--- layout: model title: English asr_wav2vec2_large_xls_r_thai_test TFWav2Vec2ForCTC from juierror author: John Snow Labs name: asr_wav2vec2_large_xls_r_thai_test date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_thai_test` is a English model originally trained by juierror. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_thai_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024143489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024143489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use

{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_thai_test", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_thai_test", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_thai_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Nationality / Company Founding Places in texts author: John Snow Labs name: finner_wiki_nationality date: 2023-01-15 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model, aimed to detect Nationalities, more specifically when talking about countries founding place. It was trained with wikipedia texts about companies. ## Predicted Entities `NATIONALITY`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_nationality_en_1.0.0_3.0_1673797584937.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_nationality_en_1.0.0_3.0_1673797584937.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) chunks = finance.NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner = finance.NerModel().pretrained("finner_wiki_nationality", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks]) model = pipe.fit(df) res = model.transform(df) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['3']['sentence']").alias("sentence_id"), F.expr("cols['0']").alias("chunk"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +-----------+--------+---+-----------+ |sentence_id|chunk |end|ner_label | +-----------+--------+---+-----------+ |0 |American|73 |NATIONALITY| +-----------+--------+---+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_wiki_nationality| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 MB| ## References Wikipedia ## Benchmarking ```bash label tp fp fn prec rec f1 B-NATIONALITY 57 7 1 0.890625 0.98275864 0.93442625 Macro-average 57 7 1 0.890625 0.98275864 0.93442625 Micro-average 57 7 1 0.890625 0.98275864 0.93442625 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_512 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670325743888.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670325743888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|161.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Translate English to Bantu languages Pipeline author: John Snow Labs name: translate_en_bnt date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bnt, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bnt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bnt_xx_2.7.0_2.4_1609687439138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bnt_xx_2.7.0_2.4_1609687439138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bnt", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bnt", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bnt').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bnt| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from Gantenbein) author: John Snow Labs name: roberta_qa_addi_fr_xlm_r date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FR-XLM-R` is a English model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_addi_fr_xlm_r_en_4.3.0_3.0_1674207724209.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_addi_fr_xlm_r_en_4.3.0_3.0_1674207724209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_addi_fr_xlm_r","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_addi_fr_xlm_r","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_addi_fr_xlm_r| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|422.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Gantenbein/ADDI-FR-XLM-R --- layout: model title: Entity Recognition Pipeline (Large, Spanish) author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_4.0.0_3.0_1656126065101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_4.0.0_3.0_1656126065101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "es") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("es.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Relation Extraction between Posologic entities author: John Snow Labs name: posology_re date: 2020-09-01 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [re, en, clinical, licensed, relation extraction] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts relations between posology-related terminology. ## Predicted Entities `DRUG-DOSAGE`, `DRUG-FREQUENCY`, `DRUG-ADE`, `DRUG-FORM`, `ENDED_BY`, `DRUG-ROUTE`, `DRUG-DURATION`, `DRUG-REASON`, `DRUG-STRENGTH` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_POSOLOGY/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentences") tokenizer = Tokenizer() \ .setInputCols(["sentences"]) \ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel()\ .pretrained("ner_posology", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") reModel = RelationExtractionModel()\ .pretrained("posology_re")\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) light_pipeline = LightPipeline(model) result = light_pipeline.fullAnnotate("The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") tokenizer = Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel() .pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_Model = RelationExtractionModel() .pretrained("posology_re") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependecy_parser, re_Model)) val data = Seq("The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |----------------|----------|---------------|-------------|------------------|-----------|---------------|-------------|------------------|------------| | DURATION-DRUG | DURATION | 493 | 500 | five-day | DRUG | 512 | 522 | amoxicillin | 1.0 | | DRUG-DURATION | DRUG | 681 | 693 | dapagliflozin | DURATION | 695 | 708 | for six months | 1.0 | | DRUG-ROUTE | DRUG | 1940 | 1946 | insulin | ROUTE | 1948 | 1951 | drip | 1.0 | | DOSAGE-DRUG | DOSAGE | 2255 | 2262 | 40 units | DRUG | 2267 | 2282 | insulin glargine | 1.0 | | DRUG-FREQUENCY | DRUG | 2267 | 2282 | insulin glargine | FREQUENCY | 2284 | 2291 | at night | 1.0 | | DOSAGE-DRUG | DOSAGE | 2295 | 2302 | 12 units | DRUG | 2307 | 2320 | insulin lispro | 1.0 | | DRUG-FREQUENCY | DRUG | 2307 | 2320 | insulin lispro | FREQUENCY | 2322 | 2331 | with meals | 1.0 | | DRUG-STRENGTH | DRUG | 2339 | 2347 | metformin | STRENGTH | 2349 | 2355 | 1000 mg | 1.0 | | DRUG-FREQUENCY | DRUG | 2339 | 2347 | metformin | FREQUENCY | 2357 | 2371 | two times a day | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|posology_re| |Compatibility:|Healthcare NLP 2.5.5+| |Edition:|Official| |License:|Licensed| |Language:|[en]| --- layout: model title: ELECTRA Embeddings(ELECTRA Base) author: John Snow Labs name: electra_base_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by: Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_base_uncased_en_2.6.0_2.4_1598485481403.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_base_uncased_en_2.6.0_2.4_1598485481403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("electra_base_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("electra_base_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.electra.base_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_electra_base_uncased_embeddings I [-0.5244714021682739, -0.0994749441742897, 0.2... love [-0.14990234375, -0.45483139157295227, 0.28477... NLP [-0.030217083171010017, -0.43060103058815, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_base_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/google/electra_base/2 --- layout: model title: English asr_20220507_122935 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: pipeline_asr_20220507_122935 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_20220507_122935` is a English model originally trained by lilitket. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_20220507_122935_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_20220507_122935_en_4.2.0_3.0_1664117535687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_20220507_122935_en_4.2.0_3.0_1664117535687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_20220507_122935', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_20220507_122935", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_20220507_122935| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Smaller BERT Sentence Embeddings (L-4_H-256_A-4) author: John Snow Labs name: sent_small_bert_L4_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_256_en_2.6.0_2.4_1598350389644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_256_en_2.6.0_2.4_1598350389644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_256", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_256", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L4_256').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L4_256_embeddings I hate cancer [-0.13163965940475464, 0.5425440073013306, 0.6... Antibiotics aren't painkiller [-0.4377692639827728, 0.5017094016075134, 0.42... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L4_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1 --- layout: model title: Fast Neural Machine Translation Model from Congo Swahili to English author: John Snow Labs name: opus_mt_swc_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, swc, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `swc` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_swc_en_xx_2.7.0_2.4_1609162361398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_swc_en_xx_2.7.0_2.4_1609162361398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_swc_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_swc_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.swc.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_swc_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source (https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models)[https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models] --- layout: model title: Translate Japanese to English Pipeline author: John Snow Labs name: translate_jap_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, jap, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `jap` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_jap_en_xx_2.7.0_2.4_1609688530036.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_jap_en_xx_2.7.0_2.4_1609688530036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_jap_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_jap_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.jap.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_jap_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from tabo) author: John Snow Labs name: distilbert_qa_tabo_base_uncased_finetuned_squad2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `tabo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tabo_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726880025.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tabo_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726880025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tabo_base_uncased_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tabo_base_uncased_finetuned_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_tabo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_tabo_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tabo/distilbert-base-uncased-finetuned-squad2 --- layout: model title: Typo Detector author: John Snow Labs name: distilbert_token_classifier_typo_detector date: 2022-01-19 tags: [typo, distilbert, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en)) and it's been trained on NeuSpell corpus to detect typos, leveraging `DistilBERT` embeddings and `DistilBertForTokenClassification` for NER purposes. It classifies typo tokens as `PO`. ## Predicted Entities `PO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_en_3.3.4_3.0_1642581005021.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_en_3.3.4_3.0_1642581005021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) text = """He had also stgruggled with addiction during his tine in Congress.""" data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en") .setInputCols(Array("sentence","token")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["He had also stgruggled with addiction during his tine in Congress."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.typos.distilbert").predict("""He had also stgruggled with addiction during his tine in Congress.""") ```
## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |stgruggled |PO | |tine |PO | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## Data Source [https://github.com/neuspell/neuspell](https://github.com/neuspell/neuspell) ## Benchmarking ```bash label precision recall f1-score support micro-avg 0.992332 0.985997 0.989154 416054.0 macro-avg 0.992332 0.985997 0.989154 416054.0 weighted-avg 0.992332 0.985997 0.989154 416054.0 ``` --- layout: model title: Bulgarian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_bg_cased date: 2022-12-02 tags: [bg, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: bg edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-bg-cased` is a Bulgarian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_bg_cased_bg_4.2.4_3.0_1670016244564.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_bg_cased_bg_4.2.4_3.0_1670016244564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_bg_cased","bg") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_bg_cased","bg") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_bg_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bg| |Size:|358.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-bg-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Clinical QA BioGPT (JSL - conditions) author: John Snow Labs name: biogpt_chat_jsl_conditions date: 2023-05-11 tags: [en, licensed, clinical, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is based on BioGPT finetuned with questions related to various medical conditions. It's less conversational, and more Q&A focused. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_conditions_en_4.4.0_3.0_1683778577103.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_conditions_en_4.4.0_3.0_1683778577103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") gpt_qa = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conditions", "en", "clinical/models")\ .setInputCols("documents")\ .setOutputCol("answer").setMaxNewTokens(100) pipeline = Pipeline().setStages([document_assembler, gpt_qa]) data = spark.createDataFrame([["How to treat asthma ?"]]).toDF("text") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conditions", "en", "clinical/models") .setInputCols("documents") .setOutputCol("answer").setMaxNewTokens(100) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = "How to treat asthma ?" val data = Seq(Array(text)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash question: How to treat asthma?. answer: The main treatments for asthma are reliever inhalers, which are small handheld devices that you put into your mouth or nose to help you breathe quickly, and preventer inhaler, a soft mist inhaler that lets you use your inhaler as often as you like. If you have severe asthma, your doctor may prescribe a long-acting bronchodilator, such as salmeterol or vilanterol, or a steroid inhaler. You'll usually need to take both types of inhaler at the same time. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biogpt_chat_jsl_conditions| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.4 GB| |Case sensitive:|true| --- layout: model title: Translate English to Setswana Pipeline author: John Snow Labs name: translate_en_tn date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tn, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tn_xx_2.7.0_2.4_1609690891054.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tn_xx_2.7.0_2.4_1609690891054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tn').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tn| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Explain Document pipeline for Spanish (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, spanish, explain_document_lg, pipeline, es] supported: true task: [Named Entity Recognition, Lemmatization] language: es edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_3.0.0_3.0_1616497458202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_3.0.0_3.0_1616497458202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'es') annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "es") val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] result_df = nlu.load('es.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['PART', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.016199000179767,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| --- layout: model title: Mapping Entities with Corresponding RxNorm Codes author: John Snow Labs name: rxnorm_mapper date: 2022-06-07 tags: [en, rxnorm, licensed, chunk_mapper] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities with their corresponding RxNorm codes ## Predicted Entities `rxnorm_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mapper_en_3.5.0_3.0_1654614618628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mapper_en_3.5.0_3.0_1654614618628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("posology_ner") posology_ner_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "posology_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRel("rxnorm_code") mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper]) data = spark.createDataFrame([["The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"]]).toDF("text") result = mapper_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("posology_ner") val posology_ner_converter = new NerConverterInternal() .setInputCols("sentence", "token", "posology_ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRel("rxnorm_code") val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper)) val senetence= "The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray" val data = Seq(senetence).toDF("text") val result = mapper_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_resolver").predict("""The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray""") ```
## Results ```bash +------------------------------+---------------+ |chunk |rxnorm_mappings| +------------------------------+---------------+ |Zyrtec 10 MG |1011483 | |Adapin 10 MG Oral Capsule |1000050 | |Septi-Soothe 0.5 Topical Spray|1000046 | +------------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_mapper| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[posology_ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|2.3 MB| --- layout: model title: English BertForQuestionAnswering model (from Ghost1) author: John Snow Labs name: bert_qa_bert_finetuned_squad1 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad1` is a English model orginally trained by `Ghost1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad1_en_4.0.0_3.0_1654535985704.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad1_en_4.0.0_3.0_1654535985704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_squad1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_squad1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_Ghost1").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_squad1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Ghost1/bert-finetuned-squad1 --- layout: model title: Swedish asr_wav2vec2_large_voxrex_swedish_4gram TFWav2Vec2ForCTC from viktor-enzell author: John Snow Labs name: pipeline_asr_wav2vec2_large_voxrex_swedish_4gram date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_voxrex_swedish_4gram` is a Swedish model originally trained by viktor-enzell. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_voxrex_swedish_4gram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_voxrex_swedish_4gram_sv_4.2.0_3.0_1664113986284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_voxrex_swedish_4gram_sv_4.2.0_3.0_1664113986284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_voxrex_swedish_4gram', lang = 'sv') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_voxrex_swedish_4gram", lang = "sv") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_voxrex_swedish_4gram| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|757.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Greek BertForMaskedLM Base Uncased model (from gealexandri) author: John Snow Labs name: bert_embeddings_greeksocial_base_greek_uncased_v1 date: 2022-12-02 tags: [el, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: el edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `greeksocialbert-base-greek-uncased-v1` is a Greek model originally trained by `gealexandri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670022274783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670022274783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_greeksocial_base_greek_uncased_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|el| |Size:|424.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/gealexandri/greeksocialbert-base-greek-uncased-v1 - http://www.paloservices.com/ --- layout: model title: Extract Effective, Renewal, Termination dates (Small) author: John Snow Labs name: legner_dates_sm date: 2022-11-21 tags: [renewal, effective, termination, date, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models extracts if a date is a Effective Date, a Renewal Date or a Termination Date, and also extracts the keywords surrounding that may be pointing about what kind of date it is. Please take into account that the keyword was not used to learn the date, all entities were training separately. But you can use the keywords to double check the date type is correct. ## Predicted Entities `EFFDATE`, `EFFDATE_KEYWORD`, `RENDATE`, `RENDATE_KEYWORD`, `TERMINDATE`, `TERMINDATE_KEYWORD` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_dates_sm_en_1.0.0_3.0_1669028480461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_dates_sm_en_1.0.0_3.0_1669028480461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_dates_sm', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""RENEWAL DATE. The date on which this Agreement shall renew, July 1st, pursuant to the terms and conditions contained herein."""] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------+-----------------+ | token| ner_label| +----------+-----------------+ | RENEWAL|B-RENDATE_KEYWORD| | DATE|I-RENDATE_KEYWORD| | .| O| | The| O| | date| O| | on| O| | which| O| | this| O| | Agreement| O| | shall| O| | renew| O| | ,| O| | July| B-RENDATE| | 1st| I-RENDATE| | ,| O| | pursuant| O| | to| O| | the| O| | terms| O| | and| O| |conditions| O| | contained| O| | herein| O| | .| O| +----------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_dates_sm| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References In-house dataset. ## Benchmarking ```bash label tp fp fn prec rec f1 I-RENDATE 6 2 5 0.75 0.54545456 0.631579 B-EFFDATE_KEYWORD 5 0 0 1.0 1.0 1.0 B-EFFDATE 5 0 0 1.0 1.0 1.0 B-TERMINDATE 3 1 0 0.75 1.0 0.85714287 I-TERMINDATE 9 4 0 0.6923077 1.0 0.8181818 I-EFFDATE 15 0 0 1.0 1.0 1.0 I-RENDATE_KEYWORD 4 0 0 1.0 1.0 1.0 I-EFFDATE_KEYWORD 5 0 0 1.0 1.0 1.0 I-TERMINDATE_KEYWORD 5 0 0 1.0 1.0 1.0 B-RENDATE 2 1 2 0.6666667 0.5 0.57142854 B-TERMINDATE_KEYWORD 5 0 0 1.0 1.0 1.0 B-RENDATE_KEYWORD 3 0 1 1.0 0.75 0.85714287 Macro-average 67 8 8 0.90491456 0.8996212 0.9022601 Micro-average 67 8 8 0.8933333 0.8933333 0.89333326 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from SupriyaArun) author: John Snow Labs name: distilbert_qa_supriyaarun_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SupriyaArun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_supriyaarun_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769358282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_supriyaarun_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769358282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_supriyaarun_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_supriyaarun_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_supriyaarun_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SupriyaArun/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Public Finance And Budget Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_public_finance_and_budget_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, public_finance_and_budget_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_public_finance_and_budget_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Public_Finance_and_Budget_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Public_Finance_and_Budget_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_public_finance_and_budget_policy_bert_en_1.0.0_3.0_1678111827097.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_public_finance_and_budget_policy_bert_en_1.0.0_3.0_1678111827097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_public_finance_and_budget_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Public_Finance_and_Budget_Policy]| |[Other]| |[Other]| |[Public_Finance_and_Budget_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_public_finance_and_budget_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.91 0.86 0.89 37 Public_Finance_and_Budget_Policy 0.86 0.91 0.88 33 accuracy - - 0.89 70 macro-avg 0.89 0.89 0.89 70 weighted-avg 0.89 0.89 0.89 70 ``` --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification date: 2021-05-27 tags: [deidentification, en, licensed, pipeline] task: De-identification language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.0.3_3.0_1622141991699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.0.3_3.0_1622141991699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","en","clinical/models") val result = pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.") ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ```
## Results ```bash {'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 'IP: 203.120.223.13.', 'The driver's license no:A334455B.', 'the SSN:324598674 and e-mail: hale@gmail.com.', 'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.', 'PCP : Oliveira, 25 years-old.', 'Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.'], 'masked': ['Record date : , , M.D.', 'IP: .', 'The driver's license .', 'the and e-mail: .', 'Name : MR. # Date : .', 'PCP : , years-old.', 'Record date : , Patient's VIN : .'], 'obfuscated': ['Record date : 2093-02-13, Shella Solan, M.D.', 'IP: 333.333.333.333.', 'The driver's license O497302436569.', 'the SSN-539-29-1060 and e-mail: Keith@google.com.', 'Name : Cindy Nakai MR. # I7396944 Date : 06-11-1985.', 'PCP : Benigno Paganini, 3 years-old.', 'Record date : 2079-12-30, Patient's VIN : 5eeee44ffff555666.'], 'ner_chunk': ['2093-01-13', 'David Hale', 'no:A334455B', 'SSN:324598674', 'Hendrickson, Ora', '719435', '01/13/93', 'Oliveira', '25', '2079-11-09', '1HGBH41JXMN109286']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: Legal Dividends Clause Binary Classifier author: John Snow Labs name: legclf_dividends_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `dividends` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `dividends` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dividends_clause_en_1.0.0_3.2_1660123433871.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dividends_clause_en_1.0.0_3.2_1660123433871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_dividends_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[dividends]| |[other]| |[other]| |[dividends]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_dividends_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support dividends 0.93 0.96 0.95 28 other 0.98 0.97 0.98 64 accuracy - - 0.97 92 macro-avg 0.96 0.97 0.96 92 weighted-avg 0.97 0.97 0.97 92 ``` --- layout: model title: English RobertaForQuestionAnswering (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_roberta_large_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-large-newsqa` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_large_newsqa_en_4.0.0_3.0_1655740252612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_large_newsqa_en_4.0.0_3.0_1655740252612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_roberta_large_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_unqover_roberta_large_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_roberta_large_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-roberta-large-newsqa --- layout: model title: Sentence Entity Resolver for Clinical Abbreviations and Acronyms (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_clinical_abbreviation_acronym date: 2021-12-11 tags: [abbreviation, entity_resolver, licensed, en, clinical, acronym] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical abbreviations and acronyms to their meanings using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It is the first primitive version of abbreviation resolution and will be improved further in the following releases. ## Predicted Entities `Abbreviation Meanings` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_2.4_1639224244652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_2.4_1639224244652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... c2doc = Chunk2Doc()\ .setInputCols("merged_chunk")\ .setOutputCol("ner_chunk_doc") sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["document", "merged_chunk"])\ .setOutputCol("sentence_embeddings")\ .setChunkWeight(0.5) abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("abbr_meaning")\ .setDistanceFunction("EUCLIDEAN")\ .setCaseSensitive(False) resolver_pipeline = Pipeline( stages = [ document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter_icd, entity_extractor, chunk_merge, c2doc, sentence_chunk_embeddings, abbr_resolver ]) model = resolver_pipeline.fit(spark.createDataFrame([['']]).toDF("text")) sample_text = "HISTORY OF PRESENT ILLNESS: The patient three weeks ago was seen at another clinic for upper respiratory infection-type symptoms. She was diagnosed with a viral infection and had used OTC medications including Tylenol, Sudafed, and Nyquil." abbr_result = model.transform(spark.createDataFrame([[text]]).toDF('text')) ``` ```scala ... val c2doc = Chunk2Doc() .setInputCols("merged_chunk") .setOutputCol("ner_chunk_doc") val sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("document", "merged_chunk")) .setOutputCol("sentence_embeddings") .setChunkWeight(0.5) val abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("abbr_meaning") .setDistanceFunction("EUCLIDEAN") .setCaseSensitive(False) val resolver_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter_icd, entity_extractor, chunk_merge, c2doc, sentence_chunk_embeddings, abbr_resolver)) val sample_text = Seq("HISTORY OF PRESENT ILLNESS: The patient three weeks ago was seen at another clinic for upper respiratory infection-type symptoms. She was diagnosed with a viral infection and had used OTC medications including Tylenol, Sudafed, and Nyquil.").toDF("text") val abbr_result = resolver_pipeline.fit(sample_text).transform(sample_text) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.clinical_abbreviation_acronym").predict("""HISTORY OF PRESENT ILLNESS: The patient three weeks ago was seen at another clinic for upper respiratory infection-type symptoms. She was diagnosed with a viral infection and had used OTC medications including Tylenol, Sudafed, and Nyquil.""") ```
## Results ```bash | sent_id | ner_chunk | entity | abbr_meaning | all_k_results | all_k_resolutions | |----------:|:------------|:---------|:-----------------|:-----------------------------------------------------------------------------------|:---------------------------| | 0 | OTC | ABBR | over the counter | ['over the counter', 'ornithine transcarbamoylase', 'enteric-coated', 'thyroxine'] | ['OTC', 'OTC', 'EC', 'T4'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_clinical_abbreviation_acronym| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[abbr_meaning]| |Language:|en| |Size:|104.9 MB| |Case sensitive:|false| --- layout: model title: DistilBERT base model (uncased) author: John Snow Labs name: distilbert_base_uncased date: 2021-05-20 tags: [distilbert, en, english, embeddings, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-cased). It was introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation). This model is uncased: it does not make a difference between english and English. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_uncased_en_3.1.0_2.4_1621522159616.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_uncased_en_3.1.0_2.4_1621522159616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = DistilBertEmbeddings.pretrained("distilbert_base_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilbert.base.uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_uncased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) ## Benchmarking ```bash When fine-tuned on downstream tasks, this model achieves the following results: Glue test results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 | ``` --- layout: model title: Legal Liens Clause Binary Classifier author: John Snow Labs name: legclf_liens_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `liens` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `liens` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_liens_clause_en_1.0.0_3.2_1660122624426.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_liens_clause_en_1.0.0_3.2_1660122624426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_liens_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[liens]| |[other]| |[other]| |[liens]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_liens_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support liens 0.95 0.84 0.89 44 other 0.93 0.98 0.96 99 accuracy - - 0.94 143 macro-avg 0.94 0.91 0.92 143 weighted-avg 0.94 0.94 0.94 143 ``` --- layout: model title: English asr_wav2vec2_large_960h_lv60 TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_large_960h_lv60 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017360276.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017360276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_960h_lv60", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_960h_lv60", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_960h_lv60| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.4 MB| --- layout: model title: Legal Energy Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_energy_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, energy_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_energy_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Energy_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Energy_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_energy_policy_bert_en_1.0.0_3.0_1678111634600.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_energy_policy_bert_en_1.0.0_3.0_1678111634600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_energy_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Energy_Policy]| |[Other]| |[Other]| |[Energy_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_energy_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Energy_Policy 0.85 0.91 0.88 57 Other 0.88 0.80 0.84 46 accuracy - - 0.86 103 macro-avg 0.87 0.86 0.86 103 weighted-avg 0.87 0.86 0.86 103 ``` --- layout: model title: Lemmatizer (Catalan, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, ca] task: Lemmatization language: ca edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Catalan Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ca_3.4.1_3.0_1646316619311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ca_3.4.1_3.0_1646316619311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ca") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["No ets millor que jo"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ca") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("No ets millor que jo").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ca.lemma").predict("""No ets millor que jo""") ```
## Results ```bash +--------------------------+ |result | +--------------------------+ |[No, ets, millor, que, jo]| +--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ca| |Size:|7.0 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Adrian) author: John Snow Labs name: distilbert_qa_adrian_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Adrian`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_adrian_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768221355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_adrian_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768221355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_adrian_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_adrian_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_adrian_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Adrian/distilbert-base-uncased-finetuned-squad --- layout: model title: Zero-shot Legal NER (CUAD, small) author: John Snow Labs name: legner_roberta_zeroshot_cuad_small date: 2023-01-30 tags: [zero, shot, cuad, en, licensed, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot NER model, trained using Roberta on SQUAD and finetuned to perform Zero-shot NER using CUAD legal dataset. In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES: ``` "Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract" ``` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_small_en_1.0.0_3.0_1675089181024.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_small_en_1.0.0_3.0_1675089181024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") zeroshot = nlp.ZeroShotNerModel.pretrained("legner_roberta_zeroshot_cuad_small","en","legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("zero_shot_ner")\ .setEntityDefinitions( { 'PARTIES': ['Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract'] }) nerconverter = NerConverter()\ .setInputCols(["document", "token", "zero_shot_ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline().setStages([ document_assembler, tokenizer, zeroshot, nerconverter ]) from pyspark.sql import types as T sample_text = ["""THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and certain of its subsidiaries. Identified on the signature pages hereto (each a "BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, in its capacity as agent for the Lenders under this Agreement (hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ')] p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) res = p_model.transform(spark.createDataFrame(sample_text, T.StringType()).toDF("text")) res.show() ```
## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |P.H. GLATFELTER COMPANY|PARTIES | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_roberta_zeroshot_cuad_small| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|449.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References SQUAD and CUAD --- layout: model title: Smaller BERT Embeddings (L-2_H-768_A-12) author: John Snow Labs name: small_bert_L2_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L2_768_en_2.6.0_2.4_1598344957042.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L2_768_en_2.6.0_2.4_1598344957042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L2_768", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L2_768", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L2_768').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L2_768_embeddings I [-0.2451338768005371, 0.40763044357299805, -0.... love [-0.23793038725852966, -0.07403656840324402, -... NLP [-0.864113450050354, -0.2902209758758545, 0.54... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L2_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1 --- layout: model title: Sentence Entity Resolver for Clinical Abbreviations and Acronyms (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_clinical_abbreviation_acronym date: 2022-02-01 tags: [en, entity_resolution, clinical, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical abbreviations and acronyms to their meanings using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is an improved version of the base model, and includes more variational data. ## Predicted Entities `Abbreviation Meanings` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_3.0_1643681527227.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_3.0_1643681527227.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["document", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \ .setInputCols(["document", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["document", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['ABBR']) sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["document", "ner_chunk"])\ .setOutputCol("sentence_embeddings")\ .setChunkWeight(0.5)\ .setCaseSensitive(True) abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("abbr_meaning")\ .setDistanceFunction("EUCLIDEAN")\ resolver_pipeline = Pipeline( stages = [ document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter, sentence_chunk_embeddings, abbr_resolver ]) model = resolver_pipeline.fit(spark.createDataFrame([['']]).toDF("text")) sample_text = "Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet." abbr_result = model.transform(spark.createDataFrame([[sample_text]]).toDF('text')) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("document", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("document", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("ABBR")) val sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("document", "ner_chunk")) .setOutputCol("sentence_embeddings") .setChunkWeight(0.5) .setCaseSensitive(True) val abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("abbr_meaning") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new Pipeline().setStages(document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter, sentence_chunk_embeddings, abbr_resolver) val sample_text = Seq("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""").toDS().toDF("text") val abbr_result = resolver_pipeline.fit(sample_text).transform(sample_text) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.clinical_abbreviation_acronym").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash | | chunk | abbr_meaning | all_k_results | |---:|:--------|:-------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | CBC | Complete Blood Count | Complete Blood Count:::Complete blood count:::blood group in ABO system:::(complement) component 4:::abortion:::carbohydrate antigen:::clear to auscultation:::carcinoembryonic antigen:::cervical (level) 4 | | 1 | AB | blood group in ABO system | blood group in ABO system:::abortion | | 2 | VDRL | Venereal disease research laboratory | Venereal disease research laboratory:::venous blood gas:::leukocyte esterase:::vertical banded gastroplasty | | 3 | HIV | human immunodeficiency virus | human immunodeficiency virus:::blood group in ABO system:::abortion:::fluorescent in situ hybridization | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_clinical_abbreviation_acronym| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[output]| |Language:|en| |Size:|112.3 MB| |Case sensitive:|true| ## References Trained on in-house curated dataset. --- layout: model title: Legal Closing Clause Binary Classifier author: John Snow Labs name: legclf_closing_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `closing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `closing` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_closing_clause_en_1.0.0_3.2_1660123306835.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_closing_clause_en_1.0.0_3.2_1660123306835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_closing_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[closing]| |[other]| |[other]| |[closing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_closing_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support closing 0.94 0.91 0.93 56 other 0.97 0.98 0.97 143 accuracy - - 0.96 199 macro-avg 0.95 0.94 0.95 199 weighted-avg 0.96 0.96 0.96 199 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_effective_date_08_31_v1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-effective_date-08-31-v1` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_effective_date_08_31_v1_en_4.3.0_3.0_1672766163916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_effective_date_08_31_v1_en_4.3.0_3.0_1672766163916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_effective_date_08_31_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_effective_date_08_31_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_effective_date_08_31_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-effective_date-08-31-v1 --- layout: model title: Named Entity Recognition (NER) Model in Norwegian (Norne 840B 300) author: John Snow Labs name: norne_840B_300 date: 2020-05-06 task: Named Entity Recognition language: "no" edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, norne, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Norne is a Named Entity Recognition (or NER) model of Norvegian, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Norne 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Derived-`DRV`, Product-`PROD`, Geo-political Entities Location-`GPE_LOC`, Geo-political Entities Organization-`GPE-ORG`, Event-`EVT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_NO/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NO.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/norne_840B_300_no_2.5.0_2.4_1588781290267.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/norne_840B_300_no_2.5.0_2.4_1588781290267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("norne_840B_300", "no") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', lang='xx') .setInputCols(Array('document', 'token')) .setOutputCol('embeddings') val ner_model = NerDLModel.pretrained("norne_840B_300", "no") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella."""] ner_df = nlu.load('no.ner.norne.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |CEO |PER | |Seattle |GPE_LOC | |Washington |GPE_LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |GPE_LOC | |New Mexico |GPE_LOC | |Gates |PER | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | |Ray Ozzie |PER | |Craig Mundie |PER | |Microsoft |ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|norne_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|no| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.559.pdf](https://www.aclweb.org/anthology/2020.lrec-1.559.pdf) --- layout: model title: Detect Person, Organization, Location, Facility, Product and Event entities in Persian (persian_w2v_cc_300d) author: John Snow Labs name: personer_cc_300d date: 2020-12-07 task: Named Entity Recognition language: fa edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, fa, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Persian word embeddings to find 6 different types of entities in Persian text. It is trained using `persian_w2v_cc_300d` word embeddings, so please use the same embeddings in the pipeline. ## Predicted Entities Persons-`PER`, Facilities-`FAC`, Products-`PRO`, Locations-`LOC`, Organizations-`ORG`, Events-`EVENT`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/personer_cc_300d_fa_2.7.0_2.4_1607339059321.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/personer_cc_300d_fa_2.7.0_2.4_1607339059321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("persian_w2v_cc_300d", "fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("personer_cc_300d", "fa") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند") ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("persian_w2v_cc_300d", "fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("personer_cc_300d", "fa") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.ner").predict("""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند""") ```
## Results ```bash | | ner_chunk | entity | |---:|--------------------------:|-------------:| | 0 | خبرنگار ایرنا | ORG | | 1 | محمد قمی | PER | | 2 | پاکدشت | LOC | | 3 | علی‌اکبر موسوی خوئینی | PER | | 4 | شمس‌الدین وهابی | PER | | 5 | تهران | LOC | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|personer_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|fa| |Dependencies:|persian_w2v_cc_300d| ## Data Source This model is trained on data provided by [https://www.aclweb.org/anthology/C16-1319/](https://www.aclweb.org/anthology/C16-1319/). ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|------:|------:|-----:|---------:|---------:|---------:| | 0 | B-Per | 1035 | 99 | 75 | 0.912698 | 0.932432 | 0.92246 | | 1 | I-Fac | 239 | 42 | 64 | 0.850534 | 0.788779 | 0.818493 | | 2 | I-Pro | 173 | 52 | 158 | 0.768889 | 0.522659 | 0.622302 | | 3 | I-Loc | 221 | 68 | 66 | 0.764706 | 0.770035 | 0.767361 | | 4 | I-Per | 652 | 38 | 55 | 0.944928 | 0.922207 | 0.933429 | | 5 | B-Org | 1118 | 289 | 348 | 0.794598 | 0.762619 | 0.778281 | | 6 | I-Org | 1543 | 237 | 240 | 0.866854 | 0.865395 | 0.866124 | | 7 | I-Event | 486 | 130 | 108 | 0.788961 | 0.818182 | 0.803306 | | 8 | B-Loc | 974 | 252 | 168 | 0.794454 | 0.85289 | 0.822635 | | 9 | B-Fac | 123 | 31 | 44 | 0.798701 | 0.736527 | 0.766355 | | 10 | B-Pro | 168 | 81 | 97 | 0.674699 | 0.633962 | 0.653697 | | 11 | B-Event | 126 | 52 | 51 | 0.707865 | 0.711864 | 0.709859 | | 12 | Macro-average | 6858 | 1371 | 1474 | 0.805657 | 0.776463 | 0.790791 | | 13 | Micro-average | 6858 | 1371 | 1474 | 0.833394 | 0.823092 | 0.828211 | ``` --- layout: model title: BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QNLI author: John Snow Labs name: bert_wiki_books_qnli date: 2021-08-30 tags: [en, open_source, wikipedia_dataset, bert_embeddings, qnli_dataset, books_corpus_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on QNLI. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_qnli_en_3.2.0_3.0_1630322335414.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_qnli_en_3.2.0_3.0_1630322335414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_wiki_books_qnli", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_wiki_books_qnli", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.wiki_books_qnli').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_wiki_books_qnli| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [QNLI dataset](https://gluebenchmark.com/) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/qnli/2 --- layout: model title: Detect Cellular/Molecular Biology Entities author: John Snow Labs name: ner_cellular_en date: 2020-04-22 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_en_2.4.2_2.4_1587513308751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_en_2.4.2_2.4_1587513308751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") cellular_ner = NerDLModel.pretrained("ner_cellular", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, cellular_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ']], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val celular_ner = NerDLModel.pretrained("ner_cellular", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, cellular_ner, ner_converter)) val data = Seq("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash |chunk |ner | +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1), |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular| |Type:|ner| |Compatibility:|Spark NLP 2.4.2| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on the JNLPBA corpus containing more than 2.404 publication abstracts with ``'embeddings_clinical'``. http://www.geniaproject.org/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:| | 0 | B-cell_line | 377 | 203 | 123 | 0.65 | 0.754 | 0.698148 | | 1 | I-DNA | 1519 | 277 | 266 | 0.845768 | 0.85098 | 0.848366 | | 2 | I-protein | 3981 | 911 | 786 | 0.813778 | 0.835116 | 0.824309 | | 3 | B-protein | 4483 | 1433 | 579 | 0.757776 | 0.885618 | 0.816724 | | 4 | I-cell_line | 786 | 340 | 203 | 0.698046 | 0.794742 | 0.743262 | | 5 | I-RNA | 178 | 42 | 9 | 0.809091 | 0.951872 | 0.874693 | | 6 | B-RNA | 99 | 28 | 19 | 0.779528 | 0.838983 | 0.808163 | | 7 | B-cell_type | 1440 | 294 | 480 | 0.83045 | 0.75 | 0.788177 | | 8 | I-cell_type | 2431 | 377 | 559 | 0.865741 | 0.813044 | 0.838565 | | 9 | B-DNA | 814 | 267 | 240 | 0.753006 | 0.772296 | 0.762529 | | 10 | Macro-average | 16108 | 4172 | 3264 | 0.780318 | 0.824665 | 0.801879 | | 11 | Micro-average | 16108 | 4172 | 3264 | 0.79428 | 0.831509 | 0.812469 | ``` --- layout: model title: Named Entity Recognition for Japanese (XLM-RoBERTa) author: John Snow Labs name: ner_ud_gsd_xlm_roberta_base date: 2021-09-15 tags: [ja, ner, open_source] task: Named Entity Recognition language: ja edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pretrained XlmRoBertaEmbeddings embeddings "xlm_roberta_base" as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities `ORDINAL`, `PERSON`, `LAW`, `MOVEMENT`, `LOC`, `WORK_OF_ART`, `DATE`, `NORP`, `TITLE_AFFIX`, `QUANTITY`, `FAC`, `TIME`, `MONEY`, `LANGUAGE`, `GPE`, `EVENT`, `ORG`, `PERCENT`, `PRODUCT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_xlm_roberta_base_ja_3.2.2_3.0_1631696644878.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_xlm_roberta_base_ja_3.2.2_3.0_1631696644878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from pyspark.ml import Pipeline from sparknlp.annotator import * from sparknlp.base import * from sparknlp.training import * documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \ .setInputCols(["sentence"]) \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained() \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nerTagger = NerDLModel.pretrained("ner_ud_gsd_xlm_roberta_base", "ja") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") pipeline = Pipeline().setStages( [ documentAssembler, sentence, word_segmenter, embeddings, nerTagger, ] ) data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))").show() ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel} import com.johnsnowlabs.nlp.embeddings.XlmRoBertaEmbeddings import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("sentence") .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("japanese_cc_300d", "ja") .setInputCols("sentence", "token") .setOutputCol("embeddings") val nerTagger = NerDLModel.pretrained("ner_ud_gsd_xlm_roberta_base", "ja") .setInputCols("sentence", "token") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, word_segmenter, embeddings, nerTagger )) val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text") val model = pipeline.fit(data) val result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))").show() ``` {:.nlu-block} ```python import nlu nlu.load("ja.ner.ud_gsd_xlm_roberta_base").predict("""explode(arrays_zip(token.result, ner.result))""") ```
## Results ```bash +-------------------+ | col| +-------------------+ | {宮本, B-PERSON}| | {茂, I-PERSON}| | {氏, O}| | {は, O}| | {、, O}| | {日本, B-GPE}| | {の, O}| | {任天, B-ORG}| | {堂, I-ORG}| | {の, O}| | {ゲーム, O}| |{プロデューサー, O}| | {です, O}| | {。, O}| +-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ud_gsd_xlm_roberta_base| |Type:|ner| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ja| |Dependencies:|xlm_roberta_base| ## Data Source The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs: https://github.com/megagonlabs/UD_Japanese-GSD Reference: Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash label precision recall f1-score support DATE 0.93 0.97 0.95 206 EVENT 0.78 0.48 0.60 52 FAC 0.80 0.68 0.73 59 GPE 0.88 0.81 0.85 102 LANGUAGE 1.00 1.00 1.00 8 LAW 0.82 0.69 0.75 13 LOC 0.87 0.83 0.85 41 MONEY 1.00 1.00 1.00 20 MOVEMENT 0.67 0.55 0.60 11 NORP 0.84 0.86 0.85 57 O 0.99 0.99 0.99 11785 ORDINAL 0.94 0.94 0.94 32 ORG 0.71 0.78 0.74 179 PERCENT 1.00 1.00 1.00 16 PERSON 0.89 0.90 0.89 127 PRODUCT 0.56 0.68 0.61 50 QUANTITY 0.92 0.96 0.94 172 TIME 0.91 1.00 0.96 32 TITLE_AFFIX 0.86 0.75 0.80 24 WORK_OF_ART 0.87 0.85 0.86 48 accuracy - - 0.98 13034 macro-avg 0.86 0.84 0.85 13034 weighted-avg 0.98 0.98 0.98 13034 ``` --- layout: model title: Catalan RobertaForQuestionAnswering (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_roberta_base_squad date: 2022-06-20 tags: [ca, open_source, question_answering, roberta] task: Question Answering language: ca edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a Catalan model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_ca_4.0.0_3.0_1655734774977.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_ca_4.0.0_3.0_1655734774977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad","ca") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad","ca") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ca| |Size:|461.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/thatdramebaazguy/roberta-base-squad --- layout: model title: Slovenian T5ForConditionalGeneration Small Cased model (from cjvt) author: John Snow Labs name: t5_legacy_sl_small date: 2023-01-30 tags: [sl, open_source, t5, tensorflow] task: Text Generation language: sl edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legacy-t5-sl-small` is a Slovenian model originally trained by `cjvt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_legacy_sl_small_sl_4.3.0_3.0_1675104880094.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_legacy_sl_small_sl_4.3.0_3.0_1675104880094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_legacy_sl_small","sl") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_legacy_sl_small","sl") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_legacy_sl_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|sl| |Size:|178.9 MB| ## References - https://huggingface.co/cjvt/legacy-t5-sl-small --- layout: model title: Arabic BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_ar_cased date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ar-cased` is a Arabic model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ar_cased_ar_4.2.4_3.0_1670015694662.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ar_cased_ar_4.2.4_3.0_1670015694662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ar_cased","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ar_cased","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_ar_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|344.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ar-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: French Legal Roberta Embeddings author: John Snow Labs name: roberta_large_french_legal date: 2023-02-16 tags: [fr, french, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: fr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-french-roberta-large` is a French model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_french_legal_fr_4.2.4_3.0_1676556919312.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_french_legal_fr_4.2.4_3.0_1676556919312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_french_legal", "fr")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_french_legal", "fr") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_french_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|1.3 GB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-french-roberta-large --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_base_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2.0_en_4.3.0_3.0_1674219848563.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2.0_en_4.3.0_3.0_1674219848563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/roberta-base_squad2.0 --- layout: model title: Part of Speech for Breton author: John Snow Labs name: pos_ud_keb date: 2021-03-09 tags: [part_of_speech, open_source, breton, pos_ud_keb, br] task: Part of Speech Tagging language: br edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADV - VERB - PUNCT - NOUN - PART - ADJ - ADP - NUM - DET - X - PROPN - PRON - CCONJ - SCONJ - SYM - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_3.0.0_3.0_1615292153000.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_3.0.0_3.0_1615292153000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_keb", "br") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_keb", "br") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hello from John Snow Labs!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs!""] token_df = nlu.load('br.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hello PRON 1 from VERB 2 John ADJ 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_keb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|br| --- layout: model title: Legal Further assurances Clause Binary Classifier author: John Snow Labs name: legclf_further_assurances_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `further-assurances` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `further-assurances` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_clause_en_1.0.0_3.2_1660122474814.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_clause_en_1.0.0_3.2_1660122474814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_further_assurances_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[further-assurances]| |[other]| |[other]| |[further-assurances]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_further_assurances_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support further-assurances 1.00 1.00 1.00 41 other 1.00 1.00 1.00 99 accuracy - - 1.00 140 macro-avg 1.00 1.00 1.00 140 weighted-avg 1.00 1.00 1.00 140 ``` --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_fpdm_hier_bert_FT_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_hier_bert_FT_newsqa_en_4.0.0_3.0_1654187869149.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_hier_bert_FT_newsqa_en_4.0.0_3.0_1654187869149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_hier_bert_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fpdm_hier_bert_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.fpdm_hier_ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fpdm_hier_bert_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_hier_bert_FT_newsqa --- layout: model title: Russian RoBERTa Embeddings (from blinoff) author: John Snow Labs name: roberta_embeddings_roberta_base_russian_v0 date: 2022-04-14 tags: [roberta, embeddings, ru, open_source] task: Embeddings language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-russian-v0` is a Russian model orginally trained by `blinoff`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_russian_v0_ru_3.4.2_3.0_1649947793512.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_russian_v0_ru_3.4.2_3.0_1649947793512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_russian_v0","ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_russian_v0","ru") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю искра NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.embed.roberta_base_russian_v0").predict("""Я люблю искра NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_base_russian_v0| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ru| |Size:|468.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/blinoff/roberta-base-russian-v0 --- layout: model title: Typo Detector Pipeline for Icelandic author: John Snow Labs name: distilbert_token_classifier_typo_detector_pipeline date: 2022-06-25 tags: [icelandic, typo, ner, distilbert, is, open_source] task: Named Entity Recognition language: is edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector_is](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_is.html). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_IS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_is_4.0.0_3.0_1656119193097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_is_4.0.0_3.0_1656119193097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "is") typo_pipeline.annotate("Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.") ``` ```scala val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "is") typo_pipeline.annotate("Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.") ```
## Results ```bash +--------+---------+ |chunk |ner_label| +--------+---------+ |miög |PO | |álykanir|PO | +--------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|is| |Size:|505.8 MB| ## Included Models - DocumentAssembler - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: Detect Diagnosis, Symptoms, Drugs, Labs and Demographics (ner_jsl_enriched) author: John Snow Labs name: ner_jsl_enriched_en date: 2020-04-22 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Name`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Frequency`: Frequency of administration for a dose prescribed. - `Gender`: Gender-specific nouns and pronouns. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Allergen`: Allergen related extractions mentioned in the document. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Respiration`: Number of breaths per minute. - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Temperature`: All mentions that refer to body temperature. - `Weight`: All mentions related to a patients weight. ## Predicted Entities `Age`, `Diagnosis`, `Dosage`, `Drug_Name`, `Frequency`, `Gender`, `Lab_Name`, `Lab_Result`, `Symptom_Name`, `Allergenic_substance`, `Blood_Pressure`, `Causative_Agents_(Virus_and_Bacteria)`, `Modifier`, `Name`, `Negation`, `O2_Saturation`, `Procedure`, `Procedure_Name`, `Pulse_Rate`, `Respiratory_Rate`, `Route`, `Section_Name`, `Substance_Name`, `Temperature`, `Weight`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_2.4.2_2.4_1587513303751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_2.4.2_2.4_1587513303751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_jsl_enriched", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +---------------------------+------------+ |chunk |ner | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |congestion |Symptom_Name| |mom |Gender | |suctioning yellow discharge|Symptom_Name| |she |Gender | |problems with his breathing|Symptom_Name| |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched_en_2.4.2_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.2| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ {:.h2_title} ## Benchmarking ```bash label tp fp fn prec rec f1 B-Pulse_Rate 80 26 9 0.754717 0.898876 0.820513 I-Diagnosis 2341 1644 1129 0.587453 0.67464 0.628035 I-Procedure_Name 2209 1128 1085 0.661972 0.670613 0.666265 B-Lab_Result 432 107 263 0.801484 0.621583 0.700162 B-Dosage 465 179 81 0.72205 0.851648 0.781513 I-Causative_Agents_(Virus_and_Bacteria) 9 3 10 0.75 0.473684 0.580645 B-Name 648 295 510 0.687169 0.559585 0.616849 I-Name 917 427 665 0.682292 0.579646 0.626794 B-Weight 52 25 9 0.675325 0.852459 0.753623 B-Symptom_Name 4244 1911 1776 0.689521 0.704983 0.697166 I-Maybe 25 15 63 0.625 0.284091 0.390625 I-Symptom_Name 1920 1584 2503 0.547945 0.434095 0.48442 B-Modifier 1399 704 942 0.66524 0.597608 0.629613 B-Blood_Pressure 82 21 7 0.796117 0.921348 0.854167 B-Frequency 290 93 97 0.75718 0.749354 0.753247 I-Gender 29 19 25 0.604167 0.537037 0.568627 I-Age 3 6 11 0.333333 0.214286 0.26087 B-Drug_Name 1762 500 271 0.778957 0.866699 0.820489 B-Substance_Name 143 32 53 0.817143 0.729592 0.770889 B-Temperature 58 23 11 0.716049 0.84058 0.773333 B-Section_Name 2700 294 177 0.901804 0.938478 0.919775 I-Route 131 165 177 0.442568 0.425325 0.433775 B-Maybe 108 47 164 0.696774 0.397059 0.505855 B-Gender 5156 685 68 0.882726 0.986983 0.931948 I-Dosage 435 182 87 0.705024 0.833333 0.763828 B-Causative_Agents_(Virus_and_Bacteria) 21 17 6 0.552632 0.777778 0.646154 I-Frequency 278 131 191 0.679707 0.592751 0.633257 B-Age 352 34 21 0.911917 0.9437 0.927536 I-Lab_Result 27 20 170 0.574468 0.137056 0.221311 B-Negation 1501 311 341 0.828366 0.814875 0.821565 B-Diagnosis 2657 1281 1049 0.674708 0.716945 0.695186 I-Section_Name 3876 1304 188 0.748263 0.95374 0.838598 B-Route 466 286 123 0.619681 0.791172 0.695004 I-Negation 80 152 190 0.344828 0.296296 0.318725 B-Procedure_Name 1453 739 562 0.662865 0.721092 0.690754 I-Allergenic_substance 6 1 7 0.857143 0.461538 0.6 B-Allergenic_substance 74 31 23 0.704762 0.762887 0.732673 I-Weight 46 43 17 0.516854 0.730159 0.605263 B-Lab_Name 639 189 287 0.771739 0.690065 0.72862 I-Modifier 104 156 417 0.4 0.199616 0.266325 I-Temperature 2 7 13 0.222222 0.133333 0.166667 I-Drug_Name 334 237 290 0.584939 0.535256 0.558996 I-Lab_Name 271 157 140 0.633178 0.659367 0.646007 B-Respiratory_Rate 46 6 5 0.884615 0.901961 0.893204 Macro-average 37896 15237 14343 0.621144 0.562248 0.59023 Micro-average 37896 15237 14343 0.713229 0.725435 0.71928 ``` --- layout: model title: English asr_Part1 TFWav2Vec2ForCTC from zasheza author: John Snow Labs name: pipeline_asr_Part1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Part1` is a English model originally trained by zasheza. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Part1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Part1_en_4.2.0_3.0_1664039779675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Part1_en_4.2.0_3.0_1664039779675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Part1', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Part1", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Part1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from datarpit) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_natural_questions date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-natural-questions` is a English model originally trained by `datarpit`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.3.0_3.0_1672768123339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.3.0_3.0_1672768123339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_natural_questions| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/datarpit/distilbert-base-uncased-finetuned-natural-questions --- layout: model title: Legal Non competition Clause Binary Classifier author: John Snow Labs name: legclf_non_competition_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non-competition` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `non-competition` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_clause_en_1.0.0_3.2_1660122728584.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_clause_en_1.0.0_3.2_1660122728584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_non_competition_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[non-competition]| |[other]| |[other]| |[non-competition]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_competition_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non-competition 1.00 0.89 0.94 18 other 0.97 1.00 0.99 74 accuracy - - 0.98 92 macro-avg 0.99 0.94 0.96 92 weighted-avg 0.98 0.98 0.98 92 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from dbernsohn) author: John Snow Labs name: t5_numbers_gcd date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_numbers_gcd` is a English model originally trained by `dbernsohn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_numbers_gcd_en_4.3.0_3.0_1675156829112.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_numbers_gcd_en_4.3.0_3.0_1675156829112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_numbers_gcd","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_numbers_gcd","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_numbers_gcd| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|283.1 MB| ## References - https://huggingface.co/dbernsohn/t5_numbers_gcd - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://www.tensorflow.org/datasets/catalog/math_dataset#mathdatasetnumbers_gcd - https://github.com/DorBernsohn/CodeLM/tree/main/MathLM - https://www.linkedin.com/in/dor-bernsohn-70b2b1146/ --- layout: model title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case - fr) author: John Snow Labs name: ner_eu_clinical_case_pipeline date: 2023-03-08 tags: [fr, clinical, licensed, ner] task: Named Entity Recognition language: fr edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/02/01/ner_eu_clinical_case_fr.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_fr_4.3.0_3.2_1678261744783.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_fr_4.3.0_3.2_1678261744783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "fr", "clinical/models") text = " Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "fr", "clinical/models") val text = " Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:------------------------------------------------------|--------:|------:|:-------------------|-------------:| | 0 | Un garçon de 3 ans | 1 | 18 | patient | 0.58786 | | 1 | trouble autistique à l'hôpital du service pédiatrique | 33 | 85 | clinical_condition | 0.560657 | | 2 | l'hôpital | 92 | 100 | clinical_event | 0.3725 | | 3 | Il n'a | 117 | 122 | patient | 0.62695 | | 4 | d'antécédents | 128 | 140 | clinical_event | 0.8355 | | 5 | troubles | 155 | 162 | clinical_condition | 0.9096 | | 6 | maladies | 170 | 177 | clinical_condition | 0.9109 | | 7 | du spectre autistique | 179 | 199 | bodypart | 0.4828 | | 8 | Le garçon | 202 | 210 | patient | 0.48925 | | 9 | diagnostiqué | 218 | 229 | clinical_event | 0.2155 | | 10 | trouble | 239 | 245 | clinical_condition | 0.8545 | | 11 | difficultés | 281 | 291 | clinical_event | 0.5636 | | 12 | traitement | 321 | 330 | clinical_event | 0.9046 | | 13 | tests | 355 | 359 | clinical_event | 0.9305 | | 14 | normaux | 378 | 384 | units_measurements | 0.9394 | | 15 | thyréostimuline | 387 | 401 | clinical_event | 0.4653 | | 16 | TSH | 404 | 406 | clinical_event | 0.691 | | 17 | ferritine | 456 | 464 | clinical_event | 0.2768 | | 18 | L'endoscopie | 468 | 479 | clinical_event | 0.7778 | | 19 | montré | 499 | 504 | clinical_event | 0.9829 | | 20 | tumeur sous-muqueuse | 510 | 529 | clinical_condition | 0.7923 | | 21 | provoquant | 531 | 540 | clinical_event | 0.868 | | 22 | obstruction | 546 | 556 | clinical_condition | 0.9448 | | 23 | la sortie gastrique | 571 | 589 | bodypart | 0.496233 | | 24 | suspicion | 602 | 610 | clinical_event | 0.9035 | | 25 | tumeur stromale gastro-intestinale | 618 | 651 | clinical_condition | 0.5901 | | 26 | gastrectomie | 658 | 669 | clinical_event | 0.3939 | | 27 | L'examen | 695 | 702 | clinical_event | 0.5114 | | 28 | révélé | 724 | 729 | clinical_event | 0.9731 | | 29 | prolifération | 735 | 747 | clinical_event | 0.6767 | | 30 | cellules fusiformes | 752 | 770 | bodypart | 0.5233 | | 31 | la couche sous-muqueuse | 777 | 799 | bodypart | 0.6755 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Part of Speech for Bhojpuri (pos_ud_bhtb) author: John Snow Labs name: pos_ud_bhtb date: 2021-01-18 task: Part of Speech Tagging language: bh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [bho, bh, pos, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 14 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `VERB`, and `X` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bhtb_bh_2.7.0_2.4_1610989017843.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bhtb_bh_2.7.0_2.4_1610989017843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ud_bhtb", "bh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, pos ]) example = spark.createDataFrame([['ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ud_bhtb", "bh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।"] pos_df = nlu.load('bh.pos').predict(text) pos_df ```
## Results ```bash +------------------------------------------------------------+----------------------------------------------------------------------------------+ |text |result | +------------------------------------------------------------+----------------------------------------------------------------------------------+ |ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई ।|[DET, NOUN, ADP, NOUN, VERB, SCONJ, ADJ, VERB, PROPN, ADP, NOUN, VERB, AUX, PUNCT]| +------------------------------------------------------------+----------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bhtb| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|bh| ## Data Source The model was trained on the [Universal Dependencies](http://universaldependencies.org) version 2.7. Reference: - Ojha, A. K., & Zeman, D. (2020). Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri. Proceedings of the WILDRE5{--} 5th Workshop on Indian Language Data: Resources and Evaluation. ## Benchmarking ```bash | pos | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.92 | 0.92 | 0.92 | 250 | | ADP | 0.95 | 0.96 | 0.96 | 989 | | ADV | 0.85 | 0.88 | 0.86 | 32 | | AUX | 0.93 | 0.95 | 0.94 | 355 | | CCONJ | 0.95 | 0.95 | 0.95 | 151 | | DET | 0.96 | 0.95 | 0.95 | 353 | | INTJ | 1.00 | 1.00 | 1.00 | 5 | | NOUN | 0.95 | 0.96 | 0.96 | 1854 | | NUM | 0.97 | 0.98 | 0.97 | 149 | | PART | 0.94 | 0.93 | 0.93 | 192 | | PRON | 0.95 | 0.94 | 0.95 | 335 | | PROPN | 0.94 | 0.94 | 0.94 | 419 | | PUNCT | 0.97 | 0.96 | 0.96 | 695 | | SCONJ | 1.00 | 0.96 | 0.98 | 118 | | VERB | 0.95 | 0.93 | 0.94 | 767 | | X | 0.50 | 1.00 | 0.67 | 1 | | accuracy | | | 0.95 | 6665 | | macro avg | 0.92 | 0.95 | 0.93 | 6665 | | weighted avg | 0.95 | 0.95 | 0.95 | 6665 | ``` --- layout: model title: Pipeline to Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_pipeline date: 2022-03-23 tags: [licensed, ner, clinical, bertfortokenclassification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_3.4.1_3.0_1648044551434.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_3.4.1_3.0_1648044551434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_token_ner_jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +--------------------------------+------------+ |chunk |ner_label | +--------------------------------+------------+ |21-day-old |Age | |Caucasian male |Demographics| |congestion |Symptom | |mom |Demographics| |yellow discharge |Symptom | |nares |Body_Part | |she |Demographics| |mild problems with his breathing|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |Date_Time | |mom |Demographics| |tactile temperature |Symptom | |Tylenol |Drug | |Baby-girl |Age | |decreased p.o. intake |Symptom | |His |Demographics| |breast-feeding |Body_Part | |his |Demographics| |respiratory congestion |Symptom | +--------------------------------+------------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.5 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English BertForTokenClassification Small Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_original_PubmedBert_small date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-original-PubmedBert_small` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_small_en_4.0.0_3.0_1657108249459.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_small_en_4.0.0_3.0_1657108249459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert_small","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert_small","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_original_PubmedBert_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4-original-PubmedBert_small --- layout: model title: SNOMED ChunkResolver author: John Snow Labs name: chunkresolve_snomed_findings_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-06-20 task: Entity Resolution edition: Healthcare NLP 2.5.1 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities Snomed Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_snomed_findings_clinical_en_2.5.1_2.4_1592617161564.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_snomed_findings_clinical_en_2.5.1_2.4_1592617161564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("snomed_resolution") pipeline_snomed = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, snomed_ner_converter, chunk_embeddings, snomed_resolver]) data = ["""Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !""", """Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .""", """Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control ."""] model = pipeline_snomed.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(spark.createDataFrame([['William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.']], ["text"])) ``` ```scala ... val snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("snomed_resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, snomed_ner_converter, chunk_embeddings, snomed_resolver)) val data = Array("""Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !""", """Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .""", """Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control .""") val result = pipeline.fit(Seq.empty[String]).transform(data) ```
{:.h2_title} ## Results ```bash +-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+ | chunk| entity| target_text| code|confidence| +-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+ | erythematous skin lesions|PROBLEM|Skin lesion:::Achromic skin lesions of pinta:::Scaly skin:::Skin constricture:::Cratered skin les...| 95324001| 0.0937| | pruritus|PROBLEM|Pruritus:::Genital pruritus:::Postmenopausal pruritus:::Pruritus hiemalis:::Pruritus ani:::Anogen...| 418363000| 0.1394| | pruritus|PROBLEM|Pruritus:::Genital pruritus:::Postmenopausal pruritus:::Pruritus hiemalis:::Pruritus ani:::Anogen...| 418363000| 0.1394| | hypertension|PROBLEM|Hypertension:::Renovascular hypertension:::Idiopathic hypertension:::Venous hypertension:::Resist...| 38341003| 0.1019| | headache or pain|PROBLEM|Pain:::Headache:::Postchordotomy pain:::Throbbing pain:::Aching headache:::Postspinal headache:::...| 22253000| 0.0953| | applied to lesion on corner of mouth|PROBLEM|Lesion of tongue:::Erythroleukoplakia of mouth:::Lesion of nose:::Lesion of oropharynx:::Erythrop...| 300246005| 0.0547| | nausea and vomiting|PROBLEM|Nausea and vomiting:::Vomiting without nausea:::Nausea:::Intractable nausea and vomiting:::Vomiti...| 16932000| 0.0995| | perianal irritation|PROBLEM|Perineal irritation:::Vulval irritation:::Skin irritation:::Perianal pain:::Perianal itch:::Vagin...| 281639001| 0.0764| | insomnia|PROBLEM|Insomnia:::Mood insomnia:::Nonorganic insomnia:::Persistent insomnia:::Psychophysiologic insomnia...| 193462001| 0.1198| +-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |----------------|---------------------------------------| | Name: | chunkresolve_snomed_findings_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.1+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings ] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on SNOMED CT Findings http://www.snomed.org/ --- layout: model title: English ElectraForQuestionAnswering model (from navteca) author: John Snow Labs name: electra_qa_base_squad2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-squad2` is a English model originally trained by `navteca`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_en_4.0.0_3.0_1655920731292.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_en_4.0.0_3.0_1655920731292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.base.by_navteca").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/navteca/electra-base-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Detect Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_gene_clinical date: 2020-09-21 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects mentions of genes and human phenotypes (hp) in medical text. ## Predicted Entities `GENE`, `HP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GENE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_HUMAN_PHENOTYPE_GENE_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_2.5.5_2.4_1598558253840.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_2.5.5_2.4_1598558253840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).") ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype.gene_clinical").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""") ```
{:.h2_title} ## Results ```bash +----+------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+==================+=========+=======+==========+ | 0 | BS type | 29 | 32 | GENE | +----+------------------+---------+-------+----------+ | 1 | polyhydramnios | 75 | 88 | HP | +----+------------------+---------+-------+----------+ | 2 | polyuria | 91 | 98 | HP | +----+------------------+---------+-------+----------+ | 3 | nephrocalcinosis | 101 | 116 | HP | +----+------------------+---------+-------+----------+ | 4 | hypokalemia | 122 | 132 | HP | +----+------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_clinical| |Type:|ner| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| ## Data source This model was trained with data from https://github.com/lasigeBioTM/PGR For further details please refer to https://aclweb.org/anthology/papers/N/N19/N19-1152/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:| | 0 | I-HP | 303 | 56 | 64 | 0.844011 | 0.825613 | 0.834711 | | 1 | B-GENE | 1176 | 158 | 252 | 0.881559 | 0.823529 | 0.851557 | | 2 | B-HP | 1078 | 133 | 96 | 0.890173 | 0.918228 | 0.903983 | | 3 | Macro-average | 2557 | 347 | 412 | 0.871915 | 0.85579 | 0.863777 | | 4 | Micro-average | 2557 | 347 | 412 | 0.88051 | 0.861233 | 0.870765 | ``` --- layout: model title: Fast Neural Machine Translation Model from Bulgarian to English author: John Snow Labs name: opus_mt_bg_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bg, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bg_en_xx_2.7.0_2.4_1609169845170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bg_en_xx_2.7.0_2.4_1609169845170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bg_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bg_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bg.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bg_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BERT Sequence Classification - Detecting Hate Speech (bert_sequence_classifier_hatexplain) author: John Snow Labs name: bert_sequence_classifier_hatexplain date: 2021-11-06 tags: [bert_for_sequence_classification, hate, hate_speech, speech, offensive, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models`and it is used for classifying a text as `Hate speech`, `Offensive`, or `Normal`. The model is trained using data from Gab and Twitter and Human Rationales were included as part of the training data to boost the performance. - Citing : ```bash @article{mathew2020hatexplain, title={HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection}, author={Mathew, Binny and Saha, Punyajoy and Yimam, Seid Muhie and Biemann, Chris and Goyal, Pawan and Mukherjee, Animesh}, journal={arXiv preprint arXiv:2012.10289}, year={2020} } ``` ## Predicted Entities `hate speech`, `normal`, `offensive` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_hatexplain_en_3.3.2_2.4_1636214446271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_hatexplain_en_3.3.2_2.4_1636214446271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_hatexplain', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['I love you very much!']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_hatexplain", "en") .setInputCols("document", "token") .setOutputCol("class") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["I love you very much!"].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['normal'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_hatexplain| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain](https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain) ## Benchmarking ```bash +-------+------------+--------+ | Acc | Macro F1 | AUROC | +-------+------------+--------+ | 0.698 | 0.687 | 0.851 | +-------+------------+--------+ ``` --- layout: model title: English Deberta Embeddings model (from domenicrosati) author: John Snow Labs name: deberta_embeddings_mlm_test date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-mlm-test` is a English model originally trained by `domenicrosati`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_mlm_test_en_4.3.1_3.0_1678702297278.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_mlm_test_en_4.3.1_3.0_1678702297278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_mlm_test","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_mlm_test","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_mlm_test| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|267.4 MB| |Case sensitive:|false| ## References https://huggingface.co/domenicrosati/deberta-mlm-test --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el8_dl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el8-dl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl4_en_4.3.0_3.0_1675120674389.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl4_en_4.3.0_3.0_1675120674389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el8_dl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|143.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-el8-dl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: RoBERTa Large CoNLL-03 NER Pipeline author: John Snow Labs name: roberta_large_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, roberta, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/roberta_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654476076.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654476076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Detect PHI for Deidentification (Augmented) author: John Snow Labs name: ner_deid_augmented date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER (Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_3.0.0_3.0_1617208449273.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_3.0.0_3.0_1617208449273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_deid_augmented","en","clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. ']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_augmented","en","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.augmented").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """) ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |NAME | |VA Hospital |LOCATION | |John Green |NAME | |2347165768 |ID | |Day Hospital |LOCATION | |02/04/2003 |DATE | |Smith |NAME | |Day Hospital |LOCATION | |Smith |NAME | |Smith |NAME | |7 Ardmore Tower|LOCATION | |Hart |NAME | |Smith |NAME | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_augmented| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | I-NAME | 1096 | 47 | 80 | 0.95888 | 0.931973 | 0.945235 | | 1 | I-CONTACT | 93 | 0 | 4 | 1 | 0.958763 | 0.978947 | | 2 | I-AGE | 3 | 1 | 6 | 0.75 | 0.333333 | 0.461538 | | 3 | B-DATE | 2078 | 42 | 52 | 0.980189 | 0.975587 | 0.977882 | | 4 | I-DATE | 474 | 39 | 25 | 0.923977 | 0.9499 | 0.936759 | | 5 | I-LOCATION | 755 | 68 | 76 | 0.917375 | 0.908544 | 0.912938 | | 6 | I-PROFESSION | 78 | 8 | 9 | 0.906977 | 0.896552 | 0.901734 | | 7 | B-NAME | 1182 | 101 | 36 | 0.921278 | 0.970443 | 0.945222 | | 8 | B-AGE | 259 | 10 | 11 | 0.962825 | 0.959259 | 0.961039 | | 9 | B-ID | 146 | 8 | 11 | 0.948052 | 0.929936 | 0.938907 | | 10 | B-PROFESSION | 76 | 9 | 21 | 0.894118 | 0.783505 | 0.835165 | | 11 | B-LOCATION | 556 | 87 | 71 | 0.864697 | 0.886762 | 0.875591 | | 12 | I-ID | 64 | 8 | 3 | 0.888889 | 0.955224 | 0.920863 | | 13 | B-CONTACT | 40 | 7 | 5 | 0.851064 | 0.888889 | 0.869565 | | 14 | Macro-average | 6900 | 435 | 410 | 0.912023 | 0.880619 | 0.896046 | | 15 | Micro-average | 6900 | 435 | 410 | 0.940695 | 0.943912 | 0.942301 | ``` --- layout: model title: Fast Neural Machine Translation Model from English to Haitian Creole author: John Snow Labs name: opus_mt_en_ht date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ht, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ht` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ht_xx_2.7.0_2.4_1609163766018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ht_xx_2.7.0_2.4_1609163766018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ht", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ht", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ht').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ht| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from Kutay) author: John Snow Labs name: bert_qa_fine_tuned_tweetqa_aip date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fine_tuned_tweetqa_aip` is a English model orginally trained by `Kutay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fine_tuned_tweetqa_aip_en_4.0.0_3.0_1654187707453.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fine_tuned_tweetqa_aip_en_4.0.0_3.0_1654187707453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fine_tuned_tweetqa_aip","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fine_tuned_tweetqa_aip","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fine_tuned_tweetqa_aip| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Kutay/fine_tuned_tweetqa_aip --- layout: model title: Extract relations between drugs and proteins (ReDL) author: John Snow Labs name: redl_drugprot_biobert date: 2023-01-14 tags: [relation_extraction, clinical, en, licensed, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect interactions between chemical compounds/drugs and genes/proteins using BERT by classifying whether a specified semantic relation holds between a chemical and gene entities within a sentence or document. The entity labels used during training were derived from the custom NER model created by our team for the DrugProt corpus. These include CHEMICAL for chemical compounds/drugs, GENE for genes/proteins and GENE_AND_CHEMICAL for entity mentions of type GENE and of type CHEMICAL that overlap (such as enzymes and small peptides). The relation categories from the DrugProt corpus were condensed from 13 categories to 10 categories due to low numbers of examples for certain categories. This merging process involved grouping the SUBSTRATE_PRODUCT-OF and SUBSTRATE relation categories together and grouping the AGONIST-ACTIVATOR, AGONIST-INHIBITOR and AGONIST relation categories together. ## Predicted Entities `INHIBITOR`, `DIRECT-REGULATOR`, `SUBSTRATE`, `ACTIVATOR`, `INDIRECT-UPREGULATOR`, `INDIRECT-DOWNREGULATOR`, `ANTAGONIST`, `PRODUCT-OF`, `PART-OF`, `AGONIST` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_drugprot_biobert_en_4.2.4_3.0_1673736326031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_drugprot_biobert_en_4.2.4_3.0_1673736326031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `redl_drugprot_biobert` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------:|------------------------------------------------------------------------------------| | redl_drugprot_biobert | INHIBITOR,
DIRECT-REGULATOR,
SUBSTRATE,
ACTIVATOR,
INDIRECT-UPREGULATOR,
INDIRECT-DOWNREGULATOR,
ANTAGONIST,
PRODUCT-OF,
PART-OF,
AGONIST | ner_drugprot_clinical | [“checmical-gene”,
“chemical-gene_and_chemical”,
“gene_and_chemical-gene”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates drugprot_re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunks", "dependencies"])\ .setOutputCol("re_ner_chunks")\ .setMaxSyntacticDistance(4) # .setRelationPairs(['CHEMICAL-GENE']) drugprot_re_Model = RelationExtractionDLModel()\ .pretrained('redl_drugprot_biobert', "en", "clinical/models")\ .setPredictionThreshold(0.9)\ .setInputCols(["re_ner_chunks", "sentences"])\ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_ner_chunk_filter, drugprot_re_Model]) text='''Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.''' data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val drugprot_re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array("CHEMICAL-GENE")) // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val drugprot_re_Model = RelationExtractionDLModel() .pretrained("redl_drugprot_biobert", "en", "clinical/models") .setPredictionThreshold(0.9) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_ner_chunk_filter, drugprot_re_Model)) val data = Seq("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.drugprot").predict("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""") ```
## Results ```bash +---------+-----------------+-------------+-----------+--------------------+-----------------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+-----------------+-------------+-----------+--------------------+-----------------+-------------+-----------+--------------------+----------+ |ACTIVATOR| GENE| 33| 48| murine P4-ATPase| GENE| 50| 55| Atp8a1|0.95415354| |ACTIVATOR| GENE| 50| 55| Atp8a1| GENE| 58| 66| ATPase II| 0.9600417| |SUBSTRATE| CHEMICAL| 114| 131| phosphatidylserine|GENE_AND_CHEMICAL| 224| 248|ATP-dependent tra...| 0.9931178| |SUBSTRATE| CHEMICAL| 134| 135| PS|GENE_AND_CHEMICAL| 224| 248|ATP-dependent tra...| 0.9978284| |SUBSTRATE|GENE_AND_CHEMICAL| 256| 282|aminophospholipid...| CHEMICAL| 308| 309| PS| 0.9968598| |SUBSTRATE| GENE| 443| 450| flippase| CHEMICAL| 589| 590| PS| 0.9991992| |ACTIVATOR| CHEMICAL| 1201| 1219| phosphatidylcholine| CHEMICAL| 1222| 1238| phosphatidic acid|0.96227807| |ACTIVATOR| CHEMICAL| 1244| 1263|phosphatidylinositol| CHEMICAL| 1292| 1311|phosphatidylglycerol|0.93301487| |ACTIVATOR| CHEMICAL| 1244| 1263|phosphatidylinositol| CHEMICAL| 1316| 1339|phosphatidylethan...|0.93579245| |ACTIVATOR| CHEMICAL| 1292| 1311|phosphatidylglycerol| CHEMICAL| 1316| 1339|phosphatidylethan...| 0.9583067| |ACTIVATOR| CHEMICAL| 1292| 1311|phosphatidylglycerol| CHEMICAL| 1342| 1343| PE| 0.9603738| |ACTIVATOR| CHEMICAL| 1316| 1339|phosphatidylethan...| CHEMICAL| 1342| 1343| PE| 0.9596611| |ACTIVATOR| CHEMICAL| 1316| 1339|phosphatidylethan...| CHEMICAL| 1377| 1378| PS| 0.9832381| |ACTIVATOR| CHEMICAL| 1342| 1343| PE| CHEMICAL| 1377| 1378| PS| 0.981709| |ACTIVATOR| GENE| 1511| 1516| Atp8a1| CHEMICAL| 1563| 1577| sn-1,2-glycerol|0.99146277| |ACTIVATOR| GENE| 1511| 1516| Atp8a1| CHEMICAL| 1589| 1590| PS| 0.9842391| |ACTIVATOR| GENE| 1511| 1516| Atp8a1| CHEMICAL| 1604| 1618| sn-2,3-glycerol|0.98676455| | PART-OF| GENE| 1639| 1646| flippase| CHEMICAL| 1716| 1721| serine| 0.9470919| |SUBSTRATE| CHEMICAL| 1936| 1957|lysophosphatidyls...| GENE| 2050| 2057| flippase|0.98919815| |SUBSTRATE| CHEMICAL| 1960| 1979|glycerophosphoserine| GENE| 2050| 2057| flippase| 0.9857248| +---------+-----------------+-------------+-----------+--------------------+-----------------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_drugprot_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References This model was trained on the DrugProt corpus. ## Benchmarking ```bash label recall precision f1 support ACTIVATOR 0.885 0.776 0.827 235 AGONIST 0.810 0.925 0.864 137 ANTAGONIST 0.970 0.919 0.944 199 DIRECT-REGULATOR 0.836 0.901 0.867 403 INDIRECT-DOWNREGULATOR 0.885 0.850 0.867 313 INDIRECT-UPREGULATOR 0.844 0.887 0.865 270 INHIBITOR 0.947 0.937 0.942 1083 PART-OF 0.939 0.889 0.913 247 PRODUCT-OF 0.697 0.953 0.805 145 SUBSTRATE 0.912 0.884 0.898 468 Avg 0.873 0.892 0.879 - Weighted-Avg 0.897 0.899 0.897 - ``` --- layout: model title: Chinese Bert Embeddings (from qinluo) author: John Snow Labs name: bert_embeddings_wobert_chinese_plus date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `wobert-chinese-plus` is a Chinese model orginally trained by `qinluo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_plus_zh_3.4.2_3.0_1649669941004.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_plus_zh_3.4.2_3.0_1649669941004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.wobert_chinese_plus").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_wobert_chinese_plus| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|467.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/qinluo/wobert-chinese-plus - https://github.com/ZhuiyiTechnology/WoBERT - https://github.com/JunnYu/WoBERT_pytorch --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becas1 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becas1` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becas1_es_4.3.0_3.0_1674217912605.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becas1_es_4.3.0_3.0_1674217912605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becas1","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becas1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becas1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becas1 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_urdu_proj TFWav2Vec2ForCTC from MSaudTahir author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_urdu_proj` is a English model originally trained by MSaudTahir. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj_en_4.2.0_3.0_1664102013221.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj_en_4.2.0_3.0_1664102013221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_urdu_proj| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Politics And Public Safety Document Classifier (EURLEX) author: John Snow Labs name: legclf_politics_and_public_safety_bert date: 2023-03-06 tags: [en, legal, classification, clauses, politics_and_public_safety, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_politics_and_public_safety_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Politics_and_Public_Safety or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Politics_and_Public_Safety`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_politics_and_public_safety_bert_en_1.0.0_3.0_1678111789915.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_politics_and_public_safety_bert_en_1.0.0_3.0_1678111789915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_politics_and_public_safety_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Politics_and_Public_Safety]| |[Other]| |[Other]| |[Politics_and_Public_Safety]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_politics_and_public_safety_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.88 0.90 0.89 84 Politics_and_Public_Safety 0.90 0.88 0.89 83 accuracy - - 0.89 167 macro-avg 0.89 0.89 0.89 167 weighted-avg 0.89 0.89 0.89 167 ``` --- layout: model title: Legal Voting Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_voting_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, voting, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_voting_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `voting-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `voting-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_voting_agreement_bert_en_1.0.0_3.0_1669372485272.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_voting_agreement_bert_en_1.0.0_3.0_1669372485272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_voting_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[voting-agreement]| |[other]| |[other]| |[voting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_voting_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.98 0.98 65 voting-agreement 0.97 0.94 0.95 33 accuracy - - 0.97 98 macro-avg 0.97 0.96 0.97 98 weighted-avg 0.97 0.97 0.97 98 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Hiri Motu author: John Snow Labs name: opus_mt_en_ho date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ho, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ho` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ho_xx_2.7.0_2.4_1609169379532.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ho_xx_2.7.0_2.4_1609169379532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ho", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ho", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ho').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ho| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: ALBERT Embeddings (XXLarge Uncase) author: John Snow Labs name: albert_xxlarge_uncased date: 2020-04-28 task: Embeddings language: en nav_key: models edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_xxlarge_uncased_en_2.5.0_2.4_1588073588232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_xxlarge_uncased_en_2.5.0_2.4_1588073588232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = AlbertEmbeddings.pretrained("albert_xxlarge_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = AlbertEmbeddings.pretrained("albert_xxlarge_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.albert.xxlarge_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_albert_xxlarge_uncased_embeddings I [-0.07972775399684906, 0.06297606974840164, 0.... love [-0.07597140967845917, 0.05237535387277603, 0.... NLP [0.005398618057370186, -0.0253510233014822, 0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_xxlarge_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/albert_xlarge/3](https://tfhub.dev/google/albert_xlarge/3) --- layout: model title: Explain Document Pipeline for Portuguese author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, portuguese, explain_document_sm, pipeline, pt] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pt edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_3.0.0_3.0_1616422933551.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_3.0.0_3.0_1616422933551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'pt') annotations = pipeline.fullAnnotate(""Olá de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "pt") val result = pipeline.fullAnnotate("Olá de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Olá de John Snow Labs! ""] result_df = nlu.load('pt.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:----------------------------|:---------------------------|:---------------------------------------|:---------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Olá de John Snow Labs! '] | ['Olá de John Snow Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pt| --- layout: model title: Legal Land Transport Document Classifier (EURLEX) author: John Snow Labs name: legclf_land_transport_bert date: 2023-03-06 tags: [en, legal, classification, clauses, land_transport, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_land_transport_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Land_Transport or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Land_Transport`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_land_transport_bert_en_1.0.0_3.0_1678111683794.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_land_transport_bert_en_1.0.0_3.0_1678111683794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_land_transport_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Land_Transport]| |[Other]| |[Other]| |[Land_Transport]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_land_transport_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Land_Transport 0.87 0.92 0.89 97 Other 0.92 0.88 0.90 104 accuracy - - 0.90 201 macro-avg 0.90 0.90 0.90 201 weighted-avg 0.90 0.90 0.90 201 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab1_by_tahazakir TFWav2Vec2ForCTC from tahazakir author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_tahazakir` is a English model originally trained by tahazakir. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_en_4.2.0_3.0_1664038802857.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_en_4.2.0_3.0_1664038802857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab1_by_tahazakir| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Litigations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_litigations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, litigations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Litigations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Litigations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_litigations_bert_en_1.0.0_3.0_1678049988540.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_litigations_bert_en_1.0.0_3.0_1678049988540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_litigations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Litigations]| |[Other]| |[Other]| |[Litigations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_litigations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Litigations 0.99 0.97 0.98 125 Other 0.97 0.99 0.98 150 accuracy - - 0.98 275 macro-avg 0.98 0.98 0.98 275 weighted-avg 0.98 0.98 0.98 275 ``` --- layout: model title: Dutch BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_nl_cased date: 2022-12-02 tags: [nl, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: nl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-nl-cased` is a Dutch model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_nl_cased_nl_4.2.4_3.0_1670018595773.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_nl_cased_nl_4.2.4_3.0_1670018595773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_nl_cased","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_nl_cased","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_nl_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|391.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-nl-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Satisfaction And Discharge Of Indenture Clause Binary Classifier author: John Snow Labs name: legclf_satisfaction_and_discharge_of_indenture_clause date: 2023-01-27 tags: [en, legal, classification, satisfaction, discharge, indenture, clauses, satisfaction_and_discharge_of_indenture, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `satisfaction-and-discharge-of-indenture` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `satisfaction-and-discharge-of-indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_satisfaction_and_discharge_of_indenture_clause_en_1.0.0_3.0_1674821475929.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_satisfaction_and_discharge_of_indenture_clause_en_1.0.0_3.0_1674821475929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_satisfaction_and_discharge_of_indenture_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[satisfaction-and-discharge-of-indenture]| |[other]| |[other]| |[satisfaction-and-discharge-of-indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_satisfaction_and_discharge_of_indenture_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 36 satisfaction-and-discharge-of-indenture 0.97 0.94 0.95 31 accuracy - - 0.96 67 macro-avg 0.96 0.95 0.95 67 weighted-avg 0.96 0.96 0.96 67 ``` --- layout: model title: Resolve Tickers to Company Names author: John Snow Labs name: finel_tickers2names date: 2022-09-09 tags: [en, finance, companies, tickers, nasdaq, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Resolution / Entity Linking model, which is able to provide Company Names given their Ticker / Trading Symbols. You can use any NER which extracts Tickersto then send the output to this Entity Linking model and get the Company Name. ## Predicted Entities {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/financial_company_normalization){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_tickers2names_en_1.0.0_3.2_1662733866127.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_tickers2names_en_1.0.0_3.2_1662733866127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = finance.SentenceEntityResolverModel.pretrained("finel_tickers2names", "en", "finance/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("name")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver]) lp = LightPipeline(pipelineModel) lp.fullAnnotate("unit") ```
## Results ```bash +-------+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------+---------------------------+ | chunk| code | all_codes| resolutions | all_distances| +-------+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------+---------------------------+ | unit | UNITI GROUP INC. | [UNITI GROUP INC., Uniti Group INC. , Uniti Group Incorporated] |[UNITI GROUP INC., Uniti Group INC. , Uniti Group Incorporated] | [0.0000, 0.0000, 0.0000] | +-------+--------------------+-----------------------------------------------------------------+----------------------------------------------------------------+---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_tickers2names| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[org_company_name]| |Language:|en| |Size:|8.5 MB| |Case sensitive:|false| ## References https://data.world/johnsnowlabs/list-of-companies-in-nasdaq-exchanges --- layout: model title: Pipeline to Detect Clinical Events author: John Snow Labs name: ner_events_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_events_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_pipeline_en_3.4.1_3.0_1647873847549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_pipeline_en_3.4.1_3.0_1647873847549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient presented to the emergency room last evening") ``` ```scala val pipeline = new PretrainedPipeline("ner_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient presented to the emergency room last evening") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_clinical.pipeline").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +------------------+-------------+ |chunk |ner_label | +------------------+-------------+ |presented |OCCURRENCE | |the emergency room|CLINICAL_DEPT| |last evening |DATE | +------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: BERT Sequence Classification - Identify Antisemitic texts author: John Snow Labs name: bert_sequence_classifier_antisemitism date: 2021-11-06 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models` and it was trained on 4K tweets, where ~50% were labeled as antisemitic. The model identifies if the text is antisemitic or not. - `1` : Antisemitic - `0` : Non-antisemitic ## Predicted Entities `1`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_antisemitism_en_3.3.2_2.4_1636196636003.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_antisemitism_en_3.3.2_2.4_1636196636003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_antisemitism', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([["The Jews have too much power!"]]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_antisemitism", "en") .setInputCols("document", "token") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["The Jews have too much power!"].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['1'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_antisemitism| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/astarostap/autonlp-antisemitism-2-21194454](https://huggingface.co/astarostap/autonlp-antisemitism-2-21194454) --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from Datasaur) author: John Snow Labs name: distilbert_token_classifier_base_uncased_finetuned_conll2003 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll2003` is a English model originally trained by `Datasaur`. ## Predicted Entities `LOC`, `ORG`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1677881552803.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1677881552803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_finetuned_conll2003| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Datasaur/distilbert-base-uncased-finetuned-conll2003 --- layout: model title: English RobertaForQuestionAnswering (from eAsyle) author: John Snow Labs name: roberta_qa_roberta_base_custom_QA date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_base_custom_QA` is a English model originally trained by `eAsyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_custom_QA_en_4.0.0_3.0_1655738945694.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_custom_QA_en_4.0.0_3.0_1655738945694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_custom_QA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_custom_QA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_custom_QA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|424.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/eAsyle/roberta_base_custom_QA --- layout: model title: Financial NER (sm, Small) author: John Snow Labs name: finner_financial_small date: 2022-10-19 tags: [en, finance, ner, annual, reports, 10k, filings, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `sm` (small) version of a financial model, trained with more generic labels than the other versions of the model (`md`, `lg`, ...) you can find in Models Hub. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The aim of this model is to detect the main pieces of financial information in annual reports of companies, more specifically this model is being trained with 10K filings. The currently available entities are: - AMOUNT: Numeric amounts, not percentages - PERCENTAGE: Numeric amounts which are percentages - CURRENCY: The currency of the amount - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - PROFIT: Profit or also Revenue - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - EXPENSE: An expense or loss - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year You can also check for the Relation Extraction model which connects these entities together ## Predicted Entities `AMOUNT`, `CURRENCY`, `DATE`, `FISCAL_YEAR`, `PERCENTAGE`, `EXPENSE`, `EXPENSE_INCREASE`, `EXPENSE_DECREASE`, `PROFIT`, `PROFIT_INCREASE`, `PROFIT_DECLINE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_FINANCIAL_10K/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_financial_small_en_1.0.0_3.0_1666185056018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_financial_small_en_1.0.0_3.0_1666185056018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. The increase was primarily related to increase in internal staff costs of $ 1.1 million as we increased delivery staff and work performed on internal projects, partially offset by a decrease in third party consultant costs of $ 0.6 million as these were converted to internal staff or terminated. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +---------------------------------------------------------+----------------+ |text |label | +---------------------------------------------------------+----------------+ |License fees revenue |PROFIT_DECLINE | |40 |PERCENTAGE | |$ |CURRENCY | |0.5 million |AMOUNT | |$ |CURRENCY | |0.7 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |1.2 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |Services revenue |PROFIT_INCREASE | |4 |PERCENTAGE | |$ |CURRENCY | |1.1 million |AMOUNT | |$ |CURRENCY | |25.6 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |24.5 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |Costs of revenue, excluding depreciation and amortization|EXPENSE_INCREASE| |$ |CURRENCY | |0.1 million |AMOUNT | |2 |PERCENTAGE | |$ |CURRENCY | |8.8 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |8.7 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |internal staff costs |EXPENSE_INCREASE| |$ |CURRENCY | |1.1 million |AMOUNT | |third party consultant costs |EXPENSE_DECREASE| |$ |CURRENCY | |0.6 million |AMOUNT | |travel costs |EXPENSE_DECREASE| |$ |CURRENCY | |0.4 million |AMOUNT | |cost of revenue, excluding depreciation and amortization |EXPENSE | |34 |PERCENTAGE | |December 31, 2020 |FISCAL_YEAR | |2019 |DATE | |Sales and marketing expenses |EXPENSE_DECREASE| |20 |PERCENTAGE | |$ |CURRENCY | |1.5 million |AMOUNT | |$ |CURRENCY | |6.0 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |7.5 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | +---------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_financial_small| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on 10-K Filings ## Benchmarking ```bash label tp fp fn prec rec f1 I-AMOUNT 163 3 1 0.9819277 0.99390244 0.9878788 B-AMOUNT 233 1 0 0.99572647 1.0 0.99785864 B-DATE 288 8 6 0.972973 0.97959185 0.9762712 I-DATE 292 8 13 0.97333336 0.9573771 0.9652893 I-EXPENSE 16 10 9 0.61538464 0.64 0.62745094 B-PROFIT_INCREASE 17 5 7 0.77272725 0.7083333 0.73913044 B-EXPENSE 9 5 10 0.64285713 0.47368422 0.5454545 I-PROFIT_DECLINE 21 4 6 0.84 0.7777778 0.8076922 I-PROFIT 15 4 14 0.7894737 0.51724136 0.625 B-CURRENCY 232 1 0 0.99570817 1.0 0.99784946 I-PROFIT_INCREASE 18 3 8 0.85714287 0.6923077 0.7659574 B-PROFIT 13 6 14 0.68421054 0.4814815 0.5652174 B-PERCENTAGE 59 0 0 1.0 1.0 1.0 I-FISCAL_YEAR 231 9 1 0.9625 0.99568963 0.9788135 B-PROFIT_DECLINE 12 3 2 0.8 0.85714287 0.82758623 B-EXPENSE_INCREASE 32 3 9 0.9142857 0.7804878 0.84210527 B-EXPENSE_DECREASE 23 10 8 0.6969697 0.7419355 0.71874994 B-FISCAL_YEAR 77 3 0 0.9625 1.0 0.9808917 I-EXPENSE_DECREASE 43 17 13 0.71666664 0.76785713 0.7413793 I-EXPENSE_INCREASE 63 6 22 0.9130435 0.7411765 0.8181819 Macro-average 1857 109 143 0.85437155 0.8052994 0.82910997 Micro-average 1857 109 143 0.9445575 0.9285 0.9364599 ``` --- layout: model title: Pipeline to Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_pipeline date: 2023-06-07 tags: [licensed, en, clinical, ner, ner_jsl, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2023/05/04/bert_token_classifier_ner_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.0_1686126034858.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.0_1686126034858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") val text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_labels | confidence | |---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.996622 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.999759 | | 2 | male | 38 | 41 | Gender | 0.999847 | | 3 | 2 days | 52 | 57 | Duration | 0.818646 | | 4 | congestion | 62 | 71 | Symptom | 0.997344 | | 5 | mom | 75 | 77 | Gender | 0.999601 | | 6 | yellow | 99 | 104 | Symptom | 0.476263 | | 7 | discharge | 106 | 114 | Symptom | 0.704853 | | 8 | nares | 135 | 139 | External_body_part_or_region | 0.999152 | | 9 | she | 147 | 149 | Gender | 0.999927 | | 10 | mild | 168 | 171 | Modifier | 0.999674 | | 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.995353 | | 12 | perioral cyanosis | 237 | 253 | Symptom | 0.99852 | | 13 | retractions | 258 | 268 | Symptom | 0.999806 | | 14 | One day ago | 272 | 282 | RelativeDate | 0.99949 | | 15 | mom | 285 | 287 | Gender | 0.999779 | | 16 | tactile temperature | 304 | 322 | Symptom | 0.997475 | | 17 | Tylenol | 345 | 351 | Drug_BrandName | 0.998978 | | 18 | Baby-girl | 354 | 362 | Age | 0.990654 | | 19 | decreased | 382 | 390 | Symptom | 0.996808 | | 20 | intake | 397 | 402 | Symptom | 0.983608 | | 21 | His | 405 | 407 | Gender | 0.999922 | | 22 | breast-feeding | 416 | 429 | External_body_part_or_region | 0.994421 | | 23 | 20 minutes | 444 | 453 | Duration | 0.992322 | | 24 | 5 to 10 minutes | 464 | 478 | Duration | 0.969913 | | 25 | his | 493 | 495 | Gender | 0.999908 | | 26 | respiratory congestion | 497 | 518 | Symptom | 0.995677 | | 27 | He | 521 | 522 | Gender | 0.999803 | | 28 | tired | 555 | 559 | Symptom | 0.999463 | | 29 | fussy | 574 | 578 | Symptom | 0.996514 | | 30 | over the past 2 days | 580 | 599 | RelativeDate | 0.998001 | | 31 | albuterol | 642 | 650 | Drug_Ingredient | 0.99964 | | 32 | ER | 676 | 677 | Clinical_Dept | 0.998161 | | 33 | His | 680 | 682 | Gender | 0.999921 | | 34 | urine output has also decreased | 684 | 714 | Symptom | 0.971606 | | 35 | he | 726 | 727 | Gender | 0.999916 | | 36 | per 24 hours | 765 | 776 | Frequency | 0.910935 | | 37 | he | 783 | 784 | Gender | 0.999922 | | 38 | per 24 hours | 812 | 823 | Frequency | 0.921849 | | 39 | Mom | 826 | 828 | Gender | 0.999606 | | 40 | diarrhea | 841 | 848 | Symptom | 0.999849 | | 41 | His | 851 | 853 | Gender | 0.999739 | | 42 | bowel | 855 | 859 | Internal_organ_or_component | 0.999471 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Brazilian Portuguese NER for Laws (Bert, Base) author: John Snow Labs name: legner_br_bert_base date: 2022-09-28 tags: [pt, legal, ner, laws, licensed] task: Named Entity Recognition language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Deep Learning Portuguese Named Entity Recognition model for the legal domain, trained using Base Bert Embeddings, and is able to predict the following entities: - ORGANIZACAO (Organizations) - JURISPRUDENCIA (Jurisprudence) - PESSOA (Person) - TEMPO (Time) - LOCAL (Location) - LEGISLACAO (Laws) - O (Other) You can find different versions of this model in Models Hub: - With a Deep Learning architecture (non-transformer) and Base Embeddings; - With a Deep Learning architecture (non-transformer) and Large Embeddings; - With a Transformers Architecture and Base Embeddings; - With a Transformers Architecture and Large Embeddings; ## Predicted Entities `PESSOA`, `ORGANIZACAO`, `LEGISLACAO`, `JURISPRUDENCIA`, `TEMPO`, `LOCAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_PT/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_br_bert_base_pt_1.0.0_3.0_1664362186486.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_br_bert_base_pt_1.0.0_3.0_1664362186486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = nlp.BertForTokenClassification.pretrained("legner_br_bert_base","pt", "legal/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = nlp.Pipeline( stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) example = spark.createDataFrame(pd.DataFrame({'text': ["""Mediante do exposto , com fundamento nos artigos 32 , i , e 33 , da lei 8.443/1992 , submetem-se os autos à consideração superior , com posterior encaminhamento ao ministério público junto ao tcu e ao gabinete do relator , propondo : a ) conhecer do recurso e , no mérito , negar-lhe provimento ; b ) comunicar ao recorrente , ao superior tribunal militar e ao tribunal regional federal da 2ª região , a fim de fornecer subsídios para os processos judiciais 2001.34.00.024796-9 e 2003.34.00.044227-3 ; e aos demais interessados a deliberação que vier a ser proferida por esta corte ” ."""]})) result = pipeline.fit(example).transform(example) ```
## Results ```bash +-------------------+----------------+ | token| ner| +-------------------+----------------+ | diante| O| | do| O| | exposto| O| | ,| O| | com| O| | fundamento| O| | nos| O| | artigos| B-LEGISLACAO| | 32| I-LEGISLACAO| | ,| I-LEGISLACAO| | i| I-LEGISLACAO| | ,| I-LEGISLACAO| | e| I-LEGISLACAO| | 33| I-LEGISLACAO| | ,| I-LEGISLACAO| | da| I-LEGISLACAO| | lei| I-LEGISLACAO| | 8.443/1992| I-LEGISLACAO| | ,| O| | submetem-se| O| | os| O| | autos| O| | à| O| | consideração| O| | superior| O| | ,| O| | com| O| | posterior| O| | encaminhamento| O| | ao| O| | ministério| B-ORGANIZACAO| | público| I-ORGANIZACAO| | junto| O| | ao| O| | tcu| B-ORGANIZACAO| | e| O| | ao| O| | gabinete| O| | do| O| | relator| O| | ,| O| | propondo| O| | :| O| | a| O| | )| O| | conhecer| O| | do| O| | recurso| O| | e| O| | ,| O| | no| O| | mérito| O| | ,| O| | negar-lhe| O| | provimento| O| | ;| O| | b| O| | )| O| | comunicar| O| | ao| O| | recorrente| O| | ,| O| | ao| O| | superior| B-ORGANIZACAO| | tribunal| I-ORGANIZACAO| | militar| I-ORGANIZACAO| | e| O| | ao| O| | tribunal| B-ORGANIZACAO| | regional| I-ORGANIZACAO| | federal| I-ORGANIZACAO| | da| I-ORGANIZACAO| | 2ª| I-ORGANIZACAO| | região| I-ORGANIZACAO| | ,| O| | a| O| | fim| O| | de| O| | fornecer| O| | subsídios| O| | para| O| | os| O| | processos| O| | judiciais| O| |2001.34.00.024796-9|B-JURISPRUDENCIA| | e| O| |2003.34.00.044227-3|B-JURISPRUDENCIA| | ;| O| | e| O| | aos| O| | demais| O| | interessados| O| | a| O| | deliberação| O| | que| O| | vier| O| | a| O| | ser| O| | proferida| O| | por| O| | esta| O| | corte| O| | ”| O| | .| O| +-------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_br_bert_base| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Original texts available in https://paperswithcode.com/sota?task=Token+Classification&dataset=lener_br and in-house data augmentation with weak labelling ## Benchmarking ```bash label precision recall f1-score support B-ORGANIZACAO 0.86 0.86 0.86 499 I-ORGANIZACAO 0.89 0.89 0.89 859 B-LEGISLACAO 0.94 0.94 0.94 373 I-LEGISLACAO 0.96 0.98 0.97 2235 B-JURISPRUDENCIA 0.76 0.54 0.63 183 I-JURISPRUDENCIA 0.87 0.79 0.83 475 B-TEMPO 0.92 0.61 0.74 192 I-TEMPO 0.90 0.93 0.91 68 B-PESSOA 0.93 0.96 0.95 231 I-PESSOA 0.96 0.99 0.97 494 B-LOCAL 0.78 0.81 0.79 47 I-LOCAL 0.59 0.74 0.66 85 micro-avg 0.91 0.91 0.91 5741 macro-avg 0.86 0.84 0.84 5741 weighted-avg 0.91 0.91 0.91 5741 ``` --- layout: model title: Legal Organisation Of Transport Document Classifier (EURLEX) author: John Snow Labs name: legclf_organisation_of_transport_bert date: 2023-03-06 tags: [en, legal, classification, clauses, organisation_of_transport, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_organisation_of_transport_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Organisation_of_Transport or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Organisation_of_Transport`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_organisation_of_transport_bert_en_1.0.0_3.0_1678111547219.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_organisation_of_transport_bert_en_1.0.0_3.0_1678111547219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_organisation_of_transport_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Organisation_of_Transport]| |[Other]| |[Other]| |[Organisation_of_Transport]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_organisation_of_transport_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Organisation_of_Transport 0.87 0.90 0.88 188 Other 0.89 0.86 0.88 184 accuracy - - 0.88 372 macro-avg 0.88 0.88 0.88 372 weighted-avg 0.88 0.88 0.88 372 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_dev date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-en` is a English model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_en_4.0.0_3.0_1657189939497.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_en_4.0.0_3.0_1657189939497.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-en --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_ImbalancedPubMedBERT date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD_ImbalancedPubMedBERT` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_ImbalancedPubMedBERT_en_4.0.0_3.0_1657108954704.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_ImbalancedPubMedBERT_en_4.0.0_3.0_1657108954704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_ImbalancedPubMedBERT","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_ImbalancedPubMedBERT","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_ImbalancedPubMedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD_ImbalancedPubMedBERT --- layout: model title: English XlmRoBertaForQuestionAnswering (from ncduy) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-squad2-distilled-finetuned-chaii` is a English model originally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_en_4.0.0_3.0_1655991564185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_en_4.0.0_3.0_1655991564185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_chaii.xlm_roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|886.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/xlm-roberta-base-squad2-distilled-finetuned-chaii --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_wikisql date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-wikiSQL` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_wikisql_en_4.3.0_3.0_1675109286457.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_wikisql_en_4.3.0_3.0_1675109286457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_wikisql","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_wikisql","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_wikisql| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|887.6 MB| ## References - https://huggingface.co/mrm8488/t5-base-finetuned-wikiSQL - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/salesforce/WikiSQL - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb - https://github.com/patil-suraj - https://pbs.twimg.com/media/Ec5vaG5XsAINty_?format=png&name=900x900 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Fast Neural Machine Translation Model from Thai to English author: John Snow Labs name: opus_mt_th_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, th, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `th` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_th_en_xx_2.7.0_2.4_1609163813254.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_th_en_xx_2.7.0_2.4_1609163813254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_th_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_th_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.th.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_th_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Pangasinan to English Pipeline author: John Snow Labs name: translate_pag_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pag, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pag` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pag_en_xx_2.7.0_2.4_1609686426766.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pag_en_xx_2.7.0_2.4_1609686426766.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_pag_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pag_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.pag.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pag_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from twmkn9) author: John Snow Labs name: roberta_qa_base_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad2` is a English model originally trained by `twmkn9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_en_4.3.0_3.0_1674210478798.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_en_4.3.0_3.0_1674210478798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/twmkn9/distilroberta-base-squad2 --- layout: model title: Finnish asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot` is a Finnish model originally trained by aapot. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_fi_4.2.0_3.0_1664024597770.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_fi_4.2.0_3.0_1664024597770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from Graphcore) author: John Snow Labs name: bert_qa_Graphcore_bert_large_uncased_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squad` is a English model orginally trained by `Graphcore`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Graphcore_bert_large_uncased_squad_en_4.0.0_3.0_1654536525530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Graphcore_bert_large_uncased_squad_en_4.0.0_3.0_1654536525530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Graphcore_bert_large_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Graphcore_bert_large_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased.by_Graphcore").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Graphcore_bert_large_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|798.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Graphcore/bert-large-uncased-squad --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_question_generation date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-question-generation` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_question_generation_it_4.3.0_3.0_1675103595829.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_question_generation_it_4.3.0_3.0_1675103595829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_question_generation","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_question_generation","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_question_generation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|593.8 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-question-generation - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Question+generation&dataset=SQuAD-IT --- layout: model title: English DistilBertForQuestionAnswering model (from Adrian) Squad2 author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_colab date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad-colab` is a English model originally trained by `Adrian`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_colab_en_4.0.0_3.0_1654726625385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_colab_en_4.0.0_3.0_1654726625385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_colab","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_colab","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_colab.by_Adrian").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_colab| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Adrian/distilbert-base-uncased-finetuned-squad-colab --- layout: model title: Swedish asr_wav2vec2_swedish_common_voice TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: asr_wav2vec2_swedish_common_voice date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_swedish_common_voice` is a Swedish model originally trained by birgermoell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_swedish_common_voice_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_swedish_common_voice_sv_4.2.0_3.0_1664114373826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_swedish_common_voice_sv_4.2.0_3.0_1664114373826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_swedish_common_voice", "sv")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_swedish_common_voice", "sv") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_swedish_common_voice| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sv| |Size:|1.2 GB| --- layout: model title: Smaller BERT Embeddings (L-6_H-128_A-2) author: John Snow Labs name: small_bert_L6_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L6_128_en_2.6.0_2.4_1598344340449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L6_128_en_2.6.0_2.4_1598344340449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L6_128", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L6_128", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L6_128').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L6_128_embeddings I [0.43105611205101013, 0.6831966638565063, -1.2..... love [0.8754201531410217, 0.4752326011657715, -1.46... NLP [-0.2781177759170532, -0.14001458883285522, 1... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L6_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/small_bert/bert_uncased_L-6_H-128_A-2/2](https://tfhub.dev/google/small_bert/bert_uncased_L-6_H-128_A-2/2) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-PubMedBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384_en_4.0.0_3.0_1657109392849.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384_en_4.0.0_3.0_1657109392849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CDR_Chem_Modified_PubMedBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-PubMedBERT-384 --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Portuguese (WikiNER 6B 100) author: John Snow Labs name: wikiner_6B_100 date: 2020-05-10 task: Named Entity Recognition language: pt edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, pt, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_PT){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pt_2.5.0_2.4_1588495233192.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pt_2.5.0_2.4_1588495233192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_100", "pt") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_100", "pt") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella."""] ner_df = nlu.load('pt.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |Ele |MISC | |Microsoft Corporation |ORG | |Durante |ORG | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Ele |MISC | |Nascido |MISC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Novo México |LOC | |Gates |PER | |CEO |ORG | |Gates |PER | |Gates |PER | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://pt.wikipedia.org](https://pt.wikipedia.org) --- layout: model title: French CamemBert Embeddings (from Jodsa) author: John Snow Labs name: camembert_embeddings_camembert_mlm date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `camembert_mlm` is a French model orginally trained by `Jodsa`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_camembert_mlm_fr_3.4.4_3.0_1653985748924.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_camembert_mlm_fr_3.4.4_3.0_1653985748924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_camembert_mlm","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_camembert_mlm","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_camembert_mlm| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|420.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Jodsa/camembert_mlm --- layout: model title: Legal Supplemental Indenture Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_supplemental_indenture_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, supplemental, indenture, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_supplemental_indenture_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `supplemental-indenture` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `supplemental-indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_en_1.0.0_3.0_1671393678095.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_en_1.0.0_3.0_1671393678095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_supplemental_indenture_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[supplemental-indenture]| |[other]| |[other]| |[supplemental-indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_supplemental_indenture_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.95 0.96 221 supplemental-indenture 0.90 0.94 0.92 107 accuracy - - 0.95 328 macro-avg 0.94 0.95 0.94 328 weighted-avg 0.95 0.95 0.95 328 ``` --- layout: model title: Relation Extraction between Test and Results (ReDL) author: John Snow Labs name: redl_oncology_test_result_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, test, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links test extractions to their corresponding results. ## Predicted Entities `is_finding_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_test_result_biobert_wip_en_4.2.4_3.0_1673776756086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_test_result_biobert_wip_en_4.2.4_3.0_1673776756086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Each relevant relation pair in the pipeline should include one test entity (such as Biomarker, Imaging_Test, Pathology_Test or Oncogene) and one result entity (such as Biomarker_Result, Pathology_Result or Tumor_Finding).
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_test_result_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["Pathology showed tumor cells, which were positive for estrogen and progesterone receptors."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_test_result_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology.test_result_biobert").predict("""Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.""") ```
## Results ```bash +-------------+----------------+-------------+-----------+---------+----------------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +-------------+----------------+-------------+-----------+---------+----------------+-------------+-----------+--------------------+----------+ |is_finding_of| Pathology_Test| 0| 8|Pathology|Pathology_Result| 17| 27| tumor cells| 0.8494344| |is_finding_of|Biomarker_Result| 41| 48| positive| Biomarker| 54| 61| estrogen|0.99451536| |is_finding_of|Biomarker_Result| 41| 48| positive| Biomarker| 67| 88|progesterone rece...|0.99218905| +-------------+----------------+-------------+-----------+---------+----------------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_test_result_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.87 0.92 0.9 is_finding_of 0.93 0.88 0.9 macro-avg 0.90 0.90 0.9 ``` --- layout: model title: English image_classifier_vit_diam ViTForImageClassification from godiec author: John Snow Labs name: image_classifier_vit_diam date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_diam` is a English model originally trained by godiec. ## Predicted Entities `bunny`, `moon`, `sun`, `tiger` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_diam_en_4.1.0_3.0_1660167848550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_diam_en_4.1.0_3.0_1660167848550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_diam", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_diam", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_diam| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SQuAD 2.0 author: John Snow Labs name: sent_bert_wiki_books_squad2 date: 2021-08-31 tags: [en, open_source, sentence_detection, wikipedia_dataset, books_corpus_dataset, squad_2_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on SQuAD 2.0. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages. This model is fine-tuned on the SQuAD 2.0 and is recommended for use in question answering tasks. The fine-tuning task uses the SQuAD 2.0 dataset as a span-labeling task to label the answer to a question in a given context. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_squad2_en_3.2.0_3.0_1630412125790.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_squad2_en_3.2.0_3.0_1630412125790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_squad2", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_squad2", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_squad2').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_wiki_books_squad2| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/squad2/2 --- layout: model title: Classify text about Effective, Renewal or Termination date author: John Snow Labs name: legclf_dates_sm date: 2022-11-21 tags: [effective, renewal, termination, date, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Classification model can help you classify if a paragraph talks about an Effective Date, a Renewal Date, a Termination Date or something else. Don't confuse this model with the NER model (`legner_dates_sm`) which allows you to extract the actual dates from the texts. ## Predicted Entities `EFFECTIVE_DATE`, `RENEWAL_DATE`, `TERMINATION_DATE`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dates_sm_en_1.0.0_3.0_1669034322560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dates_sm_en_1.0.0_3.0_1669034322560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained('legclf_dates_sm', 'en', 'legal/models')\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("label") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) text = ["""Renewal Date means January 1, 2018."""] empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +--------------+ | result| +--------------+ |[RENEWAL_DATE]| +--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_dates_sm| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[label]| |Language:|en| |Size:|22.5 MB| ## References In-house annotations. ## Benchmarking ```bash label precision recall f1-score support EFFECTIVE_DATE 1.00 0.80 0.89 5 RENEWAL_DATE 1.00 1.00 1.00 6 TERMINATION_DATE 0.86 0.75 0.80 8 other 0.91 1.00 0.95 21 accuracy - - 0.93 40 macro-avg 0.94 0.89 0.91 40 weighted-avg 0.93 0.93 0.92 40 ``` --- layout: model title: Legal Termination Clause Binary Classifier (CUAD dataset, SBERT version) author: John Snow Labs name: legclf_sbert_cuad_termination_clause date: 2022-11-11 tags: [termination, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. This version was trained with Universal Sentence Encoder. There is another version using Universal Sentence Encoding, called `legclf_cuad_termination_clause` If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset. ## Predicted Entities `termination`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sbert_cuad_termination_clause_en_1.0.0_3.0_1668163200458.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sbert_cuad_termination_clause_en_1.0.0_3.0_1668163200458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_sbert_cuad_termination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([[" ---------------------\n\n This Agreement may be terminated immediately by Developer..."]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sbert_cuad_termination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 41 termination 1.00 1.00 1.00 40 accuracy - - 1.00 81 macro-avg 1.00 1.00 1.00 81 weighted-avg 1.00 1.00 1.00 81 ``` --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_de_4.2.0_3.0_1664103295967.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_de_4.2.0_3.0_1664103295967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from sunitha) author: John Snow Labs name: bert_qa_output_files date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `output_files` is a English model orginally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_output_files_en_4.0.0_3.0_1654189020709.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_output_files_en_4.0.0_3.0_1654189020709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_output_files","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_output_files","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.output_files.bert.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_output_files| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/output_files --- layout: model title: English image_classifier_vit_ViT_FaceMask_Finetuned ViTForImageClassification from AkshatSurolia author: John Snow Labs name: image_classifier_vit_ViT_FaceMask_Finetuned date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ViT_FaceMask_Finetuned` is a English model originally trained by AkshatSurolia. ## Predicted Entities `Mask`, `No Mask` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViT_FaceMask_Finetuned_en_4.1.0_3.0_1660165872491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViT_FaceMask_Finetuned_en_4.1.0_3.0_1660165872491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_ViT_FaceMask_Finetuned", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_ViT_FaceMask_Finetuned", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_ViT_FaceMask_Finetuned| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForQuestionAnswering model (from bdickson) author: John Snow Labs name: distilbert_qa_bdickson_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `bdickson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725099987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725099987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_bdickson").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bdickson_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/bdickson/distilbert-base-uncased-finetuned-squad --- layout: model title: Icelandic XlmRoBertaForQuestionAnswering (from vesteinn) author: John Snow Labs name: xlm_roberta_qa_XLMr_ENIS_QA_Is date: 2022-06-23 tags: [is, open_source, question_answering, xlmroberta] task: Question Answering language: is edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `XLMr-ENIS-QA-Is` is a Icelandic model originally trained by `vesteinn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLMr_ENIS_QA_Is_is_4.0.0_3.0_1655983971257.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLMr_ENIS_QA_Is_is_4.0.0_3.0_1655983971257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_XLMr_ENIS_QA_Is","is") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_XLMr_ENIS_QA_Is","is") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("is.answer_question.xlmr_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_XLMr_ENIS_QA_Is| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|is| |Size:|453.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vesteinn/XLMr-ENIS-QA-Is --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding ICD-9-CM Codes author: John Snow Labs name: icd10_icd9_mapping date: 2023-06-13 tags: [en, licensed, icd10cm, icd9, pipeline, chunk_mapping] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icd10_icd9_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.4.4_3.2_1686663555396.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.4.4_3.2_1686663555396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(Z833 A0100 A000) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(Z833 A0100 A000) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10_icd9.mapping").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(Z833 A0100 A000) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(Z833 A0100 A000) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10_icd9.mapping").predict("""Put your text here.""") ```
## Results ```bash Results | | icd10_code | icd9_code | |---:|:--------------------|:-------------------| | 0 | Z833 | A0100 | A000 | V180 | 0020 | 0010 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10_icd9_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|593.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Translate Albanian to English Pipeline author: John Snow Labs name: translate_sq_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sq, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sq` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sq_en_xx_2.7.0_2.4_1609688416552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sq_en_xx_2.7.0_2.4_1609688416552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sq_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sq_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sq.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sq_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Clinical Entities (ner_eu_clinical_case - es) author: John Snow Labs name: ner_eu_clinical_case date: 2023-02-01 tags: [es, clinical, licensed, ner] task: Named Entity Recognition language: es edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical entities from Spanish texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_event`, `bodypart`, `clinical_condition`, `units_measurements`, `patient`, `date_time` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_es_4.2.8_3.0_1675285093855.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_es_4.2.8_3.0_1675285093855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_eu_clinical_case", "es", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_eu_clinical_case", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------+------------------+ |chunk |ner_label | +--------------------------------+------------------+ |Un niño de 3 años |patient | |trastorno autista |clinical_event | |antecedentes |clinical_event | |enfermedad |clinical_event | |trastorno del espectro autista |clinical_event | |El niño |patient | |diagnosticado |clinical_event | |trastorno de comunicación severo|clinical_event | |dificultades |clinical_event | |retraso |clinical_event | |análisis |clinical_event | |sangre |bodypart | |normales |units_measurements| |hormona |clinical_event | |la tiroides |bodypart | |TSH |clinical_event | |hemoglobina |clinical_event | |volumen |clinical_event | |MCV |clinical_event | |ferritina |clinical_event | |endoscopia |clinical_event | |mostró |clinical_event | |tumor submucoso |clinical_event | |obstrucción |clinical_event | |tumor |clinical_event | |del estroma gastrointestinal |bodypart | |gastrectomía |clinical_event | |examen |clinical_event | |reveló |clinical_event | |proliferación |clinical_event | |células fusiformes |bodypart | |la capa submucosa |bodypart | +--------------------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|895.1 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 date_time 87.0 10.0 17.0 104.0 0.8969 0.8365 0.8657 units_measurements 37.0 5.0 11.0 48.0 0.8810 0.7708 0.8222 clinical_condition 50.0 34.0 70.0 120.0 0.5952 0.4167 0.4902 patient 76.0 8.0 11.0 87.0 0.9048 0.8736 0.8889 clinical_event 399.0 44.0 79.0 478.0 0.9007 0.8347 0.8664 bodypart 153.0 56.0 13.0 166.0 0.7321 0.9217 0.8160 macro - - - - - - 0.7916 micro - - - - - - 0.8128 ``` --- layout: model title: Korean BertForQuestionAnswering model (from bespin-global) author: John Snow Labs name: bert_qa_bespin_global_klue_bert_base_mrc date: 2022-06-02 tags: [ko, open_source, question_answering, bert] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-bert-base-mrc` is a Korean model orginally trained by `bespin-global`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bespin_global_klue_bert_base_mrc_ko_4.0.0_3.0_1654188080353.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bespin_global_klue_bert_base_mrc_ko_4.0.0_3.0_1654188080353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bespin_global_klue_bert_base_mrc","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bespin_global_klue_bert_base_mrc","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.klue.bert.base.by_bespin-global").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bespin_global_klue_bert_base_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|413.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bespin-global/klue-bert-base-mrc - https://www.bespinglobal.com/ --- layout: model title: Fast Neural Machine Translation Model from Salishan Languages to English author: John Snow Labs name: opus_mt_sal_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sal, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sal` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sal_en_xx_2.7.0_2.4_1609163973743.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sal_en_xx_2.7.0_2.4_1609163973743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sal_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sal_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sal.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sal_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_testimonial TFWav2Vec2ForCTC from testimonial author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_testimonial date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_testimonial` is a English model originally trained by testimonial. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_testimonial_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_testimonial_en_4.2.0_3.0_1664107711019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_testimonial_en_4.2.0_3.0_1664107711019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_testimonial", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_testimonial", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_testimonial| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Classify Edgar Financial Filings and Schedules author: John Snow Labs name: finclf_sec_schedules_filings date: 2023-01-13 tags: [sec, filings, schedules, en, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `multiclass` model, which analyzes the first 512 tokens of your document and retrieves if it is one of the supported classes (see Predicted entities). The class `schedule` includes `TO-C`, `13D`, `TO-T`, `14F1`, `14D9`, `14N`, `13G`, `TO-I`, `13E3`. `3` means SEC's `FORM-3`. `4` means SEC's `FORM-4`. ## Predicted Entities `schedule`, `other`, `10-K`, `10-Q`, `3`, `4`, `8-K`, `S-8` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_sec_schedules_filings_en_1.0.0_3.0_1673628989895.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_sec_schedules_filings_en_1.0.0_3.0_1673628989895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.UniversalSentenceEncoder.pretrained()\ .setInputCols("document") \ .setOutputCol("sentence_embeddings") doc_classifier = finance.ClassifierDLModel.pretrained("finclf_sec_schedules_filings", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier ]) text = """SECURITIES AND EXCHANGE COMMISSION WASHINGTON, DC 20549 SCHEDULE 13D (Rule 13d-101) INFORMATION TO BE INCLUDED IN STATEMENTS FILED PURSUANT TO RULE 13d-1(a) AND AMENDMENTS THERETO FILED PURSUANT TO RULE 13d-2(a) Under the Securities Exchange Act of 1934 (Amendment No. 2)* TILE SHOP HOLDINGS, INC. (Name of Issuer) ....""" df = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash ['schedule'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_sec_schedules_filings| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References SEC's Edgar ## Benchmarking ```bash label precision recall f1-score support 10-K 0.93 0.90 0.92 42 10-Q 0.95 0.95 0.95 38 3 0.62 0.61 0.62 33 4 0.82 0.78 0.80 54 8-K 0.86 0.91 0.88 33 S-8 0.93 0.96 0.95 28 other 1.00 1.00 1.00 238 schedule 0.94 0.96 0.95 50 accuracy - - 0.93 516 macro-avg 0.88 0.88 0.88 516 weighted-avg 0.93 0.93 0.93 516 ``` --- layout: model title: RxNorm Xsmall ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_xsmall_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-06-24 task: Entity Resolution edition: Healthcare NLP 2.5.2 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. {:.h2_title} ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_xsmall_clinical_en_2.5.2_2.4_1592959394598.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_xsmall_clinical_en_2.5.2_2.4_1592959394598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... rxnorm_resolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_xsmall_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\ .setInputCols(['token', 'chunk_embeddings'])\ .setOutputCol('rxnorm_resolution')\ .setPoolingStrategy("MAX") pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnorm_resolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_xsmall_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9)) .setInputCols('token', 'chunk_embeddings') .setOutputCol('rxnorm_resolution') .setPoolingStrategy("MAX") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|Glipizide Metformin hydrochloride:::Glyburide Metformin hydrochloride:::Glipizide Metformin hydro...| 861731| 0.2000| | glipizide|TREATMENT| Glipizide:::Glipizide:::Glipizide:::Glipizide:::Glipizide Metformin hydrochloride| 310488| 0.2499| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|dapagliflozin saxagliptin:::dapagliflozin saxagliptin:::dapagliflozin saxagliptin:::dapagliflozin...|1925504| 0.2080| | dapagliflozin|TREATMENT| dapagliflozin:::dapagliflozin:::dapagliflozin:::dapagliflozin:::dapagliflozin saxagliptin|1488574| 0.2492| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |----------------|-------------------------------------| | Name: | chunkresolve_rxnorm_xsmall_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.2+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on December 2019 RxNorm Subset http://www.snomed.org/ --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_Tommi TFWav2Vec2ForCTC from Tommi author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_finnish_by_Tommi date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_Tommi` is a Finnish model originally trained by Tommi. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664020930598.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664020930598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_Tommi", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_Tommi", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_finnish_by_Tommi| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect Medication Entities, Assign Assertion Status and Find Relations author: John Snow Labs name: explain_clinical_doc_medication date: 2022-04-01 tags: [licensed, en, clinical, ner, assertion, relation_extraction, posology] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for detecting posology entities with the `ner_posology_large` NER model, assigning their assertion status with `assertion_jsl` model, and extracting relations between posology-related terminology with `posology_re` relation extraction model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_3.4.2_3.0_1648813363898.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_3.4.2_3.0_1648813363898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models") result = pipeline.fullAnnotate("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_dco.clinical_medication.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""") ```
## Results ```bash +----+----------------+------------+ | | chunks | entities | |---:|:---------------|:-----------| | 0 | insulin | DRUG | | 1 | Bactrim | DRUG | | 2 | for 14 days | DURATION | | 3 | 5000 units | DOSAGE | | 4 | Fragmin | DRUG | | 5 | subcutaneously | ROUTE | | 6 | daily | FREQUENCY | | 7 | Lantus | DRUG | | 8 | 40 units | DOSAGE | | 9 | subcutaneously | ROUTE | | 10 | at bedtime | FREQUENCY | +----+----------------+------------+ +----+----------+------------+-------------+ | | chunks | entities | assertion | |---:|:---------|:-----------|:------------| | 0 | insulin | DRUG | Present | | 1 | Bactrim | DRUG | Past | | 2 | Fragmin | DRUG | Planned | | 3 | Lantus | DRUG | Planned | +----+----------+------------+-------------+ +----------------+-----------+------------+-----------+----------------+ | relation | entity1 | chunk1 | entity2 | chunk2 | |:---------------|:----------|:-----------|:----------|:---------------| | DRUG-DURATION | DRUG | Bactrim | DURATION | for 14 days | | DOSAGE-DRUG | DOSAGE | 5000 units | DRUG | Fragmin | | DRUG-ROUTE | DRUG | Fragmin | ROUTE | subcutaneously | | DRUG-FREQUENCY | DRUG | Fragmin | FREQUENCY | daily | | DRUG-DOSAGE | DRUG | Lantus | DOSAGE | 40 units | | DRUG-ROUTE | DRUG | Lantus | ROUTE | subcutaneously | | DRUG-FREQUENCY | DRUG | Lantus | FREQUENCY | at bedtime | +----------------+-----------+------------+-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_medication| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternal - NerConverterInternal - AssertionDLModel - PerceptronModel - DependencyParserModel - PosologyREModel --- layout: model title: Pipeline to Detect Clinical Entities (WIP Greedy) author: John Snow Labs name: jsl_ner_wip_greedy_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, wip, clinical, greedy, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_greedy_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_greedy_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_pipeline_en_3.4.1_3.0_1647866343183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_pipeline_en_3.4.1_3.0_1647866343183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.wip_greedy_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +----------------------------------------------+----------------------------+ |chunk |ner_label | +----------------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | |20 minutes |Duration | +----------------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_greedy_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_all_903429540 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429540` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1678783374440.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1678783374440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_all_903429540| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429540 --- layout: model title: English asr_distil_wav2vec2 TFWav2Vec2ForCTC from OthmaneJ author: John Snow Labs name: pipeline_asr_distil_wav2vec2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_distil_wav2vec2` is a English model originally trained by OthmaneJ. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_distil_wav2vec2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_distil_wav2vec2_en_4.2.0_3.0_1664020989973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_distil_wav2vec2_en_4.2.0_3.0_1664020989973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_distil_wav2vec2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_distil_wav2vec2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_distil_wav2vec2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|188.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Classify Texts into 4 News Categories author: John Snow Labs name: bert_sequence_classifier_age_news_pipeline date: 2022-06-19 tags: [ag_news, news, bert, bert_sequence, classification, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_sequence_classifier_age_news_en](https://nlp.johnsnowlabs.com/2021/11/07/bert_sequence_classifier_age_news_en.html) which is imported from `HuggingFace`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_4.0.0_3.0_1655653779437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_4.0.0_3.0_1655653779437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python news_pipeline = PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en") news_pipeline.annotate("Microsoft has taken its first step into the metaverse.") ``` ```scala val news_pipeline = new PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en") news_pipeline.annotate("Microsoft has taken its first step into the metaverse.") ```
## Results ```bash ['Sci/Tech'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_age_news_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|42.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertForSequenceClassification --- layout: model title: Legal NER for NDA (Return of Confidential Information Clauses) author: John Snow Labs name: legner_nda_return_of_conf_info date: 2023-04-19 tags: [en, legal, licensed, ner, nda] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `RETURN_OF_CONF_INFO` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `ARCHIVAL_PURPOSE`, and `LEGAL_PURPOSE`. ## Predicted Entities `ARCHIVAL_PURPOSE`, `LEGAL_PURPOSE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_return_of_conf_info_en_1.0.0_3.0_1681936414470.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_return_of_conf_info_en_1.0.0_3.0_1681936414470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_return_of_conf_info", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Notwithstanding the foregoing, the Recipient and its Representatives may retain copies of the Confidential Information to the extent that such retention is required to demonstrate compliance with applicable law or governmental rule or regulation, to the extent included in any board or executive documents relating to the proposed business relationship, and in its archives for backup purposes subject to the confidentiality provisions of this Agreement."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +--------------+----------------+ |chunk |ner_label | +--------------+----------------+ |applicable law|LEGAL_PURPOSE | |governmental |LEGAL_PURPOSE | |regulation |LEGAL_PURPOSE | |archives |ARCHIVAL_PURPOSE| |backup |ARCHIVAL_PURPOSE| +--------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_return_of_conf_info| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support ARCHIVAL_PURPOSE 0.94 1.00 0.97 16 LEGAL_PURPOSE 0.78 0.85 0.81 33 micro-avg 0.83 0.90 0.86 49 macro-avg 0.86 0.92 0.89 49 weighted-avg 0.83 0.90 0.86 49 ``` --- layout: model title: English asr_xlsr_wav2vec_english TFWav2Vec2ForCTC from harshit345 author: John Snow Labs name: asr_xlsr_wav2vec_english date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec_english` is a English model originally trained by harshit345. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec_english_en_4.2.0_3.0_1664043295043.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec_english_en_4.2.0_3.0_1664043295043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_wav2vec_english", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_wav2vec_english", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_wav2vec_english| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from AurasuiteAgreements) author: John Snow Labs name: bert_qa_base_uncased_contracts_finetuned_on_squadv2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-contracts-finetuned-on-squadv2` is a English model originally trained by `AurasuiteAgreements`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_contracts_finetuned_on_squadv2_en_4.0.0_3.0_1657183854663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_contracts_finetuned_on_squadv2_en_4.0.0_3.0_1657183854663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_contracts_finetuned_on_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_contracts_finetuned_on_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_contracts_finetuned_on_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/AurasuiteAgreements/bert-base-uncased-contracts-finetuned-on-squadv2 --- layout: model title: Turkish BertForTokenClassification Cased model (from busecarik) author: John Snow Labs name: bert_token_classifier_berturk_sunlp_ner_turkish date: 2022-11-30 tags: [tr, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: tr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `berturk-sunlp-ner-turkish` is a Turkish model originally trained by `busecarik`. ## Predicted Entities `PRODUCT`, `TIME`, `MONEY`, `ORGANIZATION`, `LOCATION`, `TVSHOW`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_sunlp_ner_turkish_tr_4.2.4_3.0_1669815581712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_sunlp_ner_turkish_tr_4.2.4_3.0_1669815581712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_sunlp_ner_turkish","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_sunlp_ner_turkish","tr") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_berturk_sunlp_ner_turkish| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|tr| |Size:|689.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/busecarik/berturk-sunlp-ner-turkish - https://github.com/SU-NLP/SUNLP-Twitter-NER-Dataset --- layout: model title: Legal Conditions Clause Binary Classifier author: John Snow Labs name: legclf_conditions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `conditions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `conditions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conditions_clause_en_1.0.0_3.2_1660123336869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conditions_clause_en_1.0.0_3.2_1660123336869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_conditions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[conditions]| |[other]| |[other]| |[conditions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conditions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support conditions 0.89 0.78 0.83 82 other 0.90 0.95 0.93 175 accuracy - - 0.90 257 macro-avg 0.90 0.87 0.88 257 weighted-avg 0.90 0.90 0.90 257 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from ksabeh) author: John Snow Labs name: distilbert_qa_attribute_correction_mlm_titles date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-attribute-correction-mlm-titles` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_mlm_titles_en_4.3.0_3.0_1672766395443.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_mlm_titles_en_4.3.0_3.0_1672766395443.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction_mlm_titles","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction_mlm_titles","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_attribute_correction_mlm_titles| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ksabeh/distilbert-attribute-correction-mlm-titles --- layout: model title: Legal Disclaimer Clause Binary Classifier author: John Snow Labs name: legclf_disclaimer_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `disclaimer` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `disclaimer` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disclaimer_clause_en_1.0.0_3.2_1660123418981.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disclaimer_clause_en_1.0.0_3.2_1660123418981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_disclaimer_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[disclaimer]| |[other]| |[other]| |[disclaimer]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disclaimer_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support disclaimer 0.94 0.89 0.91 35 other 0.93 0.96 0.95 55 accuracy - - 0.93 90 macro-avg 0.93 0.92 0.93 90 weighted-avg 0.93 0.93 0.93 90 ``` --- layout: model title: Hebrew Lemmatizer author: John Snow Labs name: lemma date: 2020-12-09 task: Lemmatization language: he edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [lemmatizer, he, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_he_2.7.0_2.4_1607522684355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_he_2.7.0_2.4_1607522684355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "he") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) results = light_pipeline.fullAnnotate(["""להגיש הגישה הגיש הגשתי יגיש מגישים הגישו תגיש הגשנו מגישה"""]) ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "he") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("להגיש הגישה הגיש הגשתי יגיש מגישים הגישו תגיש הגשנו מגישה").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""להגיש הגישה הגיש הגשתי יגיש מגישים הגישו תגיש הגשנו מגישה"""] lemma_df = nlu.load('he.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 4, הגיש, {'sentence': '0'}), Annotation(token, 6, 10, הגיש, {'sentence': '0'}), Annotation(token, 12, 15, הגיש, {'sentence': '0'}), Annotation(token, 17, 21, הגיש, {'sentence': '0'}), Annotation(token, 23, 26, הגיש, {'sentence': '0'}), Annotation(token, 28, 33, הגיש, {'sentence': '0'}), Annotation(token, 35, 39, הגיש, {'sentence': '0'}), Annotation(token, 41, 44, הגיש, {'sentence': '0'}), Annotation(token, 46, 50, הגיש, {'sentence': '0'}), Annotation(token, 52, 56, הגיש, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[tokens]| |Output Labels:|[lemma]| |Language:|he| ## Data Source This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_how_1e_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-how-1e-4` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_1e_4_en_4.3.0_3.0_1674209475140.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_1e_4_en_4.3.0_3.0_1674209475140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_1e_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_1e_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_how_1e_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-how-1e-4 --- layout: model title: English T5ForConditionalGeneration Cased model (from rajistics) author: John Snow Labs name: t5_informal_formal_style_transfer date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `informal_formal_style_transfer` is a English model originally trained by `rajistics`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_informal_formal_style_transfer_en_4.3.0_3.0_1675103071459.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_informal_formal_style_transfer_en_4.3.0_3.0_1675103071459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_informal_formal_style_transfer","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_informal_formal_style_transfer","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_informal_formal_style_transfer| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|924.5 MB| ## References - https://huggingface.co/rajistics/informal_formal_style_transfer - https://github.com/PrithivirajDamodaran/Styleformer - https://www.aclweb.org/anthology/D19-5502.pdf - http://cs230.stanford.edu/projects_winter_2020/reports/32069807.pdf - https://arxiv.org/pdf/1804.06437.pdf --- layout: model title: Spanish Named Entity Recognition (from mrm8488) author: John Snow Labs name: bert_ner_TinyBERT_spanish_uncased_finetuned_ner date: 2022-05-09 tags: [bert, ner, token_classification, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `TinyBERT-spanish-uncased-finetuned-ner` is a Spanish model orginally trained by `mrm8488`. ## Predicted Entities `LOC`, `PER`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_TinyBERT_spanish_uncased_finetuned_ner_es_3.4.2_3.0_1652096474583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_TinyBERT_spanish_uncased_finetuned_ner_es_3.4.2_3.0_1652096474583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_TinyBERT_spanish_uncased_finetuned_ner","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_TinyBERT_spanish_uncased_finetuned_ner","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_TinyBERT_spanish_uncased_finetuned_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|54.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/TinyBERT-spanish-uncased-finetuned-ner - https://www.kaggle.com/nltkdata/conll-corpora - https://www.kaggle.com/nltkdata/conll-corpora - https://twitter.com/mrm8488 --- layout: model title: Financial News Multilabel Classifier author: John Snow Labs name: finmulticlf_news date: 2022-08-30 tags: [en, finance, classification, news, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multilabel classification model trained on different news scrapped from the internet and in-house annotations and label grouping. As this model is Multilabel, you can get as an output of a financial new, an array of 0 (no classes detected), 1(one class) or N (n classes detected). The available classes are: - acq: Acquisition / Purchase operations - finance: Generic financial news - fuel: News about fuel and energy sources - jobs: News about jobs, employment rates, etc. - livestock: News about animales and livestock - mineral: News about mineral as copper, gold, silver, coal, etc. - plant: News about greens, plants, cereals, etc - trade: Trading news ## Predicted Entities `acq`, `finance`, `fuel`, `jobs`, `livestock`, `mineral`, `plant`, `trade` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFICATION_MULTILABEL/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmulticlf_news_en_1.0.0_3.2_1661857631377.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmulticlf_news_en_1.0.0_3.2_1661857631377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") \ .setCleanupMode("shrink") embeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("embeddings") docClassifier = nlp.MultiClassifierDLModel.pretrained("finmulticlf_news", "en","finance/models")\ .setInputCols("embeddings") \ .setOutputCol("category") pipeline = nlp.Pipeline() \ .setStages( [ documentAssembler, embeddings, docClassifier ] ) empty_data = spark.createDataFrame([[""]]).toDF("text") pipelineModel = pipeline.fit(empty_data) text = [""" ECUADOR HAS TRADE SURPLUS IN FIRST FOUR MONTHS Ecuador posted a trade surplus of 10.6 mln dlrs in the first four months of 1987 compared with a surplus of 271.7 mln in the same period in 1986, the central bank of Ecuador said in its latest monthly report. Ecuador suspended sales of crude oil, its principal export product, in March after an earthquake destroyed part of its oil-producing infrastructure. Exports in the first four months of 1987 were around 639 mln dlrs and imports 628.3 mln, compared with 771 mln and 500 mln respectively in the same period last year. Exports of crude and products in the first four months were around 256.1 mln dlrs, compared with 403.3 mln in the same period in 1986. The central bank said that between January and May Ecuador sold 16.1 mln barrels of crude and 2.3 mln barrels of products, compared with 32 mln and 2.7 mln respectively in the same period last year. Ecuador's international reserves at the end of May were around 120.9 mln dlrs, compared with 118.6 mln at the end of April and 141.3 mln at the end of May 1986, the central bank said. gold reserves were 165.7 mln dlrs at the end of May compared with 124.3 mln at the end of April. """] lmodel = LightPipeline(pipelineModel) results = lmodel.annotate(text) ```
## Results ```bash ['finance', 'trade'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmulticlf_news| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|12.9 MB| ## References News scrapped from the Internet and manual in-house annotations ## Benchmarking ```bash label precision recall f1-score support acq 0.94 0.92 0.93 718 finance 0.95 0.96 0.96 1499 fuel 0.91 0.86 0.88 286 jobs 0.86 0.57 0.69 21 livestock 0.93 0.44 0.60 57 mineral 0.87 0.62 0.72 121 plant 0.89 0.88 0.89 301 trade 0.79 0.72 0.75 113 micro-avg 0.93 0.90 0.92 3116 macro-avg 0.89 0.75 0.80 3116 weighted-avg 0.93 0.90 0.91 3116 samples-avg 0.91 0.91 0.91 3116 ``` --- layout: model title: English image_classifier_vit_vc_bantai__withoutAMBI_adunest ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_vc_bantai__withoutAMBI_adunest date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI_adunest` is a English model originally trained by AykeeSalazar. ## Predicted Entities `nonViolation`, `publicDrinking`, `publicSmoking` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_en_4.1.0_3.0_1660166079256.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_en_4.1.0_3.0_1660166079256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vc_bantai__withoutAMBI_adunest| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English image_classifier_vit_apes ViTForImageClassification from ducnapa author: John Snow Labs name: image_classifier_vit_apes date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_apes` is a English model originally trained by ducnapa. ## Predicted Entities `chimpanzee`, `gibbon`, `gorilla`, `orangutan` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_apes_en_4.1.0_3.0_1660172348568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_apes_en_4.1.0_3.0_1660172348568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_apes", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_apes", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_apes| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Stopwords Remover for Amharic language (228 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, am, open_source] task: Stop Words Removal language: am edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_am_3.4.1_3.0_1646673268681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_am_3.4.1_3.0_1646673268681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","am") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["አንተ አልተሻልክም።"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","am") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("አንተ አልተሻልክም።").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("am.stopwords").predict("""አንተ አልተሻልክም።""") ```
## Results ```bash +----------+ |result | +----------+ |[አልተሻልክም።]| +----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|am| |Size:|2.5 KB| --- layout: model title: TREC(50) Question Classifier author: John Snow Labs name: classifierdl_use_trec50 class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/05/2020 task: Text Classification edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Classify open-domain, fact-based questions into sub categories of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values. {:.h2_title} ## Predicted Entities ``ENTY_animal``, ``ENTY_body``, ``ENTY_color``, ``ENTY_cremat``, ``ENTY_currency``, ``ENTY_dismed``, ``ENTY_event``, ``ENTY_food``, ``ENTY_instru``, ``ENTY_lang``, ``ENTY_letter``, ``ENTY_other``, ``ENTY_plant``, ``ENTY_product``, ``ENTY_religion``, ``ENTY_sport``, ``ENTY_substance``, ``ENTY_symbol``, ``ENTY_techmeth``, ``ENTY_termeq``, ``ENTY_veh``, ``ENTY_word``, ``DESC_def``, ``DESC_desc``, ``DESC_manner``, ``DESC_reason``, ``HUM_gr``, ``HUM_ind``, ``HUM_title``, ``HUM_desc``, ``LOC_city``, ``LOC_country``, ``LOC_mount``, ``LOC_other``, ``LOC_state``, ``NUM_code``, ``NUM_count``, ``NUM_date``, ``NUM_dist``, ``NUM_money``, ``NUM_ord``, ``NUM_other``, ``NUM_period``, ``NUM_perc``, ``NUM_speed``, ``NUM_temp``, ``NUM_volsize``, ``NUM_weight``, ``ABBR_abb``, ``ABBR_exp``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_en_2.5.0_2.4_1588493558481.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_en_2.5.0_2.4_1588493558481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec50', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec50", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] trec50_df = nlu.load('en.classify.trec50.use').predict(text, output_level = "document") trec50_df[["document", "trec50"]] ```
{:.h2_title} ## Results {:.table-model} ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |When did the construction of stone circles begin in the UK? | NUM_date | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} | Model Name | classifierdl_use_trec50 | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.0 | | Spark NLP Compatibility | 2.4 | | License | open source| | Edition | public | | Input Labels | [document, sentence_embeddings] | | Output Labels | [class] | | Language | en| | Upstream Dependencies | with tfhub_use | {:.h2_title} ## Data Source This model is trained on the 50 class version of TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html --- layout: model title: Malay T5ForConditionalGeneration Base Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_paraphrase_base_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-paraphrase-t5-base-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_base_standard_bahasa_cased_ms_4.3.0_3.0_1675102005289.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_base_standard_bahasa_cased_ms_4.3.0_3.0_1675102005289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_paraphrase_base_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_paraphrase_base_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_paraphrase_base_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|926.8 MB| ## References - https://huggingface.co/mesolitica/finetune-paraphrase-t5-base-standard-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/session/paraphrase/hf-t5 --- layout: model title: T5 Clinical Summarization / QA model author: John Snow Labs name: t5_base_mediqa_mnli date: 2021-02-19 tags: [t5, licensed, clinical, en] supported: true recommended: true task: Summarization language: en nav_key: models edition: Healthcare NLP 2.7.4 spark_version: 2.4 annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The T5 transformer model described in the seminal paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” can perform a variety of tasks, such as text summarization, question answering and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf). This model is specifically trained on medical data for text summarization and question answering. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/t5_base_mediqa_mnli_en_2.7.4_2.4_1613750257481.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/t5_base_mediqa_mnli_en_2.7.4_2.4_1613750257481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("documents") sentence_detector = SentenceDetectorDLModel().pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols("documents")\ .setOutputCol("sentence") t5 = T5Transformer().pretrained("t5_base_mediqa_mnli", "en", "clinical/models") \ .setInputCols(["sentence"]) \ .setOutputCol("t5_output")\ .setTask("summarize medical questions:")\ .setMaxOutputLength(200) pipeline = Pipeline(stages=[ document_assembler, sentence_detector, t5 ]) pipeline = Pipeline(stages=[ document_assembler, sentence_detector, t5 ]) data = spark.createDataFrame([ [1, "content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \"girly\" parts are normal. My organs never matured. Could you give me more information please. focus:all"] ]).toDF('id', 'text') results = pipeline.fit(data).transform(data) results.select("t5_output.result").show(truncate=False) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.mediqa").predict("""content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \""") ```
## Results ```bash What are the treatments for mensus?, What are the treatments for infantile female reproductive organs?, What are the treatments for cancer?, What are the treatments for organ transplantation?, What are the treatments for cancer?, What are the treatments for cancer? ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_mediqa_mnli| |Compatibility:|Healthcare NLP 2.7.4+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on MEDIQA2021 and MedNLI Datasets --- layout: model title: Explain Document Pipeline for Swedish author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, swedish, explain_document_md, pipeline, sv] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: sv edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_sv_3.0.0_3.0_1616436435552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_sv_3.0.0_3.0_1616436435552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'sv') annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "sv") val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej från John Snow Labs! ""] result_df = nlu.load('sv.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.4006600081920624,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| --- layout: model title: Fast Neural Machine Translation Model from Tok Pisin to English author: John Snow Labs name: opus_mt_tpi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tpi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tpi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tpi_en_xx_2.7.0_2.4_1609166984301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tpi_en_xx_2.7.0_2.4_1609166984301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tpi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tpi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tpi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tpi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for ICD-O (sbiobertresolve_icdo_augmented) author: John Snow Labs name: sbiobertresolve_icdo_augmented date: 2021-06-22 tags: [licensed, en, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-O codes using sBioBert sentence embeddings. This model is augmented using the site information coming from ICD10 and synonyms coming from SNOMED vocabularies. It is trained with a dataset that is 20x larger than the previous version of ICDO resolver. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original `Topography` codes, `Morphology` codes comprising of `Histology` and `Behavior` codes, and descriptions. ## Predicted Entities ICD-O Codes and their normalized definition with `sbiobert_base_cased_mli ` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.1.0_3.0_1624344274944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.1.0_3.0_1624344274944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icdo_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver]) data = spark.createDataFrame([["The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icdo_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq("The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icdo_augmented").predict("""The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma.""") ```
## Results ```bash +--------------------+-----+---+-----------+-------------+-------------------------+-------------------------+ | chunk|begin|end| entity| code| all_k_resolutions| all_k_codes| +--------------------+-----+---+-----------+-------------+-------------------------+-------------------------+ | mesothelioma| 255|266|Oncological|9971/3||C38.3|malignant mediastinal ...|9971/3||C38.3:::8854/3...| |several malignancies| 293|312|Oncological|8894/3||C39.8|overlapping malignant ...|8894/3||C39.8:::8070/2...| | brain tumor| 350|360|Oncological|9562/0||C71.9|cancer of the brain:::...|9562/0||C71.9:::9070/3...| | breast cancer| 413|425|Oncological|9691/3||C50.9|carcinoma of breast:::...|9691/3||C50.9:::8070/2...| | lung cancer| 471|481|Oncological|8814/3||C34.9|malignant tumour of lu...|8814/3||C34.9:::8550/3...| | leukemia| 560|567|Oncological|9670/3||C80.9|anemia in neoplastic d...|9670/3||C80.9:::9714/3...| | B-cell lymphoma| 610|624|Oncological|9818/3||C77.9|secondary malignant ne...|9818/3||C77.9:::9655/3...| +--------------------+-----+---+-----------+-------------+-------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icdo_augmented| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icdo_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD-O Histology Behaviour dataset with `sbiobert_base_cased_mli ` sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf --- layout: model title: German DistilBERT Embeddings author: John Snow Labs name: distilbert_embeddings_distilbert_base_german_cased date: 2022-04-12 tags: [distilbert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-german-cased` is a German model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_german_cased_de_3.4.2_3.0_1649783682880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_german_cased_de_3.4.2_3.0_1649783682880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_german_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_german_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.distilbert_base_german_cased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_german_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|250.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/distilbert-base-german-cased --- layout: model title: Legal ORG, PRODUCT and ALIAS NER (small) author: John Snow Labs name: legner_orgs_prods_alias date: 2022-08-17 tags: [en, legal, ner, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a large Named Entity Recognition model, trained with a subset of generic conLL, financial and legal conll, ontonotes and several in-house corpora, to detect Organizations, Products and Aliases of Companies. ## Predicted Entities `ORG`, `PROD`, `ALIAS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_ORGPROD){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_orgs_prods_alias_en_1.0.0_3.2_1660733903868.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_orgs_prods_alias_en_1.0.0_3.2_1660733903868.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties"). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']['entity']").alias("ner_label"), F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False) ```
## Results ```bash +-----------------------------------+---------+----------+ |chunk |ner_label|confidence| +-----------------------------------+---------+----------+ |Armstrong Flooring, Inc |ORG |0.807575 | |Seller |ALIAS |0.997 | |AFI Licensing LLC |ORG |0.7076333 | |Licensing |ALIAS |0.9981 | |Seller |ALIAS |0.996 | |Arizona |ALIAS |0.9958 | |AHF Holding, Inc. |ORG |0.72438 | |Tarzan HoldCo, Inc |ORG |0.684675 | |Buyer |ALIAS |0.9983 | |Armstrong Hardwood Flooring Company|ORG |0.58274996| |Company |ALIAS |0.9989 | |Buyer |ALIAS |0.9979 | |Buyer Entities |ALIAS |0.98835003| |Arizona |ALIAS |0.9635 | |Buyer Entities |ALIAS |0.77565 | |Party |ALIAS |0.9982 | +-----------------------------------+---------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_orgs_prods_alias| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.7 MB| ## References ConLL-2003, FinSec ConLL, a subset of Ontonotes, In-house corpora ## Benchmarking ```bash label tp fp fn prec rec f1 I-ORG 12853 2621 2685 0.8306191 0.82719785 0.828905 B-PRODUCT 2306 697 932 0.76789874 0.712168 0.7389841 I-ALIAS 14 6 13 0.7 0.5185185 0.59574467 B-ORG 8967 2078 2311 0.81186056 0.79508775 0.80338657 I-PRODUCT 2336 803 1091 0.74418604 0.68164575 0.7115443 B-ALIAS 76 14 22 0.84444445 0.7755102 0.80851066 Macro-average 26552 6219 7054 0.78316814 0.7183547 0.7493626 Micro-average 26552 6219 7054 0.8102285 0.790097 0.80003613 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Tianle) author: John Snow Labs name: distilbert_qa_Tianle_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Tianle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Tianle_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724812794.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Tianle_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724812794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Tianle_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Tianle_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Tianle").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Tianle_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Tianle/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentence Entity Resolver for RxNorm (NDC) author: John Snow Labs name: sbiobertresolve_rxnorm_ndc date: 2021-10-05 tags: [licensed, clinical, en, ndc, rxnorm] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps `DRUG ` entities to RxNorm codes and their [National Drug Codes (NDC)](https://www.drugs.com/ndc.html#:~:text=The%20NDC%2C%20or%20National%20Drug,and%20the%20commercial%20package%20size.) using `sbiobert_base_cased_mli ` sentence embeddings. You can find all NDC codes of drugs seperated by `|` symbol in the all_k_aux_labels parameter of the metadata. ## Predicted Entities `RxNorm Codes`, `NDC Codes` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_ndc_en_3.2.3_2.4_1633424811842.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_ndc_en_3.2.3_2.4_1633424811842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sentence_embeddings") rxnorm_ndc_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_ndc", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_ndc_pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, rxnorm_ndc_resolver]) data = spark.createDataFrame([["activated charcoal 30000 mg powder for oral suspension"]]).toDF("text") res = rxnorm_ndc_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk") .setOutputCol("sentence_embeddings") val rxnorm_ndc_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_ndc", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_ndc_pipeline = new Pipeline().setStages(Array(documentAssembler, sbert_embedder, rxnorm_ndc_resolver)) val data = Seq("activated charcoal 30000 mg powder for oral suspension").toDF("text") val res = rxnorm_ndc_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm_ndc").predict("""activated charcoal 30000 mg powder for oral suspension""") ```
## Results ```bash +--+------------------------------------------------------+-----------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ | |ner_chunk |rxnorm_code|all_codes |resolutions |all_k_aux_labels (ndc_codes) | +--+------------------------------------------------------+-----------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ |0 |activated charcoal 30000 mg powder for oral suspension|1440919 |[1440919, 808917, 1088194, 1191772, 808921,...]|'activated charcoal 30000 MG Powder for Oral Suspension', 'Activated Charcoal 30000 MG Powder for Oral Suspension', 'wheat dextrin 3000 MG Powder for Oral Solution [Benefiber]', 'cellulose 3000 MG Oral Powder [Unifiber]', 'fosfomycin 3000 MG Powder for Oral Solution [Monurol]', ...|69784030828, 00395052791, 08679001362|86790016280|00067004490, 46017004408|68220004416, 00456430001,...| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_ndc| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: English XlmRoBertaForQuestionAnswering (from krinal214) author: John Snow Labs name: xlm_roberta_qa_xlm_all date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-all` is a English model originally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_all_en_4.0.0_3.0_1655988363716.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_all_en_4.0.0_3.0_1655988363716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_all","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_all","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_all| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|924.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/xlm-all --- layout: model title: French CamemBert Embeddings (from tnagata) author: John Snow Labs name: camembert_embeddings_tnagata_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `tnagata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tnagata_generic_model_fr_3.4.4_3.0_1653990401775.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tnagata_generic_model_fr_3.4.4_3.0_1653990401775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tnagata_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tnagata_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_tnagata_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/tnagata/dummy-model --- layout: model title: Adverse Drug Events Classifier author: John Snow Labs name: classifierml_ade date: 2023-05-04 tags: [ade, clinical, licensed, en, text_classification] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: DocumentMLClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the `DocumentMLClassifierApproach` annotator and classifies a text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1683229229936.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1683229229936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("prediction") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, classifier_ml]) data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler =new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifier_ml = new DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models") .setInputCols("token") .setOutputCol("prediction") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, classifier_ml)) val data = Seq(Array("I feel great after taking tylenol", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------------+-------+ |I feel great after taking tylenol |[False]| |Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] | +----------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierml_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[prediction]| |Language:|en| |Size:|2.7 MB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.90 0.94 0.92 3359 True 0.85 0.75 0.79 1364 accuracy - - 0.89 4723 macro avg 0.87 0.85 0.86 4723 weighted avg 0.89 0.89 0.89 4723 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_kv32 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv32` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv32_en_4.3.0_3.0_1675121475776.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv32_en_4.3.0_3.0_1675121475776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_kv32","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_kv32","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_kv32| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|129.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-kv32 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering model (from andi611) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2-with-ner-mit-movie-with-neg-with-repeat` is a English model orginally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat_en_4.0.0_3.0_1654537505373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat_en_4.0.0_3.0_1654537505373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.movie_squadv2.bert.large_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/bert-large-uncased-whole-word-masking-squad2-with-ner-mit-movie-with-neg-with-repeat --- layout: model title: English ElectraForQuestionAnswering model (from ptran74) Version-3 author: John Snow Labs name: electra_qa_DSPFirst_Finetuning_3 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-3` is a English model originally trained by `ptran74`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_3_en_4.0.0_3.0_1655919564099.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_3_en_4.0.0_3.0_1655919564099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.finetuning_3").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_DSPFirst_Finetuning_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ptran74/DSPFirst-Finetuning-3 - https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/ --- layout: model title: English RobertaForQuestionAnswering Cased model (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_movie_squad date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `movie-roberta-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_squad_en_4.2.4_3.0_1669985305755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_squad_en_4.2.4_3.0_1669985305755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_movie_squad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/thatdramebaazguy/movie-roberta-squad - https://github.com/ibm-aur-nlp/domain-specific-QA - https://github.com/adityaarunsinghal/Domain-Adaptation/blob/master/scripts/shell_scripts/train_movieR_just_squadv1.sh - https://github.com/adityaarunsinghal/Domain-Adaptation/ --- layout: model title: English RobertaForQuestionAnswering (from jgammack) author: John Snow Labs name: roberta_qa_roberta_base_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_en_4.0.0_3.0_1655734821686.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_en_4.0.0_3.0_1655734821686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_jgammack").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/roberta-base-squad --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_4 TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_4` is a English model originally trained by nimrah. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4_en_4.2.0_3.0_1664116941781.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4_en_4.2.0_3.0_1664116941781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_4| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from lucasresck) author: John Snow Labs name: distilbert_qa_lucasresck_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `lucasresck`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lucasresck_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772108616.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lucasresck_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772108616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lucasresck_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lucasresck_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_lucasresck_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/lucasresck/distilbert-base-uncased-finetuned-squad --- layout: model title: NER Pipeline for German author: John Snow Labs name: xlm_roberta_large_token_classifier_conll03_pipeline date: 2022-04-19 tags: [german, roberta, xlm, ner, conll03, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_large_token_classifier_conll03_de](https://nlp.johnsnowlabs.com/2021/12/25/xlm_roberta_large_token_classifier_conll03_de.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_pipeline_de_3.4.1_3.0_1650369924733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_pipeline_de_3.4.1_3.0_1650369924733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_conll03_pipeline", lang = "de") pipeline.annotate("Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_conll03_pipeline", lang = "de") pipeline.annotate("Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.") ```
## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Ibser |PER | |ASK Ebreichsdorf |ORG | |Admira Wacker Mödling |ORG | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Detect Assertion Status from Response to Treatment author: John Snow Labs name: assertion_oncology_response_to_treatment_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of entities related to response to treatment. The model identifies positive mentions (Present_Or_Past status), and hypothetical or absent mentions (Hypothetical_Or_Absent status). ## Predicted Entities `Hypothetical_Or_Absent`, `Present_Or_Past` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.0.0_3.0_1665522412809.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.0.0_3.0_1665522412809.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Response_To_Treatment"]) assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient presented no evidence of recurrence."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Response_To_Treatment")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient presented no evidence of recurrence.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_response_to_treatment_wip").predict("""The patient presented no evidence of recurrence.""") ```
## Results ```bash | chunk | ner_label | assertion | |:-----------|:----------------------|:-----------------------| | recurrence | Response_To_Treatment | Hypothetical_Or_Absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_response_to_treatment_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Hypothetical_Or_Absent 0.82 0.90 0.86 61.0 Present_Or_Past 0.89 0.80 0.84 61.0 macro-avg 0.86 0.85 0.85 122.0 weighted-avg 0.86 0.85 0.85 122.0 ``` --- layout: model title: English image_classifier_vit_hot_dog_or_sandwich ViTForImageClassification from osanseviero author: John Snow Labs name: image_classifier_vit_hot_dog_or_sandwich date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_hot_dog_or_sandwich` is a English model originally trained by osanseviero. ## Predicted Entities `hot dog`, `sandwich` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hot_dog_or_sandwich_en_4.1.0_3.0_1660169703998.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hot_dog_or_sandwich_en_4.1.0_3.0_1660169703998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_hot_dog_or_sandwich", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_hot_dog_or_sandwich", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_hot_dog_or_sandwich| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Spanish DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_es_cased date: 2022-04-12 tags: [distilbert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-es-cased` is a Spanish model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_es_cased_es_3.4.2_3.0_1649783277779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_es_cased_es_3.4.2_3.0_1649783277779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_es_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_es_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.distilbert_base_es_cased").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_es_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|237.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-es-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Sentence Entity Resolver for ATC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_atc date: 2022-03-01 tags: [atc, licensed, en, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps drugs entities to ATC (Anatomic Therapeutic Chemical) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `ATC Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_2.4_1646127233333.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_2.4_1646127233333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["DRUG"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models")\ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("atc_code")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner, ner_converter, c2doc, sbert_embedder, atc_resolver ]) sampleText = ["""He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day.""", """She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""", """She was given antidepressant for a month"""] data = spark.createDataFrame(sampleText, StringType()).toDF("text") results = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("DRUG")) val c2doc = Chunk2Doc() .setInputCols(Array("ner_chunk")) .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("atc_code") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new PipelineModel().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner, ner_converter, c2doc, sbert_embedder, atc_resolver)) val data = Seq("He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day and then ibuprofen. She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide. She was given antidepressant for a month").toDF("text") val results = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.atc").predict("""She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""") ```
## Results ```bash +-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ | chunk|atc_code| all_k_codes| resolutions| all_k_aux_labels| +-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ | eltrombopag| B02BX05|B02BX05:::A07DA06:::B06AC03:::M01AB08:::L04AA39...|eltrombopag :::eluxadoline :::ecallantide :::et...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | amlodipine| C08CA01|C08CA01:::C08CA17:::C08CA13:::C08CA06:::C08CA10...|amlodipine :::levamlodipine :::lercanidipine ::...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | metformin| A10BA02|A10BA02:::A10BA01:::A10BB01:::A10BH04:::A10BB07...|metformin :::phenformin :::glyburide / metformi...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | hydrogen peroxide| A01AB02|A01AB02:::S02AA06:::D10AE:::D11AX25:::D10AE01::...|hydrogen peroxide :::hydrogen peroxide; otic:::...|ATC 5th:::ATC 5th:::ATC 4th:::ATC 5th:::ATC 5th...| | amoxicillin| J01CA04|J01CA04:::J01CA01:::J01CF02:::J01CF01:::J01CA51...|amoxicillin :::ampicillin :::cloxacillin :::dic...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| |magnesium hydroxide| A02AA04|A02AA04:::A12CC02:::D10AX30:::B05XA11:::A02AA02...|magnesium hydroxide :::magnesium sulfate :::alu...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | antidepressant| N06A|N06A:::N05A:::N06AX:::N05AH02:::N06D:::N06CA:::...|ANTIDEPRESSANTS:::ANTIPSYCHOTICS:::Other antide...|ATC 3rd:::ATC 3rd:::ATC 4th:::ATC 5th:::ATC 3rd...| +-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_atc| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[atc_code]| |Language:|en| |Size:|71.6 MB| |Case sensitive:|false| ## References Trained on ATC 2022 Codes dataset --- layout: model title: Pipeline to Detect Clinical Entities (jsl_ner_wip_greedy_clinical) author: John Snow Labs name: jsl_ner_wip_greedy_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_greedy_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_greedy_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_pipeline_en_4.3.0_3.2_1678784242470.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_pipeline_en_4.3.0_3.2_1678784242470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.wip_greedy_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.9817 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9998 | | 2 | male | 38 | 41 | Gender | 0.9922 | | 3 | for 2 days | 48 | 57 | Duration | 0.6968 | | 4 | congestion | 62 | 71 | Symptom | 0.875 | | 5 | mom | 75 | 77 | Gender | 0.8156 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.2697 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.6216 | | 8 | she | 147 | 149 | Gender | 0.9965 | | 9 | mild problems with his breathing while feeding | 168 | 213 | Symptom | 0.444029 | | 10 | perioral cyanosis | 237 | 253 | Symptom | 0.3283 | | 11 | retractions | 258 | 268 | Symptom | 0.957 | | 12 | One day ago | 272 | 282 | RelativeDate | 0.646267 | | 13 | mom | 285 | 287 | Gender | 0.692 | | 14 | tactile temperature | 304 | 322 | Symptom | 0.20765 | | 15 | Tylenol | 345 | 351 | Drug | 0.9951 | | 16 | Baby | 354 | 357 | Age | 0.981 | | 17 | decreased p.o. intake | 377 | 397 | Symptom | 0.437375 | | 18 | His | 400 | 402 | Gender | 0.999 | | 19 | 20 minutes | 439 | 448 | Duration | 0.20415 | | 20 | q.2h. | 450 | 454 | Frequency | 0.6406 | | 21 | to 5 to 10 minutes | 456 | 473 | Duration | 0.12444 | | 22 | his | 488 | 490 | Gender | 0.9904 | | 23 | respiratory congestion | 492 | 513 | Symptom | 0.5294 | | 24 | He | 516 | 517 | Gender | 0.9989 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_greedy_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk TFWav2Vec2ForCTC from krirk author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk` is a English model originally trained by krirk. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042731612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042731612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_xls_r_300m_cv8 TFWav2Vec2ForCTC from comodoro author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_cv8 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_cv8` is a English model originally trained by comodoro. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_cv8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664014149392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664014149392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_cv8', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_cv8", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_cv8| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect Genes/Proteins (BC2GM) in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc2gm_gene_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bc2gm_gene](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_bc2gm_gene_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc2gm_gene_pipeline_en_4.3.0_3.2_1679303903870.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc2gm_gene_pipeline_en_4.3.0_3.2_1679303903870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_bc2gm_gene_pipeline", "en", "clinical/models") text = '''ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bc2gm_gene_pipeline", "en", "clinical/models") val text = "ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------|--------:|------:|:-------------|-------------:| | 0 | ROCK-I | 0 | 5 | GENE/PROTEIN | 0.999978 | | 1 | Kinectin | 8 | 15 | GENE/PROTEIN | 0.999973 | | 2 | mDia2 | 22 | 26 | GENE/PROTEIN | 0.999974 | | 3 | RhoA | 65 | 68 | GENE/PROTEIN | 0.999976 | | 4 | Cdc42 | 74 | 78 | GENE/PROTEIN | 0.999979 | | 5 | tnaC | 213 | 216 | GENE/PROTEIN | 0.999978 | | 6 | Rho | 225 | 227 | GENE/PROTEIN | 0.999976 | | 7 | boxA | 247 | 250 | GENE/PROTEIN | 0.999837 | | 8 | rut sites | 256 | 264 | GENE/PROTEIN | 0.99115 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc2gm_gene_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Recognize Entities DL Pipeline for Danish - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, danish, entity_recognizer_sm, pipeline, da] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: da edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_da_3.0.0_3.0_1616443414871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_da_3.0.0_3.0_1616443414871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'da') annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "da") val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej fra John Snow Labs! ""] result_df = nlu.load('da.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|da| --- layout: model title: English T5ForConditionalGeneration Base Cased model (from sumedh) author: John Snow Labs name: t5_base_amazonreviews date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-amazonreviews` is a English model originally trained by `sumedh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_amazonreviews_en_4.3.0_3.0_1675107991189.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_amazonreviews_en_4.3.0_3.0_1675107991189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_amazonreviews","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_amazonreviews","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_amazonreviews| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|921.3 MB| ## References - https://huggingface.co/sumedh/t5-base-amazonreviews --- layout: model title: Recognize Entities OntoNotes pipeline - BERT Mini author: John Snow Labs name: onto_recognize_entities_bert_mini date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_bert_mini, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_bert_mini is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_mini_en_3.0.0_3.0_1616477436682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_mini_en_3.0.0_3.0_1616477436682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_bert_mini', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_mini", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.bert.mini').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.147406503558158,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_mini| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_512 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670021628805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670021628805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|66.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Spanish BertForQuestionAnswering model (from IIC) author: John Snow Labs name: bert_qa_beto_base_spanish_sqac date: 2022-06-02 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `beto-base-spanish-sqac` is a Spanish model orginally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_beto_base_spanish_sqac_es_4.0.0_3.0_1654185522043.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_beto_base_spanish_sqac_es_4.0.0_3.0_1654185522043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_beto_base_spanish_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_beto_base_spanish_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_beto_base_spanish_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/IIC/beto-base-spanish-sqac - https://paperswithcode.com/sota?task=question-answering&dataset=PlanTL-GOB-ES%2FSQAC - https://arxiv.org/abs/2107.07253 - https://github.com/dccuchile/beto - https://www.bsc.es/ --- layout: model title: Persian XlmRoBertaForQuestionAnswering (from SajjadAyoubi) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_fa_qa date: 2022-06-23 tags: [fa, open_source, question_answering, xlmroberta] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-fa-qa` is a Persian model originally trained by `SajjadAyoubi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_fa_qa_fa_4.0.0_3.0_1655995556403.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_fa_qa_fa_4.0.0_3.0_1655995556403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_fa_qa","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_fa_qa","fa") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fa.answer_question.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_fa_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fa| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SajjadAyoubi/xlm-roberta-large-fa-qa - https://colab.research.google.com/github/sajjjadayobi/PersianQA/blob/main/notebooks/HowToUse.ipynb --- layout: model title: English Deberta Embeddings model (from domenicrosati) author: John Snow Labs name: deberta_embeddings_xsmall_dapt_scientific_papers_pubmed date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-xsmall-dapt-scientific-papers-pubmed` is a English model originally trained by `domenicrosati`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_xsmall_dapt_scientific_papers_pubmed_en_4.3.1_3.0_1678701718282.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_xsmall_dapt_scientific_papers_pubmed_en_4.3.1_3.0_1678701718282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_xsmall_dapt_scientific_papers_pubmed","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_xsmall_dapt_scientific_papers_pubmed","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_xsmall_dapt_scientific_papers_pubmed| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|246.9 MB| |Case sensitive:|false| ## References https://huggingface.co/domenicrosati/deberta-xsmall-dapt-scientific-papers-pubmed --- layout: model title: Sentence Detection in English Texts author: John Snow Labs name: sentence_detector_dl date: 2021-01-02 task: Sentence Detection language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, sentence_detection, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTENCE_DETECTOR/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/20.SentenceDetectorDL_Healthcare.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_en_2.7.0_2.4_1609611052663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_en_2.7.0_2.4_1609611052663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sentence_detector").predict("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""") ```
## Results ```bash +---+------------------------------+ | 0 | John loves Mary. | +---+------------------------------+ | 1 | Mary loves Peter | +---+------------------------------+ | 2 | Peter loves Helen . | +---+------------------------------+ | 3 | Helen loves John; | +---+------------------------------+ | 4 | Total: four people involved. | +---+------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|en| ## Data Source Please visit the repo for more information https://github.com/dbmdz/deep-eos ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to Esperanto author: John Snow Labs name: opus_mt_af_eo date: 2021-06-01 tags: [open_source, seq2seq, translation, af, eo, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: eo {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_eo_xx_3.1.0_2.4_1622559306014.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_eo_xx_3.1.0_2.4_1622559306014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_eo", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_eo", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.Esperanto').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_eo| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from kuberpmu) author: John Snow Labs name: distilbert_qa_kuberpmu_base_cased_led_squad_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad` is a English model originally trained by `kuberpmu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kuberpmu_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766528239.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kuberpmu_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766528239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kuberpmu_base_cased_led_squad_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kuberpmu_base_cased_led_squad_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kuberpmu_base_cased_led_squad_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/kuberpmu/distilbert-base-cased-distilled-squad-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ab20211112) author: John Snow Labs name: distilbert_qa_ab20211112_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ab20211112`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ab20211112_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769623851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ab20211112_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769623851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ab20211112_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ab20211112_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ab20211112_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ab20211112/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd1 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd1` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd1_en_4.0.0_3.0_1657187990798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd1_en_4.0.0_3.0_1657187990798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd1 --- layout: model title: Pipeline to Detect Clinical Entities (Slim version) author: John Snow Labs name: bert_token_classifier_ner_jsl_slim_pipeline date: 2022-03-21 tags: [licensed, ner, slim, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl_slim](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_jsl_slim_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_pipeline_en_3.4.1_3.0_1647865346100.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_pipeline_en_3.4.1_3.0_1647865346100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_slim_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_slim_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.jsl_slim.pipeline").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""") ```
## Results ```bash +----------------+------------+ |chunk |ner_label | +----------------+------------+ |HISTORY: |Header | |30-year-old |Age | |female |Demographics| |mammography |Test | |soft tissue lump|Symptom | |shoulder |Body_Part | |breast cancer |Oncological | |her mother |Demographics| |age 58 |Age | |breast cancer |Oncological | +----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_slim_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data, Roberta) author: John Snow Labs name: ner_deid_generic_roberta_augmented date: 2022-02-16 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic_roberta` ner model). This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic_roberta` which does not include it). It's a generalized version of `ner_deid_subentity_roberta_augmented`. This is a Roberta embeddings based model. You also have available the `ner_deid_generic_augmented` that uses Sciwi 300d embeddings. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_augmented_es_3.3.4_3.0_1645006281743.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_augmented_es_3.3.4_3.0_1645006281743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic_roberta_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.generic.roberta").predict(""" Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-NAME| | Miguel| I-NAME| | Martínez| I-NAME| | ,| O| | un| B-SEX| | varón| I-SEX| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-LOCATION| | ,| O| | España| B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19|B-PROFESSION| | el| O| | dia| O| | 14| O| | de| O| | Marzo| O| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| B-LOCATION| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-LOCATION| | San| I-LOCATION| | Carlos| I-LOCATION| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_roberta_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 177.0 3.0 6.0 183.0 0.9833 0.9672 0.9752 NAME 1963.0 159.0 123.0 2086.0 0.9251 0.941 0.933 DATE 953.0 18.0 16.0 969.0 0.9815 0.9835 0.9825 ORGANIZATION 2320.0 520.0 362.0 2682.0 0.8169 0.865 0.8403 ID 63.0 7.0 1.0 64.0 0.9 0.9844 0.9403 SEX 619.0 14.0 8.0 627.0 0.9779 0.9872 0.9825 LOCATION 2388.0 470.0 423.0 2811.0 0.8355 0.8495 0.8425 PROFESSION 233.0 15.0 28.0 261.0 0.9395 0.8927 0.9155 AGE 516.0 16.0 3.0 519.0 0.9699 0.9942 0.9819 macro - - - - - - 0.93263 micro - - - - - - 0.89427 ``` --- layout: model title: English BertForQuestionAnswering model (from manav) author: John Snow Labs name: bert_qa_causal_qa date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `causal_qa` is a English model orginally trained by `manav`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_causal_qa_en_4.0.0_3.0_1654537687909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_causal_qa_en_4.0.0_3.0_1654537687909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_causal_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_causal_qa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_manav").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_causal_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manav/causal_qa - https://github.com/kstats/CausalQG --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_snacks ViTForImageClassification from matteopilotto author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_snacks date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_snacks` is a English model originally trained by matteopilotto. ## Predicted Entities `salad`, `candy`, `muffin`, `banana`, `grape`, `popcorn`, `pretzel`, `pineapple`, `juice`, `orange`, `doughnut`, `carrot`, `waffle`, `cake`, `cookie`, `ice cream`, `watermelon`, `hot dog`, `apple`, `strawberry` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_snacks_en_4.1.0_3.0_1660167587853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_snacks_en_4.1.0_3.0_1660167587853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_snacks", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_snacks", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_snacks| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: BioBERT Sentence Embeddings (Pubmed) author: John Snow Labs name: sent_biobert_pubmed_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_base_cased_en_2.6.0_2.4_1598348028762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_base_cased_en_2.6.0_2.4_1598348028762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_biobert_pubmed_base_cased_embeddings sentence [0.209750697016716, 0.21535921096801758, -0.59... I hate cancer [0.01466107927262783, -0.20778851211071014, -0... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pubmed_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: German Named Entity Recognition author: John Snow Labs name: xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP `xlm-roberta-large-finetuned-conll03-german` is a German model orginally trained by HuggingFace. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german_de_3.4.2_3.0_1652807937775.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german_de_3.4.2_3.0_1652807937775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german","de") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_large_finetuned_conll03_german| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/xlm-roberta-large-finetuned-conll03-german --- layout: model title: Telugu RobertaForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: roberta_embeddings_indic_transformers date: 2022-12-12 tags: [te, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: te edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-roberta` is a Telugu model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_te_4.2.4_3.0_1670858686031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_te_4.2.4_3.0_1670858686031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|te| |Size:|314.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-roberta - https://oscar-corpus.com/ --- layout: model title: Spanish Text Classification (from `hackathon-pln-es`) author: John Snow Labs name: roberta_jurisbert_clas_art_convencion_americana_dh date: 2022-05-20 tags: [roberta, text_classification, es, open_source] task: Text Classification language: es edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jurisbert-clas-art-convencion-americana-dh` is a Spanish model orginally trained by `hackathon-pln-es`. ## Predicted Entities `Artículo 63.1 Reparaciones`, `Artículo 15. Derecho de Reunión`, `Artículo 4. Derecho a la Vida`, `Artículo 1. Obligación de Respetar los Derechos`, `Artículo 5. Derecho a la Integridad Personal`, `Artículo 8. Garantías Judiciales`, `Artículo 19. Derechos del Niño`, `Artículo 17. Protección a la Familia`, `Artículo 2. Deber de Adoptar Disposiciones de Derecho Interno`, `Artículo 16. Libertad de Asociación`, `Artículo 25. Protección Judicial`, `Artículo 11. Protección de la Honra y de la Dignidad`, `Artículo 12. Libertad de Conciencia y de Religión`, `Artículo 9. Principio de Legalidad y de Retroactividad`, `Artículo 7. Derecho a la Libertad Personal`, `Artículo 24. Igualdad ante la Ley`, `Artículo 6. Prohibición de la Esclavitud y Servidumbre`, `Artículo 22. Derecho de Circulación y de Residencia`, `Artículo 28. Cláusula Federal`, `Artículo 21. Derecho a la Propiedad Privada`, `Artículo_29_Normas_de_Interpretación`, `Artículo 23. Derechos Políticos`, `Artículo 13. Libertad de Pensamiento y de Expresión`, `Artículo 26. Desarrollo Progresivo`, `Artículo 30. Alcance de las Restricciones`, `Artículo 14. Derecho de Rectificación o Respuesta`, `Artículo 3. Derecho al Reconocimiento de la Personalidad Jurídica`, `Artículo 27. Suspensión de Garantías`, `Artículo 20. Derecho a la Nacionalidad`, `Artículo 18. Derecho al Nombre` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_jurisbert_clas_art_convencion_americana_dh_es_3.4.4_3.0_1653049484318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_jurisbert_clas_art_convencion_americana_dh_es_3.4.4_3.0_1653049484318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForSequenceClassification.pretrained("roberta_jurisbert_clas_art_convencion_americana_dh","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Me encanta Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForSequenceClassification.pretrained("roberta_jurisbert_clas_art_convencion_americana_dh","es") .setInputCols(Array("sentence", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Me encanta Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_jurisbert_clas_art_convencion_americana_dh| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/hackathon-pln-es/jurisbert-clas-art-convencion-americana-dh --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739325907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739325907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_ruletriplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0 --- layout: model title: Pipeline to Detect diseases in Text (large) author: John Snow Labs name: ner_diseases_large_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, disease, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diseases_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_pipeline_en_3.4.1_3.0_1647872024826.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_pipeline_en_3.4.1_3.0_1647872024826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_diseases_large_pipeline", "en", "clinical/models") pipeline.annotate("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_diseases_large_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases_large.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Multiple autoimmune syndrome|Disease | |T-cell leukemia |Disease | |T-cell leukemia |Disease | |Chikungunya virus disease |Disease | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Drug Side Effect Classification Pipeline - Voice of the Patient author: John Snow Labs name: bert_sequence_classifier_vop_drug_side_effect_pipeline date: 2023-06-14 tags: [clinical, licensed, en, classification, vop] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline includes the Medical Bert for Sequence Classification model to classify health-related text in colloquial language according to the presence or absence of mentions of side effects related to drugs. The pipeline is built on the top of [bert_sequence_classifier_vop_drug_side_effect](https://nlp.johnsnowlabs.com/2023/06/13/bert_sequence_classifier_vop_drug_side_effect_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_drug_side_effect_pipeline_en_4.4.3_3.2_1686704779005.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_drug_side_effect_pipeline_en_4.4.3_3.2_1686704779005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_vop_drug_side_effect_pipeline", "en", "clinical/models") pipeline.annotate("I felt kind of dizzy after taking that medication for a month.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_drug_side_effect_pipeline", "en", "clinical/models") val result = pipeline.annotate(I felt kind of dizzy after taking that medication for a month.) ```
## Results ```bash | text | prediction | |:---------------------------------------------------------------|:-------------| | I felt kind of dizzy after taking that medication for a month. | Drug_AE | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_drug_side_effect_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from JonatanGk) author: John Snow Labs name: roberta_qa_jonatangk_base_bne_finetuned_s_c date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `JonatanGk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_jonatangk_base_bne_finetuned_s_c_es_4.3.0_3.0_1674213010026.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_jonatangk_base_bne_finetuned_s_c_es_4.3.0_3.0_1674213010026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jonatangk_base_bne_finetuned_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jonatangk_base_bne_finetuned_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_jonatangk_base_bne_finetuned_s_c| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/JonatanGk/roberta-base-bne-finetuned-sqac --- layout: model title: Legal Other remedies Clause Binary Classifier author: John Snow Labs name: legclf_other_remedies_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `other-remedies` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `other-remedies` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_other_remedies_clause_en_1.0.0_3.2_1660122803750.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_other_remedies_clause_en_1.0.0_3.2_1660122803750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_other_remedies_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[other-remedies]| |[other]| |[other]| |[other-remedies]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_other_remedies_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.95 0.94 98 other-remedies 0.86 0.82 0.84 38 accuracy - - 0.91 136 macro-avg 0.90 0.88 0.89 136 weighted-avg 0.91 0.91 0.91 136 ``` --- layout: model title: German NER for Laws (Bert, Base) author: John Snow Labs name: legner_bert_base_courts date: 2022-10-02 tags: [de, legal, ner, laws, court, licensed] task: Named Entity Recognition language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect legal entities in German text, predicting up to 19 different labels: ``` | tag | meaning ----------------- | AN | Anwalt | EUN | Europäische Norm | GS | Gesetz | GRT | Gericht | INN | Institution | LD | Land | LDS | Landschaft | LIT | Literatur | MRK | Marke | ORG | Organisation | PER | Person | RR | Richter | RS | Rechtssprechung | ST | Stadt | STR | Straße | UN | Unternehmen | VO | Verordnung | VS | Vorschrift | VT | Vertrag ``` German Named Entity Recognition model, trained using large German Base Bert model and finetuned using Court Decisions (2017-2018) dataset (check `Data Source` section). You can also find a lighter Deep Learning (non-transformer based) in our Models Hub (`legner_courts`) and a Bert Large version (`legner_bert_large_courts`). ## Predicted Entities `STR`, `LIT`, `PER`, `EUN`, `VT`, `MRK`, `INN`, `UN`, `RS`, `ORG`, `GS`, `VS`, `LDS`, `GRT`, `VO`, `RR`, `LD`, `AN`, `ST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_DE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_bert_base_courts_de_1.0.0_3.0_1664708306072.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_bert_base_courts_de_1.0.0_3.0_1664708306072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") ner_model = legal.BertForTokenClassification.pretrained("legner_bert_base_courts", "de", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) text_list = ["""Der Europäische Gerichtshof für Menschenrechte (EGMR) gibt dabei allerdings ebenso wenig wie das Bundesverfassungsgericht feste Fristen vor, sondern stellt auf die jeweiligen Umstände des Einzelfalls ab.""", """Formelle Rechtskraft ( § 705 ZPO ) trat mit Verkündung des Revisionsurteils am 15. Dezember 2016 ein (vgl. Zöller / Seibel ZPO 32. Aufl. § 705 Rn. 8) ."""] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("ner_chunk"), F.expr("cols['1']['entity']").alias("label")).show(truncate = False) ```
## Results ```bash +------------------------------------------+-----+ |ner_chunk |label| +------------------------------------------+-----+ |Europäische Gerichtshof für Menschenrechte|GRT | |EGMR |GRT | |Bundesverfassungsgericht |GRT | |§ 705 ZPO |GS | |Zöller / Seibel ZPO 32. Aufl. § 705 Rn. 8 |LIT | +------------------------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_bert_base_courts| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|de| |Size:|407.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References The dataset used to train this model is taken from Leitner, et.al (2019) Leitner, E., Rehm, G., and Moreno-Schneider, J. (2019). Fine-grained Named Entity Recognition in Legal Documents. In Maribel Acosta, et al., editors, Semantic Systems. The Power of AI and Knowledge Graphs. Proceedings of the 15th International Conference (SEMANTiCS2019), number 11702 in Lecture Notes in Computer Science, pages 272–287, Karlsruhe, Germany, 9. Springer. 10/11 September 2019. Source of the annotated text: Court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG). ## Benchmarking ```bash label precision recall f1-score support AN 0.82 0.61 0.70 23 EUN 0.90 0.93 0.92 210 GRT 0.95 0.98 0.96 445 GS 0.97 0.98 0.98 2739 INN 0.87 0.88 0.88 321 LD 0.92 0.94 0.93 189 LDS 0.44 0.73 0.55 26 LIT 0.85 0.91 0.88 449 MRK 0.40 0.86 0.55 44 ORG 0.72 0.79 0.76 184 PER 0.71 0.91 0.80 260 RR 0.73 0.58 0.65 208 RS 0.95 0.97 0.96 1859 ST 0.81 0.94 0.87 120 STR 0.69 0.69 0.69 26 UN 0.73 0.84 0.78 158 VO 0.82 0.86 0.84 107 VS 0.48 0.81 0.60 86 VT 0.90 0.87 0.89 442 micro-avg 0.90 0.93 0.92 7896 macro-avg 0.77 0.85 0.80 7896 weighted-avg 0.91 0.93 0.92 7896 ``` --- layout: model title: Word2Vec Embeddings in Malay (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ms, open_source] task: Embeddings language: ms edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ms_3.4.1_3.0_1647445079012.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ms_3.4.1_3.0_1647445079012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ms.embed.w2v_cc_300d").predict("""Saya suka Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ms| |Size:|700.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Documentation Document Classifier (EURLEX) author: John Snow Labs name: legclf_documentation_bert date: 2023-03-06 tags: [en, legal, classification, clauses, documentation, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_documentation_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Documentation or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Documentation`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_documentation_bert_en_1.0.0_3.0_1678111855792.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_documentation_bert_en_1.0.0_3.0_1678111855792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_documentation_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Documentation]| |[Other]| |[Other]| |[Documentation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_documentation_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Documentation 0.81 0.83 0.82 83 Other 0.87 0.85 0.86 107 accuracy - - 0.84 190 macro-avg 0.84 0.84 0.84 190 weighted-avg 0.84 0.84 0.84 190 ``` --- layout: model title: English ElectraForQuestionAnswering model (from mbartolo) author: John Snow Labs name: electra_qa_large_synqa date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-synqa` is a English model originally trained by `mbartolo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_synqa_en_4.0.0_3.0_1655921148669.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_synqa_en_4.0.0_3.0_1655921148669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_synqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_synqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.synqa.electra.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_large_synqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mbartolo/electra-large-synqa --- layout: model title: Sentence Detection in Telugu Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [te, sentence_detection, open_source] task: Embeddings language: te edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_te_3.2.0_3.0_1630338728542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_te_3.2.0_3.0_1630338728542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "te") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు. ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది. వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు! అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం. కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు? ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న: పఠన నైపుణ్యాల ఉపయోగం ఏమిటి? చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "te") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు. ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది. వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు! అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం. కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు? ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న: పఠన నైపుణ్యాల ఉపయోగం ఏమిటి? చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('te.sentence_detector').predict("ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు. ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది. వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు! అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం. కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు? ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న: పఠన నైపుణ్యాల ఉపయోగం ఏమిటి? చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.", output_level ='sentence') ```
## Results ```bash +--------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------+ |[ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా?] | |[మీరు సరైన స్థలానికి వచ్చారు.] | |[ఇటీవలి అధ్యయనం ప్రకారం, నేటి యువతలో చదివే అలవాటు వేగంగా తగ్గుతోంది.] | |[వారు కొన్ని సెకన్ల కంటే ఎక్కువ ఇచ్చిన ఆంగ్ల పఠన పేరాపై దృష్టి పెట్టలేరు!]| |[అలాగే, చదవడం అనేది అన్ని పోటీ పరీక్షలలో అంతర్భాగం.] | |[కాబట్టి, మీరు మీ పఠన నైపుణ్యాలను ఎలా మెరుగుపరుచుకుంటారు?] | |[ఈ ప్రశ్నకు సమాధానం నిజానికి మరొక ప్రశ్న:] | |[పఠన నైపుణ్యాల ఉపయోగం ఏమిటి?] | |[చదవడం యొక్క ముఖ్య ఉద్దేశ్యం 'అర్థం చేసుకోవడం'.] | +--------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|te| ## Benchmarking ```bash Accuracy: 0.98 Recall: 1.00 Precision: 0.96 F1: 0.98 ``` --- layout: model title: Danish Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: da edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, da] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_da_2.5.5_2.4_1596054395311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_da_2.5.5_2.4_1596054395311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "da") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "da") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne."""] lemma_df = nlu.load('da.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=11, result='være', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=19, result='bortset', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=23, result='fra', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|da| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Stopwords Remover for Slovene language (319 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, sl, open_source] task: Stop Words Removal language: sl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sl_3.4.1_3.0_1646672307800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sl_3.4.1_3.0_1646672307800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","sl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nisi boljši od mene"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nisi boljši od mene").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sl.stopwords").predict("""Nisi boljši od mene""") ```
## Results ```bash +--------------+ |result | +--------------+ |[Nisi, boljši]| +--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sl| |Size:|2.1 KB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl8_en_4.3.0_3.0_1675123172323.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl8_en_4.3.0_3.0_1675123172323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|176.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Embeddings Healthcare 100 dims author: John Snow Labs name: embeddings_healthcare_100d class: WordEmbeddingsModel language: en nav_key: models repository: clinical/models date: 2020-05-29 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,en] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_healthcare_100d_en_2.5.0_2.4_1590794626292.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_healthcare_100d_en_2.5.0_2.4_1590794626292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.glove.healthcare_100d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_healthcare_100d``. {:.model-param} ## Model Information {:.table-model} |---------------|----------------------------| | Name: | embeddings_healthcare_100d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | en | | Dimension: | 100.0 | {:.h2_title} ## Data Source Trained on PubMed + ICD10 + UMLS + MIMIC III corpora https://www.nlm.nih.gov/databases/download/pubmed_medline.html --- layout: model title: English XlmRoBertaForQuestionAnswering (from horsbug98) author: John Snow Labs name: xlm_roberta_qa_Part_1_XLM_Model_E1 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_1_XLM_Model_E1` is a English model originally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_1_XLM_Model_E1_en_4.0.0_3.0_1655983332305.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_1_XLM_Model_E1_en_4.0.0_3.0_1655983332305.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_Part_1_XLM_Model_E1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_Part_1_XLM_Model_E1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.xlm_roberta.by_horsbug98").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_Part_1_XLM_Model_E1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|877.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_1_XLM_Model_E1 --- layout: model title: English Bert Embeddings Cased model (from nlpie) author: John Snow Labs name: bert_embeddings_distil_clinical date: 2023-02-22 tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distil-clinicalbert` is a English model originally trained by `nlpie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_distil_clinical_en_4.3.0_3.0_1677088459443.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_distil_clinical_en_4.3.0_3.0_1677088459443.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_distil_clinical","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark-NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_distil_clinical","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark-NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_distil_clinical| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|247.1 MB| |Case sensitive:|true| ## References https://huggingface.co/nlpie/distil-clinicalbert --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0_en_4.3.0_3.0_1674214155549.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0_en_4.3.0_3.0_1674214155549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|416.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-0 --- layout: model title: Detect Drugs and Posology Entities (ner_posology_greedy) author: John Snow Labs name: ner_posology_greedy date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects drugs, dosage, form, frequency, duration, route, and drug strength in text. It differs from `ner_posology` in the sense that it chunks together drugs, dosage, form, strength, and route when they appear together, resulting in a bigger chunk. It is trained using `embeddings_clinical` so please use the same embeddings in the pipeline. ## Predicted Entities `DRUG`, `STRENGTH`, `DURATION`, `FREQUENCY`, `FORM`, `DOSAGE`, `ROUTE`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_3.0.0_3.0_1617208415393.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_3.0.0_3.0_1617208415393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.greedy").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""") ```
## Results ```bash +----+----------------------------------+---------+-------+------------+ | | chunks | begin | end | entities | |---:|---------------------------------:|--------:|------:|-----------:| | 0 | 1 capsule of Advil 10 mg | 27 | 50 | DRUG | | 1 | magnesium hydroxide 100mg/1ml PO | 67 | 98 | DRUG | | 2 | for 5 days | 52 | 61 | DURATION | | 3 | 40 units of insulin glargine | 168 | 195 | DRUG | | 4 | at night | 197 | 204 | FREQUENCY | | 5 | 12 units of insulin lispro | 207 | 232 | DRUG | | 6 | with meals | 234 | 243 | FREQUENCY | | 7 | metformin 1000 mg | 250 | 266 | DRUG | | 8 | two times a day | 268 | 282 | FREQUENCY | +----+----------------------------------+---------+-------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_greedy| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented i2b2_med7 + FDA dataset with ``embeddings_clinical``, [https://www.i2b2.org/NLP/Medication](https://www.i2b2.org/NLP/Medication). --- layout: model title: Word2Vec Embeddings in Occitan (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, oc, open_source] task: Embeddings language: oc edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_oc_3.4.1_3.0_1647450923281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_oc_3.4.1_3.0_1647450923281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","oc") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","oc") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("oc.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|oc| |Size:|459.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English Named Entity Recognition (from abhishek) author: John Snow Labs name: bert_ner_autonlp_prodigy_10_3362554 date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-prodigy-10-3362554` is a English model orginally trained by `abhishek`. ## Predicted Entities `LOCATION`, `PERSON`, `ORG`, `PRODUCT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_prodigy_10_3362554_en_3.4.2_3.0_1652097317068.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_prodigy_10_3362554_en_3.4.2_3.0_1652097317068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_prodigy_10_3362554","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_prodigy_10_3362554","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_autonlp_prodigy_10_3362554| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/abhishek/autonlp-prodigy-10-3362554 --- layout: model title: Thai BertForQuestionAnswering model (from zhufy) author: John Snow Labs name: bert_qa_xquad_th_mbert_base date: 2022-06-02 tags: [th, open_source, question_answering, bert] task: Question Answering language: th edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xquad-th-mbert-base` is a Thai model orginally trained by `zhufy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xquad_th_mbert_base_th_4.0.0_3.0_1654192577829.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xquad_th_mbert_base_th_4.0.0_3.0_1654192577829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xquad_th_mbert_base","th") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_xquad_th_mbert_base","th") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("th.answer_question.xquad.multi_lingual_bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xquad_th_mbert_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|th| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/zhufy/xquad-th-mbert-base - https://github.com/deepmind/xquad --- layout: model title: Sentiment Analysis of IMDB Reviews (sentimentdl_use_imdb) author: John Snow Labs name: sentimentdl_use_imdb date: 2021-01-15 task: Sentiment Analysis language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, en, sentiment] supported: true annotator: SentimentDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify IMDB reviews in negative and positive categories using `Universal Sentence Encoder`. ## Predicted Entities `neg`, `pos` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_use_imdb_en_2.7.0_2.4_1610715247685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_use_imdb_en_2.7.0_2.4_1610715247685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") classifier = SentimentDLModel().pretrained('sentimentdl_use_imdb')\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("sentiment") nlp_pipeline = Pipeline(stages=[document_assembler, use, classifier ]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate('Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!') ``` {:.nlu-block} ```python import nlu nlu.load("en.sentiment.imdb.use.dl").predict("""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""") ```
## Results ```bash | | document | sentiment | |---:|---------------------------------------------------------------------------------------------------------:|--------------:| | | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | | | 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive | | | was rad! Horror and sword fight freaks,buy this movie now! | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentimentdl_use_imdb| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[sentiment]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source This model is trained on data from https://ai.stanford.edu/~amaas/data/sentiment/ ## Benchmarking ```bash precision recall f1-score support neg 0.88 0.82 0.85 12500 pos 0.84 0.88 0.86 12500 accuracy 0.85 25000 macro avg 0.86 0.86 0.85 25000 weighted avg 0.86 0.85 0.85 25000 ``` --- layout: model title: Pipeline to Extract Granular Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_granular_pipeline date: 2023-03-08 tags: [licensed, clinical, en, oncology, ner, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_anatomy_granular](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_anatomy_granular_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_pipeline_en_4.3.0_3.2_1678286098380.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_pipeline_en_4.3.0_3.2_1678286098380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_anatomy_granular_pipeline", "en", "clinical/models") text = '''The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_anatomy_granular_pipeline", "en", "clinical/models") val text = "The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | left | 36 | 39 | Direction | 0.9981 | | 1 | breast | 41 | 46 | Site_Breast | 0.9969 | | 2 | lungs | 82 | 86 | Site_Lung | 0.9978 | | 3 | liver | 99 | 103 | Site_Liver | 0.9999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_granular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Duration Clause Binary Classifier author: John Snow Labs name: legclf_duration_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `duration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `duration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duration_clause_en_1.0.0_3.2_1660123443846.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duration_clause_en_1.0.0_3.2_1660123443846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_duration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[duration]| |[other]| |[other]| |[duration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_duration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support duration 0.97 0.95 0.96 37 other 0.98 0.99 0.98 86 accuracy - - 0.98 123 macro-avg 0.97 0.97 0.97 123 weighted-avg 0.98 0.98 0.98 123 ``` --- layout: model title: Ganda XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_luganda date: 2022-08-13 tags: [lg, open_source, xlm_roberta, ner] task: Named Entity Recognition language: lg edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-luganda` is a Ganda model originally trained by `mbeukman`. ## Predicted Entities `ORG`, `LOC`, `PER`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_lg_4.1.0_3.0_1660427316470.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_lg_4.1.0_3.0_1660427316470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda","lg") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda","lg") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_luganda| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|lg| |Size:|776.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-luganda - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner - https://arxiv.org/pdf/2103.11811.pdf - https://arxiv.org/abs/2103.11811 - https://arxiv.org/abs/2103.11811 --- layout: model title: English RoBERTa Embeddings (SCOTUS dataset) author: John Snow Labs name: roberta_embeddings_fairlex_scotus_minilm date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fairlex-scotus-minilm` is a English model orginally trained by `coastalcph`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fairlex_scotus_minilm_en_3.4.2_3.0_1649947447091.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fairlex_scotus_minilm_en_3.4.2_3.0_1649947447091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_fairlex_scotus_minilm","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_fairlex_scotus_minilm","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.fairlex_scotus_minilm").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_fairlex_scotus_minilm| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|114.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/coastalcph/fairlex-scotus-minilm - https://coastalcph.github.io - https://github.com/iliaschalkidis - https://twitter.com/KiddoThe2B --- layout: model title: Portuguese BERT Embeddings (Large Cased) author: John Snow Labs name: bert_portuguese_large_cased date: 2020-11-04 task: Embeddings language: pt edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, pt] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the pre-trained BERT model trained on the Portuguese language. `BERT-Base` and `BERT-Large` Cased variants were trained on the `BrWaC` (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_portuguese_large_cased_pt_2.6.0_2.4_1604487922125.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_portuguese_large_cased_pt_2.6.0_2.4_1604487922125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_portuguese_large_cased", "pt") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Eu amo PNL']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_portuguese_large_cased", "pt") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Eu amo PNL").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Eu amo PNL"] embeddings_df = nlu.load('pt.bert.cased.large').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token pt_bert_cased_large_embeddings Eu [0.6893012523651123, 0.18436528742313385, 0.14... amo [0.6536692976951599, 0.17582201957702637, -0.5... PNL [-0.1397203803062439, 0.5698696374893188, -0.3... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_portuguese_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[pt]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from https://github.com/neuralmind-ai/portuguese-bert --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1657185134431.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1657185134431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-10 --- layout: model title: English DistilBertForQuestionAnswering model (from anurag0077) Squad3 author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad3 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad3` is a English model originally trained by `anurag0077`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.0.0_3.0_1654726909717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.0.0_3.0_1654726909717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_anurag0077").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad3 --- layout: model title: Legal Change in control Clause Binary Classifier author: John Snow Labs name: legclf_change_in_control_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `change-in-control` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `change-in-control` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_change_in_control_clause_en_1.0.0_3.2_1660123291976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_change_in_control_clause_en_1.0.0_3.2_1660123291976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_change_in_control_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[change-in-control]| |[other]| |[other]| |[change-in-control]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_change_in_control_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash legal precision recall f1-score support change-in-control 0.89 0.96 0.92 25 other 0.99 0.96 0.97 76 accuracy - - 0.96 101 macro-avg 0.94 0.96 0.95 101 weighted-avg 0.96 0.96 0.96 101 ``` --- layout: model title: Vietnamese XlmRoBertaForQuestionAnswering (from bhavikardeshna) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_vietnamese date: 2022-06-23 tags: [vn, open_source, question_answering, xlmroberta] task: Question Answering language: vn edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-vietnamese` is a Vietnamese model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_vietnamese_vn_4.0.0_3.0_1655991784070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_vietnamese_vn_4.0.0_3.0_1655991784070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_vietnamese","vn") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_vietnamese","vn") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("vn.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_vietnamese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|vn| |Size:|880.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/xlm-roberta-base-vietnamese --- layout: model title: Detect Drug Chemicals (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_drugs date: 2021-09-20 tags: [drug, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for Drugs. This model is traiend with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects drug chemicals. ## Predicted Entities `DrugChem` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_en_3.2.0_2.4_1632141658042.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_en_3.2.0_2.4_1632141658042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_drugs", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""" result = model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_drugs", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_drugs").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |potassium |DrugChem | |nucleotide |DrugChem | |anthracyclines|DrugChem | |taxanes |DrugChem | |vinorelbine |DrugChem | |vinorelbine |DrugChem | |anthracyclines|DrugChem | |taxanes |DrugChem | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_drugs| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|128| ## Data Source Trained on i2b2_med7 + FDA. https://www.i2b2.org/NLP/Medication ## Benchmarking ```bash label precision recall f1-score support B-DrugChem 0.99 0.99 0.99 97872 I-DrugChem 0.99 0.99 0.99 54909 O 1.00 1.00 1.00 1191109 accuracy - - 1.00 1343890 macro-avg 0.99 0.99 0.99 1343890 weighted-avg 1.00 1.00 1.00 1343890 ``` --- layout: model title: Translate Haitian Creole to English Pipeline author: John Snow Labs name: translate_ht_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ht, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ht` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ht_en_xx_2.7.0_2.4_1609687007155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ht_en_xx_2.7.0_2.4_1609687007155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ht_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ht_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ht.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ht_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForMaskedLM Base Cased model (from model-attribution-challenge) author: John Snow Labs name: roberta_embeddings_model_attribution_challenge_base date: 2022-12-12 tags: [en, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base` is a English model originally trained by `model-attribution-challenge`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_model_attribution_challenge_base_en_4.2.4_3.0_1670859033776.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_model_attribution_challenge_base_en_4.2.4_3.0_1670859033776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_model_attribution_challenge_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_model_attribution_challenge_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_model_attribution_challenge_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|300.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/model-attribution-challenge/roberta-base - https://arxiv.org/abs/1907.11692 - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia - https://commoncrawl.org/2016/10/news-dataset-available/ - https://github.com/jcpeterson/openwebtext - https://arxiv.org/abs/1806.02847 --- layout: model title: Detect Drugs and posology entities including experimental drugs and cycles (ner_posology_experimental) author: John Snow Labs name: ner_posology_experimental date: 2021-09-01 tags: [licensed, clinical, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects drugs, experimental drugs, cyclelength, cyclecount, cycledaty, dosage, form, frequency, duration, route, and drug strength in text. It is based on the core `ner_posology` model, supports additional things like drug cycles, and enriched with more data from clinical trials. ## Predicted Entities `Administration`, `Cyclenumber`, `Strength`, `Cycleday`, `Duration`, `Cyclecount`, `Route`, `Form`, `Frequency`, `Cyclelength`, `Drug`, `Dosage` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_en_3.1.3_3.0_1630511369574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_en_3.1.3_3.0_1630511369574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_experimental", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA)..\n\nCalcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology_experimental", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA)..\n\nCalcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.experimental").predict("""Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA)..\n\nCalcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.""") ```
## Results ```bash | | chunk | begin | end | entity | |---:|:-------------------------|--------:|------:|:---------| | 0 | Anti-Tac | 15 | 22 | Drug | | 1 | 10 mCi | 25 | 30 | Dosage | | 2 | 15 mCi | 108 | 113 | Dosage | | 3 | yttrium labeled anti-TAC | 118 | 141 | Drug | | 4 | calcium trisodium Inj | 156 | 176 | Drug | | 5 | Calcium-DTPA | 191 | 202 | Drug | | 6 | Ca-DTPA | 205 | 211 | Drug | | 7 | intravenously | 234 | 246 | Route | | 8 | Days 1-3 | 251 | 258 | Cycleday | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_experimental| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source This model is trained on FDA 2018 Medication dataset, enriched with clinical trials data. ## Benchmarking ```bash label tp fp fn prec rec f1 B-Drug 30260 1321 1630 0.95817107 0.9488868 0.95350635 B-Cycleday 294 1 7 0.99661016 0.9767442 0.9865772 B-Dosage 4019 441 972 0.9011211 0.8052494 0.85049194 I-Strength 21784 2375 1616 0.9016929 0.9309401 0.9160832 I-Cyclenumber 113 2 1 0.9826087 0.9912280 0.98689955 B-Cyclelength 217 3 0 0.98636365 1.0 0.99313504 B-Administration 97 1 5 0.9897959 0.95098037 0.96999997 I-Cyclecount 174 7 3 0.96132594 0.9830508 0.972067 B-Strength 18871 1299 1161 0.9355974 0.9420427 0.93880904 B-Frequency 13064 464 713 0.96570075 0.9482471 0.95689434 B-Cyclenumber 93 2 1 0.97894734 0.9893617 0.9841269 I-Duration 6116 519 738 0.92177844 0.89232564 0.9068129 B-Cyclecount 120 5 3 0.96 0.9756098 0.9677419 B-Form 10964 912 986 0.92320645 0.9174895 0.9203391 I-Route 275 42 51 0.8675079 0.8435583 0.85536546 I-Cyclelength 261 5 0 0.981203 1.0 0.9905123 I-Dosage 2385 471 1107 0.835084 0.6829897 0.75141776 I-Cycleday 548 5 13 0.9909584 0.9768271 0.983842 I-Frequency 18644 967 1574 0.9506909 0.9221486 0.9362023 I-Administration 303 10 5 0.9680511 0.98376626 0.9758454 I-Form 642 284 553 0.6933045 0.5372385 0.6053748 B-Route 5930 280 692 0.9549114 0.8954998 0.92425185 B-Duration 2422 261 359 0.9027208 0.87090975 0.88653 I-Drug 11472 1066 1240 0.9149784 0.9024544 0.9086733 Macro-average 149068 10743 13430 0.93426394 0.9111479 0.92256117 Micro-average 149068 10743 13430 0.93277687 0.91735286 0.9250006 ``` --- layout: model title: GloVe Embeddings 840B 300 (Multilingual) author: John Snow Labs name: glove_840B_300 date: 2020-01-22 task: Embeddings language: xx edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [open_source, embeddings] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description GloVe (Global Vectors) is a model for distributed word representation. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. It outperformed many common Word2vec models on the word analogy task. One benefit of GloVe is that it is the result of directly modeling relationships, instead of getting them as a side effect of training a language model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_840B_300_xx_2.4.0_2.4_1579698926752.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/glove_840B_300_xx_2.4.0_2.4_1579698926752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""I love Spark NLP"""] glove_df = nlu.load('xx.embed.glove.840B_300').predict(text) glove_df ```
{:.h2_title} ## Results ```bash token | glove_embeddings | -------|----------------------------------------------------| I | [0.1941000074148178, 0.22603000700473785, -0.4...] | love | [0.13948999345302582, 0.534529983997345, -0.25...] | Spark | [0.20353999733924866, 0.6292600035667419, 0.27...] | NLP | [0.059436000883579254, 0.18411000072956085, -0...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|glove_840B_300| |Type:|embeddings| |Compatibility:|Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[xx]| |Dimension:|300| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_samantharhay TFWav2Vec2ForCTC from samantharhay author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_samantharhay date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_samantharhay` is a English model originally trained by samantharhay. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_samantharhay_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102943776.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102943776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_samantharhay", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_samantharhay", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_samantharhay| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.8 MB| --- layout: model title: Spanish BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488 date: 2022-06-03 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_4.0.0_3.0_1654249847218.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_4.0.0_3.0_1654249847218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2.bert.distilled_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es - https://github.com/dccuchile/beto - https://twitter.com/mrm8488 - https://github.com/ccasimiro88/TranslateAlignRetrieve - https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Huggingface_pipelines_demo.ipynb - https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Using_Spanish_BERT_fine_tuned_for_Q%26A_pipelines.ipynb --- layout: model title: Detect Clinical Conditions (ner_eu_clinical_case - fr) author: John Snow Labs name: ner_eu_clinical_condition date: 2023-02-06 tags: [fr, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: fr edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from French texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_condition` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_fr_4.2.8_3.0_1675725809666.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_fr_4.2.8_3.0_1675725809666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "fr", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Il aurait présenté il y’ a environ 30 ans des ulcérations génitales non traitées spontanément guéries. L’interrogatoire retrouvait une toux sèche depuis trois mois, des douleurs rétro-sternales constrictives, une dyspnée stade III de la NYHA et un contexte d’ apyrexie. Sur ce tableau s’ est greffé des œdèmes des membres inférieurs puis un tableau d’ anasarque d’ où son hospitalisation en cardiologie pour décompensation cardiaque globale."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "fr", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Il aurait présenté il y’ a environ 30 ans des ulcérations génitales non traitées spontanément guéries. L’interrogatoire retrouvait une toux sèche depuis trois mois, des douleurs rétro-sternales constrictives, une dyspnée stade III de la NYHA et un contexte d’ apyrexie. Sur ce tableau s’ est greffé des œdèmes des membres inférieurs puis un tableau d’ anasarque d’ où son hospitalisation en cardiologie pour décompensation cardiaque globale.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------+------------------+ |chunk |ner_label | +------------------------+------------------+ |ulcérations |clinical_condition| |toux sèche |clinical_condition| |douleurs |clinical_condition| |dyspnée |clinical_condition| |apyrexie |clinical_condition| |anasarque |clinical_condition| |décompensation cardiaque|clinical_condition| +------------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|899.9 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 clinical_event 269.0 51.0 52.0 321.0 0.8406 0.8380 0.8393 macro - - - - - - 0.8393 micro - - - - - - 0.8393 ``` --- layout: model title: Turkish BertForQuestionAnswering Cased model (from enelpi) author: John Snow Labs name: bert_qa_question_answering_cased_squadv2 date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-question-answering-cased-squadv2_tr` is a Turkish model originally trained by `enelpi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_cased_squadv2_tr_4.0.0_3.0_1657187898372.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_cased_squadv2_tr_4.0.0_3.0_1657187898372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_question_answering_cased_squadv2","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_question_answering_cased_squadv2","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_question_answering_cased_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|413.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/enelpi/bert-question-answering-cased-squadv2_tr --- layout: model title: English BertForMaskedLM Base Cased model author: John Snow Labs name: bert_embeddings_base_cased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_cased_en_4.2.4_3.0_1670016286064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_cased_en_4.2.4_3.0_1670016286064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/bert-base-cased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Smaller BERT Sentence Embeddings (L-8_H-256_A-4) author: John Snow Labs name: sent_small_bert_L8_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_256_en_2.6.0_2.4_1598350433990.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_256_en_2.6.0_2.4_1598350433990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_256", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_256", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L8_256').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L8_256_embeddings I hate cancer [-0.04690948873758316, 0.5517814755439758, 0.7... Antibiotics aren't painkiller [0.4066215753555298, 0.48149049282073975, 0.18... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L8_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1 --- layout: model title: Translate Hungarian to English Pipeline author: John Snow Labs name: translate_hu_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, hu, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `hu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_hu_en_xx_2.7.0_2.4_1609688503781.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_hu_en_xx_2.7.0_2.4_1609688503781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_hu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_hu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.hu.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_hu_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stopwords Remover for Tamil language (125 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ta, open_source] task: Stop Words Removal language: ta edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ta_3.4.1_3.0_1646673010096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ta_3.4.1_3.0_1646673010096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ta") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["நீங்கள் என்னை விட நன்றாக இல்லை"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ta") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("நீங்கள் என்னை விட நன்றாக இல்லை").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ta.stopwords").predict("""நீங்கள் என்னை விட நன்றாக இல்லை""") ```
## Results ```bash +-------------------------------+ |result | +-------------------------------+ |[நீங்கள், என்னை, நன்றாக, இல்லை]| +-------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ta| |Size:|2.0 KB| --- layout: model title: Fast Neural Machine Translation Model from Punjabi (Eastern) to English author: John Snow Labs name: opus_mt_pa_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pa, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pa_en_xx_2.7.0_2.4_1609164148817.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pa_en_xx_2.7.0_2.4_1609164148817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pa_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pa_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pa.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pa_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Large Cased model (from dmis-lab) author: John Snow Labs name: bert_qa_biobert_large_cased_v1.1_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-large-cased-v1.1-squad` is a English model originally trained by `dmis-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_large_cased_v1.1_squad_en_4.0.0_3.0_1657189073455.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_large_cased_v1.1_squad_en_4.0.0_3.0_1657189073455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_large_cased_v1.1_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_biobert_large_cased_v1.1_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_large_cased_v1.1_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/dmis-lab/biobert-large-cased-v1.1-squad --- layout: model title: Legal Amendments and waivers Clause Binary Classifier (md) author: John Snow Labs name: legclf_amendments_and_waivers_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `amendments-and-waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `amendments-and-waivers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_amendments_and_waivers_md_en_1.0.0_3.0_1673460275552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_amendments_and_waivers_md_en_1.0.0_3.0_1673460275552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_amendments_and_waivers_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[amendments-and-waivers]| |[other]| |[other]| |[amendments-and-waivers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_amendments_and_waivers_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support effect-of-termination 1.00 0.82 0.90 38 other 0.85 1.00 0.92 39 accuracy 0.91 77 macro avg 0.92 0.91 0.91 77 weighted avg 0.92 0.91 0.91 77 ``` --- layout: model title: Spanish Named Entity Recognition, (RoBERTa base trained with data from the National Library of Spain (BNE) and CONLL 2003 data), by the TEMU Unit of the BSC-CNS author: cayorodriguez name: roberta_base_bne_conll_ner_spark_nlp date: 2022-11-21 tags: [es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 4.0.0 spark_version: 3.2 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. roberta-base-bne-conll-ner_spark_nlp is a Spanish model orginally trained by TEMU-BSC for PlanTL-GOB-ES. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/roberta_base_bne_conll_ner_spark_nlp_es_4.0.0_3.2_1669018824287.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/roberta_base_bne_conll_ner_spark_nlp_es_4.0.0_3.2_1669018824287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") ner = RoBertaForTokenClassification.pretrained("roberta_base_bne_conll_ner_spark_nlp","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner]) data = spark.createDataFrame([["El Plan Nacional para el Impulso de las Tecnologías del Lenguage es una iniciativa del Gobierno de España"]]).toDF("text") result = pipeline.fit(data).transform(data)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") ner = RoBertaForTokenClassification.pretrained("roberta_base_bne_conll_ner_spark_nlp","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner]) data = spark.createDataFrame([["El Plan Nacional para el Impulso de las Tecnologías del Lenguage es una iniciativa del Gobierno de España"]]).toDF("text") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_bne_conll_ner_spark_nlp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|447.3 MB| |Case sensitive:|true| |Max sentence length:|128| --- layout: model title: BERT Sequence Classifier - Classify the Music Genre author: John Snow Labs name: bert_sequence_classifier_song_lyrics date: 2021-11-07 tags: [song, lyrics, en, bert_for_sequence_classification, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models` and it classifies the music genre into 6 classes. ## Predicted Entities `Dance`, `Heavy Metal`, `Hip Hop`, `Indie`, `Pop`, `Rock` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_song_lyrics_en_3.3.2_2.4_1636283685615.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_song_lyrics_en_3.3.2_2.4_1636283685615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_song_lyrics', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([["""Because you need me Every single day Trying to find me But you don't know why Trying to find me again But you don't know how Trying to find me again Every single day"""]]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification("bert_sequence_classifier_song_lyrics", "en") .setInputCols("document", "token") .setOutputCol("class") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["""Because you need me Every single day Trying to find me But you don't know why Trying to find me again But you don't know how Trying to find me again Every single day"""].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.song_lyrics").predict("""Because you need me Every single day Trying to find me But you don't know why Trying to find me again But you don't know how Trying to find me again Every single day""") ```
## Results ```bash ['Rock'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_song_lyrics| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/juliensimon/autonlp-song-lyrics-18753417](https://huggingface.co/juliensimon/autonlp-song-lyrics-18753417) ## Benchmarking ```bash +--------------------+----------+ | Validation Metrics | Score | +--------------------+----------+ | Loss | 0.906597 | | Accuracy | 0.668027 | | Macro F1 | 0.538484 | | Micro F1 | 0.668027 | | Weighted F1 | 0.64147 | | Macro Precision | 0.67444 | | Micro Precision | 0.668027 | | Weighted Precision | 0.663409 | | Macro Recall | 0.50784 | | Micro Recall | 0.668027 | | Weighted Recall | 0.668027 | +--------------------+----------+ ``` --- layout: model title: RCT Binary Classifier (BioBERT Sentence Embeddings) author: John Snow Labs name: rct_binary_classifier_biobert date: 2022-05-27 tags: [licensed, rct, clinical, classifier, en] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a BioBERT based classifier that can classify if an article is a randomized clinical trial (RCT) or not. ## Predicted Entities `true`, `false` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_biobert_en_3.4.2_3.0_1653668780966.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_biobert_en_3.4.2_3.0_1653668780966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_biobert", "en", "clinical/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") biobert_clf_pipeline = Pipeline( stages = [ document_assembler, bert_sent, classifier_dl ]) data = spark.createDataFrame([["""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """]]).toDF("text") result = biobert_clf_pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") .setInputCols("document") .setOutputCol("sentence_embeddings") val classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_biobert", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("class") val biobert_clf_pipeline = new Pipeline().setStages(Array(documenter, bert_sent, classifier_dl)) val data = Seq("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """).toDS.toDF("text") val result = biobert_clf_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.rct_binary_biobert").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash | text | rct | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| | Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | true | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rct_binary_classifier_biobert| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References https://arxiv.org/abs/1710.06071 ## Benchmarking ```bash label precision recall f1-score support false 0.86 0.81 0.84 2915 true 0.80 0.85 0.83 2545 accuracy - - 0.83 5460 macro-avg 0.83 0.83 0.83 5460 weighted-avg 0.83 0.83 0.83 5460 ``` --- layout: model title: Fast Neural Machine Translation Model from Oromo to English author: John Snow Labs name: opus_mt_om_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, om, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `om` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_om_en_xx_2.7.0_2.4_1609169131097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_om_en_xx_2.7.0_2.4_1609169131097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_om_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_om_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.om.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_om_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Bemba (Zambia) asr_wav2vec2_large_xls_r_300m_bemba_fds TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_bemba_fds date: 2022-09-24 tags: [wav2vec2, bem, audio, open_source, asr] task: Automatic Speech Recognition language: bem edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_bemba_fds_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023896000.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023896000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_bemba_fds", "bem")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_bemba_fds", "bem") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_bemba_fds| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|bem| |Size:|1.2 GB| --- layout: model title: Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl date: 2022-03-21 tags: [ner_jsl, ner, berfortokenclassification, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. This model is trained with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects 77 entities. Definitions of Predicted Entities: - `Medical_Device`: All mentions related to medical devices and supplies. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Allergen`: Allergen related extractions mentioned in the document. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Birth_Entity`: Mentions that indicate giving birth. - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). ## Predicted Entities `Medical_Device`, `Physical_Measurement`, `Alergen`, `Procedure`, `Substance_Quantity`, `Drug`, `Test_Result`, `Pregnancy_Newborn`, `Admission_Discharge`, `Demographics`, `Lifestyle`, `Header`, `Date_Time`, `Treatment`, `Clinical_Dept`, `Test`, `Death_Entity`, `Age`, `Oncological`, `Body_Part`, `Birth_Entity`, `Vital_Sign`, `Symptom`, `Disease_Syndrome_Disorder` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.3.4_2.4_1647895738040.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.3.4_2.4_1647895738040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\ .setInputCols(["token", "sentence"])\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) sample_text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" df = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val sample_text = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = pipeline.fit(sample_text).transform(sample_text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +--------------------------------+-------------+ |chunk |ner_label | +--------------------------------+-------------+ |21-day-old |Age | |Caucasian male |Demographics | |congestion |Symptom | |mom |Demographics | |yellow discharge |Symptom | |nares |Body_Part | |she |Demographics | |mild problems with his breathing|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |Date_Time | |mom |Demographics | |tactile temperature |Symptom | |Tylenol |Drug | |Baby-girl |Age | |decreased p.o. intake |Symptom | |His |Demographics | |breast-feeding |Body_Part | |his |Demographics | |respiratory congestion |Symptom | |He |Demographics | |tired |Symptom | |fussy |Symptom | |over the past 2 days |Date_Time | |albuterol |Drug | |ER |Clinical_Dept| |His |Demographics | |urine output has |Symptom | |decreased |Symptom | |he |Demographics | |he |Demographics | |Mom |Demographics | |diarrhea |Symptom | |His |Demographics | |bowel |Body_Part | +--------------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## Benchmarking ## Benchmarking ```bash label tp fp fn prec rec f1 B-Medical_Device 2696 444 282 0.8585987 0.9053055 0.8813337 I-Physical_Measurement 220 16 34 0.9322034 0.8661417 0.8979592 B-Procedure 1800 239 281 0.8827857 0.8649688 0.8737864 B-Drug 1865 218 237 0.8953432 0.8872502 0.8912784 I-Test_Result 289 203 292 0.5873983 0.4974182 0.5386766 I-Pregnancy_Newborn 150 41 104 0.7853403 0.5905512 0.6741573 B-Admission_Discharge 255 35 6 0.8793103 0.9770115 0.9255898 B-Demographics 4609 119 123 0.9748308 0.9740068 0.9744186 I-Lifestyle 71 49 20 0.5916666 0.7802198 0.6729857 B-Header 2463 53 122 0.9789348 0.9528046 0.965693 I-Date_Time 928 184 191 0.8345324 0.8293119 0.8319139 B-Test_Result 866 198 262 0.8139097 0.7677305 0.7901459 I-Treatment 114 37 46 0.7549669 0.7125 0.733119 B-Clinical_Dept 688 83 76 0.8923476 0.9005235 0.8964169 B-Test 1920 333 313 0.8521970 0.8598298 0.8559965 B-Death_Entity 36 9 2 0.8 0.9473684 0.8674699 B-Lifestyle 268 58 50 0.8220859 0.8427673 0.8322981 B-Date_Time 823 154 176 0.8423746 0.8238238 0.8329959 I-Age 136 34 49 0.8 0.7351351 0.7661972 I-Oncological 345 41 19 0.8937824 0.9478022 0.9199999 I-Body_Part 3717 720 424 0.8377282 0.8976093 0.8666356 B-Pregnancy_Newborn 153 51 104 0.75 0.5953307 0.6637744 B-Treatment 169 41 58 0.8047619 0.7444933 0.7734553 I-Procedure 2302 326 417 0.8759513 0.8466348 0.8610435 B-Birth_Entity 6 5 7 0.5454545 0.4615384 0.5 I-Vital_Sign 639 197 93 0.7643540 0.8729508 0.815051 I-Header 4451 111 216 0.9756685 0.9537176 0.9645682 I-Death_Entity 2 0 0 1 1 1 I-Clinical_Dept 621 54 39 0.92 0.9409091 0.9303371 I-Test 1593 378 353 0.8082192 0.8186022 0.8133775 B-Age 472 43 51 0.9165048 0.9024856 0.9094413 I-Symptom 4227 1271 1303 0.7688250 0.7643761 0.7665941 I-Demographics 321 53 53 0.8582887 0.8582887 0.8582887 B-Body_Part 6312 912 809 0.8737541 0.8863923 0.8800279 B-Physical_Measurement 91 10 17 0.9009901 0.8425926 0.8708134 B-Disease_Syndrome_Disorder 2817 336 433 0.8934348 0.8667692 0.8799001 B-Symptom 4522 830 747 0.8449178 0.8582274 0.8515206 I-Disease_Syndrome_Disorder 2814 386 530 0.879375 0.8415072 0.8600244 I-Drug 3737 612 517 0.859278 0.8784673 0.8687667 I-Medical_Device 1825 331 131 0.8464749 0.9330266 0.8876459 B-Oncological 276 28 27 0.9078947 0.9108911 0.9093904 B-Vital_Sign 429 97 79 0.8155893 0.8444882 0.8297872 Macro-average 62038 9340 9110 0.7678277 0.7648211 0.7663215 Micro-average 62038 9340 9110 0.8691473 0.8719570 0.87055 ``` --- layout: model title: Abkhazian asr_baseline TFWav2Vec2ForCTC from neelan-elucidate-ai author: John Snow Labs name: asr_baseline date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_baseline` is a Abkhazian model originally trained by neelan-elucidate-ai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_baseline_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_baseline_ab_4.2.0_3.0_1664021865075.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_baseline_ab_4.2.0_3.0_1664021865075.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_baseline", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_baseline", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_baseline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.6 KB| --- layout: model title: Translate English to Malagasy Pipeline author: John Snow Labs name: translate_en_mg date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mg, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mg_xx_2.7.0_2.4_1609687980137.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mg_xx_2.7.0_2.4_1609687980137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mg').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mg| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Clinical Entities (ner_jsl_biobert) author: John Snow Labs name: ner_jsl_biobert_pipeline date: 2023-03-20 tags: [clinical, licensed, en, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_biobert](https://nlp.johnsnowlabs.com/2021/09/05/ner_jsl_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_4.3.0_3.2_1679309924530.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_4.3.0_3.2_1679309924530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 1 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9304 | | 2 | male | 38 | 41 | Gender | 1 | | 3 | for 2 days | 48 | 57 | Duration | 0.6477 | | 4 | congestion | 62 | 71 | Symptom | 0.7325 | | 5 | mom | 75 | 77 | Gender | 0.9995 | | 6 | suctioning | 88 | 97 | Modifier | 0.1445 | | 7 | yellow discharge | 99 | 114 | Symptom | 0.43875 | | 8 | nares | 135 | 139 | External_body_part_or_region | 0.9005 | | 9 | she | 147 | 149 | Gender | 0.9956 | | 10 | mild | 168 | 171 | Modifier | 0.5113 | | 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.4362 | | 12 | perioral cyanosis | 237 | 253 | Symptom | 0.76325 | | 13 | retractions | 258 | 268 | Symptom | 0.9819 | | 14 | One day ago | 272 | 282 | RelativeDate | 0.838267 | | 15 | mom | 285 | 287 | Gender | 0.9995 | | 16 | tactile temperature | 304 | 322 | Symptom | 0.5194 | | 17 | Tylenol | 345 | 351 | Drug_BrandName | 0.9999 | | 18 | Baby | 354 | 357 | Age | 0.9997 | | 19 | decreased p.o | 377 | 389 | Symptom | 0.445 | | 20 | His | 400 | 402 | Gender | 0.9996 | | 21 | from 20 minutes q.2h. to 5 to 10 minutes | 434 | 473 | Duration | 0.24581 | | 22 | his | 488 | 490 | Gender | 0.9573 | | 23 | respiratory congestion | 492 | 513 | Symptom | 0.5144 | | 24 | He | 516 | 517 | Gender | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Non Competition Clause Binary Classifier author: John Snow Labs name: legclf_non_comp_clause date: 2023-02-13 tags: [en, legal, classification, non_competition, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non_comp` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `non_comp`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_comp_clause_en_1.0.0_3.0_1676304359955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_comp_clause_en_1.0.0_3.0_1676304359955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_non_comp_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[non_comp]| |[other]| |[other]| |[non_comp]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_comp_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non_comp 1.00 1.00 1.00 15 other 1.00 1.00 1.00 7 accuracy - - 1.00 22 macro-avg 1.00 1.00 1.00 22 weighted-avg 1.00 1.00 1.00 22 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Nadhiya) author: John Snow Labs name: distilbert_qa_Nadhiya_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Nadhiya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Nadhiya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724293769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Nadhiya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724293769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Nadhiya_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Nadhiya_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Nadhiya").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Nadhiya_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Nadhiya/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Noncompetition Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_noncompetition_agreement_bert date: 2023-01-29 tags: [en, legal, classification, noncompetition, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_noncompetition_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `noncompetition-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `noncompetition-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_noncompetition_agreement_bert_en_1.0.0_3.0_1674990641933.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_noncompetition_agreement_bert_en_1.0.0_3.0_1674990641933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_noncompetition_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[noncompetition-agreement]| |[other]| |[other]| |[noncompetition-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_noncompetition_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support noncompetition-agreement 0.97 0.97 0.97 32 other 0.98 0.98 0.98 55 accuracy - - 0.98 87 macro-avg 0.98 0.98 0.98 87 weighted-avg 0.98 0.98 0.98 87 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_english_colab TFWav2Vec2ForCTC from shacharm author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_english_colab date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_english_colab` is a English model originally trained by shacharm. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_english_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103475912.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103475912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_english_colab", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_english_colab", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_english_colab| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Italian Embeddings (Base, Recipees) author: John Snow Labs name: bert_embeddings_chefberto_italian_cased date: 2022-04-11 tags: [bert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chefberto-italian-cased` is a Italian model orginally trained by `vinhood`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chefberto_italian_cased_it_3.4.2_3.0_1649676831699.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chefberto_italian_cased_it_3.4.2_3.0_1649676831699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chefberto_italian_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chefberto_italian_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.chefberto_italian_cased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chefberto_italian_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|415.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/vinhood/chefberto-italian-cased - https://twitter.com/denocris - https://www.linkedin.com/in/cristiano-de-nobili/ - https://www.vinhood.com/en/ --- layout: model title: English BertForMaskedLM Base Cased model (from VMware) author: John Snow Labs name: bert_embeddings_v_2021_base date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vbert-2021-base` is a English model originally trained by `VMware`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_base_en_4.2.4_3.0_1670022938608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_base_en_4.2.4_3.0_1670022938608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_v_2021_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/VMware/vbert-2021-base - https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99 --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becasincentivos3 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos3` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos3_es_4.3.0_3.0_1674218087235.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos3_es_4.3.0_3.0_1674218087235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos3","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos3","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becasincentivos3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos3 --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465524 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465524` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465524_en_4.0.0_3.0_1655987091352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465524_en_4.0.0_3.0_1655987091352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465524","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465524","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465524.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465524| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465524 --- layout: model title: Vietnamese Bert Embeddings author: John Snow Labs name: bert_embeddings_bert_base_vi_cased date: 2022-04-11 tags: [bert, embeddings, vi, open_source] task: Embeddings language: vi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-vi-cased` is a Vietnamese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_vi_cased_vi_3.4.2_3.0_1649676357396.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_vi_cased_vi_3.4.2_3.0_1649676357396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_vi_cased","vi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Tôi yêu Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_vi_cased","vi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Tôi yêu Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vi.embed.bert_cased").predict("""Tôi yêu Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_vi_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|vi| |Size:|373.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-vi-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Normalizing Section Headers in Clinical Notes author: John Snow Labs name: normalized_section_header_mapper date: 2022-04-04 tags: [en, chunkmapper, chunkmapping, normalizer, sectionheader, licensed, clinical] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: NotDefined article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline normalizes the section headers in clinical notes. It returns two levels of normalization called `level_1` and `level_2`. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NORMALIZED_SECTION_HEADER_MAPPER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NORMALIZED_SECTION_HEADER_MAPPER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/normalized_section_header_mapper_en_3.4.2_3.0_1649098646707.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/normalized_section_header_mapper_en_3.4.2_3.0_1649098646707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["Header"]) chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") \ .setInputCols("ner_chunk")\ .setOutputCol("mappings")\ .setRel("level_1") #or level_2 pipeline = Pipeline().setStages([document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner, ner_converter, chunkerMapper]) sentences = """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma. PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma. GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""" test_data = spark.createDataFrame([[sentences]]).toDF("text") result = pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models") .setInputCols(Array("sentence","token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Header")) val chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("mappings") .setRel("level_1") #or level_2 val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner, ner_converter, chunkerMapper)) val test_sentence= """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma. PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma. GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""" val test_data = Seq(test_sentence).toDS.toDF("text") val result = pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.section_headers_normalized").predict("""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma. PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma. GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""") ```
## Results ```bash +-------------------+------------------+ |section |normalized_section| +-------------------+------------------+ |ADMISSION DIAGNOSIS|DIAGNOSIS | |PRINCIPAL DIAGNOSIS|DIAGNOSIS | |GENERAL REVIEW |REVIEW TYPE | +-------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|normalized_section_header_mapper| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|14.2 KB| --- layout: model title: Translate Tuvaluan to English Pipeline author: John Snow Labs name: translate_tvl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tvl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tvl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tvl_en_xx_2.7.0_2.4_1609690444126.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tvl_en_xx_2.7.0_2.4_1609690444126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tvl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tvl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tvl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tvl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Integration Clause Binary Classifier author: John Snow Labs name: legclf_integration_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `integration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `integration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_integration_clause_en_1.0.0_3.2_1660122564699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_integration_clause_en_1.0.0_3.2_1660122564699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_integration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[integration]| |[other]| |[other]| |[integration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_integration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support integration 0.93 0.73 0.82 37 other 0.92 0.98 0.95 118 accuracy - - 0.92 155 macro-avg 0.93 0.86 0.88 155 weighted-avg 0.92 0.92 0.92 155 ``` --- layout: model title: English asr_Quran_speech_recognizer TFWav2Vec2ForCTC from Nuwaisir author: John Snow Labs name: asr_Quran_speech_recognizer date: 2022-09-26 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Quran_speech_recognizer` is a English model originally trained by Nuwaisir. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Quran_speech_recognizer_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Quran_speech_recognizer_en_4.2.0_3.0_1664208158710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Quran_speech_recognizer_en_4.2.0_3.0_1664208158710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Quran_speech_recognizer", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Quran_speech_recognizer", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Quran_speech_recognizer| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Clinical Conditions (ner_eu_clinical_condition) author: John Snow Labs name: ner_eu_clinical_condition date: 2023-02-06 tags: [en, clinical, licensed, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for clinical conditions. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_condition` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_en_4.2.8_3.0_1675718793293.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_en_4.2.8_3.0_1675718793293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------+------------------+ |chunk |ner_label | +-----------------------+------------------+ |Hyperparathyroidism |clinical_condition| |weakness |clinical_condition| |generalized joint pains|clinical_condition| |epigastric pain |clinical_condition| |gastritis |clinical_condition| |fractures |clinical_condition| |anesthesia |clinical_condition| |mandibular fracture |clinical_condition| +-----------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|851.3 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 clinical_event 230.0 28.0 70.0 300.0 0.8915 0.7667 0.8244 macro - - - - - - 0.8244 micro - - - - - - 0.8244 ``` --- layout: model title: English RobertaForQuestionAnswering (from mvonwyl) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad2` is a English model originally trained by `mvonwyl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad2_en_4.0.0_3.0_1655734553755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad2_en_4.0.0_3.0_1655734553755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_mvonwyl").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mvonwyl/roberta-base-finetuned-squad2 --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from susghosh) author: John Snow Labs name: roberta_qa_large_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad` is a English model originally trained by `susghosh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad_en_4.3.0_3.0_1674221913718.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad_en_4.3.0_3.0_1674221913718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/susghosh/roberta-large-squad --- layout: model title: Translate Bemba (Zambia) to English Pipeline author: John Snow Labs name: translate_bem_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bem, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bem` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bem_en_xx_2.7.0_2.4_1609701800337.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bem_en_xx_2.7.0_2.4_1609701800337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bem_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bem_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bem.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bem_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Resolve CVX Codes author: John Snow Labs name: cvx_resolver_pipeline date: 2022-10-12 tags: [en, licensed, clinical, resolver, chunk_mapping, cvx, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.2.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding CVX codes. You’ll just feed your text and it will return the corresponding CVX codes. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.2.1_3.0_1665611325640.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.2.1_3.0_1665611325640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline resolver_pipeline = PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models") text= "The patient has a history of influenza vaccine, tetanus and DTaP" result = resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val resolver_pipeline = new PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models") val result = resolver_pipeline.fullAnnotate("The patient has a history of influenza vaccine, tetanus and DTaP") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cvx_pipeline").predict("""The patient has a history of influenza vaccine, tetanus and DTaP""") ```
## Results ```bash +-----------------+---------+--------+ |chunk |ner_chunk|cvx_code| +-----------------+---------+--------+ |influenza vaccine|Vaccine |160 | |tetanus |Vaccine |35 | |DTaP |Vaccine |20 | +-----------------+---------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|cvx_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.2.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.1 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Legal Electrical And Nuclear Industries Document Classifier (EURLEX) author: John Snow Labs name: legclf_electrical_and_nuclear_industries_bert date: 2023-03-06 tags: [en, legal, classification, clauses, electrical_and_nuclear_industries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_electrical_and_nuclear_industries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Electrical_and_Nuclear_Industries or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Electrical_and_Nuclear_Industries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_electrical_and_nuclear_industries_bert_en_1.0.0_3.0_1678111896903.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_electrical_and_nuclear_industries_bert_en_1.0.0_3.0_1678111896903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_electrical_and_nuclear_industries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Electrical_and_Nuclear_Industries]| |[Other]| |[Other]| |[Electrical_and_Nuclear_Industries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_electrical_and_nuclear_industries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Electrical_and_Nuclear_Industries 0.82 0.94 0.88 34 Other 0.95 0.85 0.90 46 accuracy - - 0.89 80 macro-avg 0.89 0.89 0.89 80 weighted-avg 0.90 0.89 0.89 80 ``` --- layout: model title: Spanish RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_longformer_base_4096_spanish_finetuned_squad date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-base-4096-spanish-finetuned-squad` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_longformer_base_4096_spanish_finetuned_squad_es_4.0.0_3.0_1655728985385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_longformer_base_4096_spanish_finetuned_squad_es_4.0.0_3.0_1655728985385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_longformer_base_4096_spanish_finetuned_squad","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_longformer_base_4096_spanish_finetuned_squad","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.roberta.base_4096.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_longformer_base_4096_spanish_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|473.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/longformer-base-4096-spanish-finetuned-squad - https://creativecommons.org/licenses/by/4.0/legalcode - https://es.wikinews.org/ - https://creativecommons.org/licenses/by/2.5/ - https://es.wikipedia.org/ - https://creativecommons.org/licenses/by-sa/3.0/legalcode - https://twitter.com/mrm8488 - https://www.narrativa.com/ - http://clic.ub.edu/corpus/en --- layout: model title: Legal Use of proceeds Clause Binary Classifier author: John Snow Labs name: legclf_use_of_proceeds_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `use-of-proceeds` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `use-of-proceeds` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_use_of_proceeds_clause_en_1.0.0_3.2_1660123170841.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_use_of_proceeds_clause_en_1.0.0_3.2_1660123170841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_use_of_proceeds_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[use-of-proceeds]| |[other]| |[other]| |[use-of-proceeds]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_use_of_proceeds_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 1.00 112 use-of-proceeds 1.00 0.98 0.99 43 accuracy - - 0.99 155 macro-avg 1.00 0.99 0.99 155 weighted-avg 0.99 0.99 0.99 155 ``` --- layout: model title: Word2Vec Embeddings in Macedonian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mk, open_source] task: Embeddings language: mk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mk_3.4.1_3.0_1647443888318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mk_3.4.1_3.0_1647443888318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Сакам искра НЛП"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Сакам искра НЛП").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mk.embed.w2v_cc_300d").predict("""Сакам искра НЛП""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mk| |Size:|788.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Translate English to Morisyen Pipeline author: John Snow Labs name: translate_en_mfe date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mfe, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mfe` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mfe_xx_2.7.0_2.4_1609690383378.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mfe_xx_2.7.0_2.4_1609690383378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mfe", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mfe", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mfe').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mfe| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract Cancer Therapies and Posology Information author: John Snow Labs name: ner_oncology_unspecific_posology date: 2022-11-24 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of treatments and posology information using unspecific labels (low granularity). Definitions of Predicted Entities: - `Cancer_Therapy`: Mentions of cancer treatments, including chemotherapy, radiotherapy, surgery and other. - `Posology_Information`: Terms related to the posology of the treatment, including duration, frequencies and dosage. ## Predicted Entities `Cancer_Therapy`, `Posology_Information` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_en_4.2.2_3.0_1669309081671.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_en_4.2.2_3.0_1669309081671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_unspecific_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------------| | adriamycin | Cancer_Therapy | | 60 mg/m2 | Posology_Information | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Posology_Information | | over six courses | Posology_Information | | second cycle | Posology_Information | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_unspecific_posology| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Posology_Information 2663 244 399 3062 0.92 0.87 0.89 Cancer_Therapy 2580 317 247 2827 0.89 0.91 0.90 macro_avg 5243 561 646 5889 0.90 0.89 0.90 micro_avg 5243 561 646 5889 0.90 0.89 0.90 ``` --- layout: model title: Generic Classifier for Adverse Drug Events (LogReg) author: John Snow Labs name: generic_logreg_classifier_ade date: 2023-05-09 tags: [generic_classifier, logreg, clinical, licensed, en, text_classification, ade] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: GenericLogRegClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the Generic Classifier annotator and the Logistic Regression algorithm and classifies text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/generic_logreg_classifier_ade_en_4.4.1_3.0_1683641152188.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/generic_logreg_classifier_ade_en_4.4.1_3.0_1683641152188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("generic_logreg_classifier_ade", "en", "clinical/models")\ .setInputCols(["features"])\ .setOutputCol("class") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier]) data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = new WordEmbeddingsModel().pretrained("embeddings_clinical","en","clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val sentence_embeddings = new SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = new GenericClassifierModel.pretrained("generic_logreg_classifier_ade", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("None of the patients required treatment for the overdose.", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------------+-------+ |None of the patients required treatment for the overdose. |[False]| |Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] | +----------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|generic_logreg_classifier_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[feature_vector]| |Output Labels:|[prediction]| |Language:|en| |Size:|17.0 KB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.84 0.92 0.88 3362 True 0.74 0.57 0.64 1361 accuracy - - 0.82 4723 macro avg 0.79 0.74 0.76 4723 weighted avg 0.81 0.82 0.81 4723 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Altaic Languages author: John Snow Labs name: opus_mt_en_tut date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tut, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tut` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tut_xx_2.7.0_2.4_1609166499411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tut_xx_2.7.0_2.4_1609166499411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tut", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tut", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tut').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tut| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_processor_with_lm TFWav2Vec2ForCTC from hf-internal-testing author: John Snow Labs name: pipeline_asr_processor_with_lm date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_processor_with_lm` is a English model originally trained by hf-internal-testing. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_processor_with_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_processor_with_lm_en_4.2.0_3.0_1664025190161.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_processor_with_lm_en_4.2.0_3.0_1664025190161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_processor_with_lm', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_processor_with_lm", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_processor_with_lm| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|459.1 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Force Majeure Clause Binary Classifier (CUAD dataset) author: John Snow Labs name: legclf_cuad_force_majeure_clause date: 2022-11-30 tags: [en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `force-majeure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset. ## Predicted Entities `other`, `force-majeure` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_force_majeure_clause_en_1.0.0_3.0_1669806586316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_force_majeure_clause_en_1.0.0_3.0_1669806586316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_force_majeure_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["10 . FORCE-MAJEURE 10.1 Except for the obligations to make any payment , required by this Contract ( which shall not be subject to relief under this item ), a Party shall not be in breach of this Contract and liable to the other Party for any failure to fulfil any obligation under this Contract to the extent any fulfillment has been interfered with , hindered , delayed , or prevented by any circumstance whatsoever , which is not reasonably within the control of and is unforeseeable by such Party and if such Party exercised due diligence , including acts of God , fire , flood , freezing , landslides , lightning , earthquakes , fire , storm , floods , washouts , and other natural disasters , wars ( declared or undeclared ), insurrections , riots , civil disturbances , epidemics , quarantine restrictions , blockade , embargo , strike , lockouts , labor disputes , or restrictions imposed by any government ."]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[force-majeure]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_force_majeure_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References In-house annotations on Cuad dataset. ## Benchmarking ```bash label precision recall f1-score support force-majeure 0.97 0.94 0.95 31 other 0.96 0.98 0.97 56 accuracy - - 0.97 87 macro-avg 0.97 0.96 0.96 87 weighted-avg 0.97 0.97 0.97 87 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1657184453535.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1657184453535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-10 --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icd10cm_snomed_mapping date: 2022-06-27 tags: [icd10cm, snomed, pipeline, clinical, en, licensed, chunk_mapper] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icd10cm_snomed_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_3.5.3_3.0_1656361159581.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_3.5.3_3.0_1656361159581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") result= pipeline.fullAnnotate('R079 N4289 M62830') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= new PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("R079 N4289 M62830") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd10cm_to_snomed.pipe").predict("""R079 N4289 M62830""") ```
## Results ```bash | | icd10cm_code | snomed_code | |---:|:----------------------|:-----------------------------------------| | 0 | R079 | N4289 | M62830 | 161972006 | 22035000 | 16410651000119105 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_snomed_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Legal Waiver of jury trial Clause Binary Classifier author: John Snow Labs name: legclf_waiver_of_jury_trial_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `waiver-of-jury-trial` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `waiver-of-jury-trial` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trial_clause_en_1.0.0_3.2_1660123193810.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trial_clause_en_1.0.0_3.2_1660123193810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_waiver_of_jury_trial_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[waiver-of-jury-trial]| |[other]| |[other]| |[waiver-of-jury-trial]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_waiver_of_jury_trial_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.99 0.98 102 waiver-of-jury-trial 0.96 0.86 0.91 28 accuracy - - 0.96 130 macro-avg 0.96 0.92 0.94 130 weighted-avg 0.96 0.96 0.96 130 ``` --- layout: model title: Modern Greek (1453-) BertForQuestionAnswering model (from Danastos) author: John Snow Labs name: bert_qa_newsqa_bert_el_Danastos date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `newsqa_bert_el` is a Modern Greek (1453-) model orginally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_newsqa_bert_el_Danastos_el_4.0.0_3.0_1654249941385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_newsqa_bert_el_Danastos_el_4.0.0_3.0_1654249941385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_newsqa_bert_el_Danastos","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_newsqa_bert_el_Danastos","el") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("el.answer_question.news.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_newsqa_bert_el_Danastos| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|421.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/newsqa_bert_el --- layout: model title: English image_classifier_vit_rust_image_classification_3 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_3` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_3_en_4.1.0_3.0_1660167270260.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_3_en_4.1.0_3.0_1660167270260.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Lozi author: John Snow Labs name: opus_mt_en_loz date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, loz, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `loz` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_loz_xx_2.7.0_2.4_1609164276767.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_loz_xx_2.7.0_2.4_1609164276767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_loz", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_loz", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.loz').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_loz| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0_en_4.3.0_3.0_1674215168842.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0_en_4.3.0_3.0_1674215168842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_32_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|417.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-0 --- layout: model title: English RobertaForQuestionAnswering Mini Cased model (from sguskin) author: John Snow Labs name: roberta_qa_minilmv2_l6_h384_squad1.1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilmv2-L6-H384-squad1.1` is a English model originally trained by `sguskin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_minilmv2_l6_h384_squad1.1_en_4.3.0_3.0_1674211435898.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_minilmv2_l6_h384_squad1.1_en_4.3.0_3.0_1674211435898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_minilmv2_l6_h384_squad1.1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_minilmv2_l6_h384_squad1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_minilmv2_l6_h384_squad1.1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|112.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/sguskin/minilmv2-L6-H384-squad1.1 --- layout: model title: French CamemBert Embeddings (from JonathanSum) author: John Snow Labs name: camembert_embeddings_JonathanSum_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `JonathanSum`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_JonathanSum_generic_model_fr_3.4.4_3.0_1653986427511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_JonathanSum_generic_model_fr_3.4.4_3.0_1653986427511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_JonathanSum_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_JonathanSum_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_JonathanSum_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/JonathanSum/dummy-model --- layout: model title: Stopwords Remover for Vietnamese language (1942 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, vi, open_source] task: Stop Words Removal language: vi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_vi_3.4.1_3.0_1646672286429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_vi_3.4.1_3.0_1646672286429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","vi") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Bạn không tốt hơn tôi"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","vi") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Bạn không tốt hơn tôi").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vi.stopwords").predict("""Bạn không tốt hơn tôi""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|vi| |Size:|8.6 KB| --- layout: model title: Generic Deidentification NER author: John Snow Labs name: legner_deid date: 2022-08-09 tags: [en, legal, ner, deid, licensed] task: [De-identification, Named Entity Recognition] language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/DEID_LEGAL/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_deid_en_1.0.0_3.2_1660050699764.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_deid_en_1.0.0_3.2_1660050699764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+----------------+ | token| ner_label| +-----------+----------------+ | This| O| | LICENSE| O| | AND| O| |DEVELOPMENT| O| | AGREEMENT| O| | (| O| | this| O| | Agreement| O| | )| O| | is| O| | entered| O| | into| O| | effective| O| | as| O| | of| O| | Nov| B-DATE| | .| I-DATE| | 02| I-DATE| | ,| I-DATE| | 2019| I-DATE| | (| O| | the| O| | Effective| O| | Date| O| | )| O| | by| O| | and| O| | between| O| | Bioeq| O| | IP| O| | AG| O| | ,| O| | having| O| | its| O| | principal| O| | place| O| | of| O| | business| O| | at| O| | 333| B-STREET| | Twin| I-STREET| | Dolphin| I-STREET| | Drive| I-STREET| | ,| O| | Suite|B-LOCATION-OTHER| | 600|I-LOCATION-OTHER| | ,| O| | Redwood| B-CITY| | City| I-CITY| | ,| O| | CA| B-STATE| | ,| O| | 94065| B-ZIP| | ,| O| | USA| B-STATE| | (| O| | Licensee| O| | ).| O| +-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_deid| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References In-house annotated documents with protected information ## Benchmarking ```bash label precision recall f1-score support B-AGE 0.96 0.89 0.92 245 B-CITY 0.85 0.86 0.86 123 B-COUNTRY 0.86 0.67 0.75 36 B-DATE 0.98 0.97 0.97 2352 B-ORG 0.75 0.71 0.73 38 B-PERSON 0.97 0.94 0.95 1348 B-PHONE 0.86 0.80 0.83 86 B-PROFESSION 0.93 0.75 0.83 84 B-STATE 0.92 0.89 0.91 102 B-STREET 0.99 0.91 0.95 89 I-CITY 0.82 0.77 0.79 35 I-COUNTRY 1.00 0.50 0.67 6 I-DATE 0.96 0.95 0.96 402 I-ORG 0.71 0.86 0.77 28 I-PERSON 0.98 0.96 0.97 1240 I-PHONE 0.91 0.92 0.92 77 I-PROFESSION 0.96 0.79 0.87 70 I-STATE 1.00 0.62 0.77 8 I-STREET 0.98 0.94 0.96 188 I-ZIP 0.84 0.97 0.90 60 O 1.00 1.00 1.00 194103 accuracy - - 1.00 200762 macro-avg 0.72 0.62 0.65 200762 weighted-avg 1.00 1.00 1.00 200762 ``` --- layout: model title: German Public Health Mention Sequence Classifier (BERT-base) author: John Snow Labs name: bert_sequence_classifier_health_mentions_bert date: 2022-08-10 tags: [public_health, de, licensed, sequence_classification, health_mention] task: Text Classification language: de edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [bert-base-german](https://www.deepset.ai/german-bert) based sequence classification model that can classify public health mentions in German social media text. ## Predicted Entities `non-health`, `health-related` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_bert_de_4.0.2_3.0_1660131666549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_bert_de_4.0.2_3.0_1660131666549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_bert", "de", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["Die Temperaturen klettern am Wochenende."], ["Zu den Symptomen gehört u.a. eine verringerte Greifkraft."] ]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_bert", "de", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("Die Temperaturen klettern am Wochenende.", "Zu den Symptomen gehört u.a. eine verringerte Greifkraft.")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.bert_sequence.health_mentions_bert").predict("""Zu den Symptomen gehört u.a. eine verringerte Greifkraft.""") ```
## Results ```bash +---------------------------------------------------------+----------------+ |text |result | +---------------------------------------------------------+----------------+ |Die Temperaturen klettern am Wochenende. |[non-health] | |Zu den Symptomen gehört u.a. eine verringerte Greifkraft.|[health-related]| +---------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mentions_bert| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|409.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support non-health 0.99 0.90 0.94 82 health-related 0.89 0.99 0.94 69 accuracy - - 0.94 151 macro-avg 0.94 0.94 0.94 151 weighted-avg 0.94 0.94 0.94 151 ``` --- layout: model title: Word2Vec Embeddings in Sakha (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sah, open_source] task: Embeddings language: sah edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sah_3.4.1_3.0_1647455186697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sah_3.4.1_3.0_1647455186697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sah") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sah") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sah.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sah| |Size:|150.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Financial Institutions And Credit Document Classifier (EURLEX) author: John Snow Labs name: legclf_financial_institutions_and_credit_bert date: 2023-03-06 tags: [en, legal, classification, clauses, financial_institutions_and_credit, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_financial_institutions_and_credit_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Financial_Institutions_and_Credit or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Financial_Institutions_and_Credit`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financial_institutions_and_credit_bert_en_1.0.0_3.0_1678111900978.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financial_institutions_and_credit_bert_en_1.0.0_3.0_1678111900978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_financial_institutions_and_credit_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Financial_Institutions_and_Credit]| |[Other]| |[Other]| |[Financial_Institutions_and_Credit]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_financial_institutions_and_credit_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Financial_Institutions_and_Credit 0.87 0.89 0.88 81 Other 0.87 0.85 0.86 72 accuracy - - 0.87 153 macro-avg 0.87 0.87 0.87 153 weighted-avg 0.87 0.87 0.87 153 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from Nakul24) author: John Snow Labs name: roberta_qa_emotion_extraction date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa-emotion-extraction` is a English model originally trained by `Nakul24`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_emotion_extraction_en_4.3.0_3.0_1674208609562.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_emotion_extraction_en_4.3.0_3.0_1674208609562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_emotion_extraction","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_emotion_extraction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_emotion_extraction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|426.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Nakul24/RoBERTa-emotion-extraction --- layout: model title: Legal Question Answering (Bert) author: John Snow Labs name: legqa_bert date: 2022-08-09 tags: [en, legal, qa, licensed] task: Question Answering language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal Bert-based Question Answering model, trained on squad-v2, finetuned on proprietary Legal questions and answers. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_bert_en_1.0.0_3.2_1660054695560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_bert_en_1.0.0_3.2_1660054695560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.BertForQuestionAnswering.pretrained("legqa_bert","en", "legal/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) example = spark.createDataFrame([["Who was subjected to torture?", "The applicant submitted that her husband was subjected to treatment amounting to abuse whilst in the custody of police."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('answer.result').show() ```
## Results ```bash `her husband` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legqa_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on squad-v2, finetuned on proprietary Legal questions and answers. --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025690267.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025690267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|3.5 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Rundi Pipeline author: John Snow Labs name: translate_en_run date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, run, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `run` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_run_xx_2.7.0_2.4_1609688818752.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_run_xx_2.7.0_2.4_1609688818752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_run", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_run", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.run').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_run| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Papiamento Pipeline author: John Snow Labs name: translate_en_pap date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, pap, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `pap` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pap_xx_2.7.0_2.4_1609691878350.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pap_xx_2.7.0_2.4_1609691878350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_pap", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_pap", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.pap').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_pap| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Stockholder Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_stockholder_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, stockholder, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stockholder_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `stockholder-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `stockholder-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stockholder_agreement_bert_en_1.0.0_3.0_1669371874035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stockholder_agreement_bert_en_1.0.0_3.0_1669371874035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_stockholder_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[stockholder-agreement]| |[other]| |[other]| |[stockholder-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stockholder_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.93 1.00 0.96 65 stockholder-agreement 1.00 0.83 0.91 29 accuracy - - 0.95 94 macro-avg 0.96 0.91 0.93 94 weighted-avg 0.95 0.95 0.95 94 ``` --- layout: model title: Legal Oil Industry Document Classifier (EURLEX) author: John Snow Labs name: legclf_oil_industry_bert date: 2023-03-06 tags: [en, legal, classification, clauses, oil_industry, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_oil_industry_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Oil_Industry or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Oil_Industry`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_oil_industry_bert_en_1.0.0_3.0_1678111691969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_oil_industry_bert_en_1.0.0_3.0_1678111691969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_oil_industry_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Oil_Industry]| |[Other]| |[Other]| |[Oil_Industry]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_oil_industry_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Oil_Industry 0.85 0.93 0.89 43 Other 0.92 0.82 0.87 40 accuracy - - 0.88 83 macro-avg 0.88 0.88 0.88 83 weighted-avg 0.88 0.88 0.88 83 ``` --- layout: model title: Korean ElectraForQuestionAnswering model (from monologg) Version-2 author: John Snow Labs name: electra_qa_base_v2_finetuned_korquad_384 date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v2-finetuned-korquad-384` is a Korean model originally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_v2_finetuned_korquad_384_ko_4.0.0_3.0_1655922143142.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_v2_finetuned_korquad_384_ko_4.0.0_3.0_1655922143142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_v2_finetuned_korquad_384","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_v2_finetuned_korquad_384","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.korquad.electra.base_v2_384.by_monologg").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_v2_finetuned_korquad_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|412.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monologg/koelectra-base-v2-finetuned-korquad-384 --- layout: model title: Pipeline to Classify Texts into TREC-6 Classes author: John Snow Labs name: bert_sequence_classifier_trec_coarse_pipeline date: 2022-06-19 tags: [bert_sequence, trec, coarse, bert, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_sequence_classifier_trec_coarse_en](https://nlp.johnsnowlabs.com/2021/11/06/bert_sequence_classifier_trec_coarse_en.html). The TREC dataset for question classification consists of open-domain, fact-based questions divided into broad semantic categories. You can check the official documentation of the dataset, entities, etc. [here](https://search.r-project.org/CRAN/refmans/textdata/html/dataset_trec.html). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_4.0.0_3.0_1655653749614.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_4.0.0_3.0_1655653749614.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python trec_pipeline = PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en") trec_pipeline.annotate("Germany is the largest country in Europe economically.") ``` ```scala val trec_pipeline = new PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en") trec_pipeline.annotate("Germany is the largest country in Europe economically.") ```
## Results ```bash ['LOC'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_trec_coarse_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|406.6 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertForSequenceClassification --- layout: model title: Extract medical devices and clinical department mentions (Voice of the Patients) author: John Snow Labs name: ner_vop_clinical_dept_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `AdmissionDischarge`, `ClinicalDept`, `MedicalDevice` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_wip_en_4.4.2_3.0_1684512218256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_wip_en_4.4.2_3.0_1684512218256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------------------|:--------------| | orthopedic department | ClinicalDept | | titanium plate | MedicalDevice | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_clinical_dept_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 AdmissionDischarge 29 1 5 34 0.97 0.85 0.91 ClinicalDept 292 41 34 326 0.88 0.90 0.89 MedicalDevice 244 72 88 332 0.77 0.73 0.75 macro_avg 565 114 127 692 0.87 0.83 0.85 micro_avg 565 114 127 692 0.83 0.82 0.82 ``` --- layout: model title: Explain Document Pipeline for Dutch author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, dutch, explain_document_md, pipeline, nl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: nl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_3.0.0_3.0_1616434945966.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_3.0.0_3.0_1616434945966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'nl') annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "nl") val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo van John Snow Labs! ""] result_df = nlu.load('nl.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:------------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.5910000205039978,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| --- layout: model title: Fast Neural Machine Translation Model from Tigrinya to English author: John Snow Labs name: opus_mt_ti_en date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ti, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ti` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ti_en_xx_2.7.0_2.4_1609254492007.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ti_en_xx_2.7.0_2.4_1609254492007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ti_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ti_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ti.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ti_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from xraychen) author: John Snow Labs name: bert_qa_mqa_baseline date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mqa-baseline` is a English model orginally trained by `xraychen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_baseline_en_4.0.0_3.0_1654188319379.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_baseline_en_4.0.0_3.0_1654188319379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mqa_baseline","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mqa_baseline","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mqa_baseline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/xraychen/mqa-baseline --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265898` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898_en_4.0.0_3.0_1655984453355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898_en_4.0.0_3.0_1655984453355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265898").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265898| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265898 --- layout: model title: RoBERTa base biomedical author: ireneisdoomed name: roberta_base_biomedical date: 2022-01-13 tags: [es, open_source] task: Text Classification language: es edition: Spark NLP 3.4.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model has been pulled from the HF Hub - https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es This is a result of reproducing the tutorial for bringing HF's models into Spark NLP - https://medium.com/spark-nlp/importing-huggingface-models-into-sparknlp-8c63bdea671d ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ireneisdoomed/roberta_base_biomedical_es_3.4.0_3.0_1642093372752.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ireneisdoomed/roberta_base_biomedical_es_3.4.0_3.0_1642093372752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("term")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es", "@ireneisdoomed")\ .setInputCols(["document", "token"])\ .setOutputCol("roberta_embeddings") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, roberta_embeddings]) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.roberta_base_biomedical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_biomedical| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Community| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|301.7 MB| --- layout: model title: Extract entities in clinical trial abstracts author: John Snow Labs name: ner_clinical_trials_abstracts date: 2022-06-22 tags: [ner, clinical, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model uses a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. It extracts relevant entities from clinical trial abstracts. It uses a simplified version of the ontology specified by [Sanchez Graillet, O., et al.](https://pub.uni-bielefeld.de/record/2939477) in order to extract concepts related to trial design, diseases, drugs, population, statistics and publication. ## Predicted Entities `Age`, `AllocationRatio`, `Author`, `BioAndMedicalUnit`, `CTAnalysisApproach`, `CTDesign`, `Confidence`, `Country`, `DisorderOrSyndrome`, `DoseValue`, `Drug`, `DrugTime`, `Duration`, `Journal`, `NumberPatients`, `PMID`, `PValue`, `PercentagePatients`, `PublicationYear`, `TimePoint`, `Value` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_en_3.5.3_3.0_1655911616789.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_en_3.5.3_3.0_1655911616789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models")\ .setInputCols(["sentence","token", "embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.clinical_trials_abstracts").predict("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""") ```
## Results ```bash +-----------+--------------------+ | token| ner_label| +-----------+--------------------+ | A| O| | one-year| O| | ,| O| | randomised| B-CTDesign| | ,| O| |multicentre| B-CTDesign| | trial| O| | comparing| O| | insulin| B-Drug| | glargine| I-Drug| | with| O| | NPH| B-Drug| | insulin| I-Drug| | in| O| |combination| O| | with| O| | oral| O| | agents| O| | in| O| | patients| O| | with| O| | type|B-DisorderOrSyndrome| | 2|I-DisorderOrSyndrome| | diabetes|I-DisorderOrSyndrome| | .| O| | In| O| | a| O| |multicentre| B-CTDesign| | ,| O| | open| B-CTDesign| | ,| O| | randomised| B-CTDesign| | study| O| | ,| O| | 570| B-NumberPatients| | patients| O| | with| O| | Type|B-DisorderOrSyndrome| | 2|I-DisorderOrSyndrome| | diabetes|I-DisorderOrSyndrome| | ,| O| | aged| O| | 34| B-Age| | -| O| | 80| B-Age| | years| O| | ,| O| | were| O| | treated| O| | for| O| | 52| B-Duration| | weeks| I-Duration| | with| O| | insulin| B-Drug| | glargine| I-Drug| | or| O| | NPH| B-Drug| | insulin| I-Drug| | given| O| | once| B-DrugTime| | daily| I-DrugTime| | at| O| | bedtime| B-DrugTime| | .| O| +-----------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_trials_abstracts| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.7 MB| ## References - [Sanchez Graillet O, Cimiano P, Witte C, Ell B. C-TrO: an ontology for summarization and aggregation of the level of evidence in clinical trials. In: Proceedings of the Workshop Ontologies and Data in Life Sciences (ODLS 2019) in the Joint Ontology Workshops' (JOWO 2019). 2019.](https://pub.uni-bielefeld.de/record/2939477) ## Benchmarking ```bash label precision recall f1-score support Age 0.88 0.61 0.72 38 AllocationRatio 1.00 1.00 1.00 24 Author 0.93 0.92 0.92 789 BioAndMedicalUnit 0.95 0.94 0.95 785 CTAnalysisApproach 1.00 0.87 0.93 23 CTDesign 0.91 0.95 0.93 410 Confidence 0.95 0.95 0.95 899 Country 0.94 0.86 0.90 123 DisorderOrSyndrome 0.99 0.98 0.99 568 DoseValue 0.96 0.97 0.97 263 Drug 0.96 0.95 0.96 1290 DrugTime 0.97 0.85 0.91 377 Duration 0.89 0.86 0.88 271 Journal 0.95 0.93 0.94 175 NumberPatients 0.95 0.94 0.94 173 O 0.98 0.98 0.98 21613 PMID 1.00 1.00 1.00 55 PValue 0.97 0.99 0.98 654 PercentagePatients 0.92 0.92 0.92 235 PublicationYear 0.86 0.96 0.91 57 TimePoint 0.85 0.75 0.80 514 Value 0.94 0.94 0.94 1195 accuracy - - 0.97 30531 macro-avg 0.94 0.91 0.93 30531 weighted-avg 0.97 0.97 0.97 30531 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hyan97) author: John Snow Labs name: distilbert_qa_hyan97_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hyan97`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hyan97_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771413196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hyan97_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771413196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hyan97_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hyan97_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hyan97_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hyan97/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_large_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_large_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_pipeline_en_3.4.1_3.0_1647873112906.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_pipeline_en_3.4.1_3.0_1647873112906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_large_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_large_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posoloy_large.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Explain Document Pipeline for French author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, french, explain_document_md, pipeline, fr] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: fr edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_fr_3.0.0_3.0_1616429735046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_fr_3.0.0_3.0_1616429735046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'fr') annotations = pipeline.fullAnnotate(""Bonjour de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "fr") val result = pipeline.fullAnnotate("Bonjour de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Bonjour de John Snow Labs! ""] result_df = nlu.load('fr.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:--------------------------------|:-------------------------------|:-------------------------------------------|:-------------------------------------------|:-------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------------------| | 0 | ['Bonjour de John Snow Labs! '] | ['Bonjour de John Snow Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | ['INTJ', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0783179998397827,.,...]] | ['I-MISC', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['Bonjour', 'John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fr| --- layout: model title: Word2Vec Embeddings in Yiddish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, yi, open_source] task: Embeddings language: yi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_yi_3.4.1_3.0_1647467653837.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_yi_3.4.1_3.0_1647467653837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["איך ליבע אָנצינדן נלפּ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("איך ליבע אָנצינדן נלפּ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("yi.embed.w2v_cc_300d").predict("""איך ליבע אָנצינדן נלפּ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|yi| |Size:|114.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Japanese Lemmatizer author: John Snow Labs name: lemma date: 2021-01-15 task: Lemmatization language: ja edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ja, lemmatizer, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ja_2.7.0_2.4_1610746691356.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ja_2.7.0_2.4_1610746691356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')\ .setInputCols("document")\ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "ja") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, word_segmenter , lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) results = light_pipeline.fullAnnotate(["これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。"]) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja') .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "ja") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter , lemmatizer)) val data = Seq("これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["これに不快感を示す住民はいましたが,現在,表立って反対や抗議の声を挙げている住民はいないようです。"] lemma_df = nlu.load('ja.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 1, これ, {'sentence': '0'}), Annotation(token, 2, 2, にる, {'sentence': '0'}), Annotation(token, 3, 4, 不快, {'sentence': '0'}), Annotation(token, 5, 5, 感, {'sentence': '0'}), Annotation(token, 6, 6, を, {'sentence': '0'}), Annotation(token, 7, 8, 示す, {'sentence': '0'}), Annotation(token, 9, 10, 住民, {'sentence': '0'}), Annotation(token, 11, 11, はる, {'sentence': '0'}), Annotation(token, 12, 12, いる, {'sentence': '0'}), Annotation(token, 13, 14, まする, {'sentence': '0'}), Annotation(token, 15, 15, たる, {'sentence': '0'}), Annotation(token, 16, 16, がる, {'sentence': '0'}), Annotation(token, 17, 17, ,, {'sentence': '0'}), Annotation(token, 18, 19, 現在, {'sentence': '0'}), Annotation(token, 20, 20, ,, {'sentence': '0'}), Annotation(token, 21, 23, 表立つ, {'sentence': '0'}), Annotation(token, 24, 24, てる, {'sentence': '0'}), Annotation(token, 25, 26, 反対, {'sentence': '0'}), Annotation(token, 27, 27, やる, {'sentence': '0'}), Annotation(token, 28, 29, 抗議, {'sentence': '0'}), Annotation(token, 30, 30, のる, {'sentence': '0'}), Annotation(token, 31, 31, 声, {'sentence': '0'}), Annotation(token, 32, 32, を, {'sentence': '0'}), Annotation(token, 33, 34, 挙げる, {'sentence': '0'}), Annotation(token, 35, 35, てる, {'sentence': '0'}), Annotation(token, 36, 37, いる, {'sentence': '0'}), Annotation(token, 38, 39, 住民, {'sentence': '0'}), Annotation(token, 40, 40, はる, {'sentence': '0'}), Annotation(token, 41, 41, いる, {'sentence': '0'}), Annotation(token, 42, 43, なぐ, {'sentence': '0'}), Annotation(token, 44, 45, よう, {'sentence': '0'}), Annotation(token, 46, 47, です, {'sentence': '0'}), Annotation(token, 48, 48, 。, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|ja| ## Data Source The model was trained using the universal dependencies data set version 2 and the _IPADIC_ dictionary from [Mecab](https://taku910.github.io/mecab/). References: > - Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_kv16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv16_en_4.3.0_3.0_1675121314422.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv16_en_4.3.0_3.0_1675121314422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_kv16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_kv16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_kv16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|120.7 MB| ## References - https://huggingface.co/google/t5-efficient-small-kv16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Lewotobi RobertaForQuestionAnswering (from 21iridescent) author: John Snow Labs name: roberta_qa_RoBERTa_base_finetuned_squad2_lwt date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: lwt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa-base-finetuned-squad2-lwt` is a Lewotobi model originally trained by `21iridescent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655727062223.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655727062223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RoBERTa_base_finetuned_squad2_lwt","lwt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_RoBERTa_base_finetuned_squad2_lwt","lwt") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("lwt.answer_question.squadv2.roberta.base.by_21iridescent").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_RoBERTa_base_finetuned_squad2_lwt| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|lwt| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/21iridescent/RoBERTa-base-finetuned-squad2-lwt --- layout: model title: Professions & Occupations NER model in Spanish (meddroprof_scielowiki) author: John Snow Labs name: meddroprof_scielowiki date: 2022-12-18 tags: [ner, licensed, prefessions, es, occupations] task: Named Entity Recognition language: es edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description NER model that detects professions and occupations in Spanish texts. Trained with the `embeddings_scielowiki_300d` embeddings, and the same `WordEmbeddingsModel` is needed in the pipeline. ## Predicted Entities `ACTIVIDAD`, `PROFESION`, `SITUACION_LABORAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_PROFESSIONS_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_PROFESSIONS_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_4.2.2_3.0_1671367707210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_4.2.2_3.0_1671367707210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence = SentenceDetector() \ .setInputCols("document") \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence, tokenizer, word_embeddings, clinical_ner, ner_converter]) sample_text = """La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""" df = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(df).transform(df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.scielowiki").predict("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""") ```
## Results ```bash +---------------------------------------+-----------------+ |chunk |ner_label | +---------------------------------------+-----------------+ |estudiando 1o ESO |SITUACION_LABORAL| |ATS |PROFESION | |trabajan en diferentes centros de salud|PROFESION | |estudiando 1o ESO |SITUACION_LABORAL| +---------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|meddroprof_scielowiki| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|14.8 MB| ## References The model was trained with the [MEDDOPROF](https://temu.bsc.es/meddoprof/data/) data set: > The MEDDOPROF corpus is a collection of 1844 clinical cases from over 20 different specialties annotated with professions and employment statuses. The corpus was annotated by a team composed of linguists and clinical experts following specially prepared annotation guidelines, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 1 shows a screenshot of a sample manual annotation generated using the brat annotation tool. Reference: ``` @article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2022 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2022} } ``` ## Benchmarking ```bash label precision recall f1-score support B-ACTIVIDAD 0.82 0.36 0.50 25 B-PROFESION 0.87 0.75 0.81 634 B-SITUACION_LABORAL 0.79 0.67 0.72 310 I-ACTIVIDAD 0.86 0.43 0.57 58 I-PROFESION 0.87 0.80 0.83 944 I-SITUACION_LABORAL 0.74 0.71 0.73 407 O 1.00 1.00 1.00 139880 accuracy - - 0.99 142258 macro-avg 0.85 0.67 0.74 142258 weighted-avg 0.99 0.99 0.99 142258 ``` --- layout: model title: English RobertaForQuestionAnswering (from veronica320) author: John Snow Labs name: roberta_qa_QA_for_Event_Extraction date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `QA-for-Event-Extraction` is a English model originally trained by `veronica320`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_QA_for_Event_Extraction_en_4.0.0_3.0_1655726863853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_QA_for_Event_Extraction_en_4.0.0_3.0_1655726863853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_QA_for_Event_Extraction","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_QA_for_Event_Extraction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_veronica320").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_QA_for_Event_Extraction| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/veronica320/QA-for-Event-Extraction - https://aclanthology.org/2021.acl-short.42/ - https://github.com/veronica320/Zeroshot-Event-Extraction - https://github.com/uwnlp/qamr --- layout: model title: Slovak BertForMaskedLM Cased model (from fav-kky) author: John Snow Labs name: bert_embeddings_fernet_cc date: 2022-12-02 tags: [sk, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: sk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-CC_sk` is a Slovak model originally trained by `fav-kky`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fernet_cc_sk_4.2.4_3.0_1670015248677.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fernet_cc_sk_4.2.4_3.0_1670015248677.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fernet_cc","sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fernet_cc","sk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_fernet_cc| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sk| |Size:|612.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/fav-kky/FERNET-CC_sk - https://arxiv.org/abs/2107.10042 --- layout: model title: Legal NER for NDA (Confidential Information-Restricted) author: John Snow Labs name: legner_nda_confidential_information_restricted date: 2023-04-11 tags: [en, legal, licensed, ner, nda] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `USE_OF_CONF_INFO ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `RESTRICTED_ACTION`, `RESTRICTED_SUBJECT`, `RESTRICTED_OBJECT`, and `RESTRICTED_IND_OBJECT`. ## Predicted Entities `RESTRICTED_ACTION`, `RESTRICTED_SUBJECT`, `RESTRICTED_OBJECT`, `RESTRICTED_IND_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_restricted_en_1.0.0_3.0_1681210372591.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_restricted_en_1.0.0_3.0_1681210372591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_confidential_information_restricted", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""The recipient may use the proprietary information solely for the purpose of performing its obligations under a separate agreement with the disclosing party, and may not disclose such information to any third party without the prior written consent of the disclosing party."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+---------------------+ |chunk |ner_label | +-----------+---------------------+ |recipient |RESTRICTED_SUBJECT | |disclose |RESTRICTED_ACTION | |information|RESTRICTED_OBJECT | |third party|RESTRICTED_IND_OBJECT| +-----------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_confidential_information_restricted| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support RESTRICTED_ACTION 0.92 0.94 0.93 36 RESTRICTED_IND_OBJECT 1.00 0.93 0.97 15 RESTRICTED_OBJECT 0.74 1.00 0.85 26 RESTRICTED_SUBJECT 0.72 0.90 0.80 29 micro-avg 0.82 0.94 0.88 106 macro-avg 0.85 0.94 0.89 106 weighted-avg 0.83 0.94 0.88 106 ``` --- layout: model title: Financial NER (Signers) author: John Snow Labs name: finner_signers date: 2023-02-24 tags: [signers, parties, en, licensed] task: Named Entity Recognition language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal NER Model, aimed to process the last page of the agreements when information can be found about: - People Signing the document; - Title of those people in their companies; - Company (Party) they represent; ## Predicted Entities `SIGNING_TITLE`, `SIGNING_PERSON`, `PARTY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGALNER_SIGNERS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_signers_en_1.0.0_3.0_1677258652137.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_signers_en_1.0.0_3.0_1677258652137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_signers', 'en', 'finance/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """ VENDOR: VENDINGDATA CORPORATION, a Nevada corporation By: /s/ Steven J. Blad Its: Steven J. Blad CEO DISTRIBUTOR: TECHNICAL CASINO SUPPLIES LTD, an English company By: /s/ David K. Heap Its: David K. Heap Chief Executive Officer -15-""" res = model.transform(spark.createDataFrame([[text]]).toDF("text")) ```
## Results ```bash +-----------+----------------+ | token| ner_label| +-----------+----------------+ | VENDOR| O| | :| O| |VENDINGDATA| B-PARTY| |CORPORATION| I-PARTY| | ,| I-PARTY| | a| O| | Nevada| O| |corporation| O| | By| O| | :| O| | /s/| O| | Steven|B-SIGNING_PERSON| | J|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Blad|I-SIGNING_PERSON| | Its| O| | :| O| | Steven|B-SIGNING_PERSON| | J|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Blad|I-SIGNING_PERSON| | CEO| B-SIGNING_TITLE| |DISTRIBUTOR| O| | :| O| | TECHNICAL| B-PARTY| | CASINO| I-PARTY| | SUPPLIES| I-PARTY| | LTD| I-PARTY| | ,| I-PARTY| | an| O| | English| O| | company| O| | By| O| | :| O| | /s/| O| | David|B-SIGNING_PERSON| | K|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Heap|I-SIGNING_PERSON| | Its| O| | :| O| | David|B-SIGNING_PERSON| | K|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Heap|I-SIGNING_PERSON| | Chief| B-SIGNING_TITLE| | Executive| I-SIGNING_TITLE| | Officer| I-SIGNING_TITLE| | -| O| | 15| O| | -| O| +-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_signers| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References Manual annotations on CUAD dataset and data augmentation ## Benchmarking ```bash label tp fp fn prec rec f1 I-PARTY 366 26 39 0.93367344 0.9037037 0.91844416 I-SIGNING_TITLE 41 0 4 1.0 0.9111111 0.95348835 I-SIGNING_PERSON 115 10 13 0.92 0.8984375 0.9090909 B-SIGNING_PERSON 46 3 11 0.93877554 0.80701756 0.8679246 B-PARTY 122 14 28 0.89705884 0.81333333 0.85314685 B-SIGNING_TITLE 26 0 2 1.0 0.9285714 0.9629629 Macro-average 716 53 97 0.9482513 0.8770291 0.91125065 Micro-average 716 53 97 0.9310793 0.8806888 0.9051833 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-1` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_1_en_4.3.0_3.0_1672767422925.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_1_en_4.3.0_3.0_1672767422925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-1 --- layout: model title: Pipeline to Extract Cancer Therapies and Posology Information author: John Snow Labs name: ner_oncology_unspecific_posology_healthcare_pipeline date: 2023-03-08 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_unspecific_posology_healthcare](https://nlp.johnsnowlabs.com/2023/01/11/ner_oncology_unspecific_posology_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_pipeline_en_4.3.0_3.2_1678269380685.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_pipeline_en_4.3.0_3.2_1678269380685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_unspecific_posology_healthcare_pipeline", "en", "clinical/models") text = " he patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_unspecific_posology_healthcare_pipeline", "en", "clinical/models") val text = " he patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-----------------|--------:|------:|:---------------------|-------------:| | 0 | adriamycin | 46 | 55 | Cancer_Therapy | 0.9999 | | 1 | 60 mg/m2 | 58 | 65 | Posology_Information | 0.807 | | 2 | cyclophosphamide | 72 | 87 | Cancer_Therapy | 0.9998 | | 3 | 600 mg/m2 | 90 | 98 | Posology_Information | 0.9566 | | 4 | over six courses | 101 | 116 | Posology_Information | 0.689833 | | 5 | second cycle | 150 | 161 | Posology_Information | 0.9906 | | 6 | chemotherapy | 166 | 177 | Cancer_Therapy | 0.9997 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_unspecific_posology_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|533.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Korean RoBERTa Embeddings (from lassl) author: John Snow Labs name: roberta_embeddings_roberta_ko_small date: 2022-04-14 tags: [roberta, embeddings, ko, open_source] task: Embeddings language: ko edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-ko-small` is a Korean model orginally trained by `lassl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_ko_small_ko_3.4.2_3.0_1649947873838.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_ko_small_ko_3.4.2_3.0_1649947873838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_ko_small","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_ko_small","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.embed.roberta_ko_small").predict("""나는 Spark NLP를 좋아합니다""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_ko_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|87.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/lassl/roberta-ko-small - https://github.com/lassl/lassl --- layout: model title: Named Entity Recognition - ELECTRA Large (OntoNotes) author: John Snow Labs name: onto_electra_large_uncased date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, en, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `electra_large_uncased` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_electra_large_uncased_en_2.7.0_2.4_1607198670231.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_electra_large_uncased_en_2.7.0_2.4_1607198670231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("electra_large_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_onto = NerDLModel.pretrained("onto_electra_large_uncased", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("electra_large_uncased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_onto = NerDLModel.pretrained("onto_electra_large_uncased", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.electra.uncased_large').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +---------------------+---------+ |chunk |ner_label| +---------------------+---------+ |William Henry Gates |PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation|ORG | |Microsoft |ORG | |Gates |PERSON | |CEO |ORG | |May 2014 |DATE | |one |CARDINAL | |the 1970s and 1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Microsoft |FAC | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | +---------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_electra_large_uncased| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from OntoNotes 5.0 [https://catalog.ldc.upenn.edu/LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.88980144, rec: 0.88069624, f1: 0.8852254 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11227 phrases; correct: 9876. accuracy: 97.64%; 9876 11257 11227 precision: 87.97%; recall: 87.73%; FB1: 87.85 CARDINAL: 789 935 937 precision: 84.20%; recall: 84.39%; FB1: 84.29 937 DATE: 1399 1602 1640 precision: 85.30%; recall: 87.33%; FB1: 86.30 1640 EVENT: 30 63 43 precision: 69.77%; recall: 47.62%; FB1: 56.60 43 FAC: 72 135 115 precision: 62.61%; recall: 53.33%; FB1: 57.60 115 GPE: 2131 2240 2252 precision: 94.63%; recall: 95.13%; FB1: 94.88 2252 LANGUAGE: 8 22 9 precision: 88.89%; recall: 36.36%; FB1: 51.61 9 LAW: 20 40 31 precision: 64.52%; recall: 50.00%; FB1: 56.34 31 LOC: 123 179 202 precision: 60.89%; recall: 68.72%; FB1: 64.57 202 MONEY: 286 314 321 precision: 89.10%; recall: 91.08%; FB1: 90.08 321 NORP: 803 841 918 precision: 87.47%; recall: 95.48%; FB1: 91.30 918 ORDINAL: 177 195 218 precision: 81.19%; recall: 90.77%; FB1: 85.71 218 ORG: 1502 1795 1687 precision: 89.03%; recall: 83.68%; FB1: 86.27 1687 PERCENT: 306 349 344 precision: 88.95%; recall: 87.68%; FB1: 88.31 344 PERSON: 1887 1988 2020 precision: 93.42%; recall: 94.92%; FB1: 94.16 2020 PRODUCT: 48 76 62 precision: 77.42%; recall: 63.16%; FB1: 69.57 62 QUANTITY: 85 105 111 precision: 76.58%; recall: 80.95%; FB1: 78.70 111 TIME: 128 212 190 precision: 67.37%; recall: 60.38%; FB1: 63.68 190 WORK_OF_ART: 82 166 127 precision: 64.57%; recall: 49.40%; FB1: 55.97 127 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8` is a English model originally trained by lilitket. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_en_4.2.0_3.0_1664121225531.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_en_4.2.0_3.0_1664121225531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Norwegian BertForMaskedLM Cased model (from ltgoslo) author: John Snow Labs name: bert_embeddings_norbert date: 2022-12-06 tags: ["no", open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: "no" edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `norbert` is a Norwegian model originally trained by `ltgoslo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert_no_4.2.4_3.0_1670326996300.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert_no_4.2.4_3.0_1670326996300.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert","no") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_norbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|no| |Size:|417.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ltgoslo/norbert - http://vectors.nlpl.eu/repository/20/216.zip - http://norlm.nlpl.eu/ - https://github.com/ltgoslo/NorBERT - https://arxiv.org/abs/2104.06546 - https://www.eosc-nordic.eu/ - https://www.mn.uio.no/ifi/english/research/projects/sant/index.html - https://www.mn.uio.no/ifi/english/research/groups/ltg/ --- layout: model title: Detect Oncology-Specific Entities author: John Snow Labs name: ner_oncology_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner, biomarker, treatment] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts more than 40 oncology-related entities, including therapies, tests and staging. Definitions of Predicted Entities: - `Adenopathy`: Mentions of pathological findings of the lymph nodes. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score"). - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy". - `Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Date`: Mentions of exact dates, in any format, including day number, month and/or year. - `Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as "died" or "passed away". - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father"). - `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated") - `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary". - `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy". - `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan". - `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy". - `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category. - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. - `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells"). - `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. - `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4"). - `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "yesterday" or "three years later"). - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). - `Site_Bone`: Anatomical terms that refer to the human skeleton. - `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). - `Site_Breast`: Anatomical terms that refer to the breasts. - `Site_Liver`: Anatomical terms that refer to the liver. - `Site_Lung`: Anatomical terms that refer to the lungs. - `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. - `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. - `Smoking_Status`: All mentions of smoking related to the patient or to someone else. - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy". - `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm"). - `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy"). ## Predicted Entities `Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_wip_en_4.0.0_3.0_1664556885893.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_wip_en_4.0.0_3.0_1664556885893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_wip").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:----------------------| | left | Direction | | mastectomy | Cancer_Surgery | | axillary lymph node dissection | Cancer_Surgery | | left | Direction | | breast cancer | Cancer_Dx | | twenty years ago | Relative_Date | | tumor | Tumor_Finding | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | radiotherapy | Radiotherapy | | breast | Site_Breast | | cancer | Cancer_Dx | | recurred | Response_To_Treatment | | right | Direction | | lung | Site_Lung | | metastasis | Metastasis | | 13 years later | Relative_Date | | adriamycin | Chemotherapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Chemotherapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | first line | Line_Of_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|992.6 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 200.0 60.0 143.0 343.0 0.77 0.58 0.66 Direction 602.0 126.0 132.0 734.0 0.83 0.82 0.82 Staging 160.0 17.0 56.0 216.0 0.90 0.74 0.81 Cancer_Score 17.0 0.0 42.0 59.0 1.00 0.29 0.45 Imaging_Test 1534.0 192.0 175.0 1709.0 0.89 0.90 0.89 Cycle_Number 46.0 14.0 27.0 73.0 0.77 0.63 0.69 Tumor_Finding 834.0 52.0 108.0 942.0 0.94 0.89 0.91 Site_Lymph_Node 414.0 50.0 52.0 466.0 0.89 0.89 0.89 Invasion 89.0 7.0 44.0 133.0 0.93 0.67 0.78 Response_To_Treatment 225.0 70.0 204.0 429.0 0.76 0.52 0.62 Smoking_Status 48.0 15.0 6.0 54.0 0.76 0.89 0.82 Tumor_Size 771.0 133.0 81.0 852.0 0.85 0.90 0.88 Cycle_Count 143.0 59.0 32.0 175.0 0.71 0.82 0.76 Adenopathy 27.0 8.0 17.0 44.0 0.77 0.61 0.68 Age 655.0 18.0 41.0 696.0 0.97 0.94 0.96 Biomarker_Result 845.0 281.0 261.0 1106.0 0.75 0.76 0.76 Unspecific_Therapy 131.0 168.0 103.0 234.0 0.44 0.56 0.49 Site_Breast 69.0 6.0 49.0 118.0 0.92 0.58 0.72 Chemotherapy 475.0 36.0 143.0 618.0 0.93 0.77 0.84 Targeted_Therapy 139.0 8.0 40.0 179.0 0.95 0.78 0.85 Radiotherapy 192.0 12.0 30.0 222.0 0.94 0.86 0.90 Performance_Status 60.0 13.0 40.0 100.0 0.82 0.60 0.69 Pathology_Test 631.0 154.0 178.0 809.0 0.80 0.78 0.79 Site_Other_Body_Part 663.0 297.0 438.0 1101.0 0.69 0.60 0.64 Cancer_Surgery 542.0 139.0 103.0 645.0 0.80 0.84 0.82 Line_Of_Therapy 79.0 11.0 8.0 87.0 0.88 0.91 0.89 Pathology_Result 546.0 369.0 309.0 855.0 0.60 0.64 0.62 Hormonal_Therapy 82.0 1.0 34.0 116.0 0.99 0.71 0.82 Site_Bone 166.0 52.0 66.0 232.0 0.76 0.72 0.74 Biomarker 899.0 342.0 212.0 1111.0 0.72 0.81 0.76 Immunotherapy 87.0 52.0 24.0 111.0 0.63 0.78 0.70 Cycle_Day 142.0 28.0 41.0 183.0 0.84 0.78 0.80 Frequency 295.0 27.0 54.0 349.0 0.92 0.85 0.88 Route 50.0 4.0 47.0 97.0 0.93 0.52 0.66 Duration 384.0 81.0 163.0 547.0 0.83 0.70 0.76 Death_Entity 21.0 1.0 8.0 29.0 0.95 0.72 0.82 Metastasis 274.0 13.0 15.0 289.0 0.95 0.95 0.95 Site_Liver 125.0 142.0 45.0 170.0 0.47 0.74 0.57 Cancer_Dx 938.0 120.0 128.0 1066.0 0.89 0.88 0.88 Grade 119.0 19.0 79.0 198.0 0.86 0.60 0.71 Date 614.0 28.0 15.0 629.0 0.96 0.98 0.97 Site_Lung 285.0 66.0 158.0 443.0 0.81 0.64 0.72 Site_Brain 149.0 40.0 51.0 200.0 0.79 0.74 0.77 Relative_Date 853.0 351.0 111.0 964.0 0.71 0.88 0.79 Race_Ethnicity 41.0 7.0 10.0 51.0 0.85 0.80 0.83 Gender 933.0 14.0 8.0 941.0 0.99 0.99 0.99 Oncogene 198.0 53.0 159.0 357.0 0.79 0.55 0.65 Dosage 675.0 81.0 152.0 827.0 0.89 0.82 0.85 Radiation_Dose 79.0 20.0 17.0 96.0 0.80 0.82 0.81 macro_avg 17546.0 3857.0 4459.0 22005.0 0.83 0.75 0.78 micro_avg NaN NaN NaN NaN 0.83 0.80 0.81 ``` --- layout: model title: German T5ForConditionalGeneration Cased model (from dehio) author: John Snow Labs name: t5_german_qg_quad date: 2023-01-30 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `german-qg-t5-quad` is a German model originally trained by `dehio`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_german_qg_quad_de_4.3.0_3.0_1675102735996.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_german_qg_quad_de_4.3.0_3.0_1675102735996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_german_qg_quad","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_german_qg_quad","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_german_qg_quad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|923.3 MB| ## References - https://huggingface.co/dehio/german-qg-t5-quad - https://www.deepset.ai/germanquad - https://github.com/d-e-h-i-o/german-qg --- layout: model title: English RobertaForQuestionAnswering (from akdeniz27) author: John Snow Labs name: roberta_qa_roberta_large_cuad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-cuad` is a English model originally trained by `akdeniz27`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_cuad_en_4.0.0_3.0_1655736445187.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_cuad_en_4.0.0_3.0_1655736445187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_cuad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.cuad.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_cuad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/akdeniz27/roberta-large-cuad - https://github.com/TheAtticusProject/cuad - https://github.com/marshmellow77/cuad-demo --- layout: model title: French CamemBert Embeddings (from Yanzhu) author: John Snow Labs name: camembert_embeddings_bertweetfr_base date: 2022-05-23 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertweetfr-base` is a French model orginally trained by `Yanzhu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_bertweetfr_base_fr_3.4.4_3.0_1653320961942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_bertweetfr_base_fr_3.4.4_3.0_1653320961942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_bertweetfr_base","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_bertweetfr_base","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_bertweetfr_base| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|415.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Yanzhu/bertweetfr-base --- layout: model title: Detect Clinical Entities (bert_token_classifier_ner_jsl) author: John Snow Labs name: bert_token_classifier_ner_jsl date: 2021-08-28 tags: [ner, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is BERT-based version of `ner_jsl` model and it is better than the legacy NER model (MedicalNerModel) that is based on BiLSTM-CNN-Char architecture. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.2.0_2.4_1630172634235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.2.0_2.4_1630172634235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) sample_text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""" result = model.transform(spark.createDataFrame([[sample_text]]).toDF("text")) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val sample_text = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""").toDS.toDF("text") val result = pipeline.fit(sample_text).transform(sample_text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_jsl").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""") ```
## Results ```bash +------------+-------------------------+ |chunk |label | +------------+-------------------------+ |28-year-old |Age | |female |Gender | |gestational |Diabetes | |diabetes |Diabetes | |mellitus |Diabetes | |eight |RelativeDate | |years |RelativeDate | |prior |RelativeDate | |type |Diabetes | |two |Diabetes | |diabetes |Diabetes | |mellitus |Diabetes | |T2DM |Diabetes | |HTG-induced |Diabetes | |pancreatitis|Disease_Syndrome_Disorder| |three |RelativeDate | |years |RelativeDate | |prior |RelativeDate | |acute |Disease_Syndrome_Disorder| |hepatitis |Disease_Syndrome_Disorder| |obesity |Obesity | |body |BMI | |mass |BMI | |index |BMI | |BMI |BMI | |) |BMI | |of |BMI | |33.5 |BMI | |kg/m2 |BMI | |polyuria |Symptom | |polydipsia |Symptom | |poor |Symptom | |appetite |Symptom | |vomiting |Symptom | |Two |RelativeDate | |weeks |RelativeDate | |prior |RelativeDate | |she |Gender | |five-day |Drug | |course |Drug | +------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|128| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label precision recall f1-score support Admission_Discharge 0.84 0.97 0.90 415 Age 0.96 0.96 0.96 2434 Alcohol 0.75 0.83 0.79 145 BMI 1.00 0.77 0.87 26 Blood_Pressure 0.86 0.88 0.87 597 Cerebrovascular_Disease 0.74 0.77 0.75 266 Clinical_Dept 0.90 0.92 0.91 2385 Communicable_Disease 0.70 0.59 0.64 85 Date 0.95 0.98 0.96 1438 Death_Entity 0.83 0.83 0.83 59 Diabetes 0.95 0.95 0.95 350 Diet 0.60 0.49 0.54 229 Direction 0.88 0.90 0.89 6187 Disease_Syndrome_Disorder 0.90 0.89 0.89 13236 Dosage 0.57 0.49 0.53 263 Drug 0.91 0.93 0.92 15926 Duration 0.82 0.85 0.83 1218 EKG_Findings 0.64 0.70 0.67 325 Employment 0.79 0.85 0.82 539 External_body_part_or_region 0.84 0.84 0.84 4805 Family_History_Header 1.00 1.00 1.00 889 Fetus_NewBorn 0.57 0.56 0.56 341 Frequency 0.87 0.90 0.88 1718 Gender 0.98 0.98 0.98 5666 HDL 0.60 1.00 0.75 6 Heart_Disease 0.88 0.88 0.88 2295 Height 0.89 0.96 0.92 134 Hyperlipidemia 1.00 0.95 0.97 194 Hypertension 0.95 0.98 0.97 566 ImagingFindings 0.66 0.64 0.65 601 Imaging_Technique 0.62 0.67 0.64 108 Injury_or_Poisoning 0.85 0.83 0.84 1680 Internal_organ_or_component 0.90 0.91 0.90 21318 Kidney_Disease 0.89 0.89 0.89 446 LDL 0.88 0.97 0.92 37 Labour_Delivery 0.82 0.71 0.76 306 Medical_Device 0.89 0.93 0.91 12852 Medical_History_Header 0.96 0.97 0.96 1013 Modifier 0.68 0.60 0.64 1398 O2_Saturation 0.84 0.82 0.83 199 Obesity 0.96 0.98 0.97 130 Oncological 0.88 0.96 0.92 1635 Overweight 0.80 0.80 0.80 10 Oxygen_Therapy 0.91 0.92 0.92 231 Pregnancy 0.81 0.83 0.82 439 Procedure 0.91 0.91 0.91 14410 Psychological_Condition 0.81 0.81 0.81 354 Pulse 0.85 0.95 0.89 389 Race_Ethnicity 1.00 1.00 1.00 163 Relationship_Status 0.93 0.91 0.92 57 RelativeDate 0.83 0.86 0.84 1562 RelativeTime 0.74 0.79 0.77 431 Respiration 0.99 0.95 0.97 221 Route 0.68 0.69 0.69 597 Section_Header 0.97 0.98 0.98 28580 Sexually_Active_or_Sexual_Orientation 1.00 0.64 0.78 14 Smoking 0.83 0.90 0.86 225 Social_History_Header 0.95 0.99 0.97 825 Strength 0.71 0.55 0.62 227 Substance 0.85 0.81 0.83 193 Symptom 0.84 0.86 0.85 23092 Temperature 0.94 0.97 0.96 410 Test 0.84 0.88 0.86 9050 Test_Result 0.84 0.84 0.84 2766 Time 0.90 0.81 0.86 140 Total_Cholesterol 0.69 0.95 0.80 73 Treatment 0.73 0.72 0.73 506 Triglycerides 0.83 0.80 0.81 30 VS_Finding 0.76 0.77 0.76 588 Vaccine 0.70 0.84 0.76 92 Vital_Signs_Header 0.95 0.98 0.97 2223 Weight 0.88 0.89 0.88 306 O 0.97 0.96 0.97 253164 accuracy - - 0.94 445974 macro-avg 0.82 0.82 0.81 445974 weighted-avg 0.94 0.94 0.94 445974 ``` --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2022-01-18 tags: [ner_profiling, ner, clinical, biobert, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `ner_jsl_enriched_biobert`, `ner_clinical_biobert`, `ner_chemprot_biobert`, `ner_jsl_greedy_biobert`, `ner_bionlp_biobert`, `ner_human_phenotype_go_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert`, `ner_anatomy_coarse_biobert`, `ner_deid_enriched_biobert`, `ner_human_phenotype_gene_biobert`, `ner_jsl_biobert`, `ner_events_biobert`, `ner_deid_biobert`, `ner_posology_biobert`, `ner_diseases_biobert`, `jsl_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_cellular_biobert` . {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_2.4_1642535810700.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_2.4_1642535810700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|750.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from rahulchakwate) author: John Snow Labs name: distilbert_qa_base_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_finetuned_squad_en_4.3.0_3.0_1672767084899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_finetuned_squad_en_4.3.0_3.0_1672767084899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulchakwate/distilbert-base-finetuned-squad --- layout: model title: English RobertaForSequenceClassification Cased model (from lucianpopa) author: John Snow Labs name: roberta_classifier_autonlp_trec_classification_522314623 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-TREC-classification-522314623` is a English model originally trained by `lucianpopa`. ## Predicted Entities `1`, `0`, `4`, `2`, `3`, `5` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_trec_classification_522314623_en_4.2.4_3.0_1670622157916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_trec_classification_522314623_en_4.2.4_3.0_1670622157916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_trec_classification_522314623","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_trec_classification_522314623","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autonlp_trec_classification_522314623| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lucianpopa/autonlp-TREC-classification-522314623 --- layout: model title: Basic General Purpose Pipeline for Catalan author: cayorodriguez name: pipeline_md date: 2022-07-11 tags: [ca, open_source] task: [Named Entity Recognition, Sentence Detection, Embeddings, Stop Words Removal, Part of Speech Tagging, Lemmatization, Chunk Mapping, Pipeline Public] language: ca edition: Spark NLP 3.4.4 spark_version: 3.0 supported: false recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Model for Catalan language processing based on models by Barcelona SuperComputing Center and the AINA project (Generalitat de Catalunya), following POS and tokenization guidelines from ANCORA Universal Dependencies corpus. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/pipeline_md_ca_3.4.4_3.0_1657533114488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/pipeline_md_ca_3.4.4_3.0_1657533114488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("pipeline_md", "ca", "@cayorodriguez") result = pipeline.annotate("El català ja és a SparkNLP.") ```
## Results ```bash {'chunk': ['El català ja', 'SparkNLP', 'és'], 'entities': ['SparkNLP'], 'lemma': ['el', 'català', 'ja', 'ser', 'a', 'sparknlp', '.'], 'document': ['El català ja es a SparkNLP.'], 'pos': ['DET', 'NOUN', 'ADV', 'AUX', 'ADP', 'PROPN', 'PUNCT'], 'sentence_embeddings': ['El català ja és a SparkNLP.'], 'cleanTokens': ['català', 'SparkNLP', '.'], 'token': ['El', 'català', 'ja', 'és', 'a', 'SparkNLP', '.'], 'ner': ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'O'], 'embeddings': ['El', 'català', 'ja', 'és', 'a', 'SparkNLP', '.'], 'form': ['el', 'català', 'ja', 'és', 'a', 'sparknlp', '.'], 'sentence': ['El català ja és a SparkNLP.']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Community| |Language:|ca| |Size:|756.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - NormalizerModel - StopWordsCleaner - RoBertaEmbeddings - SentenceEmbeddings - EmbeddingsFinisher - LemmatizerModel - PerceptronModel - RoBertaForTokenClassification - NerConverter - Chunker --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from huangtuoyue) author: John Snow Labs name: distilbert_qa_huangtuoyue_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huangtuoyue`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huangtuoyue_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771314379.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huangtuoyue_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771314379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huangtuoyue_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huangtuoyue_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_huangtuoyue_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huangtuoyue/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Belarusian to Spanish author: John Snow Labs name: opus_mt_be_es date: 2021-06-01 tags: [open_source, seq2seq, translation, be, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: be target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_be_es_xx_3.1.0_2.4_1622557108215.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_be_es_xx_3.1.0_2.4_1622557108215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_be_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_be_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Belarusian.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_be_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Korean Bert Embeddings (from deeq) author: John Snow Labs name: bert_embeddings_dbert date: 2022-04-11 tags: [bert, embeddings, ko, open_source] task: Embeddings language: ko edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dbert` is a Korean model orginally trained by `deeq`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dbert_ko_3.4.2_3.0_1649675591519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dbert_ko_3.4.2_3.0_1649675591519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_dbert","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_dbert","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.embed.dbert").predict("""나는 Spark NLP를 좋아합니다""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|424.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/deeq/dbert --- layout: model title: German T5ForConditionalGeneration Base Cased model (from Einmalumdiewelt) author: John Snow Labs name: t5_base_gnad date: 2023-01-30 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-Base_GNAD` is a German model originally trained by `Einmalumdiewelt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_gnad_de_4.3.0_3.0_1675099176903.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_gnad_de_4.3.0_3.0_1675099176903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_gnad","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_gnad","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_gnad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|919.7 MB| ## References - https://huggingface.co/Einmalumdiewelt/T5-Base_GNAD --- layout: model title: Spanish Lemmatizer author: John Snow Labs name: lemma date: 2020-02-17 00:16:00 +0800 task: Lemmatization language: es edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [lemmatizer, es] supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_es_2.4.0_2.4_1581890818386.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_es_2.4.0_2.4_1581890818386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "es") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "es") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica."""] lemma_df = nlu.load('es.lemma').predict(text, output_level = "token") lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='Además', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=7, end=8, result='de', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=12, result='ser', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=14, end=15, result='el', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=19, result='rey', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.4.0| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|es| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_2_lr_2e_5_bs_32_ep_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_2-lr-2e-5-bs-32-ep-4` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1657188461900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1657188461900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_2e_5_bs_32_ep_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_2e_5_bs_32_ep_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_2_lr_2e_5_bs_32_ep_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_2-lr-2e-5-bs-32-ep-4 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becasv2_1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-1` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_1_en_4.3.0_3.0_1672767657809.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_1_en_4.3.0_3.0_1672767657809.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becasv2_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-1 --- layout: model title: Translate English to Turkic languages Pipeline author: John Snow Labs name: translate_en_trk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, trk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `trk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_trk_xx_2.7.0_2.4_1609690153382.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_trk_xx_2.7.0_2.4_1609690153382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_trk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_trk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.trk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_trk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_de_4.2.0_3.0_1664114711185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_de_4.2.0_3.0_1664114711185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from MarcBrun) author: John Snow Labs name: bert_qa_ixambert_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ixambert-finetuned-squad` is a English model orginally trained by `MarcBrun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ixambert_finetuned_squad_en_4.0.0_3.0_1654187989433.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ixambert_finetuned_squad_en_4.0.0_3.0_1654187989433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ixambert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ixambert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.ixam_bert.by_MarcBrun").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ixambert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|661.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MarcBrun/ixambert-finetuned-squad --- layout: model title: English BertForQuestionAnswering Cased model (from akmal2500) author: John Snow Labs name: bert_qa_akmal2500_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `akmal2500`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_akmal2500_finetuned_squad_en_4.0.0_3.0_1657186331020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_akmal2500_finetuned_squad_en_4.0.0_3.0_1657186331020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_akmal2500_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_akmal2500_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_akmal2500_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/akmal2500/bert-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1655733244827.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1655733244827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_64d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|419.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-0 --- layout: model title: Legal Qualifications Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_qualifications_bert date: 2023-03-05 tags: [en, legal, classification, clauses, qualifications, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Qualifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Qualifications`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_qualifications_bert_en_1.0.0_3.0_1678049907588.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_qualifications_bert_en_1.0.0_3.0_1678049907588.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_qualifications_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Qualifications]| |[Other]| |[Other]| |[Qualifications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_qualifications_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.83 0.83 0.83 6 Qualifications 0.83 0.83 0.83 6 accuracy - - 0.83 12 macro-avg 0.83 0.83 0.83 12 weighted-avg 0.83 0.83 0.83 12 ``` --- layout: model title: Resolver Company Names to Tickers using Nasdaq Stock Screener author: John Snow Labs name: finel_nasdaq_ticker_stock_screener date: 2023-01-20 tags: [en, licensed, finance, nasdaq, ticker] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Resolution / Entity Linking model, which is able to provide Ticker / Trading Symbols using a Company Name as an input. You can use any NER which extracts Organizations / Companies / Parties to then send the input to `finel_nasdaq_company_name_stock_screener` model to get normalized company name. Finally, this Entity Linking model get the Ticker / Trading Symbol (given the company has one). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674236954508.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674236954508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") ner_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") ticker_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en")\ .setInputCols("ner_chunk_doc")\ .setOutputCol("ticker_embeddings") er_ticker_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_ticker_stock_screener', 'en', 'finance/model')\ .setInputCols(["ticker_embeddings"])\ .setOutputCol("ticker")\ .setAuxLabelCol("company_name") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, ner_embeddings, ner_model, ner_converter, chunkToDoc, ticker_embeddings, er_ticker_model]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) lp = nlp.LightPipeline(model) text = """Nike is an American multinational association that is involved in the design, development, manufacturing and worldwide marketing and sales of apparel, footwear, accessories, equipment and services.""" result = lp.annotate(text) result["ticker"] ```
## Results ```bash ['NKE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_nasdaq_ticker_stock_screener| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[normalized]| |Language:|en| |Size:|54.6 MB| |Case sensitive:|false| ## References https://www.nasdaq.com/market-activity/stocks/screener --- layout: model title: Korean Electra Embeddings (from krevas) author: John Snow Labs name: electra_embeddings_finance_koelectra_base_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finance-koelectra-base-generator` is a Korean model orginally trained by `krevas`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_finance_koelectra_base_generator_ko_3.4.4_3.0_1652786802248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_finance_koelectra_base_generator_ko_3.4.4_3.0_1652786802248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_finance_koelectra_base_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_finance_koelectra_base_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_finance_koelectra_base_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|129.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/krevas/finance-koelectra-base-generator - https://openreview.net/forum?id=r1xMH1BtvB - https://github.com/google-research/electra --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_test TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_test date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_test` is a English model originally trained by ying-tina. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_test_en_4.2.0_3.0_1664111957682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_test_en_4.2.0_3.0_1664111957682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_test", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_test", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Sinhala RobertaForMaskedLM Cased model (from keshan) author: John Snow Labs name: roberta_embeddings_sinhalaberto date: 2022-12-12 tags: [si, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: si edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SinhalaBERTo` is a Sinhala model originally trained by `keshan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_sinhalaberto_si_4.2.4_3.0_1670858534410.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_sinhalaberto_si_4.2.4_3.0_1670858534410.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_sinhalaberto","si") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_sinhalaberto","si") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_sinhalaberto| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|si| |Size:|314.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/keshan/SinhalaBERTo - https://oscar-corpus.com/ - https://arxiv.org/abs/1907.11692 --- layout: model title: Fast Neural Machine Translation Model from Baltic Languages to English author: John Snow Labs name: opus_mt_bat_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bat, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bat` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bat_en_xx_2.7.0_2.4_1609164652074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bat_en_xx_2.7.0_2.4_1609164652074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bat_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bat_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bat.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bat_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Clinical Analysis author: John Snow Labs name: clinical_analysis class: PipelineModel language: en nav_key: models repository: clinical/models date: 2020-02-01 task: Pipeline Healthcare edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [clinical,licensed,pipeline,en] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_analysis_en_2.4.0_2.4_1580600773378.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_analysis_en_2.4.0_2.4_1580600773378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python model = PretrainedPipeline("clinical_analysis","en","clinical/models") ``` ```scala val model = PipelineModel.pretrained("clinical_analysis","en","clinical/models") ```
{:.model-param} ## Model Information {:.table-model} |---------------|-------------------| | Name: | clinical_analysis | | Type: | PipelineModel | | Compatibility: | Spark NLP 2.4.0+ | | License: | Licensed | | Edition: | Official | | Language: | en | {:.h2_title} ## Data Source --- layout: model title: English asr_Fine_Tunning_on_CV_dataset TFWav2Vec2ForCTC from Sania67 author: John Snow Labs name: pipeline_asr_Fine_Tunning_on_CV_dataset date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Fine_Tunning_on_CV_dataset` is a English model originally trained by Sania67. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Fine_Tunning_on_CV_dataset_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Fine_Tunning_on_CV_dataset_en_4.2.0_3.0_1664118340149.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Fine_Tunning_on_CV_dataset_en_4.2.0_3.0_1664118340149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Fine_Tunning_on_CV_dataset', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Fine_Tunning_on_CV_dataset", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Fine_Tunning_on_CV_dataset| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Nonstatutory Stock Option Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_nonstatutory_stock_option_agreement_bert date: 2023-02-02 tags: [en, legal, classification, nonstatutory, stock, option, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_nonstatutory_stock_option_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `nonstatutory-stock-option-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `nonstatutory-stock-option-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nonstatutory_stock_option_agreement_bert_en_1.0.0_3.0_1675360953793.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nonstatutory_stock_option_agreement_bert_en_1.0.0_3.0_1675360953793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_nonstatutory_stock_option_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[nonstatutory-stock-option-agreement]| |[other]| |[other]| |[nonstatutory-stock-option-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nonstatutory_stock_option_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support nonstatutory-stock-option-agreement 0.98 0.96 0.97 53 other 0.98 0.99 0.99 122 accuracy - - 0.98 175 macro-avg 0.98 0.98 0.98 175 weighted-avg 0.98 0.98 0.98 175 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbtl3 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbtl3` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670022905492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670022905492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbtl3| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|228.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbtl3 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Finnish T5ForConditionalGeneration Tiny Cased model (from Finnish-NLP) author: John Snow Labs name: t5_tiny_nl6 date: 2023-01-31 tags: [fi, open_source, t5, tensorflow] task: Text Generation language: fi edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-tiny-nl6-finnish` is a Finnish model originally trained by `Finnish-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_tiny_nl6_fi_4.3.0_3.0_1675156113232.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_tiny_nl6_fi_4.3.0_3.0_1675156113232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_tiny_nl6","fi") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_tiny_nl6","fi") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_tiny_nl6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|fi| |Size:|145.8 MB| ## References - https://huggingface.co/Finnish-NLP/t5-tiny-nl6-finnish - https://arxiv.org/abs/1910.10683 - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511 - https://arxiv.org/abs/2002.05202 - https://arxiv.org/abs/2109.10686 - http://urn.fi/urn:nbn:fi:lb-2017070501 - http://urn.fi/urn:nbn:fi:lb-2021050401 - http://urn.fi/urn:nbn:fi:lb-2018121001 - http://urn.fi/urn:nbn:fi:lb-2020021803 - https://sites.research.google/trc/about/ - https://github.com/google-research/t5x - https://github.com/spyysalo/yle-corpus - https://github.com/aajanki/eduskunta-vkk - https://sites.research.google/trc/ - https://www.linkedin.com/in/aapotanskanen/ - https://www.linkedin.com/in/rasmustoivanen/ --- layout: model title: Fast Neural Machine Translation Model from English to Luvale author: John Snow Labs name: opus_mt_en_lue date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, lue, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `lue` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lue_xx_2.7.0_2.4_1609168026109.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lue_xx_2.7.0_2.4_1609168026109.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_lue", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_lue", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.lue').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_lue| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for SNOMED codes (procedures and measurements) author: John Snow Labs name: sbiobertresolve_clinical_snomed_procedures_measurements date: 2021-11-15 tags: [en, licensed, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps medical entities to SNOMED codes using `sent_biobert_clinical_base_cased` Sentence Bert Embeddings. The corpus of this model includes `Procedures` and `Measurement` domains. ## Predicted Entities `SNOMED` codes from `Procedures` and `Measurements` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_snomed_procedures_measurements_en_3.3.2_3.0_1636985738813.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_snomed_procedures_measurements_en_3.3.2_3.0_1636985738813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sent_biobert_clinical_base_cased", "en")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_clinical_snomed_procedures_measurements", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("cpt_code") pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, resolver]) l_model = LightPipeline(pipelineModel) result = l_model.fullAnnotate(['coronary calcium score', 'heart surgery', 'ct scan', 'bp value']) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sent_biobert_clinical_base_cased", "en") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_clinical_snomed_procedures_measurements", "en", "clinical/models) .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("cpt_code") val pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, resolver)) val l_model = LightPipeline(pipelineModel) val result = l_model.fullAnnotate(Array("coronary calcium score", "heart surgery", "ct scan", "bp value")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.clinical_snomed_procedures_measurements").predict("""coronary calcium score""") ```
## Results ```bash | | chunk | code | code_description | all_k_code_desc | all_k_codes | |---:|:-----------------------|----------:|:------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | coronary calcium score | 450360000 | Coronary artery calcium score | ['450360000', '450734004', '1086491000000104', '1086481000000101', '762241007'] | ['Coronary artery calcium score', 'Coronary artery calcium score', 'Dundee Coronary Risk Disk score', 'Dundee Coronary Risk rank', 'Dundee Coronary Risk Disk'] | | 1 | heart surgery | 2598006 | Open heart surgery | ['2598006', '64915003', '119766003', '34068001', '233004008'] | ['Open heart surgery', 'Operation on heart', 'Heart reconstruction', 'Heart valve replacement', 'Coronary sinus operation'] | | 2 | ct scan | 303653007 | CT of head | ['303653007', '431864000', '363023007', '418272005', '241577003'] | ['CT of head', 'CT guided injection', 'CT of site', 'CT angiography', 'CT of spine'] | | 3 | bp value | 75367002 | Blood pressure | ['75367002', '6797001', '723232008', '46973005', '427732000'] | ['Blood pressure', 'Mean blood pressure', 'Average blood pressure', 'Blood pressure taking', 'Speed of blood pressure response'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_clinical_snomed_procedures_measurements| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[output]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on `SNOMED` code dataset with `sent_biobert_clinical_base_cased` sentence embeddings. --- layout: model title: Legal Representations And Warranties Clause Binary Classifier author: John Snow Labs name: legclf_representations_and_warranties_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, representations, and, warranties, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the representations-and-warranties clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `representations-and-warranties`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_and_warranties_clause_en_1.0.0_3.0_1671393649249.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_and_warranties_clause_en_1.0.0_3.0_1671393649249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_representations_and_warranties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[representations-and-warranties]| |[other]| |[other]| |[representations-and-warranties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_representations_and_warranties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.90 0.91 39 representations-and-warranties 0.86 0.89 0.88 28 accuracy - - 0.9 67 macro-avg 0.89 0.90 0.89 67 weighted-avg 0.90 0.90 0.9 67 ``` --- layout: model title: Hindi BertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers date: 2022-12-06 tags: [hi, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: hi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-hi-bert` is a Hindi model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670326624051.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670326624051.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|612.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert - https://oscar-corpus.com/ --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_repubblica_to_ilgiornale date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-repubblica-to-ilgiornale` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_repubblica_to_ilgiornale_it_4.3.0_3.0_1675103650043.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_repubblica_to_ilgiornale_it_4.3.0_3.0_1675103650043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_repubblica_to_ilgiornale","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_repubblica_to_ilgiornale","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_repubblica_to_ilgiornale| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|594.0 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-repubblica-to-ilgiornale - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Headline+style+transfer+%28Repubblica+to+Il+Giornale%29&dataset=CHANGE-IT --- layout: model title: Sentence Entity Resolver for NDC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_ndc date: 2021-11-27 tags: [ndc, entity_resolution, licensed, en, cilnical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to [National Drug Codes](https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory) using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, if a drug has more than one NDC code, it returns all other codes in the all_k_aux_label column separated by `|` symbol. ## Predicted Entities `NDC Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_NDC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ndc_en_3.3.2_2.4_1638010818380.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ndc_en_3.3.2_2.4_1638010818380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_ndc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology_greedy``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings") ndc_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_ndc", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("ndc_code")\ .setDistanceFunction("EUCLIDEAN")\ .setCaseSensitive(False) resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner, ner_converter_icd, c2doc, sbert_embedder, ndc_resolver ]) data = spark.createDataFrame([["""The patient was transferred secondary to inability and continue of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given aspirin 81 mg, folic acid 1 g daily, insulin glargine 100 UNT/ML injection and metformin 500 mg p.o. p.r.n."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala ... val c2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") val ndc_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_ndc", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("ndc_code") .setDistanceFunction("EUCLIDEAN") .setCaseSensitive(False) val resolver_pipeline = new Pipeline().setStages(Array( document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner, ner_converter_icd, c2doc, sbert_embedder, ndc_resolver )) val clinical_note = Seq("""The patient was transferred secondary to inability and continue of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given aspirin 81 mg, folic acid 1 g daily, insulin glargine 100 UNT/ML injection and metformin 500 mg p.o. p.r.n.""").toDS.toDF("text") val result = resolver_pipeline.fit(clinical_note).transform(clinical_note) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.ndc").predict("""The patient was transferred secondary to inability and continue of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given aspirin 81 mg, folic acid 1 g daily, insulin glargine 100 UNT/ML injection and metformin 500 mg p.o. p.r.n.""") ```
## Results ```bash +-------------------------------------+------+-----------+------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ner_chunk|entity| ndc_code| description| all_codes| all_resolutions| other ndc codes| +-------------------------------------+------+-----------+------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | aspirin 81 mg| DRUG|73089008114| aspirin 81 mg/81mg, 81 mg in 1 carton , capsule|[73089008114, 71872708704, 71872715401, 68210101500, 69536028110, 63548086706, 71679001000, 68196090051, 00113400500, 69536018112, 73089008112, 63981056362, 63739043402, 63548086705, 00113046708, 7...|[aspirin 81 mg/81mg, 81 mg in 1 carton , capsule, aspirin 81 mg 81 mg/1, 4 blister pack in 1 bag , tablet, aspirin 81 mg/1, 1 blister pack in 1 bag , tablet, coated, aspirin 81 mg/1, 1 bag in 1 dru...| [-, -, -, -, -, -, -, -, -, -, -, 63940060962, -, -, -, -, -, -, -, -, 70000042002|00363021879|41250027408|36800046708|59779027408|49035027408|71476010131|81522046708|30142046708, -, -, -, -]| | folic acid 1 g| DRUG|43744015101| folic acid 1 g/g, 1 g in 1 package , powder|[43744015101, 63238340000, 66326050555, 51552041802, 51552041805, 63238340001, 81919000204, 51552041804, 66326050556, 51552106301, 51927003300, 71092997701, 51927296300, 51552146602, 61281900002, 6...|[folic acid 1 g/g, 1 g in 1 package , powder, folic acid 1 kg/kg, 1 kg in 1 bottle , powder, folic acid 1 kg/kg, 1 kg in 1 drum , powder, folic acid 1 g/g, 5 g in 1 container , powder, folic acid 1...| [-, -, -, -, -, -, -, -, -, -, -, 51552139201, -, -, -, 81919000203, -, 81919000201, -, -, -, -, -, -, -]| |insulin glargine 100 UNT/ML injection| DRUG|00088502101|insulin glargine 100 [iu]/ml, 1 vial, glass in 1 package , injection, solution|[00088502101, 00088222033, 49502019580, 00002771563, 00169320111, 00088250033, 70518139000, 00169266211, 50090127600, 50090407400, 00002771559, 00002772899, 70518225200, 70518138800, 00024592410, 0...|[insulin glargine 100 [iu]/ml, 1 vial, glass in 1 package , injection, solution, insulin glargine 100 [iu]/ml, 1 vial, glass in 1 carton , injection, solution, insulin glargine 100 [iu]/ml, 1 vial ...|[-, -, -, 00088221900, -, -, 50090139800|00088502005, -, 70518146200|00169368712, 00169368512|73070020011, 00088221905|49502019675|50090406800, -, 73070010011|00169750111|50090495500, 66733077301|0...| | metformin 500 mg| DRUG|70010006315| metformin hydrochloride 500 mg/500mg, 500 mg in 1 drum , tablet|[70010006315, 62207041613, 71052050750, 62207049147, 71052091050, 25000010197, 25000013498, 25000010198, 71052063005, 51662139201, 70010049118, 70882012456, 71052011005, 71052065905, 71052050850, 1...|[metformin hydrochloride 500 mg/500mg, 500 mg in 1 drum , tablet, metformin hcl 500 mg/kg, 50 kg in 1 drum , powder, 5-fluorouracil 500 g/500g, 500 g in 1 container , powder, metformin er 500 mg 50...| [-, -, -, 70010049105, -, -, -, -, -, -, -, -, -, -, -, 71800000801|42571036007, -, -, -, -, -, -, -, -, -]| +-------------------------------------+------+-----------+------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_ndc| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[ndc_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: English asr_sanskrit TFWav2Vec2ForCTC from Tarakki100 author: John Snow Labs name: pipeline_asr_sanskrit date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_sanskrit` is a English model originally trained by Tarakki100. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_sanskrit_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_sanskrit_en_4.2.0_3.0_1664112399942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_sanskrit_en_4.2.0_3.0_1664112399942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_sanskrit', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_sanskrit", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_sanskrit| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Aspect based Sentiment Analysis for restaurant reviews author: John Snow Labs name: ner_aspect_based_sentiment date: 2020-12-29 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [sentiment, open_source, en, ner] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Automatically detect positive, negative and neutral aspects about restaurants from user reviews. Instead of labelling the entire review as negative or positive, this model helps identify which exact phrases relate to sentiment identified in the review. ## Predicted Entities `NEG`, `POS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/ASPECT_BASED_SENTIMENT_RESTAURANT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/ABSA_Inference.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_aspect_based_sentiment_en_2.6.2_2.4_1609249232812.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_aspect_based_sentiment_en_2.6.2_2.4_1609249232812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("ner_aspect_based_sentiment")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish."]]).toDF("text")) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("ner_aspect_based_sentiment") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq("Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish."""] ner_df = nlu.load('en.ner.aspect_sentiment').predict(text, output_level='token') list(zip(ner_df["entities"].values[0], ner_df["entities_confidence"].values[0]) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------------------+-----------+ | sentence | aspect | sentiment | +----------------------------------------------------------------------------------------------------+-------------------+-----------+ | We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | Thai-style main | positive | | We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | lots of flavours | positive | | But the service was below average and the chips were too terrible to finish. | service | negative | | But the service was below average and the chips were too terrible to finish. | chips | negative | +----------------------------------------------------------------------------------------------------+-------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_aspect_based_sentiment| |Type:|ner| |Compatibility:|Spark NLP 2.6.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, embeddings]| |Output Labels:|[absa]| |Language:|en| |Dependencies:|glove_6B_300| --- layout: model title: Smaller BERT Sentence Embeddings (L-12_H-128_A-2) author: John Snow Labs name: sent_small_bert_L12_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_128_en_2.6.0_2.4_1598350359233.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_128_en_2.6.0_2.4_1598350359233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_128", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_128", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L12_128').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L12_128_embeddings sentence [-0.3747739791870117, -0.28460437059402466, 0.... I hate cancer [0.9055836200714111, -0.41459062695503235, 0.0... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L12_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1 --- layout: model title: Greek Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 16:56:00 +0800 task: Lemmatization language: el edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, el] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_el_2.5.0_2.4_1588686951720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_el_2.5.0_2.4_1588686951720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "el") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "el") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής."""] lemma_df = nlu.load('el.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='εκτός', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=8, result='από', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=11, result='ο', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=15, result='ότι', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=21, result='είμαι', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|el| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Legal Adjustments Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_adjustments_bert date: 2023-03-05 tags: [en, legal, classification, clauses, adjustments, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Adjustments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Adjustments`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_adjustments_bert_en_1.0.0_3.0_1678050029175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_adjustments_bert_en_1.0.0_3.0_1678050029175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_adjustments_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Adjustments]| |[Other]| |[Other]| |[Adjustments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_adjustments_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Adjustments 0.88 0.90 0.89 40 Other 0.93 0.91 0.92 58 accuracy - - 0.91 98 macro-avg 0.90 0.91 0.91 98 weighted-avg 0.91 0.91 0.91 98 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sharonpeng) author: John Snow Labs name: distilbert_qa_sharonpeng_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sharonpeng`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sharonpeng_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772605467.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sharonpeng_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772605467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sharonpeng_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sharonpeng_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sharonpeng_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sharonpeng/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_ff1000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-ff1000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff1000_en_4.3.0_3.0_1675111729938.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff1000_en_4.3.0_3.0_1675111729938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_ff1000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_ff1000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_ff1000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|330.1 MB| ## References - https://huggingface.co/google/t5-efficient-base-ff1000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1657185460089.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1657185460089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-4 --- layout: model title: Multilingual BertForQuestionAnswering model (from Paul-Vinh) author: John Snow Labs name: bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-squad` is a Multilingual model orginally trained by `Paul-Vinh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad_xx_4.0.0_3.0_1654180153562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad_xx_4.0.0_3.0_1654180153562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.squad.bert.multilingual_base_cased.by_Paul-Vinh").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Paul-Vinh/bert-base-multilingual-cased-finetuned-squad --- layout: model title: Portuguese BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_pt_cased date: 2022-12-02 tags: [pt, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-pt-cased` is a Portuguese model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_pt_cased_pt_4.2.4_3.0_1670018755967.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_pt_cased_pt_4.2.4_3.0_1670018755967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_pt_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_pt_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_pt_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|395.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-pt-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Arabic Bert Embeddings (Large, Arabert Model, v02) author: John Snow Labs name: bert_embeddings_bert_large_arabertv02 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-arabertv02` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv02_ar_3.4.2_3.0_1649677496517.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv02_ar_3.4.2_3.0_1649677496517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv02","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv02","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_large_arabertv02").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_arabertv02| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.4 GB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-large-arabertv02 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-large-arabertv02/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Arabic DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_ar_cased date: 2022-04-12 tags: [distilbert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-ar-cased` is a Arabic model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ar_cased_ar_3.4.2_3.0_1649783855007.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ar_cased_ar_3.4.2_3.0_1649783855007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ar_cased","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ar_cased","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.distilbert").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_ar_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|182.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-ar-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Hindi XLMRobertaForTokenClassification Large Cased model (from cfilt) author: John Snow Labs name: xlmroberta_ner_hiner_original_large date: 2022-08-13 tags: [hi, open_source, xlm_roberta, ner] task: Named Entity Recognition language: hi edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `HiNER-original-xlm-roberta-large` is a Hindi model originally trained by `cfilt`. ## Predicted Entities `GAME`, `MISC`, `ORGANIZATION`, `FESTIVAL`, `LOCATION`, `LITERATURE`, `LANGUAGE`, `NUMEX`, `PERSON`, `RELIGION`, `TIMEX` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hiner_original_large_hi_4.1.0_3.0_1660406213542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hiner_original_large_hi_4.1.0_3.0_1660406213542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hiner_original_large","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hiner_original_large","hi") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_hiner_original_large| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|hi| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/cfilt/HiNER-original-xlm-roberta-large - https://paperswithcode.com/sota?task=Token+Classification&dataset=HiNER+Original --- layout: model title: Legal Bereavement leave Clause Binary Classifier author: John Snow Labs name: legclf_bereavement_leave_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `bereavement-leave` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `bereavement-leave` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bereavement_leave_clause_en_1.0.0_3.2_1660123269138.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bereavement_leave_clause_en_1.0.0_3.2_1660123269138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_bereavement_leave_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[bereavement-leave]| |[other]| |[other]| |[bereavement-leave]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bereavement_leave_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support bereavement-leave 1.00 1.00 1.00 30 other 1.00 1.00 1.00 76 accuracy - - 1.00 106 macro-avg 1.00 1.00 1.00 106 weighted-avg 1.00 1.00 1.00 106 ``` --- layout: model title: Medical Question Answering (biogpt) author: John Snow Labs name: biogpt_pubmed_qa date: 2023-02-26 tags: [licensed, en, clinical, biogpt, gpt, pubmed, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Healthcare NLP 4.3.0 spark_version: 3.0 published: false engine: tensorflow annotator: MedicalQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model has been trained with medical documents and can generate two types of answers, short and long. Types of questions are supported: `"short"` (producing yes/no/maybe) answers and `"full"` (long answers). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.3.0_3.0_1677406773484.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.3.0_3.0_1677406773484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler()\ .setInputCols("question", "context")\ .setOutputCols("document_question", "document_context") med_qa = MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models")\ .setInputCols(["document_question", "document_context"])\ .setMaxNewTokens(100)\ .setOutputCol("answer")\ .setQuestionType("short") #long pipeline = Pipeline(stages=[document_assembler, med_qa]) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus-stimulus and stimulus-response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" data = spark.createDataFrame( [ [long_question, paper_abstract, "long"], [yes_no_question, paper_abstract, "short"], ] ).toDF("question", "context", "question_type") pipeline.fit(data).transform(data.where("question_type == 'long'"))\ .select("answer.result")\ .show(truncate=False) pipeline.fit(data).transform(data.where("question_type == 'short'"))\ .select("answer.result")\ .show(truncate=False) ``` ```scala val document_assembler = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val med_qa = MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models") .setInputCols(Array("document_question", "document_context")) .setMaxNewTokens(100) .setOutputCol("answer") .setQuestionType("short") #long val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa)) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus-stimulus and stimulus-response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" val data = Seq( (long_question, paper_abstract,"long" ), (yes_no_question, paper_abstract, "short")) .toDS.toDF("question", "context", "question_type") val result = pipeline.fit(data).transform(data) ```
## Results ```bash #short result +------+ |result| +------+ |[no] | +------+ #long result +------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------+ |[the results of the two experiments suggest that the visual indexeing theory does not fully explain the effects that spatial attention has on memory.]| +------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biogpt_pubmed_qa| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 GB| |Case sensitive:|true| --- layout: model title: Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl_en date: 2020-04-22 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Temperature`: All mentions that refer to body temperature. - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Allergen`: Allergen related extractions mentioned in the document. - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Frequency`: Frequency of administration for a dose prescribed. - `Weight`: All mentions related to a patients weight. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Respiration`: Number of breaths per minute. ## Predicted Entities `Diagnosis`, `Procedure_Name`, `Lab_Result`, `Procedure`, `Procedure_Findings`, `O2_Saturation`, `Procedure_incident_description`, `Dosage`, `Causative_Agents_(Virus_and_Bacteria)`, `Name`, `Cause_of_death`, `Substance_Name`, `Weight`, `Symptom_Name`, `Maybe`, `Modifier`, `Blood_Pressure`, `Frequency`, `Gender`, `Drug_incident_description`, `Age`, `Drug_Name`, `Temperature`, `Section_Name`, `Route`, `Negation`, `Negated`, `Allergenic_substance`, `Lab_Name`, `Respiratory_Rate`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_2.4.2_2.4_1587513304751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_2.4.2_2.4_1587513304751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala ... val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +---------------------------+------------+ |chunk |ner | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |congestion |Symptom_Name| |mom |Gender | |suctioning yellow discharge|Symptom_Name| |she |Gender | |mild |Modifier | |problems with his breathing|Symptom_Name| |negative |Negated | |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_en_2.4.2_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.2| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ {:.h2_title} ## Benchmarking ```bash label tp fp fn prec rec f1 B-Pulse_Rate 77 39 12 0.663793 0.865169 0.75122 I-Diagnosis 2134 1139 1329 0.652001 0.616229 0.63361 I-Procedure_Name 2335 1329 956 0.637282 0.709511 0.671459 B-Lab_Result 601 182 94 0.767561 0.864748 0.813261 B-Procedure 1 0 5 1 0.166667 0.285714 B-Procedure_Findings 2 13 72 0.133333 0.027027 0.044944 B-O2_Saturation 1 3 4 0.25 0.2 0.222222 B-Dosage 477 197 68 0.707715 0.875229 0.782609 I-Causative_Agents_(Virus_and_Bacteria) 12 2 7 0.857143 0.631579 0.727273 B-Name 562 268 554 0.677108 0.503584 0.577595 I-Cause_of_death 9 5 11 0.642857 0.45 0.529412 I-Substance_Name 24 34 54 0.413793 0.307692 0.352941 I-Name 716 377 710 0.655078 0.502104 0.56848 B-Cause_of_death 9 6 8 0.6 0.529412 0.5625 B-Weight 52 22 9 0.702703 0.852459 0.77037 B-Symptom_Name 4364 1916 1652 0.694904 0.725399 0.709824 I-Maybe 27 51 61 0.346154 0.306818 0.325301 I-Symptom_Name 2073 1492 2348 0.581487 0.468898 0.519159 B-Modifier 1573 890 768 0.638652 0.671935 0.654871 B-Blood_Pressure 76 19 13 0.8 0.853933 0.826087 B-Frequency 308 134 77 0.696833 0.8 0.744861 I-Gender 26 31 28 0.45614 0.481482 0.468468 I-Drug_incident_description 4 10 57 0.285714 0.065574 0.106667 B-Drug_incident_description 2 5 23 0.285714 0.08 0.125 I-Age 5 0 9 1 0.357143 0.526316 B-Drug_Name 1741 490 290 0.780368 0.857213 0.816987 B-Substance_Name 148 41 48 0.783069 0.755102 0.768831 B-Temperature 56 23 13 0.708861 0.811594 0.756757 I-Procedure 1 0 7 1 0.125 0.222222 B-Section_Name 2711 317 166 0.89531 0.942301 0.918205 I-Route 119 110 189 0.519651 0.386364 0.443203 B-Maybe 143 80 127 0.641256 0.52963 0.580122 B-Gender 5166 709 58 0.879319 0.988897 0.930895 I-Dosage 434 196 87 0.688889 0.833013 0.754127 B-Causative_Agents_(Virus_and_Bacteria) 19 3 8 0.863636 0.703704 0.77551 I-Frequency 275 134 191 0.672372 0.590129 0.628571 B-Age 357 27 16 0.929688 0.957105 0.943197 I-Lab_Result 45 78 152 0.365854 0.228426 0.28125 B-Negation 99 38 38 0.722628 0.722628 0.722628 B-Diagnosis 2786 1342 913 0.674903 0.753177 0.711895 I-Section_Name 3885 1353 179 0.741695 0.955955 0.835304 B-Route 421 217 166 0.659875 0.717206 0.687347 I-Negation 11 30 24 0.268293 0.314286 0.289474 B-Procedure_Name 1490 811 522 0.647545 0.740557 0.690934 B-Negated 1490 332 215 0.817783 0.8739 0.844911 I-Allergenic_substance 1 0 12 1 0.0769231 0.142857 I-Negated 89 132 146 0.402715 0.378723 0.390351 I-Procedure_Findings 2 31 283 0.060606 0.0070175 0.012570 B-Allergenic_substance 72 29 24 0.712871 0.75 0.730965 I-Weight 47 35 16 0.573171 0.746032 0.648276 B-Lab_Name 804 290 122 0.734918 0.868251 0.79604 I-Modifier 99 73 422 0.575581 0.190019 0.285714 I-Temperature 1 0 14 1 0.066667 0.125 I-Drug_Name 362 284 261 0.560372 0.581059 0.570528 I-Lab_Name 284 194 127 0.594142 0.690998 0.63892 B-Respiratory_Rate 46 5 5 0.901961 0.901961 0.901961 Macro-average 38674 15571 13819 0.589085 0.515426 0.5498 Micro-average 38674 15571 13819 0.712951 0.736746 0.724653 ``` --- layout: model title: Chinese Part of Speech Tagger(Base, UPOS, Chinese Wikipedia Texts author: John Snow Labs name: bert_pos_chinese_roberta_base_upos date: 2022-04-26 tags: [bert, pos, part_of_speech, zh, open_source] task: Part of Speech Tagging language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-roberta-base-upos` is a Chinese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_chinese_roberta_base_upos_zh_3.4.2_3.0_1650993193631.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_chinese_roberta_base_upos_zh_3.4.2_3.0_1650993193631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_roberta_base_upos","zh") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_roberta_base_upos","zh") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.pos.chinese_roberta_base_upos").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_chinese_roberta_base_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|zh| |Size:|381.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/chinese-roberta-base-upos - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Legal Labor matters Clause Binary Classifier author: John Snow Labs name: legclf_labor_matters_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `labor-matters` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `labor-matters` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_labor_matters_clause_en_1.0.0_3.2_1660123653618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_labor_matters_clause_en_1.0.0_3.2_1660123653618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_labor_matters_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[labor-matters]| |[other]| |[other]| |[labor-matters]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_labor_matters_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support labor-matters 0.94 0.97 0.95 30 other 0.98 0.96 0.97 56 accuracy - - 0.97 86 macro-avg 0.96 0.97 0.96 86 weighted-avg 0.97 0.97 0.97 86 ``` --- layout: model title: Irish Legal Roberta Embeddings author: John Snow Labs name: roberta_base_irish_legal date: 2023-02-16 tags: [gle, irish, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: gle edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-irish-roberta-base` is a Irish model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_irish_legal_gle_4.2.4_3.0_1676553431450.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_irish_legal_gle_4.2.4_3.0_1676553431450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_irish_legal", "gle")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_irish_legal", "gle") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_irish_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|gle| |Size:|415.9 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-irish-roberta-base --- layout: model title: English T5ForConditionalGeneration Small Cased model (from allenai) author: John Snow Labs name: t5_small_squad11 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-squad11` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_squad11_en_4.3.0_3.0_1675155640438.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_squad11_en_4.3.0_3.0_1675155640438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_squad11","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_squad11","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_squad11| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|148.2 MB| ## References - https://huggingface.co/allenai/t5-small-squad11 --- layout: model title: English BertForQuestionAnswering model (from DaisyMak) author: John Snow Labs name: bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate-10epoch_transformerfrozen` is a English model orginally trained by `DaisyMak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen_en_4.0.0_3.0_1654535929048.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen_en_4.0.0_3.0_1654535929048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_DaisyMak").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/DaisyMak/bert-finetuned-squad-accelerate-10epoch_transformerfrozen --- layout: model title: Spam Classifier author: John Snow Labs name: classifierdl_use_spam class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/07/2020 task: Text Classification edition: Spark NLP 2.5.3 spark_version: 2.4 tags: [classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Automatically identify messages as being regular messages or Spam. {:.h2_title} ## Predicted Entities ``spam``, ``ham`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_SPAM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_SPAM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.5.3_2.4_1593783318934.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.5.3_2.4_1593783318934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en') .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now."""] spam_df = nlu.load('classify.spam.use').predict(text, output_level='document') spam_df[["document", "spam"]] ```
{:.h2_title} ## Results ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now. | spam | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| | Model Name | classifierdl_use_spam | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.3 | | Spark NLP Compatibility | 2.4 | | License | open source | | Edition | public | | Input Labels | [document, sentence_embeddings] | | Output Labels | [class] | | Language | en | | Upstream Dependencies | tfhub_use | {:.h2_title} ## Data Source This model is trained on UCI spam dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip {:.h2_title} ## Benchmarking Accuracy of the model with USE Embeddings is `0.86` ```bash precision recall f1-score support ham 0.86 1.00 0.92 1440 spam 0.00 0.00 0.00 238 accuracy 0.86 1678 macro avg 0.43 0.50 0.46 1678 weighted avg 0.74 0.86 0.79 1678 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah` is a English model originally trained by nimrah. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_en_4.2.0_3.0_1664115813044.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_en_4.2.0_3.0_1664115813044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: German ElectraForQuestionAnswering Distilled model (from deepset) author: John Snow Labs name: electra_qa_g_base_germanquad_distilled date: 2022-06-22 tags: [de, open_source, electra, question_answering] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-base-germanquad-distilled` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_g_base_germanquad_distilled_de_4.0.0_3.0_1655921806755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_g_base_germanquad_distilled_de_4.0.0_3.0_1655921806755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_g_base_germanquad_distilled","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_g_base_germanquad_distilled","de") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.electra.distilled_base").predict("""Was ist mein Name?|||"Mein Name ist Clara und ich lebe in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_g_base_germanquad_distilled| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|de| |Size:|410.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/gelectra-base-germanquad-distilled - https://deepset.ai/germanquad - https://deepset.ai/german-bert - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://haystack.deepset.ai/community/join --- layout: model title: Part of Speech for Armenian author: John Snow Labs name: pos_ud_armtdp date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: hy edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, hy] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_armtdp_hy_2.5.5_2.4_1596053517801.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_armtdp_hy_2.5.5_2.4_1596053517801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_armtdp", "hy") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_armtdp", "hy") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:"""] pos_df = nlu.load('hy.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=10, result='ADJ', metadata={'word': 'Հյուսիսային'}), Row(annotatorType='pos', begin=12, end=18, result='ADJ', metadata={'word': 'թագավոր'}), Row(annotatorType='pos', begin=20, end=27, result='NOUN', metadata={'word': 'լինելուց'}), Row(annotatorType='pos', begin=29, end=32, result='ADP', metadata={'word': 'բացի'}), Row(annotatorType='pos', begin=33, end=33, result='PUNCT', metadata={'word': ','}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_armtdp| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|hy| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Legal Confidential Treatment Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_confidential_treatment_bert date: 2023-01-26 tags: [en, legal, classification, confidential, treatment, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_confidential_treatment_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `confidential-treatment` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `confidential-treatment`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidential_treatment_bert_en_1.0.0_3.0_1674732030348.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidential_treatment_bert_en_1.0.0_3.0_1674732030348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_confidential_treatment_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[confidential-treatment]| |[other]| |[other]| |[confidential-treatment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_confidential_treatment_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support confidential-treatment 0.98 0.98 0.98 55 other 0.99 0.99 0.99 116 accuracy - - 0.99 171 macro-avg 0.99 0.99 0.99 171 weighted-avg 0.99 0.99 0.99 171 ``` --- layout: model title: English image_classifier_vit_rock_challenge_ViT_two_by_two ViTForImageClassification from dimbyTa author: John Snow Labs name: image_classifier_vit_rock_challenge_ViT_two_by_two date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rock_challenge_ViT_two_by_two` is a English model originally trained by dimbyTa. ## Predicted Entities `fines`, `large`, `medium`, `pellets` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_ViT_two_by_two_en_4.1.0_3.0_1660168489231.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_ViT_two_by_two_en_4.1.0_3.0_1660168489231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rock_challenge_ViT_two_by_two", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rock_challenge_ViT_two_by_two", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rock_challenge_ViT_two_by_two| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Part of Speech for Marathi author: John Snow Labs name: pos_ud_ufal date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: mr edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, mr] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_2.5.5_2.4_1596054314811.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_2.5.5_2.4_1596054314811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_ufal", "mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_ufal", "mr") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे."""] pos_df = nlu.load('mr.pos').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=7, result='NOUN', metadata={'word': 'उत्तरेचा'}), Row(annotatorType='pos', begin=9, end=12, result='NOUN', metadata={'word': 'राजा'}), Row(annotatorType='pos', begin=14, end=29, result='NOUN', metadata={'word': 'होण्याव्यतिरिक्त'}), Row(annotatorType='pos', begin=30, end=30, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=32, end=34, result='NOUN', metadata={'word': 'जॉन'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ufal| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|mr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Fast Neural Machine Translation Model from English to Pangasinan author: John Snow Labs name: opus_mt_en_pag date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, pag, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `pag` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pag_xx_2.7.0_2.4_1609169951806.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pag_xx_2.7.0_2.4_1609169951806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_pag", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_pag", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.pag').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_pag| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from jgammack) author: John Snow Labs name: distilbert_qa_sae_base_uncased_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SAE-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sae_base_uncased_squad_en_4.3.0_3.0_1672765542477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sae_base_uncased_squad_en_4.3.0_3.0_1672765542477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sae_base_uncased_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sae_base_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sae_base_uncased_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/SAE-distilbert-base-uncased-squad --- layout: model title: English image_classifier_vit_amgerindaf ViTForImageClassification from gaganpathre author: John Snow Labs name: image_classifier_vit_amgerindaf date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_amgerindaf` is a English model originally trained by gaganpathre. ## Predicted Entities `african`, `american`, `german`, `indian` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_amgerindaf_en_4.1.0_3.0_1660172489317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_amgerindaf_en_4.1.0_3.0_1660172489317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_amgerindaf", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_amgerindaf", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_amgerindaf| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: XLNet Large CoNLL-03 NER Pipeline author: John Snow Labs name: xlnet_large_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, xlnet, conll03, large, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654301280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654301280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.4 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlnetForTokenClassification - NerConverter - Finisher --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_kptimes date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1677881468043.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1677881468043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_kptimes| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-kptimes - https://arxiv.org/abs/1911.12559 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes --- layout: model title: Chinese T5ForConditionalGeneration Cased model (from wawaup) author: John Snow Labs name: t5_mengzit5_comment date: 2023-01-30 tags: [zh, open_source, t5] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MengziT5-Comment` is a Chinese model originally trained by `wawaup`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mengzit5_comment_zh_4.3.0_3.0_1675098308627.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mengzit5_comment_zh_4.3.0_3.0_1675098308627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mengzit5_comment","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mengzit5_comment","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mengzit5_comment| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|1.0 GB| ## References - https://huggingface.co/wawaup/MengziT5-Comment - https://github.com/lancopku/Graph-to-seq-comment-generation --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [el, ner, legal, mapa, licensed] task: Named Entity Recognition language: el edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Greek` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_el_1.0.0_3.0_1682590655353.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_el_1.0.0_3.0_1682590655353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_el_cased", "el")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "el", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""86 Στην υπόθεση της κύριας δίκης, προκύπτει ότι ορισμένοι εργαζόμενοι της Martin‑Meat αποσπάσθηκαν στην Αυστρία κατά την περίοδο μεταξύ του έτους 2007 και του έτους 2012, για την εκτέλεση εργασιών τεμαχισμού κρέατος σε εγκαταστάσεις της Alpenrind."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+------------+ |chunk |ner_label | +-----------+------------+ |Martin‑Meat|ORGANISATION| |Αυστρία |ADDRESS | |2007 |DATE | |2012 |DATE | |Alpenrind |ORGANISATION| +-----------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|el| |Size:|16.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.89 1.00 0.94 16 AMOUNT 0.82 0.75 0.78 12 DATE 0.98 0.98 0.98 65 ORGANISATION 0.85 0.85 0.85 40 PERSON 0.90 0.95 0.92 38 macro-avg 0.91 0.93 0.92 171 macro-avg 0.89 0.91 0.90 171 weighted-avg 0.91 0.93 0.92 171 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_kptimes date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1678133866327.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1678133866327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_kptimes| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-kptimes - https://arxiv.org/abs/1911.12559 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes --- layout: model title: Ganda asr_wav2vec2_luganda_by_cahya TFWav2Vec2ForCTC from cahya author: John Snow Labs name: pipeline_asr_wav2vec2_luganda_by_cahya date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_cahya` is a Ganda model originally trained by cahya. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_luganda_by_cahya_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037813297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037813297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_luganda_by_cahya', lang = 'lg') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_luganda_by_cahya", lang = "lg") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_luganda_by_cahya| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|lg| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: NER Model Finder author: John Snow Labs name: ner_model_finder date: 2022-09-05 tags: [pretrainedpipeline, clinical, ner, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is trained with bert embeddings and can be used to find the most appropriate NER model given the entity name. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_MODEL_FINDER/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_4.1.0_3.0_1662378666469.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_4.1.0_3.0_1662378666469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_pipeline = PretrainedPipeline("ner_model_finder", "en", "clinical/models") result = ner_pipeline.annotate("medication") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_pipeline = PretrainedPipeline("ner_model_finder","en","clinical/models") val result = ner_pipeline.annotate("medication") ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.model_finder.pipeline").predict("""Put your text here.""") ```
## Results ```bash {'model_names': ["['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_posology', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy', 'ner_pathogen']"]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_model_finder| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|155.9 MB| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - SentenceEntityResolverModel - Finisher --- layout: model title: Korean asr_wav2vec2_large_xlsr_korean TFWav2Vec2ForCTC from kresnik author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_korean date: 2022-09-25 tags: [wav2vec2, ko, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ko edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_korean` is a Korean model originally trained by kresnik. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_korean_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_korean_ko_4.2.0_3.0_1664112533768.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_korean_ko_4.2.0_3.0_1664112533768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_korean', lang = 'ko') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_korean", lang = "ko") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_korean| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ko| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese Bert Embeddings (Base, captions dataset) author: John Snow Labs name: bert_embeddings_mengzi_oscar_base_caption date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mengzi-oscar-base-caption` is a Chinese model orginally trained by `Langboat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_caption_zh_3.4.2_3.0_1649670622527.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_caption_zh_3.4.2_3.0_1649670622527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base_caption","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base_caption","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.mengzi_oscar_base_caption").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mengzi_oscar_base_caption| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/Langboat/mengzi-oscar-base-caption - https://arxiv.org/abs/2110.06696 - https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md - https://github.com/microsoft/Oscar/blob/master/INSTALL.md - https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md --- layout: model title: Catalan RobertaForQuestionAnswering Base Cased model (from projecte-aina) author: John Snow Labs name: roberta_qa_base_ca_cased date: 2022-12-02 tags: [ca, open_source, roberta, question_answering, tensorflow] task: Question Answering language: ca edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ca-cased-qa` is a Catalan model originally trained by `projecte-aina`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_cased_ca_4.2.4_3.0_1669986048039.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_cased_ca_4.2.4_3.0_1669986048039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_cased","ca")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_cased","ca") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_ca_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ca| |Size:|451.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/projecte-aina/roberta-base-ca-cased-qa - https://arxiv.org/abs/1907.11692 - https://github.com/projecte-aina/club - https://www.apache.org/licenses/LICENSE-2.0 - https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca%7Cen - https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina --- layout: model title: German asr_wav2vec2_large_xlsr_german_demo TFWav2Vec2ForCTC from marcel author: John Snow Labs name: asr_wav2vec2_large_xlsr_german_demo date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_german_demo` is a German model originally trained by marcel. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_german_demo_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103787136.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103787136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_german_demo", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_german_demo", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_german_demo| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from evegarcianz) author: John Snow Labs name: distilbert_qa_finetuned_adversarial date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-adversarial_qa` is a English model originally trained by `evegarcianz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_adversarial_en_4.3.0_3.0_1672765744855.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_adversarial_en_4.3.0_3.0_1672765744855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_adversarial","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_adversarial","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finetuned_adversarial| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/evegarcianz/bert-finetuned-adversarial_qa --- layout: model title: Legal Portuguese Embeddings (Base, Agreements) author: John Snow Labs name: bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-portuguese-cased-finetuned-tcu-acordaos` is a Portuguese model orginally trained by `Luciano`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos_pt_3.4.2_3.0_1649674108376.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos_pt_3.4.2_3.0_1649674108376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_base_portuguese_cased_finetuned_tcu_acordaos").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_portuguese_cased_finetuned_tcu_acordaos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|408.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Luciano/bert-base-portuguese-cased-finetuned-tcu-acordaos --- layout: model title: Fast Neural Machine Translation Model from English to French-Based Creoles And Pidgins author: John Snow Labs name: opus_mt_en_cpf date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, cpf, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `cpf` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cpf_xx_2.7.0_2.4_1609169519834.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cpf_xx_2.7.0_2.4_1609169519834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_cpf", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_cpf", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.cpf').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_cpf| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Smaller BERT Embeddings (L-10_H-512_A-8) author: John Snow Labs name: small_bert_L10_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L10_512_en_2.6.0_2.4_1598344780916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L10_512_en_2.6.0_2.4_1598344780916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L10_512", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L10_512", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L10_512').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L10_512_embeddings I [0.08983156085014343, 0.6781706809997559, -0.1... love [-0.22787825763225555, 0.15800981223583221, 1.... NLP [0.2888692617416382, 0.49437081813812256, -0.4... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L10_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670325800870.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670325800870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|57.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Capitalization Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_capitalization_bert date: 2023-03-05 tags: [en, legal, classification, clauses, capitalization, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Capitalization` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Capitalization`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_capitalization_bert_en_1.0.0_3.0_1678050537123.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_capitalization_bert_en_1.0.0_3.0_1678050537123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_capitalization_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Capitalization]| |[Other]| |[Other]| |[Capitalization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_capitalization_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Capitalization 0.92 0.96 0.94 48 Other 0.97 0.94 0.96 70 accuracy - - 0.95 118 macro-avg 0.95 0.95 0.95 118 weighted-avg 0.95 0.95 0.95 118 ``` --- layout: model title: Part of Speech for Swedish author: John Snow Labs name: pos_ud_tal date: 2020-05-04 23:32:00 +0800 task: Part of Speech Tagging language: sv edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, es] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_tal_sv_2.5.0_2.4_1588622711284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_tal_sv_2.5.0_2.4_1588622711284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_tal", "sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_tal", "sv") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien."""] pos_df = nlu.load('sv.pos.ud_tal').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=6, result='ADP', metadata={'word': 'Förutom'}), Row(annotatorType='pos', begin=8, end=10, result='PART', metadata={'word': 'att'}), Row(annotatorType='pos', begin=12, end=15, result='AUX', metadata={'word': 'vara'}), Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kungen'}), Row(annotatorType='pos', begin=24, end=24, result='ADP', metadata={'word': 'i'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_tal| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|sv| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Pipeline to Detect PHI for Deidentification (Generic - Augmented) author: John Snow Labs name: ner_deid_generic_augmented_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, generic, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2021/06/30/ner_deid_generic_augmented_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_pipeline_en_3.4.1_3.0_1647869128382.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_pipeline_en_3.4.1_3.0_1647869128382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_generic_augmented.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""") ```
## Results ```bash +-------------------------------------------------+---------+ |chunk |ner_label| +-------------------------------------------------+---------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson |NAME | |Ora MR. |LOCATION | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital. 0295 Keats Street.|LOCATION | |(302) 786-5227 |CONTACT | +-------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Detect Drug Information (Large) author: John Snow Labs name: ner_posology_large date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for posology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `DOSAGE`, `DRUG`, `DURATION`, `FORM`, `FREQUENCY`, `ROUTE`, `STRENGTH`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_en_3.0.0_3.0_1617207221150.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_en_3.0.0_3.0_1617207221150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_posology_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.large").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_large| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on the 2018 i2b2 dataset and FDA Drug datasets with 'embeddings_clinical'. https://open.fda.gov/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|-------:|------:|------:|---------:|---------:|---------:| | 0 | B-DRUG | 30096 | 1952 | 1265 | 0.939091 | 0.959663 | 0.949266 | | 1 | B-STRENGTH | 18379 | 1286 | 1142 | 0.934605 | 0.941499 | 0.938039 | | 2 | I-DURATION | 6346 | 692 | 391 | 0.901677 | 0.941962 | 0.921379 | | 3 | I-STRENGTH | 21368 | 2752 | 1411 | 0.885904 | 0.938057 | 0.911235 | | 4 | I-FREQUENCY | 18406 | 1525 | 1112 | 0.923486 | 0.943027 | 0.933154 | | 5 | B-FORM | 11297 | 1276 | 726 | 0.898513 | 0.939616 | 0.918605 | | 6 | B-DOSAGE | 3731 | 611 | 765 | 0.859281 | 0.829849 | 0.844309 | | 7 | I-DOSAGE | 2100 | 734 | 887 | 0.741002 | 0.703047 | 0.721526 | | 8 | I-DRUG | 11853 | 1364 | 1202 | 0.8968 | 0.907928 | 0.902329 | | 9 | I-ROUTE | 227 | 31 | 56 | 0.879845 | 0.80212 | 0.839187 | | 10 | B-ROUTE | 5870 | 436 | 488 | 0.930859 | 0.923246 | 0.927037 | | 11 | B-DURATION | 2493 | 313 | 205 | 0.888453 | 0.924018 | 0.905887 | | 12 | B-FREQUENCY | 12648 | 709 | 430 | 0.946919 | 0.96712 | 0.956913 | | 13 | I-FORM | 919 | 472 | 502 | 0.660676 | 0.646728 | 0.653627 | | 14 | Macro-average | 145733 | 14153 | 10582 | 0.877651 | 0.88342 | 0.880526 | | 15 | Micro-average | 145733 | 14153 | 10582 | 0.911481 | 0.932303 | 0.921774 | ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_base_1b_1_finetuned_squadv1 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-1B-1-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_1b_1_finetuned_squadv1_en_4.2.4_3.0_1669985495535.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_1b_1_finetuned_squadv1_en_4.2.4_3.0_1669985495535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_1b_1_finetuned_squadv1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|446.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/roberta-base-1B-1-finetuned-squadv1 - https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/ - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1655730685006.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1655730685006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|439.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-10 --- layout: model title: Legal Application Of Trust Money Clause Binary Classifier author: John Snow Labs name: legclf_application_of_trust_money_clause date: 2023-01-27 tags: [en, legal, classification, application, trust, money, clauses, application_of_trust_money, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `application-of-trust-money` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `application-of-trust-money`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_application_of_trust_money_clause_en_1.0.0_3.0_1674820460698.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_application_of_trust_money_clause_en_1.0.0_3.0_1674820460698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_application_of_trust_money_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[application-of-trust-money]| |[other]| |[other]| |[application-of-trust-money]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_application_of_trust_money_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support application-of-trust-money 1.00 0.89 0.94 18 other 0.95 1.00 0.97 36 accuracy - - 0.96 54 macro-avg 0.97 0.94 0.96 54 weighted-avg 0.96 0.96 0.96 54 ``` --- layout: model title: Word2Vec Embeddings in Breton (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, br, open_source] task: Embeddings language: br edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_br_3.4.1_3.0_1647287953369.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_br_3.4.1_3.0_1647287953369.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","br") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","br") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("br.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|br| |Size:|351.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Abkhazian asr_wav2vec2_common_voice_ab_demo TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_common_voice_ab_demo date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_ab_demo` is a Abkhazian model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_common_voice_ab_demo_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042317411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042317411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_common_voice_ab_demo', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_common_voice_ab_demo", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_common_voice_ab_demo| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Cased model author: John Snow Labs name: distilbert_qa_base_cased_led_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_en_4.3.0_3.0_1672766495924.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_en_4.3.0_3.0_1672766495924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/distilbert-base-cased-distilled-squad - https://arxiv.org/abs/1910.01108 - https://arxiv.org/abs/1910.01108 - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 - https://arxiv.org/pdf/1910.01108.pdf - https://arxiv.org/abs/1910.01108 - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad --- layout: model title: Visual NER on 10K Filings (SEC) author: John Snow Labs name: visualner_10kfilings date: 2022-09-21 tags: [en, licensed] task: OCR Object Detection language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2 supported: true annotator: VisualDocumentNERv21 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Visual NER team aimed to extract the main key points in the summary page of SEC 10 K filings (Annual reports). ## Predicted Entities `REGISTRANT`, `ADDRESS`, `PHONE`, `DATE`, `EMPLOYERIDNB`, `EXCHANGE`, `STATE`, `STOCKCLASS`, `STOCKVALUE`, `TRADINGSYMBOL`, `FILENUMBER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visualner_10kfilings_en_4.0.0_3.2_1663769328577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visualner_10kfilings_en_4.0.0_3.2_1663769328577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) img_to_hocr = ImageToHocr()\ .setInputCol("image")\ .setOutputCol("hocr")\ .setIgnoreResolution(False)\ .setOcrParams(["preserve_interword_spaces=0"]) tokenizer = HocrTokenizer()\ .setInputCol("hocr")\ .setOutputCol("token") doc_ner = VisualDocumentNerV21()\ .pretrained("visualner_10kfilings", "en", "clinical/models")\ .setInputCols(["token", "image"])\ .setOutputCol("entities") draw = ImageDrawAnnotations() \ .setInputCol("image") \ .setInputChunksCol("entities") \ .setOutputCol("image_with_annotations") \ .setFontSize(10) \ .setLineWidth(4)\ .setRectColor(Color.red) # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, img_to_hocr, tokenizer, doc_ner, draw ]) import pkg_resources bin_df = spark.read.format("binaryFile").load('data/t01.jpg') bin_df.show() results = pipeline.transform(bin_df).cache() res = results.collect() ## since pyspark2.3 doesn't have element_at, 'getItem' is involked path_array = f.split(results['path'], '/') # from pyspark2.4 # results.withColumn("filename", f.element_at(f.split("path", "/"), -1)) \ results.withColumn('filename', path_array.getItem(f.size(path_array)- 1)) \ .withColumn("exploded_entities", f.explode("entities")) \ .select("filename", "exploded_entities") \ .show(truncate=False) ```
## Results ```bash +--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ |filename|exploded_entities | +--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ |t01.jpg |{named_entity, 712, 716, OTHERS, {confidence -> 96, width -> 74, x -> 1557, y -> 416, word -> Ended, token -> ended, height -> 18}, []} | |t01.jpg |{named_entity, 718, 724, DATE-B, {confidence -> 96, width -> 97, x -> 1639, y -> 416, word -> January, token -> january, height -> 24}, []} | |t01.jpg |{named_entity, 726, 727, DATE-I, {confidence -> 95, width -> 34, x -> 1743, y -> 416, word -> 31,, token -> 31, height -> 22}, []} | |t01.jpg |{named_entity, 730, 733, DATE-I, {confidence -> 96, width -> 54, x -> 1785, y -> 416, word -> 2021, token -> 2021, height -> 18}, []} | |t01.jpg |{named_entity, 735, 744, OTHERS, {confidence -> 91, width -> 143, x -> 1372, y -> 472, word -> Commission, token -> commission, height -> 18}, []} | |t01.jpg |{named_entity, 746, 749, OTHERS, {confidence -> 96, width -> 36, x -> 1523, y -> 472, word -> file, token -> file, height -> 18}, []} | |t01.jpg |{named_entity, 751, 756, OTHERS, {confidence -> 92, width -> 96, x -> 1568, y -> 472, word -> number:, token -> number, height -> 18}, []} | |t01.jpg |{named_entity, 759, 761, FILENUMBER-B, {confidence -> 92, width -> 119, x -> 1675, y -> 472, word -> 001-39495, token -> 001, height -> 18}, []} | |t01.jpg |{named_entity, 769, 773, REGISTRANT-B, {confidence -> 92, width -> 136, x -> 1472, y -> 558, word -> ASANA,, token -> asana, height -> 31}, []} | |t01.jpg |{named_entity, 776, 778, REGISTRANT-I, {confidence -> 95, width -> 72, x -> 1620, y -> 558, word -> INC., token -> inc, height -> 25}, []} +--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|visualner_10kfilings| |Type:|ocr| |Compatibility:|Visual NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|744.4 MB| ## References SEC 10k filings --- layout: model title: Detect Clinical Entities (ner_jsl_enriched) author: John Snow Labs name: ner_jsl_enriched date: 2021-10-22 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. This model is capable of predicting up to `87` different entities and is based on `ner_jsl`. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Social_History_Header`, `Oncology_Therapy`, `Blood_Pressure`, `Respiration`, `Performance_Status`, `Family_History_Header`, `Dosage`, `Clinical_Dept`, `Diet`, `Procedure`, `HDL`, `Weight`, `Admission_Discharge`, `LDL`, `Kidney_Disease`, `Oncological`, `Route`, `Imaging_Technique`, `Puerperium`, `Overweight`, `Temperature`, `Diabetes`, `Vaccine`, `Age`, `Test_Result`, `Employment`, `Time`, `Obesity`, `EKG_Findings`, `Pregnancy`, `Communicable_Disease`, `BMI`, `Strength`, `Tumor_Finding`, `Section_Header`, `RelativeDate`, `ImagingFindings`, `Death_Entity`, `Date`, `Cerebrovascular_Disease`, `Treatment`, `Labour_Delivery`, `Pregnancy_Delivery_Puerperium`, `Direction`, `Internal_organ_or_component`, `Psychological_Condition`, `Form`, `Medical_Device`, `Test`, `Symptom`, `Disease_Syndrome_Disorder`, `Staging`, `Birth_Entity`, `Hyperlipidemia`, `O2_Saturation`, `Frequency`, `External_body_part_or_region`, `Drug_Ingredient`, `Vital_Signs_Header`, `Substance_Quantity`, `Race_Ethnicity`, `VS_Finding`, `Injury_or_Poisoning`, `Medical_History_Header`, `Alcohol`, `Triglycerides`, `Total_Cholesterol`, `Sexually_Active_or_Sexual_Orientation`, `Female_Reproductive_Status`, `Relationship_Status`, `Drug_BrandName`, `RelativeTime`, `Duration`, `Hypertension`, `Metastasis`, `Gender`, `Oxygen_Therapy`, `Pulse`, `Heart_Disease`, `Modifier`, `Allergen`, `Smoking`, `Substance`, `Cancer_Modifier`, `Fetus_NewBorn`, `Height` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_3.3.0_3.0_1634865045033.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_3.3.0_3.0_1634865045033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.enriched").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | begin | end | entity | |---:|:------------------------------------------|--------:|------:|:-----------------------------| | 0 | 21-day-old | 17 | 26 | Age | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | | 2 | male | 38 | 41 | Gender | | 3 | 2 days | 52 | 57 | Duration | | 4 | congestion | 62 | 71 | Symptom | | 5 | mom | 75 | 77 | Gender | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | | 7 | nares | 135 | 139 | External_body_part_or_region | | 8 | she | 147 | 149 | Gender | | 9 | mild | 168 | 171 | Modifier | | 10 | problems with his breathing while feeding | 173 | 213 | Symptom | | 11 | perioral cyanosis | 237 | 253 | Symptom | | 12 | retractions | 258 | 268 | Symptom | | 13 | One day ago | 272 | 282 | RelativeDate | | 14 | mom | 285 | 287 | Gender | | 15 | tactile temperature | 304 | 322 | Symptom | | 16 | Tylenol | 345 | 351 | Drug_BrandName | | 17 | Baby | 354 | 357 | Age | | 18 | decreased p.o. intake | 377 | 397 | Symptom | | 19 | His | 400 | 402 | Gender | | 20 | q.2h | 450 | 453 | Frequency | | 21 | 5 to 10 minutes | 459 | 473 | Duration | | 22 | his | 488 | 490 | Gender | | 23 | respiratory congestion | 492 | 513 | Symptom | | 24 | He | 516 | 517 | Gender | | 25 | tired | 550 | 554 | Symptom | | 26 | fussy | 569 | 573 | Symptom | | 27 | over the past 2 days | 575 | 594 | RelativeDate | | 28 | albuterol | 637 | 645 | Drug_Ingredient | | 29 | ER | 671 | 672 | Clinical_Dept | | 30 | His | 675 | 677 | Gender | | 31 | urine output has also decreased | 679 | 709 | Symptom | | 32 | he | 721 | 722 | Gender | | 33 | per 24 hours | 760 | 771 | Frequency | | 34 | he | 778 | 779 | Gender | | 35 | per 24 hours | 807 | 818 | Frequency | | 36 | Mom | 821 | 823 | Gender | | 37 | diarrhea | 836 | 843 | Symptom | | 38 | His | 846 | 848 | Gender | | 39 | bowel | 850 | 854 | Internal_organ_or_component | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data sampled from MTSamples and Clinicaltrials.gov, and annotated in-house. ## Benchmarking ```bash label tp fp fn prec rec f1 B-Oxygen_Therapy 139 44 44 0.75956285 0.75956285 0.75956285 B-Oncology_Therapy 2 0 4 1.0 0.33333334 0.5 B-Cerebrovascular_Disease 49 13 23 0.7903226 0.6805556 0.7313434 B-Triglycerides 3 0 1 1.0 0.75 0.85714287 I-Cerebrovascular_Disease 18 11 22 0.62068963 0.45 0.52173907 B-Medical_Device 2723 350 299 0.88610476 0.9010589 0.8935192 B-Labour_Delivery 38 6 27 0.8636364 0.5846154 0.6972478 I-Vaccine 27 2 8 0.9310345 0.7714286 0.84374994 I-Obesity 7 0 0 1.0 1.0 1.0 I-Smoking 2 3 4 0.4 0.33333334 0.36363637 B-RelativeTime 141 69 70 0.67142856 0.66824645 0.6698337 I-Staging 0 0 1 0.0 0.0 0.0 B-Imaging_Technique 23 7 30 0.76666665 0.43396226 0.55421686 B-Heart_Disease 264 54 51 0.8301887 0.83809525 0.8341232 B-Procedure 2091 206 277 0.9103178 0.8830236 0.89646304 I-RelativeTime 177 56 65 0.75965667 0.73140496 0.74526316 I-Substance_Quantity 0 12 1 0.0 0.0 0.0 B-Obesity 53 0 4 1.0 0.9298246 0.9636364 I-RelativeDate 702 94 97 0.88190955 0.8785983 0.8802508 B-O2_Saturation 55 27 22 0.6707317 0.71428573 0.6918239 B-Direction 3138 213 260 0.9364369 0.9234844 0.92991555 I-Alcohol 2 0 4 1.0 0.33333334 0.5 I-Oxygen_Therapy 104 60 57 0.63414633 0.6459627 0.64 B-Diet 34 5 39 0.8717949 0.46575344 0.6071429 B-Dosage 267 59 115 0.8190184 0.69895285 0.7542373 B-Injury_or_Poisoning 353 67 67 0.8404762 0.8404762 0.8404762 B-Hypertension 98 3 9 0.97029704 0.91588783 0.94230765 I-Test_Result 1093 58 145 0.94960904 0.8828756 0.91502714 B-Female_Reproductive_Status 0 0 1 0.0 0.0 0.0 B-Substance_Quantity 0 4 1 0.0 0.0 0.0 B-Alcohol 72 6 15 0.9230769 0.82758623 0.8727273 B-Height 14 7 8 0.6666667 0.6363636 0.65116286 I-Substance 19 2 4 0.9047619 0.82608694 0.86363643 B-RelativeDate 470 65 79 0.8785047 0.856102 0.86715865 B-Admission_Discharge 242 8 3 0.968 0.9877551 0.9777778 B-Date 424 25 18 0.94432074 0.959276 0.9517396 B-Kidney_Disease 71 12 12 0.85542166 0.85542166 0.85542166 I-Admission_Discharge 0 0 1 0.0 0.0 0.0 I-Strength 506 82 38 0.8605442 0.93014705 0.89399296 B-Allergen 0 3 10 0.0 0.0 0.0 I-Injury_or_Poisoning 315 83 93 0.7914573 0.77205884 0.7816377 I-Drug_Ingredient 300 88 46 0.77319586 0.867052 0.8174387 I-Time 298 31 14 0.90577507 0.9551282 0.9297972 B-Substance 54 7 9 0.8852459 0.85714287 0.87096775 B-Total_Cholesterol 12 2 3 0.85714287 0.8 0.82758623 I-Vital_Signs_Header 138 8 6 0.94520545 0.9583333 0.9517241 I-Internal_organ_or_component 2826 302 304 0.9034527 0.9028754 0.903164 B-Hyperlipidemia 27 1 1 0.96428573 0.96428573 0.9642857 I-Sexually_Active_or_Sexual_Orientation 4 2 1 0.6666667 0.8 0.72727275 B-Sexually_Active_or_Sexual_Orientation 4 3 2 0.5714286 0.6666667 0.61538464 I-Fetus_NewBorn 27 18 19 0.6 0.5869565 0.5934066 B-BMI 5 0 4 1.0 0.5555556 0.71428573 B-ImagingFindings 63 36 64 0.6363636 0.496063 0.5575221 B-Drug_Ingredient 1905 202 183 0.9041291 0.9123563 0.90822405 B-Test_Result 1327 131 184 0.9101509 0.87822634 0.8939037 B-Section_Header 2763 120 106 0.9583767 0.96305335 0.96070933 I-Treatment 103 40 39 0.7202797 0.7253521 0.72280705 B-Clinical_Dept 744 62 99 0.9230769 0.8825623 0.902365 I-Kidney_Disease 109 12 1 0.90082645 0.9909091 0.94372296 I-Pulse 156 35 27 0.8167539 0.852459 0.8342246 B-Test 2312 293 418 0.887524 0.84688646 0.8667291 B-Weight 64 10 14 0.8648649 0.82051283 0.8421053 I-Respiration 81 47 11 0.6328125 0.8804348 0.73636365 I-EKG_Findings 73 15 70 0.82954544 0.5104895 0.63203466 I-Section_Header 1999 73 97 0.96476835 0.95372134 0.9592131 I-VS_Finding 32 17 28 0.6530612 0.53333336 0.58715594 B-Strength 513 60 44 0.89528793 0.92100537 0.9079646 I-Cancer_Modifier 5 0 0 1.0 1.0 1.0 I-Social_History_Header 39 6 0 0.8666667 1.0 0.92857146 B-Vital_Signs_Header 216 14 6 0.9391304 0.972973 0.95575225 B-Death_Entity 41 11 3 0.78846157 0.9318182 0.8541667 B-Modifier 2050 335 307 0.8595388 0.86974967 0.86461407 B-Blood_Pressure 108 17 27 0.864 0.8 0.83076924 I-O2_Saturation 99 23 36 0.8114754 0.73333335 0.77042806 B-Frequency 519 53 63 0.9073427 0.8917526 0.8994801 I-Triglycerides 3 0 5 1.0 0.375 0.54545456 I-Female_Reproductive_Status 0 0 3 0.0 0.0 0.0 I-Duration 529 71 112 0.88166666 0.82527304 0.8525383 I-Diabetes 41 8 1 0.8367347 0.97619045 0.90109897 B-Race_Ethnicity 77 0 4 1.0 0.9506173 0.9746836 I-Gender 0 0 2 0.0 0.0 0.0 I-Height 40 1 18 0.9756098 0.6896552 0.8080808 B-Communicable_Disease 11 2 8 0.84615386 0.57894737 0.68749994 I-Family_History_Header 35 0 1 1.0 0.9722222 0.9859155 B-LDL 3 1 0 0.75 1.0 0.85714287 B-Form 169 41 40 0.8047619 0.80861247 0.8066826 I-Race_Ethnicity 2 0 2 1.0 0.5 0.6666667 B-Psychological_Condition 114 12 19 0.9047619 0.85714287 0.88030887 I-Drug_BrandName 14 12 12 0.53846157 0.53846157 0.53846157 I-Hypertension 2 2 10 0.5 0.16666667 0.25 I-Age 196 43 7 0.8200837 0.9655172 0.88687783 B-EKG_Findings 38 18 35 0.6785714 0.5205479 0.58914727 B-Employment 193 31 41 0.86160713 0.8247863 0.8427947 I-Oncological 333 38 23 0.8975741 0.9353933 0.9160936 B-Time 320 34 23 0.9039548 0.9329446 0.91822094 B-Treatment 129 36 61 0.7818182 0.6789474 0.7267606 B-Temperature 104 15 19 0.8739496 0.8455285 0.85950416 B-Tumor_Finding 1 2 10 0.33333334 0.09090909 0.14285715 I-Procedure 2667 348 335 0.8845771 0.8884077 0.8864883 B-Relationship_Status 37 3 3 0.925 0.925 0.925 B-Pregnancy 77 17 15 0.81914896 0.8369565 0.827957 B-Fetus_NewBorn 18 7 18 0.72 0.5 0.59016395 I-Total_Cholesterol 14 1 5 0.93333334 0.7368421 0.8235294 I-Route 193 17 13 0.9190476 0.9368932 0.92788464 B-Birth_Entity 1 7 1 0.125 0.5 0.2 I-Communicable_Disease 5 1 2 0.8333333 0.71428573 0.7692307 I-Medical_History_Header 119 0 3 1.0 0.97540987 0.98755187 I-Imaging_Technique 10 1 15 0.90909094 0.4 0.5555555 B-Smoking 96 5 5 0.95049506 0.95049506 0.95049506 I-Labour_Delivery 29 20 9 0.59183675 0.7631579 0.6666667 I-Death_Entity 3 1 0 0.75 1.0 0.85714287 B-Diabetes 77 3 3 0.9625 0.9625 0.9625 B-HDL 2 0 1 1.0 0.6666667 0.8 B-Drug_BrandName 792 67 61 0.9220023 0.9284877 0.92523366 B-Gender 4498 58 63 0.9872695 0.9861872 0.9867281 B-Metastasis 5 2 8 0.71428573 0.3846154 0.5 I-Relationship_Status 0 0 4 0.0 0.0 0.0 B-Cancer_Modifier 4 0 1 1.0 0.8 0.88888896 B-Vaccine 39 6 7 0.8666667 0.84782606 0.8571428 I-Heart_Disease 317 47 47 0.8708791 0.8708791 0.8708791 I-Dosage 216 47 126 0.82129276 0.6315789 0.7140496 B-Staging 0 0 2 0.0 0.0 0.0 B-Social_History_Header 65 8 3 0.89041096 0.9558824 0.92198586 B-External_body_part_or_region 1792 195 229 0.9018621 0.8866898 0.8942116 I-Clinical_Dept 559 23 47 0.9604811 0.92244226 0.9410775 I-Tumor_Finding 0 13 11 0.0 0.0 0.0 I-Test 1919 311 305 0.8605381 0.8628597 0.8616973 I-Frequency 447 53 68 0.894 0.86796117 0.88078815 B-Age 461 50 22 0.90215266 0.9544513 0.9275654 B-Pulse 96 14 18 0.8727273 0.84210527 0.85714287 I-Symptom 3408 1152 1091 0.7473684 0.75750166 0.7524009 I-Form 1 5 2 0.16666667 0.33333334 0.22222224 I-Pregnancy 66 8 36 0.8918919 0.64705884 0.75000006 I-LDL 5 2 6 0.71428573 0.45454547 0.5555556 I-Diet 40 7 25 0.85106385 0.61538464 0.7142857 I-Blood_Pressure 165 35 36 0.825 0.8208955 0.8229426 I-ImagingFindings 118 60 72 0.66292137 0.6210526 0.6413044 I-Date 195 11 10 0.9466019 0.9512195 0.9489051 I-Hyperlipidemia 1 1 0 0.5 1.0 0.6666667 B-Route 755 69 83 0.91626215 0.90095466 0.90854394 B-Duration 219 30 72 0.8795181 0.7525773 0.81111115 B-Medical_History_Header 84 6 3 0.93333334 0.9655172 0.9491525 I-Metastasis 3 4 4 0.42857143 0.42857143 0.42857143 I-Allergen 0 1 3 0.0 0.0 0.0 B-Respiration 53 19 15 0.7361111 0.7794118 0.75714284 I-External_body_part_or_region 429 73 62 0.85458165 0.8737271 0.86404836 I-BMI 12 3 3 0.8 0.8 0.8000001 B-Internal_organ_or_component 4361 475 509 0.90177834 0.89548254 0.8986194 I-Weight 146 14 21 0.9125 0.8742515 0.8929664 B-Disease_Syndrome_Disorder 2222 283 350 0.88702595 0.86391914 0.8753201 B-Symptom 4910 711 744 0.87351006 0.8684117 0.87095344 B-VS_Finding 207 28 44 0.8808511 0.8247012 0.8518518 I-Disease_Syndrome_Disorder 1659 201 383 0.89193547 0.8124388 0.85033315 I-Modifier 162 67 138 0.70742357 0.54 0.61247635 I-Medical_Device 1786 245 176 0.8793698 0.9102956 0.89456546 B-Oncological 354 44 29 0.8894472 0.92428195 0.90653 I-Temperature 172 16 24 0.9148936 0.877551 0.8958333 I-Employment 108 18 32 0.85714287 0.7714286 0.81203 I-Psychological_Condition 40 9 7 0.81632656 0.85106385 0.8333334 B-Family_History_Header 47 0 4 1.0 0.92156863 0.9591837 I-Direction 186 42 23 0.81578946 0.8899522 0.8512586 I-HDL 3 0 2 1.0 0.6 0.75 Macro-average 71581 9121 10180 0.7729799 0.721845 0.7465378 Micro-average 71581 9121 10180 0.8869793 0.8754908 0.88119763 ``` --- layout: model title: Legal Set off Clause Binary Classifier author: John Snow Labs name: legclf_set_off_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `set-off` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `set-off` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_set_off_clause_en_1.0.0_3.2_1660122998732.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_set_off_clause_en_1.0.0_3.2_1660122998732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_set_off_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[set-off]| |[other]| |[other]| |[set-off]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_set_off_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.98 0.96 94 set-off 0.94 0.86 0.90 36 accuracy - - 0.95 130 macro-avg 0.94 0.92 0.93 130 weighted-avg 0.95 0.95 0.95 130 ``` --- layout: model title: English RobertaForQuestionAnswering (from vuiseng9) author: John Snow Labs name: roberta_qa_roberta_l_squadv1.1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-l-squadv1.1` is a English model originally trained by `vuiseng9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_l_squadv1.1_en_4.0.0_3.0_1655735988687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_l_squadv1.1_en_4.0.0_3.0_1655735988687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_l_squadv1.1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_l_squadv1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_l_squadv1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vuiseng9/roberta-l-squadv1.1 --- layout: model title: Legal Advice Class Identifier author: John Snow Labs name: legclf_reddit_advice date: 2023-03-10 tags: [en, licensed, legal, classifier, reddit, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which retrieves the topic/class of an informal message from a legal forum, including the following classes: `digital`, `business`, `insurance`, `contract`, `driving`, `school`, `family`, `wills`, `employment`, `housing`, `criminal`. ## Predicted Entities `digital`, `business`, `insurance`, `contract`, `driving`, `school`, `family`, `wills`, `employment`, `housing` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reddit_advice_en_1.0.0_3.0_1678448985639.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reddit_advice_en_1.0.0_3.0_1678448985639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") seq_classifier = legal.BertForSequenceClassification.pretrained("legclf_reddit_advice", "en", "legal/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, seq_classifier]) data = spark.createDataFrame([["Mother of my child took my daughter and moved (without notice), won't let me see her or tell me where she is."]]).toDF("text") result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------+ | result| +--------+ |[family]| +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reddit_advice| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://huggingface.co/datasets/jonathanli/legal-advice-reddit) ## Benchmarking ```bash label precision recall f1-score support business 0.76 0.67 0.72 239 contract 0.80 0.68 0.73 207 criminal 0.82 0.77 0.80 209 digital 0.76 0.74 0.75 223 driving 0.86 0.85 0.86 223 employment 0.76 0.92 0.83 222 family 0.88 0.95 0.92 216 housing 0.89 0.95 0.92 221 insurance 0.83 0.80 0.81 221 school 0.87 0.91 0.89 207 wills 0.95 0.96 0.96 199 accuracy - - 0.83 2387 macro-avg 0.84 0.84 0.83 2387 weighted-avg 0.83 0.83 0.83 2387 ``` --- layout: model title: Extract Temporal Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_temporal date: 2023-06-06 tags: [clinical, licensed, ner, en, vop, temporal] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts temporal references from the documents transferred from the patient’s own sentences. ## Predicted Entities `DateTime`, `Duration`, `Frequency` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_en_4.4.3_3.0_1686076127059.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_en_4.4.3_3.0_1686076127059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_temporal", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_temporal", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:-----------|:------------| | last month | DateTime | | yesterday | DateTime | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_temporal| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 DateTime 4056 655 346 4402 0.86 0.92 0.89 Duration 2008 371 302 2310 0.84 0.87 0.86 Frequency 879 157 200 1079 0.85 0.81 0.83 macro_avg 6943 1183 848 7791 0.85 0.87 0.86 micro_avg 6943 1183 848 7791 0.85 0.89 0.87 ``` --- layout: model title: Slovak RobertaForMaskedLM Cased model (from fav-kky) author: John Snow Labs name: roberta_embeddings_fernet_news date: 2022-12-12 tags: [sk, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: sk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-News_sk` is a Slovak model originally trained by `fav-kky`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_sk_4.2.4_3.0_1670858429673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_sk_4.2.4_3.0_1670858429673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","sk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_fernet_news| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sk| |Size:|467.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/fav-kky/FERNET-News_sk - https://arxiv.org/abs/2107.10042 --- layout: model title: Clinical English Bert Embeddings (Base, 128 dimension) author: John Snow Labs name: bert_embeddings_clinical_pubmed_bert_base_128 date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `clinical-pubmed-bert-base-128` is a English model orginally trained by `Tsubasaz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_128_en_3.4.2_3.0_1649672767031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_128_en_3.4.2_3.0_1649672767031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_128","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_128","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.clinical_pubmed_bert_base_128").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_clinical_pubmed_bert_base_128| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/Tsubasaz/clinical-pubmed-bert-base-128 - https://mimic.physionet.org/ --- layout: model title: Korean ElectraForQuestionAnswering model (from monologg) Version-3 author: John Snow Labs name: electra_qa_base_v3_finetuned_korquad date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v3-finetuned-korquad` is a Korean model originally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_v3_finetuned_korquad_ko_4.0.0_3.0_1655922227904.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_v3_finetuned_korquad_ko_4.0.0_3.0_1655922227904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_v3_finetuned_korquad","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_v3_finetuned_korquad","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.korquad.electra.base").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_v3_finetuned_korquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|419.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monologg/koelectra-base-v3-finetuned-korquad --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_findings date: 2021-04-30 tags: [en, clinical, licensed, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Map clinical entities to UMLS CUI codes. ## Predicted Entities This model returns CUI (concept unique identifier) codes for 200K concepts from clinical findings. https://www.nlm.nih.gov/research/umls/index.html {:.btn-box} [Live Demo](http://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.0.2_3.0_1619774838339.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.0.2_3.0_1619774838339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls.findings").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | ner_chunk | cui_code | |---:|:--------------------------------------|:-----------| | 0 | gestational diabetes mellitus | C2183115 | | 1 | subsequent type two diabetes mellitus | C3532488 | | 2 | T2DM | C3280267 | | 3 | HTG-induced pancreatitis | C4554179 | | 4 | an acute hepatitis | C4750596 | | 5 | obesity | C1963185 | | 6 | a body mass index | C0578022 | | 7 | polyuria | C3278312 | | 8 | polydipsia | C3278316 | | 9 | poor appetite | C0541799 | | 10 | vomiting | C0042963 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_findings| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[umls_code]| |Language:|en| ## Data Source https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: English BertForQuestionAnswering model (from Rocketknight1) author: John Snow Labs name: bert_qa_bert_finetuned_qa date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-qa` is a English model orginally trained by `Rocketknight1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_qa_en_4.0.0_3.0_1654535355191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_qa_en_4.0.0_3.0_1654535355191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_qa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_Rocketknight1").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Rocketknight1/bert-finetuned-qa --- layout: model title: Portuguese DistilBertForQuestionAnswering Cased model (from mrm8488) author: John Snow Labs name: distilbert_qa_finetuned_squad date: 2022-07-21 tags: [open_source, distilbert, question_answering, pt] task: Question Answering language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finedtuned-squad-pt` is a Portuguese model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_squad_pt_4.0.0_3.0_1658401584866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_squad_pt_4.0.0_3.0_1658401584866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_squad","pt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR 'QUESTION' STRING HERE?", "PUT YOUR 'CONTEXT' STRING HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_squad","pt") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR 'QUESTION' STRING HERE?", "PUT YOUR 'CONTEXT' STRING HERE").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|pt| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://huggingface.co/mrm8488/distilbert-multi-finedtuned-squad-pt --- layout: model title: Part of Speech for Japanese author: John Snow Labs name: pos_ud_gsd date: 2021-01-03 task: Part of Speech Tagging language: ja edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, ja, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_2.7.0_2.4_1609700150824.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_2.7.0_2.4_1609700150824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ud_gsd", "ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, word_segmenter, posTagger ]) example = spark.createDataFrame([['院長と話をしたところ、腰痛治療も得意なようです。']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ud_gsd", "ja") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos)) val data = Seq("院長と話をしたところ、腰痛治療も得意なようです。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""5月13日に放送されるフジテレビ系「僕らの音楽」にて、福原美穂とAIという豪華共演が決定した。"""] pos_df = nlu.load('ja.pos.ud_gsd').predict(text, output_level='token') pos_df ```
## Results ```bash +------+-----+ |token |pos | +------+-----+ |院長 |NOUN | |と |ADP | |話 |NOUN | |を |ADP | |し |VERB | |た |AUX | |ところ|NOUN | |、 |PUNCT| |腰痛 |NOUN | |治療 |NOUN | |も |ADP | |得意 |ADJ | |な |AUX | |よう |AUX | |です |AUX | |。 |PUNCT| +------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|ja| ## Data Source The model was trained on the [Universal Dependencies](https://universaldependencies.org/), curated by Google. Reference: > Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash | pos_tag | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.90 | 0.78 | 0.84 | 350 | | ADP | 0.98 | 0.99 | 0.99 | 2804 | | ADV | 0.87 | 0.65 | 0.74 | 220 | | AUX | 0.95 | 0.98 | 0.96 | 1768 | | CCONJ | 0.97 | 0.93 | 0.95 | 42 | | DET | 1.00 | 1.00 | 1.00 | 66 | | INTJ | 0.00 | 0.00 | 0.00 | 1 | | NOUN | 0.93 | 0.98 | 0.95 | 3692 | | NUM | 0.99 | 0.98 | 0.99 | 251 | | PART | 0.96 | 0.83 | 0.89 | 128 | | PRON | 0.97 | 0.94 | 0.95 | 101 | | PROPN | 0.92 | 0.70 | 0.79 | 313 | | PUNCT | 1.00 | 1.00 | 1.00 | 1294 | | SCONJ | 0.97 | 0.94 | 0.96 | 682 | | SYM | 0.99 | 1.00 | 0.99 | 67 | | VERB | 0.96 | 0.92 | 0.94 | 1255 | | accuracy | 0.96 | 13034 | | | | macro avg | 0.90 | 0.85 | 0.87 | 13034 | | weighted avg | 0.96 | 0.96 | 0.95 | 13034 | ``` --- layout: model title: Detect Person, Organization and Location in Turkish text author: John Snow Labs name: xlm_roberta_base_token_classifier_ner date: 2021-12-02 tags: [xlm, roberta, ner, turkish, tr, open_source] task: Named Entity Recognition language: tr edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models`. This model is the fine-tuned version of "xlm-roberta-base" (a multilingual version of RoBERTa) using a reviewed version of well known Turkish NER dataset (https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt) ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ner_tr_3.3.2_2.4_1638447262808.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ner_tr_3.3.2_2.4_1638447262808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_base_token_classifier_ner", "tr"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_base_token_classifier_ner", "tr")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.ner.xlm_roberta").predict("""Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum.""") ```
## Results ```bash +-------------+---------+ |chunk |ner_label| +-------------+---------+ |Cesur Yurttaş|PER | |İstanbul'da |LOC | +-------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_ner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|tr| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/akdeniz27/xlm-roberta-base-turkish-ner](https://huggingface.co/akdeniz27/xlm-roberta-base-turkish-ner) ## Benchmarking ```bash accuracy: 0.9919343118732742 f1: 0.9492100796448622 precision: 0.9407349896480332 recall: 0.9578392621870883 ``` --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: sbert_jsl_medium_uncased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.1.0_2.4_1625050209626.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.1.0_2.4_1625050209626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_medium_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_medium_uncased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Acc: 0.724, STS (cos): 0.743 ``` ## Benchmarking ```bash MedNLI Score Acc 0.724 STS(cos) 0.743 ``` --- layout: model title: Arabic Bert Embeddings (MARBERT model) author: John Snow Labs name: bert_embeddings_MARBERT date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `MARBERT` is a Arabic model orginally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERT_ar_3.4.2_3.0_1649677129277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERT_ar_3.4.2_3.0_1649677129277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERT","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERT","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.MARBERT").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_MARBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|611.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/UBC-NLP/MARBERT - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: Legal Effect of termination Clause Binary Classifier (md) author: John Snow Labs name: legclf_effect_of_termination_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `effect-of-termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `effect-of-termination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effect_of_termination_md_en_1.0.0_3.0_1673460267892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effect_of_termination_md_en_1.0.0_3.0_1673460267892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_effect_of_termination_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[effect-of-termination]| |[other]| |[other]| |[effect-of-termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_effect_of_termination_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support conditions-precedent 0.91 0.88 0.89 24 other 0.93 0.95 0.94 39 accuracy 0.92 63 macro avg 0.92 0.91 0.92 63 weighted avg 0.92 0.92 0.92 63 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from nlpconnect) author: John Snow Labs name: roberta_qa_dpr_nq_reader_base date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base` is a English model originally trained by `nlpconnect`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_en_4.3.0_3.0_1674210699630.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_en_4.3.0_3.0_1674210699630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_dpr_nq_reader_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base --- layout: model title: Translate English to Romanian Pipeline author: John Snow Labs name: translate_en_ro date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ro, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ro` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ro_xx_2.7.0_2.4_1609687586572.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ro_xx_2.7.0_2.4_1609687586572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ro", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ro", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ro').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ro| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_arabic date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-arabic` is a Arabic model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_arabic_ar_4.0.0_3.0_1654188420264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_arabic_ar_4.0.0_3.0_1654188420264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_arabic","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_arabic","ar") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.bert.multilingual_arabic_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_arabic| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ar| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-arabic --- layout: model title: Translate English to Lingala Pipeline author: John Snow Labs name: translate_en_ln date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ln, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ln` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ln_xx_2.7.0_2.4_1609698946668.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ln_xx_2.7.0_2.4_1609698946668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ln", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ln", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ln').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ln| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from nlpunibo) author: John Snow Labs name: bert_qa_bert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert` is a English model orginally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_en_4.0.0_3.0_1654179427441.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_en_4.0.0_3.0_1654179427441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_nlpunibo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/bert --- layout: model title: Legal Processed Agricultural Produce Document Classifier (EURLEX) author: John Snow Labs name: legclf_processed_agricultural_produce_bert date: 2023-03-06 tags: [en, legal, classification, clauses, processed_agricultural_produce, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_processed_agricultural_produce_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Processed_Agricultural_Produce or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Processed_Agricultural_Produce`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_processed_agricultural_produce_bert_en_1.0.0_3.0_1678111667311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_processed_agricultural_produce_bert_en_1.0.0_3.0_1678111667311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_processed_agricultural_produce_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Processed_Agricultural_Produce]| |[Other]| |[Other]| |[Processed_Agricultural_Produce]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_processed_agricultural_produce_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.90 0.92 0.91 421 Processed_Agricultural_Produce 0.93 0.91 0.92 487 accuracy - - 0.92 908 macro-avg 0.91 0.92 0.91 908 weighted-avg 0.92 0.92 0.92 908 ``` --- layout: model title: Portuguese asr_bp500_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp500_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp500_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp500_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp500_xlsr_pt_4.2.0_3.0_1664193982159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp500_xlsr_pt_4.2.0_3.0_1664193982159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp500_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp500_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp500_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|756.2 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from nlpunibo) author: John Snow Labs name: distilbert_qa_base_config3 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config3` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config3_en_4.3.0_3.0_1672774482106.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config3_en_4.3.0_3.0_1672774482106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_config3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_base_config3 --- layout: model title: Stance About Health Mandates Related to Covid-19 Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_health_mandates_stance_tweet date: 2022-08-08 tags: [en, clinical, licensed, public_health, classifier, sequence_classification, covid_19, tweet, stance, mandate] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify stance about health mandates related to Covid-19 from tweets. This model is intended for direct use as a classification model and the target classes are: Support, Disapproval, Not stated. ## Predicted Entities `Support`, `Disapproval`, `Not stated` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_MANDATES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mandates_stance_tweet_en_4.0.2_3.0_1659982585130.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mandates_stance_tweet_en_4.0.2_3.0_1659982585130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mandates_stance_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["""It's too dangerous to hold the RNC, but let's send students and teachers back to school.""", """So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments Youre Speaker Pelosi nephew so stop the agenda LIES.""", """Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.""", """Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks.""", """But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers""", """New title Maskhole I think Im going to use this very soon coronavirus."""], StringType()).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mandates_stance_tweet", "es", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("It's too dangerous to hold the RNC, but let's send students and teachers back to school", "So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments Youre Speaker Pelosi nephew so stop the agenda LIES", "Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other", "Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks.", "But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers", "New title Maskhole I think Im going to use this very soon coronavirus.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.health_stance").predict("""Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.""") ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ |text |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ |It's too dangerous to hold the RNC, but let's send students and teachers back to school. |[Support] | |So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments Youre Speaker Pelosi nephew so stop the agenda LIES. |[Disapproval]| |Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.|[Not stated] | |Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks. |[Disapproval]| |But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers |[Support] | |New title Maskhole I think Im going to use this very soon coronavirus. |[Not stated] | +------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mandates_stance_tweet| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References The dataset is Covid-19-specific and consists of tweets collected via a series of keywords associated with that disease. ## Benchmarking ```bash label precision recall f1-score support Disapproval 0.70 0.64 0.67 158 Not_stated 0.75 0.78 0.76 244 Support 0.73 0.74 0.74 197 accuracy - - 0.73 599 macro-avg 0.72 0.72 0.72 599 weighted-avg 0.73 0.73 0.73 599 ``` --- layout: model title: Explain Document Pipeline for Polish author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, polish, explain_document_sm, pipeline, pl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pl edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_3.0.0_3.0_1616423208721.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_3.0.0_3.0_1616423208721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'pl') annotations = pipeline.fullAnnotate(""Witaj z John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "pl") val result = pipeline.fullAnnotate("Witaj z John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Witaj z John Snow Labs! ""] result_df = nlu.load('pl.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Witaj z John Snow Labs! '] | ['Witaj z John Snow Labs!'] | ['Witaj', 'z', 'John', 'Snow', 'Labs!'] | ['witać', 'z', 'John', 'Snow', 'Labs!'] | ['VERB', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pl| --- layout: model title: French CamemBert Embeddings (from safik) author: John Snow Labs name: camembert_embeddings_safik_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `safik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_safik_generic_model_fr_3.4.4_3.0_1653990191365.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_safik_generic_model_fr_3.4.4_3.0_1653990191365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_safik_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_safik_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_safik_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/safik/dummy-model --- layout: model title: Extract Granular Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_granular date: 2022-11-24 tags: [licensed, clinical, en, oncology, ner, anatomy] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of anatomical entities using granular labels. ## Predicted Entities `Direction`, `Site_Lymph_Node`, `Site_Breast`, `Site_Other_Body_Part`, `Site_Bone`, `Site_Liver`, `Site_Lung`, `Site_Brain` Definitions of Predicted Entities: - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". - `Site_Bone`: Anatomical terms that refer to the human skeleton. - `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). - `Site_Breast`: Anatomical terms that refer to the breasts. - `Site_Liver`: Anatomical terms that refer to the liver. - `Site_Lung`: Anatomical terms that refer to the lungs. - `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. - `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. ## Predicted Entities `Direction`, `Site_Bone`, `Site_Brain`, `Site_Breast`, `Site_Liver`, `Site_Lung`, `Site_Lymph_Node`, `Site_Other_Body_Part` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_en_4.2.2_3.0_1669299394344.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_en_4.2.2_3.0_1669299394344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_anatomy_granular").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:------------| | left | Direction | | breast | Site_Breast | | lungs | Site_Lung | | liver | Site_Liver | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_granular| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Direction 822 221 162 984 0.79 0.84 0.81 Site_Lymph_Node 481 38 70 551 0.93 0.87 0.90 Site_Breast 88 14 59 147 0.86 0.60 0.71 Site_Other_Body_Part 604 184 897 1501 0.77 0.40 0.53 Site_Bone 252 74 61 313 0.77 0.81 0.79 Site_Liver 178 92 56 234 0.66 0.76 0.71 Site_Lung 398 98 161 559 0.80 0.71 0.75 Site_Brain 197 44 82 279 0.82 0.71 0.76 macro_avg 3020 765 1548 4568 0.80 0.71 0.74 micro_avg 3020 765 1548 4568 0.80 0.66 0.71 ``` --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot` is a Finnish model originally trained by aapot. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022293307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022293307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|3.6 GB| --- layout: model title: Detect PHI for Deidentification purposes (Italian) author: John Snow Labs name: ner_deid_subentity date: 2022-03-22 tags: [deid, it, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 19 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data [(Catelli et al.)](https://ieeexplore.ieee.org/document/9335570) and several data augmentation mechanisms. ## Predicted Entities `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `EMAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `USERNAME`, `URL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_it_3.4.2_3.0_1647983756765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_it_3.4.2_3.0_1647983756765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.med_ner.deid_subentity").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""") ```
## Results ```bash +-------------+----------+ | token| ner_label| +-------------+----------+ | Ho| O| | visto| O| | Gastone| B-PATIENT| |Montanariello| I-PATIENT| | (| O| | 49| B-AGE| | anni| O| | )| O| | riferito| O| | all| O| | '| O| | Ospedale|B-HOSPITAL| | San|I-HOSPITAL| | Camillo|I-HOSPITAL| | per| O| | diabete| O| | mal| O| | controllato| O| | con| O| | sintomi| O| | risalenti| O| | a| O| | marzo| B-DATE| | 2015| I-DATE| | .| O| +-------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|15.0 MB| ## References - Internally annotated corpus - [COVID-19 Italian de-identification dataset making up 15% of total data: R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita and M. Esposito, "A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records," in IEEE Access, vol. 9, pp. 19097-19110, 2021, doi: 10.1109/ACCESS.2021.3054479.](https://ieeexplore.ieee.org/document/9335570) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 263.0 29.0 25.0 288.0 0.9007 0.9132 0.9069 HOSPITAL 365.0 36.0 48.0 413.0 0.9102 0.8838 0.8968 DATE 1164.0 13.0 26.0 1190.0 0.989 0.9782 0.9835 ORGANIZATION 72.0 25.0 26.0 98.0 0.7423 0.7347 0.7385 URL 41.0 0.0 0.0 41.0 1.0 1.0 1.0 CITY 421.0 9.0 19.0 440.0 0.9791 0.9568 0.9678 STREET 198.0 4.0 6.0 204.0 0.9802 0.9706 0.9754 USERNAME 20.0 2.0 2.0 22.0 0.9091 0.9091 0.9091 SEX 753.0 26.0 21.0 774.0 0.9666 0.9729 0.9697 IDNUM 113.0 3.0 7.0 120.0 0.9741 0.9417 0.9576 EMAIL 148.0 0.0 0.0 148.0 1.0 1.0 1.0 ZIP 148.0 3.0 1.0 149.0 0.9801 0.9933 0.9867 MEDICALRECORD 19.0 3.0 6.0 25.0 0.8636 0.76 0.8085 SSN 13.0 1.0 1.0 14.0 0.9286 0.9286 0.9286 PROFESSION 316.0 28.0 53.0 369.0 0.9186 0.8564 0.8864 PHONE 53.0 0.0 2.0 55.0 1.0 0.9636 0.9815 COUNTRY 182.0 14.0 15.0 197.0 0.9286 0.9239 0.9262 DOCTOR 769.0 77.0 62.0 831.0 0.909 0.9254 0.9171 AGE 763.0 8.0 18.0 781.0 0.9896 0.977 0.9832 macro - - - - - - 0.9328 micro - - - - - - 0.9494 ``` --- layout: model title: Translate Igbo to English Pipeline author: John Snow Labs name: translate_ig_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ig, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ig` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ig_en_xx_2.7.0_2.4_1609690747646.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ig_en_xx_2.7.0_2.4_1609690747646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ig_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ig_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ig.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ig_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Luba-Katanga Pipeline author: John Snow Labs name: translate_en_lu date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, lu, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `lu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lu_xx_2.7.0_2.4_1609701845697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lu_xx_2.7.0_2.4_1609701845697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_lu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_lu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.lu').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_lu| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Bert Embeddings (Base, Uncased, Agriculture) author: John Snow Labs name: bert_embeddings_agriculture_bert_uncased date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `agriculture-bert-uncased` is a English model orginally trained by `recobo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_agriculture_bert_uncased_en_3.4.2_3.0_1649672401296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_agriculture_bert_uncased_en_3.4.2_3.0_1649672401296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_agriculture_bert_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_agriculture_bert_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.agriculture_bert_uncased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_agriculture_bert_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/recobo/agriculture-bert-uncased --- layout: model title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_translation_tiny_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-translation-t5-tiny-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102243431.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102243431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_translation_tiny_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_translation_tiny_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_translation_tiny_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|176.9 MB| ## References - https://huggingface.co/mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/translation/laser - https://github.com/huseinzol05/malaya/tree/master/session/translation/hf-t5 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1657184079074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1657184079074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-6 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Akshat) author: John Snow Labs name: xlmroberta_ner_akshat_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Akshat`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_akshat_base_finetuned_panx_de_4.1.0_3.0_1660429170020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_akshat_base_finetuned_panx_de_4.1.0_3.0_1660429170020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_akshat_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_akshat_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_akshat_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Akshat/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Pipeline to Detect Radiology Related Entities author: John Snow Labs name: ner_radiology_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_radiology](https://nlp.johnsnowlabs.com/2021/03/31/ner_radiology_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_RADIOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_pipeline_en_3.4.1_3.0_1647874212591.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_pipeline_en_3.4.1_3.0_1647874212591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology.pipeline").predict("""Breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash +---------------------+-------------------------+ |chunk |ner_label | +---------------------+-------------------------+ |Breast |BodyPart | |ultrasound |ImagingTest | |ovoid mass |ImagingFindings | |0.5 x 0.5 x 0.4 |Measurements | |cm |Units | |left shoulder |BodyPart | |mass |ImagingFindings | |isoechoic echotexture|ImagingFindings | |muscle |BodyPart | |internal color flow |ImagingFindings | |benign fibrous tissue|ImagingFindings | |lipoma |Disease_Syndrome_Disorder| +---------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_exper7_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper7_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper7_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper7_mesum5_en_4.1.0_3.0_1660168221446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper7_mesum5_en_4.1.0_3.0_1660168221446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper7_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper7_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper7_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Pipeline to Detect Anatomical References (biobert) author: John Snow Labs name: ner_anatomy_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_anatomy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_4.3.0_3.2_1679312126242.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_4.3.0_3.2_1679312126242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models") text = '''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models") val text = "This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_biobert.pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:-----------------------|-------------:| | 0 | right | 314 | 318 | Organism_subdivision | 0.9948 | | 1 | great | 320 | 324 | Organism_subdivision | 0.8723 | | 2 | toe | 326 | 328 | Organism_subdivision | 0.9205 | | 3 | skin | 374 | 377 | Organ | 1 | | 4 | Sclerae | 542 | 548 | Pathological_formation | 0.8029 | | 5 | Extraocular | 574 | 584 | Multi-tissue_structure | 0.8437 | | 6 | muscles | 586 | 592 | Multi-tissue_structure | 0.8796 | | 7 | Nares | 613 | 617 | Organ | 0.7716 | | 8 | turbinates | 659 | 668 | Multi-tissue_structure | 0.9257 | | 9 | Mucous membranes | 716 | 731 | Cell | 0.70435 | | 10 | Neck | 744 | 747 | Organism_subdivision | 0.9982 | | 11 | Abdomen | 784 | 790 | Organism_subdivision | 0.8902 | | 12 | bowel | 802 | 806 | Organism_subdivision | 1 | | 13 | right | 869 | 873 | Organism_subdivision | 0.9967 | | 14 | toe | 881 | 883 | Organism_subdivision | 0.9816 | | 15 | skin | 933 | 936 | Organ | 1 | | 16 | toenails | 943 | 950 | Organism_subdivision | 0.9999 | | 17 | foot | 999 | 1002 | Organism_subdivision | 0.9831 | | 18 | toe | 1023 | 1025 | Organism_subdivision | 0.9653 | | 19 | toenails | 1031 | 1038 | Organism_subdivision | 0.9999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Clinical Entities in Romanian (Bert, Base, Cased) author: John Snow Labs name: ner_clinical_bert date: 2022-06-30 tags: [licenced, clinical, ro, ner, bert, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract clinical entities from Romanian clinical texts. This model is trained using `bert_base_cased` embeddings. ## Predicted Entities `Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Dosage`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.0.0_3.0_1656624991573.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.0.0_3.0_1656624991573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) sample_text = """ Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""" data = spark.createDataFrame([[sample_text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""") ```
## Results ```bash +--------------------------+-------------------------+ |chunks |entities | +--------------------------+-------------------------+ |Angio CT cardio-toracic |Imaging_Test | |Atrezie |Disease_Syndrome_Disorder| |valva pulmonara |Body_Part | |Hipoplazie |Disease_Syndrome_Disorder| |VS |Body_Part | |Atrezie |Disease_Syndrome_Disorder| |VAV stang |Body_Part | |Anastomoza Glenn |Disease_Syndrome_Disorder| |Tromboza |Disease_Syndrome_Disorder| |Sectia Clinica Cardiologie|Clinical_Dept | |GE Revolution HD |Medical_Device | |Branula albastra |Medical_Device | |membrului superior drept |Body_Part | |Scout |Body_Part | |30 ml |Dosage | |Iomeron 350 |Drug_Ingredient | |2.2 ml/s |Dosage | |20 ml |Dosage | |ser fiziologic |Drug_Ingredient | |angio-CT |Imaging_Test | +--------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_bert| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.3 MB| ## Benchmarking ```bash label precision recall f1-score support Body_Part 0.91 0.93 0.92 679 Clinical_Dept 0.68 0.65 0.67 97 Date 0.99 0.99 0.99 87 Direction 0.66 0.76 0.70 50 Disease_Syndrome_Disorder 0.73 0.76 0.74 121 Dosage 0.78 1.00 0.87 38 Drug_Ingredient 0.90 0.94 0.92 48 Form 1.00 1.00 1.00 6 Imaging_Findings 0.86 0.82 0.84 201 Imaging_Technique 0.92 0.92 0.92 26 Imaging_Test 0.93 0.98 0.95 205 Measurements 0.71 0.69 0.70 214 Medical_Device 0.85 0.81 0.83 42 Pulse 0.82 1.00 0.90 9 Route 1.00 0.91 0.95 33 Score 1.00 0.98 0.99 41 Time 1.00 1.00 1.00 28 Units 0.60 0.93 0.73 88 Weight 0.82 1.00 0.90 9 micro-avg 0.84 0.87 0.86 2037 macro-avg 0.70 0.74 0.72 2037 weighted-avg 0.84 0.87 0.85 2037 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from IDL) author: John Snow Labs name: distilbert_qa_autotrain_qna_1170143354 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-qna-1170143354` is a English model originally trained by `IDL`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_qna_1170143354_en_4.3.0_3.0_1672765675805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_qna_1170143354_en_4.3.0_3.0_1672765675805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_qna_1170143354","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_qna_1170143354","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_autotrain_qna_1170143354| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/IDL/autotrain-qna-1170143354 --- layout: model title: Legal Employment Document Classifier (EURLEX) author: John Snow Labs name: legclf_employment_bert date: 2023-03-06 tags: [en, legal, classification, clauses, employment, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_employment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Employment or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Employment`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678111724799.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678111724799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_employment_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Employment]| |[Other]| |[Other]| |[Employment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employment_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Employment 0.90 0.89 0.89 70 Other 0.87 0.89 0.88 61 accuracy - - 0.89 131 macro-avg 0.88 0.89 0.89 131 weighted-avg 0.89 0.89 0.89 131 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from khanglam7012) author: John Snow Labs name: t5_small date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small` is a English model originally trained by `khanglam7012`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_en_4.3.0_3.0_1675125819094.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_en_4.3.0_3.0_1675125819094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|253.6 MB| ## References - https://huggingface.co/khanglam7012/t5-small - https://user-images.githubusercontent.com/49101362/116334480-f5e57a00-a7dd-11eb-987c-186477f94b6e.png - https://pypi.org/project/keytotext/ - https://pepy.tech/project/keytotext - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/keytotext/tree/master/Training%20Notebooks - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb - https://github.com/gagan3012/keytotext/tree/master/Examples - https://user-images.githubusercontent.com/49101362/116220679-90e64180-a755-11eb-9246-82d93d924a6c.png - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/streamlit-tags - https://user-images.githubusercontent.com/49101362/116162205-fc042980-a6fd-11eb-892e-8f6902f193f4.png --- layout: model title: Part of Speech for Irish author: John Snow Labs name: pos_ud_idt date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: ga edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, ga] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_2.5.5_2.4_1596054150271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_2.5.5_2.4_1596054150271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_idt", "ga") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_idt", "ga") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine."""] pos_df = nlu.load('ga.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=6, result='ADP', metadata={'word': 'Seachas'}), Row(annotatorType='pos', begin=8, end=8, result='PART', metadata={'word': 'a'}), Row(annotatorType='pos', begin=10, end=15, result='NOUN', metadata={'word': 'bheith'}), Row(annotatorType='pos', begin=17, end=19, result='ADP', metadata={'word': 'ina'}), Row(annotatorType='pos', begin=21, end=22, result='NOUN', metadata={'word': 'rí'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_idt| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|ga| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Stopwords Remover for Bulgarian language (405 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, bg, open_source] task: Stop Words Removal language: bg edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_bg_3.4.1_3.0_1646672949029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_bg_3.4.1_3.0_1646672949029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","bg") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Не си по-добър от мен"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","bg") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Не си по-добър от мен").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bg.stopwords").predict("""Не си по-добър от мен""") ```
## Results ```bash +--------------+ |result | +--------------+ |[Не, по-добър]| +--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|bg| |Size:|3.0 KB| --- layout: model title: Pipeline to Detect clinical events (biobert) author: John Snow Labs name: ner_events_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_events_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_pipeline_en_3.4.1_3.0_1647873577802.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_pipeline_en_3.4.1_3.0_1647873577802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_events_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient presented to the emergency room last evening") ``` ```scala val pipeline = new PretrainedPipeline("ner_events_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient presented to the emergency room last evening") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biobert_events.pipeline").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +------------------+-------------+ |chunks |entities | +------------------+-------------+ |presented |OCCURRENCE | |the emergency room|CLINICAL_DEPT| +------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Shappey) author: John Snow Labs name: roberta_qa_base_qna_squad2_trained date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-QnA-squad2-trained` is a English model originally trained by `Shappey`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_qna_squad2_trained_en_4.3.0_3.0_1674212572106.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_qna_squad2_trained_en_4.3.0_3.0_1674212572106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_qna_squad2_trained","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_qna_squad2_trained","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_qna_squad2_trained| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|456.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Shappey/roberta-base-QnA-squad2-trained --- layout: model title: English T5ForConditionalGeneration Cased model (from google) author: John Snow Labs name: t5_efficient_xl_nl2 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-xl-nl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl2_en_4.3.0_3.0_1675124205513.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl2_en_4.3.0_3.0_1675124205513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_xl_nl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_xl_nl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_xl_nl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|575.7 MB| ## References - https://huggingface.co/google/t5-efficient-xl-nl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Hindi Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, hi, open_source] task: Embeddings language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Hindi model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_hi_3.4.2_3.0_1649673217108.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_hi_3.4.2_3.0_1649673217108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.muril_adapted_local").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Fast Neural Machine Translation Model from Romance Languages to English author: John Snow Labs name: opus_mt_roa_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, roa, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `roa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_roa_en_xx_2.7.0_2.4_1609166767520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_roa_en_xx_2.7.0_2.4_1609166767520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_roa_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_roa_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.roa.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_roa_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from Neha2608) author: John Snow Labs name: xlmroberta_ner_neha2608_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `Neha2608`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_neha2608_base_finetuned_panx_all_xx_4.1.0_3.0_1660427653138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_neha2608_base_finetuned_panx_all_xx_4.1.0_3.0_1660427653138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_neha2608_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_neha2608_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_neha2608_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Neha2608/xlm-roberta-base-finetuned-panx-all --- layout: model title: German asr_wav2vec2_large_xlsr_53_German TFWav2Vec2ForCTC from MehdiHosseiniMoghadam author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_German date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_German` is a German model originally trained by MehdiHosseiniMoghadam. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_German_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_German_de_4.2.0_3.0_1664107471966.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_German_de_4.2.0_3.0_1664107471966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_German', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_German", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_German| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English ElectraForQuestionAnswering model (from howey) author: John Snow Labs name: electra_qa_large_squad date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-squad` is a English model originally trained by `howey`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_squad_en_4.0.0_3.0_1655920994017.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_squad_en_4.0.0_3.0_1655920994017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_large_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/howey/electra-large-squad --- layout: model title: English image_classifier_vit_robot22 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_robot22 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_robot22` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_robot22_en_4.1.0_3.0_1660168305706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_robot22_en_4.1.0_3.0_1660168305706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_robot22", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_robot22", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_robot22| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: German T5ForConditionalGeneration Cased model (from diversifix) author: John Snow Labs name: t5_diversiformer date: 2023-01-30 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `diversiformer` is a German model originally trained by `diversifix`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_diversiformer_de_4.3.0_3.0_1675100976411.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_diversiformer_de_4.3.0_3.0_1675100976411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_diversiformer","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_diversiformer","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_diversiformer| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|1.2 GB| ## References - https://huggingface.co/diversifix/diversiformer - https://arxiv.org/abs/2010.11934 - https://github.com/diversifix/diversiformer - https://www.gnu.org/licenses/ --- layout: model title: Spanish RobertaForQuestionAnswering (from hackathon-pln-es) author: John Snow Labs name: roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln date: 2022-06-21 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-biomedical-es-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln_es_4.0.0_3.0_1655790263349.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln_es_4.0.0_3.0_1655790263349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2_bio_medical.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_biomedical_es_squad2_hackathon_pln| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|465.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hackathon-pln-es/roberta-base-biomedical-es-squad2-es - https://somosnlp.org/hackathon --- layout: model title: Pipeline to Extract Oncology Tests author: John Snow Labs name: ner_oncology_test_pipeline date: 2023-03-09 tags: [licensed, clinical, oncology, en, ner, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_test](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_test_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_pipeline_en_4.3.0_3.2_1678351357734.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_pipeline_en_4.3.0_3.2_1678351357734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_test_pipeline", "en", "clinical/models") text = ''' biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_test_pipeline", "en", "clinical/models") val text = " biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------------------|--------:|------:|:---------------|-------------:| | 0 | biopsy | 1 | 6 | Pathology_Test | 0.9987 | | 1 | ultrasound guided | 31 | 47 | Imaging_Test | 0.87635 | | 2 | chest computed tomography | 67 | 91 | Imaging_Test | 0.9176 | | 3 | CT | 94 | 95 | Imaging_Test | 0.8294 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_test_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate English to Sino-Tibetan languages Pipeline author: John Snow Labs name: translate_en_sit date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sit, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sit` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sit_xx_2.7.0_2.4_1609691674420.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sit_xx_2.7.0_2.4_1609691674420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sit", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sit", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sit').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sit| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Yoruba Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, yo, open_source] task: Named Entity Recognition language: yo edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-yoruba-finetuned-ner-yoruba` is a Yoruba model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808837178.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808837178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba","yo") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba","yo") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Mo nifẹ Snark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_yoruba_finetuned_ner_yoruba| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|yo| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-yoruba-finetuned-ner-yoruba - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman --- layout: model title: Word2Vec Embeddings in Malagasy (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mg, open_source] task: Embeddings language: mg edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mg_3.4.1_3.0_1647444052051.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mg_3.4.1_3.0_1647444052051.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mg") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Tiako ny spark nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mg") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Tiako ny spark nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mg.embed.w2v_cc_300d").predict("""Tiako ny spark nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mg| |Size:|233.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Pipeline to Detect Cancer Genetics author: John Snow Labs name: ner_bionlp_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_bionlp](https://nlp.johnsnowlabs.com/2021/03/31/ner_bionlp_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_pipeline_en_3.4.1_3.0_1647871349979.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_pipeline_en_3.4.1_3.0_1647871349979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_bionlp_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.") ``` ```scala val pipeline = new PretrainedPipeline("ner_bionlp_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bionlp.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash +----------------------+--------------------+ |chunk |ner_label | +----------------------+--------------------+ |human |Organism | |Kir 3.3 |Gene_or_gene_product| |GIRK3 |Gene_or_gene_product| |potassium |Simple_chemical | |GIRK |Gene_or_gene_product| |chromosome 1q21-23 |Cellular_component | |pancreas |Organ | |tissues |Tissue | |fat andskeletal muscle|Tissue | |KCNJ9 |Gene_or_gene_product| |Type II |Gene_or_gene_product| +----------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Polish (WikiNER 6B 100) author: John Snow Labs name: wikiner_6B_100 date: 2020-05-10 task: Named Entity Recognition language: pl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, pl, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_PL){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pl_2.5.0_2.4_1588519719293.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_pl_2.5.0_2.4_1588519719293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_100", "pl") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_100", "pl") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę."""] ner_df = nlu.load('pl.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Podczas swojej kariery |MISC | |Microsoft Gates |MISC | |CEO |ORG | |Urodzony |LOC | |Seattle |LOC | |Waszyngton |LOC | |Gates |PER | |Microsoftu |ORG | |Paulem Allenem |PER | |Albuquerque |LOC | |Nowym Meksyku |LOC | |Gates |PER | |Ale |PER | |Gates |PER | |Opinię |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pl| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://pl.wikipedia.org](https://pl.wikipedia.org) --- layout: model title: Fast Neural Machine Translation Model from Shona to English author: John Snow Labs name: opus_mt_sn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sn_en_xx_2.7.0_2.4_1609167066503.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sn_en_xx_2.7.0_2.4_1609167066503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from IIC) author: John Snow Labs name: roberta_qa_base_spanish_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-sqac` is a Spanish model originally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_s_c_es_4.2.4_3.0_1669986419016.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_s_c_es_4.2.4_3.0_1669986419016.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/IIC/roberta-base-spanish-sqac - https://www.bsc.es/ - https://arxiv.org/abs/2107.07253 - https://paperswithcode.com/sota?task=question-answering&dataset=PlanTL-GOB-ES%2FSQAC --- layout: model title: Text Detection author: John Snow Labs name: text_detection_v1 date: 2021-12-09 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description CRAFT: Character-Region Awareness For Text detection, is designed with a convolutional neural network producing the character region score and affinity score. The region score is used to localize individual characters in the image, and the affinity score is used to group each character into a single instance. To compensate for the lack of character-level annotations, we propose a weaklysupervised learning framework that estimates characterlevel ground truths in existing real word-level datasets. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/text_detection_v1_en_3.0.0_3.0_1639033905025.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/text_detection_v1_en_3.0.0_3.0_1639033905025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python text_detector = ImageTextDetector.pretrained("text_detection_v", "en", "clinical/ocr") text_detector.setInputCol("image") text_detector.setOutputCol("text_regions") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_detection_v1| |Type:|ocr| |Compatibility:|Visual NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Output Labels:|[text_regions]| |Language:|en| --- layout: model title: English RobertaForSequenceClassification Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-financial-news-sentiment-analysis` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis_en_4.0.0_3.0_1657716075006.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis_en_4.0.0_3.0_1657716075006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_distilroberta_finetuned_financial_news_sentiment_analysis| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from juanmarmol) author: John Snow Labs name: distilbert_qa_juanmarmol_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `juanmarmol`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_juanmarmol_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771544745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_juanmarmol_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771544745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juanmarmol_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juanmarmol_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_juanmarmol_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/juanmarmol/distilbert-base-uncased-finetuned-squad --- layout: model title: Universal Sentence Encoder XLING English and French author: John Snow Labs name: tfhub_use_xling_en_fr date: 2020-12-08 task: Embeddings language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 deprecated: true tags: [open_source, embeddings, xx] supported: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder Cross-lingual (XLING) module is an extension of the Universal Sentence Encoder that includes training on multiple tasks across languages. The multi-task training setup is based on the paper "Learning Cross-lingual Sentence Representations via a Multi-task Dual Encoder". This specific module is trained on English and French (en-fr) tasks, and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box for a number of applications. The input to the module is variable length English or French text and the output is a 512 dimensional vector. We note that one does not need to specify the language that the input is in, as the model was trained such that English and French text with similar meanings will have similar (high dot product score) embeddings. We also note that this model can be used for monolingual English (and potentially monolingual French) tasks with comparable or even better performance than the purely English Universal Sentence Encoder. Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_fr_xx_2.7.0_2.4_1607440713842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_fr_xx_2.7.0_2.4_1607440713842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_fr", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', "J'adore utiliser SparkNLP"]], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_fr", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "J'adore utiliser SparkNLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "J'adore utiliser SparkNLP"] embeddings_df = nlu.load('xx.use.xling_en_fr').predict(text, output_level='sentence') embeddings_df ```
## Results It gives a 512-dimensional vector of the sentences. ```bash sentence xx_use_xling_en_fr_embeddings 0 I love NLP [0.0608731247484684, -0.06734627485275269, -0.... 1 J'adore utiliser SparkNLP [0.07564588636159897, -0.06953935325145721, 0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_xling_en_fr| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-xling/en-fr/1](https://tfhub.dev/google/universal-sentence-encoder-xling/en-fr/1) --- layout: model title: Lithuanian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_lt_cased date: 2022-12-02 tags: [lt, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: lt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-lt-cased` is a Lithuanian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_lt_cased_lt_4.2.4_3.0_1670018323172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_lt_cased_lt_4.2.4_3.0_1670018323172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_lt_cased","lt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_lt_cased","lt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_lt_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|lt| |Size:|369.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-lt-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English T5ForConditionalGeneration Cased model (from Apoorva) author: John Snow Labs name: t5_apoorva_k2t_test date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `k2t-test` is a English model originally trained by `Apoorva`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_apoorva_k2t_test_en_4.3.0_3.0_1675103912014.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_apoorva_k2t_test_en_4.3.0_3.0_1675103912014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_apoorva_k2t_test","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_apoorva_k2t_test","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_apoorva_k2t_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|275.8 MB| ## References - https://huggingface.co/Apoorva/k2t-test --- layout: model title: Lemmatizer (Norwegian Bokmål, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-08 tags: [open_source, lemmatizer, nb] task: Lemmatization language: nb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Norwegian Bokmål Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nb_3.4.1_3.0_1646753600988.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nb_3.4.1_3.0_1646753600988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nb") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Du er ikke bedre enn meg"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nb") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Du er ikke bedre enn meg").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nb.lemma").predict("""Du er ikke bedre enn meg""") ```
## Results ```bash +-------------------------------+ |result | +-------------------------------+ |[Du, er, ikke, bedre, enn, jeg]| +-------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|nb| |Size:|15.3 KB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from 123tarunanand) author: John Snow Labs name: roberta_qa_base_finetuned date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned` is a English model originally trained by `123tarunanand`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_en_4.3.0_3.0_1674216346492.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_en_4.3.0_3.0_1674216346492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/123tarunanand/roberta-base-finetuned --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265903` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903_en_4.0.0_3.0_1655985000390.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903_en_4.0.0_3.0_1655985000390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265903").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265903| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265903 --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1654180777930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1654180777930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_128d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-0 --- layout: model title: Legal Definition Of Confidential Information Clause Binary Classifier author: John Snow Labs name: legclf_def_of_conf_info_clause date: 2023-02-13 tags: [en, legal, classification, definition, confidential, information, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `def_of_conf_info` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `def_of_conf_info`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_def_of_conf_info_clause_en_1.0.0_3.0_1676302657181.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_def_of_conf_info_clause_en_1.0.0_3.0_1676302657181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_def_of_conf_info_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[def_of_conf_info]| |[other]| |[other]| |[def_of_conf_info]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_def_of_conf_info_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support def_of_conf_info 0.91 1.00 0.95 20 other 1.00 0.85 0.92 13 accuracy - - 0.94 33 macro-avg 0.95 0.92 0.93 33 weighted-avg 0.94 0.94 0.94 33 ``` --- layout: model title: RxNorm Sbd ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_sbd_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-07-27 task: Entity Resolution edition: Healthcare NLP 2.5.1 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_sbd_clinical_en_2.5.1_2.4_1595813912622.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_sbd_clinical_en_2.5.1_2.4_1595813912622.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... rxnorm_resolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\ .setInputCols(['token', 'chunk_embeddings'])\ .setOutputCol('rxnorm_resolution')\ .setPoolingStrategy("MAX") pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnorm_resolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9)) .setInputCols('token', 'chunk_embeddings') .setOutputCol('rxnorm_resolution') .setPoolingStrategy("MAX") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver)) val result = pipeline.fit(Seq.empty[String]).transform(data) ```
{:.h2_title} ## Results ```bash +-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text(rxnorm)| code|confidence| +-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Table...| 105376| 0.2067| | glipizide|TREATMENT|Glipizide 5 MG Oral Tablet [Minidiab]:::Glipizide 5 MG Oral Tablet [Glucotrol]:::Glipizide 5 MG O...| 105373| 0.2224| | dapagliflozin for T2DM|TREATMENT|dapagliflozin 5 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 10 MG / saxagliptin 5 M...|2169276| 0.2532| | atorvastatin and gemfibrozil for HTG|TREATMENT|atorvastatin 20 MG / ezetimibe 10 MG Oral Tablet [Liptruzet]:::atorvastatin 40 MG / ezetimibe 10 ...|1422095| 0.2183| | dapagliflozin|TREATMENT|dapagliflozin 5 MG Oral Tablet [Farxiga]:::dapagliflozin 10 MG Oral Tablet [Farxiga]:::dapagliflo...|1486981| 0.3523| | bicarbonate|TREATMENT|Sodium Bicarbonate 0.417 MEQ/ML Oral Solution [Desempacho]:::potassium bicarbonate 25 MEQ Efferve...|1305099| 0.2149| |insulin drip for euDKA and HTG with a reduction|TREATMENT|insulin aspart, human 30 UNT/ML / insulin degludec 70 UNT/ML Pen Injector [Ryzodeg]:::3 ML insuli...|1994318| 0.2124| | SGLT2 inhibitor|TREATMENT|C1 esterase inhibitor (human) 500 UNT Injection [Cinryze]:::alpha 1-proteinase inhibitor, human 1...| 809871| 0.2044| | insulin glargine|TREATMENT|Insulin Glargine 100 UNT/ML Pen Injector [Lantus]:::Insulin Glargine 300 UNT/ML Pen Injector [Tou...|1359856| 0.2265| | insulin lispro with meals|TREATMENT|Insulin Lispro 100 UNT/ML Cartridge [Humalog]:::Insulin Lispro 200 UNT/ML Pen Injector [Humalog]:...|1652648| 0.2469| | metformin|TREATMENT|Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Table...| 105376| 0.2067| | SGLT2 inhibitors|TREATMENT|alpha 1-proteinase inhibitor, human 1 MG Injection [Prolastin]:::C1 esterase inhibitor (human) 50...|1661220| 0.2167| +-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------| | Name: | chunkresolve_rxnorm_sbd_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.1+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings ] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on December 2019 RxNorm Clinical Drugs (TTY=SBD) ontology graph with `embeddings_clinical` https://www.nlm.nih.gov/pubs/techbull/nd19/brief/nd19_rxnorm_december_2019_release.html --- layout: model title: English T5ForConditionalGeneration Cased model (from ThomasNLG) author: John Snow Labs name: t5_qg_squad1 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qg_squad1-en` is a English model originally trained by `ThomasNLG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qg_squad1_en_4.3.0_3.0_1675125547851.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qg_squad1_en_4.3.0_3.0_1675125547851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_qg_squad1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_qg_squad1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_qg_squad1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|923.2 MB| ## References - https://huggingface.co/ThomasNLG/t5-qg_squad1-en - https://github.com/ThomasScialom/QuestEval --- layout: model title: Detect PHI for Deidentification (Generic) author: John Snow Labs name: ner_deid_generic date: 2022-01-06 tags: [deid, ner, de, licensed] task: Named Entity Recognition language: de edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates German text to find protected health information (PHI) that may need to be deidentified. It was trained with in-house annotations and detects 7 entities. ## Predicted Entities `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `AGE`, `ID`, `CONTACT` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_DE){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_de_3.3.4_2.4_1641460977185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_de_3.3.4_2.4_1641460977185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained("ner_deid_generic", "de", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_deid_generic_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) data = spark.createDataFrame([["""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_generic", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_deid_generic_chunk") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter)) val data = Seq("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhausin Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.deid_generic").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash +-------------------------+----------------------+ |chunk |ner_deid_generic_chunk| +-------------------------+----------------------+ |Michael Berger |NAME | |12 Dezember 2018 |DATE | |St. Elisabeth-Krankenhaus|LOCATION | |Bad Kissingen |LOCATION | |Berger |NAME | |76 |AGE | +-------------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|15.0 MB| ## Data Source In-house annotated dataset ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 68.0 25.0 12.0 80.0 0.7312 0.85 0.7861 NAME 3965.0 294.0 274.0 4239.0 0.931 0.9354 0.9332 DATE 4049.0 2.0 0.0 4049.0 0.9995 1.0 0.9998 ID 185.0 11.0 32.0 217.0 0.9439 0.8525 0.8959 LOCATION 5065.0 414.0 1021.0 6086.0 0.9244 0.8322 0.8759 PROFESSION 145.0 8.0 117.0 262.0 0.9477 0.5534 0.6988 AGE 458.0 13.0 18.0 476.0 0.9724 0.9622 0.9673 ``` --- layout: model title: Legal Waivers Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_waivers_bert date: 2023-03-05 tags: [en, legal, classification, clauses, waivers, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Waivers`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waivers_bert_en_1.0.0_3.0_1678049911607.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waivers_bert_en_1.0.0_3.0_1678049911607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_waivers_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Waivers]| |[Other]| |[Other]| |[Waivers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_waivers_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.93 0.93 138 Waivers 0.92 0.92 0.92 106 accuracy - - 0.93 244 macro-avg 0.92 0.92 0.92 244 weighted-avg 0.93 0.93 0.93 244 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from microsoft) author: John Snow Labs name: t5_ssr_base date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ssr-base` is a English model originally trained by `microsoft`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ssr_base_en_4.3.0_3.0_1675107262685.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ssr_base_en_4.3.0_3.0_1675107262685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ssr_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ssr_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ssr_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|926.9 MB| ## References - https://huggingface.co/microsoft/ssr-base - https://arxiv.org/abs/2101.00416 --- layout: model title: News Classifier Pipeline for Turkish text author: John Snow Labs name: classifierdl_bert_news_pipeline date: 2021-08-27 tags: [tr, news, classification, open_source] task: Text Classification language: tr edition: Spark NLP 3.2.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline classifies Turkish texts of news. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_TR_NEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_TR_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_tr_3.2.0_2.4_1630061137177.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_tr_3.2.0_2.4_1630061137177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_bert_news_pipeline", lang = "tr") result = pipeline.fullAnnotate("Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı") result["class"] ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_news_pipeline", "tr") val result = pipeline.fullAnnotate("Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı")(0) ```
## Results ```bash ["Sport"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_news_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Language:|tr| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: English Named Entity Recognition (from kSaluja) author: John Snow Labs name: bert_ner_autonlp_tele_new_5k_557515810 date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-tele_new_5k-557515810` is a English model orginally trained by `kSaluja`. ## Predicted Entities `TARGET`, `SUGGESTIONTYPE`, `CALLTYPE`, `INSTRUMENT`, `BUYPRICE`, `HOLDINGPERIOD`, `STOPLOSS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_tele_new_5k_557515810_en_3.4.2_3.0_1652097492338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_tele_new_5k_557515810_en_3.4.2_3.0_1652097492338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_tele_new_5k_557515810","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_tele_new_5k_557515810","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_autonlp_tele_new_5k_557515810| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/kSaluja/autonlp-tele_new_5k-557515810 --- layout: model title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case) author: John Snow Labs name: ner_eu_clinical_case_pipeline date: 2023-03-08 tags: [clinical, licensed, ner, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/01/25/ner_eu_clinical_case_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_en_4.3.0_3.2_1678262043022.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_en_4.3.0_3.2_1678262043022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "en", "clinical/models") text = " A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "en", "clinical/models") val text = " A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-------------------------------|--------:|------:|:-------------------|-------------:| | 0 | A 3-year-old boy | 1 | 16 | patient | 0.733133 | | 1 | autistic disorder | 23 | 39 | clinical_condition | 0.5412 | | 2 | He | 97 | 98 | patient | 0.9991 | | 3 | illness | 125 | 131 | clinical_event | 0.4956 | | 4 | autistic spectrum disorder | 136 | 161 | clinical_condition | 0.5002 | | 5 | The child | 164 | 172 | patient | 0.82435 | | 6 | diagnosed | 178 | 186 | clinical_event | 0.9912 | | 7 | disorder | 216 | 223 | clinical_event | 0.3804 | | 8 | difficulties | 250 | 261 | clinical_event | 0.3221 | | 9 | Blood | 293 | 297 | bodypart | 0.7617 | | 10 | work | 299 | 302 | clinical_event | 0.9361 | | 11 | normal | 308 | 313 | units_measurements | 0.5337 | | 12 | hormone | 336 | 342 | clinical_event | 0.362 | | 13 | hemoglobin | 351 | 360 | clinical_event | 0.6106 | | 14 | volume | 380 | 385 | clinical_event | 0.6226 | | 15 | endoscopy | 415 | 423 | clinical_event | 0.9917 | | 16 | showed | 430 | 435 | clinical_event | 0.9904 | | 17 | tumor | 450 | 454 | clinical_condition | 0.5606 | | 18 | causing | 456 | 462 | clinical_event | 0.7362 | | 19 | obstruction | 473 | 483 | clinical_event | 0.6198 | | 20 | the gastric outlet | 488 | 505 | bodypart | 0.634967 | | 21 | gastrointestinal stromal tumor | 518 | 547 | clinical_condition | 0.387833 | | 22 | suspected | 553 | 561 | clinical_event | 0.8225 | | 23 | gastrectomy | 571 | 581 | clinical_event | 0.935 | | 24 | examination | 616 | 626 | clinical_event | 0.9987 | | 25 | revealed | 628 | 635 | clinical_event | 0.9989 | | 26 | spindle cell proliferation | 637 | 662 | clinical_condition | 0.4487 | | 27 | the submucosal layer | 667 | 686 | bodypart | 0.523 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ksabeh) author: John Snow Labs name: roberta_qa_base_attribute_correction_mlm_titles date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-attribute-correction-mlm-titles` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_attribute_correction_mlm_titles_en_4.3.0_3.0_1674212765867.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_attribute_correction_mlm_titles_en_4.3.0_3.0_1674212765867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction_mlm_titles","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction_mlm_titles","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_attribute_correction_mlm_titles| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ksabeh/roberta-base-attribute-correction-mlm-titles --- layout: model title: English asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18 TFWav2Vec2ForCTC from jhonparra18 author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18` is a English model originally trained by jhonparra18. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_en_4.2.0_3.0_1664019766050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_en_4.2.0_3.0_1664019766050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_512 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670325890317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_512_zh_4.2.4_3.0_1670325890317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|66.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken TFWav2Vec2ForCTC from cuzeverynameistaken author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken` is a English model originally trained by cuzeverynameistaken. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_en_4.2.0_3.0_1664023121407.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_en_4.2.0_3.0_1664023121407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Part of Speech for English author: John Snow Labs name: pos_ud_ewt date: 2021-03-08 tags: [part_of_speech, open_source, english, pos_ud_ewt, en] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PROPN - PUNCT - ADJ - NOUN - VERB - DET - ADP - AUX - PRON - PART - SCONJ - NUM - ADV - CCONJ - X - INTJ - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ewt_en_3.0.0_3.0_1615230175426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ewt_en_3.0.0_3.0_1615230175426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_ewt", "en") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hello from John Snow Labs ! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_ewt", "en") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hello from John Snow Labs ! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] token_df = nlu.load('en.pos.ud_ewt').predict(text) token_df ```
## Results ```bash token pos 0 Hello INTJ 1 from ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ewt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|en| --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbertresolve_icd10cm_augmented_billable_hcc) author: John Snow Labs name: sbertresolve_icd10cm_augmented_billable_hcc date: 2023-05-31 tags: [en, clinical, entity_resolution, icd10cm, billable, hcc, licensed] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD-10-CM codes using `sbert_jsl_medium_uncased` sentence bert embeddings and it supports 7-digit codes with Hierarchical Condition Categories (HCC) status. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). In the result, look for the `all_k_aux_labels` parameter in the metadata to get HCC status. This column can be divided to get further details: billable status || hcc status || hcc score. For example, if `all_k_aux_labels` is like `1||1||19` which means the `billable status` is 1, hcc status is 1, and `hcc score` is 19. ## Predicted Entities `ICD-10-CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_billable_hcc_en_4.4.2_3.0_1685534837223.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_billable_hcc_en_4.4.2_3.0_1685534837223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("bert_embeddings")\ .setCaseSensitive(False) icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models")\ .setInputCols(["bert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, bert_embeddings, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("bert_embeddings") .setCaseSensitive(False) val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") .setInputCols("bert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| hcc_list| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], maternal...| [O24.4, O24.41, O24.43, Z86.32, K86.8, P70.2, O24.434, E10.9, O24.430]|[0||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||1||19...| |subsequent type two diabetes mellitus|PROBLEM| E11|[type 2 diabetes mellitus [type 2 diabetes mellitus], type ii diabetes m...|[E11, E11.9, E10.9, E10, E13.9, Z83.3, L83, E11.8, E11.32, E10.8, Z86.39...|[0||0||0, 1||1||19, 1||1||19, 0||0||0, 1||1||19, 1||0||0, 1||0||0, 1||1|...| | acute hepatitis|PROBLEM| K72.0|[acute hepatitis [acute and subacute hepatic failure], acute hepatitis a...|[K72.0, B15, B17.2, B17.1, B16, B17.9, B18.8, B15.9, K75.2, K73.9, B17.1...|[0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 1||1||29, 1||0||0...| | obesity|PROBLEM| E66.9|[obesity [obesity, unspecified], upper body obesity [other obesity], chi...| [E66.9, E66.8, P90, Q13.0, M79.4, Z86.39]| [1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0]| | a body mass index|PROBLEM| E66.9|[observation of body mass index [obesity, unspecified], finding of body ...|[E66.9, Z68.41, Z68, E66.8, Z68.45, Z68.4, Z68.1, Z68.2, R22.9, Z68.22, ...|[1||0||0, 1||1||22, 0||0||0, 1||0||0, 1||1||22, 0||0||0, 1||0||0, 0||0||...| | polyuria|PROBLEM| R35|[polyuria [polyuria], sialuria [other specified metabolic disorders], st...|[R35, E88.8, R30.0, N28.89, O04.8, R82.4, E74.8, R82.2, E73.9, R82.0, R3...|[0||0||0, 0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0,...| | polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], polyotia [accessory auricle], polysomia [conjo...|[R63.1, Q17.0, Q89.4, Q89.09, Q74.8, H53.8, H53.2, Q13.2, R63.8, E23.2, ...|[1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0,...| | poor appetite|PROBLEM| R63.0|[poor appetite [anorexia], excessive appetite [polyphagia], poor feeding...|[R63.0, R63.2, P92.9, R45.81, Z55.8, R41.84, R41.3, Z74.8, R46.89, R45.8...|[1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0,...| | vomiting|PROBLEM| R11.1|[vomiting [vomiting], vomiting bile [vomiting following gastrointestinal...|[R11.1, K91.0, K92.0, A08.39, R11, P92.0, P92.09, R11.12, R11.10, O21.9,...|[0||0||0, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 1||0||0,...| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_augmented_billable_hcc| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[bert_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|938.6 MB| |Case sensitive:|false| --- layout: model title: Chinese RoBERTa Embeddings (from benjamin) author: John Snow Labs name: roberta_embeddings_roberta_base_wechsel_chinese date: 2022-04-14 tags: [roberta, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-wechsel-chinese` is a Chinese model orginally trained by `benjamin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_chinese_zh_3.4.2_3.0_1649944645905.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_chinese_zh_3.4.2_3.0_1649944645905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["我喜欢Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("我喜欢Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.roberta_base_wechsel_chinese").predict("""我喜欢Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_base_wechsel_chinese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|468.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/benjamin/roberta-base-wechsel-chinese - https://github.com/CPJKU/wechsel - https://arxiv.org/abs/2112.06598 --- layout: model title: English BertForQuestionAnswering Mini Uncased model (from Renukswamy) author: John Snow Labs name: bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilm-uncased-squad2-finetuned-squad` is a English model originally trained by `Renukswamy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190358667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190358667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_renukswamy_minilm_uncased_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|124.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Renukswamy/minilm-uncased-squad2-finetuned-squad --- layout: model title: Pipeline to Resolve ICD-10-CM Codes author: John Snow Labs name: icd10cm_resolver_pipeline date: 2022-11-02 tags: [en, clinical, licensed, resolver, chunk_mapping, pipeline, icd10cm] task: Pipeline Healthcare language: en nav_key: models edition: Spark NLP for Healthcare 4.2.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding ICD-10-CM codes. You’ll just feed your text and it will return the corresponding ICD-10-CM codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.2.1_3.0_1667389014041.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.2.1_3.0_1667389014041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline resolver_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models") text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""" result = resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val resolver_pipeline = new PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models") val result = resolver_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm_resolver.pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ```
## Results ```bash +-----------------------------+---------+------------+ |chunk |ner_chunk|icd10cm_code| +-----------------------------+---------+------------+ |gestational diabetes mellitus|PROBLEM |O24.919 | |anisakiasis |PROBLEM |B81.0 | |fetal and neonatal hemorrhage|PROBLEM |P545 | +-----------------------------+---------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_resolver_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP for Healthcare 4.2.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Extract entities in clinical trial abstracts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_clinical_trials_abstracts date: 2022-06-29 tags: [berttokenclassifier, bert, biobert, en, ner, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is trained with the BertForTokenClassification method from transformers library and imported into Spark NLP. It extracts relevant entities from clinical trial abstracts. It uses a simplified version of the ontology specified by [Sanchez Graillet, O., et al.](https://pub.uni-bielefeld.de/record/2939477) in order to extract concepts related to trial design, diseases, drugs, population, statistics and publication. ## Predicted Entities `Age`, `AllocationRatio`, `Author`, `BioAndMedicalUnit`, `CTAnalysisApproach`, `CTDesign`, `Confidence`, `Country`, `DisorderOrSyndrome`, `DoseValue`, `Drug`, `DrugTime`, `Duration`, `Journal`, `NumberPatients`, `PMID`, `PValue`, `PercentagePatients`, `PublicationYear`, `TimePoint`, `Value` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_en_3.5.3_3.0_1656475829985.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_en_3.5.3_3.0_1656475829985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) text = ["This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val. ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val text = "This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_trials_abstracts").predict("""This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime.""") ```
## Results ```bash +----------------+------------------+ |chunk |ner_label | +----------------+------------------+ |open-label |CTDesign | |parallel-group |CTDesign | |two-arm |CTDesign | |insulin glargine|Drug | |GLA |Drug | |NPH insulin |Drug | |metformin |Drug | |28 |NumberPatients | |type 2 diabetes |DisorderOrSyndrome| |61.5 |Age | |kg/m(2 |BioAndMedicalUnit | |metformin |Drug | |sulfonylurea |Drug | |randomized |CTDesign | |once-daily |DrugTime | |GLA |Drug | |NPH |Drug | |bedtime |DrugTime | +----------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_trials_abstracts| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - [Sanchez Graillet O, Cimiano P, Witte C, Ell B. C-TrO: an ontology for summarization and aggregation of the level of evidence in clinical trials. In: Proceedings of the Workshop Ontologies and Data in Life Sciences (ODLS 2019) in the Joint Ontology Workshops' (JOWO 2019). 2019.](https://pub.uni-bielefeld.de/record/2939477) ## Benchmarking ```bash label precision recall f1-score support B-Age 0.93 0.88 0.90 16 B-AllocationRatio 1.00 1.00 1.00 7 B-Author 0.98 1.00 0.99 702 B-BioAndMedicalUnit 0.96 0.97 0.96 723 B-CTAnalysisApproach 1.00 1.00 1.00 5 B-CTDesign 0.93 0.95 0.94 384 B-Confidence 0.91 0.95 0.93 184 B-Country 0.88 0.91 0.90 115 B-DisorderOrSyndrome 0.92 0.96 0.94 393 B-DoseValue 0.97 0.98 0.97 117 B-Drug 0.97 0.98 0.97 3944 B-DrugTime 0.92 0.90 0.91 202 B-Duration 0.90 0.88 0.89 100 B-Journal 1.00 1.00 1.00 131 B-NumberPatients 0.94 0.98 0.96 165 B-PMID 1.00 1.00 1.00 239 B-PValue 0.86 0.89 0.88 132 B-PercentagePatients 0.93 0.97 0.95 105 B-PublicationYear 1.00 0.98 0.99 57 B-TimePoint 0.78 0.87 0.82 306 B-Value 0.89 0.87 0.88 407 I-Age 1.00 0.45 0.62 22 I-AllocationRatio 1.00 1.00 1.00 14 I-Author 0.99 0.98 0.99 590 I-BioAndMedicalUnit 0.97 0.99 0.98 344 I-CTAnalysisApproach 0.90 1.00 0.95 18 I-CTDesign 0.84 0.89 0.87 183 I-Confidence 0.92 0.98 0.95 753 I-Country 0.00 0.00 0.00 10 I-DisorderOrSyndrome 0.99 0.98 0.99 600 I-DoseValue 0.99 0.98 0.98 164 I-Drug 0.90 0.89 0.90 393 I-DrugTime 0.96 0.80 0.88 192 I-Duration 0.90 0.84 0.87 165 I-Journal 0.98 0.99 0.99 238 I-NumberPatients 1.00 0.95 0.98 22 I-PValue 0.96 0.99 0.98 612 I-PercentagePatients 0.99 1.00 1.00 130 I-TimePoint 0.81 0.78 0.79 282 I-Value 0.93 0.96 0.95 787 O 0.99 0.98 0.98 24184 accuracy - - 0.97 38137 macro-avg 0.92 0.91 0.91 38137 weighted-avg 0.97 0.97 0.97 38137 ``` --- layout: model title: Swedish XlmRoBertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi date: 2022-06-24 tags: [open_source, question_answering, xlmroberta, sv] task: Question Answering language: sv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-large-qa-sv` is a Swedish model originally trained by `m3hrdadfi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi_sv_4.0.0_3.0_1656066739095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi_sv_4.0.0_3.0_1656066739095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi","sv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi","sv") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sv.answer_question.xlmr_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmr_large_qa_sv_sv_m3hrdadfi| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|sv| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/m3hrdadfi/xlmr-large-qa-sv --- layout: model title: Turkish BertForQuestionAnswering model (from emre) author: John Snow Labs name: bert_qa_distilbert_tr_q_a date: 2022-06-02 tags: [tr, open_source, question_answering, bert] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-tr-q-a` is a Turkish model orginally trained by `emre`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_distilbert_tr_q_a_tr_4.0.0_3.0_1654187587161.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_distilbert_tr_q_a_tr_4.0.0_3.0_1654187587161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_distilbert_tr_q_a","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_distilbert_tr_q_a","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.bert.distilled").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_distilbert_tr_q_a| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|412.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/emre/distilbert-tr-q-a - https://github.com/TQuad/turkish-nlp-qa-dataset --- layout: model title: Translate English to Tetun Dili Pipeline author: John Snow Labs name: translate_en_tdt date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tdt, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tdt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tdt_xx_2.7.0_2.4_1609687644374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tdt_xx_2.7.0_2.4_1609687644374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tdt", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tdt", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tdt').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tdt| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_spanish date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-spanish` is a Castilian, Spanish model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_spanish_es_4.0.0_3.0_1654188563403.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_spanish_es_4.0.0_3.0_1654188563403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_spanish","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_spanish","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.bert.multilingual_spanish_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_spanish| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-spanish --- layout: model title: Financial Relation Extraction on 10K filings (Small) author: John Snow Labs name: finre_financial_small date: 2022-11-07 tags: [financial, 10k, filings, en, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts relations between amounts, counts, percentages, dates and the financial entities extracted with one of these models: `finner_financial_small` `finner_financial_medium` `finner_financial_large` We highly recommend using it with `finner_financial_large`. ## Predicted Entities `has_amount`, `has_amount_date`, `has_percentage_date`, `has_percentage`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_financial_small_en_1.0.0_3.0_1667815219417.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_financial_small_en_1.0.0_3.0_1667815219417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) bert_embeddings= nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("bert_embeddings") ner_model = finance.NerModel.pretrained("finner_financial_large", "en", "finance/models")\ .setInputCols(["sentence", "token", "bert_embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") # =========== # This is needed only to filter relation pairs using finance.RENerChunksFilter (see below) # =========== pos = nlp.PerceptronModel.pretrained("pos_anc", 'en')\ .setInputCols("sentence", "token")\ .setOutputCol("pos") dependency_parser = nlp.DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos", "token"]) \ .setOutputCol("dependencies") ENTITIES = ['PROFIT', 'PROFIT_INCREASE', 'PROFIT_DECLINE', 'CF', 'CF_INCREASE', 'CF_DECREASE', 'LIABILITY', 'EXPENSE', 'EXPENSE_INCREASE', 'EXPENSE_DECREASE'] ENTITY_PAIRS = [f"{x}-AMOUNT" for x in ENTITIES] ENTITY_PAIRS.extend([f"{x}-COUNT" for x in ENTITIES]) ENTITY_PAIRS.extend([f"{x}-PERCENTAGE" for x in ENTITIES]) ENTITY_PAIRS.append(f"AMOUNT-FISCAL_YEAR") ENTITY_PAIRS.append(f"AMOUNT-DATE") ENTITY_PAIRS.append(f"AMOUNT-CURRENCY") re_ner_chunk_filter = finance.RENerChunksFilter() \ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setRelationPairs(ENTITY_PAIRS)\ .setMaxSyntacticDistance(5) # =========== reDL = finance.RelationExtractionDLModel.pretrained('finre_financial_small', 'en', 'finance/models')\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relations") pipeline = nlp.Pipeline(stages=[ documentAssembler, sentencizer, tokenizer, bert_embeddings, ner_model, ner_converter, pos, dependency_parser, re_ner_chunk_filter, reDL]) text = "In the third quarter of fiscal 2021, we received net proceeds of $342.7 million, after deducting underwriters discounts and commissions and offering costs of $31.8 million, including the exercise of the underwriters option to purchase additional shares. " data = spark.createDataFrame([[text]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence has_amount CF 49 60 net proceeds AMOUNT 66 78 342.7 million 0.9999101 has_amount CURRENCY 65 65 $ AMOUNT 66 78 342.7 million 0.9925425 has_amount EXPENSE 125 154 commissions and offering costs AMOUNT 160 171 31.8 million 0.9997677 has_amount CURRENCY 159 159 $ AMOUNT 160 171 31.8 million 0.998896 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_financial_small| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.7 MB| ## References In-house annotations of 10K filings. ## Benchmarking ```bash Relation Recall Precision F1 Support has_amount 0.997 0.997 0.997 670 has_amount_date 0.996 0.994 0.995 470 has_percentage 1.000 1.000 1.000 87 has_percentage_date 0.985 1.000 0.993 68 other 1.000 1.000 1.000 205 Avg. 0.996 0.998 0.997 1583 Weighted-Avg. 0.997 0.997 0.997 1583 ``` --- layout: model title: Turkish Electra Embeddings (from dbmdz) author: John Snow Labs name: electra_embeddings_electra_base_turkish_mc4_uncased_generator date: 2022-05-17 tags: [tr, open_source, electra, embeddings] task: Embeddings language: tr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-turkish-mc4-uncased-generator` is a Turkish model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_turkish_mc4_uncased_generator_tr_3.4.4_3.0_1652786631684.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_turkish_mc4_uncased_generator_tr_3.4.4_3.0_1652786631684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_turkish_mc4_uncased_generator","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_turkish_mc4_uncased_generator","tr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_turkish_mc4_uncased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|130.7 MB| |Case sensitive:|false| ## References - https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-generator - https://zenodo.org/badge/latestdoi/237817454 - https://twitter.com/mervenoyann - https://github.com/allenai/allennlp/discussions/5265 - https://github.com/dbmdz - http://www.andrew.cmu.edu/user/ko/ - https://twitter.com/mervenoyann --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl6 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl6` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl6_en_4.3.0_3.0_1675123966207.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl6_en_4.3.0_3.0_1675123966207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl6","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl6","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|53.2 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl6 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Image De-Identification author: John Snow Labs name: ner_deid_large date: 2023-01-03 tags: [en, licensed, ocr, image_deidentification] task: Image DeIdentification language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2.1 supported: true annotator: ImageDeIdentification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER (Large) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. The entities it annotates are `Age`, `Contact`, `Date`, `Id`, `Location`, `Name`, and `Profession`. This model is trained with the `embeddings_clinical` word embeddings model, so be sure to use the same embeddings in the pipeline. It protects specific health information that could identify living or deceased individuals. The rule preserves patient confidentiality without affecting the values and the information that could be needed for different research purposes. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/DEID_IMAGE/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/3.1.SparkOcrImageDeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_3.0.0_3.0_1617209688468.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_3.0.0_3.0_1617209688468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python def deidentification_nlp_pipeline(input_column, prefix = ""): document_assembler = DocumentAssembler() \ .setInputCol(input_column) \ .setOutputCol(prefix + "document") # Sentence Detector annotator, processes various sentences per line sentence_detector = SentenceDetector() \ .setInputCols([prefix + "document"]) \ .setOutputCol(prefix + "sentence") tokenizer = Tokenizer() \ .setInputCols([prefix + "sentence"]) \ .setOutputCol(prefix + "token") # Clinical word embeddings word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols([prefix + "sentence", prefix + "token"]) \ .setOutputCol(prefix + "embeddings") # NER model trained on i2b2 (sampled from MIMIC) dataset clinical_ner = MedicalNerModel.pretrained("ner_deid_large", "en", "clinical/models") \ .setInputCols([prefix + "sentence", prefix + "token", prefix + "embeddings"]) \ .setOutputCol(prefix + "ner") custom_ner_converter = NerConverter() \ .setInputCols([prefix + "sentence", prefix + "token", prefix + "ner"]) \ .setOutputCol(prefix + "ner_chunk") \ .setWhiteList(["NAME", "AGE", "CONTACT", "LOCATION", "PROFESSION", "PERSON", "DATE"]) nlp_pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, custom_ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF(input_column) nlp_model = nlp_pipeline.fit(empty_data) return nlp_model # Convert to images binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image_raw") # Extract text from image ocr = ImageToText() \ .setInputCol("image_raw") \ .setOutputCol("text") \ .setIgnoreResolution(False) \ .setPageIteratorLevel(PageIteratorLevel.SYMBOL) \ .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \ .setConfidenceThreshold(70) # Found coordinates of sensitive data position_finder = PositionFinder() \ .setInputCols("ner_chunk") \ .setOutputCol("coordinates") \ .setPageMatrixCol("positions") \ .setMatchingWindow(1000) \ .setPadding(1) # Draw filled rectangle for hide sensitive data drawRegions = ImageDrawRegions() \ .setInputCol("image_raw") \ .setInputRegionsCol("coordinates") \ .setOutputCol("image_with_regions") \ .setFilledRect(True) \ .setRectColor(Color.gray) # OCR pipeline pipeline = Pipeline(stages=[ binary_to_image, ocr, deidentification_nlp_pipeline(input_column="text"), position_finder, drawRegions ]) image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/p1.jpg") image_df = spark.read.format("binaryFile").load(image_path) result = pipeline.fit(image_df).transform(image_df).cache() ``` ```scala def deidentification_nlp_pipeline(input_column, prefix = ""): val document_assembler = new DocumentAssembler() .setInputCol(input_column) .setOutputCol(prefix + "document") # Sentence Detector annotator, processes various sentences per line val sentence_detector = new SentenceDetector() .setInputCols(Array(prefix + "document")) .setOutputCol(prefix + "sentence") val tokenizer = new Tokenizer() .setInputCols(Array(prefix + "sentence")) .setOutputCol(prefix + "token") # Clinical word embeddings val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array(prefix + "sentence", prefix + "token")) .setOutputCol(prefix + "embeddings") # NER model trained on i2b2 (sampled from MIMIC) dataset val clinical_ner = MedicalNerModel .pretrained("ner_deid_large", "en", "clinical/models") .setInputCols(Array(prefix + "sentence", prefix + "token", prefix + "embeddings")) .setOutputCol(prefix + "ner") val custom_ner_converter = new NerConverter() .setInputCols(Array(prefix + "sentence", prefix + "token", prefix + "ner")) .setOutputCol(prefix + "ner_chunk") .setWhiteList(Array("NAME", "AGE", "CONTACT", "LOCATION", "PROFESSION", "PERSON", "DATE")) val nlp_pipeline = new Pipeline.setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, custom_ner_converter )) val empty_data = spark.createDataFrame(Array("")).toDF(input_column) val nlp_model = nlp_pipeline.fit(empty_data) return nlp_model # Convert to images val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image_raw") # Extract text from image val ocr = new ImageToText() .setInputCol("image_raw") .setOutputCol("text") .setIgnoreResolution(False) .setPageIteratorLevel(PageIteratorLevel.SYMBOL) .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) .setConfidenceThreshold(70) # Found coordinates of sensitive data val position_finder = new PositionFinder() .setInputCols("ner_chunk") .setOutputCol("coordinates") .setPageMatrixCol("positions") .setMatchingWindow(1000) .setPadding(1) # Draw filled rectangle for hide sensitive data val drawRegions = new ImageDrawRegions() .setInputCol("image_raw") .setInputRegionsCol("coordinates") .setOutputCol("image_with_regions") .setFilledRect(True) .setRectColor(Color.gray) # OCR pipeline val pipeline = new Pipeline().setStages(Array( binary_to_image, ocr, deidentification_nlp_pipeline(input_column="text"), position_finder, drawRegions)) val image_path = pkg_resources.resource_filename(Array("sparkocr", "resources/ocr/images/p1.jpg")) val image_df = spark.read.format("binaryFile").load(image_path) val result = pipeline.fit(image_df).transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image8.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image8_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |ner_chunk | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[{chunk, 193, 202, 04/04/2018, {entity -> DATE, sentence -> 1, chunk -> 0, confidence -> 0.9999}, []}, {chunk, 3290, 3290, ., {entity -> NAME, sentence -> 17, chunk -> 1, confidence -> 0.6035}, []}, {chunk, 3388, 3397, 04/12/2018, {entity -> DATE, sentence -> 20, chunk -> 2, confidence -> 1.0}, []}]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_large| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented 2010 i2b2 challenge data with 'embeddings_clinical'. [https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/) ## Benchmarking ```bash label tp fp fn prec rec f1 I-TREATMENT 6625 1187 1329 0.848054 0.832914 0.840416 I-PROBLEM 15142 1976 2542 0.884566 0.856254 0.87018 B-PROBLEM 11005 1065 1587 0.911765 0.873968 0.892466 I-TEST 6748 923 1264 0.879677 0.842237 0.86055 B-TEST 8196 942 1029 0.896914 0.888455 0.892665 B-TREATMENT 8271 1265 1073 0.867345 0.885167 0.876165 Macro-average 55987 7358 8824 0.881387 0.863166 0.872181 Micro-average 55987 7358 8824 0.883842 0.86385 0.873732 ``` --- layout: model title: Word2Vec Embeddings in Mirandese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mwl, open_source] task: Embeddings language: mwl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mwl_3.4.1_3.0_1647446133349.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mwl_3.4.1_3.0_1647446133349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mwl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mwl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mwl.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mwl| |Size:|113.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Multilingual XlmRoBertaForQuestionAnswering (from gokulkarthik) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_qa_chaii date: 2022-06-23 tags: [en, hi, ta, open_source, question_answering, xlmroberta, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-qa-chaii` is a multilingual model originally trained by `gokulkarthik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_qa_chaii_xx_4.0.0_3.0_1655996639001.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_qa_chaii_xx_4.0.0_3.0_1655996639001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_qa_chaii","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_qa_chaii","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.chaii.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_qa_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|xx| |Size:|885.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/gokulkarthik/xlm-roberta-qa-chaii --- layout: model title: Korean BertForQuestionAnswering model (from eliza-dukim) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_korquad_v1 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased_korquad-v1` is a Korean model orginally trained by `eliza-dukim`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_korquad_v1_ko_4.0.0_3.0_1654180276142.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_korquad_v1_ko_4.0.0_3.0_1654180276142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_korquad_v1","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_korquad_v1","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.korquad.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_korquad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/eliza-dukim/bert-base-multilingual-cased_korquad-v1 --- layout: model title: Extract relations between problem, treatment and test entities (ReDL) author: John Snow Labs name: redl_clinical_biobert date: 2023-01-14 tags: [en, licensed, relation_extraction, clinical, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities. ## Predicted Entities `PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_4.2.4_3.0_1673727174891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_4.2.4_3.0_1673727174891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel() \ .pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["problem-test", "problem-treatment"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_clinical_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """ data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("problem-test", "problem-treatment")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_clinical_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """) ```
## Results ```bash +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ |relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ | TrAP|TREATMENT| 512| 522| amoxicillin| PROBLEM| 528| 556|a respiratory tra...|0.99863595| | TrAP|TREATMENT| 571| 579| metformin| PROBLEM| 617| 620| T2DM|0.99126583| | TrAP|TREATMENT| 583| 591| glipizide| PROBLEM| 617| 620| T2DM|0.99036837| | TrAP|TREATMENT| 583| 591| glipizide| PROBLEM| 659| 661| HTG|0.53253245| | TrAP|TREATMENT| 599| 611| dapagliflozin| PROBLEM| 617| 620| T2DM| 0.9954288| | TrAP|TREATMENT| 599| 611| dapagliflozin| PROBLEM| 659| 661| HTG|0.95774424| | TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 626| 637| atorvastatin| 0.9347153| | TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 643| 653| gemfibrozil|0.97919524| | TrAP|TREATMENT| 626| 637| atorvastatin| PROBLEM| 659| 661| HTG| 0.7040749| | TrAP|TREATMENT| 643| 653| gemfibrozil| PROBLEM| 659| 661| HTG|0.97676986| | TeRP| TEST| 739| 758|Physical examination| PROBLEM| 796| 810| dry oral mucosa| 0.9983334| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 875| 884| tenderness|0.99468285| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 888| 895| guarding| 0.9940719| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 902| 909| rigidity|0.99489564| | TeRP| TEST| 1246| 1258| blood samples| PROBLEM| 1283| 1301| significant lipemia|0.76421493| | TeRP| TEST| 1444| 1458| serum chemistry| PROBLEM| 1553| 1566| still elevated| 0.9956291| | TeRP| TEST| 1507| 1517| her glucose| PROBLEM| 1553| 1566| still elevated|0.97471684| | TeRP| TEST| 1535| 1547| the anion gap| PROBLEM| 1553| 1566| still elevated|0.99222517| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1576| 1592| serum bicarbonate|0.97230035| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1610| 1627| triglyceride level|0.96121335| +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_clinical_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on 2010 i2b2 relation challenge. ## Benchmarking ```bash label Recall Precision F1 Support PIP 0.859 0.878 0.869 1435 TeCP 0.629 0.782 0.697 337 TeRP 0.903 0.929 0.916 2034 TrAP 0.872 0.866 0.869 1693 TrCP 0.641 0.677 0.659 340 TrIP 0.517 0.796 0.627 151 TrNAP 0.402 0.672 0.503 112 TrWP 0.257 0.824 0.392 109 Avg. 0.635 0.803 0.691 - ``` --- layout: model title: RE Pipeline between Tests, Results, and Dates author: John Snow Labs name: re_test_result_date_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, tests, results, dates, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_test_result_date](https://nlp.johnsnowlabs.com/2021/02/24/re_test_result_date_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_CLINICAL_DATE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_pipeline_en_3.4.1_3.0_1648734076557.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_pipeline_en_3.4.1_3.0_1648734076557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models") pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models") pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date_test_result.pipeline").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""") ```
## Results ```bash | index | relations | entity1 | chunk1 | entity2 | chunk2 | |-------|--------------|--------------|---------------------|--------------|---------| | 0 | O | TEST | chest X-ray | MEASUREMENTS | 93% | | 1 | O | TEST | CT scan | MEASUREMENTS | 93% | | 2 | is_result_of | TEST | SpO2 | MEASUREMENTS | 93% | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_test_result_date_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Fast Neural Machine Translation Model from English to Azerbaijani author: John Snow Labs name: opus_mt_en_az date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, az, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `az` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_az_xx_2.7.0_2.4_1609166632809.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_az_xx_2.7.0_2.4_1609166632809.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_az", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_az", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.az').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_az| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRoBerta Embeddings (from hfl) author: John Snow Labs name: xlmroberta_embeddings_cino_small_v2 date: 2022-05-13 tags: [zh, ko, open_source, xlm_roberta, embeddings, xx, cino] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `cino-small-v2` is a Multilingual model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_small_v2_xx_3.4.4_3.0_1652439686002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_small_v2_xx_3.4.4_3.0_1652439686002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_small_v2","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_small_v2","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_cino_small_v2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|552.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/cino-small-v2 - https://github.com/ymcui/Chinese-Minority-PLM - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology --- layout: model title: Legal Pledge And Security Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_pledge_and_security_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, pledge, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_pledge_and_security_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `pledge-and-security-agreement` for similar document type classification) or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `pledge-and-security-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_pledge_and_security_agreement_bert_en_1.0.0_3.0_1669368647407.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_pledge_and_security_agreement_bert_en_1.0.0_3.0_1669368647407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_pledge_and_security_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[pledge-and-security-agreement]| |[other]| |[other]| |[pledge-and-security-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_pledge_and_security_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.96 0.96 82 pledge-and-security-agreement 0.91 0.89 0.90 35 accuracy - - 0.94 117 macro-avg 0.93 0.92 0.93 117 weighted-avg 0.94 0.94 0.94 117 ``` --- layout: model title: Part of Speech for Hebrew author: John Snow Labs name: pos_ud_htb date: 2021-03-09 tags: [part_of_speech, open_source, hebrew, pos_ud_htb, he] task: Part of Speech Tagging language: he edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - None - DET - NOUN - VERB - CCONJ - ADP - PRON - PUNCT - ADJ - ADV - SCONJ - NUM - PROPN - AUX - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_3.0.0_3.0_1615292289236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_3.0.0_3.0_1615292289236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_htb", "he") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['שלום מ John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_htb", "he") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("שלום מ John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""שלום מ John Snow Labs! ""] token_df = nlu.load('he.pos.ud_htb').predict(text) token_df ```
## Results ```bash token pos 0 שלום None 1 מ ADP 2 John NOUN 3 Snow NOUN 4 Labs NOUN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_htb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|he| --- layout: model title: English T5ForConditionalGeneration Cased model (from gagan3012) author: John Snow Labs name: t5_k2t_new date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `k2t-new` is a English model originally trained by `gagan3012`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_k2t_new_en_4.3.0_3.0_1675103876567.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_k2t_new_en_4.3.0_3.0_1675103876567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_k2t_new","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_k2t_new","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_k2t_new| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|262.7 MB| ## References - https://huggingface.co/gagan3012/k2t-new - https://user-images.githubusercontent.com/49101362/116334480-f5e57a00-a7dd-11eb-987c-186477f94b6e.png - https://pypi.org/project/keytotext/ - https://pepy.tech/project/keytotext - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/keytotext/tree/master/Training%20Notebooks - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/Examples/K2T.ipynb - https://github.com/gagan3012/keytotext/tree/master/Examples - https://user-images.githubusercontent.com/49101362/116220679-90e64180-a755-11eb-9246-82d93d924a6c.png - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/streamlit-tags - https://user-images.githubusercontent.com/49101362/116162205-fc042980-a6fd-11eb-892e-8f6902f193f4.png --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CDR_Chem_Modified_SciBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-SciBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_SciBERT_384_en_4.0.0_3.0_1657109428387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_SciBERT_384_en_4.0.0_3.0_1657109428387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_SciBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_SciBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CDR_Chem_Modified_SciBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-SciBERT-384 --- layout: model title: Mapping Vaccine Products with Their Corresponding CVX Codes, Vaccine Names and CPT Codes author: John Snow Labs name: cvx_name_mapper date: 2022-10-12 tags: [cvx, chunk_mapping, cpt, en, clinical, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.2.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps vaccine products with their corresponding CVX codes, vaccine names and CPT codes. It returns 3 types of vaccine names; `short_name`, `full_name` and `trade_name`. ## Predicted Entities `cvx_code`, `short_name`, `full_name`, `trade_name`, `cpt_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_name_mapper_en_4.2.1_3.0_1665599269592.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_name_mapper_en_4.2.1_3.0_1665599269592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') chunk_assembler = Doc2Chunk()\ .setInputCols(['doc'])\ .setOutputCol('ner_chunk') chunkerMapper = ChunkMapperModel\ .pretrained("cvx_name_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["cvx_code", "short_name", "full_name", "trade_name", "cpt_code"]) mapper_pipeline = Pipeline(stages=[ document_assembler, chunk_assembler, chunkerMapper ]) data = spark.createDataFrame([['DTaP'], ['MYCOBAX'], ['cholera, live attenuated']]).toDF('text') res = mapper_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("doc") val chunk_assembler = new Doc2Chunk() .setInputCols(Array("doc")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("cvx_name_mapper", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("cvx_code", "short_name", "full_name", "trade_name", "cpt_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, chunk_assembler, chunkerMapper)) val data = Seq("DTaP", "MYCOBAX", "cholera, live attenuated").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.cvx_name").predict("""cholera, live attenuated""") ```
## Results ```bash +--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+ |chunk |cvx_code|short_name |full_name |trade_name |cpt_code| +--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+ |[DTaP] |[20] |[DTaP] |[diphtheria, tetanus toxoids and acellular pertussis vaccine]|[ACEL-IMUNE]|[90700] | |[MYCOBAX] |[19] |[BCG] |[Bacillus Calmette-Guerin vaccine] |[MYCOBAX] |[90585] | |[cholera, live attenuated]|[174] |[cholera, live attenuated]|[cholera, live attenuated] |[VAXCHORA] |[90625] | +--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|cvx_name_mapper| |Compatibility:|Healthcare NLP 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|25.1 KB| --- layout: model title: Indonesian RoBERTa Embeddings (Large) author: John Snow Labs name: roberta_embeddings_indonesian_roberta_large date: 2022-04-14 tags: [roberta, embeddings, id, open_source] task: Embeddings language: id edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indonesian-roberta-large` is a Indonesian model orginally trained by `flax-community`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_large_id_3.4.2_3.0_1649948688324.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_large_id_3.4.2_3.0_1649948688324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_large","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_large","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka percikan NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.embed.indonesian_roberta_large").predict("""Saya suka percikan NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indonesian_roberta_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|632.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/flax-community/indonesian-roberta-large - https://arxiv.org/abs/1907.11692 - https://hf.co/w11wo - https://hf.co/stevenlimcorn - https://hf.co/munggok - https://hf.co/chewkokwah --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg, Multi author: John Snow Labs name: distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg-with-multi` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_en_4.0.0_3.0_1654727406477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_en_4.0.0_3.0_1654727406477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg_with_multi.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg-with-multi --- layout: model title: Pipeline to Detect Symptoms, Treatments and Other Entities in German author: John Snow Labs name: ner_healthcare_pipeline date: 2023-03-15 tags: [ner, healthcare, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_healthcare](https://nlp.johnsnowlabs.com/2021/09/15/ner_healthcare_de.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_de_4.3.0_3.2_1678880382332.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_de_4.3.0_3.2_1678880382332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_healthcare_pipeline", "de", "clinical/models") text = '''Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_healthcare_pipeline", "de", "clinical/models") val text = "Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------|--------:|------:|:----------------------|-------------:| | 0 | Kleinzellige | 4 | 15 | MEASUREMENT | 0.6897 | | 1 | Bronchialkarzinom | 17 | 33 | MEDICAL_CONDITION | 0.8983 | | 2 | Kleinzelliger | 36 | 48 | MEDICAL_SPECIFICATION | 0.1777 | | 3 | Lungenkrebs | 50 | 60 | MEDICAL_CONDITION | 0.9776 | | 4 | SCLC | 63 | 66 | MEDICAL_CONDITION | 0.9626 | | 5 | Hernia | 73 | 78 | MEDICAL_CONDITION | 0.8177 | | 6 | femoralis | 80 | 88 | LOCAL_SPECIFICATION | 0.9119 | | 7 | Akne | 91 | 94 | MEDICAL_CONDITION | 0.9995 | | 8 | einseitig | 97 | 105 | MEASUREMENT | 0.909 | | 9 | hochmalignes | 112 | 123 | MEDICAL_CONDITION | 0.6778 | | 10 | bronchogenes | 125 | 136 | BODY_PART | 0.621 | | 11 | Karzinom | 138 | 145 | MEDICAL_CONDITION | 0.8118 | | 12 | Lunge | 179 | 183 | BODY_PART | 0.9985 | | 13 | Hauptbronchus | 195 | 207 | BODY_PART | 0.9864 | | 14 | mittlere | 223 | 230 | MEASUREMENT | 0.9651 | | 15 | Prävalenz | 232 | 240 | MEDICAL_CONDITION | 0.9833 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate English to Hindi Pipeline author: John Snow Labs name: translate_en_hi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, hi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `hi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_hi_xx_2.7.0_2.4_1609689464668.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_hi_xx_2.7.0_2.4_1609689464668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_hi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_hi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.hi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_hi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for CPT codes (Augmented) author: John Snow Labs name: sbiobertresolve_cpt_augmented date: 2021-05-30 tags: [licensed, entity_resolution, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to CPT codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is enriched with augmented data for better performance. ## Predicted Entities CPT codes and their descriptions. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_augmented_en_3.0.4_3.0_1622372290384.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_augmented_en_3.0.4_3.0_1622372290384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_augmented","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cpt.augmented").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+-------+-----+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+-------+-----+----------+--------------------+--------------------+ | hypertension| 68| 79|PROBLEM|36440| 0.3349|Hypertransfusion:...|36440:::24935:::0...| |chronic renal ins...| 83|109|PROBLEM|50395| 0.0821|Nephrostomy:::Ren...|50395:::50328:::5...| | COPD| 113|116|PROBLEM|32960| 0.1575|Lung collapse pro...|32960:::32215:::1...| | gastritis| 120|128|PROBLEM|43501| 0.1772|Gastric ulcer sut...|43501:::43631:::4...| | TIA| 136|138|PROBLEM|61460| 0.1432|Intracranial tran...|61460:::64742:::2...| |a non-ST elevatio...| 182|202|PROBLEM|61624| 0.1151|Percutaneous non-...|61624:::61626:::3...| |Guaiac positive s...| 208|229|PROBLEM|44005| 0.1115|Enterolysis:::Abd...|44005:::49080:::4...| | mid LAD lesion| 332|345|PROBLEM|0281T| 0.2407|Plication of left...|0281T:::93462:::9...| | hypotension| 362|372|PROBLEM|99135| 0.9935|Induced hypotensi...|99135:::99185:::9...| | bradycardia| 378|388|PROBLEM|99135| 0.3884|Induced hypotensi...|99135:::33305:::3...| | vagal reaction| 466|479|PROBLEM|55450| 0.1427|Vasoligation:::Va...|55450:::64408:::7...| +--------------------+-----+---+-------+-----+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cpt_augmented| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[cpt_code_aug]| |Language:|en| |Case sensitive:|false| --- layout: model title: Social Determinants of Health (slim) author: John Snow Labs name: ner_sdoh_slim_wip date: 2022-11-15 tags: [en, licensed, sdoh, social_determinants, public_health, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts terminology related to `Social Determinants of Health ` from various kinds of biomedical documents. ## Predicted Entities `Housing`, `Smoking`, `Substance_Frequency`, `Childhood_Development`, `Age`, `Other_Disease`, `Employment`, `Marital_Status`, `Diet`, `Disability`, `Mental_Health`, `Alcohol`, `Substance_Quantity`, `Family_Member`, `Race_Ethnicity`, `Gender`, `Geographic_Entity`, `Sexual_Orientation`, `Substance_Use` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_slim_wip_en_4.2.1_3.0_1668524622964.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_slim_wip_en_4.2.1_3.0_1668524622964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_slim_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, clinical_embeddings, ner_model, ner_converter] text = [""" Mother states that he does smoke, there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with severe alcoholism. Pts own drinking began at age 16, living in LA, had a DUI at age 17 after totaling a new car that his mother bought for him, he was married. """] data = spark.createDataFrame([text]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols("sentence", "token") .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_slim_wip", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlpPipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val data = Seq("""Mother states that there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with severe alcoholism. Pts own drinking began at age 16, had a DUI at age 17 after totaling a new car that his mother bought for him, he was married.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.sdoh_slim_wip").predict(""" Mother states that he does smoke, there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with severe alcoholism. Pts own drinking began at age 16, living in LA, had a DUI at age 17 after totaling a new car that his mother bought for him, he was married. """) ```
## Results ```bash +-------------+-------------------+ | token| ner_label| +-------------+-------------------+ | Mother| B-Family_Member| | states| O| | that| O| | he| B-Gender| | does| O| | smoke| B-Smoking| | ,| O| | there| O| | is| O| | a| O| | family| O| | hx| O| | of| O| | alcohol| B-Alcohol| | on| O| | both| O| | maternal| B-Family_Member| | and| O| | paternal| B-Family_Member| | sides| O| | of| O| | the| O| | family| O| | ,| O| | maternal| B-Family_Member| | grandfather| B-Family_Member| | who| O| | died| O| | of| O| | alcohol| B-Alcohol| | related| O| |complications| O| | and| O| | paternal| B-Family_Member| | grandmother| B-Family_Member| | with| O| | severe| B-Alcohol| | alcoholism| I-Alcohol| | .| O| | Pts| O| | own| O| | drinking| B-Alcohol| | began| O| | at| O| | age| B-Age| | 16| I-Age| | ,| O| | living| O| | in| O| | LA|B-Geographic_Entity| | ,| O| | had| O| | a| O| | DUI| O| | at| O| | age| O| | 17| O| | after| O| | totaling| O| | a| O| | new| O| | car| O| | that| O| | his| B-Gender| | mother| B-Family_Member| | bought| O| | for| O| | him| B-Gender| | ,| O| | he| B-Gender| | was| O| | married| B-Marital_Status| | .| O| +-------------+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_slim_wip| |Compatibility:|Healthcare NLP 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Manuel annotations have been made over [MTSamples](https://mtsamples.com/) and [MIMIC ](https://physionet.org/content/mimiciii/1.4/) datasets. ## Benchmarking ```bash label precision recall f1-score support B-Age 0.93 0.90 0.91 277 B-Alcohol 0.90 0.88 0.89 410 B-Childhood_Development 1.00 1.00 1.00 1 B-Diet 1.00 1.00 1.00 6 B-Disability 0.96 0.95 0.96 57 B-Employment 0.91 0.79 0.85 1926 B-Family_Member 0.93 0.97 0.95 2412 B-Gender 0.97 0.99 0.98 6161 B-Geographic_Entity 0.81 0.79 0.80 82 B-Housing 0.82 0.73 0.77 183 B-Marital_Status 0.93 0.91 0.92 184 B-Mental_Health 0.85 0.72 0.78 487 B-Other_Disease 0.77 0.82 0.79 381 B-Race_Ethnicity 0.91 0.94 0.93 34 B-Sexual_Orientation 0.75 0.90 0.82 10 B-Smoking 0.96 0.96 0.96 209 B-Substance_Frequency 0.92 0.88 0.90 88 B-Substance_Quantity 0.83 0.79 0.81 72 B-Substance_Use 0.80 0.82 0.81 213 I-Age 0.91 0.95 0.93 589 I-Alcohol 0.80 0.77 0.79 159 I-Childhood_Development 1.00 1.00 1.00 3 I-Diet 1.00 0.89 0.94 9 I-Disability 1.00 0.53 0.70 15 I-Employment 0.77 0.62 0.69 369 I-Family_Member 0.79 0.84 0.81 138 I-Gender 0.57 0.88 0.69 231 I-Geographic_Entity 1.00 0.85 0.92 13 I-Housing 0.86 0.83 0.84 362 I-Marital_Status 1.00 0.18 0.31 11 I-Mental_Health 0.81 0.47 0.59 241 I-Other_Disease 0.75 0.74 0.75 256 I-Race_Ethnicity 1.00 1.00 1.00 15 I-Smoking 0.90 0.93 0.91 46 I-Substance_Frequency 0.85 0.73 0.79 75 I-Substance_Quantity 0.84 0.88 0.86 174 I-Substance_Use 0.86 0.84 0.85 182 O 0.99 0.99 0.99 148829 accuracy - - 0.98 164910 macro_avg 0.89 0.83 0.85 164910 weighted_avg 0.98 0.98 0.98 164910 ``` --- layout: model title: Chinese Bert Embeddings (Large, 3-layer) author: John Snow Labs name: bert_embeddings_rbtl3 date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `rbtl3` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_3.4.2_3.0_1649669312632.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_3.4.2_3.0_1649669312632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.rbtl3").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbtl3| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|228.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbtl3 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Part of Speech for Breton author: John Snow Labs name: pos_ud_keb date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: br edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, br] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_2.5.5_2.4_1596053588899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_keb_br_2.5.5_2.4_1596053588899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_keb", "br") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_keb", "br") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Distaolit dimp hon dleoù evel m"hor bo ivez distaolet d"hon dleourion.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion."""] pos_df = nlu.load('br.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=8, result='VERB', metadata={'word': 'Distaolit'}), Row(annotatorType='pos', begin=10, end=13, result='VERB', metadata={'word': 'dimp'}), Row(annotatorType='pos', begin=15, end=17, result='DET', metadata={'word': 'hon'}), Row(annotatorType='pos', begin=19, end=23, result='NOUN', metadata={'word': 'dleoù'}), Row(annotatorType='pos', begin=25, end=28, result='ADP', metadata={'word': 'evel'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_keb| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|br| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from arrandi) author: John Snow Labs name: xlmroberta_ner_arrandi_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `arrandi`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arrandi_base_finetuned_panx_de_4.1.0_3.0_1660431029647.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arrandi_base_finetuned_panx_de_4.1.0_3.0_1660431029647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arrandi_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arrandi_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_arrandi_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/arrandi/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Voice of the Patients (embeddings_clinical_large) author: John Snow Labs name: ner_vop_wip_emb_clinical_large date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Gender`, `Employment`, `BodyPart`, `Age`, `PsychologicalCondition`, `Form`, `Vaccine`, `Drug`, `Substance`, `ClinicalDept`, `Laterality`, `DateTime`, `Test`, `VitalTest`, `Disease`, `Dosage`, `Route`, `Duration`, `Procedure`, `AdmissionDischarge`, `Symptom`, `Frequency`, `RelationshipStatus`, `HealthStatus`, `Allergen`, `Modifier`, `SubstanceQuantity`, `TestResult`, `MedicalDevice`, `Treatment`, `InjuryOrPoisoning` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_emb_clinical_large_en_4.4.2_3.0_1684511324500.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_emb_clinical_large_en_4.4.2_3.0_1684511324500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_wip_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_wip_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | allopathy medicine | Treatment | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_wip_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1294 25 23 1317 0.98 0.98 0.98 Employment 1171 47 72 1243 0.96 0.94 0.95 BodyPart 2710 199 190 2900 0.93 0.93 0.93 Age 539 47 43 582 0.92 0.93 0.92 PsychologicalCondition 417 45 27 444 0.90 0.94 0.92 Form 250 33 16 266 0.88 0.94 0.91 Vaccine 37 2 5 42 0.95 0.88 0.91 Drug 1311 144 129 1440 0.90 0.91 0.91 Substance 399 69 22 421 0.85 0.95 0.90 ClinicalDept 288 25 38 326 0.92 0.88 0.90 Laterality 538 47 90 628 0.92 0.86 0.89 DateTime 3992 602 410 4402 0.87 0.91 0.89 Test 1064 141 144 1208 0.88 0.88 0.88 VitalTest 154 32 18 172 0.83 0.90 0.86 Disease 1755 316 260 2015 0.85 0.87 0.86 Dosage 347 62 65 412 0.85 0.84 0.85 Route 41 7 7 48 0.85 0.85 0.85 Duration 1845 233 465 2310 0.89 0.80 0.84 Procedure 555 83 150 705 0.87 0.79 0.83 AdmissionDischarge 25 1 9 34 0.96 0.74 0.83 Symptom 3710 727 865 4575 0.84 0.81 0.82 Frequency 851 159 228 1079 0.84 0.79 0.81 RelationshipStatus 18 3 6 24 0.86 0.75 0.80 HealthStatus 83 29 24 107 0.74 0.78 0.76 Allergen 29 1 17 46 0.97 0.63 0.76 Modifier 783 189 356 1139 0.81 0.69 0.74 SubstanceQuantity 60 17 25 85 0.78 0.71 0.74 TestResult 364 114 160 524 0.76 0.69 0.73 MedicalDevice 225 56 107 332 0.80 0.68 0.73 Treatment 142 34 86 228 0.81 0.62 0.70 InjuryOrPoisoning 104 24 72 176 0.81 0.59 0.68 macro_avg 25101 3513 4129 29230 0.87 0.82 0.84 micro_avg 25101 3513 4129 29230 0.88 0.86 0.87 ``` --- layout: model title: Emotion Detection Classifier author: John Snow Labs name: bert_sequence_classifier_emotion date: 2022-01-14 tags: [bert_for_sequence, en, emotion, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned on emotion [dataset](https://huggingface.co/nateraw/bert-base-uncased-emotion), leveraging `Bert` embeddings and `BertForSequenceClassification` for text classification purposes. ## Predicted Entities `sadness`, `joy`, `love`, `anger`, `fear`, `surprise` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_emotion_en_3.3.4_3.0_1642152012549.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_emotion_en_3.3.4_3.0_1642152012549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_emotion', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([["What do you mean? Are you kidding me?"]]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_emotion", "en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["What do you mean? Are you kidding me?"].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.emotion.bert").predict("""What do you mean? Are you kidding me?""") ```
## Results ```bash ['anger'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_emotion| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## Data Source [https://huggingface.co/datasets/viewer/?dataset=emotion](https://huggingface.co/datasets/viewer/?dataset=emotion) ## Benchmarking NOTE: The author didn't share Precision / Recall / F1, only Validation Accuracy was shared as [Evaluation Results](https://huggingface.co/nateraw/bert-base-uncased-emotion#eval-results). ```bash Validation Accuracy: 0.931 ``` --- layout: model title: Translate Mossi to English Pipeline author: John Snow Labs name: translate_mos_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mos, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mos` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mos_en_xx_2.7.0_2.4_1609689738585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mos_en_xx_2.7.0_2.4_1609689738585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mos_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mos_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mos.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mos_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for ICD10-CM (general 3 character codes) author: John Snow Labs name: sbiobertresolve_icd10cm_generalised_augmented date: 2023-05-24 tags: [licensed, en, clinical, entity_resolution] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It predicts ICD-10-CM codes up to 3 characters (according to ICD-10-CM code structure the first three characters represent general type of the injury or disease). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1684930238103.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1684930238103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use `sbiobertresolve_icd10cm_generalised_augmented` resolver model must be used with `sbiobert_base_cased_mli` as embeddings.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models")\ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models") .setInputCols("sbert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | ner_chunk | entity | icd10_code | all_codes | resolutions | | --------------------------- | -------- | ---------- | --------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | hypertension | PROBLEM | I10 | I10:::Z86:::I15:::Z87:::P29:::Z82:::I11:::Z13 | hypertension [hypertension]:::hypertension monitored [hypertension monitored]:::secondary hypertension [secondary hypertension]:::h/o: hypertension [h/o: hypertension]:::diastolic hypertension [diastolic hypertension]:::fh: hypertension [fh: hypertension]:::hypertensive heart disease [hypertensive heart disease]:::suspected hypertension [suspected hypertension] | | chronic renal insufficiency | PROBLEM | N18 | N18:::P29:::N19:::P96:::D63:::N28:::E79:::N13 | chronic renal insufficiency [chronic renal insufficiency]:::chronic renal failure [chronic renal failure]:::chronic progressive renal insufficiency [chronic progressive renal insufficiency]:::renal insufficiency [renal insufficiency]:::anaemia of chronic renal insufficiency [anaemia of chronic renal insufficiency]:::impaired renal function disorder [impaired renal function disorder]:::chronic renal failure syndrome [chronic renal failure syndrome]:::post-renal renal failure [post-renal renal failure] | | COPD | PROBLEM | J44 | J44:::J98:::J62:::Z76:::J81:::J96:::I27 | copd - chronic obstructive pulmonary disease [copd - chronic obstructive pulmonary disease]:::chronic lung disease (disorder) [chronic lung disease (disorder)]:::chronic lung disease [chronic lung disease]:::chronic obstructive pulmonary disease leaflet given [chronic obstructive pulmonary disease leaflet given]:::chronic pulmonary congestion (disorder) [chronic pulmonary congestion (disorder)]:::chronic respiratory failure (disorder) [chronic respiratory failure (disorder)]:::chronic pulmonary heart disease (disorder) [chronic pulmonary heart disease (disorder)] | | gastritis | PROBLEM | K29 | K29:::Z13:::K52:::A09:::B96 | gastritis [gastritis]:::suspicion of gastritis [suspicion of gastritis]:::toxic gastritis [toxic gastritis]:::suppurative gastritis [suppurative gastritis]:::bacterial gastritis [bacterial gastritis] | | TIA | PROBLEM | S06 | S06:::G45:::I63:::G46:::G95 | cerebral concussion [cerebral concussion]:::transient ischemic attack (disorder) [transient ischemic attack (disorder)]:::thalamic stroke [thalamic stroke]:::occipital cerebral infarction (disorder) [occipital cerebral infarction (disorder)]:::spinal cord stroke [spinal cord stroke] | | a non-ST elevation MI | PROBLEM | I21 | I21:::I24:::I25:::I63:::I5A:::M62:::I60 | silent myocardial infarction (disorder) [silent myocardial infarction (disorder)]:::acute nontransmural infarction [acute nontransmural infarction]:::mi - silent myocardial infarction [mi - silent myocardial infarction]:::silent cerebral infarct [silent cerebral infarct]:::non-ischemic myocardial injury (non-traumatic) [non-ischemic myocardial injury (non-traumatic)]:::nontraumatic ischemic infarction of muscle, unsp shoulder [nontraumatic ischemic infarction of muscle, unsp shoulder]:::nontraumatic ruptured cerebral aneurysm [nontraumatic ruptured cerebral aneurysm] | | Guaiac positive stools | PROBLEM | K92 | K92:::R19:::R15:::P54:::K38:::R29:::R85 | guaiac-positive stools [guaiac-positive stools]:::acholic stool (finding) [acholic stool (finding)]:::fecal overflow [fecal overflow]:::fecal occult blood positive [fecal occult blood positive]:::appendicular fecalith [appendicular fecalith]:::slump test positive [slump test positive]:::anal swab culture positive (finding) [anal swab culture positive (finding)] | | mid LAD lesion | PROBLEM | I21 | I21:::Q24:::Q21:::I49:::I51:::Q28:::Q20:::I25 | stemi involving left anterior descending coronary artery [stemi involving left anterior descending coronary artery]:::overriding left ventriculoarterial valve [overriding left ventriculoarterial valve]:::double outlet left atrium [double outlet left atrium]:::left atrial rhythm (finding) [left atrial rhythm (finding)]:::disorder of left atrium [disorder of left atrium]:::left dominant coronary system [left dominant coronary system]:::abnormality of left atrial appendage [abnormality of left atrial appendage]:::aneurysm of patch of left ventricular outflow tract [aneurysm of patch of left ventricular outflow tract] | | hypotension | PROBLEM | I95 | I95:::H44:::T50:::G96:::O26 | hypotension [hypotension]:::globe hypotension [globe hypotension]:::drug-induced hypotension [drug-induced hypotension]:::intracranial hypotension [intracranial hypotension]:::supine hypotensive syndrome [supine hypotensive syndrome] | | bradycardia | PROBLEM | O99 | O99:::P29:::R00:::R94:::R25:::I49:::R41:::R06 | bradycardia [bradycardia]:::bradycardia (finding) [bradycardia (finding)]:::sinus bradycardia [sinus bradycardia]:::ecg: bradycardia [ecg: bradycardia]:::bradykinesia [bradykinesia]:::nodal escape bradycardia [nodal escape bradycardia]:::bradykinesis [bradykinesis]:::bradypnea [bradypnea] | | vagal reaction | PROBLEM | R55 | R55:::R29:::R00:::R94:::R25:::I49:::R41:::R06 | vagal reaction [vagal reaction]:::vasovagal reaction [vasovagal reaction]:::vasovagal syncope [vasovagal syncope]:::vagal stimulation effect [vagal stimulation effect]:::vagal stimulation (procedure) [vagal stimulation (procedure)]:::vasovagal shock [vasovagal shock]:::vasovagal attack [vasovagal attack]:::vasovagal symptoms [vasovagal symptoms] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_generalised_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.5 GB| |Case sensitive:|false| --- layout: model title: ICD10CM Neoplasms Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_neoplasms_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-28 task: Entity Resolution edition: Healthcare NLP 2.4.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-CM Codes and their normalized definition with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_neoplasms_clinical_en_2.4.5_2.4_1588108205630.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_neoplasms_clinical_en_2.4.5_2.4_1588108205630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... neoplasm_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_neoplasms_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, neoplasm_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val neoplasm_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_neoplasms_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, neoplasm_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash chunk entity icd10_neoplasm_description icd10_neoplasm_code 0 patient Organism Acute myelomonocytic leukemia, in remission C9251 1 infant Organism Malignant (primary) neoplasm, unspecified C801 2 nose Organ Malignant neoplasm of nasal cavity C300 3 She Organism Malignant neoplasm of thyroid gland C73 4 She Organism Malignant neoplasm of thyroid gland C73 5 She Organism Malignant neoplasm of thyroid gland C73 6 Aldex Gene_or_gene_product Acute megakaryoblastic leukemia not having ach... C9420 7 ear Organ Other benign neoplasm of skin of right ear and... D2321 8 She Organism Malignant neoplasm of thyroid gland C73 9 She Organism Malignant neoplasm of thyroid gland C73 10 She Organism Malignant neoplasm of thyroid gland C73 ``` {:.model-param} ## Model Information {:.table-model} |----------------|-----------------------------------------| | Name: | chunkresolve_icd10cm_neoplasms_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10CM Dataset Ranges: C000-D489, R590-R599 https://www.icd10data.com/ICD10CM/Codes/C00-D49 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186 TFWav2Vec2ForCTC from Sarahliu186 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186` is a English model originally trained by Sarahliu186. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_en_4.2.0_3.0_1664114424756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_en_4.2.0_3.0_1664114424756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_fancy_animales ViTForImageClassification from andy-0v0 author: John Snow Labs name: image_classifier_vit_fancy_animales date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_fancy_animales` is a English model originally trained by andy-0v0. ## Predicted Entities `penguin`, `chow chow`, `sloth`, `wombat`, `panda` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_fancy_animales_en_4.1.0_3.0_1660165986402.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_fancy_animales_en_4.1.0_3.0_1660165986402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_fancy_animales", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_fancy_animales", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_fancy_animales| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Authorizations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_authorizations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, authorizations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Authorizations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Authorizations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_authorizations_bert_en_1.0.0_3.0_1678050707023.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_authorizations_bert_en_1.0.0_3.0_1678050707023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_authorizations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Authorizations]| |[Other]| |[Other]| |[Authorizations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_authorizations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Authorizations 0.93 0.96 0.95 73 Other 0.97 0.95 0.96 97 accuracy - - 0.95 170 macro-avg 0.95 0.95 0.95 170 weighted-avg 0.95 0.95 0.95 170 ``` --- layout: model title: Extract relations between problem, treatment and test entities (ReDL) author: John Snow Labs name: redl_clinical_biobert date: 2021-07-24 tags: [en, licensed, relation_extraction, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 2.4 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities. ## Predicted Entities `PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_3.0.3_2.4_1627118222780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_3.0.3_2.4_1627118222780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel() \ .pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["problem-test", "problem-treatment"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_clinical_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """ data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("problem-test", "problem-treatment")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_clinical_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """) ```
## Results ```bash +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ |relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ | TrAP|TREATMENT| 512| 522| amoxicillin| PROBLEM| 528| 556|a respiratory tra...|0.99796957| | TrAP|TREATMENT| 571| 579| metformin| PROBLEM| 617| 620| T2DM|0.99757993| | TrAP|TREATMENT| 599| 611| dapagliflozin| PROBLEM| 659| 661| HTG| 0.996036| | TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 626| 637| atorvastatin| 0.9693424| | TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 643| 653| gemfibrozil|0.99460286| | TeRP| TEST| 739| 758|Physical examination| PROBLEM| 796| 810| dry oral mucosa|0.99775106| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 875| 884| tenderness|0.99272937| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 888| 895| guarding| 0.9840321| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 902| 909| rigidity| 0.9883966| | TeRP| TEST| 1246| 1258| blood samples| PROBLEM| 1265| 1274| hemolyzing| 0.9534202| | TeRP| TEST| 1507| 1517| her glucose| PROBLEM| 1553| 1566| still elevated| 0.9464761| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1576| 1592| serum bicarbonate| 0.9428323| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1656| 1661| lipase| 0.9558198| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1670| 1672| U/L| 0.9214444| | TeRP| TEST| 1676| 1702|The β-hydroxybuty...| PROBLEM| 1733| 1740| elevated| 0.9863963| | TrAP|TREATMENT| 1937| 1951| an insulin drip| PROBLEM| 1957| 1961| euDKA| 0.9852455| | O| PROBLEM| 1957| 1961| euDKA| TEST| 1991| 2003| the anion gap|0.94141793| | O| PROBLEM| 1957| 1961| euDKA| TEST| 2015| 2027| triglycerides| 0.9622529| +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_clinical_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on 2010 i2b2 relation challenge. ## Benchmarking ```bash Relation Recall Precision F1 Support PIP 0.859 0.878 0.869 1435 TeCP 0.629 0.782 0.697 337 TeRP 0.903 0.929 0.916 2034 TrAP 0.872 0.866 0.869 1693 TrCP 0.641 0.677 0.659 340 TrIP 0.517 0.796 0.627 151 TrNAP 0.402 0.672 0.503 112 TrWP 0.257 0.824 0.392 109 Avg. 0.635 0.803 0.691 - ``` --- layout: model title: Legal NER for NDA (Definition of Confidential Information Clauses) author: John Snow Labs name: legner_nda_def_conf_info date: 2023-04-10 tags: [en, licensed, legal, ner, nda, definition] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `DEF_OF_CONF_INFO` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `CONF_INFO_FORM`, and `CONF_INFO_TYPE`. ## Predicted Entities `CONF_INFO_FORM`, `CONF_INFO_TYPE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_def_conf_info_en_1.0.0_3.0_1681152951608.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_def_conf_info_en_1.0.0_3.0_1681152951608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_def_conf_info", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""""Confidential Information" shall mean all written or oral information of a proprietary, intellectual, or similar nature relating to GT Solar's business, projects, operations, activities, or affairs whether of a technical or financial nature or otherwise (including, without limitation, reports, financial information, business plans and proposals, ideas, concepts, trade secrets, know-how, processes, and other technical or business information, whether concerning GT Solar' businesses or otherwise) which has not been publicly disclosed and which the Recipient acquires directly or indirectly from GT Solar, its officers, employees, affiliates, agents or representatives."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------+--------------+ |chunk |ner_label | +-------------+--------------+ |written |CONF_INFO_FORM| |oral |CONF_INFO_FORM| |reports |CONF_INFO_TYPE| |information |CONF_INFO_TYPE| |plans |CONF_INFO_TYPE| |proposals |CONF_INFO_TYPE| |ideas |CONF_INFO_TYPE| |concepts |CONF_INFO_TYPE| |trade secrets|CONF_INFO_TYPE| |know-how |CONF_INFO_TYPE| |processes |CONF_INFO_TYPE| |information |CONF_INFO_TYPE| +-------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_def_conf_info| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support CONF_INFO_FORM 1.00 0.95 0.97 20 CONF_INFO_TYPE 0.87 0.93 0.90 163 micro-avg 0.88 0.93 0.90 183 macro-avg 0.93 0.94 0.94 183 weighted-avg 0.88 0.93 0.90 183 ``` --- layout: model title: Translate Basque (family) to English Pipeline author: John Snow Labs name: translate_euq_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, euq, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `euq` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_euq_en_xx_2.7.0_2.4_1609687608095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_euq_en_xx_2.7.0_2.4_1609687608095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_euq_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_euq_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.euq.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_euq_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract Demographic Entities from Oncology Texts author: John Snow Labs name: ner_oncology_demographics date: 2022-11-24 tags: [licensed, clinical, en, ner, oncology, demographics] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic information from oncology texts, including age, gender, and smoking status. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father"). - `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. - `Smoking_Status`: All mentions of smoking related to the patient or to someone else. ## Predicted Entities `Age`, `Gender`, `Race_Ethnicity`, `Smoking_Status` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_en_4.2.2_3.0_1669300163954.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_en_4.2.2_3.0_1669300163954.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_demographics", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient is a 40-year-old man with history of heavy smoking."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_demographics", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient is a 40-year-old man with history of heavy smoking.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_demographics").predict("""The patient is a 40-year-old man with history of heavy smoking.""") ```
## Results ```bash | chunk | ner_label | |:------------|:---------------| | 40-year-old | Age | | man | Gender | | smoking | Smoking_Status | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_demographics| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.6 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Smoking_Status 60 19 8 68 0.76 0.88 0.82 Age 934 33 15 949 0.97 0.98 0.97 Race_Ethnicity 57 5 5 62 0.92 0.92 0.92 Gender 1248 18 6 1254 0.99 1.00 0.99 macro_avg 2299 75 34 2333 0.91 0.95 0.93 micro_avg 2299 75 34 2333 0.97 0.99 0.98 ``` --- layout: model title: Legal Non Disparagement Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_non_disparagement_bert date: 2023-03-05 tags: [en, legal, classification, clauses, non_disparagement, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Non_Disparagement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Non_Disparagement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_disparagement_bert_en_1.0.0_3.0_1678049568901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_disparagement_bert_en_1.0.0_3.0_1678049568901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_non_disparagement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Non_Disparagement]| |[Other]| |[Other]| |[Non_Disparagement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_disparagement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Non_Disparagement 0.98 0.98 0.98 41 Other 0.98 0.98 0.98 59 accuracy - - 0.98 100 macro-avg 0.98 0.98 0.98 100 weighted-avg 0.98 0.98 0.98 100 ``` --- layout: model title: Financial English BERT Embeddings (Base) author: John Snow Labs name: bert_embeddings_sec_bert_base date: 2022-04-12 tags: [bert, embeddings, en, open_source, financial] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial Pretrained BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sec-bert-base` is a English model orginally trained by `nlpaueb`. This is the reference base model, what means it uses the same architecture as BERT-BASE trained on financial documents. If you are interested in Financial Embeddings, take a look also at these two models: - [sec-num](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_num_en_3_0.html): Same as this base model but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation). - [sec-shape](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_sh_en_3_0.html): Same as this base model but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_base_en_3.4.2_3.0_1649759502537.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_base_en_3.4.2_3.0_1649759502537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.sec_bert_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_sec_bert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/nlpaueb/sec-bert-base - https://arxiv.org/abs/2203.06482 - http://nlp.cs.aueb.gr/ --- layout: model title: English BertForQuestionAnswering Cased model (from ericw0530) author: John Snow Labs name: bert_qa_ericw0530_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `ericw0530`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ericw0530_finetuned_squad_en_4.0.0_3.0_1657186496979.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ericw0530_finetuned_squad_en_4.0.0_3.0_1657186496979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ericw0530_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_ericw0530_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ericw0530_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ericw0530/bert-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Tiny Cased model (from deepset) author: John Snow Labs name: roberta_qa_tiny_squad2_step1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2-step1` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_squad2_step1_en_4.3.0_3.0_1674224441422.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_squad2_step1_en_4.3.0_3.0_1674224441422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2_step1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2_step1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tiny_squad2_step1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/tinyroberta-squad2-step1 --- layout: model title: Legal Limited Partnership Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_limited_partnership_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, limited_partnership, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_limited_partnership_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `limited-partnership-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `limited-partnership-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limited_partnership_agreement_bert_en_1.0.0_3.0_1669315953601.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limited_partnership_agreement_bert_en_1.0.0_3.0_1669315953601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_limited_partnership_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[limited-partnership-agreement]| |[other]| |[other]| |[limited-partnership-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limited_partnership_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support limited-partnership-agreement 1.00 1.00 1.0 22 other 1.00 1.00 1.0 41 accuracy - - 1.0 63 macro-avg 1.00 1.00 1.0 63 weighted-avg 1.00 1.00 1.0 63 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from avioo1) author: John Snow Labs name: distilbert_qa_avioo1_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `avioo1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770116079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770116079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_avioo1_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/avioo1/distilbert-base-uncased-finetuned-squad --- layout: model title: Recognize Entities DL pipeline for French - Large author: John Snow Labs name: entity_recognizer_lg date: 2021-03-23 tags: [open_source, french, entity_recognizer_lg, pipeline, fr] supported: true task: [Named Entity Recognition, Lemmatization] language: fr edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fr_3.0.0_3.0_1616461515226.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fr_3.0.0_3.0_1616461515226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_lg', lang = 'fr') annotations = pipeline.fullAnnotate(""Bonjour de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_lg", lang = "fr") val result = pipeline.fullAnnotate("Bonjour de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Bonjour de John Snow Labs! ""] result_df = nlu.load('fr.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:--------------------------------|:-------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Bonjour de John Snow Labs! '] | ['Bonjour de John Snow Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | [[-0.010997000150382,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fr| --- layout: model title: Legal Captions Clause Binary Classifier author: John Snow Labs name: legclf_captions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `captions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `captions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_captions_clause_en_1.0.0_3.2_1660123284259.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_captions_clause_en_1.0.0_3.2_1660123284259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_captions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[captions]| |[other]| |[other]| |[captions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_captions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support captions 0.96 1.00 0.98 50 other 1.00 0.98 0.99 105 accuracy - - 0.99 155 macro-avg 0.98 0.99 0.99 155 weighted-avg 0.99 0.99 0.99 155 ``` --- layout: model title: ICD10PCS Entity Resolver author: John Snow Labs name: chunkresolve_icd10pcs_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-21 task: Entity Resolution edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-PCS Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10pcs_clinical_en_2.4.5_2.4_1587491320087.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10pcs_clinical_en_2.4.5_2.4_1587491320087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10pcs_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline_icd10pcs = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, ner, chunk_embeddings, model]) data = ["""He has a starvation ketosis but nothing found for significant for dry oral mucosa"""] pipeline_model = pipeline_icd10pcs.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline = LightPipeline(pipeline_model) result = light_pipeline.annotate(data) ``` ```scala ... val model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10pcs_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, ner, chunk_embeddings, model)) val data = Seq("He has a starvation ketosis but nothing found for significant for dry oral mucosa").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | | chunks | begin | end | code | resolutions | |---|----------------------|-------|-----|---------|--------------------------------------------------| | 0 | a starvation ketosis | 7 | 26 | 6A3Z1ZZ | Hyperthermia, Multiple:::Narcosynthesis:::Hype...| | 1 | dry oral mucosa | 66 | 80 | 8E0ZXY4 | Yoga Therapy:::Release Cecum, Open Approach:::...| ``` {:.model-param} ## Model Information {:.table-model} |----------------|--------------------------------| | Name: | chunkresolve_icd10pcs_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.2+ | | License: | Licensed | |Edition:|Official| | |Input labels: | token, chunk_embeddings | |Output labels: | entity | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10 Procedure Coding System dataset https://www.icd10data.com/ICD10PCS/Codes --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot` is a Finnish model originally trained by aapot. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022498179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot_fi_4.2.0_3.0_1664022498179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_by_aapot| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_roomidentifier ViTForImageClassification from lazyturtl author: John Snow Labs name: image_classifier_vit_roomidentifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_roomidentifier` is a English model originally trained by lazyturtl. ## Predicted Entities `Kitchen`, `Bedroom`, `Bathroom`, `DinningRoom`, `LivingRoom` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_roomidentifier_en_4.1.0_3.0_1660168339182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_roomidentifier_en_4.1.0_3.0_1660168339182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_roomidentifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_roomidentifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_roomidentifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_kv256 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv256` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv256_en_4.3.0_3.0_1675121414682.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv256_en_4.3.0_3.0_1675121414682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_kv256","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_kv256","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_kv256| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|256.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-kv256 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Fast Neural Machine Translation Model from Ga to English author: John Snow Labs name: opus_mt_gaa_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gaa, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gaa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gaa_en_xx_2.7.0_2.4_1609163836574.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gaa_en_xx_2.7.0_2.4_1609163836574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gaa_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gaa_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gaa.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gaa_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese Word Segmentation author: John Snow Labs name: wordseg_pku date: 2021-01-03 task: Word Segmentation language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, word_segmentation, cn, zh] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_2.7.0_2.4_1609694210774.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_2.7.0_2.4_1609694210774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_msr', 'zh')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] ner_df = nlu.load('zh.segment_words.pku').predict(text, output_level='token') ner_df ```
## Results ```bash +----------------------------------+--------------------------------------------------------+ |text |result | +----------------------------------+--------------------------------------------------------+ |然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]| +----------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_pku| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source The model is trained on the Pekin University (PKU) data set available on the Second International Chinese Word Segmentation Bakeoff [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/). ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 | | WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 | | WORDSEG_MSR | 0,5984 | 0,6088 | 0,6035 | | WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 | | WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 | ``` --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: ["no", open_source] task: Named Entity Recognition language: "no" edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_4.0.0_3.0_1656124813478.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_4.0.0_3.0_1656124813478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "no") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("no.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|no| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Word2Vec Embeddings in Somali (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, so, open_source] task: Embeddings language: so edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_so_3.4.1_3.0_1647458819071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_so_3.4.1_3.0_1647458819071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","so") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Waan jeclahay Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","so") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Waan jeclahay Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("so.embed.w2v_cc_300d").predict("""Waan jeclahay Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|so| |Size:|98.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Sentiment Analysis of German texts author: John Snow Labs name: classifierdl_bert_sentiment date: 2021-09-09 tags: [de, sentiment, classification, open_source] task: Sentiment Analysis language: de edition: Spark NLP 3.2.0 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies the sentiments (positive or negative) in German texts. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_DE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_De_SENTIMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_de_3.2.0_2.4_1631184887201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_de_3.2.0_2.4_1631184887201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = BertSentenceEmbeddings\ .pretrained('labse', 'xx') \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "de") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") fr_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier]) light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result1 = light_pipeline.annotate("Spiel und Meisterschaft nicht spannend genug? Muss man jetzt den Videoschiedsrichter kontrollieren? Ich bin entsetzt...dachte der darf nur bei krassen Fehlentscheidungen ran. So macht der Fussball keinen Spass mehr.") result2 = light_pipeline.annotate("Habe gestern am Mittwoch den #werder Podcast vermisst. Wie schnell man sich an etwas gewöhnt und darauf freut. Danke an @Plainsman74 für die guten Interviews und den Einblick hinter die Kulissen von @werderbremen. Angenehme Winterpause weiterhin!") print(result1["class"], result2["class"], sep = "\n") ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings .pretrained("labse", "xx") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "de") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val fr_sentiment_pipeline = new Pipeline().setStages(Array(document, embeddings, sentimentClassifier)) val light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result1 = light_pipeline.annotate("Spiel und Meisterschaft nicht spannend genug? Muss man jetzt den Videoschiedsrichter kontrollieren? Ich bin entsetzt...dachte der darf nur bei krassen Fehlentscheidungen ran. So macht der Fussball keinen Spass mehr.") val result2 = light_pipeline.annotate("Habe gestern am Mittwoch den #werder Podcast vermisst. Wie schnell man sich an etwas gewöhnt und darauf freut. Danke an @Plainsman74 für die guten Interviews und den Einblick hinter die Kulissen von @werderbremen. Angenehme Winterpause weiterhin!") ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.sentiment.bert").predict("""Habe gestern am Mittwoch den #werder Podcast vermisst. Wie schnell man sich an etwas gewöhnt und darauf freut. Danke an @Plainsman74 für die guten Interviews und den Einblick hinter die Kulissen von @werderbremen. Angenehme Winterpause weiterhin!""") ```
## Results ```bash ['NEGATIVE'] ['POSITIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_sentiment| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|de| ## Data Source https://github.com/charlesmalafosse/open-dataset-for-sentiment-analysis/ ## Benchmarking ```bash label precision recall f1-score support NEGATIVE 0.83 0.85 0.84 978 POSITIVE 0.94 0.93 0.94 2582 accuracy - - 0.91 3560 macro-avg 0.89 0.89 0.89 3560 weighted-avg 0.91 0.91 0.91 3560 ``` --- layout: model title: Moldavian, Moldovan, Romanian asr_wav2vec2_large_xlsr_53_romanian_by_anton_l TFWav2Vec2ForCTC from anton-l author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_romanian_by_anton_l date: 2022-09-25 tags: [wav2vec2, ro, audio, open_source, asr] task: Automatic Speech Recognition language: ro edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_romanian_by_anton_l` is a Moldavian, Moldovan, Romanian model originally trained by anton-l. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098684581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098684581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_romanian_by_anton_l", "ro")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_romanian_by_anton_l", "ro") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_romanian_by_anton_l| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ro| |Size:|1.2 GB| --- layout: model title: Pipeline to Mapping MESH Codes with Their Corresponding UMLS Codes author: John Snow Labs name: mesh_umls_mapping date: 2023-06-13 tags: [en, licensed, clinical, resolver, pipeline, chunk_mapping, mesh, umls] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `mesh_umls_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_4.4.4_3.2_1686663527159.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_4.4.4_3.2_1686663527159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(C028491 D019326 C579867) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(C028491 D019326 C579867) ``` {:.nlu-block} ```python import nlu nlu.load("en.mesh.umls.mapping").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(C028491 D019326 C579867) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(C028491 D019326 C579867) ``` {:.nlu-block} ```python import nlu nlu.load("en.mesh.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash Results | | mesh_code | umls_code | |---:|:----------------------------|:-------------------------------| | 0 | C028491 | D019326 | C579867 | C0043904 | C0045010 | C3696376 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|mesh_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.9 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a English model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_en_3.4.2_3.0_1649672705449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_en_3.4.2_3.0_1649672705449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.muril_adapted_local").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Arabic Bert Embeddings (from MutazYoune) author: John Snow Labs name: bert_embeddings_Ara_DialectBERT date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `Ara_DialectBERT` is a Arabic model orginally trained by `MutazYoune`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_Ara_DialectBERT_ar_3.4.2_3.0_1649678666850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_Ara_DialectBERT_ar_3.4.2_3.0_1649678666850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_Ara_DialectBERT","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_Ara_DialectBERT","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.Ara_DialectBERT").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_Ara_DialectBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/MutazYoune/Ara_DialectBERT - https://github.com/elnagara/HARD-Arabic-Dataset --- layout: model title: Fast Neural Machine Translation Model from Kinyarwanda to English author: John Snow Labs name: opus_mt_rw_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, rw, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `rw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_rw_en_xx_2.7.0_2.4_1609164993136.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_rw_en_xx_2.7.0_2.4_1609164993136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_rw_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_rw_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.rw.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_rw_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Oncology Pipeline for Therapies author: John Snow Labs name: oncology_therapy_pipeline date: 2022-12-01 tags: [licensed, pipeline, oncology, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition and Assertion Status models to extract information from oncology texts. This pipeline focuses on entities related to therapies. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.2.2_3.0_1669906146446.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.2.2_3.0_1669906146446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_therpay.pipeline").predict("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Chemotherapy | | cyclophosphamide | Chemotherapy | ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Chemotherapy | | cyclophosphamide | Chemotherapy | ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | ******************** ner_oncology_unspecific_posology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------------| | mastectomy | Cancer_Therapy | | second cycle | Posology_Information | | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-----------------|:---------------|:------------| | mastectomy | Cancer_Surgery | Past | | adriamycin | Chemotherapy | Present | | cyclophosphamide | Chemotherapy | Present | ******************** assertion_oncology_treatment_binary_wip results ******************** | chunk | ner_label | assertion | |:-----------------|:---------------|:----------------| | mastectomy | Cancer_Surgery | Present_Or_Past | | adriamycin | Chemotherapy | Present_Or_Past | | cyclophosphamide | Chemotherapy | Present_Or_Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_therapy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel --- layout: model title: ESG Text Classification (Augmented, 26 classes) author: John Snow Labs name: finclf_augmented_esg date: 2022-09-06 tags: [en, financial, esg, classification, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies financial texts / news into 26 ESG classes which belong to three verticals: Environment, Social and Governance. This model can be use to build a ESG score board for companies. If you look for generic version, only returning Environment, Social or Governance, please look for the finance_sequence_classifier_esg model in Models Hub. ## Predicted Entities `Business_Ethics`, `Data_Security`, `Access_And_Affordability`, `Business_Model_Resilience`, `Competitive_Behavior`, `Critical_Incident_Risk_Management`, `Customer_Welfare`, `Director_Removal`, `Employee_Engagement_Inclusion_And_Diversity`, `Employee_Health_And_Safety`, `Human_Rights_And_Community_Relations`, `Labor_Practices`, `Management_Of_Legal_And_Regulatory_Framework`, `Physical_Impacts_Of_Climate_Change`, `Product_Quality_And_Safety`, `Product_Design_And_Lifecycle_Management`, `Selling_Practices_And_Product_Labeling`, `Supply_Chain_Management`, `Systemic_Risk_Management`, `Waste_And_Hazardous_Materials_Management`, `Water_And_Wastewater_Management`, `Air_Quality`, `Customer_Privacy`, `Ecological_Impacts`, `Energy_Management`, `GHG_Emissions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINCLF_ESG/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_augmented_esg_en_1.0.0_3.2_1662473372920.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_augmented_esg_en_1.0.0_3.2_1662473372920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_augmented_esg", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) # couple of simple examples example = spark.createDataFrame([["""The Canadian Environmental Assessment Agency (CEAA) concluded that in June 2016 the company had not made an effort to protect public drinking water and was ignoring concerns raised by its own scientists about the potential levels of pollutants in the local water supply. At the time, there were concerns that the company was not fully testing onsite wells for contaminants and did not use the proper methods for testing because of its test kits now manufactured in China.A preliminary report by the company in June 2016 was commissioned by the Alberta government to provide recommendations to Alberta Environment officials"""]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +--------------------+--------------------+ | text| result| +--------------------+--------------------+ |The Canadian Envi...|[Waste_And_Hazard...| +--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_augmented_esg| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotations from scrapped annual reports and tweets about ESG ## Benchmarking ```bash label precision recall f1-score support Business_Ethics 0.73 0.80 0.76 10 Data_Security 1.00 0.89 0.94 9 Access_And_Affordability 1.00 1.00 1.00 15 Business_Model_Resilience 1.00 1.00 1.00 12 Competitive_Behavior 0.92 1.00 0.96 12 Critical_Incident_Risk_Management 0.92 1.00 0.96 11 Customer_Welfare 0.85 1.00 0.92 11 Director_Removal 0.91 1.00 0.95 10 Employee_Engagement_Inclusion_And_Diversity 1.00 1.00 1.00 11 Employee_Health_And_Safety 1.00 1.00 1.00 10 Human_Rights_And_Community_Relations 0.94 1.00 0.97 16 Labor_Practices 0.71 0.53 0.61 19 Management_Of_Legal_And_Regulatory_Framework 1.00 0.95 0.97 19 Physical_Impacts_Of_Climate_Change 0.93 1.00 0.97 14 Product_Quality_And_Safety 1.00 1.00 1.00 14 Product_Design_And_Lifecycle_Management 1.00 1.00 1.00 18 Selling_Practices_And_Product_Labeling 1.00 1.00 1.00 17 Supply_Chain_Management 0.89 1.00 0.94 8 Systemic_Risk_Management 1.00 0.86 0.92 14 Waste_And_Hazardous_Materials_Management 0.88 1.00 0.93 14 Water_And_Wastewater_Management 1.00 1.00 1.00 8 Air_Quality 1.00 1.00 1.00 16 Customer_Privacy 1.00 0.93 0.97 15 Ecological_Impacts 1.00 1.00 1.00 16 Energy_Management 1.00 0.91 0.95 11 GHG_Emissions 1.00 0.91 0.95 11 accuracy - - 0.95 330 macro-avg 0.95 0.95 0.95 330 weighted-avg 0.95 0.95 0.95 330 ``` --- layout: model title: English ElectraForQuestionAnswering Small model (from Palak) author: John Snow Labs name: electra_qa_google_small_discriminator_squad date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `google_electra-small-discriminator_squad` is a English model originally trained by `Palak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_google_small_discriminator_squad_en_4.0.0_3.0_1655922022343.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_google_small_discriminator_squad_en_4.0.0_3.0_1655922022343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_google_small_discriminator_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_google_small_discriminator_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.small.by_Palak").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_google_small_discriminator_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Palak/google_electra-small-discriminator_squad --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824218 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824218` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1678783236100.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1678783236100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824218| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824218 --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_triplet_roberta_FT_new_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_roberta_FT_new_newsqa_en_4.0.0_3.0_1655728767027.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_roberta_FT_new_newsqa_en_4.0.0_3.0_1655728767027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_roberta_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_fpdm_triplet_roberta_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_fpdm_triplet_roberta_ft_new_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_triplet_roberta_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|461.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_triplet_roberta_FT_new_newsqa --- layout: model title: German Bert Embeddings (from amine) author: John Snow Labs name: bert_embeddings_bert_base_5lang_cased date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a German model orginally trained by `amine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_de_3.4.2_3.0_1649676183514.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_de_3.4.2_3.0_1649676183514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_5lang_cased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_5lang_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|464.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/amine/bert-base-5lang-cased - https://cloud.google.com/compute/docs/machine-types#n1_machine_type --- layout: model title: Translate English to Catalan Pipeline author: John Snow Labs name: translate_en_ca date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ca, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ca` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ca_xx_2.7.0_2.4_1609686877248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ca_xx_2.7.0_2.4_1609686877248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ca", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ca", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ca').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ca| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Indonesian XLMRobertaForTokenClassification Cased model (from vkhangpham) author: John Snow Labs name: xlmroberta_ner_shopee date: 2022-08-13 tags: [id, open_source, xlm_roberta, ner] task: Named Entity Recognition language: id edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `shopee-ner` is a Indonesian model originally trained by `vkhangpham`. ## Predicted Entities `STR`, `POI` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_shopee_id_4.1.0_3.0_1660423012861.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_shopee_id_4.1.0_3.0_1660423012861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_shopee","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_shopee","id") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_shopee| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|id| |Size:|865.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/vkhangpham/shopee-ner --- layout: model title: Legal Certain definitions Clause Binary Classifier author: John Snow Labs name: legclf_certain_definitions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `certain-definitions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `certain-definitions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_certain_definitions_clause_en_1.0.0_3.2_1660122218038.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_certain_definitions_clause_en_1.0.0_3.2_1660122218038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_certain_definitions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[certain-definitions]| |[other]| |[other]| |[certain-definitions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_certain_definitions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support certain-definitions 0.95 0.84 0.89 49 other 0.94 0.99 0.96 138 accuracy - - 0.95 187 macro-avg 0.95 0.91 0.93 187 weighted-avg 0.95 0.95 0.95 187 ``` --- layout: model title: Sinhala BertForQuestionAnswering model (from sankhajay) author: John Snow Labs name: bert_qa_bert_base_sinhala_qa date: 2022-06-02 tags: [si, open_source, question_answering, bert] task: Question Answering language: si edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-sinhala-qa` is a Sinhala model orginally trained by `sankhajay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_sinhala_qa_si_4.0.0_3.0_1654180367412.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_sinhala_qa_si_4.0.0_3.0_1654180367412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_sinhala_qa","si") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_sinhala_qa","si") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("si.answer_question.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_sinhala_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|si| |Size:|752.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sankhajay/bert-base-sinhala-qa --- layout: model title: Legal Forfeitures Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_forfeitures_bert date: 2023-03-05 tags: [en, legal, classification, clauses, forfeitures, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Forfeitures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Forfeitures`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_forfeitures_bert_en_1.0.0_3.0_1678046913501.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_forfeitures_bert_en_1.0.0_3.0_1678046913501.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_forfeitures_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Forfeitures]| |[Other]| |[Other]| |[Forfeitures]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_forfeitures_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Forfeitures 0.91 0.97 0.94 32 Other 0.98 0.94 0.96 50 accuracy - - 0.95 82 macro-avg 0.95 0.95 0.95 82 weighted-avg 0.95 0.95 0.95 82 ``` --- layout: model title: English asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2 TFWav2Vec2ForCTC from gary109 author: John Snow Labs name: pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2` is a English model originally trained by gary109. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_en_4.2.0_3.0_1664101430715.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_en_4.2.0_3.0_1664101430715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_xls_r_300m_hindi_lm TFWav2Vec2ForCTC from shoubhik author: John Snow Labs name: asr_wav2vec2_xls_r_300m_hindi_lm date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_hindi_lm` is a English model originally trained by shoubhik. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_hindi_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_hindi_lm_en_4.2.0_3.0_1664106060093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_hindi_lm_en_4.2.0_3.0_1664106060093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_hindi_lm", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_hindi_lm", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_hindi_lm| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from deepakvk) author: John Snow Labs name: roberta_qa_deepakvk_base_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `deepakvk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepakvk_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219250943.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepakvk_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219250943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepakvk_base_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepakvk_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepakvk_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepakvk/roberta-base-squad2-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Cased model (from cometrain) author: John Snow Labs name: t5_fake_news_detector date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fake-news-detector-t5` is a English model originally trained by `cometrain`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_fake_news_detector_en_4.3.0_3.0_1675101857981.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_fake_news_detector_en_4.3.0_3.0_1675101857981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_fake_news_detector","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_fake_news_detector","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_fake_news_detector| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|277.3 MB| ## References - https://huggingface.co/cometrain/fake-news-detector-t5 - https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from khoanvm) author: John Snow Labs name: roberta_qa_base_squad2_finetuned_visquad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-visquad` is a English model originally trained by `khoanvm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_visquad_en_4.3.0_3.0_1674219613502.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_visquad_en_4.3.0_3.0_1674219613502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_visquad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_visquad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_finetuned_visquad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/khoanvm/roberta-base-squad2-finetuned-visquad --- layout: model title: English DistilBertForQuestionAnswering model (from anurag0077) Squad2 author: John Snow Labs name: distilbert_qa_anurag0077_base_uncased_finetuned_squad2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `anurag0077`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726811228.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726811228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_anurag0077").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anurag0077_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad2 --- layout: model title: English RoBERTa Embeddings (Smiles Strings, v2) author: John Snow Labs name: roberta_embeddings_chEMBL26_smiles_v2 date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chEMBL26_smiles_v2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_chEMBL26_smiles_v2_en_3.4.2_3.0_1649946865988.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_chEMBL26_smiles_v2_en_3.4.2_3.0_1649946865988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_chEMBL26_smiles_v2","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_chEMBL26_smiles_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.chEMBL26_smiles_v2").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_chEMBL26_smiles_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|90.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/chEMBL26_smiles_v2 --- layout: model title: Legal Costs Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_costs_bert date: 2023-03-05 tags: [en, legal, classification, clauses, costs, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Costs` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Costs`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_costs_bert_en_1.0.0_3.0_1678049894274.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_costs_bert_en_1.0.0_3.0_1678049894274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_costs_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Costs]| |[Other]| |[Other]| |[Costs]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_costs_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Costs 0.81 1.00 0.90 13 Other 1.00 0.88 0.94 26 accuracy - - 0.92 39 macro-avg 0.91 0.94 0.92 39 weighted-avg 0.94 0.92 0.92 39 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Tsonga author: John Snow Labs name: opus_mt_en_ts date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ts, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ts` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ts_xx_2.7.0_2.4_1609170044688.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ts_xx_2.7.0_2.4_1609170044688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ts", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ts", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ts').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ts| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: NER Model Finder with Sentence Entity Resolvers (sbert_jsl_medium_uncased) author: John Snow Labs name: sbertresolve_ner_model_finder date: 2022-09-05 tags: [en, entity_resolver, licensed, ner, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities (NER labels) to the most appropriate NER model using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. Given the entity name, it will return a list of pretrained NER models having that entity or similar ones. ## Predicted Entities `ner_model_list` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_4.1.0_3.0_1662377743401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_4.1.0_3.0_1662377743401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") ner_model_finder = SentenceEntityResolverModel.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")\ .setInputCols(["sbert_embeddings"])\ .setOutputCol("model_names")\ .setDistanceFunction("EUCLIDEAN") ner_model_finder_pipelineModel = PipelineModel(stages = [documentAssembler, sbert_embedder, ner_model_finder]) light_pipeline = LightPipeline(ner_model_finder_pipelineModel) annotations = light_pipeline.fullAnnotate("medication") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val ner_model_finder = SentenceEntityResolverModel.pretrained("sbertresolve_ner_model_finder", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("model_names") .setDistanceFunction("EUCLIDEAN") val ner_model_finder_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, ner_model_finder)) val light_pipeline = LightPipeline(ner_model_finder_pipelineModel) val annotations = light_pipeline.fullAnnotate("medication") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.ner.model_finder").predict("""Put your text here.""") ```
## Results ```bash +----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |entity |models |all_models |resolutions | +----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |medication|['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_posology', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy', 'ner_pathogen']|['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_posology', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy', 'ner_pathogen']:::['ner_posology_greedy', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_greedy', 'ner_ade_clinical', 'ner_nature_nero_clinical', 'ner_posology', 'ner_biomarker', 'ner_clinical_trials_abstracts', 'ner_risk_factors', 'ner_ade_healthcare', 'ner_drugs_large', 'ner_jsl_slim', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_drugs_greedy']:::['ner_covid_trials', 'ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'jsl_ner_wip_modifier_clinical', 'ner_healthcare', 'ner_jsl_enriched', 'ner_events_clinical', 'ner_jsl_greedy', 'ner_clinical', 'ner_clinical_large', 'ner_jsl_slim', 'ner_events_healthcare', 'ner_events_admission_clinical']:::['ner_biomarker']:::['ner_medmentions_coarse']:::['ner_covid_trials', 'ner_jsl_enriched', 'ner_jsl', 'ner_medmentions_coarse']:::['ner_drugs']:::['ner_nature_nero_clinical']:::['ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'jsl_ner_wip_modifier_clinical', 'ner_medmentions_coarse', 'ner_jsl_enriched', 'ner_jsl_greedy']:::['ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'ner_nature_nero_clinical', 'ner_medmentions_coarse', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_enriched', 'ner_radiology_wip_clinical', 'ner_jsl_greedy', 'ner_radiology', 'ner_jsl_slim']:::['ner_posology_experimental']:::['ner_pathogen']:::['ner_measurements_clinical', 'jsl_rd_ner_wip_greedy_clinical', 'ner_nature_nero_clinical', 'ner_radiology_wip_clinical', 'ner_radiology', 'ner_nihss']:::['ner_jsl', 'ner_posology_greedy', 'jsl_rd_ner_wip_greedy_clinical', 'jsl_ner_wip_modifier_clinical', 'ner_posology_small', 'ner_jsl_enriched', 'ner_jsl_greedy', 'ner_posology', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare']:::['ner_covid_trials', 'ner_jsl', 'jsl_rd_ner_wip_greedy_clinical', 'ner_medmentions_coarse', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_enriched', 'ner_jsl_greedy']:::['ner_clinical_trials_abstracts']:::['ner_medmentions_coarse', 'ner_nature_nero_clinical']|medication:::drug:::treatment:::targeted therapy:::therapeutic procedure:::drug ingredient:::drug chemical:::medical procedure:::substance:::medical device:::administration:::medical condition:::measurement:::drug strength:::physiological reaction:::dose:::research activity| +----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_ner_model_finder| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sbert_embeddings]| |Output Labels:|[models]| |Language:|en| |Size:|737.3 KB| |Case sensitive:|false| ## References This model is trained with the data that has the labels of 70 different clinical NER models. --- layout: model title: Norwegian BertForMaskedLM Cased model (from ltgoslo) author: John Snow Labs name: bert_embeddings_norbert2 date: 2022-12-02 tags: ["no", open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: "no" edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `norbert2` is a Norwegian model originally trained by `ltgoslo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert2_no_4.2.4_3.0_1670022783195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert2_no_4.2.4_3.0_1670022783195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert2","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert2","no") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_norbert2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|no| |Size:|467.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ltgoslo/norbert2 - http://vectors.nlpl.eu/repository/20/221.zip - http://norlm.nlpl.eu/ - https://github.com/ltgoslo/NorBERT - https://aclanthology.org/2021.nodalida-main.4/ - https://www.eosc-nordic.eu/ - https://www.mn.uio.no/ifi/english/research/groups/ltg/ --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dm2000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm2000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm2000_en_4.3.0_3.0_1675119184723.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm2000_en_4.3.0_3.0_1675119184723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dm2000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dm2000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dm2000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|590.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-dm2000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English asr_wav2vec2_base_timit_moaiz_exp2 TFWav2Vec2ForCTC from moaiz237 author: John Snow Labs name: asr_wav2vec2_base_timit_moaiz_exp2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_exp2` is a English model originally trained by moaiz237. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_moaiz_exp2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037589335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037589335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_moaiz_exp2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_moaiz_exp2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_moaiz_exp2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from nlpconnect) author: John Snow Labs name: roberta_qa_dpr_nq_reader_base_v2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base-v2` is a English model originally trained by `nlpconnect`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_v2_en_4.3.0_3.0_1674210757247.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_base_v2_en_4.3.0_3.0_1674210757247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_base_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_dpr_nq_reader_base_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base-v2 --- layout: model title: Detect concepts in drug development trials (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_drug_development_trials date: 2021-12-17 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description It is a `BertForTokenClassification` NER model to identify concepts related to drug development including `Trial Groups` , `End Points` , `Hazard Ratio`, and other entities in free text. ## Predicted Entities `Patient_Count`, `Duration`, `End_Point`, `Value`, `Trial_Group`, `Hazard_Ratio`, `Total_Patients'` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.2_3.0_1639776838533.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.2_3.0_1639776838533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala ... val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models") .setInputCols("token", "document") .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.drug_development_trials").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""") ```
## Results ```bash | | chunk | entity | |---:|:------------------|:--------------| | 0 | median | Duration | | 1 | overall survival | End_Point | | 2 | with | Trial_Group | | 3 | without topotecan | Trial_Group | | 4 | 4.0 | Value | | 5 | 3.6 months | Value | | 6 | 23 | Patient_Count | | 7 | 63 | Patient_Count | | 8 | 55 | Patient_Count | | 9 | 33 patients | Patient_Count | | 10 | topotecan | Trial_Group | | 11 | 11 | Patient_Count | | 12 | 61 | Patient_Count | | 13 | 66 | Patient_Count | | 14 | 32 patients | Patient_Count | | 15 | without topotecan | Trial_Group | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_drug_development_trials| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|400.6 MB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source Trained on data obtained from `clinicaltrials.gov` and annotated in-house. ## Benchmarking ```bash label precision recall f1 support B-Duration 0.93 0.94 0.93 1820 B-End_Point 0.99 0.98 0.98 5022 B-Hazard_Ratio 0.97 0.95 0.96 778 B-Patient_Count 0.81 0.88 0.85 300 B-Trial_Group 0.86 0.88 0.87 6751 B-Value 0.94 0.96 0.95 7675 I-Duration 0.71 0.82 0.76 185 I-End_Point 0.94 0.98 0.96 1491 I-Patient_Count 0.48 0.64 0.55 44 I-Trial_Group 0.78 0.75 0.77 4561 I-Value 0.93 0.95 0.94 1511 O 0.96 0.95 0.95 47423 accuracy - - 0.94 77608 macro-avg 0.79 0.82 0.80 77608 weighted-avg 0.94 0.94 0.94 77608 ``` --- layout: model title: English RobertaForQuestionAnswering Tiny Cased model (from deepset) author: John Snow Labs name: roberta_qa_tiny_6l_768d date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-6l-768d` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_6l_768d_en_4.2.4_3.0_1669988517909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_6l_768d_en_4.2.4_3.0_1669988517909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_6l_768d","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_6l_768d","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tiny_6l_768d| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/tinyroberta-6l-768d - https://arxiv.org/pdf/1909.10351.pdf - https://github.com/deepset-ai/haystack - https://haystack.deepset.ai/guides/model-distillation - https://github.com/deepset-ai/haystack/ - https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Legal Joint Filing Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_joint_filing_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, joint_filing, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_joint_filing_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `joint-filing-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `joint-filing-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_en_1.0.0_3.0_1669291473829.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_en_1.0.0_3.0_1669291473829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_joint_filing_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[joint-filing-agreement]| |[other]| |[other]| |[joint-filing-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_joint_filing_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support joint-filing-agreement 0.97 0.97 0.97 31 other 0.99 0.99 0.99 90 accuracy - - 0.98 121 macro avg 0.98 0.98 0.98 121 weighted avg 0.98 0.98 0.98 121 ``` --- layout: model title: Translate Lozi to English Pipeline author: John Snow Labs name: translate_loz_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, loz, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `loz` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_loz_en_xx_2.7.0_2.4_1609698759083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_loz_en_xx_2.7.0_2.4_1609698759083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_loz_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_loz_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.loz.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_loz_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Dutch author: John Snow Labs name: opus_mt_en_nl date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, nl, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `nl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_nl_xx_2.7.0_2.4_1609164726700.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_nl_xx_2.7.0_2.4_1609164726700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_nl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_nl", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.nl').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_nl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Clinical English Bert Embeddings (Base, 512 dimension) author: John Snow Labs name: bert_embeddings_clinical_pubmed_bert_base_512 date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `clinical-pubmed-bert-base-512` is a English model orginally trained by `Tsubasaz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_512_en_3.4.2_3.0_1649672313480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_clinical_pubmed_bert_base_512_en_3.4.2_3.0_1649672313480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_512","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_clinical_pubmed_bert_base_512","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.clinical_pubmed_bert_base_512").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_clinical_pubmed_bert_base_512| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/Tsubasaz/clinical-pubmed-bert-base-512 - https://mimic.physionet.org/ --- layout: model title: English RoBERTa Embeddings (Base, Wikipedia and Bookcorpus datasets) author: John Snow Labs name: roberta_embeddings_muppet_roberta_base date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muppet-roberta-base` is a English model orginally trained by `facebook`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_base_en_3.4.2_3.0_1649946369947.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_base_en_3.4.2_3.0_1649946369947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.muppet_roberta_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_muppet_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|301.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/facebook/muppet-roberta-base - https://arxiv.org/abs/2101.11038 --- layout: model title: Fast Neural Machine Translation Model from Kabyle to English author: John Snow Labs name: opus_mt_kab_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kab, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kab` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kab_en_xx_2.7.0_2.4_1609166904449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kab_en_xx_2.7.0_2.4_1609166904449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kab_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kab_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kab.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kab_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Moldavian, Moldovan, Romanian asr_wav2vec2_large_xlsr_53_romanian_by_anton_l TFWav2Vec2ForCTC from anton-l author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l date: 2022-09-25 tags: [wav2vec2, ro, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ro edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_romanian_by_anton_l` is a Moldavian, Moldovan, Romanian model originally trained by anton-l. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098754607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l_ro_4.2.0_3.0_1664098754607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l', lang = 'ro') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l", lang = "ro") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_anton_l| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nh1 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh1` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh1_en_4.3.0_3.0_1675123623466.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh1_en_4.3.0_3.0_1675123623466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nh1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nh1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|41.6 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nh1 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_roberta_base_MITmovie_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-MITmovie-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_MITmovie_squad_en_4.0.0_3.0_1655729784944.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_MITmovie_squad_en_4.0.0_3.0_1655729784944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_MITmovie_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_MITmovie_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.movie_squad.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_MITmovie_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|461.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/thatdramebaazguy/roberta-base-MITmovie-squad --- layout: model title: English RobertaForQuestionAnswering Cased model (from sunitha) author: John Snow Labs name: roberta_qa_cv_custom_ds date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CV_Custom_DS` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cv_custom_ds_en_4.3.0_3.0_1674207905368.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cv_custom_ds_en_4.3.0_3.0_1674207905368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cv_custom_ds","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cv_custom_ds","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cv_custom_ds| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/sunitha/CV_Custom_DS --- layout: model title: Pipeline to Extraction of Clinical Abbreviations and Acronyms author: John Snow Labs name: ner_abbreviation_clinical_pipeline date: 2023-03-14 tags: [ner, abbreviation, acronym, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_abbreviation_clinical](https://nlp.johnsnowlabs.com/2021/12/30/ner_abbreviation_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_pipeline_en_4.3.0_3.2_1678777406281.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_pipeline_en_4.3.0_3.2_1678777406281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_abbreviation_clinical_pipeline", "en", "clinical/models") text = '''Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_abbreviation_clinical_pipeline", "en", "clinical/models") val text = "Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical-abbreviation.pipeline").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | CBC | 126 | 128 | ABBR | 1 | | 1 | AB | 159 | 160 | ABBR | 1 | | 2 | VDRL | 189 | 192 | ABBR | 1 | | 3 | HIV | 247 | 249 | ABBR | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_abbreviation_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl4_en_4.3.0_3.0_1675113874908.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl4_en_4.3.0_3.0_1675113874908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|221.4 MB| ## References - https://huggingface.co/google/t5-efficient-base-nl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Pipeline to Detect Clinical Entities (WIP) author: John Snow Labs name: jsl_ner_wip_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, wip, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_3.4.1_3.0_1647865732108.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_3.4.1_3.0_1647865732108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_wip_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Modifier | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Part of Speech for Bengali (pos_msri) author: John Snow Labs name: pos_msri date: 2021-01-20 task: Part of Speech Tagging language: bn edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [bn, pos, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include NN (noun), CC (Conjuncts - coordinating and subordinating), and 26 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities `BM` (Not Documented), `CC (Conjuncts, Coordinating and Subordinating)`, `CL (Clitics)`, `DEM (Demonstratives)`, `INJ (Interjection)`, `INTF (Intensifier)`, `JJ (Adjective)`, `NEG (Negative)`, `NN (Noun)`, `NNC (Compound Nouns)`, `NNP (Proper Noun)`, `NST (Preposition of Direction)`, `PPR (Postposition)`, `PRP (Pronoun)`, `PSP (Preprosition)`, `QC (Cardinal Number)`, `QF (Quantifiers)`, `QO (Ordinal Numbers)`, `RB (Adverb)`, `RDP (Not Documented)`, `RP (Particle)`, `SYM (Special Symbol)`, `UT (Not Documented)`, `VAUX (Verb Auxiliary)`, `VM (Verb)`, `WQ (wh- qualifier)` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_msri_bn_2.7.0_2.4_1611173659719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_msri_bn_2.7.0_2.4_1611173659719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_msri", "bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে মোদ ' ৷"]], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_lst20", "th") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷"] pos_df = nlu.load('bn.pos').predict(text, output_level = "token") pos_df ```
## Results ```bash +------------------------------------------------------+----------------------------------------+ |text |result | +------------------------------------------------------+----------------------------------------+ |বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷|[NN, NNP, NN, NN, VM, SYM, NN, SYM, SYM]| +------------------------------------------------------+----------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_msri| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|bn| ## Data Source The model was trained on the _Indian Language POS-Tagged Corpus_ from [NLTK](http://www.nltk.org) collected by A Kumaran (Microsoft Research, India). ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | BM | 1.00 | 1.00 | 1.00 | 1 | | CC | 0.99 | 0.99 | 0.99 | 390 | | CL | 1.00 | 1.00 | 1.00 | 2 | | DEM | 0.98 | 0.99 | 0.98 | 139 | | INJ | 0.92 | 0.85 | 0.88 | 13 | | INTF | 1.00 | 1.00 | 1.00 | 55 | | JJ | 0.99 | 0.99 | 0.99 | 688 | | NEG | 0.99 | 0.98 | 0.99 | 135 | | NN | 0.99 | 0.99 | 0.99 | 2996 | | NNC | 1.00 | 1.00 | 1.00 | 4 | | NNP | 0.97 | 0.98 | 0.97 | 528 | | NST | 1.00 | 1.00 | 1.00 | 156 | | PPR | 1.00 | 1.00 | 1.00 | 1 | | PRP | 0.98 | 0.98 | 0.98 | 685 | | PSP | 0.99 | 0.99 | 0.99 | 250 | | QC | 0.99 | 0.99 | 0.99 | 193 | | QF | 0.98 | 0.98 | 0.98 | 187 | | QO | 1.00 | 1.00 | 1.00 | 22 | | RB | 0.99 | 0.99 | 0.99 | 187 | | RDP | 1.00 | 0.98 | 0.99 | 44 | | RP | 0.99 | 0.96 | 0.97 | 79 | | SYM | 0.97 | 0.98 | 0.98 | 1413 | | UNK | 1.00 | 1.00 | 1.00 | 1 | | UT | 1.00 | 1.00 | 1.00 | 18 | | VAUX | 0.97 | 0.97 | 0.97 | 400 | | VM | 0.99 | 0.98 | 0.98 | 1393 | | WQ | 1.00 | 0.99 | 0.99 | 71 | | XC | 0.98 | 0.97 | 0.97 | 219 | | accuracy | | | 0.98 | 10270 | | macro avg | 0.99 | 0.98 | 0.99 | 10270 | | weighted avg | 0.98 | 0.98 | 0.98 | 10270 | ``` --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbertresolve_icd10cm_slim_billable_hcc_med) author: John Snow Labs name: sbertresolve_icd10cm_slim_billable_hcc_med date: 2021-05-25 tags: [icd10cm, licensed, slim, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD10 CM codes using sentence bert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 11. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.3_2.4_1621977523869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.3_2.4_1621977523869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True) bert_pipeline_icd = PipelineModel(stages = [document_assembler, sbert_embedder, icd10_resolver]) model = bert_pipeline_icd.fit(spark.createDataFrame([["bladder cancer"]]).toDF("text")) results = model.transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols("document") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") .setInputCols(["sbert_embeddings"]) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new PipelineModel().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("bladder cancer").toDS.toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc_med").predict("""bladder cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:---------------|:--------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------------:|:----------------------------|:-----------------------------------------------------------------------------------------------------------------| | 0 | bladder cancer | C671 |[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], prostate cancer [Malignant neoplasm of prostate], cancer of the urinary bladder, lateral wall [Malignant neoplasm of lateral wall of bladder], cancer of the urinary bladder, anterior wall [Malignant neoplasm of anterior wall of bladder], cancer of the urinary bladder, posterior wall [Malignant neoplasm of posterior wall of bladder], cancer of the urinary bladder, neck [Malignant neoplasm of bladder neck], cancer of the urinary bladder, ureteric orifice [Malignant neoplasm of ureteric orifice]]| [C671, C679, C61, C672, C673, C674, C675, C676, D090, Z126, D494, C670, Z8551, C7911] | ['1', '1', '11'] | [0.0894, 0.1051, 0.1184, 0.1180, 0.1200, 0.1204, 0.1255, 0.1375, 0.1357, 0.1452, 0.1469, 0.1513, 0.1500, 0.1575] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_slim_billable_hcc_med| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[icd10_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Swedish asr_lm_swedish TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: asr_lm_swedish date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_lm_swedish` is a Swedish model originally trained by birgermoell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_lm_swedish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_lm_swedish_sv_4.2.0_3.0_1664117876808.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_lm_swedish_sv_4.2.0_3.0_1664117876808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_lm_swedish", "sv")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_lm_swedish", "sv") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_lm_swedish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sv| |Size:|757.4 MB| --- layout: model title: Translate English to Gun Pipeline author: John Snow Labs name: translate_en_guw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, guw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `guw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_guw_xx_2.7.0_2.4_1609688347625.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_guw_xx_2.7.0_2.4_1609688347625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_guw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_guw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.guw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_guw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from evegarcianz) author: John Snow Labs name: distilbert_qa_evegarcianz_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `evegarcianz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_evegarcianz_finetuned_squad_en_4.3.0_3.0_1672765810295.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_evegarcianz_finetuned_squad_en_4.3.0_3.0_1672765810295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_evegarcianz_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_evegarcianz_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_evegarcianz_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/evegarcianz/bert-finetuned-squad --- layout: model title: Legal NER on EDGAR Documents author: John Snow Labs name: legner_sec_edgar date: 2023-04-13 tags: [en, licensed, legal, ner, sec, edgar] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Legal NER model extracts `ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, and `TICKER` entities from the US SEC EDGAR documents. ## Predicted Entities `ALIAS`, `COURT`, `INST`, `LAW`, `LOC`, `MISC`, `ORG`, `PER`, `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_en_1.0.0_3.0_1681397579002.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_sec_edgar_en_1.0.0_3.0_1681397579002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_sec_edgar", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------------------------------+---------+ |chunk |ner_label| +----------------------------------------+---------+ |SunGard Capital Corp. II |ORG | |SCC II |ALIAS | |accounting principles generally accepted|LAW | |United States of America |LOC | +----------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_sec_edgar| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations ## Benchmarking ```bash label precision recall f1-score support ALIAS 0.86 0.74 0.79 84 COURT 0.86 1.00 0.92 6 INST 0.94 0.76 0.84 76 LAW 0.91 0.93 0.92 166 LOC 0.89 0.88 0.88 140 MISC 0.90 0.83 0.86 226 ORG 0.89 0.93 0.91 430 PER 0.92 0.92 0.92 66 TICKER 1.00 0.86 0.92 7 micro-avg 0.90 0.88 0.89 1201 macro-avg 0.91 0.87 0.89 1201 weighted-avg 0.90 0.88 0.89 1201 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg, Repeat author: John Snow Labs name: distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg-with-repeat` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat_en_4.0.0_3.0_1654727466812.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat_en_4.0.0_3.0_1654727466812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg_with_repeat.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg-with-repeat --- layout: model title: Legal Independent contractor Clause Binary Classifier author: John Snow Labs name: legclf_independent_contractor_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `independent-contractor` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `independent-contractor` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_independent_contractor_clause_en_1.0.0_3.2_1660122527352.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_independent_contractor_clause_en_1.0.0_3.2_1660122527352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_independent_contractor_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[independent-contractor]| |[other]| |[other]| |[independent-contractor]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_independent_contractor_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support independent-contractor 1.00 1.00 1.00 34 other 1.00 1.00 1.00 101 accuracy - - 1.00 135 macro-avg 1.00 1.00 1.00 135 weighted-avg 1.00 1.00 1.00 135 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from guhuawuli) author: John Snow Labs name: distilbert_qa_guhuawuli_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `guhuawuli`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_guhuawuli_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770918999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_guhuawuli_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770918999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_guhuawuli_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_guhuawuli_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_guhuawuli_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/guhuawuli/distilbert-base-uncased-finetuned-squad --- layout: model title: English ALBERT Embeddings (xx-large) author: John Snow Labs name: albert_embeddings_albert_xxlarge_v1 date: 2022-04-14 tags: [albert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-xxlarge-v1` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xxlarge_v1_en_3.4.2_3.0_1649954172408.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xxlarge_v1_en_3.4.2_3.0_1649954172408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xxlarge_v1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xxlarge_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.albert_xxlarge_v1").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_xxlarge_v1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|834.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/albert-xxlarge-v1 - https://arxiv.org/abs/1909.11942 - https://github.com/google-research/albert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Relation Extraction Between Body Parts and Direction Entities (ReDL) author: John Snow Labs name: redl_bodypart_direction_biobert date: 2023-01-14 tags: [licensed, en, clinical, relation_extraction, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like Internal_organ_or_component, External_body_part_or_region etc. and direction entities like upper, lower in clinical texts. 1 : Shows the body part and direction entity are related, 0 : Shows the body part and direction entity are not related. ## Predicted Entities `1`, `0` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_4.2.4_3.0_1673710170047.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_4.2.4_3.0_1673710170047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['direction-external_body_part_or_region', 'external_body_part_or_region-direction', 'direction-internal_organ_or_component', 'internal_organ_or_component-direction' ]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_direction_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([[''' MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia ''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("direction-external_body_part_or_region", "external_body_part_or_region-direction", "direction-internal_organ_or_component", "internal_organ_or_component-direction")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_direction_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation").predict(""" MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia """) ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------| | 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 | | 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 | | 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 | | 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 | | 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 | | 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 | | 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 | | 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 | | 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_direction_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on an internal dataset. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.856 0.873 0.865 153 1 0.986 0.984 0.985 1347 Avg. 0.921 0.929 0.925 - ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from allenai) author: John Snow Labs name: t5_unifiedqa_v2_base_1363200 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unifiedqa-v2-t5-base-1363200` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_base_1363200_en_4.3.0_3.0_1675157943693.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_base_1363200_en_4.3.0_3.0_1675157943693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_unifiedqa_v2_base_1363200","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_unifiedqa_v2_base_1363200","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_unifiedqa_v2_base_1363200| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|474.3 MB| ## References - https://huggingface.co/allenai/unifiedqa-v2-t5-base-1363200 - #further-details-httpsgithubcomallenaiunifiedqa - https://github.com/allenai/unifiedqa - #further-details-httpsgithubcomallenaiunifiedqa - https://github.com/allenai/unifiedqa --- layout: model title: Stop Words Cleaner for Finnish author: John Snow Labs name: stopwords_fi date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: fi edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, fi] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_fi_fi_2.5.4_2.4_1594742441054.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_fi_fi_2.5.4_2.4_1594742441054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_fi", "fi") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_fi", "fi") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä."""] stopword_df = nlu.load('fi.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=33, result='pohjoisen', metadata={'sentence': '0'}), Row(annotatorType='token', begin=35, end=42, result='kuningas', metadata={'sentence': '0'}), Row(annotatorType='token', begin=43, end=43, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=45, end=48, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_fi| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|fi| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Stop Words Cleaner for Thai author: John Snow Labs name: stopwords_th date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: th edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, th] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_th_th_2.5.4_2.4_1594742440606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_th_th_2.5.4_2.4_1594742440606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_th", "th") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_th", "th") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์"""] stopword_df = nlu.load('th.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=130, result='นอกเหนือจากการเป็นราชาแห่งทิศเหนือแล้วจอห์นสโนว์ยังเป็นแพทย์ชาวอังกฤษและเป็นผู้นำในการพัฒนายาระงับความรู้สึกและสุขอนามัยทางการแพทย์', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_th| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|th| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Legal European Construction Document Classifier (EURLEX) author: John Snow Labs name: legclf_european_construction_bert date: 2023-03-06 tags: [en, legal, classification, clauses, european_construction, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_european_construction_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class European_Construction or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `European_Construction`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_european_construction_bert_en_1.0.0_3.0_1678111732690.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_european_construction_bert_en_1.0.0_3.0_1678111732690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_european_construction_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[European_Construction]| |[Other]| |[Other]| |[European_Construction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_european_construction_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support European_Construction 0.85 0.90 0.87 535 Other 0.88 0.83 0.86 505 accuracy - - 0.87 1040 macro-avg 0.87 0.87 0.87 1040 weighted-avg 0.87 0.87 0.87 1040 ``` --- layout: model title: French CamemBert Embeddings (from adeiMousa) author: John Snow Labs name: camembert_embeddings_adeiMousa_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `adeiMousa`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_adeiMousa_generic_model_fr_3.4.4_3.0_1653987280320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_adeiMousa_generic_model_fr_3.4.4_3.0_1653987280320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_adeiMousa_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_adeiMousa_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_adeiMousa_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/adeiMousa/dummy-model --- layout: model title: English Named Entity Recognition (from DeDeckerThomas) author: John Snow Labs name: distilbert_ner_keyphrase_extraction_distilbert_openkp date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model orginally trained by `DeDeckerThomas`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_openkp_en_3.4.2_3.0_1652721945024.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_openkp_en_3.4.2_3.0_1652721945024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_openkp","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_openkp","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_keyphrase_extraction_distilbert_openkp| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/DeDeckerThomas/keyphrase-extraction-distilbert-openkp - https://github.com/microsoft/OpenKP - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp --- layout: model title: Detect diseases in text (large) author: John Snow Labs name: ner_diseases_large date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract mentions of different types of disease in medical text using pretrained NER model. ## Predicted Entities `Disease` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_en_3.0.0_3.0_1617260844811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_en_3.0.0_3.0_1617260844811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_diseases_large", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_diseases_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases.large").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_large| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Clinical Deidentification Pipeline (English, slim) author: John Snow Labs name: clinical_deidentification_slim date: 2023-06-13 tags: [deidentification, deid, glove, slim, pipeline, clinical, en, licensed] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_slim_en_4.4.4_3.2_1686665745769.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_slim_en_4.4.4_3.2_1686665745769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_slim","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_slim").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_slim","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_slim").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Results Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID: , IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID: [********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID: ****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Layne Nation, Record date: 2093-03-13, # C6240488. Dr. Dr Rosalba Hill, ID: JY:3489547, IP 005.005.005.005. He is a 79 male was admitted to the JOHN MUIR MEDICAL CENTER-CONCORD CAMPUS for cystectomy on 01-25-1997. Patient's VIN : 3CCCC22DDDD333888, SSN SSN-289-37-4495, Driver's license S99983662. Phone 04.32.52.27.90, North Adrienne, Colorado Springs, E-MAIL: Rawland@google.com. {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_slim| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|181.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - TextMatcherModel - ContextualParserModel - RegexMatcherModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English BertForQuestionAnswering model (from twmkn9) author: John Snow Labs name: bert_qa_twmkn9_bert_base_uncased_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad2` is a English model orginally trained by `twmkn9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_twmkn9_bert_base_uncased_squad2_en_4.0.0_3.0_1654181501175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_twmkn9_bert_base_uncased_squad2_en_4.0.0_3.0_1654181501175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_twmkn9_bert_base_uncased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_twmkn9_bert_base_uncased_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_uncased.by_twmkn9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_twmkn9_bert_base_uncased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/twmkn9/bert-base-uncased-squad2 --- layout: model title: English Bert Embeddings (Large, Uncased) author: John Snow Labs name: bert_embeddings_bert_large_uncased_whole_word_masking date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-uncased-whole-word-masking` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_uncased_whole_word_masking_en_3.4.2_3.0_1649671495082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_uncased_whole_word_masking_en_3.4.2_3.0_1649671495082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_uncased_whole_word_masking","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_uncased_whole_word_masking","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_large_uncased_whole_word_masking").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_uncased_whole_word_masking| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/bert-large-uncased-whole-word-masking - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Sentence Entity Resolver for CPT codes (Augmented) author: John Snow Labs name: sbiobertresolve_cpt_procedures_augmented date: 2021-05-30 tags: [licensed, entity_resolution, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to CPT codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is enriched with augmented data for better performance. ## Predicted Entities CPT codes and their descriptions. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.0.4_3.0_1622371775342.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.0.4_3.0_1622371775342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq.empty["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."].toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cpt.procedures_augmented").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+-------+-----+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+-------+-----+----------+--------------------+--------------------+ | hypertension| 68| 79|PROBLEM|36440| 0.3349|Hypertransfusion:...|36440:::24935:::0...| |chronic renal ins...| 83|109|PROBLEM|50395| 0.0821|Nephrostomy:::Ren...|50395:::50328:::5...| | COPD| 113|116|PROBLEM|32960| 0.1575|Lung collapse pro...|32960:::32215:::1...| | gastritis| 120|128|PROBLEM|43501| 0.1772|Gastric ulcer sut...|43501:::43631:::4...| | TIA| 136|138|PROBLEM|61460| 0.1432|Intracranial tran...|61460:::64742:::2...| |a non-ST elevatio...| 182|202|PROBLEM|61624| 0.1151|Percutaneous non-...|61624:::61626:::3...| |Guaiac positive s...| 208|229|PROBLEM|44005| 0.1115|Enterolysis:::Abd...|44005:::49080:::4...| | mid LAD lesion| 332|345|PROBLEM|0281T| 0.2407|Plication of left...|0281T:::93462:::9...| | hypotension| 362|372|PROBLEM|99135| 0.9935|Induced hypotensi...|99135:::99185:::9...| | bradycardia| 378|388|PROBLEM|99135| 0.3884|Induced hypotensi...|99135:::33305:::3...| | vagal reaction| 466|479|PROBLEM|55450| 0.1427|Vasoligation:::Va...|55450:::64408:::7...| +--------------------+-----+---+-------+-----+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cpt_procedures_augmented| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[cpt_code_aug]| |Language:|en| |Case sensitive:|false| --- layout: model title: Sentence Entity Resolver for RxCUI (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_rxcui language: en nav_key: models repository: clinical/models date: 2020-12-11 task: Entity Resolution edition: Healthcare NLP 2.6.5 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to RxCUI codes using chunk embeddings. {:.h2_title} ## Predicted Entities RxCUI Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxcui_en_2.6.4_2.4_1607714146277.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxcui_en_2.6.4_2.4_1607714146277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use ```sbiobertresolve_rxcui``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") rxcui_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxcui","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxcui_resolver]) data = spark.createDataFrame([["He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day"]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val rxcui_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxcui","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxcui_resolver)) val data = Seq("He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +---------------------------+--------+-----------------------------------------------------+ | chunk | code | term | +---------------------------+--------+-----------------------------------------------------+ | 50 mg of eltrombopag oral | 825427 | eltrombopag 50 MG Oral Tablet | | 5 mg amlodipine | 197361 | amlodipine 5 MG Oral Tablet | | metformin 1000 mg | 861004 | metformin hydrochloride 2000 MG Oral Tablet | +---------------------------+--------+-----------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_rxcui | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.5 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings. https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html. [Sample Content](https://rxnav.nlm.nih.gov/REST/rxclass/class/byRxcui.json?rxcui=1000000). --- layout: model title: Part of Speech for Turkish author: John Snow Labs name: pos_ud_imst date: 2021-03-08 tags: [part_of_speech, open_source, turkish, pos_ud_imst, tr] task: Part of Speech Tagging language: tr edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADJ - PROPN - PUNCT - ADP - NOUN - VERB - PRON - ADV - NUM - AUX - CCONJ - DET - INTJ - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_imst_tr_3.0.0_3.0_1615230214154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_imst_tr_3.0.0_3.0_1615230214154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_imst", "tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([["John Snow Labs'tan merhaba! "]], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_imst", "tr") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("John Snow Labs'tan merhaba! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""John Snow Labs'tan merhaba! ""] token_df = nlu.load('tr.pos.ud_imst').predict(text) token_df ```
## Results ```bash token pos 0 John NOUN 1 Snow PROPN 2 Labs'tan PROPN 3 merhaba NOUN 4 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_imst| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|tr| --- layout: model title: English asr_wav2vec2_large_960h_lv60_self TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_large_960h_lv60_self date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_self_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664037014758.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664037014758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60_self', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60_self", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_960h_lv60_self| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824218 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824218` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1677881805603.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1677881805603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824218| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824218 --- layout: model title: Part of Speech for Romanian author: John Snow Labs name: pos_ud_rrt date: 2020-05-04 23:32:00 +0800 task: Part of Speech Tagging language: ro edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, ro] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_rrt_ro_2.5.0_2.4_1588622539956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_rrt_ro_2.5.0_2.4_1588622539956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_rrt", "ro") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_rrt", "ro") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale."""] pos_df = nlu.load('ro.pos.ud_rrt').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=1, result='ADP', metadata={'word': 'În'}), Row(annotatorType='pos', begin=3, end=7, result='ADV', metadata={'word': 'afară'}), Row(annotatorType='pos', begin=9, end=10, result='ADP', metadata={'word': 'de'}), Row(annotatorType='pos', begin=12, end=12, result='PART', metadata={'word': 'a'}), Row(annotatorType='pos', begin=14, end=15, result='AUX', metadata={'word': 'fi'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_rrt| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|ro| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt4_h312 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt4-h312` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_h312_zh_4.2.4_3.0_1670327101139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_h312_zh_4.2.4_3.0_1670327101139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4_h312","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4_h312","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt4_h312| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|43.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt4-h312 - https://github.com/iflytek/MiniRBT - https://github.com/ymcui/LERT - https://github.com/ymcui/PERT - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/iflytek/HFL-Anthology --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from nlpunibo) author: John Snow Labs name: distilbert_qa_base_config1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config1` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.3.0_3.0_1672774413705.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.3.0_3.0_1672774413705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_config1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_base_config1 --- layout: model title: Extract Entities in Spanish Clinical Trial Abstracts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_clinical_trials_abstracts date: 2022-08-11 tags: [es, clinical, licensed, token_classification, bert, ner] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting relevant entities from Spanish clinical trial abstracts and trained using the BertForTokenClassification method from the transformers library and [BERT based](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) embeddings. The model detects Pharmacological and Chemical Substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). ## Predicted Entities `CHEM`, `DISO`, `PROC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_es_4.0.2_3.0_1660229117151.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_es_4.0.2_3.0_1660229117151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "es", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame(["""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales."""], StringType()).toDF("text") result = model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "es", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","label")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.classify.bert_token.clinical_trials_abstract").predict("""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.""") ```
## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |suplementación |PROC | |ácido fólico |CHEM | |niveles de homocisteína|PROC | |hemodiálisis |PROC | |hiperhomocisteinemia |DISO | |niveles de homocisteína|PROC | |tHcy |PROC | |ácido fólico |CHEM | |vitamina B6 |CHEM | |pp |CHEM | |diálisis |PROC | |función residual |PROC | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_trials_abstracts| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References The model is prepared using the reference paper: "A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine", Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión and Antonio Moreno-Sandoval. BMC Medical Informatics and Decision Making volume 21, Article number: 69 (2021) ## Benchmarking ```bash label precision recall f1-score support B-CHEM 0.9335 0.9314 0.9325 4944 I-CHEM 0.8210 0.8689 0.8443 1251 B-DISO 0.9406 0.9429 0.9417 5538 I-DISO 0.9071 0.9115 0.9093 5129 B-PROC 0.8850 0.9113 0.8979 5893 I-PROC 0.8711 0.8615 0.8663 7047 micro-avg 0.9010 0.9070 0.9040 29802 macro-avg 0.8930 0.9046 0.8987 29802 weighted-avg 0.9012 0.9070 0.9040 29802 ``` --- layout: model title: Estonian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_estonian_legal date: 2023-02-16 tags: [et, estonian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: et edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-estonian-roberta-base` is a Estonian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_estonian_legal_et_4.2.4_3.0_1676577830758.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_estonian_legal_et_4.2.4_3.0_1676577830758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_estonian_legal", "et")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_estonian_legal", "et") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_estonian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|et| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-estonian-roberta-base --- layout: model title: NER Pipeline for Anatomy Entities - Voice of the Patient author: John Snow Labs name: ner_vop_anatomy_pipeline date: 2023-06-09 tags: [licensed, ner, en, pipeline, vop, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of anatomical sites from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_pipeline_en_4.4.3_3.0_1686341261132.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_pipeline_en_4.4.3_3.0_1686341261132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_anatomy_pipeline", "en", "clinical/models") pipeline.annotate(" Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_anatomy_pipeline", "en", "clinical/models") val result = pipeline.annotate(" Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head. ") ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | muscle | BodyPart | | neck | BodyPart | | trapezius | BodyPart | | head | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_anatomy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_samantharhay TFWav2Vec2ForCTC from samantharhay author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_samantharhay` is a English model originally trained by samantharhay. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102981328.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay_en_4.2.0_3.0_1664102981328.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_samantharhay| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.8 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from niklaspm) author: John Snow Labs name: bert_qa_linkbert_large_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `linkbert-large-finetuned-squad` is a English model orginally trained by `niklaspm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_large_finetuned_squad_en_4.0.0_3.0_1654188104988.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_large_finetuned_squad_en_4.0.0_3.0_1654188104988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_linkbert_large_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_linkbert_large_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.link_bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_linkbert_large_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/niklaspm/linkbert-large-finetuned-squad - https://arxiv.org/abs/2203.15827 --- layout: model title: Translate Catalan to English Pipeline author: John Snow Labs name: translate_ca_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ca, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ca` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ca_en_xx_2.7.0_2.4_1609691497747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ca_en_xx_2.7.0_2.4_1609691497747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ca_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ca_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ca.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ca_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: RCT Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_rct_biobert date: 2022-03-01 tags: [licensed, sequence_classification, bert, en, rct] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier that can classify the sections within the abstracts of scientific articles regarding randomized clinical trials (RCT). ## Predicted Entities `BACKGROUND`, `CONCLUSIONS`, `METHODS`, `OBJECTIVE`, `RESULTS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_2.4_1646129655723.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_2.4_1646129655723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier_loaded = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier_loaded ]) data = spark.createDataFrame([["""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl ."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_trials").predict("""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ |text |class | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ |[Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .]|[BACKGROUND]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_rct_biobert| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://arxiv.org/abs/1710.06071 ## Benchmarking ```bash label precision recall f1-score support BACKGROUND 0.77 0.86 0.81 2000 CONCLUSIONS 0.96 0.95 0.95 2000 METHODS 0.96 0.98 0.97 2000 OBJECTIVE 0.85 0.77 0.81 2000 RESULTS 0.98 0.95 0.96 2000 accuracy 0.9 0.9 0.9 10000 macro-avg 0.9 0.9 0.9 10000 weighted-avg 0.9 0.9 0.9 10000 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from SauravMaheshkar) author: John Snow Labs name: xlm_roberta_qa_xlm_multi_roberta_large_chaii date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-multi-roberta-large-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_multi_roberta_large_chaii_en_4.0.0_3.0_1655988703987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_multi_roberta_large_chaii_en_4.0.0_3.0_1655988703987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_multi_roberta_large_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_multi_roberta_large_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.xlm_roberta.large_multi.by_SauravMaheshkar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_multi_roberta_large_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/xlm-multi-roberta-large-chaii --- layout: model title: Legal Sub Advisory Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_sub_advisory_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, sub, advisory, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_sub_advisory_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `sub-advisory-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `sub-advisory-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sub_advisory_agreement_bert_en_1.0.0_3.0_1671393859780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sub_advisory_agreement_bert_en_1.0.0_3.0_1671393859780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sub_advisory_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[sub-advisory-agreement]| |[other]| |[other]| |[sub-advisory-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sub_advisory_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 204 sub-advisory-agreement 0.98 0.98 0.98 107 accuracy - - 0.99 311 macro-avg 0.99 0.99 0.99 311 weighted-avg 0.99 0.99 0.99 311 ``` --- layout: model title: Recognize Entities OntoNotes - ELECTRA Large author: John Snow Labs name: onto_recognize_entities_electra_large date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `electra_large_uncased` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_large_en_2.7.0_2.4_1607530726468.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_large_en_2.7.0_2.4_1607530726468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_electra_large') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_large") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.onto.large").predict("""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.""") ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_electra_large| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: Part of Speech for Dutch author: John Snow Labs name: pos_ud_alpino date: 2021-03-08 tags: [part_of_speech, open_source, dutch, pos_ud_alpino, nl] task: Part of Speech Tagging language: nl edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PRON - AUX - ADV - VERB - PUNCT - ADP - NUM - NOUN - SCONJ - DET - ADJ - PROPN - CCONJ - SYM - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_alpino_nl_3.0.0_3.0_1615230249057.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_alpino_nl_3.0.0_3.0_1615230249057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_alpino", "nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hallo van John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_alpino", "nl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hallo van John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hallo van John Snow Labs! ""] token_df = nlu.load('nl.pos.ud_alpino').predict(text) token_df ```
## Results ```bash token pos 0 Hallo PROPN 1 van ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_alpino| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|nl| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_768 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670021813941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670021813941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|277.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English BertForQuestionAnswering Cased model (from Callmenicky) author: John Snow Labs name: bert_qa_callmenicky_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `Callmenicky`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_callmenicky_finetuned_squad_en_4.0.0_3.0_1657185991318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_callmenicky_finetuned_squad_en_4.0.0_3.0_1657185991318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_callmenicky_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_callmenicky_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_callmenicky_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Callmenicky/bert-finetuned-squad --- layout: model title: Pipeline to Detect Organism in Medical Texts author: John Snow Labs name: bert_token_classifier_ner_linnaeus_species_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_linnaeus_species](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_linnaeus_species_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_pipeline_en_4.3.0_3.2_1679303734578.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_pipeline_en_4.3.0_3.2_1679303734578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_linnaeus_species_pipeline", "en", "clinical/models") text = '''First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_linnaeus_species_pipeline", "en", "clinical/models") val text = "First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------|--------:|------:|:------------|-------------:| | 0 | chicken | 20 | 26 | SPECIES | 0.998697 | | 1 | human | 71 | 75 | SPECIES | 0.999767 | | 2 | Xenopus laevis | 82 | 95 | SPECIES | 0.999918 | | 3 | Drosophila melanogaster | 102 | 124 | SPECIES | 0.999925 | | 4 | Schizosaccharomyces pombe | 134 | 158 | SPECIES | 0.999881 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_linnaeus_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering model (from clagator) author: John Snow Labs name: bert_qa_biobert_squad2_cased date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_squad2_cased` is a English model orginally trained by `clagator`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_squad2_cased_en_4.0.0_3.0_1654185692636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_squad2_cased_en_4.0.0_3.0_1654185692636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_squad2_cased","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_squad2_cased","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.biobert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_squad2_cased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/clagator/biobert_squad2_cased --- layout: model title: Sundanese asr_wav2vec2_large_xlsr_sundanese TFWav2Vec2ForCTC from cahya author: John Snow Labs name: asr_wav2vec2_large_xlsr_sundanese date: 2022-09-24 tags: [wav2vec2, su, audio, open_source, asr] task: Automatic Speech Recognition language: su edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_sundanese` is a Sundanese model originally trained by cahya. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_sundanese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039112850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039112850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_sundanese", "su")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_sundanese", "su") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_sundanese| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|su| |Size:|1.2 GB| --- layout: model title: French Bert Embeddings author: John Snow Labs name: bert_embeddings_bert_base_fr_cased date: 2022-04-11 tags: [bert, embeddings, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-fr-cased` is a French model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_fr_cased_fr_3.4.2_3.0_1649675673587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_fr_cased_fr_3.4.2_3.0_1649675673587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_fr_cased","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_fr_cased","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.bert_base_fr_cased").predict("""J'adore Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_fr_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|393.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-fr-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Translate Latvian to English Pipeline author: John Snow Labs name: translate_lv_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lv, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lv_en_xx_2.7.0_2.4_1609690492134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lv_en_xx_2.7.0_2.4_1609690492134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lv.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lv_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_greedy_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_greedy_biobert](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_greedy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_3.4.1_3.0_1647869992577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_3.4.1_3.0_1647869992577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biobert_jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | entity | |---:|:-----------------------------------------------|:-----------------------------| | 0 | 21-day-old | Age | | 1 | Caucasian | Race_Ethnicity | | 2 | male | Gender | | 3 | for 2 days | Duration | | 4 | congestion | Symptom | | 5 | mom | Gender | | 6 | suctioning yellow discharge | Symptom | | 7 | nares | External_body_part_or_region | | 8 | she | Gender | | 9 | mild problems with his breathing while feeding | Symptom | | 10 | perioral cyanosis | Symptom | | 11 | retractions | Symptom | | 12 | One day ago | RelativeDate | | 13 | mom | Gender | | 14 | tactile temperature | Symptom | | 15 | Tylenol | Drug | | 16 | Baby | Age | | 17 | decreased p.o. intake | Symptom | | 18 | His | Gender | | 19 | breast-feeding | External_body_part_or_region | | 20 | q.2h | Frequency | | 21 | to 5 to 10 minutes | Duration | | 22 | his | Gender | | 23 | respiratory congestion | Symptom | | 24 | He | Gender | | 25 | tired | Symptom | | 26 | fussy | Symptom | | 27 | over the past 2 days | RelativeDate | | 28 | albuterol | Drug | | 29 | ER | Clinical_Dept | | 30 | His | Gender | | 31 | urine output has also decreased | Symptom | | 32 | he | Gender | | 33 | per 24 hours | Frequency | | 34 | he | Gender | | 35 | per 24 hours | Frequency | | 36 | Mom | Gender | | 37 | diarrhea | Symptom | | 38 | His | Gender | | 39 | bowel | Internal_organ_or_component | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from microsoft) author: John Snow Labs name: roberta_qa_xdoc_base_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xdoc-base-squad2.0` is a English model originally trained by `microsoft`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad2.0_en_4.3.0_3.0_1674224984469.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad2.0_en_4.3.0_3.0_1674224984469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_xdoc_base_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/microsoft/xdoc-base-squad2.0 - https://arxiv.org/abs/2210.02849 --- layout: model title: English image_classifier_vit_Check_Missing_Teeth ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Check_Missing_Teeth date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Check_Missing_Teeth` is a English model originally trained by steven123. ## Predicted Entities `Missing Teeth`, `Non-Missing Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Missing_Teeth_en_4.1.0_3.0_1660167758083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Missing_Teeth_en_4.1.0_3.0_1660167758083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Check_Missing_Teeth", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Check_Missing_Teeth", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Check_Missing_Teeth| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Luvale to English author: John Snow Labs name: opus_mt_lue_en date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lue, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lue` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lue_en_xx_2.7.0_2.4_1609254447650.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lue_en_xx_2.7.0_2.4_1609254447650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lue_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lue_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lue.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lue_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Persian BertForQuestionAnswering Base Uncased model (from mhmsadegh) author: John Snow Labs name: bert_qa_base_parsbert_uncased_finetuned_squad date: 2022-07-07 tags: [fa, open_source, bert, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-parsbert-uncased-finetuned-squad` is a Persian model originally trained by `mhmsadegh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_parsbert_uncased_finetuned_squad_fa_4.0.0_3.0_1657183402278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_parsbert_uncased_finetuned_squad_fa_4.0.0_3.0_1657183402278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_parsbert_uncased_finetuned_squad","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_parsbert_uncased_finetuned_squad","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_parsbert_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mhmsadegh/bert-base-parsbert-uncased-finetuned-squad --- layout: model title: Legal Modifications Clause Binary Classifier author: John Snow Labs name: legclf_modifications_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `modifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `modifications` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_modifications_clause_en_1.0.0_3.2_1660123743355.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_modifications_clause_en_1.0.0_3.2_1660123743355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_modifications_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[modifications]| |[other]| |[other]| |[modifications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_modifications_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support modifications 0.89 0.83 0.86 76 other 0.92 0.95 0.94 168 accuracy - - 0.91 244 macro-avg 0.91 0.89 0.90 244 weighted-avg 0.91 0.91 0.91 244 ``` --- layout: model title: English BERT Embeddings Cased model (from mrm8488) author: John Snow Labs name: bert_embeddings_bioclinicalbert_finetuned_covid_papers date: 2022-07-15 tags: [en, open_source, bert, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BERT Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioclinicalBERT-finetuned-covid-papers` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bioclinicalbert_finetuned_covid_papers_en_4.0.0_3.0_1657880858798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bioclinicalbert_finetuned_covid_papers_en_4.0.0_3.0_1657880858798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bioclinicalbert_finetuned_covid_papers","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bioclinicalbert_finetuned_covid_papers","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bioclinicalbert_finetuned_covid_papers| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|406.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/bioclinicalBERT-finetuned-covid-papers --- layout: model title: Detect Assertion Status from Response to Treatment author: John Snow Labs name: assertion_oncology_response_to_treatment_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of entities related to response to treatment. The model identifies positive mentions (Present_Or_Past status), and hypothetical or absent mentions (Hypothetical_Or_Absent status). ## Predicted Entities `Hypothetical_Or_Absent`, `Present_Or_Past` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.1.0_3.0_1664641698152.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_response_to_treatment_wip_en_4.1.0_3.0_1664641698152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Response_To_Treatment"]) assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient presented no evidence of recurrence."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Response_To_Treatment")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_response_to_treatment_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient presented no evidence of recurrence.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_response_to_treatment_wip").predict("""The patient presented no evidence of recurrence.""") ```
## Results ```bash | chunk | ner_label | assertion | |:-----------|:----------------------|:-----------------------| | recurrence | Response_To_Treatment | Hypothetical_Or_Absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_response_to_treatment_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Hypothetical_Or_Absent 0.83 0.96 0.89 46.0 Present_Or_Past 0.94 0.79 0.86 43.0 macro-avg 0.89 0.87 0.87 89.0 weighted-avg 0.89 0.88 0.88 89.0 ``` --- layout: model title: English BertForQuestionAnswering model (from ZYW) author: John Snow Labs name: bert_qa_squad_mbert_model date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-mbert-model` is a English model orginally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_mbert_model_en_4.0.0_3.0_1654192040161.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_mbert_model_en_4.0.0_3.0_1654192040161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_mbert_model","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad_mbert_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.multi_lingual_bert.by_ZYW").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_mbert_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/squad-mbert-model --- layout: model title: Sentence Detection in Tamil Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [ta, open_source, sentence_detection] task: Sentence Detection language: ta edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ta_3.2.0_3.0_1630337465197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ta_3.2.0_3.0_1630337465197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "ta") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா? நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள். சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது. கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது! மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும். எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது? இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி: வாசிப்பு திறனின் பயன் என்ன? வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ta") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா? நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள். சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது. கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது! மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும். எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது? இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி: வாசிப்பு திறனின் பயன் என்ன? வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('ta.sentence_detector').predict("ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா? நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள். சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது. கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது! மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும். எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது? இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி: வாசிப்பு திறனின் பயன் என்ன? வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.", output_level ='sentence') ```
## Results ```bash +--------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------+ |[ஆங்கில வாசிப்பு பத்திகளின் சிறந்த ஆதாரத்தைத் தேடுகிறீர்களா?] | |[நீங்கள் சரியான இடத்திற்கு வந்துவிட்டீர்கள்.] | |[சமீபத்திய ஆய்வின்படி, இன்றைய இளைஞர்களிடம் படிக்கும் பழக்கம் வேகமாக குறைந்து வருகிறது.] | |[கொடுக்கப்பட்ட ஆங்கில வாசிப்பு பத்தியில் சில வினாடிகளுக்கு மேல் அவர்களால் கவனம் செலுத்த முடியாது!]| |[மேலும், அனைத்து போட்டித் தேர்வுகளிலும் வாசிப்பு ஒரு ஒருங்கிணைந்த பகுதியாகும்.] | |[எனவே, உங்கள் வாசிப்புத் திறனை எவ்வாறு மேம்படுத்துவது?] | |[இந்த கேள்விக்கான பதில் உண்மையில் மற்றொரு கேள்வி:] | |[வாசிப்பு திறனின் பயன் என்ன?] | |[வாசிப்பின் முக்கிய நோக்கம் 'உணர்த்துவது'.] | +--------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|ta| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English BertForSequenceClassification Mini Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_mini_finetuned_age_news_classification date: 2022-07-13 tags: [en, open_source, bert, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-finetuned-age_news-classification` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_mini_finetuned_age_news_classification_en_4.0.0_3.0_1657720835247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_mini_finetuned_age_news_classification_en_4.0.0_3.0_1657720835247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_mini_finetuned_age_news_classification","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_mini_finetuned_age_news_classification","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_mini_finetuned_age_news_classification| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|42.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification --- layout: model title: English LongformerForQuestionAnswering model (from manishiitg) Version-2 author: John Snow Labs name: longformer_qa_recruit_v2 date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-recruit-qa-v2` is a English model originally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_v2_en_4.0.0_3.0_1656255752690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_v2_en_4.0.0_3.0_1656255752690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.longformer.v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_recruit_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|556.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/longformer-recruit-qa-v2 --- layout: model title: Fast Neural Machine Translation Model from English to Finnish author: John Snow Labs name: opus_mt_en_fi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, fi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `fi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_fi_xx_2.7.0_2.4_1609167722467.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_fi_xx_2.7.0_2.4_1609167722467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_fi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_fi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.fi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_fi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version (sbiobert_base_cased_mli) author: John Snow Labs name: sbiobertresolve_snomed_bodyStructure date: 2021-07-08 tags: [snomed, en, clinical, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings. ## Predicted Entities Snomed Codes and their normalized definition with `sbiobert_base_cased_mli` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_2.4_1625732176926.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_2.4_1625732176926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Disease_Syndrome_Disorder, External_body_part_or_region``` set in ```.setWhiteList()```. ```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. No need to set ```.setWhiteList()```. Merge ner_jsl and ner_anatomy_coarse model chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") jsl_sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_bodyStructure", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("snomed_code") snomed_pipelineModel = PipelineModel( stages = [ documentAssembler, jsl_sbert_embedder, snomed_resolver]) snomed_lp = LightPipeline(snomed_pipelineModel) result = snomed_lp.fullAnnotate("Amputation stump") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_bodyStructure", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("snomed_code") val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver)) val snomed_lp = LightPipeline(snomed_pipelineModel) val result = snomed_lp.fullAnnotate("Amputation stump") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_body_structure").predict("""Amputation stump""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------| | 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_bodyStructure| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Case sensitive:|true| ## Data Source https://www.snomed.org/ --- layout: model title: Arabic Bert Embeddings (from bashar-talafha) author: John Snow Labs name: bert_embeddings_multi_dialect_bert_base_arabic date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `multi-dialect-bert-base-arabic` is a Arabic model orginally trained by `bashar-talafha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_multi_dialect_bert_base_arabic_ar_3.4.2_3.0_1649677978634.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_multi_dialect_bert_base_arabic_ar_3.4.2_3.0_1649677978634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_multi_dialect_bert_base_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_multi_dialect_bert_base_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.multi_dialect_bert_base_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_multi_dialect_bert_base_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|414.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/bashar-talafha/multi-dialect-bert-base-arabic - https://ai.mawdoo3.com/ - https://github.com/alisafaya/Arabic-BERT - https://sites.google.com/view/nadi-shared-task - https://github.com/mawdoo3/Multi-dialect-Arabic-BERT --- layout: model title: English DistilBertForQuestionAnswering model (from caiosantillo) author: John Snow Labs name: distilbert_qa_caiosantillo_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `caiosantillo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_caiosantillo_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725142690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_caiosantillo_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725142690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_caiosantillo_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_caiosantillo_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_caiosantillo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_caiosantillo_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/caiosantillo/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Icelandic to English author: John Snow Labs name: opus_mt_is_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, is, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `is` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_is_en_xx_2.7.0_2.4_1609167041296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_is_en_xx_2.7.0_2.4_1609167041296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_is_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_is_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.is.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_is_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_english date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-english` is a English model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_english_en_4.0.0_3.0_1654188464581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_english_en_4.0.0_3.0_1654188464581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_english","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_english","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.multilingual_english_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_english| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-english --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_pert_sent_0.01_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_roberta_pert_sent_0.01_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_pert_sent_0.01_squad2.0_en_4.3.0_3.0_1674211060807.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_pert_sent_0.01_squad2.0_en_4.3.0_3.0_1674211060807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_pert_sent_0.01_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_pert_sent_0.01_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_pert_sent_0.01_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_roberta_pert_sent_0.01_squad2.0 --- layout: model title: Fast Neural Machine Translation Model from English to Xhosa author: John Snow Labs name: opus_mt_en_xh date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, xh, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `xh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_xh_xx_2.7.0_2.4_1609163416968.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_xh_xx_2.7.0_2.4_1609163416968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_xh", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_xh", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.xh').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_xh| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dm128 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm128` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm128_en_4.3.0_3.0_1675118965696.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm128_en_4.3.0_3.0_1675118965696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dm128","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dm128","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dm128| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|37.4 MB| ## References - https://huggingface.co/google/t5-efficient-small-dm128 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_fpdm_triplet_bert_FT_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_triplet_bert_FT_newsqa_en_4.0.0_3.0_1654187915859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_triplet_bert_FT_newsqa_en_4.0.0_3.0_1654187915859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_triplet_bert_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fpdm_triplet_bert_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.qa_fpdm_triplet_ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fpdm_triplet_bert_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_triplet_bert_FT_newsqa --- layout: model title: Detect Drug Information author: John Snow Labs name: ner_posology_en date: 2020-04-15 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for posology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities ``DOSAGE``, ``DRUG``, ``DURATION``, ``FORM``, ``FREQUENCY``, ``ROUTE``, ``STRENGTH``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"])) ``` ```scala ... val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_en| |Type:|ner| |Compatibility:|Spark NLP 2.4.2| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on the 2018 i2b2 dataset and FDA Drug datasets with ``embeddings_clinical``. https://open.fda.gov/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:| | 0 | B-DRUG | 2639 | 221 | 117 | 0.922727 | 0.957547 | 0.939815 | | 1 | B-STRENGTH | 1711 | 188 | 87 | 0.901 | 0.951613 | 0.925615 | | 2 | I-DURATION | 553 | 74 | 60 | 0.881978 | 0.902121 | 0.891935 | | 3 | I-STRENGTH | 1927 | 239 | 176 | 0.889658 | 0.91631 | 0.902788 | | 4 | I-FREQUENCY | 1749 | 163 | 133 | 0.914749 | 0.92933 | 0.921982 | | 5 | B-FORM | 1028 | 109 | 84 | 0.904134 | 0.92446 | 0.914184 | | 6 | B-DOSAGE | 323 | 71 | 81 | 0.819797 | 0.799505 | 0.809524 | | 7 | I-DOSAGE | 173 | 89 | 82 | 0.660305 | 0.678431 | 0.669246 | | 8 | I-DRUG | 1020 | 129 | 118 | 0.887728 | 0.896309 | 0.891998 | | 9 | I-ROUTE | 17 | 4 | 5 | 0.809524 | 0.772727 | 0.790698 | | 10 | B-ROUTE | 526 | 49 | 52 | 0.914783 | 0.910035 | 0.912402 | | 11 | B-DURATION | 223 | 35 | 27 | 0.864341 | 0.892 | 0.877953 | | 12 | B-FREQUENCY | 1170 | 90 | 54 | 0.928571 | 0.955882 | 0.942029 | | 13 | I-FORM | 48 | 6 | 11 | 0.888889 | 0.813559 | 0.849558 | | 14 | Macro-average | 13107 | 1467 | 1087 | 0.870585 | 0.878559 | 0.874554 | | 15 | Micro-average | 13107 | 1467 | 1087 | 0.899341 | 0.923418 | 0.911221 | ``` --- layout: model title: Translate English to Pedi Pipeline author: John Snow Labs name: translate_en_nso date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, nso, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `nso` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_nso_xx_2.7.0_2.4_1609687739962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_nso_xx_2.7.0_2.4_1609687739962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_nso", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_nso", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.nso').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_nso| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg, Multi, Repeat author: John Snow Labs name: distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg-with-multi-with-repeat` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat_en_4.0.0_3.0_1654727440196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat_en_4.0.0_3.0_1654727440196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg_with_multi_with_repeat.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg_with_multi_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg-with-multi-with-repeat --- layout: model title: English DistilBertForQuestionAnswering Small model (from ncduy) author: John Snow Labs name: distilbert_qa_base_cased_distilled_squad_finetuned_squad_small date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-small` is a English model originally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_small_en_4.0.0_3.0_1654723606084.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_small_en_4.0.0_3.0_1654723606084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_small","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_small","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_small_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_distilled_squad_finetuned_squad_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-small --- layout: model title: Multilabel Classification of Customer Service (Linguistic features) author: John Snow Labs name: finmulticlf_customer_service_lin_features date: 2023-02-03 tags: [en, licensed, finance, classification, customer, linguistic, tensorflow] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multilabel Text Classification model that can help you classify a chat message from customer service according to linguistic features. The classes are the following: - Q - Colloquial variation - P - Politeness variation - W - Offensive language - K - Keyword language - B - Basic syntactic structure - C - Coordinated syntactic structure - I - Interrogative structure - M - Morphological variation (plurals, tenses…) - L - Lexical variation (synonyms) - E - Expanded abbreviations (I'm -> I am, I'd -> I would…) - N - Negation - Z - Noise phenomena like spelling or punctuation errors ## Predicted Entities `B`, `C`, `E`, `I`, `K`, `L`, `M`, `N`, `P`, `Q`, `W`, `Z` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmulticlf_customer_service_lin_features_en_1.0.0_3.0_1675430237309.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmulticlf_customer_service_lin_features_en_1.0.0_3.0_1675430237309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.MultiClassifierDLModel().load("finmulticlf_customer_service_lin_features", "en", "finance/models")\ .setInputCols("sentence_embeddings") \ .setOutputCol("class") pipeline = nlp.Pipeline().setStages( [ document_assembler, embeddings, docClassifier ] ) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) light_model = nlp.LightPipeline(model) result = light_model.annotate("""What do i have to ddo to cancel a Gold account""") result["class"] ```
## Results ```bash ['Q', 'B', 'L', 'Z', 'I'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmulticlf_customer_service_lin_features| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|13.0 MB| ## References https://github.com/bitext/customer-support-intent-detection-training-dataset ## Benchmarking ```bash label precision recall f1-score support B 1.00 1.00 1.00 485 C 0.79 0.80 0.80 61 E 0.74 0.89 0.80 44 I 0.95 0.94 0.94 134 K 0.96 0.96 0.96 108 L 0.96 0.97 0.96 402 M 0.93 0.93 0.93 134 N 0.90 0.75 0.82 12 P 0.77 0.90 0.83 30 Q 0.73 0.68 0.71 212 W 0.85 0.88 0.87 33 Z 0.68 0.72 0.70 160 micro-avg 0.90 0.90 0.90 1815 macro-avg 0.85 0.87 0.86 1815 weighted-avg 0.90 0.90 0.90 1815 samples-avg 0.91 0.92 0.90 1815 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Indonesian author: John Snow Labs name: opus_mt_en_id date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, id, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `id` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_id_xx_2.7.0_2.4_1609164704175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_id_xx_2.7.0_2.4_1609164704175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_id", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_id", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.id').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_id| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stopwords Remover for Norwegian Bokmål language (211 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, nb, open_source] task: Stop Words Removal language: nb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_nb_3.4.1_3.0_1646673078011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_nb_3.4.1_3.0_1646673078011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","nb") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Bortsett fra å være kongen av nord, er John Snow en engelsk lege og en leder i utviklingen av anestesi og medisinsk hygiene."]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","nb") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Bortsett fra å være kongen av nord, er John Snow en engelsk lege og en leder i utviklingen av anestesi og medisinsk hygiene.").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nb.stopwords").predict("""Bortsett fra å være kongen av nord, er John Snow en engelsk lege og en leder i utviklingen av anestesi og medisinsk hygiene.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------+ |[Bortsett, kongen, nord, ,, John, Snow, engelsk, lege, utviklingen, anestesi, medisinsk, hygiene, .]| +----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|nb| |Size:|1.9 KB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from microsoft) author: John Snow Labs name: roberta_qa_xdoc_base_squad1.1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xdoc-base-squad1.1` is a English model originally trained by `microsoft`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad1.1_en_4.3.0_3.0_1674224925472.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_xdoc_base_squad1.1_en_4.3.0_3.0_1674224925472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad1.1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_xdoc_base_squad1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_xdoc_base_squad1.1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/microsoft/xdoc-base-squad1.1 - https://arxiv.org/abs/2210.02849 --- layout: model title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica) author: John Snow Labs name: t5_super_tiny_bahasa_cased date: 2023-01-31 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-super-tiny-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_super_tiny_bahasa_cased_ms_4.3.0_3.0_1675156057502.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_super_tiny_bahasa_cased_ms_4.3.0_3.0_1675156057502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_super_tiny_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_super_tiny_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_super_tiny_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|40.5 MB| ## References - https://huggingface.co/mesolitica/t5-super-tiny-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5 --- layout: model title: Legal Arguments Mining in Court Decisions (in German) author: John Snow Labs name: legclf_argument_mining_german date: 2023-03-26 tags: [de, licensed, classification, legal, tensorflow] task: Text Classification language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model in German which classifies arguments in legal discourse. These are the following classes: `subsumption`, `definition`, `conclusion`, `other`. ## Predicted Entities `subsumption`, `definition`, `conclusion`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_german_de_1.0.0_3.0_1679848514704.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_german_de_1.0.0_3.0_1679848514704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_large_german_legal", "de")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) embeddingsSentence = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE")\ docClassifier = legal.ClassifierDLModel.pretrained("legclf_argument_mining_de", "de", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, embeddingsSentence, docClassifier ]) df = spark.createDataFrame([["Folglich liegt eine Verletzung von Artikel 8 der Konvention vor ."]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) result.select("text", "category.result").show(truncate=False) ```
## Results ```bash +-----------------------------------------------------------------+------------+ |text |result | +-----------------------------------------------------------------+------------+ |Folglich liegt eine Verletzung von Artikel 8 der Konvention vor .|[conclusion]| +-----------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_argument_mining_german| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|de| |Size:|24.0 MB| ## References Train dataset available [here](https://huggingface.co/datasets/MeilingShi/legal_argument_mining) ## Benchmarking ```bash label precision recall f1-score support conclusion 0.88 0.88 0.88 52 definition 0.83 0.83 0.83 58 other 0.86 0.88 0.87 49 subsumption 0.81 0.80 0.80 64 accuracy - - 0.84 223 macro avg 0.85 0.85 0.85 223 weighted avg 0.84 0.84 0.84 223 ``` --- layout: model title: Match Datetime in Texts author: John Snow Labs name: match_datetime date: 2022-01-04 tags: [en, open_source] task: Pipeline Public language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description DateMatcher based on yyyy/MM/dd {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_datetime_en_3.3.4_3.0_1641310187437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_datetime_en_3.3.4_3.0_1641310187437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline_local = PretrainedPipeline('match_datetime') tres = pipeline_local.fullAnnotate(input_list)[0] for dte in tres['date']: sent = tres['sentence'][int(dte.metadata['sentence'])] print (f'text/chunk {sent.result[dte.begin:dte.end+1]} | mapped_date: {dte.result}') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP SparkNLP.version() val testData = spark.createDataFrame(Seq( (1, "David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow."))).toDF("id", "text") val pipeline = PretrainedPipeline("match_datetime", lang="en") val annotation = pipeline.transform(testData) annotation.show() ```
## Results ```bash text/chunk yesterday | mapped_date: 2022/01/02 text/chunk day before | mapped_date: 2022/01/02 text/chunk today | mapped_date: 2022/01/03 text/chunk tomorrow | mapped_date: 2022/01/04 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|match_datetime| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|12.9 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - MultiDateMatcher --- layout: model title: Word2Vec Embeddings in Romansh (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, rm, open_source] task: Embeddings language: rm edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_rm_3.4.1_3.0_1647454087483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_rm_3.4.1_3.0_1647454087483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","rm") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","rm") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("rm.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|rm| |Size:|64.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: DistilRoBERTa Base Ontonotes NER Pipeline author: ahmedlone127 name: distilroberta_base_token_classifier_ontonotes_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, distilroberta, ontonotes, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilroberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/09/26/distilroberta_base_token_classifier_ontonotes_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/distilroberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655219463122.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/distilroberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655219463122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilroberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|307.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Now therefore Clause Binary Classifier author: John Snow Labs name: legclf_now_therefore_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `now-therefore` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `now-therefore` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_now_therefore_clause_en_1.0.0_3.2_1660122766408.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_now_therefore_clause_en_1.0.0_3.2_1660122766408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_now_therefore_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[now-therefore]| |[other]| |[other]| |[now-therefore]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_now_therefore_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support now-therefore 0.98 0.98 0.98 58 other 0.99 0.99 0.99 146 accuracy - - 0.99 204 macro-avg 0.99 0.99 0.99 204 weighted-avg 0.99 0.99 0.99 204 ``` --- layout: model title: Chamorro RobertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: roberta_qa_ADDI_CH_RoBERTa date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: ch edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-CH-RoBERTa` is a Chamorro model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_CH_RoBERTa_ch_4.0.0_3.0_1655726262986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_CH_RoBERTa_ch_4.0.0_3.0_1655726262986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_CH_RoBERTa","ch") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ADDI_CH_RoBERTa","ch") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ch.answer_question.roberta.ch_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ADDI_CH_RoBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ch| |Size:|421.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-CH-RoBERTa --- layout: model title: Swedish BERT Sentence Base Cased Embedding author: John Snow Labs name: sent_bert_base_cased date: 2021-09-06 tags: [swedish, bert_sentence_embeddings, open_source, cased, sv] task: Embeddings language: sv edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_sv_3.2.2_3.0_1630926268941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_sv_3.2.2_3.0_1630926268941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "sv") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "sv") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu nlu.load("sv.embed_sentence.bert.base_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|sv| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/KB/bert-base-swedish-cased --- layout: model title: ALBERT Large CoNNL-03 NER Pipeline author: ahmedlone127 name: albert_large_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655211084220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655211084220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|64.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Pipeline to Detect Drugs - Generalized Single Entity author: John Snow Labs name: ner_drugs_greedy_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugs_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_greedy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_3.4.1_3.0_1647873160931.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_3.4.1_3.0_1647873160931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models") pipeline.annotate("DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.") ``` ```scala val pipeline = new PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models") pipeline.annotate("DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs_greedy.pipeline").predict("""DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""") ```
## Results ```bash +-----------------------------------+------------+ | chunk | ner_label | +-----------------------------------+------------+ | hydrocortisone tablets | DRUG | | 20 mg to 240 mg of hydrocortisone | DRUG | +-----------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_greedy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Legal Reinstatement Clause Binary Classifier author: John Snow Labs name: legclf_reinstatement_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `reinstatement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `reinstatement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reinstatement_clause_en_1.0.0_3.2_1660122901279.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reinstatement_clause_en_1.0.0_3.2_1660122901279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_reinstatement_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[reinstatement]| |[other]| |[other]| |[reinstatement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reinstatement_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 101 reinstatement 0.89 0.83 0.86 29 accuracy - - 0.94 130 macro-avg 0.92 0.90 0.91 130 weighted-avg 0.94 0.94 0.94 130 ``` --- layout: model title: Detect Anatomical Structures (Single Entity - embeddings_clinical) author: John Snow Labs name: ner_anatomy_coarse date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description An NER model to extract all types of anatomical references in text using "embeddings_clinical" embeddings. It is a single entity model and generalizes all anatomical references to a single entity. ## Predicted Entities `Anatomy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_en_3.0.0_3.0_1617209678971.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_en_3.0.0_3.0_1617209678971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_anatomy_coarse", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["content in the lung tissue"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_anatomy_coarse", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""content in the lung tissue""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy.coarse").predict("""content in the lung tissue""") ```
## Results ```bash | | ner_chunk | entity | |---:|------------------:|----------:| | 0 | lung tissue | Anatomy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on a custom dataset using 'embeddings_clinical'. ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|------:|------:|------:|---------:|---------:|---------:| | 0 | B-Anatomy | 2568 | 165 | 158 | 0.939627 | 0.94204 | 0.940832 | | 1 | I-Anatomy | 1692 | 89 | 169 | 0.950028 | 0.909189 | 0.92916 | | 2 | Macro-average | 4260 | 254 | 327 | 0.944827 | 0.925614 | 0.935122 | | 3 | Micro-average | 4260 | 254 | 327 | 0.943731 | 0.928712 | 0.936161 | ``` --- layout: model title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QQP author: John Snow Labs name: sent_bert_wiki_books_qqp date: 2021-08-31 tags: [en, sentence_embeddings, open_source, wikipedia_dataset, books_corpus_dataset, qqp_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on QQP. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages. This model is fine-tuned on the QQP and is recommended for use in semantic similarity of question pairs tasks. The fine-tuning task uses the Quora Question Pairs (QQP) dataset to predict whether two questions are duplicates or not. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_qqp_en_3.2.0_3.0_1630412104798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_qqp_en_3.2.0_3.0_1630412104798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_qqp", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_qqp", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_qqp').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_wiki_books_qqp| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Quora Question Pairs (QQP) dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/qqp/2 --- layout: model title: Portuguese Legal Roberta Embeddings author: John Snow Labs name: roberta_base_portuguese_legal date: 2023-02-16 tags: [pt, portuguese, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-portuguese-roberta-base` is a Portuguese model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_portuguese_legal_pt_4.2.4_3.0_1676563448655.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_portuguese_legal_pt_4.2.4_3.0_1676563448655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_portuguese_legal", "pt")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_portuguese_legal", "pt") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_portuguese_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pt| |Size:|415.9 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-portuguese-roberta-base --- layout: model title: Fast Neural Machine Translation Model from Korean to English author: John Snow Labs name: opus_mt_ko_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ko, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ko` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ko_en_xx_2.7.0_2.4_1609168124610.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ko_en_xx_2.7.0_2.4_1609168124610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ko_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ko_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ko.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ko_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRobertaForTokenClassification Cased model (from nbroad) author: John Snow Labs name: xlmroberta_ner_jplu_r_40_lang date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jplu-xlm-r-ner-40-lang` is a Multilingual model originally trained by `nbroad`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jplu_r_40_lang_xx_4.1.0_3.0_1660422549645.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jplu_r_40_lang_xx_4.1.0_3.0_1660422549645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jplu_r_40_lang","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jplu_r_40_lang","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jplu_r_40_lang| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|967.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nbroad/jplu-xlm-r-ner-40-lang --- layout: model title: English DistilBertForQuestionAnswering Cased model (from aszidon) author: John Snow Labs name: distilbert_qa_custom4 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom4` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.3.0_3.0_1672774680666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.3.0_3.0_1672774680666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom4 --- layout: model title: English BertForQuestionAnswering model (from LenaSchmidt) author: John Snow Labs name: bert_qa_no_need_to_name_this date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `no_need_to_name_this` is a English model orginally trained by `LenaSchmidt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_no_need_to_name_this_en_4.0.0_3.0_1654188966070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_no_need_to_name_this_en_4.0.0_3.0_1654188966070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_no_need_to_name_this","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_no_need_to_name_this","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_no_need_to_name_this| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LenaSchmidt/no_need_to_name_this --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from navteca) author: John Snow Labs name: roberta_qa_navteca_base_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `navteca`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_base_squad2_en_4.2.4_3.0_1669986777716.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_base_squad2_en_4.2.4_3.0_1669986777716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_navteca_base_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/navteca/roberta-base-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Chinese BertForMaskedLM Mini Cased model (from hfl) author: John Snow Labs name: bert_embeddings_minirbt_h256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minirbt-h256` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h256_zh_4.2.4_3.0_1670022628583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h256_zh_4.2.4_3.0_1670022628583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_minirbt_h256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|39.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/minirbt-h256 - https://github.com/iflytek/MiniRBT - https://github.com/ymcui/LERT - https://github.com/ymcui/PERT - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/iflytek/HFL-Anthology --- layout: model title: English asr_wav2vec2_xls_r_tf_left_right_shuru TFWav2Vec2ForCTC from hrdipto author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_tf_left_right_shuru` is a English model originally trained by hrdipto. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru_en_4.2.0_3.0_1664040053317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru_en_4.2.0_3.0_1664040053317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_tf_left_right_shuru| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_cased_finetuned_squad_r3f date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad-r3f` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_squad_r3f_en_4.0.0_3.0_1657182921335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_squad_r3f_en_4.0.0_3.0_1657182921335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_squad_r3f","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_squad_r3f","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_cased_finetuned_squad_r3f| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-cased-finetuned-squad-r3f --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kj141) author: John Snow Labs name: distilbert_qa_kj141_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kj141`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kj141_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771876070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kj141_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771876070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kj141_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kj141_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kj141_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kj141/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Indemnification and contribution Clause Binary Classifier (md) author: John Snow Labs name: legclf_indemnification_and_contribution_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnification-and-contribution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indemnification-and-contribution` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_and_contribution_md_en_1.0.0_3.0_1669376499482.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_and_contribution_md_en_1.0.0_3.0_1669376499482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_indemnification_and_contribution_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnification-and-contribution]| |[other]| |[other]| |[indemnification-and-contribution]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_and_contribution_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support indemnification 0.92 0.96 0.94 25 other 0.97 0.95 0.96 39 accuracy 0.95 64 macro avg 0.95 0.95 0.95 64 weighted avg 0.95 0.95 0.95 64 ``` --- layout: model title: Korean ElectraForQuestionAnswering model (from sehandev) author: John Snow Labs name: electra_qa_long date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-long-qa` is a Korean model originally trained by `sehandev`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_long_ko_4.0.0_3.0_1655922278586.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_long_ko_4.0.0_3.0_1655922278586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_long","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_long","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.electra").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_long| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|419.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sehandev/koelectra-long-qa --- layout: model title: Legal Whereas Clause Binary Classifier (md) author: John Snow Labs name: legclf_whereas_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `whereas` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_whereas_md_en_1.0.0_3.0_1669376529983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_whereas_md_en_1.0.0_3.0_1669376529983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_whereas_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[whereas]| |[other]| |[other]| |[whereas]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_whereas_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support other 0.93 1.00 0.96 39 whereas 1.00 0.92 0.96 38 accuracy 0.96 77 macro avg 0.96 0.96 0.96 77 weighted avg 0.96 0.96 0.96 77 ``` --- layout: model title: Translate English to Semitic languages Pipeline author: John Snow Labs name: translate_en_sem date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sem, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sem_xx_2.7.0_2.4_1609690194050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sem_xx_2.7.0_2.4_1609690194050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sem", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sem", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sem').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sem| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Clinical Entities (ner_jsl_greedy_biobert) author: John Snow Labs name: ner_jsl_greedy_biobert_pipeline date: 2023-03-20 tags: [ner, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_greedy_biobert](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_greedy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_4.3.0_3.2_1679310105776.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_pipeline_en_4.3.0_3.2_1679310105776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_greedy_biobert_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biobert_jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 1 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9488 | | 2 | male | 38 | 41 | Gender | 0.9978 | | 3 | for 2 days | 48 | 57 | Duration | 0.7709 | | 4 | congestion | 62 | 71 | Symptom | 0.5467 | | 5 | mom | 75 | 77 | Gender | 0.9355 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.327867 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.8963 | | 8 | she | 147 | 149 | Gender | 0.995 | | 9 | mild problems with his breathing while feeding | 168 | 213 | Symptom | 0.588714 | | 10 | perioral cyanosis | 237 | 253 | Symptom | 0.58635 | | 11 | retractions | 258 | 268 | Symptom | 0.9864 | | 12 | One day ago | 272 | 282 | RelativeDate | 0.755833 | | 13 | mom | 285 | 287 | Gender | 0.9956 | | 14 | tactile temperature | 304 | 322 | Symptom | 0.10505 | | 15 | Tylenol | 345 | 351 | Drug | 0.9496 | | 16 | Baby | 354 | 357 | Age | 0.976 | | 17 | decreased p.o. intake | 377 | 397 | Symptom | 0.448125 | | 18 | His | 400 | 402 | Gender | 0.999 | | 19 | q.2h. to 5 to 10 minutes | 450 | 473 | Frequency | 0.298843 | | 20 | his | 488 | 490 | Gender | 0.9976 | | 21 | respiratory congestion | 492 | 513 | VS_Finding | 0.6158 | | 22 | He | 516 | 517 | Gender | 0.9998 | | 23 | tired | 550 | 554 | Symptom | 0.8912 | | 24 | fussy | 569 | 573 | Symptom | 0.9541 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_rxnorm language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to RxNorm codes using chunk embeddings. {:.h2_title} ## Predicted Entities RxNorm Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_2.6.4_2.4_1606235763316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_2.6.4_2.4_1606235763316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 386165| 0.1567|hypercal:::hypersed:::hypertears:::hyperstat...|386165:::217667::...| |chronic renal ins...| 83|109| PROBLEM| 218689| 0.1036|nephro calci:::dialysis solutions:::creatini...|218689:::3310:::2...| | COPD| 113|116| PROBLEM|1539999| 0.1644|broncomar dm:::acne medication:::carbon mono...|1539999:::214981:...| | gastritis| 120|128| PROBLEM| 225965| 0.1983|gastroflux:::gastroflux oral product:::uceri...|225965:::1176661:...| | TIA| 136|138| PROBLEM|1089812| 0.0625|thera tears:::thiotepa injection:::nature's ...|1089812:::1660003...| |a non-ST elevatio...| 182|202| PROBLEM| 218767| 0.1007|non-aspirin pm:::aspirin-free:::non aspirin ...|218767:::215440::...| |Guaiac positive s...| 208|229| PROBLEM|1294361| 0.0820|anusol rectal product:::anusol hc rectal pro...|1294361:::1166715...| |cardiac catheteri...| 295|317| TEST| 385247| 0.1566|cardiacap:::cardiology pack:::cardizem:::car...|385247:::545063::...| | PTCA| 324|327|TREATMENT| 8410| 0.0867|alteplase:::reteplase:::pancuronium:::tripe ...|8410:::76895:::78...| | mid LAD lesion| 332|345| PROBLEM| 151672| 0.0549|dulcolax:::lazerformalyde:::linaclotide:::du...|151672:::217985::...| +--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_rxnorm | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings. https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html --- layout: model title: English asr_wav2vec2_cetuc_sid_voxforge_mls_1 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: asr_wav2vec2_cetuc_sid_voxforge_mls_1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_1` is a English model originally trained by joaoalvarenga. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_cetuc_sid_voxforge_mls_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_1_en_4.2.0_3.0_1664023215604.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_1_en_4.2.0_3.0_1664023215604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_1", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_1", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_cetuc_sid_voxforge_mls_1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_ar_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-ar-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_ar_hi_dev_xx_4.0.0_3.0_1657189808354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_ar_hi_dev_xx_4.0.0_3.0_1657189808354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_ar_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_ar_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_ar_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-ar-hi --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from nlp-en-es) author: John Snow Labs name: roberta_qa_nlp_en_es_base_bne_finetuned_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `nlp-en-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_nlp_en_es_base_bne_finetuned_s_c_es_4.2.4_3.0_1669985738926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_nlp_en_es_base_bne_finetuned_s_c_es_4.2.4_3.0_1669985738926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlp_en_es_base_bne_finetuned_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlp_en_es_base_bne_finetuned_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_nlp_en_es_base_bne_finetuned_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nlp-en-es/roberta-base-bne-finetuned-sqac --- layout: model title: Irish Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: ga edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, ga] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ga_2.5.5_2.4_1596054397576.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ga_2.5.5_2.4_1596054397576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "ga") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "ga") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine."""] lemma_df = nlu.load('ga.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='Seachas', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=8, result='a', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=15, result='bheith', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=19, result='i', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=22, result='rí', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|ga| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English DistilBertForQuestionAnswering model (from hcy11) author: John Snow Labs name: distilbert_qa_hcy11_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hcy11`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hcy11_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725400512.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hcy11_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725400512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hcy11_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hcy11_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_hcy11").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hcy11_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hcy11/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CDR_Chem_Modified_BioBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-BioBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BioBERT_512_en_4.0.0_3.0_1657109320849.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BioBERT_512_en_4.0.0_3.0_1657109320849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BioBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BioBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CDR_Chem_Modified_BioBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-BioBERT-512 --- layout: model title: Legal Asset Purchase Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_asset_purchase_agreement date: 2022-11-10 tags: [en, legal, classification, licensed, agreement] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_asset_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `asset-purchase-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `asset-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_asset_purchase_agreement_en_1.0.0_3.0_1668104092936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_asset_purchase_agreement_en_1.0.0_3.0_1668104092936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_asset_purchase_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[asset-purchase-agreement]| |[other]| |[other]| |[asset-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_asset_purchase_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support asset-purchase-agreement 0.96 0.96 0.96 27 other 0.99 0.99 0.99 85 accuracy - - 0.98 112 macro-avg 0.98 0.98 0.98 112 weighted-avg 0.98 0.98 0.98 112 ``` --- layout: model title: Finnish XLMRobertaForTokenClassification Base Cased model (from tner) author: John Snow Labs name: xlmroberta_ner_base_fin date: 2022-08-13 tags: [fi, open_source, xlm_roberta, ner] task: Named Entity Recognition language: fi edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-fin` is a Finnish model originally trained by `tner`. ## Predicted Entities `other`, `person`, `location`, `organization` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_fin_fi_4.1.0_3.0_1660426752654.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_fin_fi_4.1.0_3.0_1660426752654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_fin","fi") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_fin","fi") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_fin| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fi| |Size:|773.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tner/xlm-roberta-base-fin - https://github.com/asahi417/tner --- layout: model title: Ukrainian T5ForConditionalGeneration Cased model (from ukr-models) author: John Snow Labs name: t5_uk_summarizer date: 2023-01-31 tags: [uk, open_source, t5, tensorflow] task: Text Generation language: uk edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `uk-summarizer` is a Ukrainian model originally trained by `ukr-models`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_uk_summarizer_uk_4.3.0_3.0_1675157739525.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_uk_summarizer_uk_4.3.0_3.0_1675157739525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_uk_summarizer","uk") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_uk_summarizer","uk") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_uk_summarizer| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|uk| |Size:|995.5 MB| ## References - https://huggingface.co/ukr-models/uk-summarizer --- layout: model title: Translate English to Chinese Pipeline author: John Snow Labs name: translate_en_zh date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, zh, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `zh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_zh_xx_2.7.0_2.4_1609686009785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_zh_xx_2.7.0_2.4_1609686009785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_zh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_zh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.zh').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_zh| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Appointment Clause Binary Classifier author: John Snow Labs name: legclf_appointment_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `appointment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `appointment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_appointment_clause_en_1.0.0_3.2_1660122120413.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_appointment_clause_en_1.0.0_3.2_1660122120413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_appointment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[appointment]| |[other]| |[other]| |[appointment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_appointment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support appointment 0.86 0.94 0.90 32 other 0.98 0.95 0.96 101 accuracy - - 0.95 133 macro-avg 0.92 0.94 0.93 133 weighted-avg 0.95 0.95 0.95 133 ``` --- layout: model title: Japanese Bert Embeddings (Base, Character Tokenization, Whole Word Masking) author: John Snow Labs name: bert_embeddings_bert_base_japanese_char_whole_word_masking date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char-whole-word-masking` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_whole_word_masking_ja_3.4.2_3.0_1649674360241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_whole_word_masking_ja_3.4.2_3.0_1649674360241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_whole_word_masking","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_whole_word_masking","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_char_whole_word_masking").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_char_whole_word_masking| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|334.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Fast Neural Machine Translation Model from English to Tonga (Zambezi) author: John Snow Labs name: opus_mt_en_toi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, toi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `toi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_toi_xx_2.7.0_2.4_1609166942717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_toi_xx_2.7.0_2.4_1609166942717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_toi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_toi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.toi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_toi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Warranty Clause Binary Classifier (CUAD dataset) author: John Snow Labs name: legclf_cuad_warranty_clause date: 2022-10-18 tags: [warranty, clause, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `warranty` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset. ## Predicted Entities `warranty`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_warranty_clause_en_1.0.0_3.0_1666097671097.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_warranty_clause_en_1.0.0_3.0_1666097671097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cuad_warranty_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["Sutro represents and warrants that:\n\n7.3.1..."]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[warranty]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_warranty_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References CUAD dataset ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.96 0.96 27 warranty 0.93 0.93 0.93 14 accuracy - - 0.95 41 macro-avg 0.95 0.95 0.95 41 weighted-avg 0.95 0.95 0.95 41 ``` --- layout: model title: Pipeline to Detect PHI for Deidentification author: John Snow Labs name: ner_deid_augmented_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_augmented](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_augmented_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_pipeline_en_3.4.1_3.0_1647864550318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_pipeline_en_3.4.1_3.0_1647864550318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_augmented_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_augmented_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.ner_augmented.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |NAME | |VA Hospital |LOCATION | |John Green |NAME | |2347165768 |ID | |Day Hospital |LOCATION | |02/04/2003 |DATE | |Smith |NAME | |Day Hospital |LOCATION | |Smith |NAME | |Smith |NAME | |7 Ardmore Tower|LOCATION | |Hart |NAME | |Smith |NAME | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from shahrukhx01) author: John Snow Labs name: roberta_qa_base_squad2_boolq_baseline date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-boolq-baseline` is a English model originally trained by `shahrukhx01`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_boolq_baseline_en_4.3.0_3.0_1674219076102.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_boolq_baseline_en_4.3.0_3.0_1674219076102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_boolq_baseline","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_boolq_baseline","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_boolq_baseline| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/shahrukhx01/roberta-base-squad2-boolq-baseline --- layout: model title: Translate Seychellois Creole to English Pipeline author: John Snow Labs name: translate_crs_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, crs, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `crs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_crs_en_xx_2.7.0_2.4_1609690310342.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_crs_en_xx_2.7.0_2.4_1609690310342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_crs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_crs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.crs.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_crs_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_cline_emanuals_tech date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cline-emanuals-techqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_emanuals_tech_en_4.3.0_3.0_1674209326690.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_emanuals_tech_en_4.3.0_3.0_1674209326690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_emanuals_tech","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_emanuals_tech","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cline_emanuals_tech| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/cline-emanuals-techqa --- layout: model title: Vietnamese DistilBERT Base Cased Embeddings author: John Snow Labs name: distilbert_base_cased date: 2022-01-13 tags: [embeddings, distilbert, vietnamese, vi, open_source] task: Embeddings language: vi edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This embeddings model was imported from `Hugging Face`. It's a custom version of `distilbert_base_multilingual_cased` and it gives the same representations produced by the original model which preserves the original accuracy. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_vi_3.3.4_3.0_1642064850307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_vi_3.3.4_3.0_1642064850307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... distilbert = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")\ .setInputCols(["sentence",'token'])\ .setOutputCol("embeddings") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, distilbert]) text = """Tôi yêu Spark NLP""" data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala ... val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Tôi yêu Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vi.embed.distilbert.cased").predict("""Tôi yêu Spark NLP""") ```
## Results ```bash +-----+--------------------+ |token| embeddings| +-----+--------------------+ | Tôi|[-0.38760236, -0....| | yêu|[-0.3357051, -0.5...| |Spark|[-0.20642707, -0....| | NLP|[-0.013280544, -0...| +-----+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_cased| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|vi| |Size:|211.6 MB| |Case sensitive:|false| --- layout: model title: French CamemBert Embeddings (from Sonny) author: John Snow Labs name: camembert_embeddings_Sonny_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Sonny`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sonny_generic_model_fr_3.4.4_3.0_1653986957057.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sonny_generic_model_fr_3.4.4_3.0_1653986957057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sonny_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sonny_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Sonny_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Sonny/dummy-model --- layout: model title: French CamemBert Embeddings (from elusive-magnolia) author: John Snow Labs name: camembert_embeddings_elusive_magnolia_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `elusive-magnolia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_elusive_magnolia_generic_model_fr_3.4.4_3.0_1653988370340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_elusive_magnolia_generic_model_fr_3.4.4_3.0_1653988370340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_elusive_magnolia_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_elusive_magnolia_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_elusive_magnolia_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/elusive-magnolia/dummy-model --- layout: model title: Translate English to Tuvaluan Pipeline author: John Snow Labs name: translate_en_tvl date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tvl, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tvl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tvl_xx_2.7.0_2.4_1609686360411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tvl_xx_2.7.0_2.4_1609686360411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tvl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tvl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tvl').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tvl| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Basque to English Pipeline author: John Snow Labs name: translate_eu_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, eu, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `eu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_eu_en_xx_2.7.0_2.4_1609686527199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_eu_en_xx_2.7.0_2.4_1609686527199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_eu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_eu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.eu.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_eu_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from AlirezaBaneshi) author: John Snow Labs name: roberta_qa_autotrain_test2_756523213 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523213` is a English model originally trained by `AlirezaBaneshi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.0.0_3.0_1655727630639.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.0.0_3.0_1655727630639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523213","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_autotrain_test2_756523213","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.756523213.by_AlirezaBaneshi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_autotrain_test2_756523213| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|415.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523213 --- layout: model title: Fast Neural Machine Translation Model from English to French author: John Snow Labs name: opus_mt_en_fr date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, fr, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `fr` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_fr_xx_2.7.0_2.4_1609166836357.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_fr_xx_2.7.0_2.4_1609166836357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_fr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_fr", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.fr').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_fr| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [gl, ner, clinical, licensed] task: Named Entity Recognition language: gl edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_gl_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_gl_4.3.0_3.2_1678704830024.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_gl_4.3.0_3.2_1678704830024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "gl", "clinical/models") text = '''Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "gl", "clinical/models") val text = "Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|-------------:| | 0 | Muller | 0 | 5 | HUMAN | 0.9998 | | 1 | paciente | 167 | 174 | HUMAN | 0.9985 | | 2 | artrópodos | 344 | 353 | SPECIES | 0.9647 | | 3 | antivirales | 378 | 388 | SPECIES | 0.8854 | | 4 | herpética | 437 | 445 | SPECIES | 0.9592 | | 5 | púbico | 551 | 556 | HUMAN | 0.7293 | | 6 | Staphylococcus aureus | 644 | 664 | SPECIES | 0.87005 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|gl| |Size:|794.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Hindi BertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers date: 2022-12-02 tags: [hi, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: hi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-hi-bert` is a Hindi model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670022367639.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_4.2.4_3.0_1670022367639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|612.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert - https://oscar-corpus.com/ --- layout: model title: Ukrainian DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_uk_cased date: 2022-04-12 tags: [distilbert, embeddings, uk, open_source] task: Embeddings language: uk edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-uk-cased` is a Ukrainian model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uk_cased_uk_3.4.2_3.0_1649783949701.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uk_cased_uk_3.4.2_3.0_1649783949701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uk_cased","uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uk_cased","uk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.embed.distilbert_base_cased").predict("""Я люблю Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_uk_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|uk| |Size:|195.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-uk-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Lemmatizer (Luxembourgish, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, lb] task: Lemmatization language: lb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Luxembourgish Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lb_3.4.1_3.0_1646316561258.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lb_3.4.1_3.0_1646316561258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lb") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Dir sidd net besser wéi ech"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lb") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Dir sidd net besser wéi ech").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lb.lemma").predict("""Dir sidd net besser wéi ech""") ```
## Results ```bash +--------------------------------------+ |result | +--------------------------------------+ |[dir, sidd, net, besseren, wéien, ech]| +--------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|lb| |Size:|3.9 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becasv2_3 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-3` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_3_en_4.3.0_3.0_1672767723393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_3_en_4.3.0_3.0_1672767723393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becasv2_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-3 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from bdickson) author: John Snow Labs name: distilbert_qa_bdickson_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `bdickson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770181883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bdickson_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770181883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bdickson_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bdickson_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/bdickson/distilbert-base-uncased-finetuned-squad --- layout: model title: German Bert Embeddings(Cased) author: John Snow Labs name: bert_embeddings_bert_base_german_dbmdz_cased date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-dbmdz-cased` is a German model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_dbmdz_cased_de_3.4.2_3.0_1649676089568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_dbmdz_cased_de_3.4.2_3.0_1649676089568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_dbmdz_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_dbmdz_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_german_dbmdz_cased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_german_dbmdz_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/bert-base-german-dbmdz-cased --- layout: model title: Translate Estonian to English Pipeline author: John Snow Labs name: translate_et_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, et, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `et` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_et_en_xx_2.7.0_2.4_1609699041974.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_et_en_xx_2.7.0_2.4_1609699041974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_et_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_et_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.et.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_et_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav TFWav2Vec2ForCTC from vai6hav author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav` is a English model originally trained by vai6hav. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_en_4.2.0_3.0_1664113051189.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_en_4.2.0_3.0_1664113051189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-8` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1654191650975.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1654191650975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_512d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|387.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-8 --- layout: model title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version (sbiobert_base_cased_mli) author: John Snow Labs name: sbiobertresolve_snomed_bodyStructure date: 2021-06-15 tags: [snomed, en, clinical, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings. ## Predicted Entities Snomed Codes and their normalized definition with `sbiobert_base_cased_mli` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_3.0_1623774132614.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_bodyStructure_en_3.1.0_3.0_1623774132614.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Disease_Syndrome_Disorder, External_body_part_or_region``` set in ```.setWhiteList()```. ```sbiobertresolve_snomed_bodyStructure``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. No need to set ```.setWhiteList()```. Merge ner_jsl and ner_anatomy_coarse model chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") jsl_sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli','en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_bodyStructure, "en", "clinical/models) \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("snomed_code") snomed_pipelineModel = PipelineModel( stages = [ documentAssembler, jsl_sbert_embedder, snomed_resolver]) snomed_lp = LightPipeline(snomed_pipelineModel) result = snomed_lp.fullAnnotate("Amputation stump") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_bodyStructure", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("snomed_code") val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver)) val snomed_lp = LightPipeline(snomed_pipelineModel) val result = snomed_lp.fullAnnotate("Amputation stump") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_body_structure").predict("""sbiobertresolve_snomed_bodyStructure, """) ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------| | 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_bodyStructure| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Case sensitive:|true| ## Data Source https://www.snomed.org/ --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_ner_778023879 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-ner-778023879` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1677881870073.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1677881870073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_ner_778023879| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-ner-778023879 --- layout: model title: Pipeline to Extract Negation and Uncertainty Entities from Spanish Medical Texts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_negation_uncertainty_pipeline date: 2023-03-20 tags: [es, clinical, licensed, token_classification, bert, ner, negation, uncertainty, linguistics] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_negation_uncertainty](https://nlp.johnsnowlabs.com/2022/08/11/bert_token_classifier_negation_uncertainty_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_negation_uncertainty_pipeline_es_4.3.0_3.2_1679298806721.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_negation_uncertainty_pipeline_es_4.3.0_3.2_1679298806721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_negation_uncertainty_pipeline", "es", "clinical/models") text = '''Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_negation_uncertainty_pipeline", "es", "clinical/models") val text = "Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------------------------------------------|--------:|------:|:------------|-------------:| | 0 | probable | 16 | 23 | UNC | 0.999994 | | 1 | de cirrosis hepática | 25 | 44 | USCO | 0.999988 | | 2 | no | 47 | 48 | NEG | 0.999995 | | 3 | conocida previamente | 50 | 69 | NSCO | 0.999992 | | 4 | no | 175 | 176 | NEG | 0.999995 | | 5 | se realizó paracentesis control por escasez de liquido | 178 | 231 | NSCO | 0.999995 | | 6 | susceptible de | 293 | 306 | UNC | 0.999986 | | 7 | ca basocelular perlado | 308 | 329 | USCO | 0.99999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_negation_uncertainty_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|410.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_3 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-3` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_3_en_4.3.0_3.0_1672767491512.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_3_en_4.3.0_3.0_1672767491512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-3 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from k3nneth) author: John Snow Labs name: xlmroberta_ner_k3nneth_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `k3nneth`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_k3nneth_base_finetuned_panx_de_4.1.0_3.0_1660434928587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_k3nneth_base_finetuned_panx_de_4.1.0_3.0_1660434928587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_k3nneth_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_k3nneth_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_k3nneth_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/k3nneth/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Legal The closing Clause Binary Classifier author: John Snow Labs name: legclf_the_closing_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `the-closing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `the-closing` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_the_closing_clause_en_1.0.0_3.2_1660123095618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_the_closing_clause_en_1.0.0_3.2_1660123095618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_the_closing_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[the-closing]| |[other]| |[other]| |[the-closing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_the_closing_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 105 the-closing 1.00 1.00 1.00 35 accuracy - - 1.00 140 macro-avg 1.00 1.00 1.00 140 weighted-avg 1.00 1.00 1.00 140 ``` --- layout: model title: Clinical Portuguese Bert Embeddiongs (Clinical) author: John Snow Labs name: biobert_embeddings_clinical date: 2022-04-11 tags: [biobert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BioBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `biobertpt-clin` is a Portuguese model orginally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_embeddings_clinical_pt_3.4.2_3.0_1649686994929.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_embeddings_clinical_pt_3.4.2_3.0_1649686994929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_embeddings_clinical","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Odeio o cancro"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_embeddings_clinical","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Odeio o cancro").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.gs_clinical").predict("""Odeio o cancro""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_embeddings_clinical| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/pucpr/biobertpt-clin - https://aclanthology.org/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-42` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1654191455684.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1654191455684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|380.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-42 --- layout: model title: English asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu TFWav2Vec2ForCTC from adelgalu author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu` is a English model originally trained by adelgalu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_en_4.2.0_3.0_1664098872120.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_en_4.2.0_3.0_1664098872120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from maroo93) author: John Snow Labs name: bert_qa_squad1.1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad1.1` is a English model orginally trained by `maroo93`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad1.1_en_4.0.0_3.0_1654192132045.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad1.1_en_4.0.0_3.0_1654192132045.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad1.1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_maroo93").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/maroo93/squad1.1 --- layout: model title: Pretrained Pipeline for Few-NERD-General NER Model author: John Snow Labs name: nerdl_fewnerd_100d_pipeline date: 2022-06-28 tags: [fewnerd, nerdl, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on Few-NERD model and it detects : `PERSON`, `ORGANIZATION`, `LOCATION`, `ART`, `BUILDING`, `PRODUCT`, `EVENT`, `OTHER` ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_4.0.0_3.0_1656388980361.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_4.0.0_3.0_1656388980361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") fewnerd_pipeline.annotate("""The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).""") ``` ```scala val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") val result = pipeline.fullAnnotate("The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).")(0) ```
## Results ```bash +-----------------------+------------+ |chunk |ner_label | +-----------------------+------------+ |Kentucky Fried Chicken |ORGANIZATION| |Anglo-Egyptian War |EVENT | |battle of Tell El Kebir|EVENT | |Egypt Medal |OTHER | |Order of Medjidie |OTHER | +-----------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_fewnerd_100d_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|167.3 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter - Finisher --- layout: model title: English RobertaForQuestionAnswering Cased model (from Ching) author: John Snow Labs name: roberta_qa_negation_detector date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `negation_detector` is a English model originally trained by `Ching`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_negation_detector_en_4.3.0_3.0_1674211601485.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_negation_detector_en_4.3.0_3.0_1674211601485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_negation_detector","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_negation_detector","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_negation_detector| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Ching/negation_detector --- layout: model title: Extract test entities (Voice of the Patients) author: John Snow Labs name: ner_vop_test_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts test mentions from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Measurements`, `TestResult`, `Test`, `VitalTest` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_wip_en_4.4.0_3.0_1682013044617.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_wip_en_4.4.0_3.0_1682013044617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_test_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I"m on medication to manage it."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_test_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I"m on medication to manage it.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------|:------------| | thyroid levels | Test | | blood test | Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_test_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Measurements 100 58 30 130 0.63 0.77 0.69 TestResult 452 124 182 634 0.78 0.71 0.75 Test 1194 98 207 1401 0.92 0.85 0.89 VitalTest 195 20 23 218 0.91 0.89 0.90 macro_avg 1941 300 442 2383 0.81 0.80 0.81 micro_avg 1941 300 442 2383 0.87 0.81 0.84 ``` --- layout: model title: Sentiment Analysis of IMDB Reviews Pipeline (analyze_sentimentdl_use_imdb) author: John Snow Labs name: analyze_sentimentdl_use_imdb date: 2021-01-15 task: [Embeddings, Sentiment Analysis, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [en, pipeline, sentiment] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline to classify IMDB reviews in `neg` and `pos` classes using `tfhub_use` embeddings. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_use_imdb_en_2.7.1_2.4_1610723836151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_use_imdb_en_2.7.1_2.4_1610723836151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("analyze_sentimentdl_use_imdb", lang = "en") result = pipeline.fullAnnotate("Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("analyze_sentimentdl_use_imdb", lang = "en") val result = pipeline.fullAnnotate("Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!") ``` {:.nlu-block} ```python import nlu text = ["""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!"""] sentiment_df = nlu.load('en.sentiment.imdb.use').predict(text, output_level='sentence') sentiment_df ```
## Results ```bash | | document | sentiment | |---:|---------------------------------------------------------------------------------------------------------:|--------------:| | | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | | | 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive | | | was rad! Horror and sword fight freaks,buy this movie now! | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|analyze_sentimentdl_use_imdb| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Language:|en| ## Included Models `tfhub_use`, `sentimentdl_use_imdb` --- layout: model title: Legal Deposit Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_deposit_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, deposit, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_deposit_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `deposit-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `deposit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deposit_agreement_bert_en_1.0.0_3.0_1670349380582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deposit_agreement_bert_en_1.0.0_3.0_1670349380582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_deposit_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[deposit-agreement]| |[other]| |[other]| |[deposit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_deposit_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support deposit-agreement 0.97 0.97 0.97 36 other 0.98 0.98 0.98 65 accuracy - - 0.98 101 macro-avg 0.98 0.98 0.98 101 weighted-avg 0.98 0.98 0.98 101 ``` --- layout: model title: Legal Section headings Clause Binary Classifier author: John Snow Labs name: legclf_section_headings_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `section-headings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `section-headings` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_section_headings_clause_en_1.0.0_3.2_1660122983672.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_section_headings_clause_en_1.0.0_3.2_1660122983672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_section_headings_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[section-headings]| |[other]| |[other]| |[section-headings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_section_headings_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 159 section-headings 1.00 1.00 1.00 46 accuracy - - 1.00 205 macro-avg 1.00 1.00 1.00 205 weighted-avg 1.00 1.00 1.00 205 ``` --- layout: model title: Extract Anatomical Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_anatomy_emb_clinical_large date: 2023-06-06 tags: [licensed, clinical, en, ner, vop, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `BodyPart`, `Laterality` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_emb_clinical_large_en_4.4.3_3.0_1686074062221.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_emb_clinical_large_en_4.4.3_3.0_1686074062221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_anatomy_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_anatomy_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | muscle | BodyPart | | neck | BodyPart | | trapezius | BodyPart | | head | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_anatomy_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 BodyPart 2725 236 175 2900 0.92 0.94 0.93 Laterality 546 62 82 628 0.90 0.87 0.88 macro_avg 3271 298 257 3528 0.91 0.90 0.90 micro_avg 3271 298 257 3528 0.92 0.93 0.92 ``` --- layout: model title: Part of Speech for Bulgarian author: John Snow Labs name: pos_btb date: 2021-03-23 tags: [pos, bg, open_source] supported: true task: Part of Speech Tagging language: bg edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_btb_bg_2.7.5_2.4_1616506894131.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_btb_bg_2.7.5_2.4_1616506894131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_btb", "bg")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Столица на Република България е град София .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_btb", "bg") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos)) val data = Seq("Столица на Република България е град София .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Столица на Република България е град София .""] token_df = nlu.load('bg.pos.btb').predict(text) token_df ```
## Results ```bash +--------------------------------------------+-------------------------------------------------+ |text |result | +--------------------------------------------+-------------------------------------------------+ |Столица на Република България е град София .|[NOUN, ADP, NOUN, PROPN, AUX, NOUN, PROPN, PUNCT]| +--------------------------------------------+-------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_btb| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|bg| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.89 | 0.87 | 0.88 | 1377 | | ADP | 0.95 | 0.95 | 0.95 | 2238 | | ADV | 0.94 | 0.92 | 0.93 | 671 | | AUX | 0.98 | 0.97 | 0.97 | 916 | | CCONJ | 0.96 | 0.95 | 0.96 | 467 | | DET | 0.91 | 0.88 | 0.90 | 273 | | INTJ | 1.00 | 1.00 | 1.00 | 1 | | NOUN | 0.92 | 0.93 | 0.93 | 3486 | | NUM | 0.89 | 0.87 | 0.88 | 223 | | PART | 0.98 | 0.96 | 0.97 | 210 | | PRON | 0.97 | 0.97 | 0.97 | 981 | | PROPN | 0.88 | 0.89 | 0.89 | 805 | | PUNCT | 0.95 | 0.96 | 0.95 | 2268 | | SCONJ | 0.98 | 0.97 | 0.98 | 156 | | VERB | 0.95 | 0.94 | 0.94 | 1652 | | accuracy | | | 0.94 | 15724 | | macro avg | 0.94 | 0.94 | 0.94 | 15724 | | weighted avg | 0.94 | 0.94 | 0.94 | 15724 | ``` --- layout: model title: English asr_wav2vec2_xlsr_53_phon TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_xlsr_53_phon date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_53_phon` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_53_phon_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109116417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109116417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_53_phon", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_53_phon", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_53_phon| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|756.9 MB| --- layout: model title: German XlmRoBertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan date: 2022-06-24 tags: [de, open_source, question_answering, xlmroberta] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-de` is a German model originally trained by `saattrupdan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan_de_4.0.0_3.0_1656062956033.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan_de_4.0.0_3.0_1656062956033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.squad_de_tuned.xlmr_roberta.base.by_saattrupdan").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_de_de_saattrupdan| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|de| |Size:|874.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saattrupdan/xlmr-base-texas-squad-de --- layout: model title: Word2Vec Embeddings in Sardinian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sc, open_source] task: Embeddings language: sc edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sc_3.4.1_3.0_1647455351489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sc_3.4.1_3.0_1647455351489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sc") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sc") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sc.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sc| |Size:|74.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Permitted Use Clause Binary Classifier author: John Snow Labs name: legclf_permitted_use_clause date: 2023-02-13 tags: [en, legal, classification, permitted, use, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `permitted_use` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `permitted_use`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_permitted_use_clause_en_1.0.0_3.0_1676305311848.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_permitted_use_clause_en_1.0.0_3.0_1676305311848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_permitted_use_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[permitted_use]| |[other]| |[other]| |[permitted_use]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_permitted_use_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 5 permitted_use 1.00 1.00 1.00 11 accuracy - - 1.00 16 macro-avg 1.00 1.00 1.00 16 weighted-avg 1.00 1.00 1.00 16 ``` --- layout: model title: Legal Bonus Clause Binary Classifier author: John Snow Labs name: legclf_bonus_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `bonus` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `bonus` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bonus_clause_en_1.0.0_3.2_1660122172503.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bonus_clause_en_1.0.0_3.2_1660122172503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_bonus_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[bonus]| |[other]| |[other]| |[bonus]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bonus_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support bonus 0.97 0.95 0.96 38 other 0.98 0.99 0.98 95 accuracy - - 0.98 133 macro-avg 0.98 0.97 0.97 133 weighted-avg 0.98 0.98 0.98 133 ``` --- layout: model title: Detect Diagnoses and Procedures (Spanish) author: John Snow Labs name: ner_diag_proc date: 2021-03-31 tags: [ner, clinical, licensed, es] task: Named Entity Recognition language: es edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Pretrained named entity recognition deep learning model for diagnostics and procedures in spanish ## Predicted Entities ``DIAGNOSTICO``, ``PROCEDIMIENTO``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DIAG_PROC_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_es_3.0.0_3.0_1617208422892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_es_3.0.0_3.0_1617208422892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") model = MedicalNerModel.pretrained("ner_diag_proc","es","clinical/models")\ .setInputCols("sentence","token","word_embeddings")\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embed, model, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val model = MedicalNerModel.pretrained("ner_diag_proc","es","clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embed, model, ner_converter)) val data = Seq("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner").predict("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""") ```
## Results ```bash +----------------------+-------------+ |chunk |ner_label | +----------------------+-------------+ |ENFERMEDAD |DIAGNOSTICO | |cáncer de vejiga |DIAGNOSTICO | |resección |PROCEDIMIENTO| |cistectomía |PROCEDIMIENTO| |estrés |DIAGNOSTICO | |infarto subendocárdico|DIAGNOSTICO | |enfermedad |DIAGNOSTICO | |arterias coronarias |DIAGNOSTICO | +----------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diag_proc| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| ## Benchmarking ```bash +-------------+------+------+------+------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +-------------+------+------+------+------+---------+------+------+ |PROCEDIMIENTO|2299.0|1103.0| 860.0|3159.0| 0.6758|0.7278|0.7008| | DIAGNOSTICO|6623.0|1364.0|2974.0|9597.0| 0.8292|0.6901|0.7533| +-------------+------+------+------+------+---------+------+------+ +------------------+ | macro| +------------------+ |0.7270531284138397| +------------------+ +------------------+ | micro| +------------------+ |0.7402992400932049| +------------------+ ``` --- layout: model title: English BertForQuestionAnswering model (from motiondew) author: John Snow Labs name: bert_qa_bert_finetuned_lr2_e5_b16_ep2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-lr2-e5-b16-ep2` is a English model orginally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_lr2_e5_b16_ep2_en_4.0.0_3.0_1654535195058.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_lr2_e5_b16_ep2_en_4.0.0_3.0_1654535195058.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_lr2_e5_b16_ep2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_lr2_e5_b16_ep2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_motiondew").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_lr2_e5_b16_ep2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-finetuned-lr2-e5-b16-ep2 --- layout: model title: Spanish RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac_es_4.0.0_3.0_1655729996691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac_es_4.0.0_3.0_1655729996691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.base.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_mrm8488_roberta_base_bne_finetuned_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/roberta-base-bne-finetuned-sqac - https://paperswithcode.com/sota?task=Question+Answering&dataset=sqac --- layout: model title: English RobertaForQuestionAnswering Cased model (from clementgyj) author: John Snow Labs name: roberta_qa_finetuned_squad_50k date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-squad-50k` is a English model originally trained by `clementgyj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_squad_50k_en_4.3.0_3.0_1674220438911.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_squad_50k_en_4.3.0_3.0_1674220438911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_squad_50k","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_squad_50k","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_squad_50k| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|462.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/clementgyj/roberta-finetuned-squad-50k --- layout: model title: English AlbertForQuestionAnswering model (from twmkn9) author: John Snow Labs name: albert_base_qa_squad2 date: 2022-06-15 tags: [open_source, albert, question_answering, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-base-v2-squad2` is a English model originally trained by `twmkn9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_base_qa_squad2_en_4.0.0_3.0_1655294222450.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_base_qa_squad2_en_4.0.0_3.0_1655294222450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_base_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_base_qa_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.span_question.albert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_base_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References https://huggingface.co/twmkn9/albert-base-v2-squad2 --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification date: 2023-06-13 tags: [deidentification, pipeline, de, licensed, clinical] task: Pipeline Healthcare language: de edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from **German** medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `CONTACT`, `ID`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `DLN`, `PLATE` entities. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_4.4.4_3.2_1686663693325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_4.4.4_3.2_1686663693325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models") sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """ result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models") val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.clinical").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models") sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """ result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models") val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.clinical").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """) ```
## Results ```bash Results Masked with entity labels ------------------------------ Zusammenfassung : wird am Morgen des ins eingeliefert. Herr ist Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: Platte Kontonummer: SSN : Lizenznummer: Adresse : Masked with chars ------------------------------ Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************] eingeliefert. Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: [*******] Platte [*****] Kontonummer: [********************] SSN : [**********] Lizenznummer: [*********] Adresse : [*****************] [***] Masked with fixed length chars ------------------------------ Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert. Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: **** Platte **** Kontonummer: **** SSN : **** Lizenznummer: **** Adresse : **** **** Obfusceted ------------------------------ Zusammenfassung : Herrmann Kallert wird am Morgen des 11-26-1977 ins International Neuroscience eingeliefert. Herr Herrmann Kallert ist 79 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: 136704D357 Platte QA348G Kontonummer: 192837465738 SSN : 1310011981M454 Lizenznummer: XX123456 Adresse : Klingelhöferring 31206 {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: German asr_exp_w2v2t_vp_100k_s627 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_vp_100k_s627 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_100k_s627` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_vp_100k_s627_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_100k_s627_de_4.2.0_3.0_1664105815417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_100k_s627_de_4.2.0_3.0_1664105815417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_vp_100k_s627", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_vp_100k_s627", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_vp_100k_s627| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739775556.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739775556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_twostagetriplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|306.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0 --- layout: model title: Legal Rights And Freedoms Document Classifier (EURLEX) author: John Snow Labs name: legclf_rights_and_freedoms_bert date: 2023-03-06 tags: [en, legal, classification, clauses, rights_and_freedoms, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_rights_and_freedoms_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Rights_and_Freedoms or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Rights_and_Freedoms`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rights_and_freedoms_bert_en_1.0.0_3.0_1678111839271.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rights_and_freedoms_bert_en_1.0.0_3.0_1678111839271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_rights_and_freedoms_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Rights_and_Freedoms]| |[Other]| |[Other]| |[Rights_and_Freedoms]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_rights_and_freedoms_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.89 0.85 0.87 39 Rights_and_Freedoms 0.79 0.85 0.81 26 accuracy - - 0.85 65 macro-avg 0.84 0.85 0.84 65 weighted-avg 0.85 0.85 0.85 65 ``` --- layout: model title: Portuguese asr_bp_lapsbm1_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp_lapsbm1_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_lapsbm1_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_lapsbm1_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_lapsbm1_xlsr_pt_4.2.0_3.0_1664190605281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_lapsbm1_xlsr_pt_4.2.0_3.0_1664190605281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp_lapsbm1_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp_lapsbm1_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp_lapsbm1_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|756.4 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Cushitic languages author: John Snow Labs name: opus_mt_en_cus date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, cus, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `cus` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cus_xx_2.7.0_2.4_1609168898891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cus_xx_2.7.0_2.4_1609168898891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_cus", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_cus", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.cus').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_cus| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BERT Sentence Embeddings (Large Uncased) author: John Snow Labs name: sent_bert_large_uncased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_large_uncased_en_2.6.0_2.4_1598347026632.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_large_uncased_en_2.6.0_2.4_1598347026632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_large_uncased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_large_uncased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.bert_large_uncased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_bert_large_uncased_embeddings I hate cancer [[-0.13290119171142578, -0.2996622622013092, -... Antibiotics aren't painkiller [[-0.13290119171142578, -0.2996622622013092, -... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1](https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1) --- layout: model title: Chinese Part of Speech Tagger (from ckiplab) author: John Snow Labs name: bert_pos_bert_base_chinese_pos date: 2022-04-26 tags: [bert, pos, part_of_speech, zh, open_source] task: Part of Speech Tagging language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-chinese-pos` is a Chinese model orginally trained by `ckiplab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_chinese_pos_zh_3.4.2_3.0_1650993041893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_chinese_pos_zh_3.4.2_3.0_1650993041893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_chinese_pos","zh") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_chinese_pos","zh") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.pos.bert_base_chinese_pos").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_chinese_pos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|zh| |Size:|381.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ckiplab/bert-base-chinese-pos - https://github.com/ckiplab/ckip-transformers - https://muyang.pro - https://ckip.iis.sinica.edu.tw - https://github.com/ckiplab/ckip-transformers - https://github.com/ckiplab/ckip-transformers --- layout: model title: Swedish Legal Roberta Embeddings author: John Snow Labs name: roberta_base_swedish_legal date: 2023-02-17 tags: [se, swedish, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: se edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-swedish-roberta-base` is a Swedish model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_swedish_legal_se_4.2.4_3.0_1676643288694.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_swedish_legal_se_4.2.4_3.0_1676643288694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_swedish_legal", "se")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_swedish_legal", "se") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_swedish_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|se| |Size:|415.9 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-swedish-roberta-base --- layout: model title: Financial Financial conditions Item Binary Classifier author: John Snow Labs name: finclf_financial_conditions_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `financial_conditions` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `financial_conditions` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_financial_conditions_item_en_1.0.0_3.2_1660154420184.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_financial_conditions_item_en_1.0.0_3.2_1660154420184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_financial_conditions_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[financial_conditions]| |[other]| |[other]| |[financial_conditions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_financial_conditions_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support financial_conditions 0.83 0.73 0.78 245 other 0.75 0.84 0.80 237 accuracy - - 0.79 482 macro-avg 0.79 0.79 0.79 482 weighted-avg 0.79 0.79 0.79 482 ``` --- layout: model title: Legal Non Exclusivity Clause Binary Classifier author: John Snow Labs name: legclf_non_exclusivity_clause date: 2023-01-29 tags: [en, legal, classification, exclusivity, clauses, non_exclusivity, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non-exclusivity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `non-exclusivity`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_exclusivity_clause_en_1.0.0_3.0_1675006033580.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_exclusivity_clause_en_1.0.0_3.0_1675006033580.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_non_exclusivity_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[non-exclusivity]| |[other]| |[other]| |[non-exclusivity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_exclusivity_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non-exclusivity 0.93 0.96 0.95 27 other 0.97 0.95 0.96 39 accuracy - - 0.95 66 macro-avg 0.95 0.96 0.95 66 weighted-avg 0.96 0.95 0.95 66 ``` --- layout: model title: Finnish asr_wav2vec2_large_xlsr_finnish TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: asr_wav2vec2_large_xlsr_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_finnish` is a Finnish model originally trained by birgermoell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021375004.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021375004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_finnish", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_finnish", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_finnish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: Legal Intellectual property Clause Binary Classifier author: John Snow Labs name: legclf_intellectual_property_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `intellectual-property` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `intellectual-property` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_intellectual_property_clause_en_1.0.0_3.2_1660123623906.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_intellectual_property_clause_en_1.0.0_3.2_1660123623906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_intellectual_property_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[intellectual-property]| |[other]| |[other]| |[intellectual-property]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_intellectual_property_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support intellectual-property 0.95 0.85 0.90 47 other 0.93 0.98 0.95 95 accuracy - - 0.94 142 macro-avg 0.94 0.92 0.93 142 weighted-avg 0.94 0.94 0.94 142 ``` --- layout: model title: German Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_large_german_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, de, open_source] task: Part of Speech Tagging language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-german-upos` is a German model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_german_upos_de_3.4.2_3.0_1652092375858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_german_upos_de_3.4.2_3.0_1652092375858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_german_upos","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_german_upos","de") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_large_german_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-large-german-upos - https://github.com/UniversalDependencies/UD_German-HDT - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1657185089636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1657185089636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-8 --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: sbiobert_jsl_rxnorm_cased date: 2021-12-23 tags: [licensed, embeddings, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps sentences & documents to a 768 dimensional dense vector space by using average pooling on top of BioBert model. It's also fine-tuned on RxNorm dataset to help generalization over medication-related datasets. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_rxnorm_cased_en_3.3.4_2.4_1640271525048.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_rxnorm_cased_en_3.3.4_2.4_1640271525048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_jsl_rxnorm_cased", "en", "clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbioert_embeddings") ``` ```scala val sentence_embeddings = BertSentenceEmbeddings.pretrained('sbiobert_jsl_rxnorm_cased', 'en','clinical/models') .setInputCols("sentence") .setOutputCol("sbioert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.biobert.rxnorm").predict("""Put your text here.""") ```
## Results ```bash Gives a 768-dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobert_jsl_rxnorm_cased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|402.0 MB| --- layout: model title: Pipeline to Detect Medication Entities, Assign Assertion Status and Find Relations author: John Snow Labs name: explain_clinical_doc_medication date: 2023-04-20 tags: [licensed, clinical, ner, en, assertion, relation_extraction, posology, medication] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for detecting posology entities with the `ner_posology_large` NER model, assigning their assertion status with `assertion_jsl` model, and extracting relations between posology-related terminology with `posology_re` relation extraction model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_4.3.0_3.2_1682017727303.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_4.3.0_3.2_1682017727303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models") text = '''The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models") val text = "The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_dco.clinical_medication.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""") ```
## Results ```bash +----+----------------+------------+ | | chunks | entities | |---:|:---------------|:-----------| | 0 | insulin | DRUG | | 1 | Bactrim | DRUG | | 2 | for 14 days | DURATION | | 3 | 5000 units | DOSAGE | | 4 | Fragmin | DRUG | | 5 | subcutaneously | ROUTE | | 6 | daily | FREQUENCY | | 7 | Lantus | DRUG | | 8 | 40 units | DOSAGE | | 9 | subcutaneously | ROUTE | | 10 | at bedtime | FREQUENCY | +----+----------------+------------+ +----+----------+------------+-------------+ | | chunks | entities | assertion | |---:|:---------|:-----------|:------------| | 0 | insulin | DRUG | Present | | 1 | Bactrim | DRUG | Past | | 2 | Fragmin | DRUG | Planned | | 3 | Lantus | DRUG | Planned | +----+----------+------------+-------------+ +----------------+-----------+------------+-----------+----------------+ | relation | entity1 | chunk1 | entity2 | chunk2 | |:---------------|:----------|:-----------|:----------|:---------------| | DRUG-DURATION | DRUG | Bactrim | DURATION | for 14 days | | DOSAGE-DRUG | DOSAGE | 5000 units | DRUG | Fragmin | | DRUG-ROUTE | DRUG | Fragmin | ROUTE | subcutaneously | | DRUG-FREQUENCY | DRUG | Fragmin | FREQUENCY | daily | | DRUG-DOSAGE | DRUG | Lantus | DOSAGE | 40 units | | DRUG-ROUTE | DRUG | Lantus | ROUTE | subcutaneously | | DRUG-FREQUENCY | DRUG | Lantus | FREQUENCY | at bedtime | +----------------+-----------+------------+-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_medication| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - NerConverterInternalModel - AssertionDLModel - PerceptronModel - DependencyParserModel - PosologyREModel --- layout: model title: Bemba (Zambia) asr_wav2vec2_large_xls_r_300m_bemba_fds TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds date: 2022-09-24 tags: [wav2vec2, bem, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: bem edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023955232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds_bem_4.2.0_3.0_1664023955232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds', lang = 'bem') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds", lang = "bem") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_bemba_fds| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|bem| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_ViTFineTuned ViTForImageClassification from pthpth author: John Snow Labs name: image_classifier_vit_ViTFineTuned date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ViTFineTuned` is a English model originally trained by pthpth. ## Predicted Entities `white_bread`, `brown_bread`, `cracker`, `aluminium_foil`, `linen`, `wool`, `corduroy`, `wood`, `lettuce_leaf`, `cotton`, `cork` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViTFineTuned_en_4.1.0_3.0_1660167943982.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ViTFineTuned_en_4.1.0_3.0_1660167943982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_ViTFineTuned", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_ViTFineTuned", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_ViTFineTuned| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Effectiveness Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_effectiveness_bert date: 2023-03-05 tags: [en, legal, classification, clauses, effectiveness, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Effectiveness` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Effectiveness`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effectiveness_bert_en_1.0.0_3.0_1678050012884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effectiveness_bert_en_1.0.0_3.0_1678050012884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_effectiveness_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Effectiveness]| |[Other]| |[Other]| |[Effectiveness]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_effectiveness_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Effectiveness 0.88 0.92 0.90 24 Other 0.94 0.92 0.93 36 accuracy - - 0.92 60 macro-avg 0.91 0.92 0.91 60 weighted-avg 0.92 0.92 0.92 60 ``` --- layout: model title: German asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545 date: 2022-09-26 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545_de_4.2.0_3.0_1664191216048.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545_de_4.2.0_3.0_1664191216048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s545| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Part of Speech for Afrikaans author: John Snow Labs name: pos_afribooms date: 2021-03-16 tags: [af, open_source, pos] supported: true task: Part of Speech Tagging language: af edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_afribooms_af_2.7.5_2.4_1615903333785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_afribooms_af_2.7.5_2.4_1615903333785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_afribooms", "af")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_afribooms", "af") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer ,pos)) val data = Seq("Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .""] token_df = nlu.load('af.pos.afribooms').predict(text) token_df ```
## Results ```bash +---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+ |text |result | +---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+ |Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .|[DET, NOUN, PRON, VERB, AUX, PUNCT, AUX, ADJ, CCONJ, ADJ, ADP, NOUN, CCONJ, NOUN, AUX, PUNCT]| +---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_afribooms| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|af| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.60 | 0.67 | 0.63 | 665 | | ADP | 0.76 | 0.78 | 0.77 | 1299 | | ADV | 0.74 | 0.69 | 0.72 | 523 | | AUX | 0.85 | 0.83 | 0.84 | 663 | | CCONJ | 0.71 | 0.71 | 0.71 | 380 | | DET | 0.83 | 0.70 | 0.76 | 1014 | | NOUN | 0.69 | 0.72 | 0.71 | 2025 | | NUM | 0.76 | 0.76 | 0.76 | 42 | | PART | 0.67 | 0.68 | 0.68 | 322 | | PRON | 0.87 | 0.87 | 0.87 | 794 | | PROPN | 0.82 | 0.73 | 0.77 | 156 | | PUNCT | 0.68 | 0.70 | 0.69 | 877 | | SCONJ | 0.85 | 0.85 | 0.85 | 210 | | SYM | 0.87 | 0.88 | 0.87 | 142 | | VERB | 0.69 | 0.72 | 0.70 | 889 | | X | 0.35 | 0.14 | 0.20 | 64 | | accuracy | | | 0.74 | 10065 | | macro avg | 0.73 | 0.72 | 0.72 | 10065 | | weighted avg | 0.74 | 0.74 | 0.74 | 10065 | ``` --- layout: model title: English DistilBertForQuestionAnswering model (from vitusya) author: John Snow Labs name: distilbert_qa_vitusya_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vitusya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vitusya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726511045.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vitusya_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726511045.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vitusya_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vitusya_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_vitusya").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vitusya_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vitusya/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Investment Subadvisory Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_investment_subadvisory_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, investment_subadvisory, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_investment_subadvisory_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `investment-subadvisory-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `investment-subadvisory-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_subadvisory_agreement_en_1.0.0_3.0_1668115303552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_subadvisory_agreement_en_1.0.0_3.0_1668115303552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_investment_subadvisory_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[investment-subadvisory-agreement]| |[other]| |[other]| |[investment-subadvisory-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investment_subadvisory_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support investment-subadvisory-agreement 1.00 0.98 0.99 42 other 0.99 1.00 0.99 66 accuracy - - 0.99 108 macro-avg 0.99 0.99 0.99 108 weighted-avg 0.99 0.99 0.99 108 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Walterchamy) author: John Snow Labs name: distilbert_qa_walterchamy_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Walterchamy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_walterchamy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769525196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_walterchamy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769525196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_walterchamy_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_walterchamy_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_walterchamy_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Walterchamy/distilbert-base-uncased-finetuned-squad --- layout: model title: Medically Sound Suggestion Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_vop_sound_medical date: 2023-06-13 tags: [licensed, clinical, classification, en, vop, tensorflow] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier is meant to identify whether the suggestion that is mentioned in the text is medically sound. ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_sound_medical_en_4.4.3_3.0_1686673701807.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_sound_medical_en_4.4.3_3.0_1686673701807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_sound_medical", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["I had a lung surgery for emphyema and after surgery my xray showing some recovery.", "I was advised to put honey on a burned skin."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_sound_medical", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("prediction") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("I had a lung surgery for emphyema and after surgery my xray showing some recovery.", "I was advised to put honey on a burned skin.")).toDS.toDF("text") ```
## Results ```bash +----------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------+-------+ |I had a lung surgery for emphyema and after surgery my xray showing some recovery.|[True] | |I was advised to put honey on a burned skin. |[False]| +----------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_sound_medical| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset “Hello,I’m 20 year old girl. I’m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I’m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I’m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I’m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.” ## Benchmarking ```bash label precision recall f1-score support False 0.848564 0.752315 0.797546 432 True 0.664577 0.785185 0.719864 270 accuracy - - 0.764957 702 macro_avg 0.756570 0.768750 0.758705 702 weighted_avg 0.777800 0.764957 0.767668 702 ``` --- layout: model title: English Bert Embeddings (Base, Uncased, Unstructured) author: John Snow Labs name: bert_embeddings_bert_base_uncased_sparse_70_unstructured date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-uncased-sparse-70-unstructured` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_sparse_70_unstructured_en_3.4.2_3.0_1649672494464.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_sparse_70_unstructured_en_3.4.2_3.0_1649672494464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_sparse_70_unstructured","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_sparse_70_unstructured","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_base_uncased_sparse_70_unstructured").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_uncased_sparse_70_unstructured| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|228.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Intel/bert-base-uncased-sparse-70-unstructured --- layout: model title: Word2Vec Embeddings in Russian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ru, open_source] task: Embeddings language: ru edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ru_3.4.1_3.0_1647455083959.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ru_3.4.1_3.0_1647455083959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ru") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю искра NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.embed.w2v_cc_300d").predict("""Я люблю искра NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ru| |Size:|1.3 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Asia And Oceania Document Classifier (EURLEX) author: John Snow Labs name: legclf_asia_and_oceania_bert date: 2023-03-06 tags: [en, legal, classification, clauses, asia_and_oceania, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_asia_and_oceania_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Asia_and_Oceania or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Asia_and_Oceania`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_asia_and_oceania_bert_en_1.0.0_3.0_1678111638726.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_asia_and_oceania_bert_en_1.0.0_3.0_1678111638726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_asia_and_oceania_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Asia_and_Oceania]| |[Other]| |[Other]| |[Asia_and_Oceania]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_asia_and_oceania_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Asia_and_Oceania 0.92 0.91 0.92 456 Other 0.90 0.92 0.91 400 accuracy - - 0.91 856 macro-avg 0.91 0.91 0.91 856 weighted-avg 0.91 0.91 0.91 856 ``` --- layout: model title: Hindi RoBERTa Embeddings (from neuralspace-reverie) author: John Snow Labs name: roberta_embeddings_indic_transformers_hi_roberta date: 2022-04-14 tags: [roberta, embeddings, hi, open_source] task: Embeddings language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-hi-roberta` is a Hindi model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_hi_roberta_hi_3.4.2_3.0_1649947526435.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_hi_roberta_hi_3.4.2_3.0_1649947526435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_hi_roberta","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_hi_roberta","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.indic_transformers_hi_roberta").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indic_transformers_hi_roberta| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|313.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-roberta - https://oscar-corpus.com/ --- layout: model title: Dutch BERT Sentence Base Cased Embedding author: John Snow Labs name: sent_bert_base_cased date: 2021-09-06 tags: [dutch, open_source, bert_sentence_embeddings, cased, nl] task: Embeddings language: nl edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_nl_3.2.2_3.0_1630926264607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_nl_3.2.2_3.0_1630926264607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "nl") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "nl") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed_sentence.bert.base_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|nl| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/GroNLP/bert-base-dutch-cased --- layout: model title: English DebertaForQuestionAnswering model (from nbroad) author: John Snow Labs name: deberta_v3_xsmall_qa_squad2 date: 2022-06-15 tags: [open_source, deberta, question_answering, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DeBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-v3-xsmall-squad2` is a English model originally trained by `nbroad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_v3_xsmall_qa_squad2_en_4.0.0_3.0_1655290640197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_v3_xsmall_qa_squad2_en_4.0.0_3.0_1655290640197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DebertaForQuestionAnswering.pretrained("deberta_v3_xsmall_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DebertaForQuestionAnswering.pretrained("deberta_v3_xsmall_qa_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.deberta").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_v3_xsmall_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|252.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://huggingface.co/nbroad/deberta-v3-xsmall-squad2 --- layout: model title: Relation Extraction between dates and other entities (ReDL) author: John Snow Labs name: redl_oncology_temporal_biobert_wip date: 2022-09-29 tags: [licensed, clinical, oncology, en, relation_extraction, temporal] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Date and Relative_Date extractions to clinical entities such as Test or Cancer_Dx. ## Predicted Entities `is_date_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_temporal_biobert_wip_en_4.1.0_3.0_1664456191667.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_temporal_biobert_wip_en_4.1.0_3.0_1664456191667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Each relevant relation pair in the pipeline should include one date entity (Date or Relative_Date) and a clinical entity (such as Pathology_Test, Cancer_Dx or Chemotherapy).
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_temporal_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["Her breast cancer was diagnosed last year."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols("ner_chunk", "dependencies") .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_temporal_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols("re_ner_chunk", "sentence") .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("Her breast cancer was diagnosed last year.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_temporal_biobert_wip").predict("""Her breast cancer was diagnosed last year.""") ```
## Results ```bash | chunk1 | entity1 | chunk2 | entity2 | relation | confidence | | --------------- |--------------- |---------------- |--------------- |----------- |----------- | | breast cancer | Cancer_Dx | last year | Relative_Date | is_date_of | 0.9999256 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_temporal_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 support O 0.77 0.81 0.79 302.0 is_date_of 0.82 0.78 0.80 298.0 macro-avg 0.79 0.79 0.79 NaN ``` --- layout: model title: Dutch RoBERTa Embeddings (Merged) author: John Snow Labs name: roberta_embeddings_robbertje_1_gb_merged date: 2022-04-14 tags: [roberta, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbertje-1-gb-merged` is a Dutch model orginally trained by `DTAI-KULeuven`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_merged_nl_3.4.2_3.0_1649949144654.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_merged_nl_3.4.2_3.0_1649949144654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_merged","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_merged","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.robbertje_1_gb_merged").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robbertje_1_gb_merged| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|279.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-merged - http://github.com/iPieter/robbert - http://github.com/iPieter/robbertje - https://www.clinjournal.org/clinj/article/view/131 - https://www.clin31.ugent.be - https://arxiv.org/abs/2101.05716 --- layout: model title: Part of Speech for Vietnamese author: John Snow Labs name: pos_vtb date: 2021-03-10 tags: [open_source, pos, vi] supported: true task: Part of Speech Tagging language: vi edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_vtb_vi_2.7.5_2.4_1615401332222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_vtb_vi_2.7.5_2.4_1615401332222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_vtb", "vi") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Thắng sẽ tìm nghề mới cho Lan .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_vtb", "vi") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos)) val data = Seq("Thắng sẽ tìm nghề mới cho Lan .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Thắng sẽ tìm nghề mới cho Lan .""] token_df = nlu.load('vi.pos.vtb').predict(text) token_df ```
## Results ```bash +-------------------------------+--------------------------------------------+ |text |result | +-------------------------------+--------------------------------------------+ |Thắng sẽ tìm nghề mới cho Lan .|[NOUN, X, VERB, NOUN, ADJ, ADP, NOUN, PUNCT]| +-------------------------------+--------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_vtb| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|vi| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.58 | 0.49 | 0.53 | 738 | | ADP | 0.84 | 0.87 | 0.86 | 688 | | AUX | 0.79 | 0.95 | 0.87 | 132 | | CCONJ | 0.85 | 0.80 | 0.83 | 335 | | DET | 0.95 | 0.85 | 0.90 | 232 | | INTJ | 1.00 | 0.14 | 0.25 | 7 | | NOUN | 0.84 | 0.86 | 0.85 | 3838 | | NUM | 0.94 | 0.91 | 0.92 | 412 | | PART | 0.53 | 0.30 | 0.38 | 87 | | PROPN | 0.85 | 0.85 | 0.85 | 494 | | PUNCT | 0.97 | 0.99 | 0.98 | 1722 | | SCONJ | 0.99 | 0.98 | 0.98 | 122 | | VERB | 0.73 | 0.76 | 0.74 | 2178 | | X | 0.81 | 0.76 | 0.79 | 970 | | accuracy | | | 0.83 | 11955 | | macro avg | 0.83 | 0.75 | 0.77 | 11955 | | weighted avg | 0.83 | 0.83 | 0.83 | 11955 | ``` --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from autoevaluate) author: John Snow Labs name: distilbert_qa_autoevaluate_base_cased_led_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad` is a English model originally trained by `autoevaluate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_autoevaluate_base_cased_led_squad_en_4.3.0_3.0_1672766463212.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_autoevaluate_base_cased_led_squad_en_4.3.0_3.0_1672766463212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autoevaluate_base_cased_led_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autoevaluate_base_cased_led_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_autoevaluate_base_cased_led_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/autoevaluate/distilbert-base-cased-distilled-squad --- layout: model title: Legal Entire Agreements Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_entire_agreements_bert date: 2023-03-05 tags: [en, legal, classification, clauses, entire_agreements, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Entire_Agreements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Entire_Agreements`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_entire_agreements_bert_en_1.0.0_3.0_1678050004746.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_entire_agreements_bert_en_1.0.0_3.0_1678050004746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_entire_agreements_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Entire_Agreements]| |[Other]| |[Other]| |[Entire_Agreements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_entire_agreements_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Entire_Agreements 0.99 0.98 0.98 284 Other 0.98 0.99 0.98 312 accuracy - - 0.98 596 macro-avg 0.98 0.98 0.98 596 weighted-avg 0.98 0.98 0.98 596 ``` --- layout: model title: Detect clinical events author: John Snow Labs name: ner_events_healthcare date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect clinical events like Date, Occurance, Clinical_Department and a lot more using pretrained NER model. ## Predicted Entities `OCCURRENCE`, `TREATMENT`, `TIME`, `DATE`, `PROBLEM`, `CLINICAL_DEPT`, `DURATION`, `EVIDENTIAL`, `FREQUENCY`, `TEST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_en_3.0.0_3.0_1617260839291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_en_3.0.0_3.0_1617260839291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_events_healthcare", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_events_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_healthcre").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash entity tp fp fn total precision recall f1 DURATION 575.0 263.0 231.0 806.0 0.6862 0.7134 0.6995 PROBLEM 8067.0 2479.0 2305.0 10372.0 0.7649 0.7778 0.7713 DATE 1787.0 508.0 315.0 2102.0 0.7786 0.8501 0.8128 CLINICAL_DEPT 1804.0 393.0 338.0 2142.0 0.8211 0.8422 0.8315 OCCURRENCE 1917.0 893.0 2188.0 4105.0 0.6822 0.467 0.5544 TREATMENT 4578.0 1596.0 1817.0 6395.0 0.7415 0.7159 0.7285 FREQUENCY 145.0 46.0 213.0 358.0 0.7592 0.405 0.5282 TEST 3723.0 949.0 1113.0 4836.0 0.7969 0.7699 0.7831 EVIDENTIAL 334.0 80.0 279.0 613.0 0.8068 0.5449 0.6504 macro - - - - - - 0.60759 micro - - - - - - 0.73065 ``` --- layout: model title: Detect Clinical Entities (jsl_ner_wip_greedy_clinical) author: John Snow Labs name: jsl_ner_wip_greedy_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Hyperlipidemia`, `Respiration`, `Birth_Entity`, `Age`, `Family_History_Header`, `Labour_Delivery`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Drug`, `Symptom`, `Treatment`, `Substance`, `Route`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Time`, `Frequency`, `Sexually_Active_or_Sexual_Orientation`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Hypertension`, `HDL`, `Overweight`, `Total_Cholesterol`, `Smoking`, `Date`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_en_3.0.0_3.0_1617206898504.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_clinical_en_3.0.0_3.0_1617206898504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.wip.clinical.greedy").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +----------------------------------------------+----------------------------+ |chunk |ner_label | +----------------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | |20 minutes |Duration | |q.2h. |Frequency | |to 5 to 10 minutes |Duration | |his |Gender | |respiratory congestion |Symptom | |He |Gender | |tired |Symptom | |fussy |Symptom | |over the past 2 days |RelativeDate | |albuterol |Drug | |ER |Clinical_Dept | |His |Gender | |urine output has also decreased |Symptom | |he |Gender | |per 24 hours |Frequency | |he |Gender | |per 24 hours |Frequency | |Mom |Gender | |diarrhea |Symptom | |His |Gender | |bowel |Internal_organ_or_component | +----------------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_greedy_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 229.0 56.0 34.0 263.0 0.8035 0.8707 0.8358 Direction 4009.0 479.0 403.0 4412.0 0.8933 0.9087 0.9009 Female_Reproducti... 2.0 1.0 3.0 5.0 0.6667 0.4 0.5 Respiration 80.0 9.0 14.0 94.0 0.8989 0.8511 0.8743 Cerebrovascular_D... 82.0 27.0 18.0 100.0 0.7523 0.82 0.7847 not 4.0 0.0 0.0 4.0 1.0 1.0 1.0 Family_History_He... 86.0 4.0 3.0 89.0 0.9556 0.9663 0.9609 Heart_Disease 469.0 76.0 83.0 552.0 0.8606 0.8496 0.8551 ImagingFindings 68.0 38.0 75.0 143.0 0.6415 0.4755 0.5462 RelativeTime 141.0 76.0 66.0 207.0 0.6498 0.6812 0.6651 Strength 720.0 49.0 58.0 778.0 0.9363 0.9254 0.9308 Smoking 117.0 8.0 6.0 123.0 0.936 0.9512 0.9435 Medical_Device 3584.0 730.0 359.0 3943.0 0.8308 0.909 0.8681 EKG_Findings 41.0 20.0 45.0 86.0 0.6721 0.4767 0.5578 Pulse 138.0 23.0 24.0 162.0 0.8571 0.8519 0.8545 Psychological_Con... 121.0 14.0 29.0 150.0 0.8963 0.8067 0.8491 Overweight 5.0 2.0 0.0 5.0 0.7143 1.0 0.8333 Triglycerides 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Obesity 49.0 6.0 4.0 53.0 0.8909 0.9245 0.9074 Admission_Discharge 325.0 30.0 2.0 327.0 0.9155 0.9939 0.9531 HDL 2.0 1.0 1.0 3.0 0.6667 0.6667 0.6667 Diabetes 118.0 13.0 7.0 125.0 0.9008 0.944 0.9219 Section_Header 3778.0 148.0 138.0 3916.0 0.9623 0.9648 0.9635 Age 617.0 52.0 47.0 664.0 0.9223 0.9292 0.9257 O2_Saturation 34.0 11.0 19.0 53.0 0.7556 0.6415 0.6939 Kidney_Disease 114.0 5.0 12.0 126.0 0.958 0.9048 0.9306 Test 2668.0 526.0 498.0 3166.0 0.8353 0.8427 0.839 Communicable_Disease 25.0 12.0 9.0 34.0 0.6757 0.7353 0.7042 Hypertension 152.0 10.0 6.0 158.0 0.9383 0.962 0.95 External_body_par... 652.0 387.0 340.0 2992.0 0.8727 0.8864 0.8795 Oxygen_Therapy 67.0 21.0 23.0 90.0 0.7614 0.7444 0.7528 Test_Result 1124.0 227.0 258.0 1382.0 0.832 0.8133 0.8225 Modifier 539.0 185.0 309.0 848.0 0.7445 0.6356 0.6858 BMI 7.0 1.0 1.0 8.0 0.875 0.875 0.875 Labour_Delivery 75.0 19.0 23.0 98.0 0.7979 0.7653 0.7813 Employment 249.0 51.0 57.0 306.0 0.83 0.8137 0.8218 Clinical_Dept 948.0 95.0 80.0 1028.0 0.9089 0.9222 0.9155 Time 36.0 7.0 7.0 43.0 0.8372 0.8372 0.8372 Procedure 3180.0 460.0 480.0 3660.0 0.8736 0.8689 0.8712 Diet 50.0 29.0 30.0 80.0 0.6329 0.625 0.6289 Oncological 478.0 46.0 50.0 528.0 0.9122 0.9053 0.9087 LDL 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Symptom 6801.0 1097.0 1097.0 7898.0 0.8611 0.8611 0.8611 Temperature 109.0 12.0 7.0 116.0 0.9008 0.9397 0.9198 Vital_Signs_Header 213.0 27.0 16.0 229.0 0.8875 0.9301 0.9083 Relationship_Status 42.0 2.0 1.0 43.0 0.9545 0.9767 0.9655 Total_Cholesterol 10.0 4.0 5.0 15.0 0.7143 0.6667 0.6897 Blood_Pressure 167.0 22.0 23.0 190.0 0.8836 0.8789 0.8813 Injury_or_Poisoning 510.0 83.0 111.0 621.0 0.86 0.8213 0.8402 Drug_Ingredient 1698.0 160.0 158.0 1856.0 0.9139 0.9149 0.9144 Treatment 156.0 40.0 54.0 210.0 0.7959 0.7429 0.7685 Assertion_SocialD... 4.0 0.0 6.0 10.0 1.0 0.4 0.5714 Pregnancy 100.0 45.0 41.0 141.0 0.6897 0.7092 0.6993 Vaccine 13.0 3.0 6.0 19.0 0.8125 0.6842 0.7429 Disease_Syndrome_... 2861.0 452.0 376.0 3237.0 0.8636 0.8838 0.8736 Height 25.0 8.0 9.0 34.0 0.7576 0.7353 0.7463 Frequency 650.0 157.0 148.0 798.0 0.8055 0.8145 0.81 Route 872.0 83.0 85.0 957.0 0.9131 0.9112 0.9121 Death_Entity 49.0 7.0 6.0 55.0 0.875 0.8909 0.8829 Duration 367.0 132.0 95.0 462.0 0.7355 0.7944 0.7638 Internal_organ_or... 6532.0 1016.0 987.0 7519.0 0.8654 0.8687 0.8671 Alcohol 79.0 20.0 12.0 91.0 0.798 0.8681 0.8316 Date 515.0 19.0 19.0 534.0 0.9644 0.9644 0.9644 Hyperlipidemia 47.0 2.0 1.0 48.0 0.9592 0.9792 0.9691 Social_History_He... 89.0 9.0 4.0 93.0 0.9082 0.957 0.9319 Race_Ethnicity 113.0 0.0 3.0 116.0 1.0 0.9741 0.9869 Imaging_Technique 47.0 31.0 30.0 77.0 0.6026 0.6104 0.6065 Drug_BrandName 963.0 72.0 79.0 1042.0 0.9304 0.9242 0.9273 RelativeDate 553.0 128.0 121.0 674.0 0.812 0.8205 0.8162 Gender 6043.0 59.0 87.0 6130.0 0.9903 0.9858 0.9881 Form 227.0 35.0 47.0 274.0 0.8664 0.8285 0.847 Dosage 279.0 42.0 62.0 341.0 0.8692 0.8182 0.8429 Medical_History_H... 117.0 4.0 11.0 128.0 0.9669 0.9141 0.9398 Substance 59.0 16.0 16.0 75.0 0.7867 0.7867 0.7867 Weight 85.0 19.0 21.0 106.0 0.8173 0.8019 0.8095 macro - - - - - - 0.7286 micro - - - - - - 0.8715 ``` --- layout: model title: English BertForMaskedLM Cased model (from anferico) author: John Snow Labs name: bert_embeddings_for_patents date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-for-patents` is a English model originally trained by `anferico`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_for_patents_en_4.2.4_3.0_1670019529456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_for_patents_en_4.2.4_3.0_1670019529456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_for_patents","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_for_patents","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_for_patents| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/anferico/bert-for-patents - https://cloud.google.com/blog/products/ai-machine-learning/how-ai-improves-patent-analysis - https://services.google.com/fh/files/blogs/bert_for_patents_white_paper.pdf - https://github.com/google/patents-public-data/blob/master/models/BERT%20for%20Patents.md - https://github.com/ec-jrc/Patents4IPPC - https://picampus-school.com/ - https://ec.europa.eu/jrc/en --- layout: model title: Italian Bert Embeddings (from bullmount) author: John Snow Labs name: bert_embeddings_hseBert_it_cased date: 2022-04-11 tags: [bert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `hseBert-it-cased` is a Italian model orginally trained by `bullmount`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_hseBert_it_cased_it_3.4.2_3.0_1649676875956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_hseBert_it_cased_it_3.4.2_3.0_1649676875956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_hseBert_it_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_hseBert_it_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.hseBert_it_cased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_hseBert_it_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|412.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/bullmount/hseBert-it-cased --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from CenIA) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-qa-tar` is a Castilian, Spanish model orginally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar_es_4.0.0_3.0_1654180441819.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar_es_4.0.0_3.0_1654180441819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.bert.base_cased.by_CenIA").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/bert-base-spanish-wwm-cased-finetuned-qa-tar --- layout: model title: Fast Neural Machine Translation Model from Niuean to English author: John Snow Labs name: opus_mt_niu_en date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, niu, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `niu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_niu_en_xx_2.7.0_2.4_1609254471115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_niu_en_xx_2.7.0_2.4_1609254471115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_niu_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_niu_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.niu.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_niu_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for SNOMED (sbiobertresolve_snomed_drug) author: John Snow Labs name: sbiobertresolve_snomed_drug date: 2022-01-18 tags: [licensed, snomed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps detected drug entities to SNOMED codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `SNOMED Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_drug_en_3.3.4_2.4_1642534694043.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_drug_en_3.3.4_2.4_1642534694043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['DRUG']) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sentence_chunk_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_drug", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN")\ resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter, c2doc, sentence_chunk_embeddings, snomed_resolver ]) model = resolver_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = model.transform(spark.createDataFrame([["She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Avandia 4 mg daily, aspirin 81 mg daily, Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin."]]).toDF("text")) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("DRUG")) val c2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sentence_chunk_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_drug", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("snomed_code") .setDistanceFunction("EUCLIDEAN") resolver_pipeline = new Pipeline().setStages(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter, c2doc, sentence_chunk_embeddings, snomed_resolver) data = Seq("She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Avandia 4 mg daily, aspirin 81 mg daily, Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin.").toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_drug").predict("""She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Avandia 4 mg daily, aspirin 81 mg daily, Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin.""") ```
## Results ```bash +-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+ | ner_chunk|entity| snomed_code| resolved_text| all_k_results| all_k_resolutions| +-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+ | Fragmin| DRUG| 9487801000001106| Fragmin|9487801000001106:::130752006:::28999000:::953500100000110...|Fragmin:::Fragilysin:::Fusarin:::Femulen:::Fumonisin:::Fr...| | OxyContin| DRUG| 9296001000001100| OxyCONTIN|9296001000001100:::373470001:::230091000001108:::55452001...|OxyCONTIN:::Oxychlorosene:::Oxyargin:::oxyCODONE:::Oxymor...| | folic acid| DRUG| 63718003| Folic acid|63718003:::6247001:::226316008:::432165000:::438451000124...|Folic acid:::Folic acid-containing product:::Folic acid s...| | levothyroxine| DRUG|10071011000001106| Levothyroxine|10071011000001106:::710809001:::768532006:::126202002:::7...|Levothyroxine:::Levothyroxine (substance):::Levothyroxine...| | Avandia| DRUG| 9217601000001109| avandia|9217601000001109:::9217501000001105:::12226401000001108::...|avandia:::avandamet:::Anatera:::Intanza:::Avamys:::Aragam...| | aspirin| DRUG| 387458008| Aspirin|387458008:::7947003:::5145711000001107:::426365001:::4125...|Aspirin:::Aspirin-containing product:::Aspirin powder:::A...| | Neurontin| DRUG| 9461401000001102| neurontin|9461401000001102:::130694004:::86822004:::952840100000110...|neurontin:::Neurolysin:::Neurine (substance):::Nebilet:::...| |magnesium citrate| DRUG| 12495006|Magnesium citrate|12495006:::387401007:::21691008:::15531411000001106:::408...|Magnesium citrate:::Magnesium carbonate:::Magnesium trisi...| | insulin| DRUG| 67866001| Insulin|67866001:::325072002:::414515005:::39487003:::411530000::...|Insulin:::Insulin aspart:::Insulin detemir:::Insulin-cont...| +-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_drug| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Dependencies:|ner_posology| ## Data Source Trained on `SNOMED` code dataset with `sbiobert_base_cased_mli` sentence embeddings. --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICDO Codes author: John Snow Labs name: snomed_icdo_mapping date: 2023-06-13 tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, icdo] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_icdo_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_4.4.4_3.2_1686665539427.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_4.4.4_3.2_1686665539427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(10376009 2026006 26638004) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(10376009 2026006 26638004) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(10376009 2026006 26638004) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(10376009 2026006 26638004) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""Put your text here.""") ```
## Results ```bash Results | | snomed_code | icdo_code | |---:|:------------------------------|:-------------------------| | 0 | 10376009 | 2026006 | 26638004 | 8050/2 | 9014/0 | 8322/0 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icdo_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|212.8 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English BertForQuestionAnswering Cased model (from chanifrusydi) author: John Snow Labs name: bert_qa_chanifrusydi_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `chanifrusydi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chanifrusydi_finetuned_squad_en_4.0.0_3.0_1657186373688.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chanifrusydi_finetuned_squad_en_4.0.0_3.0_1657186373688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chanifrusydi_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_chanifrusydi_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chanifrusydi_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/chanifrusydi/bert-finetuned-squad --- layout: model title: Smaller BERT Sentence Embeddings (L-2_H-512_A-8) author: John Snow Labs name: sent_small_bert_L2_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_512_en_2.6.0_2.4_1598350526043.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_512_en_2.6.0_2.4_1598350526043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_512", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_512", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L2_512').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L2_512_embeddings I hate cancer [0.015892572700977325, 0.21051561832427979, 0.... Antibiotics aren't painkiller [-0.2904765009880066, 0.21515187621116638, 0.1... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L2_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1 --- layout: model title: Sentence Detection in Sindhi Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [sd, open_source, sentence_detection] task: Sentence Detection language: sd edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_sd_3.2.0_3.0_1630337452693.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_sd_3.2.0_3.0_1630337452693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "sd") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو. هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي. اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ. پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو. تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي: پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "sd") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو. هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي. اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ. پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو. تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي: پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('sd.sentence_detector').predict("readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو. هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي. اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ. پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو. تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي: پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.", output_level ='sentence') ```
## Results ```bash +--------------------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------------------+ |[readingولي رھيا آھن ھڪڙو وڏو ذريعو انگريزي پڙھڻ جا پيراگراف؟ توھان صحيح ھن place تي آيا آھيو.] | |[هڪ تازي تحقيق مطابق ا today's جي نوجوانن ۾ پڙهڻ جي عادت تيزيءَ سان گهٽجي رهي آهي.] | |[اھي نٿا ڏئي سگھن انگريزي ڏنل پيراگراف تي ڪجھ سيڪنڊن کان و forيڪ لاءِ.] | |[پڻ ، پڙهڻ هو ۽ آهي هڪ لازمي حصو س allني مقابلي واري امتحانن جو.] | |[تنھنڪري ، توھان پنھنجي پڙھڻ جي صلاحيتن کي ڪيئن بھتر ڪريو ٿا؟ ھن سوال جو جواب اصل ۾ ھڪڙو questionيو سوال آھي:]| |[پڙھڻ جي صلاحيتن جو استعمال ا آھي؟ پڙهڻ جو بنيادي مقصد آهي ’احساس ڪرڻ‘.] | +--------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|sd| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from avioo1) author: John Snow Labs name: distilbert_qa_avioo1_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `avioo1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725061133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_avioo1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725061133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_avioo1_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_avioo1").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_avioo1_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/avioo1/distilbert-base-uncased-finetuned-squad --- layout: model title: Telugu Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, te, open_source] task: Embeddings language: te edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Telugu model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_te_3.4.2_3.0_1649675347372.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_te_3.4.2_3.0_1649675347372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.embed.muril_adapted_local").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|te| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `LOC`, `ORG`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili_sw_4.1.0_3.0_1659353949038.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili_sw_4.1.0_3.0_1659353949038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_hausa_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-hausa-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Legal Subscription Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_subscription_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, subscription, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_subscription_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `subscription-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `subscription-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_bert_en_1.0.0_3.0_1669372100379.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_bert_en_1.0.0_3.0_1669372100379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_subscription_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[subscription-agreement]| |[other]| |[other]| |[subscription-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subscription_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.97 0.95 65 subscription-agreement 0.94 0.89 0.91 35 accuracy - - 0.94 100 macro-avg 0.94 0.93 0.93 100 weighted-avg 0.94 0.94 0.94 100 ``` --- layout: model title: Legal No solicitation Clause Binary Classifier author: John Snow Labs name: legclf_no_solicitation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-solicitation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-solicitation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_solicitation_clause_en_1.0.0_3.2_1660123765936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_solicitation_clause_en_1.0.0_3.2_1660123765936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_solicitation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-solicitation]| |[other]| |[other]| |[no-solicitation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_solicitation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-solicitation 0.93 0.96 0.94 26 other 0.98 0.96 0.97 46 accuracy - - 0.96 72 macro-avg 0.95 0.96 0.96 72 weighted-avg 0.96 0.96 0.96 72 ``` --- layout: model title: Detect entities related to road traffic author: John Snow Labs name: ner_traffic date: 2021-04-01 tags: [ner, clinical, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect entities related to road traffic using pretrained NER model. ## Predicted Entities `ORGANIZATION_COMPANY`, `DISASTER_TYPE`, `TIME`, `TRIGGER`, `DATE`, `PERSON`, `LOCATION_STOP`, `ORGANIZATION`, `DISTANCE`, `LOCATION_STREET`, `NUMBER`, `DURATION`, `ORG_POSITION`, `LOCATION_ROUTE`, `LOCATION`, `LOCATION_CITY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TRAFFIC_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_traffic_de_3.0.0_3.0_1617260858901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_traffic_de_3.0.0_3.0_1617260858901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_german = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_traffic", "de", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_german, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_german = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_traffic", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_german, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.traffic").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_traffic| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| ## Benchmarking ```bash entity tp fp fn total precision recall f1 DURATION 113.0 34.0 94.0 207.0 0.7687 0.5459 0.6384 ORGANIZATION_COMPANY 667.0 324.0 515.0 1182.0 0.6731 0.5643 0.6139 LOCATION_CITY 441.0 137.0 166.0 607.0 0.763 0.7265 0.7443 LOCATION_ROUTE 132.0 30.0 61.0 193.0 0.8148 0.6839 0.7437 DATE 730.0 81.0 168.0 898.0 0.9001 0.8129 0.8543 PERSON 422.0 84.0 174.0 596.0 0.834 0.7081 0.7659 LOCATION_STREET 132.0 12.0 99.0 231.0 0.9167 0.5714 0.704 LOCATION 697.0 94.0 359.0 1056.0 0.8812 0.66 0.7547 TIME 266.0 34.0 45.0 311.0 0.8867 0.8553 0.8707 TRIGGER 187.0 34.0 192.0 379.0 0.8462 0.4934 0.6233 DISTANCE 99.0 0.0 16.0 115.0 1.0 0.8609 0.9252 NUMBER 608.0 147.0 189.0 797.0 0.8053 0.7629 0.7835 LOCATION_STOP 403.0 53.0 77.0 480.0 0.8838 0.8396 0.8611 macro - - - - - - 0.6528 micro - - - - - - 0.7261 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Modified_pubmed_clinical date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Modified_pubmed_clinical` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_pubmed_clinical_en_4.0.0_3.0_1657108848432.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_pubmed_clinical_en_4.0.0_3.0_1657108848432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_pubmed_clinical","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_pubmed_clinical","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Modified_pubmed_clinical| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Modified_pubmed_clinical --- layout: model title: Turkish BertForTokenClassification Cased model (from busecarik) author: John Snow Labs name: bert_token_classifier_loodos_sunlp_ner_turkish date: 2022-11-30 tags: [tr, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: tr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-loodos-sunlp-ner-turkish` is a Turkish model originally trained by `busecarik`. ## Predicted Entities `PRODUCT`, `TIME`, `MONEY`, `ORGANIZATION`, `LOCATION`, `TVSHOW`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_loodos_sunlp_ner_turkish_tr_4.2.4_3.0_1669815349144.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_loodos_sunlp_ner_turkish_tr_4.2.4_3.0_1669815349144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_loodos_sunlp_ner_turkish","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_loodos_sunlp_ner_turkish","tr") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_loodos_sunlp_ner_turkish| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|tr| |Size:|412.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/busecarik/bert-loodos-sunlp-ner-turkish - https://github.com/SU-NLP/SUNLP-Twitter-NER-Dataset --- layout: model title: Legal Transactions With Affiliates Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_transactions_with_affiliates_bert date: 2023-03-05 tags: [en, legal, classification, clauses, transactions_with_affiliates, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Transactions_With_Affiliates` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Transactions_With_Affiliates`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transactions_with_affiliates_bert_en_1.0.0_3.0_1678050557126.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transactions_with_affiliates_bert_en_1.0.0_3.0_1678050557126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_transactions_with_affiliates_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Transactions_With_Affiliates]| |[Other]| |[Other]| |[Transactions_With_Affiliates]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transactions_with_affiliates_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.97 0.98 61 Transactions_With_Affiliates 0.95 1.00 0.98 42 accuracy - - 0.98 103 macro-avg 0.98 0.98 0.98 103 weighted-avg 0.98 0.98 0.98 103 ``` --- layout: model title: Extract Temporal Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_temporal_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, temporal] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts temporal references from the documents transferred from the patient’s own sentences. ## Predicted Entities `DateTime`, `Frequency`, `Duration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_emb_clinical_medium_en_4.4.3_3.0_1686076464979.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_emb_clinical_medium_en_4.4.3_3.0_1686076464979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_temporal_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_temporal_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:-----------|:------------| | last month | DateTime | | yesterday | DateTime | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_temporal_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 DateTime 3954 470 448 4402 0.89 0.90 0.90 Frequency 921 190 158 1079 0.83 0.85 0.84 Duration 1952 362 358 2310 0.84 0.85 0.84 macro_avg 6827 1022 964 7791 0.85 0.87 0.86 micro_avg 6827 1022 964 7791 0.87 0.88 0.87 ``` --- layout: model title: Arabic Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_bert_base_arabic date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_ar_3.4.2_3.0_1649677068712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_ar_3.4.2_3.0_1649677068712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|414.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-base-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: Tagalog Electra Embeddings (from jcblaise) author: John Snow Labs name: electra_embeddings_electra_tagalog_small_cased_generator date: 2022-05-17 tags: [tl, open_source, electra, embeddings] task: Embeddings language: tl edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tagalog-small-cased-generator` is a Tagalog model orginally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_cased_generator_tl_3.4.4_3.0_1652786760112.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_cased_generator_tl_3.4.4_3.0_1652786760112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_cased_generator","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Mahilig ako sa Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_cased_generator","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Mahilig ako sa Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_tagalog_small_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|18.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/jcblaise/electra-tagalog-small-cased-generator - https://blaisecruz.com --- layout: model title: Fast Neural Machine Translation Model from English to Malayo-Polynesian Languages author: John Snow Labs name: opus_mt_en_poz date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, poz, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `poz` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_poz_xx_2.7.0_2.4_1609168000682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_poz_xx_2.7.0_2.4_1609168000682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_poz", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_poz", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.poz').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_poz| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Reporting Clause Binary Classifier author: John Snow Labs name: legclf_reporting_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `reporting` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `reporting` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reporting_clause_en_1.0.0_3.2_1660123939267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reporting_clause_en_1.0.0_3.2_1660123939267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_reporting_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[reporting]| |[other]| |[other]| |[reporting]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reporting_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.93 0.93 154 reporting 0.86 0.87 0.87 78 accuracy - - 0.91 232 macro-avg 0.90 0.90 0.90 232 weighted-avg 0.91 0.91 0.91 232 ``` --- layout: model title: English RobertaForQuestionAnswering (from deepset) author: John Snow Labs name: roberta_qa_roberta_base_squad2_covid date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-covid` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_covid_en_4.0.0_3.0_1655735153729.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_covid_en_4.0.0_3.0_1655735153729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_covid","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2_covid","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2_covid| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/roberta-base-squad2-covid - https://www.linkedin.com/company/deepset-ai/ - https://github.com/deepset-ai/COVID-QA/blob/master/data/question-answering/200423_covidQA.json - https://haystack.deepset.ai/community/join - https://deepset.ai/german-bert - https://github.com/deepset-ai/FARM - http://www.deepset.ai/jobs - https://twitter.com/deepset_ai - https://github.com/deepset-ai/haystack/discussions - https://github.com/deepset-ai/haystack/ - https://deepset.ai - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM/blob/master/examples/question_answering_crossvalidation.py --- layout: model title: ChunkResolver Loinc Clinical author: John Snow Labs name: chunkresolve_loinc_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities LOINC Codes with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_LOINC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_loinc_clinical_en_3.0.0_3.0_1617355407030.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_loinc_clinical_en_3.0.0_3.0_1617355407030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... loinc_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_loinc_clinical", "en", "clinical/models") \ .setInputCols(["token", "chunk_embeddings"]) \ .setOutputCol("loinc_code") \ .setDistanceFunction("COSINE") \ .setNeighbours(5) pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, loinc_resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") model = pipeline_loinc.fit(data) results = model.transform(data) ``` ```scala ... val loinc_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_loinc_clinical", "en", "clinical/models") .setInputCols(Array("token", "chunk_embeddings")) .setOutputCol("loinc_code") .setDistanceFunction("COSINE") .setNeighbours(5) val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, loinc_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text") val result = pipeline_loinc.fit(data).transform(data) ```
## Results ```bash Chunk loinc-Code 0 gestational diabetes mellitus 44877-9 1 type two diabetes mellitus 44877-9 2 T2DM 93692-2 3 prior episode of HTG-induced pancreatitis 85695-5 4 associated with an acute hepatitis 24363-4 5 obesity with a body mass index 47278-7 6 BMI) of 33.5 kg/m2 47214-2 7 polyuria 35234-4 8 polydipsia 25541-4 9 poor appetite 50056-1 10 vomiting 34175-0 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_loinc_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[loinc]| |Language:|en| --- layout: model title: English image_classifier_vit_housing_categories ViTForImageClassification from Albe author: John Snow Labs name: image_classifier_vit_housing_categories date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_housing_categories` is a English model originally trained by Albe. ## Predicted Entities `tree house`, `yurt`, `caravan`, `farm`, `castle` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_housing_categories_en_4.1.0_3.0_1660166778182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_housing_categories_en_4.1.0_3.0_1660166778182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_housing_categories", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_housing_categories", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_housing_categories| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_biobert](https://nlp.johnsnowlabs.com/2021/09/05/ner_jsl_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_3.4.1_3.0_1647869212989.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_pipeline_en_3.4.1_3.0_1647869212989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning |Modifier | |yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o |Symptom | |His |Gender | |from 20 minutes q.2h. to 5 to 10 minutes |Duration | |his |Gender | |respiratory congestion |Symptom | |He |Gender | |tired |Symptom | |fussy |Symptom | |over the past 2 days |RelativeDate | |albuterol |Drug_Ingredient | |ER |Clinical_Dept | |His |Gender | |urine output has also decreased |Symptom | |he |Gender | |per 24 hours |Frequency | |he |Gender | |per 24 hours |Frequency | |Mom |Gender | |diarrhea |Symptom | |His |Gender | |bowel |Internal_organ_or_component | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Chinese Bert Embeddings (Base, MacBERT) author: John Snow Labs name: bert_embeddings_chinese_macbert_base date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-macbert-base` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_base_zh_3.4.2_3.0_1649669049572.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_base_zh_3.4.2_3.0_1649669049572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.chinese_macbert_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_macbert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|384.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-macbert-base - https://github.com/ymcui/MacBERT/blob/master/LICENSE - https://2020.emnlp.org - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://github.com/chatopera/Synonyms - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 --- layout: model title: Legal Exhibits Clause Binary Classifier author: John Snow Labs name: legclf_exhibits_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exhibits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `exhibits` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exhibits_clause_en_1.0.0_3.2_1660123526366.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exhibits_clause_en_1.0.0_3.2_1660123526366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_exhibits_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exhibits]| |[other]| |[other]| |[exhibits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exhibits_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support exhibits 0.95 0.97 0.96 59 other 0.98 0.97 0.97 96 accuracy - - 0.97 155 macro-avg 0.96 0.97 0.97 155 weighted-avg 0.97 0.97 0.97 155 ``` --- layout: model title: Vietnamese Deberta Embeddings model (from binhquoc) author: John Snow Labs name: deberta_embeddings_vie_small date: 2023-03-12 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, vie, tensorflow] task: Embeddings language: vie edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vie-deberta-small` is a Vietnamese model originally trained by `binhquoc`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_vie_small_vie_4.3.1_3.0_1678626638418.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_vie_small_vie_4.3.1_3.0_1678626638418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_vie_small","vie") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_vie_small","vie") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_vie_small| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|vie| |Size:|278.0 MB| |Case sensitive:|false| ## References https://huggingface.co/binhquoc/vie-deberta-small --- layout: model title: Legal Jurisdictions Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_jurisdictions_bert date: 2023-03-05 tags: [en, legal, classification, clauses, jurisdictions, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Jurisdictions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Jurisdictions`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_jurisdictions_bert_en_1.0.0_3.0_1678050016981.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_jurisdictions_bert_en_1.0.0_3.0_1678050016981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_jurisdictions_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Jurisdictions]| |[Other]| |[Other]| |[Jurisdictions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_jurisdictions_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Jurisdictions 0.86 1.00 0.93 19 Other 1.00 0.91 0.95 32 accuracy - - 0.94 51 macro-avg 0.93 0.95 0.94 51 weighted-avg 0.95 0.94 0.94 51 ``` --- layout: model title: English Bert Embeddings (Uncased) author: John Snow Labs name: bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1 date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `false-positives-scancode-bert-base-uncased-L8-1` is a English model orginally trained by `ayansinha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1_en_3.4.2_3.0_1649672624525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1_en_3.4.2_3.0_1649672624525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.false_positives_scancode_bert_base_uncased_L8_1").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_false_positives_scancode_bert_base_uncased_L8_1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/ayansinha/false-positives-scancode-bert-base-uncased-L8-1 - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py --- layout: model title: French CamemBert Embeddings (from ysharma) author: John Snow Labs name: camembert_embeddings_ysharma_generic_model_2 date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model-2` is a French model orginally trained by `ysharma`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ysharma_generic_model_2_fr_3.4.4_3.0_1653991037086.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ysharma_generic_model_2_fr_3.4.4_3.0_1653991037086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ysharma_generic_model_2","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ysharma_generic_model_2","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_ysharma_generic_model_2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/ysharma/dummy-model-2 --- layout: model title: Multilingual BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_multi_cased_finetuned_xquadv1 date: 2022-06-02 tags: [en, es, de, el, ru, tr, ar, vi, th, zh, hi, open_source, question_answering, bert, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-finetuned-xquadv1` is a Multilingual model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_xquadv1_xx_4.0.0_3.0_1654184515717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_xquadv1_xx_4.0.0_3.0_1654184515717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_finetuned_xquadv1","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_cased_finetuned_xquadv1","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.xquad.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_cased_finetuned_xquadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-multi-cased-finetuned-xquadv1 - https://github.com/google-research/bert/blob/master/multilingual.md - https://twitter.com/mrm8488 - https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl - https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Try_mrm8488_xquad_finetuned_model.ipynb - https://github.com/fxsjy/jieba - https://github.com/deepmind/xquad --- layout: model title: Legal Deposit Of Redemption Price Clause Binary Classifier author: John Snow Labs name: legclf_deposit_of_redemption_price_clause date: 2023-01-27 tags: [en, legal, classification, deposit, redemption, price, clauses, deposit_of_redemption_price, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `deposit-of-redemption-price` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `deposit-of-redemption-price`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deposit_of_redemption_price_clause_en_1.0.0_3.0_1674820606738.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deposit_of_redemption_price_clause_en_1.0.0_3.0_1674820606738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_deposit_of_redemption_price_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[deposit-of-redemption-price]| |[other]| |[other]| |[deposit-of-redemption-price]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_deposit_of_redemption_price_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support deposit-of-redemption-price 0.96 1.00 0.98 22 other 1.00 0.97 0.99 38 accuracy - - 0.98 60 macro-avg 0.98 0.99 0.98 60 weighted-avg 0.98 0.98 0.98 60 ``` --- layout: model title: English BertForMaskedLM Base Cased model (from ayansinha) author: John Snow Labs name: bert_embeddings_lic_class_scancode_base_cased_l32_1 date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `lic-class-scancode-bert-base-cased-L32-1` is a English model originally trained by `ayansinha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_base_cased_l32_1_en_4.2.4_3.0_1670326834348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_base_cased_l32_1_en_4.2.4_3.0_1670326834348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_base_cased_l32_1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_base_cased_l32_1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_lic_class_scancode_base_cased_l32_1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/ayansinha/lic-class-scancode-bert-base-cased-L32-1 - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_mBERT_all_ty_SQen_SQ20_1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBERT_all_ty_SQen_SQ20_1` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mBERT_all_ty_SQen_SQ20_1_en_4.0.0_3.0_1654188214197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mBERT_all_ty_SQen_SQ20_1_en_4.0.0_3.0_1654188214197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mBERT_all_ty_SQen_SQ20_1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mBERT_all_ty_SQen_SQ20_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mBERT_all_ty_SQen_SQ20_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/mBERT_all_ty_SQen_SQ20_1 --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_timeentities2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-timeentities2` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_en_4.3.0_3.0_1674220671794.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_en_4.3.0_3.0_1674220671794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_timeentities2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|465.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-timeentities2 --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becasincentivos1 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos1` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos1_es_4.3.0_3.0_1674217969589.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos1_es_4.3.0_3.0_1674217969589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos1","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becasincentivos1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos1 --- layout: model title: Detect Organism in Medical Texts author: John Snow Labs name: bert_token_classifier_ner_linnaeus_species date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects species entities in a biomedical text ## Predicted Entities `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_en_4.0.0_3.0_1658755473753.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_linnaeus_species_en_4.0.0_3.0_1658755473753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_linnaeus_species", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_linnaeus_species", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.linnaeus_species").predict("""First identified in chicken, vigilin homologues have now been found in human (6), Xenopus laevis (7), Drosophila melanogaster (8) and Schizosaccharomyces pombe.""") ```
## Results ```bash +-------------------------+-------+ |ner_chunk |label | +-------------------------+-------+ |chicken |SPECIES| |human |SPECIES| |Xenopus laevis |SPECIES| |Drosophila melanogaster |SPECIES| |Schizosaccharomyces pombe|SPECIES| +-------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_linnaeus_species| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-SPECIES 0.6391 0.9204 0.7544 1433 I-SPECIES 0.8297 0.7071 0.7635 799 micro-avg 0.6863 0.8441 0.7571 2232 macro-avg 0.7344 0.8138 0.7589 2232 weighted-avg 0.7073 0.8441 0.7576 2232 ``` --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish TFWav2Vec2ForCTC from aapot author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_1b_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish` is a Finnish model originally trained by aapot. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018763154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018763154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Icelandic NER Pipeline author: John Snow Labs name: roberta_token_classifier_icelandic_ner_pipeline date: 2022-04-20 tags: [open_source, ner, token_classifier, roberta, icelandic, is] task: Named Entity Recognition language: is edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_icelandic_ner](https://nlp.johnsnowlabs.com/2021/12/06/roberta_token_classifier_icelandic_ner_is.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_3.4.1_3.0_1650453946425.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_3.4.1_3.0_1650453946425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is") pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is") pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.") ```
## Results ```bash +----------------+------------+ |chunk |ner_label | +----------------+------------+ |Peter Fergusson |Person | |New York |Location | |október 2011 |Date | |Tesla Motor |Organization| |100K $ |Money | +----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_icelandic_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|is| |Size:|457.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Clinical Deidentification (English, Glove, Augmented) author: John Snow Labs name: clinical_deidentification_glove_augmented date: 2022-09-16 tags: [en, deid, deidentification, licensed, clinical, glove, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities. It's different to `clinical_deidentification_glove` in the way it manages PHONE and PATIENT, having apart from the NER, some rules in Contextual Parser components. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_4.1.0_3.2_1663311659491.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_4.1.0_3.2_1663311659491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove_augmented", "en", "clinical/models") deid_pipeline.annotate("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification_glove_augmented", "en", "clinical/models") val result = pipeline.annotate("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_augmented.pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ```
## Results ```bash {'masked': ['Record date : , , M.D.', 'IP: .', "The driver's license no: .", 'The SSN: and e-mail: .', 'Name : MR. # Date : .', 'PCP : , years old.', 'Record date : , : .'], 'masked_fixed_length_chars': ['Record date : ****, ****, M.D.', 'IP: ****.', "The driver's license no: ****.", 'The SSN: **** and e-mail: ****.', 'Name : **** MR. # **** Date : ****.', 'PCP : ****, **** years old.', 'Record date : ****, **** : ****.'], 'masked_with_chars': ['Record date : [********], [********], M.D.', 'IP: [************].', "The driver's license no: [******].", 'The SSN: [*******] and e-mail: [************].', 'Name : [**************] MR. # [****] Date : [******].', 'PCP : [******], ** years old.', 'Record date : [********], [***********] : [***************].'], 'ner_chunk': ['2093-01-13', 'David Hale', 'A334455B', '324598674', 'hale@gmail.com', 'Hendrickson, Ora', '719435', '01/13/93', 'Oliveira', '25', '2079-11-09', "Patient's VIN", '1HGBH41JXMN109286'], 'obfuscated': ['Record date : 2093-01-23, Dr Marshia Curling, M.D.', 'IP: 004.004.004.004.', "The driver's license no: 123XX123.", 'The SSN: SSN-089-89-9294 and e-mail: Mikey@hotmail.com.', 'Name : Stephania Chang MR. # E5881795 Date : 02-14-1983.', 'PCP : Dr Lovella Israel, 52 years old.', 'Record date : 2079-11-14, Dr Colie Carne : 3CCCC22DDDD333888.'], 'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 'IP: 203.120.223.13.', "The driver's license no: A334455B.", 'The SSN: 324598674 and e-mail: hale@gmail.com.', 'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.', 'PCP : Oliveira, 25 years old.', "Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286."]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_glove_augmented| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|181.3 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from Davlan) author: John Snow Labs name: xlmroberta_ner_base_sadilar date: 2022-08-01 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-sadilar-ner` is a Multilingual model originally trained by `Davlan`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_sadilar_xx_4.1.0_3.0_1659356675311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_sadilar_xx_4.1.0_3.0_1659356675311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_sadilar","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_sadilar","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_sadilar| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|806.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Davlan/xlm-roberta-base-sadilar-ner - https://www.sadilar.org/index.php/en/ --- layout: model title: Company Name to IRS (Edgar database) author: John Snow Labs name: legel_edgar_irs date: 2022-08-30 tags: [en, legal, companies, edgar, licensed] task: Entity Resolution language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Linking / Entity Resolution model, which allows you to retrieve the IRS number of a company given its name, using SEC Edgar database. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legel_edgar_irs_en_1.0.0_3.2_1661866500067.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legel_edgar_irs_en_1.0.0_3.2_1661866500067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_irs", "en", "legal/models")\ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("irs_code")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver]) lp = LightPipeline(pipelineModel) lp.fullAnnotate("CONTACT GOLD") ```
## Results ```bash +--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+ | chunk| code | all_codes| resolutions | all_distances| +--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+ | CONTACT GOLD | 981369960| [981369960, 271989147, 208531222, 273566922, 270348508] |[981369960, 271989147, 208531222, 273566922, 270348508] | [0.1733, 0.3700, 0.3867, 0.4103, 0.4121] | +--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legel_edgar_irs| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[company_irs_number]| |Language:|en| |Size:|313.8 MB| |Case sensitive:|false| ## References In-house scrapping and postprocessing of SEC Edgar Database --- layout: model title: Translate Yapese to English Pipeline author: John Snow Labs name: translate_yap_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, yap, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `yap` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_yap_en_xx_2.7.0_2.4_1609686209390.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_yap_en_xx_2.7.0_2.4_1609686209390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_yap_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_yap_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.yap.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_yap_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_vliegmachine ViTForImageClassification from johnnydevriese author: John Snow Labs name: image_classifier_vit_vliegmachine date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vliegmachine` is a English model originally trained by johnnydevriese. ## Predicted Entities `f117`, `f16`, `f18` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vliegmachine_en_4.1.0_3.0_1660166009742.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vliegmachine_en_4.1.0_3.0_1660166009742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vliegmachine", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vliegmachine", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vliegmachine| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English RobertaForMaskedLM Large Cased model author: John Snow Labs name: roberta_embeddings_large date: 2022-12-12 tags: [en, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_large_en_4.2.4_3.0_1670859597088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_large_en_4.2.4_3.0_1670859597088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_large","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_large","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|847.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/roberta-large - https://arxiv.org/abs/1907.11692 - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia - https://commoncrawl.org/2016/10/news-dataset-available/ - https://github.com/jcpeterson/openwebtext - https://arxiv.org/abs/1806.02847 --- layout: model title: Pipeline to Detect Anatomical References (biobert) author: John Snow Labs name: ner_anatomy_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_anatomy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_3.4.1_3.0_1647873806641.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_pipeline_en_3.4.1_3.0_1647873806641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.") ``` ```scala val pipeline = new PretrainedPipeline("ner_anatomy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_biobert.pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |right |Organism_subdivision | |great |Organism_subdivision | |toe |Organism_subdivision | |skin |Organ | |Sclerae |Pathological_formation| |Extraocular muscles|Multi-tissue_structure| |Nares |Organ | |turbinates |Multi-tissue_structure| |Mucous membranes |Cell | |Abdomen |Organism_subdivision | |bowel |Organism_subdivision | |right |Organism_subdivision | |toe |Organism_subdivision | |skin |Organ | |toenails |Organism_subdivision | |foot |Organism_subdivision | |toe |Organism_subdivision | |toenails |Organism_subdivision | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Translate English to Slavic languages Pipeline author: John Snow Labs name: translate_en_sla date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sla, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sla` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sla_xx_2.7.0_2.4_1609687363449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sla_xx_2.7.0_2.4_1609687363449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sla", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sla", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sla').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sla| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Environmental matters Clause Binary Classifier author: John Snow Labs name: legclf_environmental_matters_clause date: 2022-09-28 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `environmental-matters` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `environmental-matters` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_environmental_matters_clause_en_1.0.0_3.0_1664363148554.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_environmental_matters_clause_en_1.0.0_3.0_1664363148554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_environmental_matters_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[environmental-matters]| |[other]| |[other]| |[environmental-matters]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_environmental_matters_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support environmental-matters 0.95 0.86 0.90 21 other 0.94 0.98 0.96 48 accuracy - - 0.94 69 macro-avg 0.94 0.92 0.93 69 weighted-avg 0.94 0.94 0.94 69 ``` --- layout: model title: ALBERT Embeddings (Base Uncase) author: John Snow Labs name: albert_base_uncased date: 2020-04-28 task: Embeddings language: en nav_key: models edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_base_uncased_en_2.5.0_2.4_1588073363475.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_base_uncased_en_2.5.0_2.4_1588073363475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = AlbertEmbeddings.pretrained("albert_base_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = AlbertEmbeddings.pretrained("albert_base_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.albert.base_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_albert_base_uncased_embeddings I [1.0153148174285889, 0.5481745600700378, -0.44... love [0.3452114760875702, -1.191628336906433, 0.423... NLP [-0.4268064796924591, -0.3819553852081299, 0.8... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_base_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/albert_base/3](https://tfhub.dev/google/albert_base/3) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578` is a English model originally trained by doddle124578. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037269433.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037269433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Detect Anatomical and Observation Entities in Chest Radiology Reports (CheXpert) author: John Snow Labs name: ner_chexpert date: 2021-09-30 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts `Anatomical` and `Observation` entities from Chest Radiology Reports. ## Predicted Entities `ANAT - Anatomy`, `OBS - Observation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_en_3.3.0_3.0_1633010671460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_en_3.3.0_3.0_1633010671460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_chexpert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_chexpert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chexpert").predict("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""") ```
## Results ```bash | | chunk | label | |---:|:-------------------------|:--------| | 0 | endotracheal tube | OBS | | 1 | Swan - Ganz catheter | OBS | | 2 | left chest | ANAT | | 3 | tube | OBS | | 4 | in place | OBS | | 5 | pneumothorax | OBS | | 6 | Mild atelectatic changes | OBS | | 7 | left base | ANAT | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chexpert| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on CheXpert dataset explain in https://arxiv.org/pdf/2106.14463.pdf. ## Benchmarking ```bash label tp fp fn prec rec f1 I-ANAT_DP 26 11 11 0.7027027 0.7027027 0.7027027 B-OBS_DP 1489 141 104 0.9134969 0.9347144 0.9239839 I-OBS_DP 16 3 54 0.84210527 0.22857143 0.35955057 B-ANAT_DP 1125 39 45 0.96649486 0.96153843 0.96401024 Macro-average 2656 194 214 0.8561999 0.70688176 0.7744088 Micro-average 2656 194 214 0.9319298 0.92543554 0.9286713 ``` --- layout: model title: Legal Waivers Clause Binary Classifier author: John Snow Labs name: legclf_waivers_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `waivers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waivers_clause_en_1.0.0_3.2_1660124127062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waivers_clause_en_1.0.0_3.2_1660124127062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_waivers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[waivers]| |[other]| |[other]| |[waivers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_waivers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.98 0.97 324 waivers 0.96 0.90 0.93 128 accuracy - - 0.96 452 macro-avg 0.96 0.94 0.95 452 weighted-avg 0.96 0.96 0.96 452 ``` --- layout: model title: Detect PHI for Deidentification (Sub Entity) author: John Snow Labs name: ner_deid_subentity date: 2022-01-06 tags: [de, deid, ner, licensed] task: Named Entity Recognition language: de edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates German text to find protected health information (PHI) that may need to be deidentified. It was trained with in-house annotations and detects 12 entities. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_DE){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_de_3.3.4_2.4_1641460993460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_de_3.3.4_2.4_1641460993460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained("ner_deid_subentity", "de", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_deid_subentity_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) data = spark.createDataFrame([["""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_deid_subentity_chunk") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter)) val data = Seq("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhausin Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.deid_subentity").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash +-------------------------+-------------------------+ |chunk |ner_deid_subentity_chunk | +-------------------------+-------------------------+ |Michael Berger |PATIENT | |12 Dezember 2018 |DATE | |St. Elisabeth-Krankenhaus|HOSPITAL | |Bad Kissingen |CITY | |Berger |PATIENT | |76 |AGE | +-------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|15.0 MB| ## Data Source In-house annotated dataset ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 2080.0 58.0 74.0 2154.0 0.9729 0.9656 0.9692 HOSPITAL 1598.0 4.0 0.0 1598.0 0.9975 1.0 0.9988 DATE 4047.0 7.0 2.0 4049.0 0.9983 0.9995 0.9989 ORGANIZATION 1288.0 108.0 67.0 1355.0 0.9226 0.9506 0.9364 CITY 196.0 1.0 4.0 200.0 0.9949 0.98 0.9874 STREET 124.0 1.0 4.0 128.0 0.992 0.9688 0.9802 USERNAME 45.0 0.0 0.0 45.0 1.0 1.0 1.0 PROFESSION 262.0 1.0 0.0 262.0 0.9962 1.0 0.9981 PHONE 71.0 10.0 9.0 80.0 0.8765 0.8875 0.882 COUNTRY 306.0 5.0 6.0 312.0 0.9839 0.9808 0.9823 DOCTOR 1414.0 9.0 39.0 1453.0 0.9937 0.9732 0.9833 AGE 473.0 3.0 3.0 476.0 0.9937 0.9937 0.9937 ``` --- layout: model title: English BertForMaskedLM Base Uncased model author: John Snow Labs name: bert_embeddings_base_uncased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uncased_en_4.2.4_3.0_1670019190911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uncased_en_4.2.4_3.0_1670019190911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/bert-base-uncased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/master/README.md - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Fast Neural Machine Translation Model from English to Berber author: John Snow Labs name: opus_mt_en_ber date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ber, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ber` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ber_xx_2.7.0_2.4_1609169805124.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ber_xx_2.7.0_2.4_1609169805124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ber", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ber", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ber').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ber| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Indemnification Clause Binary Classifier (md) author: John Snow Labs name: legclf_indemnification_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnification` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indemnification` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_md_en_1.0.0_3.0_1669376491781.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_md_en_1.0.0_3.0_1669376491781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_indemnification_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnification]| |[other]| |[other]| |[indemnification]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support indemnification-and-contribution 0.84 0.93 0.88 28 other 0.94 0.87 0.91 39 accuracy 0.90 67 macro avg 0.89 0.90 0.89 67 weighted avg 0.90 0.90 0.90 67 ``` --- layout: model title: Recognize Entities OntoNotes pipeline - BERT Small author: John Snow Labs name: onto_recognize_entities_bert_small date: 2021-03-22 tags: [open_source, english, onto_recognize_entities_bert_small, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_bert_small is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_small_en_3.0.0_3.0_1616443983762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_small_en_3.0.0_3.0_1616443983762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_bert_small', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_small", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.bert.small').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:----------------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.9379079937934875,.,...]] | ['O', 'O', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_small| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Swty) author: John Snow Labs name: distilbert_qa_swty_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Swty`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_swty_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769390900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_swty_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769390900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_swty_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_swty_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_swty_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Swty/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Mapping RxNORM Codes with Their Corresponding UMLS Codes author: John Snow Labs name: rxnorm_umls_mapping date: 2023-06-13 tags: [en, licensed, clinical, pipeline, chunk_mapping, rxnorm, umls] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `rxnorm_umls_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_4.4.4_3.2_1686663532119.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_4.4.4_3.2_1686663532119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(1161611 315677) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(1161611 315677) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm.umls.mapping").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(1161611 315677) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(1161611 315677) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash Results | | rxnorm_code | umls_code | |---:|:-----------------|:--------------------| | 0 | 1161611 | 315677 | C3215948 | C0984912 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.9 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Legal Cancellation Clause Binary Classifier author: John Snow Labs name: legclf_cancellation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `cancellation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `cancellation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cancellation_clause_en_1.0.0_3.2_1660122195560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cancellation_clause_en_1.0.0_3.2_1660122195560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cancellation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[cancellation]| |[other]| |[other]| |[cancellation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cancellation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support cancellation 0.92 0.92 0.92 38 other 0.97 0.97 0.97 96 accuracy - - 0.96 134 macro-avg 0.94 0.94 0.94 134 weighted-avg 0.96 0.96 0.96 134 ``` --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_marcel TFWav2Vec2ForCTC from marcel author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_marcel` is a German model originally trained by marcel. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101959541.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101959541.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_marcel| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Czech asr_wav2vec2_large_xlsr_53_Czech TFWav2Vec2ForCTC from MehdiHosseiniMoghadam author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_Czech date: 2022-09-25 tags: [wav2vec2, cs, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: cs edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_Czech` is a Czech model originally trained by MehdiHosseiniMoghadam. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_Czech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_Czech_cs_4.2.0_3.0_1664119968423.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_Czech_cs_4.2.0_3.0_1664119968423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_Czech', lang = 'cs') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_Czech", lang = "cs") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_Czech| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|cs| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Hausa asr_Hausa_xlsr TFWav2Vec2ForCTC from Akashpb13 author: John Snow Labs name: asr_Hausa_xlsr date: 2022-09-26 tags: [wav2vec2, ha, audio, open_source, asr] task: Automatic Speech Recognition language: ha edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Hausa_xlsr` is a Hausa model originally trained by Akashpb13. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Hausa_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Hausa_xlsr_ha_4.2.0_3.0_1664192959924.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Hausa_xlsr_ha_4.2.0_3.0_1664192959924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Hausa_xlsr", "ha")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Hausa_xlsr", "ha") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Hausa_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ha| |Size:|1.2 GB| --- layout: model title: Legal No waiver Clause Binary Classifier author: John Snow Labs name: legclf_no_waiver_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-waiver` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-waiver` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_waiver_clause_en_1.0.0_3.2_1660122721126.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_waiver_clause_en_1.0.0_3.2_1660122721126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_waiver_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-waiver]| |[other]| |[other]| |[no-waiver]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_waiver_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-waiver 0.95 0.98 0.97 43 other 0.99 0.98 0.99 119 accuracy - - 0.98 162 macro-avg 0.97 0.98 0.98 162 weighted-avg 0.98 0.98 0.98 162 ``` --- layout: model title: Stopwords Remover for Armenian language (105 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, hy, open_source] task: Stop Words Removal language: hy edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_hy_3.4.1_3.0_1646672921601.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_hy_3.4.1_3.0_1646672921601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","hy") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Դու ինձանից ավելի լավն ես"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","hy") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Դու ինձանից ավելի լավն ես").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hy.stopwords").predict("""Դու ինձանից ավելի լավն ես""") ```
## Results ```bash +----------------------+ |result | +----------------------+ |[ինձանից, ավելի, լավն]| +----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hy| |Size:|1.8 KB| --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to Spanish author: John Snow Labs name: opus_mt_bcl_es date: 2021-06-01 tags: [open_source, seq2seq, translation, bcl, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bcl target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_es_xx_3.1.0_2.4_1622561503404.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_es_xx_3.1.0_2.4_1622561503404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Central Bikol.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Deberta Embeddings model (from domenicrosati) author: John Snow Labs name: deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt date: 2023-03-12 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-v3-large-dapt-scientific-papers-pubmed-tapt` is a English model originally trained by `domenicrosati`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt_en_4.3.1_3.0_1678658548832.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt_en_4.3.1_3.0_1678658548832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_tapt| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|false| ## References https://huggingface.co/domenicrosati/deberta-v3-large-dapt-scientific-papers-pubmed-tapt --- layout: model title: Javanese DistilBERT Embeddings (Small, Wikipedia) author: John Snow Labs name: distilbert_embeddings_javanese_distilbert_small date: 2022-04-12 tags: [distilbert, embeddings, jv, open_source] task: Embeddings language: jv edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-distilbert-small` is a Javanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_jv_3.4.2_3.0_1649783759354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_jv_3.4.2_3.0_1649783759354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("jv.embed.distilbert").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_javanese_distilbert_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|248.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-distilbert-small - https://arxiv.org/abs/1910.01108 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Legal Vacations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_vacations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, vacations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Vacations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Vacations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_vacations_bert_en_1.0.0_3.0_1678050711001.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_vacations_bert_en_1.0.0_3.0_1678050711001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_vacations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Vacations]| |[Other]| |[Other]| |[Vacations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_vacations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.98 0.99 56 Vacations 0.97 1.00 0.99 36 accuracy - - 0.99 92 macro-avg 0.99 0.99 0.99 92 weighted-avg 0.99 0.99 0.99 92 ``` --- layout: model title: T5 text-to-text model author: John Snow Labs name: t5_small date: 2020-12-21 task: [Question Answering, Summarization, Translation] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, t5, en] supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The T5 transformer model described in the seminal paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This model can perform a variety of tasks, such as text summarization, question answering and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5TRANSFORMER/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.0_2.4_1608554292913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.0_2.4_1608554292913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer() \ .pretrained("t5_small") \ .setTask("summarize:")\ .setMaxOutputLength(200)\ .setInputCols(["documents"]) \ .setOutputCol("summaries") pipeline = Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer .pretrained("t5_small") .setTask("summarize:") .setInputCols(Array("documents")) .setOutputCol("summaries") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val result = pipeline.fit(dataDf).transform(dataDf) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.small").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Data Source C4 --- layout: model title: German BertForMaskedLM Base Uncased model (from dbmdz) author: John Snow Labs name: bert_embeddings_base_german_uncased date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-german-uncased` is a German model originally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_german_uncased_de_4.2.4_3.0_1670017726606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_german_uncased_de_4.2.4_3.0_1670017726606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_german_uncased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_german_uncased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_german_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-german-uncased - https://deepset.ai/german-bert - https://deepset.ai/ - https://spacy.io/ - https://github.com/allenai/scibert - https://github.com/stefan-it/fine-tuned-berts-seq - https://github.com/dbmdz/berts/issues/new --- layout: model title: Detect Cellular/Molecular Biology Entities (clinical_large) author: John Snow Labs name: ner_cellular_emb_clinical_large date: 2023-05-24 tags: [ner, en, licensed, clinical, dna, rna, protein, cell_line, cell_type] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CELLULAR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_emb_clinical_large_en_4.4.2_3.0_1684920548062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_emb_clinical_large_en_4.4.2_3.0_1684920548062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(['sentence', 'token', 'ner'])\ .setOutputCol('ner_chunk') pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_df = spark.createDataFrame([["""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""]]).toDF("text") result = pipeline.fit(sample_df).transform(sample_df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val sample_data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS.toDF("text") val result = pipeline.fit(sample_data).transform(sample_data) ```
## Results ```bash +-------------------------------------------+-----+---+---------+ |chunk |begin|end|ner_label| +-------------------------------------------+-----+---+---------+ |intracellular signaling proteins |27 |58 |protein | |human T-cell leukemia virus type 1 promoter|130 |172|DNA | |Tax |186 |188|protein | |Tax-responsive element 1 |193 |216|DNA | |cyclic AMP-responsive members |237 |265|protein | |CREB/ATF family |274 |288|protein | |transcription factors |293 |313|protein | |Tax |389 |391|protein | |Tax-responsive element 1 |431 |454|DNA | |TRE-1 |457 |461|DNA | |lacZ gene |582 |590|DNA | |CYC1 promoter |617 |629|DNA | |TRE-1 |663 |667|DNA | |cyclic AMP response element-binding protein|695 |737|protein | |CREB |740 |743|protein | |CREB |749 |752|protein | |GAL4 activation domain |767 |788|protein | |GAD |791 |793|protein | |reporter gene |848 |860|DNA | |Tax |863 |865|protein | +-------------------------------------------+-----+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. (http://www.geniaproject.org/) ## Benchmarking ```bash label precision recall f1-score support cell_type 0.89 0.79 0.84 4912 protein 0.80 0.90 0.84 9841 cell_line 0.66 0.75 0.70 1489 DNA 0.78 0.87 0.82 2845 RNA 0.79 0.81 0.80 305 micro-avg 0.80 0.85 0.83 19392 macro-avg 0.78 0.82 0.80 19392 weighted-avg 0.81 0.85 0.83 19392 ``` --- layout: model title: BERT multilingual base model (cased) author: John Snow Labs name: bert_base_multilingual_cased date: 2021-05-20 tags: [xx, multilingual, embeddings, bert, open_source] supported: true task: Embeddings language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is case sensitive: it makes a difference between english and English. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Next sentence prediction (NSP): the models concatenate two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_cased_xx_3.1.0_2.4_1621519556711.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_cased_xx_3.1.0_2.4_1621519556711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_multilingual_cased", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_multilingual_cased", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.bert_base_multilingual_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_multilingual_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|xx| |Case sensitive:|true| ## Data Source [https://huggingface.co/bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) --- layout: model title: Extract Oncology Tests author: John Snow Labs name: ner_oncology_test date: 2022-11-24 tags: [licensed, clinical, oncology, en, ner, test] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests. Definitions of Predicted Entities: - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. - `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan". - `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. - `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. ## Predicted Entities `Biomarker`, `Biomarker_Result`, `Imaging_Test`, `Oncogene`, `Pathology_Test` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_en_4.2.2_3.0_1669307746859.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_en_4.2.2_3.0_1669307746859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_test", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_test", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_test").predict("""A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:---------------| | biopsy | Pathology_Test | | ultrasound guided thick-needle | Pathology_Test | | chest computed tomography | Imaging_Test | | CT | Imaging_Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_test| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.2 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Imaging_Test 2020 229 184 2204 0.90 0.92 0.91 Biomarker_Result 1177 186 268 1445 0.86 0.81 0.84 Pathology_Test 888 276 162 1050 0.76 0.85 0.80 Biomarker 1287 254 228 1515 0.84 0.85 0.84 Oncogene 365 89 84 449 0.80 0.81 0.81 macro_avg 5737 1034 926 6663 0.83 0.85 0.84 micro_avg 5737 1034 926 6663 0.85 0.86 0.85 ``` --- layout: model title: English BertForQuestionAnswering model (from deepset) author: John Snow Labs name: bert_base_cased_qa_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-squad2` is a English model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_qa_squad2_en_4.0.0_3.0_1654193845988.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_qa_squad2_en_4.0.0_3.0_1654193845988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_base_cased_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_base_cased_qa_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_cased.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_cased_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/bert-base-cased-squad2 --- layout: model title: Legal Specific performance Clause Binary Classifier author: John Snow Labs name: legclf_specific_performance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `specific-performance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `specific-performance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_specific_performance_clause_en_1.0.0_3.2_1660123020947.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_specific_performance_clause_en_1.0.0_3.2_1660123020947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_specific_performance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[specific-performance]| |[other]| |[other]| |[specific-performance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_specific_performance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.98 0.99 94 specific-performance 0.95 1.00 0.97 36 accuracy - - 0.98 130 macro-avg 0.97 0.99 0.98 130 weighted-avg 0.99 0.98 0.98 130 ``` --- layout: model title: Legal Counterparts Clause Binary Classifier author: John Snow Labs name: legclf_counterparts_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `counterparts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `counterparts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_clause_en_1.0.0_3.2_1660123374262.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_clause_en_1.0.0_3.2_1660123374262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_counterparts_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[counterparts]| |[other]| |[other]| |[counterparts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_counterparts_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support counterparts 1.00 1.00 1.00 38 other 1.00 1.00 1.00 97 accuracy - - 1.00 135 macro-avg 1.00 1.00 1.00 135 weighted-avg 1.00 1.00 1.00 135 ``` --- layout: model title: Relation Extraction between different oncological entity types (unspecific version) author: John Snow Labs name: re_oncology_wip date: 2022-09-27 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model identifies relations between dates and other clinical entities, between tumor mentions and their size, between anatomical entities and other clinical entities, and between tests and their results. In contrast to re_oncology_granular, all these relation types are labeled as is_related_to. The different types of relations can be identified considering the pairs of entities that are linked. ## Predicted Entities `is_related_to`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_wip_en_4.0.0_3.0_1664302122205.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_wip_en_4.0.0_3.0_1664302122205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use realation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("A mastectomy was performed two months ago, and a 3 cm mass was extracted.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""") ```
## Results ```bash chunk1 entity1 chunk2 entity2 relation confidence mastectomy Cancer_Surgery two months ago Relative_Date is_related_to 0.9623304 3 cm Tumor_Size mass Tumor_Finding is_related_to 0.7947009 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|266.3 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash relation recall precision f1 O 0.82 0.88 0.85 is_related_to 0.89 0.83 0.86 macro-avg 0.86 0.86 0.86 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Austro-Asiatic languages author: John Snow Labs name: opus_mt_en_aav date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, aav, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `aav` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_aav_xx_2.7.0_2.4_1609169278246.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_aav_xx_2.7.0_2.4_1609169278246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_aav", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_aav", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.aav').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_aav| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from CNT-UPenn) author: John Snow Labs name: roberta_qa_for_seizurefrequency date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa_for_seizureFrequency_QA` is a English model originally trained by `CNT-UPenn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_for_seizurefrequency_en_4.3.0_3.0_1674208667059.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_for_seizurefrequency_en_4.3.0_3.0_1674208667059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_seizurefrequency","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_seizurefrequency","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_for_seizurefrequency| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/CNT-UPenn/RoBERTa_for_seizureFrequency_QA - https://doi.org/10.1093/jamia/ocac018 --- layout: model title: Pipeline to Detect Living Species(w2v_cc_300d) author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [pt, ner, clinical, licensed] task: Named Entity Recognition language: pt edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_pt_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_pt_4.3.0_3.2_1678708110628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_pt_4.3.0_3.2_1678708110628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "pt", "clinical/models") text = '''Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "pt", "clinical/models") val text = "Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:------------|-------------:| | 0 | rapariga | 4 | 11 | HUMAN | 0.9991 | | 1 | pessoal | 41 | 47 | HUMAN | 0.9765 | | 2 | paciente | 182 | 189 | HUMAN | 1 | | 3 | gato | 368 | 371 | SPECIES | 0.9847 | | 4 | veterinário | 413 | 423 | HUMAN | 0.91 | | 5 | Trichophyton rubrum | 632 | 650 | SPECIES | 0.9996 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_vasilis TFWav2Vec2ForCTC from vasilis author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_finnish_by_vasilis date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_vasilis` is a Finnish model originally trained by vasilis. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_fi_4.2.0_3.0_1664024039687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_fi_4.2.0_3.0_1664024039687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_vasilis", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_vasilis", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_finnish_by_vasilis| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: English T5ForConditionalGeneration Cased model (from yirmibesogluz) author: John Snow Labs name: t5_t2t_ner_ade_balanced date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t2t-ner-ade-balanced` is a English model originally trained by `yirmibesogluz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_t2t_ner_ade_balanced_en_4.3.0_3.0_1675107775759.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_t2t_ner_ade_balanced_en_4.3.0_3.0_1675107775759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_t2t_ner_ade_balanced","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_t2t_ner_ade_balanced","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_t2t_ner_ade_balanced| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|924.7 MB| ## References - https://huggingface.co/yirmibesogluz/t2t-ner-ade-balanced - https://github.com/gokceuludogan/boun-tabi-smm4h22 --- layout: model title: Legal Natural Environment Document Classifier (EURLEX) author: John Snow Labs name: legclf_natural_environment_bert date: 2023-03-06 tags: [en, legal, classification, clauses, natural_environment, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_natural_environment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Natural_Environment or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Natural_Environment`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_natural_environment_bert_en_1.0.0_3.0_1678111551323.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_natural_environment_bert_en_1.0.0_3.0_1678111551323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_natural_environment_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Natural_Environment]| |[Other]| |[Other]| |[Natural_Environment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_natural_environment_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Natural_Environment 0.89 0.87 0.88 45 Other 0.88 0.90 0.89 50 accuracy - - 0.88 95 macro-avg 0.88 0.88 0.88 95 weighted-avg 0.88 0.88 0.88 95 ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jonfrank) author: John Snow Labs name: xlmroberta_ner_jonfrank_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jonfrank`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jonfrank_base_finetuned_panx_de_4.1.0_3.0_1660434803462.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jonfrank_base_finetuned_panx_de_4.1.0_3.0_1660434803462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jonfrank_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jonfrank_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jonfrank_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jonfrank/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Italian T5ForConditionalGeneration Base Cased model (from aiknowyou) author: John Snow Labs name: t5_mt5_base_it_paraphraser date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mt5-base-it-paraphraser` is a Italian model originally trained by `aiknowyou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mt5_base_it_paraphraser_it_4.3.0_3.0_1675105866508.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mt5_base_it_paraphraser_it_4.3.0_3.0_1675105866508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mt5_base_it_paraphraser","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mt5_base_it_paraphraser","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mt5_base_it_paraphraser| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|969.5 MB| ## References - https://huggingface.co/aiknowyou/mt5-base-it-paraphraser - https://arxiv.org/abs/2010.11934 - https://colab.research.google.com/drive/1DGeF190gJ3DjRFQiwhFuZalp427iqJNQ - https://gist.github.com/avidale/44cd35bfcdaf8bedf51d97c468cc8001 - http://creativecommons.org/licenses/by-nc-sa/4.0/ - http://creativecommons.org/licenses/by-nc-sa/4.0/ --- layout: model title: Detect tumor morphology in Spanish texts author: John Snow Labs name: cantemist_scielowiki date: 2021-07-23 tags: [ner, licensed, oncology, es] task: Named Entity Recognition language: es edition: Spark NLP for Healthcare 3.1.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect tumor morphology entities in Spanish text. ## Predicted Entities `MORFOLOGIA_NEOPLASIA`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cantemist_scielowiki_es_3.1.2_3.0_1627080305994.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cantemist_scielowiki_es_3.1.2_3.0_1627080305994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") embedings_stage = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("cantemist_scielowiki", "es", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') pipeline = Pipeline(stages=[ document_assembler, sentence, tokenizer, embedings_stage, clinical_ner, ner_converter ]) data = spark.createDataFrame([["""Anamnesis Paciente de 37 años de edad sin antecedentes patológicos ni quirúrgicos de interés. En diciembre de 2012 consultó al Servicio de Urgencias por un cuadro de cefalea aguda e hipostesia del hemicuerpo izquierdo de 15 días de evolución refractario a tratamiento. Exploración neurológica sin focalidad; fondo de ojo: papiledema unilateral. Se solicitaron una TC del SNC, que objetiva una LOE frontal derecha con afectación aparente del cuerpo calloso, y una RM del SNC, que muestra un extenso proceso expansivo intraparenquimatoso frontal derecho que infiltra la rodilla del cuerpo calloso, mal delimitada y sin componente necrótico. Tras la administración de contraste se apreciaban diferentes realces parcheados en la lesión, pero sin definirse una cápsula con aumento del flujo sanguíneo en la lesión, características compatibles con linfoma o astrocitoma anaplásico . El 3 de enero de 2013 se efectúa biopsia intraoperatoria, con diagnóstico histológico de astrocitoma anaplásico GIII"""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embedings_stage = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("cantemist_scielowiki", "es", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence, tokenizer, embedings_stage, clinical_ner, ner_converter)) val data = Seq("""Anamnesis Paciente de 37 años de edad sin antecedentes patológicos ni quirúrgicos de interés. En diciembre de 2012 consultó al Servicio de Urgencias por un cuadro de cefalea aguda e hipostesia del hemicuerpo izquierdo de 15 días de evolución refractario a tratamiento. Exploración neurológica sin focalidad; fondo de ojo: papiledema unilateral. Se solicitaron una TC del SNC, que objetiva una LOE frontal derecha con afectación aparente del cuerpo calloso, y una RM del SNC, que muestra un extenso proceso expansivo intraparenquimatoso frontal derecho que infiltra la rodilla del cuerpo calloso, mal delimitada y sin componente necrótico. Tras la administración de contraste se apreciaban diferentes realces parcheados en la lesión, pero sin definirse una cápsula con aumento del flujo sanguíneo en la lesión, características compatibles con linfoma o astrocitoma anaplásico . El 3 de enero de 2013 se efectúa biopsia intraoperatoria, con diagnóstico histológico de astrocitoma anaplásico GIII""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------+----------------------+ | token | prediction | +---------------------+----------------------+ | Anamnesis | O | | Paciente | O | | de | O | | 37 | O | | años | O | | de | O | | edad | O | | sin | O | | antecedentes | O | | patológicos | O | | ni | O | | quirúrgicos | O | | de | O | | interés | O | | . | O | | En | O | | diciembre | O | | de | O | | 2012 | O | | consultó | O | | al | O | | Servicio | O | | de | O | | Urgencias | O | | por | O | | un | O | | cuadro | O | | de | O | | cefalea | O | | aguda | O | | e | O | | hipostesia | O | | del | O | | hemicuerpo | O | | izquierdo | O | | de | O | | 15 | O | | días | O | | de | O | | evolución | O | | refractario | O | | a | O | | tratamiento | O | | . | O | | Exploración | O | | neurológica | O | | sin | O | | focalidad | O | | ; | O | | fondo | O | | de | O | | ojo | O | | : | O | | papiledema | O | | unilateral | O | | . | O | | Se | O | | solicitaron | O | | una | O | | TC | O | | del | O | | SNC | B-MORFOLOGIA_NEOP... | | , | O | | que | O | | objetiva | O | | una | O | | LOE | O | | frontal | O | | derecha | O | | con | O | | afectación | B-MORFOLOGIA_NEOP... | | aparente | I-MORFOLOGIA_NEOP... | | del | I-MORFOLOGIA_NEOP... | | cuerpo | I-MORFOLOGIA_NEOP... | | calloso | I-MORFOLOGIA_NEOP... | | , | O | | y | O | | una | O | | RM | B-MORFOLOGIA_NEOP... | | del | I-MORFOLOGIA_NEOP... | | SNC | I-MORFOLOGIA_NEOP... | | , | O | | que | O | | muestra | O | | un | O | | extenso | O | | proceso | B-MORFOLOGIA_NEOP... | | expansivo | I-MORFOLOGIA_NEOP... | | intraparenquimatoso | I-MORFOLOGIA_NEOP... | | frontal | I-MORFOLOGIA_NEOP... | | derecho | I-MORFOLOGIA_NEOP... | | que | I-MORFOLOGIA_NEOP... | | infiltra | I-MORFOLOGIA_NEOP... | | la | I-MORFOLOGIA_NEOP... | | rodilla | I-MORFOLOGIA_NEOP... | | del | I-MORFOLOGIA_NEOP... | | cuerpo | I-MORFOLOGIA_NEOP... | | calloso | I-MORFOLOGIA_NEOP... | | , | O | | mal | O | | delimitada | O | | y | O | | sin | O | | componente | O | | necrótico | O | | . | O | | Tras | O | | la | O | | administración | O | | de | O | | contraste | O | | se | O | | apreciaban | O | | diferentes | O | | realces | O | | parcheados | O | | en | O | | la | O | | lesión | O | | , | O | | pero | O | | sin | O | | definirse | O | | una | O | | cápsula | O | | con | O | | aumento | O | | del | O | | flujo | O | | sanguíneo | O | | en | O | | la | O | | lesión | O | | , | O | | características | O | | compatibles | O | | con | O | | linfoma | O | | o | O | | astrocitoma | B-MORFOLOGIA_NEOP... | | anaplásico | I-MORFOLOGIA_NEOP... | | . | O | | El | O | | 3 | O | | de | O | | enero | O | | de | O | | 2013 | O | | se | O | | efectúa | O | | biopsia | O | | intraoperatoria | O | | , | O | | con | O | | diagnóstico | O | | histológico | O | | de | O | | astrocitoma | B-MORFOLOGIA_NEOP... | | anaplásico | I-MORFOLOGIA_NEOP... | | GIII | I-MORFOLOGIA_NEOP... | +---------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|cantemist_scielowiki| |Compatibility:|Spark NLP for Healthcare 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Dependencies:|embeddings_scielowiki_300d| ## Data Source The model was trained with the [CANTEMIST](https://temu.bsc.es/cantemist/) data set: > CANTEMIST is an annotated data set for oncology analysis in the Spanish language containing 1301 oncological clinical case reports with a total of 63,016 sentences and 1093,501 tokens. All documents of the corpus have been manually annotated by clinical experts with mentions of tumor morphology (in Spanish, “morfología de neoplasia”). There are 16,030 tumor morphology mentions mapped to an eCIE-O code (850 unique codes) References: 1. P. Ruas, A. Neves, V. D. Andrade, F. M. Couto, Lasigebiotm at cantemist: Named entity recognition and normalization of tumour morphology entities and clinical coding of Spanish health-related documents, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020 2. Antonio Miranda-Escalada, Eulàlia Farré-Maduell, Martin Krallinger. Named Entity Recognition, Concept Normalization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish, Corpus, Guidelines, Methods and Results. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings. 303-323 (2020). ## Benchmarking ```bash label precision recall f1-score support B-MORFOLOGIA_NEOPLASIA 0.94 0.73 0.83 2474 I-MORFOLOGIA_NEOPLASIA 0.81 0.74 0.77 3169 O 0.99 1.00 1.00 283006 accuracy - - 0.99 288649 macro-avg 0.92 0.82 0.87 288649 weighted-avg 0.99 0.99 0.99 288649 ``` --- layout: model title: Yue Chinese asr_wav2vec2_large_xlsr_cantonese_by_ctl TFWav2Vec2ForCTC from ctl author: John Snow Labs name: asr_wav2vec2_large_xlsr_cantonese_by_ctl date: 2022-09-24 tags: [wav2vec2, yue, audio, open_source, asr] task: Automatic Speech Recognition language: yue edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_cantonese_by_ctl` is a Yue Chinese model originally trained by ctl. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_cantonese_by_ctl_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_cantonese_by_ctl_yue_4.2.0_3.0_1664039699316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_cantonese_by_ctl_yue_4.2.0_3.0_1664039699316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_cantonese_by_ctl", "yue")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_cantonese_by_ctl", "yue") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_cantonese_by_ctl| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|yue| |Size:|1.2 GB| --- layout: model title: Google's Tapas Table Understanding (Small, SQA) author: John Snow Labs name: table_qa_tapas_small_finetuned_sqa date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Small Has aggregation operations?: False ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_sqa_en_4.2.0_3.0_1664530724535.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_sqa_en_4.2.0_3.0_1664530724535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_small_finetuned_sqa","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_small_finetuned_sqa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|110.1 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 --- layout: model title: Fast Neural Machine Translation Model from Manx to English author: John Snow Labs name: opus_mt_gv_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gv, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gv_en_xx_2.7.0_2.4_1609165146913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gv_en_xx_2.7.0_2.4_1609165146913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gv_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gv_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gv.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gv_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2_en_4.3.0_3.0_1674214328836.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2_en_4.3.0_3.0_1674214328836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|416.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-2 --- layout: model title: English DistilBertForQuestionAnswering Cased model (from rowan1224) author: John Snow Labs name: distilbert_qa_squad_slp date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squad-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_slp_en_4.3.0_3.0_1672774348522.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_slp_en_4.3.0_3.0_1672774348522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_slp","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_slp","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_slp| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/distilbert-squad-slp --- layout: model title: Legal Subordination Clause Binary Classifier author: John Snow Labs name: legclf_subordination_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `subordination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `subordination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subordination_clause_en_1.0.0_3.2_1660124022192.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subordination_clause_en_1.0.0_3.2_1660124022192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_subordination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[subordination]| |[other]| |[other]| |[subordination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subordination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.98 0.96 49 subordination 0.96 0.89 0.93 28 accuracy - - 0.95 77 macro-avg 0.95 0.94 0.94 77 weighted-avg 0.95 0.95 0.95 77 ``` --- layout: model title: Detect Living Species (bert_embeddings_bert_base_italian_xxl_cased) author: John Snow Labs name: ner_living_species_bert date: 2022-06-23 tags: [it, ner, clinical, licensed, bert] task: Named Entity Recognition language: it edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Italian which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `bert_embeddings_bert_base_italian_xxl_cased` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_it_3.5.3_3.0_1655972219820.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_it_3.5.3_3.0_1655972219820.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased", "it")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "it", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased", "it") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "it", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.med_ner.living_species.bert").predict("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""") ```
## Results ```bash +----------------+-------+ |ner_chunk |label | +----------------+-------+ |donna |HUMAN | |personale |HUMAN | |madre |HUMAN | |fratello |HUMAN | |sorella |HUMAN | |virus epatotropi|SPECIES| |HBV |SPECIES| |HCV |SPECIES| |HIV |SPECIES| |paziente |HUMAN | +----------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.88 0.95 0.91 2772 B-SPECIES 0.76 0.89 0.82 2860 I-HUMAN 0.70 0.59 0.64 101 I-SPECIES 0.70 0.81 0.75 1036 micro-avg 0.80 0.90 0.85 6769 macro-avg 0.76 0.81 0.78 6769 weighted-avg 0.80 0.90 0.85 6769 ``` --- layout: model title: English asr_wav2vec_finetuned_on_cryptocurrency TFWav2Vec2ForCTC from distractedm1nd author: John Snow Labs name: asr_wav2vec_finetuned_on_cryptocurrency date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec_finetuned_on_cryptocurrency` is a English model originally trained by distractedm1nd. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec_finetuned_on_cryptocurrency_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec_finetuned_on_cryptocurrency_en_4.2.0_3.0_1664024959023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec_finetuned_on_cryptocurrency_en_4.2.0_3.0_1664024959023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec_finetuned_on_cryptocurrency", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec_finetuned_on_cryptocurrency", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec_finetuned_on_cryptocurrency| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: RE Pipeline between Tests, Results, and Dates author: John Snow Labs name: re_test_result_date_pipeline date: 2023-06-13 tags: [licensed, clinical, relation_extraction, tests, results, dates, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_test_result_date](https://nlp.johnsnowlabs.com/2021/02/24/re_test_result_date_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_pipeline_en_4.4.4_3.2_1686665254277.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_pipeline_en_4.4.4_3.2_1686665254277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models") pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models") pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date_test_result.pipeline").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models") pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_test_result_date_pipeline", "en", "clinical/models") pipeline.fullAnnotate("He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date_test_result.pipeline").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""") ```
## Results ```bash Results | index | relations | entity1 | chunk1 | entity2 | chunk2 | |-------|--------------|--------------|---------------------|--------------|---------| | 0 | O | TEST | chest X-ray | MEASUREMENTS | 93% | | 1 | O | TEST | CT scan | MEASUREMENTS | 93% | | 2 | is_result_of | TEST | SpO2 | MEASUREMENTS | 93% | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_test_result_date_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: English BertForQuestionAnswering model (from aymanm419) author: John Snow Labs name: bert_qa_araSpeedest date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araSpeedest` is a English model orginally trained by `aymanm419`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_araSpeedest_en_4.0.0_3.0_1654179104397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_araSpeedest_en_4.0.0_3.0_1654179104397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_araSpeedest","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_araSpeedest","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_aymanm419").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_araSpeedest| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|505.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aymanm419/araSpeedest --- layout: model title: Legal No material adverse change Clause Binary Classifier author: John Snow Labs name: legclf_no_material_adverse_change_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-material-adverse-change` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-material-adverse-change` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_change_clause_en_1.0.0_3.2_1660122691534.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_change_clause_en_1.0.0_3.2_1660122691534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_material_adverse_change_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-material-adverse-change]| |[other]| |[other]| |[no-material-adverse-change]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_material_adverse_change_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-material-adverse-change 0.95 0.97 0.96 37 other 0.99 0.98 0.99 103 accuracy - - 0.98 140 macro-avg 0.97 0.98 0.97 140 weighted-avg 0.98 0.98 0.98 140 ``` --- layout: model title: Detect Anatomical Regions (MedicalBertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_anatomy date: 2022-01-06 tags: [anatomy, bertfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for anatomy terms. This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP. ## Predicted Entities `Anatomical_system`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Tissue` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_en_3.3.4_2.4_1641454747169.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_en_3.3.4_2.4_1641454747169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_anatomy", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter() \ .setInputCols(["document","token","ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_anatomy", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_anatomy").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |great toe |Multi-tissue_structure| |skin |Organ | |conjunctivae |Multi-tissue_structure| |Extraocular muscles|Multi-tissue_structure| |Nares |Multi-tissue_structure| |turbinates |Multi-tissue_structure| |Oropharynx |Multi-tissue_structure| |Mucous membranes |Tissue | |Neck |Organism_subdivision | |bowel |Organ | |great toe |Multi-tissue_structure| |skin |Organ | |toenails |Organism_subdivision | |foot |Organism_subdivision | |great toe |Multi-tissue_structure| |toenails |Organism_subdivision | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_anatomy| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.4 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source Trained on the Anatomical Entity Mention (AnEM) corpus with 'embeddings_clinical'. http://www.nactem.ac.uk/anatomy/ ## Benchmarking ```bash label precision recall f1-score support B-Anatomical_system 1.00 0.50 0.67 4 B-Cell 0.89 0.96 0.92 74 B-Cellular_component 0.97 0.81 0.88 36 B-Developing_anatomical_structure 1.00 0.50 0.67 6 B-Immaterial_anatomical_entity 0.60 0.75 0.67 4 B-Multi-tissue_structure 0.75 0.86 0.80 58 B-Organ 0.86 0.88 0.87 48 B-Organism_subdivision 0.62 0.42 0.50 12 B-Organism_substance 0.89 0.81 0.85 31 B-Pathological_formation 0.91 0.91 0.91 32 B-Tissue 0.94 0.76 0.84 21 I-Anatomical_system 1.00 1.00 1.00 1 I-Cell 1.00 0.84 0.91 62 I-Cellular_component 0.92 0.85 0.88 13 I-Developing_anatomical_structure 1.00 1.00 1.00 1 I-Immaterial_anatomical_entity 1.00 1.00 1.00 1 I-Multi-tissue_structure 1.00 0.77 0.87 26 I-Organ 0.80 0.80 0.80 5 I-Organism_substance 1.00 0.71 0.83 7 I-Pathological_formation 1.00 0.94 0.97 16 I-Tissue 0.93 0.93 0.93 15 accuracy - - 0.84 473 macro-avg 0.87 0.77 0.83 473 weighted-avg 0.90 0.84 0.86 473 ``` --- layout: model title: Part of Speech for Latin author: John Snow Labs name: pos_ud_llct date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: la edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, la] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_2.5.5_2.4_1596054191115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_2.5.5_2.4_1596054191115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_llct", "la") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_llct", "la") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene."""] pos_df = nlu.load('la.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='PROPN', metadata={'word': 'Alius'}), Row(annotatorType='pos', begin=6, end=8, result='AUX', metadata={'word': 'est'}), Row(annotatorType='pos', begin=10, end=13, result='VERB', metadata={'word': 'esse'}), Row(annotatorType='pos', begin=15, end=19, result='VERB', metadata={'word': 'regem'}), Row(annotatorType='pos', begin=21, end=29, result='PROPN', metadata={'word': 'Aquilonis'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_llct| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|la| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Japanese T5ForConditionalGeneration Cased model (from astremo) author: John Snow Labs name: t5_friendly date: 2023-01-30 tags: [ja, open_source, t5, tensorflow] task: Text Generation language: ja edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `friendly_JA` is a Japanese model originally trained by `astremo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_friendly_ja_4.3.0_3.0_1675102435483.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_friendly_ja_4.3.0_3.0_1675102435483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_friendly","ja") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_friendly","ja") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_friendly| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ja| |Size:|923.1 MB| ## References - https://huggingface.co/astremo/friendly_JA - http://creativecommons.org/licenses/by/4.0/ - http://creativecommons.org/licenses/by/4.0/ - http://creativecommons.org/licenses/by/4.0/ --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_timeentities2_ttsp75 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-timeentities2_ttsp75` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_ttsp75_en_4.3.0_3.0_1674220728523.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities2_ttsp75_en_4.3.0_3.0_1674220728523.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2_ttsp75","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities2_ttsp75","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_timeentities2_ttsp75| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-timeentities2_ttsp75 --- layout: model title: Fast Neural Machine Translation Model from Hausa to English author: John Snow Labs name: opus_mt_ha_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ha, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ha` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ha_en_xx_2.7.0_2.4_1609168807678.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ha_en_xx_2.7.0_2.4_1609168807678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ha_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ha_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ha.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ha_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial Zero-shot NER author: John Snow Labs name: finner_roberta_zeroshot date: 2022-09-02 tags: [en, finance, ner, zero, shot, zeroshot, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: ZeroShotNER article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to carry out a Zero-Shot Named Entity Recognition (NER) approach, detecting any kind of entities with no training dataset, just tje pretrained RoBERTa embeddings (included in the model) and some examples. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_ZEROSHOT/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_roberta_zeroshot_en_1.0.0_3.2_1662113599526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_roberta_zeroshot_en_1.0.0_3.2_1662113599526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sparktokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\ .setInputCols(["document", "token"])\ .setOutputCol("zero_shot_ner")\ .setEntityDefinitions( { "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?'], "ORG": ["Which company was acquired?"], "PRODUCT": ["Which product?"], "PROFIT_INCREASE": ["How much has the gross profit increased?"], "REVENUES_DECLINED": ["How much has the revenues declined?"], "OPERATING_LOSS_2020": ["Which was the operating loss in 2020"], "OPERATING_LOSS_2019": ["Which was the operating loss in 2019"] }) nerconverter = nlp.NerConverter()\ .setInputCols(["document", "token", "zero_shot_ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ documentAssembler, sparktokenizer, zero_shot_ner, nerconverter, ] ) sample_text = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.", "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.", "While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019." "We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of approximately $7,738,193 million in 2019."] p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF("text")) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +------------------+-------------------+ |chunk |ner_label | +------------------+-------------------+ |March 2012 |DATE | |Vertro |ORG | |ALOT |PRODUCT | |February 2017 |DATE | |NetSeer |ORG | |81.4% |PROFIT_INCREASE | |27% |REVENUES_DECLINED | |$8,048,581 million|OPERATING_LOSS_2020| |$7,738,193 million|OPERATING_LOSS_2019| +------------------+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_roberta_zeroshot| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Financial Roberta Embeddings --- layout: model title: Malay T5ForConditionalGeneration Base Cased model (from mesolitica) author: John Snow Labs name: t5_base_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_bahasa_cased_ms_4.3.0_3.0_1675108290125.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_bahasa_cased_ms_4.3.0_3.0_1675108290125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|473.3 MB| ## References - https://huggingface.co/mesolitica/t5-base-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5 --- layout: model title: English RobertaForQuestionAnswering Cased model (from LucasS) author: John Snow Labs name: roberta_qa_robertabaseabsa date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaBaseABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertabaseabsa_en_4.3.0_3.0_1674222849343.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertabaseabsa_en_4.3.0_3.0_1674222849343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertabaseabsa","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertabaseabsa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_robertabaseabsa| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|437.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/LucasS/robertaBaseABSA --- layout: model title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab) author: John Snow Labs name: bert_embeddings_base_arabic_camel_msa date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa` is a Arabic model originally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_ar_4.2.4_3.0_1670016029582.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_ar_4.2.4_3.0_1670016029582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic_camel_msa| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: French RoBERTa Embeddings (from benjamin) author: John Snow Labs name: roberta_embeddings_roberta_base_wechsel_french date: 2022-04-14 tags: [roberta, embeddings, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-wechsel-french` is a French model orginally trained by `benjamin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_french_fr_3.4.2_3.0_1649947929675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_french_fr_3.4.2_3.0_1649947929675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_french","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_french","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.roberta_base_wechsel_french").predict("""J'adore Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_base_wechsel_french| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|468.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/benjamin/roberta-base-wechsel-french - https://github.com/CPJKU/wechsel - https://arxiv.org/abs/2112.06598 --- layout: model title: Stop Words Cleaner for Basque author: John Snow Labs name: stopwords_eu date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: eu edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, eu] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_eu_eu_2.5.4_2.4_1594742441951.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_eu_eu_2.5.4_2.4_1594742441951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_eu", "eu") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_eu", "eu") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow."""] stopword_df = nlu.load('eu.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=10, result='Iparraldeko', metadata={'sentence': '0'}), Row(annotatorType='token', begin=12, end=18, result='erregea', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=26, result='izateaz', metadata={'sentence': '0'}), Row(annotatorType='token', begin=28, end=31, result='gain', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=32, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_eu| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|eu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: ICDO Entity Resolver author: John Snow Labs name: chunkresolve_icdo_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-21 task: Entity Resolution edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD-O Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICDO/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICDO.ipynb#scrollTo=Qdh2BQaLI7tU){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icdo_clinical_en_2.4.5_2.4_1587491354644.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icdo_clinical_en_2.4.5_2.4_1587491354644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... model = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner_model, clinical_ner_chunker, chunk_embeddings, model]) data = ["""DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes. PHYSICAL EXAMINATION NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present. RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present. ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time."""] pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline = LightPipeline(pipeline_model) result = light_pipeline.annotate(data) ``` ```scala ... val model = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner_model, clinical_ner_chunker, chunk_embeddings, model)) val data = Seq("DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes. PHYSICAL EXAMINATION NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present. RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present. ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | | chunk | begin | end | entity | idco_description | icdo_code | |---|----------------------------|-------|-----|--------|---------------------------------------------|-----------| | 0 | Left breast adenocarcinoma | 11 | 36 | Cancer | Intraductal carcinoma, noninfiltrating, NOS | 8500/2 | | 1 | T3 N1b M0 | 44 | 52 | Cancer | Kaposi sarcoma | 9140/3 | ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------| | Name: | chunkresolve_icdo_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.2+ | | License: | Licensed | |Edition:|Official| | |Input labels: | token, chunk_embeddings | |Output labels: | entity | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD-O Histology Behaviour dataset https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from FOFer) author: John Snow Labs name: distilbert_qa_fofer_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `FOFer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_fofer_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768487186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_fofer_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768487186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fofer_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fofer_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_fofer_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/FOFer/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_10_H_512_A_8_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-10_H-512_A-8_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_en_4.0.0_3.0_1654185195977.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_en_4.0.0_3.0_1654185195977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.uncased_10l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_10_H_512_A_8_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|178.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-10_H-512_A-8_squad2 --- layout: model title: Arabic Part of Speech Tagger (from CAMeL-Lab) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_ca_pos_egy date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-ca-pos-egy` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_egy_ar_3.4.2_3.0_1650993368525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_egy_ar_3.4.2_3.0_1650993368525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_egy","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_egy","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_ca_pos_egy").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_ca_pos_egy| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-egy - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Legal Confidentiality Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_confidentiality_bert date: 2023-03-05 tags: [en, legal, classification, clauses, confidentiality, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Confidentiality` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Confidentiality`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidentiality_bert_en_1.0.0_3.0_1678050626282.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidentiality_bert_en_1.0.0_3.0_1678050626282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_confidentiality_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Confidentiality]| |[Other]| |[Other]| |[Confidentiality]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_confidentiality_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Confidentiality 0.94 0.99 0.96 123 Other 0.99 0.95 0.97 150 accuracy - - 0.97 273 macro-avg 0.97 0.97 0.97 273 weighted-avg 0.97 0.97 0.97 273 ``` --- layout: model title: Word Embeddings for Japanese (japanese_cc_300d) author: John Snow Labs name: japanese_cc_300d date: 2021-09-09 tags: [embeddings, open_source, ja] task: Embeddings language: ja edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/japanese_cc_300d_ja_3.2.2_3.0_1631192388744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/japanese_cc_300d_ja_3.2.2_3.0_1631192388744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \ .setInputCols(["sentence"]) \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("japanese_cc_300d", "ja") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") pipeline = Pipeline().setStages([ documentAssembler, sentence, word_segmenter, embeddings ]) data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.selectExpr("explode(arrays_zip(embeddings.result, embeddings.embeddings))").show() ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel} import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("japanese_cc_300d", "ja") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, word_segmenter, embeddings )) val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text") val model = pipeline.fit(data) val result = model.transform(data) result.selectExpr("explode(arrays_zip(embeddings.result, embeddings.embeddings))").show() ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.glove.cc_300d").predict("""explode(arrays_zip(embeddings.result, embeddings.embeddings))""") ```
## Results ```bash +---------------------------+ | col| +---------------------------+ | [宮本, [0.1944, 0.4...| | [茂, [-0.079, 0.09...| | [氏, [-0.1053, 0.1...| | [は, [0.0732, -0.0...| | [、, [0.0571, -0.0...| | [日本, [0.1844, 0.0...| | [の, [0.0109, -0.0...| | [任天, [0.0, 0.0, 0...| | [堂, [-0.1972, 0.0...| | [の, [0.0109, -0.0...| | [ゲーム, [0.013, 0.0...| |[プロデューサー, [-0.010...| | [です, [0.0036, -0....| | [。, [0.069, -0.01...| +---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|japanese_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ja| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_triplet_ft_new_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_ft_new_news_en_4.3.0_3.0_1674211184082.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_ft_new_news_en_4.3.0_3.0_1674211184082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_ft_new_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_ft_new_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_triplet_ft_new_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|461.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_triplet_roberta_FT_new_newsqa --- layout: model title: Legal Indemnifications Clause Binary Classifier author: John Snow Labs name: legclf_cuad_indemnifications_clause date: 2022-09-27 tags: [cuad, indemnifications, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indemnifications` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_indemnifications_clause_en_1.0.0_3.0_1664272531526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_indemnifications_clause_en_1.0.0_3.0_1664272531526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cuad_indemnifications_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnifications]| |[other]| |[other]| |[indemnifications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_indemnifications_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|21.9 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support indemnifications 1.00 0.83 0.91 12 other 0.83 1.00 0.91 10 accuracy - - 0.91 22 macro avg 0.92 0.92 0.91 22 weighted avg 0.92 0.91 0.91 22 ``` --- layout: model title: English BertForMaskedLM Large Cased model author: John Snow Labs name: bert_embeddings_large_cased_whole_word_masking date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-cased-whole-word-masking` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_whole_word_masking_en_4.2.4_3.0_1670020123161.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_whole_word_masking_en_4.2.4_3.0_1670020123161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased_whole_word_masking","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased_whole_word_masking","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_cased_whole_word_masking| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/bert-large-cased-whole-word-masking - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English RobertaForQuestionAnswering (from comacrae) author: John Snow Labs name: roberta_qa_roberta_eda_and_parav3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-eda-and-parav3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_eda_and_parav3_en_4.0.0_3.0_1655735762543.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_eda_and_parav3_en_4.0.0_3.0_1655735762543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_eda_and_parav3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_eda_and_parav3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.eda_and_parav3.by_comacrae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_eda_and_parav3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/comacrae/roberta-eda-and-parav3 --- layout: model title: Part of Speech for Bulgarian author: John Snow Labs name: pos_ud_btb date: 2021-03-08 tags: [part_of_speech, open_source, bulgarian, pos_ud_btb, bg] task: Part of Speech Tagging language: bg edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - PUNCT - VERB - AUX - PRON - ADJ - PART - ADV - INTJ - DET - PROPN - CCONJ - NUM - SCONJ - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_btb_bg_3.0.0_3.0_1615230275121.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_btb_bg_3.0.0_3.0_1615230275121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_btb", "bg") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Здравейте от Lak Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_btb", "bg") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Здравейте от Lak Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Здравейте от Lak Snow Labs! ""] token_df = nlu.load('bg.pos.ud_btb').predict(text) token_df ```
## Results ```bash token pos 0 Здравейте VERB 1 от ADP 2 Lak ADJ 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_btb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|bg| --- layout: model title: Pipeline for Detect Medication author: John Snow Labs name: ner_medication_pipeline date: 2022-07-28 tags: [ner, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained pipeline to detect medication entities. It was built on the top of `ner_posology_greedy` model and also augmented with the drug names mentioned in UK and US drugbank datasets. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.0.0_3.0_1658987434372.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.0.0_3.0_1658987434372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models") text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg.""" result = ner_medication_pipeline.fullAnnotate([text]) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_medication_pipeline = new PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models") val result = ner_medication_pipeline.fullAnnotate("The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."") ``` {:.nlu-block} ```python | ner_chunk | entity | |:-------------------|:---------| | metformin 1000 MG | DRUG | | glipizide 2.5 MG | DRUG | | Fragmin 5000 units | DRUG | | Xenaderm | DRUG | | OxyContin 30 mg | DRUG | ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_medication_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - Finisher --- layout: model title: ALBERT Embeddings (Large Uncase) author: John Snow Labs name: albert_large_uncased date: 2020-04-28 task: Embeddings language: en nav_key: models edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_large_uncased_en_2.5.0_2.4_1588073397355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_large_uncased_en_2.5.0_2.4_1588073397355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = AlbertEmbeddings.pretrained("albert_large_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = AlbertEmbeddings.pretrained("albert_large_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.albert.large_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_albert_large_uncased_embeddings I [0.3967159688472748, -0.6448764801025391, -0.3... love [1.1107065677642822, -0.2454298734664917, 0.60... NLP [0.02937467396259308, -0.7092287540435791, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/albert_large/3](https://tfhub.dev/google/albert_large/3) --- layout: model title: Stop Words Cleaner for Russian author: John Snow Labs name: stopwords_ru date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ru edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ru] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ru_ru_2.5.4_2.4_1594742439248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ru_ru_2.5.4_2.4_1594742439248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ru", "ru") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ru", "ru") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены."""] stopword_df = nlu.load('ru.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='Помимо', metadata={'sentence': '0'}), Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=25, result='король', metadata={'sentence': '0'}), Row(annotatorType='token', begin=27, end=32, result='севера', metadata={'sentence': '0'}), Row(annotatorType='token', begin=33, end=33, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ru| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ru| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Arabic Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_bert_large_arabic date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabic_ar_3.4.2_3.0_1649678414101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabic_ar_3.4.2_3.0_1649678414101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_large_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-large-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: Hindi BertForQuestionAnswering model (from Sindhu) author: John Snow Labs name: bert_qa_muril_large_squad2 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muril-large-squad2` is a Hindi model orginally trained by `Sindhu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_muril_large_squad2_hi_4.0.0_3.0_1654188807056.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_muril_large_squad2_hi_4.0.0_3.0_1654188807056.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_muril_large_squad2","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_muril_large_squad2","hi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hi.answer_question.squadv2.bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_muril_large_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sindhu/muril-large-squad2 - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://twitter.com/batw0man --- layout: model title: Google's Tapas Table Understanding (Medium, WIKISQL) author: John Snow Labs name: table_qa_tapas_medium_finetuned_wikisql_supervised date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Medium Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530746170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530746170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_medium_finetuned_wikisql_supervised","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_medium_finetuned_wikisql_supervised| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|157.5 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions https://github.com/salesforce/WikiSQL --- layout: model title: English image_classifier_vit_autotrain_dog_vs_food ViTForImageClassification from abhishek author: John Snow Labs name: image_classifier_vit_autotrain_dog_vs_food date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_autotrain_dog_vs_food` is a English model originally trained by abhishek. ## Predicted Entities `dog`, `food` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_dog_vs_food_en_4.1.0_3.0_1660171758307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_dog_vs_food_en_4.1.0_3.0_1660171758307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_autotrain_dog_vs_food", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_autotrain_dog_vs_food", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_autotrain_dog_vs_food| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_no_label_5e_05 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-no-label-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_no_label_5e_05_en_4.3.0_3.0_1674209664953.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_no_label_5e_05_en_4.3.0_3.0_1674209664953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_no_label_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_no_label_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_no_label_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-no-label-5e-05 --- layout: model title: English BertForQuestionAnswering model (from HankyStyle) author: John Snow Labs name: bert_qa_Multi_ling_BERT date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Multi-ling-BERT` is a English model orginally trained by `HankyStyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Multi_ling_BERT_en_4.0.0_3.0_1654178859506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Multi_ling_BERT_en_4.0.0_3.0_1654178859506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Multi_ling_BERT","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Multi_ling_BERT","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_HankyStyle").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Multi_ling_BERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|626.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/HankyStyle/Multi-ling-BERT --- layout: model title: Word2Vec Embeddings in Swahili (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sw, open_source] task: Embeddings language: sw edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sw_3.4.1_3.0_1647459595377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sw_3.4.1_3.0_1647459595377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ninapenda Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sw") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ninapenda Spark NLP.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sw.embed.w2v_cc_300d").predict("""Ninapenda Spark NLP.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sw| |Size:|224.0 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering model (from threem) Squad2 author: John Snow Labs name: distilbert_qa_mysquadv2_8Jan22_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mysquadv2_8Jan22-finetuned-squad` is a English model originally trained by `threem`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_8Jan22_finetuned_squad_en_4.0.0_3.0_1654728468416.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_8Jan22_finetuned_squad_en_4.0.0_3.0_1654728468416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_8Jan22_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_8Jan22_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.v2.by_threem").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mysquadv2_8Jan22_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/threem/mysquadv2_8Jan22-finetuned-squad --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from ArneD) author: John Snow Labs name: xlmroberta_ner_arned_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `ArneD`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arned_base_finetuned_panx_de_4.1.0_3.0_1660429307551.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_arned_base_finetuned_panx_de_4.1.0_3.0_1660429307551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arned_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_arned_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_arned_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ArneD/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_anan0329 TFWav2Vec2ForCTC from anan0329 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_anan0329 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_anan0329` is a English model originally trained by anan0329. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_anan0329_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114661625.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114661625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_anan0329", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_anan0329", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_anan0329| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: NER Pipeline for Tests - Voice of the Patient author: John Snow Labs name: ner_vop_test_pipeline date: 2023-06-10 tags: [licensed, pipeline, ner, en, vop, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of tests and their results from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_pipeline_en_4.4.3_3.0_1686427000395.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_pipeline_en_4.4.3_3.0_1686427000395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_test_pipeline", "en", "clinical/models") pipeline.annotate(" I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_test_pipeline", "en", "clinical/models") val result = pipeline.annotate(" I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it. ") ```
## Results ```bash | chunk | ner_label | |:---------------|:------------| | thyroid levels | Test | | blood test | Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_test_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Legal Representations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_representations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, representations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Representations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Representations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_bert_en_1.0.0_3.0_1678050021027.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_bert_en_1.0.0_3.0_1678050021027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_representations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Representations]| |[Other]| |[Other]| |[Representations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_representations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.92 0.91 0.92 93 Representations 0.88 0.89 0.89 66 accuracy - - 0.91 159 macro-avg 0.90 0.90 0.90 159 weighted-avg 0.91 0.91 0.91 159 ``` --- layout: model title: RxNorm to MeSH Code Mapping author: John Snow Labs name: rxnorm_mesh_mapping date: 2023-06-13 tags: [rxnorm, mesh, en, licensed, pipeline] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_4.4.4_3.2_1686663529810.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_4.4.4_3.2_1686663529810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") pipeline.annotate("1191 6809 47613") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") val result = pipeline.annotate("1191 6809 47613") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.mesh").predict("""1191 6809 47613""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") pipeline.annotate("1191 6809 47613") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") val result = pipeline.annotate("1191 6809 47613") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.mesh").predict("""1191 6809 47613""") ```
## Results ```bash Results {'rxnorm': ['1191', '6809', '47613'], 'mesh': ['D001241', 'D008687', 'D019355']} Note: | RxNorm | Details | | ---------- | -------------------:| | 1191 | aspirin | | 6809 | metformin | | 47613 | calcium citrate | | MeSH | Details | | ---------- | -------------------:| | D001241 | Aspirin | | D008687 | Metformin | | D019355 | Calcium Citrate | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_mesh_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|103.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: T5 for Active to Passive Style Transfer author: John Snow Labs name: t5_active_to_passive_styletransfer date: 2022-01-12 tags: [t5, open_source, en] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a text-to-text model based on T5 fine-tuned to generate actively written text from a passively written text input, for the task "transfer Active to Passive:". It is based on Prithiviraj Damodaran's Styleformer. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5_LINGUISTIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_LINGUISTIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_active_to_passive_styletransfer_en_3.4.0_3.0_1641987400533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_active_to_passive_styletransfer_en_3.4.0_3.0_1641987400533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * spark = sparknlp.start() documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer.pretrained("t5_active_to_passive_styletransfer") \ .setTask("transfer Active to Passive:") \ .setInputCols(["documents"]) \ .setMaxOutputLength(200) \ .setOutputCol("transfers") pipeline = Pipeline().setStages([documentAssembler, t5]) data = spark.createDataFrame([["I am writing you a letter."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("transfers.result").show(truncate=False) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_active_to_passive_styletransfer") .setTask("transfer Active to Passive:") .setMaxOutputLength(200) .setInputCols("documents") .setOutputCol("transfer") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("I am writing you a letter.").toDF("text") val result = pipeline.fit(data).transform(data) result.select("transfer.result").show(false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.active_to_passive_styletransfer").predict("""transfer Active to Passive:""") ```
## Results ```bash +---------------------------+ |result | +---------------------------+ |[a letter is written by me]| +---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_active_to_passive_styletransfer| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[transfers]| |Language:|en| |Size:|264.5 MB| ## Data Source The original model is from the transformers library: https://huggingface.co/prithivida/active_to_passive_styletransfer --- layout: model title: Luo (Kenya and Tanzania) XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_ner date: 2022-08-01 tags: [luo, open_source, xlm_roberta, ner] task: Named Entity Recognition language: luo edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-luo` is a Luo (Kenya and Tanzania) model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_luo_4.1.0_3.0_1659355137886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_luo_4.1.0_3.0_1659355137886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner","luo") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner","luo") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_ner| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|luo| |Size:|772.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-luo - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Pipeline to Detect Anatomical and Observation Entities in Chest Radiology Reports (CheXpert) author: John Snow Labs name: ner_chexpert_pipeline date: 2023-03-14 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chexpert](https://nlp.johnsnowlabs.com/2021/09/30/ner_chexpert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_pipeline_en_4.3.0_3.2_1678779791404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_pipeline_en_4.3.0_3.2_1678779791404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models") text = '''FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models") val text = "FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chexpert.pipeline").predict("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax. FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | endotracheal | 118 | 129 | OBS | 0.9881 | | 1 | tube | 131 | 134 | OBS | 0.9996 | | 2 | Swan - Ganz | 140 | 150 | OBS | 0.9625 | | 3 | catheter | 152 | 159 | OBS | 0.9919 | | 4 | left | 185 | 188 | ANAT | 0.9983 | | 5 | chest | 190 | 194 | ANAT | 0.9749 | | 6 | tube | 196 | 199 | OBS | 0.9999 | | 7 | in place | 209 | 216 | OBS | 0.9894 | | 8 | pneumothorax | 246 | 257 | OBS | 0.9997 | | 9 | Mild | 260 | 263 | OBS | 0.9988 | | 10 | atelectatic | 265 | 275 | OBS | 0.9986 | | 11 | changes | 277 | 283 | OBS | 0.9984 | | 12 | left | 301 | 304 | ANAT | 0.9999 | | 13 | base | 306 | 309 | ANAT | 0.9999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chexpert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Arabic ALBERT Embeddings (Large) author: John Snow Labs name: albert_embeddings_albert_large_arabic date: 2022-04-14 tags: [albert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-large-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_arabic_ar_3.4.2_3.0_1649954278357.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_arabic_ar_3.4.2_3.0_1649954278357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.albert_large_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_large_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|68.0 MB| |Case sensitive:|false| ## References - https://huggingface.co/asafaya/albert-large-arabic - https://oscar-corpus.com/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/albert - https://www.tensorflow.org/tfrc - https://github.com/KUIS-AI-Lab/Arabic-ALBERT/ --- layout: model title: Legal Release Clause Binary Classifier author: John Snow Labs name: legclf_release_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `release` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `release` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_release_clause_en_1.0.0_3.2_1660122916319.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_release_clause_en_1.0.0_3.2_1660122916319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_release_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[release]| |[other]| |[other]| |[release]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_release_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.88 1.00 0.93 70 release 1.00 0.50 0.67 20 accuracy - - 0.89 90 macro-avg 0.94 0.75 0.80 90 weighted-avg 0.90 0.89 0.87 90 ``` --- layout: model title: Toxic Comment Classification - Small author: John Snow Labs name: multiclassifierdl_use_toxic_sm date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en, text_classification] supported: true annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) is working on tools to help improve the online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful, or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content). Automatically detect identity hate, insult, obscene, severe toxic, threat, or toxic content in SM comments using our out-of-the-box Spark NLP Multiclassifier DL. We removed the records without any labels in this model. (only 14K+ comments were used to train this model) ## Predicted Entities `toxic`, `severe_toxic`, `identity_hate`, `insult`, `obscene`, `threat` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_MULTILABEL_TOXIC/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_MULTILABEL_TOXIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_toxic_sm_en_2.7.1_2.4_1611230645484.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_toxic_sm_en_2.7.1_2.4_1611230645484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained() \ .setInputCols(["document"])\ .setOutputCol("use_embeddings") docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic_sm") \ .setInputCols(["use_embeddings"])\ .setOutputCol("category")\ .setThreshold(0.5) pipeline = Pipeline( stages = [ document, use, docClassifier ]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val use = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("use_embeddings") val docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic_sm") .setInputCols("use_embeddings") .setOutputCol("category") .setThreshold(0.5f) val pipeline = new Pipeline() .setStages( Array( documentAssembler, use, docClassifier ) ) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.toxic.sm").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|multiclassifierdl_use_toxic_sm| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[use_embeddings]| |Output Labels:|[category]| |Language:|en| ## Data Source https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview ## Benchmarking ```bash Classification report: precision recall f1-score support 0 0.56 0.30 0.39 127 1 0.71 0.70 0.70 761 2 0.76 0.72 0.74 824 3 0.55 0.21 0.31 147 4 0.79 0.38 0.51 50 5 0.94 1.00 0.97 1504 micro avg 0.83 0.80 0.81 3413 macro avg 0.72 0.55 0.60 3413 weighted avg 0.81 0.80 0.80 3413 samples avg 0.84 0.83 0.80 3413 ``` --- layout: model title: Detect PHI for Generic Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_generic_bert date: 2022-11-22 tags: [licensed, clinical, ro, deidentification, phi, generic, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ```` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ```` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid_generic_bert").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|LOCATION | |Drumul Oprea Nr |LOCATION | |972 |LOCATION | |Vaslui |LOCATION | |737405 |LOCATION | |+40(235)413773 |CONTACT | |25 May 2022 |DATE | |BUREAN MARIA |NAME | |77 |AGE | |Agota Evelyn Tımar |NAME | |2450502264401 |ID | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_bert| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.5 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.95 0.97 0.96 1186 CONTACT 0.99 0.98 0.98 366 DATE 0.96 0.92 0.94 4518 ID 1.00 1.00 1.00 679 LOCATION 0.91 0.90 0.90 1683 NAME 0.93 0.96 0.94 2916 PROFESSION 0.87 0.85 0.86 161 micro-avg 0.94 0.94 0.94 11509 macro-avg 0.94 0.94 0.94 11509 weighted-avg 0.95 0.94 0.94 11509 ``` --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_greek_2 TFWav2Vec2ForCTC from skylord author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_greek_2 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_greek_2` is a Modern Greek (1453-) model originally trained by skylord. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_greek_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_2_el_4.2.0_3.0_1664112085138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_2_el_4.2.0_3.0_1664112085138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_greek_2', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_greek_2", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_greek_2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_deletion_squad_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-deletion-squad-10` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_deletion_squad_10_en_4.3.0_3.0_1674216541297.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_deletion_squad_10_en_4.3.0_3.0_1674216541297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_deletion_squad_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_deletion_squad_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_deletion_squad_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-deletion-squad-10 --- layout: model title: English image_classifier_vit_demo ViTForImageClassification from nguyenbh author: John Snow Labs name: image_classifier_vit_demo date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_demo` is a English model originally trained by nguyenbh. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_demo_en_4.1.0_3.0_1660167069440.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_demo_en_4.1.0_3.0_1660167069440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_demo", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_demo", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_demo| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|324.8 MB| --- layout: model title: Word2Vec Embeddings in Lombard (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, lmo, open_source] task: Embeddings language: lmo edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lmo_3.4.1_3.0_1647443293397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lmo_3.4.1_3.0_1647443293397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lmo") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lmo") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lmo.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|lmo| |Size:|297.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect Diseases author: John Snow Labs name: ner_diseases_en date: 2020-03-25 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.4 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for diseases. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities ``Disease``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_diseases", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ']], ["text"])) ``` ```scala ... val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_diseases", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ").toDF("text") val result = pipeline.fit(data).transform(data) } ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe, or add the ``"Finisher"`` to the end of your pipeline. ```bash +------------------------------+---------+ |chunk |ner | +------------------------------+---------+ |the cyst |Disease | |a large Prolene suture |Disease | |a very small incisional hernia|Disease | |the hernia cavity |Disease | |omentum |Disease | |the hernia |Disease | |the wound lesion |Disease | |The lesion |Disease | |the existing scar |Disease | |the cyst |Disease | |the wound |Disease | |this cyst down to its base |Disease | |a small incisional hernia |Disease | |The cyst |Disease | |The wound |Disease | +------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_en_2.4.4_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.4+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on i2b2 with ``embeddings_clinical``. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|-----:|-----:|---------:|---------:|---------:| | 0 | I-Disease | 5014 | 222 | 171 | 0.957601 | 0.96702 | 0.962288 | | 1 | B-Disease | 6004 | 213 | 159 | 0.965739 | 0.974201 | 0.969952 | | 2 | Macro-average | 11018 | 435 | 330 | 0.96167 | 0.970611 | 0.96612 | | 3 | Micro-average | 11018 | 435 | 330 | 0.962019 | 0.97092 | 0.966449 | ``` --- layout: model title: English BertForQuestionAnswering model (from deepset) author: John Snow Labs name: bert_qa_deepset_bert_base_uncased_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad2` is a English model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_deepset_bert_base_uncased_squad2_en_4.0.0_3.0_1654181480200.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_deepset_bert_base_uncased_squad2_en_4.0.0_3.0_1654181480200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_deepset_bert_base_uncased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_deepset_bert_base_uncased_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_deepset_bert_base_uncased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/deepset/bert-base-uncased-squad2 - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - https://twitter.com/deepset_ai - http://www.deepset.ai/jobs - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/ - https://deepset.ai/german-bert - https://www.linkedin.com/company/deepset-ai/ - https://github.com/deepset-ai/FARM - https://deepset.ai/germanquad --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1657184863712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1657184863712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-8 --- layout: model title: Legal Forbearance Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_forbearance_agreement_bert date: 2023-02-02 tags: [en, legal, classification, forbearance, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_forbearance_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `forbearance-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `forbearance-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_forbearance_agreement_bert_en_1.0.0_3.0_1675359983427.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_forbearance_agreement_bert_en_1.0.0_3.0_1675359983427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_forbearance_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[forbearance-agreement]| |[other]| |[other]| |[forbearance-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_forbearance_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support forbearance-agreement 0.97 1.00 0.99 37 other 1.00 0.99 0.99 73 accuracy - - 0.99 110 macro-avg 0.99 0.99 0.99 110 weighted-avg 0.99 0.99 0.99 110 ``` --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_hier_roberta_FT_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_roberta_FT_newsqa_en_4.0.0_3.0_1655728663969.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_roberta_FT_newsqa_en_4.0.0_3.0_1655728663969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_roberta_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_fpdm_hier_roberta_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_fpdm_hier_roberta_ft_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_hier_roberta_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|458.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_hier_roberta_FT_newsqa --- layout: model title: Hindi BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_dev date: 2022-07-07 tags: [hi, open_source, bert, question_answering] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-hi` is a Hindi model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_hi_4.0.0_3.0_1657190202881.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_dev_hi_4.0.0_3.0_1657190202881.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["मेरा नाम क्या है?", "मेरा नाम क्लारा है और मैं बर्कले में रहता हूं।"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_dev","hi") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("मेरा नाम क्या है?", "मेरा नाम क्लारा है और मैं बर्कले में रहता हूं।").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|hi| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-hi --- layout: model title: Fast Neural Machine Translation Model from English to Philippine Languages author: John Snow Labs name: opus_mt_en_phi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, phi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `phi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_phi_xx_2.7.0_2.4_1609169890617.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_phi_xx_2.7.0_2.4_1609169890617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_phi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_phi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.phi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_phi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Danish Legal Roberta Embeddings author: John Snow Labs name: roberta_base_danish_legal date: 2023-02-16 tags: [da, danish, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: da edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-danish-roberta-base` is a Danish model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_danish_legal_da_4.2.4_3.0_1676576630307.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_danish_legal_da_4.2.4_3.0_1676576630307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_danish_legal", "da")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_danish_legal", "da") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_danish_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|da| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-danish-roberta-base --- layout: model title: BERT Sequence Classification - Classify into News Categories author: John Snow Labs name: bert_sequence_classifier_age_news date: 2021-11-07 tags: [news, classification, bert_for_sequence_classification, en, open_source, ag_news] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models`. It is a BERT-Mini fine-tuned version of the `age_news` dataset. ## Predicted Entities `World`, `Sports`, `Business`, `Sci/Tech` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_en_3.3.2_2.4_1636288849469.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_en_3.3.2_2.4_1636288849469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_age_news', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['Microsoft has taken its first step into the metaverse.']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_age_news", "en") .setInputCols("document", "token") .setOutputCol("class") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Microsoft has taken its first step into the metaverse."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['Sci/Tech'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_age_news| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification](https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification) ## Benchmarking ```bash Test set accuracy: 0.93 ``` --- layout: model title: Extract relations between problem, treatment and test entities (ReDL) author: John Snow Labs name: redl_clinical_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities. ## Predicted Entities `PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_2.7.3_2.4_1612443963755.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_2.7.3_2.4_1612443963755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_clinical_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . """ data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_clinical_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . """) ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|:-----------|:----------|----------------:|--------------:|:--------------------------------------|:----------|----------------:|--------------:|:-------------------------|-------------:| | 0 | PIP | PROBLEM | 39 | 67 | gestational diabetes mellitus | PROBLEM | 157 | 160 | T2DM | 0.763447 | | 1 | PIP | PROBLEM | 39 | 67 | gestational diabetes mellitus | PROBLEM | 289 | 295 | obesity | 0.682246 | | 2 | PIP | PROBLEM | 117 | 153 | subsequent type two diabetes mellitus | PROBLEM | 187 | 210 | HTG-induced pancreatitis | 0.610396 | | 3 | PIP | PROBLEM | 117 | 153 | subsequent type two diabetes mellitus | PROBLEM | 264 | 281 | an acute hepatitis | 0.726894 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_clinical_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on 2010 i2b2 relation challenge. ## Benchmarking ```bash Relation Recall Precision F1 Support PIP 0.859 0.878 0.869 1435 TeCP 0.629 0.782 0.697 337 TeRP 0.903 0.929 0.916 2034 TrAP 0.872 0.866 0.869 1693 TrCP 0.641 0.677 0.659 340 TrIP 0.517 0.796 0.627 151 TrNAP 0.402 0.672 0.503 112 TrWP 0.257 0.824 0.392 109 Avg. 0.635 0.803 0.691 ``` --- layout: model title: Sentence Entity Resolver for ICD10-CM (general 3 character codes) author: John Snow Labs name: sbiobertresolve_icd10cm_generalised date: 2021-09-29 tags: [licensed, clinical, en, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It predicts ICD codes up to 3 characters (according to ICD10 code structure the first three characters represent general type of the injury or disease). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_en_3.2.1_3.0_1632938859569.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_en_3.2.1_3.0_1632938859569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_icd10cm_generalised``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_generalised","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_generalised","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm_generalised").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash | | chunk | begin | end | entity | code | code_desc | distance | all_k_resolutions | all_k_codes | |---:|:----------------------------|--------:|------:|:---------|:-------|:---------------------------------------------------------|-----------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------| | 0 | hypertension | 68 | 79 | PROBLEM | I10 | hypertension | 0 | hypertension:::hypertension (high blood pressure):::h/o: hypertension:::fh: hypertension:::hypertensive heart disease:::labile hypertension:::history of hypertension (situation):::endocrine hypertension | I10:::I15:::Z86:::Z82:::I11:::R03:::Z87:::E27 | | 1 | chronic renal insufficiency | 83 | 109 | PROBLEM | N18 | chronic renal impairment | 0.014 | chronic renal impairment:::renal insufficiency:::renal failure:::anaemia of chronic renal insufficiency:::impaired renal function disorder:::history of renal insufficiency:::prerenal renal failure:::abnormal renal function:::abnormal renal function | N18:::P96:::N19:::D63:::N28:::Z87:::N17:::N25:::R94 | | 2 | COPD | 113 | 116 | PROBLEM | J44 | chronic obstructive lung disease (disorder) | 0.1197 | chronic obstructive lung disease (disorder):::chronic obstructive pulmonary disease leaflet given:::chronic pulmonary congestion (disorder):::chronic respiratory failure (disorder):::chronic respiratory insufficiency:::cor pulmonale (chronic):::history of - chronic lung disease (situation) | J44:::Z76:::J81:::J96:::R06:::I27:::Z87 | | 3 | gastritis | 120 | 128 | PROBLEM | K29 | gastritis | 0 | gastritis:::bacterial gastritis:::parasitic gastritis | K29:::B96:::K93 | | 4 | TIA | 136 | 138 | PROBLEM | S06 | cerebral concussion | 0.1662 | cerebral concussion:::transient ischemic attack (disorder):::thalamic stroke:::cerebral trauma:::stroke:::traumatic amputation:::spinal cord stroke | S06:::G45:::I63:::S09:::I64:::T14:::G95 | | 5 | a non-ST elevation MI | 182 | 202 | PROBLEM | I21 | non-st elevation (nstemi) myocardial infarction | 0.1615 | non-st elevation (nstemi) myocardial infarction:::nonruptured cerebral artery dissection:::acute stroke, nonatherosclerotic:::nontraumatic ischemic infarction of muscle, unsp shoulder:::history of nonatherosclerotic stroke without residual deficits:::non-traumatic cerebral hemorrhage | I21:::I67:::I63:::M62:::Z86:::I61 | | 6 | Guaiac positive stools | 208 | 229 | PROBLEM | R85 | abnormal anal pap | 0.1807 | abnormal anal pap:::straining at stool (finding):::amine test positive:::appendiceal colic:::fecal smearing:::epiploic appendagitis:::diverticulosis of intestine (finding):::appendicitis (disorder):::colostomy present (finding):::thickened anal verge (finding):::anal fissure:::amoebic enteritis:::zenkers diverticulum | R85:::R19:::Z78:::K38:::R15:::K65:::K57:::K37:::Z93:::K62:::K60:::A06:::K22 | | 7 | mid LAD lesion | 332 | 345 | PROBLEM | I21 | stemi involving left anterior descending coronary artery | 0.1595 | stemi involving left anterior descending coronary artery:::divided left atrium:::disorder of left atrium:::double inlet left ventricle:::left os acromiale:::furuncle of left upper limb:::left anterior fascicular hemiblock (heart rhythm):::aberrant origin of left subclavian artery:::stent in circumflex branch of left coronary artery (finding) | I21:::Q24:::I51:::Q20:::M89:::L02:::I44:::Q27:::Z95 | | 8 | hypotension | 362 | 372 | PROBLEM | I95 | hypotension | 0 | hypotension:::supine hypotensive syndrome | I95:::O26 | | 9 | bradycardia | 378 | 388 | PROBLEM | R00 | bradycardia | 0 | bradycardia:::bradycardia (finding):::drug-induced bradycardia:::bradycardia (disorder) | R00:::P29:::T50:::P20 | | 10 | vagal reaction | 466 | 479 | PROBLEM | G52 | vagus nerve finding | 0.0926 | vagus nerve finding:::vasomotor reaction:::vesicular breathing (finding):::abdominal muscle tone - finding:::agonizing state:::paresthesia (finding):::glossolalia (finding):::tactile alteration (finding) | G52:::I73:::R09:::R19:::R45:::R20:::R41:::R44 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_generalised| |Compatibility:|Healthcare NLP 3.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD10 Clinical Modification dataset with `sbiobert_base_cased_mli` sentence embeddings. https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: Pretrained Pipeline for Few-NERD-General NER Model author: John Snow Labs name: nerdl_fewnerd_100d_pipeline date: 2021-12-03 tags: [fewnerd, nerdl, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.3.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on Few-NERD model and it detects : `PERSON`, `ORGANIZATION`, `LOCATION`, `ART`, `BUILDING`, `PRODUCT`, `EVENT`, `OTHER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_3.3.1_2.4_1638523061152.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_pipeline_en_3.3.1_2.4_1638523061152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") fewnerd_pipeline.annotate("""The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).""") ``` ```scala val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") val result = pipeline.fullAnnotate("The Double Down is a sandwich offered by Kentucky Fried Chicken restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).")(0) ```
## Results ```bash +-----------------------+------------+ |chunk |ner_label | +-----------------------+------------+ |Kentucky Fried Chicken |ORGANIZATION| |Anglo-Egyptian War |EVENT | |battle of Tell El Kebir|EVENT | |Egypt Medal |OTHER | |Order of Medjidie |OTHER | +-----------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_fewnerd_100d_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.1+| |License:|Open Source| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter - Finisher --- layout: model title: English asr_wav2vec2_large_xls_r_300m_cantonese TFWav2Vec2ForCTC from ivanlau author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_cantonese date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_cantonese` is a English model originally trained by ivanlau. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_cantonese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_cantonese_en_4.2.0_3.0_1664113131550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_cantonese_en_4.2.0_3.0_1664113131550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_cantonese', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_cantonese", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_cantonese| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Stopwords Remover for Hungarian language (219 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, hu, open_source] task: Stop Words Removal language: hu edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_hu_3.4.1_3.0_1646673036995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_hu_3.4.1_3.0_1646673036995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","hu") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nem vagy jobb, mint én"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","hu") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nem vagy jobb, mint én").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hu.stopwords").predict("""Nem vagy jobb, mint én""") ```
## Results ```bash +---------+ |result | +---------+ |[jobb, ,]| +---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hu| |Size:|2.1 KB| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from jteng) author: John Snow Labs name: distilbert_qa_finetuned_syllabus date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-syllabus` is a English model originally trained by `jteng`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_syllabus_en_4.3.0_3.0_1672765843797.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_syllabus_en_4.3.0_3.0_1672765843797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_syllabus","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_syllabus","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finetuned_syllabus| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jteng/bert-finetuned-syllabus --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from ardallie) author: John Snow Labs name: xlmroberta_ner_ardallie_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `ardallie`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ardallie_base_finetuned_panx_de_4.1.0_3.0_1660430915952.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ardallie_base_finetuned_panx_de_4.1.0_3.0_1660430915952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ardallie_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ardallie_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_ardallie_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ardallie/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English T5ForConditionalGeneration Cased model (from pitehu) author: John Snow Labs name: t5_ner_conll_entityreplace date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5_NER_CONLL_ENTITYREPLACE` is a English model originally trained by `pitehu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ner_conll_entityreplace_en_4.3.0_3.0_1675099568513.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ner_conll_entityreplace_en_4.3.0_3.0_1675099568513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ner_conll_entityreplace","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ner_conll_entityreplace","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ner_conll_entityreplace| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|275.5 MB| ## References - https://huggingface.co/pitehu/T5_NER_CONLL_ENTITYREPLACE - https://arxiv.org/pdf/2111.10952.pdf - https://arxiv.org/pdf/1810.04805.pdf --- layout: model title: Legal Dispute resolution Clause Binary Classifier author: John Snow Labs name: legclf_dispute_resolution_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `dispute-resolution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `dispute-resolution` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resolution_clause_en_1.0.0_3.2_1660122359464.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resolution_clause_en_1.0.0_3.2_1660122359464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_dispute_resolution_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[dispute-resolution]| |[other]| |[other]| |[dispute-resolution]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_dispute_resolution_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support dispute-resolution 0.84 0.84 0.84 32 other 0.94 0.94 0.94 84 accuracy - - 0.91 116 macro-avg 0.89 0.89 0.89 116 weighted-avg 0.91 0.91 0.91 116 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from allenai) author: John Snow Labs name: t5_small_next_word_generator_qoogle date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-next-word-generator-qoogle` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_next_word_generator_qoogle_en_4.3.0_3.0_1675126551905.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_next_word_generator_qoogle_en_4.3.0_3.0_1675126551905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_next_word_generator_qoogle","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_next_word_generator_qoogle","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_next_word_generator_qoogle| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|148.1 MB| ## References - https://huggingface.co/allenai/t5-small-next-word-generator-qoogle --- layout: model title: Translate English to Tagalog Pipeline author: John Snow Labs name: translate_en_tl date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tl, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tl_xx_2.7.0_2.4_1609691452242.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tl_xx_2.7.0_2.4_1609691452242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tl').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tl| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_vsv_all_901529445 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_vsv_all-901529445` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Name` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1677881751372.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1677881751372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_vsv_all_901529445| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_vsv_all-901529445 --- layout: model title: BioBERT Sentence Embeddings (PMC) author: John Snow Labs name: sent_biobert_pmc_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.2_2.4_1600532770743.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.2_2.4_1600532770743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pmc_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_pmc_base_cased_embeddings I hate cancer [0.34035101532936096, 0.04413360357284546, -0.... Antibiotics aren't painkiller [0.4397204518318176, 0.066007100045681, -0.114... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: BERT Embeddings trained on MEDLINE/PubMed and fine-tuned on SQuAD 2.0 author: John Snow Labs name: bert_pubmed_squad2 date: 2021-08-30 tags: [en, open_source, squad_2_dataset, medline_pubmed_dataset, bert_embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/pubmed/1 and fine-tuned on SQuAD 2.0. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pubmed_squad2_en_3.2.0_3.0_1630323544592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pubmed_squad2_en_3.2.0_3.0_1630323544592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_pubmed_squad2", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_pubmed_squad2", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.pubmed_squad2').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pubmed_squad2| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/) [4]: [MEDLINE/PubMed dataset](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) This Model has been imported from: https://tfhub.dev/google/experts/bert/pubmed/squad2/2 --- layout: model title: Bashkir asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt TFWav2Vec2ForCTC from AigizK author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt date: 2022-09-24 tags: [wav2vec2, ba, audio, open_source, asr] task: Automatic Speech Recognition language: ba edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt` is a Bashkir model originally trained by AigizK. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_ba_4.2.0_3.0_1664040309179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_ba_4.2.0_3.0_1664040309179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt", "ba")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt", "ba") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ba| |Size:|1.2 GB| --- layout: model title: Hindi BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_hi_cased date: 2022-12-02 tags: [hi, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: hi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-hi-cased` is a Hindi model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_hi_cased_hi_4.2.4_3.0_1670017763072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_hi_cased_hi_4.2.4_3.0_1670017763072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_hi_cased","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_hi_cased","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_hi_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|339.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-hi-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: German BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_de_cased date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-de-cased` is a German model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_de_cased_de_4.2.4_3.0_1670016505471.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_de_cased_de_4.2.4_3.0_1670016505471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_de_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_de_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_de_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|398.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-de-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Translate English to Twi Pipeline author: John Snow Labs name: translate_en_tw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tw_xx_2.7.0_2.4_1609691518744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tw_xx_2.7.0_2.4_1609691518744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Marscen) author: John Snow Labs name: distilbert_qa_marscen_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Marscen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_marscen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768784222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_marscen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768784222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_marscen_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_marscen_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_marscen_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Marscen/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering Cased model (from spasis) author: John Snow Labs name: bert_qa_spasis_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `spasis`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spasis_finetuned_squad_en_4.0.0_3.0_1657186715488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spasis_finetuned_squad_en_4.0.0_3.0_1657186715488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spasis_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spasis_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spasis_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/spasis/bert-finetuned-squad --- layout: model title: Extract Test Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_test_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts test mentions from the documents transferred from the patient’s own sentences. ## Predicted Entities `VitalTest`, `Test`, `Measurements`, `TestResult` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_emb_clinical_medium_en_4.4.3_3.0_1686076924102.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_emb_clinical_medium_en_4.4.3_3.0_1686076924102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_test_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_test_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------|:------------| | thyroid levels | Test | | blood test | Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_test_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 VitalTest 162 32 10 172 0.84 0.94 0.89 Test 1040 118 168 1208 0.90 0.86 0.88 Measurements 136 22 50 186 0.86 0.73 0.79 TestResult 360 109 164 524 0.77 0.69 0.73 macro_avg 1698 281 392 2090 0.84 0.80 0.82 micro_avg 1698 281 392 2090 0.86 0.81 0.84 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ydshieh) author: John Snow Labs name: roberta_qa_ydshieh_base_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `ydshieh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_base_squad2_en_4.2.4_3.0_1669986831741.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_base_squad2_en_4.2.4_3.0_1669986831741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ydshieh_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ydshieh_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ydshieh_base_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ydshieh/roberta-base-squad2 - https://github.com/deepset-ai/FARM/issues/552 - https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py - https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://github.com/deepset-ai/haystack/ - https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Pipeline to Extract the Names of Drugs & Chemicals author: John Snow Labs name: ner_chemd_clinical_pipeline date: 2023-03-14 tags: [chemdner, chemd, ner, clinical, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemd_clinical](https://nlp.johnsnowlabs.com/2021/11/04/ner_chemd_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemd_clinical_pipeline_en_4.3.0_3.2_1678778578175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemd_clinical_pipeline_en_4.3.0_3.2_1678778578175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_chemd_clinical_pipeline", "en", "clinical/models") text = '''Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_chemd_clinical_pipeline", "en", "clinical/models") val text = "Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------------|--------:|------:|:-------------|-------------:| | 0 | Lystabactins | 65 | 76 | FAMILY | 0.9841 | | 1 | lystabactins A, B, and C | 278 | 301 | MULTIPLE | 0.813429 | | 2 | amino acid | 392 | 401 | FAMILY | 0.74585 | | 3 | lystabactins | 426 | 437 | FAMILY | 0.8007 | | 4 | serine | 455 | 460 | TRIVIAL | 0.9924 | | 5 | Ser | 463 | 465 | FORMULA | 0.9999 | | 6 | asparagine | 469 | 478 | TRIVIAL | 0.9795 | | 7 | Asn | 481 | 483 | FORMULA | 0.9999 | | 8 | formylated/hydroxylated ornithines | 491 | 524 | FAMILY | 0.50085 | | 9 | FOHOrn | 527 | 532 | FORMULA | 0.509 | | 10 | dihydroxy benzoic acid | 536 | 557 | SYSTEMATIC | 0.6346 | | 11 | amino acid | 602 | 611 | FAMILY | 0.4204 | | 12 | 4,8-diamino-3-hydroxyoctanoic acid | 614 | 647 | SYSTEMATIC | 0.9124 | | 13 | LySta | 650 | 654 | ABBREVIATION | 0.9193 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemd_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Stop Words Cleaner for Yoruba author: John Snow Labs name: stopwords_yo date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: yo edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, yo] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_yo_yo_2.5.4_2.4_1594742440695.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_yo_yo_2.5.4_2.4_1594742440695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_yo", "yo") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_yo", "yo") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera."""] stopword_df = nlu.load('yo.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='Yato', metadata={'sentence': '0'}), Row(annotatorType='token', begin=5, end=6, result='si', metadata={'sentence': '0'}), Row(annotatorType='token', begin=8, end=11, result='jijẹ', metadata={'sentence': '0'}), Row(annotatorType='token', begin=13, end=15, result='ọba', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=21, result='ariwa', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_yo| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|yo| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English BertForMaskedLM Base Uncased model (from mlcorelib) author: John Snow Labs name: bert_embeddings_deberta_base_uncased date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-base-uncased` is a English model originally trained by `mlcorelib`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_4.2.4_3.0_1670326237283.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_4.2.4_3.0_1670326237283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_deberta_base_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/mlcorelib/deberta-base-uncased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_squad_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-2` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_2_en_4.0.0_3.0_1655734456241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_2_en_4.0.0_3.0_1655734456241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_squad_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_squad_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_v2.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_squad_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|437.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad-2 --- layout: model title: Finnish asr_wav2vec2_large_xlsr_finnish TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_finnish` is a Finnish model originally trained by birgermoell. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021434692.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_finnish_fi_4.2.0_3.0_1664021434692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_finnish', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_finnish", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_finnish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_768 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670326074741.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_768_zh_4.2.4_3.0_1670326074741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|277.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Detect Clinical Events (Admissions) author: John Snow Labs name: ner_events_admission_clinical date: 2021-03-01 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.7.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect clinical events in medical text, with a focus on admission entities. ## Predicted Entities `DATE`, `TIME`, `PROBLEM`, `TEST`, `TREATMENT`, `OCCURENCE`, `CLINICAL_DEPT`, `EVIDENTIAL`, `DURATION`, `FREQUENCY`, `ADMISSION`, `DISCHARGE`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_en_2.7.4_2.4_1614582648104.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_en_2.7.4_2.4_1614582648104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_events_admission_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient presented to the emergency room last evening"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_events_admission_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The patient presented to the emergency room last evening""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.admission_events").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +----+-----------------------------+---------+---------+-----------------+ | | chunk | begin | end | entity | +====+=============================+=========+=========+=================+ | 0 | presented | 12 | 20 | EVIDENTIAL | +----+-----------------------------+---------+---------+-----------------+ | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | +----+-----------------------------+---------+---------+-----------------+ | 2 | last evening | 44 | 55 | DATE | +----+-----------------------------+---------+---------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_admission_clinical| |Type:|ner| |Compatibility:|Healthcare NLP 2.7.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented/enriched i2b2 events data with clinical_embeddings. The data for Admissions has been enriched specifically. ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 42 6 9 0.875 0.8235294 0.8484849 I-TREATMENT 1134 111 312 0.9108434 0.7842324 0.8428094 B-OCCURRENCE 406 344 382 0.5413333 0.51522845 0.52795845 I-DURATION 160 42 71 0.7920792 0.6926407 0.73903 B-DATE 500 32 49 0.9398496 0.9107468 0.92506933 I-DATE 309 54 49 0.8512397 0.8631285 0.8571429 B-ADMISSION 206 1 2 0.9951691 0.99038464 0.9927711 I-PROBLEM 2394 390 412 0.85991377 0.85317177 0.8565295 B-CLINICAL_DEPT 327 64 77 0.8363171 0.8094059 0.8226415 B-TIME 44 12 15 0.78571427 0.7457627 0.76521736 I-CLINICAL_DEPT 597 62 78 0.90591806 0.8844444 0.8950525 B-PROBLEM 1643 260 252 0.86337364 0.86701846 0.86519223 I-FREQUENCY 35 21 39 0.625 0.47297296 0.5384615 I-TEST 1082 171 117 0.86352754 0.9024187 0.8825449 B-TEST 781 125 127 0.8620309 0.86013216 0.86108047 B-TREATMENT 1283 176 202 0.87936944 0.8639731 0.87160325 B-DISCHARGE 155 0 1 1.0 0.99358976 0.99678457 B-EVIDENTIAL 269 25 75 0.914966 0.78197676 0.84326017 B-DURATION 97 43 44 0.69285715 0.6879433 0.6903914 B-FREQUENCY 70 16 33 0.81395346 0.6796116 0.7407407 tp: 11841 fp: 2366 fn: 2680 labels: 22 Macro-average prec: 0.8137135, rec: 0.7533389, f1: 0.7823631 Micro-average prec: 0.83346236, rec: 0.8154397, f1: 0.8243525 ``` --- layout: model title: Legal Whereas Clause Binary Classifier author: John Snow Labs name: legclf_cuad_whereas_clause date: 2022-11-25 tags: [whereas, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `whereas`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.0_1669379828062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.0_1669379828062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_whereas_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[whereas]| |[other]| |[other]| |[whereas]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_whereas_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.4 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.94 0.96 67 whereas 0.91 0.98 0.94 41 accuracy - - 0.95 108 macro-avg 0.95 0.96 0.95 108 weighted-avg 0.96 0.95 0.95 108 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from T-qualizer) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_advers date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-advers` is a English model originally trained by `T-qualizer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_advers_en_4.0.0_3.0_1654723842338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_advers_en_4.0.0_3.0_1654723842338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_advers","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_advers","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_advers| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/T-qualizer/distilbert-base-uncased-finetuned-advers --- layout: model title: Applicable Law Clause NER Model author: John Snow Labs name: legner_applicable_law_clause date: 2023-01-12 tags: [en, ner, licensed, applicable_law] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model aimed to be used in `applicable_law` clauses to retrieve entities as `APPLIC_LAW`. Make sure you run this model only on `applicable_law` clauses after you filter them using `legclf_applicable_law_cuad` model. ## Predicted Entities `APPLIC_LAW` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_applicable_law_clause_en_1.0.0_3.0_1673558480167.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_applicable_law_clause_en_1.0.0_3.0_1673558480167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained("legner_applicable_law_clause", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""ELECTRAMECCANICA VEHICLES CORP., an entity incorporated under the laws of the Province of British Columbia, Canada, with an address of Suite 102 East 1st Avenue, Vancouver, British Columbia, Canada, V5T 1A4 ("EMV")""" ] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------------------------------+----------+----------+ |chunk |ner_label |confidence| +----------------------------------------+----------+----------+ |laws of the Province of British Columbia|APPLIC_LAW|0.95625716| +----------------------------------------+----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_applicable_law_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.1 MB| ## References In-house dataset ## Benchmarking ```bash label precision recall f1-score support B-APPLIC_LAW 0.90 0.89 0.90 84 I-APPLIC_LAW 0.98 0.93 0.96 425 micro-avg 0.97 0.93 0.95 509 macro-avg 0.94 0.91 0.93 509 weighted-avg 0.97 0.93 0.95 509 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from laampt) author: John Snow Labs name: distilbert_qa_laampt_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `laampt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_laampt_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771910439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_laampt_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771910439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_laampt_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_laampt_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_laampt_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/laampt/distilbert-base-uncased-finetuned-squad --- layout: model title: German asr_wav2vec2_base_german TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: pipeline_asr_wav2vec2_base_german date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_german` is a German model originally trained by aware-ai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_german_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_german_de_4.2.0_3.0_1664099298596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_german_de_4.2.0_3.0_1664099298596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_german', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_german", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_german| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from NeuML) author: John Snow Labs name: bert_qa_bert_small_cord19_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-cord19-squad2` is a English model orginally trained by `NeuML`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_cord19_squad2_en_4.0.0_3.0_1654184738698.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_cord19_squad2_en_4.0.0_3.0_1654184738698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_cord19_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_cord19_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_cord19.bert.small").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_cord19_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|130.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/NeuML/bert-small-cord19-squad2 --- layout: model title: Spanish Bert Embeddings (from amine) author: John Snow Labs name: bert_embeddings_bert_base_5lang_cased date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a Spanish model orginally trained by `amine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_es_3.4.2_3.0_1649671304061.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_es_3.4.2_3.0_1649671304061.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bert_base_5lang_cased").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_5lang_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|464.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/amine/bert-base-5lang-cased - https://cloud.google.com/compute/docs/machine-types#n1_machine_type --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_document_name_08_25 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-document_name-08-25` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_document_name_08_25_en_4.3.0_3.0_1672766062646.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_document_name_08_25_en_4.3.0_3.0_1672766062646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_document_name_08_25","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_document_name_08_25","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_document_name_08_25| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-document_name-08-25 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from srmukundb) author: John Snow Labs name: distilbert_qa_srmukundb_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `srmukundb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_srmukundb_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772870213.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_srmukundb_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772870213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_srmukundb_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_srmukundb_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_srmukundb_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/srmukundb/distilbert-base-uncased-finetuned-squad --- layout: model title: English image_classifier_vit_asl ViTForImageClassification from akahana author: John Snow Labs name: image_classifier_vit_asl date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_asl` is a English model originally trained by akahana. ## Predicted Entities `E`, `del`, `X`, `N`, `T`, `Y`, `J`, `U`, `F`, `A`, `M`, `I`, `G`, `nothing`, `V`, `Q`, `L`, `space`, `B`, `P`, `C`, `H`, `W`, `K`, `R`, `O`, `D`, `Z`, `S` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_asl_en_4.1.0_3.0_1660166442859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_asl_en_4.1.0_3.0_1660166442859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_asl", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_asl", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_asl| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: Detect Assertion Status (assertion_dl_healthcare) author: John Snow Labs name: assertion_dl_healthcare class: AssertionDLModel reference embedding: healthcare_embeddings language: en nav_key: models repository: clinical/models date: 2020-09-23 task: Assertion Status edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [clinical,licensed,assertion,en] supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Assertion of Clinical Entities based on Deep Learning. ## Predicted Entities `hypothetical`, `present`, `absent`, `possible`, `conditional`, `associated_with_someone_else`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.6.0_2.4_1600849811713.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.6.0_2.4_1600849811713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, AssertionDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare","en","clinical/models")\ .setInputCols(["document","ner_chunk","embeddings"])\ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) model = nlpPipeline.fit(spark.createDataFrame([['Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain']]).toDF("text")) results = model.transform(data) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare","en","clinical/models") .setInputCols("document","ner_chunk","embeddings") .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.healthcare").predict("""Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain""") ```
{:.h2_title} ## Result ```bash | | chunks | entities| assertion | |--:|-----------:|--------:|------------:| | 0 | a headache | PROBLEM | present | | 1 | anxious | PROBLEM | conditional | | 2 | alopecia | PROBLEM | absent | | 3 | pain | PROBLEM | absent | ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------| | Name: | assertion_dl_healthcare | | Type: | AssertionDLModel | | Compatibility: | 2.6.0 | | License: | Licensed | |Edition:|Official| | |Input labels: | [document, chunk, word_embeddings] | |Output labels: | [assertion] | | Language: | en | | Case sensitive: | False | | Dependencies: | embeddings_healthcare_100d | {:.h2_title} ## Data Source Trained using ``embeddings_clinical`` on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash label prec rec f1 absent 0.9289 0.9466 0.9377 present 0.9433 0.9559 0.9496 conditional 0.6888 0.5 0.5794 associated_with_someone_else 0.9285 0.9122 0.9203 hypothetical 0.9079 0.8654 0.8862 possible 0.7 0.6146 0.6545 macro-avg 0.8496 0.7991 0.8236 micro-avg 0.9245 0.9245 0.9245 ``` --- layout: model title: English asr_wav2vec2_base_100h_ngram TFWav2Vec2ForCTC from saahith author: John Snow Labs name: asr_wav2vec2_base_100h_ngram date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_ngram` is a English model originally trained by saahith. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_ngram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042339482.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042339482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_ngram", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_ngram", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_ngram| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: IndicBERT - Albert for 12 major Indian languages author: John Snow Labs name: albert_indic date: 2022-01-26 tags: [open_source, albert, as, bn, en, gu, kn, ml, mr, or, pa, ta, te, xx] task: Embeddings language: xx edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IndicBERT is a multilingual ALBERT model pretrained exclusively on 12 major Indian languages. It is pre-trained on our novel monolingual corpus of around 9 billion tokens and subsequently evaluated on a set of diverse tasks. IndicBERT has much fewer parameters than other multilingual models (mBERT, XLM-R etc.) while it also achieves a performance on-par or better than these models. The 12 languages covered by IndicBERT are: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_indic_xx_3.4.0_3.0_1643211494926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_indic_xx_3.4.0_3.0_1643211494926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_indic","xx") \ .setInputCols(["document",'token'])\ .setOutputCol("embeddings")\ embeddingsFinisher = EmbeddingsFinisher() \ .setInputCols(["embeddings"]) \ .setOutputCols("finished_embeddings") \ .setOutputAsVector(True) \ .setCleanAnnotations(False) pipeline = Pipeline().setStages([ documentAssembler, tokenizer, embeddings, embeddingsFinisher ]) data = spark.createDataFrame([ ["கர்நாடக சட்டப் பேரவையில் வெற்றி பெற்ற எம்எல்ஏக்கள் இன்று பதவியேற்றுக் கொண்ட நிலையில் , காங்கிரஸ் எம்எல்ஏ ஆனந்த் சிங் க்கள் ஆப்சென்ட் ஆகி அதிர்ச்சியை ஏற்படுத்தியுள்ளார் . உச்சநீதிமன்ற உத்தரவுப்படி இன்று மாலை முதலமைச்சர் எடியூரப்பா இன்று நம்பிக்கை வாக்கெடுப்பு நடத்தி பெரும்பான்மையை நிரூபிக்க உச்சநீதிமன்றம் உத்தரவிட்டது ."], ]).toDF("text") result = pipeline.fit(data).transform(data) result.selectExpr("explode(finished_embeddings) as result").show(5, 80) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.embeddings.AlbertEmbeddings import com.johnsnowlabs.nlp.EmbeddingsFinisher import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_indic", "xx") .setInputCols("token", "document") .setOutputCol("embeddings") val embeddingsFinisher = new EmbeddingsFinisher() .setInputCols("embeddings") .setOutputCols("finished_embeddings") .setOutputAsVector(true) .setCleanAnnotations(false) val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, embeddings, embeddingsFinisher )) val data = Seq("கர்நாடக சட்டப் பேரவையில் வெற்றி பெற்ற எம்எல்ஏக்கள் இன்று பதவியேற்றுக் கொண்ட நிலையில் , காங்கிரஸ் எம்எல்ஏ ஆனந்த் சிங் க்கள் ஆப்சென்ட் ஆகி அதிர்ச்சியை ஏற்படுத்தியுள்ளார் . உச்சநீதிமன்ற உத்தரவுப்படி இன்று மாலை முதலமைச்சர் எடியூரப்பா இன்று நம்பிக்கை வாக்கெடுப்பு நடத்தி பெரும்பான்மையை நிரூபிக்க உச்சநீதிமன்றம் உத்தரவிட்டது .") .toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(finished_embeddings) as result").show(5, 80) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.albert.indic").predict("""கர்நாடக சட்டப் பேரவையில் வெற்றி பெற்ற எம்எல்ஏக்கள் இன்று பதவியேற்றுக் கொண்ட நிலையில் , காங்கிரஸ் எம்எல்ஏ ஆனந்த் சிங் க்கள் ஆப்சென்ட் ஆகி அதிர்ச்சியை ஏற்படுத்தியுள்ளார் . உச்சநீதிமன்ற உத்தரவுப்படி இன்று மாலை முதலமைச்சர் எடியூரப்பா இன்று நம்பிக்கை வாக்கெடுப்பு நடத்தி பெரும்பான்மையை நிரூபிக்க உச்சநீதிமன்றம் உத்தரவிட்டது .""") ```
## Results ```bash +--------------------------------------------------------------------------------+ | result| +--------------------------------------------------------------------------------+ |[0.2693195641040802,-0.6446362733840942,-0.05138964205980301,0.06030936539173...| |[0.027906809002161026,-0.37459731101989746,-0.08371371030807495,-0.0869174525...| |[0.3804604113101959,-0.7870151400566101,0.08463867008686066,-0.30186718702316...| |[0.15204764902591705,-0.26839596033096313,0.07375998795032501,-0.131638795137...| |[0.1482795625925064,-0.221298485994339,-0.022987276315689087,-0.2132280170917...| +--------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_indic| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|128.3 MB| ## References The model was exported from transformers and is based on https://github.com/AI4Bharat/indic-bert --- layout: model title: English RobertaForQuestionAnswering (from comacrae) author: John Snow Labs name: roberta_qa_roberta_paraphrasev3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-paraphrasev3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_paraphrasev3_en_4.0.0_3.0_1655738199528.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_paraphrasev3_en_4.0.0_3.0_1655738199528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_paraphrasev3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_paraphrasev3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.paraphrasev3.by_comacrae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_paraphrasev3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/comacrae/roberta-paraphrasev3 --- layout: model title: English image_classifier_vit_base_patch32_384_finetuned_eurosat ViTForImageClassification from keithanpai author: John Snow Labs name: image_classifier_vit_base_patch32_384_finetuned_eurosat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch32_384_finetuned_eurosat` is a English model originally trained by keithanpai. ## Predicted Entities `dff`, `bklf`, `nvf`, `vascf`, `akiecf`, `bccf`, `melf` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch32_384_finetuned_eurosat_en_4.1.0_3.0_1660172185841.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch32_384_finetuned_eurosat_en_4.1.0_3.0_1660172185841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch32_384_finetuned_eurosat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch32_384_finetuned_eurosat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch32_384_finetuned_eurosat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|328.4 MB| --- layout: model title: Part of Speech for Irish author: John Snow Labs name: pos_ud_idt date: 2021-03-09 tags: [part_of_speech, open_source, irish, pos_ud_idt, ga] task: Part of Speech Tagging language: ga edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - DET - AUX - PRON - VERB - SCONJ - PART - ADV - PUNCT - CCONJ - ADJ - PROPN - NUM - X - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_3.0.0_3.0_1615292201208.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_idt_ga_3.0.0_3.0_1615292201208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_idt", "ga") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Dia duit ó John Labs Sneachta! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_idt", "ga") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Dia duit ó John Labs Sneachta! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Dia duit ó John Labs Sneachta! ""] token_df = nlu.load('ga.pos').predict(text) token_df ```
## Results ```bash token pos 0 Dia NOUN 1 duit NOUN 2 ó ADP 3 John PROPN 4 Labs PROPN 5 Sneachta NOUN 6 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_idt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ga| --- layout: model title: BioBERT Embeddings (Pubmed Large) author: John Snow Labs name: biobert_pubmed_large_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_large_cased_en_2.6.2_2.4_1600529365263.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_large_cased_en_2.6.2_2.4_1600529365263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pubmed_large_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pubmed_large_cased_embeddings I [-0.041047871112823486, 0.24242812395095825, 0... hate [-0.6859451532363892, -0.45743268728256226, -0... cancer [-0.12403186410665512, 0.6688604354858398, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pubmed_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: English image_classifier_vit_pond_image_classification_1 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_1` is a English model originally trained by SummerChiam. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_1_en_4.1.0_3.0_1660165744277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_1_en_4.1.0_3.0_1660165744277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Laws Spanish Named Entity Recognition (from `hackathon-pln-es`) author: John Snow Labs name: roberta_ner_jurisbert_finetuning_ner date: 2022-05-20 tags: [roberta, ner, token_classification, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jurisbert-finetuning-ner` is a Spanish model orginally trained by `hackathon-pln-es`. ## Predicted Entities `TRAT_INTL`, `LEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_jurisbert_finetuning_ner_es_3.4.4_3.0_1653046369327.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_jurisbert_finetuning_ner_es_3.4.4_3.0_1653046369327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_jurisbert_finetuning_ner","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Me encanta Spark PNL"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_jurisbert_finetuning_ner","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Me encanta Spark PNL").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_jurisbert_finetuning_ner| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/hackathon-pln-es/jurisbert-finetuning-ner --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Rocketknight1) author: John Snow Labs name: distilbert_qa_rocketknight1_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Rocketknight1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_rocketknight1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769088913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_rocketknight1_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769088913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rocketknight1_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rocketknight1_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_rocketknight1_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Rocketknight1/distilbert-base-uncased-finetuned-squad --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from gbennett) author: John Snow Labs name: xlmroberta_ner_gbennett_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `gbennett`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_gbennett_base_finetuned_panx_de_4.1.0_3.0_1660433217069.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_gbennett_base_finetuned_panx_de_4.1.0_3.0_1660433217069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_gbennett_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_gbennett_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_gbennett_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/gbennett/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Translate English to West Germanic languages Pipeline author: John Snow Labs name: translate_en_gmw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gmw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gmw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gmw_xx_2.7.0_2.4_1609689808163.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gmw_xx_2.7.0_2.4_1609689808163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gmw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gmw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gmw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gmw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Fees and expenses Clause Binary Classifier (md) author: John Snow Labs name: legclf_fees_and_expenses_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `fees-and-expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `fees-and-expenses` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fees_and_expenses_md_en_1.0.0_3.0_1673460290897.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fees_and_expenses_md_en_1.0.0_3.0_1673460290897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_fees_and_expenses_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[fees-and-expenses]| |[other]| |[other]| |[fees-and-expenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fees_and_expenses_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support miscellaneous-provisions 0.75 0.75 0.75 24 other 0.85 0.85 0.85 39 accuracy 0.81 63 macro avg 0.80 0.80 0.80 63 weighted avg 0.81 0.81 0.81 63 ``` --- layout: model title: Pipeline to Detect Drugs - Generalized Single Entity (ner_drugs_greedy) author: John Snow Labs name: ner_drugs_greedy_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugs_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_greedy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_4.3.0_3.2_1678877919575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_pipeline_en_4.3.0_3.2_1678877919575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models") text = '''DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_drugs_greedy_pipeline", "en", "clinical/models") val text = "DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs_greedy.pipeline").predict("""DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------------------|--------:|------:|:------------|-------------:| | 0 | hydrocortisone tablets | 48 | 69 | DRUG | 0.9923 | | 1 | 20 mg to 240 mg of hydrocortisone | 85 | 117 | DRUG | 0.7361 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_greedy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Titles And Subtitles Clause Binary Classifier author: John Snow Labs name: legclf_titles_and_subtitles_clause date: 2023-01-29 tags: [en, legal, classification, titles, subtitles, clauses, titles_and_subtitles, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `titles-and-subtitles` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `titles-and-subtitles`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_titles_and_subtitles_clause_en_1.0.0_3.0_1674993674002.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_titles_and_subtitles_clause_en_1.0.0_3.0_1674993674002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_titles_and_subtitles_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[titles-and-subtitles]| |[other]| |[other]| |[titles-and-subtitles]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_titles_and_subtitles_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label recision recall f1-score support other 0.97 1.00 0.99 39 titles-and-subtitles 1.00 0.97 0.98 30 accuracy - - 0.99 69 macro-avg 0.99 0.98 0.99 69 weighted-avg 0.99 0.99 0.99 69 ``` --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented date: 2022-01-21 tags: [icd10cm, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate. ## Predicted Entities `ICD10CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_3.0_1642756161477.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_3.0_1642756161477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array('PROBLEM')) val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | ner_chunk| entity|icd10cm_code| resolutions| all_codes| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481| |subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...| | T2DM|PROBLEM| E11|type 2 diabetes mellitus:::disorder associated with type 2 diabetes...|E11:::E118:::E119:::O2411:::E109:::E139:::E113:::E8881:::Z833:::D64...| | HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::he...|K8520:::K853:::K8590:::F102:::K852:::K859:::K8580:::K8591:::K858:::...| | acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...| | obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...| | a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...| | polyuria|PROBLEM| R35|polyuria:::nocturnal polyuria:::polyuric state:::polyuric state (di...|R35:::R3581:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048...| | polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...| | poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...| | vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110| | a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.4 GB| |Case sensitive:|false| ## Data Source Trained on ICD10CM 2022 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm --- layout: model title: Relation extraction between Drugs and ADE author: John Snow Labs name: re_ade_clinical date: 2021-07-12 tags: [licensed, clinical, en, relation_extraction, ade] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. `1` : Shows the adverse event and drug entities are related, `0` : Shows the adverse event and drug entities are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_ADE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_ade_clinical_en_3.1.2_3.0_1626104637779.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_ade_clinical_en_3.1.2_3.0_1626104637779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_ade_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------:|:--------------:|:----------------:|------------------------------| | re_ade_clinical | 0
1 | ner_ade_clinical | ["ade-drug",
"drug-ade"] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel() \ .pretrained("ner_ade_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner_tags"]) \ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"])\ .setOutputCol("pos_tags") dependency_parser = sparknlp.annotators.DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentence", "pos_tags", "token"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel()\ .pretrained("re_ade_clinical", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(10)\ .setPredictionThreshold(0.1)\ .setRelationPairs(["ade-drug", "drug-ade"])\ .setRelationPairsCaseSensitive(False) nlp_pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, words_embedder, ner_tagger, ner_converter, pos_tagger, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) text ="""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps. """ annotations = light_pipeline.fullAnnotate(text) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = NerDLModel() .pretrained("ner_ade_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel() .pretrained("re_ade_clinical", "en", 'clinical/models') .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(3) .setPredictionThreshold(0.5) .setRelationPairs(Array("drug-ade", "ade-drug")) val nlpPipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, words_embedder, ner_tagger, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps. """).toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.adverse_drug_events.clinical").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""") ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---------:|:--------|--------------:|------------:|:----------|:--------|--------------:|------------:|:---------------|-----------:| | 1 | DRUG | 12 | 18 | Lipitor | ADE | 52 | 65 | severe fatigue | 1 | | 1 | DRUG | 97 | 105 | voltarene | ADE | 144 | 156 | muscle cramps | 0.997283 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_ade_clinical| |Type:|re| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| ## Data Source This model is trained on custom data annotated by JSL. ## Benchmarking ```bash label precision recall f1-score support 0 0.86 0.88 0.87 1787 1 0.92 0.90 0.91 2586 micro-avg 0.89 0.89 0.89 4373 macro-avg 0.89 0.89 0.89 4373 weighted-avg 0.89 0.89 0.89 4373 ``` --- layout: model title: English asr_model_4 TFWav2Vec2ForCTC from niclas author: John Snow Labs name: pipeline_asr_model_4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_4` is a English model originally trained by niclas. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_model_4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_4_en_4.2.0_3.0_1664098319002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_4_en_4.2.0_3.0_1664098319002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_model_4', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_model_4", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_model_4| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_Wav2Vec2_XLSR_Bengali_10500 TFWav2Vec2ForCTC from shoubhik author: John Snow Labs name: pipeline_asr_Wav2Vec2_XLSR_Bengali_10500 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_XLSR_Bengali_10500` is a English model originally trained by shoubhik. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Wav2Vec2_XLSR_Bengali_10500_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_XLSR_Bengali_10500_en_4.2.0_3.0_1664105201102.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_XLSR_Bengali_10500_en_4.2.0_3.0_1664105201102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Wav2Vec2_XLSR_Bengali_10500', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Wav2Vec2_XLSR_Bengali_10500", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Wav2Vec2_XLSR_Bengali_10500| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Brazilian Portuguese NER for Laws (Base) author: John Snow Labs name: legner_br_base date: 2022-09-27 tags: [pt, licensed] task: Named Entity Recognition language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Deep Learning Portuguese Named Entity Recognition model for the legal domain, trained using Base Bert Embeddings, and is able to predict the following entities: - ORGANIZACAO (Organizations) - JURISPRUDENCIA (Jurisprudence) - PESSOA (Person) - TEMPO (Time) - LOCAL (Location) - LEGISLACAO (Laws) - O (Other) You can find different versions of this model in Models Hub: - With a Deep Learning architecture (non-transformer) and Base Embeddings; - With a Deep Learning architecture (non-transformer) and Large Embeddings; - With a Transformers Architecture and Base Embeddings; - With a Transformers Architecture and Large Embeddings; ## Predicted Entities `PESSOA`, `ORGANIZACAO`, `LEGISLACAO`, `JURISPRUDENCIA`, `TEMPO`, `LOCAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_PT/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_br_base_pt_1.0.0_3.0_1664276774137.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_br_base_pt_1.0.0_3.0_1664276774137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') embeddings = nlp.BertEmbeddings.pretrained("bert_portuguese_base_cased", "pt")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_br_base', 'pt', 'legal/models') \ .setInputCols(['document', 'token', 'embeddings']) \ .setOutputCol('ner') ner_converter = nlp.NerConverter() \ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter ]) example = spark.createDataFrame(pd.DataFrame({'text': ["""Mediante do exposto , com fundamento nos artigos 32 , i , e 33 , da lei 8.443/1992 , submetem-se os autos à consideração superior , com posterior encaminhamento ao ministério público junto ao tcu e ao gabinete do relator , propondo : a ) conhecer do recurso e , no mérito , negar-lhe provimento ; b ) comunicar ao recorrente , ao superior tribunal militar e ao tribunal regional federal da 2ª região , a fim de fornecer subsídios para os processos judiciais 2001.34.00.024796-9 e 2003.34.00.044227-3 ; e aos demais interessados a deliberação que vier a ser proferida por esta corte ” ."""]})) result = pipeline.fit(example).transform(example) ```
## Results ```bash +-------------------+----------------+ | token| ner| +-------------------+----------------+ | diante| O| | do| O| | exposto| O| | ,| O| | com| O| | fundamento| O| | nos| O| | artigos| B-LEGISLACAO| | 32| I-LEGISLACAO| | ,| I-LEGISLACAO| | i| I-LEGISLACAO| | ,| I-LEGISLACAO| | e| I-LEGISLACAO| | 33| I-LEGISLACAO| | ,| I-LEGISLACAO| | da| I-LEGISLACAO| | lei| I-LEGISLACAO| | 8.443/1992| I-LEGISLACAO| | ,| O| | submetem-se| O| | os| O| | autos| O| | à| O| | consideração| O| | superior| O| | ,| O| | com| O| | posterior| O| | encaminhamento| O| | ao| O| | ministério| B-ORGANIZACAO| | público| I-ORGANIZACAO| | junto| O| | ao| O| | tcu| B-ORGANIZACAO| | e| O| | ao| O| | gabinete| O| | do| O| | relator| O| | ,| O| | propondo| O| | :| O| | a| O| | )| O| | conhecer| O| | do| O| | recurso| O| | e| O| | ,| O| | no| O| | mérito| O| | ,| O| | negar-lhe| O| | provimento| O| | ;| O| | b| O| | )| O| | comunicar| O| | ao| O| | recorrente| O| | ,| O| | ao| O| | superior| B-ORGANIZACAO| | tribunal| I-ORGANIZACAO| | militar| I-ORGANIZACAO| | e| O| | ao| O| | tribunal| B-ORGANIZACAO| | regional| I-ORGANIZACAO| | federal| I-ORGANIZACAO| | da| I-ORGANIZACAO| | 2ª| I-ORGANIZACAO| | região| I-ORGANIZACAO| | ,| O| | a| O| | fim| O| | de| O| | fornecer| O| | subsídios| O| | para| O| | os| O| | processos| O| | judiciais| O| |2001.34.00.024796-9|B-JURISPRUDENCIA| | e| O| |2003.34.00.044227-3|B-JURISPRUDENCIA| | ;| O| | e| O| | aos| O| | demais| O| | interessados| O| | a| O| | deliberação| O| | que| O| | vier| O| | a| O| | ser| O| | proferida| O| | por| O| | esta| O| | corte| O| | ”| O| | .| O| +-------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_br_base| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|18.8 MB| ## References Original texts available in https://paperswithcode.com/sota?task=Token+Classification&dataset=lener_br and in-house data augmentation with weak labelling ## Benchmarking ```bash label precision recall f1-score support B-JURISPRUDENCIA 0.84 0.91 0.88 175 B-LEGISLACAO 0.96 0.96 0.96 347 B-LOCAL 0.69 0.68 0.68 40 B-ORGANIZACAO 0.95 0.71 0.81 441 B-PESSOA 0.91 0.95 0.93 221 B-TEMPO 0.94 0.86 0.90 176 I-JURISPRUDENCIA 0.86 0.91 0.89 461 I-LEGISLACAO 0.98 0.99 0.98 2012 I-LOCAL 0.54 0.53 0.53 72 I-ORGANIZACAO 0.94 0.76 0.84 768 I-PESSOA 0.93 0.98 0.95 461 I-TEMPO 0.90 0.85 0.88 66 O 0.99 1.00 0.99 38419 accuracy - - 0.98 43659 macro-avg 0.88 0.85 0.86 43659 weighted-avg 0.98 0.98 0.98 43659 ``` --- layout: model title: English asr_wav2vec2_base_100h_by_vuiseng9 TFWav2Vec2ForCTC from vuiseng9 author: John Snow Labs name: asr_wav2vec2_base_100h_by_vuiseng9 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_by_vuiseng9` is a English model originally trained by vuiseng9. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_by_vuiseng9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_by_vuiseng9_en_4.2.0_3.0_1664022833525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_by_vuiseng9_en_4.2.0_3.0_1664022833525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_by_vuiseng9", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_by_vuiseng9", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_by_vuiseng9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab) author: John Snow Labs name: bert_embeddings_base_arabic_camel_msa_eighth date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa-eighth` is a Arabic model originally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_eighth_ar_4.2.4_3.0_1670016070157.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_eighth_ar_4.2.4_3.0_1670016070157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_eighth","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_eighth","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic_camel_msa_eighth| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-eighth - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Detect Cancer Genetics (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_bionlp date: 2021-11-03 tags: [bertfortokenclassification, ner, bionlp, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts biological and genetics terms in cancer-related texts using pre-trained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `Amino_acid`, `Anatomical_system`, `Cancer`, `Cell`, `Cellular_component`, `Developing_anatomical_Structure`, `Gene_or_gene_product`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism`, `Organism_subdivision`, `Simple_chemical`, `Tissue`, `Organism_substance`, `Pathological_formation` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_en_3.3.0_2.4_1635952712612.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_en_3.3.0_2.4_1635952712612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bionlp").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |erbA IRES |Organism | |erbA/myb virus |Organism | |erythroid cells |Cell | |bone marrow |Multi-tissue_structure| |blastoderm cultures|Cell | |erbA/myb IRES virus|Organism | |erbA IRES virus |Organism | |blastoderm |Cell | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bionlp| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|256| ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013. https://aclanthology.org/W13-2008/ ## Benchmarking ```bash label precision recall f1-score support B-Cancer 0.88 0.82 0.85 924 B-Cell 0.84 0.86 0.85 1013 B-Cellular_component 0.87 0.84 0.86 180 B-Developing_anatomical_structure 0.65 0.65 0.65 17 B-Gene_or_gene_product 0.62 0.79 0.69 2520 B-Immaterial_anatomical_entity 0.68 0.74 0.71 31 B-Multi-tissue_structure 0.84 0.76 0.80 303 B-Organ 0.78 0.74 0.76 156 B-Organism 0.93 0.86 0.89 518 B-Organism_subdivision 0.74 0.51 0.61 39 B-Organism_substance 0.93 0.66 0.77 102 B-Pathological_formation 0.85 0.60 0.71 88 B-Simple_chemical 0.61 0.75 0.68 727 B-Tissue 0.74 0.83 0.78 184 I-Amino_acid 0.60 1.00 0.75 3 I-Cancer 0.91 0.69 0.78 604 I-Cell 0.98 0.74 0.84 1091 I-Cellular_component 0.88 0.62 0.73 69 I-Multi-tissue_structure 0.89 0.86 0.87 162 I-Organ 0.67 0.59 0.62 17 I-Organism 0.84 0.45 0.59 120 I-Organism_substance 0.80 0.50 0.62 24 I-Pathological_formation 0.81 0.56 0.67 39 I-Tissue 0.83 0.86 0.84 111 accuracy - - 0.64 12129 macro-avg 0.73 0.56 0.60 12129 weighted-avg 0.83 0.64 0.68 12129 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from en) author: John Snow Labs name: distilbert_qa_squad_base_uncased_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `en`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_base_uncased_finetuned_en_4.3.0_3.0_1672770716210.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_base_uncased_finetuned_en_4.3.0_3.0_1672770716210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_base_uncased_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_base_uncased_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_base_uncased_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/en/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering model (from datarpit) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_natural_questions date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-natural-questions` is a English model originally trained by `datarpit`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.0.0_3.0_1654723994546.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_natural_questions_en_4.0.0_3.0_1654723994546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_natural_questions","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_uncased.by_datarpit").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_natural_questions| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/datarpit/distilbert-base-uncased-finetuned-natural-questions --- layout: model title: French CamemBert Embeddings (from Sebu) author: John Snow Labs name: camembert_embeddings_Sebu_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Sebu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sebu_generic_model_fr_3.4.4_3.0_1653986850148.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Sebu_generic_model_fr_3.4.4_3.0_1653986850148.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sebu_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Sebu_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Sebu_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Sebu/dummy-model --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Thitaree) author: John Snow Labs name: distilbert_qa_thitaree_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Thitaree`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_thitaree_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769425321.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_thitaree_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769425321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_thitaree_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_thitaree_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_thitaree_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Thitaree/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [pt, licensed, ner, legal, mapa] task: Named Entity Recognition language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Portuguese` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_pt_1.0.0_3.0_1682608680085.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_pt_1.0.0_3.0_1682608680085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_pt_cased", "pt")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "pt", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Nos termos dos Decretos da Garda Síochána (6), só pode ser admitido como estagiário para integrar a força policial nacional quem tiver pelo menos 18 anos, mas menos de 35 anos de idade, no primeiro dia do mês em que tenha sido publicado pela primeira vez, num jornal nacional, o anúncio da vaga a que o recrutamento respeita."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------------------+------------+ |chunk |ner_label | +-----------------------+------------+ |Garda Síochána |ORGANISATION| |força policial nacional|ORGANISATION| |18 anos |AMOUNT | |35 anos |AMOUNT | +-----------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.91 0.91 0.91 23 AMOUNT 1.00 0.83 0.91 6 DATE 1.00 0.95 0.97 61 ORGANISATION 0.85 0.77 0.81 30 PERSON 0.88 0.91 0.89 65 macro-avg 0.92 0.90 0.91 185 macro-avg 0.93 0.87 0.90 185 weighted-avg 0.92 0.90 0.91 185 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from nlpunibo) Config1 author: John Snow Labs name: distilbert_qa_base_config1 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config1` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.0.0_3.0_1654727786120.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config1_en_4.0.0_3.0_1654727786120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_config1.by_nlpunibo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_config1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_base_config1 --- layout: model title: Translate English to Western Malayo-Polynesian languages Pipeline author: John Snow Labs name: translate_en_pqw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, pqw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `pqw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pqw_xx_2.7.0_2.4_1609688594063.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pqw_xx_2.7.0_2.4_1609688594063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_pqw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_pqw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.pqw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_pqw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chunk Entity Resolver RxNorm-scdc author: John Snow Labs name: chunkresolve_rxnorm_in_healthcare date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to RxNorm codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). ## Predicted Entities RxNorm codes {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_in_healthcare_en_3.0.0_3.0_1618605195699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_in_healthcare_en_3.0.0_3.0_1618605195699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_in_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver]) data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ... ``` ```scala ... val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_in_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364| | glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407| | dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_in_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| --- layout: model title: Finnish asr_wav2vec2_xlsr_300m_finnish TFWav2Vec2ForCTC from aapot author: John Snow Labs name: asr_wav2vec2_xlsr_300m_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish` is a Finnish model originally trained by aapot. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_300m_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_fi_4.2.0_3.0_1664023005420.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_fi_4.2.0_3.0_1664023005420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_300m_finnish", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_300m_finnish", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_300m_finnish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering model (from graviraja) author: John Snow Labs name: distilbert_qa_graviraja_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `graviraja`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725306055.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725306055.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_graviraja").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_graviraja_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/graviraja/distilbert-base-uncased-finetuned-squad --- layout: model title: Indonesian RoBERTa Embeddings (from w11wo) author: John Snow Labs name: roberta_embeddings_indo_roberta_small date: 2022-04-14 tags: [roberta, embeddings, id, open_source] task: Embeddings language: id edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indo-roberta-small` is a Indonesian model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_roberta_small_id_3.4.2_3.0_1649948731693.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_roberta_small_id_3.4.2_3.0_1649948731693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_roberta_small","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_roberta_small","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka percikan NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.embed.indo_roberta_small").predict("""Saya suka percikan NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indo_roberta_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|314.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/indo-roberta-small - https://arxiv.org/abs/1907.11692 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_which_1e_04 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-which-1e-04` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_which_1e_04_en_4.3.0_3.0_1672766889512.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_which_1e_04_en_4.3.0_3.0_1672766889512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_which_1e_04","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_which_1e_04","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_which_1e_04| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-which-1e-04 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_anan0329 TFWav2Vec2ForCTC from anan0329 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_anan0329` is a English model originally trained by anan0329. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114693280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329_en_4.2.0_3.0_1664114693280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_anan0329| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_jdt_fin_roberta_wwm date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `jdt-fin-roberta-wwm` is a Chinese model orginally trained by `wangfan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_jdt_fin_roberta_wwm_zh_3.4.2_3.0_1649669984329.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_jdt_fin_roberta_wwm_zh_3.4.2_3.0_1649669984329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_jdt_fin_roberta_wwm","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_jdt_fin_roberta_wwm","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.jdt_fin_roberta_wwm").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_jdt_fin_roberta_wwm| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/wangfan/jdt-fin-roberta-wwm - https://3.cn/103c-hwSS - https://3.cn/103c-izpe --- layout: model title: English image_classifier_vit_llama_alpaca_guanaco_vicuna ViTForImageClassification from osanseviero author: John Snow Labs name: image_classifier_vit_llama_alpaca_guanaco_vicuna date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_llama_alpaca_guanaco_vicuna` is a English model originally trained by osanseviero. ## Predicted Entities `alpaca`, `guanaco`, `llama`, `vicuna` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_guanaco_vicuna_en_4.1.0_3.0_1660166270042.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_guanaco_vicuna_en_4.1.0_3.0_1660166270042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_llama_alpaca_guanaco_vicuna", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_llama_alpaca_guanaco_vicuna", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_llama_alpaca_guanaco_vicuna| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Sublease Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_sublease_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, sublease, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_sublease_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `sublease-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `sublease-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sublease_agreement_en_1.0.0_3.0_1668117647287.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sublease_agreement_en_1.0.0_3.0_1668117647287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_sublease_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[sublease-agreement]| |[other]| |[other]| |[sublease-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sublease_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 0.99 66 sublease-agreement 1.00 0.97 0.99 35 accuracy - - 0.99 101 macro-avg 0.99 0.99 0.99 101 weighted-avg 0.99 0.99 0.99 101 ``` --- layout: model title: German Financial Bert Word Embeddings author: John Snow Labs name: bert_embeddings_german_financial_statements_bert date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Financial Bert Word Embeddings model, trained on German Financial Statements. Uploaded to Hugging Face, adapted and imported into Spark NLP. `german-financial-statements-bert` is a German model orginally trained by `fabianrausch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_german_financial_statements_bert_de_3.4.2_3.0_1649676227862.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_german_financial_statements_bert_de_3.4.2_3.0_1649676227862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_german_financial_statements_bert","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_german_financial_statements_bert","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.german_financial_statements_bert").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_german_financial_statements_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/fabianrausch/german-financial-statements-bert --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes (Clinical Drug) author: John Snow Labs name: sbiobertresolve_umls_clinical_drugs date: 2022-07-05 tags: [entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities to UMLS CUI codes. It is trained on 2022AA UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the Clinical Drug category using sbiobert_base_cased_mli embeddings. ## Predicted Entities `Predicts UMLS codes for Clinical Drug medical concepts` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_clinical_drugs_en_4.0.0_3.0_1657039242193.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_clinical_drugs_en_4.0.0_3.0_1657039242193.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "clinical_ner"])\ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_clinical_drugs","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "clinical_ner")) .setOutputCol("ner_chunk") val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") .setCaseSensitive(False) val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_umls_clinical_drugs", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.").toDF("text") val res = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls_clinical_drugs").predict("""She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.""") ```
## Results ```bash | | chunk | code | code_description | all_k_code_desc | all_k_codes | |---:|:------------------------------|:---------|:---------------------------|:-------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | hydrogen peroxide 30 mg | C1126248 | hydrogen peroxide 30 mg/ml | ['C1126248', 'C0304655', 'C1605252', 'C0304656', 'C1154260'] | ['hydrogen peroxide 30 mg/ml', 'hydrogen peroxide solution 30%', 'hydrogen peroxide 30 mg/ml [proxacol]', 'hydrogen peroxide 30 mg/ml cutaneous solution', 'benzoyl peroxide 30 mg/ml'] | | 1 | Neosporin Cream | C0132149 | neosporin cream | ['C0132149', 'C0358174', 'C0357999', 'C0307085', 'C0698810'] | ['neosporin cream', 'nystan cream', 'nystadermal cream', 'nupercainal cream', 'nystaform cream'] | | 2 | magnesium hydroxide 100mg/1ml | C1134402 | magnesium hydroxide 100 mg | ['C1134402', 'C1126785', 'C4317023', 'C4051486', 'C4047137'] | ['magnesium hydroxide 100 mg', 'magnesium hydroxide 100 mg/ml', 'magnesium sulphate 100mg/ml injection', 'magnesium sulfate 100 mg', 'magnesium sulfate 100 mg/ml'] | | 3 | metformin 1000 mg | C0987664 | metformin 1000 mg | ['C0987664', 'C2719784', 'C0978482', 'C2719786', 'C4282269'] | ['metformin 1000 mg', 'metformin hydrochloride 1000 mg', 'metformin hcl 1000mg tab', 'metformin hydrochloride 1000 mg [fortamet]', 'metformin hcl 1000mg sa tab'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_clinical_drugs| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[name]| |Language:|en| |Size:|2.5 GB| |Case sensitive:|false| ## References Trained on 2022AA UMLS dataset’s Clinical Drug category. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Legal Other Definitional Provisions Clause Binary Classifier author: John Snow Labs name: legclf_other_definitional_provisions_clause date: 2023-01-29 tags: [en, legal, classification, other, definitional, provisions, clauses, other_definitional_provisions, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `other-definitional-provisions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other-definitional-provisions`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_other_definitional_provisions_clause_en_1.0.0_3.0_1674993355901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_other_definitional_provisions_clause_en_1.0.0_3.0_1674993355901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_other_definitional_provisions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[other-definitional-provisions]| |[other]| |[other]| |[other-definitional-provisions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_other_definitional_provisions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.98 30 other-definitional-provisions 1.00 0.95 0.98 22 accuracy - - 0.98 52 macro-avg 0.98 0.98 0.98 52 weighted-avg 0.98 0.98 0.98 52 ``` --- layout: model title: Pipeline for Detect Medication author: John Snow Labs name: ner_medication_pipeline date: 2023-06-13 tags: [ner, en, licensed] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained pipeline to detect medication entities. It was built on the top of `ner_posology_greedy` model and also augmented with the drug names mentioned in UK and US drugbank datasets. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.4.4_3.2_1686665836067.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medication_pipeline_en_4.4.4_3.2_1686665836067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models") text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg.""" result = ner_medication_pipeline.fullAnnotate([text]) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_medication_pipeline = new PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models") val result = ner_medication_pipeline.fullAnnotate("The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."") ``` {:.nlu-block} ```python | ner_chunk | entity | |:-------------------|:---------| | metformin 1000 MG | DRUG | | glipizide 2.5 MG | DRUG | | Fragmin 5000 units | DRUG | | Xenaderm | DRUG | | OxyContin 30 mg | DRUG | ```
{:.model-param}
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models") text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg.""" result = ner_medication_pipeline.fullAnnotate([text]) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_medication_pipeline = new PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models") val result = ner_medication_pipeline.fullAnnotate("The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."") ``` {:.nlu-block} ```python | ner_chunk | entity | |:-------------------|:---------| | metformin 1000 MG | DRUG | | glipizide 2.5 MG | DRUG | | Fragmin 5000 units | DRUG | | Xenaderm | DRUG | | OxyContin 30 mg | DRUG | ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_medication_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - Finisher --- layout: model title: Translate English to Finnish Pipeline author: John Snow Labs name: translate_en_fi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, fi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `fi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_fi_xx_2.7.0_2.4_1609689441892.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_fi_xx_2.7.0_2.4_1609689441892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_fi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_fi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.fi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_fi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stop Words Cleaner for Greek author: John Snow Labs name: stopwords_el date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: el edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, el] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_el_el_2.5.4_2.4_1594742437880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_el_el_2.5.4_2.4_1594742437880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_el", "el") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_el", "el") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής."""] stopword_df = nlu.load('el.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Εκτός', metadata={'sentence': '0'}), Row(annotatorType='token', begin=6, end=8, result='από', metadata={'sentence': '0'}), Row(annotatorType='token', begin=13, end=15, result='ότι', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=21, result='είναι', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=32, result='βασιλιάς', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_el| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|el| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Financial Finbert Sentiment Analysis (DistilRoBerta) author: John Snow Labs name: finclf_distilroberta_sentiment_analysis date: 2022-08-09 tags: [en, finance, sentiment, classification, sentiment_analysis, licensed] task: Sentiment Analysis language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the DistilRoBerta language model in the finance domain, using a financial corpus and thereby fine-tuning it for financial sentiment classification. Financial PhraseBank by Malo et al. (2014) and in-house JSL documents and annotations have been used for fine-tuning. ## Predicted Entities `positive`, `negative`, `neutral` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_FINANCE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_distilroberta_sentiment_analysis_en_1.0.0_3.2_1660055192412.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_distilroberta_sentiment_analysis_en_1.0.0_3.2_1660055192412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = nlp.RoBertaForSequenceClassification.pretrained("finclf_distilroberta_sentiment_analysis","en", "finance/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") nlpPipeline = nlp.Pipeline( stages = [ documentAssembler, tokenizer, classifier]) # couple of simple examples example = spark.createDataFrame([["Stocks rallied and the British pound gained."]]).toDF("text") result = nlpPipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +--------------------+----------+ | text| result| +--------------------+----------+ |Stocks rallied an...|[positive]| +--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_distilroberta_sentiment_analysis| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References In-house financial documents and Financial PhraseBank by Malo et al. (2014) ## Benchmarking ```bash label precision recall f1-score support positive 0.77 0.88 0.81 253 negative 0.86 0.85 0.88 133 neutral 0.93 0.86 0.90 584 accuracy - - 0.86 970 macro-avg 0.85 0.86 0.85 970 weighted-avg 0.87 0.86 0.87 970 ``` --- layout: model title: Portuguese BertForTokenClassification Cased model (from pucpr) author: John Snow Labs name: bert_token_classifier_clinicalnerpt_disease date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-disease` is a Portuguese model originally trained by `pucpr`. ## Predicted Entities `DiseaseOrSyndrome` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_disease_pt_4.2.4_3.0_1669822418241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_disease_pt_4.2.4_3.0_1669822418241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_disease","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_disease","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_clinicalnerpt_disease| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/pucpr/clinicalnerpt-disease - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/SemClinBr - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Stop Words Cleaner for Marathi author: John Snow Labs name: stopwords_mr date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: mr edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, mr] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_mr_mr_2.5.4_2.4_1594742439994.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_mr_mr_2.5.4_2.4_1594742439994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_mr", "mr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_mr", "mr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे."""] stopword_df = nlu.load('mr.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=7, result='उत्तरेचा', metadata={'sentence': '0'}), Row(annotatorType='token', begin=9, end=12, result='राजा', metadata={'sentence': '0'}), Row(annotatorType='token', begin=14, end=29, result='होण्याव्यतिरिक्त', metadata={'sentence': '0'}), Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=34, result='जॉन', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_mr| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|mr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Translate Icelandic to English Pipeline author: John Snow Labs name: translate_is_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, is, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `is` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_is_en_xx_2.7.0_2.4_1609690970413.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_is_en_xx_2.7.0_2.4_1609690970413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_is_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_is_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.is.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_is_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Xhosa Pipeline author: John Snow Labs name: translate_en_xh date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, xh, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `xh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_xh_xx_2.7.0_2.4_1609689615747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_xh_xx_2.7.0_2.4_1609689615747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_xh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_xh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.xh').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_xh| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from zhenyueyu) author: John Snow Labs name: distilbert_qa_zhenyueyu_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `zhenyueyu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_zhenyueyu_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773369365.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_zhenyueyu_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773369365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zhenyueyu_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zhenyueyu_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_zhenyueyu_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/zhenyueyu/distilbert-base-uncased-finetuned-squad --- layout: model title: Abkhazian asr_xls_test TFWav2Vec2ForCTC from pere author: John Snow Labs name: pipeline_asr_xls_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_test` is a Abkhazian model originally trained by pere. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_test_ab_4.2.0_3.0_1664020711203.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_test_ab_4.2.0_3.0_1664020711203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_test', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_test", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.5 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Arabic Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_bert_base_qarib date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-qarib` is a Arabic model orginally trained by `qarib`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib_ar_3.4.2_3.0_1649677790858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib_ar_3.4.2_3.0_1649677790858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_qarib").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_qarib| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|506.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/qarib/bert-base-qarib - http://opus.nlpl.eu/ - https://github.com/qcri/QARIB/Training_QARiB.md - https://github.com/qcri/QARIB/Using_QARiB.md --- layout: model title: Catalan, Valencian asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala TFWav2Vec2ForCTC from softcatala author: John Snow Labs name: asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala date: 2022-09-24 tags: [wav2vec2, ca, audio, open_source, asr] task: Automatic Speech Recognition language: ca edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala` is a Catalan, Valencian model originally trained by softcatala. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037065825.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037065825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala", "ca")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala", "ca") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ca| |Size:|1.2 GB| --- layout: model title: Sango asr_wav2vec2_large_xlsr_53_swiss_german TFWav2Vec2ForCTC from Yves author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_swiss_german date: 2022-09-24 tags: [wav2vec2, sg, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_swiss_german` is a Sango model originally trained by Yves. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_swiss_german_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_swiss_german_sg_4.2.0_3.0_1664022719221.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_swiss_german_sg_4.2.0_3.0_1664022719221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_swiss_german', lang = 'sg') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_swiss_german", lang = "sg") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_swiss_german| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sg| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Danish XlmRoBertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan date: 2022-06-24 tags: [da, open_source, question_answering, xlmroberta] task: Question Answering language: da edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-da` is a Danish model originally trained by `saattrupdan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan_da_4.0.0_3.0_1656062061104.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan_da_4.0.0_3.0_1656062061104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan","da") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan","da") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("da.answer_question.squad.xlmr_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_da_da_saattrupdan| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|da| |Size:|878.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saattrupdan/xlmr-base-texas-squad-da --- layout: model title: Legal Construction Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_construction_bert date: 2023-03-05 tags: [en, legal, classification, clauses, construction, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Construction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Construction`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_construction_bert_en_1.0.0_3.0_1678050533068.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_construction_bert_en_1.0.0_3.0_1678050533068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_construction_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Construction]| |[Other]| |[Other]| |[Construction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_construction_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Construction 0.84 0.83 0.84 46 Other 0.88 0.90 0.89 67 accuracy - - 0.87 113 macro-avg 0.86 0.86 0.86 113 weighted-avg 0.87 0.87 0.87 113 ``` --- layout: model title: English BertForTokenClassification Base Uncased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Original-BiomedNLP-PubMedBERT-base-uncased-abstract` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract_en_4.0.0_3.0_1657109212870.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract_en_4.0.0_3.0_1657109212870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Original_BiomedNLP_PubMedBERT_base_uncased_abstract| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4_Original-BiomedNLP-PubMedBERT-base-uncased-abstract --- layout: model title: Multilingual DistilBertForTokenClassification Base Cased model (from mrm8488) author: John Snow Labs name: distilbert_ner_base_multi_cased_finetuned_typo_detection date: 2022-07-21 tags: [open_source, distilbert, ner, typo, multilingual, xx] task: Named Entity Recognition language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT NER model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multi-cased-finetuned-typo-detection` is a Multilingual model originally trained by `mrm8488`. ## Predicted Entities `ok`, `typo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_base_multi_cased_finetuned_typo_detection_xx_4.0.0_3.0_1658399913400.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_base_multi_cased_finetuned_typo_detection_xx_4.0.0_3.0_1658399913400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") ner = DistilBertForTokenClassification.pretrained("distilbert_ner_base_multi_cased_finetuned_typo_detection","xx") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner = DistilBertForTokenClassification.pretrained("distilbert_ner_base_multi_cased_finetuned_typo_detection","xx") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, ner)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_base_multi_cased_finetuned_typo_detection| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6_en_4.0.0_3.0_1655733561065.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6_en_4.0.0_3.0_1655733561065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_64d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|419.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-6 --- layout: model title: Multilingual DistilBertForQuestionAnswering Base Cased model (from monakth) author: John Snow Labs name: distilbert_qa_base_cased_squadv2 date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-squadv2` is a Multilingual model originally trained by `monakth`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_squadv2_xx_4.3.0_3.0_1672767315694.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_squadv2_xx_4.3.0_3.0_1672767315694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_squadv2","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_squadv2","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_squadv2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monakth/distilbert-base-multilingual-cased-squadv2 --- layout: model title: Fast Neural Machine Translation Model from Pijin to English author: John Snow Labs name: opus_mt_pis_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pis, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pis` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pis_en_xx_2.7.0_2.4_1609163443618.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pis_en_xx_2.7.0_2.4_1609163443618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pis_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pis_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pis.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pis_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Tagalog (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, tl, open_source] task: Embeddings language: tl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tl_3.4.1_3.0_1647461421317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tl_3.4.1_3.0_1647461421317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Gustung-gusto ko ang Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Gustung-gusto ko ang Spark NLP.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tl.embed.w2v_cc_300d").predict("""Gustung-gusto ko ang Spark NLP.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|416.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Employment Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_employment_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, employment, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_employment_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `employment-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `employment-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_agreement_bert_en_1.0.0_3.0_1669310901974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_agreement_bert_en_1.0.0_3.0_1669310901974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_employment_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[employment-agreement]| |[other]| |[other]| |[employment-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employment_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support employment-agreement 0.96 0.90 0.93 29 other 0.96 0.99 0.98 82 accuracy - - 0.96 111 macro-avg 0.96 0.94 0.95 111 weighted-avg 0.96 0.96 0.96 111 ``` --- layout: model title: Fast Neural Machine Translation Model from Latvian to English author: John Snow Labs name: opus_mt_lv_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lv, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lv_en_xx_2.7.0_2.4_1609163952807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lv_en_xx_2.7.0_2.4_1609163952807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lv_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lv_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lv.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lv_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForSequenceClassification Cased model (from Kaveh8) author: John Snow Labs name: roberta_classifier_autonlp_imdb_rating_625417974 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-imdb_rating-625417974` is a English model originally trained by `Kaveh8`. ## Predicted Entities `1`, `4`, `3`, `2`, `5` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_imdb_rating_625417974_en_4.2.4_3.0_1670622586146.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_imdb_rating_625417974_en_4.2.4_3.0_1670622586146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_imdb_rating_625417974","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_imdb_rating_625417974","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autonlp_imdb_rating_625417974| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|428.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Kaveh8/autonlp-imdb_rating-625417974 --- layout: model title: English asr_wav2vec2_large_xlsr_coraa_portuguese_cv8 TFWav2Vec2ForCTC from lgris author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_coraa_portuguese_cv8` is a English model originally trained by lgris. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_en_4.2.0_3.0_1664043408372.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_en_4.2.0_3.0_1664043408372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_coraa_portuguese_cv8| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chunk Resolver (Cpt Clinical) author: John Snow Labs name: chunkresolve_cpt_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-21 task: Entity Resolution edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities CPT Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_cpt_clinical_en_2.4.5_2.4_1587491373378.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_cpt_clinical_en_2.4.5_2.4_1587491373378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... cpt_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, cpt_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val cpt_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, cpt_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash chunk entity cpt_description cpt_code 0 a cold, cough PROBLEM Thoracoscopy, surgical; with removal of a sing... 32669 1 runny nose PROBLEM Unlisted procedure, larynx 31599 2 fever PROBLEM Cesarean delivery only; 59514 3 difficulty breathing PROBLEM Repair, laceration of diaphragm, any approach 39501 4 her cough PROBLEM Exploration for postoperative hemorrhage, thro... 35840 5 physical exam TEST Cesarean delivery only; including postpartum care 59515 6 fairly congested PROBLEM Pyelotomy; with drainage, pyelostomy 50125 7 Amoxil TREATMENT Cholecystoenterostomy; with gastroenterostomy 47721 8 Aldex TREATMENT Laparoscopy, surgical; with omentopexy (omenta... 49326 9 difficulty breathing PROBLEM Repair, laceration of diaphragm, any approach 39501 10 more congested PROBLEM for section of 1 or more cranial nerves 61460 11 trouble sleeping PROBLEM Repair, laceration of diaphragm, any approach 39501 12 congestion PROBLEM Repair, laceration of diaphragm, any approach 39501 ``` {:.model-param} ## Model Information {:.table-model} |----------------|---------------------------| | Name: | chunkresolve_cpt_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.2+ | | License: | Licensed | |Edition:|Official| | |Input labels: | token, chunk_embeddings | |Output labels: | entity | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on Current Procedural Terminology dataset. --- layout: model title: Pipeline to Detect details of cellular structures (biobert) author: John Snow Labs name: ner_cellular_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_cellular_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_cellular_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_pipeline_en_4.3.0_3.2_1679314449983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_pipeline_en_4.3.0_3.2_1679314449983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_cellular_biobert_pipeline", "en", "clinical/models") text = '''Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_cellular_biobert_pipeline", "en", "clinical/models") val text = "Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cellular_biobert.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------------------------|--------:|------:|:------------|-------------:| | 0 | intracellular signaling proteins | 27 | 58 | protein | 0.673333 | | 1 | human T-cell leukemia virus type 1 promoter | 130 | 172 | DNA | 0.426171 | | 2 | Tax | 186 | 188 | protein | 0.779 | | 3 | Tax-responsive element 1 | 193 | 216 | DNA | 0.756933 | | 4 | cyclic AMP-responsive members | 237 | 265 | protein | 0.629333 | | 5 | CREB/ATF family | 274 | 288 | protein | 0.8499 | | 6 | transcription factors | 293 | 313 | protein | 0.78165 | | 7 | Tax | 389 | 391 | protein | 0.8463 | | 8 | Tax-responsive element 1 | 431 | 454 | DNA | 0.713067 | | 9 | TRE-1 | 457 | 461 | DNA | 0.9983 | | 10 | lacZ gene | 582 | 590 | DNA | 0.7018 | | 11 | CYC1 promoter | 617 | 629 | DNA | 0.81865 | | 12 | TRE-1 | 663 | 667 | DNA | 0.9967 | | 13 | cyclic AMP response element-binding protein | 695 | 737 | protein | 0.51984 | | 14 | CREB | 740 | 743 | protein | 0.9708 | | 15 | CREB | 749 | 752 | protein | 0.8875 | | 16 | GAL4 activation domain | 767 | 788 | protein | 0.578633 | | 17 | GAD | 791 | 793 | protein | 0.6432 | | 18 | reporter gene | 848 | 860 | DNA | 0.61005 | | 19 | Tax | 863 | 865 | protein | 0.99 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Sentence Entity Resolver for ICD10-PCS (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_icd10pcs date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-PCS codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts ICD10-PCS Codes and their normalized definitions. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_en_3.0.4_3.0_1621189710474.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_en_3.0.4_3.0_1621189710474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_icd10pcs``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Procedure``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10pcs_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10pcs","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10pcs_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10pcs","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10pcs").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+-------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+-------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM|DWY18ZZ| 0.0626|Hyperthermia of H...|DWY18ZZ:::6A3Z1ZZ...| |chronic renal ins...| 83|109| PROBLEM|DTY17ZZ| 0.0722|Contact Radiation...|DTY17ZZ:::04593ZZ...| | COPD| 113|116| PROBLEM|2W04X7Z| 0.0765|Change Intermitte...|2W04X7Z:::0J063ZZ...| | gastritis| 120|128| PROBLEM|04723Z6| 0.0826|Dilation of Gastr...|04723Z6:::04724Z6...| | TIA| 136|138| PROBLEM|00F5XZZ| 0.1074|Fragmentation in ...|00F5XZZ:::00F53ZZ...| |a non-ST elevatio...| 182|202| PROBLEM|B307ZZZ| 0.0750|Plain Radiography...|B307ZZZ:::2W59X3Z...| |Guaiac positive s...| 208|229| PROBLEM|3E1G38Z| 0.0886|Irrigation of Upp...|3E1G38Z:::3E1G38X...| |cardiac catheteri...| 295|317| TEST|4A0234Z| 0.0783|Measurement of Ca...|4A0234Z:::4A02X4A...| | PTCA| 324|327|TREATMENT|03SG3ZZ| 0.0507|Reposition Intrac...|03SG3ZZ:::0GCQ3ZZ...| | mid LAD lesion| 332|345| PROBLEM|02H73DZ| 0.0490|Insertion of Intr...|02H73DZ:::02163Z7...| +--------------------+-----+---+---------+-------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10pcs| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[icd10pcs_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD10 Procedure Coding System dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://www.icd10data.com/ICD10PCS/Codes --- layout: model title: English T5ForConditionalGeneration Base Cased model (from nouamanetazi) author: John Snow Labs name: t5_cover_letter_base date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cover-letter-t5-base` is a English model originally trained by `nouamanetazi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_cover_letter_base_en_4.3.0_3.0_1675100617268.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_cover_letter_base_en_4.3.0_3.0_1675100617268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_cover_letter_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_cover_letter_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_cover_letter_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|910.7 MB| ## References - https://huggingface.co/nouamanetazi/cover-letter-t5-base --- layout: model title: Fast Neural Machine Translation Model from Artificial languages to English author: John Snow Labs name: opus_mt_art_en date: 2021-06-01 tags: [open_source, seq2seq, translation, art, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: art target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_art_en_xx_3.1.0_2.4_1622559545730.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_art_en_xx_3.1.0_2.4_1622559545730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_art_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_art_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Artificial languages.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_art_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6_en_4.3.0_3.0_1674214482569.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6_en_4.3.0_3.0_1674214482569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|416.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-6 --- layout: model title: Explain Document pipeline for Dutch (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, dutch, explain_document_lg, pipeline, nl] supported: true task: [Named Entity Recognition, Lemmatization] language: nl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_3.0.0_3.0_1616513098571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_3.0.0_3.0_1616513098571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'nl') annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "nl") val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo van John Snow Labs! ""] result_df = nlu.load('nl.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:------------------------------------------|:--------------------------------------------|:-----------------------------|:------------------------------------------|:-----------------------------| | 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.245989993214607,.,...]] | ['B-PER', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['Hallo', 'John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| --- layout: model title: Legal Support Clause Binary Classifier author: John Snow Labs name: legclf_support_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `support` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `support` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_support_clause_en_1.0.0_3.2_1660123058175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_support_clause_en_1.0.0_3.2_1660123058175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_support_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[support]| |[other]| |[other]| |[support]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_support_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.99 0.98 120 support 0.97 0.89 0.93 35 accuracy - - 0.97 155 macro-avg 0.97 0.94 0.95 155 weighted-avg 0.97 0.97 0.97 155 ``` --- layout: model title: Legal Terminations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_terminations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, terminations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Terminations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Terminations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_terminations_bert_en_1.0.0_3.0_1678050545103.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_terminations_bert_en_1.0.0_3.0_1678050545103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_terminations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Terminations]| |[Other]| |[Other]| |[Terminations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_terminations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.96 0.89 0.93 160 Terminations 0.88 0.95 0.91 129 accuracy - - 0.92 289 macro-avg 0.92 0.92 0.92 289 weighted-avg 0.92 0.92 0.92 289 ``` --- layout: model title: SNOMED Sentence Resolver (Spanish) author: John Snow Labs name: robertaresolve_snomed date: 2021-11-03 tags: [embeddings, es, snomed, entity_resolution, clinical, licensed] task: Entity Resolution language: es edition: Healthcare NLP 3.3.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps predetected ner chunks (from a `MedicalNerModel`, a `ChunkConverter` and a `Chunk2Doc`) to SNOMED terms and codes for the Spanish version of SNOMED. It requires Roberta Clinical Word Embeddings (`roberta_base_biomedical_es`) averaged with `SentenceEmbeddings`. ## Predicted Entities `SNOMED codes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/robertaresolve_snomed_es_3.3.0_3.0_1635933551478.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/robertaresolve_snomed_es_3.3.0_3.0_1635933551478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use any `MedicalNer` Model from our ModelsHub that detects, for example, diagnosis, for Spanish. Then, use a `NerConverter` (in case your model has B-I-O notation). Create documents using `Chunk2Doc`. Then use a `Tokenizer` the split the chunk, and finally use the `roberta_base_biomedical_es` Roberta Embeddings model and a `SentenceEmbeddings` annotator with an average pooling strategy, as in the example.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... c2doc = nlp.Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("sentence") chunk_tokenizer = nlp.Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") chunk_word_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner_chunk_word_embeddings") chunk_embeddings = nlp.SentenceEmbeddings() \ .setInputCols(["sentence", "ner_chunk_word_embeddings"]) \ .setOutputCol("ner_chunk_embeddings") \ .setPoolingStrategy("AVERAGE") er = medical.SentenceEntityResolverModel.pretrained("robertaresolve_snomed", "es", "clinical/models")\ .setInputCols(["sentence", "ner_chunk_embeddings"]) \ .setOutputCol("snomed_code") \ .setDistanceFunction("EUCLIDEAN") snomed_resolve_pipeline = Pipeline(stages = [ c2doc, chunk_tokenizer, chunk_word_embeddings, chunk_embeddings, er ]) empty = spark.createDataFrame([['']]).toDF("text") p_model = snomed_resolve_pipeline.fit(empty) test_sentence = "Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta." result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala ... val c2doc = new Chunk2Doc() .setInputCols(Array("ner_chunk")) .setOutputCol("sentence") val chunk_tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val chunk_word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner_chunk_word_embeddings") val chunk_embeddings = new SentenceEmbeddings() .setInputCols(Array("sentence", "ner_chunk_word_embeddings")) .setOutputCol("ner_chunk_embeddings") .setPoolingStrategy("AVERAGE") val er = SentenceEntityResolverModel.pretrained("robertaresolve_snomed", "es", "clinical/models") .setInputCols(Array("ner_chunk_embeddings")) .setOutputCol("snomed_code") .setDistanceFunction("EUCLIDEAN") val snomed_pipeline = new Pipeline().setStages(Array( c2doc, chunk_tokenizer, chunk_word_embeddings, chunk_embeddings, er)) val test_sentence = 'Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.' val data = Seq(test_sentence).toDF("text") val result = snomed_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.resolve.snomed").predict("""Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.""") ```
## Results ```bash +----+-------------------------------+-------------+--------------+ | | ner_chunk | entity | snomed_code| |----+-------------------------------+-------------+--------------| | 0 | diabetes mellitus gestacional | DIAGNOSTICO | 11687002 | | 1 | diabetes mellitus tipo dos | DIAGNOSTICO | 44054006 | | 2 | pancreatitis | DIAGNOSTICO | 75694006 | | 3 | HTG | DIAGNOSTICO | 266569009 | | 4 | hepatitis aguda | DIAGNOSTICO | 37871000 | | 5 | obesidad | DIAGNOSTICO | 5476005 | | 6 | índice de masa corporal | DIAGNOSTICO | 162859006 | | 7 | poliuria | DIAGNOSTICO | 56574000 | | 8 | polidipsia | DIAGNOSTICO | 17173007 | | 9 | falta de apetito | DIAGNOSTICO | 49233005 | | 10 | vómitos | DIAGNOSTICO | 422400008 | | 11 | infección | DIAGNOSTICO | 40733004 | | 12 | HTG | DIAGNOSTICO | 266569009 | | 13 | dolor | DIAGNOSTICO | 22253000 | | 14 | rigidez | DIAGNOSTICO | 271587009 | | 15 | cetosis | DIAGNOSTICO | 2538008 | | 16 | infección | DIAGNOSTICO | 40733004 | +----+-------------------------------+-------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|robertaresolve_snomed| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk_doc, sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|es| |Case sensitive:|false| |Dependencies:|roberta_base_biomedical_es| --- layout: model title: Embeddings Healthcare author: John Snow Labs name: embeddings_healthcare class: WordEmbeddingsModel language: en nav_key: models repository: clinical/models date: 2020-03-26 task: Embeddings edition: Healthcare NLP 2.4.4 spark_version: 2.4 tags: [licensed,clinical,embeddings,en] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.h2_title} ## Data Source Trained on PubMed + ICD10 + UMLS + MIMIC III corpora https://www.nlm.nih.gov/databases/download/pubmed_medline.html {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_healthcare_en_2.4.4_2.4_1585188313964.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_healthcare_en_2.4.4_2.4_1585188313964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_healthcare","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_healthcare","en","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.glove.healthcare").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_healthcare``. {:.model-param} ## Model Information {:.table-model} |---------------|-----------------------| | Name: | embeddings_healthcare | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.4.4+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | en | | Dimension: | 400.0 | --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ThaisBeham) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_fira date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-fira` is a English model originally trained by `ThaisBeham`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_fira_en_4.3.0_3.0_1672767957708.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_fira_en_4.3.0_3.0_1672767957708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_fira","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_fira","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_fira| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ThaisBeham/distilbert-base-uncased-finetuned-fira --- layout: model title: English Named Entity Recognition (from elastic) author: John Snow Labs name: distilbert_ner_distilbert_base_cased_finetuned_conll03_english date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-conll03-english` is a English model orginally trained by `elastic`. ## Predicted Entities `ORG`, `MISC`, `PER`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_cased_finetuned_conll03_english_en_3.4.2_3.0_1652721683253.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_cased_finetuned_conll03_english_en_3.4.2_3.0_1652721683253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_cased_finetuned_conll03_english","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_cased_finetuned_conll03_english","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_distilbert_base_cased_finetuned_conll03_english| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english --- layout: model title: English DistilBertForQuestionAnswering model (from V3RX2000) author: John Snow Labs name: distilbert_qa_V3RX2000_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `V3RX2000`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_V3RX2000_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724840806.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_V3RX2000_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724840806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_V3RX2000_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_V3RX2000_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_V3RX2000").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_V3RX2000_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/V3RX2000/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_large_960h_lv60_self TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_large_960h_lv60_self date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_self_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664036965244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_en_4.2.0_3.0_1664036965244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_960h_lv60_self", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_960h_lv60_self", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_960h_lv60_self| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.3 MB| --- layout: model title: Explain Document Pipeline - CARP author: John Snow Labs name: explain_clinical_doc_carp date: 2020-08-19 task: [Named Entity Recognition, Assertion Status, Relation Extraction, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [pipeline, en, clinical, licensed] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained pipeline with ``ner_clinical``, ``assertion_dl``, ``re_clinical`` and ``ner_posology``. It will extract clinical and medication entities, assign assertion status and find relationships between clinical entities. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_2.5.5_2.4_1597841630062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_2.5.5_2.4_1597841630062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python carp_pipeline = PretrainedPipeline("explain_clinical_doc_carp","en","clinical/models") annotations = carp_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")[0] annotations.keys() ``` ```scala val carp_pipeline = new PretrainedPipeline("explain_clinical_doc_carp","en","clinical/models") val result = carp_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.carp").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""") ```
{:.h2_title} ## Results This pretrained pipeline gives the result of `ner_clinical`, `re_clinical`, `ner_posology` and `assertion_dl` models. ```bash | | chunks | ner_clinical | assertion | posology_chunk | ner_posology | relations | |---|-------------------------------|--------------|-----------|------------------|--------------|-----------| | 0 | gestational diabetes mellitus | PROBLEM | present | metformin | Drug | TrAP | | 1 | metformin | TREATMENT | present | 1000 mg | Strength | TrCP | | 2 | polyuria | PROBLEM | present | two times a day | Frequency | TrCP | | 3 | polydipsia | PROBLEM | present | 40 units | Dosage | TrWP | | 4 | poor appetite | PROBLEM | present | insulin glargine | Drug | TrCP | | 5 | vomiting | PROBLEM | present | at night | Frequency | TrAP | | 6 | insulin glargine | TREATMENT | present | 12 units | Dosage | TrAP | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_carp| |Type:|pipeline| |Compatibility:|Spark NLP 2.5.5| |License:|Licensed| |Edition:|Official| |Language:|[en]| {:.h2_title} ## Included Models - ner_clinical - assertion_dl - re_clinical - ner_posology --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from moghis) author: John Snow Labs name: xlmroberta_ner_base_finetuned_panx date: 2022-08-14 tags: [de, fr, open_source, xlm_roberta, ner, xx] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-fr-de` is a Multilingual model originally trained by `moghis`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_xx_4.1.0_3.0_1660445027313.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_xx_4.1.0_3.0_1660445027313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|858.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/moghis/xlm-roberta-base-finetuned-panx-fr-de --- layout: model title: Translate English to Germanic languages Pipeline author: John Snow Labs name: translate_en_gem date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gem, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gem_xx_2.7.0_2.4_1609688259531.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gem_xx_2.7.0_2.4_1609688259531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gem", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gem", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gem').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gem| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Finnish BERT Embeddings (Base Cased) author: John Snow Labs name: bert_base_finnish_cased date: 2022-01-03 tags: [open_source, embeddings, fi, bert] task: Embeddings language: fi edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released multilingual BERT models from Google. FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_finnish_cased_fi_3.3.4_2.4_1641223279447.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_finnish_cased_fi_3.3.4_2.4_1641223279447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_finnish_cased", "fi") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") sample_data= spark.createDataFrame([['Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.']], ["text"]) nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(sample_data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_finnish_cased", "fi") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fi.embed_sentence.bert.cased").predict("""Syväoppiminen perustuu keinotekoisiin hermoihin, jotka muodostavat monikerroksisen neuroverkon.""") ```
## Results ```bash +--------------------+---------------+ | embeddings| token| +--------------------+---------------+ |[0.53366333, -0.4...| Syväoppiminen| |[0.49171034, -1.1...| perustuu| |[-0.0017492473, -...| keinotekoisiin| |[0.61259747, -0.7...| hermoihin| |[-0.008151092, -0...| ,| |[-0.4050159, -0.2...| jotka| |[-0.69079936, 0.6...| muodostavat| |[-0.45641452, 0.4...|monikerroksisen| |[1.278124, -1.218...| neuroverkon| |[0.42451048, -1.2...| .| +--------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_finnish_cased| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert]| |Language:|fi| |Size:|464.2 MB| |Case sensitive:|true| --- layout: model title: Translate West Germanic languages to English Pipeline author: John Snow Labs name: translate_gmw_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gmw, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gmw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gmw_en_xx_2.7.0_2.4_1609685887934.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gmw_en_xx_2.7.0_2.4_1609685887934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gmw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gmw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gmw.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gmw_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223448939.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223448939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: English BertForMaskedLM Large Cased model author: John Snow Labs name: bert_embeddings_large_cased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-cased` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_en_4.2.4_3.0_1670020019196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_cased_en_4.2.4_3.0_1670020019196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/bert-large-cased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English BertForQuestionAnswering model (from bdickson) author: John Snow Labs name: bert_qa_bdickson_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `bdickson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bdickson_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181090164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bdickson_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181090164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bdickson_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bdickson_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_bdickson").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bdickson_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/bdickson/bert-base-uncased-finetuned-squad --- layout: model title: COVID BERT Embeddings (Large Uncased) author: John Snow Labs name: covidbert_large_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/covidbert_large_uncased_en_2.6.0_2.4_1598484981419.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/covidbert_large_uncased_en_2.6.0_2.4_1598484981419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("covidbert_large_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("covidbert_large_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.covidbert.large_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_covidbert_large_uncased_embeddings token [-1.934066891670227, 0.620597779750824, 0.0967... I [-0.5530431866645813, 1.1948248147964478, -0.0... love [0.255395770072937, 0.5808677077293396, 0.3073... NLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|covidbert_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/digitalepidemiologylab/covid-twitter-bert/2 --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1654180865294.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1654180865294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_512d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-0 --- layout: model title: Part of Speech for Persian author: John Snow Labs name: pos_ud_perdt date: 2020-11-30 task: Part of Speech Tagging language: fa edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [fa, pos] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_perdt_fa_2.7.0_2.4_1606724821106.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_perdt_fa_2.7.0_2.4_1606724821106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(["جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است."]) ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است"""] pos_df = nlu.load('fa.pos').predict(text) pos_df ```
## Results ```bash {'pos': [Annotation(pos, 0, 2, NOUN, {'word': 'جان'}), Annotation(pos, 4, 7, NOUN, {'word': 'اسنو'}), Annotation(pos, 9, 11, ADJ, {'word': 'جدا'}), Annotation(pos, 13, 14, ADP, {'word': 'از'}), Annotation(pos, 16, 20, NOUN, {'word': 'سلطنت'}), Annotation(pos, 22, 25, NOUN, {'word': 'شمال'}), Annotation(pos, 27, 27, PUNCT, {'word': '،'}), Annotation(pos, 29, 30, NUM, {'word': 'یک'}), Annotation(pos, 32, 35, NOUN, {'word': 'پزشک'}), Annotation(pos, 37, 43, ADJ, {'word': 'انگلیسی'}), Annotation(pos, 45, 45, CCONJ, {'word': 'و'}), Annotation(pos, 47, 50, NOUN, {'word': 'رهبر'}), Annotation(pos, 52, 56, NOUN, {'word': 'توسعه'}), Annotation(pos, 58, 63, VERB, {'word': 'بیهوشی'}), Annotation(pos, 65, 65, CCONJ, {'word': 'و'}), Annotation(pos, 67, 72, NOUN, {'word': 'بهداشت'}), Annotation(pos, 74, 78, ADJ, {'word': 'پزشکی'}), Annotation(pos, 80, 82, AUX, {'word': 'است'}), Annotation(pos, 83, 83, PUNCT, {'word': '.'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_perdt| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[tags, document]| |Output Labels:|[pos]| |Language:|fa| ## Data Source The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org) ## Benchmarking ```bash | | | precision | recall | f1-score | support | |---:|:-------------|:------------|:---------|-----------:|----------:| | 0 | ADJ | 0.88 | 0.88 | 0.88 | 1647 | | 1 | ADP | 0.99 | 0.99 | 0.99 | 3402 | | 2 | ADV | 0.94 | 0.91 | 0.92 | 383 | | 3 | AUX | 0.99 | 0.99 | 0.99 | 1000 | | 4 | CCONJ | 1.00 | 1.00 | 1 | 1022 | | 5 | DET | 0.94 | 0.96 | 0.95 | 490 | | 6 | INTJ | 0.88 | 0.81 | 0.85 | 27 | | 7 | NOUN | 0.95 | 0.96 | 0.95 | 8201 | | 8 | NUM | 0.94 | 0.97 | 0.96 | 293 | | 9 | None | 1.00 | 0.99 | 0.99 | 289 | | 10 | PART | 1.00 | 0.86 | 0.92 | 28 | | 11 | PRON | 0.98 | 0.97 | 0.98 | 1117 | | 12 | PROPN | 0.84 | 0.78 | 0.81 | 1107 | | 13 | PUNCT | 1.00 | 1.00 | 1 | 2134 | | 14 | SCONJ | 0.98 | 0.98 | 0.98 | 630 | | 15 | VERB | 0.99 | 0.99 | 0.99 | 2581 | | 16 | accuracy | | | 0.96 | 24351 | | 17 | macro avg | 0.96 | 0.94 | 0.95 | 24351 | | 18 | weighted avg | 0.96 | 0.96 | 0.96 | 24351 | ``` --- layout: model title: Legal Rights Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_rights_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, rights, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_rights_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `rights-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `rights-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rights_agreement_en_1.0.0_3.0_1669294308500.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rights_agreement_en_1.0.0_3.0_1669294308500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_rights_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[rights-agreement]| |[other]| |[other]| |[rights-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_rights_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.98 0.98 90 rights-agreement 0.94 0.97 0.95 30 accuracy - - 0.97 120 macro-avg 0.96 0.97 0.97 120 weighted-avg 0.98 0.97 0.98 120 ``` --- layout: model title: Translate Korean to English Pipeline author: John Snow Labs name: translate_ko_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ko, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ko` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ko_en_xx_2.7.0_2.4_1609688668059.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ko_en_xx_2.7.0_2.4_1609688668059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ko_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ko_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ko.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ko_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Vietnamese Deberta Embeddings model (from hieule) author: John Snow Labs name: deberta_embeddings_spm_vie date: 2023-03-12 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, vie, tensorflow] task: Embeddings language: vie edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spm-vie-deberta` is a Vietnamese model originally trained by `hieule`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_spm_vie_vie_4.3.1_3.0_1678627522214.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_spm_vie_vie_4.3.1_3.0_1678627522214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_spm_vie","vie") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_spm_vie","vie") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_spm_vie| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|vie| |Size:|290.3 MB| |Case sensitive:|false| ## References https://huggingface.co/hieule/spm-vie-deberta --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4` is a English model originally trained by lilitket. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_en_4.2.0_3.0_1664119468999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_en_4.2.0_3.0_1664119468999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Second Supplemental Indenture Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_second_supplemental_indenture_bert date: 2023-02-02 tags: [en, legal, classification, second, supplemental, indenture, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_second_supplemental_indenture_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `second-supplemental-indenture` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `second-supplemental-indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_second_supplemental_indenture_bert_en_1.0.0_3.0_1675359737104.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_second_supplemental_indenture_bert_en_1.0.0_3.0_1675359737104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_second_supplemental_indenture_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[second-supplemental-indenture]| |[other]| |[other]| |[second-supplemental-indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_second_supplemental_indenture_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 73 second-supplemental-indenture 0.95 0.90 0.92 39 accuracy - - 0.95 112 macro-avg 0.95 0.94 0.94 112 weighted-avg 0.95 0.95 0.95 112 ``` --- layout: model title: Legal Material contracts Clause Binary Classifier author: John Snow Labs name: legclf_material_contracts_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `material-contracts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `material-contracts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_material_contracts_clause_en_1.0.0_3.2_1660122646574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_material_contracts_clause_en_1.0.0_3.2_1660122646574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_material_contracts_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[material-contracts]| |[other]| |[other]| |[material-contracts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_material_contracts_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support material-contracts 0.85 0.79 0.82 29 other 0.94 0.96 0.95 93 accuracy - - 0.92 122 macro-avg 0.89 0.88 0.88 122 weighted-avg 0.92 0.92 0.92 122 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1657185416907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1657185416907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-2 --- layout: model title: Word2Vec Embeddings in Sanskrit (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sa, open_source] task: Embeddings language: sa edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sa_3.4.1_3.0_1647455309990.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sa_3.4.1_3.0_1647455309990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sa.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sa| |Size:|288.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Bank Complaints Classification author: John Snow Labs name: finclf_bank_complaints date: 2022-08-09 tags: [en, finance, bank, classification, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies Bank-related texts into different 7 different categories, and can be used to automatically process incoming emails to customer support channels and forward them to the proper recipients. ## Predicted Entities `Accounts`, `Credit Cards`, `Credit Reporting`, `Debt Collection`, `Loans`, `Money Transfer and Currency`, `Mortgage` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bank_complaints_en_1.0.0_3.2_1660035048303.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bank_complaints_en_1.0.0_3.2_1660035048303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") classsifier_dl = nlp.ClassifierDLModel.pretrained("finclf_bank_complaints", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("label")\ clf_pipeline = nlp.Pipeline( stages = [ document_assembler, embeddings, classsifier_dl ]) light_pipeline = LightPipeline(clf_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.annotate("""Over the course of 30 days I have filed a dispute in regards to inaccurate and false information on my credit report. Ive attached a copy of my dispute mailed in certified to Equifax and they are still reporting these incorrect items. According to the fair credit ACT, section 609 ( a ) ( 1 ) ( A ) they are required by Federal Law to only report Accurate information and the have not done so. They have not provided me with any proof i.e. and original consumer contract with my signature on it proving that this is my account.Further more, I would like to make a formal complaint that Ive tried calling Equifax Over 10 times this week and every single time Ive called Ive asked for a representative in the fraud dispute department wants transfer it over there when you speak to the representative and let them know that you are looking to dispute inquiries and accounts due to fraud they immediately transfer you to their survey line essentially ending the call. I believe Equifax is training their representatives to not help consumers over the phone and performing unethical practices. Once I finally got a hold of a representative she told me that she could not help because I did not send in my Social Security card which violates my consumer rights. So Im Making a formal CFPB complaint that you will correct Equifaxs actions. Below Ive written what is also included in the files uploaded, my disputes for inaccuracies on my credit report.""") result['label'] ```
## Results ```bash ['Credit Reporting'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_bank_complaints| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data ## Benchmarking ```bash label precision recall f1-score support Accounts 0.77 0.73 0.75 490 Credit_Cards 0.75 0.68 0.72 461 Credit_Reporting 0.73 0.81 0.76 488 Debt_Collection 0.72 0.72 0.72 459 Loans 0.78 0.78 0.78 472 Money_Transfer_and_Currency 0.82 0.84 0.83 482 Mortgage 0.87 0.87 0.87 488 accuracy - - 0.78 3340 macro-avg 0.78 0.78 0.78 3340 weighted-avg 0.78 0.78 0.78 3340 ``` --- layout: model title: Spanish BertForQuestionAnswering model (from MMG) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG date: 2022-06-03 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad2-es` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG_es_4.0.0_3.0_1654249736195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG_es_4.0.0_3.0_1654249736195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2_sqac.bert.base_cased.by_MMG").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad2-es --- layout: model title: SDOH Substance Usage For Binary Classification author: John Snow Labs name: genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli date: 2023-01-14 tags: [en, licensed, generic_classifier, sdoh, substance, clinical] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting substance use in clinical notes and trained by using GenericClassifierApproach annotator. `Present:` if the patient was a current consumer of substance or the patient was a consumer in the past and had quit or if the patient had never consumed substance. `None:` if there was no related text. ## Predicted Entities `Present`, `None` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673697973649.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673697973649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["Lives in apartment with 16-year-old daughter. Denies EtOH use currently although reports occasional use in past. Utox on admission positive for opiate (on as rx) as well as cocaine. 4-6 cigarettes a day on and off for 10 years. Denies h/o illicit drug use besides marijuana although admitted to cocaine use after being found to have urine positive for cocaine.", "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago."] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq("The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.generic.sdoh_substance_binary_sbiobert_cased").predict("""Lives in apartment with 16-year-old daughter. Denies EtOH use currently although reports occasional use in past. Utox on admission positive for opiate (on as rx) as well as cocaine. 4-6 cigarettes a day on and off for 10 years. Denies h/o illicit drug use besides marijuana although admitted to cocaine use after being found to have urine positive for cocaine.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | text| result| +----------------------------------------------------------------------------------------------------+---------+ |Lives in apartment with 16-year-old daughter. Denies EtOH use currently although reports occasion...|[Present]| |The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [None]| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| ## Benchmarking ```bash label precision recall f1-score support None 0.91 0.83 0.87 898 Present 0.76 0.87 0.81 540 accuracy - - 0.85 1438 macro-avg 0.83 0.85 0.84 1438 weighted-avg 0.85 0.85 0.85 1438 ``` --- layout: model title: Legal Multilabel Classifier on Covid-19 Exceptions (Italian) author: John Snow Labs name: legmulticlf_covid19_exceptions_italian date: 2023-04-20 tags: [it, licensed, legal, multilabel, classification, tensorflow] task: Text Classification language: it edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the Multi-Label Text Classification model that can be used to identify up to 5 classes to facilitate analysis, discovery, and comparison of legal texts in Italian related to COVID-19 exception measures. The classes are as follows: - Closures/lockdown - Government_oversight - Restrictions_of_daily_liberties - Restrictions_of_fundamental_rights_and_civil_liberties - State_of_Emergency ## Predicted Entities `Closures/lockdown`, `Government_oversight`, `Restrictions_of_daily_liberties`, `Restrictions_of_fundamental_rights_and_civil_liberties`, `State_of_Emergency` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_italian_it_1.0.0_3.0_1681985472330.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_italian_it_1.0.0_3.0_1681985472330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased", "it") \ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") embeddingsSentence = nlp.SentenceEmbeddings() \ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") multilabelClfModel = nlp.MultiClassifierDLModel.pretrained('legmulticlf_covid19_exceptions_italian', 'it', "legal/models") \ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline( stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, multilabelClfModel]) df = spark.createDataFrame([["Al di fuori di tale ultima ipotesi, secondo le raccomandazioni impartite dal Ministero della salute, occorre provvedere ad assicurare la corretta applicazione di misure preventive quali lavare frequentemente le mani con acqua e detergenti comuni."]]).toDF("text") model = clf_pipeline.fit(df) result = model.transform(df) result.select("text", "class.result").show(truncate=False) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ |text |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ |Al di fuori di tale ultima ipotesi, secondo le raccomandazioni impartite dal Ministero della salute, occorre provvedere ad assicurare la corretta applicazione di misure preventive quali lavare frequentemente le mani con acqua e detergenti comuni.|[Restrictions_of_daily_liberties]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_covid19_exceptions_italian| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|it| |Size:|13.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/joelito/covid19_emergency_event) ## Benchmarking ```bash label precision recall f1-score support Closures/lockdown 0.88 0.94 0.91 47 Government_oversight 1.00 0.50 0.67 4 Restrictions_of_daily_liberties 0.88 0.79 0.83 28 Restrictions_of_fundamental_rights_and_civil_liberties 0.62 0.62 0.62 16 State_of_Emergency 0.67 1.00 0.80 6 micro-avg 0.82 0.83 0.83 101 macro-avg 0.81 0.77 0.77 101 weighted-avg 0.83 0.83 0.83 101 samples-avg 0.81 0.84 0.81 101 ``` --- layout: model title: English image_classifier_vit_pond_image_classification_12 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_12 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_12` is a English model originally trained by SummerChiam. ## Predicted Entities `NormalCement0`, `Boiling0`, `NormalNight0`, `Algae0`, `BoilingNight0`, `NormalRain0`, `Normal0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_12_en_4.1.0_3.0_1660171317776.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_12_en_4.1.0_3.0_1660171317776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_12", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_12", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_12| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_hausa_4_wa2vec_data_aug_xls_r_300m TFWav2Vec2ForCTC from Tiamz author: John Snow Labs name: pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_hausa_4_wa2vec_data_aug_xls_r_300m` is a English model originally trained by Tiamz. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m_en_4.2.0_3.0_1664108237641.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m_en_4.2.0_3.0_1664108237641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_hausa_4_wa2vec_data_aug_xls_r_300m| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Warranty NER (md) author: John Snow Labs name: legner_warranty_md date: 2022-12-01 tags: [warranty, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_warranty_clause` Text Classifier to select only these paragraphs; This is a Legal Named Entity Recognition Model to identify the Subject (who), Action (what), Object(the indemnification) and Indirect Object (to whom) from Warranty clauses. This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives outside the specific section, which may help to run it at whole document level (although not recommended). ## Predicted Entities `WARRANTY`, `WARRANTY_ACTION`, `WARRANTY_SUBJECT`, `WARRANTY_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_warranty_md_en_1.0.0_3.0_1669893390077.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_warranty_md_en_1.0.0_3.0_1669893390077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_warranty_md', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[documentAssembler,sentenceDetector,tokenizer,embeddings,ner_model,ner_converter]) data = spark.createDataFrame([["""8 . Representations and Warranties SONY hereby makes the following representations and warranties to PURCHASER , each of which shall be true and correct as of the date hereof and as of the Closing Date , and shall be unaffected by any investigation heretofore or hereafter made : 8.1 Power and Authority SONY has the right and power to enter into this IP Agreement and to transfer the Transferred Patents and to grant the license set forth in Section 3.1 ."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------------------------------------------------+------------------------+ |chunk |entity | +--------------------------------------------------------------------------+------------------------+ |SONY |WARRANTY_SUBJECT | |makes the following representations and warranties |WARRANTY_ACTION | |PURCHASER |WARRANTY_INDIRECT_OBJECT| |shall be true and correct as of the date hereof and as of the Closing Date|WARRANTY | |shall be unaffected by any investigation |WARRANTY | |SONY |WARRANTY_SUBJECT | |has the right and power to enter into this IP Agreement |WARRANTY | +--------------------------------------------------------------------------+------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_warranty_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References In-house annotated examples from CUAD legal dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-WARRANTY_SUBJECT 23 9 19 0.71875 0.54761904 0.62162155 B-WARRANTY 111 36 34 0.75510204 0.76551723 0.760274 B-WARRANTY_SUBJECT 55 31 33 0.6395349 0.625 0.6321839 I-WARRANTY_INDIRECT_OBJECT 18 6 3 0.75 0.85714287 0.79999995 I-WARRANTY_ACTION 77 8 14 0.90588236 0.84615386 0.875 B-WARRANTY_ACTION 36 4 4 0.9 0.9 0.9 I-WARRANTY 1686 487 313 0.7758859 0.8434217 0.8082455 B-WARRANTY_INDIRECT_OBJECT 34 12 6 0.73913044 0.85 0.79069775 Macro-average 2040 593 426 0.7730357 0.7793569 0.7761834 Micro-average 2040 593 426 0.77478164 0.8272506 0.80015695 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from niklaspm) author: John Snow Labs name: bert_qa_linkbert_base_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `linkbert-base-finetuned-squad` is a English model originally trained by `niklaspm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_base_finetuned_squad_en_4.0.0_3.0_1657189758932.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_linkbert_base_finetuned_squad_en_4.0.0_3.0_1657189758932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_linkbert_base_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_linkbert_base_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_linkbert_base_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/niklaspm/linkbert-base-finetuned-squad - https://arxiv.org/abs/2203.15827 --- layout: model title: Translate English to Tok Pisin Pipeline author: John Snow Labs name: translate_en_tpi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tpi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tpi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tpi_xx_2.7.0_2.4_1609686751040.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tpi_xx_2.7.0_2.4_1609686751040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tpi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tpi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tpi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tpi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Longformer (Base, 4096) author: John Snow Labs name: legal_longformer_base date: 2022-10-20 tags: [en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.1 spark_version: [3.2, 3.0] supported: true annotator: LongformerEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [Longformer](https://arxiv.org/abs/2004.05150) is a transformer model for long documents. `legal_longformer_base` is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096 and it's specifically trained on *legal documents* Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations. If you use `Longformer` in your research, please cite [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150). ``` @article{Beltagy2020Longformer, title={Longformer: The Long-Document Transformer}, author={Iz Beltagy and Matthew E. Peters and Arman Cohan}, journal={arXiv:2004.05150}, year={2020}, } ``` `Longformer` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/legal_longformer_base_en_4.2.1_3.2_1666282710556.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/legal_longformer_base_en_4.2.1_3.2_1666282710556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = LongformerEmbeddings\ .pretrained("legal_longformer_base", "en")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings")\ .setCaseSensitive(True)\ .setMaxSentenceLength(4096) ``` ```scala val embeddings = LongformerEmbeddings.pretrained("legal_longformer_base", "en") .setInputCols("document", "token") .setOutputCol("embeddings") .setCaseSensitive(true) .setMaxSentenceLength(4096) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legal_longformer_base| |Compatibility:|Spark NLP 4.2.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|531.1 MB| |Case sensitive:|true| |Max sentence length:|4096| ## References https://huggingface.co/saibo/legal-longformer-base-4096 --- layout: model title: Pipeline to Detect Genetic Cancer Entities author: John Snow Labs name: ner_cancer_genetics_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_cancer_genetics](https://nlp.johnsnowlabs.com/2021/03/31/ner_cancer_genetics_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_pipeline_en_4.3.0_3.2_1678864026558.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_pipeline_en_4.3.0_3.2_1678864026558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_cancer_genetics_pipeline", "en", "clinical/models") text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_cancer_genetics_pipeline", "en", "clinical/models") val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cancer_genetics.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------------------------------------------------|--------:|------:|:------------|-------------:| | 0 | human KCNJ9 | 4 | 14 | protein | 0.674 | | 1 | Kir 3.3 | 17 | 23 | protein | 0.95355 | | 2 | GIRK3 | 26 | 30 | protein | 0.5127 | | 3 | G-protein-activated inwardly rectifying potassium (GIRK) channel family | 52 | 122 | protein | 0.691744 | | 4 | KCNJ9 locus | 173 | 183 | DNA | 0.97875 | | 5 | chromosome 1q21-23 | 188 | 205 | DNA | 0.95305 | | 6 | coding exons | 357 | 368 | DNA | 0.63345 | | 7 | identified14 single nucleotide polymorphisms | 451 | 494 | DNA | 0.6994 | | 8 | SNPs), | 497 | 502 | DNA | 0.79075 | | 9 | KCNJ9 gene | 801 | 810 | DNA | 0.95605 | | 10 | KCNJ9 protein | 868 | 880 | protein | 0.844 | | 11 | locus | 931 | 935 | DNA | 0.9685 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cancer_genetics_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670021539975.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_256_zh_4.2.4_3.0_1670021539975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|57.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Mapping RXNORM Codes with Their Corresponding UMLS Codes author: John Snow Labs name: rxnorm_umls_mapper date: 2022-06-26 tags: [rxnorm, umls, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RXNORM codes to corresponding UMLS codes. ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapper_en_3.5.3_3.0_1656276292081.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapper_en_3.5.3_3.0_1656276292081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("rxnorm_umls_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("umls_mappings")\ .setRels(["umls_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("amlodipine 5 MG") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("rxnorm_umls_mapper", "en", "clinical/models") .setInputCols("rxnorm_code") .setOutputCol("umls_mappings") .setRels(Array("umls_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper )) val data = Seq("amlodipine 5 MG").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm_to_umls").predict("""amlodipine 5 MG""") ```
## Results ```bash | | ner_chunk | rxnorm_code | umls_mappings | |---:|:----------------|--------------:|:----------------| | 0 | amlodipine 5 MG | 329528 | C1124796 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_umls_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[rxnorm_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.9 MB| ## References This pretrained model maps RXNORM codes to corresponding UMLS codes under the Unified Medical Language System (UMLS). --- layout: model title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case - es) author: John Snow Labs name: ner_eu_clinical_case_pipeline date: 2023-03-08 tags: [es, clinical, licensed, ner] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/02/01/ner_eu_clinical_case_es.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_es_4.3.0_3.2_1678261388612.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_es_4.3.0_3.2_1678261388612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "es", "clinical/models") text = " Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "es", "clinical/models") val text = " Un niño de 3 años con trastorno autista en el hospital de la sala pediátrica A del hospital universitario. No tiene antecedentes familiares de enfermedad o trastorno del espectro autista. El niño fue diagnosticado con un trastorno de comunicación severo, con dificultades de interacción social y retraso en el procesamiento sensorial. Los análisis de sangre fueron normales (hormona estimulante de la tiroides (TSH), hemoglobina, volumen corpuscular medio (MCV) y ferritina). La endoscopia alta también mostró un tumor submucoso que causaba una obstrucción subtotal de la salida gástrica. Ante la sospecha de tumor del estroma gastrointestinal, se realizó gastrectomía distal. El examen histopatológico reveló proliferación de células fusiformes en la capa submucosa. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-----------------------------|--------:|------:|:-------------------|-------------:| | 0 | Un niño de 3 años | 1 | 17 | patient | 0.68856 | | 1 | trastorno | 23 | 31 | clinical_event | 0.9976 | | 2 | autista | 33 | 39 | clinical_condition | 0.7979 | | 3 | antecedentes | 117 | 128 | clinical_event | 0.7161 | | 4 | enfermedad | 144 | 153 | clinical_event | 0.5444 | | 5 | trastorno | 157 | 165 | clinical_event | 0.9914 | | 6 | del espectro autista | 167 | 186 | clinical_condition | 0.5385 | | 7 | El niño | 189 | 195 | patient | 0.87065 | | 8 | diagnosticado | 201 | 213 | clinical_event | 0.6442 | | 9 | trastorno | 222 | 230 | clinical_event | 0.836 | | 10 | de comunicación severo | 232 | 253 | clinical_condition | 0.501067 | | 11 | dificultades | 260 | 271 | clinical_event | 0.8807 | | 12 | retraso | 297 | 303 | clinical_event | 0.6975 | | 13 | análisis | 340 | 347 | clinical_event | 0.9664 | | 14 | sangre | 352 | 357 | bodypart | 0.9251 | | 15 | normales | 366 | 373 | units_measurements | 0.9838 | | 16 | hormona | 376 | 382 | clinical_event | 0.398 | | 17 | la tiroides | 399 | 409 | bodypart | 0.37665 | | 18 | TSH | 412 | 414 | clinical_event | 0.9389 | | 19 | hemoglobina | 418 | 428 | clinical_event | 0.2746 | | 20 | volumen | 431 | 437 | clinical_event | 0.9674 | | 21 | MCV | 458 | 460 | clinical_event | 0.6897 | | 22 | ferritina | 465 | 473 | clinical_event | 0.8188 | | 23 | endoscopia | 480 | 489 | clinical_event | 0.9953 | | 24 | mostró | 504 | 509 | clinical_event | 0.9998 | | 25 | tumor | 514 | 518 | clinical_event | 0.9866 | | 26 | submucoso | 520 | 528 | clinical_condition | 0.6053 | | 27 | obstrucción | 546 | 556 | clinical_event | 0.9974 | | 28 | tumor | 610 | 614 | clinical_event | 0.7284 | | 29 | del estroma gastrointestinal | 616 | 643 | bodypart | 0.577067 | | 30 | gastrectomía | 657 | 668 | clinical_event | 0.9666 | | 31 | examen | 681 | 686 | clinical_event | 0.9738 | | 32 | reveló | 704 | 709 | clinical_event | 0.9993 | | 33 | proliferación | 711 | 723 | clinical_event | 0.9996 | | 34 | células fusiformes | 728 | 745 | bodypart | 0.7001 | | 35 | la capa submucosa | 750 | 766 | bodypart | 0.641267 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English Named Entity Recognition (from sagorsarker) author: John Snow Labs name: bert_ner_codeswitch_nepeng_lid_lince date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-nepeng-lid-lince` is a English model orginally trained by `sagorsarker`. ## Predicted Entities `mixed`, `other`, `en`, `ambiguous`, `ne`, `nep` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_nepeng_lid_lince_en_3.4.2_3.0_1652097400616.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_nepeng_lid_lince_en_3.4.2_3.0_1652097400616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_nepeng_lid_lince","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_nepeng_lid_lince","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_codeswitch_nepeng_lid_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-nepeng-lid-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: Translate English to Indonesian Pipeline author: John Snow Labs name: translate_en_id date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, id, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `id` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_id_xx_2.7.0_2.4_1609690363498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_id_xx_2.7.0_2.4_1609690363498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_id", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_id", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.id').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_id| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German T5ForConditionalGeneration Cased model (from dehio) author: John Snow Labs name: t5_german_qg_e2e_quad date: 2023-01-30 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `german-qg-t5-e2e-quad` is a German model originally trained by `dehio`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_german_qg_e2e_quad_de_4.3.0_3.0_1675102645662.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_german_qg_e2e_quad_de_4.3.0_3.0_1675102645662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_german_qg_e2e_quad","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_german_qg_e2e_quad","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_german_qg_e2e_quad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|924.3 MB| ## References - https://huggingface.co/dehio/german-qg-t5-e2e-quad --- layout: model title: Pipeline to Detect PHI in text (enriched-biobert) author: John Snow Labs name: ner_deid_enriched_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_enriched_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_enriched_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_4.3.0_3.2_1679316429600.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_4.3.0_3.2_1679316429600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models") text = '''A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models") val text = "A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.ner_enriched_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:-------------|-------------:| | 0 | 2093-01-13 | 17 | 26 | DATE | 0.9267 | | 1 | David Hale | 29 | 38 | DOCTOR | 0.7949 | | 2 | Hendrickson, Ora | 53 | 68 | PATIENT | 0.637733 | | 3 | 7194334 | 76 | 82 | PHONE | 0.4939 | | 4 | Cocke County Baptist Hospital | 114 | 142 | HOSPITAL | 0.6199 | | 5 | 0295 Keats Street | 145 | 161 | STREET | 0.592433 | | 6 | 302) 786-5227 | 174 | 186 | PHONE | 0.846833 | | 7 | Brothers Coal-Mine | 253 | 270 | ORGANIZATION | 0.45085 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_enriched_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Chuvash RobertaForQuestionAnswering (from sunitha) author: John Snow Labs name: roberta_qa_CV_Custom_DS date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: cv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CV_Custom_DS` is a Chuvash model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_CV_Custom_DS_cv_4.0.0_3.0_1655726596821.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_CV_Custom_DS_cv_4.0.0_3.0_1655726596821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_CV_Custom_DS","cv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_CV_Custom_DS","cv") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("cv.answer_question.roberta.cv_custom_ds.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_CV_Custom_DS| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|cv| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/CV_Custom_DS --- layout: model title: English Deberta Embeddings model (from smeoni) author: John Snow Labs name: deberta_embeddings_nbme_V3_large date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nbme-deberta-V3-large` is a English model originally trained by `smeoni`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_nbme_V3_large_en_4.3.1_3.0_1678713648667.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_nbme_V3_large_en_4.3.1_3.0_1678713648667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_nbme_V3_large","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_nbme_V3_large","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_nbme_V3_large| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|false| ## References https://huggingface.co/smeoni/nbme-deberta-V3-large --- layout: model title: English BertForMaskedLM Large Cased model (from VMware) author: John Snow Labs name: bert_embeddings_v_2021_large date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vbert-2021-large` is a English model originally trained by `VMware`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_large_en_4.2.4_3.0_1670023012204.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_large_en_4.2.4_3.0_1670023012204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_large","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_large","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_v_2021_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/VMware/vbert-2021-large - https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99 --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_2_h_128_a_2_squad2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-128_A-2_squad2` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_squad2_en_4.0.0_3.0_1657188893270.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_squad2_en_4.0.0_3.0_1657188893270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_2_h_128_a_2_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-128_A-2_squad2 --- layout: model title: Detect Entities (GloVe) author: John Snow Labs name: ner_dl date: 2020-03-19 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.4.3 spark_version: 2.4 tags: [ner, en, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description `ner_dl` is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. It was trained on the CoNLL 2003 text corpus. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. `ner_dl` model is trained with GloVe 100D word embeddings, so be sure to use the same embeddings in the pipeline. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_en_2.4.3_2.4_1584624950746.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_dl_en_2.4.3_2.4_1584624950746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(["document", 'token']) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("ner_dl", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("ner_dl", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.dl.glove.6B_100d').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |American |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Born |LOC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | |Ray Ozzie |PER | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_dl| |Type:|ner| |Compatibility:| Spark NLP 2.4.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from[CoNLL 2003 Data Set](https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003) --- layout: model title: Korean Bert Embeddings author: John Snow Labs name: bert_embeddings_bert_base date: 2022-04-11 tags: [bert, embeddings, ko, open_source] task: Embeddings language: ko edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base` is a Korean model orginally trained by `klue`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ko_3.4.2_3.0_1649675453798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ko_3.4.2_3.0_1649675453798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.embed.bert").predict("""나는 Spark NLP를 좋아합니다""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/klue/bert-base - https://github.com/KLUE-benchmark/KLUE - https://arxiv.org/abs/2105.09680 --- layout: model title: Persian BertForQuestionAnswering model (from ForutanRad) author: John Snow Labs name: bert_qa_bert_fa_QA_v1 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-fa-QA-v1` is a Persian model orginally trained by `ForutanRad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_fa_QA_v1_fa_4.0.0_3.0_1654181654761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_fa_QA_v1_fa_4.0.0_3.0_1654181654761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_fa_QA_v1","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_fa_QA_v1","fa") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fa.answer_question.bert.by_ForutanRad").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_fa_QA_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ForutanRad/bert-fa-QA-v1 - https://arxiv.org/abs/2005.12515 --- layout: model title: Slovak RobertaForTokenClassification Cased model (from crabz) author: John Snow Labs name: roberta_token_classifier_slovakbert_ner date: 2023-03-01 tags: [sk, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: sk edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `slovakbert-ner` is a Slovak model originally trained by `crabz`. ## Predicted Entities `4`, `2`, `6`, `1`, `0`, `5`, `3` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_slovakbert_ner_sk_4.3.0_3.0_1677703644531.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_slovakbert_ner_sk_4.3.0_3.0_1677703644531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_slovakbert_ner","sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_slovakbert_ner","sk") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_slovakbert_ner| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sk| |Size:|439.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/crabz/slovakbert-ner - https://paperswithcode.com/sota?task=Token+Classification&dataset=wikiann --- layout: model title: Maltese Lemmatizer author: John Snow Labs name: lemma date: 2021-04-02 tags: [mt, open_source, lemmatizer] task: Lemmatization language: mt edition: Spark NLP 2.7.5 spark_version: 2.4 supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_mt_2.7.5_2.4_1617376734828.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_mt_2.7.5_2.4_1617376734828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "mt") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Il- Membru tal- Kumitat Leo Brincat talab li bħala xhud ikun hemm rappreżentant tal- MEPA u kien hemm qbil filwaqt li d- Deputat Laburista Joe Mizzi ta lista ta' persuni oħrajn mill- Korporazzjoni Enemalta u minn WasteServ u ma kienx hemm oġġezzjoni ."]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "mt") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Il- Membru tal- Kumitat Leo Brincat talab li bħala xhud ikun hemm rappreżentant tal- MEPA u kien hemm qbil filwaqt li d- Deputat Laburista Joe Mizzi ta lista ta' persuni oħrajn mill- Korporazzjoni Enemalta u minn WasteServ u ma kienx hemm oġġezzjoni .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Il- Membru tal- Kumitat Leo Brincat talab li bħala xhud ikun hemm rappreżentant tal- MEPA u kien hemm qbil filwaqt li d- Deputat Laburista Joe Mizzi ta lista ta' persuni oħrajn mill- Korporazzjoni Enemalta u minn WasteServ u ma kienx hemm oġġezzjoni ."] lemma_df = nlu.load('mt.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash +-------+ | lemma| +-------+ | Il| | _| | _| | tal| | _| | _| | Leo| |Brincat| | _| | _| | _| | _| | _| | _| | _| | tal| | _| | MEPA| | _| | _| +-------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|mt| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7. ## Benchmarking ```bash Precision=0.078, Recall=0.073, F1-score=0.075 ``` --- layout: model title: Named Entity Recognition - BERT Small (OntoNotes) author: John Snow Labs name: onto_small_bert_L4_512 date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `small_bert_L4_512` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_small_bert_L4_512_en_2.7.0_2.4_1607199400149.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_small_bert_L4_512_en_2.7.0_2.4_1607199400149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... ner_onto = NerDLModel.pretrained("onto_small_bert_L4_512", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val ner_onto = NerDLModel.pretrained("onto_small_bert_L4_512", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.bert.small_l4_512').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |the 1970s |DATE | |1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | |the late 1990s |DATE | |Gates |PERSON | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_small_bert_L4_512| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.8697573, rec: 0.8567398, f1: 0.8631994 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11219 phrases; correct: 9557. accuracy: 97.17%; 9557 11257 11219 precision: 85.19%; recall: 84.90%; FB1: 85.04 CARDINAL: 804 935 958 precision: 83.92%; recall: 85.99%; FB1: 84.94 958 DATE: 1412 1602 1726 precision: 81.81%; recall: 88.14%; FB1: 84.86 1726 EVENT: 20 63 46 precision: 43.48%; recall: 31.75%; FB1: 36.70 46 FAC: 78 135 122 precision: 63.93%; recall: 57.78%; FB1: 60.70 122 GPE: 2066 2240 2185 precision: 94.55%; recall: 92.23%; FB1: 93.38 2185 LANGUAGE: 10 22 11 precision: 90.91%; recall: 45.45%; FB1: 60.61 11 LAW: 12 40 18 precision: 66.67%; recall: 30.00%; FB1: 41.38 18 LOC: 114 179 168 precision: 67.86%; recall: 63.69%; FB1: 65.71 168 MONEY: 273 314 320 precision: 85.31%; recall: 86.94%; FB1: 86.12 320 NORP: 779 841 873 precision: 89.23%; recall: 92.63%; FB1: 90.90 873 ORDINAL: 174 195 226 precision: 76.99%; recall: 89.23%; FB1: 82.66 226 ORG: 1381 1795 1691 precision: 81.67%; recall: 76.94%; FB1: 79.23 1691 PERCENT: 311 349 349 precision: 89.11%; recall: 89.11%; FB1: 89.11 349 PERSON: 1827 1988 2046 precision: 89.30%; recall: 91.90%; FB1: 90.58 2046 PRODUCT: 32 76 51 precision: 62.75%; recall: 42.11%; FB1: 50.39 51 QUANTITY: 80 105 105 precision: 76.19%; recall: 76.19%; FB1: 76.19 105 TIME: 124 212 219 precision: 56.62%; recall: 58.49%; FB1: 57.54 219 WORK_OF_ART: 60 166 105 precision: 57.14%; recall: 36.14%; FB1: 44.28 105 ``` --- layout: model title: English image_classifier_vit_rust_image_classification_5 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_5` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_5_en_4.1.0_3.0_1660165828492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_5_en_4.1.0_3.0_1660165828492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Company to Ticker using Nasdaq author: John Snow Labs name: finel_nasdaq_data_ticker date: 2022-10-22 tags: [en, finance, companies, nasdaq, ticker, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial Entity Resolver model, trained to obtain the ticker from a Company Name, registered in NASDAQ. You can use this model after extracting a company name using any NER, and you will obtain its ticker. After this, you can use `finmapper_nasdaq_data_ticker` to augment and obtain more information about a company using NASDAQ datasource, including Official Company Name, Sector, Location, Currency, etc. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_data_ticker_en_1.0.0_3.0_1666473763228.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_data_ticker_en_1.0.0_3.0_1666473763228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python test = ["FIDUS INVESTMENT corp","ASPECT DEVELOPMENT Inc","CFSB BANCORP","DALEEN TECHNOLOGIES","GLEASON Corporation"] testdf = pandas.DataFrame(test, columns=['text']) testsdf = spark.createDataFrame(testdf).toDF('text') documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("sentence") use = nlp.UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_data_ticker', 'en', 'finance/models')\ .setInputCols("embeddings")\ .setOutputCol('normalized')\ prediction_Model = nlp.Pipeline(stages=[documentAssembler, use, use_er_model]) test_pred = prediction_Model.transform(testsdf).cache() ```
## Results ```bash +----------------------+-------+ |text |result | +----------------------+-------+ |FIDUS INVESTMENT corp |[FDUS] | |ASPECT DEVELOPMENT Inc|[ASDV] | |CFSB BANCORP |[CFSB] | |DALEEN TECHNOLOGIES |[DALN1]| |GLEASON Corporation |[GLE1] | +----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_nasdaq_data_ticker| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[normalized]| |Language:|en| |Size:|69.8 MB| |Case sensitive:|false| ## References NASDAQ Database --- layout: model title: German BERT Base Uncased Model author: John Snow Labs name: bert_base_german_uncased date: 2021-05-20 tags: [open_source, embeddings, german, de, bert] task: Embeddings language: de edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens. The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_german_uncased_de_3.1.0_2.4_1621504361619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_german_uncased_de_3.1.0_2.4_1621504361619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_german_uncased", "de") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_german_uncased", "de") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert.uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_german_uncased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|de| |Case sensitive:|true| ## Data Source https://huggingface.co/dbmdz/bert-base-german-uncased ## Benchmarking ```bash For results on downstream tasks like NER or PoS tagging, please refer to [this repository](https://github.com/stefan-it/fine-tuned-berts-seq). ``` --- layout: model title: Fast Neural Machine Translation Model from English to Bemba (Zambia) author: John Snow Labs name: opus_mt_en_bem date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bem, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bem_xx_2.7.0_2.4_1609171023473.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bem_xx_2.7.0_2.4_1609171023473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bem", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bem", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bem').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bem| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Generic PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_generic date: 2023-05-30 tags: [licensed, ner, clinical, deidentifiction, generic, arabic, ar] task: Named Entity Recognition language: ar edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model Word2Vec Arabic Clinical Embeddings. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `SEX`, `LOCATION`, `PROFESSION`, `AGE` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_ar_4.4.2_3.0_1685443881012.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_ar_4.4.2_3.0_1685443881012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner_subentity = MedicalNerModel.pretrained("ner_deid_generic", "ar", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipelineGeneric = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner_subentity, ner_converter]) text = ''' ملاحظات سريرية - مريض الربو: التاريخ: 30 مايو 2023 اسم المريض: أحمد سليمان العنوان: شارع السلام، مبنى رقم 555، حي الصفاء، الرياض الرمز البريدي: 54321 البلد: المملكة العربية السعودية اسم المستشفى: مستشفى الأمانة اسم الطبيب: د. ريم الحمد تفاصيل الحالة: المريض أحمد سليمان، البالغ من العمر 30 عامًا، يعاني من مرض الربو المزمن. يشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصه بمرض الربو بناءً على تاريخه الطبي واختبارات وظائف الرئة. ''' data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline .fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "ar", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) text = ''' ملاحظات سريرية - مريض الربو: التاريخ: 30 مايو 2023 اسم المريض: أحمد سليمان العنوان: شارع السلام، مبنى رقم 555، حي الصفاء، الرياض الرمز البريدي: 54321 البلد: المملكة العربية السعودية اسم المستشفى: مستشفى الأمانة اسم الطبيب: د. ريم الحمد تفاصيل الحالة: المريض أحمد سليمان، البالغ من العمر 30 عامًا، يعاني من مرض الربو المزمن. يشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصه بمرض الربو بناءً على تاريخه الطبي واختبارات وظائف الرئة. ''' val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------+---------------------+ |chunk | ner_label | +-----------------+---------------------+ |30 مايو |DATE | |أحمد سليمان |NAME | |الرياض |LOCATION | |54321 |LOCATION | |المملكة العربية |LOCATION | |السعودية |LOCATION | |مستشفى الأمانة |LOCATION | |ريم الحمد |NAME | |أحمد |NAME | +---------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ar| |Size:|15.0 MB| ## References Custom John Snow Labs datasets Data augmentation techniques ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 146.0 0.0 6.0 152.0 1.0 0.9605 0.9799 NAME 685.0 25.0 25.0 710.0 0.9648 0.9648 0.9648 DATE 876.0 14.0 9.0 885.0 0.9843 0.9898 0.987 ID 28.0 9.0 2.0 30.0 0.7568 0.9333 0.8358 SEX 300.0 8.0 69.0 369.0 0.974 0.813 0.8863 LOCATION 689.0 48.0 38.0 727.0 0.9349 0.9477 0.9413 PROFESSION 303.0 20.0 32.0 335.0 0.9381 0.9045 0.921 AGE 608.0 7.0 9.0 617.0 0.9886 0.9854 0.987 macro - - - - - - 0.9378 micro - - - - - - 0.9572 ``` --- layout: model title: Clean documents pipeline for English author: John Snow Labs name: clean_stop date: 2021-03-24 tags: [open_source, english, clean_stop, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_stop is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_stop_en_3.0.0_3.0_1616544492033.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_stop_en_3.0.0_3.0_1616544492033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('clean_stop', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("clean_stop", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.stop').predict(text) result_df ```
## Results ```bash | | document | sentence | token | cleanTokens | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:---------------------------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'John', 'Snow', 'Labs', '!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_stop| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: English XlmRoBertaForQuestionAnswering (from alon-albalak) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_xquad date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-xquad` is a English model originally trained by `alon-albalak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_xquad_en_4.0.0_3.0_1655996505419.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_xquad_en_4.0.0_3.0_1655996505419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_xquad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_xquad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xquad.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_xquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/alon-albalak/xlm-roberta-large-xquad - https://github.com/deepmind/xquad --- layout: model title: Fast Neural Machine Translation Model from English to Chinese author: John Snow Labs name: opus_mt_en_zh date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, zh, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `zh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_zh_xx_2.7.0_2.4_1609168259647.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_zh_xx_2.7.0_2.4_1609168259647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.zh').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_zh| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_vsv_all_901529445 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_vsv_all-901529445` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Name`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1678783317887.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1678783317887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_vsv_all_901529445| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_vsv_all-901529445 --- layout: model title: English XlmRoBertaForQuestionAnswering (from deepset) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_squad2_distilled date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-squad2-distilled` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_en_4.0.0_3.0_1655991460437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_en_4.0.0_3.0_1655991460437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_squad2_distilled| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|854.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/xlm-roberta-base-squad2-distilled - https://www.linkedin.com/company/deepset-ai/ - https://twitter.com/deepset_ai - http://www.deepset.ai/jobs - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/ - https://github.com/deepset-ai/FARM - https://deepset.ai/germanquad - https://deepset.ai - https://deepset.ai/german-bert - https://github.com/deepset-ai/haystack/discussions --- layout: model title: English DistilBertForQuestionAnswering model (from vkrishnamoorthy) author: John Snow Labs name: distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vkrishnamoorthy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726595722.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726595722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_vkrishnamoorthy").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vkrishnamoorthy/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect Restaurant-related Terminology author: John Snow Labs name: nerdl_restaurant_100d_pipeline date: 2022-03-18 tags: [restaurant, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [nerdl_restaurant_100d](https://nlp.johnsnowlabs.com/2021/12/31/nerdl_restaurant_100d_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_RESTAURANT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RESTAURANT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_pipeline_en_3.4.1_3.0_1647610686318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_pipeline_en_3.4.1_3.0_1647610686318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python restaurant_pipeline = PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en") restaurant_pipeline.annotate("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.") ``` ```scala val restaurant_pipeline = new PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en") restaurant_pipeline.annotate("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.") ```
## Results ```bash +---------------------------+---------------+ |chunk |ner_label | +---------------------------+---------------+ |Hong Kong’s |Restaurant_Name| |favourite |Rating | |pasta bar |Dish | |most reasonably |Price | |lunch |Hours | |in town! |Location | |Sha Tin – Pici’s |Restaurant_Name| |burrata |Dish | |arugula salad |Dish | |freshly tossed tuna tartare|Dish | |reliable |Price | |handmade pasta |Dish | |pappardelle |Dish | |effortless |Amenity | |Italian |Cuisine | |tidy one-pot |Amenity | |espresso |Dish | +---------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_restaurant_100d_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|166.7 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter - Finisher --- layout: model title: Social Determinants of Health author: John Snow Labs name: ner_sdoh date: 2023-06-13 tags: [clinical, en, social_determinants, ner, public_health, sdoh, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SDOH NER model is designed to detect and label social determinants of health (SDOH) entities within text data. Social determinants of health are crucial factors that influence individuals' health outcomes, encompassing various social, economic, and environmental element. The model has been trained using advanced machine learning techniques on a diverse range of text sources. The model can accurately recognize and classify a wide range of SDOH entities, including but not limited to factors such as socioeconomic status, education level, housing conditions, access to healthcare services, employment status, cultural and ethnic background, neighborhood characteristics, and environmental factors. The model's accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results. Here are the labels of the SDOH NER model with their description: - `Access_To_Care`: Patient’s ability or barriers to access the care needed. "long distances, access to health care, rehab program, etc." - `Age`: All mention of ages including "Newborn, Infant, Child, Teenager, Teenage, Adult, etc." - `Alcohol`: Mentions of alcohol drinking habit. - `Chidhood_Event`: Childhood events mentioned by the patient. "childhood trauma, childhood abuse, etc." - `Communicable_Disease`: Include all the communicable diseases. "HIV, hepatitis, tuberculosis, sexually transmitted diseases, etc." - `Community_Safety`: safety of the neighborhood or places of study or work. "dangerous neighborhood, safe area, etc." - `Diet`: Information regarding the patient’s dietary habits. "vegetarian, vegan, healthy foods, low-calorie diet, etc." - `Disability`: Mentions related to disability - `Eating_Disorder`: This entity is used to extract eating disorders. "anorexia, bulimia, pica, etc." - `Education`:Patient’s educational background - `Employment`: Patient or provider occupational titles. - `Environmental_Condition`: Conditions of the environment where people live. "pollution, air quality, noisy environment, etc." - `Exercise`: Mentions of the exercise habits of a patient. "exercise, physical activity, play football, go to the gym, etc." - `Family_Member`: Nouns that refer to a family member. "mother, father, brother, sister, etc." - `Financial_Status`: Financial status refers to the state and condition of the person’s finances. "financial decline, debt, bankruptcy, etc." - `Food_Insecurity`: Food insecurity is defined as a lack of consistent access to enough food for every person in a household to live an active, healthy life. "food insecurity, scarcity of protein, lack of food, etc." - `Gender`: Gender-specific nouns and pronouns - `Geographic_Entity`: Geographical location refers to a specific physical point on Earth. - `Healthcare_Institution`: Health care institution means every place, institution, building or agency. "hospital, clinic, trauma centers, etc." - `Housing`: Conditions of the patient’s living spaces. "homeless, housing, small apartment, etc." - `Hyperlipidemia`: Terms that indicate hyperlipidemia and relevant subtypes. "hyperlipidemia, hypercholesterolemia, elevated cholesterol, etc." - `Hypertension`: Terms related to hypertension. "hypertension, high blood pressure, etc." - `Income`: Information regarding the patient’s income - `Insurance_Status`: Information regarding the patient’s insurance status. "uninsured, insured, Medicare, Medicaid, etc." - `Language`: A system of conventional spoken, manual (signed) or written symbols by means of which human beings express themselves. "English, Spanish-speaking, bilingual, etc. " - `Legal_Issues`: Issues that have legal implications. "legal issues, legal problems, detention , in prison, etc." - `Marital_Status`: Terms that indicate the person’s marital status. - `Mental_Health`: Include all the mental, neurodegenerative and neurodevelopmental diagnosis, disorders, conditions or syndromes mentioned. "depression, anxiety, bipolar disorder, psychosis, etc." - `Obesity`: Terms related to the patient being obese. "obesity, overweight, etc." - `Other_Disease`: Include all the diseases mentioned. "psoriasis, thromboembolism, etc." - `Other_SDoH_Keywords`: This label is used to annotated terms or sentences that provide information about social determinants of health that are not already extracted under any other entity label. "minimal activities of daily living, ack of government programs, etc." - `Population_Group`: The population group that a person belongs to, that does not fall under any other entities. "refugee, prison patient, etc." - `Quality_Of_Life`: Quality of life refers to how an individual feels about their current station in life. " lower quality of life, profoundly impact his quality of life, etc." - `Race_Ethnicity`: The race and ethnicity categories include racial, ethnic, and national origins. - `Sexual_Activity`: Mentions of patient’s sexual behaviors. "monogamous, sexual activity, inconsistent condom use, etc." - `Sexual_Orientation`: Terms that are related to sexual orientations. "gay, bisexual, heterosexual, etc." - `Smoking`: mentions of smoking habit. "smoking, cigarette, tobacco, etc." - `Social_Exclusion`: Absence or lack of rights or accessibility to services or goods that are expected of the majority of the population. "social exclusion, social isolation, gender discrimination, etc." - `Social_Support`: he presence of friends, family or other people to turn to for comfort or help. "social support, live with family, etc." - `Spiritual_Beliefs`: Spirituality is concerned with beliefs beyond self, usually related to the existence of a superior being. "spiritual beliefs, religious beliefs, strong believer, etc." - `Substance_Duration`: The duration associated with the health behaviors. "for 2 years, 3 months, etc" - `Substance_Frequency`: The frequency associated with the health behaviors. "five days a week, daily, weekly, monthly, etc" - `Substance_Quantity`: The quantity associated with the health behaviors. "2 packs , 40 ounces,ten to twelve, modarate, etc" - `Substance_Use`: Mentions of illegal recreational drugs use. Include also substances that can create dependency including here caffeine and tea. "overdose, cocaine, illicit substance intoxication, coffee etc." - `Transportation`: mentions of accessibility to transportation means. "car, bus, train, etc." - `Violence_Or_Abuse`: Episodes of abuse or violence experienced and reported by the patient. "domestic violence, sexual abuse, etc." ## Predicted Entities `Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_en_4.4.3_3.0_1686654976160.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_en_4.4.3_3.0_1686654976160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["""Smith is 55 years old, living in New York, a divorced Mexcian American woman with financial problems. She speaks Spanish and Portuguese. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and cannot access health insurance or paid sick leave. She has a son, a student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reports having her catholic faith as a means of support as well. She has a long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI in April and was due to court this week."""]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("""Smith is 55 years old, living in New York, a divorced Mexcian American woman with financial problems. She speaks Spanish and Portuguese. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and cannot access health insurance or paid sick leave. She has a son, a student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reports having her catholic faith as a means of support as well. She has a long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI in April and was due to court this week.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------+-----+---+-------------------+ |chunk |begin|end|ner_label | +------------------+-----+---+-------------------+ |55 years old |9 |20 |Age | |New York |33 |40 |Geographic_Entity | |divorced |45 |52 |Marital_Status | |Mexcian American |54 |69 |Race_Ethnicity | |woman |71 |75 |Gender | |financial problems|82 |99 |Financial_Status | |She |102 |104|Gender | |Spanish |113 |119|Language | |Portuguese |125 |134|Language | |She |137 |139|Gender | |apartment |153 |161|Housing | |She |164 |166|Gender | |diabetes |193 |200|Other_Disease | |hospitalizations |268 |283|Other_SDoH_Keywords| |cleaning assistant|342 |359|Employment | |health insurance |379 |394|Insurance_Status | |She |416 |418|Gender | |son |426 |428|Family_Member | |student |433 |439|Education | |college |444 |450|Education | |depression |482 |491|Mental_Health | |She |494 |496|Gender | |she |507 |509|Gender | |rehab |517 |521|Access_To_Care | |her |542 |544|Gender | |catholic faith |546 |559|Spiritual_Beliefs | |support |575 |581|Social_Support | |She |593 |595|Gender | |etoh abuse |619 |628|Alcohol | |her |644 |646|Gender | |teens |648 |652|Age | |She |655 |657|Gender | |she |667 |669|Gender | |daily |682 |686|Substance_Frequency| |drinker |688 |694|Alcohol | |30 years |700 |707|Substance_Duration | |drinking |724 |731|Alcohol | |beer |733 |736|Alcohol | |daily |738 |742|Substance_Frequency| |She |745 |747|Gender | |smokes |749 |754|Smoking | |a pack |756 |761|Substance_Quantity | |cigarettes |766 |775|Smoking | |a day |777 |781|Substance_Frequency| |She |784 |786|Gender | |DUI |792 |794|Legal_Issues | +------------------+-----+---+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| |Dependencies:|embeddings_clinical| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Access_To_Care 0.88 0.88 0.88 544 Age 0.93 0.95 0.94 466 Alcohol 0.96 0.98 0.97 263 Chidhood_Event 1.00 0.53 0.69 19 Communicable_Disease 0.96 1.00 0.98 51 Community_Safety 0.95 0.85 0.89 65 Diet 0.89 0.75 0.82 89 Disability 0.93 0.97 0.95 38 Eating_Disorder 0.95 0.91 0.93 44 Education 0.89 0.94 0.91 67 Employment 0.94 0.95 0.95 2087 Environmental_Condition 0.95 1.00 0.97 18 Exercise 0.90 0.86 0.88 65 Family_Member 0.98 0.98 0.98 2061 Financial_Status 0.73 0.77 0.75 116 Food_Insecurity 0.81 0.88 0.84 43 Gender 0.99 0.98 0.98 4858 Geographic_Entity 0.95 0.84 0.89 106 Healthcare_Institution 0.94 0.98 0.96 721 Housing 0.93 0.88 0.90 369 Hyperlipidemia 0.86 0.60 0.71 10 Hypertension 0.88 0.77 0.82 30 Income 0.82 0.75 0.79 61 Insurance_Status 0.89 0.82 0.85 66 Language 0.96 0.92 0.94 26 Legal_Issues 0.89 0.79 0.84 63 Marital_Status 0.96 1.00 0.98 90 Mental_Health 0.87 0.85 0.86 491 Obesity 0.85 1.00 0.92 11 Other_Disease 0.90 0.89 0.90 609 Other_SDoH_Keywords 0.81 0.85 0.83 258 Population_Group 1.00 0.86 0.92 14 Quality_Of_Life 0.61 0.86 0.72 44 Race_Ethnicity 0.97 0.90 0.94 41 Sexual_Activity 0.78 0.84 0.81 45 Sexual_Orientation 0.80 1.00 0.89 12 Smoking 0.99 0.97 0.98 76 Social_Exclusion 0.88 0.88 0.88 25 Social_Support 0.87 0.95 0.91 676 Spiritual_Beliefs 0.83 0.83 0.83 53 Substance_Duration 0.78 0.69 0.73 52 Substance_Frequency 0.52 0.83 0.64 46 Substance_Quantity 0.82 0.88 0.85 51 Substance_Use 0.91 0.97 0.94 207 Transportation 0.85 0.83 0.84 42 Violence_Or_Abuse 0.87 0.70 0.77 112 micro-avg 0.94 0.94 0.94 15301 macro-avg 0.88 0.87 0.87 15301 weighted-avg 0.94 0.94 0.94 15301 ``` --- layout: model title: English BertForQuestionAnswering model (from FardinSaboori) author: John Snow Labs name: bert_qa_FardinSaboori_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `FardinSaboori`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_FardinSaboori_bert_finetuned_squad_en_4.0.0_3.0_1654535437430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_FardinSaboori_bert_finetuned_squad_en_4.0.0_3.0_1654535437430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_FardinSaboori_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_FardinSaboori_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_FardinSaboori").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_FardinSaboori_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/FardinSaboori/bert-finetuned-squad --- layout: model title: Bangla Bert Embeddings (from Kowsher) author: John Snow Labs name: bert_embeddings_bangla_bert date: 2022-04-11 tags: [bert, embeddings, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bangla-bert` is a Bangla model orginally trained by `Kowsher`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_bert_bn_3.4.2_3.0_1649673360956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_bert_bn_3.4.2_3.0_1649673360956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bangla_bert","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bangla_bert","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed.bangla_bert").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bangla_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|615.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Kowsher/bangla-bert - https://github.com/Kowsher/bert-base-bangla - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://www.kaggle.com/gakowsher/bangla-language-model-dataset - https://ssrn.com/abstract= - http://kowsher.org/ --- layout: model title: Translate Central Bikol to English Pipeline author: John Snow Labs name: translate_bcl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bcl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bcl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bcl_en_xx_2.7.0_2.4_1609688692603.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bcl_en_xx_2.7.0_2.4_1609688692603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bcl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bcl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bcl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bcl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Seychellois Creole Pipeline author: John Snow Labs name: translate_en_crs date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, crs, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `crs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_crs_xx_2.7.0_2.4_1609691811394.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_crs_xx_2.7.0_2.4_1609691811394.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_crs", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_crs", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.crs').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_crs| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from Wiam) author: John Snow Labs name: distilbert_qa_Wiam_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Wiam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Wiam_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724868151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Wiam_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724868151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Wiam_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Wiam_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Wiam").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Wiam_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Wiam/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from deepset) author: John Snow Labs name: roberta_qa_roberta_base_squad2_distilled date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-distilled` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_distilled_en_4.0.0_3.0_1655735282920.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_distilled_en_4.0.0_3.0_1655735282920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_distilled","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2_distilled","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.distilled_base.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2_distilled| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/roberta-base-squad2-distilled - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/FARM - http://www.deepset.ai/jobs - https://twitter.com/deepset_ai - https://github.com/deepset-ai/haystack/discussions - https://github.com/deepset-ai/haystack/ - https://deepset.ai - https://deepset.ai/germanquad - https://deepset.ai/german-bert --- layout: model title: Mapping Entities (Clinical Drugs) with Corresponding UMLS CUI Codes author: John Snow Labs name: umls_clinical_drugs_mapper date: 2022-07-06 tags: [umls, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities (Clinical Drugs) with their corresponding UMLS CUI codes. ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_drugs_mapper_en_4.0.0_3.0_1657124255341.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_drugs_mapper_en_4.0.0_3.0_1657124255341.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "clinical_ner"])\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("umls_clinical_drugs_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["umls_code"])\ .setLowerCase(True) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper]) sample_text="""She was immediately given hydrogen peroxide 30 mg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.""" test_data = spark.createDataFrame([[sample_text]]).toDF("text") result = mapper_pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "clinical_ner")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("umls_clinical_drugs_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("umls_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper)) val test_data = Seq("She was immediately given hydrogen peroxide 30 mg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.").toDF("text") val result = pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_clinical_drugs_mapper").predict("""She was immediately given hydrogen peroxide 30 mg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.""") ```
## Results ```bash +-------------------+---------+ |ner_chunk |umls_code| +-------------------+---------+ |hydrogen peroxide |C0020281 | |Neosporin Cream |C0132149 | |magnesium hydroxide|C0024476 | |metformin |C0025598 | +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_clinical_drugs_mapper| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|23.3 MB| ## References 2022AA UMLS dataset’s Clinical Drug category. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Detect Clinical Entities (ner_eu_clinical_case - fr) author: John Snow Labs name: ner_eu_clinical_case date: 2023-02-01 tags: [fr, clinical, licensed, ner] task: Named Entity Recognition language: fr edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical entities from French texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_event`, `bodypart`, `clinical_condition`, `units_measurements`, `patient`, `date_time` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_fr_4.2.8_3.0_1675293960896.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_fr_4.2.8_3.0_1675293960896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "fr", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_case", "fr", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Un garçon de 3 ans atteint d'un trouble autistique à l'hôpital du service pédiatrique A de l'hôpital universitaire. Il n'a pas d'antécédents familiaux de troubles ou de maladies du spectre autistique. Le garçon a été diagnostiqué avec un trouble de communication sévère, avec des difficultés d'interaction sociale et un traitement sensoriel retardé. Les tests sanguins étaient normaux (thyréostimuline (TSH), hémoglobine, volume globulaire moyen (MCV) et ferritine). L'endoscopie haute a également montré une tumeur sous-muqueuse provoquant une obstruction subtotale de la sortie gastrique. Devant la suspicion d'une tumeur stromale gastro-intestinale, une gastrectomie distale a été réalisée. L'examen histopathologique a révélé une prolifération de cellules fusiformes dans la couche sous-muqueuse.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------------------------------------+------------------+ |chunk |ner_label | +-----------------------------------------------------+------------------+ |Un garçon de 3 ans |patient | |trouble autistique à l'hôpital du service pédiatrique|clinical_condition| |l'hôpital |clinical_event | |Il n'a |patient | |d'antécédents |clinical_event | |troubles |clinical_condition| |maladies |clinical_condition| |du spectre autistique |bodypart | |Le garçon |patient | |diagnostiqué |clinical_event | |trouble |clinical_condition| |difficultés |clinical_event | |traitement |clinical_event | |tests |clinical_event | |normaux |units_measurements| |thyréostimuline |clinical_event | |TSH |clinical_event | |ferritine |clinical_event | |L'endoscopie |clinical_event | |montré |clinical_event | |tumeur sous-muqueuse |clinical_condition| |provoquant |clinical_event | |obstruction |clinical_condition| |la sortie gastrique |bodypart | |suspicion |clinical_event | |tumeur stromale gastro-intestinale |clinical_condition| |gastrectomie |clinical_event | |L'examen |clinical_event | |révélé |clinical_event | |prolifération |clinical_event | |cellules fusiformes |bodypart | |la couche sous-muqueuse |bodypart | +-----------------------------------------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|895.0 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 date_time 49.0 14.0 70.0 104.0 0.7778 0.7000 0.7368 units_measurements 92.0 19.0 6.0 48.0 0.8288 0.9388 0.8804 clinical_condition 178.0 74.0 73.0 120.0 0.7063 0.7092 0.7078 patient 114.0 6.0 15.0 87.0 0.9500 0.8837 0.9157 clinical_event 265.0 81.0 71.0 478.0 0.7659 0.7887 0.7771 bodypart 243.0 34.0 64.0 166.0 0.8773 0.7915 0.8322 macro - - - - - - 0.8083 micro - - - - - - 0.7978 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from 21iridescent) author: John Snow Labs name: distilbert_qa_21iridescent_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `21iridescent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_21iridescent_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724023073.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_21iridescent_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724023073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_21iridescent_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_21iridescent_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_21iridescent").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_21iridescent_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/21iridescent/distilbert-base-uncased-finetuned-squad --- layout: model title: English AlbertForQuestionAnswering model (from AyushPJ) author: John Snow Labs name: albert_qa_ai_club_inductions_21_nlp date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-ALBERT` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_ai_club_inductions_21_nlp_en_4.0.0_3.0_1656063682959.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_ai_club_inductions_21_nlp_en_4.0.0_3.0_1656063682959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_ai_club_inductions_21_nlp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_ai_club_inductions_21_nlp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.albert.by_AyushPJ").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_ai_club_inductions_21_nlp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-ALBERT --- layout: model title: Spanish RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_RuPERTa_base_finetuned_squadv1 date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-squadv1` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RuPERTa_base_finetuned_squadv1_es_4.0.0_3.0_1655727321165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RuPERTa_base_finetuned_squadv1_es_4.0.0_3.0_1655727321165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RuPERTa_base_finetuned_squadv1","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_RuPERTa_base_finetuned_squadv1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.ruperta.base.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_RuPERTa_base_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|470.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv1 --- layout: model title: Word2Vec Embeddings in Spanish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_es_3.4.1_3.0_1647459363492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_es_3.4.1_3.0_1647459363492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.w2v_cc_300d").predict("""Me encanta Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Sentiment Analysis of French texts author: John Snow Labs name: classifierdl_bert_sentiment date: 2021-09-08 tags: [fr, sentiment, classification, open_source] task: Sentiment Analysis language: fr edition: Spark NLP 3.2.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies the sentiments (positive or negative) in French texts. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_FR/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_Fr_Sentiment.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_fr_3.2.0_2.4_1631104713514.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_fr_3.2.0_2.4_1631104713514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = BertSentenceEmbeddings\ .pretrained('labse', 'xx') \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "fr") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") fr_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier]) light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result1 = light_pipeline.annotate("Mignolet vraiment dommage de ne jamais le voir comme titulaire") result2 = light_pipeline.annotate("Je me sens bien, je suis heureux d'être de retour.") print(result1["class"], result2["class"], sep = "\n") ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings .pretrained("labse", "xx") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "fr") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val fr_sentiment_pipeline = new Pipeline().setStages(Array(document, embeddings, sentimentClassifier)) val light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result1 = light_pipeline.annotate("Mignolet vraiment dommage de ne jamais le voir comme titulaire") val result2 = light_pipeline.annotate("Je me sens bien, je suis heureux d'être de retour.") ``` {:.nlu-block} ```python import nlu nlu.load("fr.classify.sentiment.bert").predict("""Mignolet vraiment dommage de ne jamais le voir comme titulaire""") ```
## Results ```bash ['NEGATIVE'] ['POSITIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_sentiment| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|fr| ## Data Source https://github.com/charlesmalafosse/open-dataset-for-sentiment-analysis/ ## Benchmarking ```bash precision recall f1-score support NEGATIVE 0.82 0.72 0.77 378 POSITIVE 0.92 0.95 0.94 1240 accuracy 0.90 1618 macro avg 0.87 0.84 0.85 1618 weighted avg 0.90 0.90 0.90 1618 ``` --- layout: model title: Legal Confidential Clause Binary Classifier author: John Snow Labs name: legclf_confidential_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `confidential` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `confidential` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidential_clause_en_1.0.0_3.2_1660122270121.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidential_clause_en_1.0.0_3.2_1660122270121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_confidential_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[confidential]| |[other]| |[other]| |[confidential]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_confidential_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support confidential 0.90 0.90 0.90 41 other 0.97 0.97 0.97 127 accuracy - - 0.95 168 macro-avg 0.94 0.94 0.94 168 weighted-avg 0.95 0.95 0.95 168 ``` --- layout: model title: Stop Words Cleaner for Sesotho author: John Snow Labs name: stopwords_st date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: st edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, st] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_st_st_2.5.4_2.4_1594742438831.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_st_st_2.5.4_2.4_1594742438831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_st", "st") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Ntle le ho ba morena oa leboea, John Snow ke ngaka ea Lenyesemane ebile ke moetapele nts'etsopele ea anesthesia le bohloeki ba bongaka.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_st", "st") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Ntle le ho ba morena oa leboea, John Snow ke ngaka ea Lenyesemane ebile ke moetapele nts'etsopele ea anesthesia le bohloeki ba bongaka.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Ntle le ho ba morena oa leboea, John Snow ke ngaka ea Lenyesemane ebile ke moetapele nts'etsopele ea anesthesia le bohloeki ba bongaka."""] stopword_df = nlu.load('st.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='Ntle', metadata={'sentence': '0'}), Row(annotatorType='token', begin=14, end=19, result='morena', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=29, result='leboea', metadata={'sentence': '0'}), Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=35, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_st| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|st| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English Bert Embeddings Cased model (from Tristan) author: John Snow Labs name: bert_embeddings_olm_base_uncased_oct_2022 date: 2023-02-21 tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `olm-bert-base-uncased-oct-2022` is a English model originally trained by `Tristan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_olm_base_uncased_oct_2022_en_4.3.0_3.0_1676999449577.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_olm_base_uncased_oct_2022_en_4.3.0_3.0_1676999449577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_olm_base_uncased_oct_2022","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_olm_base_uncased_oct_2022","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_olm_base_uncased_oct_2022| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|467.5 MB| |Case sensitive:|true| ## References https://huggingface.co/Tristan/olm-bert-base-uncased-oct-2022 --- layout: model title: Detect PHI for Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_subentity_bert date: 2022-06-27 tags: [deidentification, bert, phi, ner, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with `bert_base_cased` embeddings and can detect 17 entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_ro_4.0.0_3.0_1656311815383.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_ro_4.0.0_3.0_1656311815383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid.subentity.bert").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|HOSPITAL | |Drumul Oprea Nr |STREET | |Vaslui |CITY | |737405 |ZIP | |+40(235)413773 |PHONE | |25 May 2022 |DATE | |BUREAN MARIA |PATIENT | |77 |AGE | |Agota Evelyn Tımar |DOCTOR | |2450502264401 |IDNUM | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_bert| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.5 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.98 0.95 0.96 1186 CITY 0.94 0.87 0.90 299 COUNTRY 0.90 0.73 0.81 108 DATE 0.98 0.95 0.96 4518 DOCTOR 0.91 0.94 0.93 1979 EMAIL 1.00 0.62 0.77 8 FAX 0.98 0.95 0.96 56 HOSPITAL 0.92 0.85 0.88 881 IDNUM 0.98 0.99 0.98 235 LOCATION-OTHER 1.00 0.85 0.92 13 MEDICALRECORD 0.99 1.00 1.00 444 ORGANIZATION 0.86 0.76 0.81 75 PATIENT 0.91 0.87 0.89 937 PHONE 0.96 0.98 0.97 302 PROFESSION 0.85 0.82 0.83 161 STREET 0.96 0.94 0.95 173 ZIP 0.99 0.98 0.99 138 micro-avg 0.95 0.93 0.94 11513 macro-avg 0.95 0.89 0.91 11513 weighted-avg 0.95 0.93 0.94 11513 ``` --- layout: model title: English BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_base_cased_chaii date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-chaii` is a English model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_chaii_en_4.0.0_3.0_1654179712101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_chaii_en_4.0.0_3.0_1654179712101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_cased_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_cased_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_cased_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-base-cased-chaii --- layout: model title: Legal Arguments Mining in Court Decisions author: John Snow Labs name: legclf_argument_mining date: 2023-03-26 tags: [en, classification, licensed, legal, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which classifies arguments in legal discourse. These are the following classes: `subsumption`, `definition`, `conclusion`, `other`. ## Predicted Entities `subsumption`, `definition`, `conclusion`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_en_1.0.0_3.0_1679829561976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_argument_mining_en_1.0.0_3.0_1679829561976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) embeddingsSentence = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") docClassifier = legal.ClassifierDLModel.pretrained("legclf_argument_mining","en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, embeddingsSentence, docClassifier ]) df = spark.createDataFrame([["There is therefore no doubt – and the Government do not contest – that the measures concerned in the present case ( the children 's continued placement in foster homes and the restrictions imposed on contact between the applicants and their children ) amounts to an “ interference ” with the applicants ' rights to respect for their family life ."]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) result.select("text", "category.result").show() ```
## Results ```bash +--------------------+-------------+ | text| result| +--------------------+-------------+ |There is therefor...|[subsumption]| +--------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_argument_mining| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Train dataset available [here](https://huggingface.co/datasets/MeilingShi/legal_argument_mining) ## Benchmarking ```bash label precision recall f1-score support conclusion 0.93 0.79 0.85 52 definition 0.87 0.81 0.84 58 other 0.88 0.88 0.88 57 subsumption 0.64 0.79 0.71 52 accuracy - - 0.82 219 macro-avg 0.83 0.82 0.82 219 weighted-avg 0.83 0.82 0.82 219 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_hfl_chinese_roberta_wwm_ext date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-roberta-wwm-ext` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_hfl_chinese_roberta_wwm_ext_zh_4.2.4_3.0_1670021322707.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_hfl_chinese_roberta_wwm_ext_zh_4.2.4_3.0_1670021322707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_hfl_chinese_roberta_wwm_ext","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_hfl_chinese_roberta_wwm_ext","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_hfl_chinese_roberta_wwm_ext| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-roberta-wwm-ext - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: English BertForQuestionAnswering model (from peterhsu) author: John Snow Labs name: bert_qa_peterhsu_bert_finetuned_squad_accelerate date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `peterhsu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_peterhsu_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535878232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_peterhsu_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535878232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_peterhsu_bert_finetuned_squad_accelerate","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_peterhsu_bert_finetuned_squad_accelerate","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.accelerate.by_peterhsu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_peterhsu_bert_finetuned_squad_accelerate| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peterhsu/bert-finetuned-squad-accelerate --- layout: model title: Fast Neural Machine Translation Model from Estonian to English author: John Snow Labs name: opus_mt_et_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, et, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `et` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_et_en_xx_2.7.0_2.4_1609170283601.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_et_en_xx_2.7.0_2.4_1609170283601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_et_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_et_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.et.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_et_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_moaiz_exp2 TFWav2Vec2ForCTC from moaiz237 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_moaiz_exp2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_exp2` is a English model originally trained by moaiz237. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_moaiz_exp2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037629984.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_exp2_en_4.2.0_3.0_1664037629984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_moaiz_exp2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_moaiz_exp2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_moaiz_exp2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-28 tags: [ga, licensed, ner, legal, mapa] task: Named Entity Recognition language: ga edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Irish` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_ga_1.0.0_3.0_1682670223837.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_ga_1.0.0_3.0_1682670223837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_irish_legal","gle")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "ga", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Dhiúltaigh Tribunale di Teramo ( An Chúirt Dúiche, Teramo ) an t-iarratas a rinne Bn.Grigorescu, ar bhonn teagmhasach, chun aitheantas a thabhairt san Iodáil do bhreithiúnas colscartha Tribunalul București ( An Chúirt Réigiúnach, Búcairist ) an 3 Nollaig 2012, de bhun Rialachán Uimh."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |Teramo |ADDRESS | |Bn.Grigorescu |PERSON | |Búcairist |ADDRESS | |3 Nollaig 2012|DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ga| |Size:|16.3 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.82 0.74 0.78 19 AMOUNT 1.00 1.00 1.00 7 DATE 0.91 0.92 0.91 75 ORGANISATION 0.65 0.67 0.66 48 PERSON 0.71 0.82 0.76 56 micro-avg 0.79 0.82 0.80 205 macro-avg 0.82 0.83 0.82 205 weighted-avg 0.79 0.82 0.80 205 ``` --- layout: model title: English asr_distil_wav2vec2 TFWav2Vec2ForCTC from OthmaneJ author: John Snow Labs name: asr_distil_wav2vec2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_distil_wav2vec2` is a English model originally trained by OthmaneJ. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_distil_wav2vec2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_distil_wav2vec2_en_4.2.0_3.0_1664020967214.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_distil_wav2vec2_en_4.2.0_3.0_1664020967214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_distil_wav2vec2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_distil_wav2vec2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_distil_wav2vec2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|188.9 MB| --- layout: model title: Legal BERT Base Uncased Embedding author: John Snow Labs name: bert_base_uncased_legal date: 2021-09-07 tags: [english, legal, open_source, bert_embeddings, uncased, en] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. To pre-train the different variations of LEGAL-BERT, we collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. Sub-domains variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. A light-weight model (33% the size of BERT-BASE) pre-trained from scratch on legal data with competitive perfomance is also available. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_legal_en_3.2.2_3.0_1630999701913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_legal_en_3.2.2_3.0_1630999701913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_uncased_legal", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_uncased_legal", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert.base_uncased_legal").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_uncased_legal| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/nlpaueb/legal-bert-base-uncased --- layout: model title: English asr_wav2vec2_base_timit_demo_colab240 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab240 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab240` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab240_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023921297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023921297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab240", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab240", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab240| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Turkish Named Entity Recognition (from akdeniz27) author: John Snow Labs name: bert_ner_bert_base_turkish_cased_ner date: 2022-05-09 tags: [bert, ner, token_classification, tr, open_source] task: Named Entity Recognition language: tr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-turkish-cased-ner` is a Turkish model orginally trained by `akdeniz27`. ## Predicted Entities `LOC`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_cased_ner_tr_3.4.2_3.0_1652099217326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_cased_ner_tr_3.4.2_3.0_1652099217326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_cased_ner","tr") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_cased_ner","tr") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_turkish_cased_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|tr| |Size:|412.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/akdeniz27/bert-base-turkish-cased-ner - https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt - https://ieeexplore.ieee.org/document/7495744 --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented date: 2022-01-18 tags: [icd10cm, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1642532480732.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1642532480732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | ner_chunk| entity|icd10cm_code| resolutions| all_codes| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481| |subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...| | T2DM|PROBLEM| E11|type 2 diabetes mellitus:::disorder associated with type 2 diabetes...|E11:::E118:::E119:::O2411:::E109:::E139:::E113:::E8881:::Z833:::D64...| | HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::he...|K8520:::K853:::K8590:::F102:::K852:::K859:::K8580:::K8591:::K858:::...| | acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...| | obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...| | a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...| | polyuria|PROBLEM| R35|polyuria:::nocturnal polyuria:::polyuric state:::polyuric state (di...|R35:::R3581:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048...| | polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...| | poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...| | vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110| | a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.4 GB| |Case sensitive:|false| |Dependencies:|embeddings_clinical| ## Data Source Trained on ICD10CM 2022 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265911` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911_en_4.0.0_3.0_1655985889976.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911_en_4.0.0_3.0_1655985889976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265911").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265911| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265911 --- layout: model title: Legal Return Of Confidential Information Clause Binary Classifier author: John Snow Labs name: legclf_return_of_conf_info_clause date: 2023-02-13 tags: [en, legal, classification, return, confidential, information, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `return_of_conf_info` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `return_of_conf_info`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_return_of_conf_info_clause_en_1.0.0_3.0_1676304098427.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_return_of_conf_info_clause_en_1.0.0_3.0_1676304098427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_return_of_conf_info_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[return_of_conf_info]| |[other]| |[other]| |[return_of_conf_info]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_return_of_conf_info_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 8 return_of_conf_info 1.00 1.00 1.00 15 accuracy - - 1.00 23 macro-avg 1.00 1.00 1.00 23 weighted-avg 1.00 1.00 1.00 23 ``` --- layout: model title: Legal Limited Liability Company Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_limited_liability_company_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, limited, liability, company, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_limited_liability_company_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `limited-liability-company-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `limited-liability-company-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limited_liability_company_agreement_bert_en_1.0.0_3.0_1671393854665.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limited_liability_company_agreement_bert_en_1.0.0_3.0_1671393854665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_limited_liability_company_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[limited-liability-company-agreement]| |[other]| |[other]| |[limited-liability-company-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limited_liability_company_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support limited-liability-company-agreement 0.98 0.98 0.98 121 other 0.99 0.99 0.99 204 accuracy - - 0.98 325 macro-avg 0.98 0.98 0.98 325 weighted-avg 0.98 0.98 0.98 325 ``` --- layout: model title: NER Pipeline for 10 High Resourced Languages author: John Snow Labs name: xlm_roberta_large_token_classifier_hrl_pipeline date: 2022-06-27 tags: [arabic, german, english, spanish, french, italian, latvian, dutch, portuguese, chinese, xlm, roberta, ner, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_large_token_classifier_hrl](https://nlp.johnsnowlabs.com/2021/12/26/xlm_roberta_large_token_classifier_hrl_xx.html) model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_HRL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_HRL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_pipeline_xx_4.0.0_3.0_1656371823877.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_pipeline_xx_4.0.0_3.0_1656371823877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_hrl_pipeline", lang = "xx") pipeline.annotate("يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_hrl_pipeline", lang = "xx") pipeline.annotate("يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |الرياض |LOC | |فيصل بن بندر بن عبد العزيز |PER | |الرياض |LOC | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_hrl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Explain Document Pipeline for Finnish author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, finnish, explain_document_sm, pipeline, fi] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: fi edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_fi_3.0.0_3.0_1616429037499.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_fi_3.0.0_3.0_1616429037499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'fi') annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "fi") val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei John Snow Labs! ""] result_df = nlu.load('fi.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------|:------------------------|:---------------------------------|:---------------------------------|:------------------------------------|:-----------------------------|:---------------------------------|:--------------------| | 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | ['hei', 'John', 'Snow', 'Labs!'] | ['INTJ', 'PROPN', 'PROPN', 'PROPN'] | [[-0.394499987363815,.,...]] | ['O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_rxnorm_augmented date: 2022-01-03 tags: [rxnorm, licensed, en, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in `all_k_aux_labels` column. ## Predicted Entities `RxNorm Codes`, `Concept Classes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_en_3.3.1_2.4_1641241820334.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_en_3.3.1_2.4_1641241820334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_rxnorm_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols("sbert_embeddings")\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, rxnorm_resolver]) model = rxnorm_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_model = LightPipeline(model) result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "avandia 4 mg"]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_augmented_cased", "en", "clinical/models") .setInputCols("sbert_embeddings") .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val data = Seq(Array("Coumadin 5 mg", "aspirin", ""avandia 4 mg")).toDS.toDF("text") val result= rxnorm_pipelineModel.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnen.med_ner.deid_subentityorm_augmented").predict("""Coumadin 5 mg""") ```
## Results ```bash | | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels | |---:|-------------:|:-----------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------| | 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::432467:::438740:::103... | 3.0367:::4.7790:::4.7790:::5.3... | 0.0161:::0.0395:::0.0395:::0.0... | warfarin sodium 5 MG [Coumadin]:::coumarin 5 MG Oral Tablet:... | Branded Drug Comp:::Clinical D... | | 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::1295740:::405... | 0.0000:::0.0000:::4.1826:::5.7... | 0.0000:::0.0000:::0.0292:::0.0... | aspirin Effervescent Oral Tablet:::aspirin:::aspirin Oral Po... | Clinical Drug Form:::Ingredien... | | 2 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::810073:::153845:::109... | 0.0000:::4.7482:::5.0125:::5.2... | 0.0000:::0.0365:::0.0409:::0.0... | rosiglitazone 4 MG Oral Tablet [Avandia]:::fesoterodine fuma... | Branded Drug:::Branded Drug Co... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_augmented| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|976.1 MB| |Case sensitive:|false| --- layout: model title: Fast Neural Machine Translation Model from Basque to English author: John Snow Labs name: opus_mt_eu_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, eu, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `eu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_eu_en_xx_2.7.0_2.4_1609166590644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_eu_en_xx_2.7.0_2.4_1609166590644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_eu_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_eu_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.eu.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_eu_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to French Pipeline author: John Snow Labs name: translate_en_fr date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, fr, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `fr` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_fr_xx_2.7.0_2.4_1609684801803.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_fr_xx_2.7.0_2.4_1609684801803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_fr", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_fr", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.fr').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_fr| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Romanian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ro, open_source] task: Embeddings language: ro edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ro_3.4.1_3.0_1647454014729.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ro_3.4.1_3.0_1647454014729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Îmi place Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Îmi place Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.embed.w2v_cc_300d").predict("""Îmi place Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ro| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739549310.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739549310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_hier_triplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0 --- layout: model title: Part of Speech for Korean author: John Snow Labs name: pos_ud_kaist date: 2021-03-09 tags: [part_of_speech, open_source, korean, pos_ud_kaist, ko] task: Part of Speech Tagging language: ko edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - CCONJ - ADV - SCONJ - DET - NOUN - VERB - ADJ - PUNCT - AUX - PRON - PROPN - NUM - INTJ - PART - X - ADP - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_kaist_ko_3.0.0_3.0_1615292391244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_kaist_ko_3.0.0_3.0_1615292391244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_kaist", "ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['John Snow Labs에서 안녕하세요! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_kaist", "ko") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("John Snow Labs에서 안녕하세요! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""John Snow Labs에서 안녕하세요! ""] token_df = nlu.load('ko.pos.ud_kaist').predict(text) token_df ```
## Results ```bash token pos 0 J NOUN 1 o NOUN 2 h NOUN 3 n SCONJ 4 S X 5 n X 6 o X 7 w X 8 L X 9 a X 10 b X 11 s X 12 에 ADP 13 서 SCONJ 14 안 ADV 15 녕 VERB 16 하세요 VERB 17 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_kaist| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ko| --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from cj-mills) author: John Snow Labs name: xlmroberta_ner_cj_mills_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `cj-mills`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cj_mills_base_finetuned_panx_all_xx_4.1.0_3.0_1660427899930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cj_mills_base_finetuned_panx_all_xx_4.1.0_3.0_1660427899930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cj_mills_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cj_mills_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_cj_mills_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|860.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-all --- layout: model title: Pipeline to Detect PHI for Generic Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_generic_bert_pipeline date: 2023-03-09 tags: [licensed, clinical, ro, deidentification, phi, generic, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic_bert](https://nlp.johnsnowlabs.com/2022/11/22/ner_deid_generic_bert_ro.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_pipeline_ro_4.3.0_3.2_1678352946195.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_pipeline_ro_4.3.0_3.2_1678352946195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_bert_pipeline", "ro", "clinical/models") text = '''Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_bert_pipeline", "ro", "clinical/models") val text = "Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------|--------:|------:|:------------|-------------:| | 0 | Spitalul Pentru Ochi de Deal | 0 | 27 | LOCATION | 0.99352 | | 1 | Drumul Oprea Nr. 972 | 30 | 49 | LOCATION | 0.99994 | | 2 | Vaslui | 51 | 56 | LOCATION | 1 | | 3 | 737405 | 59 | 64 | LOCATION | 1 | | 4 | +40(235)413773 | 79 | 92 | CONTACT | 1 | | 5 | 25 May 2022 | 119 | 129 | DATE | 1 | | 6 | si | 145 | 146 | NAME | 0.9998 | | 7 | BUREAN MARIA | 158 | 169 | NAME | 0.9993 | | 8 | 77 | 180 | 181 | AGE | 1 | | 9 | Agota Evelyn Tımar | 191 | 210 | NAME | 0.859975 | | | C | | | | | | 10 | 2450502264401 | 218 | 230 | ID | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|483.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265905` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905_en_4.0.0_3.0_1655985226687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905_en_4.0.0_3.0_1655985226687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265905").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265905| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265905 --- layout: model title: Fast Neural Machine Translation Model from English to Afro-Asiatic Languages author: John Snow Labs name: opus_mt_en_afa date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, afa, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `afa` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_afa_xx_2.7.0_2.4_1609169665906.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_afa_xx_2.7.0_2.4_1609169665906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_afa", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_afa", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.afa').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_afa| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_rxnorm date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts RxNorm Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_3.0.4_3.0_1636395903630.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_3.0.4_3.0_1636395903630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 386165| 0.1567|hypercal:::hypersed:::hypertears:::hyperstat...|386165:::217667::...| |chronic renal ins...| 83|109| PROBLEM| 218689| 0.1036|nephro calci:::dialysis solutions:::creatini...|218689:::3310:::2...| | COPD| 113|116| PROBLEM|1539999| 0.1644|broncomar dm:::acne medication:::carbon mono...|1539999:::214981:...| | gastritis| 120|128| PROBLEM| 225965| 0.1983|gastroflux:::gastroflux oral product:::uceri...|225965:::1176661:...| | TIA| 136|138| PROBLEM|1089812| 0.0625|thera tears:::thiotepa injection:::nature's ...|1089812:::1660003...| |a non-ST elevatio...| 182|202| PROBLEM| 218767| 0.1007|non-aspirin pm:::aspirin-free:::non aspirin ...|218767:::215440::...| |Guaiac positive s...| 208|229| PROBLEM|1294361| 0.0820|anusol rectal product:::anusol hc rectal pro...|1294361:::1166715...| |cardiac catheteri...| 295|317| TEST| 385247| 0.1566|cardiacap:::cardiology pack:::cardizem:::car...|385247:::545063::...| | PTCA| 324|327|TREATMENT| 8410| 0.0867|alteplase:::reteplase:::pancuronium:::tripe ...|8410:::76895:::78...| | mid LAD lesion| 332|345| PROBLEM| 151672| 0.0549|dulcolax:::lazerformalyde:::linaclotide:::du...|151672:::217985::...| +--------------------+-----+---+---------+-------+----------+-----------------------------------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, drugs_sbert_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings. https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html --- layout: model title: Stopwords Remover for Marathi language (187 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, mr, open_source] task: Stop Words Removal language: mr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_mr_3.4.1_3.0_1646672300971.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_mr_3.4.1_3.0_1646672300971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","mr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["आपण माझ्यापेक्षा चांगले नाही"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","mr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("आपण माझ्यापेक्षा चांगले नाही").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mr.stopwords").predict("""आपण माझ्यापेक्षा चांगले नाही""") ```
## Results ```bash +----------------------------+ |result | +----------------------------+ |[माझ्यापेक्षा, चांगले, नाही]| +----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|mr| |Size:|2.1 KB| --- layout: model title: English T5ForConditionalGeneration Cased model (from ThomasNLG) author: John Snow Labs name: t5_qg_webnlg_synth date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qg_webnlg_synth-en` is a English model originally trained by `ThomasNLG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qg_webnlg_synth_en_4.3.0_3.0_1675125600977.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qg_webnlg_synth_en_4.3.0_3.0_1675125600977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_qg_webnlg_synth","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_qg_webnlg_synth","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_qg_webnlg_synth| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|280.8 MB| ## References - https://huggingface.co/ThomasNLG/t5-qg_webnlg_synth-en - https://github.com/ThomasScialom/QuestEval - https://arxiv.org/abs/2104.07555 --- layout: model title: English asr_wav2vec2_xlsr_53_phon TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_53_phon date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_53_phon` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_53_phon_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109509538.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_53_phon_en_4.2.0_3.0_1664109509538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_53_phon', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_53_phon", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_53_phon| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|756.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Hindi asr_Wav2Vec2_xls_r_lm_300m TFWav2Vec2ForCTC from LegolasTheElf author: John Snow Labs name: asr_Wav2Vec2_xls_r_lm_300m date: 2022-09-26 tags: [wav2vec2, hi, audio, open_source, asr] task: Automatic Speech Recognition language: hi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_xls_r_lm_300m` is a Hindi model originally trained by LegolasTheElf. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_xls_r_lm_300m_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_xls_r_lm_300m_hi_4.2.0_3.0_1664190519147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_xls_r_lm_300m_hi_4.2.0_3.0_1664190519147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Wav2Vec2_xls_r_lm_300m", "hi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Wav2Vec2_xls_r_lm_300m", "hi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Wav2Vec2_xls_r_lm_300m| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|hi| |Size:|1.2 GB| --- layout: model title: Catalan RobertaForQuestionAnswering (from projecte-aina) author: John Snow Labs name: roberta_qa_roberta_base_ca_cased_qa date: 2022-06-20 tags: [ca, open_source, question_answering, roberta] task: Question Answering language: ca edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ca-cased-qa` is a Catalan model originally trained by `projecte-aina`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_ca_cased_qa_ca_4.0.0_3.0_1655730281795.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_ca_cased_qa_ca_4.0.0_3.0_1655730281795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_ca_cased_qa","ca") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_ca_cased_qa","ca") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_ca_cased_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ca| |Size:|451.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/projecte-aina/roberta-base-ca-cased-qa - https://arxiv.org/abs/1907.11692 - https://github.com/projecte-aina/club --- layout: model title: Relation extraction between Drugs and ADE (ReDL) author: John Snow Labs name: redl_ade_biobert date: 2021-07-12 tags: [relation_extraction, en, clinical, licensed, ade, biobert] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is an end-to-end trained BioBERT model, capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. `1` : Shows the adverse event and drug entities are related, `0` : Shows the adverse event and drug entities are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_ADE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_3.1.2_3.0_1626105541347.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_3.1.2_3.0_1626105541347.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = sparknlp.annotators.DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['ade-drug', 'drug-ade']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_ade_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) text ="""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""" annotations = light_pipeline.fullAnnotate(text) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("drug-ade", "ade-drug")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_ade_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, ner_tagger, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.adverse_drug_events.clinical.biobert").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |-----------:|:----------|----------------:|--------------:|:----------|:----------|----------------:|--------------:|:---------------|-------------:| | 1 | DRUG | 12 | 18 | Lipitor | ADE | 52 | 65 | severe fatigue | 0.998156 | | 1 | DRUG | 97 | 105 | voltarene | ADE | 144 | 156 | muscle cramps | 0.985513 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_ade_biobert| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[redl_ner_chunks, document]| |Output Labels:|[relations]| |Language:|en| ## Data Source This model is trained on custom data annotated by JSL. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.829 0.895 0.861 1146 1 0.955 0.923 0.939 2454 Avg. 0.892 0.909 0.900 - Weighted-Avg. 0.915 0.914 0.914 - ``` --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_marcel TFWav2Vec2ForCTC from marcel author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_by_marcel date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_marcel` is a German model originally trained by marcel. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_by_marcel_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101884066.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_marcel_de_4.2.0_3.0_1664101884066.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_by_marcel", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_by_marcel", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_by_marcel| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1657183896456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1657183896456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-0 --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-2` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1654191544720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1654191544720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_32d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|376.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-2 --- layout: model title: English BertForQuestionAnswering Tiny Cased model (from mrm8488) author: John Snow Labs name: bert_qa_tiny_wrslb_finetuned_squadv1 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-wrslb-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_wrslb_finetuned_squadv1_en_4.0.0_3.0_1657188696480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_wrslb_finetuned_squadv1_en_4.0.0_3.0_1657188696480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tiny_wrslb_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tiny_wrslb_finetuned_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tiny_wrslb_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-tiny-wrslb-finetuned-squadv1 --- layout: model title: Fast Neural Machine Translation Model from English to Hindi author: John Snow Labs name: opus_mt_en_hi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, hi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `hi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hi_xx_2.7.0_2.4_1609169231360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hi_xx_2.7.0_2.4_1609169231360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_hi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_hi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.hi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_hi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Russian Named Entity Recognition (from IlyaGusev) author: John Snow Labs name: bert_ner_rubertconv_toxic_editor date: 2022-05-09 tags: [bert, ner, token_classification, ru, open_source] task: Named Entity Recognition language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `rubertconv_toxic_editor` is a Russian model orginally trained by `IlyaGusev`. ## Predicted Entities `equal`, `replace`, `delete`, `insert` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_rubertconv_toxic_editor_ru_3.4.2_3.0_1652099038495.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_rubertconv_toxic_editor_ru_3.4.2_3.0_1652099038495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_rubertconv_toxic_editor","ru") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_rubertconv_toxic_editor","ru") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_rubertconv_toxic_editor| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ru| |Size:|662.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/IlyaGusev/rubertconv_toxic_editor - https://colab.research.google.com/drive/1NUSO1QGlDgD-IWXa2SpeND089eVxrCJW - https://github.com/skoltech-nlp/russe_detox_2022/tree/main/data --- layout: model title: Aspect based Sentiment Analysis for restaurant reviews author: John Snow Labs name: ner_aspect_based_sentiment date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Automatically detect positive, negative and neutral aspects about restaurants from user reviews. Instead of labelling the entire review as negative or positive, this model helps identify which exact phrases relate to sentiment identified in the review. ## Predicted Entities `NEG`, `POS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/ASPECT_BASED_SENTIMENT_RESTAURANT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_aspect_based_sentiment_en_3.0.0_3.0_1617209723737.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_aspect_based_sentiment_en_3.0.0_3.0_1617209723737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_aspect_based_sentiment")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_aspect_based_sentiment") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq("Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.aspect_sentiment").predict("""Came for lunch my sister. We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. But the service was below average and the chips were too terrible to finish.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------------------+-----------+ | sentence | aspect | sentiment | +----------------------------------------------------------------------------------------------------+-------------------+-----------+ | We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | Thai-style main | positive | | We loved our Thai-style main which amazing with lots of flavours very impressive for vegetarian. | lots of flavours | positive | | But the service was below average and the chips were too terrible to finish. | service | negative | | But the service was below average and the chips were too terrible to finish. | chips | negative | +----------------------------------------------------------------------------------------------------+-------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_aspect_based_sentiment| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, embeddings]| |Output Labels:|[absa]| |Language:|en| --- layout: model title: English asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6 TFWav2Vec2ForCTC from chrisvinsen author: John Snow Labs name: asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6` is a English model originally trained by chrisvinsen. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_en_4.2.0_3.0_1664106725822.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_en_4.2.0_3.0_1664106725822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect posology entities (biobert) author: John Snow Labs name: ner_posology_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect Drug, Dosage and administration instructions in text using pretraiend NER model. ## Predicted Entities `FREQUENCY`, `DRUG`, `STRENGTH`, `FORM`, `DURATION`, `DOSAGE`, `ROUTE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_en_3.0.0_3.0_1617260806766.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_en_3.0.0_3.0_1617260806766.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_posology_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash label precision recall f1-score support B-DOSAGE 0.78 0.67 0.72 559 B-DRUG 0.93 0.94 0.94 3865 B-DURATION 0.79 0.81 0.80 331 B-FORM 0.90 0.87 0.88 1472 B-FREQUENCY 0.92 0.94 0.93 1577 B-ROUTE 0.94 0.85 0.89 772 B-STRENGTH 0.88 0.92 0.90 2519 I-DOSAGE 0.62 0.57 0.60 357 I-DRUG 0.81 0.89 0.85 1539 I-DURATION 0.80 0.89 0.84 796 I-FORM 0.58 0.54 0.56 142 I-FREQUENCY 0.86 0.93 0.89 2424 I-ROUTE 1.00 0.47 0.64 32 I-STRENGTH 0.85 0.91 0.88 2972 O 0.98 0.98 0.98 101134 accuracy - - 0.97 120491 macro-avg 0.84 0.81 0.82 120491 weighted-avg 0.97 0.97 0.97 120491 ``` --- layout: model title: XLM-RoBERTa Base, CoNLL-03 NER Pipeline author: ahmedlone127 name: xlm_roberta_base_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, xlm_roberta, conll03, xlm, base, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/xlm_roberta_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655217192119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/xlm_roberta_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655217192119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|851.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English asr_wav2vec2_large_xlsr_ksponspeech_1_20 TFWav2Vec2ForCTC from cheulyop author: John Snow Labs name: asr_wav2vec2_large_xlsr_ksponspeech_1_20 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_ksponspeech_1_20` is a English model originally trained by cheulyop. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_ksponspeech_1_20_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_ksponspeech_1_20_en_4.2.0_3.0_1664097388003.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_ksponspeech_1_20_en_4.2.0_3.0_1664097388003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_ksponspeech_1_20", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_ksponspeech_1_20", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_ksponspeech_1_20| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_400000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-400000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_400000_cased_generator_de_3.4.4_3.0_1652786393218.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_400000_cased_generator_de_3.4.4_3.0_1652786393218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_400000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_400000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_400000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|223.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-400000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Translate English to Azerbaijani Pipeline author: John Snow Labs name: translate_en_az date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, az, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `az` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_az_xx_2.7.0_2.4_1609685842780.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_az_xx_2.7.0_2.4_1609685842780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_az", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_az", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.az').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_az| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Electronics And Electrical Engineering Document Classifier (EURLEX) author: John Snow Labs name: legclf_electronics_and_electrical_engineering_bert date: 2023-03-06 tags: [en, legal, classification, clauses, electronics_and_electrical_engineering, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_electronics_and_electrical_engineering_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Electronics_and_Electrical_Engineering or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Electronics_and_Electrical_Engineering`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_electronics_and_electrical_engineering_bert_en_1.0.0_3.0_1678111679765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_electronics_and_electrical_engineering_bert_en_1.0.0_3.0_1678111679765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_electronics_and_electrical_engineering_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Electronics_and_Electrical_Engineering]| |[Other]| |[Other]| |[Electronics_and_Electrical_Engineering]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_electronics_and_electrical_engineering_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Electronics_and_Electrical_Engineering 0.86 0.83 0.84 58 Other 0.85 0.88 0.86 64 accuracy - - 0.85 122 macro-avg 0.85 0.85 0.85 122 weighted-avg 0.85 0.85 0.85 122 ``` --- layout: model title: Legal Disclosure Of Information Clause Binary Classifier author: John Snow Labs name: legclf_disclosure_of_information_clause date: 2023-01-27 tags: [en, legal, classification, disclosure, information, clauses, disclosure_of_information, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `disclosure-of-information` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `disclosure-of-information`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disclosure_of_information_clause_en_1.0.0_3.0_1674820995298.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disclosure_of_information_clause_en_1.0.0_3.0_1674820995298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_disclosure_of_information_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[disclosure-of-information]| |[other]| |[other]| |[disclosure-of-information]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disclosure_of_information_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support disclosure-of-information 1.00 0.96 0.98 26 other 0.97 1.00 0.99 39 accuracy - - 0.98 65 macro-avg 0.99 0.98 0.98 65 weighted-avg 0.99 0.98 0.98 65 ``` --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_jsl_rxnorm_cased embeddings) author: John Snow Labs name: sbiobertresolve_jsl_rxnorm_augmented date: 2021-12-27 tags: [en, clinical, entity_resolution, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_jsl_rxnorm_cased` Sentence Bert Embeddings. It is trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in all_k_aux_labels column. ## Predicted Entities `RxNorm Codes`, `Concept Classes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_jsl_rxnorm_augmented_en_3.3.4_2.4_1640637079907.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_jsl_rxnorm_augmented_en_3.3.4_2.4_1640637079907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_jsl_rxnorm_cased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_jsl_rxnorm_augmented", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver]) light_model = LightPipeline(rxnorm_pipelineModel) result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "Neurontin 300", "avandia 4 mg"]) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_jsl_rxnorm_cased", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_jsl_rxnorm_augmented", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val light_model = LightPipeline(rxnorm_pipelineModel) val result = light_model.fullAnnotate(Array("Coumadin 5 mg", "aspirin", "avandia 4 mg")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.augmented").predict("""Coumadin 5 mg""") ```
## Results ```bash | | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels | |---:|-------------:|:-------------------------------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------| | 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::432467:::438740:::103... | 0.0000:::5.0617:::5.0617:::5.9... | 0.0000:::0.0388:::0.0388:::0.0... | warfarin sodium 5 MG [Coumadin]:::coumarin 5 MG Oral Tablet:... | Branded Drug Comp:::Clinical D... | | 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::405403:::2187... | 0.0000:::0.0000:::9.0615:::9.4... | 0.0000:::0.0000:::0.1268:::0.1... | aspirin Effervescent Oral Tablet:::aspirin:::YSP Aspirin:::N... | Clinical Drug Form:::Ingredien... | | 2 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::208364:::1792373:::57... | 0.0000:::8.0227:::8.1631:::8.2... | 0.0000:::0.0982:::0.1001:::0.1... | rosiglitazone 4 MG Oral Tablet [Avandia]:::triamcinolone 4 M... | Branded Drug:::Branded Drug:::... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_jsl_rxnorm_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|970.8 MB| |Case sensitive:|false| --- layout: model title: Chinese BertForTokenClassification Base Cased model (from ckiplab) author: John Snow Labs name: bert_token_classifier_base_han_chinese_ws date: 2022-11-30 tags: [zh, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-han-chinese-ws` is a Chinese model originally trained by `ckiplab`. ## Predicted Entities `B`, `I` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_han_chinese_ws_zh_4.2.4_3.0_1669814901320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_han_chinese_ws_zh_4.2.4_3.0_1669814901320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_han_chinese_ws","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_han_chinese_ws","zh") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_han_chinese_ws| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|zh| |Size:|395.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ckiplab/bert-base-han-chinese-ws - https://github.com/ckiplab/han-transformers - http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/akiwi/kiwi.sh - http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/dkiwi/kiwi.sh - http://lingcorpus.iis.sinica.edu.tw/cgi-bin/kiwi/pkiwi/kiwi.sh - http://asbc.iis.sinica.edu.tw - https://ckip.iis.sinica.edu.tw/ --- layout: model title: English asr_model_sid_voxforge_cetuc_2 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: asr_model_sid_voxforge_cetuc_2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_sid_voxforge_cetuc_2` is a English model originally trained by joaoalvarenga. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_model_sid_voxforge_cetuc_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_model_sid_voxforge_cetuc_2_en_4.2.0_3.0_1664022318789.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_model_sid_voxforge_cetuc_2_en_4.2.0_3.0_1664022318789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_model_sid_voxforge_cetuc_2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_model_sid_voxforge_cetuc_2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_model_sid_voxforge_cetuc_2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Drug Information (Small) author: John Snow Labs name: ner_posology date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for posology, this NER is trained with the ``embeddings_clinical`` word embeddings model, so be sure to use the same embeddings in the pipeline. ## Predicted Entities ``DOSAGE``, ``DRUG``, ``DURATION``, ``FORM``, ``FREQUENCY``, ``ROUTE``, ``STRENGTH``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_3.0.0_2.4_1617208445872.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_en_3.0.0_2.4_1617208445872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val model = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, model, ner_converter)) val data = Seq("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d., |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on the 2018 i2b2 dataset (no FDA) with ``embeddings_clinical``. https://www.i2b2.org/NLP/Medication ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | B-DRUG | 1408 | 62 | 99 | 0.957823 | 0.934307 | 0.945919 | | 1 | B-STRENGTH | 470 | 43 | 29 | 0.916179 | 0.941884 | 0.928854 | | 2 | I-DURATION | 123 | 22 | 8 | 0.848276 | 0.938931 | 0.891304 | | 3 | I-STRENGTH | 499 | 66 | 15 | 0.883186 | 0.970817 | 0.924931 | | 4 | I-FREQUENCY | 945 | 47 | 55 | 0.952621 | 0.945 | 0.948795 | | 5 | B-FORM | 365 | 13 | 12 | 0.965608 | 0.96817 | 0.966887 | | 6 | B-DOSAGE | 298 | 27 | 26 | 0.916923 | 0.919753 | 0.918336 | | 7 | I-DOSAGE | 348 | 29 | 22 | 0.923077 | 0.940541 | 0.931727 | | 8 | I-DRUG | 208 | 25 | 60 | 0.892704 | 0.776119 | 0.830339 | | 9 | I-ROUTE | 10 | 0 | 2 | 1 | 0.833333 | 0.909091 | | 10 | B-ROUTE | 467 | 4 | 25 | 0.991507 | 0.949187 | 0.969886 | | 11 | B-DURATION | 64 | 10 | 10 | 0.864865 | 0.864865 | 0.864865 | | 12 | B-FREQUENCY | 588 | 12 | 17 | 0.98 | 0.971901 | 0.975934 | | 13 | I-FORM | 264 | 5 | 4 | 0.981413 | 0.985075 | 0.98324 | | 14 | Macro-average | 6057 | 365 | 384 | 0.93387 | 0.924277 | 0.929049 | | 15 | Micro-average | 6057 | 365 | 384 | 0.943164 | 0.940382 | 0.941771 | ``` --- layout: model title: Legal Performance Clause Binary Classifier author: John Snow Labs name: legclf_performance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `performance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `performance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_performance_clause_en_1.0.0_3.2_1660123818700.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_performance_clause_en_1.0.0_3.2_1660123818700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_performance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[performance]| |[other]| |[other]| |[performance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_performance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.90 1.00 0.95 89 performance 1.00 0.74 0.85 39 accuracy - - 0.92 128 macro-avg 0.95 0.87 0.90 128 weighted-avg 0.93 0.92 0.92 128 ``` --- layout: model title: Translate English to Tsonga Pipeline author: John Snow Labs name: translate_en_ts date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ts, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ts` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ts_xx_2.7.0_2.4_1609699062996.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ts_xx_2.7.0_2.4_1609699062996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ts", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ts", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ts').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ts| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Finnish BERT Embeddings (Base Uncased) author: John Snow Labs name: bert_finnish_uncased date: 2020-08-31 task: Embeddings language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, fi] supported: true deprecated: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. `FinBERT` features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words. `FinBERT` has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train `FinBERT`. These features allow `FinBERT` to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_finnish_uncased_fi_2.6.0_2.4_1598897239983.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_finnish_uncased_fi_2.6.0_2.4_1598897239983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_finnish_uncased", "fi") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Rakastan NLP: tä']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_finnish_uncased", "fi") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Rakastan NLP: tä").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Rakastan NLP: tä"] embeddings_df = nlu.load('fi.embed.bert.uncased.').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token fi_embed_bert_uncased__embeddings Rakastan [-0.5126021504402161, -1.1741008758544922, 0.6... NLP [1.4763829708099365, -1.5427947044372559, 0.80... : [-0.2581554353237152, -0.5670831203460693, -1.... tä [0.39770740270614624, -0.7221324443817139, 0.1... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_finnish_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[fi]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://github.com/TurkuNLP/FinBERT --- layout: model title: English asr_sanskrit TFWav2Vec2ForCTC from Tarakki100 author: John Snow Labs name: asr_sanskrit date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_sanskrit` is a English model originally trained by Tarakki100. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_sanskrit_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_sanskrit_en_4.2.0_3.0_1664112373546.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_sanskrit_en_4.2.0_3.0_1664112373546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_sanskrit", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_sanskrit", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_sanskrit| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: Pre-trained Pipeline for Few-NERD NER Model author: John Snow Labs name: nerdl_fewnerd_subentity_100d_pipeline date: 2022-06-28 tags: [fewnerd, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on Few-NERD/inter public dataset and it extracts 66 entities that are in general scope. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_pipeline_en_4.0.0_3.0_1656388795031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_pipeline_en_4.0.0_3.0_1656388795031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") fewnerd_pipeline.annotate("""12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.""") ``` ```scala val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") val result = pipeline.fullAnnotate("12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.")(0) ```
## Results ```bash +-----------------------+----------------------------+ |chunk |ner_label | +-----------------------+----------------------------+ |Corazones ('12 Hearts')|art-broadcastprogram | |Spanish-language |other-language | |United States |location-GPE | |Telemundo |organization-media/newspaper| |Argentine TV |organization-media/newspaper| |Los Angeles |location-GPE | |Steven Spielberg |person-director | |Cloverfield Paradox |art-film | +-----------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_fewnerd_subentity_100d_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|167.8 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter - Finisher --- layout: model title: English RoBERTa Embeddings (Base, Biomarkers/Carcinoma/Clinical Trial) author: John Snow Labs name: roberta_embeddings_roberta_pubmed date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-pubmed` is a English model orginally trained by `raynardj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_pubmed_en_3.4.2_3.0_1649946815266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_pubmed_en_3.4.2_3.0_1649946815266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_pubmed","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_pubmed","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.roberta_pubmed").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_pubmed| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|468.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/raynardj/roberta-pubmed - https://pubmed.ncbi.nlm.nih.gov/ - https://www.ncbi.nlm.nih.gov/mesh/ --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from cosmo) author: John Snow Labs name: distilbert_qa_cosmo_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `cosmo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cosmo_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770516755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cosmo_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770516755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cosmo_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cosmo_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cosmo_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/cosmo/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Catalan author: John Snow Labs name: opus_mt_en_ca date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ca, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ca` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ca_xx_2.7.0_2.4_1609167744249.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ca_xx_2.7.0_2.4_1609167744249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ca", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ca", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ca').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ca| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_location date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-location` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_location_en_4.3.0_3.0_1674220382399.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_location_en_4.3.0_3.0_1674220382399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_location","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_location","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_location| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-location --- layout: model title: Finnish Legal Roberta Embeddings author: John Snow Labs name: roberta_base_finnish_legal date: 2023-02-16 tags: [fi, finnish, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: fi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-finnish-roberta-base` is a Finnish model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_finnish_legal_fi_4.2.4_3.0_1676561071432.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_finnish_legal_fi_4.2.4_3.0_1676561071432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_finnish_legal", "fi")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_finnish_legal", "fi") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_finnish_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fi| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-finnish-roberta-base --- layout: model title: Stopwords Remover for Dutch language (352 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, nl, open_source] task: Stop Words Removal language: nl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_nl_3.4.1_3.0_1646673228420.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_nl_3.4.1_3.0_1646673228420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","nl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Je bent niet beter dan ik"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","nl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Je bent niet beter dan ik").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.stopwords").predict("""Je bent niet beter dan ik""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|nl| |Size:|2.4 KB| --- layout: model title: Finance Pipeline (Headers / Subheaders) author: John Snow Labs name: finpipe_header_subheader date: 2023-01-20 tags: [en, finance, ner, licensed, contextual_parser] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a finance pretrained pipeline that will help you split long financial documents into smaller sections. To do that, it detects Headers and Subheaders of different sections. You can then use the beginning and end information in the metadata to retrieve the text between those headers. PART I, PART II, etc are HEADERS Item 1, Item 2, etc are also HEADERS Item 1A, 2B, etc are SUBHEADERS 1., 2., 2.1, etc. are SUBHEADERS {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finpipe_header_subheader_en_1.0.0_3.0_1674243435691.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finpipe_header_subheader_en_1.0.0_3.0_1674243435691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python finance_pipeline = nlp.PretrainedPipeline("finpipe_header_subheader", "en", "finance/models") text = [""" Item 2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller. Item 2A. Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6 Item 2B. Customer Agreements."""] result = finance_pipeline.annotate(text) ```
## Results ```bash | chunks | begin | end | entities | |------------------------------:|------:|----:|----------:| | Item 2. Definitions. | 1 | 21 | HEADER | | Item 2A. Appointment. | 158 | 179 | SUBHEADER | | Item 2B. Customer Agreements. | 538 | 566 | SUBHEADER | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finpipe_header_subheader| |Type:|pipeline| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|23.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel --- layout: model title: English asr_wav2vec2_base_timit_demo_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten_en_4.2.0_3.0_1664025434108.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten_en_4.2.0_3.0_1664025434108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_by_patrickvonplaten| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: xlmroberta_pos_xlm_roberta_base_english_upos date: 2022-05-18 tags: [xlm_roberta, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-english-upos` is a English model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_pos_xlm_roberta_base_english_upos_en_3.4.2_3.0_1652837577026.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_pos_xlm_roberta_base_english_upos_en_3.4.2_3.0_1652837577026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_pos_xlm_roberta_base_english_upos","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_pos_xlm_roberta_base_english_upos","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_pos_xlm_roberta_base_english_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|791.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/xlm-roberta-base-english-upos - https://github.com/UniversalDependencies/UD_English-EWT - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Translate Kinyarwanda to English Pipeline author: John Snow Labs name: translate_rw_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, rw, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `rw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_rw_en_xx_2.7.0_2.4_1609687466102.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_rw_en_xx_2.7.0_2.4_1609687466102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_rw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_rw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.rw.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_rw_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RobertaForQuestionAnswering Large Cased model (from stevemobs) author: John Snow Labs name: roberta_qa_large_fine_tuned_squad date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-fine-tuned-squad-es` is a Spanish model originally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_fine_tuned_squad_es_4.3.0_3.0_1674221753097.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_fine_tuned_squad_es_4.3.0_3.0_1674221753097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_fine_tuned_squad","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_fine_tuned_squad","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_fine_tuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/stevemobs/roberta-large-fine-tuned-squad-es --- layout: model title: English asr_wav2vec2_xls_r_300m_cv8 TFWav2Vec2ForCTC from comodoro author: John Snow Labs name: asr_wav2vec2_xls_r_300m_cv8 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_cv8` is a English model originally trained by comodoro. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_cv8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664036662517.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_cv8_en_4.2.0_3.0_1664036662517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_cv8", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_cv8", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_cv8| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Fake News Classifier author: John Snow Labs name: classifierdl_use_fakenews date: 2021-01-09 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en, classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Determine if news articles are Real or Fake. ## Predicted Entities `REAL`, `FAKE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_FAKENEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_FAKENEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_fakenews_en_2.7.1_2.4_1610187399147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_fakenews_en_2.7.1_2.4_1610187399147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_fakenews', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document_assembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_fakenews", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton"""] fake_df = nlu.load('classify.fakenews.use').predict(text, output_level='document') fake_df[["document", "fakenews"]] ```
## Results ```bash +--------------------------------------------------------------------------------------------------------+------------+ |document |class | +--------------------------------------------------------------------------------------------------------+------------+ |Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton | FAKE | +--------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_fakenews| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source This model is trained on the fake new classification challenge. https://raw.githubusercontent.com/joolsa/fake_real_news_dataset/master/fake_or_real_news.csv.zip ## Benchmarking ```bash precision recall f1-score support FAKE 0.86 0.89 0.88 626 REAL 0.89 0.86 0.87 634 accuracy 0.87 1260 macro avg 0.88 0.87 0.87 1260 weighted avg 0.88 0.87 0.87 1260 ``` --- layout: model title: Fast Neural Machine Translation Model from Chuukese to English author: John Snow Labs name: opus_mt_chk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, chk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `chk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_chk_en_xx_2.7.0_2.4_1609169397835.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_chk_en_xx_2.7.0_2.4_1609169397835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_chk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_chk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.chk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_chk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman` is a Finnish model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_fi_4.2.0_3.0_1664041894280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_fi_4.2.0_3.0_1664041894280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from deepset) author: John Snow Labs name: bert_qa_minilm_uncased_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilm-uncased-squad2` is a English model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_minilm_uncased_squad2_en_4.0.0_3.0_1654188279232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_minilm_uncased_squad2_en_4.0.0_3.0_1654188279232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_minilm_uncased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_minilm_uncased_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.mini_lm_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_minilm_uncased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|123.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/deepset/minilm-uncased-squad2 - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py - https://twitter.com/deepset_ai - http://www.deepset.ai/jobs - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/ - https://deepset.ai/german-bert - https://www.linkedin.com/company/deepset-ai/ - https://github.com/deepset-ai/FARM - https://deepset.ai/germanquad --- layout: model title: Abkhazian asr_xls_r_ab_test_by_muneson TFWav2Vec2ForCTC from muneson author: John Snow Labs name: pipeline_asr_xls_r_ab_test_by_muneson date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_muneson` is a Abkhazian model originally trained by muneson. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_muneson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_muneson_ab_4.2.0_3.0_1664019208833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_muneson_ab_4.2.0_3.0_1664019208833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_muneson', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_muneson", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_test_by_muneson| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.2 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sasuke) author: John Snow Labs name: distilbert_qa_sasuke_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sasuke`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sasuke_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772441119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sasuke_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772441119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sasuke_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sasuke_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sasuke_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sasuke/distilbert-base-uncased-finetuned-squad --- layout: model title: Financial Deidentification Pipeline author: John Snow Labs name: finpipe_deid date: 2023-02-27 tags: [deid, deidentification, anonymization, en, licensed] task: [De-identification, Pipeline Finance] language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Pretrained Pipeline aimed to deidentify legal and financial documents to be compliant with data privacy regulations as GDPR and CCPA. Since the models used in this pipeline are statistical, make sure you use this model in a human-in-the-loop process to guarantee a 100% accuracy. You can carry out both masking and obfuscation with this pipeline, on the following entities: `ALIAS`, `EMAIL`, `PHONE`, `PROFESSION`, `ORG`, `DATE`, `PERSON`, `ADDRESS`, `STREET`, `CITY`, `STATE`, `ZIP`, `COUNTRY`, `TITLE_CLASS`, `TICKER`, `STOCK_EXCHANGE`, `CFN`, `IRS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/DEID_FIN/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finpipe_deid_en_1.0.0_3.0_1677508149273.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finpipe_deid_en_1.0.0_3.0_1677508149273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("finpipe_deid", "en", "finance/models") sample = """CARGILL, INCORPORATED By: Pirkko Suominen Name: Pirkko Suominen Title: Director, Bio Technology Development, Date: 10/19/2011 BIOAMBER, SAS By: Jean-François Huc Name: Jean-François Huc Title: President Date: October 15, 2011 email : jeanfran@gmail.com phone : 1808733909 """ result = deid_pipeline.annotate(sample) print("\nMasked with entity labels") print("-"*30) print("\n".join(result['deidentified'])) print("\nMasked with chars") print("-"*30) print("\n".join(result['masked_with_chars'])) print("\nMasked with fixed length chars") print("-"*30) print("\n".join(result['masked_fixed_length_chars'])) print("\nObfuscated") print("-"*30) print("\n".join(result['obfuscated'])) ```
## Results ```bash Masked with entity labels ------------------------------ , By: Name: : , Date: , By: Name: : Date: email : phone : Masked with chars ------------------------------ [*****], [**********] By: [*************] Name: [*******************]: [**********************************] Center, Date: [********] [******], [*] By: [***************] Name: [**********************]: [*******]Date: [**************] email : [****************] phone : [********] Masked with fixed length chars ------------------------------ ****, **** By: **** Name: ****: ****, Date: **** ****, **** By: **** Name: ****: ****Date: **** email : **** phone : **** Obfuscated ------------------------------ MGT Trust Company, LLC., Clarus llc. By: Benjamin Dean Name: John Snow Labs Inc: Sales Manager, Date: 03/08/2025 Clarus llc., SESA CO. By: JAMES TURNER Name: MGT Trust Company, LLC.: Business ManagerDate: 11/7/2016 email : Tyrus@google.com phone : 78 834 854 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finpipe_deid| |Type:|pipeline| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|458.6 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - BertEmbeddings - FinanceNerModel - NerConverterInternalModel - FinanceNerModel - NerConverterInternalModel - FinanceNerModel - NerConverterInternalModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel --- layout: model title: Detect Clinical Conditions (ner_eu_clinical_condition - it) author: John Snow Labs name: ner_eu_clinical_condition date: 2023-02-06 tags: [it, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: it edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from Italian texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_condition` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_it_4.2.8_3.0_1675726754516.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_it_4.2.8_3.0_1675726754516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","it")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "it", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Donna, 64 anni, ricovero per dolore epigastrico persistente, irradiato a barra e posteriormente, associato a dispesia e anoressia. Poche settimane dopo compaiono, però, iperemia, intenso edema vulvare ed una esione ulcerativa sul lato sinistro della parete rettale che la RM mostra essere una fistola transfinterica. Questi trattamenti determinano un miglioramento dell’ infiammazione ed una riduzione dell’ ulcera, ma i condilomi permangono inalterati."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","it") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "it", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Donna, 64 anni, ricovero per dolore epigastrico persistente, irradiato a barra e posteriormente, associato a dispesia e anoressia. Poche settimane dopo compaiono, però, iperemia, intenso edema vulvare ed una esione ulcerativa sul lato sinistro della parete rettale che la RM mostra essere una fistola transfinterica. Questi trattamenti determinano un miglioramento dell’ infiammazione ed una riduzione dell’ ulcera, ma i condilomi permangono inalterati.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------+------------------+ |chunk |ner_label | +----------------------+------------------+ |dolore epigastrico |clinical_condition| |anoressia |clinical_condition| |iperemia |clinical_condition| |edema |clinical_condition| |fistola transfinterica|clinical_condition| |infiammazione |clinical_condition| +----------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|903.5 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 clinical_condition 208.0 35.0 46.0 254.0 0.8560 0.8189 0.8370 macro - - - - - - 0.8370 micro - - - - - - 0.8370 ``` --- layout: model title: English RobertaForQuestionAnswering (from saburbutt) author: John Snow Labs name: roberta_qa_roberta_large_tweetqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_large_tweetqa` is a English model originally trained by `saburbutt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_tweetqa_en_4.0.0_3.0_1655739122606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_tweetqa_en_4.0.0_3.0_1655739122606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_tweetqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_tweetqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_tweetqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saburbutt/roberta_large_tweetqa --- layout: model title: English image_classifier_vit_iiif_manuscript_ ViTForImageClassification from davanstrien author: John Snow Labs name: image_classifier_vit_iiif_manuscript_ date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_iiif_manuscript_` is a English model originally trained by davanstrien. ## Predicted Entities `3rd upper flyleaf verso`, `Blank leaf recto`, `3rd lower flyleaf verso`, `2nd lower flyleaf verso`, `2nd upper flyleaf verso`, `flyleaf`, `1st upper flyleaf verso`, `1st lower flyleaf verso`, `fol`, `cover`, `Lower flyleaf verso`, `Blank leaf verso`, `Upper flyleaf verso` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_iiif_manuscript__en_4.1.0_3.0_1660170321999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_iiif_manuscript__en_4.1.0_3.0_1660170321999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_iiif_manuscript_", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_iiif_manuscript_", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_iiif_manuscript_| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Purchase and sale Clause Binary Classifier author: John Snow Labs name: legclf_purchase_and_sale_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `purchase-and-sale` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `purchase-and-sale` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purchase_and_sale_clause_en_1.0.0_3.2_1660123863834.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purchase_and_sale_clause_en_1.0.0_3.2_1660123863834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_purchase_and_sale_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[purchase-and-sale]| |[other]| |[other]| |[purchase-and-sale]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_purchase_and_sale_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.93 1.00 0.96 65 purchase-and-sale 1.00 0.88 0.94 43 accuracy - - 0.95 108 macro-avg 0.96 0.94 0.95 108 weighted-avg 0.96 0.95 0.95 108 ``` --- layout: model title: Detect Problems, Tests and Treatments (ner_clinical) in German author: John Snow Labs name: ner_clinical date: 2023-05-05 tags: [ner, clinical, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terms in German. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683310968546.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683310968546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) sample_text= """Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome. Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl . ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 .""" results = model.transform(spark.createDataFrame([[sample_text]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome. Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl . ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Schmerzen |PROBLEM | |Schwäche in den Beinen|PROBLEM | |Verlust der Darm |PROBLEM | |Blasenfunktion |PROBLEM | |Symptome |PROBLEM | |empirisch Ampicillin |TREATMENT| |Gentamycin |TREATMENT| |Flagyl |TREATMENT| |Narcan |TREATMENT| |Fentanyl |TREATMENT| |ALT |TEST | |AST |TEST | |LDH |TEST | |alkalische Phosphatase|TEST | |Bilirubin |TEST | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|2.1 MB| ## Benchmarking ```bash label precision recall f1-score support B-PROBLEM 0.78 0.77 0.77 407 B-TEST 0.77 0.92 0.84 220 B-TREATMENT 0.82 0.71 0.76 241 I-PROBLEM 0.87 0.78 0.82 386 I-TEST 0.66 0.93 0.77 57 I-TREATMENT 0.68 0.76 0.72 76 O 0.96 0.97 0.96 4323 accuracy - - 0.92 5710 macro-avg 0.79 0.83 0.81 5710 weighted-avg 0.92 0.92 0.92 5710 ``` --- layout: model title: Javanese Bert Embeddings (Small, Imdb) author: John Snow Labs name: bert_embeddings_javanese_bert_small_imdb date: 2022-04-11 tags: [bert, embeddings, jv, open_source] task: Embeddings language: jv edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-bert-small-imdb` is a Javanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_bert_small_imdb_jv_3.4.2_3.0_1649676688860.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_bert_small_imdb_jv_3.4.2_3.0_1649676688860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_javanese_bert_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_javanese_bert_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("jv.embed.javanese_bert_small_imdb").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_javanese_bert_small_imdb| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|410.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-bert-small-imdb - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: Legal Bankruptcy Clause Binary Classifier author: John Snow Labs name: legclf_bankruptcy_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `bankruptcy` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `bankruptcy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bankruptcy_clause_en_1.0.0_3.2_1660122157662.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bankruptcy_clause_en_1.0.0_3.2_1660122157662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_bankruptcy_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[bankruptcy]| |[other]| |[other]| |[bankruptcy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bankruptcy_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support bankruptcy 1.00 0.73 0.84 26 other 0.94 1.00 0.97 107 accuracy - - 0.95 133 macro-avg 0.97 0.87 0.91 133 weighted-avg 0.95 0.95 0.94 133 ``` --- layout: model title: Detect PHI for Deidentification purposes (Spanish, Roberta, augmented) author: John Snow Labs name: ner_deid_subentity_roberta_augmented date: 2022-02-16 tags: [deid, es, licensed,clinical] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity_roberta` model. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms. This is a version that includes Roberta Clinical embeddings. You can find as well `ner_deid_subentity_augmented` that uses Sciwi 300d embeddings based instead of Roberta. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_3.0_1645006804071.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_3.0_1645006804071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.subentity.roberta").predict(""" Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-PATIENT| | Miguel| I-PATIENT| | Martínez| I-PATIENT| | ,| O| | varón| B-SEX| | de| O| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-CITY| | ,| O| | España| B-COUNTRY| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14| B-DATE| | de| I-DATE| | Marzo| I-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-HOSPITAL| | San| I-HOSPITAL| | Carlos| I-HOSPITAL| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_roberta_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 1874.0 165.0 186.0 2060.0 0.9191 0.9097 0.9144 HOSPITAL 241.0 19.0 54.0 295.0 0.9269 0.8169 0.8685 DATE 954.0 17.0 15.0 969.0 0.9825 0.9845 0.9835 ORGANIZATION 2521.0 483.0 468.0 2989.0 0.8392 0.8434 0.8413 CITY 1464.0 369.0 289.0 1753.0 0.7987 0.8351 0.8165 ID 35.0 1.0 0.0 35.0 0.9722 1.0 0.9859 STREET 194.0 8.0 6.0 200.0 0.9604 0.97 0.9652 USERNAME 7.0 0.0 4.0 11.0 1.0 0.6364 0.7778 SEX 618.0 9.0 9.0 627.0 0.9856 0.9856 0.9856 EMAIL 134.0 0.0 0.0 134.0 1.0 1.0 1.0 ZIP 138.0 0.0 1.0 139.0 1.0 0.9928 0.9964 MEDICALRECORD 29.0 10.0 0.0 29.0 0.7436 1.0 0.8529 PROFESSION 231.0 20.0 30.0 261.0 0.9203 0.8851 0.9023 PHONE 44.0 0.0 6.0 50.0 1.0 0.88 0.9362 COUNTRY 458.0 96.0 103.0 561.0 0.8267 0.8164 0.8215 DOCTOR 432.0 38.0 48.0 480.0 0.9191 0.9 0.9095 AGE 509.0 9.0 10.0 519.0 0.9826 0.9807 0.9817 macro - - - - - - 0.9141 micro - - - - - - 0.8891 ``` --- layout: model title: Romanian T5ForConditionalGeneration Small Cased model (from BlackKakapo) author: John Snow Labs name: t5_small_grammar_v2 date: 2023-01-31 tags: [ro, open_source, t5, tensorflow] task: Text Generation language: ro edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-grammar-ro-v2` is a Romanian model originally trained by `BlackKakapo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_grammar_v2_ro_4.3.0_3.0_1675126347481.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_grammar_v2_ro_4.3.0_3.0_1675126347481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_grammar_v2","ro") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_grammar_v2","ro") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_grammar_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ro| |Size:|288.1 MB| ## References - https://huggingface.co/BlackKakapo/t5-small-grammar-ro-v2 - https://img.shields.io/badge/V.2-06.08.2022-brightgreen --- layout: model title: Translate Lushai to English Pipeline author: John Snow Labs name: translate_lus_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lus, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lus` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lus_en_xx_2.7.0_2.4_1609686287332.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lus_en_xx_2.7.0_2.4_1609686287332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lus_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lus_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lus.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lus_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English AlbertForQuestionAnswering model (from sultan) author: John Snow Labs name: albert_qa_BioM_xxlarge_SQuAD2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ALBERT-xxlarge-SQuAD2` is a English model originally trained by `sultan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_BioM_xxlarge_SQuAD2_en_4.0.0_3.0_1656063644904.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_BioM_xxlarge_SQuAD2_en_4.0.0_3.0_1656063644904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_BioM_xxlarge_SQuAD2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_BioM_xxlarge_SQuAD2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.xxl.by_sultan").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_BioM_xxlarge_SQuAD2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|771.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sultan/BioM-ALBERT-xxlarge-SQuAD2 - http://participants-area.bioasq.org/results/9b/phaseB/ - https://github.com/salrowili/BioM-Transformers --- layout: model title: Italian BertForMaskedLM Base Uncased model (from dbmdz) author: John Snow Labs name: bert_embeddings_base_italian_xxl_uncased date: 2022-12-02 tags: [it, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-xxl-uncased` is a Italian model originally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_uncased_it_4.2.4_3.0_1670018034736.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_uncased_it_4.2.4_3.0_1670018034736.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_uncased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_uncased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_italian_xxl_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: Legal Registration Clause Binary Classifier author: John Snow Labs name: legclf_registration_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `registration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `registration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_clause_en_1.0.0_3.2_1660123909144.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_clause_en_1.0.0_3.2_1660123909144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_registration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[registration]| |[other]| |[other]| |[registration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_registration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.90 0.97 0.93 65 registration-rights 0.94 0.83 0.88 41 accuracy - - 0.92 106 macro-avg 0.92 0.90 0.91 106 weighted-avg 0.92 0.92 0.91 106 ``` --- layout: model title: Pipeline to Detect Chemical Compounds and Genes author: John Snow Labs name: ner_chemprot_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemprot_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_chemprot_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_pipeline_en_4.3.0_3.2_1678865440862.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_pipeline_en_4.3.0_3.2_1678865440862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_chemprot_clinical_pipeline", "en", "clinical/models") text = '''Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_chemprot_clinical_pipeline", "en", "clinical/models") val text = "Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot_clinical.pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | Keratinocyte | 0 | 11 | GENE-Y | 0.7433 | | 1 | growth | 13 | 18 | GENE-Y | 0.6481 | | 2 | factor | 20 | 25 | GENE-Y | 0.5693 | | 3 | acidic | 31 | 36 | GENE-Y | 0.5518 | | 4 | fibroblast | 38 | 47 | GENE-Y | 0.5111 | | 5 | growth | 49 | 54 | GENE-Y | 0.4559 | | 6 | factor | 56 | 61 | GENE-Y | 0.5213 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from deepset) author: John Snow Labs name: roberta_qa_large_squad2_hp date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2-hp` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad2_hp_en_4.3.0_3.0_1674222219863.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad2_hp_en_4.3.0_3.0_1674222219863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad2_hp","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad2_hp","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_squad2_hp| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/roberta-large-squad2-hp --- layout: model title: Summarize Clinical Question Notes author: John Snow Labs name: summarizer_clinical_questions date: 2023-04-03 tags: [licensed, en, clinical, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with medical questions exchanged in clinical mediums (clinic, email, call center etc.) by John Snow Labs.  It can generate summaries up to 512 tokens given an input text (max 1024 tokens). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION_QA/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_en_4.3.2_3.0_1680550227628.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_en_4.3.2_3.0_1680550227628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer.pretrained("summarizer_clinical_questions", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("summary")\ .setMaxTextLength(512)\ .setMaxNewTokens(512) pipeline = sparknlp.base.Pipeline(stages=[ document_assembler, summarizer ]) text = """ Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. """ data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_questions", "en", "clinical/models") .setInputCols("document_prompt") .setOutputCol("answer") .setMaxTextLength(512) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = """Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. """ val data = Seq(Array(text)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash ['What are the treatments for hyperthyroidism?'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_questions| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.0 MB| --- layout: model title: Word2Vec Embeddings in Dimli (individual language) (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, diq, open_source] task: Embeddings language: diq edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_diq_3.4.1_3.0_1647467748159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_diq_3.4.1_3.0_1647467748159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","diq") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","diq") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("diq.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|diq| |Size:|101.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Sundanese asr_wav2vec2_large_xlsr_sundanese TFWav2Vec2ForCTC from cahya author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_sundanese date: 2022-09-24 tags: [wav2vec2, su, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: su edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_sundanese` is a Sundanese model originally trained by cahya. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_sundanese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039175545.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_sundanese_su_4.2.0_3.0_1664039175545.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_sundanese', lang = 'su') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_sundanese", lang = "su") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_sundanese| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|su| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Professions & Occupations NER model in Spanish (meddroprof_scielowiki) author: John Snow Labs name: meddroprof_scielowiki date: 2021-07-26 tags: [ner, licensed, prefessions, es, occupations] task: Named Entity Recognition language: es edition: Healthcare NLP 3.1.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description NER model that detects professions and occupations in Spanish texts. Trained with the `embeddings_scielowiki_300d` embeddings, and the same `WordEmbeddingsModel` is needed in the pipeline. ## Predicted Entities `ACTIVIDAD`, `PROFESION`, `SITUACION_LABORAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_PROFESSIONS_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_PROFESSIONS_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_3.1.3_3.0_1627328955264.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/meddroprof_scielowiki_es_3.1.3_3.0_1627328955264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence = SentenceDetector() \ .setInputCols("document") \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence, tokenizer, word_embeddings, clinical_ner, ner_converter]) sample_text = """La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""" df = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(df).transform(df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d", "es", "clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("meddroprof_scielowiki", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.scielowiki").predict("""La paciente es la mayor de 2 hermanos, tiene un hermano de 13 años estudiando 1o ESO. Sus padres son ambos ATS , trabajan en diferentes centros de salud estudiando 1o ESO""") ```
## Results ```bash +---------------------------------------+-----------------+ |chunk |ner_label | +---------------------------------------+-----------------+ |estudiando 1o ESO |SITUACION_LABORAL| |ATS |PROFESION | |trabajan en diferentes centros de salud|PROFESION | |estudiando 1o ESO |SITUACION_LABORAL| +---------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|meddroprof_scielowiki| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Dependencies:|embeddings_scielowiki_300d| ## Data Source The model was trained with the [MEDDOPROF](https://temu.bsc.es/meddoprof/data/) data set: > The MEDDOPROF corpus is a collection of 1844 clinical cases from over 20 different specialties annotated with professions and employment statuses. The corpus was annotated by a team composed of linguists and clinical experts following specially prepared annotation guidelines, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 1 shows a screenshot of a sample manual annotation generated using the brat annotation tool. Reference: ``` @article{meddoprof, title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts}, author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, volume = {67}, year={2021} } ``` ## Benchmarking ```bash label precision recall f1-score support B-ACTIVIDAD 0.82 0.36 0.50 25 B-PROFESION 0.87 0.75 0.81 634 B-SITUACION_LABORAL 0.79 0.67 0.72 310 I-ACTIVIDAD 0.86 0.43 0.57 58 I-PROFESION 0.87 0.80 0.83 944 I-SITUACION_LABORAL 0.74 0.71 0.73 407 O 1.00 1.00 1.00 139880 accuracy - - 0.99 142258 macro-avg 0.85 0.67 0.74 142258 weighted-avg 0.99 0.99 0.99 142258 ``` --- layout: model title: Spanish NER Pipeline author: John Snow Labs name: roberta_token_classifier_bne_capitel_ner_pipeline date: 2022-04-20 tags: [roberta, token_classifier, spanish, ner, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_bne_capitel_ner_es](https://nlp.johnsnowlabs.com/2021/12/07/roberta_token_classifier_bne_capitel_ner_es.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_pipeline_es_3.4.1_3.0_1650450203759.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_pipeline_es_3.4.1_3.0_1650450203759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_token_classifier_bne_capitel_ner_pipeline", lang = "es") pipeline.annotate("Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_token_classifier_bne_capitel_ner_pipeline", lang = "es") pipeline.annotate("Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.") ```
## Results ```bash +------------------------+---------+ |chunk |ner_label| +------------------------+---------+ |Antonio |PER | |fábrica de Mercedes-Benz|ORG | |Madrid |LOC | +------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_bne_capitel_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|es| |Size:|459.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Fast Neural Machine Translation Model from Greek Languages to English author: John Snow Labs name: opus_mt_grk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, grk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `grk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_grk_en_xx_2.7.0_2.4_1609166792734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_grk_en_xx_2.7.0_2.4_1609166792734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_grk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_grk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.grk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_grk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Utilities Clause Binary Classifier author: John Snow Labs name: legclf_utilities_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `utilities` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `utilities` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_utilities_clause_en_1.0.0_3.2_1660123178356.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_utilities_clause_en_1.0.0_3.2_1660123178356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_utilities_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[utilities]| |[other]| |[other]| |[utilities]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_utilities_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.98 0.99 105 utilities 0.93 0.97 0.95 29 accuracy - - 0.98 134 macro-avg 0.96 0.97 0.97 134 weighted-avg 0.98 0.98 0.98 134 ``` --- layout: model title: Word2Vec Embeddings in Pashto (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ps, open_source] task: Embeddings language: ps edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ps_3.4.1_3.0_1647451184427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ps_3.4.1_3.0_1647451184427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ps") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["زه د سپارک الاپ خوښوم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ps") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("زه د سپارک الاپ خوښوم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ps.embed.w2v_cc_300d").predict("""زه د سپارک الاپ خوښوم""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ps| |Size:|170.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English XLMRobertaForTokenClassification Large Cased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_large_ontonotes5 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-ontonotes5` is a English model originally trained by `asahi417`. ## Predicted Entities `language`, `time`, `percent`, `quantity`, `product`, `ordinal number`, `cardinal number`, `event`, `geopolitical area`, `facility`, `organization`, `work of art`, `group`, `money`, `law`, `person`, `location`, `date` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_ontonotes5_en_4.1.0_3.0_1660425115455.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_ontonotes5_en_4.1.0_3.0_1660425115455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_ontonotes5","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_ontonotes5","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_large_ontonotes5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-large-ontonotes5 - https://github.com/asahi417/tner --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465517 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465517` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465517_en_4.0.0_3.0_1655986325897.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465517_en_4.0.0_3.0_1655986325897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465517","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465517","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465517.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465517| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465517 --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_findings date: 2021-10-03 tags: [entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using sbiobert_base_cased_mli Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. This model returns CUI (concept unique identifier) codes for 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_UMLS_CUI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.2.3_3.0_1633220877215.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.2.3_3.0_1633220877215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") stopwords = StopWordsCleaner.pretrained()\ .setInputCols("token")\ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "cleanTokens"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "cleanTokens", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val stopwords = StopWordsCleaner.pretrained() .setInputCols("token") .setOutputCol("cleanTokens") .setCaseSensitive(False) val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "cleanTokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "cleanTokens", "ner")) .setOutputCol("entities") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""").toDS().toDF("text") val res = p_model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls.findings").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | ner_chunk | cui_code | |---:|:--------------------------------------|:-----------| | 0 | gestational diabetes mellitus | C2183115 | | 1 | subsequent type two diabetes mellitus | C3532488 | | 2 | T2DM | C3280267 | | 3 | HTG-induced pancreatitis | C4554179 | | 4 | an acute hepatitis | C4750596 | | 5 | obesity | C1963185 | | 6 | a body mass index | C0578022 | | 7 | polyuria | C3278312 | | 8 | polydipsia | C3278316 | | 9 | poor appetite | C0541799 | | 10 | vomiting | C0042963 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_findings| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[umls_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Legal Withholdings Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_withholdings_bert date: 2023-03-05 tags: [en, legal, classification, clauses, withholdings, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Withholdings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Withholdings`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_withholdings_bert_en_1.0.0_3.0_1678049980404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_withholdings_bert_en_1.0.0_3.0_1678049980404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_withholdings_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Withholdings]| |[Other]| |[Other]| |[Withholdings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_withholdings_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.99 0.98 0.98 85 Withholdings 0.97 0.98 0.98 61 accuracy - - 0.98 146 macro-avg 0.98 0.98 0.98 146 weighted-avg 0.98 0.98 0.98 146 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from seongju) author: John Snow Labs name: xlm_roberta_qa_squadv2_xlm_roberta_base date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squadv2-xlm-roberta-base` is a English model originally trained by `seongju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_squadv2_xlm_roberta_base_en_4.0.0_3.0_1655988029859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_squadv2_xlm_roberta_base_en_4.0.0_3.0_1655988029859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_squadv2_xlm_roberta_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_squadv2_xlm_roberta_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_squadv2_xlm_roberta_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|875.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/seongju/squadv2-xlm-roberta-base - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: English image_classifier_vit_denver_nyc_paris ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_denver_nyc_paris date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_denver_nyc_paris` is a English model originally trained by nateraw. ## Predicted Entities `denver`, `new york city`, `paris` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_denver_nyc_paris_en_4.1.0_3.0_1660172026182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_denver_nyc_paris_en_4.1.0_3.0_1660172026182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_denver_nyc_paris", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_denver_nyc_paris", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_denver_nyc_paris| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Embeddings Scielo 50 dims author: John Snow Labs name: embeddings_scielo_50d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-26 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_50d_es_2.5.0_2.4_1590467114993.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_50d_es_2.5.0_2.4_1590467114993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_scielo_50d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_scielo_50d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.scielo.50d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_scielo_50d``. {:.model-param} ## Model Information {:.table-model} |---------------|-----------------------| | Name: | embeddings_scielo_50d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 50.0 | {:.h2_title} ## Data Source Trained on Scielo Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from IIC) author: John Snow Labs name: roberta_qa_base_spanish_squades date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades` is a Spanish model originally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_es_4.2.4_3.0_1669986476235.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_es_4.2.4_3.0_1669986476235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/IIC/roberta-base-spanish-squades - https://arxiv.org/abs/2107.07253 - https://paperswithcode.com/sota?task=question-answering&dataset=squad_es --- layout: model title: Detect Clinical Conditions (ner_eu_clinical_condition - es) author: John Snow Labs name: ner_eu_clinical_condition date: 2023-02-06 tags: [es, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: es edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from Spanish texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_condition` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_es_4.2.8_3.0_1675721390266.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_es_4.2.8_3.0_1675721390266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "es", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""La exploración abdominal revela una cicatriz de laparotomía media infraumbilical, la presencia de ruidos disminuidos, y dolor a la palpación de manera difusa sin claros signos de irritación peritoneal. No existen hernias inguinales o crurales."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""La exploración abdominal revela una cicatriz de laparotomía media infraumbilical, la presencia de ruidos disminuidos, y dolor a la palpación de manera difusa sin claros signos de irritación peritoneal. No existen hernias inguinales o crurales.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------+------------------+ |chunk |ner_label | +--------------------+------------------+ |cicatriz |clinical_condition| |dolor a la palpación|clinical_condition| |signos |clinical_condition| |irritación |clinical_condition| |hernias inguinales |clinical_condition| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|898.1 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 clinical_condition 354.0 42.0 84.0 438.0 0.8939 0.8082 0.8489 macro - - - - - - 0.8489 micro - - - - - - 0.8489 ``` --- layout: model title: NER Model Finder with Sentence Entity Resolvers (SBert, Medium, Uncased) author: John Snow Labs name: sbertresolve_ner_model_finder date: 2022-01-17 tags: [ner, licensed, clinical, entity_resolver, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true recommended: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities (NER labels) to the most appropriate NER model using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. Given the entity name, it will return a list of pretrained NER models having that entity or similar ones. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1642422477025.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1642422477025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbert_jsl_medium_uncased","en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") ner_model_finder = SentenceEntityResolverModel\ .pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"])\ .setOutputCol("model_names")\ .setDistanceFunction("EUCLIDEAN") ner_model_finder_pipelineModel = PipelineModel(stages = [documentAssembler, sbert_embedder, ner_model_finder]) light_pipeline = LightPipeline(ner_model_finder_pipelineModel) annotations = light_pipeline.fullAnnotate("medication") ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val ner_model_finder = SentenceEntityResolverModel .pretrained("sbertresolve_ner_model_finder", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("model_names") .setDistanceFunction("EUCLIDEAN") val ner_model_finder_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, ner_model_finder)) val light_pipeline = LightPipeline(ner_model_finder_pipelineModel) val annotations = light_pipeline.fullAnnotate("medication") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.ner.model_finder").predict("""Put your text here.""") ```
## Results ```bash +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |entity |models |all_models |resolutions | +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |medication|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_clinical_large', 'ner_healthcare', 'ner_jsl_enriched', 'ner_clinical', 'ner_jsl_slim', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_events_admission_clinical', 'ner_events_healthcare', 'ner_events_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse']:::['ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'ner_medmentions_coarse']:::['ner_drugs']:::['ner_clinical_icdem', 'ner_medmentions_coarse']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_radiology_wip_clinical', 'ner_jsl_slim', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy', 'ner_radiology']:::['ner_medmentions_coarse','ner_clinical_icdem']:::['ner_posology_experimental']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_measurements_clinical', 'ner_radiology_wip_clinical', 'ner_radiology']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_posology_small', 'ner_jsl_enriched', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_posology_greedy', 'ner_posology', 'ner_jsl_greedy']:::['ner_covid_trials', 'ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_deid_subentity_augmented', 'ner_deid_subentity_glove', 'ner_deidentify_dl', 'ner_deid_enriched']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_chemd_clinical']|medication:::drug:::treatment:::therapeutic procedure:::drug ingredient:::drug chemical:::diagnostic aid:::substance:::medical device:::diagnostic procedure:::administration:::measurement:::drug strength:::physiological reaction:::patient:::vaccine:::psychological condition:::abbreviation| +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_ner_model_finder| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sbert_embeddings]| |Output Labels:|[models]| |Language:|en| |Size:|611.1 KB| |Case sensitive:|false| --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to Swedish author: John Snow Labs name: opus_mt_af_sv date: 2021-06-01 tags: [open_source, seq2seq, translation, af, sv, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: sv {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_sv_xx_3.1.0_2.4_1622562929786.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_sv_xx_3.1.0_2.4_1622562929786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_sv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_sv", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.Swedish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_sv| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune TFWav2Vec2ForCTC from hrdipto author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune` is a English model originally trained by hrdipto. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_en_4.2.0_3.0_1664041738598.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_en_4.2.0_3.0_1664041738598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Dutch Pipeline author: John Snow Labs name: translate_en_nl date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, nl, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `nl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_nl_xx_2.7.0_2.4_1609688238018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_nl_xx_2.7.0_2.4_1609688238018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_nl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_nl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.nl').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_nl| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Named Entity Recognition - BERT Base (OntoNotes) author: John Snow Labs name: onto_bert_base_cased date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, en, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `bert_base_cased` embeddings model from `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_bert_base_cased_en_2.7.0_2.4_1607197077494.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_bert_base_cased_en_2.7.0_2.4_1607197077494.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_onto = NerDLModel.pretrained("onto_bert_base_cased", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_onto = NerDLModel.pretrained("onto_bert_base_cased", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.bert.cased_base').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |one |CARDINAL | |the 1970s and 1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |ORG | |January 2000 |DATE | |the late 1990s |DATE | |Gates |PERSON | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_bert_base_cased| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.8987879, rec: 0.90063596, f1: 0.89971095 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11276 phrases; correct: 10006. accuracy: 98.01%; 10006 11257 11276 precision: 88.74%; recall: 88.89%; FB1: 88.81 CARDINAL: 822 935 990 precision: 83.03%; recall: 87.91%; FB1: 85.40 990 DATE: 1355 1602 1567 precision: 86.47%; recall: 84.58%; FB1: 85.52 1567 EVENT: 32 63 59 precision: 54.24%; recall: 50.79%; FB1: 52.46 59 FAC: 96 135 124 precision: 77.42%; recall: 71.11%; FB1: 74.13 124 GPE: 2116 2240 2182 precision: 96.98%; recall: 94.46%; FB1: 95.70 2182 LANGUAGE: 10 22 11 precision: 90.91%; recall: 45.45%; FB1: 60.61 11 LAW: 21 40 28 precision: 75.00%; recall: 52.50%; FB1: 61.76 28 LOC: 141 179 178 precision: 79.21%; recall: 78.77%; FB1: 78.99 178 MONEY: 278 314 321 precision: 86.60%; recall: 88.54%; FB1: 87.56 321 NORP: 799 841 850 precision: 94.00%; recall: 95.01%; FB1: 94.50 850 ORDINAL: 177 195 217 precision: 81.57%; recall: 90.77%; FB1: 85.92 217 ORG: 1606 1795 1848 precision: 86.90%; recall: 89.47%; FB1: 88.17 1848 PERCENT: 306 349 344 precision: 88.95%; recall: 87.68%; FB1: 88.31 344 PERSON: 1856 1988 1978 precision: 93.83%; recall: 93.36%; FB1: 93.60 1978 PRODUCT: 54 76 76 precision: 71.05%; recall: 71.05%; FB1: 71.05 76 QUANTITY: 87 105 108 precision: 80.56%; recall: 82.86%; FB1: 81.69 108 TIME: 143 212 216 precision: 66.20%; recall: 67.45%; FB1: 66.82 216 WORK_OF_ART: 107 166 179 precision: 59.78%; recall: 64.46%; FB1: 62.03 179 ``` --- layout: model title: Traditional Chinese Word Segmentation author: John Snow Labs name: wordseg_gsd_ud_trad date: 2021-01-25 task: Word Segmentation language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [word_segmentation, zh, open_source] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. This model was trained on *traditional characters* in Chinese texts. Reference: - Xue, Nianwen. “Chinese word segmentation as character tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_trad_zh_2.7.0_2.4_1611584735643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_trad_zh_2.7.0_2.4_1611584735643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute for the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")\ .setInputCols(["sentence"])\ .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler,sentence_detector, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,這樣的處理也衍生了一些問題。']], ["text"]) result = ws_model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler,sentence_detector, word_segmenter)) val data = Seq("然而,這樣的處理也衍生了一些問題。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,這樣的處理也衍生了一些問題。"""] token_df = nlu.load('zh.segment_words.gsd').predict(text) token_df ```
## Results ```bash +-----------------------------------------+-----------------------------------------------------------+ |text | result | +-----------------------------------------+-----------------------------------------------------------+ |然而 , 這樣 的 處理 也 衍生 了 一些 問題 。 |[然而, ,, 這樣, 的, 處理, 也, 衍生, 了, 一些, 問題, 。] | +-----------------------------------------+-----------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_gsd_ud_trad| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source The model was trained on the [Universal Dependencies](https://universaldependencies.org/) for Traditional Chinese annotated and converted by Google. ## Benchmarking ```bash | precision | recall | f1-score | |--------------|----------|------------| | 0.7392 | 0.7754 | 0.7569 | ``` --- layout: model title: Legal Distributions Clause Binary Classifier author: John Snow Labs name: legclf_distributions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `distributions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `distributions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_distributions_clause_en_1.0.0_3.2_1660123426386.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_distributions_clause_en_1.0.0_3.2_1660123426386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_distributions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[distributions]| |[other]| |[other]| |[distributions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_distributions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support distributions 0.88 0.90 0.89 41 other 0.93 0.92 0.92 60 accuracy - - 0.91 101 macro-avg 0.91 0.91 0.91 101 weighted-avg 0.91 0.91 0.91 101 ``` --- layout: model title: Legal Master Administrative Services Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_master_administrative_services_agreement_bert date: 2023-01-29 tags: [en, legal, classification, master, administrative, services, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_master_administrative_services_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `master-administrative-services-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `master-administrative-services-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_master_administrative_services_agreement_bert_en_1.0.0_3.0_1674990113596.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_master_administrative_services_agreement_bert_en_1.0.0_3.0_1674990113596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_master_administrative_services_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[master-administrative-services-agreement]| |[other]| |[other]| |[master-administrative-services-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_master_administrative_services_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support master-administrative-services-agreement 1.00 1.00 1.00 54 other 1.00 1.00 1.00 101 accuracy - - 1.00 155 macro-avg 1.00 1.00 1.00 155 weighted-avg 1.00 1.00 1.00 155 ``` --- layout: model title: Clinical Deidentification (English, Glove, Augmented) author: John Snow Labs name: clinical_deidentification_glove_augmented date: 2022-03-22 tags: [deid, deidentification, en, licensed] task: De-identification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight glove_100d embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities. It's different to `clinical_deidentification_glove` in the way it manages PHONE and PATIENT, having apart from the NER, some rules in Contextual Parser components. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_3.4.1_3.0_1647966639326.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_3.4.1_3.0_1647966639326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models") deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification_glove","en","clinical/models") val result = pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_augmented.pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ```
## Results ```bash {'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 'IP: 203.120.223.13.', 'The driver's license no:A334455B.', 'the SSN:324598674 and e-mail: hale@gmail.com.', 'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.', 'PCP : Oliveira, 25 years-old.', 'Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.'], 'masked': ['Record date : , , M.D.', 'IP: .', 'The driver's license .', 'the and e-mail: .', 'Name : MR. # Date : .', 'PCP : , years-old.', 'Record date : , Patient's VIN : .'], 'obfuscated': ['Record date : 2093-02-13, Shella Solan, M.D.', 'IP: 444.444.444.444.', 'The driver's license O497302436569.', 'the SSN-539-29-1060 and e-mail: Keith@google.com.', 'Name : Roscoe Kerns MR. # Q984288 Date : 10-08-1991.', 'PCP : Dr Rudell Dubin, 10 years-old.', 'Record date : 2079-12-30, Patient's VIN : 5eeee44ffff555666.'], 'ner_chunk': ['2093-01-13', 'David Hale', 'no:A334455B', 'SSN:324598674', 'Hendrickson, Ora', '719435', '01/13/93', 'Oliveira', '25', '2079-11-09', '1HGBH41JXMN109286']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_glove_augmented| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|181.3 MB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetector - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - medical.NerConverterInternal - medical.NerModel - medical.NerConverterInternal - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x1.16-f88.1-d8-unstruct-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1_en_4.0.0_3.0_1654181545377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1_en_4.0.0_3.0_1654181545377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.x1.16_f88.1_d8_unstruct.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1_x1.16_f88.1_d8_unstruct_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|146.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squadv1-x1.16-f88.1-d8-unstruct-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el8_dl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el8-dl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl2_en_4.3.0_3.0_1675120612912.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl2_en_4.3.0_3.0_1675120612912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el8_dl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|127.7 MB| ## References - https://huggingface.co/google/t5-efficient-small-el8-dl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Cased model (from ybagoury) author: John Snow Labs name: t5_flan_base_tldr_news date: 2023-03-02 tags: [open_source, t5, flan, en, tensorflow] task: Text Generation language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. flan-t5-base-tldr_news is a English model originally trained by ybagoury. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_base_tldr_news_en_4.3.0_3.0_1677760144575.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_base_tldr_news_en_4.3.0_3.0_1677760144575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_flan_base_tldr_news","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_flan_base_tldr_news","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_flan_base_tldr_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References https://huggingface.co/ybagoury/flan-t5-base-tldr_news --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_de_4.2.0_3.0_1664114173921.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_de_4.2.0_3.0_1664114173921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Part of Speech for Amharic (pos_ud_att) author: John Snow Labs name: pos_ud_att date: 2021-01-20 task: Part of Speech Tagging language: am edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [am, pos, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities | POS tag | Description | |---------|----------------------------| | ADJ | adjective | | ADP | adposition | | ADV | adverb | | AUX | auxiliary | | CCONJ | coordinating conjunction | | DET | determiner | | INTJ | interjection | | NOUN | noun | | NUM | numeral | | PART | particle | | PRON | pronoun | | PROPN | proper noun | | PUNCT | punctuation | | SCONJ | subordinating conjunction | | SYM | symbol | | VERB | verb | | X | other | {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_att_am_2.7.0_2.4_1611180723328.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_att_am_2.7.0_2.4_1611180723328.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ud_att", "am") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ud_att", "am") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።"] pos_df = nlu.load('am.pos').predict(text) pos_df ```
## Results ```bash +------------------------------+----------------------------------------------------------------+ |text |result | +------------------------------+----------------------------------------------------------------+ |ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።|[NOUN, DET, PART, NOUN, DET, PART, VERB, PRON, AUX, PRON, PUNCT]| +------------------------------+----------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_att| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|am| ## Data Source The model was trained on the [Universal Dependencies](https://universaldependencies.org/) version 2.7. Reference: - Binyam Ephrem Seyoum ,Yusuke Miyao and Baye Yimam Mekonnen.2018.Universal Dependencies for Amharic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 2216–2222, Miyazaki, Japan: European Language Resources Association (ELRA) ## Benchmarking ```bash | | precision | recall | f1-score | support | |:------------:|:---------:|:------:|:--------:|:-------:| | ADJ | 1.00 | 0.97 | 0.99 | 116 | | ADP | 0.99 | 1.00 | 0.99 | 681 | | ADV | 0.94 | 0.99 | 0.96 | 93 | | AUX | 1.00 | 1.00 | 1.00 | 419 | | CCONJ | 0.99 | 0.97 | 0.98 | 99 | | DET | 0.99 | 1.00 | 0.99 | 485 | | INTJ | 0.97 | 0.99 | 0.98 | 67 | | NOUN | 0.99 | 1.00 | 1.00 | 1485 | | NUM | 1.00 | 1.00 | 1.00 | 42 | | PART | 1.00 | 1.00 | 1.00 | 875 | | PRON | 1.00 | 1.00 | 1.00 | 2547 | | PROPN | 1.00 | 0.99 | 0.99 | 236 | | PUNCT | 1.00 | 1.00 | 1.00 | 1093 | | SCONJ | 1.00 | 0.98 | 0.99 | 214 | | VERB | 1.00 | 1.00 | 1.00 | 1552 | | accuracy | | | 1.00 | 10004 | | macro avg | 0.99 | 0.99 | 0.99 | 10004 | | weighted avg | 1.00 | 1.00 | 1.00 | 10004 | ``` --- layout: model title: Legal Approvals Clause Binary Classifier author: John Snow Labs name: legclf_approvals_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `approvals` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `approvals` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_approvals_clause_en_1.0.0_3.2_1660123231676.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_approvals_clause_en_1.0.0_3.2_1660123231676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_approvals_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[approvals]| |[other]| |[other]| |[approvals]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_approvals_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support approvals 0.89 0.94 0.91 33 other 0.97 0.95 0.96 82 accuracy - - 0.95 115 macro-avg 0.93 0.95 0.94 115 weighted-avg 0.95 0.95 0.95 115 ``` --- layout: model title: Ocr pipeline in streaming author: John Snow Labs name: ocr_streaming date: 2023-01-03 tags: [en, licensed, ocr, streaming] task: Ocr Streaming language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.0 supported: true annotator: OcrStreaming article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Streaming pipeline implementation for the OCR task, using tesseract models. Tesseract is an Optical Character Recognition (OCR) engine developed by Google. It is an open-source tool that can be used to recognize text in images and convert it into machine-readable text. The engine is based on a neural network architecture and uses machine learning algorithms to improve its accuracy over time. Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset. In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy. ## Predicted Entities {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/6.1.SparkOcrStreamingPDF.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python # Transform binary to image pdf_to_image = PdfToImage() \ .setInputCol("content") \ .setOutputCol("image") # Run OCR for each region ocr = ImageToText() \ .setInputCol("image") \ .setOutputCol("text") \ .setConfidenceThreshold(60) # OCR pipeline pipeline = PipelineModel(stages=[ pdf_to_image, ocr]) # fill path to folder with PDF's here dataset_path = "data/pdfs/*.pdf" # read one file for infer schema pdfs_df = spark.read.format("binaryFile").load(dataset_path).limit(1) # count of files in one microbatch maxFilesPerTrigger = 4 # read files as stream pdf_stream_df = spark.readStream \ .format("binaryFile") \ .schema(pdfs_df.schema) \ .option("maxFilesPerTrigger", maxFilesPerTrigger) \ .load(dataset_path) # process files using OCR pipeline result = pipeline.transform(pdf_stream_df).withColumn("timestamp", current_timestamp()) # store results to memory table query = result.writeStream \ .format('memory') \ .queryName('result') \ .start() # show results spark.table("result").select("timestamp","pagenum", "path", "text").show(10) ``` ```scala # Transform binary to image val pdf_to_image = new PdfToImage() .setInputCol("content") .setOutputCol("image") # Run OCR for each region val ocr = new ImageToText() .setInputCol("image") .setOutputCol("text") .setConfidenceThreshold(60) # OCR pipeline val pipeline = new PipelineModel().setStages(Array( pdf_to_image, ocr)) # fill path to folder with PDF's here val dataset_path = "data/pdfs/*.pdf" # read one file for infer schema val pdfs_df = spark.read.format("binaryFile").load(dataset_path).limit(1) # count of files in one microbatch val maxFilesPerTrigger = 4 # read files as stream val pdf_stream_df = spark.readStream .format("binaryFile") .schema(pdfs_df.schema) .option("maxFilesPerTrigger", maxFilesPerTrigger) .load(dataset_path) # process files using OCR pipeline val result = pipeline.transform(pdf_stream_df).withColumn("timestamp", current_timestamp()) # store results to memory table val query = result.writeStream .format("memory") .queryName("result") .start() # show results spark.table("result").select(Array("timestamp","pagenum", "path", "text")).show(10) ```
## Example ### Input: ![Screenshot](/assets/images/examples_ocr/image4.png) ### Output ```bash +--------------------+ | value| +--------------------+ | | | | | | | | | | | | |ne Pa a Date: 7/1...| |er ‘Sample No. _ ...| |“ Original reques...| | | |Sample specificat...| | , BLEND CASING R...| | | |- OLD GOLD STRAIG...| | | |Control for Sampl...| | | | Cigarettes:| | | | OLD GOLD STRAIGHT| +--------------------+ only showing top 20 rows ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_streaming| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Legal Indemnification Procedures Clause Binary Classifier author: John Snow Labs name: legclf_indemnification_procedures_clause date: 2023-01-27 tags: [en, legal, classification, indemnification, procedures, clauses, indemnification_procedures, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnification-procedures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `indemnification-procedures`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_procedures_clause_en_1.0.0_3.0_1674819938050.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_procedures_clause_en_1.0.0_3.0_1674819938050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indemnification_procedures_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[indemnification-procedures]| |[other]| |[other]| |[indemnification-procedures]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_procedures_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support indemnification-procedures 1.00 0.96 0.98 23 other 0.97 1.00 0.99 39 accuracy - - 0.98 62 macro-avg 0.99 0.98 0.98 62 weighted-avg 0.98 0.98 0.98 62 ``` --- layout: model title: Detect Problems, Tests and Treatments (ner_clinical) in German author: John Snow Labs name: ner_clinical date: 2023-05-08 tags: [ner, clinical, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terms in German. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683555292486.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_de_4.4.0_3.0_1683555292486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) sample_text= """Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome. Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl . ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 .""" results = model.transform(spark.createDataFrame([[sample_text]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""Verschlechterung von Schmerzen oder Schwäche in den Beinen , Verlust der Darm - oder Blasenfunktion oder andere besorgniserregende Symptome. Der Patient erhielt empirisch Ampicillin , Gentamycin und Flagyl sowie Narcan zur Umkehrung von Fentanyl . ALT war 181 , AST war 156 , LDH war 336 , alkalische Phosphatase war 214 und Bilirubin war insgesamt 12,7 .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Schmerzen |PROBLEM | |Schwäche in den Beinen|PROBLEM | |Verlust der Darm |PROBLEM | |Blasenfunktion |PROBLEM | |Symptome |PROBLEM | |empirisch Ampicillin |TREATMENT| |Gentamycin |TREATMENT| |Flagyl |TREATMENT| |Narcan |TREATMENT| |Fentanyl |TREATMENT| |ALT |TEST | |AST |TEST | |LDH |TEST | |alkalische Phosphatase|TEST | |Bilirubin |TEST | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|2.0 MB| ## Benchmarking ```bash label precision recall f1-score support B-PROBLEM 0.85 0.71 0.78 512 B-TEST 0.89 0.85 0.87 203 B-TREATMENT 0.84 0.82 0.83 238 I-PROBLEM 0.78 0.70 0.74 355 I-TEST 0.90 0.83 0.87 66 I-TREATMENT 0.62 0.71 0.66 75 O 0.94 0.97 0.95 4141 accuracy - - 0.91 5590 macro avg 0.83 0.80 0.81 5590 weighted avg 0.91 0.91 0.91 5590 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from f2io) author: John Snow Labs name: distilbert_token_classifier_ner_roles_openapi date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ner-roles-openapi` is a English model originally trained by `f2io`. ## Predicted Entities ``, `MISC`, `ORG`, `ENTITY`, `PER`, `PRG`, `ROLE`, `OR`, `LOC`, `ACTION` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.1_3.0_1678782949346.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.1_3.0_1678782949346.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_ner_roles_openapi| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/f2io/ner-roles-openapi --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from deepset) author: John Snow Labs name: roberta_qa_deepset_base_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true recommended: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_en_4.2.4_3.0_1669986722225.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_en_4.2.4_3.0_1669986722225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/roberta-base-squad2 - https://haystack.deepset.ai/tutorials/first-qa-system - https://github.com/deepset-ai/haystack/ - https://haystack.deepset.ai/tutorials/first-qa-system - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - http://deepset.ai/ - https://haystack.deepset.ai/ - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/haystack - https://docs.haystack.deepset.ai - https://haystack.deepset.ai/community - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2 --- layout: model title: Fast Neural Machine Translation Model from Catalan to English author: John Snow Labs name: opus_mt_ca_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ca, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ca` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ca_en_xx_2.7.0_2.4_1609169355424.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ca_en_xx_2.7.0_2.4_1609169355424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ca_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ca_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ca.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ca_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect PHI for Deidentification (Glove, 7 labels) author: John Snow Labs name: ner_deid_generic_glove_pipeline date: 2023-03-13 tags: [deid, clinical, glove, licensed, ner, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic_glove](https://nlp.johnsnowlabs.com/2021/06/06/ner_deid_generic_glove_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_glove_pipeline_en_4.3.0_3.2_1678734514341.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_glove_pipeline_en_4.3.0_3.2_1678734514341.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_glove_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_glove_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 1 | | 1 | David Hale | 27 | 36 | NAME | 0.9938 | | 2 | Hendrickson Ora | 55 | 69 | NAME | 0.992 | | 3 | 7194334 | 78 | 84 | ID | 1 | | 4 | 01/13/93 | 93 | 100 | DATE | 1 | | 5 | Oliveira | 110 | 117 | NAME | 1 | | 6 | 25 | 121 | 122 | AGE | 0.8724 | | 7 | 2079-11-09 | 150 | 159 | DATE | 1 | | 8 | Cocke County Baptist Hospital | 163 | 191 | LOCATION | 0.8586 | | 9 | 0295 Keats Street | 195 | 211 | LOCATION | 0.948667 | | 10 | 302-786-5227 | 221 | 232 | CONTACT | 0.9972 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_glove_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|167.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Word2Vec Embeddings in Upper Sorbian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, hsb, open_source] task: Embeddings language: hsb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hsb_3.4.1_3.0_1647465128124.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hsb_3.4.1_3.0_1647465128124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hsb") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hsb") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hsb.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|hsb| |Size:|144.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: bert_ner_bert_base_parsbert_ner_uncased date: 2022-05-09 tags: [bert, ner, token_classification, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-parsbert-ner-uncased` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `percent`, `facility`, `location`, `money`, `product`, `person`, `date`, `organization`, `time`, `event` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_ner_uncased_fa_3.4.2_3.0_1652099655453.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_ner_uncased_fa_3.4.2_3.0_1652099655453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_ner_uncased","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_ner_uncased","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_parsbert_ner_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/bert-base-parsbert-ner-uncased - https://arxiv.org/abs/2005.12515 - http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/ - https://github.com/HaniehP/PersianNER - https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://arxiv.org/abs/2005.12515 - https://tensorflow.org/tfrc - https://hooshvare.com - https://www.linkedin.com/in/m3hrdadfi/ - https://twitter.com/m3hrdadfi - https://github.com/m3hrdadfi - https://www.linkedin.com/in/mohammad-gharachorloo/ - https://twitter.com/MGharachorloo - https://github.com/baarsaam - https://www.linkedin.com/in/marziehphi/ - https://twitter.com/marziehphi - https://github.com/marziehphi - https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/ - https://twitter.com/mmanthouri - https://github.com/mmanthouri - https://hooshvare.com/ - https://www.linkedin.com/company/hooshvare - https://twitter.com/hooshvare - https://github.com/hooshvare - https://www.instagram.com/hooshvare/ - https://www.linkedin.com/in/sara-tabrizi-64548b79/ - https://www.behance.net/saratabrizi - https://www.instagram.com/sara_b_tabrizi/ --- layout: model title: Legal Waiver Of Jury Trials Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_waiver_of_jury_trials_bert date: 2023-03-05 tags: [en, legal, classification, clauses, waiver_of_jury_trials, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Waiver_Of_Jury_Trials` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Waiver_Of_Jury_Trials`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trials_bert_en_1.0.0_3.0_1678050634484.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_jury_trials_bert_en_1.0.0_3.0_1678050634484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_waiver_of_jury_trials_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Waiver_Of_Jury_Trials]| |[Other]| |[Other]| |[Waiver_Of_Jury_Trials]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_waiver_of_jury_trials_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.85 0.99 0.92 125 Waiver_Of_Jury_Trials 0.99 0.75 0.85 89 accuracy - - 0.89 214 macro-avg 0.92 0.87 0.88 214 weighted-avg 0.91 0.89 0.89 214 ``` --- layout: model title: Mapping RxNorm Codes with Corresponding Actions and Treatments author: John Snow Labs name: rxnorm_action_treatment_mapper date: 2022-05-08 tags: [en, chunk_mapper, rxnorm, action, treatment, licensed, clinical] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm and RxNorm Extension codes with their corresponding action and treatment. Action refers to the function of the drug in various body systems; treatment refers to which disease the drug is used to treat. ## Predicted Entities `action`, `treatment` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_action_treatment_mapper_en_3.5.1_3.0_1652043181565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_action_treatment_mapper_en_3.5.1_3.0_1652043181565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('ner_chunk') sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("Action")\ .setRel("Action") chunkerMapper_treatment = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("Treatment")\ .setRel("Treatment") pipeline = Pipeline().setStages([document_assembler, sbert_embedder, rxnorm_resolver, chunkerMapper_action, chunkerMapper_treatment ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) light_pipeline = LightPipeline(model) result = light_pipeline.annotate(['Sinequan 150 MG', 'Zonalon 50 mg']) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")) .setInputCols("rxnorm_code") .setOutputCol("Action") .setRel("Action") val chunkerMapper_treatment = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")) .setInputCols("rxnorm_code") .setOutputCol("Treatment") .setRel("Treatment") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, rxnorm_resolver, chunkerMapper_action, chunkerMapper_treatment )) val text_data = Seq("Sinequan 150 MG", "Zonalon 50 mg").toDS.toDF("text") val res = pipeline.fit(text_data).transform(text_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_to_action_treatment").predict("""Sinequan 150 MG""") ```
## Results ```bash | | ner_chunk | rxnorm_code | Treatment | Action | |---:|:--------------------|:--------------|:-------------------------------------------------------------------------------|:-----------------------------------------------------------------------| | 0 | ['Sinequan 150 MG'] | ['1000067'] | ['Alcoholism', 'Depression', 'Neurosis', 'Anxiety&Panic Attacks', 'Psychosis'] | ['Antidepressant', 'Anxiolytic', 'Psychoanaleptics', 'Sedative'] | | 1 | ['Zonalon 50 mg'] | ['103971'] | ['Pain'] | ['Analgesic', 'Analgesic (Opioid)', 'Analgetic', 'Opioid', 'Vitamins'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_action_treatment_mapper| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|19.3 MB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_TIMESTEP_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212220798.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212220798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_recipe_triplet_recipes_base_timestep_squadv2_epochs_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_TIMESTEP_squadv2_epochs_3 --- layout: model title: Legal Rules of construction Clause Binary Classifier author: John Snow Labs name: legclf_rules_of_construction_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `rules-of-construction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `rules-of-construction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rules_of_construction_clause_en_1.0.0_3.2_1660122976182.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rules_of_construction_clause_en_1.0.0_3.2_1660122976182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_rules_of_construction_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[rules-of-construction]| |[other]| |[other]| |[rules-of-construction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_rules_of_construction_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.99 0.97 98 rules-of-construction 0.98 0.91 0.94 46 accuracy - - 0.97 144 macro-avg 0.97 0.95 0.96 144 weighted-avg 0.97 0.97 0.96 144 ``` --- layout: model title: Translate Dravidian languages to English Pipeline author: John Snow Labs name: translate_dra_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, dra, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `dra` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_dra_en_xx_2.7.0_2.4_1609686505330.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_dra_en_xx_2.7.0_2.4_1609686505330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_dra_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_dra_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.dra.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_dra_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Compliance Clause Binary Classifier author: John Snow Labs name: legclf_compliance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `compliance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `compliance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compliance_clause_en_1.0.0_3.2_1660122240327.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compliance_clause_en_1.0.0_3.2_1660122240327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_compliance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[compliance]| |[other]| |[other]| |[compliance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_compliance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support compliance 0.80 0.82 0.81 67 other 0.94 0.93 0.93 188 accuracy - - 0.90 255 macro-avg 0.87 0.87 0.87 255 weighted-avg 0.90 0.90 0.90 255 ``` --- layout: model title: Sentence Embeddings - sbiobert (tuned) author: John Snow Labs name: sbiobert_jsl_umls_cased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_umls_cased_en_3.1.0_2.4_1625050246280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_umls_cased_en_3.1.0_2.4_1625050246280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_jsl_umls_cased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbiobert_jsl_umls_cased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.biobert.jsl_umls_cased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobert_jsl_umls_cased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Score Acc 0.758 STS(cos) 0.651 ``` --- layout: model title: BERT Sentence Embeddings trained on MEDLINE/PubMed and fine-tuned on SQuAD 2.0 author: John Snow Labs name: sent_bert_pubmed_squad2 date: 2021-08-31 tags: [en, open_source, sentence_embeddings, medline_pubmed_dataset, squad_2_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/pubmed/1 and fine-tuned on SQuAD 2.0. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. This model is intended to be used for a variety of English NLP tasks in the medical domain. This model is fine-tuned on the SQuAD 2.0 as a span-labeling task to label the answer to a question in a given context and is recommended for use in question answering tasks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_pubmed_squad2_en_3.2.0_3.0_1630412086842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_pubmed_squad2_en_3.2.0_3.0_1630412086842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_pubmed_squad2", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_pubmed_squad2", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.pubmed_squad2').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_pubmed_squad2| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [MEDLINE/PubMed dataset](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) [2]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/) This Model has been imported from: https://tfhub.dev/google/experts/bert/pubmed/squad2/2 --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Russian (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-03-16 task: Named Entity Recognition language: ru edition: Spark NLP 2.4.4 spark_version: 2.4 tags: [ner, ru, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_RU){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RU.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_ru_2.4.4_2.4_1584014001695.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_ru_2.4.4_2.4_1584014001695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "ru") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "ru") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла."""] ner_df = nlu.load('ru.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Уильям Генри Гейтс III|PER | |Microsoft |ORG | |Microsoft Гейтс |ORG | |CEO |ORG | |Гейтс |PER | |Сиэтле |LOC | |Вашингтон |LOC | |Полом Алленом |PER | |Альбукерке |LOC | |Нью-Мексико |LOC | |Microsoft |ORG | |Гейтс |PER | |Гейтс |PER | |Гейтс |PER | |Microsoft |ORG | |Фонде Билла |PER | |Мелинды Гейтс |PER | |Мелинда Гейтс |PER | |Постепенно |PER | |Рэю Оззи |PER | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ru| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://ru.wikipedia.org](https://ru.wikipedia.org) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-BlueBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384_en_4.0.0_3.0_1657108595543.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384_en_4.0.0_3.0_1657108595543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_BlueBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-BlueBERT-384 --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_original_PubmedBert date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-original-PubmedBert` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_en_4.0.0_3.0_1657108214734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_original_PubmedBert_en_4.0.0_3.0_1657108214734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_original_PubmedBert","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_original_PubmedBert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4-original-PubmedBert --- layout: model title: Word Embeddings for Arabic (arabic_w2v_cc_300d) author: John Snow Labs name: arabic_w2v_cc_300d date: 2020-12-05 task: Embeddings language: ar edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [embeddings, ar, open_source] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/arabic_w2v_cc_300d_ar_2.7.0_2.4_1607168354606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/arabic_w2v_cc_300d_ar_2.7.0_2.4_1607168354606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['أنا أحب التعلم الآلي']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("أنا أحب التعلم الآلي").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["أنا أحب التعلم الآلي"] arabicvec_df = nlu.load('ar.embed.cbow.300d').predict(text, output_level='token') arabicvec_df ```
{:.h2_title} ## Results The model gives 300 dimensional Word2Vec feature vector outputs per token. ```bash | ar_embed_cbow_300d_embeddings token |----------------------------------------------------|-------- | [-0.11158058792352676, -0.06634224951267242, -... أنا | [-0.2818698585033417, -0.21061033010482788, -0... أحب ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|arabic_w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|ar| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from [https://fasttext.cc/docs/en/crawl-vectors.html](https://fasttext.cc/docs/en/crawl-vectors.html) --- layout: model title: English DistilBertForQuestionAnswering model (from jgammack) SAE author: John Snow Labs name: distilbert_qa_SAE_base_uncased_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SAE-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_SAE_base_uncased_squad_en_4.0.0_3.0_1654722995068.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_SAE_base_uncased_squad_en_4.0.0_3.0_1654722995068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SAE_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SAE_base_uncased_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_sae.by_jgammack").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_SAE_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/SAE-distilbert-base-uncased-squad --- layout: model title: Legal Definitions Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_definitions_bert date: 2023-03-05 tags: [en, legal, classification, clauses, definitions, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Definitions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Definitions`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_definitions_bert_en_1.0.0_3.0_1678049915628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_definitions_bert_en_1.0.0_3.0_1678049915628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_definitions_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Definitions]| |[Other]| |[Other]| |[Definitions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_definitions_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Definitions 0.97 0.96 0.96 67 Other 0.97 0.98 0.97 95 accuracy - - 0.97 162 macro-avg 0.97 0.97 0.97 162 weighted-avg 0.97 0.97 0.97 162 ``` --- layout: model title: Relation Extraction between anatomical entities and other clinical entities (ReDL) author: John Snow Labs name: redl_oncology_location_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, anatomy, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links extractions from anatomical entities (such as Site_Breast or Site_Lung) to other clinical entities (such as Tumor_Finding or Cancer_Surgery). ## Predicted Entities `is_location_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_location_biobert_wip_en_4.2.4_3.0_1673770597615.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_location_biobert_wip_en_4.2.4_3.0_1673770597615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding", "Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_location_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["In April 2011, she first noticed a lump in her right breast."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding","Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_location_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""In April 2011, she first noticed a lump in her right breast.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_location_biobert_wip").predict("""In April 2011, she first noticed a lump in her right breast.""") ```
## Results ```bash +--------------+-------------+-------------+-----------+------+-----------+-------------+-----------+------+----------+ | relation| entity1|entity1_begin|entity1_end|chunk1| entity2|entity2_begin|entity2_end|chunk2|confidence| +--------------+-------------+-------------+-----------+------+-----------+-------------+-----------+------+----------+ |is_location_of|Tumor_Finding| 35| 38| lump|Site_Breast| 53| 58|breast| 0.9628376| +--------------+-------------+-------------+-----------+------+-----------+-------------+-----------+------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_location_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.90 0.94 0.92 is_location_of 0.94 0.90 0.92 macro-avg 0.92 0.92 0.92 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt6 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt6` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_4.2.4_3.0_1670327118775.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_4.2.4_3.0_1670327118775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt6| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|224.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt6 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Legal Cause Clause Binary Classifier author: John Snow Labs name: legclf_cause_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `cause` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `cause` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cause_clause_en_1.0.0_3.2_1660122210522.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cause_clause_en_1.0.0_3.2_1660122210522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cause_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[cause]| |[other]| |[other]| |[cause]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cause_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support cause 0.96 1.00 0.98 27 other 1.00 0.99 1.00 108 accuracy - - 0.99 135 macro-avg 0.98 1.00 0.99 135 weighted-avg 0.99 0.99 0.99 135 ``` --- layout: model title: Chinese BertForQuestionAnswering model (from jackh1995) author: John Snow Labs name: bert_qa_bert_chinese_finetuned date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-chinese-finetuned` is a Chinese model orginally trained by `jackh1995`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_chinese_finetuned_zh_4.0.0_3.0_1654181635362.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_chinese_finetuned_zh_4.0.0_3.0_1654181635362.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_chinese_finetuned","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_chinese_finetuned","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.by_jackh1995").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_chinese_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|381.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jackh1995/bert-chinese-finetuned --- layout: model title: Stopwords Remover for Ancient Greek language (907 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, grc, open_source] task: Stop Words Removal language: grc edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_grc_3.4.1_3.0_1646673167002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_grc_3.4.1_3.0_1646673167002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","grc") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Όντας δε θνητούς θνητά και φρονείν χρεών."]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","grc") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Όντας δε θνητούς θνητά και φρονείν χρεών.").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("grc.stopwords").predict("""Όντας δε θνητούς θνητά και φρονείν χρεών.""") ```
## Results ```bash +---------------------------------------------------+ |result | +---------------------------------------------------+ |[Όντας, δε, θνητούς, θνητά, και, φρονείν, χρεών, .]| +---------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|grc| |Size:|4.5 KB| --- layout: model title: English BertForQuestionAnswering model (from rsvp-ai) author: John Snow Labs name: bert_qa_bertserini_bert_base_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-bert-base-squad` is a English model orginally trained by `rsvp-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_bert_base_squad_en_4.0.0_3.0_1654185449571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_bert_base_squad_en_4.0.0_3.0_1654185449571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertserini_bert_base_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bertserini_bert_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base.by_rsvp-ai").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bertserini_bert_base_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rsvp-ai/bertserini-bert-base-squad --- layout: model title: Google's Tapas Table Understanding (Mini, WTQ) author: John Snow Labs name: table_qa_tapas_mini_finetuned_wtq date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Mini Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_mini_finetuned_wtq_en_4.2.0_3.0_1664530449660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_mini_finetuned_wtq_en_4.2.0_3.0_1664530449660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_mini_finetuned_wtq","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_mini_finetuned_wtq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|43.4 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions --- layout: model title: Fast Neural Machine Translation Model from English to Efik author: John Snow Labs name: opus_mt_en_efi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, efi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `efi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_efi_xx_2.7.0_2.4_1609164796629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_efi_xx_2.7.0_2.4_1609164796629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_efi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_efi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.efi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_efi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab11_by_sameearif88 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab11_by_sameearif88 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab11_by_sameearif88` is a English model originally trained by sameearif88. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021285479.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021285479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab11_by_sameearif88", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab11_by_sameearif88", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab11_by_sameearif88| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English AlbertForQuestionAnswering model (from madlag) author: John Snow Labs name: albert_qa_base_v2_squad date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-base-v2-squad` is a English model originally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_base_v2_squad_en_4.0.0_3.0_1656063705520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_base_v2_squad_en_4.0.0_3.0_1656063705520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_base_v2_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_base_v2_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.albert.base_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_base_v2_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/madlag/albert-base-v2-squad - https://github.com/google-research/albert --- layout: model title: Extract relations between problem, test, and findings in reports author: John Snow Labs name: re_test_problem_finding date: 2021-04-19 tags: [en, relation_extraction, licensed, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Find relations between diagnosis, tests and imaging findings in radiology reports. `1` : The two entities are related. `0` : The two entities are not related ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_RADIOLOGY/){:.button.button-orange} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_en_2.7.1_2.4_1618830922197.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_en_2.7.1_2.4_1618830922197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_test_problem_finding` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:-----------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | re_test_problem_finding | 0,1 | ner_jsl | [“test-cerebrovascular_disease”,
“cerebrovascular_disease-test”,
“test-communicable_disease”,
“communicable_disease-test”,
“test-diabetes”, “diabetes-test”,
“test-disease_syndrome_disorder”,
“disease_syndrome_disorder-test”,
“test-heart_disease”,
“heart_disease-test”,
“test-hyperlipidemia”,
“hyperlipidemia-test”,
“test-hypertension”,
“hypertension-test”,
“test-injury_or_poisoning”,
“injury_or_poisoning-test”,
“test-kidney_disease”,
“kidney_disease-test”,
“test-obesity”,
“obesity-test”,
“test-oncological”,
“oncological-test”,
“test-psychological_condition”,
“psychological_condition-test”,
“test-symptom”, “symptom-test”,
“ekg_findings-disease_syndrome_disorder”,
“disease_syndrome_disorder-ekg_findings”,
“ekg_findings-heart_disease”,
“heart_disease-ekg_findings”,
“ekg_findings-symptom”,
“symptom-ekg_findings”,
“imagingfindings-cerebrovascular_disease”,
“cerebrovascular_disease-imagingfindings”,
“imagingfindings-communicable_disease”,
“communicable_disease-imagingfindings”,
“imagingfindings-disease_syndrome_disorder”,
“disease_syndrome_disorder-imagingfindings”,
“imagingfindings-heart_disease”,
“heart_disease-imagingfindings”,
“imagingfindings-hyperlipidemia”,
“hyperlipidemia-imagingfindings”,
“imagingfindings-hypertension”,
“hypertension-imagingfindings”,
“imagingfindings-injury_or_poisoning”,
“injury_or_poisoning-imagingfindings”,
“imagingfindings-kidney_disease”,
“kidney_disease-imagingfindings”,
“imagingfindings-oncological”,
“oncological-imagingfindings”,
“imagingfindings-psychological_condition”,
“psychological_condition-imagingfindings”,
“imagingfindings-symptom”,
“symptom-imagingfindings”,
“vs_finding-cerebrovascular_disease”,
“cerebrovascular_disease-vs_finding”,
“vs_finding-communicable_disease”,
“communicable_disease-vs_finding”,
“vs_finding-diabetes”,
“diabetes-vs_finding”,
“vs_finding-disease_syndrome_disorder”,
“disease_syndrome_disorder-vs_finding”,
“vs_finding-heart_disease”,
“heart_disease-vs_finding”,
“vs_finding-hyperlipidemia”,
“hyperlipidemia-vs_finding”,
“vs_finding-hypertension”,
“hypertension-vs_finding”,
“vs_finding-injury_or_poisoning”,
“injury_or_poisoning-vs_finding”,
“vs_finding-kidney_disease”,
“kidney_disease-vs_finding”,
“vs_finding-obesity”,
“obesity-vs_finding”,
“vs_finding-oncological”,
“oncological-vs_finding”,
“vs_finding-overweight”,
“overweight-vs_finding”,
“vs_finding-psychological_condition”,
“psychological_condition-vs_finding”,
“vs_finding-symptom”, “symptom-vs_finding”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel()\ .pretrained('jsl_ner_wip_clinical',"en","clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel()\ .pretrained("re_test_problem_finding", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4)\ .setPredictionThreshold(0.9)\ .setRelationPairs(["procedure-symptom"]) nlp_pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""Targeted biopsy of this lesion for histological correlation should be considered.""") ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel() .pretrained("jsl_ner_wip_clinical","en","clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel() .pretrained("re_test_problem_finding", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) .setPredictionThreshold(0.9) .setRelationPairs("procedure-symptom") val nlp_pipeline = new Pipeline().setStagesArray(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val data = Seq("""Targeted biopsy of this lesion for histological correlation should be considered.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | index | relations | entity1 | chunk1 | entity2 | chunk2 | |-------|--------------|--------------|---------------------|--------------|---------| | 0 | 1 | PROCEDURE | biopsy | SYMPTOM | lesion | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_test_problem_finding| |Type:|re| |Compatibility:|Healthcare NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| ## Data Source Trained on internal datasets. --- layout: model title: English BertForQuestionAnswering Cased model (from maroo93) author: John Snow Labs name: bert_qa_kd_squad1.1 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kd_squad1.1` is a English model originally trained by `maroo93`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kd_squad1.1_en_4.0.0_3.0_1657189570964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kd_squad1.1_en_4.0.0_3.0_1657189570964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kd_squad1.1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kd_squad1.1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kd_squad1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|249.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/maroo93/kd_squad1.1 --- layout: model title: Chinese Part of Speech Tagger (from raynardj) author: John Snow Labs name: bert_pos_classical_chinese_punctuation_guwen_biaodian date: 2022-05-09 tags: [bert, pos, part_of_speech, zh, open_source] task: Part of Speech Tagging language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `classical-chinese-punctuation-guwen-biaodian` is a Chinese model orginally trained by `raynardj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_classical_chinese_punctuation_guwen_biaodian_zh_3.4.2_3.0_1652088290238.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_classical_chinese_punctuation_guwen_biaodian_zh_3.4.2_3.0_1652088290238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_classical_chinese_punctuation_guwen_biaodian","zh") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_classical_chinese_punctuation_guwen_biaodian","zh") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_classical_chinese_punctuation_guwen_biaodian| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|zh| |Size:|381.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/raynardj/classical-chinese-punctuation-guwen-biaodian - https://github.com/raynardj/yuan - https://github.com/raynardj/yuan - https://github.com/raynardj/yuan --- layout: model title: English RobertaForQuestionAnswering (from sunitha) author: John Snow Labs name: roberta_qa_Roberta_Custom_Squad_DS date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Roberta_Custom_Squad_DS` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_Roberta_Custom_Squad_DS_en_4.0.0_3.0_1655727273046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_Roberta_Custom_Squad_DS_en_4.0.0_3.0_1655727273046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_Roberta_Custom_Squad_DS","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_Roberta_Custom_Squad_DS","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_Roberta_Custom_Squad_DS| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/Roberta_Custom_Squad_DS --- layout: model title: Ukrainian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_uk_cased date: 2022-12-02 tags: [uk, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: uk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uk-cased` is a Ukrainian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uk_cased_uk_4.2.4_3.0_1670019147754.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uk_cased_uk_4.2.4_3.0_1670019147754.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uk_cased","uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_uk_cased","uk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_uk_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|uk| |Size:|357.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-uk-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Translate English to San Salvador Kongo Pipeline author: John Snow Labs name: translate_en_kwy date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, kwy, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `kwy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_kwy_xx_2.7.0_2.4_1609688437952.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_kwy_xx_2.7.0_2.4_1609688437952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_kwy", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_kwy", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.kwy').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_kwy| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_qacombination_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qacombination_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qacombination_el_4_el_4.0.0_3.0_1657190786453.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qacombination_el_4_el_4.0.0_3.0_1657190786453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qacombination_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_qacombination_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qacombination_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/qacombination_bert_el_4 --- layout: model title: Company Name Normalization using Nasdaq Stock Screener author: John Snow Labs name: finel_nasdaq_company_name_stock_screener date: 2023-01-20 tags: [en, finance, licensed, nasdaq, company] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial Entity Resolver model, trained to obtain normalized versions of Company Names, registered in NASDAQ Stock Screener. You can use this model after extracting a company name using any NER, and you will obtain the official name of the company as per NASDAQ Stock Screener. After this, you can use `finmapper_nasdaq_company_name_stock_screener` to augment and obtain more information about a company using NASDAQ Stock Screener, including Ticker, Sector, Country, etc. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_company_name_stock_screener_en_1.0.0_3.0_1674233034536.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_company_name_stock_screener_en_1.0.0_3.0_1674233034536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk_doc") \ .setOutputCol("sentence_embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained("finel_nasdaq_company_name_stock_screener", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("normalized")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, chunk_embeddings, use_er_model ]) text = """NIKE is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services.""" test_data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(test_data) lp = nlp.LightPipeline(model) result = lp.annotate(text) result["normalized"] ```
## Results ```bash ['Nike Inc. Common Stock'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_nasdaq_company_name_stock_screener| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[normalized]| |Language:|en| |Size:|54.7 MB| |Case sensitive:|false| ## References https://www.nasdaq.com/market-activity/stocks/screener --- layout: model title: Self Reported Stress Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_self_reported_stress_tweet date: 2022-07-29 tags: [en, licenced, clinical, public_health, sequence_classification, classifier, stress, licensed] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true recommended: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can identify stress in social media (Twitter) posts in the self-disclosure category. The model finds whether a person claims he/she is stressed or not. ## Predicted Entities `not-stressed`, `stressed` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_STRESS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_stress_tweet_en_4.0.0_3.0_1659087442993.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_stress_tweet_en_4.0.0_3.0_1659087442993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_stress_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["Do you feel stressed?"], ["I'm so stressed!"], ["Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful."], ["Do you enjoy living constantly in this self-inflicted stress?"]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_stress_tweet", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq(Array("Do you feel stressed!", "I'm so stressed!", "Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful.", "Do you enjoy living constantly in this self-inflicted stress?")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.self_reported_stress").predict("""Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful.""") ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------+--------------+ |text |result | +-------------------------------------------------------------------------------------------------------------+--------------+ |Do you feel stressed? |[not-stressed]| |I'm so stressed! |[stressed] | |Depression and anxiety will probably end up killing me – I feel so stressed all the time and just feel awful.|[stressed] | |Do you enjoy living constantly in this self-inflicted stress? |[not-stressed]| +-------------------------------------------------------------------------------------------------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_self_reported_stress_tweet| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support not-stressed 0.8564 0.8020 0.8283 409 stressed 0.7197 0.7909 0.7536 263 accuracy - - 0.7976 672 macro-avg 0.7881 0.7964 0.7910 672 weighted-avg 0.8029 0.7976 0.7991 672 ``` --- layout: model title: English asr_wav2vec2_large_960h_lv60_self_4_gram TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_large_960h_lv60_self_4_gram date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self_4_gram` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_self_4_gram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021695681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021695681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_960h_lv60_self_4_gram", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_960h_lv60_self_4_gram", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_960h_lv60_self_4_gram| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.4 MB| --- layout: model title: Turkish XlmRoBertaForQuestionAnswering (from Aybars) author: John Snow Labs name: xlm_roberta_qa_XLM_Turkish date: 2022-06-23 tags: [tr, open_source, question_answering, xlmroberta] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `XLM_Turkish` is a Turkish model originally trained by `Aybars`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLM_Turkish_tr_4.0.0_3.0_1655983903393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLM_Turkish_tr_4.0.0_3.0_1655983903393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_XLM_Turkish","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_XLM_Turkish","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_XLM_Turkish| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|tr| |Size:|792.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Aybars/XLM_Turkish --- layout: model title: English asr_wav2vec2_coral_300ep TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: pipeline_asr_wav2vec2_coral_300ep date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_coral_300ep` is a English model originally trained by joaoalvarenga. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_coral_300ep_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_coral_300ep_en_4.2.0_3.0_1664023766656.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_coral_300ep_en_4.2.0_3.0_1664023766656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_coral_300ep', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_coral_300ep", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_coral_300ep| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline for Detect Subentity PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_subentity_pipeline date: 2023-05-31 tags: [licensed, clinical, deidentification, ar, pipeline] task: Pipeline Healthcare language: ar edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2023/05/29/ner_deid_subentity_ar.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ar_4.4.1_3.0_1685563688023.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ar_4.4.1_3.0_1685563688023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models") text= ''' ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. ''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models") val text = "ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح." val result = pipeline.fullAnnotate(text) ```
## Results ```bash +---------------+--------+ |chunks |entities| +---------------+--------+ |16 أبريل 2000 |DATE | |ليلى حسن |PATIENT | |789، |ZIP | |جدة |CITY | |54321 |ZIP | |المملكة العربية|CITY | |السعودية |COUNTRY | |النور |HOSPITAL| |أميرة أحمد |DOCTOR | |ليلى |PATIENT | |35 |AGE | +---------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|ar| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from tner) author: John Snow Labs name: xlmroberta_ner_base_bc5cdr date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-bc5cdr` is a English model originally trained by `tner`. ## Predicted Entities `chemical`, `disease` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bc5cdr_en_4.1.0_3.0_1660425851127.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bc5cdr_en_4.1.0_3.0_1660425851127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bc5cdr","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bc5cdr","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_bc5cdr| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|780.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tner/xlm-roberta-base-bc5cdr - https://github.com/asahi417/tner --- layout: model title: Legal Corporate existence Clause Binary Classifier author: John Snow Labs name: legclf_corporate_existence_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `corporate-existence` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `corporate-existence` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_corporate_existence_clause_en_1.0.0_3.2_1660123366726.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_corporate_existence_clause_en_1.0.0_3.2_1660123366726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_corporate_existence_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[corporate-existence]| |[other]| |[other]| |[corporate-existence]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_corporate_existence_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support corporate-existence 0.91 0.93 0.92 43 other 0.96 0.95 0.95 76 accuracy - - 0.94 119 macro-avg 0.93 0.94 0.94 119 weighted-avg 0.94 0.94 0.94 119 ``` --- layout: model title: Bangla BertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers date: 2022-12-06 tags: [bn, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: bn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-bn-bert` is a Bangla model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670326563595.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670326563595.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|505.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-bert - https://oscar-corpus.com/ --- layout: model title: Extract Cancer Therapies and Granular Posology Information author: John Snow Labs name: ner_oncology_posology date: 2022-11-24 tags: [licensed, clinical, en, oncology, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts cancer therapies (Cancer_Surgery, Radiotherapy and Cancer_Therapy) and posology information at a granular level. Definitions of Predicted Entities: - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Cancer_Therapy`: Any cancer treatment mentioned in text, excluding surgeries and radiotherapy. - `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Radiation_Dose`: Dose used in radiotherapy. - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). ## Predicted Entities `Cancer_Surgery`, `Cancer_Therapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Radiotherapy`, `Radiation_Dose`, `Route` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.2.2_3.0_1669306988706.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.2.2_3.0_1669306988706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------| | adriamycin | Cancer_Therapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | second cycle | Cycle_Number | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_posology| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Cycle_Number 52 4 45 97 0.93 0.54 0.68 Cycle_Count 200 63 30 230 0.76 0.87 0.81 Radiotherapy 255 16 55 310 0.94 0.82 0.88 Cancer_Surgery 592 66 227 819 0.90 0.72 0.80 Cycle_Day 175 22 73 248 0.89 0.71 0.79 Frequency 337 44 90 427 0.88 0.79 0.83 Route 53 1 60 113 0.98 0.47 0.63 Cancer_Therapy 1448 81 250 1698 0.95 0.85 0.90 Duration 525 154 236 761 0.77 0.69 0.73 Dosage 858 79 202 1060 0.92 0.81 0.86 Radiation_Dose 86 4 40 126 0.96 0.68 0.80 macro_avg 4581 534 1308 5889 0.90 0.72 0.79 micro_avg 4581 534 1308 5889 0.90 0.78 0.83 ``` --- layout: model title: Indonesian T5ForConditionalGeneration Base Cased model (from Wikidepia) author: John Snow Labs name: t5_indot5_base_paraphrase date: 2023-01-30 tags: [id, open_source, t5] task: Text Generation language: id edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IndoT5-base-paraphrase` is a Indonesian model originally trained by `Wikidepia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_indot5_base_paraphrase_id_4.3.0_3.0_1675097776595.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_indot5_base_paraphrase_id_4.3.0_3.0_1675097776595.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_indot5_base_paraphrase","id") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_indot5_base_paraphrase","id") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_indot5_base_paraphrase| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|id| |Size:|1.0 GB| ## References - https://huggingface.co/Wikidepia/IndoT5-base-paraphrase --- layout: model title: Оcr small for printed text author: John Snow Labs name: ocr_small_printed date: 2022-02-16 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.3.3 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr small model for recognise printed text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_printed_en_3.3.3_2.4_1645007455031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_printed_en_3.3.3_2.4_1645007455031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ocr = ImageToTextv2().pretrained("ocr_small_printed", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToTextv2().pretrained("ocr_base_printed", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToTextv2().pretrained("ocr_small_printed", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToTextv2().pretrained("ocr_base_printed", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_small_printed| |Type:|ocr| |Compatibility:|Visual NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|146.7 MB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab9 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab9 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab9` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab9_en_4.2.0_3.0_1664019939623.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab9_en_4.2.0_3.0_1664019939623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab9", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab9", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Spanish RobertaForQuestionAnswering (from jamarju) author: John Snow Labs name: roberta_qa_roberta_base_bne_squad_2.0_es_jamarju date: 2022-06-21 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-squad-2.0-es` is a Spanish model originally trained by `jamarju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789380928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789380928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_bne_squad_2.0_es_jamarju","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_bne_squad_2.0_es_jamarju","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.roberta.base.by_jamarju").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_bne_squad_2.0_es_jamarju| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|456.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jamarju/roberta-base-bne-squad-2.0-es - https://github.com/PlanTL-SANIDAD/lm-spanish - https://github.com/ccasimiro88/TranslateAlignRetrieve --- layout: model title: English DistilBertForQuestionAnswering model (from Hoang) author: John Snow Labs name: distilbert_qa_Hoang_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Hoang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Hoang_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724211596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Hoang_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724211596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Hoang_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Hoang_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Hoang").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Hoang_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Hoang/distilbert-base-uncased-finetuned-squad --- layout: model title: ICD10GM ChunkResolver author: John Snow Labs name: chunkresolve_ICD10GM class: ChunkEntityResolverModel language: de repository: clinical/models date: 2020-09-06 task: Entity Resolution edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,de] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_ICD10GM_de_2.5.5_2.4_1599431635423.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_ICD10GM_de_2.5.5_2.4_1599431635423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... icd10_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_ICD10GM",'de','clinical/models') \ .setInputCols(["token", "chunk_embeddings"]) \ .setOutputCol("icd10_de_code")\ .setDistanceFunction("EUCLIDEAN") \ .setNeighbours(5) pipeline_icd10 = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, de_embeddings, de_ner, ner_converter, chunk_embeddings, icd10_resolution]) empty_data = spark.createDataFrame([['''Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt. Vom SCLC sind hauptsächlich Peronen mittleren Alters (27-66 Jahre) mit Raucheranamnese betroffen. Etwa 70% der Patienten mit SCLC haben bei Stellung der Diagnose schon extra-thorakale Symptome. Zu den Symptomen gehören Thoraxschmerz, Dyspnoe, Husten und pfeifende Atmung. Die Beteiligung benachbarter Bereiche verursacht Heiserkeit, Dysphagie und Oberes Vena-cava-Syndrom (Obstruktion des Blutflusses durch die Vena cava superior). Zusätzliche Symptome als Folge einer Fernmetastasierung sind ebenfalls möglich. Rauchen und Strahlenexposition sind synergistisch wirkende Risikofaktoren. Die industrielle Exposition mit Bis (Chlormethyläther) ist ein weiterer Risikofaktor. Röntgenaufnahmen des Thorax sind nicht ausreichend empfindlich, um einen SCLC frühzeitig zu erkennen. Röntgenologischen Auffälligkeiten muß weiter nachgegangen werden, meist mit Computertomographie. Die Diagnose wird bioptisch gesichert. Patienten mit SCLC erhalten meist Bestrahlung und/oder Chemotherapie. In Hinblick auf eine Verbesserung der Überlebenschancen der Patienten ist sowohl bei ausgedehnten und bei begrenzten SCLC eine kombinierte Chemotherapie wirksamer als die Behandlung mit Einzelsubstanzen. Es kann auch eine prophylaktische Bestrahlung des Schädels erwogen werden, da innerhalb von 2-3 Jahren nach Behandlungsbeginn ein hohes Risiko für zentralnervöse Metastasen besteht. Das Kleinzellige Bronchialkarzinom ist der aggressivste Lungentumor: Die 5-Jahres-Überlebensrate beträgt 1-5%, der Median des gesamten Überlebens liegt bei etwa 6 bis 10 Monaten.''']]).toDF("text") model = pipeline_icd10.fit(empty_data) results = model.transform(data) ``` ```scala ... val icd10_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_ICD10GM",'de','clinical/models') .setInputCols("token", "chunk_embeddings") .setOutputCol("icd10_de_code") .setDistanceFunction("EUCLIDEAN") .setNeighbours(5) val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, de_embeddings, de_ner, ner_converter, chunk_embeddings, icd10_resolution)) val data = Seq("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt. Vom SCLC sind hauptsächlich Peronen mittleren Alters (27-66 Jahre) mit Raucheranamnese betroffen. Etwa 70% der Patienten mit SCLC haben bei Stellung der Diagnose schon extra-thorakale Symptome. Zu den Symptomen gehören Thoraxschmerz, Dyspnoe, Husten und pfeifende Atmung. Die Beteiligung benachbarter Bereiche verursacht Heiserkeit, Dysphagie und Oberes Vena-cava-Syndrom (Obstruktion des Blutflusses durch die Vena cava superior). Zusätzliche Symptome als Folge einer Fernmetastasierung sind ebenfalls möglich. Rauchen und Strahlenexposition sind synergistisch wirkende Risikofaktoren. Die industrielle Exposition mit Bis (Chlormethyläther) ist ein weiterer Risikofaktor. Röntgenaufnahmen des Thorax sind nicht ausreichend empfindlich, um einen SCLC frühzeitig zu erkennen. Röntgenologischen Auffälligkeiten muß weiter nachgegangen werden, meist mit Computertomographie. Die Diagnose wird bioptisch gesichert. Patienten mit SCLC erhalten meist Bestrahlung und/oder Chemotherapie. In Hinblick auf eine Verbesserung der Überlebenschancen der Patienten ist sowohl bei ausgedehnten und bei begrenzten SCLC eine kombinierte Chemotherapie wirksamer als die Behandlung mit Einzelsubstanzen. Es kann auch eine prophylaktische Bestrahlung des Schädels erwogen werden, da innerhalb von 2-3 Jahren nach Behandlungsbeginn ein hohes Risiko für zentralnervöse Metastasen besteht. Das Kleinzellige Bronchialkarzinom ist der aggressivste Lungentumor: Die 5-Jahres-Überlebensrate beträgt 1-5%, der Median des gesamten Überlebens liegt bei etwa 6 bis 10 Monaten.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | Problem | ICD10-Code | |--------------------|------------| | Kleinzellige | M01.00 | | Bronchialkarzinom | I50.0 | | Kleinzelliger | I37.0 | | Lungenkrebs | B90.9 | | SCLC | C83.0 | | ... | ... | | Kleinzellige | M01.00 | | Bronchialkarzinom | I50.0 | | Lungentumor | C90.31 | | 1-5% | I37.0 | | 6 bis 10 Monaten | Q91.6 | ``` {:.model-param} ## Model Information {:.table-model} |----------------|--------------------------| | Name: | chunkresolve_ICD10GM | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | de | | Case sensitive: | True | | Dependencies: | w2v_cc_300d | {:.h2_title} ## Data Source FILLUP --- layout: model title: Onto Recognize Entities Lg author: John Snow Labs name: onto_recognize_entities_lg date: 2022-06-28 tags: [en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entites. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_lg_en_4.0.0_3.0_1656389642706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_lg_en_4.0.0_3.0_1656389642706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("onto_recognize_entities_lg", "en") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.onto.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Translate Hausa to English Pipeline author: John Snow Labs name: translate_ha_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ha, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ha` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ha_en_xx_2.7.0_2.4_1609686549663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ha_en_xx_2.7.0_2.4_1609686549663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ha_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ha_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ha.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ha_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Cancer Genetics (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_bionlp_pipeline date: 2023-03-20 tags: [bertfortokenclassification, ner, bionlp, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bionlp](https://nlp.johnsnowlabs.com/2022/01/03/bert_token_classifier_ner_bionlp_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_pipeline_en_4.3.0_3.2_1679308593451.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_pipeline_en_4.3.0_3.2_1679308593451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_bionlp_pipeline", "en", "clinical/models") text = '''Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bionlp_pipeline", "en", "clinical/models") val text = "Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.biolp.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:-----------------------|-------------:| | 0 | erbA IRES | 9 | 17 | Organism | 0.999188 | | 1 | erbA/myb virus | 27 | 40 | Organism | 0.999434 | | 2 | erythroid cells | 65 | 79 | Cell | 0.999837 | | 3 | bone | 100 | 103 | Multi-tissue_structure | 0.999846 | | 4 | marrow | 105 | 110 | Multi-tissue_structure | 0.999876 | | 5 | blastoderm cultures | 115 | 133 | Cell | 0.999823 | | 6 | erbA/myb IRES virus | 140 | 158 | Organism | 0.999751 | | 7 | erbA IRES virus | 236 | 250 | Organism | 0.999749 | | 8 | blastoderm | 259 | 268 | Cell | 0.999897 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bionlp_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from English to Niger-Kordofanian Languages author: John Snow Labs name: opus_mt_en_nic date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, nic, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `nic` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_nic_xx_2.7.0_2.4_1609167803723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_nic_xx_2.7.0_2.4_1609167803723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_nic", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_nic", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.nic').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_nic| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Cased model (from google) author: John Snow Labs name: t5_efficient_xl_nl4 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-xl-nl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl4_en_4.3.0_3.0_1675124613893.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_xl_nl4_en_4.3.0_3.0_1675124613893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_xl_nl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_xl_nl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_xl_nl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/google/t5-efficient-xl-nl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Quiet Enjoyment Clause Binary Classifier author: John Snow Labs name: legclf_quiet_enjoyment_clause date: 2023-01-29 tags: [en, legal, classification, quiet, enjoyment, clauses, quiet_enjoyment, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `quiet-enjoyment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `quiet-enjoyment`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_quiet_enjoyment_clause_en_1.0.0_3.0_1675005306234.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_quiet_enjoyment_clause_en_1.0.0_3.0_1675005306234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_quiet_enjoyment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[quiet-enjoyment]| |[other]| |[other]| |[quiet-enjoyment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_quiet_enjoyment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 1.00 0.97 39 quiet-enjoyment 1.00 0.94 0.97 33 accuracy - - 0.97 72 macro-avg 0.98 0.97 0.97 72 weighted-avg 0.97 0.97 0.97 72 ``` --- layout: model title: Sentence Entity Resolver for ATC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_atc date: 2022-03-01 tags: [atc, licensed, en, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps drugs entities to ATC (Anatomic Therapeutic Chemical) codes using `sbiobert_base_cased_mli ` Sentence Bert Embeddings. ## Predicted Entities `ATC Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_3.0_1646126349436.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_atc_en_3.4.1_3.0_1646126349436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["DRUG"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models")\ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("atc_code")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner, ner_converter, c2doc, sbert_embedder, atc_resolver ]) sampleText = ["""He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day.""", """She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""", """She was given antidepressant for a month"""] data = spark.createDataFrame(sampleText, StringType()).toDF("text") results = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("DRUG")) val c2doc = Chunk2Doc() .setInputCols(Array("ner_chunk")) .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val atc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_atc", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("atc_code") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new PipelineModel().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, posology_ner, ner_converter, c2doc, sbert_embedder, atc_resolver)) val data = Seq("He was seen by the endocrinology service and she was discharged on eltrombopag at night, amlodipine with meals metformin two times a day and then ibuprofen. She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide. She was given antidepressant for a month").toDF("text") val results = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.atc").predict("""She was immediately given hydrogen peroxide 30 mg and amoxicillin twice daily for 10 days to treat the infection on her leg. She has a history of taking magnesium hydroxide.""") ```
## Results ```bash +-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ | chunk|atc_code| all_k_codes| resolutions| all_k_aux_labels| +-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ | eltrombopag| B02BX05|B02BX05:::A07DA06:::B06AC03:::M01AB08:::L04AA39...|eltrombopag :::eluxadoline :::ecallantide :::et...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | amlodipine| C08CA01|C08CA01:::C08CA17:::C08CA13:::C08CA06:::C08CA10...|amlodipine :::levamlodipine :::lercanidipine ::...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | metformin| A10BA02|A10BA02:::A10BA01:::A10BB01:::A10BH04:::A10BB07...|metformin :::phenformin :::glyburide / metformi...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | hydrogen peroxide| A01AB02|A01AB02:::S02AA06:::D10AE:::D11AX25:::D10AE01::...|hydrogen peroxide :::hydrogen peroxide; otic:::...|ATC 5th:::ATC 5th:::ATC 4th:::ATC 5th:::ATC 5th...| | amoxicillin| J01CA04|J01CA04:::J01CA01:::J01CF02:::J01CF01:::J01CA51...|amoxicillin :::ampicillin :::cloxacillin :::dic...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| |magnesium hydroxide| A02AA04|A02AA04:::A12CC02:::D10AX30:::B05XA11:::A02AA02...|magnesium hydroxide :::magnesium sulfate :::alu...|ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th:::ATC 5th...| | antidepressant| N06A|N06A:::N05A:::N06AX:::N05AH02:::N06D:::N06CA:::...|ANTIDEPRESSANTS:::ANTIPSYCHOTICS:::Other antide...|ATC 3rd:::ATC 3rd:::ATC 4th:::ATC 5th:::ATC 3rd...| +-------------------+--------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_atc| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[atc_code]| |Language:|en| |Size:|71.6 MB| |Case sensitive:|false| ## References Trained on ATC 2022 Codes dataset --- layout: model title: Catalan Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: ca edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, ca] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ca_2.5.5_2.4_1596054394549.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ca_2.5.5_2.4_1596054394549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "ca") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "ca") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica."""] lemma_df = nlu.load('ca.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=0, result='a', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=2, end=5, result='part', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=7, end=8, result='de', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=12, result='ser', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=14, end=15, result='ell', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|ca| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English BertForQuestionAnswering model (from Nakul24) author: John Snow Labs name: bert_qa_Spanbert_emotion_extraction date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Spanbert-emotion-extraction` is a English model orginally trained by `Nakul24`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Spanbert_emotion_extraction_en_4.0.0_3.0_1654179065087.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Spanbert_emotion_extraction_en_4.0.0_3.0_1654179065087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Spanbert_emotion_extraction","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Spanbert_emotion_extraction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.span_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Spanbert_emotion_extraction| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|384.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Nakul24/Spanbert-emotion-extraction --- layout: model title: Pipeline to Detect PHI in medical text (biobert) author: John Snow Labs name: ner_deid_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_pipeline_en_4.3.0_3.2_1679310594035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_pipeline_en_4.3.0_3.2_1679310594035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_biobert_pipeline", "en", "clinical/models") text = '''A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_biobert_pipeline", "en", "clinical/models") val text = "A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.ner_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | 2093-01-13 | 17 | 26 | DATE | 0.981 | | 1 | David Hale | 29 | 38 | NAME | 0.77585 | | 2 | Hendrickson | 53 | 63 | NAME | 0.9666 | | 3 | Ora | 66 | 68 | LOCATION | 0.8723 | | 4 | Oliveira | 91 | 98 | LOCATION | 0.7785 | | 5 | Cocke County Baptist Hospital | 114 | 142 | LOCATION | 0.792 | | 6 | Keats Street | 150 | 161 | LOCATION | 0.77305 | | 7 | Phone | 164 | 168 | LOCATION | 0.7083 | | 8 | Brothers | 253 | 260 | LOCATION | 0.9447 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering Cased model (from SebastianS) author: John Snow Labs name: bert_qa_sebastians_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `SebastianS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sebastians_finetuned_squad_en_4.0.0_3.0_1657186249406.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sebastians_finetuned_squad_en_4.0.0_3.0_1657186249406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sebastians_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sebastians_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sebastians_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SebastianS/bert-finetuned-squad --- layout: model title: Translate English to Pijin Pipeline author: John Snow Labs name: translate_en_pis date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, pis, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `pis` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pis_xx_2.7.0_2.4_1609698832184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pis_xx_2.7.0_2.4_1609698832184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_pis", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_pis", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.pis').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_pis| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Tagalog author: John Snow Labs name: opus_mt_en_tl date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tl, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tl_xx_2.7.0_2.4_1609169442130.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tl_xx_2.7.0_2.4_1609169442130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tl", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tl').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Western Frisian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, fy, open_source] task: Embeddings language: fy edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fy_3.4.1_3.0_1647467525855.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fy_3.4.1_3.0_1647467525855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fy") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hâld fan spark nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fy") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hâld fan spark nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fy.embed.w2v_cc_300d").predict("""Ik hâld fan spark nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|fy| |Size:|306.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Terms Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_terms_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, terms, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_terms_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `terms-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `terms-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_terms_agreement_bert_en_1.0.0_3.0_1669372149301.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_terms_agreement_bert_en_1.0.0_3.0_1669372149301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_terms_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[terms-agreement]| |[other]| |[other]| |[terms-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_terms_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.95 0.96 65 terms-agreement 0.91 0.94 0.93 34 accuracy - - 0.95 99 macro-avg 0.94 0.95 0.94 99 weighted-avg 0.95 0.95 0.95 99 ``` --- layout: model title: Language Detection & Identification Pipeline - 21 Languages (BiGRU) author: John Snow Labs name: detect_language_bigru_21 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using BiGRU architectures (mentioned in the model's name) in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Estonian`, `Finnish`, `French`, `Hungarian`, `Italian`, `Lithuanian`, `Latvian`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Slovak`, `Slovenian`, `Spanish`, `Swedish`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_bigru_21_xx_2.7.0_2.4_1607186103596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_bigru_21_xx_2.7.0_2.4_1607186103596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_bigru_21", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_bigru_21", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.bigru").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_bigru_21| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: Meena's Tapas Table Understanding (Base) author: John Snow Labs name: table_qa_table_question_answering_tapas date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Base Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_table_question_answering_tapas_en_4.2.0_3.0_1664530457710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_table_question_answering_tapas_en_4.2.0_3.0_1664530457710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_table_question_answering_tapas","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_table_question_answering_tapas| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|413.9 MB| |Case sensitive:|false| ## References https://huggingface.co/models?pipeline_tag=table-question-answering --- layout: model title: Legal Scope Clause Binary Classifier author: John Snow Labs name: legclf_scope_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `scope` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `scope` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_scope_clause_en_1.0.0_3.2_1660123969333.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_scope_clause_en_1.0.0_3.2_1660123969333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_scope_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[scope]| |[other]| |[other]| |[scope]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_scope_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.90 0.92 88 scope 0.73 0.83 0.77 29 accuracy - - 0.88 117 macro-avg 0.83 0.86 0.85 117 weighted-avg 0.89 0.88 0.88 117 ``` --- layout: model title: Legal Signers Clause Binary Classifier (CUAD dataset) author: John Snow Labs name: legclf_cuad_signers_clause date: 2022-11-17 tags: [signers, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the signers part of a documenttype. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset. ## Predicted Entities `signers`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_signers_clause_en_1.0.0_3.0_1668693373474.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_signers_clause_en_1.0.0_3.0_1668693373474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_signers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[signers]| |[other]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_signers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References CUAD dataset ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 73 signers 1.00 1.00 1.00 35 accuracy - - 1.00 108 macro-avg 1.00 1.00 1.00 108 weighted-avg 1.00 1.00 1.00 108 ``` --- layout: model title: English RobertaForQuestionAnswering (from SauravMaheshkar) author: John Snow Labs name: roberta_qa_roberta_base_chaii date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_chaii_en_4.0.0_3.0_1655730347590.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_chaii_en_4.0.0_3.0_1655730347590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/roberta-base-chaii --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to German author: John Snow Labs name: opus_mt_bcl_de date: 2021-06-01 tags: [open_source, seq2seq, translation, bcl, de, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bcl target languages: de {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_de_xx_3.1.0_2.4_1622550430850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_de_xx_3.1.0_2.4_1622550430850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_de", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_de", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Central Bikol.translate_to.German').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_de| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Cebuano (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, ceb, open_source] task: Embeddings language: ceb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ceb_3.4.1_3.0_1647290267903.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ceb_3.4.1_3.0_1647290267903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ceb") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ganahan ko spark nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ceb") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ganahan ko spark nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ceb.embed.w2v_cc_300d").predict("""Ganahan ko spark nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ceb| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English image_classifier_vit__beans ViTForImageClassification from johnnydevriese author: John Snow Labs name: image_classifier_vit__beans date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit__beans` is a English model originally trained by johnnydevriese. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit__beans_en_4.1.0_3.0_1660169646080.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit__beans_en_4.1.0_3.0_1660169646080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit__beans", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit__beans", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit__beans| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_shuffled_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223641296.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223641296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_hier_triplet_shuffled_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_shuffled_epochs_1_shard_1_squad2.0 --- layout: model title: English T5ForConditionalGeneration Cased model (from dbernsohn) author: John Snow Labs name: t5_wikisql_en2sql date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_wikisql_en2SQL` is a English model originally trained by `dbernsohn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_wikisql_en2sql_en_4.3.0_3.0_1675157192158.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_wikisql_en2sql_en_4.3.0_3.0_1675157192158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_wikisql_en2sql","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_wikisql_en2sql","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_wikisql_en2sql| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.2 MB| ## References - https://huggingface.co/dbernsohn/t5_wikisql_en2SQL - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/DorBernsohn/CodeLM/tree/main/SQLM - https://www.linkedin.com/in/dor-bernsohn-70b2b1146/ --- layout: model title: Tamil XlmRoBertaForQuestionAnswering (from AswiN037) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_squad_tamil date: 2022-06-23 tags: [ta, open_source, question_answering, xlmroberta] task: Question Answering language: ta edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-squad-tamil` is a Tamil model originally trained by `AswiN037`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_squad_tamil_ta_4.0.0_3.0_1655996786525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_squad_tamil_ta_4.0.0_3.0_1655996786525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_squad_tamil","ta") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_squad_tamil","ta") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ta.answer_question.squad.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_squad_tamil| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ta| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AswiN037/xlm-roberta-squad-tamil --- layout: model title: English DistilBertForTokenClassification Cased model (from m3hrdadfi) author: John Snow Labs name: distilbert_tok_classifier_typo_detector date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `typo-detector-distilbert-en` is a English model originally trained by `m3hrdadfi`. ## Predicted Entities `TYPO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_tok_classifier_typo_detector_en_4.3.1_3.0_1678134333311.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_tok_classifier_typo_detector_en_4.3.1_3.0_1678134333311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_tok_classifier_typo_detector","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_tok_classifier_typo_detector","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_tok_classifier_typo_detector| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en - https://github.com/neuspell/neuspell - https://github.com/m3hrdadfi/typo-detector/issues --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_roberta_FT_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_FT_newsqa_en_4.0.0_3.0_1655738866363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_FT_newsqa_en_4.0.0_3.0_1655738866363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_roberta_ft_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|458.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/roberta_FT_newsqa --- layout: model title: Hebrew BertForQuestionAnswering model (from tdklab) author: John Snow Labs name: bert_qa_hebert_finetuned_hebrew_squad date: 2022-06-02 tags: [he, open_source, question_answering, bert] task: Question Answering language: he edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hebert-finetuned-hebrew-squad` is a Hebrew model orginally trained by `tdklab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_hebert_finetuned_hebrew_squad_he_4.0.0_3.0_1654187940492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_hebert_finetuned_hebrew_squad_he_4.0.0_3.0_1654187940492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_hebert_finetuned_hebrew_squad","he") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_hebert_finetuned_hebrew_squad","he") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("he.answer_question.squad.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_hebert_finetuned_hebrew_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|he| |Size:|408.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tdklab/hebert-finetuned-hebrew-squad --- layout: model title: English DistilBertForQuestionAnswering model (from FOFer) author: John Snow Labs name: distilbert_qa_FOFer_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `FOFer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_FOFer_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724124682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_FOFer_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724124682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_FOFer_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_FOFer_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_FOFer").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_FOFer_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/FOFer/distilbert-base-uncased-finetuned-squad --- layout: model title: Word Segmenter for Chinese author: John Snow Labs name: wordseg_ctb9 date: 2021-03-08 tags: [word_segmentation, open_source, chinese, wordseg_ctb9, zh] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_3.0.0_3.0_1615225768619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_3.0.0_3.0_1615225768619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.segment_words.ctb9').predict(text) token_df ```
## Results ```bash 0 从 1 J 2 o 3 h 4 n 5 S 6 n 7 o 8 w 9 Labs 10 你 11 好 12 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_ctb9| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|zh| --- layout: model title: Summarize Clinical Notes in Layman Terms author: John Snow Labs name: summarizer_clinical_laymen date: 2023-05-29 tags: [licensed, en, clinical, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries. It can generate summaries up to 512 tokens given an input text (max 1024 tokens). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685360017257.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685360017257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("summary")\ .setMaxNewTokens(512) pipeline = sparknlp.base.Pipeline(stages=[ document_assembler, summarizer ]) text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.\n\nPAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.\n\nPAST SURGICAL HISTORY: Pertinent for cholecystectomy.\n\nPSYCHOLOGICAL HISTORY: Negative.\n\nSOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.\n\nFAMILY HISTORY: Pertinent for obesity and hypertension.\n\nMEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.\n\nREVIEW OF SYSTEMS: Negative.\n\nPHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval. """ data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ```
## Results ```bash ['This is a clinical note about a 34-year-old woman who is interested in having weight loss surgery. She has been overweight for over 20 years and wants to have more energy and improve her self-image. She has tried many diets and weight loss programs, but has not been successful in keeping the weight off. She has a history of hypertension and shortness of breath, but is not allergic to any medications. She will have an upper endoscopy and will be contacted by a nutritionist and social worker. The plan is to have her weight loss surgery through the gastric bypass, rather than Lap-Band.'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_laymen| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.5 MB| --- layout: model title: Danish asr_xls_r_300m_nst_cv9 TFWav2Vec2ForCTC from chcaa author: John Snow Labs name: asr_xls_r_300m_nst_cv9 date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_300m_nst_cv9` is a Danish model originally trained by chcaa. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_300m_nst_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_300m_nst_cv9_da_4.2.0_3.0_1664103508619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_300m_nst_cv9_da_4.2.0_3.0_1664103508619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_300m_nst_cv9", "da")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_300m_nst_cv9", "da") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_300m_nst_cv9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|da| |Size:|756.3 MB| --- layout: model title: Sentence Entity Resolver for LOINC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_loinc_cased date: 2021-12-24 tags: [en, clinical, licensed, entity_resolution, loinc] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It is trained with augmented cased (unlowered) concept names since sbiobert model is cased. ## Predicted Entities `LOINC` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_cased_en_3.3.4_2.4_1640374998947.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_cased_en_3.3.4_2.4_1640374998947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical','en', 'clinical/models')\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") rad_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") rad_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['Test']) chunk2doc = Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings")\ .setCaseSensitive(True) resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_cased", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"])\ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, word_embeddings, rad_ner, rad_ner_converter, chunk2doc, sbert_embedder, resolver ]) data = spark.createDataFrame([["""The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hemoglobin is 8.2%."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val rad_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val rad_ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Test")) val chunk2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") .setCaseSensitive(True) val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_cased", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, rad_ner, rad_ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hemoglabin is 8.2%.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc_cased").predict("""The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hemoglobin is 8.2%.""") ```
## Results ```bash +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ner_chunk|entity| resolution| all_codes| resolutions| +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | BMI| Test| LP35925-4|[LP35925-4, 59574-4, BDYCRC, 73964-9, 59574-4,... |[Body mass index (BMI), Body mass index, Body circumference, Body muscle mass, Body mass index (BMI) [Percentile], ... | | aspartate aminotransferase| Test| 14409-7|[14409-7, 1916-6, 16325-3, 16324-6, 43822-6, 308... |[Aspartate aminotransferase, Aspartate aminotransferase/Alanine aminotransferase, Alanine aminotransferase/Aspartate aminotransferase, Alanine aminotransferase, Aspartate aminotransferase [Prese... | | alanine aminotransferase| Test| 16324-6|[16324-6, 16325-3, 14409-7, 1916-6, 59245-1, 30... |[Alanine aminotransferase, Alanine aminotransferase/Aspartate aminotransferase, Aspartate aminotransferase, Aspartate aminotransferase/Alanine aminotransferase, Alanine glyoxylate aminotransfer,... | | hemoglobin| Test| 14775-1|[14775-1, 16931-8, 12710-0, 29220-1, 15082-1, 72... |[Hemoglobin, Hematocrit/Hemoglobin, Hemoglobin pattern, Haptoglobin, Methemoglobin, Oxyhemoglobin, Hemoglobin test status, Verdohemoglobin, Hemoglobin A, Hemoglobin distribution width, Myoglobin,... | +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_loinc_cased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| |Size:|648.5 MB| |Case sensitive:|true| --- layout: model title: Financial Financial statements Item Binary Classifier author: John Snow Labs name: finclf_financial_statements_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `financial_statements` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `financial_statements` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_financial_statements_item_en_1.0.0_3.2_1660154427604.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_financial_statements_item_en_1.0.0_3.2_1660154427604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_financial_statements_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[financial_statements]| |[other]| |[other]| |[financial_statements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_financial_statements_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support financial_statements 0.86 0.96 0.91 1204 other 0.96 0.85 0.90 1254 accuracy - - 0.90 2458 macro-avg 0.91 0.91 0.90 2458 weighted-avg 0.91 0.90 0.90 2458 ``` --- layout: model title: Adverse Drug Events Binary Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_ade_augmented date: 2022-07-27 tags: [clinical, licensed, public_health, ade, classifier, sequence_classification, en] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT-based] (https://github.com/dmis-lab/biobert) classifier that can classify tweets reporting ADEs (Adverse Drug Events). ## Predicted Entities `ADE`, `noADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_augmented_en_4.0.0_3.0_1658905698079.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_augmented_en_4.0.0_3.0_1658905698079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade_augmented", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st", "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"], StringType()).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade_augmented", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st", "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.adverse_drug_events").predict("""So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st""") ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------------+-------+ |text |result | +-----------------------------------------------------------------------------------------------------------------------+-------+ |So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st|[ADE] | |Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 |[noADE]| +-----------------------------------------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_ade_augmented| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support ADE 0.9696 0.9595 0.9645 2763 noADE 0.9670 0.9753 0.9712 3366 accuracy - - 0.9682 6129 macro-avg 0.9683 0.9674 0.9678 6129 weighted-avg 0.9682 0.9682 0.9682 6129 ``` --- layout: model title: italian Legal Roberta Embeddings author: John Snow Labs name: roberta_large_italian_legal date: 2023-02-16 tags: [it, italian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-italian-roberta-large` is a Italian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_italian_legal_it_4.2.4_3.0_1676557559157.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_italian_legal_it_4.2.4_3.0_1676557559157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_italian_legal", "it")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_italian_legal", "it") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_italian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|1.3 GB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-italian-roberta-large --- layout: model title: Explain Document Pipeline for Spanish author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, spanish, explain_document_md, pipeline, es] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: es edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_es_3.0.0_3.0_1616431976931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_es_3.0.0_3.0_1616431976931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'es') annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "es") val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] result_df = nlu.load('es.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['PART', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.5123000144958496,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| --- layout: model title: Legal Separation Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_separation_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, separation, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_separation_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `separation-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `separation-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_separation_agreement_en_1.0.0_3.0_1669294576564.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_separation_agreement_en_1.0.0_3.0_1669294576564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_separation_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[separation-agreement]| |[other]| |[other]| |[separation-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_separation_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.91 0.93 90 separation-agreement 0.82 0.88 0.85 42 accuracy - - 0.90 132 macro-avg 0.88 0.90 0.89 132 weighted-avg 0.90 0.90 0.90 132 ``` --- layout: model title: Pipeline to Mapping RxNorm Codes with Corresponding National Drug Codes (NDC) author: John Snow Labs name: rxnorm_ndc_mapping date: 2022-06-27 tags: [rxnorm, ndc, pipeline, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RXNORM codes to NDC codes without using any text data. You’ll just feed white space-delimited RXNORM codes and it will return the corresponding two different types of ndc codes which are called `package ndc` and `product ndc`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_3.5.3_3.0_1656369648141.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_3.5.3_3.0_1656369648141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") result= pipeline.fullAnnotate("1652674 259934") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("1652674 259934") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_to_ndc.pipe").predict("""1652674 259934""") ```
## Results ```bash {'document': ['1652674 259934'], 'package_ndc': ['62135-0625-60', '13349-0010-39'], 'product_ndc': ['46708-0499', '13349-0010'], 'rxnorm_code': ['1652674', '259934']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_ndc_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.0 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_ntp0102 TFWav2Vec2ForCTC from ntp0102 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_ntp0102` is a English model originally trained by ntp0102. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102_en_4.2.0_3.0_1664026602923.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102_en_4.2.0_3.0_1664026602923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_ntp0102| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Spanish RobertaForQuestionAnswering (from jamarju) author: John Snow Labs name: roberta_qa_roberta_large_bne_squad_2.0_es_jamarju date: 2022-06-21 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-squad-2.0-es` is a Spanish model originally trained by `jamarju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789415779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_bne_squad_2.0_es_jamarju_es_4.0.0_3.0_1655789415779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_bne_squad_2.0_es_jamarju","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_bne_squad_2.0_es_jamarju","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.roberta.large.by_jamarju").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_bne_squad_2.0_es_jamarju| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jamarju/roberta-large-bne-squad-2.0-es - https://github.com/PlanTL-SANIDAD/lm-spanish - https://github.com/ccasimiro88/TranslateAlignRetrieve --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from miamiya) author: John Snow Labs name: roberta_qa_miamiya_base_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `miamiya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_miamiya_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219308729.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_miamiya_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219308729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_miamiya_base_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_miamiya_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_miamiya_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/miamiya/roberta-base-squad2-finetuned-squad --- layout: model title: Hungarian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_hungarian_legal date: 2023-02-16 tags: [hu, hungarian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: hu edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-hungarian-roberta-base` is a Hungarian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_hungarian_legal_hu_4.2.4_3.0_1676558480899.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_hungarian_legal_hu_4.2.4_3.0_1676558480899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_hungarian_legal", "hu")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_hungarian_legal", "hu") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_hungarian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hu| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-hungarian-roberta-base --- layout: model title: Word2Vec Embeddings in Sundanese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, su, open_source] task: Embeddings language: su edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_su_3.4.1_3.0_1647459488324.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_su_3.4.1_3.0_1647459488324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","su") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Abdi bogoh Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","su") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Abdi bogoh Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("su.embed.w2v_cc_300d").predict("""Abdi bogoh Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|su| |Size:|185.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from Sachinkelenjaguri) author: John Snow Labs name: distilbert_qa_sa_qna date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sa_Qna` is a English model originally trained by `Sachinkelenjaguri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sa_qna_en_4.3.0_3.0_1672775418180.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sa_qna_en_4.3.0_3.0_1672775418180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sa_qna","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sa_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sa_qna| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sachinkelenjaguri/sa_Qna --- layout: model title: Translate English to Austronesian languages Pipeline author: John Snow Labs name: translate_en_map date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, map, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `map` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_map_xx_2.7.0_2.4_1609688461104.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_map_xx_2.7.0_2.4_1609688461104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_map", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_map", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.map').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_map| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824211 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824211` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1678134173949.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1678134173949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824211| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824211 --- layout: model title: Sentence Entity Resolver for billable ICD10-CM HCC codes author: John Snow Labs name: sbiobertresolve_icd10cm_augmented_billable_hcc date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. The load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. It has been augmented with synonyms, four times richer than previous resolver. It also adds support of 7-digit codes with HCC status. ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for `aux_label` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`.For example, in the example shared `below the billable status is 1`, `hcc status is 1`, and `hcc score is 8`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.0.4_2.4_1621189647111.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.0.4_2.4_1621189647111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_icd10cm_augmented_billable_hcc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \ .setInputCols(["document", "sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented_billable").predict("""metastatic lung cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Oncology Pipeline for Therapies author: John Snow Labs name: oncology_therapy_pipeline date: 2023-03-29 tags: [licensed, pipeline, oncology, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition and Assertion Status models to extract information from oncology texts. This pipeline focuses on entities related to therapies. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.3.2_3.2_1680123025997.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.3.2_3.2_1680123025997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models") text = '''The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models") val text = "The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_therpay.pipeline").predict("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""") ```
## Results ```bash " ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Chemotherapy | | cyclophosphamide | Chemotherapy | ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Chemotherapy | | cyclophosphamide | Chemotherapy | ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | ******************** ner_oncology_unspecific_posology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------------| | mastectomy | Cancer_Therapy | | second cycle | Posology_Information | | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-----------------|:---------------|:------------| | mastectomy | Cancer_Surgery | Past | | adriamycin | Chemotherapy | Present | | cyclophosphamide | Chemotherapy | Present | ******************** assertion_oncology_treatment_binary_wip results ******************** | chunk | ner_label | assertion | |:-----------------|:---------------|:----------------| | mastectomy | Cancer_Surgery | Present_Or_Past | | adriamycin | Chemotherapy | Present_Or_Past | | cyclophosphamide | Chemotherapy | Present_Or_Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_therapy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel --- layout: model title: Arabic ElectraForQuestionAnswering model (from wissamantoun) author: John Snow Labs name: electra_qa_ara_base_artydiqa date: 2022-06-22 tags: [ar, open_source, electra, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araelectra-base-artydiqa` is a Arabic model originally trained by `wissamantoun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_ara_base_artydiqa_ar_4.0.0_3.0_1655920272375.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_ara_base_artydiqa_ar_4.0.0_3.0_1655920272375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_ara_base_artydiqa","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_ara_base_artydiqa","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.tydiqa.electra.base").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_ara_base_artydiqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|504.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/wissamantoun/araelectra-base-artydiqa --- layout: model title: Explain Document Pipeline for Italian author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, italian, explain_document_md, pipeline, it] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: it edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_it_3.0.0_3.0_1616430477970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_it_3.0.0_3.0_1616430477970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'it') annotations = pipeline.fullAnnotate(""Ciao da John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "it") val result = pipeline.fullAnnotate("Ciao da John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Ciao da John Snow Labs! ""] result_df = nlu.load('it.explain.document').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Ciao da John Snow Labs! '] | ['Ciao da John Snow Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | ['VERB', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.146050006151199,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|it| --- layout: model title: Chinese BertForMaskedLM Large Cased model (from genggui001) author: John Snow Labs name: bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_wwm_large_ext_fix_mlm` is a Chinese model originally trained by `genggui001`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_4.2.4_3.0_1670326139931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_4.2.4_3.0_1670326139931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/genggui001/chinese_roberta_wwm_large_ext_fix_mlm - https://github.com/ymcui/Chinese-BERT-wwm/issues/98 - https://github.com/genggui001/chinese_roberta_wwm_large_ext_fix_mlm --- layout: model title: English DistilBertForQuestionAnswering model (from machine2049) Duorc author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_duorc_ date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-duorc_distilbert` is a English model originally trained by `machine2049`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_duorc__en_4.0.0_3.0_1654723876220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_duorc__en_4.0.0_3.0_1654723876220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_duorc_","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_duorc_","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_uncased.by_machine2049").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_duorc_| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-duorc_distilbert --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kaouther) author: John Snow Labs name: distilbert_qa_kaouther_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kaouther`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaouther_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771677866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaouther_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771677866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaouther_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaouther_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kaouther_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kaouther/distilbert-base-uncased-finetuned-squad --- layout: model title: French CamemBert Embeddings (from gulabpatel) author: John Snow Labs name: camembert_embeddings_new_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `new-dummy-model` is a French model orginally trained by `gulabpatel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_new_generic_model_fr_3.4.4_3.0_1653991782298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_new_generic_model_fr_3.4.4_3.0_1653991782298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_new_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_new_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_new_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/gulabpatel/new-dummy-model --- layout: model title: German Financial Bert Word Embeddings author: John Snow Labs name: bert_sentence_embeddings_financial date: 2022-05-04 tags: [bert, embeddings, de, open_source, financial] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Although in the name of the model you will see the word `sentence`, this is a Word Embeddings Model. Financial Pretrained BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `german-financial-statements-bert` is a German model orginally trained by `fabianrausch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sentence_embeddings_financial_de_3.4.2_3.0_1651678415089.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sentence_embeddings_financial_de_3.4.2_3.0_1651678415089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_sentence_embeddings_financial","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark-NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_sentence_embeddings_financial","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark-NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sentence_embeddings_financial| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.6 MB| |Case sensitive:|true| --- layout: model title: English asr_wav2vec2_base_timit_ali_hasan_colab_EX2 TFWav2Vec2ForCTC from ali221000262 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_ali_hasan_colab_EX2` is a English model originally trained by ali221000262. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2_en_4.2.0_3.0_1664038560611.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2_en_4.2.0_3.0_1664038560611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_ali_hasan_colab_EX2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Named Entity Recognition (NER) Model in Norwegian (Norne 6B 100) author: John Snow Labs name: norne_6B_100 date: 2020-05-06 task: Named Entity Recognition language: "no" edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, nn, nb, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Norne is a Named Entity Recognition (or NER) model of Norvegian, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Norne 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Derived-`DRV`, Product-`PROD`, Geo-political Entities Location-`GPE_LOC`, Geo-political Entities Organization-`GPE-ORG`, Event-`EVT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_NO/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NO.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/norne_6B_300_no_2.5.0_2.4_1588781290264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/norne_6B_300_no_2.5.0_2.4_1588781290264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained('glove_6B_100') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("norne_6B_100", "no") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_100") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("norne_6B_100", "no") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella."""] ner_df = nlu.load('no.ner.norne.glove.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Seattle |GPE_LOC | |Washington |GPE_LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |GPE_LOC | |New Mexico |GPE_LOC | |Gates |PER | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | |Ray Ozzie |PER | |Craig Mundie |PER | |Han |PER | |Microsoft |ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|norne_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|no| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.559.pdf](https://www.aclweb.org/anthology/2020.lrec-1.559.pdf) --- layout: model title: Lemmatizer (Lithuanian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, lt] task: Lemmatization language: lt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Lithuanian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lt_3.4.1_3.0_1646316598333.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_lt_3.4.1_3.0_1646316598333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lt") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Jūs nesate geresnis už mane"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","lt") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Jūs nesate geresnis už mane").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lt.lemma").predict("""Jūs nesate geresnis už mane""") ```
## Results ```bash +------------------------------+ |result | +------------------------------+ |[Jūs, nebūti, geras, už, mane]| +------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|lt| |Size:|2.6 MB| --- layout: model title: Spanish BertForTokenClassification Cased model (from luch0247) author: John Snow Labs name: bert_token_classifier_autotrain_lucy_alicorp_1356152290 date: 2022-11-30 tags: [es, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-Lucy-Alicorp-1356152290` is a Spanish model originally trained by `luch0247`. ## Predicted Entities `C`, `NM`, `VRB`, `CR`, `QT`, `DB` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_lucy_alicorp_1356152290_es_4.2.4_3.0_1669814335691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_lucy_alicorp_1356152290_es_4.2.4_3.0_1669814335691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_lucy_alicorp_1356152290","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_lucy_alicorp_1356152290","es") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_lucy_alicorp_1356152290| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/luch0247/autotrain-Lucy-Alicorp-1356152290 --- layout: model title: Fast Neural Machine Translation Model from English to Twi author: John Snow Labs name: opus_mt_en_tw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tw_xx_2.7.0_2.4_1609169865968.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tw_xx_2.7.0_2.4_1609169865968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from AI4Sec) author: John Snow Labs name: xlmroberta_ner_cyner_base date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cyner-xlm-roberta-base` is a English model originally trained by `AI4Sec`. ## Predicted Entities `Vulnerability`, `Malware`, `System`, `Organization`, `Indicator` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cyner_base_en_4.1.0_3.0_1660422140565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cyner_base_en_4.1.0_3.0_1660422140565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cyner_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cyner_base","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_cyner_base| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|780.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AI4Sec/cyner-xlm-roberta-base --- layout: model title: English asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves TFWav2Vec2ForCTC from tonyalves author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves` is a English model originally trained by tonyalves. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_en_4.2.0_3.0_1664109299079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_en_4.2.0_3.0_1664109299079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_scrambled_squad_15 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-15` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_en_4.3.0_3.0_1674216826395.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_en_4.3.0_3.0_1674216826395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_scrambled_squad_15| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-15 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42_en_4.3.0_3.0_1674213511887.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42_en_4.3.0_3.0_1674213511887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|447.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-42 --- layout: model title: Word Segmenter for Chinese author: John Snow Labs name: wordseg_pku date: 2021-03-09 tags: [word_segmentation, open_source, chinese, wordseg_pku, zh] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_3.0.0_3.0_1615292332841.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_pku_zh_3.0.0_3.0_1615292332841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.segment_words.pku').predict(text) token_df ```
## Results ```bash 0 从 1 Jo 2 hn 3 Sn 4 ow 5 La 6 bs 7 你 8 好 9 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_pku| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|zh| --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-512_A-8_cord19-200616_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185223441.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185223441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_cord19.bert.uncased_2l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|83.4 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-512_A-8_cord19-200616_squad2 --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_informal_to_formal date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-informal-to-formal` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_informal_to_formal_it_4.3.0_3.0_1675103414416.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_informal_to_formal_it_4.3.0_3.0_1675103414416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_informal_to_formal","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_informal_to_formal","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_informal_to_formal| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|593.5 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-informal-to-formal - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Informal-to-formal+Style+Transfer&dataset=XFORMAL+%28Italian+Subset%29 --- layout: model title: Legal Powers Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_powers_bert date: 2023-03-05 tags: [en, legal, classification, clauses, powers, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Powers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Powers`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_powers_bert_en_1.0.0_3.0_1678050727123.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_powers_bert_en_1.0.0_3.0_1678050727123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_powers_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Powers]| |[Other]| |[Other]| |[Powers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_powers_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.90 0.95 0.92 19 Powers 0.90 0.82 0.86 11 accuracy - - 0.90 30 macro-avg 0.90 0.88 0.89 30 weighted-avg 0.90 0.90 0.90 30 ``` --- layout: model title: Korean BertForMaskedLM Base Cased model (from kykim) author: John Snow Labs name: bert_embeddings_kor_base date: 2022-12-02 tags: [ko, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ko edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-kor-base` is a Korean model originally trained by `kykim`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_kor_base_ko_4.2.4_3.0_1670019610924.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_kor_base_ko_4.2.4_3.0_1670019610924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kor_base","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kor_base","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_kor_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|443.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/kykim/bert-kor-base - https://github.com/kiyoungkim1/LM-kor --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from moghis) author: John Snow Labs name: xlmroberta_ner_base_finetuned_panx_de_data date: 2022-08-14 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de-data` is a German model originally trained by `moghis`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_de_data_de_4.1.0_3.0_1660438168971.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_panx_de_data_de_4.1.0_3.0_1660438168971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx_de_data","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_panx_de_data","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_panx_de_data| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/moghis/xlm-roberta-base-finetuned-panx-de-data --- layout: model title: French CamemBert Embeddings (from kaushikacharya) author: John Snow Labs name: camembert_embeddings_kaushikacharya_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `kaushikacharya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_kaushikacharya_generic_model_fr_3.4.4_3.0_1653989224371.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_kaushikacharya_generic_model_fr_3.4.4_3.0_1653989224371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_kaushikacharya_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_kaushikacharya_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_kaushikacharya_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/kaushikacharya/dummy-model --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1655731588944.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1655731588944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|416.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-10 --- layout: model title: Pipeline to Resolve Medication Codes author: John Snow Labs name: medication_resolver_pipeline date: 2023-04-10 tags: [resolver, snomed, umls, rxnorm, ndc, ade, en, licensed, pipeline] task: Entity Resolution language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used as Lightpipeline (with `annotate/fullAnnotate`). You can use `medication_resolver_transform_pipeline` for Spark transform. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.0_1681151954032.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.0_1681151954032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline med_resolver_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" result = med_resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val med_resolver_pipeline = new PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") val result = med_resolver_pipeline.fullAnnotate("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | | chunks | entities | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |---:|:-----------------------------|:-----------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | 0 | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | 1 | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | 2 | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | 3 | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: Fast Neural Machine Translation Model from Arabic to Esperanto author: John Snow Labs name: opus_mt_ar_eo date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, eo, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: eo {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_eo_xx_3.1.0_2.4_1622554268155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_eo_xx_3.1.0_2.4_1622554268155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_eo", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_eo", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Esperanto').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_eo| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Emotional Stressor Classifier (BERT) author: John Snow Labs name: bert_sequence_classifier_stressor date: 2022-07-27 tags: [stressor, public_health, en, licensed, sequence_classification] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [bioBERT](https://nlp.johnsnowlabs.com/2022/07/18/biobert_pubmed_base_cased_v1.2_en_3_0.html) based classifier that can classify source of emotional stress in text. ## Predicted Entities `Family_Issues`, `Financial_Problem`, `Health_Fatigue_or_Physical Pain`, `Other`, `School`, `Work`, `Social_Relationships` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_STRESS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stressor_en_4.0.0_3.0_1658923809554.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stressor_en_4.0.0_3.0_1658923809554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stressor", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["All the panic about the global pandemic has been stressing me out!"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stressor", "en", "clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq("All the panic about the global pandemic has been stressing me out!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.stressor").predict("""All the panic about the global pandemic has been stressing me out!""") ```
## Results ```bash +------------------------------------------------------------------+-----------------------------------+ |text |class | +------------------------------------------------------------------+-----------------------------------+ |All the panic about the global pandemic has been stressing me out!|[Health, Fatigue, or Physical Pain]| +------------------------------------------------------------------+-----------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_stressor| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support Family Issues 0.80 0.87 0.84 161 Financial Problem 0.87 0.83 0.85 126 Health, Fatigue, or Physical Pain 0.75 0.81 0.78 168 Other 0.82 0.80 0.81 384 School 0.89 0.91 0.90 127 Social Relationships 0.83 0.71 0.76 133 Work 0.87 0.89 0.88 271 accuracy - - 0.83 1370 macro-avg 0.83 0.83 0.83 1370 weighted-avg 0.83 0.83 0.83 1370 ``` --- layout: model title: Question classification of open-domain and fact-based questions Pipeline - TREC6 author: John Snow Labs name: classifierdl_use_trec6_pipeline date: 2021-01-08 task: [Text Classification, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [classifier, text_classification, en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify open-domain, fact-based questions into one of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_pipeline_en_2.7.1_2.4_1610119335714.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_pipeline_en_2.7.1_2.4_1610119335714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("classifierdl_use_trec6_pipeline", lang = "en") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("classifierdl_use_trec6_pipeline", lang = "en") ```
## Results ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |When did the construction of stone circles begin in the UK? | NUM | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_trec6_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Language:|en| --- layout: model title: Translate English to Luo (Kenya and Tanzania) Pipeline author: John Snow Labs name: translate_en_luo date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, luo, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `luo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_luo_xx_2.7.0_2.4_1609689219424.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_luo_xx_2.7.0_2.4_1609689219424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_luo", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_luo", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.luo').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_luo| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xlsr_53_gpt TFWav2Vec2ForCTC from voidful author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_gpt date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_gpt` is a English model originally trained by voidful. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_gpt_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_gpt_en_4.2.0_3.0_1664095296861.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_gpt_en_4.2.0_3.0_1664095296861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_gpt', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_gpt", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_gpt| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.3 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Abkhazian asr_hf_challenge_test TFWav2Vec2ForCTC from Iskaj author: John Snow Labs name: asr_hf_challenge_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_hf_challenge_test` is a Abkhazian model originally trained by Iskaj. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_hf_challenge_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_hf_challenge_test_ab_4.2.0_3.0_1664021278980.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_hf_challenge_test_ab_4.2.0_3.0_1664021278980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_hf_challenge_test", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_hf_challenge_test", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_hf_challenge_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.6 KB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from arjunth2001) author: John Snow Labs name: roberta_qa_priv_qna date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `priv_qna` is a English model originally trained by `arjunth2001`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_priv_qna_en_4.3.0_3.0_1674211774365.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_priv_qna_en_4.3.0_3.0_1674211774365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_priv_qna","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_priv_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_priv_qna| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/arjunth2001/priv_qna --- layout: model title: English BertForQuestionAnswering Small Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd2_small date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd2-small` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_small_en_4.0.0_3.0_1657188173526.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_small_en_4.0.0_3.0_1657188173526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd2_small","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd2_small","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd2_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd2-small --- layout: model title: Translate Niger-Kordofanian languages to English Pipeline author: John Snow Labs name: translate_nic_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, nic, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `nic` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_nic_en_xx_2.7.0_2.4_1609699199544.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_nic_en_xx_2.7.0_2.4_1609699199544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_nic_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_nic_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.nic.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_nic_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Extract Biomarkers and Their Results author: John Snow Labs name: ner_oncology_biomarker_healthcare_pipeline date: 2023-03-08 tags: [licensed, clinical, oncology, en, ner, biomarker] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_biomarker_healthcare](https://nlp.johnsnowlabs.com/2023/01/11/ner_oncology_biomarker_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_healthcare_pipeline_en_4.3.0_3.2_1678269721297.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_healthcare_pipeline_en_4.3.0_3.2_1678269721297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_biomarker_healthcare_pipeline", "en", "clinical/models") text = '''he results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_biomarker_healthcare_pipeline", "en", "clinical/models") val text = "he results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-----------------------------------------|--------:|------:|:-----------------|-------------:| | 0 | negative | 69 | 76 | Biomarker_Result | 1 | | 1 | CK7 | 82 | 84 | Biomarker | 1 | | 2 | synaptophysin | 87 | 99 | Biomarker | 1 | | 3 | Syn | 102 | 104 | Biomarker | 0.9999 | | 4 | chromogranin A | 108 | 121 | Biomarker | 0.99855 | | 5 | CgA | 124 | 126 | Biomarker | 1 | | 6 | Muc5AC | 130 | 135 | Biomarker | 0.9999 | | 7 | human epidermal growth factor receptor-2 | 138 | 177 | Biomarker | 0.99994 | | 8 | HER2 | 180 | 183 | Biomarker | 1 | | 9 | Muc6 | 191 | 194 | Biomarker | 1 | | 10 | positive | 197 | 204 | Biomarker_Result | 0.9997 | | 11 | CK20 | 210 | 213 | Biomarker | 1 | | 12 | Muc1 | 216 | 219 | Biomarker | 1 | | 13 | Muc2 | 222 | 225 | Biomarker | 1 | | 14 | E-cadherin | 228 | 237 | Biomarker | 0.9997 | | 15 | p53 | 244 | 246 | Biomarker | 1 | | 16 | Ki-67 index | 253 | 263 | Biomarker | 0.99865 | | 17 | 87% | 275 | 277 | Biomarker_Result | 0.828 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_biomarker_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|533.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from phiyodr) author: John Snow Labs name: roberta_qa_large_finetuned_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-finetuned-squad2` is a English model originally trained by `phiyodr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_finetuned_squad2_en_4.2.4_3.0_1669987751520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_finetuned_squad2_en_4.2.4_3.0_1669987751520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_finetuned_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/phiyodr/roberta-large-finetuned-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/abs/1907.11692 - https://arxiv.org/abs/1806.03822 - https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sasuke) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad1` is a English model originally trained by `sasuke`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad1_en_4.3.0_3.0_1672773467636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad1_en_4.3.0_3.0_1672773467636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sasuke/distilbert-base-uncased-finetuned-squad1 --- layout: model title: Legal Warrant Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_warrant_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, warrant, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_warrant_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `warrant-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `warrant-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_warrant_agreement_bert_en_1.0.0_3.0_1671393844438.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_warrant_agreement_bert_en_1.0.0_3.0_1671393844438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_warrant_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[warrant-agreement]| |[other]| |[other]| |[warrant-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_warrant_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.99 0.98 204 warrant-agreement 0.96 0.95 0.96 83 accuracy - - 0.98 287 macro-avg 0.97 0.97 0.97 287 weighted-avg 0.98 0.98 0.98 287 ``` --- layout: model title: Pipeline to Detect Anatomical Regions (MedicalBertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_anatomy_pipeline date: 2023-03-20 tags: [anatomy, bertfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_anatomy](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_anatomy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_4.3.0_3.2_1679306174114.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_4.3.0_3.2_1679306174114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models") text = '''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models") val text = "This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.anatomy_pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:-----------------------|-------------:| | 0 | great | 320 | 324 | Multi-tissue_structure | 0.693343 | | 1 | toe | 326 | 328 | Multi-tissue_structure | 0.378996 | | 2 | skin | 374 | 377 | Organ | 0.946453 | | 3 | conjunctivae | 554 | 565 | Multi-tissue_structure | 0.929193 | | 4 | Extraocular | 574 | 584 | Multi-tissue_structure | 0.858331 | | 5 | muscles | 586 | 592 | Organ | 0.670788 | | 6 | Nares | 613 | 617 | Multi-tissue_structure | 0.573931 | | 7 | turbinates | 659 | 668 | Multi-tissue_structure | 0.947797 | | 8 | Oropharynx | 683 | 692 | Multi-tissue_structure | 0.458301 | | 9 | Mucous membranes | 716 | 731 | Tissue | 0.811466 | | 10 | Neck | 744 | 747 | Organism_subdivision | 0.879527 | | 11 | bowel | 802 | 806 | Organ | 0.919502 | | 12 | great | 875 | 879 | Multi-tissue_structure | 0.701514 | | 13 | toe | 881 | 883 | Multi-tissue_structure | 0.264513 | | 14 | skin | 933 | 936 | Organ | 0.925361 | | 15 | toenails | 943 | 950 | Organism_subdivision | 0.674937 | | 16 | foot | 999 | 1002 | Organism_subdivision | 0.544587 | | 17 | great | 1017 | 1021 | Multi-tissue_structure | 0.818323 | | 18 | toe | 1023 | 1025 | Organism_subdivision | 0.341098 | | 19 | toenails | 1031 | 1038 | Organism_subdivision | 0.75016 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_anatomy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Luo (Kenya and Tanzania) XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_luo_finetuned_ner date: 2022-08-01 tags: [luo, open_source, xlm_roberta, ner] task: Named Entity Recognition language: luo edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-luo-finetuned-ner-luo` is a Luo (Kenya and Tanzania) model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luo_finetuned_ner_luo_4.1.0_3.0_1659354450889.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luo_finetuned_ner_luo_4.1.0_3.0_1659354450889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luo_finetuned_ner","luo") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luo_finetuned_ner","luo") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_luo_finetuned_ner| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|luo| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luo-finetuned-ner-luo - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English image_classifier_vit_cifar10 ViTForImageClassification from alfredcs author: John Snow Labs name: image_classifier_vit_cifar10 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_cifar10` is a English model originally trained by alfredcs. ## Predicted Entities `deer`, `bird`, `dog`, `horse`, `automobile`, `truck`, `frog`, `ship`, `airplane`, `cat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_cifar10_en_4.1.0_3.0_1660167465918.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_cifar10_en_4.1.0_3.0_1660167465918.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_cifar10", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_cifar10", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_cifar10| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from ncduy) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_finetuned_test date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-test` is a English model originally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_test_en_4.3.0_3.0_1672766662366.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_test_en_4.3.0_3.0_1672766662366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_test","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_test","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_finetuned_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-test --- layout: model title: Indonesian Part of Speech Tagger (from w11wo) author: John Snow Labs name: roberta_pos_indonesian_roberta_base_posp_tagger date: 2022-05-03 tags: [roberta, pos, part_of_speech, id, open_source] task: Part of Speech Tagging language: id edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indonesian-roberta-base-posp-tagger` is a Indonesian model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_indonesian_roberta_base_posp_tagger_id_3.4.2_3.0_1651596272433.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_indonesian_roberta_base_posp_tagger_id_3.4.2_3.0_1651596272433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_indonesian_roberta_base_posp_tagger","id") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_indonesian_roberta_base_posp_tagger","id") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Saya suka Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.pos.indonesian_roberta_base_posp_tagger").predict("""Saya suka Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_indonesian_roberta_base_posp_tagger| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|id| |Size:|466.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/w11wo/indonesian-roberta-base-posp-tagger - https://arxiv.org/abs/1907.11692 - https://hf.co/flax-community/indonesian-roberta-base - https://hf.co/datasets/indonlu - https://w11wo.github.io/ --- layout: model title: Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages author: John Snow Labs name: sent_bert_muril date: 2021-09-01 tags: [xx, open_source, sentence_embeddings, muril, indian_languages] task: Embeddings language: xx edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture pretrained from scratch using the Wikipedia, Common Crawl, PMINDIA and Dakshina corpora for the following 17 Indian languages: `Assamese`, `Bengali` , `English` , `Gujarati` , `Hindi` , `Kannada` , `Kashmiri` , `Malayalam` , `Marathi` , `Nepali` , `Oriya` , `Punjabi` , `Sanskrit` , `Sindhi` , `Tamil` , `Telugu` , `Urdu` The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below : - Monolingual Data : Publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages. - Parallel Data : There are two types of parallel data : - Translated Data : Translations of the above monolingual corpora obtained using the Google NMT pipeline. Translated segment pairs fed as input. Also, Publicly available PMINDIA corpus was used. - Transliterated Data : Transliterations of Wikipedia obtained using the IndicTrans library. Transliterated segment pairs fed as input. Also, Publicly available Dakshina dataset was used. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_muril_xx_3.2.0_3.0_1630467991919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_muril_xx_3.2.0_3.0_1630467991919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.muril').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_muril| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|xx| |Case sensitive:|false| ## Data Source [1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018. [2]: [Wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia) [3]: [Common Crawl](http://commoncrawl.org/the-data/) [4]: [PMINDIA](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html) [5]: [Dakshina](https://github.com/google-research-datasets/dakshina) The model is imported from: https://tfhub.dev/google/MuRIL/1 --- layout: model title: Sentence Detection in Punjabi Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [pa, open_source, sentence_detection] task: Sentence Detection language: pa edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_pa_3.2.0_3.0_1630320087911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_pa_3.2.0_3.0_1630320087911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "pa") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਦੇ ਪੈਰਾਗ੍ਰਾਫਾਂ ਦੇ ਇੱਕ ਮਹਾਨ ਸਰੋਤ ਦੀ ਭਾਲ ਕਰ ਰਹੇ ਹੋ? ਤੁਸੀਂ ਸਹੀ ਜਗ੍ਹਾ ਤੇ ਆਏ ਹੋ. ਇੱਕ ਤਾਜ਼ਾ ਅਧਿਐਨ ਅਨੁਸਾਰ ਅੱਜ ਦੇ ਨੌਜਵਾਨਾਂ ਵਿੱਚ ਪੜ੍ਹਨ ਦੀ ਆਦਤ ਤੇਜ਼ੀ ਨਾਲ ਘਟ ਰਹੀ ਹੈ। ਉਹ ਕੁਝ ਸਕਿੰਟਾਂ ਤੋਂ ਵੱਧ ਸਮੇਂ ਲਈ ਦਿੱਤੇ ਗਏ ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਵਾਲੇ ਪੈਰੇ 'ਤੇ ਧਿਆਨ ਨਹੀਂ ਦੇ ਸਕਦੇ! ਨਾਲ ਹੀ, ਪੜ੍ਹਨਾ ਸਾਰੀਆਂ ਪ੍ਰਤੀਯੋਗੀ ਪ੍ਰੀਖਿਆਵਾਂ ਦਾ ਇੱਕ ਅਨਿੱਖੜਵਾਂ ਅੰਗ ਸੀ ਅਤੇ ਹੈ. ਇਸ ਲਈ, ਤੁਸੀਂ ਆਪਣੇ ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਨੂੰ ਕਿਵੇਂ ਸੁਧਾਰਦੇ ਹੋ? ਇਸ ਪ੍ਰਸ਼ਨ ਦਾ ਉੱਤਰ ਅਸਲ ਵਿੱਚ ਇੱਕ ਹੋਰ ਪ੍ਰਸ਼ਨ ਹੈ: ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਦੀ ਵਰਤੋਂ ਕੀ ਹੈ? ਪੜ੍ਹਨ ਦਾ ਮੁੱਖ ਉਦੇਸ਼ 'ਅਰਥ ਬਣਾਉਣਾ' ਹੈ.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "pa") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਦੇ ਪੈਰਾਗ੍ਰਾਫਾਂ ਦੇ ਇੱਕ ਮਹਾਨ ਸਰੋਤ ਦੀ ਭਾਲ ਕਰ ਰਹੇ ਹੋ? ਤੁਸੀਂ ਸਹੀ ਜਗ੍ਹਾ ਤੇ ਆਏ ਹੋ. ਇੱਕ ਤਾਜ਼ਾ ਅਧਿਐਨ ਅਨੁਸਾਰ ਅੱਜ ਦੇ ਨੌਜਵਾਨਾਂ ਵਿੱਚ ਪੜ੍ਹਨ ਦੀ ਆਦਤ ਤੇਜ਼ੀ ਨਾਲ ਘਟ ਰਹੀ ਹੈ। ਉਹ ਕੁਝ ਸਕਿੰਟਾਂ ਤੋਂ ਵੱਧ ਸਮੇਂ ਲਈ ਦਿੱਤੇ ਗਏ ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਵਾਲੇ ਪੈਰੇ 'ਤੇ ਧਿਆਨ ਨਹੀਂ ਦੇ ਸਕਦੇ! ਨਾਲ ਹੀ, ਪੜ੍ਹਨਾ ਸਾਰੀਆਂ ਪ੍ਰਤੀਯੋਗੀ ਪ੍ਰੀਖਿਆਵਾਂ ਦਾ ਇੱਕ ਅਨਿੱਖੜਵਾਂ ਅੰਗ ਸੀ ਅਤੇ ਹੈ. ਇਸ ਲਈ, ਤੁਸੀਂ ਆਪਣੇ ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਨੂੰ ਕਿਵੇਂ ਸੁਧਾਰਦੇ ਹੋ? ਇਸ ਪ੍ਰਸ਼ਨ ਦਾ ਉੱਤਰ ਅਸਲ ਵਿੱਚ ਇੱਕ ਹੋਰ ਪ੍ਰਸ਼ਨ ਹੈ: ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਦੀ ਵਰਤੋਂ ਕੀ ਹੈ? ਪੜ੍ਹਨ ਦਾ ਮੁੱਖ ਉਦੇਸ਼ 'ਅਰਥ ਬਣਾਉਣਾ' ਹੈ.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('pa.sentence_detector').predict("ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਦੇ ਪੈਰਾਗ੍ਰਾਫਾਂ ਦੇ ਇੱਕ ਮਹਾਨ ਸਰੋਤ ਦੀ ਭਾਲ ਕਰ ਰਹੇ ਹੋ? ਤੁਸੀਂ ਸਹੀ ਜਗ੍ਹਾ ਤੇ ਆਏ ਹੋ. ਇੱਕ ਤਾਜ਼ਾ ਅਧਿਐਨ ਅਨੁਸਾਰ ਅੱਜ ਦੇ ਨੌਜਵਾਨਾਂ ਵਿੱਚ ਪੜ੍ਹਨ ਦੀ ਆਦਤ ਤੇਜ਼ੀ ਨਾਲ ਘਟ ਰਹੀ ਹੈ। ਉਹ ਕੁਝ ਸਕਿੰਟਾਂ ਤੋਂ ਵੱਧ ਸਮੇਂ ਲਈ ਦਿੱਤੇ ਗਏ ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਵਾਲੇ ਪੈਰੇ 'ਤੇ ਧਿਆਨ ਨਹੀਂ ਦੇ ਸਕਦੇ! ਨਾਲ ਹੀ, ਪੜ੍ਹਨਾ ਸਾਰੀਆਂ ਪ੍ਰਤੀਯੋਗੀ ਪ੍ਰੀਖਿਆਵਾਂ ਦਾ ਇੱਕ ਅਨਿੱਖੜਵਾਂ ਅੰਗ ਸੀ ਅਤੇ ਹੈ. ਇਸ ਲਈ, ਤੁਸੀਂ ਆਪਣੇ ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਨੂੰ ਕਿਵੇਂ ਸੁਧਾਰਦੇ ਹੋ? ਇਸ ਪ੍ਰਸ਼ਨ ਦਾ ਉੱਤਰ ਅਸਲ ਵਿੱਚ ਇੱਕ ਹੋਰ ਪ੍ਰਸ਼ਨ ਹੈ: ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਦੀ ਵਰਤੋਂ ਕੀ ਹੈ? ਪੜ੍ਹਨ ਦਾ ਮੁੱਖ ਉਦੇਸ਼ 'ਅਰਥ ਬਣਾਉਣਾ' ਹੈ.", output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਦੇ ਪੈਰਾਗ੍ਰਾਫਾਂ ਦੇ ਇੱਕ ਮਹਾਨ ਸਰੋਤ ਦੀ ਭਾਲ ਕਰ ਰਹੇ ਹੋ?] | |[ਤੁਸੀਂ ਸਹੀ ਜਗ੍ਹਾ ਤੇ ਆਏ ਹੋ.] | |[ਇੱਕ ਤਾਜ਼ਾ ਅਧਿਐਨ ਅਨੁਸਾਰ ਅੱਜ ਦੇ ਨੌਜਵਾਨਾਂ ਵਿੱਚ ਪੜ੍ਹਨ ਦੀ ਆਦਤ ਤੇਜ਼ੀ ਨਾਲ ਘਟ ਰਹੀ ਹੈ। ਉਹ ਕੁਝ ਸਕਿੰਟਾਂ ਤੋਂ ਵੱਧ ਸਮੇਂ ਲਈ ਦਿੱਤੇ ਗਏ ਅੰਗਰੇਜ਼ੀ ਪੜ੍ਹਨ ਵਾਲੇ ਪੈਰੇ 'ਤੇ ਧਿਆਨ ਨਹੀਂ ਦੇ ਸਕਦੇ!]| |[ਨਾਲ ਹੀ, ਪੜ੍ਹਨਾ ਸਾਰੀਆਂ ਪ੍ਰਤੀਯੋਗੀ ਪ੍ਰੀਖਿਆਵਾਂ ਦਾ ਇੱਕ ਅਨਿੱਖੜਵਾਂ ਅੰਗ ਸੀ ਅਤੇ ਹੈ.] | |[ਇਸ ਲਈ, ਤੁਸੀਂ ਆਪਣੇ ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਨੂੰ ਕਿਵੇਂ ਸੁਧਾਰਦੇ ਹੋ?] | |[ਇਸ ਪ੍ਰਸ਼ਨ ਦਾ ਉੱਤਰ ਅਸਲ ਵਿੱਚ ਇੱਕ ਹੋਰ ਪ੍ਰਸ਼ਨ ਹੈ:] | |[ਪੜ੍ਹਨ ਦੇ ਹੁਨਰ ਦੀ ਵਰਤੋਂ ਕੀ ਹੈ?] | |[ਪੜ੍ਹਨ ਦਾ ਮੁੱਖ ਉਦੇਸ਼ 'ਅਰਥ ਬਣਾਉਣਾ' ਹੈ.] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|pa| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English asr_wav2vec2_large_robust_libri_960h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_large_robust_libri_960h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_libri_960h` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_robust_libri_960h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_libri_960h_en_4.2.0_3.0_1664039514827.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_libri_960h_en_4.2.0_3.0_1664039514827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_robust_libri_960h', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_robust_libri_960h", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_robust_libri_960h| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_xls_r_300m_german_english TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_german_english date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_german_english` is a English model originally trained by aware-ai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_german_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_german_english_en_4.2.0_3.0_1664111961756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_german_english_en_4.2.0_3.0_1664111961756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_german_english', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_german_english", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_german_english| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Clean patterns pipeline for English author: John Snow Labs name: clean_pattern date: 2022-07-06 tags: [open_source, english, clean_pattern, pipeline, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_pattern_en_4.0.0_3.0_1657137560119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_pattern_en_4.0.0_3.0_1657137560119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('clean_pattern', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("clean_pattern", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.pattern').predict(text) result_df ```
## Results ```bash | | document | sentence | token | normal | |---:|:-----------|:-----------|:----------|:----------| | 0 | ['Hello'] | ['Hello'] | ['Hello'] | ['Hello'] || | document | sentence | token | normal | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_pattern| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|28.8 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - NormalizerModel --- layout: model title: Bulgarian RobertaForMaskedLM Base Cased model (from iarfmoose) author: John Snow Labs name: roberta_embeddings_base_bulgarian date: 2022-12-12 tags: [bg, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: bg edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bulgarian` is a Bulgarian model originally trained by `iarfmoose`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_bulgarian_bg_4.2.4_3.0_1670859176755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_bulgarian_bg_4.2.4_3.0_1670859176755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base_bulgarian","bg") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base_bulgarian","bg") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_base_bulgarian| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bg| |Size:|473.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/iarfmoose/roberta-base-bulgarian - https://arxiv.org/abs/1907.11692 - https://oscar-corpus.com/ - https://wortschatz.uni-leipzig.de/en/download/bulgarian - https://wortschatz.uni-leipzig.de/en/download/bulgarian --- layout: model title: Pipeline to Detect Clinical Conditions (ner_eu_clinical_case - eu) author: John Snow Labs name: ner_eu_clinical_condition_pipeline date: 2023-03-07 tags: [eu, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: eu edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_condition](https://nlp.johnsnowlabs.com/2023/02/06/ner_eu_clinical_condition_eu.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_eu_4.3.0_3.2_1678213509285.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_eu_4.3.0_3.2_1678213509285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_condition_pipeline", "eu", "clinical/models") text = " Gertaera honetatik bi hilabetetara, umea Larrialdietako Zerbitzura dator 4 egunetan zehar buruko mina eta bekokiko hantura azaltzeagatik, sukarrik izan gabe. Miaketan, haztapen mingarria duen bekokiko hantura bigunaz gain, ez da beste zeinurik azaltzen. Polakiuria eta tenesmo arina ere izan zuen egun horretan hematuriarekin batera. Geroztik sintomarik gabe dago. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_condition_pipeline", "eu", "clinical/models") val text = " Gertaera honetatik bi hilabetetara, umea Larrialdietako Zerbitzura dator 4 egunetan zehar buruko mina eta bekokiko hantura azaltzeagatik, sukarrik izan gabe. Miaketan, haztapen mingarria duen bekokiko hantura bigunaz gain, ez da beste zeinurik azaltzen. Polakiuria eta tenesmo arina ere izan zuen egun horretan hematuriarekin batera. Geroztik sintomarik gabe dago. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-----------|--------:|------:|:-------------------|-------------:| | 0 | mina | 98 | 101 | clinical_condition | 0.8754 | | 1 | hantura | 116 | 122 | clinical_condition | 0.8877 | | 2 | sukarrik | 139 | 146 | clinical_condition | 0.9119 | | 3 | mingarria | 178 | 186 | clinical_condition | 0.7381 | | 4 | hantura | 203 | 209 | clinical_condition | 0.8805 | | 5 | Polakiuria | 256 | 265 | clinical_condition | 0.6683 | | 6 | sintomarik | 345 | 354 | clinical_condition | 0.9632 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|eu| |Size:|1.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English image_classifier_vit_anomaly ViTForImageClassification from hafidber author: John Snow Labs name: image_classifier_vit_anomaly date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_anomaly` is a English model originally trained by hafidber. ## Predicted Entities `abnormal`, `normal` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_anomaly_en_4.1.0_3.0_1660169901656.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_anomaly_en_4.1.0_3.0_1660169901656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_anomaly", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_anomaly", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_anomaly| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Supplemental Indenture Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_supplemental_indenture_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, supplemental, indenture, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_supplemental_indenture_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `supplemental-indenture` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `supplemental-indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_bert_en_1.0.0_3.0_1671393857190.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_supplemental_indenture_agreement_bert_en_1.0.0_3.0_1671393857190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_supplemental_indenture_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[supplemental-indenture]| |[other]| |[other]| |[supplemental-indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_supplemental_indenture_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.95 0.96 204 supplemental-indenture 0.91 0.95 0.93 111 accuracy - - 0.95 315 macro-avg 0.94 0.95 0.95 315 weighted-avg 0.95 0.95 0.95 315 ``` --- layout: model title: English image_classifier_vit_deit_flyswot ViTForImageClassification from davanstrien author: John Snow Labs name: image_classifier_vit_deit_flyswot date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_deit_flyswot` is a English model originally trained by davanstrien. ## Predicted Entities `EDGE + SPINE`, `OTHER`, `PAGE + FOLIO`, `FLYSHEET`, `CONTAINER`, `CONTROL SHOT`, `COVER`, `SCROLL` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_flyswot_en_4.1.0_3.0_1660166402706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_flyswot_en_4.1.0_3.0_1660166402706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_deit_flyswot", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_deit_flyswot", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_deit_flyswot| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.8 MB| --- layout: model title: English image_classifier_vit_llama_alpaca_snake ViTForImageClassification from osanseviero author: John Snow Labs name: image_classifier_vit_llama_alpaca_snake date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_llama_alpaca_snake` is a English model originally trained by osanseviero. ## Predicted Entities `alpaca`, `llamas`, `snake` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_snake_en_4.1.0_3.0_1660170191761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_alpaca_snake_en_4.1.0_3.0_1660170191761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_llama_alpaca_snake", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_llama_alpaca_snake", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_llama_alpaca_snake| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab66 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab66 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab66` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab66_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab66_en_4.2.0_3.0_1664024830506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab66_en_4.2.0_3.0_1664024830506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab66", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab66", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab66| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Translate English to Tetela Pipeline author: John Snow Labs name: translate_en_tll date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tll, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tll` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tll_xx_2.7.0_2.4_1609699282699.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tll_xx_2.7.0_2.4_1609699282699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tll", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tll", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tll').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tll| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Base Uncased model (from ksabeh) author: John Snow Labs name: bert_qa_base_uncased_attribute_correction_mlm_titles date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-attribute-correction-mlm-titles` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_mlm_titles_en_4.0.0_3.0_1657183812655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_mlm_titles_en_4.0.0_3.0_1657183812655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction_mlm_titles","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction_mlm_titles","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_attribute_correction_mlm_titles| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ksabeh/bert-base-uncased-attribute-correction-mlm-titles --- layout: model title: Lewotobi RobertaForQuestionAnswering (from 21iridescent) author: John Snow Labs name: roberta_qa_distilroberta_base_finetuned_squad2_lwt date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: lwt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-squad2-lwt` is a Lewotobi model originally trained by `21iridescent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655728304909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_finetuned_squad2_lwt_lwt_4.0.0_3.0_1655728304909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_finetuned_squad2_lwt","lwt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_distilroberta_base_finetuned_squad2_lwt","lwt") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("lwt.answer_question.squadv2.roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilroberta_base_finetuned_squad2_lwt| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|lwt| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/21iridescent/distilroberta-base-finetuned-squad2-lwt --- layout: model title: French BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_fr_cased date: 2022-12-02 tags: [fr, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: fr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-fr-cased` is a French model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_fr_cased_fr_4.2.4_3.0_1670017613286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_fr_cased_fr_4.2.4_3.0_1670017613286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_fr_cased","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_fr_cased","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_fr_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|393.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-fr-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Pipeline to Detect Chemicals in Medical text (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_chemicals_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, berfortokenclassification, chemicals, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemicals](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemicals_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647889424974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647889424974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemicals_pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |p - choloroaniline |CHEM | |chlorhexidine - digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone - iodine |CHEM | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English image_classifier_vit_vision_transformer_v3 ViTForImageClassification from mrgiraffe author: John Snow Labs name: image_classifier_vit_vision_transformer_v3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vision_transformer_v3` is a English model originally trained by mrgiraffe. ## Predicted Entities `chart`, `imagechart`, `notchart`, `pdfpagechart` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_v3_en_4.1.0_3.0_1660168553159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_v3_en_4.1.0_3.0_1660168553159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vision_transformer_v3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vision_transformer_v3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vision_transformer_v3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Hindi XlmRoBertaForQuestionAnswering (from bhavikardeshna) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_hindi date: 2022-06-23 tags: [hi, open_source, question_answering, xlmroberta] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-hindi` is a Hindi model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_hindi_hi_4.0.0_3.0_1655990042129.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_hindi_hi_4.0.0_3.0_1655990042129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_hindi","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_hindi","hi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hi.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_hindi| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|hi| |Size:|885.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/xlm-roberta-base-hindi --- layout: model title: Finnish asr_wav2vec2_xlsr_train_aug_bigLM_1B TFWav2Vec2ForCTC from RASMUS author: John Snow Labs name: asr_wav2vec2_xlsr_train_aug_bigLM_1B date: 2022-09-25 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_train_aug_bigLM_1B` is a Finnish model originally trained by RASMUS. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_train_aug_bigLM_1B_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097486875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097486875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_train_aug_bigLM_1B", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_train_aug_bigLM_1B", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_train_aug_bigLM_1B| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|3.6 GB| --- layout: model title: German asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779 date: 2022-09-26 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779_de_4.2.0_3.0_1664191745283.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779_de_4.2.0_3.0_1664191745283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s779| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Legal Enforceability Clause Binary Classifier author: John Snow Labs name: legclf_enforceability_clause date: 2022-09-28 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `enforceability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `enforceability` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_enforceability_clause_en_1.0.0_3.0_1664363141173.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_enforceability_clause_en_1.0.0_3.0_1664363141173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_enforceability_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[enforceability]| |[other]| |[other]| |[enforceability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_enforceability_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support enforceability 0.87 0.89 0.88 38 other 0.95 0.94 0.94 78 accuracy - - 0.92 116 macro-avg 0.91 0.92 0.91 116 weighted-avg 0.92 0.92 0.92 116 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-BlueBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512_en_4.0.0_3.0_1657108398297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512_en_4.0.0_3.0_1657108398297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-BlueBERT-512 --- layout: model title: English BertForQuestionAnswering model (from rahulkuruvilla) author: John Snow Labs name: bert_qa_COVID_BERTa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-BERTa` is a English model orginally trained by `rahulkuruvilla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTa_en_4.0.0_3.0_1654176515744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTa_en_4.0.0_3.0_1654176515744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_COVID_BERTa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_COVID_BERTa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid_bert.a.by_rahulkuruvilla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_COVID_BERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulkuruvilla/COVID-BERTa --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_de_4.2.0_3.0_1664096062320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_de_4.2.0_3.0_1664096062320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Dravidian languages Pipeline author: John Snow Labs name: translate_en_dra date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, dra, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `dra` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_dra_xx_2.7.0_2.4_1609698809869.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_dra_xx_2.7.0_2.4_1609698809869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_dra", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_dra", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.dra').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_dra| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hindi Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_hi_cased date: 2022-04-11 tags: [bert, embeddings, hi, open_source] task: Embeddings language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-hi-cased` is a Hindi model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_hi_cased_hi_3.4.2_3.0_1649673139297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_hi_cased_hi_3.4.2_3.0_1649673139297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_hi_cased","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_hi_cased","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.bert_hi_cased").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_hi_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|339.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-hi-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Mapping HCPCS Codes with Corresponding National Drug Codes (NDC) and Drug Brand Names author: John Snow Labs name: hcpcs_ndc_mapper date: 2023-04-13 tags: [en, licensed, chunk_mappig, hcpcs, ndc, brand_name] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps HCPCS codes with their corresponding National Drug Codes (NDC) and their drug brand names. ## Predicted Entities `ndc_code`, `brand_name` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/hcpcs_ndc_mapper_en_4.4.0_3.0_1681405950608.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/hcpcs_ndc_mapper_en_4.4.0_3.0_1681405950608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("hcpcs_chunk") chunkerMapper = DocMapperModel.pretrained("hcpcs_ndc_mapper", "en", "clinical/models")\ .setInputCols(["hcpcs_chunk"])\ .setOutputCol("mappings")\ .setRels(["ndc_code", "brand_name"]) pipeline = Pipeline().setStages([document_assembler, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) res = lp.fullAnnotate(["Q5106", "J9211", "J7508"]) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text")\ .setOutputCol("hcpcs_chunk") val chunkerMapper = DocMapperModel .pretrained("hcpcs_ndc_mapper", "en", "clinical/models") .setInputCols(Array("hcpcs_chunk")) .setOutputCol("mappings") .setRels(Array("ndc_code", "brand_name")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, chunkerMapper)) val data = Seq(Array(["Q5106", "J9211", "J7508"])).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------+-------------------------------------+----------+ |hcpcs_chunk|mappings |relation | +-----------+-------------------------------------+----------+ |Q5106 |59353-0003-10 |ndc_code | |Q5106 |RETACRIT (PF) 3000 U/1 ML |brand_name| |J9211 |59762-2596-01 |ndc_code | |J9211 |IDARUBICIN HYDROCHLORIDE (PF) 1 MG/ML|brand_name| |J7508 |00469-0687-73 |ndc_code | |J7508 |ASTAGRAF XL 5 MG |brand_name| +-----------+-------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|hcpcs_ndc_mapper| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|20.7 KB| --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_oliverguhr TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_oliverguhr` is a German model originally trained by oliverguhr. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_de_4.2.0_3.0_1664104534440.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_de_4.2.0_3.0_1664104534440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_oliverguhr| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Information Technology And Data Processing Document Classifier (EURLEX) author: John Snow Labs name: legclf_information_technology_and_data_processing_bert date: 2023-03-06 tags: [en, legal, classification, clauses, information_technology_and_data_processing, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_information_technology_and_data_processing_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Information_Technology_and_Data_Processing or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Information_Technology_and_Data_Processing`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_information_technology_and_data_processing_bert_en_1.0.0_3.0_1678111859916.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_information_technology_and_data_processing_bert_en_1.0.0_3.0_1678111859916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_information_technology_and_data_processing_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Information_Technology_and_Data_Processing]| |[Other]| |[Other]| |[Information_Technology_and_Data_Processing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_information_technology_and_data_processing_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Information_Technology_and_Data_Processing 0.85 0.80 0.82 153 Other 0.79 0.85 0.82 141 accuracy - - 0.82 294 macro-avg 0.82 0.82 0.82 294 weighted-avg 0.83 0.82 0.82 294 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_7 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-7` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_7_en_4.3.0_3.0_1672767624756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_7_en_4.3.0_3.0_1672767624756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_7","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_7","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_7| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-7 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from clisi2000) author: John Snow Labs name: xlmroberta_ner_clisi2000_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `clisi2000`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_clisi2000_base_finetuned_panx_de_4.1.0_3.0_1660431788167.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_clisi2000_base_finetuned_panx_de_4.1.0_3.0_1660431788167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_clisi2000_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_clisi2000_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_clisi2000_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/clisi2000/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English DistilBertForQuestionAnswering model (from minhdang241) author: John Snow Labs name: distilbert_qa_robustqa_baseline_01 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-baseline-01` is a English model originally trained by `minhdang241`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_baseline_01_en_4.0.0_3.0_1654728524154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_baseline_01_en_4.0.0_3.0_1654728524154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_baseline_01","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_baseline_01","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base.by_minhdang241").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_robustqa_baseline_01| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/minhdang241/robustqa-baseline-01 --- layout: model title: French CamemBert Embeddings (from yancong) author: John Snow Labs name: camembert_embeddings_yancong_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `yancong`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_yancong_generic_model_fr_3.4.4_3.0_1653990825081.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_yancong_generic_model_fr_3.4.4_3.0_1653990825081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_yancong_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_yancong_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_yancong_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/yancong/dummy-model --- layout: model title: Chinese BertForMaskedLM Cased model (from qinluo) author: John Snow Labs name: bert_embeddings_wo_chinese_plus date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `wobert-chinese-plus` is a Chinese model originally trained by `qinluo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670023089360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670023089360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_wo_chinese_plus| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|467.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/qinluo/wobert-chinese-plus - https://github.com/ZhuiyiTechnology/WoBERT - https://github.com/JunnYu/WoBERT_pytorch --- layout: model title: English DistilBertForQuestionAnswering model (from abhilash1910) author: John Snow Labs name: distilbert_qa_squadv1 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squadv1` is a English model originally trained by `abhilash1910`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squadv1_en_4.0.0_3.0_1654727758712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squadv1_en_4.0.0_3.0_1654727758712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.by_abhilash1910").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/abhilash1910/distilbert-squadv1 --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes (Drug & Substance) author: John Snow Labs name: sbiobertresolve_umls_drug_substance date: 2021-12-06 tags: [entity_resolution, en, clinical, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.3 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities to UMLS CUI codes. It is trained on `2021AB` UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the `Clinical Drug`, `Pharmacologic Substance`, `Antibiotic`, `Hazardous or Poisonous Substance` categories using `sbiobert_base_cased_mli` embeddings. ## Predicted Entities `Predicts UMLS codes for Drugs & Substances medical concepts` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_drug_substance_en_3.3.3_3.0_1638802613409.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_drug_substance_en_3.3.3_3.0_1638802613409.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") stopwords = StopWordsCleaner.pretrained()\ .setInputCols("token")\ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "cleanTokens"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "cleanTokens", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings").setCaseSensitive(False) resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_drug_substance","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([[""]]).toDF("text") model = LightPipeline(pipeline.fit(data)) results = model.fullAnnotate(['Dilaudid', 'Hydromorphone', 'Exalgo', 'Palladone', 'Hydrogen peroxide 30 mg', 'Neosporin Cream', 'Magnesium hydroxide 100mg/1ml', 'Metformin 1000 mg']) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val stopwords = StopWordsCleaner.pretrained() .setInputCols("token") .setOutputCol("cleanTokens") .setCaseSensitive(False) val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "cleanTokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "cleanTokens", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_drug_substance", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val p_model = new PipelineModel().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("""'Dilaudid', 'Hydromorphone', 'Exalgo', 'Palladone', 'Hydrogen peroxide 30 mg', 'Neosporin Cream', 'Magnesium hydroxide 100mg/1ml', 'Metformin 1000 mg'""").toDS().toDF("text") val res = p_model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls_drug_substance").predict("""Magnesium hydroxide 100mg/1ml""") ```
## Results ```bash | | chunk | code | code_description | all_k_code_desc | all_k_codes | |---:|:------------------------------|:---------|:---------------------------|:-------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | Dilaudid | C0728755 | dilaudid | ['C0728755', 'C0719907', 'C1448344', 'C0305924', 'C1569295'] | ['dilaudid', 'Dilaudid HP', 'Disthelm', 'Dilaudid Injection', 'Distaph'] | | 1 | Hydromorphone | C0012306 | HYDROMORPHONE | ['C0012306', 'C0700533', 'C1646274', 'C1170495', 'C0498841'] | ['HYDROMORPHONE', 'Hydromorphone HCl', 'Phl-HYDROmorphone', 'PMS HYDROmorphone', 'Hydromorphone injection'] | | 2 | Exalgo | C2746500 | Exalgo | ['C2746500', 'C0604734', 'C1707065', 'C0070591', 'C3660437'] | ['Exalgo', 'exaltolide', 'Exelgyn', 'Extacol', 'exserohilone'] | | 3 | Palladone | C0730726 | palladone | ['C0730726', 'C0594402', 'C1655349', 'C0069952', 'C2742475'] | ['palladone', 'Palladone-SR', 'Palladone IR', 'palladiazo', 'palladia'] | | 4 | Hydrogen peroxide 30 mg | C1126248 | hydrogen peroxide 30 MG/ML | ['C1126248', 'C0304655', 'C1605252', 'C0304656', 'C1154260'] | ['hydrogen peroxide 30 MG/ML', 'Hydrogen peroxide solution 30%', 'hydrogen peroxide 30 MG/ML [Proxacol]', 'Hydrogen peroxide 30 mg/mL cutaneous solution', 'benzoyl peroxide 30 MG/ML'] | | 5 | Neosporin Cream | C0132149 | Neosporin Cream | ['C0132149', 'C0306959', 'C4722788', 'C0704071', 'C0698988'] | ['Neosporin Cream', 'Neosporin Ointment', 'Neomycin Sulfate Cream', 'Neosporin Topical Ointment', 'Naseptin cream'] | | 6 | Magnesium hydroxide 100mg/1ml | C1134402 | magnesium hydroxide 100 MG | ['C1134402', 'C1126785', 'C4317023', 'C4051486', 'C4047137'] | ['magnesium hydroxide 100 MG', 'magnesium hydroxide 100 MG/ML', 'Magnesium sulphate 100mg/mL injection', 'magnesium sulfate 100 MG', 'magnesium sulfate 100 MG/ML'] | | 7 | Metformin 1000 mg | C0987664 | metformin 1000 MG | ['C0987664', 'C2719784', 'C0978482', 'C2719786', 'C4282269'] | ['metformin 1000 MG', 'metFORMIN hydrochloride 1000 MG', 'METFORMIN HCL 1000MG TAB', 'metFORMIN hydrochloride 1000 MG [Fortamet]', 'METFORMIN HCL 1000MG SA TAB'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_drug_substance| |Compatibility:|Healthcare NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[output]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on `2021AB` UMLS dataset’s `Clinical Drug`, `Pharmacologic Substance`, `Antibiotic`, `Hazardous or Poisonous Substance` categories. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Italian Embeddings (Base, Wines description) author: John Snow Labs name: bert_embeddings_wineberto_italian_cased date: 2022-04-11 tags: [bert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `wineberto-italian-cased` is a Italian model orginally trained by `vinhood`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wineberto_italian_cased_it_3.4.2_3.0_1649676965822.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wineberto_italian_cased_it_3.4.2_3.0_1649676965822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_wineberto_italian_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_wineberto_italian_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.wineberto_italian_cased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_wineberto_italian_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|415.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/vinhood/wineberto-italian-cased - https://twitter.com/denocris - https://www.linkedin.com/in/cristiano-de-nobili/ - https://www.vinhood.com/en/ --- layout: model title: Legal Purchase price Clause Binary Classifier author: John Snow Labs name: legclf_purchase_price_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `purchase-price` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `purchase-price` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purchase_price_clause_en_1.0.0_3.2_1660123871318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purchase_price_clause_en_1.0.0_3.2_1660123871318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_purchase_price_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[purchase-price]| |[other]| |[other]| |[purchase-price]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_purchase_price_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 81 purchase-price 0.98 0.98 0.98 47 accuracy - - 0.98 128 macro-avg 0.98 0.98 0.98 128 weighted-avg 0.98 0.98 0.98 128 ``` --- layout: model title: Pipeline to Detect PHI for deidentification purposes author: John Snow Labs name: ner_deid_subentity_augmented_i2b2_pipeline date: 2023-03-13 tags: [deid, ner, phi, deidentification, licensed, i2b2, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity_augmented_i2b2](https://nlp.johnsnowlabs.com/2021/11/29/ner_deid_subentity_augmented_i2b2_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_pipeline_en_4.3.0_3.2_1678735152629.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_pipeline_en_4.3.0_3.2_1678735152629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_augmented_i2b2_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_augmented_i2b2_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.subentity_ner_augmented_i2b2.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:--------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 0.9997 | | 1 | David Hale | 27 | 36 | DOCTOR | 0.9507 | | 2 | Hendrickson Ora | 55 | 69 | PATIENT | 0.9981 | | 3 | 7194334 | 78 | 84 | MEDICALRECORD | 0.9996 | | 4 | 01/13/93 | 93 | 100 | DATE | 0.9992 | | 5 | Oliveira | 110 | 117 | DOCTOR | 0.8822 | | 6 | 25 | 121 | 122 | AGE | 0.5648 | | 7 | 2079-11-09 | 150 | 159 | DATE | 0.9995 | | 8 | Cocke County Baptist Hospital | 163 | 191 | HOSPITAL | 0.863775 | | 9 | 0295 Keats Street | 195 | 211 | STREET | 0.754533 | | 10 | 302-786-5227 | 221 | 232 | PHONE | 0.9697 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented_i2b2_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Chinese Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_bert_base_chinese_jinyong date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-chinese-jinyong` is a Chinese model orginally trained by `yechen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_chinese_jinyong_zh_3.4.2_3.0_1649670833638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_chinese_jinyong_zh_3.4.2_3.0_1649670833638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_chinese_jinyong","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_chinese_jinyong","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.bert_base_chinese_jinyong").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_chinese_jinyong| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|384.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/yechen/bert-base-chinese-jinyong --- layout: model title: Detect Anatomical Regions author: John Snow Labs name: ner_anatomy_en date: 2020-04-22 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for anatomy terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `Anatomical_system`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Tissue` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_en_2.4.2_2.4_1587513307751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_en_2.4.2_2.4_1587513307751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_anatomy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) results = model.transform(spark.createDataFrame([['This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.']], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_anatomy", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val result = pipeline.fit(Seq.empty["This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short."].toDS.toDF("text")).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +-------------------+----------------------+ |chunk |ner | +-------------------+----------------------+ |skin |Organ | |Extraocular muscles|Organ | |turbinates |Multi-tissue_structure| |Mucous membranes |Tissue | |Neck |Organism_subdivision | |bowel |Organ | |skin |Organ | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_en_2.4.2_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.2| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on the Anatomical Entity Mention (AnEM) corpus with ``'embeddings_clinical'``. http://www.nactem.ac.uk/anatomy/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|----------------------------------:|-----:|-----:|-----:|---------:|---------:|---------:| | 0 | B-Immaterial_anatomical_entity | 4 | 0 | 1 | 1 | 0.8 | 0.888889 | | 1 | B-Cellular_component | 14 | 4 | 7 | 0.777778 | 0.666667 | 0.717949 | | 2 | B-Organism_subdivision | 21 | 7 | 3 | 0.75 | 0.875 | 0.807692 | | 3 | I-Cell | 47 | 8 | 5 | 0.854545 | 0.903846 | 0.878505 | | 4 | B-Tissue | 14 | 2 | 10 | 0.875 | 0.583333 | 0.7 | | 5 | B-Anatomical_system | 5 | 1 | 3 | 0.833333 | 0.625 | 0.714286 | | 6 | B-Organism_substance | 26 | 2 | 8 | 0.928571 | 0.764706 | 0.83871 | | 7 | B-Cell | 86 | 6 | 11 | 0.934783 | 0.886598 | 0.910053 | | 8 | I-Immaterial_anatomical_entity | 5 | 0 | 0 | 1 | 1 | 1 | | 9 | I-Tissue | 16 | 1 | 6 | 0.941176 | 0.727273 | 0.820513 | | 10 | I-Pathological_formation | 20 | 0 | 1 | 1 | 0.952381 | 0.97561 | | 11 | I-Anatomical_system | 7 | 0 | 0 | 1 | 1 | 1 | | 12 | B-Organ | 30 | 7 | 3 | 0.810811 | 0.909091 | 0.857143 | | 13 | B-Pathological_formation | 35 | 5 | 3 | 0.875 | 0.921053 | 0.897436 | | 14 | I-Cellular_component | 4 | 0 | 3 | 1 | 0.571429 | 0.727273 | | 15 | I-Multi-tissue_structure | 26 | 10 | 6 | 0.722222 | 0.8125 | 0.764706 | | 16 | B-Multi-tissue_structure | 57 | 23 | 8 | 0.7125 | 0.876923 | 0.786207 | | 17 | I-Organism_substance | 6 | 2 | 0 | 0.75 | 1 | 0.857143 | | 18 | Macro-average | 424 | 84 | 88 | 0.731775 | 0.682666 | 0.706368 | | 19 | Micro-average | 424 | 84 | 88 | 0.834646 | 0.828125 | 0.831372 | ``` --- layout: model title: Clinical Deidentification Pipeline (Portuguese) author: John Snow Labs name: clinical_deidentification date: 2022-06-21 tags: [deid, deidentification, pt, licensed] task: [De-identification, Pipeline Healthcare] language: pt edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with `w2v_cc_300d` portuguese embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `ID`, `COUNTRY`, `STREET`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_3.5.0_3.0_1655820388743.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_3.5.0_3.0_1655820388743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = """Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = "Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("pt.deid.clinical").predict("""Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """) ```
## Results ```bash Masked with entity labels ------------------------------ Dados do . Nome: . Apelido: . NIF: . NISS: . Endereço: . CÓDIGO POSTAL: . Dados de cuidados. Data de nascimento: . País: . Idade: anos Sexo: . Data de admissão: . Doutor: Cuéllar NºCol: . Relatório clínico do : de anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: - , 22 E-mail: . Masked with chars ------------------------------ Dados do [******]. Nome: [***]. Apelido: [*******]. NIF: [****]. NISS: [*********]. Endereço: [*********************]. CÓDIGO POSTAL: [***]. Dados de cuidados. Data de nascimento: [********]. País: [******]. Idade: ** anos Sexo: *. Data de admissão: [********]. Doutor: [*************] Cuéllar NºCol: ** ** [***]. Relatório clínico do [******]: [******] de ** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér[**] de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: [***********] - [*****************], 22 [******] E-mail: [****************]. Masked with fixed length chars ------------------------------ Dados do ****. Nome: ****. Apelido: ****. NIF: ****. NISS: ****. Endereço: ****. CÓDIGO POSTAL: ****. Dados de cuidados. Data de nascimento: ****. País: ****. Idade: **** anos Sexo: ****. Data de admissão: ****. Doutor: **** Cuéllar NºCol: **** **** ****. Relatório clínico do ****: **** de **** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér**** de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: **** - ****, 22 **** E-mail: ****. Obfuscated ------------------------------ Dados do H.. Nome: Marcos Alves. Apelido: Tiago Santos. NIF: 566-445. NISS: 134544332. Endereço: Rua de Santa María, 100. CÓDIGO POSTAL: 4099. Dados de cuidados. Data de nascimento: 31/03/1946. País: Espanha. Idade: 46 anos Sexo: Mulher. Data de admissão: 06/01/2017. Doutor: Carlos Melo Cuéllar NºCol: 134544332 134544332 124 445 311. Relatório clínico do H.: M. de 46 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicérHomen de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Carlos Melo - Avenida Dos Aliados, 56, 22 Espanha E-mail: maria.prado@jacob.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|1.3 GB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - nlp.TextMatcherModel - ContextualParserModel - ContextualParserModel - nlp.RegexMatcherModel - nlp.RegexMatcherModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77 TFWav2Vec2ForCTC from emeson77 author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77` is a English model originally trained by emeson77. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037180334.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037180334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Food Technology Document Classifier (EURLEX) author: John Snow Labs name: legclf_food_technology_bert date: 2023-03-06 tags: [en, legal, classification, clauses, food_technology, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_food_technology_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Food_Technology or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Food_Technology`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_food_technology_bert_en_1.0.0_3.0_1678111847548.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_food_technology_bert_en_1.0.0_3.0_1678111847548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_food_technology_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Food_Technology]| |[Other]| |[Other]| |[Food_Technology]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_food_technology_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Food_Technology 0.87 0.86 0.86 221 Other 0.83 0.84 0.84 181 accuracy - - 0.85 402 macro-avg 0.85 0.85 0.85 402 weighted-avg 0.85 0.85 0.85 402 ``` --- layout: model title: Vaccine Sentiment Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_vaccine_sentiment date: 2022-07-28 tags: [public_health, vaccine_sentiment, en, licensed, sequence_classification] task: Sentiment Analysis language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true recommended: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT](https://nlp.johnsnowlabs.com/2022/07/18/biobert_pubmed_base_cased_v1.2_en_3_0.html) based sentimental analysis model that can extract information from COVID-19 Vaccine-related tweets. The model predicts whether a tweet contains positive, negative, or neutral sentiments about COVID-19 Vaccines. ## Predicted Entities `neutral`, `positive`, `negative` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_VACCINE_STATUS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vaccine_sentiment_en_4.0.0_3.0_1658995472179.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vaccine_sentiment_en_4.0.0_3.0_1658995472179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vaccine_sentiment", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) text_list = ['A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.', 'People with a history of severe allergic reaction to any component of the vaccine should not take.', '43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b'] data = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vaccine_sentiment", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.", "People with a history of severe allergic reaction to any component of the vaccine should not take.", "43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence_vaccine_sentiment").predict("""A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.""") ```
## Results ```bash +-----------------------------------------------------------------------------------------------------+----------+ |text |class | +-----------------------------------------------------------------------------------------------------+----------+ |A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.|[positive]| |People with a history of severe allergic reaction to any component of the vaccine should not take. |[negative]| |43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b |[neutral] | +-----------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vaccine_sentiment| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support neutral 0.82 0.78 0.80 1007 positive 0.88 0.90 0.89 1002 negative 0.83 0.86 0.84 881 accuracy - - 0.85 2890 macro-avg 0.85 0.85 0.85 2890 weighted-avg 0.85 0.85 0.85 2890 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_only_classfn_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191247552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191247552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_only_classfn_epochs_1_shard_1_squad2.0 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Leizhang) author: John Snow Labs name: xlmroberta_ner_leizhang_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Leizhang`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_leizhang_base_finetuned_panx_de_4.1.0_3.0_1660429672416.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_leizhang_base_finetuned_panx_de_4.1.0_3.0_1660429672416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_leizhang_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_leizhang_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_leizhang_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Leizhang/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English asr_wav2vec2_xls_r_300m_Turkish_Tr_med TFWav2Vec2ForCTC from emre author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_Turkish_Tr_med` is a English model originally trained by emre. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med_en_4.2.0_3.0_1664037839645.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med_en_4.2.0_3.0_1664037839645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_Turkish_Tr_med| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Salishan languages Pipeline author: John Snow Labs name: translate_en_sal date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sal, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sal` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sal_xx_2.7.0_2.4_1609686644399.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sal_xx_2.7.0_2.4_1609686644399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sal", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sal", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sal').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sal| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Seniority Clause Binary Classifier author: John Snow Labs name: legclf_seniority_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `seniority` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `seniority` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_seniority_clause_en_1.0.0_3.2_1660123984376.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_seniority_clause_en_1.0.0_3.2_1660123984376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_seniority_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[seniority]| |[other]| |[other]| |[seniority]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_seniority_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.96 0.97 91 seniority 0.90 0.97 0.94 37 accuracy - - 0.96 128 macro-avg 0.94 0.96 0.95 128 weighted-avg 0.96 0.96 0.96 128 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465522 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465522` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465522_en_4.0.0_3.0_1655986870269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465522_en_4.0.0_3.0_1655986870269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465522","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465522","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465522.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465522| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465522 --- layout: model title: English RobertaForQuestionAnswering (from LucasS) author: John Snow Labs name: roberta_qa_robertaBaseABSA date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaBaseABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaBaseABSA_en_4.0.0_3.0_1655738733590.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaBaseABSA_en_4.0.0_3.0_1655738733590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaBaseABSA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_robertaBaseABSA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_robertaBaseABSA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|436.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LucasS/robertaBaseABSA --- layout: model title: English BertForQuestionAnswering model (from horsbug98) author: John Snow Labs name: bert_qa_Part_2_mBERT_Model_E2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_2_mBERT_Model_E2` is a English model orginally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Part_2_mBERT_Model_E2_en_4.0.0_3.0_1654178989285.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Part_2_mBERT_Model_E2_en_4.0.0_3.0_1654178989285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Part_2_mBERT_Model_E2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Part_2_mBERT_Model_E2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Part_2_mBERT_Model_E2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_2_mBERT_Model_E2 --- layout: model title: English RobertaForQuestionAnswering Cased model (from comacrae) author: John Snow Labs name: roberta_qa_eda_and_parav3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-eda-and-parav3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_eda_and_parav3_en_4.3.0_3.0_1674220091398.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_eda_and_parav3_en_4.3.0_3.0_1674220091398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_eda_and_parav3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_eda_and_parav3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_eda_and_parav3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/comacrae/roberta-eda-and-parav3 --- layout: model title: English image_classifier_vit_dog_food__base_patch16_224_in21k ViTForImageClassification from sasha author: John Snow Labs name: image_classifier_vit_dog_food__base_patch16_224_in21k date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog_food__base_patch16_224_in21k` is a English model originally trained by sasha. ## Predicted Entities `dog`, `food` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_food__base_patch16_224_in21k_en_4.1.0_3.0_1660171837844.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_food__base_patch16_224_in21k_en_4.1.0_3.0_1660171837844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dog_food__base_patch16_224_in21k", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dog_food__base_patch16_224_in21k", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dog_food__base_patch16_224_in21k| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: sbert_jsl_medium_rxnorm_uncased date: 2022-01-03 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps sentences & documents to a 512-dimensional dense vector space by using average pooling on top of BERT model. It's also fine-tuned on the RxNorm dataset to help generalization over medication-related datasets. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_rxnorm_uncased_en_3.3.4_2.4_1641241051941.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_rxnorm_uncased_en_3.3.4_2.4_1641241051941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_rxnorm_uncased", "en", "clinical/models")\ .setInputCols("sentence")\ .setOutputCol("sbert_embeddings") ``` ```scala val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_rxnorm_uncased", "en", "clinical/models") .setInputCols("sentence") .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert_uncased.rxnorm").predict("""Put your text here.""") ```
## Results ```bash Gives a 512-dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_medium_rxnorm_uncased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|153.9 MB| |Case sensitive:|false| --- layout: model title: English asr_wav2vec2_large_a TFWav2Vec2ForCTC from yongjian author: John Snow Labs name: asr_wav2vec2_large_a date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_a` is a English model originally trained by yongjian. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_a_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_a_en_4.2.0_3.0_1664039889685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_a_en_4.2.0_3.0_1664039889685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_a", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_a", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_a| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_512 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670021484393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_512_zh_4.2.4_3.0_1670021484393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|161.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Detect Assertion Status (assertion_jsl_augmented) author: John Snow Labs name: assertion_jsl_augmented date: 2022-09-15 tags: [licensed, clinical, assertion, en] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and it is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011). This model is also the augmented version of [assertion_jsl](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_en.html) model with in-house annotations and it returns confidence scores of the results. ## Predicted Entities `Present`, `Absent`, `Possible`, `Planned`, `Past`, `Family`, `Hypotetical`, `SomeoneElse` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_augmented_en_4.1.0_3.0_1663252918565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_augmented_en_4.1.0_3.0_1663252918565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner")\ ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setBlackList(["RelativeDate", "Gender"]) clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion ]) text = """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant. We prescribed sleeping pills for her current insomnia""" data = spark.createDataFrame([[text]]).toDF('text') result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setBlackList(Array("RelativeDate", "Gender")) val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val nlpPipeline = Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data= Seq("Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant. We prescribed sleeping pills for her current insomnia").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.jsl_augmented").predict("""Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant. We prescribed sleeping pills for her current insomnia""") ```
## Results ```bash +--------------+-----+---+-------------------------+-----------+---------+ |ner_chunk |begin|end|ner_label |sentence_id|assertion| +--------------+-----+---+-------------------------+-----------+---------+ |headache |14 |21 |Symptom |0 |Past | |anxious |57 |63 |Symptom |0 |Possible | |alopecia |89 |96 |Disease_Syndrome_Disorder|1 |Absent | |pain |116 |119|Symptom |2 |Absent | |paralyzed |136 |144|Symptom |3 |Family | |antidepressant|212 |225|Drug_Ingredient |4 |Past | |sleeping pills|242 |255|Drug_Ingredient |5 |Planned | |insomnia |273 |280|Symptom |5 |Present | +--------------+-----+---+-------------------------+-----------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_jsl_augmented| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|6.5 MB| ## Benchmarking ```bash label precision recall f1-score Absent 0.94 0.93 0.94 Family 0.88 0.91 0.89 Hypothetical 0.85 0.82 0.83 Past 0.89 0.89 0.89 Planned 0.78 0.81 0.80 Possible 0.82 0.82 0.82 Present 0.91 0.93 0.92 SomeoneElse 0.88 0.80 0.84 ``` --- layout: model title: Explain Document pipeline for Swedish (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, swedish, explain_document_lg, pipeline, sv] supported: true task: [Named Entity Recognition, Lemmatization] language: sv edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_sv_3.0.0_3.0_1616520973696.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_sv_3.0.0_3.0_1616520973696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'sv') annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "sv") val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej från John Snow Labs! ""] result_df = nlu.load('sv.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Lolaibrin) author: John Snow Labs name: distilbert_qa_lolaibrin_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Lolaibrin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lolaibrin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768653496.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lolaibrin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768653496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lolaibrin_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lolaibrin_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_lolaibrin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Lolaibrin/distilbert-base-uncased-finetuned-squad --- layout: model title: Longformer Base NER Pipeline author: ahmedlone127 name: longformer_base_token_classifier_conll03_pipeline date: 2022-06-14 tags: [ner, longformer, pipeline, conll, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [longformer_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655213912525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655213912525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|516.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - LongformerForTokenClassification - NerConverter - Finisher --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_scrambled_squad_15_new date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-15-new` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_new_en_4.3.0_3.0_1674216883682.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_15_new_en_4.3.0_3.0_1674216883682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15_new","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_15_new","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_scrambled_squad_15_new| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-15-new --- layout: model title: Fast Neural Machine Translation Model from English to Kwangali author: John Snow Labs name: opus_mt_en_kwn date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, kwn, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `kwn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_kwn_xx_2.7.0_2.4_1609164344098.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_kwn_xx_2.7.0_2.4_1609164344098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_kwn", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_kwn", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.kwn').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_kwn| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hindi Named Entity Recognition (from sagorsarker) author: John Snow Labs name: bert_ner_codeswitch_hineng_lid_lince date: 2022-05-09 tags: [bert, ner, token_classification, hi, open_source] task: Named Entity Recognition language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-hineng-lid-lince` is a Hindi model orginally trained by `sagorsarker`. ## Predicted Entities `mixed`, `hin`, `other`, `unk`, `en`, `ambiguous`, `ne`, `fw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_lid_lince_hi_3.4.2_3.0_1652097632881.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_lid_lince_hi_3.4.2_3.0_1652097632881.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_lid_lince","hi") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_lid_lince","hi") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_codeswitch_hineng_lid_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|hi| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-hineng-lid-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: Pipeline to Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_pipeline date: 2023-03-20 tags: [ner_jsl, ner, berfortokenclassification, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2022/03/21/bert_token_classifier_ner_jsl_en_2_4.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.2_1679305183990.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.2_1679305183990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_token_ner_jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------------------------|--------:|------:|:-------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.999456 | | 1 | Caucasian male | 28 | 41 | Demographics | 0.9901 | | 2 | congestion | 62 | 71 | Symptom | 0.997918 | | 3 | mom | 75 | 77 | Demographics | 0.999013 | | 4 | yellow discharge | 99 | 114 | Symptom | 0.998663 | | 5 | nares | 135 | 139 | Body_Part | 0.998609 | | 6 | she | 147 | 149 | Demographics | 0.999442 | | 7 | mild problems with his breathing | 168 | 199 | Symptom | 0.930385 | | 8 | perioral cyanosis | 237 | 253 | Symptom | 0.99819 | | 9 | retractions | 258 | 268 | Symptom | 0.999783 | | 10 | One day ago | 272 | 282 | Date_Time | 0.999386 | | 11 | mom | 285 | 287 | Demographics | 0.999835 | | 12 | tactile temperature | 304 | 322 | Symptom | 0.999352 | | 13 | Tylenol | 345 | 351 | Drug | 0.999762 | | 14 | Baby-girl | 354 | 362 | Age | 0.980529 | | 15 | decreased p.o. intake | 382 | 402 | Symptom | 0.998978 | | 16 | His | 405 | 407 | Demographics | 0.999913 | | 17 | breast-feeding | 416 | 429 | Body_Part | 0.99954 | | 18 | his | 493 | 495 | Demographics | 0.999661 | | 19 | respiratory congestion | 497 | 518 | Symptom | 0.834984 | | 20 | He | 521 | 522 | Demographics | 0.999858 | | 21 | tired | 555 | 559 | Symptom | 0.999516 | | 22 | fussy | 574 | 578 | Symptom | 0.997592 | | 23 | over the past 2 days | 580 | 599 | Date_Time | 0.994786 | | 24 | albuterol | 642 | 650 | Drug | 0.999735 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Chinese BertForQuestionAnswering model (from qalover) author: John Snow Labs name: bert_qa_chinese_pert_large_open_domain_mrc date: 2022-06-28 tags: [zh, open_source, bert, question_answering] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-pert-large-open-domain-mrc` is a Chinese model originally trained by `qalover`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pert_large_open_domain_mrc_zh_4.0.0_3.0_1656413708959.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pert_large_open_domain_mrc_zh_4.0.0_3.0_1656413708959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chinese_pert_large_open_domain_mrc","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_chinese_pert_large_open_domain_mrc","zh") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.large").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chinese_pert_large_open_domain_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/qalover/chinese-pert-large-open-domain-mrc - https://github.com/dbiir/UER-py/ --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_major_concepts date: 2021-10-03 tags: [entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions. ## Predicted Entities `This model returns CUI (concept unique identifier) codes for Clinical Findings`, `Medical Devices`, `Anatomical Structures and Injuries & Poisoning terms` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.2.3_3.0_1633221571574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.2.3_3.0_1633221571574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_umls_major_concepts``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes, Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension, Injury_or_Poisoning, Kidney_Disease, Medical-Device, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_major_concepts","en", "clinical/models") \ .setInputCols(["ner_chunk_doc", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["The patient complains of ankle pain after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician"]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_umls_major_concepts", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq(""The patient complains of ankle pain after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician"").toDF("text") val res = p_model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls").predict("""The patient complains of ankle pain after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician""") ```
## Results ```bash | | ner_chunk | code | code_description | |---:|:------------------------------|:-------------|:---------------------------------------------| | 0 | ankle | C4047548 | bilateral ankle joint pain (finding) | | 1 | falling from stairs | C0417023 | fall from stairs | | 2 | Arthroscopy | C0179144 | arthroscope | | 3 | primary care pyhsician | C3266804 | referred by primary care physician (finding) | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_major_concepts| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[umls_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on data sampled from https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: BioBERT Sentence Embeddings (PMC) author: John Snow Labs name: sent_biobert_pmc_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.0_2.4_1598348966950.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pmc_base_cased_en_2.6.0_2.4_1598348966950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pmc_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pmc_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_pmc_base_cased_embeddings I hate cancer [0.34035101532936096, 0.04413360357284546, -0.... Antibiotics aren't painkiller [0.4397204518318176, 0.066007100045681, -0.114... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Explain Document Pipeline for Russian author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, russian, explain_document_sm, pipeline, ru] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: ru edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_3.0.0_3.0_1616422668270.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_3.0.0_3.0_1616422668270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'ru') annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "ru") val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Здравствуйте из Джона Снежных Лабораторий! ""] result_df = nlu.load('ru.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------| | 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | ['здравствовать', 'из', 'Джон', 'Снежных', 'Лабораторий!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['Джона Снежных Лабораторий!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|ru| --- layout: model title: Greek BertForMaskedLM Base Uncased model (from gealexandri) author: John Snow Labs name: bert_embeddings_greeksocial_base_greek_uncased_v1 date: 2022-12-06 tags: [el, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: el edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `greeksocialbert-base-greek-uncased-v1` is a Greek model originally trained by `gealexandri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670326520370.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_greeksocial_base_greek_uncased_v1_el_4.2.4_3.0_1670326520370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_greeksocial_base_greek_uncased_v1","el") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_greeksocial_base_greek_uncased_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|el| |Size:|424.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/gealexandri/greeksocialbert-base-greek-uncased-v1 - http://www.paloservices.com/ --- layout: model title: Contextual SpellChecker Clinical author: John Snow Labs name: spellcheck_clinical class: ContextSpellCheckerModel language: en nav_key: models repository: clinical/models date: 2020-04-17 task: Spell Check edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/6.Clinical_Context_Spell_Checker.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_2.4.2_2.4_1587146727460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_2.4.2_2.4_1587146727460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = ContextSpellCheckerModel.pretrained("spellcheck_clinical","en","clinical/models") .setInputCols("token") .setOutputCol("spell") ``` ```scala val model = ContextSpellCheckerModel.pretrained("spellcheck_clinical","en","clinical/models") .setInputCols("token") .setOutputCol("spell") ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---------------|--------------------------| | Name: | spellcheck_clinical | | Type: | ContextSpellCheckerModel | | Compatibility: | 2.4.2 | | License: | Licensed | | Edition: | Official | |Input labels: | [token] | |Output labels: | [spell] | | Language: | en | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained with PubMed and i2b2 datasets. --- layout: model title: English BertForQuestionAnswering Base Cased model (from rsvp-ai) author: John Snow Labs name: bert_qa_bertserini_base_cmrc date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-bert-base-cmrc` is a English model originally trained by `rsvp-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_base_cmrc_en_4.0.0_3.0_1657188963909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_base_cmrc_en_4.0.0_3.0_1657188963909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertserini_base_cmrc","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_bertserini_base_cmrc","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bertserini_base_cmrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|381.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rsvp-ai/bertserini-bert-base-cmrc --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from lingchensanwen) author: John Snow Labs name: distilbert_qa_lingchensanwen_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `lingchensanwen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lingchensanwen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771975946.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lingchensanwen_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771975946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lingchensanwen_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lingchensanwen_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_lingchensanwen_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/lingchensanwen/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from juliusco) author: John Snow Labs name: bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad-finetuned-biobert` is a English model orginally trained by `juliusco`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert_en_4.0.0_3.0_1654185597741.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert_en_4.0.0_3.0_1654185597741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.biobert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_base_cased_v1.1_squad_finetuned_biobert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/juliusco/biobert-base-cased-v1.1-squad-finetuned-biobert --- layout: model title: Clinical Deidentification (Spanish) author: John Snow Labs name: clinical_deidentification date: 2022-03-02 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_2.4_1646246697330.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_2.4_1646246697330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """) ```
## Results ```bash Masked with entity labels ------------------------------ Datos del paciente. Nombre: . Apellidos: . NHC: . NASS: 04. Domicilio: , 5 B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . : María Merino Viveros NºCol: . Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico: Masked with chars ------------------------------ Datos del paciente. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: ** [******] 04. Domicilio: [*******************], 5 B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. [****]: María Merino Viveros NºCol: ** ** [***]. Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos del paciente. Nombre: **** . Apellidos: ****. NHC: ****. NASS: **** **** 04. Domicilio: ****, 5 B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. ****: María Merino Viveros NºCol: **** **** ****. Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos del paciente. Nombre: Sr. Lerma . Apellidos: Aristides Gonzalez Gelabert. NHC: BBBBBBBBQR648597. NASS: 041010000011 RZRM020101906017 04. Domicilio: Valencia, 5 B.. Localidad/ Provincia: Madrid. CP: 99335. Datos asistenciales. Fecha de nacimiento: 25/04/1977. País: Barcelona. Edad: 8 años Sexo: F.. Fecha de Ingreso: 02/08/2018. transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78. Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net ``` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.2 MB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: Arabic Bert Embeddings (Base, Arabert Model, v01) author: John Snow Labs name: bert_embeddings_bert_base_arabertv01 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabertv01` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv01_ar_3.4.2_3.0_1649677579686.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv01_ar_3.4.2_3.0_1649677579686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv01","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv01","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabertv01").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabertv01| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|508.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv01 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabertv01/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: English RobertaForQuestionAnswering (from mbartolo) author: John Snow Labs name: roberta_qa_roberta_large_synqa_ext date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-synqa-ext` is a English model originally trained by `mbartolo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_synqa_ext_en_4.0.0_3.0_1655738082187.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_synqa_ext_en_4.0.0_3.0_1655738082187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_synqa_ext","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_synqa_ext","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.synqa_ext.roberta.large.by_mbartolo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_synqa_ext| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mbartolo/roberta-large-synqa-ext - https://arxiv.org/abs/2002.00293 - https://arxiv.org/abs/2104.08678 --- layout: model title: Part of Speech for Latin author: John Snow Labs name: pos_ud_llct date: 2021-03-09 tags: [part_of_speech, open_source, latin, pos_ud_llct, la] task: Part of Speech Tagging language: la edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PUNCT - ADP - PROPN - NOUN - VERB - DET - CCONJ - PRON - ADJ - NUM - AUX - SCONJ - ADV - PART - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_3.0.0_3.0_1615292206384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_llct_la_3.0.0_3.0_1615292206384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_llct", "la") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Aequaliter Nubila Labs Ioannes de salve ! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_llct", "la") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Aequaliter Nubila Labs Ioannes de salve ! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Aequaliter Nubila Labs Ioannes de salve ! ""] token_df = nlu.load('la.pos').predict(text) token_df ```
## Results ```bash token pos 0 Aequaliter PROPN 1 Nubila PROPN 2 Labs ADJ 3 Ioannes NOUN 4 de ADP 5 salve NOUN 6 ! PROPN ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_llct| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|la| --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1654189512725.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1654189512725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|390.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-10 --- layout: model title: Word2Vec Embeddings in Tatar (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, tt, open_source] task: Embeddings language: tt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tt_3.4.1_3.0_1647462888271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tt_3.4.1_3.0_1647462888271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tt.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|tt| |Size:|535.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: BERT Embeddings (Base Uncased) author: John Snow Labs name: bert_base_uncased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_en_2.6.0_2.4_1598340514223.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_en_2.6.0_2.4_1598340514223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_base_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_base_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_embeddings I [0.5920650362968445, 0.18827693164348602, 0.12... love [1.2889715433120728, 0.8475795388221741, 0.720... NLP [0.21503107249736786, -0.9925870299339294, 1.0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1](https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1) --- layout: model title: Czech RobertaForMaskedLM Cased model (from fav-kky) author: John Snow Labs name: roberta_embeddings_fernet_news date: 2022-12-12 tags: [cs, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: cs edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-News` is a Czech model originally trained by `fav-kky`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_cs_4.2.4_3.0_1670858382244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fernet_news_cs_4.2.4_3.0_1670858382244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","cs") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_fernet_news","cs") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_fernet_news| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|cs| |Size:|468.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/fav-kky/FERNET-News - https://arxiv.org/abs/2107.10042 --- layout: model title: English BertForQuestionAnswering model (from lewtun) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad-v1` is a English model orginally trained by `lewtun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_v1_en_4.0.0_3.0_1654181199655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_v1_en_4.0.0_3.0_1654181199655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_lewtun").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/lewtun/bert-base-uncased-finetuned-squad-v1 --- layout: model title: English Named Entity Recognition (from lucifermorninstar011) author: John Snow Labs name: distilbert_ner_autotrain_luicfer_company_861827409 date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-luicfer_company-861827409` is a English model orginally trained by `lucifermorninstar011`. ## Predicted Entities `vocab`, `company` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_luicfer_company_861827409_en_3.4.2_3.0_1652721660864.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_luicfer_company_861827409_en_3.4.2_3.0_1652721660864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_luicfer_company_861827409","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_luicfer_company_861827409","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_autotrain_luicfer_company_861827409| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lucifermorninstar011/autotrain-luicfer_company-861827409 --- layout: model title: Summarize clinical notes (augmented) author: John Snow Labs name: summarizer_clinical_jsl_augmented date: 2023-03-30 tags: [licensed, clinical, en, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is at first finetuned with natural instructions and then finetuned with clinical notes, encounters, critical care notes, discharge notes, reports, curated  by John Snow Labs. This model is further optimized by augmenting the training methodology, and dataset. It can generate summaries from clinical notes up to 512 tokens given the input text (max 1024 tokens). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_en_4.3.2_3.0_1680203312371.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_en_4.3.2_3.0_1680203312371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer()\ .pretrained("summarizer_clinical_jsl_augmented", "en", "clinical/models")\ .setInputCols("document")\ .setOutputCol("summary")\ .setMaxTextLength(512)\ .setMaxNewTokens(512) pipeline = Pipeline(stages=[document, summarizer]) text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash.""" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer() .pretrained("summarizer_clinical_jsl_augmented", "en", "clinical/models") .setInputCols("document") .setOutputCol("summary") .setMaxTextLength(512) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document, summarizer)) val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash.""" val data = Seq(text).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for a recheck. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. Her medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, and TriViFlor. She also has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl_augmented| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.0 MB| ## Benchmarkings ### Benchmark on MtSamples Summarization Dataset : | model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 | |--|--|--|--|--|--|--| philschmid/flan-t5-base-samsum | 250M | 0.1919 | 0.1124 | 0.8409 | 0.8964 | 0.8678 | linydub/bart-large-samsum | 500M | 0.1586 | 0.0732 | 0.8747 | 0.8184 | 0.8456 | philschmid/bart-large-cnn-samsum | 500M | 0.2170 | 0.1299 | 0.8846 | 0.8436 | 0.8636 | transformersbook/pegasus-samsum | 500M | 0.1924 | 0.0965 | 0.8920 | 0.8149 | 0.8517 | summarizer_clinical_jsl | 250M | 0.4836 | 0.4188 | 0.9041 | 0.9374 | 0.9204 | summarizer_clinical_jsl_augmented | 250M | 0.5119 | 0.4545 | 0.9282 | 0.9526 | 0.9402 | ### Benchmark on MIMIC Summarization Dataset : | model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 | |--|--|--|--|--|--|--| philschmid/flan-t5-base-samsum | 250M | 0.1910 | 0.1037 | 0.8708 | 0.9056 | 0.8879 | linydub/bart-large-samsum | 500M | 0.1252 | 0.0382 | 0.8933 | 0.8440 | 0.8679 | philschmid/bart-large-cnn-samsum | 500M | 0.1795 | 0.0889 | 0.9172 | 0.8978 | 0.9074 | transformersbook/pegasus-samsum | 570M | 0.1425 | 0.0582 | 0.9171 | 0.8682 | 0.8920 | summarizer_clinical_jsl | 250M | 0.395 | 0.2962 | 0.895 | 0.9316 | 0.913 | summarizer_clinical_jsl_augmented | 250M | 0.3964 | 0.307 | 0.9109 | 0.9452 | 0.9227 | ![Benchmark Summary](https://github.com/JohnSnowLabs/jsl-private-projects/blob/llm_benchmarks/internal_projects/LLM_Experiments/jsl-summarization-benchmarks.png?raw=true) ## References Trained on in-house curated dataset --- layout: model title: Part of Speech for Norwegian Nynorsk author: John Snow Labs name: pos_ud_nynorsk date: 2020-05-05 18:57:00 +0800 task: Part of Speech Tagging language: nn edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, nn] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_2.5.0_2.4_1588693690964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_2.5.0_2.4_1588693690964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""] pos_df = nlu.load('nn.pos.ud_nynorsk').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='NOUN', metadata={'word': 'Annet'}), Row(annotatorType='pos', begin=6, end=8, result='SCONJ', metadata={'word': 'enn'}), Row(annotatorType='pos', begin=10, end=10, result='PART', metadata={'word': 'å'}), Row(annotatorType='pos', begin=12, end=15, result='VERB', metadata={'word': 'være'}), Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kongen'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_nynorsk| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|nn| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Chinese Bert Embeddings (Large, MacBERT) author: John Snow Labs name: bert_embeddings_chinese_macbert_large date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-macbert-large` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_large_zh_3.4.2_3.0_1649669165054.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_macbert_large_zh_3.4.2_3.0_1649669165054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_macbert_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.chinese_macbert_large").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_macbert_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-macbert-large - https://github.com/ymcui/MacBERT/blob/master/LICENSE - https://2020.emnlp.org - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://github.com/chatopera/Synonyms - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 --- layout: model title: Mapping Drug Brand Names with Corresponding National Drug Codes author: John Snow Labs name: drug_brandname_ndc_mapper date: 2022-05-11 tags: [chunk_mapper, en, licensed, ndc, clinical] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.1_3.0_1652259542096.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.1_3.0_1652259542096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("chunk") chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\ .setInputCols(["chunk"])\ .setOutputCol("ndc")\ .setRel("Strength_NDC") pipeline = Pipeline().setStages([document_assembler, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) result = lp.fullAnnotate(["zytiga", "zyvana", "ZYVOX", "ZYTIGA"]) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("chunk") val chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models") .setInputCols("chunk") .setOutputCol("ndc") .setRel("Strength_NDC") val pipeline = new Pipeline().setStages(Array(document_assembler, chunkerMapper)) val text_data = Seq("zytiga", "zyvana", "ZYVOX", "ZYTIGA").toDS.toDF("text") val res = pipeline.fit(text_data).transform(text_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_brand_to_ndc").predict("""Put your text here.""") ```
## Results ```bash |---:|:------------|:-------------------------|:----------------------------------------------------------| | | Brandname | Strenth_NDC | Other_NDSs | |---:|:------------|:-------------------------|:----------------------------------------------------------| | 0 | zytiga | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] | | 1 | zyvana | 527 mg/1 | 69336-405 | [''] | | 2 | ZYVOX | 600 mg/300mL | 0009-4992 | ['600 mg/300mL | 66298-7807', '600 mg/300mL | 0009-7807'] | | 3 | ZYTIGA | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] | |---:|:------------|:-------------------------|:----------------------------------------------------------| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_brandname_ndc_mapper| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|3.0 MB| --- layout: model title: English image_classifier_vit_rock_challenge_DeiT_solo ViTForImageClassification from dimbyTa author: John Snow Labs name: image_classifier_vit_rock_challenge_DeiT_solo date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rock_challenge_DeiT_solo` is a English model originally trained by dimbyTa. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_DeiT_solo_en_4.1.0_3.0_1660170757484.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_DeiT_solo_en_4.1.0_3.0_1660170757484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rock_challenge_DeiT_solo", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rock_challenge_DeiT_solo", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rock_challenge_DeiT_solo| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|81.7 MB| --- layout: model title: Pipeline to Detect Bacterial Species (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_bacteria_pipeline date: 2023-03-20 tags: [bacteria, bertfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bacteria](https://nlp.johnsnowlabs.com/2022/01/07/bert_token_classifier_ner_bacteria_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_pipeline_en_4.3.0_3.2_1679305685030.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_pipeline_en_4.3.0_3.2_1679305685030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_bacteria_pipeline", "en", "clinical/models") text = '''Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bacteria_pipeline", "en", "clinical/models") val text = "Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T))." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bacteria_ner.pipeline").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | SMSP (T) | 73 | 80 | SPECIES | 0.99985 | | 1 | Methanoregula formicica | 167 | 189 | SPECIES | 0.999787 | | 2 | SMSP (T) | 222 | 229 | SPECIES | 0.999871 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bacteria_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Turkish Bert Embeddings author: John Snow Labs name: bert_embeddings_bert_base_tr_cased date: 2022-04-11 tags: [bert, embeddings, tr, open_source] task: Embeddings language: tr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-tr-cased` is a Turkish model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_tr_cased_tr_3.4.2_3.0_1649675409597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_tr_cased_tr_3.4.2_3.0_1649675409597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_tr_cased","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_tr_cased","tr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.embed.bert_cased").predict("""Spark NLP'yi seviyorum""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_tr_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|tr| |Size:|378.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-tr-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Zero-Shot Named Entity Recognition (RoBERTa) author: John Snow Labs name: zero_shot_ner_roberta date: 2022-08-29 tags: [ner, zero_shot, licensed, clinical, en, roberta] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: ZeroShotNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with Zero-Shot Named Entity Recognition (NER) approach and it can detect any kind of defined entities with no training dataset, just pretrained RoBERTa embeddings (included in the model). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/zero_shot_ner_roberta_en_4.0.2_3.0_1661769801401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/zero_shot_ner_roberta_en_4.0.2_3.0_1661769801401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") zero_shot_ner = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clincial/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("zero_shot_ner")\ .setEntityDefinitions( { "NAME": ["What is his name?", "What is my name?", "What is her name?"], "CITY": ["Which city?", "Which is the city?"] }) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "zero_shot_ner"])\ .setOutputCol("ner_chunk")\ pipeline = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, zero_shot_ner, ner_converter]) zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame(["Hellen works in London, Paris and Berlin. My name is Clara, I live in New York and Hellen lives in Paris.", "John is a man who works in London, London and London."], StringType()).toDF("text") result = zero_shot_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val zero_shot_ner = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clincial/models") .setInputCols(Array("sentence", "token")) .setOutputCol("zero_shot_ner") .setEntityDefinitions(Map( "NAME"-> Array("What is his name?", "What is my name?", "What is her name?"), "CITY"-> Array("Which city?", "Which is the city?") )) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "zero_shot_ner")) .setOutputCol("ner_chunk") val pipeline = new .setStages(Array( documentAssembler, sentenceDetector, tokenizer, zero_shot_ner, ner_converter)) val data = Seq(Array("Hellen works in London, Paris and Berlin. My name is Clara, I live in New York and Hellen lives in Paris.", "John is a man who works in London, London and London.")toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.zero_shot.ner_roberta").predict("""Hellen works in London, Paris and Berlin. My name is Clara, I live in New York and Hellen lives in Paris.""") ```
## Results ```bash +------+---------+--------+-----+---+----------+ | token|ner_label|sentence|begin|end|confidence| +------+---------+--------+-----+---+----------+ |Hellen| B-NAME| 0| 0| 5|0.13306311| | works| O| 0| 7| 11| null| | in| O| 0| 13| 14| null| |London| B-CITY| 0| 16| 21| 0.4064213| | ,| O| 0| 22| 22| null| | Paris| B-CITY| 0| 24| 28|0.04597357| | and| O| 0| 30| 32| null| |Berlin| B-CITY| 0| 34| 39|0.16265489| | .| O| 0| 40| 40| null| | My| O| 1| 42| 43| null| | name| O| 1| 45| 48| null| | is| O| 1| 50| 51| null| | Clara| B-NAME| 1| 53| 57| 0.9274031| | ,| O| 1| 58| 58| null| | I| O| 1| 60| 60| null| | live| O| 1| 62| 65| null| | in| O| 1| 67| 68| null| | New| B-CITY| 1| 70| 72|0.82799006| | York| I-CITY| 1| 74| 77|0.82799006| | and| O| 1| 79| 81| null| |Hellen| B-NAME| 1| 83| 88|0.40429682| | lives| O| 1| 90| 94| null| | in| O| 1| 96| 97| null| | Paris| B-CITY| 1| 99|103|0.49216735| | .| O| 1| 104|104| null| | John| B-NAME| 0| 0| 3|0.14063153| | is| O| 0| 5| 6| null| | a| O| 0| 8| 8| null| | man| O| 0| 10| 12| null| | who| O| 0| 14| 16| null| | works| O| 0| 18| 22| null| | in| O| 0| 24| 25| null| |London| B-CITY| 0| 27| 32|0.15521188| | ,| O| 0| 33| 33| null| |London| B-CITY| 0| 35| 40|0.12151082| | and| O| 0| 42| 44| null| |London| B-CITY| 0| 46| 51| 0.2650951| | .| O| 0| 52| 52| null| +------+---------+--------+-----+---+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|zero_shot_ner_roberta| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References As it is a Zero-Shot NER, no training dataset is necessary. --- layout: model title: Fast Neural Machine Translation Model from Arabic to Italian author: John Snow Labs name: opus_mt_ar_it date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, it, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: it {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_it_xx_3.1.0_2.4_1622556125806.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_it_xx_3.1.0_2.4_1622556125806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_it", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_it", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Italian').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_it| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from deepset) author: John Snow Labs name: roberta_qa_deepset_large_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_large_squad2_en_4.2.4_3.0_1669987928587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_large_squad2_en_4.2.4_3.0_1669987928587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_large_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_large_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_large_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/roberta-large-squad2 --- layout: model title: Legal Question Answering (RoBerta, CUAD, Base) author: John Snow Labs name: legqa_roberta_cuad_base date: 2023-01-30 tags: [en, licensed, tensorflow] task: Question Answering language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal RoBerta-based Question Answering model, trained on squad-v2, finetuned on CUAD dataset (base). In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES: ``` "Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract" ``` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_base_en_1.0.0_3.0_1675083334950.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_base_en_1.0.0_3.0_1675083334950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("legqa_roberta_cuad_base","en", "legal/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) text = """THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and certain of its subsidiaries. Identified on the signature pages hereto (each a "BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, in its capacity as agent for the Lenders under this Agreement (hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ') question = ['"Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"'] qt = [ [q,text] for q in questions ] example = spark.createDataFrame(qt).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('document_question.result', 'answer.result').show(truncate=False) ```
## Results ```bash ["Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"]|[P . H . GLATFELTER COMPANY]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legqa_roberta_cuad_base| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|453.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Squad, finetuned with CUAD-based Question/Answering --- layout: model title: English Named Entity Recognition (from DeDeckerThomas) author: John Snow Labs name: distilbert_ner_keyphrase_extraction_distilbert_kptimes date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model orginally trained by `DeDeckerThomas`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_kptimes_en_3.4.2_3.0_1652721921747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_kptimes_en_3.4.2_3.0_1652721921747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_kptimes","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_kptimes","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_keyphrase_extraction_distilbert_kptimes| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/DeDeckerThomas/keyphrase-extraction-distilbert-kptimes - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes --- layout: model title: Multilingual T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_flan_small date: 2023-01-30 tags: [vi, ne, fi, ur, ku, yo, si, ru, it, zh, la, hi, he, xh, so, ca, ar, as, sw, en, ro, ig, te, th, ta, ce, es, gu, or, fr, ka, "no", li, cr, ch, be, ha, ga, ja, pa, ko, sl, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `flan-t5-small` is a Multilingual model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_small_xx_4.3.0_3.0_1675102370004.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_small_xx_4.3.0_3.0_1675102370004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_flan_small","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_flan_small","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_flan_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|349.5 MB| ## References - https://huggingface.co/google/flan-t5-small - https://s3.amazonaws.com/moonup/production/uploads/1666363435475-62441d1d9fdefb55a0b7d12c.png - https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints - https://arxiv.org/pdf/2210.11416.pdf - https://github.com/google-research/t5x - https://arxiv.org/pdf/2210.11416.pdf - https://arxiv.org/pdf/2210.11416.pdf - https://arxiv.org/pdf/2210.11416.pdf - https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png - https://arxiv.org/pdf/2210.11416.pdf - https://github.com/google-research/t5x - https://github.com/google/jax - https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png - https://arxiv.org/pdf/2210.11416.pdf - https://arxiv.org/pdf/2210.11416.pdf - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 --- layout: model title: Multilingual XLMRoBerta Embeddings author: John Snow Labs name: xlmroberta_embeddings_afriberta_base date: 2022-05-13 tags: [ha, yo, ig, am, so, open_source, xlm_roberta, embeddings, xx, afriberta] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true recommended: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `afriberta_base` is a Multilingual model orginally trained by `castorini`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_base_xx_3.4.4_3.0_1652439193066.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_base_xx_3.4.4_3.0_1652439193066.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_base","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_base","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_afriberta_base| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|417.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/castorini/afriberta_base - https://github.com/keleog/afriberta --- layout: model title: English asr_wav2vec2_med_custom_train_large TFWav2Vec2ForCTC from PrajwalS author: John Snow Labs name: asr_wav2vec2_med_custom_train_large date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_med_custom_train_large` is a English model originally trained by PrajwalS. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_med_custom_train_large_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_med_custom_train_large_en_4.2.0_3.0_1664122216388.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_med_custom_train_large_en_4.2.0_3.0_1664122216388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_med_custom_train_large", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_med_custom_train_large", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_med_custom_train_large| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica) author: John Snow Labs name: t5_tiny_bahasa_cased date: 2023-01-31 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-tiny-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_tiny_bahasa_cased_ms_4.3.0_3.0_1675156097275.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_tiny_bahasa_cased_ms_4.3.0_3.0_1675156097275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_tiny_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_tiny_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_tiny_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|90.5 MB| ## References - https://huggingface.co/mesolitica/t5-tiny-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5 --- layout: model title: Legal Relation Extraction (Parties, Alias, Dates, Document Type, Sm, Bidirectional) author: John Snow Labs name: legre_contract_doc_parties date: 2022-08-12 tags: [en, legal, re, relations, agreements, licensed] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs; This is a Legal Relation Extraction model, which can be used after the NER Model for extracting Parties, Document Types, Effective Dates and Aliases, called `legner_contract_doc_parties`. As an output, you will get the relations linking the different concepts together, if such relation exists. The list of relations is: - dated_as: A Document has an Effective Date - has_alias: The Alias of a Party all along the document - has_collective_alias: An Alias hold by several parties at the same time - signed_by: Between a Party and the document they signed This model is a `sm` model without meaningful directions in the relations (the model was not trained to understand if the direction of the relation is from left to right or right to left). There are bigger models in Models Hub trained also with directed relationships. ## Predicted Entities `dated_as`, `has_alias`, `has_collective_alias`, `signed_by` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALRE_PARTIES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_contract_doc_parties_en_1.0.0_3.2_1660293010932.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_contract_doc_parties_en_1.0.0_3.2_1660293010932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_uncased_legal", "en") \ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") reDL = legal.RelationExtractionDLModel().pretrained('legre_contract_doc_parties', 'en', 'legal/models')\ .setPredictionThreshold(0.5)\ .setInputCols(["ner_chunk", "document"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, reDL ]) text=''' This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties"). ''' data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(data) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence dated_as DOC 6 36 INTELLECTUAL PROPERTY AGREEMENT EFFDATE 70 86 December 31, 2018 0.9933402 signed_by DOC 6 36 INTELLECTUAL PROPERTY AGREEMENT PARTY 142 164 Armstrong Flooring, Inc 0.6235637 signed_by DOC 6 36 INTELLECTUAL PROPERTY AGREEMENT PARTY 316 331 AHF Holding, Inc 0.5001139 has_alias PARTY 142 164 Armstrong Flooring, Inc ALIAS 193 198 Seller 0.93385726 has_alias PARTY 206 222 AFI Licensing LLC ALIAS 264 272 Licensing 0.9859913 has_collective_alias ALIAS 293 298 Seller ALIAS 302 308 Arizona 0.82137156 has_alias PARTY 316 331 AHF Holding, Inc ALIAS 400 404 Buyer 0.8178999 has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 479 485 Company 0.9557921 has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 575 579 Buyer 0.6778585 has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 612 616 Party 0.6778583 has_alias PARTY 412 446 Armstrong Hardwood Flooring Company ALIAS 642 648 Parties 0.6778585 has_collective_alias ALIAS 506 510 Buyer ALIAS 517 530 Buyer Entities 0.69863707 has_collective_alias ALIAS 517 530 Buyer Entities ALIAS 575 579 Buyer 0.55453944 has_collective_alias ALIAS 517 530 Buyer Entities ALIAS 612 616 Party 0.55453944 has_collective_alias ALIAS 517 530 Buyer Entities ALIAS 642 648 Parties 0.55453944 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_contract_doc_parties| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|409.9 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label Recall Precision F1 Support dated_as 0.962 0.962 0.962 26 has_alias 0.936 0.946 0.941 94 has_collective_alias 1.000 1.000 1.000 7 no_rel 0.982 0.980 0.981 497 signed_by 0.961 0.961 0.961 76 Avg. 0.968 0.970 0.969 - Weighted-Avg. 0.973 0.973 0.973 - ``` --- layout: model title: Hocr for table recognition pdf author: John Snow Labs name: hocr_table_recognition_pdf date: 2023-01-23 tags: [en, licensed] task: HOCR Table Recognition language: en nav_key: models edition: Visual NLP 4.2.4 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Table structure recognition based on hocr with Tesseract architecture, for PDF documents. Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset. In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/PDF_TABLE_RECOGNITION_HOCR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter/SparkOCRPdfToTable.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pdf_to_hocr = PdfToHocr() \ .setInputCol("content") \ .setOutputCol("hocr") tokenizer = HocrTokenizer() \ .setInputCol("hocr") \ .setOutputCol("token") \ pdf_to_image = PdfToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setPageNumCol("tmp_pagenum") \ .setImageType(ImageType.TYPE_3BYTE_BGR) table_detector = ImageTableDetector \ .pretrained("general_model_table_detection_v2", "en", "public/ocr/models") \ .setInputCol("image") \ .setOutputCol("table_regions") \ .setScoreThreshold(0.9) \ .setApplyCorrection(True) \ .setScaleWidthToCol("width_dimension") \ .setScaleHeightToCol("height_dimension") image_scaler = ImageScaler() \ .setWidthCol("width_dimension") \ .setHeightCol("height_dimension") hocr_to_table = HocrToTextTable() \ .setInputCol("hocr") \ .setRegionCol("table_regions") \ .setOutputCol("tables") draw_annotations = ImageDrawAnnotations() \ .setInputCol("scaled_image") \ .setInputChunksCol("tables") \ .setOutputCol("image_with_annotations") \ .setFilledRect(False) \ .setFontSize(5) \ .setRectColor(Color.red) draw_regions = ImageDrawRegions() \ .setInputCol("scaled_image") \ .setInputRegionsCol("table_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.red) pipeline1 = PipelineModel(stages=[ pdf_to_hocr, tokenizer, pdf_to_image, table_detector, image_scaler, draw_regions, hocr_to_table ]) test_image_path = "data/pdfs/f1120.pdf" bin_df = spark.read.format("binaryFile").load(test_image_path) result = pipeline1.transform(bin_df).cache().drop("tmp_pagenum") result = result.filter(result.pagenum == 1) ``` ```scala val pdf_to_hocr = new PdfToHocr() .setInputCol("content") .setOutputCol("hocr") val tokenizer = new HocrTokenizer() .setInputCol("hocr") .setOutputCol("token") val pdf_to_image = new PdfToImage() .setInputCol("content") .setOutputCol("image") .setPageNumCol("tmp_pagenum") .setImageType(ImageType.TYPE_3BYTE_BGR) val table_detector = ImageTableDetector .pretrained("general_model_table_detection_v2", "en", "public/ocr/models") .setInputCol("image") .setOutputCol("table_regions") .setScoreThreshold(0.9) .setApplyCorrection(True) .setScaleWidthToCol("width_dimension") .setScaleHeightToCol("height_dimension") val image_scaler = new ImageScaler() .setWidthCol("width_dimension") .setHeightCol("height_dimension") val hocr_to_table = new HocrToTextTable() .setInputCol("hocr") .setRegionCol("table_regions") .setOutputCol("tables") val draw_annotations = new ImageDrawAnnotations() .setInputCol("scaled_image") .setInputChunksCol("tables") .setOutputCol("image_with_annotations") .setFilledRect(False) .setFontSize(5) .setRectColor(Color.red) val draw_regions = new ImageDrawRegions() .setInputCol("scaled_image") .setInputRegionsCol("table_regions") .setOutputCol("image_with_regions") .setRectColor(Color.red) val pipeline1 = new PipelineModel().setStages(Array( pdf_to_hocr, tokenizer, pdf_to_image, table_detector, image_scaler, draw_regions, hocr_to_table)) val test_image_path = "data/pdfs/f1120.pdf" val bin_df = spark.read.format("binaryFile").load(test_image_path) val result = pipeline1.transform(bin_df).cache().drop("tmp_pagenum") result = result.filter(result.pagenum == 1) ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image14.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image14_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash path modificationTime length hocr height_dimension width_dimension pagenum token image total_pages documentnum table_regions scaled_image image_with_regions tables exception table_index file:/content/f11... 2023-01-23 08:16:... 3471478
Live Demo [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTa_hindi_guj_san_hi_3.4.2_3.0_1649947496602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTa_hindi_guj_san_hi_3.4.2_3.0_1649947496602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTa_hindi_guj_san","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTa_hindi_guj_san","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.RoBERTa_hindi_guj_san").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_RoBERTa_hindi_guj_san| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|252.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/surajp/RoBERTa-hindi-guj-san - https://github.com/goru001/inltk - https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k - https://www.kaggle.com/disisbig/gujarati-wikipedia-articles - https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles - https://twitter.com/parmarsuraj99 - https://www.linkedin.com/in/parmarsuraj99/ --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from iis2009002) author: John Snow Labs name: xlmroberta_ner_iis2009002_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `iis2009002`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_all_xx_4.1.0_3.0_1660428464219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_all_xx_4.1.0_3.0_1660428464219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_iis2009002_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/iis2009002/xlm-roberta-base-finetuned-panx-all --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Modified_biobert_v1.1 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Modified-biobert-v1.1` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_biobert_v1.1_en_4.0.0_3.0_1657109068197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_biobert_v1.1_en_4.0.0_3.0_1657109068197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_biobert_v1.1","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_biobert_v1.1","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Modified_biobert_v1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4_Modified-biobert-v1.1 --- layout: model title: Financial Relation Extraction (Work Experience, Medium, Unidirectional) author: John Snow Labs name: finre_work_experience_md date: 2022-11-08 tags: [work, experience, role, en, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole financial report. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `finclf_work_experience_item` Text Classifier to select only these paragraphs; This is a `md` (medium) version of `finre_work_experience` model, trained with more data and with **unidirectional relation extractions**, meaning now the direction of the arrow matters: it goes from the source (`chunk1`) to the target (`chunk2`). This model allows you to analyzed present and past job positions of people, extracting relations between PERSON, ORG, ROLE and DATE. This model requires an NER with the mentioned entities, as `finner_org_per_role_date` and can also be combined with `finassertiondl_past_roles` to detect if the entities are mentioned to have happened in the PAST or not (although you can also infer that from the relations as `had_role_until`). ## Predicted Entities `has_role`, `had_role_until`, `has_role_from`, `works_for`, `has_role_in_company` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_work_experience_md_en_1.0.0_3.0_1667922980930.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_work_experience_md_en_1.0.0_3.0_1667922980930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") bert_embeddings= BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("bert_embeddings") ner_model = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\ .setInputCols(["sentence", "token", "bert_embeddings"])\ .setOutputCol("ner_orgs") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner_orgs"])\ .setOutputCol("ner_chunk") pos = PerceptronModel.pretrained()\ .setInputCols(["sentence", "token"])\ .setOutputCol("pos") dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en")\ .setInputCols(["sentence", "pos", "token"])\ .setOutputCol("dependencies") re_filter = finance.RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setRelationPairs(["PERSON-ROLE", "PERSON-ORG", "ORG-ROLE", "DATE-ROLE"]) reDL = finance.RelationExtractionDLModel()\ .pretrained('finre_work_experience_md','en','finance/models')\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentencizer, tokenizer, bert_embeddings, ner_model, ner_converter, pos, dependency_parser, re_filter, reDL]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = f"On December 15, 2021, Anirudh Devgan assumed the role of President and Chief Executive Officer of Cadence, replacing Lip-Bu Tan. Prior to his role as Chief Executive Officer, Dr. Devgan served as President of Cadence. Concurrently, Mr. Tan transitioned to the role of Executive Chair" lmodel = LightPipeline(model) results = lmodel.fullAnnotate(text) rel_df = get_relations_df (results) rel_df = rel_df[rel_df['relation']!='other'] print(rel_df.to_string(index=False)) print() ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence has_role_from DATE 3 19 December 15, 2021 ROLE 57 65 President 0.9532135 has_role_from DATE 3 19 December 15, 2021 ROLE 71 93 Chief Executive Officer 0.91833746 has_role PERSON 22 35 Anirudh Devgan ROLE 57 65 President 0.9993814 has_role PERSON 22 35 Anirudh Devgan ROLE 71 93 Chief Executive Officer 0.9889985 works_for PERSON 22 35 Anirudh Devgan ORG 98 104 Cadence 0.9983778 has_role_in_company ROLE 57 65 President ORG 98 104 Cadence 0.9997348 has_role_in_company ROLE 71 93 Chief Executive Officer ORG 98 104 Cadence 0.99845624 has_role ROLE 150 172 Chief Executive Officer PERSON 175 184 Dr. Devgan 0.85268635 has_role_in_company ROLE 150 172 Chief Executive Officer ORG 209 215 Cadence 0.9976404 has_role PERSON 175 184 Dr. Devgan ROLE 196 204 President 0.99899226 works_for PERSON 175 184 Dr. Devgan ORG 209 215 Cadence 0.99876934 has_role_in_company ROLE 196 204 President ORG 209 215 Cadence 0.9997203 has_role PERSON 232 238 Mr. Tan ROLE 268 282 Executive Chair 0.98612714 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_work_experience_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.7 MB| ## References Manual annotations on CUAD dataset, 10K filings and Wikidata ## Benchmarking ```bash label Recall Precision F1 Support had_role_until 1.000 1.000 1.000 117 has_role 0.998 0.995 0.997 649 has_role_from 1.000 1.000 1.000 401 has_role_in_company 0.993 0.993 0.993 268 other 0.996 0.996 0.996 235 works_for 0.994 1.000 0.997 330 Avg. 0.997 0.997 0.997 2035 Weighted-Avg. 0.997 0.997 0.997 2035 ``` --- layout: model title: Detect PHI in text (enriched-biobert) author: John Snow Labs name: ner_deid_enriched_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect sensitive information in text for de-identification using pretrained NER model. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `DOCTOR`, `PHONE`, `COUNTRY`, `MEDICALRECORD`, `STREET`, `CITY`, `PROFESSION`, `PATIENT`, `IDNUM`, `BIOID`, `HEALTHPLAN`, `HOSPITAL`, `USERNAME`, `LOCATION-OTHER`, `AGE`, `FAX`, `EMAIL`, `DATE`, `STATE`, `ZIP`, `URL`, `ORGANIZATION`, `DEVICE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_en_3.0.0_3.0_1617260810027.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_en_3.0.0_3.0_1617260810027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_enriched_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_enriched_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.enriched_biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_enriched_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English asr_Central_kurdish_xlsr TFWav2Vec2ForCTC from Akashpb13 author: John Snow Labs name: asr_Central_kurdish_xlsr date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Central_kurdish_xlsr` is a English model originally trained by Akashpb13. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Central_kurdish_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Central_kurdish_xlsr_en_4.2.0_3.0_1664103765643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Central_kurdish_xlsr_en_4.2.0_3.0_1664103765643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Central_kurdish_xlsr", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Central_kurdish_xlsr", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Central_kurdish_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Sentence Entity Resolver for RxNorm (Action / Treatment) author: John Snow Labs name: sbiobertresolve_rxnorm_action_treatment date: 2022-04-25 tags: [licensed, en, entity_resolution, clinical, rxnorm] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Additionally, this model returns actions and treatments of the drugs in `all_k_aux_labels` column. ## Predicted Entities `RxNorm Codes`, `Action`, `Treatment` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_action_treatment_en_3.5.1_2.4_1650899853599.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_action_treatment_en_3.5.1_2.4_1650899853599.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_action_treatment", "en", "clinical/models")\ .setInputCols(["sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver ]) light_model = LightPipeline(pipelineModel) result = light_model.fullAnnotate(["Zita 200 mg", "coumadin 5 mg", "avandia 4 mg"]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_action_treatment", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val light_model = LightPipeline(rxnorm_pipelineModel) val result = light_model.fullAnnotate(Array("Zita 200 mg", "coumadin 5 mg", "avandia 4 mg")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm_action_treatment").predict("""coumadin 5 mg""") ```
## Results ```bash | | ner_chunk | rxnorm_code | action | treatment | |---:|:--------------|--------------:|:---------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | Zita 200 mg | 104080 | ['Analgesic', 'Antacid', 'Antipyretic', 'Pain Reliever'] | ['Backache', 'Pain', 'Sore Throat', 'Headache', 'Influenza', 'Toothache', 'Heartburn', 'Migraine', 'Muscular Aches And Pains', 'Neuralgia', 'Cold', 'Weakness'] | | 1 | coumadin 5 mg | 855333 | ['Anticoagulant'] | ['Cerebrovascular Accident', 'Pulmonary Embolism', 'Heart Attack', 'AF', 'Embolization'] | | 2 | avandia 4 mg | 261242 | ['Drugs Used In Diabets', 'Hypoglycemic'] | ['Diabetes Mellitus', 'Type 1 Diabetes Mellitus', 'Type 2 Diabetes'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_action_treatment| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|918.7 MB| |Case sensitive:|false| --- layout: model title: English asr_wav2vec2_large_robust_swbd_300h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_large_robust_swbd_300h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_swbd_300h` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_robust_swbd_300h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_swbd_300h_en_4.2.0_3.0_1664038284772.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_swbd_300h_en_4.2.0_3.0_1664038284772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_robust_swbd_300h", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_robust_swbd_300h", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_robust_swbd_300h| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.5 MB| --- layout: model title: English BertForQuestionAnswering model (from maroo93) author: John Snow Labs name: bert_qa_squad2.0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad2.0` is a English model orginally trained by `maroo93`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad2.0_en_4.0.0_3.0_1654192176164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad2.0_en_4.0.0_3.0_1654192176164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/maroo93/squad2.0 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vkrishnamoorthy) author: John Snow Labs name: distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vkrishnamoorthy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773167222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773167222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vkrishnamoorthy_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vkrishnamoorthy/distilbert-base-uncased-finetuned-squad --- layout: model title: Named Entity Recognition for Japanese (BERT Base Japanese) author: John Snow Labs name: ner_ud_gsd_bert_base_japanese date: 2021-09-16 tags: [ja, ner, open_sourve, open_source] task: Named Entity Recognition language: ja edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pretrained BertEmbeddings embeddings "bert_base_ja" as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities `ORDINAL`, `PERSON`, `LAW`, `MOVEMENT`, `LOC`, `WORK_OF_ART`, `DATE`, `NORP`, `TITLE_AFFIX`, `QUANTITY`, `FAC`, `TIME`, `MONEY`, `LANGUAGE`, `GPE`, `EVENT`, `ORG`, `PERCENT`, `PRODUCT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_bert_base_japanese_ja_3.2.2_3.0_1631804789491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_bert_base_japanese_ja_3.2.2_3.0_1631804789491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from pyspark.ml import Pipeline from sparknlp.annotator import * from sparknlp.base import * from sparknlp.training import * documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \ .setInputCols(["sentence"]) \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_japanese", "ja") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nerTagger = NerDLModel.pretrained("ner_ud_gsd_bert_base_japanese", "ja") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") pipeline = Pipeline().setStages( [ documentAssembler, sentence, word_segmenter, embeddings, nerTagger, ] ) data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))").show() ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel} import com.johnsnowlabs.nlp.embeddings.BertEmbeddings import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_japanese", "ja") .setInputCols("sentence", "token") .setOutputCol("embeddings") val nerTagger = NerDLModel.pretrained("ner_ud_gsd_bert_base_japanese", "ja") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, word_segmenter, embeddings, nerTagger )) val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text") val model = pipeline.fit(data) val result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))").show() ```
## Results ```bash # +-------------------+ # | col| # +-------------------+ # | {宮本, B-PERSON}| # | {茂, I-PERSON}| # | {氏, O}| # | {は, O}| # | {、, O}| # | {日本, B-GPE}| # | {の, O}| # | {任天, B-ORG}| # | {堂, I-ORG}| # | {の, O}| # | {ゲーム, O}| # |{プロデューサー, O}| # | {です, O}| # | {。, O}| # +-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ud_gsd_bert_base_japanese| |Type:|ner| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ja| |Dependencies:|bert_base_ja| ## Data Source The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs: https://github.com/megagonlabs/UD_Japanese-GSD Reference: Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash label precision recall f1-score support CARDINAL 0.00 0.00 0.00 0 DATE 0.95 0.96 0.96 206 EVENT 0.84 0.50 0.63 52 FAC 0.75 0.71 0.73 59 GPE 0.79 0.76 0.78 102 LANGUAGE 1.00 1.00 1.00 8 LAW 1.00 0.31 0.47 13 LOC 0.89 0.83 0.86 41 MONEY 1.00 1.00 1.00 20 MOVEMENT 1.00 0.18 0.31 11 NORP 0.85 0.82 0.84 57 O 0.99 0.99 0.99 11785 ORDINAL 0.81 0.94 0.87 32 ORG 0.78 0.65 0.71 179 PERCENT 0.89 1.00 0.94 16 PERSON 0.76 0.84 0.80 127 PRODUCT 0.62 0.68 0.65 50 QUANTITY 0.92 0.94 0.93 172 TIME 0.97 0.88 0.92 32 TITLE_AFFIX 0.89 0.71 0.79 24 WORK_OF_ART 0.66 0.73 0.69 48 accuracy - - 0.97 13034 macro-avg 0.83 0.73 0.75 13034 weighted-avg 0.97 0.97 0.97 13034 ``` --- layout: model title: English BertForTokenClassification Small Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_modified_PubmedBert_small date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-modified-PubmedBert_small` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_modified_PubmedBert_small_en_4.0.0_3.0_1657108180187.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_modified_PubmedBert_small_en_4.0.0_3.0_1657108180187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_modified_PubmedBert_small","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_modified_PubmedBert_small","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_modified_PubmedBert_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4-modified-PubmedBert_small --- layout: model title: Pipeline to Detect Disease Mentions (MedicalBertForTokenClassification) (BERT) author: John Snow Labs name: bert_token_classifier_disease_mentions_tweet_pipeline date: 2023-03-20 tags: [es, clinical, licensed, public_health, ner, token_classification, disease, tweet] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_disease_mentions_tweet](https://nlp.johnsnowlabs.com/2022/07/28/bert_token_classifier_disease_mentions_tweet_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_disease_mentions_tweet_pipeline_es_4.3.0_3.2_1679299531828.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_disease_mentions_tweet_pipeline_es_4.3.0_3.2_1679299531828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_disease_mentions_tweet_pipeline", "es", "clinical/models") text = '''El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_disease_mentions_tweet_pipeline", "es", "clinical/models") val text = "El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|-------------:| | 0 | Neumonía en el pulmón | 41 | 61 | ENFERMEDAD | 0.999969 | | 1 | Sinusitis | 72 | 80 | ENFERMEDAD | 0.999977 | | 2 | Faringitis aguda | 94 | 109 | ENFERMEDAD | 0.999969 | | 3 | infección de orina | 113 | 130 | ENFERMEDAD | 0.999969 | | 4 | Gripe | 150 | 154 | ENFERMEDAD | 0.999983 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_disease_mentions_tweet_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|462.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nl16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl16_en_4.3.0_3.0_1675113699074.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl16_en_4.3.0_3.0_1675113699074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nl16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nl16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nl16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|602.0 MB| ## References - https://huggingface.co/google/t5-efficient-base-nl16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English asr_wav2vec2_base_100h_by_vuiseng9 TFWav2Vec2ForCTC from vuiseng9 author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_by_vuiseng9 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_by_vuiseng9` is a English model originally trained by vuiseng9. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_by_vuiseng9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_vuiseng9_en_4.2.0_3.0_1664022865816.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_vuiseng9_en_4.2.0_3.0_1664022865816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_by_vuiseng9', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_by_vuiseng9", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_by_vuiseng9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_taco_or_what ViTForImageClassification from osanseviero author: John Snow Labs name: image_classifier_vit_taco_or_what date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_taco_or_what` is a English model originally trained by osanseviero. ## Predicted Entities `burrito`, `taco`, `quesadilla`, `fajitas`, `kebab` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_taco_or_what_en_4.1.0_3.0_1660169560946.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_taco_or_what_en_4.1.0_3.0_1660169560946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_taco_or_what", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_taco_or_what", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_taco_or_what| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: sbert_jsl_medium_umls_uncased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.0.3_2.4_1621017148548.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.0.3_2.4_1621017148548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models") .setInputCols("sentence") .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_medium_umls_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_medium_umls_uncased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.744 STS(cos) 0.695 ``` --- layout: model title: Multilingual T5ForConditionalGeneration Base Cased model (from KETI-AIR) author: John Snow Labs name: t5_ke_base date: 2023-01-30 tags: [en, ko, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ke-t5-base` is a Multilingual model originally trained by `KETI-AIR`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ke_base_xx_4.3.0_3.0_1675104312892.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ke_base_xx_4.3.0_3.0_1675104312892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ke_base","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ke_base","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ke_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|569.3 MB| ## References - https://huggingface.co/KETI-AIR/ke-t5-base - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints - https://github.com/AIRC-KETI/ke-t5 - https://aclanthology.org/2021.findings-emnlp.33/ - https://jmlr.org/papers/volume21/20-074/20-074.pdf - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 - https://www.tensorflow.org/datasets/catalog/c4 - https://jmlr.org/papers/volume21/20-074/20-074.pdf - https://jmlr.org/papers/volume21/20-074/20-074.pdf - https://jmlr.org/papers/volume21/20-074/20-074.pdf - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 - https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/main/notebooks/t5-trivia.ipynb --- layout: model title: Translate Portuguese-based languages to English Pipeline author: John Snow Labs name: translate_cpp_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cpp, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cpp` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cpp_en_xx_2.7.0_2.4_1609687142938.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cpp_en_xx_2.7.0_2.4_1609687142938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cpp_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cpp_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cpp.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cpp_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from mcurmei) author: John Snow Labs name: distilbert_qa_single_label_n_max date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `single_label_N_max` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_n_max_en_4.3.0_3.0_1672775498618.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_n_max_en_4.3.0_3.0_1672775498618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_n_max","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_n_max","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_single_label_n_max| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/single_label_N_max --- layout: model title: Japanese Electra Embeddings (from izumi-lab) author: John Snow Labs name: electra_embeddings_electra_small_japanese_fin_generator date: 2022-05-17 tags: [ja, open_source, electra, embeddings] task: Embeddings language: ja edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-japanese-fin-generator` is a Japanese model orginally trained by `izumi-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_japanese_fin_generator_ja_3.4.4_3.0_1652786680826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_japanese_fin_generator_ja_3.4.4_3.0_1652786680826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_japanese_fin_generator","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_japanese_fin_generator","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_small_japanese_fin_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ja| |Size:|52.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/izumi-lab/electra-small-japanese-fin-generator - https://github.com/google-research/electra - https://github.com/retarfi/language-pretraining/tree/v1.0 - https://github.com/google-research/electra - https://arxiv.org/abs/2003.10555 - https://creativecommons.org/licenses/by-sa/4.0/ --- layout: model title: English ElectraForQuestionAnswering model (from ptran74) Version-5 author: John Snow Labs name: electra_qa_DSPFirst_Finetuning_5 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-5` is a English model originally trained by `ptran74`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_5_en_4.0.0_3.0_1655919805104.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_5_en_4.0.0_3.0_1655919805104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_5","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_5","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.finetuning_5").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_DSPFirst_Finetuning_5| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ptran74/DSPFirst-Finetuning-5 - https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/ - https://github.com/patil-suraj/question_generation --- layout: model title: Wolof XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof date: 2022-08-01 tags: [wo, open_source, xlm_roberta, ner] task: Named Entity Recognition language: wo edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-wolof-finetuned-ner-wolof` is a Wolof model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof_wo_4.1.0_3.0_1659356140998.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof_wo_4.1.0_3.0_1659356140998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof","wo") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof","wo") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_wolof_finetuned_ner_wolof| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|wo| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-wolof-finetuned-ner-wolof - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Legal Guarantee Clause Binary Classifier author: John Snow Labs name: legclf_guarantee_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `guarantee` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `guarantee` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_guarantee_clause_en_1.0.0_3.2_1660123571558.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_guarantee_clause_en_1.0.0_3.2_1660123571558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_guarantee_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[guarantee]| |[other]| |[other]| |[guarantee]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_guarantee_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support guarantee 0.88 0.80 0.83 88 other 0.91 0.95 0.93 192 accuracy - - 0.90 280 macro-avg 0.89 0.87 0.88 280 weighted-avg 0.90 0.90 0.90 280 ``` --- layout: model title: English ElectraForQuestionAnswering Small model (from mrm8488) author: John Snow Labs name: electra_qa_small_finetuned_squadv1 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_finetuned_squadv1_en_4.0.0_3.0_1655921278345.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_finetuned_squadv1_en_4.0.0_3.0_1655921278345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_finetuned_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.small").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_small_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/electra-small-finetuned-squadv1 - https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/ --- layout: model title: Legal Negative covenants Clause Binary Classifier author: John Snow Labs name: legclf_negative_covenants_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `negative-covenants` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `negative-covenants` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_negative_covenants_clause_en_1.0.0_3.2_1660122676325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_negative_covenants_clause_en_1.0.0_3.2_1660122676325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_negative_covenants_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[negative-covenants]| |[other]| |[other]| |[negative-covenants]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_negative_covenants_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support negative-covenants 1.00 0.92 0.96 51 other 0.97 1.00 0.98 130 accuracy - - 0.98 181 macro-avg 0.99 0.96 0.97 181 weighted-avg 0.98 0.98 0.98 181 ``` --- layout: model title: Detect Organism in Medical Text author: John Snow Labs name: bert_token_classifier_ner_species date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_en_4.0.0_3.0_1658758056681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_en_4.0.0_3.0_1658758056681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_species", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) ."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_species", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.species").predict("""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .""") ```
## Results ```bash +-----------------------+-------+ |ner_chunk |label | +-----------------------+-------+ |6C (T) |SPECIES| |Betaproteobacteria |SPECIES| |Thiomonas intermedia |SPECIES| |DSM 18155 (T) |SPECIES| |Thiomonas perometabolis|SPECIES| |DSM 18570 (T) |SPECIES| +-----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_species| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://species.jensenlab.org/](https://species.jensenlab.org/) ## Benchmarking ```bash label precision recall f1-score support B-SPECIES 0.6073 0.9374 0.7371 767 I-SPECIES 0.7418 0.8648 0.7986 1043 micro-avg 0.6754 0.8956 0.7701 1810 macro-avg 0.6745 0.9011 0.7678 1810 weighted-avg 0.6848 0.8956 0.7725 1810 ``` --- layout: model title: Sentiment Analysis pipeline for English (analyze_sentimentdl_glove_imdb) author: John Snow Labs name: analyze_sentimentdl_glove_imdb date: 2021-03-24 tags: [open_source, english, analyze_sentimentdl_glove_imdb, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The analyze_sentimentdl_glove_imdb is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_3.0.0_3.0_1616544505213.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_3.0.0_3.0_1616544505213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("analyze_sentimentdl_glove_imdb", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.sentiment.glove').predict(text) result_df ```
## Results ```bash | | document | sentence | tokens | word_embeddings | sentence_embeddings | sentiment | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-----------------------------|:------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.2668800055980682,.,...]] | [[0.0771183446049690,.,...]] | ['neg'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|analyze_sentimentdl_glove_imdb| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Extract Pharmacological Entities From Spanish Medical Texts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_pharmacology date: 2022-08-11 tags: [es, clinical, licensed, token_classification, bert, ner, pharmacology] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting pharmacological entities from Spanish medical texts and trained using the BertForTokenClassification method from the transformers library and [BERT based](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) embeddings. The model detects PROTEINAS and NORMALIZABLES. ## Predicted Entities `PROTEINAS`, `NORMALIZABLES` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_es_4.0.2_3.0_1660236427687.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_es_4.0.2_3.0_1660236427687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","label")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.classify.bert_token.pharmacology").predict("""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).""") ```
## Results ```bash +---------------+-------------+ |chunk |ner_label | +---------------+-------------+ |creatinkinasa |PROTEINAS | |LDH |PROTEINAS | |urea |NORMALIZABLES| |CA 19.9 |PROTEINAS | |vimentina |PROTEINAS | |S-100 |PROTEINAS | |HMB-45 |PROTEINAS | |actina |PROTEINAS | |Cisplatino |NORMALIZABLES| |Interleukina II|PROTEINAS | |Dacarbacina |NORMALIZABLES| |Interferon alfa|PROTEINAS | +---------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_pharmacology| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support B-NORMALIZABLES 0.9458 0.9694 0.9575 3076 I-NORMALIZABLES 0.8788 0.8969 0.8878 291 B-PROTEINAS 0.9164 0.9369 0.9265 2234 I-PROTEINAS 0.8825 0.7634 0.8186 748 micro-avg 0.9257 0.9304 0.9280 6349 macro-avg 0.9059 0.8917 0.8976 6349 weighted-avg 0.9249 0.9304 0.9270 6349 ``` --- layout: model title: Translate Manx to English Pipeline author: John Snow Labs name: translate_gv_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gv, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gv_en_xx_2.7.0_2.4_1609686733139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gv_en_xx_2.7.0_2.4_1609686733139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gv.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gv_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Death Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_death_bert date: 2023-03-05 tags: [en, legal, classification, clauses, death, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Death` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Death`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_death_bert_en_1.0.0_3.0_1678049968182.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_death_bert_en_1.0.0_3.0_1678049968182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_death_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Death]| |[Other]| |[Other]| |[Death]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_death_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Death 0.86 1.00 0.93 31 Other 1.00 0.90 0.95 49 accuracy - - 0.94 80 macro-avg 0.93 0.95 0.94 80 weighted-avg 0.95 0.94 0.94 80 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from KES) author: John Snow Labs name: t5_kes date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-KES` is a English model originally trained by `KES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_kes_en_4.3.0_3.0_1675099343508.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_kes_en_4.3.0_3.0_1675099343508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_kes","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_kes","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_kes| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|912.8 MB| ## References - https://huggingface.co/KES/T5-KES - https://arxiv.org/abs/1702.04066 - https://github.com/EricFillion/happy-transformer - https://pypi.org/project/Caribe/ --- layout: model title: Legal Sanctions Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_sanctions_bert date: 2023-03-05 tags: [en, legal, classification, clauses, sanctions, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Sanctions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Sanctions`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sanctions_bert_en_1.0.0_3.0_1678050581574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sanctions_bert_en_1.0.0_3.0_1678050581574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sanctions_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Sanctions]| |[Other]| |[Other]| |[Sanctions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sanctions_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.95 0.97 19 Sanctions 0.92 1.00 0.96 11 accuracy - - 0.97 30 macro-avg 0.96 0.97 0.96 30 weighted-avg 0.97 0.97 0.97 30 ``` --- layout: model title: Word2Vec Embeddings in Uzbek (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, uz, open_source] task: Embeddings language: uz edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_uz_3.4.1_3.0_1647465690254.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_uz_3.4.1_3.0_1647465690254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","uz") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Men Spark NLP ni yaxshi ko'raman"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","uz") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Men Spark NLP ni yaxshi ko'raman").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uz.embed.w2v_cc_300d").predict("""Men Spark NLP ni yaxshi ko'raman""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|uz| |Size:|481.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Stop Words Cleaner for Slovenian author: John Snow Labs name: stopwords_sl date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: sl edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, sl] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_sl_sl_2.5.4_2.4_1594742442155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_sl_sl_2.5.4_2.4_1594742442155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_sl", "sl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_sl", "sl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene."""] stopword_df = nlu.load('sl.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}), Row(annotatorType='token', begin=23, end=23, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=31, end=37, result='severni', metadata={'sentence': '0'}), Row(annotatorType='token', begin=39, end=43, result='kralj', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_sl| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Financial Finetuned FLAN-T5 Text Generation (FIQA dataset) author: John Snow Labs name: fingen_flant5_finetuned_fiqa date: 2023-05-29 tags: [en, finance, generation, licensed, flant5, fiqa, tensorflow] task: Text Generation language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: FinanceTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `fingen_flant5_finetuned_fiqa` model is the Text Generation model that has been fine-tuned on FLAN-T5 using FIQA dataset. FLAN-T5 is a state-of-the-art language model developed by Google AI that utilizes the T5 architecture for text-generation tasks. References: ```bibtex @article{flant5_paper, title={Scaling instruction-finetuned language models}, author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and others}, journal={arXiv preprint arXiv:2210.11416}, year={2022} } @article{t5_paper, title={Exploring the limits of transfer learning with a unified text-to-text transformer}, author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J}, journal={The Journal of Machine Learning Research}, volume={21}, number={1}, pages={5485--5551}, year={2020}, publisher={JMLRORG} } ``` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_fiqa_en_1.0.0_3.0_1685363340017.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_fiqa_en_1.0.0_3.0_1685363340017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") flant5 = finance.TextGenerator.pretrained("fingen_flant5_finetuned_fiqa", "en", "finance/models")\ .setInputCols(["document"])\ .setOutputCol("generated")\ .setMaxNewTokens(256)\ .setStopAtEos(True)\ .setDoSample(True)\ .setTopK(3) pipeline = nlp.Pipeline(stages=[document_assembler, flant5]) data = spark.createDataFrame([ [1, "How to have a small capital investment in US if I am out of the country?"]]).toDF('id', 'text') results = pipeline.fit(data).transform(data) results.select("generated.result").show(truncate=False) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[I would suggest a local broker. They have diversified funds that are diversified and have the same fees as the US market. They also offer diversified portfolios that have the lowest risk.]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|fingen_flant5_finetuned_fiqa| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.6 GB| ## References The dataset is available [here](https://huggingface.co/datasets/BeIR/fiqa) --- layout: model title: Extract Intent Type from Customer Service Chat Messages author: John Snow Labs name: finclf_customer_service_intent_type date: 2023-02-03 tags: [en, licensed, intent, finance, customer, tensorflow] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Classification model that can help you classify a chat message from customer service according to intent type. ## Predicted Entities `cancel_order`, `change_order`, `change_setup_shipping_address`, `check_cancellation_fee`, `check_payment_methods`, `check_refund_policy`, `complaint`, `contact_customer_service`, `contact_human_agent`, `create_edit_switch_account`, `delete_account`, `delivery_options`, `delivery_period`, `get_check_invoice`, `get_refund`, `newsletter_subscription`, `payment_issue`, `place_order`, `recover_password`, `registration_problems`, `review`, `track_order`, `track_refund`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_customer_service_intent_type_en_1.0.0_3.0_1675427852317.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_customer_service_intent_type_en_1.0.0_3.0_1675427852317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = finance.ClassifierDLModel.pretrained("finclf_customer_service_intent_type", "en", "finance/models")\ .setInputCols("sentence_embeddings") \ .setOutputCol("class") pipeline = nlp.Pipeline().setStages( [ document_assembler, embeddings, docClassifier ] ) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) light_model = nlp.LightPipeline(model) result = light_model.annotate("""I have a problem with the deletion of my Premium account.""") result["class"] ```
## Results ```bash ['delete_account'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_customer_service_intent_type| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References https://github.com/bitext/customer-support-intent-detection-evaluation-dataset ## Benchmarking ```bash label precision recall f1-score support cancel_order 0.88 1.00 0.94 30 change_order 1.00 0.90 0.95 30 change_setup_shipping_address 0.97 0.97 0.97 36 check_cancellation_fee 0.97 0.97 0.97 30 check_payment_methods 0.97 0.93 0.95 30 check_refund_policy 0.97 0.97 0.97 30 complaint 0.93 0.93 0.93 30 contact_customer_service 1.00 1.00 1.00 30 contact_human_agent 0.97 0.97 0.97 30 create_edit_switch_account 0.90 0.97 0.93 36 delete_account 0.96 0.87 0.91 30 delivery_options 0.91 1.00 0.95 30 delivery_period 1.00 0.97 0.98 30 get_check_invoice 0.92 0.97 0.95 36 get_refund 1.00 0.87 0.93 30 newsletter_subscription 1.00 0.93 0.97 30 other 1.00 0.92 0.96 38 payment_issue 0.97 1.00 0.98 30 place_order 0.97 0.93 0.95 30 recover_password 0.97 1.00 0.98 30 registration_problems 1.00 0.97 0.98 30 review 0.94 1.00 0.97 30 track_order 0.93 0.93 0.93 30 track_refund 0.91 1.00 0.95 30 accuracy - - 0.96 746 macro-avg 0.96 0.96 0.96 746 weighted-avg 0.96 0.96 0.96 746 ``` --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_200000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-200000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_200000_cased_generator_de_3.4.4_3.0_1652786339074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_200000_cased_generator_de_3.4.4_3.0_1652786339074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_200000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_200000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_200000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-200000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Arabic Named Entity Recognition (from abdusahmbzuai) author: John Snow Labs name: bert_ner_arabert_ner date: 2022-05-04 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `arabert-ner` is a Arabic model orginally trained by `abdusahmbzuai`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_arabert_ner_ar_3.4.2_3.0_1651630356143.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_arabert_ner_ar_3.4.2_3.0_1651630356143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_arabert_ner","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_arabert_ner","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner.arabert_ner").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_arabert_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|505.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/abdusahmbzuai/arabert-ner --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_radhakri119 TFWav2Vec2ForCTC from radhakri119 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_radhakri119 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_radhakri119` is a English model originally trained by radhakri119. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_radhakri119_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_radhakri119_en_4.2.0_3.0_1664101755978.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_radhakri119_en_4.2.0_3.0_1664101755978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_radhakri119", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_radhakri119", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_radhakri119| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Central Bikol author: John Snow Labs name: opus_mt_en_bcl date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bcl, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bcl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bcl_xx_2.7.0_2.4_1609170612440.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bcl_xx_2.7.0_2.4_1609170612440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bcl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bcl", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bcl').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bcl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_what_5e_05 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-what-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_what_5e_05_en_4.3.0_3.0_1672766856335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_what_5e_05_en_4.3.0_3.0_1672766856335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_what_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_what_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_what_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-what-5e-05 --- layout: model title: Javanese BertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: bert_embeddings_javanese_small_imdb date: 2022-12-02 tags: [jv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-bert-small-imdb` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670022513681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670022513681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_javanese_small_imdb| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-bert-small-imdb - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: bert_ner_bert_base_parsbert_peymaner_uncased date: 2022-05-09 tags: [bert, ner, token_classification, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-parsbert-peymaner-uncased` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `LOC`, `PER`, `TIM`, `MON`, `DAT`, `PCT`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_peymaner_uncased_fa_3.4.2_3.0_1652099544405.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_peymaner_uncased_fa_3.4.2_3.0_1652099544405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_peymaner_uncased","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_peymaner_uncased","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_parsbert_peymaner_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|607.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/bert-base-parsbert-peymaner-uncased - https://arxiv.org/abs/2005.12515 - http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/ - https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://arxiv.org/abs/2005.12515 - https://tensorflow.org/tfrc - https://hooshvare.com - https://www.linkedin.com/in/m3hrdadfi/ - https://twitter.com/m3hrdadfi - https://github.com/m3hrdadfi - https://www.linkedin.com/in/mohammad-gharachorloo/ - https://twitter.com/MGharachorloo - https://github.com/baarsaam - https://www.linkedin.com/in/marziehphi/ - https://twitter.com/marziehphi - https://github.com/marziehphi - https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/ - https://twitter.com/mmanthouri - https://github.com/mmanthouri - https://hooshvare.com/ - https://www.linkedin.com/company/hooshvare - https://twitter.com/hooshvare - https://github.com/hooshvare - https://www.instagram.com/hooshvare/ - https://www.linkedin.com/in/sara-tabrizi-64548b79/ - https://www.behance.net/saratabrizi - https://www.instagram.com/sara_b_tabrizi/ --- layout: model title: Explain Document Pipeline for Norwegian (Bokmal) author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, norwegian_bokmal, explain_document_sm, pipeline, "no"] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: "no" edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_3.0.0_3.0_1616427435939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_3.0.0_3.0_1616427435939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'no') annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "no") val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei fra John Snow Labs! ""] result_df = nlu.load('no.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.394499987363815,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|no| --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_ff9000 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff9000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff9000_en_4.3.0_3.0_1675123598242.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff9000_en_4.3.0_3.0_1675123598242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_ff9000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff9000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_ff9000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|110.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-ff9000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForMaskedLM Base Cased model author: John Snow Labs name: roberta_embeddings_distil_base date: 2022-12-12 tags: [en, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distil_base_en_4.2.4_3.0_1670858593481.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distil_base_en_4.2.4_3.0_1670858593481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_distil_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_distil_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distil_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|308.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/distilroberta-base - https://arxiv.org/abs/1910.01108 - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 - https://skylion007.github.io/OpenWebTextCorpus/ - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-few-shot-k-1024-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674221411536.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674221411536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_few_shot_k_1024_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-large-few-shot-k-1024-finetuned-squad-seed-4 --- layout: model title: English BertForQuestionAnswering model (from MrAnderson) author: John Snow Labs name: bert_qa_bert_base_1024_full_trivia_copied_embeddings date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-1024-full-trivia-copied-embeddings` is a English model orginally trained by `MrAnderson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_1024_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179607765.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_1024_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179607765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_1024_full_trivia_copied_embeddings","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_1024_full_trivia_copied_embeddings","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.bert.base_1024d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_1024_full_trivia_copied_embeddings| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|409.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MrAnderson/bert-base-1024-full-trivia-copied-embeddings --- layout: model title: Legal Transfer Clause Binary Classifier author: John Snow Labs name: legclf_transfer_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `transfer` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `transfer` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transfer_clause_en_1.0.0_3.2_1660124097088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transfer_clause_en_1.0.0_3.2_1660124097088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_transfer_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[transfer]| |[other]| |[other]| |[transfer]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transfer_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.91 0.94 0.93 66 transfer 0.90 0.86 0.88 43 accuracy - - 0.91 109 macro-avg 0.91 0.90 0.90 109 weighted-avg 0.91 0.91 0.91 109 ``` --- layout: model title: Fast Neural Machine Translation Model from Tumbuka to English author: John Snow Labs name: opus_mt_tum_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tum, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tum` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tum_en_xx_2.7.0_2.4_1609168643029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tum_en_xx_2.7.0_2.4_1609168643029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tum_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tum_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tum.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tum_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_bilal_20epoch TFWav2Vec2ForCTC from Roshana author: John Snow Labs name: pipeline_asr_wav2vec2_bilal_20epoch date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_bilal_20epoch` is a English model originally trained by Roshana. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_bilal_20epoch_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_bilal_20epoch_en_4.2.0_3.0_1664119706366.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_bilal_20epoch_en_4.2.0_3.0_1664119706366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_bilal_20epoch', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_bilal_20epoch", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_bilal_20epoch| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Luba-Lulua author: John Snow Labs name: opus_mt_en_lua date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, lua, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `lua` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lua_xx_2.7.0_2.4_1609164365414.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lua_xx_2.7.0_2.4_1609164365414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_lua", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_lua", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.lua').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_lua| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Clinical Entities (jsl_ner_wip_clinical) author: John Snow Labs name: jsl_ner_wip_clinical date: 2021-01-18 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.7.0 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `I-Age`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `I-Diet`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_en_2.6.5_2.4_1609505628141.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_en_2.6.5_2.4_1609505628141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') clinical_ner = NerDLModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala ... val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols("sentence", "token") .setOutputCol("embeddings") val ner = NerDLModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Modifier | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_clinical| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ {:.h2_title} ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 235.0 46.0 43.0 278.0 0.8363 0.8453 0.8408 Direction 3972.0 465.0 458.0 4430.0 0.8952 0.8966 0.8959 Respiration 82.0 4.0 4.0 86.0 0.9535 0.9535 0.9535 Cerebrovascular_D... 93.0 20.0 24.0 117.0 0.823 0.7949 0.8087 Family_History_He... 88.0 6.0 3.0 91.0 0.9362 0.967 0.9514 Heart_Disease 447.0 82.0 119.0 566.0 0.845 0.7898 0.8164 RelativeTime 158.0 80.0 59.0 217.0 0.6639 0.7281 0.6945 Strength 624.0 58.0 53.0 677.0 0.915 0.9217 0.9183 Smoking 121.0 11.0 4.0 125.0 0.9167 0.968 0.9416 Medical_Device 3716.0 491.0 466.0 4182.0 0.8833 0.8886 0.8859 Pulse 136.0 22.0 14.0 150.0 0.8608 0.9067 0.8831 Psychological_Con... 135.0 9.0 29.0 164.0 0.9375 0.8232 0.8766 Overweight 2.0 1.0 0.0 2.0 0.6667 1.0 0.8 Triglycerides 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Obesity 42.0 5.0 6.0 48.0 0.8936 0.875 0.8842 Admission_Discharge 318.0 24.0 11.0 329.0 0.9298 0.9666 0.9478 HDL 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Diabetes 110.0 14.0 8.0 118.0 0.8871 0.9322 0.9091 Section_Header 3740.0 148.0 157.0 3897.0 0.9619 0.9597 0.9608 Age 627.0 75.0 48.0 675.0 0.8932 0.9289 0.9107 O2_Saturation 34.0 14.0 17.0 51.0 0.7083 0.6667 0.6869 Kidney_Disease 96.0 12.0 34.0 130.0 0.8889 0.7385 0.8067 Test 2504.0 545.0 498.0 3002.0 0.8213 0.8341 0.8276 Communicable_Disease 21.0 10.0 6.0 27.0 0.6774 0.7778 0.7241 Hypertension 162.0 5.0 10.0 172.0 0.9701 0.9419 0.9558 External_body_par... 2626.0 356.0 413.0 3039.0 0.8806 0.8641 0.8723 Oxygen_Therapy 81.0 15.0 14.0 95.0 0.8438 0.8526 0.8482 Modifier 2341.0 404.0 539.0 2880.0 0.8528 0.8128 0.8324 Test_Result 1007.0 214.0 255.0 1262.0 0.8247 0.7979 0.8111 BMI 9.0 1.0 0.0 9.0 0.9 1.0 0.9474 Labour_Delivery 57.0 23.0 33.0 90.0 0.7125 0.6333 0.6706 Employment 271.0 59.0 55.0 326.0 0.8212 0.8313 0.8262 Fetus_NewBorn 66.0 33.0 51.0 117.0 0.6667 0.5641 0.6111 Clinical_Dept 923.0 110.0 83.0 1006.0 0.8935 0.9175 0.9053 Time 29.0 13.0 16.0 45.0 0.6905 0.6444 0.6667 Procedure 3185.0 462.0 501.0 3686.0 0.8733 0.8641 0.8687 Diet 36.0 20.0 45.0 81.0 0.6429 0.4444 0.5255 Oncological 459.0 61.0 55.0 514.0 0.8827 0.893 0.8878 LDL 3.0 0.0 3.0 6.0 1.0 0.5 0.6667 Symptom 7104.0 1302.0 1200.0 8304.0 0.8451 0.8555 0.8503 Temperature 116.0 6.0 8.0 124.0 0.9508 0.9355 0.9431 Vital_Signs_Header 215.0 29.0 24.0 239.0 0.8811 0.8996 0.8903 Relationship_Status 49.0 2.0 1.0 50.0 0.9608 0.98 0.9703 Total_Cholesterol 11.0 4.0 5.0 16.0 0.7333 0.6875 0.7097 Blood_Pressure 158.0 18.0 22.0 180.0 0.8977 0.8778 0.8876 Injury_or_Poisoning 579.0 130.0 127.0 706.0 0.8166 0.8201 0.8184 Drug_Ingredient 1716.0 153.0 132.0 1848.0 0.9181 0.9286 0.9233 Treatment 136.0 36.0 60.0 196.0 0.7907 0.6939 0.7391 Pregnancy 123.0 36.0 51.0 174.0 0.7736 0.7069 0.7387 Vaccine 13.0 2.0 6.0 19.0 0.8667 0.6842 0.7647 Disease_Syndrome_... 2981.0 559.0 446.0 3427.0 0.8421 0.8699 0.8557 Height 30.0 10.0 15.0 45.0 0.75 0.6667 0.7059 Frequency 595.0 99.0 138.0 733.0 0.8573 0.8117 0.8339 Route 858.0 76.0 89.0 947.0 0.9186 0.906 0.9123 Duration 351.0 99.0 108.0 459.0 0.78 0.7647 0.7723 Death_Entity 43.0 14.0 5.0 48.0 0.7544 0.8958 0.819 Internal_organ_or... 6477.0 972.0 991.0 7468.0 0.8695 0.8673 0.8684 Alcohol 80.0 18.0 13.0 93.0 0.8163 0.8602 0.8377 Substance_Quantity 6.0 7.0 4.0 10.0 0.4615 0.6 0.5217 Date 498.0 38.0 19.0 517.0 0.9291 0.9632 0.9459 Hyperlipidemia 47.0 3.0 3.0 50.0 0.94 0.94 0.94 Social_History_He... 99.0 7.0 7.0 106.0 0.934 0.934 0.934 Race_Ethnicity 116.0 0.0 0.0 116.0 1.0 1.0 1.0 Imaging_Technique 40.0 18.0 47.0 87.0 0.6897 0.4598 0.5517 Drug_BrandName 859.0 62.0 61.0 920.0 0.9327 0.9337 0.9332 RelativeDate 566.0 124.0 143.0 709.0 0.8203 0.7983 0.8091 Gender 6096.0 80.0 101.0 6197.0 0.987 0.9837 0.9854 Dosage 244.0 31.0 57.0 301.0 0.8873 0.8106 0.8472 Form 234.0 32.0 55.0 289.0 0.8797 0.8097 0.8432 Medical_History_H... 114.0 9.0 10.0 124.0 0.9268 0.9194 0.9231 Birth_Entity 4.0 2.0 3.0 7.0 0.6667 0.5714 0.6154 Substance 59.0 8.0 11.0 70.0 0.8806 0.8429 0.8613 Sexually_Active_o... 5.0 3.0 4.0 9.0 0.625 0.5556 0.5882 Weight 90.0 10.0 21.0 111.0 0.9 0.8108 0.8531 macro - - - - - - 0.8148 micro - - - - - - 0.8788 ``` --- layout: model title: Legal Economic Analysis Document Classifier (EURLEX) author: John Snow Labs name: legclf_economic_analysis_bert date: 2023-03-06 tags: [en, legal, classification, clauses, economic_analysis, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_economic_analysis_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Economic_Analysis or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Economic_Analysis`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_economic_analysis_bert_en_1.0.0_3.0_1678111814559.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_economic_analysis_bert_en_1.0.0_3.0_1678111814559.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_economic_analysis_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Economic_Analysis]| |[Other]| |[Other]| |[Economic_Analysis]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_economic_analysis_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Economic_Analysis 0.88 0.84 0.86 116 Other 0.84 0.88 0.86 111 accuracy - - 0.86 227 macro-avg 0.86 0.86 0.86 227 weighted-avg 0.86 0.86 0.86 227 ``` --- layout: model title: Arabic Bert Embeddings (MARBERT model v2) author: John Snow Labs name: bert_embeddings_MARBERTv2 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `MARBERTv2` is a Arabic model orginally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERTv2_ar_3.4.2_3.0_1649678231280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_MARBERTv2_ar_3.4.2_3.0_1649678231280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERTv2","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_MARBERTv2","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.MARBERTv2").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_MARBERTv2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|609.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/UBC-NLP/MARBERTv2 - https://aclanthology.org/2021.acl-long.551.pdf - https://github.com/UBC-NLP/marbert - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: English image_classifier_vit_ice_cream ViTForImageClassification from juanfiguera author: John Snow Labs name: image_classifier_vit_ice_cream date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ice_cream` is a English model originally trained by juanfiguera. ## Predicted Entities `chocolate ice cream`, `vanilla ice cream` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ice_cream_en_4.1.0_3.0_1660170383355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ice_cream_en_4.1.0_3.0_1660170383355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_ice_cream", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_ice_cream", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_ice_cream| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data) author: John Snow Labs name: ner_deid_generic_augmented date: 2022-02-15 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic` ner model). This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic` which does not include it). It's a generalized version of `ner_deid_subentity_augmented`. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_2.4_1644925864218.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_2.4_1644925864218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.generic_augmented").predict(""" Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-NAME| | Miguel| I-NAME| | Martínez| I-NAME| | ,| O| | un| B-SEX| | varón| I-SEX| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-LOCATION| | ,| O| | España| B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19|B-PROFESSION| | el| O| | dia| O| | 14| O| | de| O| | Marzo| O| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| B-LOCATION| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-LOCATION| | San| I-LOCATION| | Carlos| I-LOCATION| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 185.0 3.0 0.0 185.0 0.984 1.0 0.992 NAME 2066.0 138.0 106.0 2172.0 0.9374 0.9512 0.9442 DATE 1017.0 18.0 18.0 1035.0 0.9826 0.9826 0.9826 ORGANIZATION 2468.0 482.0 332.0 2800.0 0.8366 0.8814 0.8584 ID 65.0 5.0 3.0 68.0 0.9286 0.9559 0.942 SEX 678.0 8.0 15.0 693.0 0.9883 0.9784 0.9833 LOCATION 2532.0 358.0 420.0 2952.0 0.8761 0.8577 0.8668 PROFESSION 246.0 9.0 31.0 277.0 0.9647 0.8881 0.9248 AGE 547.0 8.0 9.0 556.0 0.9856 0.9838 0.9847 macro - - - - - - 0.9421 micro - - - - - - 0.9092 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dm256 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dm256` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dm256_en_4.3.0_3.0_1675110661031.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dm256_en_4.3.0_3.0_1675110661031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dm256","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dm256","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dm256| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|158.7 MB| ## References - https://huggingface.co/google/t5-efficient-base-dm256 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Cyberbullying Classifier author: John Snow Labs name: classifierdl_use_cyberbullying class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/07/2020 task: Text Classification edition: Spark NLP 2.5.3 spark_version: 2.4 tags: [classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Identify Racism, Sexism or Neutral tweets. {:.h2_title} ## Predicted Entities ``neutral``, ``racism``, ``sexism``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_CYBERBULLYING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_CYBERBULLYING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.5.3_2.4_1593783319298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.5.3_2.4_1593783319298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_cyberbullying', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_cyberbullying", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked"""] cyberbull_df = nlu.load('classify.cyberbullying.use').predict(text, output_level='document') cyberbull_df[["document", "cyberbullying"]] ```
{:.h2_title} ## Results ```bash +--------------------------------------------------------------------------------------------------------+------------+ |document |class | +--------------------------------------------------------------------------------------------------------+------------+ |@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked. | racism | +--------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| | Model Name | classifierdl_use_cyberbullying | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.3 | | Spark NLP Compatibility | 2.4 | | License | open source | | Edition | public | | Input Labels | [document, sentence_embeddings] | | Output Labels | [class] | | Language | en | | Upstream Dependencies | tfhub_use | {:.h2_title} ## Data Source This model is trained on cyberbullying detection dataset. https://raw.githubusercontent.com/dhavalpotdar/cyberbullying-detection/master/data/data/data.csv {:.h2_title} ## Benchmarking ```bash precision recall f1-score support none 0.69 1.00 0.81 3245 racism 0.00 0.00 0.00 568 sexism 0.00 0.00 0.00 922 accuracy 0.69 4735 macro avg 0.23 0.33 0.27 4735 weighted avg 0.47 0.69 0.56 4735 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_small_pretrained_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-pretrained-finetuned-squad` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_pretrained_finetuned_squad_en_4.0.0_3.0_1654184786135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_pretrained_finetuned_squad_en_4.0.0_3.0_1654184786135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_pretrained_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_pretrained_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.small_finetuned.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_pretrained_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-small-pretrained-finetuned-squad --- layout: model title: Detect Assertion Status (assertion_jsl) author: John Snow Labs name: assertion_jsl date: 2021-07-24 tags: [licensed, clinical, assertion, en] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 2.4 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011). ## Predicted Entities `Present`, `Absent`, `Possible`, `Planned`, `Someoneelse`, `Past`, `Family`, `None`, `Hypotetical` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_en_3.1.2_2.4_1627139823450.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_en_3.1.2_2.4_1627139823450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) text="""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala ... val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash The output is a dataframe with a sentence per row and an `assertion` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select `ner_chunk.result` and `assertion.result` from your output dataframe. +-----------------------------------------+-----+---+----------------------------+-------+---------+ |chunk |begin|end|ner_label |sent_id|assertion| +-----------------------------------------+-----+---+----------------------------+-------+---------+ |21-day-old |17 |26 |Age |0 |Family | |Caucasian |28 |36 |Race_Ethnicity |0 |Family | |male |38 |41 |Gender |0 |Family | |for 2 days |48 |57 |Duration |0 |Family | |congestion |62 |71 |Symptom |0 |Present | |mom |75 |77 |Gender |0 |Family | |yellow |99 |104|Modifier |0 |Family | |discharge |106 |114|Symptom |0 |Family | |nares |135 |139|External_body_part_or_region|0 |Family | |she |147 |149|Gender |0 |Family | |mild |168 |171|Modifier |0 |Family | |problems with his breathing while feeding|173 |213|Symptom |0 |Present | |perioral cyanosis |237 |253|Symptom |0 |Absent | |retractions |258 |268|Symptom |0 |Absent | |One day ago |272 |282|RelativeDate |1 |Family | |mom |285 |287|Gender |1 |Family | |Tylenol |345 |351|Drug_BrandName |1 |Family | |Baby |354 |357|Age |2 |Family | |decreased p.o. intake |377 |397|Symptom |2 |Family | |His |400 |402|Gender |3 |Family | +-----------------------------------------+-----+---+----------------------------+-------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_jsl| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label prec rec f1 Absent 0.970 0.943 0.956 Someoneelse 0.868 0.775 0.819 Planned 0.721 0.754 0.737 Possible 0.852 0.884 0.868 Past 0.811 0.823 0.817 Present 0.833 0.866 0.849 Family 0.872 0.921 0.896 None 0.609 0.359 0.452 Hypothetical 0.722 0.810 0.763 Macro-average 0.888 0.872 0.880 Micro-average 0.908 0.908 0.908 ``` --- layout: model title: English RobertaForQuestionAnswering (from deepset) author: John Snow Labs name: roberta_qa_roberta_large_squad2_hp date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2-hp` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_hp_en_4.0.0_3.0_1655737792807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_hp_en_4.0.0_3.0_1655737792807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_squad2_hp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_squad2_hp","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.large.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_squad2_hp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/roberta-large-squad2-hp --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Gam) author: John Snow Labs name: distilbert_qa_base_uncased_cuad_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-cuad-distilbert` is a English model originally trained by `Gam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_cuad_finetuned_en_4.3.0_3.0_1672767889486.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_cuad_finetuned_en_4.3.0_3.0_1672767889486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_cuad_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_cuad_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_cuad_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Gam/distilbert-base-uncased-finetuned-cuad-distilbert --- layout: model title: English image_classifier_vit_lawn_weeds ViTForImageClassification from LorenzoDeMattei author: John Snow Labs name: image_classifier_vit_lawn_weeds date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_lawn_weeds` is a English model originally trained by LorenzoDeMattei. ## Predicted Entities `clover`, `dichondra`, `grass` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lawn_weeds_en_4.1.0_3.0_1660171067931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lawn_weeds_en_4.1.0_3.0_1660171067931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_lawn_weeds", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_lawn_weeds", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_lawn_weeds| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect radiology concepts (ner_radiology_wip_clinical) author: John Snow Labs name: ner_radiology_wip_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_radiology_wip_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_radiology_wip_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_pipeline_en_4.3.0_3.2_1678801944623.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_pipeline_en_4.3.0_3.2_1678801944623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_radiology_wip_clinical_pipeline", "en", "clinical/models") text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_radiology_wip_clinical_pipeline", "en", "clinical/models") val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology.clinical_wip.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:--------------------------|-------------:| | 0 | Bilateral | 0 | 8 | Direction | 0.9828 | | 1 | breast | 10 | 15 | BodyPart | 0.8169 | | 2 | ultrasound | 17 | 26 | ImagingTest | 0.6216 | | 3 | ovoid mass | 78 | 87 | ImagingFindings | 0.6917 | | 4 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.91524 | | 5 | cm | 129 | 130 | Units | 0.9987 | | 6 | anteromedial aspect | 163 | 181 | Direction | 0.8241 | | 7 | left | 190 | 193 | Direction | 0.4667 | | 8 | shoulder | 195 | 202 | BodyPart | 0.6349 | | 9 | mass | 210 | 213 | ImagingFindings | 0.9611 | | 10 | isoechoic echotexture | 228 | 248 | ImagingFindings | 0.6851 | | 11 | muscle | 266 | 271 | BodyPart | 0.7805 | | 12 | internal color flow | 294 | 312 | ImagingFindings | 0.5153 | | 13 | benign fibrous tissue | 334 | 354 | ImagingFindings | 0.394867 | | 14 | lipoma | 361 | 366 | Disease_Syndrome_Disorder | 0.9142 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology_wip_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squadv2-recipe-roberta-tokenwise-token-and-step-losses-3-epochs` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs_en_4.3.0_3.0_1674224122519.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs_en_4.3.0_3.0_1674224122519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_squadv2_recipe_tokenwise_token_and_step_losses_3_epochs| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/squadv2-recipe-roberta-tokenwise-token-and-step-losses-3-epochs --- layout: model title: English BertForQuestionAnswering Tiny Cased model (from M-FAC) author: John Snow Labs name: bert_qa_tiny_finetuned_squadv2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-squadv2` is a English model originally trained by `M-FAC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_finetuned_squadv2_en_4.0.0_3.0_1657188687594.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_finetuned_squadv2_en_4.0.0_3.0_1657188687594.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tiny_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tiny_finetuned_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tiny_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/M-FAC/bert-tiny-finetuned-squadv2 - https://arxiv.org/pdf/2107.03356.pdf - https://github.com/IST-DASLab/M-FAC --- layout: model title: Detect Units and Measurements in text author: John Snow Labs name: ner_measurements_clinical date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract units and other measurements in reports, prescription and other medical texts using pretrained NER model. ## Predicted Entities `Units`, `Measurements` {:.btn-box} [ [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_en_3.0.0_3.0_1617260795877.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_en_3.0.0_3.0_1617260795877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_measurements_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_measurements_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.measurements").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_measurements_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Legal Law Area Prediction Classifier (Italian) author: John Snow Labs name: legclf_law_area_prediction_italian date: 2023-03-29 tags: [it, licensed, classification, legal, tensorflow] task: Text Classification language: it edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which identifies law area labels(civil_law, penal_law, public_law, social_law) in Italian-based Court Cases. ## Predicted Entities `civil_law`, `penal_law`, `public_law`, `social_law` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_italian_it_1.0.0_3.0_1680095983817.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_italian_it_1.0.0_3.0_1680095983817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx")\ .setInputCols(["document"]) \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_law_area_prediction_italian", "it", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, docClassifier ]) df = spark.createDataFrame([["Per questi motivi, il Tribunale federale pronuncia: 1. Nella misura in cui è ammissibile, il ricorso è respinto. 2. Le spese giudiziarie di fr. 1'000.-- sono poste a carico dei ricorrenti. 3. Comunicazione al patrocinatore dei ricorrenti, al Consiglio di Stato, al Gran Consiglio, al Tribunale amministrativo del Cantone Ticino e all'Ufficio federale dello sviluppo territoriale."]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) result.select("text", "category.result").show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+------------+ | text| result| +----------------------------------------------------------------------------------------------------+------------+ |Per questi motivi, il Tribunale federale pronuncia: 1. Nella misura in cui è ammissibile, il rico...|[public_law]| +----------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_law_area_prediction_italian| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|it| |Size:|22.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction) ## Benchmarking ```bash label precision recall f1-score support civil_law 0.86 0.86 0.86 58 penal_law 0.85 0.82 0.83 55 public_law 0.79 0.79 0.79 52 social_law 0.93 0.96 0.94 68 accuracy - - 0.86 233 macro-avg 0.86 0.86 0.86 233 weighted-avg 0.86 0.86 0.86 233 ``` --- layout: model title: Finnish asr_wav2vec2_large_uralic_voxpopuli_v2_finnish TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_uralic_voxpopuli_v2_finnish` is a Finnish model originally trained by Finnish-NLP. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_fi_4.2.0_3.0_1664038039563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_fi_4.2.0_3.0_1664038039563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_uralic_voxpopuli_v2_finnish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_classify_4scence ViTForImageClassification from HaoHu author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_classify_4scence date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_classify_4scence` is a English model originally trained by HaoHu. ## Predicted Entities `City_road`, `fog`, `rain`, `snow` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_classify_4scence_en_4.1.0_3.0_1660171116662.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_classify_4scence_en_4.1.0_3.0_1660171116662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_classify_4scence", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_classify_4scence", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_classify_4scence| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species date: 2022-06-26 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** - The text files were translated from Spanish with a neural machine translation system. - The annotations were translated with the same neural machine translation system. - The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_en_3.5.3_3.0_1656273939035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_en_3.5.3_3.0_1656273939035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.living_species.token_bert").predict("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""") ```
## Results ```bash +-----------------------+-------+ |ner_chunk |label | +-----------------------+-------+ |woman |HUMAN | |bacterial |SPECIES| |Fusarium spp |SPECIES| |patient |HUMAN | |species |SPECIES| |Fusarium solani complex|SPECIES| |antifungals |SPECIES| +-----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.83 0.96 0.89 2950 B-SPECIES 0.70 0.93 0.80 3129 I-HUMAN 0.73 0.39 0.51 145 I-SPECIES 0.67 0.81 0.74 1166 micro-avg 0.74 0.91 0.82 7390 macro-avg 0.73 0.77 0.73 7390 weighted-avg 0.75 0.91 0.82 7390 ``` --- layout: model title: Pipeline to Resolve ICD-10-CM Codes author: John Snow Labs name: icd10cm_resolver_pipeline date: 2023-04-28 tags: [en, licensed, clinical, resolver, chunk_mapping, pipeline, icd10cm] task: Entity Resolution language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding ICD-10-CM codes. You’ll just feed your text and it will return the corresponding ICD-10-CM codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.3.2_3.0_1682726202207.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.3.2_3.0_1682726202207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline resolver_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models") text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""" result = resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val resolver_pipeline = new PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models") val result = resolver_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm_resolver.pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ```
## Results ```bash +-----------------------------+---------+------------+ |chunk |ner_chunk|icd10cm_code| +-----------------------------+---------+------------+ |gestational diabetes mellitus|PROBLEM |O24.919 | |anisakiasis |PROBLEM |B81.0 | |fetal and neonatal hemorrhage|PROBLEM |P545 | +-----------------------------+---------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English XlmRoBertaForQuestionAnswering (from horsbug98) author: John Snow Labs name: xlm_roberta_qa_Part_2_XLM_Model_E1 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_2_XLM_Model_E1` is a English model originally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_2_XLM_Model_E1_en_4.0.0_3.0_1655983522974.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_Part_2_XLM_Model_E1_en_4.0.0_3.0_1655983522974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_Part_2_XLM_Model_E1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_Part_2_XLM_Model_E1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.xlm_roberta.v2.by_horsbug98").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_Part_2_XLM_Model_E1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|814.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_2_XLM_Model_E1 --- layout: model title: Word Embeddings for Urdu (urduvec_140M_300d) author: John Snow Labs name: urduvec_140M_300d date: 2020-12-01 task: Embeddings language: ur edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [embeddings, ur, open_source] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained using Word2Vec approach on a corpora of 140 Million tokens, has a vocabulary of 100k unique tokens, and gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/urduvec_140M_300d_ur_2.7.0_2.4_1606810614734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/urduvec_140M_300d_ur_2.7.0_2.4_1606810614734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['مجھے سپارک این ایل پی پسند ہے۔']], ["text"])) ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("مجھے سپارک این ایل پی پسند ہے۔").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["مجھے سپارک این ایل پی پسند ہے۔"] urduvec_df = nlu.load('ur.embed.urdu_vec_140M_300d').predict(text, output_level="token") urduvec_df ```
{:.h2_title} ## Results The model gives 300 dimensional Word2Vec feature vector outputs per token. ```bash |Embeddings vector | Tokens |----------------------------------------------------|--------- | [0.15994004905223846, -0.2213257998228073, 0.0... | مجھے | [-0.16085924208164215, -0.12259697169065475, -... | سپارک | [-0.07977486401796341, -0.528775691986084, 0.3... | این | [-0.24136857688426971, -0.15272589027881622, 0... | ایل | [-0.23666366934776306, -0.16016320884227753, 0... | پی | [0.07911433279514313, 0.05598200485110283, 0.0... | پسند ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|urduvec_140M_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|ur| |Case sensitive:|false| |Dimension:|300| ## Data Source The model is imported from http://www.lrec-conf.org/proceedings/lrec2018/pdf/148.pdf --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_768 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670021693429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670021693429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|170.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Oncology Pipeline for Biomarkers author: John Snow Labs name: oncology_biomarker_pipeline date: 2022-11-04 tags: [licensed, en, oncology] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.2.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline focuses on entities related to biomarkers. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.2.2_3.0_1667581643291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.2.2_3.0_1667581643291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_biomarker.pipeline").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Oncogene | ******************** ner_oncology_biomarker_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Biomarker | ******************** ner_oncology_test_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | Immunohistochemistry | Pathology_Test | | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Oncogene | ******************** ner_biomarker results ******************** | chunk | ner_label | |:-------------------------------|:----------------------| | Immunohistochemistry | Test | | negative | Biomarker_Measurement | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Measurement | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Measurement | | HER2 | Biomarker | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-------------------------------|:---------------|:------------| | Immunohistochemistry | Pathology_Test | Past | | thyroid transcription factor-1 | Biomarker | Present | | napsin A | Biomarker | Present | | ER | Biomarker | Present | | PR | Biomarker | Present | | HER2 | Oncogene | Present | ******************** assertion_oncology_test_binary_wip results ******************** | chunk | ner_label | assertion | |:-------------------------------|:---------------|:----------------| | Immunohistochemistry | Pathology_Test | Medical_History | | thyroid transcription factor-1 | Biomarker | Medical_History | | napsin A | Biomarker | Medical_History | | ER | Biomarker | Medical_History | | PR | Biomarker | Medical_History | | HER2 | Oncogene | Medical_History | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | O | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_related_to | | negative | Biomarker_Result | napsin A | Biomarker | is_related_to | | positive | Biomarker_Result | ER | Biomarker | is_related_to | | positive | Biomarker_Result | PR | Biomarker | is_related_to | | positive | Biomarker_Result | HER2 | Oncogene | O | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | O | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | | negative | Biomarker_Result | napsin A | Biomarker | is_finding_of | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | | positive | Biomarker_Result | HER2 | Oncogene | is_finding_of | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | ******************** re_oncology_biomarker_result_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | is_finding_of | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | | negative | Biomarker_Result | napsin A | Biomarker | is_finding_of | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | | positive | Biomarker_Result | HER2 | Oncogene | O | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_biomarker_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP for Healthcare 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel - RelationExtractionModel --- layout: model title: Acquisitions / Subsidiaries Relation Extraction (md, Unidirectional) author: John Snow Labs name: finre_acquisitions_subsidiaries_md date: 2022-11-08 tags: [acquisition, subsidiaries, en, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole financial report. Instead: - Split by paragraphs; - Use the `finclf_acquisitions_item` Text Classifier to select only these paragraphs; This model is a `md` model, meaning that the directions in the relations are meaningful: `chunk1` is the source of the relation, `chunk2` is the target. The aim of this model is to retrieve acquisition or subsidiary relationships between Organizations, included when the acquisition was carried out ("was_acquired") and by whom ("was_acquired_by"). Subsidiaries are tagged with the relationship "is_subsidiary_of". ## Predicted Entities `was_acquired`, `was_acquired_by`, `is_subsidiary_of`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINRE_ACQUISITIONS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_acquisitions_subsidiaries_md_en_1.0.0_3.0_1667920790547.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_acquisitions_subsidiaries_md_en_1.0.0_3.0_1667920790547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") bert_embeddings= nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("bert_embeddings") ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\ .setInputCols(["sentence", "token", "bert_embeddings"])\ .setOutputCol("ner_dates") ner_converter_date = nlp.NerConverter()\ .setInputCols(["sentence","token","ner_dates"])\ .setOutputCol("ner_chunk_date") ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\ .setInputCols(["sentence", "token", "bert_embeddings"])\ .setOutputCol("ner_orgs") ner_converter_org = nlp.NerConverter()\ .setInputCols(["sentence","token","ner_orgs"])\ .setOutputCol("ner_chunk_org")\ .setWhiteList(['ORG', 'PRODUCT', 'ALIAS']) chunk_merger = finance.ChunkMergeApproach()\ .setInputCols('ner_chunk_org', "ner_chunk_date")\ .setOutputCol('ner_chunk') reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\ .setInputCols(["ner_chunk", "sentence"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentencizer, tokenizer, bert_embeddings, ner_model_date, ner_converter_date, ner_model_org, ner_converter_org, chunk_merger, reDL]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = "Whatsapp, Inc. was acquired by Meta, Inc" lmodel = LightPipeline(model) results = lmodel.fullAnnotate(text) rel_df = get_relations_df (results) rel_df = rel_df[rel_df['relation']!='no_rel'] print(rel_df.to_string(index=False)) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence was_acquired_by ORG 0 13 Whatsapp, Inc. ORG 31 34 Meta 0.9527305 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_acquisitions_subsidiaries_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.7 MB| ## References In-house annotations on SEC 10K filings and Wikidata ## Benchmarking ```bash label Recall Precision F1 Support is_subsidiary_of 0.583 0.618 0.600 36 other 0.975 0.948 0.961 243 was_acquired 0.836 0.895 0.864 61 was_acquired_by 0.767 0.780 0.773 60 Avg. 0.790 0.810 0.800 406 Weighted-Avg. 0.887 0.885 0.886 406 ``` --- layout: model title: French XlmRoBertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan date: 2022-06-24 tags: [open_source, question_answering, xlmroberta, fr] task: Question Answering language: fr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-fr` is a French model originally trained by `saattrupdan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan_fr_4.0.0_3.0_1656066193139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan_fr_4.0.0_3.0_1656066193139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan","fr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan","fr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fr.answer_question.squad.xlmr_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_fr_fr_saattrupdan| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fr| |Size:|873.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saattrupdan/xlmr-base-texas-squad-fr --- layout: model title: Sentence Entity Resolver for Snomed (sbertresolve_snomed_conditions) author: John Snow Labs name: sbertresolve_snomed_conditions date: 2021-08-28 tags: [snomed, licensed, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities (domain: Conditions) to Snomed codes using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. ## Predicted Entities Snomed Codes and their normalized definition with `sbert_jsl_medium_uncased ` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_conditions_en_3.1.3_2.4_1630180858399.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_conditions_en_3.1.3_2.4_1630180858399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN") snomed_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, snomed_resolver ]) snomed_lp = LightPipeline(snomed_pipelineModel) result = snomed_lp.fullAnnotate("schizophrenia") ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased', 'en','clinical/models') .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \ .setInputCols(Array("sbert_embeddings")) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN") val snomed_pipelineModel = new PipelineModel().setStages(Array(documentAssembler,sbert_embedder,snomed_resolver)) val snomed_lp = LightPipeline(snomed_pipelineModel) val result = snomed_lp.fullAnnotate("schizophrenia") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_conditions").predict("""Put your text here.""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:--------------|:---------|:-------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|:-----------------------------------------------------| | 0 | schizophrenia | 58214004 | [schizophrenia, chronic schizophrenia, borderline schizophrenia, schizophrenia, catatonic, subchronic schizophrenia, ...]| [58214004, 83746006, 274952002, 191542003, 191529003, 16990005, ...] | 0.0000, 0.0774, 0.0838, 0.0927, 0.0970, 0.0970, ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_snomed_conditions| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Arabic Part of Speech Tagger (Modern Standard Arabic (MSA), Modern Standard Arabic-MSA POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_msa_pos_msa date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-pos-msa` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_msa_ar_3.4.2_3.0_1650993589088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_msa_ar_3.4.2_3.0_1650993589088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_msa","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_msa","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_msa_pos_msa").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_msa_pos_msa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-pos-msa - https://dl.acm.org/doi/pdf/10.5555/1621804.1621808 - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Persian BertForMaskedLM Base Cased model (from HooshvareLab) author: John Snow Labs name: bert_embeddings_fa_zwnj_base date: 2022-12-02 tags: [fa, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: fa edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-fa-zwnj-base` is a Persian model originally trained by `HooshvareLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fa_zwnj_base_fa_4.2.4_3.0_1670019440290.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fa_zwnj_base_fa_4.2.4_3.0_1670019440290.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fa_zwnj_base","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fa_zwnj_base","fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_fa_zwnj_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fa| |Size:|444.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/HooshvareLab/bert-fa-zwnj-base - https://arxiv.org/abs/2005.12515 - https://github.com/hooshvare/parsbert/issues --- layout: model title: XLM-RoBERTa Base NER Pipeline author: ahmedlone127 name: xlm_roberta_base_token_classifier_ontonotes_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, xlm_roberta, ontonotes, xlm, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_ontonotes_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655216428417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655216428417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|858.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury TFWav2Vec2ForCTC from Satyamatury author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury` is a English model originally trained by Satyamatury. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_en_4.2.0_3.0_1664112233589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_en_4.2.0_3.0_1664112233589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_5 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-5` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_5_en_4.3.0_3.0_1672767558441.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_5_en_4.3.0_3.0_1672767558441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_5","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_5","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_5| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-5 --- layout: model title: Translate Berber to English Pipeline author: John Snow Labs name: translate_ber_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ber, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ber` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ber_en_xx_2.7.0_2.4_1609687862998.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ber_en_xx_2.7.0_2.4_1609687862998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ber_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ber_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ber.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ber_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: PICO Classifier author: John Snow Labs name: classifierdl_pico_biobert date: 2020-11-12 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.6.2 spark_version: 2.4 tags: [classifier, en, licensed, clinical] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Classify medical text according to PICO framework. {:.h2_title} ## Predicted Entities ``CONCLUSIONS``, ``DESIGN_SETTING``, ``INTERVENTION``, ``PARTICIPANTS``, ``FINDINGS``, ``MEASUREMENTS``, ``AIMS``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_PICO/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_pico_biobert_en_2.6.2_2.4_1601901791781.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_pico_biobert_en_2.6.2_2.4_1601901791781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings (biobert_pubmed_base_cased), SentenceEmbeddings, ClassifierDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\ .setInputCols(["document", 'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["""A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""", """When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced."""]) ``` ```scala ... val embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased') .setInputCols(Array("document", 'token')) .setOutputCol("word_embeddings") val sentence_embeddings = SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models') .setInputCols(Array('document', 'token', 'sentence_embeddings')).setOutputCol('class') val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifier)) val data = Seq("A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.", "When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.pico").predict("""A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""") ```
{:.h2_title} ## Results A dictionary containing class labels for each sentence. ```bash | sentences | class | |------------------------------------------------------+--------------+ | A total of 10 adult daily smokers who reported at... | PARTICIPANTS | | When carbamazepine is withdrawn from the combinat... | CONCLUSIONS | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_pico_biobert| |Type:|ClassifierDLModel| |Compatibility:|Healthcare NLP 2.6.2 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|[en]| |Case sensitive:|True| {:.h2_title} ## Data Source Trained on a custom dataset derived from PICO classification dataset, using ``'biobert_pubmed_base_cased'`` embeddings. {:.h2_title} ## Benchmarking ```bash | | labels | precision | recall | f1-score | support | |---:|---------------:|----------:|---------:|---------:|--------:| | 0 | AIMS | 0.9197 | 0.9121 | 0.9159 | 3845 | | 1 | CONCLUSIONS | 0.8426 | 0.8571 | 0.8498 | 4241 | | 2 | DESIGN_SETTING | 0.7703 | 0.8351 | 0.8014 | 5191 | | 3 | FINDINGS | 0.9214 | 0.8964 | 0.9088 | 9500 | | 4 | INTERVENTION | 0.7529 | 0.6758 | 0.7123 | 2597 | | 5 | MEASUREMENTS | 0.8409 | 0.7734 | 0.8058 | 3500 | | 6 | PARTICIPANTS | 0.7521 | 0.8548 | 0.8002 | 2396 | | | accuracy | | 0.8476 | 31270 | | | macro avg | 0.8286 | 0.8292 | 0.8277 | 31270 | | | weighted avg | 0.8495 | 0.8476 | 0.8476 | 31270 | ``` --- layout: model title: Match Chunks in Texts author: John Snow Labs name: match_chunks date: 2022-06-15 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The pipeline uses regex `
?/*+` ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_chunks_en_4.0.0_3.0_1655322760895.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_chunks_en_4.0.0_3.0_1655322760895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline_local = PretrainedPipeline('match_chunks') result = pipeline_local.annotate("David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow.") result['chunk'] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP SparkNLP.version() val testData = spark.createDataFrame(Seq( (1, "David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow."))).toDF("id", "text") val pipeline = PretrainedPipeline("match_chunks", lang="en") val annotation = pipeline.transform(testData) annotation.show() ``` {:.nlu-block} ```python import nlu nlu.load("en.match.chunks").predict("""David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow.""") ```
## Results ```bash ['the restaurant yesterday', 'family', 'the day', 'that time', 'today', 'the food', 'tomorrow'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|match_chunks| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|4.2 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - Chunker --- layout: model title: Word Embeddings for Dutch (dutch_cc_300d) author: John Snow Labs name: dutch_cc_300d date: 2021-10-04 tags: [nl, embeddings, open_source] task: Embeddings language: nl edition: Spark NLP 3.3.0 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia dataset for Dutch language using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dutch_cc_300d_nl_3.3.0_3.0_1633366113070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dutch_cc_300d_nl_3.3.0_3.0_1633366113070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = WordEmbeddingsModel.pretrained("dutch_cc_300d", "nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) data = spark.createDataFrame([["De Bijlmerramp is de benaming voor de vliegramp"]]).toDF("text") pipeline_model = nlp_pipeline.fit(data) result = pipeline_model.transform(data) ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("dutch_cc_300d", "nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("De Bijlmerramp is de benaming voor de vliegramp").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | token | embedding | |:-------------|:-------------------------------------------------------------------------------| | De | ['0.0249', '-0.0115', '-0.0748', '-0.0823', '0.0866', '-0.0219', '0.00' ...] | | Bijlmerramp | ['0.0204', '0.0079', '0.0224', '0.0352', '-0.0409', '0.0053', '0.0175', ...] | | is | ['-1.0E-4', '0.1419', '0.053', '-0.0921', '0.07', '0.004', '-0.1683', ...] | | de | ['0.0309', '0.0411', '-0.0077', '-0.0756', '0.0741', '-0.0402', '0.025' ...] | | benaming | ['0.0197', '0.0167', '-0.0051', '0.0198', '0.034', '-0.0086', '-0.009', ...] | | voor | ['0.0642', '-0.0171', '-0.0118', '0.0042', '0.0058', '0.0018', '0.0039' ...] | | de | ['0.0309', '0.0411', '-0.0077', '-0.0756', '0.0741', '-0.0402', '0.025' ...] | | vliegramp | ['0.083', '0.025', '0.0029', '0.0064', '-0.0698', '0.0344', '-0.0305', ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dutch_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|nl| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: Detect Assertion Status (DL Large) author: John Snow Labs name: assertion_dl_large_en date: 2020-05-21 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deep learning named entity recognition model for assertions. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities ``hypothetical``, ``present``, ``absent``, ``possible``, ``conditional``, ``associated_with_someone_else``. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_large_en_2.5.0_2.4_1590022282256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_large_en_2.5.0_2.4_1590022282256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_large", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) model = nlpPipeline.fit(spark.createDataFrame([["Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer."]]).toDF("text")) light_model = LightPipeline(model) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val nerConverter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_large", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, nerConverter, clinical_assertion)) val data = Seq("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe. ```bash chunks entities assertion 0 severe fever PROBLEM present 1 sore throat PROBLEM present 2 stomach pain PROBLEM absent 3 an epidural TREATMENT present 4 PCA TREATMENT present 5 pain control PROBLEM present 6 short of breath PROBLEM conditional 7 CT TEST present 8 lung tumor PROBLEM present 9 Alzheimer PROBLEM associated_with_someone_else ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_large| |Type:|ner| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, ner_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash prec rec f1 absent 0.97 0.91 0.94 associated_with_someone_else 0.93 0.87 0.90 conditional 0.70 0.33 0.44 hypothetical 0.91 0.82 0.86 possible 0.81 0.59 0.68 present 0.93 0.98 0.95 micro avg 0.93 0.93 0.93 macro avg 0.87 0.75 0.80 ``` --- layout: model title: Pipeline to Detect Clinical Entities (bert_token_classifier_ner_clinical) author: John Snow Labs name: bert_token_classifier_ner_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, berfortokenclassification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_3.0_1647888696583.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_3.0_1647888696583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.clinical_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |gestational diabetes mellitus|PROBLEM | |type two diabetes mellitus |PROBLEM | |T2DM |PROBLEM | |HTG-induced pancreatitis |PROBLEM | |an acute hepatitis |PROBLEM | |obesity |PROBLEM | |a body mass index |TEST | |BMI |TEST | |polyuria |PROBLEM | |polydipsia |PROBLEM | |poor appetite |PROBLEM | |vomiting |PROBLEM | |amoxicillin |TREATMENT| |a respiratory tract infection|PROBLEM | |metformin |TREATMENT| |glipizide |TREATMENT| |dapagliflozin |TREATMENT| |T2DM |PROBLEM | |atorvastatin |TREATMENT| |gemfibrozil |TREATMENT| +-----------------------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_augmented_Squad_Translated date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `augmented_Squad_Translated` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_augmented_Squad_Translated_en_4.0.0_3.0_1654179259638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_augmented_Squad_Translated_en_4.0.0_3.0_1654179259638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_augmented_Squad_Translated","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_augmented_Squad_Translated","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.augmented").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_augmented_Squad_Translated| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/augmented_Squad_Translated --- layout: model title: Arabic BertForMaskedLM Base Cased model (from aubmindlab) author: John Snow Labs name: bert_embeddings_base_arabertv02 date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabertv02` is a Arabic model originally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv02_ar_4.2.4_3.0_1670015827071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv02_ar_4.2.4_3.0_1670015827071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv02","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv02","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabertv02| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv02 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabertv02/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: English BertForQuestionAnswering Cased model (from DaisyMak) author: John Snow Labs name: bert_qa_finetuned_squad_transformerfrozen_testtoken date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-transformerfrozen-testtoken` is a English model originally trained by `DaisyMak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_transformerfrozen_testtoken_en_4.0.0_3.0_1657187107539.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_transformerfrozen_testtoken_en_4.0.0_3.0_1657187107539.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_transformerfrozen_testtoken","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_transformerfrozen_testtoken","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_finetuned_squad_transformerfrozen_testtoken| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/DaisyMak/bert-finetuned-squad-transformerfrozen-testtoken --- layout: model title: Arabic Named Entity Recognition (Dialectal Arabic-DA) author: John Snow Labs name: bert_ner_bert_base_arabic_camelbert_da_ner date: 2022-05-04 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-da-ner` is a Arabic model orginally trained by `CAMeL-Lab`. ## Predicted Entities `ORG`, `LOC`, `PERS`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_da_ner_ar_3.4.2_3.0_1651630269156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_da_ner_ar_3.4.2_3.0_1651630269156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_da_ner","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_da_ner","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner.arabic_camelbert_da_ner").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_arabic_camelbert_da_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-ner - https://camel.abudhabi.nyu.edu/anercorp/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Part of Speech for Hebrew author: John Snow Labs name: pos_ud_htb date: 2020-12-09 task: Part of Speech Tagging language: he edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, open_source, he] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_2.7.0_2.4_1607521333296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_htb_he_2.7.0_2.4_1607521333296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_htb", "he") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(["ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה"]) ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_htb", "he") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה"] pos_df = nlu.load('he.pos.ud_htb').predict(text, output_level='token') pos_df ```
## Results ```bash {'pos': [Annotation(pos, 0, 0, ADP, {'word': 'ב'}), Annotation(pos, 1, 1, PUNCT, {'word': '-'}), Annotation(pos, 3, 4, NUM, {'word': '25'}), Annotation(pos, 6, 12, VERB, {'word': 'לאוגוסט'}), Annotation(pos, 14, 16, None, {'word': 'עצר'}), Annotation(pos, 18, 22, VERB, {'word': 'השב"כ'}), Annotation(pos, 24, 25, ADP, {'word': 'את'}), Annotation(pos, 27, 31, PROPN, {'word': 'מוחמד'}), Annotation(pos, 33, 42, PROPN, {'word': "אבו-ג'וייד"}), Annotation(pos, 44, 44, PUNCT, {'word': ','}), Annotation(pos, 46, 49, NOUN, {'word': 'אזרח'}), Annotation(pos, 51, 55, ADJ, {'word': 'ירדני'}), Annotation(pos, 57, 57, PUNCT, {'word': ','}), Annotation(pos, 59, 63, VERB, {'word': 'שגויס'}), Annotation(pos, 65, 70, ADP, {'word': 'לארגון'}), Annotation(pos, 72, 76, NOUN, {'word': 'הפת"ח'}), Annotation(pos, 78, 83, PROPN, {'word': 'והופעל'}), Annotation(pos, 85, 86, ADP, {'word': 'על'}), Annotation(pos, 88, 90, NOUN, {'word': 'ידי'}), Annotation(pos, 92, 99, PROPN, {'word': 'חיזבאללה'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_htb| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[tags, document]| |Output Labels:|[pos]| |Language:|he| ## Data Source The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org) ## Benchmarking ```bash | | | precision | recall | f1-score | support | |---:|:-------------|:------------|:---------|-----------:|----------:| | 0 | ADJ | 0.83 | 0.83 | 0.83 | 676 | | 1 | ADP | 0.99 | 0.99 | 0.99 | 1889 | | 2 | ADV | 0.93 | 0.89 | 0.91 | 408 | | 3 | AUX | 0.90 | 0.90 | 0.9 | 229 | | 4 | CCONJ | 0.97 | 0.99 | 0.98 | 434 | | 5 | DET | 0.97 | 0.99 | 0.98 | 1390 | | 6 | NOUN | 0.91 | 0.94 | 0.93 | 3056 | | 7 | NUM | 0.97 | 0.96 | 0.97 | 285 | | 9 | PRON | 0.97 | 0.99 | 0.98 | 443 | | 10 | PROPN | 0.82 | 0.72 | 0.77 | 573 | | 11 | PUNCT | 1.00 | 1.00 | 1 | 1381 | | 12 | SCONJ | 0.99 | 0.90 | 0.94 | 411 | | 13 | VERB | 0.87 | 0.85 | 0.86 | 1063 | | 14 | X | 1.00 | 0.17 | 0.29 | 6 | | 15 | accuracy | | | 0.95 | 15089 | | 16 | macro avg | 0.94 | 0.87 | 0.89 | 15089 | | 17 | weighted avg | 0.95 | 0.95 | 0.95 | 15089 | ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab53_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab53_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab53_by_hassnain` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab53_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab53_by_hassnain_en_4.2.0_3.0_1664022612371.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab53_by_hassnain_en_4.2.0_3.0_1664022612371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab53_by_hassnain", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab53_by_hassnain", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab53_by_hassnain| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Legal Third party beneficiaries Clause Binary Classifier author: John Snow Labs name: legclf_third_party_beneficiaries_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `third-party-beneficiaries` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `third-party-beneficiaries` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_third_party_beneficiaries_clause_en_1.0.0_3.2_1660123110554.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_third_party_beneficiaries_clause_en_1.0.0_3.2_1660123110554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_third_party_beneficiaries_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[third-party-beneficiaries]| |[other]| |[other]| |[third-party-beneficiaries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_third_party_beneficiaries_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-third-party-beneficiaries 0.96 0.96 0.96 49 other 0.98 0.98 0.98 130 accuracy - - 0.98 179 macro-avg 0.97 0.97 0.97 179 weighted-avg 0.98 0.98 0.98 179 ``` --- layout: model title: Mapping Drug Brand Names with Corresponding National Drug Codes author: John Snow Labs name: drug_brandname_ndc_mapper date: 2022-06-26 tags: [chunk_mapper, ndc, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata. ## Predicted Entities `Strength_NDC` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.3_3.0_1656260706121.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_brandname_ndc_mapper_en_3.5.3_3.0_1656260706121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("chunk") chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\ .setInputCols(["chunk"])\ .setOutputCol("ndc")\ .setRels(["Strength_NDC"])\ .setLowerCase(True) pipeline = Pipeline().setStages([ document_assembler, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline = LightPipeline(model) result = light_pipeline.fullAnnotate(["zytiga", "zyvana", "ZYVOX"]) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("chunk") val chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models") .setInputCols(Array("chunk")) .setOutputCol("ndc") .setRels(Array("Strength_NDC")) .setLowerCase(True) val pipeline = new Pipeline().setStages(Array( document_assembler, chunkerMapper)) val sample_data = Seq("zytiga", "zyvana", "ZYVOX").toDS.toDF("text") val result = pipeline.fit(sample_data).transform(sample_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_brand_to_ndc").predict("""Put your text here.""") ```
## Results ```bash | | Brandname | Strength_NDC | |---:|:------------|:-------------------------| | 0 | zytiga | 500 mg/1 | 57894-195 | | 1 | zyvana | 527 mg/1 | 69336-405 | | 2 | ZYVOX | 600 mg/300mL | 0009-4992 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_brandname_ndc_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|3.0 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Luba-Katanga author: John Snow Labs name: opus_mt_en_lu date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, lu, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `lu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lu_xx_2.7.0_2.4_1609281752399.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lu_xx_2.7.0_2.4_1609281752399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_lu", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_lu", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.lu').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_lu| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Amharic to Swedish author: John Snow Labs name: opus_mt_am_sv date: 2021-06-01 tags: [open_source, seq2seq, translation, am, sv, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: am target languages: sv {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_am_sv_xx_3.1.0_2.4_1622554609708.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_am_sv_xx_3.1.0_2.4_1622554609708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_am_sv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_am_sv", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Amharic.translate_to.Swedish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_am_sv| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xls_r_thai_test TFWav2Vec2ForCTC from juierror author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_thai_test date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_thai_test` is a English model originally trained by juierror. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_thai_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024211955.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_thai_test_en_4.2.0_3.0_1664024211955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_thai_test', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_thai_test", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_thai_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: ESG Text Classification (3 classes) author: John Snow Labs name: finclf_esg date: 2022-09-06 tags: [en, financial, esg, classification, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies financial texts / news into three classes: Environment, Social and Governance. This model can be use to build a ESG score board for companies. If you look for an augmented version of this model, with more fine-grain verticals (Green House Emissions, Business Ethics, etc), please look for the finance_sequence_classifier_augmented_esg model in Models Hub. ## Predicted Entities `Environment`, `Social`, `Governance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINCLF_ESG/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_esg_en_1.0.0_3.2_1662472406140.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_esg_en_1.0.0_3.2_1662472406140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_esg", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) # couple of simple examples example = spark.createDataFrame([["""The Canadian Environmental Assessment Agency (CEAA) concluded that in June 2016 the company had not made an effort to protect public drinking water and was ignoring concerns raised by its own scientists about the potential levels of pollutants in the local water supply. At the time, there were concerns that the company was not fully testing onsite wells for contaminants and did not use the proper methods for testing because of its test kits now manufactured in China.A preliminary report by the company in June 2016 was commissioned by the Alberta government to provide recommendations to Alberta Environment officials"""]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +--------------------+---------------+ | text| result| +--------------------+---------------+ |The Canadian Envi...|[Environmental]| +--------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_esg| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotations from scrapped annual reports and tweets about ESG ## Benchmarking ```bash label precision recall f1-score support Environmental 0.99 0.97 0.98 97 Social 0.95 0.96 0.95 162 Governance 0.91 0.90 0.91 71 accuracy - - 0.95 330 macro-avg 0.95 0.94 0.95 330 weighted-avg 0.95 0.95 0.95 330 ``` --- layout: model title: Mapping Entities with Corresponding ICD-9-CM Codes author: John Snow Labs name: icd9_mapper date: 2022-09-30 tags: [icd9cm, chunk_mapping, en, licensed, clinical] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities with their corresponding ICD-9-CM codes. ## Predicted Entities `icd9_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd9_mapper_en_4.1.0_3.0_1664535522949.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd9_mapper_en_4.1.0_3.0_1664535522949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') chunk_assembler = Doc2Chunk()\ .setInputCols(['doc'])\ .setOutputCol('ner_chunk') chunkerMapper = ChunkMapperModel\ .pretrained("icd9_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["icd9_code"]) mapper_pipeline = Pipeline(stages=[ document_assembler, chunk_assembler, chunkerMapper ]) test_data = spark.createDataFrame([["24 completed weeks of gestation"]]).toDF("text") result = mapper_pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("doc") val chunk_assembler = Doc2Chunk() .setInputCols(Array("doc")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("icd9_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("icd9_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, chunk_assembler, chunkerMapper)) val test_data = Seq("24 completed weeks of gestation").toDS.toDF("text") val result = mapper_pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd9").predict("""24 completed weeks of gestation""") ```
## Results ```bash +-------------------------------+------------+ |chunk |icd9_mapping| +-------------------------------+------------+ |24 completed weeks of gestation|765.22 | +-------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd9_mapper| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|374.4 KB| --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-28 tags: [fi, open_source] task: Named Entity Recognition language: fi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fi_4.0.0_3.0_1656386552101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fi_4.0.0_3.0_1656386552101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "fi") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("fi.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becasv2_5 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-5` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_5_en_4.3.0_3.0_1672767788818.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_5_en_4.3.0_3.0_1672767788818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_5","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_5","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becasv2_5| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-5 --- layout: model title: English DistilBertForQuestionAnswering model (from ncduy) author: John Snow Labs name: distilbert_qa_base_cased_distilled_squad_finetuned_squad_test date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-test` is a English model originally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_test_en_4.0.0_3.0_1654723644873.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_finetuned_squad_test_en_4.0.0_3.0_1654723644873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_test","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad_finetuned_squad_test","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_distilled_squad_finetuned_squad_test| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-test --- layout: model title: Extract Demographic Entities from Oncology Texts author: John Snow Labs name: ner_oncology_demographics_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner, demographics] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic information from oncology texts, including age, gender and smoking status. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father"). - `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. - `Smoking_Status`: All mentions of smoking related to the patient or to someone else. ## Predicted Entities `Age`, `Gender`, `Race_Ethnicity`, `Smoking_Status` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_wip_en_4.0.0_3.0_1664563557899.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_wip_en_4.0.0_3.0_1664563557899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token")\ .setSplitChars(['-']) word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_demographics_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient is a 40-year-old man with history of heavy smoking."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") .setSplitChars("-") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_demographics_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient is a 40-year-old man with history of heavy smoking.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_demographics_wip").predict("""The patient is a 40-year-old man with history of heavy smoking.""") ```
## Results ```bash | chunk | ner_label | |:------------|:---------------| | 40-year-old | Age | | man | Gender | | smoking | Smoking_Status | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_demographics_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|849.2 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Smoking_Status 43.0 12.0 11.0 54.0 0.78 0.80 0.79 Age 679.0 27.0 17.0 696.0 0.96 0.98 0.97 Race_Ethnicity 44.0 7.0 7.0 51.0 0.86 0.86 0.86 Gender 933.0 14.0 8.0 941.0 0.99 0.99 0.99 macro_avg 1699.0 60.0 43.0 1742.0 0.90 0.91 0.90 micro_avg NaN NaN NaN NaN 0.97 0.98 0.97 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from doc2query) author: John Snow Labs name: t5_yahoo_answers_base_v1 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `yahoo_answers-t5-base-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_yahoo_answers_base_v1_en_4.3.0_3.0_1675158667385.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_yahoo_answers_base_v1_en_4.3.0_3.0_1675158667385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_yahoo_answers_base_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_yahoo_answers_base_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_yahoo_answers_base_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/doc2query/yahoo_answers-t5-base-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html --- layout: model title: English BertForQuestionAnswering model (from srmukundb) author: John Snow Labs name: bert_qa_srmukundb_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `srmukundb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_srmukundb_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181131257.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_srmukundb_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181131257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_srmukundb_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_srmukundb_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_srmukundb").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_srmukundb_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/srmukundb/bert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_inspec date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-inspec` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.1_3.0_1678133894118.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.1_3.0_1678133894118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_inspec| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-inspec - https://dl.acm.org/doi/10.3115/1119355.1119383 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=inspec --- layout: model title: Word2Vec Embeddings in Azerbaijani (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, az, open_source] task: Embeddings language: az edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_az_3.4.1_3.0_1647284820457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_az_3.4.1_3.0_1647284820457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","az") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Qığılcım nlp sevirəm"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","az") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Qığılcım nlp sevirəm").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("az.embed.w2v_cc_300d").predict("""Qığılcım nlp sevirəm""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|az| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Fast Neural Machine Translation Model from Bemba (Zambia) to Spanish author: John Snow Labs name: opus_mt_bem_es date: 2021-06-01 tags: [open_source, seq2seq, translation, bem, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bem target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bem_es_xx_3.1.0_2.4_1622560268464.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bem_es_xx_3.1.0_2.4_1622560268464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bem_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bem_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Bemba (Zambia).translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bem_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Mini Uncased model (from ahujaniharika95) author: John Snow Labs name: bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minilm-uncased-squad2-finetuned-squad` is a English model originally trained by `ahujaniharika95`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190376033.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad_en_4.0.0_3.0_1657190376033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ahujaniharika95_minilm_uncased_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|124.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ahujaniharika95/minilm-uncased-squad2-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Hiligaynon author: John Snow Labs name: opus_mt_en_hil date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, hil, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `hil` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hil_xx_2.7.0_2.4_1609168441233.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hil_xx_2.7.0_2.4_1609168441233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_hil", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_hil", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.hil').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_hil| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual (English, German) DistilBertForQuestionAnswering model (from ZYW) author: John Snow Labs name: distilbert_qa_en_de_model date: 2022-06-08 tags: [en, de, open_source, distilbert, question_answering, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-model` is a English model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_model_xx_4.0.0_3.0_1654728267702.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_model_xx_4.0.0_3.0_1654728267702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_model","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_model","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.distil_bert.en_de_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_en_de_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/en-de-model --- layout: model title: Legal Registration expenses Clause Binary Classifier author: John Snow Labs name: legclf_registration_expenses_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `registration-expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `registration-expenses` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_expenses_clause_en_1.0.0_3.2_1660122893840.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_expenses_clause_en_1.0.0_3.2_1660122893840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_registration_expenses_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[registration-expenses]| |[other]| |[other]| |[registration-expenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_registration_expenses_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.99 0.99 86 registration-expenses 0.97 1.00 0.98 30 accuracy - - 0.99 116 macro-avg 0.98 0.99 0.99 116 weighted-avg 0.99 0.99 0.99 116 ``` --- layout: model title: Legal Subscription Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_subscription_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, subscription, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_subscription_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `subscription-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `subscription-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_en_1.0.0_3.0_1668111662925.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subscription_agreement_en_1.0.0_3.0_1668111662925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_subscription_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[subscription-agreement]| |[other]| |[other]| |[subscription-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subscription_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.98 0.98 85 subscription-agreement 0.94 0.97 0.95 30 accuracy - - 0.97 115 macro-avg 0.96 0.97 0.97 115 weighted-avg 0.97 0.97 0.97 115 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hjds0923) author: John Snow Labs name: distilbert_qa_hjds0923_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hjds0923`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hjds0923_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771214311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hjds0923_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771214311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hjds0923_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hjds0923_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hjds0923_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hjds0923/distilbert-base-uncased-finetuned-squad --- layout: model title: Lemmatizer (French, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, fr] task: Lemmatization language: fr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This French Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_fr_3.4.1_3.0_1646316584024.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_fr_3.4.1_3.0_1646316584024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","fr") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Tu n'es pas mieux que moi"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","fr") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Tu n'es pas mieux que moi").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.lemma.spacylookup").predict("""Tu n'es pas mieux que moi""") ```
## Results ```bash +--------------------------------+ |result | +--------------------------------+ |[Tu, n'es, pas, mieux, que, moi]| +--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|fr| |Size:|2.4 MB| --- layout: model title: English asr_Part1 TFWav2Vec2ForCTC from zasheza author: John Snow Labs name: asr_Part1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Part1` is a English model originally trained by zasheza. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Part1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Part1_en_4.2.0_3.0_1664039751422.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Part1_en_4.2.0_3.0_1664039751422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Part1", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Part1", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Part1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English BertForQuestionAnswering model (from juliusco) author: John Snow Labs name: bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad-finetuned-covdrobert` is a English model orginally trained by `juliusco`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert_en_4.0.0_3.0_1654185648282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert_en_4.0.0_3.0_1654185648282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.covid_roberta.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/juliusco/biobert-base-cased-v1.1-squad-finetuned-covdrobert --- layout: model title: Fast Neural Machine Translation Model from Hungarian to English author: John Snow Labs name: opus_mt_hu_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, hu, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `hu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_hu_en_xx_2.7.0_2.4_1609168984163.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_hu_en_xx_2.7.0_2.4_1609168984163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_hu_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_hu_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.hu.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_hu_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BioBERT Embeddings (Pubmed PMC) author: John Snow Labs name: biobert_pubmed_pmc_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_pmc_base_cased_en_2.6.0_2.4_1598343200280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_pmc_base_cased_en_2.6.0_2.4_1598343200280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pubmed_pmc_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pubmed_pmc_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pubmed_pmc_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pubmed_pmc_base_cased_embeddings I [-0.012962102890014648, 0.27699071168899536, 0... hate [0.1688309609889984, 0.5337603688240051, 0.148... cancer [0.1850549429655075, 0.05875205248594284, -0.5... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pubmed_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_800000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-800000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_800000_cased_generator_de_3.4.4_3.0_1652786505011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_800000_cased_generator_de_3.4.4_3.0_1652786505011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_800000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_800000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_800000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-800000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: English Bert Embeddings (from nlp4good) author: John Snow Labs name: bert_embeddings_psych_search date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `psych-search` is a English model orginally trained by `nlp4good`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_psych_search_en_3.4.2_3.0_1649672127720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_psych_search_en_3.4.2_3.0_1649672127720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_psych_search","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_psych_search","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.psych_search").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_psych_search| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/nlp4good/psych-search - https://meshb.nlm.nih.gov/treeView - https://meshb.nlm.nih.gov/record/ui?ui=D000072339 - https://meshb.nlm.nih.gov/record/ui?ui=D005006 - https://meshb.nlm.nih.gov/treeView - http://bioasq.org/ --- layout: model title: English asr_wav2vec2_large_english TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_wav2vec2_large_english date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_english` is a English model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_english_en_4.2.0_3.0_1664020317451.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_english_en_4.2.0_3.0_1664020317451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_english', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_english", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_english| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Word2Vec Embeddings in Norwegian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: ["no", open_source] task: Embeddings language: "no" edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_no_3.4.1_3.0_1647448666485.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_no_3.4.1_3.0_1647448666485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Jeg elsker gnist nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Jeg elsker gnist nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("no.embed.w2v_cc_300d").predict("""Jeg elsker gnist nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|no| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Extract Cancer Therapies and Granular Posology Information author: John Snow Labs name: ner_oncology_posology date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts cancer therapies (Cancer_Surgery, Radiotherapy and Cancer_Therapy) and posology information at a granular level. Definitions of Predicted Entities: - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Cancer_Therapy`: Any cancer treatment mentioned in text, excluding surgeries and radiotherapy. - `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Radiation_Dose`: Dose used in radiotherapy. - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). ## Predicted Entities `Cancer_Surgery`, `Cancer_Therapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Radiotherapy`, `Radiation_Dose`, `Route` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.0.0_3.0_1666728701834.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_en_4.0.0_3.0_1666728701834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------| | adriamycin | Cancer_Therapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | second cycle | Cycle_Number | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_posology| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Cycle_Number 52 4 45 97 0.93 0.54 0.68 Cycle_Count 200 63 30 230 0.76 0.87 0.81 Radiotherapy 255 16 55 310 0.94 0.82 0.88 Cancer_Surgery 592 66 227 819 0.90 0.72 0.80 Cycle_Day 175 22 73 248 0.89 0.71 0.79 Frequency 337 44 90 427 0.88 0.79 0.83 Route 53 1 60 113 0.98 0.47 0.63 Cancer_Therapy 1448 81 250 1698 0.95 0.85 0.90 Duration 525 154 236 761 0.77 0.69 0.73 Dosage 858 79 202 1060 0.92 0.81 0.86 Radiation_Dose 86 4 40 126 0.96 0.68 0.80 macro_avg 4581 534 1308 5889 0.90 0.72 0.79 micro_avg 4581 534 1308 5889 0.90 0.78 0.83 ``` --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_09 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_english_filipino_wav2vec2_l_xls_r_test_09 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_09` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_09_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119314205.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119314205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_09", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_09", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_09| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Sentences in Healthcare Texts author: John Snow Labs name: sentence_detector_dl_healthcare date: 2021-08-11 tags: [licensed, clinical, en, sentence_detection] task: Sentence Detection language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 3.0 supported: true annotator: SentenceDetectorDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. ## Predicted Entities Breaks text in sentences. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SENTENCE_DETECTOR_HC/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/20.SentenceDetectorDL_Healthcare.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_3.2.0_3.0_1628678815210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_3.2.0_3.0_1628678815210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentences") text = """He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). Estimated volume is 51.9 ml. , and is mildly enlarged in size.Normal delineation pattern of the prostate gland is preserved. """ sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) result = sd_model.fullAnnotate(text) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). Estimated volume is 51.9 ml. , and is mildly enlarged in size.Normal delineation pattern of the prostate gland is preserved.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.detect_sentence.clinical").predict("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). Estimated volume is 51.9 ml. , and is mildly enlarged in size.Normal delineation pattern of the prostate gland is preserved. """) ```
## Results ```bash | | sentences | |---:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety. | | 1 | Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. | | 2 | Size: Prostate gland measures 10x1.1x 4.9 cm (LS x AP x TS). | | 3 | Estimated volume is | | | 51.9 ml. , and is mildly enlarged in size. | | 4 | Normal delineation pattern of the prostate gland is preserved. | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl_healthcare| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|en| ## Data Source Healthcare SDDL model is trained on in-house domain specific data. ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Fast Neural Machine Translation Model from Haitian Creole to English author: John Snow Labs name: opus_mt_ht_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ht, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ht` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ht_en_xx_2.7.0_2.4_1609166270703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ht_en_xx_2.7.0_2.4_1609166270703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ht_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ht_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ht.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ht_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Sales Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_sales_bert date: 2023-03-05 tags: [en, legal, classification, clauses, sales, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Sales` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Sales`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sales_bert_en_1.0.0_3.0_1678049898342.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sales_bert_en_1.0.0_3.0_1678049898342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sales_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Sales]| |[Other]| |[Other]| |[Sales]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sales_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.94 0.89 0.91 54 Sales 0.84 0.91 0.88 35 accuracy - - 0.90 89 macro-avg 0.89 0.90 0.90 89 weighted-avg 0.90 0.90 0.90 89 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_EASY_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3_en_4.3.0_3.0_1674212163630.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3_en_4.3.0_3.0_1674212163630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_recipe_triplet_recipes_base_easy_squadv2_epochs_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_EASY_squadv2_epochs_3 --- layout: model title: English asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent TFWav2Vec2ForCTC from creynier author: John Snow Labs name: pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent` is a English model originally trained by creynier. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_en_4.2.0_3.0_1664042475596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_en_4.2.0_3.0_1664042475596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese Bert Embeddings (Base, Plus, Wobert model) author: John Snow Labs name: bert_embeddings_wobert_chinese_plus_base date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `wobert_chinese_plus_base` is a Chinese model orginally trained by `junnyu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_plus_base_zh_3.4.2_3.0_1649669510103.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_plus_base_zh_3.4.2_3.0_1649669510103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_plus_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.wobert_chinese_plus_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_wobert_chinese_plus_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|467.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/junnyu/wobert_chinese_plus_base - https://github.com/ZhuiyiTechnology/WoBERT - https://github.com/JunnYu/WoBERT_pytorch --- layout: model title: Smaller BERT Embeddings (L-8_H-128_A-2) author: John Snow Labs name: small_bert_L8_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L8_128_en_2.6.0_2.4_1598344352001.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L8_128_en_2.6.0_2.4_1598344352001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L8_128", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L8_128", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L8_128').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L8_128_embeddings I [1.8417736291885376, 0.29461684823036194, -0.3... love [2.903827428817749, 0.6693897247314453, -0.338... NLP [1.8207342624664307, 0.1299048662185669, -1.94... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L8_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1) --- layout: model title: Lemmatization from BSC/projecte_aina lookups author: cayorodriguez name: lemmatizer_bsc date: 2022-07-07 tags: [ca, open_source] task: Lemmatization language: ca edition: Spark NLP 3.4.4 spark_version: 3.0 supported: false recommended: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Lemmatizer using lookup tables from `BSC/projecte_aina` sources. This Lemmatizer should work with specific tokenization rules included in the Python usage section. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/lemmatizer_bsc_ca_3.4.4_3.0_1657199421685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/lemmatizer_bsc_ca_3.4.4_3.0_1657199421685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") ex_list = ["aprox\.","pàg\.","p\.ex\.","gen\.","feb\.","abr\.","jul\.","set\.","oct\.","nov\.","dec\.","dr\.","dra\.","sr\.","sra\.","srta\.","núm\.","st\.","sta\.","pl\.","etc\.", "ex\."] ex_list_all = [] ex_list_all.extend(ex_list) ex_list_all.extend([x[0].upper() + x[1:] for x in ex_list]) ex_list_all.extend([x.upper() for x in ex_list]) tokenizer = Tokenizer() \ .setInputCols(['sentence']).setOutputCol('token')\ .setInfixPatterns(["(d|D)(els)","(d|D)(el)","(a|A)(ls)","(a|A)(l)","(p|P)(els)","(p|P)(el)",\ "([A-zÀ-ú_@]+)(-[A-zÀ-ú_@]+)",\ "(d'|D')([·A-zÀ-ú@_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+","(l'|L')([·A-zÀ-ú_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+", \ "(l'|l'|s'|s'|d'|d'|m'|m'|n'|n'|D'|D'|L'|L'|S'|S'|N'|N'|M'|M')([A-zÀ-ú_]+)",\ """([A-zÀ-ú·]+)(\.|,|\)|\?|!|;|\:|\"|”)(\.|,|\)|\?|!|;|\:|\"|”)+""",\ "([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|,|;|:|\?|,)+",\ "([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)",\ "(\.|\"|;|:|!|\?|\-|\(|\)|”|“|')+([0-9A-zÀ-ú_]+)",\ "([0-9A-zÀ-ú·]+)(\.|\"|;|:|!|\?|\(|\)|”|“|'|,|%)",\ "(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+([0-9]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+",\ "(d'|D'|l'|L')([·A-zÀ-ú@_]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)", \ "([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)"]) \ .setExceptions(ex_list_all).fit(data) lemmatizer = LemmatizerModel.pretrained("lemmatizer_bsc","ca") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Bons dies, al mati"]], ["text"]) results = pipeline.fit(example).transform(example) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemmatizer_bsc| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Community| |Input Labels:|[form]| |Output Labels:|[lemma]| |Language:|ca| |Size:|7.3 MB| --- layout: model title: English asr_Dansk_wav2vec21 TFWav2Vec2ForCTC from Siyam author: John Snow Labs name: pipeline_asr_Dansk_wav2vec21 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Dansk_wav2vec21` is a English model originally trained by Siyam. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Dansk_wav2vec21_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Dansk_wav2vec21_en_4.2.0_3.0_1664118614170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Dansk_wav2vec21_en_4.2.0_3.0_1664118614170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Dansk_wav2vec21', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Dansk_wav2vec21", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Dansk_wav2vec21| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RoBERTa Embeddings (Large, Wikipedia and Bookcorpus datasets) author: John Snow Labs name: roberta_embeddings_muppet_roberta_large date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muppet-roberta-large` is a English model orginally trained by `facebook`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_large_en_3.4.2_3.0_1649946679876.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_muppet_roberta_large_en_3.4.2_3.0_1649946679876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_large","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_muppet_roberta_large","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.muppet_roberta_large").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_muppet_roberta_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|849.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/facebook/muppet-roberta-large - https://arxiv.org/abs/2101.11038 --- layout: model title: Portuguese Named Entity Recognition (from m-lin20) author: John Snow Labs name: bert_ner_satellite_instrument_bert_NER date: 2022-05-09 tags: [bert, ner, token_classification, pt, open_source] task: Named Entity Recognition language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `satellite-instrument-bert-NER` is a Portuguese model orginally trained by `m-lin20`. ## Predicted Entities `satellite`, `instrument` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_satellite_instrument_bert_NER_pt_3.4.2_3.0_1652098534939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_satellite_instrument_bert_NER_pt_3.4.2_3.0_1652098534939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_satellite_instrument_bert_NER","pt") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_satellite_instrument_bert_NER","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_satellite_instrument_bert_NER| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m-lin20/satellite-instrument-bert-NER - https://github.com/Tsinghua-mLin/satellite-instrument-NER --- layout: model title: German T5ForConditionalGeneration Small Cased model (from aiassociates) author: John Snow Labs name: t5_small_grammar_correction date: 2023-01-31 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-grammar-correction-german` is a German model originally trained by `aiassociates`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_grammar_correction_de_4.3.0_3.0_1675126287089.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_grammar_correction_de_4.3.0_3.0_1675126287089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_grammar_correction","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_grammar_correction","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_grammar_correction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|288.7 MB| ## References - https://huggingface.co/aiassociates/t5-small-grammar-correction-german - https://github.com/EricFillion/happy-transformer - https://www.ai.associates/ - https://www.linkedin.com/company/ai-associates --- layout: model title: Sentence Embeddings - sbert tiny (tuned) author: John Snow Labs name: sbert_jsl_tiny_umls_uncased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_umls_uncased_en_3.1.0_2.4_1625050224767.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_umls_uncased_en_3.1.0_2.4_1625050224767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_tiny_umls_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_tiny_umls_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_tiny_umls_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_tiny_umls_uncased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.616 STS(cos) 0.632 ``` --- layout: model title: Spanish RobertaForQuestionAnswering (from hackathon-pln-es) author: John Snow Labs name: roberta_qa_roberta_base_bne_squad2_hackathon_pln date: 2022-06-21 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad2_hackathon_pln_es_4.0.0_3.0_1655790288763.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_bne_squad2_hackathon_pln_es_4.0.0_3.0_1655790288763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_bne_squad2_hackathon_pln","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_bne_squad2_hackathon_pln","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_bne_squad2_hackathon_pln| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|456.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hackathon-pln-es/roberta-base-bne-squad2-es --- layout: model title: English DistilBertForQuestionAnswering Cased model (from aszidon) author: John Snow Labs name: distilbert_qa_custom date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.3.0_3.0_1672774581586.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.3.0_3.0_1672774581586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom --- layout: model title: English image_classifier_vit_croupier_creature_classifier ViTForImageClassification from alkzar90 author: John Snow Labs name: image_classifier_vit_croupier_creature_classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_croupier_creature_classifier` is a English model originally trained by alkzar90. ## Predicted Entities `elf`, `goblin`, `knight`, `zombie` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_croupier_creature_classifier_en_4.1.0_3.0_1660171498624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_croupier_creature_classifier_en_4.1.0_3.0_1660171498624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_croupier_creature_classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_croupier_creature_classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_croupier_creature_classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Entity Resolver for Human Phenotype Ontology author: John Snow Labs name: sbiobertresolve_HPO date: 2021-05-05 tags: [en, licensed, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps phenotypic abnormalities encountered in human diseases to Human Phenotype Ontology (HPO) codes. ## Predicted Entities This model returns Human Phenotype Ontology (HPO) codes for phenotypic abnormalities encountered in human diseases. It also returns associated codes from the following vocabularies for each HPO code: - MeSH (Medical Subject Headings) - SNOMED - UMLS (Unified Medical Language System ) - ORPHA (international reference resource for information on rare diseases and orphan drugs) - OMIM (Online Mendelian Inheritance in Man) {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_HPO/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_HPO_en_3.0.2_3.0_1620235451661.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_HPO_en_3.0.2_3.0_1620235451661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_HPO``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_human_phenotype_gene_clinical``` as NER model. No need to ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_HPO", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokens, embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver]) model = LightPipeline(pipeline.fit(spark.createDataFrame([['']], ["text"]))) text="""These disorders include cancer, bipolar disorder, schizophrenia, autism, Cri-du-chat syndrome, myopia, cortical cataract-linked Alzheimer's disease, and infectious diseases""" results = model.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.HPO").predict("""These disorders include cancer, bipolar disorder, schizophrenia, autism, Cri-du-chat syndrome, myopia, cortical cataract-linked Alzheimer's disease, and infectious diseases""") ```
## Results ```bash | | chunk | entity | resolution | aux_codes | |---:|:-----------------|:---------|:-------------|:-----------------------------------------------------------------------------| | 0 | cancer | HP | HP:0002664 | MSH:D009369||SNOMED:108369006,363346000||UMLS:C0006826,C0027651||ORPHA:1775 | | 1 | bipolar disorder | HP | HP:0007302 | MSH:D001714||SNOMED:13746004||UMLS:C0005586||ORPHA:370079 | | 2 | schizophrenia | HP | HP:0100753 | MSH:D012559||SNOMED:191526005,58214004||UMLS:C0036341||ORPHA:231169 | | 3 | autism | HP | HP:0000717 | MSH:D001321||SNOMED:408856003,408857007,43614003||UMLS:C0004352||ORPHA:79279 | | 4 | myopia | HP | HP:0000545 | MSH:D009216||SNOMED:57190000||UMLS:C0027092||ORPHA:370022 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_HPO| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hpo_code]| |Language:|en| --- layout: model title: Legal Exclusivity Clause Binary Classifier author: John Snow Labs name: legclf_exclusivity_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exclusivity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `exclusivity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exclusivity_clause_en_1.0.0_3.2_1660122411802.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exclusivity_clause_en_1.0.0_3.2_1660122411802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_exclusivity_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exclusivity]| |[other]| |[other]| |[exclusivity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exclusivity_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support exclusivity 0.79 0.61 0.69 36 other 0.86 0.93 0.90 92 accuracy - - 0.84 128 macro-avg 0.82 0.77 0.79 128 weighted-avg 0.84 0.84 0.84 128 ``` --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batterydata_bert_base_uncased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batterydata_bert_base_uncased_squad_v1_en_4.0.0_3.0_1654181357717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batterydata_bert_base_uncased_squad_v1_en_4.0.0_3.0_1654181357717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batterydata_bert_base_uncased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batterydata_bert_base_uncased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batterydata_bert_base_uncased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/bert-base-uncased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: English asr_wav2vec2_xls_r_300m_kh TFWav2Vec2ForCTC from kongkeaouch author: John Snow Labs name: asr_wav2vec2_xls_r_300m_kh date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_kh` is a English model originally trained by kongkeaouch. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_kh_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025079738.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025079738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_kh", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_kh", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_kh| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Representations Clause Binary Classifier author: John Snow Labs name: legclf_representations_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `representations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `representations` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_clause_en_1.0.0_3.2_1660122946365.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_clause_en_1.0.0_3.2_1660122946365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_representations_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[representations]| |[other]| |[other]| |[representations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_representations_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.98 0.98 645 representations 0.95 0.94 0.95 212 accuracy - - 0.97 857 macro-avg 0.97 0.96 0.96 857 weighted-avg 0.97 0.97 0.97 857 ``` --- layout: model title: Norwegian BertForTokenClassification Base Cased model (from Kushtrim) author: John Snow Labs name: bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner date: 2022-11-30 tags: ["no", open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: "no" edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-norsk-ner` is a Norwegian model originally trained by `Kushtrim`. ## Predicted Entities `MISC`, `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner_no_4.2.4_3.0_1669815034218.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner_no_4.2.4_3.0_1669815034218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner","no") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_multilingual_cased_finetuned_norsk_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|no| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Kushtrim/bert-base-multilingual-cased-finetuned-norsk-ner --- layout: model title: Pipeline to Extract Neurologic Deficits Related to Stroke Scale (NIHSS) author: John Snow Labs name: ner_nihss_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_nihss](https://nlp.johnsnowlabs.com/2021/11/15/ner_nihss_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_NIHSS/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_NIHSS.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_nihss_pipeline_en_3.4.1_3.0_1647871076449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_nihss_pipeline_en_3.4.1_3.0_1647871076449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_nihss_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_nihss_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.nihss_pipeline").predict("""Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently.""") ```
## Results ```bash | | chunk | entity | |---:|:-------------------|:-------------------------| | 0 | NIH stroke scale | NIHSS | | 1 | 23 to 24 | Measurement | | 2 | one | Measurement | | 3 | consciousness | 1a_LOC | | 4 | two | Measurement | | 5 | month and year | 1b_LOCQuestions | | 6 | two | Measurement | | 7 | eye / grip | 1c_LOCCommands | | 8 | one | Measurement | | 9 | two | Measurement | | 10 | gaze | 2_BestGaze | | 11 | two | Measurement | | 12 | face | 4_FacialPalsy | | 13 | eight | Measurement | | 14 | one | Measurement | | 15 | limited ataxia | 7_LimbAtaxia | | 16 | one to two | Measurement | | 17 | sensory | 8_Sensory | | 18 | three | Measurement | | 19 | best language | 9_BestLanguage | | 20 | two | Measurement | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_nihss_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Stop Words Cleaner for Portuguese author: John Snow Labs name: stopwords_pt date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: pt edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, pt] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_pt_pt_2.5.4_2.4_1594742441703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_pt_pt_2.5.4_2.4_1594742441703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_pt", "pt") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_pt", "pt") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica."""] stopword_df = nlu.load('pt.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=14, end=16, result='rei', metadata={'sentence': '0'}), Row(annotatorType='token', begin=21, end=25, result='norte', metadata={'sentence': '0'}), Row(annotatorType='token', begin=26, end=26, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=28, end=31, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=33, end=36, result='Snow', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_pt| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|pt| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from aszidon) author: John Snow Labs name: distilbert_qa_custom2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom2` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom2_en_4.3.0_3.0_1672774614830.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom2_en_4.3.0_3.0_1672774614830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom2 --- layout: model title: Spanish BertForQuestionAnswering model (from MMG) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac date: 2022-06-02 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-spa-squad2-es-finetuned-sqac` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac_es_4.0.0_3.0_1654180469657.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac_es_4.0.0_3.0_1654180469657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2_sqac.bert.base_cased_spa.by_MMG").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es-finetuned-sqac --- layout: model title: Legal Tax returns Clause Binary Classifier author: John Snow Labs name: legclf_tax_returns_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `tax-returns` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `tax-returns` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tax_returns_clause_en_1.0.0_3.2_1660123065637.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tax_returns_clause_en_1.0.0_3.2_1660123065637.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_tax_returns_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[tax-returns]| |[other]| |[other]| |[tax-returns]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_tax_returns_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash precision recall f1-score support other 1.00 0.92 0.96 37 tax-returns 0.75 1.00 0.86 9 accuracy - - 0.93 46 macro-avg 0.88 0.96 0.91 46 weighted-avg 0.95 0.93 0.94 46 ``` --- layout: model title: Pipeline to Detect Units and Measurements in text author: John Snow Labs name: ner_measurements_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_measurements_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_measurements_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_4.3.0_3.2_1678832259909.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_4.3.0_3.2_1678832259909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models") text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models") val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_measurements.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:-------------|-------------:| | 0 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.98748 | | 1 | cm | 129 | 130 | Units | 0.9996 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_measurements_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from East Slavic Languages to English author: John Snow Labs name: opus_mt_zle_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, zle, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `zle` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_zle_en_xx_2.7.0_2.4_1609166964113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_zle_en_xx_2.7.0_2.4_1609166964113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_zle_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_zle_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.zle.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_zle_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial News Summarization (X-Small) author: John Snow Labs name: finsum_news_xs date: 2022-11-23 tags: [financial, summarization, en, licensed] task: Summarization language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Financial news Summarizer, finetuned with a extra-small financial dataset. (about 4K news). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finsum_news_xs_en_1.0.0_3.0_1669213220483.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finsum_news_xs_en_1.0.0_3.0_1669213220483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = nlp.T5Transformer() \ .pretrained("finsum_news_xs" ,"en", "finance/models") \ .setTask("summarize:")\ # or 'summarization' .setMaxOutputLength(512)\ .setInputCols(["documents"]) \ .setOutputCol("summaries") data_df = spark.createDataFrame([["Deere Grows Sales 37% as Shipments Rise. Farm equipment supplier forecasts higher sales in year ahead, lifted by price increases and infrastructure investments. Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more of its farm and construction equipment. The Moline, Ill.-based company, the largest supplier of farm equipment in the U.S., said demand held up as it raised prices on farm equipment, and forecast sales gains in the year ahead. Chief Executive John May cited strong demand and increased investment in infrastructure projects as the Biden administration ramps up spending. Elevated crop prices have kept farmers interested in new machinery even as their own production expenses increase."]]).toDF("text") pipeline = nlp.Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("summaries.result").show(truncate=False) ```
## Results ```bash Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more farm and construction equipment. Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more farm and construction equipment. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finsum_news_xs| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|923.2 MB| ## Benchmarking ```bash John Snow Labs in-house summarized articles. ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Thitaree) author: John Snow Labs name: distilbert_qa_Thitaree_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Thitaree`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Thitaree_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724750492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Thitaree_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724750492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Thitaree_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Thitaree_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Thitaree").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Thitaree_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Thitaree/distilbert-base-uncased-finetuned-squad --- layout: model title: Polish BertForQuestionAnswering model (from henryk) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2 date: 2022-06-02 tags: [pl, open_source, question_answering, bert] task: Question Answering language: pl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-polish-squad2` is a Polish model orginally trained by `henryk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2_pl_4.0.0_3.0_1654180123880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2_pl_4.0.0_3.0_1654180123880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2","pl") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2","pl") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("pl.answer_question.squadv2.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pl| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad2 - https://www.linkedin.com/in/henryk-borzymowski-0755a2167/ - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/blob/master/multilingual.md --- layout: model title: Finnish asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664040198920.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664040198920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Oromo to English Pipeline author: John Snow Labs name: translate_om_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, om, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `om` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_om_en_xx_2.7.0_2.4_1609690273713.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_om_en_xx_2.7.0_2.4_1609690273713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_om_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_om_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.om.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_om_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Anatomical Structures (Single Entity - biobert) author: John Snow Labs name: ner_anatomy_coarse_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy_coarse_biobert](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_coarse_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_pipeline_en_4.3.0_3.2_1679316528376.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_pipeline_en_4.3.0_3.2_1679316528376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models") text = '''content in the lung tissue''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models") val text = "content in the lung tissue" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_coarse_biobert.pipeline").predict("""content in the lung tissue""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------|--------:|------:|:------------|-------------:| | 0 | lung tissue | 15 | 25 | Anatomy | 0.99155 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from English to Malagasy author: John Snow Labs name: opus_mt_en_mg date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mg, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mg_xx_2.7.0_2.4_1609167829744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mg_xx_2.7.0_2.4_1609167829744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mg", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mg", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mg').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mg| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from avioo1) author: John Snow Labs name: roberta_qa_avioo1_base_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `avioo1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_avioo1_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219191405.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_avioo1_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219191405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_avioo1_base_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_avioo1_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_avioo1_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/avioo1/roberta-base-squad2-finetuned-squad --- layout: model title: Pipeline to Detect Drug Information (Small) author: John Snow Labs name: ner_posology_small_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_small](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_small_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_pipeline_en_4.3.0_3.2_1678868910811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_pipeline_en_4.3.0_3.2_1678868910811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_small_pipeline", "en", "clinical/models") text = '''The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_small_pipeline", "en", "clinical/models") val text = "The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_small.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | insulin | 59 | 65 | DRUG | 0.9984 | | 1 | Bactrim | 346 | 352 | DRUG | 0.9999 | | 2 | for 14 days | 354 | 364 | DURATION | 0.8678 | | 3 | Fragmin | 925 | 931 | DRUG | 0.9994 | | 4 | 5000 units | 933 | 942 | DOSAGE | 0.88505 | | 5 | subcutaneously | 944 | 957 | ROUTE | 0.9998 | | 6 | daily | 959 | 963 | FREQUENCY | 0.997 | | 7 | Xenaderm | 966 | 973 | DRUG | 0.99 | | 8 | topically | 985 | 993 | ROUTE | 0.971 | | 9 | b.i.d., | 995 | 1001 | FREQUENCY | 0.76095 | | 10 | Lantus | 1003 | 1008 | DRUG | 0.9994 | | 11 | 40 units | 1010 | 1017 | DOSAGE | 0.89935 | | 12 | subcutaneously | 1019 | 1032 | ROUTE | 0.9992 | | 13 | at bedtime | 1034 | 1043 | FREQUENCY | 0.83435 | | 14 | OxyContin | 1046 | 1054 | DRUG | 0.9992 | | 15 | 30 mg | 1056 | 1060 | STRENGTH | 0.99965 | | 16 | p.o | 1062 | 1064 | ROUTE | 0.9997 | | 17 | q.12 h., | 1067 | 1074 | FREQUENCY | 0.9102 | | 18 | folic acid | 1076 | 1085 | DRUG | 0.9955 | | 19 | 1 mg | 1087 | 1090 | STRENGTH | 0.99875 | | 20 | daily | 1092 | 1096 | FREQUENCY | 0.9993 | | 21 | levothyroxine | 1099 | 1111 | DRUG | 0.9998 | | 22 | 0.1 mg | 1113 | 1118 | STRENGTH | 0.99965 | | 23 | p.o | 1120 | 1122 | ROUTE | 0.999 | | 24 | daily | 1125 | 1129 | FREQUENCY | 0.9373 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_small_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: German Medical Bert Embeddings author: John Snow Labs name: bert_embeddings_German_MedBERT date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained German Medical Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `German-MedBERT` is a German model orginally trained by `smanjil`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_German_MedBERT_de_3.4.2_3.0_1649675767629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_German_MedBERT_de_3.4.2_3.0_1649675767629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_German_MedBERT","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_German_MedBERT","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.medbert").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_German_MedBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/smanjil/German-MedBERT - https://opus4.kobv.de/opus4-rhein-waal/frontdoor/index/index/searchtype/collection/id/16225/start/0/rows/10/doctypefq/masterthesis/docId/740 - https://www.linkedin.com/in/manjil-shrestha-038527b4/ --- layout: model title: French CamemBert Embeddings (from DoyyingFace) author: John Snow Labs name: camembert_embeddings_DoyyingFace_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `DoyyingFace`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_DoyyingFace_generic_model_fr_3.4.4_3.0_1653986003984.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_DoyyingFace_generic_model_fr_3.4.4_3.0_1653986003984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_DoyyingFace_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_DoyyingFace_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_DoyyingFace_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/DoyyingFace/dummy-model --- layout: model title: Hindi Named Entity Recognition (from sagorsarker) author: John Snow Labs name: bert_ner_codeswitch_hineng_ner_lince date: 2022-05-09 tags: [bert, ner, token_classification, hi, open_source] task: Named Entity Recognition language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-hineng-ner-lince` is a Hindi model orginally trained by `sagorsarker`. ## Predicted Entities `PERSON`, `ORGANISATION`, `PLACE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_ner_lince_hi_3.4.2_3.0_1652097576639.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_hineng_ner_lince_hi_3.4.2_3.0_1652097576639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_ner_lince","hi") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_hineng_ner_lince","hi") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_codeswitch_hineng_ner_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|hi| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-hineng-ner-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: English image_classifier_vit_rust_image_classification_8 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_8 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_8` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_8_en_4.1.0_3.0_1660166811431.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_8_en_4.1.0_3.0_1660166811431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_8", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_8", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_8| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented date: 2021-10-31 tags: [icd10cm, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate. ## Predicted Entities `ICD10CM Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1635684621243.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.3.1_2.4_1635684621243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | ner_chunk| entity|icd10cm_code| resolutions| all_codes| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481| |subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...| | T2DM|PROBLEM| E11|type 2 diabetes mellitus:::disorder associated with type 2 diabetes...|E11:::E118:::E119:::O2411:::E109:::E139:::E113:::E8881:::Z833:::D64...| | HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::drug-induced acute pancreatitis:::he...|K8520:::K853:::K8590:::F102:::K852:::K859:::K8580:::K8591:::K858:::...| | acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...| | obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...| | a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...| | polyuria|PROBLEM| R35|polyuria:::nocturnal polyuria:::polyuric state:::polyuric state (di...|R35:::R3581:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048...| | polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...| | poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...| | vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110| | a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD10CM 2022 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_headline_generation date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-headline-generation` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_headline_generation_it_4.3.0_3.0_1675103295731.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_headline_generation_it_4.3.0_3.0_1675103295731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_headline_generation","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_headline_generation","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_headline_generation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|594.0 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-headline-generation - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Headline+generation&dataset=HeadGen-IT --- layout: model title: Recognize Entities DL pipeline for Italian - Large author: John Snow Labs name: entity_recognizer_lg date: 2021-03-23 tags: [open_source, italian, entity_recognizer_lg, pipeline, it] supported: true task: [Named Entity Recognition, Lemmatization] language: it edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_it_3.0.0_3.0_1616465464186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_it_3.0.0_3.0_1616465464186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_lg', lang = 'it') annotations = pipeline.fullAnnotate(""Ciao da John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_lg", lang = "it") val result = pipeline.fullAnnotate("Ciao da John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Ciao da John Snow Labs! ""] result_df = nlu.load('it.ner.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Ciao da John Snow Labs! '] | ['Ciao da John Snow Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | [[-0.238279998302459,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|it| --- layout: model title: Translate Punjabi (Eastern) to English Pipeline author: John Snow Labs name: translate_pa_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pa, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pa` - target languages: `en` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/INDIAN_TRANSLATION_PUNJABI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TRANSLATION_PIPELINES_MODELS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pa_en_xx_2.7.0_2.4_1609690246774.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pa_en_xx_2.7.0_2.4_1609690246774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline result = pipeline = PretrainedPipeline("translate_pa_en", lang = "xx") pipeline.annotate("ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pa_en", lang = "xx") val result = pipeline.annotate("ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ.") ``` {:.nlu-block} ```python import nlu text = ["ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ."] translate_df = nlu.load('xx.pa.translate_to.en').predict(text, output_level='sentence') translate_df ```
## Results ```bash +------------------------------+---------------------------+ |sentence |translation | +------------------------------+---------------------------+ |ਮੈਨੂੰ ਪੜ੍ਹਨਾ ਪਸੰਦ ਹੈ. |I like reading. | +------------------------------+---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pa_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from usami) author: John Snow Labs name: distilbert_qa_usami_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `usami`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_usami_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772969630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_usami_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772969630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_usami_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_usami_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_usami_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/usami/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from anu24) author: John Snow Labs name: distilbert_qa_anu24_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `anu24`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anu24_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769853124.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anu24_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769853124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anu24_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anu24_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anu24_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anu24/distilbert-base-uncased-finetuned-squad --- layout: model title: Extract Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_general_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner, anatomy] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical entities using an unspecific label. Definitions of Predicted Entities: - `Anatomical_Site`: Relevant anatomical terms mentioned in text. - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". ## Predicted Entities `Anatomical_Site`, `Direction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_wip_en_4.0.0_3.0_1664562237279.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_wip_en_4.0.0_3.0_1664562237279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_anatomy_general").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:----------------| | left | Direction | | breast | Anatomical_Site | | lungs | Anatomical_Site | | liver | Anatomical_Site | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_general_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|843.0 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Anatomical_Site 2377.0 649.0 353.0 2730.0 0.79 0.87 0.83 Direction 668.0 219.0 66.0 734.0 0.75 0.91 0.82 macro_avg 3045.0 868.0 419.0 3464.0 0.77 0.89 0.83 micro_avg NaN NaN NaN NaN 0.78 0.88 0.83 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from Zamachi) author: John Snow Labs name: roberta_qa_for_question_answering date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-for-question-answering` is a English model originally trained by `Zamachi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_for_question_answering_en_4.3.0_3.0_1674220787682.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_for_question_answering_en_4.3.0_3.0_1674220787682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_question_answering","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_question_answering","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_for_question_answering| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Zamachi/roberta-for-question-answering --- layout: model title: English image_classifier_vit_animal_classifier ViTForImageClassification from ritheshSree author: John Snow Labs name: image_classifier_vit_animal_classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_animal_classifier` is a English model originally trained by ritheshSree. ## Predicted Entities `cat`, `dog`, `snake`, `tiger` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animal_classifier_en_4.1.0_3.0_1660170154919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animal_classifier_en_4.1.0_3.0_1660170154919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_animal_classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_animal_classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_animal_classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Anatomical Structures (Single Entity - biobert) author: John Snow Labs name: ner_anatomy_coarse_biobert date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description An NER model to extract all types of anatomical references in text using "biobert_pubmed_base_cased" embeddings. It is a single entity model and generalizes all anatomical references to a single entity. ## Predicted Entities `Anatomy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_en_3.0.0_3.0_1617209714335.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_en_3.0.0_3.0_1617209714335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_anatomy_coarse_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["content in the lung tissue"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_anatomy_coarse_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("""content in the lung tissue""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy.coarse_biobert").predict("""content in the lung tissue""") ```
## Results ```bash | | ner_chunk | entity | |---:|:------------------|:----------| | 0 | lung tissue | Anatomy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on a custom dataset using 'biobert_pubmed_base_cased'. ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | B-Anatomy | 2499 | 155 | 162 | 0.941598 | 0.939121 | 0.940357 | | 1 | I-Anatomy | 1695 | 116 | 158 | 0.935947 | 0.914733 | 0.925218 | | 2 | Macro-average | 4194 | 271 | 320 | 0.938772 | 0.926927 | 0.932812 | | 3 | Micro-average | 4194 | 271 | 320 | 0.939306 | 0.929109 | 0.93418 | ``` --- layout: model title: MeSH to UMLS Code Mapping author: John Snow Labs name: mesh_umls_mapping date: 2021-05-04 tags: [mesh, umls, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.0.2_3.0_1620134296251.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.0.2_3.0_1620134296251.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("mesh_umls_mapping","en","clinical/models") pipeline.annotate("C028491 D019326 C579867") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("mesh_umls_mapping","en","clinical/models") val result = pipeline.annotate("C028491 D019326 C579867") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.mesh.umls").predict("""C028491 D019326 C579867""") ```
## Results ```bash {'mesh': ['C028491', 'D019326', 'C579867'], 'umls': ['C0970275', 'C0886627', 'C3696376']} Note: | MeSH | Details | | ---------- | ----------------------------:| | C028491 | 1,3-butylene glycol | | D019326 | 17-alpha-Hydroxyprogesterone | | C579867 | 3-Methylglutaconic Aciduria | | UMLS | Details | | ---------- | ---------------------------:| | C0970275 | 1,3-butylene glycol | | C0886627 | 17-hydroxyprogesterone | | C3696376 | 3-methylglutaconic aciduria | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|mesh_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl2_en_4.3.0_3.0_1675113785226.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl2_en_4.3.0_3.0_1675113785226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|158.2 MB| ## References - https://huggingface.co/google/t5-efficient-base-nl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_deletion_squad_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-deletion-squad-10` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_deletion_squad_10_en_4.0.0_3.0_1655733844546.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_deletion_squad_10_en_4.0.0_3.0_1655733844546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_deletion_squad_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_deletion_squad_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_deletion_10.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_deletion_squad_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-deletion-squad-10 --- layout: model title: Social Determinants of Health author: John Snow Labs name: ner_sdoh_wip date: 2023-02-11 tags: [licensed, clinical, en, social_determinants, ner, public_health, sdoh] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Other_SDoH_Keywords`, `Education`, `Population_Group`, `Quality_Of_Life`, `Housing`, `Substance_Frequency`, `Smoking`, `Eating_Disorder`, `Obesity`, `Healthcare_Institution`, `Financial_Status`, `Age`, `Chidhood_Event`, `Exercise`, `Communicable_Disease`, `Hypertension`, `Other_Disease`, `Violence_Or_Abuse`, `Spiritual_Beliefs`, `Employment`, `Social_Exclusion`, `Access_To_Care`, `Marital_Status`, `Diet`, `Social_Support`, `Disability`, `Mental_Health`, `Alcohol`, `Insurance_Status`, `Substance_Quantity`, `Hyperlipidemia`, `Family_Member`, `Legal_Issues`, `Race_Ethnicity`, `Gender`, `Geographic_Entity`, `Sexual_Orientation`, `Transportation`, `Sexual_Activity`, `Language`, `Substance_Use` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_wip_en_4.2.8_3.0_1676135569606.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_wip_en_4.2.8_3.0_1676135569606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well.  She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("He continues to smoke one pack of cigarettes daily, as he has for the past 28 years.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------+-----+---+-------------------+ |chunk |begin|end|ner_label | +------------------+-----+---+-------------------+ |55 years old |11 |22 |Age | |divorced |25 |32 |Marital_Status | |Mexcian American |34 |49 |Race_Ethnicity | |woman |51 |55 |Gender | |financial problems|62 |79 |Financial_Status | |She |82 |84 |Gender | |spanish |93 |99 |Language | |She |102 |104|Gender | |apartment |118 |126|Housing | |She |129 |131|Gender | |diabetes |158 |165|Other_Disease | |cleaning assistant|307 |324|Employment | |health insurance |354 |369|Insurance_Status | |She |391 |393|Gender | |son |401 |403|Family_Member | |student |405 |411|Education | |college |416 |422|Education | |depression |454 |463|Mental_Health | |She |466 |468|Gender | |she |479 |481|Gender | |rehab |489 |493|Access_To_Care | |her |514 |516|Gender | |catholic faith |518 |531|Spiritual_Beliefs | |support |547 |553|Social_Support | |She |565 |567|Gender | |etoh abuse |589 |598|Alcohol | |her |614 |616|Gender | |teens |618 |622|Age | |She |625 |627|Gender | |she |637 |639|Gender | |drinker |658 |664|Alcohol | |drinking beer |694 |706|Alcohol | |daily |708 |712|Substance_Frequency| |She |715 |717|Gender | |smokes |719 |724|Smoking | |a pack |726 |731|Substance_Quantity | |cigarettes |736 |745|Smoking | |a day |747 |751|Substance_Frequency| |She |754 |756|Gender | |DUI |762 |764|Legal_Issues | +------------------+-----+---+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_wip| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.5 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Other_SDoH_Keywords 317.0 84.0 112.0 429.0 0.790524 0.738928 0.763855 Education 116.0 38.0 31.0 147.0 0.753247 0.789116 0.770764 Population_Group 27.0 4.0 11.0 38.0 0.870968 0.710526 0.782609 Quality_Of_Life 146.0 37.0 57.0 203.0 0.797814 0.719212 0.756477 Housing 809.0 91.0 118.0 927.0 0.898889 0.872708 0.885605 Substance_Frequency 74.0 19.0 44.0 118.0 0.795699 0.627119 0.701422 Smoking 136.0 4.0 2.0 138.0 0.971429 0.985507 0.978417 Eating_Disorder 40.0 2.0 0.0 40.0 0.952381 1.000000 0.975610 Obesity 16.0 1.0 5.0 21.0 0.941176 0.761905 0.842105 Healthcare_Institution 117.0 36.0 57.0 174.0 0.764706 0.672414 0.715596 Financial_Status 222.0 47.0 128.0 350.0 0.825279 0.634286 0.717286 Age 1328.0 109.0 48.0 1376.0 0.924148 0.965116 0.944188 Chidhood_Event 30.0 0.0 24.0 54.0 1.000000 0.555556 0.714286 Exercise 52.0 17.0 31.0 83.0 0.753623 0.626506 0.684211 Communicable_Disease 61.0 5.0 10.0 71.0 0.924242 0.859155 0.890511 Hypertension 45.0 1.0 12.0 57.0 0.978261 0.789474 0.873786 Other_Disease 1065.0 229.0 119.0 1184.0 0.823029 0.899493 0.859564 Violence_Or_Abuse 98.0 26.0 53.0 151.0 0.790323 0.649007 0.712727 Spiritual_Beliefs 94.0 9.0 21.0 115.0 0.912621 0.817391 0.862385 Employment 3797.0 272.0 288.0 4085.0 0.933153 0.929498 0.931322 Social_Exclusion 38.0 6.0 14.0 52.0 0.863636 0.730769 0.791667 Access_To_Care 810.0 95.0 160.0 970.0 0.895028 0.835052 0.864000 Marital_Status 177.0 4.0 9.0 186.0 0.977901 0.951613 0.964578 Diet 110.0 34.0 30.0 140.0 0.763889 0.785714 0.774648 Social_Support 1243.0 197.0 99.0 1342.0 0.863194 0.926230 0.893602 Disability 94.0 4.0 9.0 103.0 0.959184 0.912621 0.935323 Mental_Health 817.0 99.0 216.0 1033.0 0.891921 0.790900 0.838379 Alcohol 592.0 32.0 28.0 620.0 0.948718 0.954839 0.951768 Insurance_Status 145.0 23.0 32.0 177.0 0.863095 0.819209 0.840580 Substance_Quantity 107.0 42.0 39.0 146.0 0.718121 0.732877 0.725424 Hyperlipidemia 14.0 1.0 2.0 16.0 0.933333 0.875000 0.903226 Family_Member 4255.0 110.0 73.0 4328.0 0.974800 0.983133 0.978949 Legal_Issues 71.0 13.0 20.0 91.0 0.845238 0.780220 0.811429 Race_Ethnicity 81.0 9.0 7.0 88.0 0.900000 0.920455 0.910112 Gender 9698.0 183.0 193.0 9891.0 0.981480 0.980487 0.980983 Geographic_Entity 189.0 18.0 22.0 211.0 0.913043 0.895735 0.904306 Sexual_Orientation 21.0 0.0 3.0 24.0 1.000000 0.875000 0.933333 Transportation 27.0 2.0 27.0 54.0 0.931034 0.500000 0.650602 Sexual_Activity 56.0 4.0 24.0 80.0 0.933333 0.700000 0.800000 Language 35.0 6.0 2.0 37.0 0.853659 0.945946 0.897436 Substance_Use 400.0 40.0 24.0 424.0 0.909091 0.943396 0.925926 ``` --- layout: model title: Pipeline to Detect Anatomical Structures (Single Entity - biobert) author: John Snow Labs name: ner_anatomy_coarse_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy_coarse_biobert](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_coarse_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_pipeline_en_3.4.1_3.0_1647873505075.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_pipeline_en_3.4.1_3.0_1647873505075.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models") pipeline.annotate("content in the lung tissue") ``` ```scala val pipeline = new PretrainedPipeline("ner_anatomy_coarse_biobert_pipeline", "en", "clinical/models") pipeline.annotate("content in the lung tissue") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_coarse_biobert.pipeline").predict("""content in the lung tissue""") ```
## Results ```bash | | ner_chunk | entity | |---:|:------------------|:----------| | 0 | lung tissue | Anatomy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Chinese Word Segmentation author: John Snow Labs name: wordseg_large date: 2021-01-03 task: Word Segmentation language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, word_segmentation, zh, cn] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. In this model, we created a curated large data set obtained from Chinese Treebank, Weibo, and SIGHAM 2005 data sets, and trained the neural network model as described in a research paper (Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_large_zh_2.7.0_2.4_1609681406666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_large_zh_2.7.0_2.4_1609681406666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.load("WORDSEG_LARGE_CN")\ .setInputCols("document")\ .setOutputCol("token")\ pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = ws_model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] token_df = nlu.load('zh.segment_words.large').predict(text, output_level='token') token_df ```
## Results ```bash +----------------------------------+--------------------------------------------------------+ |text |result | +----------------------------------+--------------------------------------------------------+ |然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]| +----------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_large| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source cn_wordseg_large_train.chartag ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 | | WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 | | WORDSEG_MSRA | 0,5984 | 0,6088 | 0,6035 | | WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 | | WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 | ``` --- layout: model title: Mapping ICD10CM Codes with Their Corresponding UMLS Codes author: John Snow Labs name: icd10cm_umls_mapper date: 2022-06-26 tags: [icd10cm, umls, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICD10CM codes to corresponding UMLS codes under the Unified Medical Language System (UMLS). ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapper_en_3.5.3_3.0_1656278690210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapper_en_3.5.3_3.0_1656278690210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icd10cm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("icd10cm_umls_mapper", "en", "clinical/models")\ .setInputCols(["icd10cm_code"])\ .setOutputCol("umls_mappings")\ .setRels(["umls_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, icd10cm_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Neonatal skin infection") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val icd10cm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("icd10cm_umls_mapper", "en", "clinical/models") .setInputCols(Array("rxnorm_code")) .setOutputCol("umls_mappings") .setRels(Array("umls_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, icd10cm_resolver, chunkerMapper )) val data = Seq("Neonatal skin infection").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm_to_umls").predict("""Neonatal skin infection""") ```
## Results ```bash | | ner_chunk | icd10cm_code | umls_mappings | |---:|:------------------------|:---------------|:----------------| | 0 | Neonatal skin infection | P394 | C0456111 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_umls_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[icd10cm_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|942.9 KB| --- layout: model title: English Named Entity Recognition (from surrey-nlp) author: John Snow Labs name: roberta_ner_roberta_large_finetuned_abbr date: 2022-05-03 tags: [roberta, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-finetuned-abbr` is a English model orginally trained by `surrey-nlp`. ## Predicted Entities `LF`, `AC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_finetuned_abbr_en_3.4.2_3.0_1651594192589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_finetuned_abbr_en_3.4.2_3.0_1651594192589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_finetuned_abbr","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_finetuned_abbr","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.roberta_large_finetuned_abbr").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_large_finetuned_abbr| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/surrey-nlp/roberta-large-finetuned-abbr - https://paperswithcode.com/sota?task=Token+Classification&dataset=surrey-nlp%2FPLOD-unfiltered --- layout: model title: English Bert Embeddings (from alexanderfalk) author: John Snow Labs name: bert_embeddings_danbert_small_cased date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `danbert-small-cased` is a English model orginally trained by `alexanderfalk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_danbert_small_cased_en_3.4.2_3.0_1649672086620.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_danbert_small_cased_en_3.4.2_3.0_1649672086620.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_danbert_small_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_danbert_small_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.danbert_small_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_danbert_small_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|313.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/alexanderfalk/danbert-small-cased --- layout: model title: Smaller BERT Embeddings (L-4_H-128_A-2) author: John Snow Labs name: small_bert_L4_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L4_128_en_2.6.0_2.4_1598344330158.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L4_128_en_2.6.0_2.4_1598344330158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L4_128", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L4_128", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L4_128').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L4_128_embeddings I [0.5109787583351135, 1.6565966606140137, 2.695.... love [1.0555483102798462, 1.8791943788528442, 1.285... NLP [-0.23064681887626648, 0.939659833908081, 1.77... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L4_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_kv128 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-kv128` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_kv128_en_4.3.0_3.0_1675112746492.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_kv128_en_4.3.0_3.0_1675112746492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_kv128","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_kv128","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_kv128| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|637.5 MB| ## References - https://huggingface.co/google/t5-efficient-base-kv128 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Firat) author: John Snow Labs name: distilbert_qa_firat_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Firat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_firat_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768588053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_firat_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768588053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_firat_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_firat_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_firat_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Firat/distilbert-base-uncased-finetuned-squad --- layout: model title: German Bert Embeddings (Base, Cased) author: John Snow Labs name: bert_embeddings_gbert_base date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `gbert-base` is a German model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_gbert_base_de_3.4.2_3.0_1649675902802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_gbert_base_de_3.4.2_3.0_1649675902802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_gbert_base","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_gbert_base","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.gbert_base").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_gbert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gbert-base - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: XLNet Embeddings (Base) author: John Snow Labs name: xlnet_base_cased date: 2020-04-28 task: Embeddings language: en nav_key: models edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: XlnetEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking. The details are described in the paper "[​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_base_cased_en_2.5.0_2.4_1588074114942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_base_cased_en_2.5.0_2.4_1588074114942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = XlnetEmbeddings.pretrained("xlnet_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = XlnetEmbeddings.pretrained("xlnet_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.xlnet_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_xlnet_base_cased_embeddings I [0.0027268705889582634, -3.5811028480529785, 0... love [-4.020033836364746, -2.2760159969329834, 0.88... NLP [-0.2549888491630554, -2.2768502235412598, 1.1... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/zihangdai/xlnet](https://github.com/zihangdai/xlnet) --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Ninh) author: John Snow Labs name: xlmroberta_ner_ninh_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Ninh`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ninh_base_finetuned_panx_de_4.1.0_3.0_1660430037495.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ninh_base_finetuned_panx_de_4.1.0_3.0_1660430037495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ninh_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ninh_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_ninh_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Ninh/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Pipeline to Detect PHI in Text author: John Snow Labs name: ner_deid_sd_large_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_sd_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_sd_large_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_3.4.1_3.0_1647870104226.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_3.4.1_3.0_1647870104226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models") pipeline.annotate("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.med_ner_large.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+--------+ |chunks |entities| +-----------------------------+--------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson, Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |1-11-2000 |DATE | |Cocke County Baptist Hospital|LOCATION| |0295 Keats Street |LOCATION| |786-5227 |CONTACT | |Brothers Coal-Mine |LOCATION| +-----------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_sd_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Bangla BertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers date: 2022-12-02 tags: [bn, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: bn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-bn-bert` is a Bangla model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670022315370.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_4.2.4_3.0_1670022315370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|505.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-bert - https://oscar-corpus.com/ --- layout: model title: Relation Extraction Between Dates and Clinical Entities (ReDL) author: John Snow Labs name: redl_date_clinical_biobert date: 2023-01-14 tags: [licensed, en, clinical, relation_extraction, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Identify if tests were conducted on a particular date or any diagnosis was made on a specific date by checking relations between clinical entities and dates. 1 : Shows date and the clinical entity are related, 0 : Shows date and the clinical entity are not related. ## Predicted Entities `1`, `0` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_4.2.4_3.0_1673731277460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_4.2.4_3.0_1673731277460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") events_ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") events_re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setOutputCol("re_ner_chunks") events_re_Model = RelationExtractionDLModel() \ .pretrained('redl_date_clinical_biobert', "en", "clinical/models")\ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[ documenter, sentencer, tokenizer, words_embedder, pos_tagger, events_ner_tagger, ner_chunker, dependency_parser, events_re_ner_chunk_filter, events_re_Model]) data = spark.createDataFrame([['''This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val events_ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val events_re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setOutputCol("re_ner_chunks") val events_re_Model = RelationExtractionDLModel() .pretrained("redl_date_clinical_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter,sentencer,tokenizer,words_embedder,pos_tagger,events_ner_tagger,ner_chunker,dependency_parser,events_re_ner_chunk_filter,events_re_Model)) val data = Seq("This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date").predict("""This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.""") ```
## Results ```bash +--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ |relation|entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ | 1| TEST| 24| 25| CT| DATE| 30| 36| 1/12/95|0.99997973| | 1| TEST| 24| 25| CT|PROBLEM| 44| 83|progressive memor...| 0.9998983| | 1| TEST| 24| 25| CT| DATE| 91| 97| 8/11/94| 0.9997316| | 1| DATE| 30| 36| 1/12/95|PROBLEM| 44| 83|progressive memor...| 0.9998915| | 1| DATE| 30| 36| 1/12/95| DATE| 91| 97| 8/11/94| 0.9997931| | 1|PROBLEM| 44| 83|progressive memor...| DATE| 91| 97| 8/11/94| 0.9998667| +--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_date_clinical_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on an internal dataset. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.738 0.729 0.734 84 1 0.945 0.947 0.946 416 Avg. 0.841 0.838 0.840 - ``` --- layout: model title: Turkish BertForQuestionAnswering Base Cased model (from husnu) author: John Snow Labs name: bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3 date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3` is a Turkish model originally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1657183554228.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1657183554228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_turkish_128k_cased_finetuned_lr_2e_05_epochs_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|689.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/husnu/bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3 --- layout: model title: Fast Neural Machine Translation Model from English to Setswana author: John Snow Labs name: opus_mt_en_tn date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tn, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tn_xx_2.7.0_2.4_1609167140357.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tn_xx_2.7.0_2.4_1609167140357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tn", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tn", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tn').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tn| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese NER Model author: John Snow Labs name: bert_token_classifier_chinese_ner date: 2021-12-07 tags: [chinese, token_classifier, bert, zh, open_source] task: Named Entity Recognition language: zh edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned for traditional Chinese language, leveraging `Bert` embeddings and `BertForTokenClassification` for NER purposes. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_chinese_ner_zh_3.3.2_2.4_1638881767667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_chinese_ner_zh_3.3.2_2.4_1638881767667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_chinese_ner", "zh"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_chinese_ner", "zh")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。"].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.ner.bert_token").predict("""我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。""") ```
## Results ```bash +-----------------+---------+ |chunk |ner_label| +-----------------+---------+ |莎拉 |PERSON | |1999 年 11 月 2 |DATE | |斯图加特 |GPE | |梅赛德斯-奔驰公司 |ORG | +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_chinese_ner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|zh| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/ckiplab/bert-base-chinese-ner](https://huggingface.co/ckiplab/bert-base-chinese-ner) ## Benchmarking ```bash label score f1 0.8118 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC2GM-Gene_Imbalancedscibert_scivocab_cased` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `GENE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased_en_4.0.0_3.0_1657108037612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased_en_4.0.0_3.0_1657108037612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC2GM_Gene_Imbalancedscibert_scivocab_cased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC2GM-Gene_Imbalancedscibert_scivocab_cased --- layout: model title: English DistilBertForQuestionAnswering model (from charlieoneill) author: John Snow Labs name: distilbert_qa_base_uncased_gradient_clinic date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-gradient-clinic` is a English model originally trained by `charlieoneill`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_gradient_clinic_en_4.0.0_3.0_1654727035184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_gradient_clinic_en_4.0.0_3.0_1654727035184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_gradient_clinic","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_gradient_clinic","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_uncased.by_charlieoneill").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_gradient_clinic| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/charlieoneill/distilbert-base-uncased-gradient-clinic --- layout: model title: Thai Word Segmentation author: John Snow Labs name: wordseg_best date: 2021-01-13 task: Word Segmentation language: th edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [th, word_segmentation, open_source] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Thai text. Thai text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_best_th_2.7.0_2.4_1610543628078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_best_th_2.7.0_2.4_1610543628078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_best', 'th')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) example = spark.createDataFrame([['จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ']], ["text"]) result = pipeline.fit(example ).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส"""] token_df = nlu.load('th.segment_words').predict(text) token_df ```
## Results ```bash +-----------------------------------+---------------------------------------------------------+ |text |result | +-----------------------------------+---------------------------------------------------------+ |จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ|[จวน, จะ, ถึง, ร้าน, ที่, คุณ, จอง, โต๊ะ, ไว้, แล้ว, จ้ะ]| +-----------------------------------+---------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_best| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|th| ## Data Source The model was trained on the [BEST](http://thailang.nectec.or.th/best) corpus from the National Electronics and Computer Technology Center (NECTEC). References: > - Krit Kosawat, Monthika Boriboon, Patcharika Chootrakool, Ananlada Chotimongkol, Supon Klaithin, Sarawoot Kongyoung, Kanyanut Kriengket, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Chai Wutiwiwatchai, "BEST 2009: Thai word segmentation software contest," in Proc. 8th Int. Symp. Natural Language Process. (SNLP), Bangkok, Thailand, Oct.20-22, 2009, pp.83-88. > - Monthika Boriboon, Kanyanut Kriengket, Patcharika Chootrakool, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Krit Kosawat, "BEST corpus development and analysis," in Proc. 2nd Int. Conf. Asian Language Process. (IALP), Singapore, Dec.7-9, 2009, pp.322-327. ## Benchmarking ```bash | Model | precision | recall | f1-score | |--------------|-----------|--------|----------| | WORDSEG_BEST | 0.4791 | 0.6245 | 0.5422 | ``` --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl2 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl2_en_4.3.0_3.0_1675123819194.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl2_en_4.3.0_3.0_1675123819194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|39.0 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English asr_wav2vec2_base_100h_with_lm_by_saahith TFWav2Vec2ForCTC from saahith author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_by_saahith` is a English model originally trained by saahith. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith_en_4.2.0_3.0_1664117830342.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith_en_4.2.0_3.0_1664117830342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_with_lm_by_saahith| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Adverse Drug Events (healthcare) author: John Snow Labs name: ner_ade_healthcare date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse drug events in tweets, reviews, and medical text using pretrained NER model. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_en_3.0.0_3.0_1617260836627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_en_3.0.0_3.0_1617260836627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_ade_healthcare", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_ade_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.ade.ade_healthcare").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash +------+------+------+------+-------+---------+------+------+ |entity| tp| fp| fn| total|precision|recall| f1| +------+------+------+------+-------+---------+------+------+ | DRUG|9649.0| 884.0|9772.0|19421.0| 0.9161|0.4968|0.6443| | ADE|5909.0|9508.0|1987.0| 7896.0| 0.3833|0.7484|0.5069| +------+------+------+------+-------+---------+------+------+ +------------------+ | macro| +------------------+ |0.5755909944827655| +------------------+ +------------------+ | micro| +------------------+ |0.6045600310939989| +------------------+ ``` --- layout: model title: Pipeline to Resolve CVX Codes author: John Snow Labs name: cvx_resolver_pipeline date: 2023-03-30 tags: [en, licensed, clinical, resolver, chunk_mapping, cvx, pipeline] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding CVX codes. You’ll just feed your text and it will return the corresponding CVX codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.3.2_3.2_1680178011294.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_resolver_pipeline_en_4.3.2_3.2_1680178011294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline resolver_pipeline = PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models") text= "The patient has a history of influenza vaccine, tetanus and DTaP" result = resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val resolver_pipeline = new PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models") val result = resolver_pipeline.fullAnnotate("The patient has a history of influenza vaccine, tetanus and DTaP") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cvx_pipeline").predict("""The patient has a history of influenza vaccine, tetanus and DTaP""") ```
## Results ```bash +-----------------+---------+--------+ |chunk |ner_chunk|cvx_code| +-----------------+---------+--------+ |influenza vaccine|Vaccine |160 | |tetanus |Vaccine |35 | |DTaP |Vaccine |20 | +-----------------+---------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|cvx_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.1 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Detect Assertion Status from Oncology Entities author: John Snow Labs name: assertion_oncology_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of entities related to oncology (including diagnoses, therapies and tests). ## Predicted Entities `Absent`, `Family`, `Hypothetical`, `Past`, `Possible`, `Present` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_wip_en_4.1.0_3.0_1664641275549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_wip_en_4.1.0_3.0_1664641275549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Dx", "Tumor_Finding", "Cancer_Surgery", "Chemotherapy", "Pathology_Test", "Imaging_Test"]) assertion = AssertionDLModel.pretrained("assertion_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Dx", "Tumor_Finding", "Cancer_Surgery", "Chemotherapy", "Pathology_Test", "Imaging_Test")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_wip").predict("""The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------------|:---------------|:------------| | breast cancer | Cancer_Dx | Possible | | cancers | Cancer_Dx | Family | | biopsy | Pathology_Test | Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Absent 0.81 0.77 0.79 264.0 Family 0.78 0.82 0.80 34.0 Hypothetical 0.67 0.61 0.64 182.0 Past 0.91 0.93 0.92 1583.0 Possible 0.59 0.59 0.59 51.0 Present 0.89 0.89 0.89 1645.0 macro-avg 0.77 0.77 0.77 3759.0 weighted-avg 0.88 0.88 0.88 3759.0 ``` --- layout: model title: Finnish RobertaForQuestionAnswering (from cgou) author: John Snow Labs name: roberta_qa_fin_RoBERTa_v1_finetuned_squad date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: fi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fin_RoBERTa-v1-finetuned-squad` is a Finnish model originally trained by `cgou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fin_RoBERTa_v1_finetuned_squad_fi_4.0.0_3.0_1655728569389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fin_RoBERTa_v1_finetuned_squad_fi_4.0.0_3.0_1655728569389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fin_RoBERTa_v1_finetuned_squad","fi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_fin_RoBERTa_v1_finetuned_squad","fi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fi.answer_question.squad.roberta.by_cgou").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fin_RoBERTa_v1_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fi| |Size:|248.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/cgou/fin_RoBERTa-v1-finetuned-squad --- layout: model title: Mapping Drugs from the KEGG Database to Their Efficacies, Molecular Weights and Corresponding Codes from Other Databases author: John Snow Labs name: kegg_drug_mapper date: 2022-11-21 tags: [drug, efficacy, molecular_weight, cas, pubchem, chebi, ligandbox, nikkaji, pdbcct, chunk_mapper, clinical, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drugs with their corresponding `efficacy`, `molecular_weight` as well as `CAS`, `PubChem`, `ChEBI`, `LigandBox`, `NIKKAJI`, `PDB-CCD` codes. This model was trained with the data from the KEGG database. ## Predicted Entities `efficacy`, `molecular_weight`, `CAS`, `PubChem`, `ChEBI`, `LigandBox`, `NIKKAJI`, `PDB-CCD` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/kegg_drug_mapper_en_4.2.2_3.0_1669069910375.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/kegg_drug_mapper_en_4.2.2_3.0_1669069910375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("kegg_drug_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["efficacy", "molecular_weight", "CAS", "PubChem", "ChEBI", "LigandBox", "NIKKAJI", "PDB-CCD"])\ pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, converter, chunkerMapper]) text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("kegg_drug_mapper", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("mappings") .setRels(Array("efficacy", "molecular_weight", "CAS", "PubChem", "ChEBI", "LigandBox", "NIKKAJI", "PDB-CCD")) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, converter, chunkerMapper)) val text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin" val data = Seq(text).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.kegg_drug").predict("""She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin""") ```
## Results ```bash +-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+ | ner_chunk| efficacy|molecular_weight| CAS| PubChem| ChEBI|LigandBox| NIKKAJI|PDB-CCD| +-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+ | OxyContin| Analgesic (narcotic), Opioid receptor agonist| 351.8246| 124-90-3| 7847912.0| 7859.0| D00847|J281.239H| NONE| | folic acid|Anti-anemic, Hematopoietic, Supplement (folic a...| 441.3975| 59-30-3| 7847138.0|27470.0| D00070| J1.392G| FOL| |levothyroxine| Replenisher (thyroid hormone)| 776.87| 51-48-9|9.6024815E7|18332.0| D08125| J4.118A| T44| | Norvasc|Antihypertensive, Vasodilator, Calcium channel ...| 408.8759|88150-42-9|5.1091781E7| 2668.0| D07450| J33.383B| NONE| | aspirin|Analgesic, Anti-inflammatory, Antipyretic, Anti...| 180.1574| 50-78-2| 7847177.0|15365.0| D00109| J2.300K| AIN| | Neurontin| Anticonvulsant, Antiepileptic| 171.2368|60142-96-3| 7847398.0|42797.0| D00332| J39.388F| GBN| +-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|kegg_drug_mapper| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.0 MB| --- layout: model title: Translate English to Haitian Creole Pipeline author: John Snow Labs name: translate_en_ht date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ht, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ht` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ht_xx_2.7.0_2.4_1609688301365.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ht_xx_2.7.0_2.4_1609688301365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ht", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ht", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ht').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ht| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German Bert Embeddings (Base, Cased, Old Vocabulary) author: John Snow Labs name: bert_embeddings_bert_base_german_cased_oldvocab date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-cased-oldvocab` is a German model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_cased_oldvocab_de_3.4.2_3.0_1649676274361.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_cased_oldvocab_de_3.4.2_3.0_1649676274361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_cased_oldvocab","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_cased_oldvocab","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_german_cased_oldvocab").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_german_cased_oldvocab| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/bert-base-german-cased-oldvocab - https://github.com/deepset-ai/FARM/issues/60 - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Persian BertForQuestionAnswering model (from SajjadAyoubi) author: John Snow Labs name: bert_qa_bert_base_fa_qa date: 2022-06-02 tags: [fa, open_source, question_answering, bert] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-fa-qa` is a Persian model orginally trained by `SajjadAyoubi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_fa_qa_fa_4.0.0_3.0_1654179918056.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_fa_qa_fa_4.0.0_3.0_1654179918056.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_fa_qa","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_fa_qa","fa") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fa.answer_question.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_fa_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SajjadAyoubi/bert-base-fa-qa - https://colab.research.google.com/github/sajjjadayobi/PersianQA/blob/main/notebooks/HowToUse.ipynb --- layout: model title: Translate English to East Slavic languages Pipeline author: John Snow Labs name: translate_en_zle date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, zle, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `zle` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_zle_xx_2.7.0_2.4_1609691744462.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_zle_xx_2.7.0_2.4_1609691744462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_zle", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_zle", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.zle').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_zle| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Universal Sentence Encoder XLING Many author: John Snow Labs name: tfhub_use_xling_many date: 2020-12-08 task: Embeddings language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 deprecated: true tags: [embeddings, open_source, xx] supported: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder Cross-lingual (XLING) module is an extension of the Universal Sentence Encoder that includes training on multiple tasks across languages. The multi-task training setup is based on the paper "Learning Cross-lingual Sentence Representations via a Multi-task Dual Encoder". This specific module is trained on English, French, German, Spanish, Italian, Chinese, Korean, and Japanese tasks, and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box for a number of applications. The input to the module is variable length text in any of the eight aforementioned languages and the output is a 512 dimensional vector. We note that one does not need to specify the language of the input, as the model was trained such that text across languages with similar meanings will have embeddings with high dot product scores. Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_many_xx_2.7.0_2.4_1607440840968.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_many_xx_2.7.0_2.4_1607440840968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_many", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Me encanta usar SparkNLP']], ["text"])) ``` ```scala val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_many", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Me encanta usar SparkNLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Me encanta usar SparkNLP"] embeddings_df = nlu.load('xx.use.xling_many').predict(text, output_level='sentence') embeddings_df ```
## Results It gives a 512-dimensional vector of the sentences. ```bash xx_use_xling_many_embeddings sentence 0 [0.03621278703212738, 0.007045685313642025, -0... I love NLP 1 [-0.0060035050846636295, 0.028749311342835426,... Me encanta usar SparkNLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_xling_many| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source [https://tfhub.dev/google/universal-sentence-encoder-xling-many/1](https://tfhub.dev/google/universal-sentence-encoder-xling-many/1) --- layout: model title: Pipeline to Mapping MESH Codes with Their Corresponding UMLS Codes author: John Snow Labs name: mesh_umls_mapping date: 2022-06-27 tags: [mesh, umls, chunk_mapper, pipeline, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `mesh_umls_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.5.3_3.0_1656366727552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.5.3_3.0_1656366727552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate("C028491 D019326 C579867") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate("C028491 D019326 C579867") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.mesh.umls").predict("""C028491 D019326 C579867""") ```
## Results ```bash | | mesh_code | umls_code | |---:|:----------------------------|:-------------------------------| | 0 | C028491 | D019326 | C579867 | C0043904 | C0045010 | C3696376 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|mesh_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.8 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nh16 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh16_en_4.3.0_3.0_1675123654269.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh16_en_4.3.0_3.0_1675123654269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nh16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nh16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|64.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nh16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Extract relations between phenotypic abnormalities and diseases (ReDL) author: John Snow Labs name: redl_human_phenotype_gene_biobert date: 2021-07-24 tags: [relation_extraction, en, licensed, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 2.4 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations to fully understand the origin of some phenotypic abnormalities and their associated diseases. `1` : Entities are related, `0` : Entities are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_3.0.3_2.4_1627120647767.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_3.0.3_2.4_1627120647767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") #Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_human_phenotype_gene_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text = """She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_human_phenotype_gene_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.humen_phenotype_gene").predict("""She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""") ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|-----------:|:----------|----------------:|--------------:|:---------------------|:----------|----------------:|--------------:|:--------------------|-------------:| | 0 | 0 | HP | 10 | 29 | retinal degeneration | HP | 32 | 43 | hearing loss | 0.893809 | | 1 | 0 | HP | 10 | 29 | retinal degeneration | HP | 49 | 61 | renal failure | 0.958486 | | 2 | 1 | HP | 10 | 29 | retinal degeneration | HP | 162 | 180 | autosomal recessive | 0.65584 | | 3 | 0 | HP | 32 | 43 | hearing loss | HP | 64 | 76 | short stature | 0.707055 | | 4 | 1 | HP | 32 | 43 | hearing loss | GENE | 96 | 103 | SH3PXD2B | 0.640802 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_human_phenotype_gene_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on a silver standard corpus of human phenotype and gene annotations and their relations. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.922 0.908 0.915 129 1 0.831 0.855 0.843 71 Avg. 0.877 0.882 0.879 - ``` --- layout: model title: English image_classifier_vit_exper3_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper3_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper3_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper3_mesum5_en_4.1.0_3.0_1660167974762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper3_mesum5_en_4.1.0_3.0_1660167974762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper3_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper3_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper3_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Translate Bulgarian to English Pipeline author: John Snow Labs name: translate_bg_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bg, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bg_en_xx_2.7.0_2.4_1609691570462.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bg_en_xx_2.7.0_2.4_1609691570462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bg.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bg_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial SEC Filings Classifier author: John Snow Labs name: finclf_sec_filings date: 2022-12-01 tags: [en, finance, classification, sec, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model allows you to classify documents among a list of specific US Security Exchange Commission filings, as : `10-K`, `10-Q`, `8-K`, `S-8`, `3`, `4`, `Other` **IMPORTANT** : This model works with the first 512 tokens of a document, you don't need to run it in the whole document. ## Predicted Entities `10-K`, `10-Q`, `8-K`, `S-8`, `3`, `4`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_sec_filings_en_1.0.0_3.0_1669921534523.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_sec_filings_en_1.0.0_3.0_1669921534523.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = finance.ClassifierDLModel.pretrained("finclf_sec_filings", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier ]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[10-K]| |[8-K]| |[10-Q]| |[S-8]| |[3]| |[4]| |[other]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_sec_filings| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Scrapped filings from SEC ## Benchmarking ```bash label precision recall f1-score support 10-K 0.97 0.82 0.89 40 10-Q 0.94 0.94 0.94 35 3 0.80 0.95 0.87 41 4 0.94 0.76 0.84 42 8-K 0.81 0.94 0.87 32 S-8 0.91 0.93 0.92 44 other 0.98 0.98 0.98 41 accuracy - - 0.90 275 macro-avg 0.91 0.90 0.90 275 weighted-avg 0.91 0.90 0.90 275 ``` --- layout: model title: Embeddings Scielo 150 dims author: John Snow Labs name: embeddings_scielo_150d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-26 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_150d_es_2.5.0_2.4_1590467082526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_150d_es_2.5.0_2.4_1590467082526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_scielo_150d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_scielo_150d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.scielo.150d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_scielo_150d``. {:.model-param} ## Model Information {:.table-model} |---------------|------------------------| | Name: | embeddings_scielo_150d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 150.0 | {:.h2_title} ## Data Source Trained on Scielo Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Pipeline to Detect Organism in Medical Text author: John Snow Labs name: bert_token_classifier_ner_species_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_species](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_species_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_pipeline_en_4.3.0_3.2_1679301125473.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_species_pipeline_en_4.3.0_3.2_1679301125473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_species_pipeline", "en", "clinical/models") text = '''As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_species_pipeline", "en", "clinical/models") val text = "As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) ." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | 6C (T) | 57 | 62 | SPECIES | 0.998955 | | 1 | Betaproteobacteria | 117 | 134 | SPECIES | 0.99973 | | 2 | Thiomonas intermedia | 167 | 186 | SPECIES | 0.999822 | | 3 | DSM 18155 (T) | 188 | 200 | SPECIES | 0.997657 | | 4 | Thiomonas perometabolis | 206 | 228 | SPECIES | 0.999614 | | 5 | DSM 18570 (T) | 230 | 242 | SPECIES | 0.997146 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW) author: John Snow Labs name: distilbert_qa_zyw_model date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-vi-zh-es-model` is a Multilingual model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_zyw_model_xx_4.3.0_3.0_1672775050259.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_zyw_model_xx_4.3.0_3.0_1672775050259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zyw_model","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zyw_model","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_zyw_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/en-de-vi-zh-es-model --- layout: model title: Extract Clinical Department Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_clinical_dept_emb_clinical_large date: 2023-06-06 tags: [licensed, clinical, en, ner, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `ClinicalDept`, `AdmissionDischarge`, `MedicalDevice` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_emb_clinical_large_en_4.4.3_3.0_1686074681308.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_emb_clinical_large_en_4.4.3_3.0_1686074681308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------------------|:--------------| | orthopedic department | ClinicalDept | | titanium plate | MedicalDevice | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_clinical_dept_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 ClinicalDept 297 35 29 326 0.89 0.91 0.90 AdmissionDischarge 25 0 9 34 1.00 0.74 0.85 MedicalDevice 256 64 76 332 0.80 0.77 0.79 macro_avg 578 99 114 692 0.90 0.81 0.85 micro_avg 578 99 114 692 0.85 0.83 0.84 ``` --- layout: model title: Legal Eu Finance Document Classifier (EURLEX) author: John Snow Labs name: legclf_eu_finance_bert date: 2023-03-06 tags: [en, legal, classification, clauses, eu_finance, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_eu_finance_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Eu_Finance or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Eu_Finance`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_eu_finance_bert_en_1.0.0_3.0_1678111884579.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_eu_finance_bert_en_1.0.0_3.0_1678111884579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_eu_finance_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Eu_Finance]| |[Other]| |[Other]| |[Eu_Finance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_eu_finance_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Eu_Finance 0.89 0.87 0.88 622 Other 0.85 0.87 0.86 529 accuracy - - 0.87 1151 macro-avg 0.87 0.87 0.87 1151 weighted-avg 0.87 0.87 0.87 1151 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt4 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt4` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_4.2.4_3.0_1670327086297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_4.2.4_3.0_1670327086297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt4| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|171.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt4 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Detect Drug Chemicals author: John Snow Labs name: ner_drugs_large_en date: 2021-01-29 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 2.7.1 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for Drugs. The model combines dosage, strength, form, and route into a single entity: Drug. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities `DRUG` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_en_2.6.0_2.4_1603915964112.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_en_2.6.0_2.4_1603915964112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") # Clinical word embeddings trained on PubMED dataset word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) data = spark.createDataFrame([["""The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") # Clinical word embeddings trained on PubMED dataset val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. ```bash +--------------------------------+---------+ |chunk |ner_label| +--------------------------------+---------+ |Aspirin 81 milligrams |DRUG | |Humulin N |DRUG | |insulin 50 units |DRUG | |HCTZ 50 mg |DRUG | |Nitroglycerin 1/150 sublingually|DRUG | +--------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_large_en_2.6.0_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on i2b2_med7 + FDA with 'embeddings_clinical'. https://www.i2b2.org/NLP/Medication {:.h2_title} ## Benchmarking Since this NER model is crafted from `ner_posology` but reduced to single entity, no benchmark is applicable. --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265908` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908_en_4.0.0_3.0_1655985559166.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908_en_4.0.0_3.0_1655985559166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265908").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265908| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265908 --- layout: model title: English image_classifier_vit_pond_image_classification_7 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_7 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_7` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_7_en_4.1.0_3.0_1660167146003.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_7_en_4.1.0_3.0_1660167146003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_7", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_7", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_7| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Earning Calls Financial NER (Specific, md) author: John Snow Labs name: finner_earning_calls_specific_md date: 2022-12-15 tags: [en, finance, ner, licensed, earning, calls] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `md` (medium) version of a financial model trained on Earning Calls transcripts to detect financial entities (NER model). This model is called `Specific` as it has more labels in comparison with the `Generic` version. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The currently available entities are: - AMOUNT: Numeric amounts, not percentages - ASSET: Current or Fixed Asset - ASSET_DECREASE: Decrease in the asset possession/exposure - ASSET_INCREASE: Increase in the asset possession/exposure - CF: Total cash flow  - CFO: Cash flow from operating activity - CFO_INCREASE: Cash flow from operating activity increased - CONTRA_LIABILITY: Negative liability account that offsets the liability account (e.g. paying a debt) - COUNT: Number of items (not monetary, not percentages). - CURRENCY: The currency of the amount - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - EXPENSE: An expense or loss - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - FCF: Free Cash Flow - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - INCOME: Any income that is reported - INCOME_INCREASE: Relative increase in income - KPI: Key Performance Indicator, a quantifiable measure of performance over time for a specific objective - KPI_DECREASE: Relative decrease in a KPI - KPI_INCREASE: Relative increase in a KPI - LIABILITY: Current or Long-Term Liability (not from stockholders) - LIABILITY_DECREASE: Relative decrease in liability - LIABILITY_INCREASE: Relative increase in liability - LOSS: Type of loss (e.g. gross, net) - ORG: Mention to a company/organization name - PERCENTAGE: : Numeric amounts which are percentages - PROFIT: Profit or also Revenue - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - REVENUE: Revenue reported by company - REVENUE_DECLINE: Relative decrease in revenue when compared to other years - REVENUE_INCREASE: Relative increase in revenue when compared to other years - STOCKHOLDERS_EQUITY: Equity possessed by stockholders, not liability - TICKER: Trading symbol of the company ## Predicted Entities `AMOUNT`, `ASSET`, `ASSET_DECREASE`, `ASSET_INCREASE`, `CF`, `CFO`, `CFO_INCREASE`, `CF_INCREASE`, `CONTRA_LIABILITY`, `COUNT`, `CURRENCY`, `DATE`, `EXPENSE`, `EXPENSE_DECREASE`, `EXPENSE_INCREASE`, `FCF`, `FISCAL_YEAR`, `INCOME`, `INCOME_INCREASE`, `KPI`, `KPI_DECREASE`, `KPI_INCREASE`, `LIABILITY`, `LIABILITY_DECREASE`, `LIABILITY_INCREASE`, `LOSS`, `LOSS_DECREASE`, `ORG`, `PERCENTAGE`, `PROFIT`, `PROFIT_DECLINE`, `PROFIT_INCREASE`, `REVENUE`, `REVENUE_DECLINE`, `REVENUE_INCREASE`, `STOCKHOLDERS_EQUITY`, `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_specific_md_en_1.0.0_3.0_1671134641020.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_specific_md_en_1.0.0_3.0_1671134641020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_earning_calls_specific_md", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Adjusted EPS was ahead of our expectations at $ 1.21 , and free cash flow is also ahead of our expectations despite a $ 1.5 billion additional tax payment we made related to the R&D amortization."""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +------------+----------+----------+ | token| ner_label|confidence| +------------+----------+----------+ | Adjusted| B-PROFIT| 0.6179| | EPS| I-PROFIT| 0.913| | was| O| 1.0| | ahead| O| 1.0| | of| O| 1.0| | our| O| 1.0| |expectations| O| 1.0| | at| O| 1.0| | $|B-CURRENCY| 1.0| | 1.21| B-AMOUNT| 1.0| | ,| O| 1.0| | and| O| 1.0| | free| B-FCF| 0.9992| | cash| I-FCF| 0.9945| | flow| I-FCF| 0.9988| | is| O| 1.0| | also| O| 1.0| | ahead| O| 1.0| | of| O| 1.0| | our| O| 1.0| |expectations| O| 1.0| | despite| O| 1.0| | a| O| 1.0| | $|B-CURRENCY| 1.0| | 1.5| B-AMOUNT| 1.0| | billion| I-AMOUNT| 1.0| | additional| O| 0.9945| | tax| O| 0.6131| | payment| O| 0.6613| | we| O| 1.0| | made| O| 1.0| | related| O| 1.0| | to| O| 1.0| | the| O| 1.0| | R&D| O| 0.9994| |amortization| O| 0.9989| | .| O| 1.0| +------------+----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_earning_calls_specific_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on Earning Calls. ## Benchmarking ```bash label precision recall f1 support AMOUNT 99.303136 99.650350 99.476440 574 ASSET 55.172414 47.058824 50.793651 29 ASSET_INCREASE 100.000000 33.333333 50.000000 1 CF 46.153846 70.588235 55.813953 26 CFO 77.777778 100.000000 87.500000 9 CONTRA_LIABILITY 52.380952 56.410256 54.320988 42 COUNT 65.384615 77.272727 70.833333 26 CURRENCY 98.916968 99.636364 99.275362 554 DATE 86.982249 93.630573 90.184049 169 EXPENSE 67.187500 57.333333 61.870504 64 EXPENSE_DECREASE 100.000000 60.000000 75.000000 3 EXPENSE_INCREASE 40.000000 44.444444 42.105263 10 FCF 75.000000 75.000000 75.000000 20 INCOME 60.000000 40.000000 48.000000 10 KPI 41.666667 23.809524 30.303030 12 KPI_DECREASE 20.000000 10.000000 13.333333 5 KPI_INCREASE 44.444444 38.095238 41.025641 18 LIABILITY 38.461538 38.461538 38.461538 13 LIABILITY_DECREASE 50.000000 66.666667 57.142857 4 LOSS 50.000000 37.500000 42.857143 6 ORG 94.736842 90.000000 92.307692 19 PERCENTAGE 99.299475 99.648506 99.473684 571 PROFIT 78.014184 85.937500 81.784387 141 PROFIT_DECLINE 100.000000 36.363636 53.333333 4 PROFIT_INCREASE 78.947368 75.000000 76.923077 19 REVENUE 64.835165 71.951220 68.208092 91 REVENUE_DECLINE 53.571429 57.692308 55.555556 28 REVENUE_INCREASE 65.734266 75.200000 70.149254 143 STOCKHOLDERS_EQUITY 60.000000 37.500000 46.153846 5 TICKER 94.444444 94.444444 94.444444 18 accuracy - - 0.9571 19083 macro-avg 0.6660 0.5900 0.6070 19083 weighted-avg 0.9575 0.9571 0.9563 19083 ``` --- layout: model title: Legal Note Purchase Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_note_purchase_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, note_purchase, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_note_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `note-purchase-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `note-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_note_purchase_agreement_en_1.0.0_3.0_1669292848323.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_note_purchase_agreement_en_1.0.0_3.0_1669292848323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_note_purchase_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[note-purchase-agreement]| |[other]| |[other]| |[note-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_note_purchase_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support note-purchase-agreement 0.90 0.92 0.91 38 other 0.97 0.96 0.96 90 accuracy - - 0.95 128 macro-avg 0.93 0.94 0.93 128 weighted-avg 0.95 0.95 0.95 128 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from holtin) author: John Snow Labs name: distilbert_qa_base_uncased_holtin_finetuned_full_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-holtin-finetuned-full-squad` is a English model originally trained by `holtin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_full_squad_en_4.3.0_3.0_1672773896128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_full_squad_en_4.3.0_3.0_1672773896128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_full_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_full_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_holtin_finetuned_full_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/holtin/distilbert-base-uncased-holtin-finetuned-full-squad --- layout: model title: Embeddings Sciwiki 150 dims author: John Snow Labs name: embeddings_sciwiki_150d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-27 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_150d_es_2.5.0_2.4_1590609340084.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_150d_es_2.5.0_2.4_1590609340084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_150d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_150d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.sciwiki.150d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_sciwiki_150d``. {:.model-param} ## Model Information {:.table-model} |---------------|-------------------------| | Name: | embeddings_sciwiki_150d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 150.0 | {:.h2_title} ## Data Source Trained on Clinical Wikipedia Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_128 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670021708476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670021708476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|15.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Sentence Embeddings - sbert mini (tuned) author: John Snow Labs name: sbert_jsl_mini_umls_uncased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_umls_uncased_en_3.0.3_2.4_1621017142607.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_umls_uncased_en_3.0.3_2.4_1621017142607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbert_jsl_mini_umls_uncased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_mini_umls_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_mini_umlsuncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_mini_umls_uncased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.677 STS(cos) 0.681 ``` --- layout: model title: Legal Custodian Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_custodian_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, custodian, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_custodian_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `custodian-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `custodian-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_custodian_agreement_bert_en_1.0.0_3.0_1669310583748.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_custodian_agreement_bert_en_1.0.0_3.0_1669310583748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_custodian_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[custodian-agreement]| |[other]| |[other]| |[custodian-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_custodian_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support custodian-agreement 0.98 0.93 0.95 43 other 0.96 0.99 0.98 82 accuracy - - 0.97 125 macro-avg 0.97 0.96 0.96 125 weighted-avg 0.97 0.97 0.97 125 ``` --- layout: model title: Italian DistilBertForMaskedLM Cased model (from indigo-ai) author: John Snow Labs name: distilbert_embeddings_bertino date: 2022-12-12 tags: [it, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BERTino` is a Italian model originally trained by `indigo-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_bertino_it_4.2.4_3.0_1670864710883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_bertino_it_4.2.4_3.0_1670864710883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_bertino","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_bertino","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_bertino| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|253.3 MB| |Case sensitive:|false| ## References - https://huggingface.co/indigo-ai/BERTino - https://indigo.ai/en/ - https://www.corpusitaliano.it/ - https://corpora.dipintra.it/public/run.cgi/corp_info?corpname=itwac_full - https://universaldependencies.org/treebanks/it_partut/index.html - https://universaldependencies.org/treebanks/it_isdt/index.html - https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500 --- layout: model title: Detect PHI for Deidentification (Augmented) author: John Snow Labs name: ner_deid_augmented date: 2021-01-20 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 2.7.1 spark_version: 2.4 tags: [en, deidentify, ner, clinical, licensed] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER (Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/3de6f25c23cd487d829ac3ce444ef19cfbe02631/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentificiation.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_2.7.1_2.4_1611145829422.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_en_2.7.1_2.4_1611145829422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use This model is trained with the ‘embeddings_clinical’ word embeddings, so be sure to use the same embeddings within the pipeline in addition to document assembler, sentence detector, tokenizer and ner converter .
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("ner_deid_augmented","en","clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. ']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("ner_deid_augmented","en","clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token", "ner")) .setOutputCol("ner_chunk") val nlpPipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter)) val result = pipeline.fit(Seq.empty[String]).transform(data) val results = LightPipeline(model).fullAnnotate("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.augmented").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """) ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |NAME | |VA Hospital |LOCATION | |John Green |NAME | |2347165768 |ID | |Day Hospital |LOCATION | |02/04/2003 |DATE | |Smith |NAME | |Day Hospital |LOCATION | |Smith |NAME | |Smith |NAME | |7 Ardmore Tower|LOCATION | |Hart |NAME | |Smith |NAME | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_augmented| |Type:|ner| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | I-NAME | 1096 | 47 | 80 | 0.95888 | 0.931973 | 0.945235 | | 1 | I-CONTACT | 93 | 0 | 4 | 1 | 0.958763 | 0.978947 | | 2 | I-AGE | 3 | 1 | 6 | 0.75 | 0.333333 | 0.461538 | | 3 | B-DATE | 2078 | 42 | 52 | 0.980189 | 0.975587 | 0.977882 | | 4 | I-DATE | 474 | 39 | 25 | 0.923977 | 0.9499 | 0.936759 | | 5 | I-LOCATION | 755 | 68 | 76 | 0.917375 | 0.908544 | 0.912938 | | 6 | I-PROFESSION | 78 | 8 | 9 | 0.906977 | 0.896552 | 0.901734 | | 7 | B-NAME | 1182 | 101 | 36 | 0.921278 | 0.970443 | 0.945222 | | 8 | B-AGE | 259 | 10 | 11 | 0.962825 | 0.959259 | 0.961039 | | 9 | B-ID | 146 | 8 | 11 | 0.948052 | 0.929936 | 0.938907 | | 10 | B-PROFESSION | 76 | 9 | 21 | 0.894118 | 0.783505 | 0.835165 | | 11 | B-LOCATION | 556 | 87 | 71 | 0.864697 | 0.886762 | 0.875591 | | 12 | I-ID | 64 | 8 | 3 | 0.888889 | 0.955224 | 0.920863 | | 13 | B-CONTACT | 40 | 7 | 5 | 0.851064 | 0.888889 | 0.869565 | | 14 | Macro-average | 6900 | 435 | 410 | 0.912023 | 0.880619 | 0.896046 | | 15 | Micro-average | 6900 | 435 | 410 | 0.940695 | 0.943912 | 0.942301 | ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from nbroad) author: John Snow Labs name: roberta_qa_base_super_1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rob-base-superqa1` is a English model originally trained by `nbroad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_super_1_en_4.3.0_3.0_1674212339330.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_super_1_en_4.3.0_3.0_1674212339330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_super_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_super_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_super_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nbroad/rob-base-superqa1 - https://paperswithcode.com/sota?task=Question+Answering&dataset=adversarial_qa --- layout: model title: Fast Neural Machine Translation Model from Pedi to English author: John Snow Labs name: opus_mt_nso_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, nso, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `nso` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_nso_en_xx_2.7.0_2.4_1609170845142.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_nso_en_xx_2.7.0_2.4_1609170845142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_nso_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_nso_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.nso.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_nso_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Finance Capital Call Notices Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: finclf_capital_call_notices date: 2023-02-16 tags: [en, licensed, finance, capital_calls, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `finclf_capital_call_notices` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `capital_call_notices` or not (Binary Classification). ## Predicted Entities `capital_call_notices`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_capital_call_notices_en_1.0.0_3.0_1676590287518.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_capital_call_notices_en_1.0.0_3.0_1676590287518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = finance.ClassifierDLModel.pretrained("finclf_capital_call_notices", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[capital_call_notices]| |[other]| |[other]| |[capital_call_notices]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_capital_call_notices| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Financial documents and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support capital_call_notices 1.00 1.00 1.00 12 other 1.00 1.00 1.00 23 accuracy - - 1.00 35 macro-avg 1.00 1.00 1.00 35 weighted-avg 1.00 1.00 1.00 35 ``` --- layout: model title: Sentiment Analysis of IMDB Reviews Pipeline (analyze_sentimentdl_glove_imdb) author: John Snow Labs name: analyze_sentimentdl_glove_imdb date: 2021-01-15 task: [Embeddings, Sentiment Analysis, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [sentiment, en, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline to classify IMDB reviews in `neg` and `pos` classes using `glove_100d` embeddings. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_2.7.1_2.4_1610722058784.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_glove_imdb_en_2.7.1_2.4_1610722058784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("analyze_sentimentdl_glove_imdb", lang = "en") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("analyze_sentimentdl_glove_imdb", lang = "en") ``` {:.nlu-block} ```python import nlu nlu.load("en.sentiment.glove").predict("""Put your text here.""") ```
## Results ```bash | | document | sentiment | |---:|---------------------------------------------------------------------------------------------------------:|--------------:| | | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | | | 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive | | | was rad! Horror and sword fight freaks,buy this movie now! | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|analyze_sentimentdl_glove_imdb| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Language:|en| ## Included Models `glove_100d`, `sentimentdl_glove_imdb` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from anurag0077) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad3 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad3` is a English model originally trained by `anurag0077`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.3.0_3.0_1672773665178.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad3_en_4.3.0_3.0_1672773665178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad3 --- layout: model title: Translate Hiligaynon to English Pipeline author: John Snow Labs name: translate_hil_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, hil, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `hil` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_hil_en_xx_2.7.0_2.4_1609686032449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_hil_en_xx_2.7.0_2.4_1609686032449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_hil_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_hil_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.hil.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_hil_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_openkp date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.0_3.0_1677880905122.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.0_3.0_1677880905122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_openkp| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-openkp - https://github.com/microsoft/OpenKP - https://arxiv.org/abs/1911.02671 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp --- layout: model title: Word2Vec Embeddings in Bihari languages (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bh, open_source] task: Embeddings language: bh edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bh_3.4.1_3.0_1647286940542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bh_3.4.1_3.0_1647286940542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bh.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bh| |Size:|77.7 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Translate Tigrinya to English Pipeline author: John Snow Labs name: translate_ti_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ti, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ti` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ti_en_xx_2.7.0_2.4_1609689546761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ti_en_xx_2.7.0_2.4_1609689546761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ti_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ti_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ti.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ti_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Danish asr_alvenir_wav2vec2_base_nst_cv9 TFWav2Vec2ForCTC from chcaa author: John Snow Labs name: pipeline_asr_alvenir_wav2vec2_base_nst_cv9 date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_alvenir_wav2vec2_base_nst_cv9` is a Danish model originally trained by chcaa. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_alvenir_wav2vec2_base_nst_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_alvenir_wav2vec2_base_nst_cv9_da_4.2.0_3.0_1664104731248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_alvenir_wav2vec2_base_nst_cv9_da_4.2.0_3.0_1664104731248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_alvenir_wav2vec2_base_nst_cv9', lang = 'da') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_alvenir_wav2vec2_base_nst_cv9", lang = "da") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_alvenir_wav2vec2_base_nst_cv9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|da| |Size:|226.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Thai to English Pipeline author: John Snow Labs name: translate_th_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, th, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `th` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_th_en_xx_2.7.0_2.4_1609689519812.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_th_en_xx_2.7.0_2.4_1609689519812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_th_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_th_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.th.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_th_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Diseases in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc5cdr_disease_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bc5cdr_disease](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_bc5cdr_disease_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_pipeline_en_4.3.0_3.2_1679302082722.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_pipeline_en_4.3.0_3.2_1679302082722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_bc5cdr_disease_pipeline", "en", "clinical/models") text = '''Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bc5cdr_disease_pipeline", "en", "clinical/models") val text = "Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|-------------:| | 0 | interstitial cystitis | 61 | 81 | DISEASE | 0.999746 | | 1 | mastocytosis | 129 | 140 | DISEASE | 0.999132 | | 2 | cystitis | 209 | 216 | DISEASE | 0.999912 | | 3 | prostate cancer | 355 | 369 | DISEASE | 0.999781 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc5cdr_disease_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering model (from LoudlySoft) author: John Snow Labs name: bert_qa_scibert_scivocab_uncased_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `scibert_scivocab_uncased_squad` is a English model orginally trained by `LoudlySoft`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_scivocab_uncased_squad_en_4.0.0_3.0_1654189441461.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_scivocab_uncased_squad_en_4.0.0_3.0_1654189441461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_scibert_scivocab_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_scibert_scivocab_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.scibert.uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_scibert_scivocab_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/LoudlySoft/scibert_scivocab_uncased_squad --- layout: model title: Financial Finetuned FLAN-T5 Text Generation ( Financial Alpaca ) author: John Snow Labs name: fingen_flant5_finetuned_alpaca date: 2023-05-25 tags: [en, finance, generation, licensed, flant5, alpaca, tensorflow] task: Text Generation language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: FinanceTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `fingen_flant5_finetuned_alpaca` model is the Text Generation model that has been fine-tuned on FLAN-T5 using Financial Alpaca dataset. FLAN-T5 is a state-of-the-art language model developed by Facebook AI that utilizes the T5 architecture for text-generation tasks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_alpaca_en_1.0.0_3.0_1685016665729.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/fingen_flant5_finetuned_alpaca_en_1.0.0_3.0_1685016665729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") flant5 = finance.TextGenerator.pretrained("fingen_flant5_finetuned_alpaca", "en", "finance/models")\ .setInputCols(["document"])\ .setOutputCol("generated")\ .setMaxNewTokens(256)\ .setStopAtEos(True)\ .setDoSample(True)\ .setTopK(3) pipeline = nlp.Pipeline(stages=[document_assembler, flant5]) data = spark.createDataFrame([ [1, "What is the US Fair Tax?"]]).toDF('id', 'text') results = pipeline.fit(data).transform(data) results.select("generated.result").show(truncate=False) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[Fair tax in the US is essentially an income tax. Fair taxes are tax on your income, and are not taxeable in any country. Fair taxes are taxed as income. If you have a net gain or if the loss of income from taxable activities is less then the fair value (the loss) of your gross income (the loss) then you have to file an Income Report. This will give the US government an overview and give you an understanding. If your net income is less that your fair share of your gross income (which you are entitled) you have the right to claim a refund.]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|fingen_flant5_finetuned_alpaca| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.6 GB| ## References The dataset is available [here](https://huggingface.co/datasets/gbharti/finance-alpaca/viewer/gbharti--finance-alpaca) --- layout: model title: English Deberta Embeddings model (from ZZ99) author: John Snow Labs name: deberta_embeddings_tapt_nbme_v3_base date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tapt_nbme_deberta_v3_base` is a English model originally trained by `ZZ99`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_tapt_nbme_v3_base_en_4.3.1_3.0_1678712713960.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_tapt_nbme_v3_base_en_4.3.1_3.0_1678712713960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_tapt_nbme_v3_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_tapt_nbme_v3_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_tapt_nbme_v3_base| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|689.5 MB| |Case sensitive:|false| ## References https://huggingface.co/ZZ99/tapt_nbme_deberta_v3_base --- layout: model title: Detect Living Species (bert_base_cased) author: John Snow Labs name: ner_living_species_bert date: 2022-06-23 tags: [ro, ner, clinical, licensed, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Romanian which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `bert_base_cased` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_ro_3.5.3_3.0_1655974560466.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_ro_3.5.3_3.0_1655974560466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "ro", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "ro", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.living_species.bert").predict("""O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia.""") ```
## Results ```bash +--------------------+-------+ |ner_chunk |label | +--------------------+-------+ |femeie |HUMAN | |Pacientul |HUMAN | |VHB |SPECIES| |virusul Ebstein Barr|SPECIES| |parvovirozei B19 |SPECIES| |EBV |SPECIES| |enterovirus |SPECIES| |parvovirus B19 |SPECIES| |fetală |HUMAN | +--------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.85 0.94 0.89 2184 B-SPECIES 0.75 0.85 0.80 2617 I-HUMAN 0.89 0.11 0.20 72 I-SPECIES 0.74 0.80 0.77 1027 micro-avg 0.79 0.86 0.82 5900 macro-avg 0.81 0.67 0.66 5900 weighted-avg 0.79 0.86 0.82 5900 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_openkp date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.1_3.0_1678782889694.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.1_3.0_1678782889694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_openkp| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-openkp - https://github.com/microsoft/OpenKP - https://arxiv.org/abs/1911.02671 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp --- layout: model title: English image_classifier_vit_vision_transformer_fmri_classification_ft ViTForImageClassification from shivkumarganesh author: John Snow Labs name: image_classifier_vit_vision_transformer_fmri_classification_ft date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vision_transformer_fmri_classification_ft` is a English model originally trained by shivkumarganesh. ## Predicted Entities `test`, `train`, `val` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_fmri_classification_ft_en_4.1.0_3.0_1660166000402.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformer_fmri_classification_ft_en_4.1.0_3.0_1660166000402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vision_transformer_fmri_classification_ft", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vision_transformer_fmri_classification_ft", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vision_transformer_fmri_classification_ft| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Special terms and conditions of trust Clause Binary Classifier author: John Snow Labs name: legclf_special_terms_and_conditions_of_trust_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `special-terms-and-conditions-of-trust` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `special-terms-and-conditions-of-trust` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_special_terms_and_conditions_of_trust_clause_en_1.0.0_3.2_1660123013530.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_special_terms_and_conditions_of_trust_clause_en_1.0.0_3.2_1660123013530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_special_terms_and_conditions_of_trust_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[special-terms-and-conditions-of-trust]| |[other]| |[other]| |[special-terms-and-conditions-of-trust]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_special_terms_and_conditions_of_trust_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 132 special-terms-and-conditions-of-trust 1.00 1.00 1.00 56 accuracy - - 1.00 188 macro-avg 1.00 1.00 1.00 188 weighted-avg 1.00 1.00 1.00 188 ``` --- layout: model title: Movies Sentiment Analysis author: John Snow Labs name: movies_sentiment_analysis date: 2022-07-06 tags: [en, open_source] task: Sentiment Analysis language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The movies_sentiment_analysis is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and predicts sentiment . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/movies_sentiment_analysis_en_4.0.0_3.0_1657135804995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/movies_sentiment_analysis_en_4.0.0_3.0_1657135804995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("movies_sentiment_analysis", "en") result = pipeline.annotate("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|movies_sentiment_analysis| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|210.0 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - SymmetricDeleteModel - SentimentDetectorModel --- layout: model title: Word2Vec Embeddings in Galician (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, gl, open_source] task: Embeddings language: gl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gl_3.4.1_3.0_1647374243984.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gl_3.4.1_3.0_1647374243984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo a faísca NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo a faísca NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gl.embed.w2v_cc_300d").predict("""Eu amo a faísca NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|gl| |Size:|779.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Multilingual XLMRoBerta Embeddings (from castorini) author: John Snow Labs name: xlmroberta_embeddings_afriberta_small date: 2022-05-13 tags: [ha, yo, ig, am, so, open_source, xlm_roberta, embeddings, xx, afriberta] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `afriberta_small` is a Multilingual model orginally trained by `castorini`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_small_xx_3.4.4_3.0_1652439280261.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_small_xx_3.4.4_3.0_1652439280261.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_small","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_small","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_afriberta_small| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|311.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/castorini/afriberta_small - https://github.com/keleog/afriberta --- layout: model title: Smaller BERT Sentence Embeddings (L-6_H-256_A-4) author: John Snow Labs name: sent_small_bert_L6_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_256_en_2.6.0_2.4_1598350409969.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_256_en_2.6.0_2.4_1598350409969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_256", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_256", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L6_256').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L6_256_embeddings sentence [0.7711525559425354, 0.5496315956115723, 1.261... I hate cancer [0.28574034571647644, -0.03116176463663578, 1.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L6_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1 --- layout: model title: Legal Injunctive relief Clause Binary Classifier author: John Snow Labs name: legclf_injunctive_relief_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `injunctive-relief` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `injunctive-relief` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_injunctive_relief_clause_en_1.0.0_3.2_1660122542368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_injunctive_relief_clause_en_1.0.0_3.2_1660122542368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_injunctive_relief_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[injunctive-relief]| |[other]| |[other]| |[injunctive-relief]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_injunctive_relief_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support injunctive-relief 0.91 1.00 0.95 30 other 1.00 0.97 0.99 103 accuracy - - 0.98 133 macro-avg 0.95 0.99 0.97 133 weighted-avg 0.98 0.98 0.98 133 ``` --- layout: model title: Translate English to Hiri Motu Pipeline author: John Snow Labs name: translate_en_ho date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ho, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ho` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ho_xx_2.7.0_2.4_1609691430837.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ho_xx_2.7.0_2.4_1609691430837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ho", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ho", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ho').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ho| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Full disclosure Clause Binary Classifier author: John Snow Labs name: legclf_full_disclosure_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `full-disclosure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `full-disclosure` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_full_disclosure_clause_en_1.0.0_3.2_1660122467527.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_full_disclosure_clause_en_1.0.0_3.2_1660122467527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_full_disclosure_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[full-disclosure]| |[other]| |[other]| |[full-disclosure]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_full_disclosure_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support full-disclosure 1.00 0.94 0.97 31 other 0.98 1.00 0.99 104 accuracy - - 0.99 135 macro-avg 0.99 0.97 0.98 135 weighted-avg 0.99 0.99 0.99 135 ``` --- layout: model title: Pipeline to Summarize Clinical Question Notes author: John Snow Labs name: summarizer_clinical_questions_pipeline date: 2023-05-29 tags: [licensed, en, clinical, summarization, question] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_questions](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_clinical_questions_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.2_3.0_1685401048463.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.2_3.0_1685401048463.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models") text = """ Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models") val text = """ Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash What are the treatments for hyperthyroidism? ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_questions_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: Chinese BertForQuestionAnswering model (from uer) author: John Snow Labs name: bert_qa_roberta_base_chinese_extractive_qa date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chinese-extractive-qa` is a Chinese model orginally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_qa_zh_4.0.0_3.0_1654189258198.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_qa_zh_4.0.0_3.0_1654189258198.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_roberta_base_chinese_extractive_qa","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_roberta_base_chinese_extractive_qa","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.base.by_uer").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_roberta_base_chinese_extractive_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|381.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/uer/roberta-base-chinese-extractive-qa - https://spaces.ac.cn/archives/4338 - https://www.kesci.com/home/competition/5d142d8cbb14e6002c04e14a/content/0 - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/product/tione/ - https://github.com/ymcui/cmrc2018 --- layout: model title: English RobertaForQuestionAnswering (from nlpconnect) author: John Snow Labs name: roberta_qa_roberta_base_squad2_nq date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-nq` is a English model originally trained by `nlpconnect`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_nq_en_4.0.0_3.0_1655735618263.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_nq_en_4.0.0_3.0_1655735618263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_nq","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2_nq","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_nlpconnect").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2_nq| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpconnect/roberta-base-squad2-nq --- layout: model title: German T5ForConditionalGeneration Base Cased model (from Einmalumdiewelt) author: John Snow Labs name: t5_base_gnad_maxsamples date: 2023-01-30 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-Base_GNAD_MaxSamples` is a German model originally trained by `Einmalumdiewelt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_gnad_maxsamples_de_4.3.0_3.0_1675099257674.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_gnad_maxsamples_de_4.3.0_3.0_1675099257674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_gnad_maxsamples","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_gnad_maxsamples","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_gnad_maxsamples| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|922.8 MB| ## References - https://huggingface.co/Einmalumdiewelt/T5-Base_GNAD_MaxSamples --- layout: model title: Context Spell Checker for the English Language author: John Snow Labs name: spellcheck_dl date: 2021-03-28 tags: [en, open_source] supported: true task: Spell Check language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: ContextSpellCheckerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.0.0_3.0_1616900699393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.0.0_3.0_1616900699393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use The model works at the token level, so you must put it after tokenization. The model can change the length of the tokens when correcting words, so keep this in mind when using it before other annotators that may work with absolute references to the original document like NerConverter.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setPrefixes(["\"", "“", "(", "[", "\n", "."]) \ .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained()\ .setInputCols("token")\ .setOutputCol("checked")\ ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new RecursiveTokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setPrefixes(Array("\"", "“", "(", "[", "\n", ".")) .setSuffixes(Array("\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained(). setInputCols("token"). setOutputCol("checked") ``` {:.nlu-block} ```python import nlu nlu.load("spell").predict("""]) \ .setSuffixes([""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| ## Data Source American National Corpus. --- layout: model title: English image_classifier_vit_modeversion28_7 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_modeversion28_7 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modeversion28_7` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion28_7_en_4.1.0_3.0_1660168471234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion28_7_en_4.1.0_3.0_1660168471234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_modeversion28_7", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_modeversion28_7", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_modeversion28_7| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: BioBERT Embeddings (Discharge) author: John Snow Labs name: biobert_discharge_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for discharge summaries. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_discharge_base_cased_en_2.6.2_2.4_1600531401858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_discharge_base_cased_en_2.6.2_2.4_1600531401858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_discharge_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_discharge_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.discharge_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_discharge_base_cased_embeddings I [0.0036486536264419556, 0.3796533942222595, -0... hate [0.1914958357810974, 0.6709488034248352, -0.49... cancer [0.04618441313505173, -0.04562612622976303, -0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_discharge_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: Dispute Clause Binary Classifier author: John Snow Labs name: legclf_dispute_clauses_cuad date: 2023-01-18 tags: [en, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `dispute_clause` clause type. To use this model, make sure you provide enough context as an input. Senteces have been used as positive examples, so better results will be achieved if SetenceDetector is added to the pipeline. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other 300+ Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `dispute_clause`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dispute_clauses_cuad_en_1.0.0_3.0_1674056674986.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dispute_clauses_cuad_en_1.0.0_3.0_1674056674986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel() \ .pretrained("legclf_dispute_clauses_cuad","en","legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("is_dispute_clause") pipeline = nlp.Pipeline() \ .setStages( [ documentAssembler, embeddings, docClassifier ] ) fit_model = pipeline.fit(spark.createDataFrame([[""]]).toDF('text')) lm = nlp.LightPipeline(fit_model) pos_example = "24.2 The parties irrevocably agree that the courts of Ohio shall have non-exclusive jurisdiction to settle any dispute or claim that arises out of or in connection with this agreement or its subject matter or formation ( including non - contractual disputes or claims )." neg_example = "Brokers’ Fees and Expenses Except as expressly set forth in the Transaction Documents to the contrary, each party shall pay the fees and expenses of its advisers, counsel, accountants and other experts, if any, and all other expenses incurred by such party incident to the negotiation, preparation, execution, delivery and performance of this Agreement. The Company shall pay all transfer agent fees, stamp taxes and other taxes and duties levied in connection with the delivery of any Warrant Shares to the Purchasers. Steel Pier Capital Advisors, LLC shall be reimbursed its expenses in having the Transaction Documents prepared on behalf of the Company and for its obligations under the Security Agreement in an amount not to exceed $25,000.00." texts = [ pos_example, neg_example ] res = lm.annotate(texts) ```
## Results ```bash ['dispute_clause'] ['other'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_dispute_clauses_cuad| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[label]| |Language:|en| |Size:|22.9 MB| ## References Manual annotations of CUAD dataset ## Benchmarking ```bash label precision recall f1-score support dispute_clause 1.00 1.00 1.00 61 other 1.00 1.00 1.00 96 accuracy - - 1.00 157 macro-avg 1.00 1.00 1.00 157 weighted-avg 1.00 1.00 1.00 157 ``` --- layout: model title: Pipeline to Detect Problems, Tests and Treatments author: John Snow Labs name: ner_healthcare_pipeline date: 2023-03-14 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_healthcare](https://nlp.johnsnowlabs.com/2021/04/21/ner_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_en_4.3.0_3.2_1678824932575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_en_4.3.0_3.2_1678824932575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_healthcare_pipeline", "en", "clinical/models") text = '''A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_healthcare_pipeline", "en", "clinical/models") val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | gestational diabetes mellitus | 39 | 67 | PROBLEM | 0.938233 | | 1 | type two diabetes mellitus | 128 | 153 | PROBLEM | 0.762925 | | 2 | HTG-induced pancreatitis | 186 | 209 | PROBLEM | 0.9742 | | 3 | an acute hepatitis | 263 | 280 | PROBLEM | 0.915067 | | 4 | obesity | 288 | 294 | PROBLEM | 0.9926 | | 5 | a body mass index | 301 | 317 | TEST | 0.721175 | | 6 | BMI | 321 | 323 | TEST | 0.4466 | | 7 | polyuria | 380 | 387 | PROBLEM | 0.9987 | | 8 | polydipsia | 391 | 400 | PROBLEM | 0.9993 | | 9 | poor appetite | 404 | 416 | PROBLEM | 0.96315 | | 10 | vomiting | 424 | 431 | PROBLEM | 0.9588 | | 11 | amoxicillin | 511 | 521 | TREATMENT | 0.6453 | | 12 | a respiratory tract infection | 527 | 555 | PROBLEM | 0.867 | | 13 | metformin | 570 | 578 | TREATMENT | 0.9989 | | 14 | glipizide | 582 | 590 | TREATMENT | 0.9997 | | 15 | dapagliflozin | 598 | 610 | TREATMENT | 0.9996 | | 16 | T2DM | 616 | 619 | TREATMENT | 0.9662 | | 17 | atorvastatin | 625 | 636 | TREATMENT | 0.9993 | | 18 | gemfibrozil | 642 | 652 | TREATMENT | 0.9997 | | 19 | HTG | 658 | 660 | PROBLEM | 0.9927 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English ElectraForQuestionAnswering model (from mrm8488) Version-2 author: John Snow Labs name: electra_qa_base_finetuned_squadv2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_finetuned_squadv2_en_4.0.0_3.0_1655920687688.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_finetuned_squadv2_en_4.0.0_3.0_1655920687688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_finetuned_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.base_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/electra-base-finetuned-squadv2 --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_major_concepts date: 2021-05-02 tags: [en, clinical, licensed, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Map clinical entities to UMLS CUI codes. ## Predicted Entities This model returns CUI (concept unique identifier) codes for `Clinical Findings`, `Medical Devices`, `Anatomical Structures` and `Injuries & Poisoning` terms {:.btn-box} [Live Demo](http://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.0.2_3.0_1619973285528.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.0.2_3.0_1619973285528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_umls_major_concepts``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes, Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension, Injury_or_Poisoning, Kidney_Disease, Medical-Device, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_major_concepts", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokens, embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | ner_chunk | resolution | |---:|:------------------------------|:-------------| | 0 | 28-year-old | C1864118 | | 1 | female | C3887375 | | 2 | gestational diabetes mellitus | C2183115 | | 3 | eight years prior | C5195266 | | 4 | subsequent | C3844350 | | 5 | type two diabetes mellitus | C4014362 | | 6 | T2DM | C4014362 | | 7 | HTG-induced pancreatitis | C4554179 | | 8 | three years prior | C1866782 | | 9 | acute | C1332147 | | 10 | hepatitis | C1963279 | | 11 | obesity | C1963185 | | 12 | body mass index | C0578022 | | 13 | 33.5 kg/m2 | C2911054 | | 14 | one-week | C0420331 | | 15 | polyuria | C3278312 | | 16 | polydipsia | C3278316 | | 17 | poor appetite | C0541799 | | 18 | vomiting | C1963281 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_major_concepts| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[umls_code]| |Language:|en| ## Data Source https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: English image_classifier_vit_lucky_model ViTForImageClassification from Loc author: John Snow Labs name: image_classifier_vit_lucky_model date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_lucky_model` is a English model originally trained by Loc. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lucky_model_en_4.1.0_3.0_1660166485109.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lucky_model_en_4.1.0_3.0_1660166485109.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_lucky_model", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_lucky_model", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_lucky_model| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|324.8 MB| --- layout: model title: English image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office ViTForImageClassification from mayoughi author: John Snow Labs name: image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office` is a English model originally trained by mayoughi. ## Predicted Entities `office`, `balcony`, `restaurant`, `hospital`, `inside apartment`, `airport`, `hallway`, `inside coffee house` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office_en_4.1.0_3.0_1660172866138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office_en_4.1.0_3.0_1660172866138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_apartment_office| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Google's Tapas Table Understanding (Medium, WTQ) author: John Snow Labs name: table_qa_tapas_medium_finetuned_wtq date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Medium Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wtq_en_4.2.0_3.0_1664530490771.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_wtq_en_4.2.0_3.0_1664530490771.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_medium_finetuned_wtq","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_medium_finetuned_wtq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|157.5 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions --- layout: model title: Arabic ElectraForQuestionAnswering model (from salti) author: John Snow Labs name: electra_qa_AraElectra_base_finetuned_ARCD date: 2022-06-22 tags: [ar, open_source, electra, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `AraElectra-base-finetuned-ARCD` is a Arabic model originally trained by `salti`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_AraElectra_base_finetuned_ARCD_ar_4.0.0_3.0_1655918851105.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_AraElectra_base_finetuned_ARCD_ar_4.0.0_3.0_1655918851105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_AraElectra_base_finetuned_ARCD","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_AraElectra_base_finetuned_ARCD","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.squad_arcd.electra.base").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_AraElectra_base_finetuned_ARCD| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|504.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/salti/AraElectra-base-finetuned-ARCD --- layout: model title: Legal Building And Public Works Document Classifier (EURLEX) author: John Snow Labs name: legclf_building_and_public_works_bert date: 2023-03-06 tags: [en, legal, classification, clauses, building_and_public_works, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_building_and_public_works_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Building_and_Public_Works or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Building_and_Public_Works`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_building_and_public_works_bert_en_1.0.0_3.0_1678111597578.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_building_and_public_works_bert_en_1.0.0_3.0_1678111597578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_building_and_public_works_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Building_and_Public_Works]| |[Other]| |[Other]| |[Building_and_Public_Works]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_building_and_public_works_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Building_and_Public_Works 0.85 0.85 0.85 33 Other 0.87 0.87 0.87 39 accuracy - - 0.86 72 macro-avg 0.86 0.86 0.86 72 weighted-avg 0.86 0.86 0.86 72 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from ThomasNLG) author: John Snow Labs name: t5_weighter_cnndm date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-weighter_cnndm-en` is a English model originally trained by `ThomasNLG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_weighter_cnndm_en_4.3.0_3.0_1675156764080.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_weighter_cnndm_en_4.3.0_3.0_1675156764080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_weighter_cnndm","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_weighter_cnndm","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_weighter_cnndm| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|277.8 MB| ## References - https://huggingface.co/ThomasNLG/t5-weighter_cnndm-en - https://github.com/ThomasScialom/QuestEval - https://arxiv.org/abs/2103.12693 --- layout: model title: French CamemBert Embeddings (from joe8zhang) author: John Snow Labs name: camembert_embeddings_joe8zhang_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `joe8zhang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_joe8zhang_generic_model_fr_3.4.4_3.0_1653988899631.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_joe8zhang_generic_model_fr_3.4.4_3.0_1653988899631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_joe8zhang_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_joe8zhang_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_joe8zhang_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/joe8zhang/dummy-model --- layout: model title: Korean ElectraForQuestionAnswering Small model (from monologg) Version-3 author: John Snow Labs name: electra_qa_small_v3_finetuned_korquad date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-small-v3-finetuned-korquad` is a Korean model originally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_v3_finetuned_korquad_ko_4.0.0_3.0_1655922310458.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_v3_finetuned_korquad_ko_4.0.0_3.0_1655922310458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_v3_finetuned_korquad","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_v3_finetuned_korquad","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.korquad.electra.small").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_small_v3_finetuned_korquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|53.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monologg/koelectra-small-v3-finetuned-korquad --- layout: model title: Detect Drugs and Posology Entities (ner_posology_greedy) author: John Snow Labs name: ner_posology_greedy date: 2020-12-08 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.5 spark_version: 2.4 tags: [ner, licensed, clinical, en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects drugs, dosage, form, frequency, duration, route, and drug strength in text. It differs from `ner_posology` in the sense that it chunks together drugs, dosage, form, strength, and route when they appear together, resulting in a bigger chunk. It is trained using `embeddings_clinical` so please use the same embeddings in the pipeline. ## Predicted Entities `DRUG`, `STRENGTH`, `DURATION`, `FREQUENCY`, `FORM`, `DOSAGE`, `ROUTE`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_2.6.4_2.4_1607422064676.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_en_2.6.4_2.4_1607422064676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_posology_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day."]]).toDF("text")) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = NerDLModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.greedy").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""") ```
## Results ```bash +----+----------------------------------+---------+-------+------------+ | | chunks | begin | end | entities | |---:|---------------------------------:|--------:|------:|-----------:| | 0 | 1 capsule of Advil 10 mg | 27 | 50 | DRUG | | 1 | magnesium hydroxide 100mg/1ml PO | 67 | 98 | DRUG | | 2 | for 5 days | 52 | 61 | DURATION | | 3 | 40 units of insulin glargine | 168 | 195 | DRUG | | 4 | at night | 197 | 204 | FREQUENCY | | 5 | 12 units of insulin lispro | 207 | 232 | DRUG | | 6 | with meals | 234 | 243 | FREQUENCY | | 7 | metformin 1000 mg | 250 | 266 | DRUG | | 8 | two times a day | 268 | 282 | FREQUENCY | +----+----------------------------------+---------+-------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_greedy| |Type:|ner| |Compatibility:|Spark NLP 2.6.5+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on augmented i2b2_med7 + FDA dataset with ``embeddings_clinical``, [https://www.i2b2.org/NLP/Medication](https://www.i2b2.org/NLP/Medication). ## Benchmarking | label | tp | fp | fn | prec | rec | f1 | |---------------|--------|-------|-------|------------|------------|------------| | B-DRUG | 29362 | 1679 | 1985 | 0.9459103 | 0.93667656 | 0.94127077 | | B-STRENGTH | 14018 | 1172 | 864 | 0.922844 | 0.9419433 | 0.9322958 | | I-DURATION | 6404 | 935 | 476 | 0.87259847 | 0.93081397 | 0.9007666 | | I-STRENGTH | 16686 | 1991 | 1292 | 0.8933983 | 0.9281344 | 0.9104351 | | I-FREQUENCY | 19743 | 1088 | 1081 | 0.9477702 | 0.94808877 | 0.9479294 | | B-FORM | 2733 | 526 | 780 | 0.8386008 | 0.7779676 | 0.80714715 | | B-DOSAGE | 2774 | 474 | 688 | 0.85406405 | 0.80127096 | 0.8268257 | | I-DOSAGE | 1357 | 490 | 844 | 0.7347049 | 0.6165379 | 0.67045456 | | I-DRUG | 37846 | 4103 | 3386 | 0.90219074 | 0.91787934 | 0.9099674 | | I-ROUTE | 208 | 30 | 62 | 0.8739496 | 0.77037036 | 0.8188976 | | B-ROUTE | 3061 | 340 | 451 | 0.9000294 | 0.87158316 | 0.88557786 | | B-DURATION | 2491 | 388 | 276 | 0.865231 | 0.900253 | 0.8823946 | | B-FREQUENCY | 13065 | 608 | 436 | 0.9555328 | 0.9677061 | 0.9615809 | | I-FORM | 154 | 69 | 386 | 0.69058293 | 0.2851852 | 0.40366974 | | | | | | | | | | Macro-average | 149902 | 13893 | 13007 | 0.8712434 | 0.82817215 | 0.849162 | | Micro-average | 149902 | 13893 | 13007 | 0.91518056 | 0.92015785 | 0.9176625 | --- layout: model title: Pipeline to Detect Risk Factors author: John Snow Labs name: ner_risk_factors_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_risk_factors](https://nlp.johnsnowlabs.com/2021/03/31/ner_risk_factors_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_pipeline_en_4.3.0_3.2_1678865692535.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_pipeline_en_4.3.0_3.2_1678865692535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_risk_factors_pipeline", "en", "clinical/models") text = '''HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed. REVIEW OF SYSTEMS: All other systems reviewed & are negative. PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC. SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker. FAMILY HISTORY: Positive for coronary artery disease (father & brother).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_risk_factors_pipeline", "en", "clinical/models") val text = "HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed. REVIEW OF SYSTEMS: All other systems reviewed & are negative. PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC. SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker. FAMILY HISTORY: Positive for coronary artery disease (father & brother)." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.risk_factors.pipeline").predict("""HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed. REVIEW OF SYSTEMS: All other systems reviewed & are negative. PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC. SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker. FAMILY HISTORY: Positive for coronary artery disease (father & brother).""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------------------------|--------:|------:|:-------------|-------------:| | 0 | diabetic | 136 | 143 | DIABETES | 0.9992 | | 1 | coronary artery disease | 172 | 194 | CAD | 0.689667 | | 2 | Diabetes mellitus type II | 1315 | 1339 | DIABETES | 0.73075 | | 3 | hypertension | 1342 | 1353 | HYPERTENSION | 0.986 | | 4 | coronary artery disease | 1356 | 1378 | CAD | 0.882567 | | 5 | 1995 | 1422 | 1425 | PHI | 0.9999 | | 6 | ABC | 1434 | 1436 | PHI | 0.9999 | | 7 | Smokes 2 packs of cigarettes per day | 1481 | 1516 | SMOKER | 0.634257 | | 8 | banker | 1530 | 1535 | PHI | 0.9779 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Amharic Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, am, open_source] task: Named Entity Recognition language: am edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-amharic` is a Amharic model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic_am_3.4.2_3.0_1652810385568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic_am_3.4.2_3.0_1652810385568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic","am") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["ስካርቻ nlp እወዳለሁ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic","am") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("ስካርቻ nlp እወዳለሁ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_amharic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|am| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-amharic - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukm --- layout: model title: Word2Vec Embeddings in Sicilian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, scn, open_source] task: Embeddings language: scn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_scn_3.4.1_3.0_1647457121351.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_scn_3.4.1_3.0_1647457121351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","scn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","scn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("scn.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|scn| |Size:|138.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Google's Tapas Table Understanding (Tiny, SQA) author: John Snow Labs name: table_qa_tapas_tiny_finetuned_sqa date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Tiny Has aggregation operations?: False ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_tiny_finetuned_sqa_en_4.2.0_3.0_1664530438363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_tiny_finetuned_sqa_en_4.2.0_3.0_1664530438363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_tiny_finetuned_sqa","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_tiny_finetuned_sqa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|17.4 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 --- layout: model title: English asr_wav2vec2_large_960h_lv60_self_4_gram TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self_4_gram` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021741793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram_en_4.2.0_3.0_1664021741793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_960h_lv60_self_4_gram| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Catalan RoBERTa embeddings author: cayorodriguez name: roberta_embeddings_bsc date: 2022-07-07 tags: [roberta, projecte_aina, ca, open_source] task: Embeddings language: ca edition: Spark NLP 3.4.4 spark_version: 3.0 supported: false recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Catalan Roberta Word Embeddings, used within the `PlanTL-GOB-ES/roberta-base-ca` project. This model requires a specific Tokenizer, as shown in the Python Examples section. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/roberta_embeddings_bsc_ca_3.4.4_3.0_1657198648319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/roberta_embeddings_bsc_ca_3.4.4_3.0_1657198648319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") ex_list = ["aprox\.","pàg\.","p\.ex\.","gen\.","feb\.","abr\.","jul\.","set\.","oct\.","nov\.","dec\.","dr\.","dra\.","sr\.","sra\.","srta\.","núm\.","st\.","sta\.","pl\.","etc\.", "ex\."] ex_list_all = [] ex_list_all.extend(ex_list) ex_list_all.extend([x[0].upper() + x[1:] for x in ex_list]) ex_list_all.extend([x.upper() for x in ex_list]) tokenizer = Tokenizer() \ .setInputCols(['sentence']).setOutputCol('token')\ .setInfixPatterns(["(d|D)(els)","(d|D)(el)","(a|A)(ls)","(a|A)(l)","(p|P)(els)","(p|P)(el)",\ "([A-zÀ-ú_@]+)(-[A-zÀ-ú_@]+)",\ "(d'|D')([·A-zÀ-ú@_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+","(l'|L')([·A-zÀ-ú_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+", \ "(l'|l'|s'|s'|d'|d'|m'|m'|n'|n'|D'|D'|L'|L'|S'|S'|N'|N'|M'|M')([A-zÀ-ú_]+)",\ """([A-zÀ-ú·]+)(\.|,|\)|\?|!|;|\:|\"|”)(\.|,|\)|\?|!|;|\:|\"|”)+""",\ "([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|,|;|:|\?|,)+",\ "([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)",\ "(\.|\"|;|:|!|\?|\-|\(|\)|”|“|')+([0-9A-zÀ-ú_]+)",\ "([0-9A-zÀ-ú·]+)(\.|\"|;|:|!|\?|\(|\)|”|“|'|,|%)",\ "(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+([0-9]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+",\ "(d'|D'|l'|L')([·A-zÀ-ú@_]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)", \ "([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)"]) \ .setExceptions(ex_list_all).fit(data) embeddings = WordEmbeddingsModel.pretrained("roberta_embeddings_bsc","ca") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["M'encanta fer anar aixó."]]).toDF("text") result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bsc| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Community| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ca| |Size:|300.3 MB| |Case sensitive:|true| ## References projecte-aina/catalan_general_crawling @ huggingface --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10_en_4.3.0_3.0_1674214731317.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10_en_4.3.0_3.0_1674214731317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|427.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-10 --- layout: model title: Financial Executives Item Binary Classifier author: John Snow Labs name: finclf_executives_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `executives` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `executives` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_executives_item_en_1.0.0_3.2_1660154397820.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_executives_item_en_1.0.0_3.2_1660154397820.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_executives_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[executives]| |[other]| |[other]| |[executives]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_executives_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support executives 0.96 0.98 0.97 46 other 0.98 0.96 0.97 45 accuracy - - 0.97 91 macro-avg 0.97 0.97 0.97 91 weighted-avg 0.97 0.97 0.97 91 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_3_lr_3e_5_bs_32_ep_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_3-lr-3e-5-bs-32-ep-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188664961.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188664961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_3e_5_bs_32_ep_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_3e_5_bs_32_ep_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_3_lr_3e_5_bs_32_ep_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_3-lr-3e-5-bs-32-ep-3 --- layout: model title: Fast Neural Machine Translation Model from Albanian to English author: John Snow Labs name: opus_mt_sq_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sq, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sq` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sq_en_xx_2.7.0_2.4_1609167620053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sq_en_xx_2.7.0_2.4_1609167620053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sq_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sq_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sq.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sq_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Assertion Status (assertion_dl_biobert_scope_L10R10) author: John Snow Labs name: assertion_dl_biobert_scope_L10R10 date: 2022-03-24 tags: [licensed, clinical, en, assertion, biobert] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained using `biobert_pubmed_base_cased` BERT token embeddings. It considers 10 tokens on the left and 10 tokens on the right side of the clinical entities extracted by NER models and assigns their assertion status based on their context in this scope. ## Predicted Entities `present`, `absent`, `possible`, `conditional`, `associated_with_someone_else`, `hypothetical` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_scope_L10R10_en_3.4.2_2.4_1648148217364.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_scope_L10R10_en_3.4.2_2.4_1648148217364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token') embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[document, sentenceDetector, token, embeddings, clinical_ner, ner_converter, clinical_assertion]) text = "Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer." data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.biobert_l10210").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""") ```
## Results ```bash +---------------+---------+----------------------------+ |chunk |ner_label|assertion | +---------------+---------+----------------------------+ |severe fever |PROBLEM |present | |sore throat |PROBLEM |present | |stomach pain |PROBLEM |absent | |an epidural |TREATMENT|present | |PCA |TREATMENT|present | |pain control |TREATMENT|present | |short of breath|PROBLEM |conditional | |CT |TEST |present | |lung tumor |PROBLEM |present | |Alzheimer |PROBLEM |associated_with_someone_else| +---------------+---------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_biobert_scope_L10R10| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|3.2 MB| ## References Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with `biobert_pubmed_base_cased`. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label tp fp fn prec rec f1 absent 839 89 44 0.9040948 0.9501699 0.9265599 present 2436 127 168 0.9504487 0.9354839 0.9429069 conditional 29 21 24 0.58 0.5471698 0.5631067 associated_with_someone_else 39 11 6 0.78 0.8666670 0.8210527 hypothetical 164 44 11 0.7884616 0.9371429 0.8563969 possible 126 36 75 0.7777778 0.6268657 0.6942149 Macro-average 3633 328 328 0.7967971 0.8105832 0.8036310 Micro-average 3633 328 328 0.9171926 0.9171926 0.9171926 ``` --- layout: model title: Word2Vec Embeddings in Thai (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-10 tags: [cc, embeddings, fastText, word2vec, th, open_source] task: Embeddings language: th edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_th_3.4.1_3.0_1646924940460.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_th_3.4.1_3.0_1646924940460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","th") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ฉันรัก Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","th") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ฉันรัก Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("th.embed.w2v_cc_300d").predict("""ฉันรัก Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|th| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Оcr base v2 for handwritten text author: John Snow Labs name: ocr_base_handwritten_v2 date: 2023-01-17 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 4.2.4 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr base handwritten model v2 for recognise handwritten text based on TrOCR model pretrained on handwritten datasets. It is an Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_HANDWRITTEN/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextHandwritten_V2.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_en_4.2.2_3.0_1670602309000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_en_4.2.2_3.0_1670602309000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(True) \ .setSizeThreshold(-1) \ .setLinkThreshold(0.3) \ .setWidth(500) ocr = ImageToTextV2Opt.pretrained("ocr_base_handwritten_v2", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setGroupImages(True) \ .setOutputCol("text") \ .setRegionsColumn("text_regions") draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, ocr, draw_regions ]) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/handwritten/handwritten_example.jpg imagePath = 'handwritten_example.jpg' image_df = spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(True) .setSizeThreshold(-1) .setLinkThreshold(0.3) .setWidth(500) val ocr = ImageToTextV2Opt .pretrained("ocr_base_handwritten_v2", "en", "clinical/ocr") .setInputCols(Array("image", "text_regions")) .setGroupImages(True) .setOutputCol("text") .setRegionsColumn("text_regions") val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val pipeline = new PipelineModel().setStages(Array( binary_to_image, text_detector, ocr, draw_regions)) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/handwritten/handwritten_example.jpg val imagePath = "handwritten_example.jpg" val image_df = spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image3.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image3_out2.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash This is an example of handwritten beerxt Let's # check the performance ! I hope it will be awesome ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_base_handwritten_v2| |Type:|ocr| |Compatibility:|Visual NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: English DistilBertForQuestionAnswering model (from ajaypyatha) author: John Snow Labs name: distilbert_qa_sdsqna date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sdsqna` is a English model originally trained by `ajaypyatha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sdsqna_en_4.0.0_3.0_1654728628088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sdsqna_en_4.0.0_3.0_1654728628088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sdsqna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sdsqna","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_ajaypyatha").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sdsqna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ajaypyatha/sdsqna --- layout: model title: Pipeline to Detect Living Species(biobert_embeddings_biomedical) author: John Snow Labs name: ner_living_species_bert_pipeline date: 2023-03-13 tags: [pt, ner, clinical, licensed, biobert] task: Named Entity Recognition language: pt edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_bert](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_bert_pt_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_pt_4.3.0_3.2_1678729438675.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_pt_4.3.0_3.2_1678729438675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_bert_pipeline", "pt", "clinical/models") text = '''Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito..''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_bert_pipeline", "pt", "clinical/models") val text = "Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:------------|-------------:| | 0 | rapariga | 4 | 11 | HUMAN | 0.9849 | | 1 | pessoal | 41 | 47 | HUMAN | 0.9994 | | 2 | paciente | 182 | 189 | HUMAN | 1 | | 3 | gato | 368 | 371 | SPECIES | 0.9912 | | 4 | veterinário | 413 | 423 | HUMAN | 0.9909 | | 5 | Trichophyton rubrum | 632 | 650 | SPECIES | 0.9778 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|684.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Agricultural Structures And Production Document Classifier (EURLEX) author: John Snow Labs name: legclf_agricultural_structures_and_production_bert date: 2023-03-06 tags: [en, legal, classification, clauses, agricultural_structures_and_production, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_agricultural_structures_and_production_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Agricultural_Structures_and_Production or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Agricultural_Structures_and_Production`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_structures_and_production_bert_en_1.0.0_3.0_1678111593399.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_structures_and_production_bert_en_1.0.0_3.0_1678111593399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agricultural_structures_and_production_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Agricultural_Structures_and_Production]| |[Other]| |[Other]| |[Agricultural_Structures_and_Production]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agricultural_structures_and_production_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Agricultural_Structures_and_Production 0.83 0.91 0.87 44 Other 0.90 0.81 0.85 43 accuracy - - 0.86 87 macro-avg 0.87 0.86 0.86 87 weighted-avg 0.87 0.86 0.86 87 ``` --- layout: model title: XLM-RoBERTa Base (xlm_roberta_base) author: John Snow Labs name: xlm_roberta_base date: 2021-05-25 tags: [xx, multilingual, embeddings, xlm_roberta, open_source] task: Embeddings language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross-lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross-lingual benchmarks. The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_xx_3.1.0_2.4_1621961851929.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_xx_3.1.0_2.4_1621961851929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ``` ```scala val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.xlm").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|xx| |Case sensitive:|true| ## Data Source [https://huggingface.co/xlm-roberta-base](https://huggingface.co/xlm-roberta-base) --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_all_904029569 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029569` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Name`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1678783428948.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1678783428948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_all_904029569| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029569 --- layout: model title: SDOH Tobacco Usage For Classification author: John Snow Labs name: genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli date: 2023-01-14 tags: [en, licensed, generic_classifier, sdoh, tobacco, clinical] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting tobacco use in clinical notes and trained by using GenericClassifierApproach annotator. `Present:` if the patient was a current consumer of tobacco. `Past:` the patient was a consumer in the past and had quit. `Never:` if the patient had never consumed tobacco. `None: ` if there was no related text. ## Predicted Entities `Present`, `Past`, `Never`, `None` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_TOBACCO/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli_en_4.2.4_3.0_1673697468673.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli_en_4.2.4_3.0_1673697468673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes", "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.", "The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.", "She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq("Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.generic.sdoh_tobacco_sbiobert_cased").predict("""The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | text| result| +----------------------------------------------------------------------------------------------------+---------+ |Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]| |The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [Past]| | The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.| [Never]| |She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not report...| [None]| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| ## Benchmarking ```bash label precision recall f1-score support Never 0.89 0.90 0.90 487 None 0.86 0.78 0.82 269 Past 0.87 0.79 0.83 415 Present 0.63 0.82 0.71 203 accuracy - - 0.83 1374 macro-avg 0.81 0.82 0.81 1374 weighted-avg 0.84 0.83 0.83 1374 ``` --- layout: model title: Sentiment Analysis pipeline for English author: John Snow Labs name: analyze_sentiment date: 2021-03-24 tags: [open_source, english, analyze_sentiment, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The analyze_sentiment is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentiment_en_3.0.0_3.0_1616544471011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentiment_en_3.0.0_3.0_1616544471011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('analyze_sentiment', lang = 'en') result = pipeline.fullAnnotate("""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("analyze_sentiment", lang = "en") val result = pipeline.fullAnnotate("""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""") ``` {:.nlu-block} ```python import nlu text = ["""Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!"""] result_df = nlu.load('en.classify').predict(text) result_df ```
## Results ```bash | | text | sentiment | |---:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------| | 0 | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now! | positive | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|analyze_sentiment| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Lemmatizer (Serbian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, sr] task: Lemmatization language: sr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Serbian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_sr_3.4.1_3.0_1646316491633.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_sr_3.4.1_3.0_1646316491633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","sr") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Ниси бољи од мене"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","sr") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Ниси бољи од мене").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sr.lemma").predict("""Ниси бољи од мене""") ```
## Results ```bash +---------------------+ |result | +---------------------+ |[Ниси, добар, од, ја]| +---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|sr| |Size:|3.2 MB| --- layout: model title: Czech asr_wav2vec2_large_xlsr_czech TFWav2Vec2ForCTC from arampacha author: John Snow Labs name: asr_wav2vec2_large_xlsr_czech date: 2022-09-25 tags: [wav2vec2, cs, audio, open_source, asr] task: Automatic Speech Recognition language: cs edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_czech` is a Czech model originally trained by arampacha. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_czech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_czech_cs_4.2.0_3.0_1664120388474.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_czech_cs_4.2.0_3.0_1664120388474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_czech", "cs")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_czech", "cs") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_czech| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|cs| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from alex-apostolo) author: John Snow Labs name: roberta_qa_base_filtered_cuad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-filtered-cuad` is a English model originally trained by `alex-apostolo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_filtered_cuad_en_4.3.0_3.0_1674216293189.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_filtered_cuad_en_4.3.0_3.0_1674216293189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_filtered_cuad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_filtered_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_filtered_cuad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|454.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/alex-apostolo/roberta-base-filtered-cuad --- layout: model title: Stopwords Remover for Czech language (358 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, cs, open_source] task: Stop Words Removal language: cs edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_cs_3.4.1_3.0_1646673248718.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_cs_3.4.1_3.0_1646673248718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","cs") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nejste lepší než já"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","cs") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nejste lepší než já").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("cs.stopwords").predict("""Nejste lepší než já""") ```
## Results ```bash +---------------+ |result | +---------------+ |[Nejste, lepší]| +---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|cs| |Size:|2.4 KB| --- layout: model title: English BertForQuestionAnswering Cased model (from irenelizihui) author: John Snow Labs name: bert_qa_irenelizihui_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `irenelizihui`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_irenelizihui_finetuned_squad_en_4.0.0_3.0_1657186550664.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_irenelizihui_finetuned_squad_en_4.0.0_3.0_1657186550664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_irenelizihui_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_irenelizihui_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_irenelizihui_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/irenelizihui/bert-finetuned-squad --- layout: model title: English RoBERTa Embeddings (Sampling strategy 'full select') author: John Snow Labs name: roberta_embeddings_distilroberta_base_climate_f date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-climate-f` is a English model orginally trained by `climatebert`. Sampling strategy f: As expressed in the author's paper [here](https://arxiv.org/pdf/2110.12010.pdf), f is "full select" strategy, meaning all sentences from all corpora where selected. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_f_en_3.4.2_3.0_1649946254298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_f_en_3.4.2_3.0_1649946254298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_f","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_f","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta_base_climate_f").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base_climate_f| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|310.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/climatebert/distilroberta-base-climate-f - https://arxiv.org/abs/2110.12010 --- layout: model title: Legal Multilabel Classification on Terms of Service (UNFAIR-ToS) author: John Snow Labs name: legmulticlf_unfair_tos date: 2023-03-08 tags: [en, legal, licensed, classification, unfair, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multilabel Text Classification model that can help you classify 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to European consumer law. ## Predicted Entities `Arbitration`, `Choice_of_Law`, `Content_Removal`, `Contract_by_Using`, `Jurisdiction`, `Limitation_of_Liability`, `Unilateral_Change`, `Unilateral_Termination`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_unfair_tos_en_1.0.0_3.0_1678283272065.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_unfair_tos_en_1.0.0_3.0_1678283272065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) embeddingsSentence = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") docClassifier = nlp.MultiClassifierDLModel().pretrained("legmulticlf_unfair_tos", "en", "legal/models")\ .setInputCols("sentence_embeddings") \ .setOutputCol("class") pipeline = nlp.Pipeline( stages=[ document_assembler, tokenizer, embeddings, embeddingsSentence, docClassifier ] ) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) light_model = nlp.LightPipeline(model) result = light_model.annotate("""we may alter, suspend or discontinue any aspect of the service at any time, including the availability of any service feature, database or content.""") ```
## Results ```bash ['Unilateral_Change', 'Unilateral_Termination'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_unfair_tos| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|13.9 MB| ## References Train dataset available [here](https://github.com/coastalcph/lex-glue) ## Benchmarking ```bash label precision recall f1-score support Arbitration 1.00 0.82 0.90 11 Choice_of_Law 0.93 0.93 0.93 14 Content_Removal 0.80 0.57 0.67 21 Contract_by_Using 0.93 0.82 0.87 17 Jurisdiction 1.00 1.00 1.00 16 Limitation_of_Liability 0.81 0.80 0.81 60 Other 0.78 0.71 0.75 66 Unilateral_Change 0.94 0.84 0.89 38 Unilateral_Termination 0.78 0.81 0.79 36 micro-avg 0.85 0.79 0.82 279 macro-avg 0.89 0.81 0.85 279 weighted-avg 0.85 0.79 0.82 279 samples-avg 0.78 0.80 0.78 279 ``` --- layout: model title: ICD10CM Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-21 task: Entity Resolution edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/enterprise/healthcare/EntityResolution_ICD10_RxNorm_Detailed.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_clinical_en_2.4.5_2.4_1587491222166.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_clinical_en_2.4.5_2.4_1587491222166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") \ .setInputCols(["ner_token", "chunk_embeddings"]) \ .setOutputCol("icd10cm_code") \ .setDistanceFunction("COSINE") \ .setNeighbours(5) pipeline_icd10cm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, chunk_tokenizer, icd10cm_resolution]) pipeline_model = pipeline_icd10cm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG."""]]).toDF("text")) result = pipeline_model.transform(data) ``` ```scala ... val icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") .setInputCols("ner_token", "chunk_embeddings") .setOutputCol("icd10cm_code") .setDistanceFunction("COSINE") .setNeighbours(5) val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, chunk_tokenizer, icd10cm_resolution)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | | chunk | entity | resolved_text | code | cms | |---|-----------------------------|-----------|----------------------------------------------------|--------|---------------------------------------------------| | 0 | T2DM), | PROBLEM | Type 2 diabetes mellitus with diabetic nephrop... | E1121 | Type 2 diabetes mellitus with diabetic nephrop... | | 1 | T2DM | PROBLEM | Type 2 diabetes mellitus with diabetic nephrop... | E1121 | Type 2 diabetes mellitus with diabetic nephrop... | | 2 | polydipsia | PROBLEM | Polydipsia | R631 | Polydipsia:::Anhedonia:::Galactorrhea | | 3 | interference from turbidity | PROBLEM | Non-working side interference | M2656 | Non-working side interference:::Hemoglobinuria... | | 4 | polyuria | PROBLEM | Other polyuria | R358 | Other polyuria:::Polydipsia:::Generalized edem... | | 5 | lipemia | PROBLEM | Glycosuria | R81 | Glycosuria:::Pure hyperglyceridemia:::Hyperchy... | | 6 | starvation ketosis | PROBLEM | Propionic acidemia | E71121 | Propionic acidemia:::Bartter's syndrome:::Hypo... | | 7 | HTG | PROBLEM | Pure hyperglyceridemia | E781 | Pure hyperglyceridemia:::Familial hypercholest... | ``` {:.model-param} ## Model Information {:.table-model} |----------------|-------------------------------| | Name: | chunkresolve_icd10cm_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.2+ | | License: | Licensed | |Edition:|Official| | |Input labels: | token, chunk_embeddings | |Output labels: | entity | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10 Clinical Modification dataset with tenths of variations per code. https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: Detect Assertion Status (assertion_dl_healthcare) - supports confidence scores author: John Snow Labs name: assertion_dl_healthcare date: 2021-01-26 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.7.2 spark_version: 2.4 tags: [assertion, en, licensed, clinical, healthcare] supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Assign assertion status to clinical entities extracted by NER based on their context in the text. ## Predicted Entities `absent`, `present`, `conditional`, `associated_with_someone_else`, `hypothetical`, `possible`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.7.2_2.4_1611646187271.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_healthcare_en_2.7.2_2.4_1611646187271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion ]) data = spark.createDataFrame([["""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_healthcare","en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.healthcare").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""") ```
## Results ```bash +---------------+---------+----------------------------+ |chunk |ner_label|assertion | +---------------+---------+----------------------------+ |severe fever |PROBLEM |present | |sore throat |PROBLEM |present | |stomach pain |PROBLEM |absent | |an epidural |TREATMENT|present | |PCA |TREATMENT|present | |pain control |TREATMENT|present | |short of breath|PROBLEM |conditional | |CT |TEST |present | |lung tumor |PROBLEM |present | |Alzheimer |PROBLEM |associated_with_someone_else| +---------------+---------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_healthcare| |Compatibility:|Spark NLP 2.7.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| ## Data Source Trained on i2b2 assertion data ## Benchmarking ```bash label tp fp fn prec rec f1 absent 726 86 98 0.894089 0.881068 0.887531 present 2544 232 119 0.916427 0.955314 0.935466 conditional 18 13 37 0.580645 0.327273 0.418605 associated_with_someone_else 40 5 9 0.888889 0.816327 0.851064 hypothetical 132 13 26 0.910345 0.835443 0.871287 possible 96 45 105 0.680851 0.477612 0.561404 Macro-average 3556 394 394 0.811874 0.715506 0.76065 Micro-average 3556 394 394 0.900253 0.900253 0.900253 ``` --- layout: model title: Legal Compensation Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_compensation_agreement_bert date: 2023-01-29 tags: [en, legal, classification, compensation, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_compensation_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `compensation-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `compensation-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compensation_agreement_bert_en_1.0.0_3.0_1674990338214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compensation_agreement_bert_en_1.0.0_3.0_1674990338214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_compensation_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[compensation-agreement]| |[other]| |[other]| |[compensation-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_compensation_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support compensation-agreement 1.00 0.95 0.97 40 other 0.96 1.00 0.98 55 accuracy - - 0.98 95 macro-avg 0.98 0.97 0.98 95 weighted-avg 0.98 0.98 0.98 95 ``` --- layout: model title: Legal Operating Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_operating_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, operating, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_operating_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `operating-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `operating-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_operating_agreement_bert_en_1.0.0_3.0_1669368532203.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_operating_agreement_bert_en_1.0.0_3.0_1669368532203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_operating_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[operating-agreement]| |[other]| |[other]| |[operating-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_operating_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support operating-agreement 0.96 0.84 0.90 31 other 0.94 0.99 0.96 82 accuracy - - 0.95 113 macro-avg 0.95 0.91 0.93 113 weighted-avg 0.95 0.95 0.95 113 ``` --- layout: model title: Igbo Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ig, open_source] task: Named Entity Recognition language: ig edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-igbo` is a Igbo model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo_ig_3.4.2_3.0_1652809139166.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo_ig_3.4.2_3.0_1652809139166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo","ig") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ahụrụ m n'anya na-atọ m ụtọ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo","ig") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ahụrụ m n'anya na-atọ m ụtọ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_igbo| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ig| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-igbo - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/ --- layout: model title: Translate Indic languages to English Pipeline author: John Snow Labs name: translate_inc_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, inc, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `inc` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_inc_en_xx_2.7.0_2.4_1609688074696.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_inc_en_xx_2.7.0_2.4_1609688074696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_inc_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_inc_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.inc.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_inc_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Kabyle asr_Kabyle_xlsr TFWav2Vec2ForCTC from Akashpb13 author: John Snow Labs name: pipeline_asr_Kabyle_xlsr date: 2022-09-24 tags: [wav2vec2, kab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: kab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Kabyle_xlsr` is a Kabyle model originally trained by Akashpb13. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Kabyle_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Kabyle_xlsr_kab_4.2.0_3.0_1664018945760.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Kabyle_xlsr_kab_4.2.0_3.0_1664018945760.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Kabyle_xlsr', lang = 'kab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Kabyle_xlsr", lang = "kab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Kabyle_xlsr| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|kab| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Finnish T5ForConditionalGeneration Mini Cased model (from Finnish-NLP) author: John Snow Labs name: t5_mini_nl8 date: 2023-01-31 tags: [fi, open_source, t5, tensorflow] task: Text Generation language: fi edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-mini-nl8-finnish` is a Finnish model originally trained by `Finnish-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mini_nl8_fi_4.3.0_3.0_1675124948833.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mini_nl8_fi_4.3.0_3.0_1675124948833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mini_nl8","fi") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mini_nl8","fi") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mini_nl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|fi| |Size:|315.9 MB| ## References - https://huggingface.co/Finnish-NLP/t5-mini-nl8-finnish - https://arxiv.org/abs/1910.10683 - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511 - https://arxiv.org/abs/2002.05202 - https://arxiv.org/abs/2109.10686 - http://urn.fi/urn:nbn:fi:lb-2017070501 - http://urn.fi/urn:nbn:fi:lb-2021050401 - http://urn.fi/urn:nbn:fi:lb-2018121001 - http://urn.fi/urn:nbn:fi:lb-2020021803 - https://sites.research.google/trc/about/ - https://github.com/google-research/t5x - https://github.com/spyysalo/yle-corpus - https://github.com/aajanki/eduskunta-vkk - https://sites.research.google/trc/ - https://www.linkedin.com/in/aapotanskanen/ - https://www.linkedin.com/in/rasmustoivanen/ --- layout: model title: Sentence Entity Resolver for RxNorm (sbert_jsl_medium_rxnorm_uncased embeddings) author: John Snow Labs name: sbertresolve_jsl_rxnorm_augmented_med date: 2021-12-28 tags: [clinical, entity_resolution, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbert_jsl_medium_rxnorm_uncased` Sentence Bert Embeddings. It is trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in all_k_aux_labels column. ## Predicted Entities `RxNorm Codes`, `Concept Classes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_jsl_rxnorm_augmented_med_en_3.3.4_2.4_1640686630389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_jsl_rxnorm_augmented_med_en_3.3.4_2.4_1640686630389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_rxnorm_uncased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_jsl_rxnorm_augmented_med", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver]) light_model = LightPipeline(rxnorm_pipelineModel) result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "Neurontin 300", "avandia 4 mg"]) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_rxnorm_uncased", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_jsl_rxnorm_augmented_med", "en", "clinical/models") \ .setInputCols(Array("sbert_embeddings")) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val light_model = LightPipeline(rxnorm_pipelineModel) val result = light_model.fullAnnotate(Array("Coumadin 5 mg", "aspirin", "Neurontin 300", "avandia 4 mg")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.augmented_med").predict("""Coumadin 5 mg""") ```
## Results ```bash | | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels | |---:|-------------:|:-------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------| | 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::855334:::1110792:::11... | 0.0000:::6.0548:::6.1667:::6.1... | 0.0000:::0.0515:::0.0536:::0.0... | warfarin sodium 5 MG [Coumadin]:::warfarin sodium 5 MG Oral ... | Branded Drug Comp:::Branded Dr... | | 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::202547:::1001... | 0.0000:::0.0000:::8.8123:::9.3... | 0.0000:::0.0000:::0.1145:::0.1... | aspirin Effervescent Oral Tablet:::aspirin:::Empirin:::Ecpir... | Clinical Drug Form:::Ingredien... | | 2 | 105029 | gabapentin 300 MG Oral Capsule [Neurontin] | 105029:::1718929:::1718930:::3... | 5.5969:::8.7502:::8.7502:::8.7... | 0.0452:::0.1092:::0.1092:::0.1... | gabapentin 300 MG Oral Capsule [Neurontin]:::olanzapine 300 ... | Branded Drug:::Clinical Drug C... | | 3 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::2123140:::1792373:::8... | 0.0000:::7.1217:::7.7113:::8.4... | 0.0000:::0.0728:::0.0843:::0.1... | rosiglitazone 4 MG Oral Tablet [Avandia]:::erdafitinib 4 MG ... | Branded Drug:::Branded Drug Co... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_jsl_rxnorm_augmented_med| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|650.7 MB| |Case sensitive:|false| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1657184626320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1657184626320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-8 --- layout: model title: Catalan RobertaForTokenClassification Cased model (from softcatala) author: John Snow Labs name: roberta_token_classifier_fullstop_catalan_punctuation_prediction date: 2023-03-01 tags: [ca, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: ca edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fullstop-catalan-punctuation-prediction` is a Catalan model originally trained by `softcatala`. ## Predicted Entities `.`, `?`, `-`, `:`, `,`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_fullstop_catalan_punctuation_prediction_ca_4.3.0_3.0_1677703587592.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_fullstop_catalan_punctuation_prediction_ca_4.3.0_3.0_1677703587592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_fullstop_catalan_punctuation_prediction","ca") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_fullstop_catalan_punctuation_prediction","ca") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_fullstop_catalan_punctuation_prediction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ca| |Size:|457.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/softcatala/fullstop-catalan-punctuation-prediction - https://github.com/oliverguhr/fullstop-deep-punctuation-prediction --- layout: model title: Bemba (Zambia) asr_wav2vec2_large_xls_r_1b_bemba_fds TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: asr_wav2vec2_large_xls_r_1b_bemba_fds date: 2022-09-24 tags: [wav2vec2, bem, audio, open_source, asr] task: Automatic Speech Recognition language: bem edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_1b_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_1b_bemba_fds_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_1b_bemba_fds_bem_4.2.0_3.0_1664043378039.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_1b_bemba_fds_bem_4.2.0_3.0_1664043378039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_1b_bemba_fds", "bem")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_1b_bemba_fds", "bem") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_1b_bemba_fds| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|bem| |Size:|3.6 GB| --- layout: model title: Legal Conduct of business Clause Binary Classifier author: John Snow Labs name: legclf_conduct_of_business_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `conduct-of-business` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `conduct-of-business` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conduct_of_business_clause_en_1.0.0_3.2_1660122262718.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conduct_of_business_clause_en_1.0.0_3.2_1660122262718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_conduct_of_business_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[conduct-of-business]| |[other]| |[other]| |[conduct-of-business]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conduct_of_business_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support conduct-of-business 0.96 0.75 0.84 32 other 0.92 0.99 0.96 98 accuracy - - 0.93 130 macro-avg 0.94 0.87 0.90 130 weighted-avg 0.93 0.93 0.93 130 ``` --- layout: model title: Legal Subsidiaries Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_subsidiaries_bert date: 2023-03-05 tags: [en, legal, classification, clauses, subsidiaries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Subsidiaries` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Subsidiaries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subsidiaries_bert_en_1.0.0_3.0_1678050508907.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subsidiaries_bert_en_1.0.0_3.0_1678050508907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_subsidiaries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Subsidiaries]| |[Other]| |[Other]| |[Subsidiaries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subsidiaries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.98 0.96 0.97 85 Subsidiaries 0.95 0.97 0.96 64 accuracy - - 0.97 149 macro-avg 0.97 0.97 0.97 149 weighted-avg 0.97 0.97 0.97 149 ``` --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_ff2000 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff2000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff2000_en_4.3.0_3.0_1675123479854.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff2000_en_4.3.0_3.0_1675123479854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_ff2000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff2000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_ff2000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|46.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-ff2000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Spanish BERT Sentence Base Cased Embedding author: John Snow Labs name: sent_bert_base_cased date: 2021-09-06 tags: [spanish, open_source, bert_sentence_embeddings, cased, es] task: Embeddings language: es edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_es_3.2.2_3.0_1630926259701.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_es_3.2.2_3.0_1630926259701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "es") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "es") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed_sentence.bert.base_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|es| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-8` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1654191434774.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1654191434774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_128d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|381.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-8 --- layout: model title: Korean DistilBertForQuestionAnswering Cased model (from pakupoko) author: John Snow Labs name: distilbert_qa_bizlin_model date: 2023-01-03 tags: [ko, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: ko edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bizlin-distil-model` is a Korean model originally trained by `pakupoko`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_model_ko_4.3.0_3.0_1672765867213.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_model_ko_4.3.0_3.0_1672765867213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_model","ko")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_model","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bizlin_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|104.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pakupoko/bizlin-distil-model --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nh32 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh32` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh32_en_4.3.0_3.0_1675123694521.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh32_en_4.3.0_3.0_1675123694521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nh32","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh32","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nh32| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|88.2 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nh32 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Typo Detector Pipeline for English author: ahmedlone127 name: distilbert_token_classifier_typo_detector_pipeline date: 2022-06-14 tags: [ner, bert, bert_for_token, typo, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_en.html). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/distilbert_token_classifier_typo_detector_pipeline_en_4.0.0_3.0_1655212406234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/distilbert_token_classifier_typo_detector_pipeline_en_4.0.0_3.0_1655212406234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en") typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.") ``` ```scala val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en") typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.") ```
## Results ```bash +----------+---------+ |chunk |ner_label| +----------+---------+ |stgruggled|PO | |tine |PO | +----------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|244.2 MB| ## Included Models - DocumentAssembler - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Employment Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_employment_bert date: 2023-03-05 tags: [en, legal, classification, clauses, employment, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Employment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Employment`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678050529088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_bert_en_1.0.0_3.0_1678050529088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_employment_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Employment]| |[Other]| |[Other]| |[Employment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employment_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Employment 0.96 0.96 0.96 27 Other 0.98 0.98 0.98 49 accuracy - - 0.97 76 macro-avg 0.97 0.97 0.97 76 weighted-avg 0.97 0.97 0.97 76 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from ponmari) author: John Snow Labs name: bert_qa_questionansweing date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `QuestionAnsweingBert` is a English model originally trained by `ponmari`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_questionansweing_en_4.0.0_3.0_1657182420944.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_questionansweing_en_4.0.0_3.0_1657182420944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_questionansweing","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_questionansweing","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_questionansweing| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ponmari/QuestionAnsweingBert --- layout: model title: English DistilBertForQuestionAnswering model (from Rocketknight1) author: John Snow Labs name: distilbert_qa_Rocketknight1_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Rocketknight1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Rocketknight1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724414659.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Rocketknight1_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724414659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Rocketknight1_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Rocketknight1_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Rocketknight1").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Rocketknight1_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Rocketknight1/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_ff3000 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff3000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff3000_en_4.3.0_3.0_1675123510286.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff3000_en_4.3.0_3.0_1675123510286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_ff3000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff3000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_ff3000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|62.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-ff3000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el16_dl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el16-dl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl8_en_4.3.0_3.0_1675119813436.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl8_en_4.3.0_3.0_1675119813436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el16_dl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|224.3 MB| ## References - https://huggingface.co/google/t5-efficient-small-el16-dl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English asr_wav2vec2_large_english TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_wav2vec2_large_english date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_english` is a English model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_english_en_4.2.0_3.0_1664020258828.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_english_en_4.2.0_3.0_1664020258828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_english", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_english", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_english| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species_pipeline date: 2023-03-20 tags: [pt, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: pt edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_living_species](https://nlp.johnsnowlabs.com/2022/06/27/bert_token_classifier_ner_living_species_pt_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_pt_4.3.0_3.2_1679304320046.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_pt_4.3.0_3.2_1679304320046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "pt", "clinical/models") text = '''Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "pt", "clinical/models") val text = "Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:------------|-------------:| | 0 | rapariga | 4 | 11 | HUMAN | 0.999888 | | 1 | pessoal | 41 | 47 | HUMAN | 0.99987 | | 2 | paciente | 182 | 189 | HUMAN | 0.999731 | | 3 | gato | 368 | 371 | SPECIES | 0.999365 | | 4 | veterinário | 413 | 423 | HUMAN | 0.982236 | | 5 | Trichophyton rubrum | 632 | 650 | SPECIES | 0.996602 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|666.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from English to Romanian author: John Snow Labs name: opus_mt_en_ro date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ro, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ro` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ro_xx_2.7.0_2.4_1609168722172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ro_xx_2.7.0_2.4_1609168722172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ro", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ro", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ro').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ro| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xls_r_300m_english_colab TFWav2Vec2ForCTC from shacharm author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_english_colab date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_english_colab` is a English model originally trained by shacharm. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_english_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103544762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_english_colab_en_4.2.0_3.0_1664103544762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_english_colab', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_english_colab", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_english_colab| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: sbert_jsl_medium_umls_uncased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.1.0_2.4_1625050119656.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_umls_uncased_en_3.1.0_2.4_1625050119656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_umls_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_medium_umls_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_medium_umls_uncased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Acc: 0.744, STS (cos): 0.695 ``` ## Benchmarking ```bash MedNLI Score Acc 0.744 STS(cos) 0.695 ``` --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Russian (WikiNER 6B 100) author: John Snow Labs name: wikiner_6B_100 date: 2020-03-16 task: Named Entity Recognition language: ru edition: Spark NLP 2.4.4 spark_version: 2.4 tags: [ner, ru, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_RU){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RU.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_ru_2.4.4_2.4_1584014001452.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_ru_2.4.4_2.4_1584014001452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_100", "ru") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_100", "ru") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла."""] ner_df = nlu.load('ru.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Уильям Генри Гейтс III|PER | |Microsoft |ORG | |За |ORG | |Microsoft Гейтс |MISC | |CEO |ORG | |Он |PER | |Гейтс |PER | |Сиэтле |LOC | |Вашингтон |LOC | |Полом Алленом |PER | |Альбукерке |LOC | |Нью-Мексико |LOC | |Microsoft |ORG | |Гейтс |PER | |Гейтс |PER | |Это |PER | |В июне 2006 |MISC | |Гейтс |PER | |Microsoft |ORG | |Фонде Билла |ORG | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ru| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://ru.wikipedia.org](https://ru.wikipedia.org) --- layout: model title: English BertForQuestionAnswering model (from KevinChoi) author: John Snow Labs name: bert_qa_KevinChoi_bert_finetuned_squad_accelerate date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `KevinChoi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_KevinChoi_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535806391.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_KevinChoi_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535806391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_KevinChoi_bert_finetuned_squad_accelerate","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_KevinChoi_bert_finetuned_squad_accelerate","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.accelerate.by_KevinChoi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_KevinChoi_bert_finetuned_squad_accelerate| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KevinChoi/bert-finetuned-squad-accelerate --- layout: model title: Explain Document ML Pipeline for English author: John Snow Labs name: explain_document_ml date: 2021-03-23 tags: [open_source, english, explain_document_ml, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_ml is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_3.0.0_3.0_1616473253101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_3.0.0_3.0_1616473253101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('explain_document_ml', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_ml", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | spell | lemmas | stems | pos | |---:|:---------------------------------|:---------------------------------|:-------------------------------------------------|:------------------------------------------------|:------------------------------------------------|:-----------------------------------------------|:---------------------------------------| | 0 | ['Hello fronm John Snwow Labs!'] | ['Hello fronm John Snwow Labs!'] | ['Hello', 'fronm', 'John', 'Snwow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['hello', 'front', 'john', 'snow', 'lab', '!'] | ['UH', 'NN', 'NNP', 'NNP', 'NNP', '.'] || | document | sentence | token | spell | lemmas | stems | pos | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_ml| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: English asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4 TFWav2Vec2ForCTC from chrisvinsen author: John Snow Labs name: asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4` is a English model originally trained by chrisvinsen. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_en_4.2.0_3.0_1664103661065.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_en_4.2.0_3.0_1664103661065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Salary Clause Binary Classifier author: John Snow Labs name: legclf_salary_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `salary` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `salary` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_salary_clause_en_1.0.0_3.2_1660123961758.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_salary_clause_en_1.0.0_3.2_1660123961758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_salary_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[salary]| |[other]| |[other]| |[salary]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_salary_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 79 salary 1.00 1.00 1.00 33 accuracy - - 1.00 112 macro-avg 1.00 1.00 1.00 112 weighted-avg 1.00 1.00 1.00 112 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Lingala author: John Snow Labs name: opus_mt_en_ln date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ln, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ln` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ln_xx_2.7.0_2.4_1609170351544.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ln_xx_2.7.0_2.4_1609170351544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ln", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ln", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ln').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ln| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_100h_with_lm_turkish TFWav2Vec2ForCTC from gorkemgoknar author: John Snow Labs name: asr_wav2vec2_base_100h_with_lm_turkish date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_turkish` is a English model originally trained by gorkemgoknar. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_with_lm_turkish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_turkish_en_4.2.0_3.0_1664038528506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_turkish_en_4.2.0_3.0_1664038528506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_with_lm_turkish", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_with_lm_turkish", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_with_lm_turkish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.3 MB| --- layout: model title: Detect Diseases in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc5cdr_disease date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Chemicals, diseases, and their relations are among the most searched topics by PubMed users worldwide as they play central roles in many areas of biomedical research and healthcare, such as drug discovery and safety surveillance. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects disease from a medical text ## Predicted Entities `DISEASE` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_en_4.0.0_3.0_1658754395259.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_disease_en_4.0.0_3.0_1658754395259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc5cdr_disease", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc5cdr_disease", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bc5cdr_disease").predict("""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.""") ```
## Results ```bash +---------------------+-------+ |ner_chunk |label | +---------------------+-------+ |interstitial cystitis|DISEASE| |mastocytosis |DISEASE| |cystitis |DISEASE| |prostate cancer |DISEASE| +---------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc5cdr_disease| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-DISEASE 0.7905 0.9146 0.8480 4424 I-DISEASE 0.6521 0.8725 0.7464 2737 micro-avg 0.7328 0.8985 0.8072 7161 macro-avg 0.7213 0.8935 0.7972 7161 weighted-avg 0.7376 0.8985 0.8092 7161 ``` --- layout: model title: Urdu DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_ur_cased date: 2022-04-12 tags: [distilbert, embeddings, ur, open_source] task: Embeddings language: ur edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-ur-cased` is a Urdu model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ur_cased_ur_3.4.2_3.0_1649783731492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ur_cased_ur_3.4.2_3.0_1649783731492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ur_cased","ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ur_cased","ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.embed.distilbert_base_cased").predict("""مجھے سپارک این ایل پی سے محبت ہے""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_ur_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ur| |Size:|186.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-ur-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English BertForQuestionAnswering model (from ixa-ehu) author: John Snow Labs name: bert_qa_SciBERT_SQuAD_QuAC date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SciBERT-SQuAD-QuAC` is a English model orginally trained by `ixa-ehu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_SciBERT_SQuAD_QuAC_en_4.0.0_3.0_1654179044906.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_SciBERT_SQuAD_QuAC_en_4.0.0_3.0_1654179044906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_SciBERT_SQuAD_QuAC","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_SciBERT_SQuAD_QuAC","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.scibert.by_ixa-ehu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_SciBERT_SQuAD_QuAC| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ixa-ehu/SciBERT-SQuAD-QuAC - https://www.aclweb.org/anthology/P18-2124/ - https://arxiv.org/abs/1808.07036 --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_de_4.2.0_3.0_1664115914159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_de_4.2.0_3.0_1664115914159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Named Entity Recognition in Romanian Official Documents (Medium) author: John Snow Labs name: legner_romanian_official_md date: 2022-11-10 tags: [ro, ner, legal, licensed] task: Named Entity Recognition language: ro edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a medium version of NER model that extracts PER(Person), LOC(Location), ORG(Organization), DATE and LEGAL entities from Romanian Official Documents. Different from small version, it labels all entities related to legal domain as LEGAL. ## Predicted Entities `PER`, `LOC`, `ORG`, `DATE`, `LEGAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGNER_ROMANIAN_OFFICIAL/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_md_ro_1.0.0_3.0_1668083301892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_md_ro_1.0.0_3.0_1668083301892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_romanian_official_md", "ro", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""Anexa nr. 1 la Ordinul ministrului sănătății nr. 1.468 / 2018 pentru aprobarea prețurilor maximale ale medicamentelor de uz uman, valabile în România, care pot fi utilizate / comercializate de către deținătorii de autorizație de punere pe piață a medicamentelor sau reprezentanții acestora, distribuitorii angro și furnizorii de servicii medicale și medicamente pentru acele medicamente care fac obiectul unei relații contractuale cu Ministerul Sănătății, casele de asigurări de sănătate și / sau direcțiile de sănătate publică județene și a municipiului București, cuprinse în Catalogul național al prețurilor medicamentelor autorizate de punere pe piață în România, a prețurilor de referință generice și a prețurilor de referință inovative, publicat în Monitorul Oficial al României, Partea I nr. 989 și 989 bis din 22 noiembrie 2018, cu modificările și completările ulterioare, se modifică și se completează conform anexei care face parte integrantă din prezentul ordin."""]]).toDF("text") result = model.transform(data) ```
## Results ```bash +----------------------------------------------+-----+ |chunk |label| +----------------------------------------------+-----+ |Ordinul ministrului sănătății nr. 1.468 / 2018|LEGAL| |România |LOC | |Ministerul Sănătății |ORG | |București |LOC | |România |LOC | |Monitorul Oficial al României |ORG | |22 noiembrie 2018 |DATE | +----------------------------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_romanian_official_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.5 MB| ## References Dataset is available [here](https://zenodo.org/record/7025333#.Y2zsquxBx83). ## Benchmarking ```bash label precision recall f1-score support DATE 0.84 0.92 0.88 218 LEGAL 0.89 0.96 0.92 337 LOC 0.82 0.77 0.79 158 ORG 0.87 0.88 0.88 463 PER 0.97 0.97 0.97 87 micro-avg 0.87 0.90 0.89 1263 macro-avg 0.88 0.90 0.89 1263 weighted-avg 0.87 0.90 0.89 1263 ``` --- layout: model title: German Electra Embeddings (from deepset) author: John Snow Labs name: electra_embeddings_gelectra_large_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-large-generator` is a German model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_large_generator_de_3.4.4_3.0_1652786854236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_large_generator_de_3.4.4_3.0_1652786854236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_large_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_large_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_gelectra_large_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|194.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gelectra-large-generator - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Legal Deterioration Of The Environment Document Classifier (EURLEX) author: John Snow Labs name: legclf_deterioration_of_the_environment_bert date: 2023-03-06 tags: [en, legal, classification, clauses, deterioration_of_the_environment, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_deterioration_of_the_environment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Deterioration_of_The_Environment or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Deterioration_of_The_Environment`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deterioration_of_the_environment_bert_en_1.0.0_3.0_1678111773583.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deterioration_of_the_environment_bert_en_1.0.0_3.0_1678111773583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_deterioration_of_the_environment_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Deterioration_of_The_Environment]| |[Other]| |[Other]| |[Deterioration_of_The_Environment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_deterioration_of_the_environment_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Deterioration_of_The_Environment 0.92 0.91 0.91 196 Other 0.91 0.92 0.91 191 accuracy - - 0.91 387 macro-avg 0.91 0.91 0.91 387 weighted-avg 0.91 0.91 0.91 387 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from pakupoko) author: John Snow Labs name: distilbert_qa_bizlin_distil_model date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bizlin-distil-model` is a English model originally trained by `pakupoko`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_distil_model_en_4.0.0_3.0_1654723375668.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bizlin_distil_model_en_4.0.0_3.0_1654723375668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_distil_model","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bizlin_distil_model","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_pakupoko").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bizlin_distil_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|104.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pakupoko/bizlin-distil-model --- layout: model title: English RobertaForQuestionAnswering (from deepset) author: John Snow Labs name: roberta_qa_tinyroberta_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_squad2_en_4.0.0_3.0_1655740021196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_squad2_en_4.0.0_3.0_1655740021196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_tinyroberta_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_tinyroberta_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.tiny.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tinyroberta_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/tinyroberta-squad2 - https://www.linkedin.com/company/deepset-ai/ - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://arxiv.org/pdf/1909.10351.pdf - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack - http://deepset.ai/ - https://haystack.deepset.ai/ - http://www.deepset.ai/jobs - https://twitter.com/deepset_ai - https://github.com/deepset-ai/haystack/discussions - https://github.com/deepset-ai/haystack/ - https://deepset.ai - https://deepset.ai/germanquad - https://haystack.deepset.ai - https://deepset.ai/german-bert --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-6` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1654537973401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1654537973401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|390.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-6 --- layout: model title: English DistilBertForQuestionAnswering model (from adamlin) author: John Snow Labs name: distilbert_qa_base_cased_sgd_qa_step5000 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-sgd_qa-step5000` is a English model originally trained by `adamlin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_sgd_qa_step5000_en_4.0.0_3.0_1654723671592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_sgd_qa_step5000_en_4.0.0_3.0_1654723671592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_sgd_qa_step5000","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_sgd_qa_step5000","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_cased.by_adamlin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_sgd_qa_step5000| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/adamlin/distilbert-base-cased-sgd_qa-step5000 --- layout: model title: Sentiment Analysis for Urdu (IMDB Review dataset) author: John Snow Labs name: sentimentdl_urduvec_imdb date: 2021-01-09 task: Sentiment Analysis language: ur edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [ur, open_source, sentiment] supported: true annotator: SentimentDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Analyse sentiment in reviews by classifying them as `positive` or `negative`. This model is trained using `urduvec_140M_300d` word embeddings. The word embeddings are then converted to sentence embeddings before feeding to the sentiment classifier which uses a DL architecture to classify sentences. ## Predicted Entities `positive` , `negative` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_urduvec_imdb_ur_2.7.1_2.4_1610185467237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_urduvec_imdb_ur_2.7.1_2.4_1610185467237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, SentenceEmbeddings.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel()\ .pretrained('urduvec_140M_300d', 'ur')\ .setInputCols(["sentence",'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["sentence", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = SentimentDLModel.pretrained('sentimentdl_urduvec_imdb', 'ur' )\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('sentiment') nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ", "بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"]) ``` {:.nlu-block} ```python import nlu text = ["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ", "بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"] urdusent_df = nlu.load('ur.sentiment').predict(text, output_level='sentence') urdusent_df ```
## Results ```bash | | document | sentiment | |---:|---------------------------------------------------------------------------------------------------------:|--------------:| | 0 |مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک | positive | | 1 |بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں | negative | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentimentdl_urduvec_imdb| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[sentiment]| |Language:|ur| |Dependencies:|urduvec_140M_300d| ## Data Source This models in trained using data from https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews ## Benchmarking ```bash loss: 2428.622 - acc: 0.8181 - val_acc: 80.0 ``` --- layout: model title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab) author: John Snow Labs name: bert_embeddings_base_arabic_camel_mix date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-mix` is a Arabic model originally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_mix_ar_4.2.4_3.0_1670015990923.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_mix_ar_4.2.4_3.0_1670015990923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_mix","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_mix","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic_camel_mix| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://arxiv.org/abs/2103.06678 - https://zenodo.org/record/3891466#.YEX4-F0zbzc - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Split Sentences in Healthcare Texts author: John Snow Labs name: sentence_detector_dl_healthcare class: DeepSentenceDetector language: en nav_key: models repository: clinical/models date: 2020-09-13 task: Sentence Detection edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [clinical,sentence_detection,en] supported: true annotator: SentenceDetectorDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SENTENCE_DETECTOR_HC/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.6.0_2.4_1600001082565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.6.0_2.4_1600001082565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.detect_sentence.clinical").predict("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""") ```
{:.h2_title} ## Results ```bash +---+------------------------------+ | 0 | John loves Mary. | +---+------------------------------+ | 1 | Mary loves Peter | +---+------------------------------+ | 2 | Peter loves Helen . | +---+------------------------------+ | 3 | Helen loves John; | +---+------------------------------+ | 4 | Total: four people involved. | +---+------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|-------------------------------------------| | Name: | sentence_detector_dl_healthcare | | Type: | DeepSentenceDetector | | Compatibility: | Spark NLP 2.6.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document] | |Output labels: | sentence | | Language: | en | {:.h2_title} ## Data Source Healthcare SDDL model is trained on domain (healthcare) specific text, annotated internally, to generalize further on clinical notes. {:.h2_title} ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Finnish Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 12:35:00 +0800 task: Lemmatization language: fi edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, fi] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fi_2.5.0_2.4_1588671290521.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_fi_2.5.0_2.4_1588671290521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "fi") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "fi") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä."""] lemma_df = nlu.load('fi.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=2, result='se', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=4, end=10, result='lisäksi', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=16, result='että', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=20, result='hän', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|fi| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Google T5 (Text-To-Text Transfer Transformer) Small author: John Snow Labs name: t5_small date: 2021-01-08 task: [Question Answering, Summarization, Translation] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, t5, summarization, translation, en, seq2seq] supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The T5 transformer model described in the seminal paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This model can perform a variety of tasks, such as text summarization, question answering, and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5TRANSFORMER/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.1_2.4_1610133219885.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_en_2.7.1_2.4_1610133219885.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Either set the following tasks or have them inline with your input: - summarize: - translate English to German: - translate English to French: - stsb sentence1: Big news. sentence2: No idea. The full list of tasks is in the Appendix of the paper: https://arxiv.org/pdf/1910.10683.pdf
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer() \ .pretrained("t5_small") \ .setTask("summarize:")\ .setMaxOutputLength(200)\ .setInputCols(["documents"]) \ .setOutputCol("summaries") pipeline = Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("summaries.result").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer .pretrained("t5_small") .setTask("summarize:") .setInputCols(Array("documents")) .setOutputCol("summaries") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val model = pipeline.fit(dataDf) val results = model.transform(dataDf) results.select("summaries.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.small").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[t5]| |Language:|en| ## Data Source https://huggingface.co/t5-small --- layout: model title: Translate English to North Germanic languages Pipeline author: John Snow Labs name: translate_en_gmq date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gmq, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gmq` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gmq_xx_2.7.0_2.4_1609698966439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gmq_xx_2.7.0_2.4_1609698966439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gmq", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gmq", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gmq').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gmq| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from gdario) author: John Snow Labs name: bert_qa_biobert_bioasq date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_bioasq` is a English model orginally trained by `gdario`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_bioasq_en_4.0.0_3.0_1654185669067.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_bioasq_en_4.0.0_3.0_1654185669067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_bioasq","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_bioasq","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.biobert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_bioasq| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/gdario/biobert_bioasq --- layout: model title: English asr_wav2vec2_xls_r_300m_kh TFWav2Vec2ForCTC from kongkeaouch author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_kh date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_kh` is a English model originally trained by kongkeaouch. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_kh_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025155355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_kh_en_4.2.0_3.0_1664025155355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_kh', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_kh", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_kh| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Pledge Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_pledge_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, pledge, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_pledge_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `pledge-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `pledge-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_pledge_agreement_bert_en_1.0.0_3.0_1670349668991.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_pledge_agreement_bert_en_1.0.0_3.0_1670349668991.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_pledge_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[pledge-agreement]| |[other]| |[other]| |[pledge-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_pledge_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.98 0.97 65 pledge-agreement 0.96 0.88 0.92 26 accuracy - - 0.96 91 macro-avg 0.96 0.93 0.94 91 weighted-avg 0.96 0.96 0.96 91 ``` --- layout: model title: Chinese Bert Embeddings (Roberta, Whole Word Masking) author: John Snow Labs name: bert_embeddings_chinese_roberta_wwm_ext date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-roberta-wwm-ext` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_zh_3.4.2_3.0_1649668840010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_zh_3.4.2_3.0_1649668840010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.chinese_roberta_wwm_ext").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_wwm_ext| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-roberta-wwm-ext - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: English asr_model_2 TFWav2Vec2ForCTC from niclas author: John Snow Labs name: pipeline_asr_model_2 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_2` is a English model originally trained by niclas. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_model_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_2_en_4.2.0_3.0_1664097773728.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_2_en_4.2.0_3.0_1664097773728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_model_2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_model_2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_model_2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_el16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el16_en_4.3.0_3.0_1675110985082.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el16_en_4.3.0_3.0_1675110985082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_el16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_el16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_el16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|529.5 MB| ## References - https://huggingface.co/google/t5-efficient-base-el16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Financing And Investment Document Classifier (EURLEX) author: John Snow Labs name: legclf_financing_and_investment_bert date: 2023-03-06 tags: [en, legal, classification, clauses, financing_and_investment, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_financing_and_investment_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Financing_and_Investment or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Financing_and_Investment`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financing_and_investment_bert_en_1.0.0_3.0_1678111700202.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financing_and_investment_bert_en_1.0.0_3.0_1678111700202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_financing_and_investment_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Financing_and_Investment]| |[Other]| |[Other]| |[Financing_and_Investment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_financing_and_investment_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Financing_and_Investment 0.84 0.91 0.87 45 Other 0.91 0.84 0.87 50 accuracy - - 0.87 95 macro-avg 0.87 0.88 0.87 95 weighted-avg 0.88 0.87 0.87 95 ``` --- layout: model title: Javanese DistilBERT Embeddings (Small, Imdb) author: John Snow Labs name: distilbert_embeddings_javanese_distilbert_small_imdb date: 2022-04-12 tags: [distilbert, embeddings, jv, open_source] task: Embeddings language: jv edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-distilbert-small-imdb` is a Javanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_imdb_jv_3.4.2_3.0_1649783783892.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_distilbert_small_imdb_jv_3.4.2_3.0_1649783783892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_distilbert_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("jv.embed.javanese_distilbert_small_imdb").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_javanese_distilbert_small_imdb| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|248.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-distilbert-small-imdb - https://arxiv.org/abs/1910.01108 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: Translate Shona to English Pipeline author: John Snow Labs name: translate_sn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sn_en_xx_2.7.0_2.4_1609688774813.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sn_en_xx_2.7.0_2.4_1609688774813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for RxNorm According to National Institute of Health (NIH) Database (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_rxnorm_nih date: 2023-02-22 tags: [entity_resolution, rxnorm, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 4.3.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes according to the National Institute of Health (NIH) database using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_nih_en_4.3.0_3.0_1677106956679.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_nih_en_4.3.0_3.0_1677106956679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['DRUG'])\ .setPreservePosition(False) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_nih","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver]) data = spark.createDataFrame([["""She is given folic acid 1 mg daily , levothyroxine 0.1 mg and aspirin 81 mg daily ."""]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_nih","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver)) val data = Seq("""She is given folic acid 1 mg daily , levothyroxine 0.1 mg and aspirin 81 mg daily and metformin 100 mg, coumadin 5 mg.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | sent_id | ner_chunk | entity | rxnorm_code | all_codes | resolutions | |---:|----------:|:---------------------|:---------|--------------:|:------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------| | 0 | 0 | folic acid 1 mg | DRUG | 12281181 | ['12281181', '12283696', '12270292', '12306595', 1227889...| ['folic acid 1 MG [folic acid 1 MG]', 'folic acid 1.1 MG [folic acid 1.1 MG]', 'folic acid 1 MG/ML [folic acid 1 MG/ML]', 'folic a...| | 1 | 0 | levothyroxine 0.1 mg | DRUG | 12275630 | ['12275630', '12275646', '12301585', '12306484', 1235044...| ['levothyroxine sodium 0.1 MG [levothyroxine sodium 0.1 MG]', 'levothyroxine sodium 0.01 MG [levothyroxine sodium 0.01 MG]', 'levo...| | 2 | 0 | aspirin 81 mg | DRUG | 12278696 | ['12278696', '12299811', '12298729', '12311168', '1230631...| ['aspirin 81 MG [aspirin 81 MG]', 'aspirin 81 MG [YSP Aspirin] [aspirin 81 MG [YSP Aspirin]]', 'aspirin 81 MG [Med Aspirin] [aspir...| | 3 | 0 | metformin 100 mg | DRUG | 12282749 | ['12282749', '3735316', '12279966', '1509573', '3736179'... | ['metformin hydrochloride 100 MG/ML [metformin hydrochloride 100 MG/ML]', 'metFORMIN hydrochloride 100 MG/ML [metFORMIN hydrochlor...| | 4 | 0 | coumadin 5 mg | DRUG | 1768579 | ['1768579', '12534260', '1780903', '1768951', '1510873' ... | ['coumarin 5 MG [coumarin 5 MG]', 'vericiguat 5 MG [vericiguat 5 MG]', 'pridinol 5 MG [pridinol 5 MG]', 'propinox 5 MG [propinox 5...| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_nih| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|818.8 MB| |Case sensitive:|false| ## References Trained on February 2023 with `sbiobert_base_cased_mli` embeddings. https://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html --- layout: model title: Translate Gun to English Pipeline author: John Snow Labs name: translate_guw_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, guw, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `guw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_guw_en_xx_2.7.0_2.4_1609688796603.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_guw_en_xx_2.7.0_2.4_1609688796603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_guw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_guw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.guw.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_guw_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Italian Named Entity Recognition (from gunghio) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, it, open_source] task: Named Entity Recognition language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-ner` is a Italian model orginally trained by `gunghio`. ## Predicted Entities `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner_it_3.4.2_3.0_1652808069992.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner_it_3.4.2_3.0_1652808069992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner","it") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner","it") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_panx_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|it| |Size:|878.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/gunghio/xlm-roberta-base-finetuned-panx-ner --- layout: model title: Pipeline to Detect Normalized Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_gene_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, gene, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_gene_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_human_phenotype_gene_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_pipeline_en_3.4.1_3.0_1647867667569.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_pipeline_en_3.4.1_3.0_1647867667569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).") ``` ```scala val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phnotype_gene_clinical.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""") ```
## Results ```bash +----+------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+==================+=========+=======+==========+ | 0 | BS type | 29 | 32 | GENE | +----+------------------+---------+-------+----------+ | 1 | polyhydramnios | 75 | 88 | HP | +----+------------------+---------+-------+----------+ | 2 | polyuria | 91 | 98 | HP | +----+------------------+---------+-------+----------+ | 3 | nephrocalcinosis | 101 | 116 | HP | +----+------------------+---------+-------+----------+ | 4 | hypokalemia | 122 | 132 | HP | +----+------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739428334.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739428334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_hier_quadruplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: Chinese BertForMaskedLM Cased model (from qinluo) author: John Snow Labs name: bert_embeddings_wo_chinese_plus date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `wobert-chinese-plus` is a Chinese model originally trained by `qinluo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670327320217.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wo_chinese_plus_zh_4.2.4_3.0_1670327320217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_wo_chinese_plus","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_wo_chinese_plus| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|467.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/qinluo/wobert-chinese-plus - https://github.com/ZhuiyiTechnology/WoBERT - https://github.com/JunnYu/WoBERT_pytorch --- layout: model title: Chunk Entity Resolver RxNorm-scdc author: John Snow Labs name: chunkresolve_rxnorm_scdc_healthcare date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to RxNorm codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). ## Predicted Entities RxNorm codes {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scdc_healthcare_en_3.0.0_3.0_1618605170280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scdc_healthcare_en_3.0.0_3.0_1618605170280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_scdc_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver]) data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ... ``` ```scala ... val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_scdc_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364| | glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407| | dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_scdc_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| --- layout: model title: Italian Electra Embeddings (from dbmdz) author: John Snow Labs name: electra_embeddings_electra_base_italian_xxl_cased_generator date: 2022-05-17 tags: [it, open_source, electra, embeddings] task: Embeddings language: it edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-italian-xxl-cased-generator` is a Italian model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_italian_xxl_cased_generator_it_3.4.4_3.0_1652786574536.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_italian_xxl_cased_generator_it_3.4.4_3.0_1652786574536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_italian_xxl_cased_generator","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_italian_xxl_cased_generator","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_italian_xxl_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|128.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/electra-base-italian-xxl-cased-generator - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: Smaller BERT Embeddings (L-12_H-512_A-8) author: John Snow Labs name: small_bert_L12_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L12_512_en_2.6.0_2.4_1598344865471.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L12_512_en_2.6.0_2.4_1598344865471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L12_512", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L12_512", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L12_512').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L12_512_embeddings I [0.5089142322540283, -0.21703988313674927, -0.... love [-0.3273950517177582, 0.9550480842590332, -0.1... NLP [0.3552919626235962, 0.3629235625267029, 0.891... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L12_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1 --- layout: model title: Arabic Bert Embeddings (Base, Arabert Model) author: John Snow Labs name: bert_embeddings_bert_base_arabert date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabert` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabert_ar_3.4.2_3.0_1649677303708.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabert_ar_3.4.2_3.0_1649677303708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabert","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabert","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabert").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabert - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabert/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Pipeline to Summarize Clinical Question Notes author: John Snow Labs name: summarizer_clinical_questions_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization, question] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_questions](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_clinical_questions_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.1_3.0_1685530642775.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_questions_pipeline_en_4.4.1_3.0_1685530642775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models") text = """ Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_questions_pipeline", "en", "clinical/models") val text = """ Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash What are the treatments for hyperthyroidism? ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_questions_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: Word2Vec Embeddings in Aragonese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, an, open_source] task: Embeddings language: an edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_an_3.4.1_3.0_1647282522053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_an_3.4.1_3.0_1647282522053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","an") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","an") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("an.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|an| |Size:|212.8 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Pipeline to Classify Texts into 4 News Categories author: John Snow Labs name: bert_sequence_classifier_age_news_pipeline date: 2022-02-23 tags: [ag_news, news, bert, bert_sequence, classification, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_sequence_classifier_age_news_en](https://nlp.johnsnowlabs.com/2021/11/07/bert_sequence_classifier_age_news_en.html) which is imported from `HuggingFace`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_3.4.0_3.0_1645616467835.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_age_news_pipeline_en_3.4.0_3.0_1645616467835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python news_pipeline = PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en") news_pipeline.annotate("Microsoft has taken its first step into the metaverse.") ``` ```scala val news_pipeline = new PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en") news_pipeline.annotate("Microsoft has taken its first step into the metaverse.") ```
## Results ```bash ['Sci/Tech'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_age_news_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|42.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertForSequenceClassification --- layout: model title: English asr_wav2vec2_ksponspeech TFWav2Vec2ForCTC from Taeham author: John Snow Labs name: pipeline_asr_wav2vec2_ksponspeech date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_ksponspeech` is a English model originally trained by Taeham. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_ksponspeech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_ksponspeech_en_4.2.0_3.0_1664102640740.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_ksponspeech_en_4.2.0_3.0_1664102640740.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_ksponspeech', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_ksponspeech", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_ksponspeech| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from efederici) author: John Snow Labs name: t5_it5_efficient_small_lfqa date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-lfqa` is a Italian model originally trained by `efederici`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_lfqa_it_4.3.0_3.0_1675103827826.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_lfqa_it_4.3.0_3.0_1675103827826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_lfqa","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_lfqa","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_lfqa| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|594.0 MB| ## References - https://huggingface.co/efederici/it5-efficient-small-lfqa --- layout: model title: Legal NER (Parties, Dates, Document Type - sm) author: John Snow Labs name: legner_contract_doc_parties date: 2022-08-16 tags: [en, legal, ner, agreements, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs; This is a Legal NER Model, aimed to process the first page of the agreements when information can be found about: - Parties of the contract/agreement; - Aliases of those parties, or how those parties will be called further on in the document; - Document Type; - Effective Date of the agreement; This model can be used all along with its Relation Extraction model to retrieve the relations between these entities, called `legre_contract_doc_parties` Other models can be found to detect other parts of the document, as Headers/Subheaders, Signers, "Will-do", etc. ## Predicted Entities `PARTY`, `EFFDATE`, `DOC`, `ALIAS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_PARTIES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_en_1.0.0_3.2_1660647946284.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_en_1.0.0_3.2_1660647946284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" INTELLECTUAL PROPERTY AGREEMENT This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties"). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +------------+---------+ | token|ner_label| +------------+---------+ |INTELLECTUAL| B-DOC| | PROPERTY| I-DOC| | AGREEMENT| I-DOC| | This| O| |INTELLECTUAL| B-DOC| | PROPERTY| I-DOC| | AGREEMENT| I-DOC| | (| O| | this| O| | "| O| | Agreement| O| | "),| O| | dated| O| | as| O| | of| O| | December|B-EFFDATE| | 31|I-EFFDATE| | ,|I-EFFDATE| | 2018|I-EFFDATE| | (| O| | the| O| | "| O| | Effective| O| | Date| O| | ")| O| | is| O| | entered| O| | into| O| | by| O| | and| O| | between| O| | Armstrong| B-PARTY| | Flooring| I-PARTY| | ,| I-PARTY| | Inc| I-PARTY| | .,| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | Seller| B-ALIAS| | ")| O| | and| O| | AFI| B-PARTY| | Licensing| I-PARTY| | LLC| I-PARTY| | ,| O| | a| O| | Delaware| O| | limited| O| | liability| O| | company| O| | ("| O| | Licensing| B-ALIAS| | "| O| | and| O| | together| O| | with| O| | Seller| B-ALIAS| | ,| O| | "| O| | Arizona| B-ALIAS| | ")| O| | and| O| | AHF| B-PARTY| | Holding| I-PARTY| | ,| I-PARTY| | Inc| I-PARTY| | .| O| | (| O| | formerly| O| | known| O| | as| O| | Tarzan| O| | HoldCo| O| | ,| O| | Inc| O| | .),| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | Buyer| B-ALIAS| | ")| O| | and| O| | Armstrong| B-PARTY| | Hardwood| I-PARTY| | Flooring| I-PARTY| | Company| I-PARTY| ------------------------ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_contract_doc_parties| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-PARTY 262 20 61 0.92907804 0.8111455 0.8661157 B-EFFDATE 22 4 9 0.84615386 0.7096774 0.77192986 B-DOC 38 4 12 0.9047619 0.76 0.82608694 I-EFFDATE 95 9 19 0.91346157 0.8333333 0.8715596 I-DOC 93 12 5 0.8857143 0.9489796 0.9162561 B-PARTY 88 10 29 0.8979592 0.75213677 0.81860465 B-ALIAS 64 7 14 0.90140843 0.82051283 0.8590604 ``` --- layout: model title: English RobertaForQuestionAnswering Tiny Cased model (from deepset) author: John Snow Labs name: roberta_qa_tiny_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_squad2_en_4.2.4_3.0_1669988560342.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_squad2_en_4.2.4_3.0_1669988560342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tiny_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/tinyroberta-squad2 - https://haystack.deepset.ai/tutorials/first-qa-system - https://arxiv.org/pdf/1909.10351.pdf - https://github.com/deepset-ai/haystack - https://github.com/deepset-ai/haystack/ - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - http://deepset.ai/ - https://haystack.deepset.ai/ - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/haystack - https://docs.haystack.deepset.ai - https://haystack.deepset.ai/community/join - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_768 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670325955053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_768_zh_4.2.4_3.0_1670325955053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|170.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English Biomedical ElectraForQuestionAnswering model author: John Snow Labs name: electra_qa_BioM_Base_SQuAD2_BioASQ8B date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true recommended: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Biomedical Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ELECTRA-Base-SQuAD2-BioASQ8B` is a English model originally trained by `sultan`. This model is fine-tuned on the SQuAD2.0 dataset and then on the BioASQ8B-Factoid training dataset. We convert the BioASQ8B-Factoid training dataset to SQuAD1.1 format and train and evaluate our model (BioM-ELECTRA-Base-SQuAD2) on this dataset. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_BioASQ8B_en_4.0.0_3.0_1655918942331.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_BioASQ8B_en_4.0.0_3.0_1655918942331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2_BioASQ8B","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2_BioASQ8B","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_bioasq8b.electra.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_BioM_Base_SQuAD2_BioASQ8B| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sultan/BioM-ELECTRA-Base-SQuAD2-BioASQ8B --- layout: model title: Fast and Accurate Language Identification - 43 Languages (CNN) author: John Snow Labs name: ld_wiki_tatoeba_cnn_43 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Arabic`, `Belarusian`, `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Persian`, `Finnish`, `French`, `Hebrew`, `Hindi`, `Hungarian`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Japanese`, `Korean`, `Latin`, `Lithuanian`, `Latvian`, `Macedonian`, `Marathi`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Russian`, `Slovak`, `Slovenian`, `Serbian`, `Swedish`, `Tagalog`, `Turkish`, `Tatar`, `Ukrainian`, `Vietnamese`, `Chinese`. ## Predicted Entities `ar`, `be`, `bg`, `cs`, `da`, `de`, `el`, `en`, `eo`, `es`, `et`, `fa`, `fi`, `fr`, `he`, `hi`, `hu`, `ia`, `id`, `is`, `it`, `ja`, `ko`, `la`, `lt`, `lv`, `mk`, `mr`, `nl`, `pl`, `pt`, `ro`, `ru`, `sk`, `sl`, `sr`, `sv`, `tl`, `tr`, `tt`, `uk`, `vi`, `zh`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_43_xx_2.7.0_2.4_1607184003726.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_43_xx_2.7.0_2.4_1607184003726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_43", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_43", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_43').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_wiki_tatoeba_cnn_43| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Wikipedia and Tatoeba ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | fr| 1000| 1000| 1.0| | nl| 1000| 999| 0.999| | sv| 1000| 999| 0.999| | pt| 1000| 999| 0.999| | it| 1000| 999| 0.999| | es| 1000| 999| 0.999| | fi| 1000| 999| 0.999| | el| 1000| 998| 0.998| | de| 1000| 997| 0.997| | da| 1000| 997| 0.997| | en| 1000| 995| 0.995| | lt| 1000| 986| 0.986| | hu| 880| 867|0.9852272727272727| | pl| 914| 899|0.9835886214442013| | ro| 784| 765|0.9757653061224489| | et| 928| 899| 0.96875| | cs| 1000| 967| 0.967| | sk| 1000| 966| 0.966| | bg| 1000| 960| 0.96| | sl| 914| 860|0.9409190371991247| | lv| 916| 856|0.9344978165938864| +--------+-----+-------+------------------+ +-------+--------------------+ |summary| precision| +-------+--------------------+ | count| 21| | mean| 0.9832737168612825| | stddev|0.020064155103808722| | min| 0.9344978165938864| | max| 1.0| +-------+--------------------+ ``` --- layout: model title: Legal Law Area Prediction Classifier (French) author: John Snow Labs name: legclf_law_area_prediction_french date: 2023-03-29 tags: [fr, licensed, classification, legal, tensorflow] task: Text Classification language: fr edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which identifies law area labels(civil_law, penal_law, public_law, social_law) in French-based Court Cases. ## Predicted Entities `civil_law`, `penal_law`, `public_law`, `social_law` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_french_fr_1.0.0_3.0_1680094841099.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_french_fr_1.0.0_3.0_1680094841099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx")\ .setInputCols(["document"]) \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_law_area_prediction_french", "fr", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, docClassifier ]) df = spark.createDataFrame([["par ces motifs, le Juge unique prononce : 1. Le recours est irrecevable. 2. Il n'est pas perçu de frais judiciaires. 3. Le présent arrêt est communiqué aux parties, au Tribunal administratif fédéral et à l'Office fédéral des assurances sociales. Lucerne, le 2 juin 2016 Au nom de la IIe Cour de droit social du Tribunal fédéral suisse Le Juge unique : Meyer Le Greffier : Cretton"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) result.select("text", "category.result").show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+------------+ | text| result| +----------------------------------------------------------------------------------------------------+------------+ |par ces motifs, le Juge unique prononce : 1. Le recours est irrecevable. 2. Il n'est pas perçu de...|[social_law]| +----------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_law_area_prediction_french| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|fr| |Size:|22.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction) ## Benchmarking ```bash label precision recall f1-score support civil_law 0.93 0.91 0.92 613 penal_law 0.94 0.96 0.95 579 public_law 0.92 0.91 0.92 605 social_law 0.97 0.98 0.97 478 accuracy - - 0.94 2275 macro-avg 0.94 0.94 0.94 2275 weighted-avg 0.94 0.94 0.94 2275 ``` --- layout: model title: English BertForQuestionAnswering model (from hendrixcosta) author: John Snow Labs name: bert_qa_bertimbau_squad1.1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertimbau-squad1.1` is a English model orginally trained by `hendrixcosta`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertimbau_squad1.1_en_4.0.0_3.0_1654185392313.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertimbau_squad1.1_en_4.0.0_3.0_1654185392313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertimbau_squad1.1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bertimbau_squad1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_hendrixcosta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bertimbau_squad1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hendrixcosta/bertimbau-squad1.1 --- layout: model title: Classification of Self-Reported Intimate Partner Violence (BioBERT) author: John Snow Labs name: bert_sequence_classifier_self_reported_partner_violence_tweet date: 2022-07-28 tags: [sequence_classification, bert, classifier, clinical, en, licensed, public_health, partner_violence, tweet] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classification of Self-Reported Intimate Partner Violence on Twitter. This model involves the detection the potential IPV victims on social media platforms (in English tweets). ## Predicted Entities `intimate_partner_violence`, `non-intimate_partner_violence` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_PARTNER_VIOLENCE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_partner_violence_tweet_en_4.0.0_3.0_1658999356448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_partner_violence_tweet_en_4.0.0_3.0_1658999356448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_partner_violence_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) example = spark.createDataFrame(["I am fed up with this toxic relation.I hate my husband.", "Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal."], StringType()).toDF("text") result = pipeline.fit(example).transform(example) result.select("text", "class.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_partner_violence_tweet", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) # couple of simple examples val example = Seq(Array("I am fed up with this toxic relation.I hate my husband.", "Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal.")).toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.self_reported_partner_violence").predict("""Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal.""") ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+ |text |result | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+ |I am fed up with this toxic relation.I hate my husband. |[non-intimate_partner_violence]| |Can i say something real quick I ve never been one to publicly drag an ex partner and sometimes I regret that. I ve been reflecting on the harm, abuse and violence that was done to me and those bitches are truly lucky I chose peace amp therapy because they are trash forreal.|[intimate_partner_violence] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_self_reported_partner_violence_tweet| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References [SMM4H 2022](https://healthlanguageprocessing.org/smm4h-2022/) ## Benchmarking ```bash label precision recall f1-score support intimate_partner_violence 0.96 0.97 0.97 630 non-intimate_partner_violence 0.75 0.69 0.72 78 accuracy - - 0.94 708 macro-avg 0.86 0.83 0.84 708 weighted-avg 0.94 0.94 0.94 708 ``` --- layout: model title: Clinical Deidentification (Spanish) author: John Snow Labs name: clinical_deidentification date: 2022-02-17 tags: [deid, es, licensed,clinical] task: De-identification language: es edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_3.0_1645118722536.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.4.1_3.0_1645118722536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """) ```
## Results ```bash Masked with entity labels ------------------------------ Datos del paciente. Nombre: . Apellidos: . NHC: . NASS: 04. Domicilio: , 5 B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . : María Merino Viveros NºCol: . Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico: Masked with chars ------------------------------ Datos del paciente. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: ** [******] 04. Domicilio: [*******************], 5 B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. [****]: María Merino Viveros NºCol: ** ** [***]. Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos del paciente. Nombre: **** . Apellidos: ****. NHC: ****. NASS: **** **** 04. Domicilio: ****, 5 B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. ****: María Merino Viveros NºCol: **** **** ****. Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos del paciente. Nombre: Sr. Lerma . Apellidos: Aristides Gonzalez Gelabert. NHC: BBBBBBBBQR648597. NASS: 041010000011 RZRM020101906017 04. Domicilio: Valencia, 5 B.. Localidad/ Provincia: Madrid. CP: 99335. Datos asistenciales. Fecha de nacimiento: 25/04/1977. País: Barcelona. Edad: 8 años Sexo: F.. Fecha de Ingreso: 02/08/2018. transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78. Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.3 MB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: English BertForQuestionAnswering model (from gerardozq) author: John Snow Labs name: bert_qa_biobert_v1.1_pubmed_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_v1.1_pubmed-finetuned-squad` is a English model orginally trained by `gerardozq`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_pubmed_finetuned_squad_en_4.0.0_3.0_1654185735686.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_pubmed_finetuned_squad_en_4.0.0_3.0_1654185735686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_v1.1_pubmed_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_v1.1_pubmed_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_pubmed.biobert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_v1.1_pubmed_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/gerardozq/biobert_v1.1_pubmed-finetuned-squad --- layout: model title: Hocr for table recognition author: John Snow Labs name: hocr_table_recognition date: 2023-01-23 tags: [en, licensed] task: HOCR Table Recognition language: en nav_key: models edition: Visual NLP 4.2.4 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Table structure recognition based on hocr with Tesseract architecture. Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset. In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/IMAGE_TABLE_RECOGNITION_HOCR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/tree/master/jupyter/SparkOcrImageTableRecognitionWHOCR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("table_regions") splitter = ImageSplitRegions() \ .setInputCol("image") \ .setInputRegionsCol("table_regions") \ .setOutputCol("table_image") \ .setDropCols("image") \ .setImageType(ImageType.TYPE_BYTE_GRAY) \ .setExplodeCols([]) text_detector = ImageTextDetectorV2.pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(True) draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) img_to_hocr = ImageToTextV2().pretrained("ocr_small_printed", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setUsePandasUdf(False) \ .setOutputFormat(OcrOutputFormat.HOCR) \ .setOutputCol("hocr") \ .setGroupImages(False) hocr_to_table = HocrToTextTable() \ .setInputCol("hocr") \ .setRegionCol("table_regions") \ .setOutputCol("tables") pipeline = PipelineModel(stages=[ binary_to_image, table_detector, splitter, text_detector, draw_regions, img_to_hocr, hocr_to_table ]) imagePath = "data/tab_images_hocr_1/table4_1.jpg" image_df= spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") val table_detector = new ImageTableDetector .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("table_regions") val splitter = new ImageSplitRegions() .setInputCol("image") .setInputRegionsCol("table_regions") .setOutputCol("table_image") .setDropCols("image") .setImageType(ImageType.TYPE_BYTE_GRAY) .setExplodeCols(Array()) val text_detector = new ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(True) val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) img_to_hocr = ImageToTextV2() .pretrained("ocr_small_printed", "en", "clinical/ocr") .setInputCols(Array("image", "text_regions")) .setUsePandasUdf(False) .setOutputFormat(OcrOutputFormat.HOCR) .setOutputCol("hocr") .setGroupImages(False) val hocr_to_table = new HocrToTextTable() .setInputCol("hocr") .setRegionCol("table_regions") .setOutputCol("tables") val pipeline = new PipelineModel().setStages(Array( binary_to_image, table_detector, splitter, text_detector, draw_regions, img_to_hocr, hocr_to_table)) val imagePath = "data/tab_images_hocr_1/table4_1.jpg" val image_df= spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image13.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image13_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash text_regions table_image pagenum modificationTime path table_regions length image image_with_regions hocr tables exception table_index [{0, 0, 566.32025... {file:/content/ta... 0 2023-01-23 08:21:... file:/content/tab... {0, 0, 40.0, 0.0,... 172124 {file:/content/ta... {file:/content/ta... Live Demo [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_12_en_4.1.0_3.0_1660171578567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_12_en_4.1.0_3.0_1660171578567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_12", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_12", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_12| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Mongolian RobertaForTokenClassification Base Cased model (from onon214) author: John Snow Labs name: roberta_token_classifier_base_ner_demo date: 2023-03-01 tags: [mn, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: mn edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ner-demo` is a Mongolian model originally trained by `onon214`. ## Predicted Entities `MISC`, `LOC`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_base_ner_demo_mn_4.3.0_3.0_1677703536380.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_base_ner_demo_mn_4.3.0_3.0_1677703536380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_base_ner_demo","mn") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_base_ner_demo","mn") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_base_ner_demo| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|mn| |Size:|466.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/onon214/roberta-base-ner-demo --- layout: model title: Fast Neural Machine Translation Model from Igbo to English author: John Snow Labs name: opus_mt_ig_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ig, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ig` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ig_en_xx_2.7.0_2.4_1609163903644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ig_en_xx_2.7.0_2.4_1609163903644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ig_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ig_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ig.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ig_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Right of setoff Clause Binary Classifier author: John Snow Labs name: legclf_right_of_setoff_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `right-of-setoff` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `right-of-setoff` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_right_of_setoff_clause_en_1.0.0_3.2_1660122968803.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_right_of_setoff_clause_en_1.0.0_3.2_1660122968803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_right_of_setoff_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[right-of-setoff]| |[other]| |[other]| |[right-of-setoff]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_right_of_setoff_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.99 0.97 94 right-of-setoff 0.94 0.77 0.85 22 accuracy - - 0.95 116 macro-avg 0.95 0.88 0.91 116 weighted-avg 0.95 0.95 0.95 116 ``` --- layout: model title: Legal Erisa Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_erisa_bert date: 2023-03-05 tags: [en, legal, classification, clauses, erisa, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Erisa` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Erisa`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_erisa_bert_en_1.0.0_3.0_1678050577180.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_erisa_bert_en_1.0.0_3.0_1678050577180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_erisa_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Erisa]| |[Other]| |[Other]| |[Erisa]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_erisa_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Erisa 0.97 1.00 0.99 35 Other 1.00 0.98 0.99 56 accuracy - - 0.99 91 macro-avg 0.99 0.99 0.99 91 weighted-avg 0.99 0.99 0.99 91 ``` --- layout: model title: Turkish ElectraForQuestionAnswering model (from enelpi) Discriminator Version-2 author: John Snow Labs name: electra_qa_base_discriminator_finetuned_squadv2 date: 2022-06-22 tags: [tr, open_source, electra, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-discriminator-finetuned_squadv2_tr` is a Turkish model originally trained by `enelpi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv2_tr_4.0.0_3.0_1655920605376.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv2_tr_4.0.0_3.0_1655920605376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv2","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv2","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.squadv2.electra.base_v2").predict("""Benim adım ne?|||"Benim adım Clara ve Berkeley'de yaşıyorum.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_discriminator_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|412.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/enelpi/electra-base-discriminator-finetuned_squadv2_tr --- layout: model title: English RoBERTa Embeddings (from abhi1nandy2) author: John Snow Labs name: roberta_embeddings_Bible_roberta_base date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `Bible-roberta-base` is a English model orginally trained by `abhi1nandy2`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_Bible_roberta_base_en_3.4.2_3.0_1649947380949.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_Bible_roberta_base_en_3.4.2_3.0_1649947380949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_Bible_roberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_Bible_roberta_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.Bible_roberta_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_Bible_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|468.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/abhi1nandy2/Bible-roberta-base - https://www.kaggle.com/oswinrh/bible --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_ft_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_news_en_4.3.0_3.0_1674222981909.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_news_en_4.3.0_3.0_1674222981909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ft_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|458.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/roberta_FT_newsqa --- layout: model title: English asr_wav2vec2_cetuc_sid_voxforge_mls_0 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_0` is a English model originally trained by joaoalvarenga. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022807196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022807196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_0| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Part of Speech for Chinese author: John Snow Labs name: pos_ctb9 date: 2021-01-03 task: Part of Speech Tagging language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, zh, cn, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PN (pronoun), CC (coordinating conjunction), and 39 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities `AD`, `AS`, `BA`, `CC`, `CD`, `CS`, `DEC`, `DEG`, `DER`, `DEV`, `DT`, `EM`, `ETC`, `FW`, `IC`, `IJ`, `JJ`, `LB`, `LC`, `M`, `MSP`, `MSP-2`, `NN`, `NN-SHORT`, `NOI`, `NR`, `NR-SHORT`, `NT`, `NT-SHORT`, `OD`, `ON`, `P`, `PN`, `PU`, `SB`, `SP`, `URL`, `VA`, `VC`, `VE`, and `VV` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ctb9_zh_2.7.0_2.4_1609696404134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ctb9_zh_2.7.0_2.4_1609696404134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ctb9", zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, pos]) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh") .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ctb9", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] pos_df = nlu.load('zh.pos.ctb9').predict(text, output_level='token') pos_df ```
## Results ```bash +-----+---+ |token|pos_tag| +-----+---+ |然而 |AD | |, |PU | |这样 |PN | |的 |DEG| |处理 |NN | |也 |AD | |衍生 |VV | |了 |AS | |一些 |CD | |问题 |NN | |。 |PU | +-----+---+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ctb9| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|zh| ## Data Source The model was trained on the Microsoft Research Asia (MSRA) data set available on the Second International Chinese Word Segmentation Bakeoff [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) ## Benchmarking ```bash | Tag | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | AD | 0.92 | 0.94 | 0.93 | 23017 | | AS | 0.92 | 0.93 | 0.92 | 2415 | | BA | 0.95 | 0.94 | 0.95 | 376 | | CC | 0.85 | 0.82 | 0.83 | 2388 | | CD | 0.95 | 0.95 | 0.95 | 6758 | | CS | 0.93 | 0.92 | 0.92 | 525 | | DEC | 0.75 | 0.78 | 0.76 | 4422 | | DEG | 0.86 | 0.85 | 0.85 | 6094 | | DER | 0.87 | 0.79 | 0.83 | 251 | | DEV | 0.86 | 0.73 | 0.79 | 291 | | DT | 0.94 | 0.91 | 0.92 | 4409 | | EM | 0.74 | 0.72 | 0.73 | 32 | | ETC | 0.97 | 0.98 | 0.97 | 294 | | FW | 0.00 | 0.00 | 0.00 | 2 | | IC | 0.11 | 0.01 | 0.02 | 117 | | IJ | 0.95 | 0.95 | 0.95 | 2153 | | JJ | 0.80 | 0.77 | 0.78 | 4693 | | LB | 0.78 | 0.70 | 0.73 | 105 | | LC | 0.95 | 0.97 | 0.96 | 2660 | | M | 0.97 | 0.97 | 0.97 | 6467 | | MSP | 0.83 | 0.82 | 0.83 | 428 | | MSP-2 | 0.00 | 0.00 | 0.00 | 1 | | NN | 0.93 | 0.94 | 0.93 | 49159 | | NN-SHORT | 0.00 | 0.00 | 0.00 | 1 | | NOI | 0.17 | 0.06 | 0.09 | 48 | | NR | 0.95 | 0.90 | 0.93 | 13220 | | NR-SHORT | 0.00 | 0.00 | 0.00 | 8 | | NT | 0.95 | 0.95 | 0.95 | 3723 | | NT-SHORT | 1.00 | 0.20 | 0.33 | 5 | | OD | 0.91 | 0.86 | 0.88 | 587 | | ON | 0.33 | 0.23 | 0.27 | 13 | | P | 0.90 | 0.92 | 0.91 | 7442 | | PN | 0.95 | 0.95 | 0.95 | 11011 | | PU | 0.99 | 0.99 | 0.99 | 36717 | | SB | 0.89 | 0.90 | 0.90 | 197 | | SP | 0.91 | 0.90 | 0.91 | 5466 | | URL | 0.75 | 0.75 | 0.75 | 8 | | VA | 0.84 | 0.83 | 0.83 | 4866 | | VC | 0.98 | 0.98 | 0.98 | 4434 | | VE | 0.95 | 0.95 | 0.95 | 2274 | | VV | 0.91 | 0.91 | 0.91 | 35240 | | | | | | | | accuracy | 0.93 | 242317 | | | | macro avg | 0.76 | 0.72 | 0.73 | 242317 | | weighted avg | 0.93 | 0.93 | 0.93 | 242317 | ``` --- layout: model title: Pipeline to Detect Anatomical and Observation Entities in Chest Radiology Reports author: John Snow Labs name: ner_chexpert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, chexpert, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chexpert](https://nlp.johnsnowlabs.com/2021/09/30/ner_chexpert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_pipeline_en_3.4.1_3.0_1647867766035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chexpert_pipeline_en_3.4.1_3.0_1647867766035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models") pipeline.annotate("FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax . FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.") ``` ```scala val pipeline = new PretrainedPipeline("ner_chexpert_pipeline", "en", "clinical/models") pipeline.annotate("FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax . FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chexpert.pipeline").predict("""FINAL REPORT HISTORY : Chest tube leak , to assess for pneumothorax . FINDINGS : In comparison with study of ___ , the endotracheal tube and Swan - Ganz catheter have been removed . The left chest tube remains in place and there is no evidence of pneumothorax. Mild atelectatic changes are seen at the left base.""") ```
## Results ```bash | | chunk | label | |---:|:-------------------------|:--------| | 0 | endotracheal tube | OBS | | 1 | Swan - Ganz catheter | OBS | | 2 | left chest | ANAT | | 3 | tube | OBS | | 4 | in place | OBS | | 5 | pneumothorax | OBS | | 6 | Mild atelectatic changes | OBS | | 7 | left base | ANAT | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chexpert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Spanish RobertaForQuestionAnswering Large Cased model (from BSC-TeMU) author: John Snow Labs name: roberta_qa_bsc_temu_large_bne_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-sqac` is a Spanish model originally trained by `BSC-TeMU`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bsc_temu_large_bne_s_c_es_4.2.4_3.0_1669987001377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bsc_temu_large_bne_s_c_es_4.2.4_3.0_1669987001377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bsc_temu_large_bne_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bsc_temu_large_bne_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_bsc_temu_large_bne_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/BSC-TeMU/roberta-large-bne-sqac - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://github.com/PlanTL-SANIDAD/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Bangla BertForQuestionAnswering model (from sagorsarker) author: John Snow Labs name: bert_qa_mbert_bengali_tydiqa_qa date: 2022-06-02 tags: [bn, open_source, question_answering, bert] task: Question Answering language: bn edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mbert-bengali-tydiqa-qa` is a Bangla model orginally trained by `sagorsarker`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_bengali_tydiqa_qa_bn_4.0.0_3.0_1654188244734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_bengali_tydiqa_qa_bn_4.0.0_3.0_1654188244734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_bengali_tydiqa_qa","bn") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mbert_bengali_tydiqa_qa","bn") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("bn.answer_question.tydiqa.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_bengali_tydiqa_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bn| |Size:|626.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sagorsarker/mbert-bengali-tydiqa-qa - https://github.com/sagorbrur - https://github.com/sagorbrur/bntransformer - https://github.com/google-research-datasets/tydiqa - https://www.linkedin.com/in/sagor-sarker/ - https://www.kaggle.com/ --- layout: model title: Malay (macrolanguage) BertForQuestionAnswering model (from zhufy) author: John Snow Labs name: bert_qa_squad_ms_bert_base date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: ms edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-ms-bert-base` is a Malay (macrolanguage) model orginally trained by `zhufy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_ms_bert_base_ms_4.0.0_3.0_1654192110158.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_ms_bert_base_ms_4.0.0_3.0_1654192110158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_ms_bert_base","ms") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad_ms_bert_base","ms") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ms.answer_question.squad.bert.ms_tuned.base.by_zhufy").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_ms_bert_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ms| |Size:|412.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/zhufy/squad-ms-bert-base - https://github.com/huseinzol05/malay-dataset/tree/master/question-answer/squad --- layout: model title: Russian Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_base_russian_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ru, open_source] task: Part of Speech Tagging language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-russian-upos` is a Russian model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_russian_upos_ru_3.4.2_3.0_1652091813748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_russian_upos_ru_3.4.2_3.0_1652091813748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_russian_upos","ru") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_russian_upos","ru") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_russian_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ru| |Size:|665.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-russian-upos - https://universaldependencies.org/ru/ - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Word Embeddings for Hindi (hindi_cc_300d) author: John Snow Labs name: hindi_cc_300d date: 2021-02-03 task: Embeddings language: hi edition: Spark NLP 2.7.2 spark_version: 2.4 tags: [embeddings, open_source, hi] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hindi_cc_300d_hi_2.7.2_2.4_1612362695785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hindi_cc_300d_hi_2.7.2_2.4_1612362695785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = WordEmbeddingsModel.pretrained("hindi_cc_300d", "hi") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed").predict("""Put your text here.""") ```
## Results ```bash The model gives 300 dimensional feature vector output per token. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|hindi_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.7.2+| |License:|Open Source| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|hi| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: English asr_wav2vec2_thai_ASR TFWav2Vec2ForCTC from Rattana author: John Snow Labs name: pipeline_asr_wav2vec2_thai_ASR date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_thai_ASR` is a English model originally trained by Rattana. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_thai_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_thai_ASR_en_4.2.0_3.0_1664112707199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_thai_ASR_en_4.2.0_3.0_1664112707199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_thai_ASR', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_thai_ASR", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_thai_ASR| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_en_4.2.0_3.0_1664114081296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_en_4.2.0_3.0_1664114081296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_temp TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: pipeline_asr_temp date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_temp` is a English model originally trained by ying-tina. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_temp_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_temp_en_4.2.0_3.0_1664110837777.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_temp_en_4.2.0_3.0_1664110837777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_temp', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_temp", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_temp| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English ElectraForQuestionAnswering model (from sultan) author: John Snow Labs name: electra_qa_BioM_Base_SQuAD2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ELECTRA-Base-SQuAD2` is a English model originally trained by `sultan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_en_4.0.0_3.0_1655918898262.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Base_SQuAD2_en_4.0.0_3.0_1655918898262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_BioM_Base_SQuAD2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.base.by_sultan").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_BioM_Base_SQuAD2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sultan/BioM-ELECTRA-Base-SQuAD2 - https://github.com/salrowili/BioM-Transformers --- layout: model title: NER Pipeline for Clinical Problems (reduced taxonomy) - Voice of the Patient author: John Snow Labs name: ner_vop_problem_reduced_pipeline date: 2023-06-10 tags: [licensed, pipeline, ner, en, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of clinical problems from health-related text in colloquial language. All problem entities are merged into one generic Problem class. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_pipeline_en_4.4.3_3.0_1686420051472.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_pipeline_en_4.4.3_3.0_1686420051472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_problem_reduced_pipeline", "en", "clinical/models") pipeline.annotate(" I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_problem_reduced_pipeline", "en", "clinical/models") val result = pipeline.annotate(" I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms. ") ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Problem | | fatigue | Problem | | rheumatoid arthritis | Problem | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_reduced_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: NER Model for 10 High Resourced Languages author: John Snow Labs name: xlm_roberta_large_token_classifier_hrl date: 2021-12-26 tags: [arabic, german, english, spanish, french, italian, latvian, dutch, portuguese, chinese, xlm, roberta, ner, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.3.4 spark_version: 2.4 supported: true recommended: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese), leveraging `XLM-RoBERTa` embeddings and `XlmRobertaForTokenClassification` for NER purposes. ## Predicted Entities `ORG`, `PER`, `LOC` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_HRL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_HRL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_xx_3.3.4_2.4_1640520352673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_xx_3.3.4_2.4_1640520352673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_hrl", "xx"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_hrl", "xx")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.ner.high_resourced_lang").predict("""يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.""") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |الرياض |LOC | |فيصل بن بندر بن عبد العزيز |PER | |الرياض |LOC | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_hrl| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|1.8 GB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_SciBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-SciBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_SciBERT_384_en_4.0.0_3.0_1657108742652.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_SciBERT_384_en_4.0.0_3.0_1657108742652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_SciBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_SciBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_SciBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-SciBERT-384 --- layout: model title: Legal Agreement and Plan of Reorganization Document Classifier (Longformer) author: John Snow Labs name: legclf_agreement_and_plan_of_reorganization date: 2022-12-06 tags: [en, legal, classification, agreement, reorganization, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement_and_plan_of_reorganization` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `agreement-and-plan-of-reorganization` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `agreement-and-plan-of-reorganization`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_en_1.0.0_3.0_1670357495950.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_en_1.0.0_3.0_1670357495950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreement_and_plan_of_reorganization", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[agreement-and-plan-of-reorganization]| |[other]| |[other]| |[agreement-and-plan-of-reorganization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement_and_plan_of_reorganization| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement-and-plan-of-reorganization 1.00 0.92 0.96 52 other 0.97 1.00 0.98 111 accuracy - - 0.98 163 macro-avg 0.98 0.96 0.97 163 weighted-avg 0.98 0.98 0.98 163 ``` --- layout: model title: English Electra Embeddings (from google) author: John Snow Labs name: electra_embeddings_electra_large_generator date: 2022-05-17 tags: [en, open_source, electra, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-generator` is a English model orginally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_large_generator_en_3.4.4_3.0_1652786652489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_large_generator_en_3.4.4_3.0_1652786652489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_large_generator","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_large_generator","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_large_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|192.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/google/electra-large-generator - https://arxiv.org/pdf/1406.2661.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://openreview.net/pdf?id=r1xMH1BtvB - https://gluebenchmark.com/ - https://rajpurkar.github.io/SQuAD-explorer/ - https://www.clips.uantwerpen.be/conll2000/chunking/ --- layout: model title: English DistilBertForQuestionAnswering model (from aszidon) Custom Version-4 author: John Snow Labs name: distilbert_qa_custom4 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom4` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.0.0_3.0_1654728016252.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom4_en_4.0.0_3.0_1654728016252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.custom4.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom4 --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [ru, open_source] task: Named Entity Recognition language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_4.0.0_3.0_1656125353536.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_4.0.0_3.0_1656125353536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "ru") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("ru.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|ru| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Legal Competition Document Classifier (EURLEX) author: John Snow Labs name: legclf_competition_bert date: 2023-03-06 tags: [en, legal, classification, clauses, competition, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_competition_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Competition or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Competition`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_competition_bert_en_1.0.0_3.0_1678111605884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_competition_bert_en_1.0.0_3.0_1678111605884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_competition_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Competition]| |[Other]| |[Other]| |[Competition]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_competition_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Competition 0.93 0.87 0.9 363 Other 0.87 0.92 0.9 333 accuracy - - 0.9 696 macro-avg 0.90 0.90 0.9 696 weighted-avg 0.90 0.90 0.9 696 ``` --- layout: model title: Pipeline to Detect Adverse Drug Events (MedicalBertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_ade_binary_pipeline date: 2023-03-20 tags: [clinical, ade, licensed, public_health, token_classification, ner, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_ade_binary](https://nlp.johnsnowlabs.com/2022/07/27/bert_token_classifier_ner_ade_binary_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_binary_pipeline_en_4.3.0_3.2_1679299868936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_binary_pipeline_en_4.3.0_3.2_1679299868936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_ade_binary_pipeline", "en", "clinical/models") text = '''I used to be on paxil but that made me more depressed and prozac made me angry, Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_ade_binary_pipeline", "en", "clinical/models") val text = "I used to be on paxil but that made me more depressed and prozac made me angry, Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------|--------:|------:|:------------|-------------:| | 0 | depressed | 44 | 52 | ADE | 0.990846 | | 1 | angry | 73 | 77 | ADE | 0.972025 | | 2 | sugar crashes | 147 | 159 | ADE | 0.933623 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ade_binary_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Portuguese asr_bp_tedx100_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp_tedx100_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_tedx100_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_tedx100_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_tedx100_xlsr_pt_4.2.0_3.0_1664192496842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_tedx100_xlsr_pt_4.2.0_3.0_1664192496842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp_tedx100_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp_tedx100_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp_tedx100_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|756.0 MB| --- layout: model title: Detect Pathogen, Medical Condition and Medicine author: John Snow Labs name: ner_pathogen date: 2022-06-28 tags: [licensed, clinical, en, ner, pathogen, medical_condition, medicine] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained named entity recognition (NER) model is a deep learning model for detecting medical conditions (influenza, headache, malaria, etc), medicine (aspirin, penicillin, methotrexate) and pathogens (Corona Virus, Zika Virus, E. Coli, etc) in clinical texts. It is trained by using `MedicalNerApproach` annotator that allows to train generic NER models based on Neural Networks. ## Predicted Entities `Pathogen`, `MedicalCondition`, `Medicine` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_en_4.0.0_3.0_1656419618392.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_en_4.0.0_3.0_1656419618392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_pathogen", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner_model, ner_converter]) data = spark.createDataFrame([["""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_pathogen", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.pathogen").predict("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""") ```
## Results ```bash +---------------+----------------+ |chunk |ner_label | +---------------+----------------+ |Racecadotril |Medicine | |loperamide |Medicine | |Diarrhea |MedicalCondition| |dehydration |MedicalCondition| |skin color |MedicalCondition| |fast heart rate|MedicalCondition| |rabies virus |Pathogen | |Lyssavirus |Pathogen | |Ephemerovirus |Pathogen | +---------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_pathogen| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.6 MB| ## References Trained on [dataset](https://www.kaggle.com/datasets/finalepoch/medical-ner) to get a model for Named Entity Recognition. ## Benchmarking ```bash label tp fp fn total precision recall f1 Pathogen 15.0 3.0 9.0 24.0 0.8333 0.625 0.7143 Medicine 15.0 2.0 0.0 15.0 0.8824 1.0 0.9375 MedicalCondition 53.0 2.0 6.0 59.0 0.9636 0.8983 0.9298 macro - - - - - - 0.8605 micro - - - - - - 0.8782 ``` --- layout: model title: Russian BertForQuestionAnswering Cased model (from ruselkomp) author: John Snow Labs name: bert_qa_deep_pavlov_full date: 2022-07-07 tags: [ru, open_source, bert, question_answering] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deep-pavlov-full` is a Russian model originally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_deep_pavlov_full_ru_4.0.0_3.0_1657189256159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_deep_pavlov_full_ru_4.0.0_3.0_1657189256159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_deep_pavlov_full","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_deep_pavlov_full","ru") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_deep_pavlov_full| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|665.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/deep-pavlov-full --- layout: model title: Pipeline to Detect Living Species author: John Snow Labs name: ner_living_species_biobert_pipeline date: 2023-03-20 tags: [ner, en, clinical, licensed, biobert] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_biobert](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_biobert_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_biobert_pipeline_en_4.3.0_3.2_1679309343209.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_biobert_pipeline_en_4.3.0_3.2_1679309343209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_biobert_pipeline", "en", "clinical/models") text = '''42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_biobert_pipeline", "en", "clinical/models") val text = "42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | woman | 12 | 16 | HUMAN | 0.9999 | | 1 | bacterial | 145 | 153 | SPECIES | 0.9981 | | 2 | Fusarium spp | 337 | 348 | SPECIES | 0.9873 | | 3 | patient | 355 | 361 | HUMAN | 0.9991 | | 4 | species | 507 | 513 | SPECIES | 0.9926 | | 5 | Fusarium solani complex | 522 | 544 | SPECIES | 0.8422 | | 6 | antifungals | 792 | 802 | SPECIES | 0.9929 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from iis2009002) author: John Snow Labs name: xlmroberta_ner_iis2009002_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `iis2009002`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_de_4.1.0_3.0_1660433851832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_iis2009002_base_finetuned_panx_de_4.1.0_3.0_1660433851832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_iis2009002_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_iis2009002_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/iis2009002/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English BertForQuestionAnswering Cased model (from yossra) author: John Snow Labs name: bert_qa_yossra_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `yossra`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_yossra_finetuned_squad_en_4.0.0_3.0_1657186848644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_yossra_finetuned_squad_en_4.0.0_3.0_1657186848644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_yossra_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_yossra_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_yossra_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/yossra/bert-finetuned-squad --- layout: model title: English asr_wav2vec2_base_timit_demo_colab40 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab40 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab40` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab40_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab40_en_4.2.0_3.0_1664020850617.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab40_en_4.2.0_3.0_1664020850617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab40", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab40", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab40| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Finnish asr_wav2vec2_xlsr_train_aug_bigLM_1B TFWav2Vec2ForCTC from RASMUS author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B date: 2022-09-25 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_train_aug_bigLM_1B` is a Finnish model originally trained by RASMUS. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097620804.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B_fi_4.2.0_3.0_1664097620804.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_train_aug_bigLM_1B| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Parties Clause Binary Classifier author: John Snow Labs name: legclf_parties_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `parties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `parties` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_parties_clause_en_1.0.0_3.2_1660123811373.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_parties_clause_en_1.0.0_3.2_1660123811373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_parties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[parties]| |[other]| |[other]| |[parties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_parties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 91 parties 0.90 0.84 0.87 32 accuracy - - 0.93 123 macro-avg 0.92 0.91 0.91 123 weighted-avg 0.93 0.93 0.93 123 ``` --- layout: model title: English LongformerForQuestionAnswering model (from Nomi97) author: John Snow Labs name: longformer_qa_Chatbot date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Chatbot_QA` is a English model originally trained by `Nomi97`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_Chatbot_en_4.0.0_3.0_1656255131812.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_Chatbot_en_4.0.0_3.0_1656255131812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_Chatbot","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_Chatbot","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.longformer.by_Nomi97").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_Chatbot| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|546.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Nomi97/Chatbot_QA --- layout: model title: Part of Speech for Arabic author: John Snow Labs name: pos_ud_padt date: 2021-03-09 tags: [part_of_speech, open_source, arabic, pos_ud_padt, ar] task: Part of Speech Tagging language: ar edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - X - VERB - NOUN - ADJ - ADP - PUNCT - NUM - None - PRON - SCONJ - CCONJ - DET - PART - ADV - SYM - AUX - PROPN - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_padt_ar_3.0.0_3.0_1615292251530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_padt_ar_3.0.0_3.0_1615292251530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") pos_tagger = PerceptronModel.pretrained("pos_ud_padt", "ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, pos_tagger ]) example = spark.createDataFrame([['مرحبا من جون سنو مختبرات! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val posTagger = PerceptronModel.pretrained("pos_ud_padt", "ar") .setInputCols("sentence", "token") .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, posTagger)) val data = Seq("مرحبا من جون سنو مختبرات! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""مرحبا من جون سنو مختبرات! ""] token_df = nlu.load('ar.pos').predict(text) token_df ```
## Results ```bash token pos 0 مرحبا NOUN 1 من ADP 2 جون X 3 سنو X 4 مختبرات NOUN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_padt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ar| --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_small_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-finetuned-squad` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_finetuned_squad_en_4.0.0_3.0_1654184762850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_finetuned_squad_en_4.0.0_3.0_1654184762850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.small.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-small-finetuned-squad --- layout: model title: Translate Artificial languages to English Pipeline author: John Snow Labs name: translate_art_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, art, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `art` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_art_en_xx_2.7.0_2.4_1609686264952.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_art_en_xx_2.7.0_2.4_1609686264952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_art_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_art_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.art.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_art_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_squadv2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_squadv2_en_4.3.0_3.0_1675109106503.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_squadv2_en_4.3.0_3.0_1675109106503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_squadv2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_squadv2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_squadv2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|861.2 MB| ## References - https://huggingface.co/mrm8488/t5-base-finetuned-squadv2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Spanish BertForMaskedLM Base Cased model (from dccuchile) author: John Snow Labs name: bert_embeddings_base_spanish_wwm_cased date: 2022-12-02 tags: [es, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased` is a Spanish model originally trained by `dccuchile`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_cased_es_4.2.4_3.0_1670018860888.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_cased_es_4.2.4_3.0_1670018860888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_spanish_wwm_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|412.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased - https://github.com/google-research/bert - https://github.com/josecannete/spanish-corpora - https://github.com/google-research/bert/blob/master/multilingual.md - https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/tensorflow_weights.tar.gz - https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/pytorch_weights.tar.gz - https://users.dcc.uchile.cl/~jperez/beto/cased_2M/tensorflow_weights.tar.gz - https://users.dcc.uchile.cl/~jperez/beto/cased_2M/pytorch_weights.tar.gz - https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827 - https://www.kaggle.com/nltkdata/conll-corpora - https://github.com/gchaperon/beto-benchmarks/blob/master/conll2002/dev_results_beto-cased_conll2002.txt - https://github.com/facebookresearch/MLDoc - https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-cased_mldoc.txt - https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-uncased_mldoc.txt - https://github.com/google-research-datasets/paws/tree/master/pawsx - https://github.com/facebookresearch/XNLI - https://colab.research.google.com/drive/1uRwg4UmPgYIqGYY4gW_Nsw9782GFJbPt - https://www.adere.so/ - https://imfd.cl/en/ - https://www.tensorflow.org/tfrc - https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf - https://github.com/google-research/bert/blob/master/multilingual.md - https://arxiv.org/pdf/1904.09077.pdf - https://arxiv.org/pdf/1906.01502.pdf - https://arxiv.org/abs/1812.10464 - https://arxiv.org/pdf/1901.07291.pdf - https://arxiv.org/pdf/1904.02099.pdf - https://arxiv.org/pdf/1906.01569.pdf - https://arxiv.org/abs/1908.11828 --- layout: model title: Detect Normalized Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_go_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text. ## Predicted Entities : `GO`, `HP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GO_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_en_3.0.0_3.0_1617209694955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_en_3.0.0_3.0_1617209694955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype.go_clinical").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""") ```
## Results ```bash +----+--------------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+==========================+=========+=======+==========+ | 0 | tumor | 39 | 43 | HP | +----+--------------------------+---------+-------+----------+ | 1 | tricarboxylic acid cycle | 79 | 102 | GO | +----+--------------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:| | 0 | B-GO | 1530 | 129 | 57 | 0.922242 | 0.964083 | 0.942699 | | 1 | B-HP | 950 | 133 | 130 | 0.877193 | 0.87963 | 0.87841 | | 2 | I-HP | 253 | 46 | 68 | 0.846154 | 0.788162 | 0.816129 | | 3 | I-GO | 4550 | 344 | 154 | 0.92971 | 0.967262 | 0.948114 | | 4 | Macro-average | 7283 | 652 | 409 | 0.893825 | 0.899784 | 0.896795 | | 5 | Micro-average | 7283 | 652 | 409 | 0.917832 | 0.946828 | 0.932105 | ``` --- layout: model title: English asr_wav2vec_large_xlsr_korean TFWav2Vec2ForCTC from fleek author: John Snow Labs name: pipeline_asr_wav2vec_large_xlsr_korean date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec_large_xlsr_korean` is a English model originally trained by fleek. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec_large_xlsr_korean_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec_large_xlsr_korean_en_4.2.0_3.0_1664098611361.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec_large_xlsr_korean_en_4.2.0_3.0_1664098611361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec_large_xlsr_korean', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec_large_xlsr_korean", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec_large_xlsr_korean| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from Turkic Languages to English author: John Snow Labs name: opus_mt_trk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, trk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `trk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_trk_en_xx_2.7.0_2.4_1609167597007.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_trk_en_xx_2.7.0_2.4_1609167597007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_trk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_trk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.trk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_trk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shaojie) author: John Snow Labs name: distilbert_qa_shaojie_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shaojie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shaojie_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772539534.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shaojie_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772539534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shaojie_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shaojie_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shaojie_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shaojie/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from lorenzkuhn) author: John Snow Labs name: roberta_qa_lorenzkuhn_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `lorenzkuhn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_lorenzkuhn_base_finetuned_squad_en_4.3.0_3.0_1674217419784.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_lorenzkuhn_base_finetuned_squad_en_4.3.0_3.0_1674217419784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_lorenzkuhn_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_lorenzkuhn_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_lorenzkuhn_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/lorenzkuhn/roberta-base-finetuned-squad --- layout: model title: Summarize Radiology Reports author: John Snow Labs name: summarizer_radiology date: 2023-04-23 tags: [clinical, licensed, en, summarization, tensorflow, radiology] task: Summarization language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is capable of summarizing radiology reports while preserving the important information such as imaging tests and findings. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_en_4.4.0_3.0_1682218525772.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_en_4.4.0_3.0_1682218525772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer.pretrained("summarizer_radiology", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("summary")\ .setMaxTextLength(512)\ .setMaxNewTokens(512) pipeline = Pipeline(stages=[ document, summarizer ]) text = """INDICATIONS: Peripheral vascular disease with claudication. RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96. LEFT: 1. Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower lobes. """ data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer.pretrained("summarizer_radiology", "en", "clinical/models") .setInputCols("document") .setOutputCol("summary") .setMaxTextLength(512) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = """INDICATIONS: Peripheral vascular disease with claudication. RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96. LEFT: 1. Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower lobes. """ val data = Seq(text).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging, but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower lobes. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_radiology| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.4 MB| --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_news_pretrain_bert_FT_new_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_news_pretrain_bert_FT_new_newsqa_en_4.0.0_3.0_1654188909680.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_news_pretrain_bert_FT_new_newsqa_en_4.0.0_3.0_1654188909680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_news_pretrain_bert_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_news_pretrain_bert_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_news_pretrain_bert_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/news_pretrain_bert_FT_new_newsqa --- layout: model title: Modern Greek (1453-) asr_greek_lsr_1 TFWav2Vec2ForCTC from skylord author: John Snow Labs name: asr_greek_lsr_1 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_greek_lsr_1` is a Modern Greek (1453-) model originally trained by skylord. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_greek_lsr_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_greek_lsr_1_el_4.2.0_3.0_1664110738611.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_greek_lsr_1_el_4.2.0_3.0_1664110738611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_greek_lsr_1", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_greek_lsr_1", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_greek_lsr_1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from healx) author: John Snow Labs name: bert_qa_biomedical_slot_filling_reader_base date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biomedical-slot-filling-reader-base` is a English model orginally trained by `healx`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biomedical_slot_filling_reader_base_en_4.0.0_3.0_1654185786674.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biomedical_slot_filling_reader_base_en_4.0.0_3.0_1654185786674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biomedical_slot_filling_reader_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biomedical_slot_filling_reader_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bio_medical.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biomedical_slot_filling_reader_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/healx/biomedical-slot-filling-reader-base - https://arxiv.org/abs/2109.08564 --- layout: model title: Legal Dissolution Clause Binary Classifier author: John Snow Labs name: legclf_dissolution_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `dissolution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `dissolution` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dissolution_clause_en_1.0.0_3.2_1660122374357.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dissolution_clause_en_1.0.0_3.2_1660122374357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_dissolution_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[dissolution]| |[other]| |[other]| |[dissolution]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_dissolution_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support dissolution 0.95 0.90 0.92 39 other 0.97 0.99 0.98 137 accuracy - - 0.97 176 macro-avg 0.96 0.94 0.95 176 weighted-avg 0.97 0.97 0.97 176 ``` --- layout: model title: Clinical Portuguese Bert Embeddiongs (Biomedical) author: John Snow Labs name: biobert_embeddings_biomedical date: 2022-04-11 tags: [biobert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BioBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `biobertpt-bio` is a Portuguese model orginally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_embeddings_biomedical_pt_3.4.2_3.0_1649687586887.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_embeddings_biomedical_pt_3.4.2_3.0_1649687586887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_embeddings_biomedical","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Odeio o cancro"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_embeddings_biomedical","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Odeio o cancro").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.gs_biomedical").predict("""Odeio o cancro""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_embeddings_biomedical| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/pucpr/biobertpt-bio - https://aclanthology.org/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Adverse Drug Events Classifier (BERT) author: John Snow Labs name: bert_sequence_classifier_ade date: 2022-02-08 tags: [bert, sequence_classification, en, licensed] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify texts/sentences in two categories: - `True` : The sentence is talking about a possible ADE. - `False` : The sentence doesn’t have any information about an ADE. This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier. ## Predicted Entities `True`, `False` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/08.3.MedicalBertForSequenceClassification_in_SparkNLP.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_en_3.4.1_3.0_1644324436716.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_ade_en_3.4.1_3.0_1644324436716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["I felt a bit drowsy and had blurred vision after taking Aspirin."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("I felt a bit drowsy and had blurred vision after taking Aspirin.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.ade.seq_biobert").predict("""I felt a bit drowsy and had blurred vision after taking Aspirin.""") ```
## Results ```bash +----------------------------------------------------------------+------+ |text |result| +----------------------------------------------------------------+------+ |I felt a bit drowsy and had blurred vision after taking Aspirin.|[True]| +----------------------------------------------------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_ade| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model is trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed. ## Benchmarking ```bash label precision recall f1-score support False 0.97 0.97 0.97 6884 True 0.87 0.85 0.86 1398 accuracy 0.95 0.95 0.95 8282 macro-avg 0.92 0.91 0.91 8282 weighted-avg 0.95 0.95 0.95 8282 ``` --- layout: model title: English RobertaForQuestionAnswering (from AyushPJ) author: John Snow Labs name: roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-roBERTa-base-squad-v2` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2_en_4.0.0_3.0_1655727560873.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2_en_4.0.0_3.0_1655727560873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_v2.by_AyushPJ").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ai_club_inductions_21_nlp_roBERTa_base_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|465.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-roBERTa-base-squad-v2 --- layout: model title: Legal Listing Clause Binary Classifier author: John Snow Labs name: legclf_listing_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `listing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `listing` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_listing_clause_en_1.0.0_3.2_1660123698313.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_listing_clause_en_1.0.0_3.2_1660123698313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_listing_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[listing]| |[other]| |[other]| |[listing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_listing_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support listing 1.00 0.97 0.99 38 other 0.99 1.00 0.99 92 accuracy - - 0.99 130 macro-avg 0.99 0.99 0.99 130 weighted-avg 0.99 0.99 0.99 130 ``` --- layout: model title: English Named Entity Recognition (from lucifermorninstar011) author: John Snow Labs name: distilbert_ner_autotrain_lucifer_morningstar_job_859227344 date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-lucifer_morningstar_job-859227344` is a English model orginally trained by `lucifermorninstar011`. ## Predicted Entities `Job`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_lucifer_morningstar_job_859227344_en_3.4.2_3.0_1652721635851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_lucifer_morningstar_job_859227344_en_3.4.2_3.0_1652721635851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_lucifer_morningstar_job_859227344","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_lucifer_morningstar_job_859227344","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_autotrain_lucifer_morningstar_job_859227344| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lucifermorninstar011/autotrain-lucifer_morningstar_job-859227344 --- layout: model title: Spanish DistilBertForQuestionAnswering model (from CenIA) MLQA author: John Snow Labs name: distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa date: 2022-06-08 tags: [es, open_source, distilbert, question_answering] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-mlqa` is a English model originally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa_es_4.0.0_3.0_1654728089558.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa_es_4.0.0_3.0_1654728089558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa","es") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.mlqa.distil_bert.base_uncased").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_mlqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|250.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-mlqa --- layout: model title: English Bert Embeddings Cased model (from aditeyabaral) author: John Snow Labs name: bert_embeddings_carlbert_webex_mlm_spatial date: 2023-02-22 tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `carlbert-webex-mlm-spatial` is a English model originally trained by `aditeyabaral`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_carlbert_webex_mlm_spatial_en_4.3.0_3.0_1677087512961.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_carlbert_webex_mlm_spatial_en_4.3.0_3.0_1677087512961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_carlbert_webex_mlm_spatial","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark-NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_carlbert_webex_mlm_spatial","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark-NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_carlbert_webex_mlm_spatial| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.6 MB| |Case sensitive:|true| ## References https://huggingface.co/aditeyabaral/carlbert-webex-mlm-spatial --- layout: model title: Stopwords Remover for Swedish language (386 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, sv, open_source] task: Stop Words Removal language: sv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sv_3.4.1_3.0_1646672982973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sv_3.4.1_3.0_1646672982973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","sv") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Du är inte bättre än jag"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sv") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Du är inte bättre än jag").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sv.stopwords").predict("""Du är inte bättre än jag""") ```
## Results ```bash +------+ |result| +------+ |[är] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sv| |Size:|2.5 KB| --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_BioBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-BioBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BioBERT_512_en_4.0.0_3.0_1657108546349.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BioBERT_512_en_4.0.0_3.0_1657108546349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BioBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BioBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_BioBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-BioBERT-512 --- layout: model title: English image_classifier_vit_trainer_rare_puppers ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_trainer_rare_puppers date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_trainer_rare_puppers` is a English model originally trained by nateraw. ## Predicted Entities `corgi`, `samoyed`, `shiba inu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_trainer_rare_puppers_en_4.1.0_3.0_1660169700721.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_trainer_rare_puppers_en_4.1.0_3.0_1660169700721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_trainer_rare_puppers", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_trainer_rare_puppers", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_trainer_rare_puppers| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect PHI (Deidentification) author: John Snow Labs name: ner_deid_large_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_pipeline_en_4.3.0_3.2_1678736100712.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_pipeline_en_4.3.0_3.2_1678736100712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_large_pipeline", "en", "clinical/models") text = '''HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_large_pipeline", "en", "clinical/models") val text = "HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_large.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:------------|-------------:| | 0 | Smith | 32 | 36 | NAME | 0.9998 | | 1 | VA Hospital | 184 | 194 | LOCATION | 0.68335 | | 2 | Day Hospital | 258 | 269 | LOCATION | 0.7763 | | 3 | 02/04/2003 | 341 | 350 | DATE | 1 | | 4 | Smith | 374 | 378 | NAME | 0.9993 | | 5 | Day Hospital | 397 | 408 | LOCATION | 0.7522 | | 6 | Smith | 782 | 786 | NAME | 0.9998 | | 7 | Smith | 1131 | 1135 | NAME | 0.9997 | | 8 | 7 Ardmore Tower | 1153 | 1167 | LOCATION | 0.739867 | | 9 | Hart | 1221 | 1224 | NAME | 0.9995 | | 10 | Smith | 1231 | 1235 | NAME | 0.9998 | | 11 | 02/07/2003 | 1329 | 1338 | DATE | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RoBERTa Embeddings (Sampling strategy 'sim select') author: John Snow Labs name: roberta_embeddings_distilroberta_base_climate_s date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-climate-s` is a English model orginally trained by `climatebert`. Sampling strategy s:As expressed in the author's paper [here](https://arxiv.org/pdf/2110.12010.pdf), s is "sim select", meaning 70% of the most similar sentences of one of the corpus was used, discarding the rest. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_s_en_3.4.2_3.0_1649946847931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_s_en_3.4.2_3.0_1649946847931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_s","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_s","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base_climate_s| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|310.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/climatebert/distilroberta-base-climate-s - https://arxiv.org/abs/2110.12010 --- layout: model title: Stop Words Cleaner for Latvian author: John Snow Labs name: stopwords_lv date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: lv edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, lv] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_lv_lv_2.5.4_2.4_1594742439893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_lv_lv_2.5.4_2.4_1594742439893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_lv", "lv") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_lv", "lv") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis."""] stopword_df = nlu.load('lv.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Džons', metadata={'sentence': '0'}), Row(annotatorType='token', begin=6, end=10, result='Snovs', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=30, result='ziemeļu', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=38, result='karalis', metadata={'sentence': '0'}), Row(annotatorType='token', begin=39, end=39, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_lv| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|lv| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: French RobertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: roberta_qa_ADDI_FR_XLM_R date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: fr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FR-XLM-R` is a French model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FR_XLM_R_fr_4.0.0_3.0_1655726482123.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FR_XLM_R_fr_4.0.0_3.0_1655726482123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_FR_XLM_R","fr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ADDI_FR_XLM_R","fr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fr.answer_question.xlm_roberta.fr_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ADDI_FR_XLM_R| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fr| |Size:|422.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-FR-XLM-R --- layout: model title: Relation Extraction Between Body Parts and Procedures author: John Snow Labs name: redl_bodypart_procedure_test_biobert date: 2023-01-14 tags: [relation_extraction, en, clinical, dl, licensed, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like ‘Internal_organ_or_component’, ’External_body_part_or_region’ etc. and procedure and test entities. 1 : body part and test/procedure are related to each other. 0 : body part and test/procedure are not related to each other. ## Predicted Entities `1`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_4.2.4_3.0_1673714088228.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_4.2.4_3.0_1673714088228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["external_body_part_or_region-test"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_procedure_test_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([['''TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs("external_body_part_or_region-test") // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_procedure_test_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.procedure").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash | | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |---:|-----------:|:-----------------------------|:---------|:----------|:--------------------|-------------:| | 0 | 1 | External_body_part_or_region | chest | Test | portable ultrasound | 0.99953 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_procedure_test_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on a custom internal dataset. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.338 0.472 0.394 325 1 0.904 0.843 0.872 1275 Avg. 0.621 0.657 0.633 - ``` --- layout: model title: Translate Morisyen to English Pipeline author: John Snow Labs name: translate_mfe_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mfe, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mfe` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mfe_en_xx_2.7.0_2.4_1609688095194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mfe_en_xx_2.7.0_2.4_1609688095194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mfe_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mfe_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mfe.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mfe_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from haesun) author: John Snow Labs name: xlmroberta_ner_haesun_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `haesun`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_all_xx_4.1.0_3.0_1660428351266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_all_xx_4.1.0_3.0_1660428351266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_haesun_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|862.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/haesun/xlm-roberta-base-finetuned-panx-all --- layout: model title: Ocr pipeline with Rest-Api author: John Snow Labs name: ocr_restapi date: 2023-01-03 tags: [en, licensed, ocr, RestApi] task: Ocr RestApi language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2.1 supported: true annotator: OcrRestApi article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description RestAPI pipeline implementation for the OCR task, using tesseract models. Tesseract is an Optical Character Recognition (OCR) engine developed by Google. It is an open-source tool that can be used to recognize text in images and convert it into machine-readable text. The engine is based on a neural network architecture and uses machine learning algorithms to improve its accuracy over time. Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset. In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy. ## Predicted Entities {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/6.2.SparkOcrRestApi.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) ocr = ImageToText() \ .setInputCol("image") \ .setOutputCol("text") pipeline = PipelineModel(stages=[ binary_to_image, ocr ]) ## Start server SERVER_HOST = "localhost" SERVER_PORT = 8889 SERVER_API_NAME = "spark_ocr_api" checkpoint_dir = tempfile.TemporaryDirectory("_spark_ocr_server_checkpoint") df = spark.readStream.server() \ .address(SERVER_HOST, SERVER_PORT, SERVER_API_NAME) \ .load() \ .parseRequest(SERVER_API_NAME, schema=StructType().add("image", BinaryType())) \ .withColumn("path", f.lit("")) \ .withColumnRenamed("image", "content") replies = pipeline.transform(df)\ .makeReply("text") server = replies\ .writeStream \ .server() \ .replyTo(SERVER_API_NAME) \ .queryName("spark_ocr") \ .option("checkpointLocation", checkpoint_dir) \ .start() ## Call API imagePath = pkg_resources.resource_filename('sparkocr', '/resources/ocr/images/check.jpg') with open(imagePath, "rb") as image_file: im_bytes = image_file.read() im_b64 = base64.b64encode(im_bytes).decode("utf8") headers = {'Content-type': 'application/json', 'Accept': 'text/plain'} payload = json.dumps({"image": im_b64}) r = requests.post(data=payload, headers=headers, url=f"http://{SERVER_HOST}:{SERVER_PORT}/{SERVER_API_NAME}") ```
## Example ### Input: ![Screenshot](/assets/images/examples_ocr/image2.png) ## Output text ```bash Response: STARBUCKS Store #19208 11902 Euclid Avenue Cleveland, OH (216) 229-U749 CHK 664250 12/07/2014 06:43 PM 112003. Drawers 2. Reg: 2 ¥t Pep Mocha 4.5 Sbux Card 495 AMXARKERARANG 228 Subtotal $4.95 Total $4.95 Change Cue BO LOO - Check Closed ~ "49/07/2014 06:43 py oBUX Card «3228 New Balance: 37.45 Card is registertd ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_restapi| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Pipeline to Extract Mentions of Response to Cancer Treatment author: John Snow Labs name: ner_oncology_response_to_treatment_pipeline date: 2023-03-09 tags: [licensed, clinical, en, oncology, ner, treatment] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_response_to_treatment](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_response_to_treatment_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_pipeline_en_4.3.0_3.2_1678349824229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_pipeline_en_4.3.0_3.2_1678349824229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_response_to_treatment_pipeline", "en", "clinical/models") text = '''She completed her first-line therapy, but some months later there was recurrence of the breast cancer.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_response_to_treatment_pipeline", "en", "clinical/models") val text = "She completed her first-line therapy, but some months later there was recurrence of the breast cancer." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:----------------------|-------------:| | 0 | recurrence | 70 | 79 | Response_To_Treatment | 0.9767 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_response_to_treatment_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_timeentities date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-timeentities` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities_en_4.3.0_3.0_1674220613032.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_timeentities_en_4.3.0_3.0_1674220613032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_timeentities","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_timeentities| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-timeentities --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_tiny_2_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-2-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_2_finetuned_squadv2_en_4.0.0_3.0_1654184806087.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_2_finetuned_squadv2_en_4.0.0_3.0_1654184806087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_2_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_tiny_2_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.tiny_v2.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_tiny_2_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|19.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-tiny-2-finetuned-squadv2 --- layout: model title: Part of Speech for Icelandic author: John Snow Labs name: pos_icepahc date: 2021-03-23 tags: [pos, open_source, is] supported: true task: Part of Speech Tagging language: is edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_icepahc_is_2.7.5_2.4_1616509019245.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_icepahc_is_2.7.5_2.4_1616509019245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_icepahc", "is")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_icepahc", "is") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos)) val data = Seq("Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .""] token_df = nlu.load('is.pos.icepahc').predict(text) token_df ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ |text |result | +----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ |Númerið blikkaði á skjánum eins og einmana vekjaraklukka um nótt á níundu hæð í gamalli blokk í austurbæ Reykjavíkur .|[NOUN, VERB, ADP, NOUN, ADV, ADP, ADJ, NOUN, ADP, NOUN, ADP, ADJ, NOUN, ADP, ADJ, NOUN, ADP, PROPN, PROPN, PUNCT]| +----------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_icepahc| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|is| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.81 | 0.74 | 0.78 | 5906 | | ADP | 0.95 | 0.96 | 0.96 | 15548 | | ADV | 0.90 | 0.90 | 0.90 | 10631 | | AUX | 0.92 | 0.93 | 0.92 | 7416 | | CCONJ | 0.96 | 0.97 | 0.96 | 8437 | | DET | 0.89 | 0.87 | 0.88 | 7476 | | INTJ | 0.95 | 0.77 | 0.85 | 131 | | NOUN | 0.90 | 0.92 | 0.91 | 20726 | | NUM | 0.75 | 0.83 | 0.79 | 655 | | PART | 0.96 | 0.96 | 0.96 | 1703 | | PRON | 0.94 | 0.96 | 0.95 | 16852 | | PROPN | 0.89 | 0.89 | 0.89 | 4444 | | PUNCT | 0.98 | 0.98 | 0.98 | 16434 | | SCONJ | 0.94 | 0.94 | 0.94 | 5663 | | VERB | 0.92 | 0.90 | 0.91 | 17329 | | X | 0.60 | 0.30 | 0.40 | 346 | | accuracy | | | 0.92 | 139697 | | macro avg | 0.89 | 0.86 | 0.87 | 139697 | | weighted avg | 0.92 | 0.92 | 0.92 | 139697 | ``` --- layout: model title: Part of Speech for Marathi author: John Snow Labs name: pos_ud_ufal date: 2021-03-09 tags: [part_of_speech, open_source, marathi, pos_ud_ufal, mr] task: Part of Speech Tagging language: mr edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - AUX - NOUN - PUNCT - PRON - ADJ - CCONJ - ADV - VERB - SCONJ - NUM - ADP - INTJ - PROPN {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_3.0.0_3.0_1615292224912.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ufal_mr_3.0.0_3.0_1615292224912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_ufal", "mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['जॉन हिम लॅब्समधून हॅलो! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_ufal", "mr") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("जॉन हिम लॅब्समधून हॅलो! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""जॉन हिम लॅब्समधून हॅलो! ""] token_df = nlu.load('mr.pos').predict(text) token_df ```
## Results ```bash token pos 0 जॉन PROPN 1 हिम NOUN 2 लॅब्समधून ADJ 3 हॅलो VERB 4 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ufal| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|mr| --- layout: model title: Japanese Word Segmentation author: John Snow Labs name: wordseg_gsd_ud date: 2021-01-03 task: Word Segmentation language: ja edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [word_segmentation, ja, open_source] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_ja_2.7.0_2.4_1609692613721.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_ja_2.7.0_2.4_1609692613721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[ document_assembler, word_segmenter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['清代は湖北省が置かれ、そのまま現代の行政区分になっている。']], ["text"]) result = model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("清代は湖北省が置かれ、そのまま現代の行政区分になっている。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""清代は湖北省が置かれ、そのまま現代の行政区分になっている。"""] token_df = nlu.load('ja.segment_words').predict(text, output_level='token') token_df ```
## Results ```bash +----------------------------------------------------------+------------------------------------------------------------------------------------------------+ |text |result | +----------------------------------------------------------+------------------------------------------------------------------------------------------------+ |清代は湖北省が置かれ、そのまま現代の行政区分になっている。|[清代, は, 湖北, 省, が, 置か, れ, 、, その, まま, 現代, の, 行政, 区分, に, なっ, て, いる, 。]| +----------------------------------------------------------+------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_gsd_ud| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|ja| ## Data Source We trained this model on the the [Universal Dependenicies](universaldependencies.org) data set from Google (GSD-UD). > Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | JA_UD-GSD | 0,7687 | 0,8048 | 0,7863 | ``` --- layout: model title: English BertForTokenClassification Cased model (from test123) author: John Snow Labs name: bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765 date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-ingredient_pseudo_label_training_ner-29576765` is a English model originally trained by `test123`. ## Predicted Entities `I`, `B` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765_en_4.2.4_3.0_1669814198805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765_en_4.2.4_3.0_1669814198805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autonlp_ingredient_pseudo_label_training_ner_29576765| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/test123/autonlp-ingredient_pseudo_label_training_ner-29576765 --- layout: model title: Legal NER for NDA (Confidential Information-Permissions) author: John Snow Labs name: legner_nda_confidential_information_permissions date: 2023-04-06 tags: [en, licensed, legal, ner, nda, permission] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `USE_OF_CONF_INFO ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `PERMISSION`, `PERMISSION_SUBJECT `, `PERMISSION_OBJECT `, and `PERMISSION_IND_OBJECT `. ## Predicted Entities `PERMISSION`, `PERMISSION_SUBJECT`, `PERMISSION_OBJECT`, `PERMISSION_IND_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_permissions_en_1.0.0_3.0_1680814300223.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_confidential_information_permissions_en_1.0.0_3.0_1680814300223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_confidential_information_permissions", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""The interested party may disclose the information to its financing sources and potential financing sources provided that such financing sources are bound by the terms of this non-disclosure agreement and agree to keep the information confidential."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------------------------+---------------------+ |chunk |ner_label | +---------------------------+---------------------+ |interested party |PERMISSION_SUBJECT | |disclose |PERMISSION | |information |PERMISSION_OBJECT | |financing sources |PERMISSION_IND_OBJECT| |potential financing sources|PERMISSION_IND_OBJECT| +---------------------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_confidential_information_permissions| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support PERMISSION 1.00 1.00 1.00 9 PERMISSION_IND_OBJECT 1.00 0.67 0.80 9 PERMISSION_OBJECT 0.91 1.00 0.95 10 PERMISSION_SUBJECT 0.90 1.00 0.95 9 micro-avg 0.94 0.92 0.93 37 macro-avg 0.95 0.92 0.92 37 weighted-avg 0.95 0.92 0.93 37 ``` --- layout: model title: English RobertaForQuestionAnswering (from nlpconnect) author: John Snow Labs name: roberta_qa_dpr_nq_reader_roberta_base_v2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base-v2` is a English model originally trained by `nlpconnect`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_roberta_base_v2_en_4.0.0_3.0_1655728533632.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_roberta_base_v2_en_4.0.0_3.0_1655728533632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_roberta_base_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_dpr_nq_reader_roberta_base_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_dpr_nq_reader_roberta_base_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base-v2 --- layout: model title: Translate Germanic languages to English Pipeline author: John Snow Labs name: translate_gem_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gem, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gem` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gem_en_xx_2.7.0_2.4_1609691653097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gem_en_xx_2.7.0_2.4_1609691653097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gem_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gem_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gem.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gem_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from Datasaur) author: John Snow Labs name: distilbert_token_classifier_base_uncased_finetuned_conll2003 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll2003` is a English model originally trained by `Datasaur`. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1678783094748.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1678783094748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_finetuned_conll2003| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Datasaur/distilbert-base-uncased-finetuned-conll2003 --- layout: model title: English asr_xlsr_punctuation TFWav2Vec2ForCTC from boris author: John Snow Labs name: pipeline_asr_xlsr_punctuation date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_punctuation` is a English model originally trained by boris. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_punctuation_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_punctuation_en_4.2.0_3.0_1664020787424.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_punctuation_en_4.2.0_3.0_1664020787424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_punctuation', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_punctuation", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_punctuation| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Part of Speech for Portuguese author: John Snow Labs name: pos_ud_bosque date: 2020-05-03 12:54:00 +0800 task: Part of Speech Tagging language: pt edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, pt] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bosque_pt_2.5.0_2.4_1588499443093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bosque_pt_2.5.0_2.4_1588499443093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_bosque", "pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_bosque", "pt") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica."""] pos_df = nlu.load('pt.pos.ud_bosque').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=3, result='ADV', metadata={'word': 'Além'}), Row(annotatorType='pos', begin=5, end=6, result='ADP', metadata={'word': 'de'}), Row(annotatorType='pos', begin=8, end=10, result='AUX', metadata={'word': 'ser'}), Row(annotatorType='pos', begin=12, end=12, result='DET', metadata={'word': 'o'}), Row(annotatorType='pos', begin=14, end=16, result='NOUN', metadata={'word': 'rei'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bosque| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|pt| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English RobertaForSequenceClassification Cased model (from lucianpopa) author: John Snow Labs name: roberta_classifier_autonlp_sst1_529214890 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-SST1-529214890` is a English model originally trained by `lucianpopa`. ## Predicted Entities `1`, `0`, `4`, `3`, `2` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_sst1_529214890_en_4.2.4_3.0_1670622060169.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_sst1_529214890_en_4.2.4_3.0_1670622060169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_sst1_529214890","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_sst1_529214890","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autonlp_sst1_529214890| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|435.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lucianpopa/autonlp-SST1-529214890 --- layout: model title: Smaller BERT Sentence Embeddings (L-10_H-128_A-2) author: John Snow Labs name: sent_small_bert_L10_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_128_en_2.6.0_2.4_1598350346103.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_128_en_2.6.0_2.4_1598350346103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_128", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_128", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L10_128').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L10_128_embeddings sentence [-0.3761860430240631, -0.04432673007249832, 0.... I hate cancer [-0.17762605845928192, -0.7492673397064209, -0... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L10_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_what_1e_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-what-1e-4` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_what_1e_4_en_4.3.0_3.0_1674209722792.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_what_1e_4_en_4.3.0_3.0_1674209722792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_what_1e_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_what_1e_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_what_1e_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-what-1e-4 --- layout: model title: HCP Consult Classification Pipeline - Voice of the Patient author: John Snow Labs name: bert_sequence_classifier_vop_hcp_consult_pipeline date: 2023-06-14 tags: [licensed, en, clinical, classification, vop] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline includes the Medical Bert for Sequence Classification model to identify texts that mention a HCP consult. The pipeline is built on the top of [bert_sequence_classifier_vop_hcp_consult](https://nlp.johnsnowlabs.com/2023/06/13/bert_sequence_classifier_vop_hcp_consult_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_pipeline_en_4.4.3_3.2_1686708308086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_pipeline_en_4.4.3_3.2_1686708308086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_vop_hcp_consult_pipeline", "en", "clinical/models") pipeline.annotate("My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_hcp_consult_pipeline", "en", "clinical/models") val result = pipeline.annotate(My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies.) ```
## Results ```bash | text | prediction | |:-----------------------------------------------------------------------------------------------------------------------|:-----------------| | My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies. | Consulted_By_HCP | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_hcp_consult_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1655732330256.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1655732330256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_256d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|426.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-8 --- layout: model title: Pipeline to Detect diseases in Medical Text (biobert) author: John Snow Labs name: ner_diseases_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, disease, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diseases_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_3.4.1_3.0_1647871907471.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_3.4.1_3.0_1647871907471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models") pipeline.annotate("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases_biobert.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-------------------------+---------+ |chunk |ner_label| +-------------------------+---------+ |autoimmune syndrome |Disease | |human T-cell leukemia |Disease | |T-cell leukemia |Disease | |Chikungunya virus disease|Disease | +-------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Detect PHI for Generic Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_generic_bert date: 2022-07-06 tags: [deidentification, bert, phi, ner, generic, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_3.5.0_3.0_1657112906624.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_3.5.0_3.0_1657112906624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid_generic_bert").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|LOCATION | |Drumul Oprea Nr |LOCATION | |972 |LOCATION | |Vaslui |LOCATION | |737405 |LOCATION | |+40(235)413773 |CONTACT | |25 May 2022 |DATE | |BUREAN MARIA |NAME | |77 |AGE | |Agota Evelyn Tımar |NAME | |2450502264401 |ID | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_bert| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.5 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.95 0.97 0.96 1186 CONTACT 0.99 0.98 0.98 366 DATE 0.96 0.92 0.94 4518 ID 1.00 1.00 1.00 679 LOCATION 0.91 0.90 0.90 1683 NAME 0.93 0.96 0.94 2916 PROFESSION 0.87 0.85 0.86 161 micro-avg 0.94 0.94 0.94 11509 macro-avg 0.94 0.94 0.94 11509 weighted-avg 0.95 0.94 0.94 11509 ``` --- layout: model title: Part of Speech for Chinese author: John Snow Labs name: pos_ud_gsd date: 2020-05-04 20:11:00 +0800 task: Part of Speech Tagging language: zh edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, zh] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_2.5.0_2.4_1588611712161.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_2.5.0_2.4_1588611712161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。"""] pos_df = nlu.load('zh.pos.ud_gsd').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=20, result='NOUN', metadata={'word': '除了担任北方国王之外,约翰·斯诺(John'}), Row(annotatorType='pos', begin=22, end=50, result='X', metadata={'word': 'Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|zh| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English DistilBertForQuestionAnswering model (from Slavka) author: John Snow Labs name: distilbert_qa_distil_bert_finetuned_log_parser_1 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distil-bert-finetuned-log-parser-1` is a English model originally trained by `Slavka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distil_bert_finetuned_log_parser_1_en_4.0.0_3.0_1654723459354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distil_bert_finetuned_log_parser_1_en_4.0.0_3.0_1654723459354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distil_bert_finetuned_log_parser_1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distil_bert_finetuned_log_parser_1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.log_parser.by_Slavka").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distil_bert_finetuned_log_parser_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Slavka/distil-bert-finetuned-log-parser-1 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Rishav-hub) author: John Snow Labs name: xlmroberta_ner_rishav_hub_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Rishav-hub`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_rishav_hub_base_finetuned_panx_de_4.1.0_3.0_1660430288579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_rishav_hub_base_finetuned_panx_de_4.1.0_3.0_1660430288579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_rishav_hub_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_rishav_hub_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_rishav_hub_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Rishav-hub/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to English author: John Snow Labs name: opus_mt_bcl_en date: 2021-06-01 tags: [open_source, seq2seq, translation, bcl, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bcl target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_en_xx_3.1.0_2.4_1622554147726.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_en_xx_3.1.0_2.4_1622554147726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Central Bikol.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Ilocano to English author: John Snow Labs name: opus_mt_ilo_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ilo, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ilo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ilo_en_xx_2.7.0_2.4_1609165094786.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ilo_en_xx_2.7.0_2.4_1609165094786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ilo_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ilo_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ilo.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ilo_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_jsl_cased embeddings) author: John Snow Labs name: sbiobertresolve_rxnorm_augmented_cased date: 2021-12-28 tags: [en, clinical, entity_resolution, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_jsl_cased` Sentence Bert Embeddings. It is trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in all_k_aux_labels column. ## Predicted Entities `RxNorm Codes`, `Concept Classes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_cased_en_3.3.4_2.4_1640687886477.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_cased_en_3.3.4_2.4_1640687886477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_jsl_cased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_cased", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver]) light_model = LightPipeline(rxnorm_pipelineModel) result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "Neurontin 300"]) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_jsl_cased", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_cased", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val light_model = LightPipeline(rxnorm_pipelineModel) val result = light_model.fullAnnotate(Array("Coumadin 5 mg", "aspirin", "Neurontin 300")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.augmented_cased").predict("""Coumadin 5 mg""") ```
## Results ```bash | | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels | |---:|-------------:|:-------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------| | 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::645146:::432467:::438... | 7.1909:::8.2961:::8.3727:::8.3... | 0.0887:::0.1170:::0.1176:::0.1... | warfarin sodium 5 MG [Coumadin]:::minoxidil 50 MG/ML Topical... | Branded Drug Comp:::Clinical D... | | 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::437779:::7244... | 0.0000:::0.0000:::8.2570:::8.8... | 0.0000:::0.0000:::0.1147:::0.1... | aspirin Effervescent Oral Tablet:::aspirin:::aspirin / sulfu... | Clinical Drug Form:::Ingredien... | | 2 | 105029 | gabapentin 300 MG Oral Capsule [Neurontin] | 105029:::2180332:::105852:::19... | 8.7466:::10.7744:::11.1256:::1... | 0.1212:::0.1843:::0.1981:::0.2... | gabapentin 300 MG Oral Capsule [Neurontin]:::darolutamide 30... | Branded Drug:::Branded Drug Co... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_augmented_cased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|972.4 MB| |Case sensitive:|false| --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_ruperta_base_finetuned_squadv2 date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-squadv2` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ruperta_base_finetuned_squadv2_es_4.2.4_3.0_1669984933151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ruperta_base_finetuned_squadv2_es_4.2.4_3.0_1669984933151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ruperta_base_finetuned_squadv2","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ruperta_base_finetuned_squadv2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ruperta_base_finetuned_squadv2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|470.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv2 --- layout: model title: RCT Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_rct_biobert date: 2022-03-01 tags: [licensed, en, rct, bert, sequence_classification] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier that can classify the sections within the abstracts of scientific articles regarding randomized clinical trials (RCT). ## Predicted Entities `BACKGROUND`, `CONCLUSIONS`, `METHODS`, `OBJECTIVE`, `RESULTS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_3.0_1646127001699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_rct_biobert_en_3.4.1_3.0_1646127001699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier_loaded = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier_loaded ]) data = spark.createDataFrame([["""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl ."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_rct_biobert", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_trials").predict("""Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ |text |class | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ |[Previous attempts to prevent all the unwanted postoperative responses to major surgery with an epidural hydrophilic opioid , morphine , have not succeeded . The authors ' hypothesis was that the lipophilic opioid fentanyl , infused epidurally close to the spinal-cord opioid receptors corresponding to the dermatome of the surgical incision , gives equal pain relief but attenuates postoperative hormonal and metabolic responses more effectively than does systemic fentanyl .]|[BACKGROUND]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_rct_biobert| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://arxiv.org/abs/1710.06071 ## Benchmarking ```bash label precision recall f1-score support BACKGROUND 0.77 0.86 0.81 2000 CONCLUSIONS 0.96 0.95 0.95 2000 METHODS 0.96 0.98 0.97 2000 OBJECTIVE 0.85 0.77 0.81 2000 RESULTS 0.98 0.95 0.96 2000 accuracy 0.9 0.9 0.9 10000 macro-avg 0.9 0.9 0.9 10000 weighted-avg 0.9 0.9 0.9 10000 ``` --- layout: model title: Relation extraction between body parts and direction entities author: John Snow Labs name: re_bodypart_directions date: 2021-01-18 task: Relation Extraction language: en nav_key: models edition: Spark NLP for Healthcare 2.7.1 spark_version: 2.4 tags: [en, relation_extraction, clinical, licensed] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entites [Internal_organ_or_component, External_body_part_or_region] and Direction entity in clinical texts. `1` : Shows there is a relation between the body part entity and the direction entity, `0` : Shows there is no relation between the body part entity and the direction entity. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=D8TtVuN-Ee8s){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_en_2.7.1_2.4_1610983817042.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_en_2.7.1_2.4_1610983817042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_bodypart_directions` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABELS | NER MODEL | RE PAIRS | |:----------------------:|:---------------:|:---------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | re_bodypart_directions | 0,1 | ner_jsl | [“direction-external_body_part_or_region”,
“external_body_part_or_region-direction”,
“direction-internal_organ_or_component”,
“internal_organ_or_component-direction”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel()\ .pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") pair_list = ['direction-internal_organ_or_component', 'internal_organ_or_component-direction'] re_model = RelationExtractionModel().pretrained("re_bodypart_directions","en","clinical/models")\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4)\ .setRelationPairs(pair_list) pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = LightPipeline(model).fullAnnotate(''' MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia ''') ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = sparknlp.annotators.NerDLModel() .pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models") .setInputCols("sentences", "tokens", "embeddings") .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val pair_list = Array("direction-internal_organ_or_component", "internal_organ_or_component-direction") val re_model = RelationExtractionModel().pretrained("re_bodypart_directions","en","clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) .setRelationPairs(pair_list) val nlpPipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val text = """ MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia """ val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------| | 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 | | 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 | | 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 | | 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 | | 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 | | 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 | | 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 | | 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 | | 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_directions| |Type:|re| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on data gathered and manually annotated by John Snow Labs ## Benchmarking ```bash label recall precision f1 0 0.87 0.9 0.88 1 0.99 0.99 0.99 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from enoriega) author: John Snow Labs name: bert_qa_rule_softmatching date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_softmatching` is a English model originally trained by `enoriega`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_softmatching_en_4.0.0_3.0_1657191287470.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_softmatching_en_4.0.0_3.0_1657191287470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_softmatching","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_softmatching","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_softmatching| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/enoriega/rule_softmatching --- layout: model title: English RobertaForQuestionAnswering Cased model (from AlirezaBaneshi) author: John Snow Labs name: roberta_qa_autotrain_test2_756523213 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523213` is a English model originally trained by `AlirezaBaneshi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.3.0_3.0_1674209108948.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523213_en_4.3.0_3.0_1674209108948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523213","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523213","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_autotrain_test2_756523213| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|415.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523213 --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `LOC`, `ORG`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili_sw_4.1.0_3.0_1659353848315.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili_sw_4.1.0_3.0_1659353848315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_amharic_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Mapping Companies to NASDAQ Stock Screener by Ticker author: John Snow Labs name: finmapper_nasdaq_ticker_stock_screener date: 2023-01-19 tags: [en, finance, licensed, nasdaq, ticker] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model allows you to, given a Ticker, get the following information about a company at Nasdaq Stock Screener: - Country - IPO_Year - Industry - Last_Sale - Market_Cap - Name - Net_Change - Percent_Change - Sector - Ticker - Volume Firstly, you should get the TICKER symbol from the finance text with the `finner_ticker` model, then you can get detailed information about the company with the ChunkMapper model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674157233652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_ticker_stock_screener_en_1.0.0_3.0_1674157233652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_ticker_stock_screener', 'en', 'finance/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, CM]) text = ["""There are some serious purchases and sales of AMZN stock today."""] test_data = spark.createDataFrame([text]).toDF("text") model = pipeline.fit(test_data) result = model.transform(test_data).select('mappings').collect() ```
## Results ```bash "Country": "United States", "IPO_Year": "1997", "Industry": "Catalog/Specialty Distribution", "Last_Sale": "$98.12", "Market_Cap": "9.98556270184E11", "Name": "Amazon.com Inc. Common Stock", "Net_Change": "2.85", "Percent_Change": "2.991%", "Sector": "Consumer Discretionary", "Ticker": "AMZN", "Volume": "85412563" ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_nasdaq_ticker_stock_screener| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|584.5 KB| ## References https://www.nasdaq.com/market-activity/stocks/screener --- layout: model title: Korean Bert Embeddings (from kykim) author: John Snow Labs name: bert_embeddings_bert_kor_base date: 2022-04-11 tags: [bert, embeddings, ko, open_source] task: Embeddings language: ko edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-kor-base` is a Korean model orginally trained by `kykim`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_kor_base_ko_3.4.2_3.0_1649675505476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_kor_base_ko_3.4.2_3.0_1649675505476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_kor_base","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_kor_base","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.embed.bert_kor_base").predict("""나는 Spark NLP를 좋아합니다""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_kor_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|444.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/kykim/bert-kor-base - https://github.com/kiyoungkim1/LM-kor --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739648723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739648723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_only_classfn_twostage_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0 --- layout: model title: Spanish BERT Sentence Base Uncased Embedding author: John Snow Labs name: sent_bert_base_uncased date: 2021-09-06 tags: [spanish, open_source, bert_sentence_embeddings, uncased, es] task: Embeddings language: es edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_es_3.2.2_3.0_1630926281024.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_es_3.2.2_3.0_1630926281024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "es") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "es") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed_sentence.bert.base_uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_uncased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|es| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased --- layout: model title: Thai Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_base_thai_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, th, open_source] task: Part of Speech Tagging language: th edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-thai-upos` is a Thai model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_thai_upos_th_3.4.2_3.0_1652092549919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_thai_upos_th_3.4.2_3.0_1652092549919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_thai_upos","th") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["ฉันรัก Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_thai_upos","th") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("ฉันรัก Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_thai_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|th| |Size:|345.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-thai-upos - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Detect Genetic Cancer Entities author: John Snow Labs name: ner_cancer_genetics date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Pretrained named entity recognition deep learning model for biology and genetics terms. ## Predicted Entities ``DNA``, ``RNA``, ``cell_line``, ``cell_type``, ``protein``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_en_3.0.0_3.0_1617209717722.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_en_3.0.0_3.0_1617209717722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_cancer_genetics", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_cancer_genetics", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cancer").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash +-------------------+---------+ | token|ner_label| +-------------------+---------+ | The| O| | human|B-protein| | KCNJ9|I-protein| | (| O| | Kir|B-protein| | 3.3|I-protein| | ,| O| | GIRK3|B-protein| | )| O| | is| O| | a| O| | member| O| | of| O| | the| O| |G-protein-activated|B-protein| | inwardly|I-protein| | rectifying|I-protein| | potassium|I-protein| | (|I-protein| | GIRK|I-protein| | )|I-protein| | channel|I-protein| | family|I-protein| | .| O| | Here| O| | we| O| | describe| O| | the| O| |genomicorganization| O| | of| O| | the| O| | KCNJ9| B-DNA| | locus| I-DNA| | on| O| | chromosome| B-DNA| | 1q21-23| I-DNA| +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cancer_genetics| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013 with `embeddings_clinical`. https://aclanthology.org/W13-2008/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-cell_line 581 148 151 0.79698217 0.79371583 0.79534566 I-DNA 2751 735 317 0.7891566 0.89667535 0.8394873 I-protein 4416 867 565 0.8358887 0.88656896 0.8604832 B-protein 5288 763 660 0.8739051 0.8890383 0.8814068 I-cell_line 1148 244 301 0.82471263 0.79227054 0.80816615 I-RNA 221 60 27 0.78647685 0.891129 0.83553874 B-RNA 157 40 36 0.79695433 0.8134715 0.8051282 B-cell_type 1127 292 250 0.7942213 0.8184459 0.8061516 I-cell_type 1547 392 263 0.7978339 0.85469615 0.82528675 B-DNA 1513 444 387 0.77312213 0.7963158 0.7845475 Macro-average prec: 0.8069253, rec: 0.84323275, f1: 0.82467955 Micro-average prec: 0.82471186, rec: 0.86377037, f1: 0.84378934 ``` --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2021-11-03 tags: [ner, ner_profiling, clinical, licensed, en, biobert] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `ner_jsl_enriched_biobert`, `ner_clinical_biobert`, `ner_chemprot_biobert`, `ner_jsl_greedy_biobert`, `ner_bionlp_biobert`, `ner_human_phenotype_go_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert`, `ner_anatomy_coarse_biobert`, `ner_deid_enriched_biobert`, `ner_human_phenotype_gene_biobert`, `ner_jsl_biobert`, `ner_events_biobert`, `ner_deid_biobert`, `ner_posology_biobert`, `ner_diseases_biobert`, `jsl_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_cellular_biobert` . {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_2.4_1635977081207.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_2.4_1635977081207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel (x21) - NerConverter (x21) - Finisher --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8_en_4.3.0_3.0_1674216234526.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8_en_4.3.0_3.0_1674216234526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|419.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-8 --- layout: model title: Legal Condemnation Clause Binary Classifier author: John Snow Labs name: legclf_condemnation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `condemnation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `condemnation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_condemnation_clause_en_1.0.0_3.2_1660122247655.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_condemnation_clause_en_1.0.0_3.2_1660122247655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_condemnation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[condemnation]| |[other]| |[other]| |[condemnation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_condemnation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support condemnation 0.96 1.00 0.98 23 other 1.00 0.99 0.99 85 accuracy - - 0.99 108 macro-avg 0.98 0.99 0.99 108 weighted-avg 0.99 0.99 0.99 108 ``` --- layout: model title: Fon asr_fonxlsr TFWav2Vec2ForCTC from chrisjay author: John Snow Labs name: asr_fonxlsr date: 2022-09-24 tags: [wav2vec2, fon, audio, open_source, asr] task: Automatic Speech Recognition language: fon edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_fonxlsr` is a Fon model originally trained by chrisjay. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_fonxlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_fonxlsr_fon_4.2.0_3.0_1664024800003.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_fonxlsr_fon_4.2.0_3.0_1664024800003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_fonxlsr", "fon")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_fonxlsr", "fon") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_fonxlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fon| |Size:|1.2 GB| --- layout: model title: Translate English to Mon-Khmer languages Pipeline author: John Snow Labs name: translate_en_mkh date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mkh, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mkh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mkh_xx_2.7.0_2.4_1609689989219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mkh_xx_2.7.0_2.4_1609689989219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mkh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mkh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mkh').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mkh| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Clinical Major Concepts to UMLS Code Pipeline author: John Snow Labs name: umls_major_concepts_resolver_pipeline date: 2023-03-30 tags: [en, umls, licensed, pipeline, resolver, clinical] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Major Concepts) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_resolver_pipeline_en_4.3.2_3.2_1680192225130.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_resolver_pipeline_en_4.3.2_3.2_1680192225130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_major_concepts_resolver_pipeline", "en", "clinical/models") pipeline.annotate("The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_major_concepts_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_major_concepts_resolver").predict("""The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician""") ```
## Results ```bash +-----------+-----------------------------------+---------+ |chunk |ner_label |umls_code| +-----------+-----------------------------------+---------+ |pustules |Sign_or_Symptom |C0241157 | |stairs |Daily_or_Recreational_Activity |C4300351 | |Arthroscopy|Therapeutic_or_Preventive_Procedure|C0179144 | +-----------+-----------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_major_concepts_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.0 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English DistilBertForQuestionAnswering model (from machine2049) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_ date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad_distilbert` is a English model originally trained by `machine2049`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad__en_4.0.0_3.0_1654726959258.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad__en_4.0.0_3.0_1654726959258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_machine2049").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-squad_distilbert --- layout: model title: Swedish asr_test_by_marma TFWav2Vec2ForCTC from marma author: John Snow Labs name: asr_test_by_marma date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_test_by_marma` is a Swedish model originally trained by marma. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_test_by_marma_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_test_by_marma_sv_4.2.0_3.0_1664116267395.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_test_by_marma_sv_4.2.0_3.0_1664116267395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_test_by_marma", "sv")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_test_by_marma", "sv") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_test_by_marma| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sv| |Size:|756.2 MB| --- layout: model title: Chinese Word Segmentation author: John Snow Labs name: wordseg_msra date: 2021-01-03 task: Word Segmentation language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, word_segmentation, zh, cn] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_msra_zh_2.7.0_2.4_1609693916888.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_msra_zh_2.7.0_2.4_1609693916888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_msra', 'zh')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_msra", "zh") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] token_df = nlu.load('zh.segment_words.msra').predict(text, output_level='token') token_df ```
## Results ```bash +----------------------------------+--------------------------------------------------------+ |text |result | +----------------------------------+--------------------------------------------------------+ |然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]| +----------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_msra| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source We trained this model on the Microsoft Research Asia (MSRA) data set available on the Second International Chinese Word Segmentation Bakeoff [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005) ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 | | WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 | | WORDSEG_MSRA | 0,5984 | 0,6088 | 0,6035 | | WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 | | WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 | ``` --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from V3RX2000) author: John Snow Labs name: xlmroberta_ner_v3rx2000_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `V3RX2000`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_all_xx_4.1.0_3.0_1660427771019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_all_xx_4.1.0_3.0_1660427771019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_v3rx2000_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/V3RX2000/xlm-roberta-base-finetuned-panx-all --- layout: model title: English Part of Speech Tagger (Large, UPOS-Universal Part-Of-Speech) author: John Snow Labs name: roberta_pos_roberta_large_english_upos date: 2022-05-03 tags: [roberta, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-english-upos` is a English model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_large_english_upos_en_3.4.2_3.0_1651596140502.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_large_english_upos_en_3.4.2_3.0_1651596140502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_large_english_upos","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_large_english_upos","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.pos.roberta_large_english_upos").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_roberta_large_english_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/roberta-large-english-upos - https://universaldependencies.org/en/ - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Stop Words Cleaner for Polish author: John Snow Labs name: stopwords_pl date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: pl edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, pl] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_pl_pl_2.5.4_2.4_1594742438519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_pl_pl_2.5.4_2.4_1594742438519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_pl", "pl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_pl", "pl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej."""] stopword_df = nlu.load('pl.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='Oprócz', metadata={'sentence': '0'}), Row(annotatorType='token', begin=7, end=11, result='bycia', metadata={'sentence': '0'}), Row(annotatorType='token', begin=13, end=18, result='królem', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=26, result='północy', metadata={'sentence': '0'}), Row(annotatorType='token', begin=27, end=27, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_pl| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|pl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Duplets) author: John Snow Labs name: distilbert_qa_duplets_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Duplets`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_duplets_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768419922.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_duplets_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768419922.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_duplets_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_duplets_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_duplets_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Duplets/distilbert-base-uncased-finetuned-squad --- layout: model title: French CamemBert Embeddings (from Henrywang) author: John Snow Labs name: camembert_embeddings_Henrywang_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Henrywang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Henrywang_generic_model_fr_3.4.4_3.0_1653986322898.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Henrywang_generic_model_fr_3.4.4_3.0_1653986322898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Henrywang_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Henrywang_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Henrywang_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Henrywang/dummy-model --- layout: model title: Modern Greek (1453-) asr_greek_lsr_1 TFWav2Vec2ForCTC from skylord author: John Snow Labs name: pipeline_asr_greek_lsr_1 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_greek_lsr_1` is a Modern Greek (1453-) model originally trained by skylord. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_greek_lsr_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_greek_lsr_1_el_4.2.0_3.0_1664111543823.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_greek_lsr_1_el_4.2.0_3.0_1664111543823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_greek_lsr_1', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_greek_lsr_1", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_greek_lsr_1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Ganda asr_wav2vec2_luganda_by_cahya TFWav2Vec2ForCTC from cahya author: John Snow Labs name: asr_wav2vec2_luganda_by_cahya date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_cahya` is a Ganda model originally trained by cahya. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_luganda_by_cahya_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037739573.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_cahya_lg_4.2.0_3.0_1664037739573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_luganda_by_cahya", "lg")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_luganda_by_cahya", "lg") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_luganda_by_cahya| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|lg| |Size:|1.2 GB| --- layout: model title: Fast Neural Machine Translation Model from San Salvador Kongo to English author: John Snow Labs name: opus_mt_kwy_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kwy, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kwy` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kwy_en_xx_2.7.0_2.4_1609170973770.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kwy_en_xx_2.7.0_2.4_1609170973770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kwy_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kwy_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kwy.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kwy_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BioBERT Embeddings (PMC) author: John Snow Labs name: biobert_pmc_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pmc_base_cased_en_2.6.0_2.4_1598343018425.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pmc_base_cased_en_2.6.0_2.4_1598343018425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pmc_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pmc_base_cased_embeddings I [0.0654267892241478, 0.06330983340740204, 0.13... hate [0.3058323264122009, 0.4778319299221039, -0.09... cancer [0.3130614757537842, 0.024675076827406883, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_de_4.2.0_3.0_1664115088643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_de_4.2.0_3.0_1664115088643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Abkhazian asr_wav2vec2_common_voice_ab_demo TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_common_voice_ab_demo date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_ab_demo` is a Abkhazian model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_common_voice_ab_demo_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042256123.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_ab_demo_ab_4.2.0_3.0_1664042256123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_common_voice_ab_demo", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_common_voice_ab_demo", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_common_voice_ab_demo| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|1.2 GB| --- layout: model title: Mapping Entities with Corresponding RxNorm Codes and Normalized Names author: John Snow Labs name: rxnorm_normalized_mapper date: 2022-09-29 tags: [en, clinical, licensed, rxnorm, chunk_mapping] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding RxNorm codes and normalized RxNorm resolutions. ## Predicted Entities `rxnorm_code`, `normalized_name` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_normalized_mapper_en_4.1.0_3.0_1664443862683.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_normalized_mapper_en_4.1.0_3.0_1664443862683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("posology_ner") posology_ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "posology_ner"])\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("rxnorm_normalized_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["rxnorm_code", "normalized_name"]) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper]) data = spark.createDataFrame([["The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"]]).toDF("text") result= mapper_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("posology_ner") val posology_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "posology_ner")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_normalized_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("rxnorm_code", "normalized_name")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper)) val data = Seq("The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray").toDS.toDF("text") val result = mapper_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_normalized").predict("""The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray""") ```
## Results ```bash +------------------------------+-----------+--------------------------------------------------------------+ |ner_chunk |rxnorm_code|normalized_name | +------------------------------+-----------+--------------------------------------------------------------+ |Zyrtec 10 MG |1011483 |cetirizine hydrochloride 10 MG [Zyrtec] | |Adapin 10 MG Oral Capsule |1000050 |doxepin hydrochloride 10 MG Oral Capsule [Adapin] | |Septi-Soothe 0.5 Topical Spray|1000046 |chlorhexidine diacetate 0.5 MG/ML Topical Spray [Septi-Soothe]| +------------------------------+-----------+--------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_normalized_mapper| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|10.7 MB| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from Slavka) author: John Snow Labs name: distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-log-parser-winlogbeat` is a English model originally trained by `Slavka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat_en_4.0.0_3.0_1654723355284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat_en_4.0.0_3.0_1654723355284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bert_base_cased_finetuned_log_parser_winlogbeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Slavka/bert-base-cased-finetuned-log-parser-winlogbeat --- layout: model title: Summarize clinical notes author: John Snow Labs name: summarizer_clinical_jsl date: 2023-03-25 tags: [en, licensed, clinical, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Summarize clinical notes, encounters, critical care notes, discharge notes, reports, etc. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_en_4.3.1_3.0_1679772340755.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_en_4.3.1_3.0_1679772340755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler().setInputCol('text').setOutputCol('document') summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models").setInputCols(['document'])\ .setOutputCol('summary')\ .setMaxTextLength(512)\ .setMaxNewTokens(512) pipeline = sparknlp.base.Pipeline(stages=[ document, summarizer ]) text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash.""" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models") .setInputCols(['document']) .setOutputCol('summary') .setMaxTextLength(512) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash.""" val data = Seq(text).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.1 MB| ## References Trained on in-house curated dataset --- layout: model title: Sentence Entity Resolver for billable ICD10-CM HCC codes (Slim, JSL Medium Bert) author: John Snow Labs name: sbertresolve_icd10cm_slim_billable_hcc_med date: 2021-05-21 tags: [licensed, clinical, en, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using sentence embeddings. This model has been augmented with synonyms, and synonyms having low cosine similarity are dropped, making the model slim. It utilises fine-tuned `sbert_jsl_medium_uncased` Sentence Bert Model. ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.4_2.4_1621590174924.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.0.4_2.4_1621590174924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbert_jsl_medium_uncased","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models")\ .setInputCols(["document", "sbert_embeddings"])\ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc_med").predict("""metastatic lung cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_slim_billable_hcc_med| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[icd10_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: English DistilBertForQuestionAnswering model (from twmkn9) author: John Snow Labs name: distilbert_qa_base_uncased_squad2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2` is a English model originally trained by `twmkn9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_en_4.0.0_3.0_1654727261779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_en_4.0.0_3.0_1654727261779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_twmkn9").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/twmkn9/distilbert-base-uncased-squad2 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_128 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670325968748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_128_zh_4.2.4_3.0_1670325968748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|15.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Transition Services Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_transition_services_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, transition_services, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_transition_services_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `transition-services-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `transition-services-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transition_services_agreement_bert_en_1.0.0_3.0_1669372317483.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transition_services_agreement_bert_en_1.0.0_3.0_1669372317483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_transition_services_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[transition-services-agreement]| |[other]| |[other]| |[transition-services-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transition_services_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.98 0.98 82 transition-services-agreement 0.94 0.97 0.96 34 accuracy - - 0.97 116 macro-avg 0.97 0.97 0.97 116 weighted-avg 0.97 0.97 0.97 116 ``` --- layout: model title: Legal Letter Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_letter_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, letter, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_letter_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `letter-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `letter-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_letter_agreement_bert_en_1.0.0_3.0_1669315660461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_letter_agreement_bert_en_1.0.0_3.0_1669315660461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_letter_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[letter-agreement]| |[other]| |[other]| |[letter-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_letter_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support letter-agreement 0.94 0.87 0.90 38 other 0.93 0.97 0.95 65 accuracy - - 0.93 103 macro-avg 0.93 0.92 0.93 103 weighted-avg 0.93 0.93 0.93 103 ``` --- layout: model title: Stop Words Cleaner for Galician author: John Snow Labs name: stopwords_gl date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: gl edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, gl] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_gl_gl_2.5.4_2.4_1594742441210.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_gl_gl_2.5.4_2.4_1594742441210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_gl", "gl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_gl", "gl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica."""] stopword_df = nlu.load('gl.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='Ademais', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=19, result='rei', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=28, result='norte', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=29, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=31, end=34, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_gl| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|gl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Turkish BertForQuestionAnswering model (from yunusemreemik) author: John Snow Labs name: bert_qa_logo_qna_model date: 2022-06-02 tags: [tr, open_source, question_answering, bert] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `logo-qna-model` is a Turkish model orginally trained by `yunusemreemik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_logo_qna_model_tr_4.0.0_3.0_1654188164539.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_logo_qna_model_tr_4.0.0_3.0_1654188164539.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_logo_qna_model","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_logo_qna_model","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.bert.by_yunusemreemik").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_logo_qna_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/yunusemreemik/logo-qna-model --- layout: model title: Swedish asr_Wav2Vec2_large_xlsr_welsh TFWav2Vec2ForCTC from Srulikbdd author: John Snow Labs name: asr_Wav2Vec2_large_xlsr_welsh date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_large_xlsr_welsh` is a Swedish model originally trained by Srulikbdd. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_large_xlsr_welsh_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_large_xlsr_welsh_sv_4.2.0_3.0_1664115056212.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_large_xlsr_welsh_sv_4.2.0_3.0_1664115056212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Wav2Vec2_large_xlsr_welsh", "sv")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Wav2Vec2_large_xlsr_welsh", "sv") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Wav2Vec2_large_xlsr_welsh| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sv| |Size:|1.2 GB| --- layout: model title: Bert for Sequence Classification (Question vs Statement) author: John Snow Labs name: bert_sequence_classifier_question_statement date: 2021-11-04 tags: [question, statement, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Trained to add sentence classifying capabilities to distinguish between Question vs Statements. This model was imported from Hugging Face (https://huggingface.co/shahrukhx01/question-vs-statement-classifier), and trained based on Haystack (https://github.com/deepset-ai/haystack/issues/611). ## Predicted Entities `question`, `statement` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_question_statement_en_3.3.2_3.0_1636038134936.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_question_statement_en_3.3.2_3.0_1636038134936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") seq = BertForSequenceClassification.pretrained('bert_sequence_classifier_question_statement', 'en')\ .setInputCols(["token", "sentence"])\ .setOutputCol("label")\ .setCaseSensitive(True) pipeline = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, seq]) test_sentences = ["""What feature in your car did you not realize you had until someone else told you about it? Years ago, my Dad bought me a cute little VW Beetle. The first day I had it, me and my BFF were sitting in my car looking at everything. When we opened the center console, we had quite the scare. Inside was a hollowed out, plastic phallic looking object with tiny spikes on it. My friend and I literally screamed in horror. It was clear to us that somehow someone left their “toy” in my new car! We were shook, as they say. This was my car, I had to do something. So, I used a pen to pick up the nasty looking thing and threw it out. We freaked out about how gross it was and then we forgot about it… until my Dad called me. My Dad said: How’s the new car? Have you seen the flower holder in the center console? To summarize, we thought a flower vase was an XXX item… In our defense, this is a picture of a VW Beetle flower holder."""] import pandas as pd data=spark.createDataFrame(pd.DataFrame({'text': test_sentences})) res = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val seq = BertForSequenceClassification.pretrained("bert_sequence_classifier_question_statement", "en") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, seq)) val test_sentences = "What feature in your car did you not realize you had until someone else told you about it? Years ago, my Dad bought me a cute little VW Beetle. The first day I had it, me and my BFF were sitting in my car looking at everything. When we opened the center console, we had quite the scare. Inside was a hollowed out, plastic phallic looking object with tiny spikes on it. My friend and I literally screamed in horror. It was clear to us that somehow someone left their “toy” in my new car! We were shook, as they say. This was my car, I had to do something. So, I used a pen to pick up the nasty looking thing and threw it out. We freaked out about how gross it was and then we forgot about it… until my Dad called me. My Dad said: How’s the new car? Have you seen the flower holder in the center console? To summarize, we thought a flower vase was an XXX item… In our defense, this is a picture of a VW Beetle flower holder." val example = Seq(test_sentences).toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.question_vs_statement").predict("""What feature in your car did you not realize you had until someone else told you about it? Years ago, my Dad bought me a cute little VW Beetle. The first day I had it, me and my BFF were sitting in my car looking at everything. When we opened the center console, we had quite the scare. Inside was a hollowed out, plastic phallic looking object with tiny spikes on it. My friend and I literally screamed in horror. It was clear to us that somehow someone left their “toy” in my new car! We were shook, as they say. This was my car, I had to do something. So, I used a pen to pick up the nasty looking thing and threw it out. We freaked out about how gross it was and then we forgot about it… until my Dad called me. My Dad said: How’s the new car? Have you seen the flower holder in the center console? To summarize, we thought a flower vase was an XXX item… In our defense, this is a picture of a VW Beetle flower holder.""") ```
## Results ```bash +------------------------------------------------------------------------------------------+---------+ | sentence| label| +------------------------------------------------------------------------------------------+---------+ |What feature in your car did you not realize you had until someone else told you about it?| question| | Years ago, my Dad bought me a cute little VW Beetle.|statement| | The first day I had it, me and my BFF were sitting in my car looking at everything.|statement| | When we opened the center console, we had quite the scare.|statement| | Inside was a hollowed out, plastic phallic looking object with tiny spikes on it.|statement| | My friend and I literally screamed in horror.|statement| | It was clear to us that somehow someone left their “toy” in my new car!|statement| | We were shook, as they say.|statement| | This was my car, I had to do something.|statement| | So, I used a pen to pick up the nasty looking thing and threw it out.|statement| |We freaked out about how gross it was and then we forgot about it… until my Dad called me.|statement| | My Dad said: How’s the new car?| question| | Have you seen the flower holder in the center console?| question| | To summarize, we thought a flower vase was an XXX item…|statement| | In our defense, this is a picture of a VW Beetle flower holder.|statement| +------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_question_statement| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source https://github.com/deepset-ai/haystack/issues/611 ## Benchmarking ```bash Extracted from https://github.com/deepset-ai/haystack/issues/611 precision recall f1-score support statement 0.94 0.94 0.94 16105 question 0.96 0.96 0.96 26198 accuracy 0.95 42303 macro avg 0.95 0.95 0.95 42303 weighted avg 0.95 0.95 0.95 42303 ``` --- layout: model title: Legal Court Judgment Prediction (Portuguese) author: John Snow Labs name: legclf_judgment_prediction date: 2023-04-06 tags: [pt, licensed, legal, classification, tensorflow] task: Text Classification language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which identifies the court decisions in the State Supreme Court, including the following classes; - no: The appeal was denied - partial: For partially favourable decisions - yes: For fully favourable decisions ## Predicted Entities `no`, `partial`, `yes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_judgment_prediction_pt_1.0.0_3.0_1680778981035.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_judgment_prediction_pt_1.0.0_3.0_1680778981035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler= nlp.DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") seq_classifier = legal.BertForSequenceClassification.pretrained("legclf_judgment_prediction", "pt", "legal/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer , seq_classifier ]) # simple examples example = spark.createDataFrame([["PENAL. PROCESSO PENAL. APELAÇÃO. HOMICÍDIO QUALIFICADO. ARGUIÇÃO DE NULIDADE EM DECORRÊNCIA DA INSCONSTITUCIONALIDADE DO ARTIGO 457 DO CÓDIGO DE PROCESSO PENAL. AFASTADA. PLEITO DE REDIMENSIONAMENTO DA PENA. DOSIMETRIA QUE MERECE RETOQUES. AFASTADA A VALORAÇÃO DESFAVORÁVEL DAS CIRCUNSTÂNCIAS JUDICIAIS DOS ANTECEDENTES E DA PERSONALIDADE DO AGENTE. MANTIDA A CULPABILIDADE, CIRCUNSTÂNCIAS DO DELITO E CONSEQUÊNCIAS DO CRIME. APELO CONHECIDO E PARCIALMENTE PROVIDO. 1 Não há falar em ocorrência de nulidade não caso concreto, não existindo qualquer inconstitucionalidade em virtude do texto legal do ARTIGO 457 do Código de Processo Penal, não tendo ocorrido o adiamento da sessão do júri em virtude da ausência do acusado, conforme alegado. Pelo contrário, este foi devidamente intimado por edital e, mesmo assim, restou ausente. 2 A justificativa apresentada pelo magistrado singular acerca da culpabilidade considerou a alta reprovabilidade da conduta do réu, em virtude da premeditação e frieza na prática delitiva, considerando que o acusado foi até a casa da vítima com o intuito de ceifar a sua vida, experimentando assim a consequência da transgressão, estando acertada a valoração negativa desta circunstância judicial."]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | text| result| +----------------------------------------------------------------------------------------------------+---------+ |PENAL. PROCESSO PENAL. APELAÇÃO. HOMICÍDIO QUALIFICADO. ARGUIÇÃO DE NULIDADE EM DECORRÊNCIA DA IN...|[partial]| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_judgment_prediction| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support no 0.76 0.77 0.76 86 partial 0.79 0.71 0.75 75 yes 0.71 0.78 0.74 76 accuracy - - 0.75 237 macro-avg 0.75 0.75 0.75 237 weighted-avg 0.75 0.75 0.75 237 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1657184284783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1657184284783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-42 --- layout: model title: English image_classifier_vit__spectrogram ViTForImageClassification from prashanth0205 author: John Snow Labs name: image_classifier_vit__spectrogram date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit__spectrogram` is a English model originally trained by prashanth0205. ## Predicted Entities `female`, `male` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit__spectrogram_en_4.1.0_3.0_1660170988681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit__spectrogram_en_4.1.0_3.0_1660170988681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit__spectrogram", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit__spectrogram", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit__spectrogram| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from tli8hf) author: John Snow Labs name: bert_qa_unqover_bert_base_uncased_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-bert-base-uncased-squad` is a English model orginally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_bert_base_uncased_squad_en_4.0.0_3.0_1654192543690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_bert_base_uncased_squad_en_4.0.0_3.0_1654192543690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_unqover_bert_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_unqover_bert_base_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_tli8hf").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_unqover_bert_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-bert-base-uncased-squad --- layout: model title: Pipeline to Detect Units and Measurements author: John Snow Labs name: ner_measurements_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, measurements, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_measurements_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_measurements_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_3.4.1_3.0_1647870532389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_measurements_clinical_pipeline_en_3.4.1_3.0_1647870532389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE MEDICAL TEXT") ``` ```scala val pipeline = new PretrainedPipeline("ner_measurements_clinical_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE MEDICAL TEXT") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_measurements.pipeline").predict("""EXAMPLE MEDICAL TEXT""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_measurements_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English asr_wav2vec2_cetuc_sid_voxforge_mls_0 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: asr_wav2vec2_cetuc_sid_voxforge_mls_0 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_0` is a English model originally trained by joaoalvarenga. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_cetuc_sid_voxforge_mls_0_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022744764.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_cetuc_sid_voxforge_mls_0_en_4.2.0_3.0_1664022744764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_0", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_cetuc_sid_voxforge_mls_0", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_cetuc_sid_voxforge_mls_0| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: SDOH Insurance Type For Classification author: John Snow Labs name: genericclassifier_sdoh_insurance_type_sbiobert_cased_mli date: 2023-04-28 tags: [en, insurance, sdoh, social_determinants, public_health, classificaiton, licensed] task: Text Classification language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting insurance type. In this classifier, we know/assume that the patient **has insurance**. If the patient's insurance type is not mentioned or not known, it is regarded as an "Other" type of insurance. And if the patient's insurance is one of "Tricare" or "VA (Veterans Affair)", it is labeled as Military. The model is trained by using GenericClassifierApproach annotator. `Employer`: Employer insurance. `Medicaid`: Medicaid insurance. `Medicare`: Medicare insurance. `Military`: "Tricare" or "VA (Veterans Affair)" insurance. `Private`: Private insurance. `Other`: Other insurance or insurance type is not mentioned in the clinical notes or is not known. ## Predicted Entities `Employer`, `Medicaid`, `Medicare`, `Military`, `Private`, `Other` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/social_determinant){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_type_sbiobert_cased_mli_en_4.4.0_3.0_1682694596560.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_type_sbiobert_cased_mli_en_4.4.0_3.0_1682694596560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_type_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = [ "The patient has VA insurance.", "She is under Medicare insurance", "The patient has good coverage of Private insurance", """Medical File for John Smith, Male, Age 42 Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath. History of Present Illness: The patient has a history of hypertension and diabetes, which are both poorly controlled. The patient has been feeling unwell for the past week, with symptoms including nausea, vomiting, and shortness of breath. Upon examination, the patient was found to have a high serum creatinine level of 5.8 mg/dL, indicating renal failure. Past Medical History: The patient has a history of hypertension and diabetes, which have been poorly controlled due to poor medication adherence. The patient also has a history of smoking, which has been a contributing factor to the development of renal failure. Medications: The patient is currently taking Metformin and Lisinopril for the management of diabetes and hypertension, respectively. However, due to poor Medicaid coverage, the patient is unable to afford some of the medications prescribed by his physician. Insurance Status: The patient has Medicaid insurance, which provides poor coverage for some of the medications needed to manage his medical conditions, including those related to his renal failure. Physical Examination: Upon physical examination, the patient appears pale and lethargic. Blood pressure is 160/100 mmHg, heart rate is 90 beats per minute, and respiratory rate is 20 breaths per minute. There is diffuse abdominal tenderness on palpation, and lung auscultation reveals bilateral rales. Diagnosis: The patient is diagnosed with acute renal failure, likely due to uncontrolled hypertension and poorly managed diabetes. Treatment: The patient is started on intravenous fluids and insulin to manage his blood sugar levels. Due to the patient's poor Medicaid coverage, the physician works with the patient to identify alternative medications that are more affordable and will still provide effective management of his medical conditions. Follow-Up: The patient is advised to follow up with his primary care physician for ongoing management of his renal failure and other medical conditions. The patient is also referred to a nephrologist for further evaluation and management of his renal failure. """, """Certainly, here is an example case study for a patient with private insurance: Case Study for Emily Chen, Female, Age 38 Chief Complaint: Patient reports chronic joint pain and stiffness. History of Present Illness: The patient has been experiencing chronic joint pain and stiffness, particularly in the hands, knees, and ankles. The pain is worse in the morning and improves throughout the day. The patient has also noticed some swelling and redness in the affected joints. Past Medical History: The patient has a history of osteoarthritis, which has been gradually worsening over the past several years. The patient has tried over-the-counter pain relievers and joint supplements, but has not found significant relief. Medications: The patient is currently taking over-the-counter pain relievers and joint supplements for the management of her osteoarthritis. Insurance Status: The patient has private insurance, which provides comprehensive coverage for her medical care, including specialist visits and prescription medications. Physical Examination: Upon physical examination, the patient has tenderness and swelling in multiple joints, particularly the hands, knees, and ankles. Range of motion is limited due to pain and stiffness. Diagnosis: The patient is diagnosed with osteoarthritis, a chronic degenerative joint disease that causes pain, swelling, and stiffness in the affected joints. Treatment: The patient is prescribed a nonsteroidal anti-inflammatory drug (NSAID) to manage pain and inflammation. The physician also recommends physical therapy to improve range of motion and strengthen the affected joints. The patient is advised to continue taking joint supplements for ongoing joint health. Follow-Up: The patient is advised to follow up with the physician in 4-6 weeks to monitor response to treatment and make any necessary adjustments. The patient is also referred to a rheumatologist for further evaluation and management of her osteoarthritis.""", """ Medical File for John Doe, Male, Age 72 Chief Complaint: Patient reports shortness of breath and fatigue. History of Present Illness: The patient has been experiencing shortness of breath and fatigue for the past several weeks. The patient reports difficulty performing daily activities and has noticed a decrease in exercise tolerance. Past Medical History: The patient has a history of hypertension, hyperlipidemia, and coronary artery disease. The patient has undergone a coronary artery bypass graft (CABG) surgery in the past. Medications: The patient is currently taking several medications, including a beta blocker, a statin, and a diuretic, for the management of his medical conditions. Insurance Status: The patient has good coverage of Medicare insurance, which provides comprehensive coverage for his medical care, including specialist visits, diagnostic tests, and prescription medications. Physical Examination: Upon physical examination, the patient has crackles in the lungs and peripheral edema. Blood pressure is elevated, and heart sounds are irregular. Diagnosis: The patient is diagnosed with congestive heart failure, a chronic condition in which the heart cannot pump blood effectively to meet the body's needs. Treatment: The patient is admitted to the hospital for further evaluation and management of his congestive heart failure. Treatment includes diuresis to remove excess fluid, medication management to control blood pressure and heart rate, and oxygen therapy to improve breathing. The patient is also advised to follow a low-sodium diet and to monitor his fluid intake closely. Follow-Up: The patient is advised to follow up with his primary care physician and cardiologist regularly to monitor his heart function and adjust treatment as necessary. The patient is also referred to cardiac rehabilitation to improve his exercise tolerance and overall cardiovascular health."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "prediction.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_type_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("prediction") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array( "The patient has VA insurance.", "She is under Medicare insurance", "The patient has good coverage of Private insurance", """Medical File for John Smith, Male, Age 42 Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath. History of Present Illness: The patient has a history of hypertension and diabetes, which are both poorly controlled. The patient has been feeling unwell for the past week, with symptoms including nausea, vomiting, and shortness of breath. Upon examination, the patient was found to have a high serum creatinine level of 5.8 mg/dL, indicating renal failure. Past Medical History: The patient has a history of hypertension and diabetes, which have been poorly controlled due to poor medication adherence. The patient also has a history of smoking, which has been a contributing factor to the development of renal failure. Medications: The patient is currently taking Metformin and Lisinopril for the management of diabetes and hypertension, respectively. However, due to poor Medicaid coverage, the patient is unable to afford some of the medications prescribed by his physician. Insurance Status: The patient has Medicaid insurance, which provides poor coverage for some of the medications needed to manage his medical conditions, including those related to his renal failure. Physical Examination: Upon physical examination, the patient appears pale and lethargic. Blood pressure is 160/100 mmHg, heart rate is 90 beats per minute, and respiratory rate is 20 breaths per minute. There is diffuse abdominal tenderness on palpation, and lung auscultation reveals bilateral rales. Diagnosis: The patient is diagnosed with acute renal failure, likely due to uncontrolled hypertension and poorly managed diabetes. Treatment: The patient is started on intravenous fluids and insulin to manage his blood sugar levels. Due to the patient's poor Medicaid coverage, the physician works with the patient to identify alternative medications that are more affordable and will still provide effective management of his medical conditions. Follow-Up: The patient is advised to follow up with his primary care physician for ongoing management of his renal failure and other medical conditions. The patient is also referred to a nephrologist for further evaluation and management of his renal failure. """, """Certainly, here is an example case study for a patient with private insurance: Case Study for Emily Chen, Female, Age 38 Chief Complaint: Patient reports chronic joint pain and stiffness. History of Present Illness: The patient has been experiencing chronic joint pain and stiffness, particularly in the hands, knees, and ankles. The pain is worse in the morning and improves throughout the day. The patient has also noticed some swelling and redness in the affected joints. Past Medical History: The patient has a history of osteoarthritis, which has been gradually worsening over the past several years. The patient has tried over-the-counter pain relievers and joint supplements, but has not found significant relief. Medications: The patient is currently taking over-the-counter pain relievers and joint supplements for the management of her osteoarthritis. Insurance Status: The patient has private insurance, which provides comprehensive coverage for her medical care, including specialist visits and prescription medications. Physical Examination: Upon physical examination, the patient has tenderness and swelling in multiple joints, particularly the hands, knees, and ankles. Range of motion is limited due to pain and stiffness. Diagnosis: The patient is diagnosed with osteoarthritis, a chronic degenerative joint disease that causes pain, swelling, and stiffness in the affected joints. Treatment: The patient is prescribed a nonsteroidal anti-inflammatory drug (NSAID) to manage pain and inflammation. The physician also recommends physical therapy to improve range of motion and strengthen the affected joints. The patient is advised to continue taking joint supplements for ongoing joint health. Follow-Up: The patient is advised to follow up with the physician in 4-6 weeks to monitor response to treatment and make any necessary adjustments. The patient is also referred to a rheumatologist for further evaluation and management of her osteoarthritis.""", """ Medical File for John Doe, Male, Age 72 Chief Complaint: Patient reports shortness of breath and fatigue. History of Present Illness: The patient has been experiencing shortness of breath and fatigue for the past several weeks. The patient reports difficulty performing daily activities and has noticed a decrease in exercise tolerance. Past Medical History: The patient has a history of hypertension, hyperlipidemia, and coronary artery disease. The patient has undergone a coronary artery bypass graft (CABG) surgery in the past. Medications: The patient is currently taking several medications, including a beta blocker, a statin, and a diuretic, for the management of his medical conditions. Insurance Status: The patient has good coverage of Medicare insurance, which provides comprehensive coverage for his medical care, including specialist visits, diagnostic tests, and prescription medications. Physical Examination: Upon physical examination, the patient has crackles in the lungs and peripheral edema. Blood pressure is elevated, and heart sounds are irregular. Diagnosis: The patient is diagnosed with congestive heart failure, a chronic condition in which the heart cannot pump blood effectively to meet the body's needs. Treatment: The patient is admitted to the hospital for further evaluation and management of his congestive heart failure. Treatment includes diuresis to remove excess fluid, medication management to control blood pressure and heart rate, and oxygen therapy to improve breathing. The patient is also advised to follow a low-sodium diet and to monitor his fluid intake closely. Follow-Up: The patient is advised to follow up with his primary care physician and cardiologist regularly to monitor his heart function and adjust treatment as necessary. The patient is also referred to cardiac rehabilitation to improve his exercise tolerance and overall cardiovascular health.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+----------+ | text| result| +----------------------------------------------------------------------------------------------------+----------+ | The patient has VA insurance.|[Military]| | She is under Medicare insurance|[Medicare]| |Medical File for John Smith, Male, Age 42\n\nChief Complaint: Patient complains of nausea, vomiti...|[Medicaid]| |Certainly, here is an example case study for a patient with private insurance:\n\nCase Study for ...| [Private]| |\nMedical File for John Doe, Male, Age 72\n\nChief Complaint: Patient reports shortness of breath...|[Medicare]| +----------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_insurance_type_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH project ## Benchmarking ```bash label precision recall f1-score support Employer 0.67 0.82 0.74 17 Medicaid 0.89 0.80 0.84 61 Medicare 0.85 0.89 0.87 38 Military 0.76 0.89 0.82 18 Other 0.56 0.45 0.50 11 Private 0.80 0.77 0.79 31 accuracy - - 0.81 176 macro-avg 0.75 0.77 0.76 176 weighted-avg 0.81 0.81 0.81 176 ``` --- layout: model title: Translate Wallisian to English Pipeline author: John Snow Labs name: translate_wls_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, wls, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `wls` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_wls_en_xx_2.7.0_2.4_1609688000528.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_wls_en_xx_2.7.0_2.4_1609688000528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_wls_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_wls_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.wls.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_wls_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BERT Embeddings (Base Cased) author: John Snow Labs name: bert_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_en_2.6.0_2.4_1598340336670.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_en_2.6.0_2.4_1598340336670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_base_cased_embeddings I [0.43879568576812744, -0.40160006284713745, 0.... love [0.21737590432167053, -0.3865768313407898, -0.... NLP [-0.16226479411125183, -0.053727392107248306, ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1](https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1) --- layout: model title: Multilingual BERT Embeddings (Base Cased) author: John Snow Labs name: bert_multi_cased date: 2020-08-25 task: Embeddings language: xx edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, xx] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_multi_cased_xx_2.6.0_2.4_1598341875191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_multi_cased_xx_2.6.0_2.4_1598341875191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_multi_cased", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_multi_cased", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love Spark NLP"] embeddings_df = nlu.load('xx.embed.bert_multi_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash xx_embed_bert_multi_cased_embeddings token [0.31631314754486084, -0.5579454898834229, 0.1... I [-0.1488783359527588, -0.27264419198036194, -0... love [0.0496230386197567, -0.43625175952911377, -0.... Spark [-0.2838578224182129, -0.7103433012962341, 0.4... NLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_multi_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[xx]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3 --- layout: model title: Korean BertForQuestionAnswering model (from bespin-global) author: John Snow Labs name: bert_qa_klue_bert_base_aihub_mrc date: 2022-06-02 tags: [ko, open_source, question_answering, bert] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-bert-base-aihub-mrc` is a Korean model orginally trained by `bespin-global`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_klue_bert_base_aihub_mrc_ko_4.0.0_3.0_1654188035750.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_klue_bert_base_aihub_mrc_ko_4.0.0_3.0_1654188035750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_klue_bert_base_aihub_mrc","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_klue_bert_base_aihub_mrc","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.klue.bert.base_aihub.by_bespin-global").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_klue_bert_base_aihub_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|413.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bespin-global/klue-bert-base-aihub-mrc - https://github.com/KLUE-benchmark/KLUE - https://www.bespinglobal.com/ - https://aihub.or.kr/aidata/86 --- layout: model title: Relation Extraction between Tests, Results, and Dates author: John Snow Labs name: re_test_result_date date: 2021-02-24 tags: [licensed, en, clinical, relation_extraction] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.4 spark_version: 2.4 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between lab test names, their findings, measurements, results, and date. ## Predicted Entities `is_finding_of`, `is_result_of`, `is_date_of`, `O`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=D8TtVuN-Ee8s){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_en_2.7.4_2.4_1614168615976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_result_date_en_2.7.4_2.4_1614168615976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel In the table below, `re_test_result_date` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:-------------------:|:------------------------------------------------------:|:---------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | re_test_result_date | is_finding_of,
is_result_of,
is_date_of,
O | ner_jsl | [“test-test_result”,
“test_result-test”,
“test-date”, “date-test”,
“test-imagingfindings”,
“imagingfindings-test”,
“test-ekg_findings”,
“ekg_findings-test”,
“date-test_result”,
“test_result-date”,
“date-imagingfindings”,
“imagingfindings-date”,
“date-ekg_findings”,
“ekg_findings-date”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel().pretrained('jsl_ner_wip_clinical',"en","clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel().pretrained("re_test_result_date", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4)\ .setPredictionThreshold(0.9)\ .setRelationPairs(["external_body_part_or_region-test"])# Possible relation pairs. Default: All Relations. nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel().pretrained("jsl_ner_wip_clinical","en","clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel().pretrained("re_test_result_date", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) .setPredictionThreshold(0.9) .setRelationPairs("external_body_part_or_region-test")# Possible relation pairs. Default: All Relations. val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, sentencer, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val text = """He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.test_result_date").predict("""He was advised chest X-ray or CT scan after checking his SpO2 which was <= 93%""") ```
## Results ```bash | index | relations | entity1 | chunk1 | entity2 | chunk2 | |-------|--------------|--------------|---------------------|--------------|---------| | 0 | O | TEST | chest X-ray | MEASUREMENTS | 93% | | 1 | O | TEST | CT scan | MEASUREMENTS | 93% | | 2 | is_result_of | TEST | SpO2 | MEASUREMENTS | 93% | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_test_result_date| |Type:|re| |Compatibility:|Healthcare NLP 2.7.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| ## Data Source Trained on internal data. ## Benchmarking ```bash | relation | prec | |-----------------|------| | O | 0.77 | | is_finding_of | 0.80 | | is_result_of | 0.96 | | is_date_of | 0.94 | ``` --- layout: model title: English image_classifier_vit_base_mri ViTForImageClassification from raedinkhaled author: John Snow Labs name: image_classifier_vit_base_mri date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_mri` is a English model originally trained by raedinkhaled. ## Predicted Entities `cad`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_mri_en_4.1.0_3.0_1660168752129.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_mri_en_4.1.0_3.0_1660168752129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_mri", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_mri", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_mri| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForQuestionAnswering model (from sunitha) author: John Snow Labs name: distilbert_qa_base_uncased_3feb_2022_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-3feb-2022-finetuned-squad` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_3feb_2022_finetuned_squad_en_4.0.0_3.0_1654723730880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_3feb_2022_finetuned_squad_en_4.0.0_3.0_1654723730880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_3feb_2022_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_3feb_2022_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_sunitha").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_3feb_2022_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/distilbert-base-uncased-3feb-2022-finetuned-squad --- layout: model title: Medical Spell Checker Pipeline author: John Snow Labs name: spellcheck_clinical_pipeline date: 2022-04-14 tags: [spellcheck, medical, medical_spell_checker, spell_corrector, spell_pipeline, en, licensed, clinical] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained medical spellchecker pipeline is built on the top of [spellcheck_clinical](https://nlp.johnsnowlabs.com/2022/04/14/spellcheck_clinical_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.1. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.1_2.4_1649930943224.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.1_2.4_1649930943224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] pipeline.fullAnnotate(example) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") val example = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") pipeline.fullAnnotate(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical.pipeline").predict("""Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|141.3 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: German asr_exp_w2v2t_vp_s946 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_vp_s946 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s946` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_vp_s946_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s946_de_4.2.0_3.0_1664110443438.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s946_de_4.2.0_3.0_1664110443438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_vp_s946', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_vp_s946", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_vp_s946| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species date: 2022-06-27 tags: [es, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: es edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_es_3.5.3_3.0_1656316616890.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_es_3.5.3_3.0_1656316616890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "es", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "es", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.classify.bert_token.ner_living_species").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://temu.bsc.es/livingner/](https://temu.bsc.es/livingner/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.96 0.99 0.97 3281 B-SPECIES 0.89 0.94 0.91 3712 I-HUMAN 0.86 0.75 0.80 297 I-SPECIES 0.88 0.90 0.89 1732 micro-avg 0.91 0.94 0.93 9022 macro-avg 0.90 0.89 0.89 9022 weighted-avg 0.91 0.94 0.93 9022 ``` --- layout: model title: Legal Management Clause Binary Classifier author: John Snow Labs name: legclf_management_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `management` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `management` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_clause_en_1.0.0_3.2_1660123720911.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_clause_en_1.0.0_3.2_1660123720911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_management_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[management]| |[other]| |[other]| |[management]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_management_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support management 0.96 0.90 0.93 71 other 0.95 0.98 0.96 136 accuracy - - 0.95 207 macro-avg 0.95 0.94 0.95 207 weighted-avg 0.95 0.95 0.95 207 ``` --- layout: model title: Relation Extraction between Tumors and Sizes author: John Snow Labs name: re_oncology_size_wip date: 2022-09-26 tags: [licensed, clinical, oncology, relation_extraction, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Tumor_Size extractions to their corresponding Tumor_Finding extractions. ## Predicted Entities `is_size_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_size_wip_en_4.0.0_3.0_1664230171831.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_size_wip_en_4.0.0_3.0_1664230171831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(['document'])\ .setOutputCol('sentence') tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", 'token']) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_size_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding"]) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_size_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology.size_wip").predict("""The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.""") ```
## Results ```bash | chunk1 | entity1 | chunk2 | entity2 | relation | confidence | |:---------|:--------------|:---------|:--------------|:-----------|-------------:| | 2 cm | Tumor_Size | mass | Tumor_Finding | is_size_of | 0.853271 | | tumor | Tumor_Finding | 3 cm | Tumor_Size | is_size_of | 0.815623 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_size_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|267.5 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 is_size_of 0.89 0.77 0.83 O 0.75 0.88 0.81 macro-avg 0.82 0.82 0.82 ``` --- layout: model title: English image_classifier_vit_exper6_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper6_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper6_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper6_mesum5_en_4.1.0_3.0_1660168138363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper6_mesum5_en_4.1.0_3.0_1660168138363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper6_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper6_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper6_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: English Named Entity Recognition (from CouchCat) author: John Snow Labs name: distilbert_ner_ma_ner_v7_distil date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ma_ner_v7_distil` is a English model orginally trained by `CouchCat`. ## Predicted Entities `MATR`, `PERS`, `TIME`, `MISC`, `PAD`, `PROD`, `BRND` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_ma_ner_v7_distil_en_3.4.2_3.0_1652721967576.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_ma_ner_v7_distil_en_3.4.2_3.0_1652721967576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_ma_ner_v7_distil","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_ma_ner_v7_distil","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_ma_ner_v7_distil| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CouchCat/ma_ner_v7_distil --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_256_A_4_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-256_A-4_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_en_4.0.0_3.0_1654185268072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_en_4.0.0_3.0_1654185268072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.uncased_4l_256d_a4a_256d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_256_A_4_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-256_A-4_squad2 --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_bert_FT_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_newsqa_en_4.0.0_3.0_1654185068956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_newsqa_en_4.0.0_3.0_1654185068956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/bert_FT_newsqa --- layout: model title: English image_classifier_vit_new_york_tokyo_london ViTForImageClassification from Suzana author: John Snow Labs name: image_classifier_vit_new_york_tokyo_london date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_new_york_tokyo_london` is a English model originally trained by Suzana. ## Predicted Entities `London`, `New York`, `Tokyo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_new_york_tokyo_london_en_4.1.0_3.0_1660171162315.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_new_york_tokyo_london_en_4.1.0_3.0_1660171162315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_new_york_tokyo_london", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_new_york_tokyo_london", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_new_york_tokyo_london| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: French Legal Roberta Embeddings author: John Snow Labs name: roberta_base_french_legal date: 2023-02-16 tags: [fr, french, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: fr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-french-roberta-base` is a French model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_french_legal_fr_4.2.4_3.0_1676580048854.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_french_legal_fr_4.2.4_3.0_1676580048854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_french_legal", "fr")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_french_legal", "fr") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_french_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|416.2 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-french-roberta-base --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from tiennvcs) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_infov date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-infovqa` is a English model originally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_infov_en_4.3.0_3.0_1672768056130.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_infov_en_4.3.0_3.0_1672768056130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_infov","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_infov","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_infov| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/distilbert-base-uncased-finetuned-infovqa --- layout: model title: English ElectraForQuestionAnswering model (from ptran74) Version-2 author: John Snow Labs name: electra_qa_DSPFirst_Finetuning_2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-2` is a English model originally trained by `ptran74`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_2_en_4.0.0_3.0_1655919435626.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_2_en_4.0.0_3.0_1655919435626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.finetuning_2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_DSPFirst_Finetuning_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ptran74/DSPFirst-Finetuning-2 - https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/ --- layout: model title: Оcr small for handwritten text author: John Snow Labs name: ocr_small_handwritten date: 2022-02-17 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.3.3 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr small model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_handwritten_en_3.3.3_2.4_1645080334390.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_handwritten_en_3.3.3_2.4_1645080334390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ocr = ImageToTextv2().pretrained("ocr_small_handwritten", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToTextv2().pretrained("ocr_small_handwritten", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_small_handwritten| |Type:|ocr| |Compatibility:|Visual NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|146.7 MB| --- layout: model title: Fast Neural Machine Translation Model from Afro-Asiatic languages to English author: John Snow Labs name: opus_mt_afa_en date: 2021-06-01 tags: [open_source, seq2seq, translation, afa, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: afa target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_afa_en_xx_3.1.0_2.4_1622562470297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_afa_en_xx_3.1.0_2.4_1622562470297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_afa_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_afa_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afro-Asiatic languages.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_afa_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for ICD-10-CM (general 3 character codes - augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_generalised_augmented date: 2023-05-31 tags: [licensed, en, clinical, entity_resolution, icd10cm] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It predicts ICD-10-CM codes up to 3 characters (according to ICD-10-CM code structure the first three characters represent general type of the injury or disease). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1685508789416.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_generalised_augmented_en_4.4.2_3.0_1685508789416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_generalised_augmented","en", "clinical/models") .setInputCols("sbert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+-----------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+-----------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24|[gestational diabetes mellitus [gestational diabetes mellitus], history ...| [O24, Z86, Z87]| |subsequent type two diabetes mellitus|PROBLEM| O24|[pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mel...| [O24, E11, E13, Z86]| | obesity|PROBLEM| E66|[obesity [obesity, unspecified], obese [body mass index [bmi] 40.0-44.9,...| [E66, Z68, Q13, Z86, E34, H35, Z83, Q55]| | a body mass index|PROBLEM| Z68|[finding of body mass index [body mass index [bmi] 40.0-44.9, adult], ob...| [Z68, E66, R22, R41, M62, P29, R19, R89, M21]| | polyuria|PROBLEM| R35|[polyuria [polyuria], polyuric state (disorder) [diabetes insipidus], he...|[R35, E23, R31, R82, N40, E72, O04, R30, R80, E88, N03, P96, N02]| | polydipsia|PROBLEM| R63|[polydipsia [polydipsia], psychogenic polydipsia [other impulse disorder...|[R63, F63, E23, O40, G47, M79, R06, H53, I44, Q30, I45, R00, M35]| | poor appetite|PROBLEM| R63|[poor appetite [anorexia], poor feeding [feeding problem of newborn, uns...|[R63, P92, R43, E86, R19, F52, Z72, R06, Z76, R53, R45, F50, R10]| | vomiting|PROBLEM| R11|[vomiting [vomiting], periodic vomiting [cyclical vomiting, in migraine,...| [R11, G43, P92]| | a respiratory tract infection|PROBLEM| J98|[respiratory tract infection [other specified respiratory disorders], up...| [J98, J06, A49, J22, J20, Z59, T17, J04, Z13, J18, P28, J39]| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+-----------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_generalised_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.4 GB| |Case sensitive:|false| --- layout: model title: English asr_wav2vec2_base_cynthia_tedlium_2500_v2 TFWav2Vec2ForCTC from huyue012 author: John Snow Labs name: pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_cynthia_tedlium_2500_v2` is a English model originally trained by huyue012. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2_en_4.2.0_3.0_1664040558353.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2_en_4.2.0_3.0_1664040558353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_cynthia_tedlium_2500_v2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Portuguese asr_bp_voxforge1_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp_voxforge1_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_voxforge1_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_voxforge1_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_voxforge1_xlsr_pt_4.2.0_3.0_1664193020078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_voxforge1_xlsr_pt_4.2.0_3.0_1664193020078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp_voxforge1_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp_voxforge1_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp_voxforge1_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|1.2 GB| --- layout: model title: RxNorm Cd ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_cd_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_cd_clinical_en_3.0.0_3.0_1618603400196.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_cd_clinical_en_3.0.0_3.0_1618603400196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... rxnorm_resolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_cd_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\ .setInputCols(['token', 'chunk_embeddings'])\ .setOutputCol('rxnorm_resolution')\ .setPoolingStrategy("MAX") pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnorm_resolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_cd_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9)) .setInputCols('token', 'chunk_embeddings') .setOutputCol('rxnorm_resolution') .setPoolingStrategy("MAX") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364| | glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407| | dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_cd_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm]| |Language:|en| --- layout: model title: English BertForQuestionAnswering model (from ruselkomp) author: John Snow Labs name: bert_qa_tests_finetuned_squad_test_bert_2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tests-finetuned-squad-test-bert-2` is a English model orginally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tests_finetuned_squad_test_bert_2_en_4.0.0_3.0_1654192311570.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tests_finetuned_squad_test_bert_2_en_4.0.0_3.0_1654192311570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tests_finetuned_squad_test_bert_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tests_finetuned_squad_test_bert_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.v2.by_ruselkomp").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tests_finetuned_squad_test_bert_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/tests-finetuned-squad-test-bert-2 --- layout: model title: English image_classifier_vit_generation_xyz ViTForImageClassification from chradden author: John Snow Labs name: image_classifier_vit_generation_xyz date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_generation_xyz` is a English model originally trained by chradden. ## Predicted Entities `Generation Alpha`, `Millennials`, `Generation X`, `Generation Z`, `Baby Boomers` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_generation_xyz_en_4.1.0_3.0_1660171707791.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_generation_xyz_en_4.1.0_3.0_1660171707791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_generation_xyz", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_generation_xyz", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_generation_xyz| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForTokenClassification Cased model (from Neurona) author: John Snow Labs name: distilbert_token_classifier_cpener_test date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cpener-test` is a English model originally trained by `Neurona`. ## Predicted Entities `cpe_version`, `cpe_product`, `cpe_vendor` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.0_3.0_1677881384855.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.0_3.0_1677881384855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_cpener_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Neurona/cpener-test --- layout: model title: Fast Neural Machine Translation Model from Multiple languages to English author: John Snow Labs name: opus_mt_mul_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mul, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mul` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mul_en_xx_2.7.0_2.4_1609166361160.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mul_en_xx_2.7.0_2.4_1609166361160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mul_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mul_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mul.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mul_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Problems, Tests and Treatments (ner_clinical_large) author: John Snow Labs name: ner_clinical_large_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_clinical_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_pipeline_en_4.3.0_3.2_1678876271920.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_pipeline_en_4.3.0_3.2_1678876271920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_large_pipeline", "en", "clinical/models") text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_large_pipeline", "en", "clinical/models") val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_large.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------------------------------------|--------:|------:|:------------|-------------:| | 0 | the G-protein-activated inwardly rectifying potassium (GIRK | 48 | 106 | TREATMENT | 0.6926 | | 1 | the genomicorganization | 142 | 164 | TREATMENT | 0.80715 | | 2 | a candidate gene forType II diabetes mellitus | 210 | 254 | PROBLEM | 0.754343 | | 3 | byapproximately | 380 | 394 | TREATMENT | 0.7924 | | 4 | single nucleotide polymorphisms | 464 | 494 | TREATMENT | 0.636967 | | 5 | aVal366Ala substitution | 532 | 554 | PROBLEM | 0.53615 | | 6 | an 8 base-pair | 561 | 574 | PROBLEM | 0.607733 | | 7 | insertion/deletion | 581 | 598 | PROBLEM | 0.8692 | | 8 | Ourexpression studies | 601 | 621 | TEST | 0.89975 | | 9 | the transcript in various humantissues | 648 | 685 | PROBLEM | 0.83306 | | 10 | fat andskeletal muscle | 749 | 770 | PROBLEM | 0.778133 | | 11 | furtherstudies | 830 | 843 | PROBLEM | 0.8789 | | 12 | the KCNJ9 protein | 864 | 880 | TREATMENT | 0.561033 | | 13 | evaluation | 892 | 901 | TEST | 0.9981 | | 14 | Type II diabetes | 940 | 955 | PROBLEM | 0.698967 | | 15 | the treatment | 1025 | 1037 | TREATMENT | 0.81195 | | 16 | breast cancer | 1042 | 1054 | PROBLEM | 0.9604 | | 17 | the standard therapy | 1067 | 1086 | TREATMENT | 0.757767 | | 18 | anthracyclines | 1125 | 1138 | TREATMENT | 0.9999 | | 19 | taxanes | 1144 | 1150 | TREATMENT | 0.9999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becasv2_2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-2` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_2_en_4.3.0_3.0_1672767690511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_2_en_4.3.0_3.0_1672767690511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becasv2_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-2 --- layout: model title: English asr_autonlp_hindi_asr TFWav2Vec2ForCTC from abhishek author: John Snow Labs name: asr_autonlp_hindi_asr date: 2022-09-26 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_autonlp_hindi_asr` is a English model originally trained by abhishek. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_autonlp_hindi_asr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_autonlp_hindi_asr_en_4.2.0_3.0_1664195323489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_autonlp_hindi_asr_en_4.2.0_3.0_1664195323489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_autonlp_hindi_asr", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_autonlp_hindi_asr", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_autonlp_hindi_asr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForSequenceClassification Cased model (from jawadhussein462) author: John Snow Labs name: roberta_classifier_autotrain_neurips_chanllenge_1287149282 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true recommended: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-neurips_chanllenge-1287149282` is a English model originally trained by `jawadhussein462`. ## Predicted Entities `1`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_neurips_chanllenge_1287149282_en_4.2.4_3.0_1670624021899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_neurips_chanllenge_1287149282_en_4.2.4_3.0_1670624021899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_neurips_chanllenge_1287149282","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_neurips_chanllenge_1287149282","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autotrain_neurips_chanllenge_1287149282| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/jawadhussein462/autotrain-neurips_chanllenge-1287149282 --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_EManuals_RoBERTa_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `EManuals_RoBERTa_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_EManuals_RoBERTa_squad2.0_en_4.0.0_3.0_1655726695679.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_EManuals_RoBERTa_squad2.0_en_4.0.0_3.0_1655726695679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_EManuals_RoBERTa_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_EManuals_RoBERTa_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.emanuals.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_EManuals_RoBERTa_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/EManuals_RoBERTa_squad2.0 --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_facebook TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_by_facebook date: 2022-09-24 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_facebook` is a German model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_by_facebook_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_facebook_de_4.2.0_3.0_1664026408621.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_facebook_de_4.2.0_3.0_1664026408621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_by_facebook", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_by_facebook", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_by_facebook| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|332.6 MB| --- layout: model title: Detect Radiology Concepts - WIP (biobert) author: John Snow Labs name: jsl_rd_ner_wip_greedy_biobert date: 2021-07-26 tags: [licensed, clinical, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract clinical entities from Radiology reports using pretrained NER model. ## Predicted Entities `Test_Result`, `OtherFindings`, `BodyPart`, `ImagingFindings`, `Disease_Syndrome_Disorder`, `ImagingTest`, `Measurements`, `Procedure`, `Score`, `Test`, `Medical_Device`, `Direction`, `Symptom`, `Imaging_Technique`, `ManualFix`, `Units` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_en_3.1.3_3.0_1627305153541.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_en_3.1.3_3.0_1627305153541.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained('biobert_pubmed_base_cased') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') clinical_ner = MedicalNerModel.pretrained("jsl_rd_ner_wip_greedy_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols("sentence", "token") .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("jsl_rd_ner_wip_greedy_biobert", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology.wip_greedy_biobert").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | chunk | entity | |---:|:----------------------|:--------------------------| | 0 | Bilateral | Direction | | 1 | breast | BodyPart | | 2 | ultrasound | ImagingTest | | 3 | ovoid mass | ImagingFindings | | 4 | 0.5 x 0.5 x 0.4 | Measurements | | 5 | cm | Units | | 6 | left | Direction | | 7 | shoulder | BodyPart | | 8 | mass | ImagingFindings | | 9 | isoechoic echotexture | ImagingFindings | | 10 | muscle | BodyPart | | 11 | internal color flow | ImagingFindings | | 12 | benign fibrous tissue | ImagingFindings | | 13 | lipoma | Disease_Syndrome_Disorder | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_rd_ner_wip_greedy_biobert| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on Dataset annotated by John Snow Labs ## Benchmarking ```bash label tp fp fn prec rec f1 B-Units 253 7 11 0.9730769 0.9583333 0.9656488 B-Medical_Device 382 109 74 0.7780040 0.8377193 0.8067581 B-BodyPart 2645 347 276 0.8840241 0.9055118 0.8946389 I-BodyPart 645 142 135 0.819568 0.8269231 0.8232291 B-Imaging_Technique 137 36 33 0.7919075 0.8058823 0.7988338 B-Procedure 260 93 130 0.7365439 0.6666667 0.6998653 B-Direction 1573 136 123 0.9204213 0.9274764 0.9239353 I-ImagingTest 30 9 32 0.7692308 0.4838709 0.5940594 I-Test_Result 2 0 0 1 1 1 B-Measurements 452 24 30 0.9495798 0.9377593 0.9436326 B-ImagingFindings 1929 679 542 0.7396472 0.7806556 0.7595984 B-Test 146 17 49 0.8957055 0.7487179 0.8156425 B-ManualFix 2 0 2 1 0.5 0.6666667 I-Procedure 147 91 106 0.6176470 0.5810277 0.598778 I-Imaging_Technique 75 63 26 0.5434782 0.7425743 0.6276151 I-Measurements 45 3 6 0.9375 0.8823529 0.9090909 B-ImagingTest 328 36 85 0.9010989 0.7941888 0.8442728 I-Test 26 9 34 0.7428571 0.4333333 0.5473684 I-Symptom 138 62 142 0.69 0.4928571 0.575 I-ImagingFindings 1348 617 662 0.6860051 0.6706468 0.678239 B-Disease_Syndrome_Disorder 1068 298 243 0.7818448 0.8146453 0.7979080 B-Symptom 523 110 190 0.8262243 0.7335203 0.7771174 I-Disease_Syndrome_Disorder 377 168 171 0.6917431 0.6879562 0.6898445 I-Medical_Device 369 72 62 0.8367347 0.8561485 0.8463302 I-Direction 352 38 41 0.9025641 0.8956743 0.899106 Macro-average 13272 3200 3313 0.7195612 0.6489194 0.6824170 Micro-average 13272 3200 3313 0.8057309 0.8002412 0.8029767 ``` --- layout: model title: Legal Power of attorney Clause Binary Classifier (md) author: John Snow Labs name: legclf_power_of_attorney_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `power-of-attorney` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `power-of-attorney` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_power_of_attorney_md_en_1.0.0_3.0_1669376514743.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_power_of_attorney_md_en_1.0.0_3.0_1669376514743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_power_of_attorney_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[power-of-attorney]| |[other]| |[other]| |[power-of-attorney]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_power_of_attorney_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support other 0.89 1.00 0.94 39 power-of-attorney 1.00 0.79 0.88 24 accuracy 0.92 63 macro avg 0.94 0.90 0.91 63 weighted avg 0.93 0.92 0.92 63 ``` --- layout: model title: ALBERT Large CoNNL-03 NER Pipeline author: John Snow Labs name: albert_large_token_classifier_conll03_pipeline date: 2022-04-23 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_large_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650710898514.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650710898514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|64.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: English T5ForConditionalGeneration Small Cased model (from allenai) author: John Snow Labs name: t5_small_squad2_next_word_generator_squad date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-squad2-next-word-generator-squad` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_squad2_next_word_generator_squad_en_4.3.0_3.0_1675155704406.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_squad2_next_word_generator_squad_en_4.3.0_3.0_1675155704406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_squad2_next_word_generator_squad","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_squad2_next_word_generator_squad","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_squad2_next_word_generator_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|148.1 MB| ## References - https://huggingface.co/allenai/t5-small-squad2-next-word-generator-squad --- layout: model title: Part of Speech for Japanese author: John Snow Labs name: pos_ud_gsd date: 2021-03-09 tags: [part_of_speech, open_source, japanese, pos_ud_gsd, ja] task: Part of Speech Tagging language: ja edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - NOUN - ADP - VERB - SCONJ - AUX - PUNCT - PART - DET - NUM - ADV - PRON - ADJ - PROPN - CCONJ - SYM - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_3.0.0_3.0_1615292368738.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ja_3.0.0_3.0_1615292368738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['ジョンスノーラボからこんにちは! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "ja") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("ジョンスノーラボからこんにちは! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""ジョンスノーラボからこんにちは! ""] token_df = nlu.load('ja.pos.ud_gsd').predict(text) token_df ```
## Results ```bash token pos 0 ジョンス NOUN 1 ノ NOUN 2 ー NOUN 3 ラ NOUN 4 ボ NOUN 5 から ADP 6 こん NOUN 7 に ADP 8 ち NOUN 9 は ADP 10 ! VERB ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ja| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_how_5e_05 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-how-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_5e_05_en_4.3.0_3.0_1674209532316.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_how_5e_05_en_4.3.0_3.0_1674209532316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_how_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_how_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-how-5e-05 --- layout: model title: Pipeline for Adverse Drug Events author: John Snow Labs name: explain_clinical_doc_ade date: 2022-06-30 tags: [en, clinical, licensed, ade, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for Adverse Drug Events (ADE) with `ner_ade_biobert`, `assertion_dl_biobert`, ,`classifierdl_ade_conversational_biobert`, and `re_ade_biobert` . It will classify the document, extract ADE and DRUG clinical entities, assign assertion status to ADE entities, and relate Drugs with their ADEs. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_4.0.0_3.0_1656581944018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_4.0.0_3.0_1656581944018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models") res = pipeline.fullAnnotate("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ``` ```scala from sparknlp.pretrained import PretrainedPipeline val era_pipeline = new PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash Class: True NER_Assertion: | | chunk | entitiy | assertion | |----|-------------------------|------------|-------------| | 0 | Lipitor | DRUG | - | | 1 | severe fatigue | ADE | Conditional | | 2 | voltaren | DRUG | - | | 3 | cramps | ADE | Conditional | Relations: | | chunk1 | entitiy1 | chunk2 | entity2 | relation | |----|-------------------------------|------------|-------------|---------|----------| | 0 | severe fatigue | ADE | Lipitor | DRUG | 1 | | 1 | cramps | ADE | Lipitor | DRUG | 0 | | 2 | severe fatigue | ADE | voltaren | DRUG | 0 | | 3 | cramps | ADE | voltaren | DRUG | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_ade| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|484.6 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertEmbeddings - SentenceEmbeddings - ClassifierDLModel - MedicalNerModel - NerConverterInternalModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - NerConverterInternalModel - AssertionDLModel --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739275203.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739275203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_quadruplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: Chinese BertForMaskedLM Base Cased model (from ptrsxu) author: John Snow Labs name: bert_embeddings_ptrsxu_base_chinese date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese` is a Chinese model originally trained by `ptrsxu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_base_chinese_zh_4.2.4_3.0_1670016435043.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_base_chinese_zh_4.2.4_3.0_1670016435043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_base_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_base_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_ptrsxu_base_chinese| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/ptrsxu/bert-base-chinese - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 --- layout: model title: Fast Neural Machine Translation Model from Spanish to English author: John Snow Labs name: opus_mt_es_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, es, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `es` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_es_en_xx_2.7.0_2.4_1609165183882.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_es_en_xx_2.7.0_2.4_1609165183882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_es_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_es_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.es.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_es_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Uncased model (from roshnir) author: John Snow Labs name: bert_qa_multi_uncased_trained_squadv2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-uncased-trained-squadv2` is a English model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multi_uncased_trained_squadv2_en_4.0.0_3.0_1657187851046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multi_uncased_trained_squadv2_en_4.0.0_3.0_1657187851046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multi_uncased_trained_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_multi_uncased_trained_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multi_uncased_trained_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|626.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/bert-multi-uncased-trained-squadv2 - https://aclanthology.org/2020.acl-main.421.pdf%5D --- layout: model title: Tagalog Electra Embeddings (from jcblaise) author: John Snow Labs name: electra_embeddings_electra_tagalog_small_uncased_generator date: 2022-05-17 tags: [tl, open_source, electra, embeddings] task: Embeddings language: tl edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tagalog-small-uncased-generator` is a Tagalog model orginally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_uncased_generator_tl_3.4.4_3.0_1652786769151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_small_uncased_generator_tl_3.4.4_3.0_1652786769151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_uncased_generator","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Mahilig ako sa Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_small_uncased_generator","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Mahilig ako sa Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_tagalog_small_uncased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|18.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/jcblaise/electra-tagalog-small-uncased-generator - https://blaisecruz.com --- layout: model title: English asr_wav2vec2_base_100h_by_facebook TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_by_facebook date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_by_facebook` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_by_facebook_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_facebook_en_4.2.0_3.0_1664038682928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_by_facebook_en_4.2.0_3.0_1664038682928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_by_facebook', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_by_facebook", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_by_facebook| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-2` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_2_en_4.3.0_3.0_1672767456835.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_2_en_4.3.0_3.0_1672767456835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|243.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-2 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10_en_4.3.0_3.0_1674215893984.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10_en_4.3.0_3.0_1674215893984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|419.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-10 --- layout: model title: Named Entity Recognition Profiling (Clinical) author: John Snow Labs name: ner_profiling_clinical date: 2022-01-18 tags: [ner, ner_profiling, clinical, en, licensed] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. It has been updated by adding new clinical NER model and NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `ner_ade_clinical`, `ner_posology_greedy`, `ner_risk_factors`, `jsl_ner_wip_clinical`, `ner_human_phenotype_gene_clinical`, `jsl_ner_wip_greedy_clinical`, `ner_cellular`, `ner_cancer_genetics`, `jsl_ner_wip_modifier_clinical`, `ner_drugs_greedy`, `ner_deid_sd_large`, `ner_diseases`, `nerdl_tumour_demo`, `ner_deid_subentity_augmented`, `ner_jsl_enriched`, `ner_genetic_variants`, `ner_bionlp`, `ner_measurements_clinical`, `ner_diseases_large`, `ner_radiology`, `ner_deid_augmented`, `ner_anatomy`, `ner_chemprot_clinical`, `ner_posology_experimental`, `ner_drugs`, `ner_deid_sd`, `ner_posology_large`, `ner_deid_large`, `ner_posology`, `ner_deidentify_dl`, `ner_deid_enriched`, `ner_bacterial_species`, `ner_drugs_large`, `ner_clinical_large`, `jsl_rd_ner_wip_greedy_clinical`, `ner_medmentions_coarse`, `ner_radiology_wip_clinical`, `ner_clinical`, `ner_chemicals`, `ner_deid_synthetic`, `ner_events_clinical`, `ner_posology_small`, `ner_anatomy_coarse`, `ner_human_phenotype_go_clinical`, `ner_jsl_slim`, `ner_jsl`, `ner_jsl_greedy`, `ner_events_admission_clinical`, `ner_chexpert` . {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_3.3.1_2.4_1642496753293.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_3.3.1_2.4_1642496753293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models') val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ******************** ner_jsl Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('subsequent', 'Modifier'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Diabetes'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Communicable_Disease'), ('obesity', 'Obesity'), ('body mass index', 'Symptom'), ('33.5 kg/m2', 'Weight'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_diseases_large Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('vomiting', 'Disease')] ******************** ner_radiology Model Results ******************** [('gestational diabetes mellitus', 'Disease_Syndrome_Disorder'), ('type two diabetes mellitus', 'Disease_Syndrome_Disorder'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('acute hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Disease_Syndrome_Disorder'), ('body', 'BodyPart'), ('mass index', 'Symptom'), ('BMI', 'Test'), ('33.5', 'Measurements'), ('kg/m2', 'Units'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus', 'PROBLEM'), ('T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'PROBLEM'), ('BMI', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_medmentions_coarse Model Results ******************** [('female', 'Organism_Attribute'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('T2DM', 'Disease_or_Syndrome'), ('HTG-induced pancreatitis', 'Disease_or_Syndrome'), ('associated with', 'Qualitative_Concept'), ('acute hepatitis', 'Disease_or_Syndrome'), ('obesity', 'Disease_or_Syndrome'), ('body mass index', 'Clinical_Attribute'), ('BMI', 'Clinical_Attribute'), ('polyuria', 'Sign_or_Symptom'), ('polydipsia', 'Sign_or_Symptom'), ('poor appetite', 'Sign_or_Symptom'), ('vomiting', 'Sign_or_Symptom')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_clinical| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.4 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: Part of Speech for Swedish author: John Snow Labs name: pos_talbanken date: 2021-03-23 tags: [sv, open_source] supported: true task: Part of Speech Tagging language: sv edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_talbanken_sv_2.7.5_2.4_1616511099635.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_talbanken_sv_2.7.5_2.4_1616511099635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_talbanken", "sv")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([["' Medicinsk bildtolk ' också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord ."]], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_talbanken", "sv") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer , pos)) val data = Seq(" Medicinsk bildtolk " också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""' Medicinsk bildtolk ' också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord .""] token_df = nlu.load('sv.pos.talbanken').predict(text) token_df ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+ |text |result | +---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+ |' Medicinsk bildtolk ' också skall fungera som hjälpmedel för läkaren att klarlägga sjukdomsbilden utan att patienten behöver säga ett ord . |[PUNCT, ADJ, NOUN, PUNCT, ADV, AUX, VERB, SCONJ, NOUN, ADP, NOUN, PART, VERB, NOUN, ADP, SCONJ, NOUN, AUX, VERB, DET, NOUN, PUNCT]| +---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_talbanken| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|sv| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.88 | 0.89 | 0.89 | 1826 | | ADP | 0.96 | 0.96 | 0.96 | 2298 | | ADV | 0.91 | 0.87 | 0.89 | 1528 | | AUX | 0.91 | 0.93 | 0.92 | 1021 | | CCONJ | 0.95 | 0.94 | 0.94 | 791 | | DET | 0.92 | 0.95 | 0.93 | 1015 | | INTJ | 1.00 | 0.33 | 0.50 | 3 | | NOUN | 0.94 | 0.95 | 0.95 | 4711 | | NUM | 0.98 | 0.96 | 0.97 | 357 | | PART | 0.93 | 0.94 | 0.94 | 406 | | PRON | 0.94 | 0.91 | 0.92 | 1449 | | PROPN | 0.88 | 0.83 | 0.85 | 243 | | PUNCT | 0.97 | 0.98 | 0.98 | 2104 | | SCONJ | 0.86 | 0.82 | 0.84 | 491 | | SYM | 0.50 | 1.00 | 0.67 | 1 | | VERB | 0.90 | 0.90 | 0.90 | 2142 | | accuracy | | | 0.93 | 20386 | | macro avg | 0.90 | 0.89 | 0.88 | 20386 | | weighted avg | 0.93 | 0.93 | 0.93 | 20386 | ``` --- layout: model title: Detect Adverse Drug Events (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_ade date: 2021-09-30 tags: [adverse, ade, bertfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.2 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions of drugs in reviews, tweets, and medical text using the pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.2.2_2.4_1633008677011.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.2.2_2.4_1633008677011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |Lipitor |DRUG | |severe fatigue|ADE | |voltaren |DRUG | |cramps |ADE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ade| |Compatibility:|Healthcare NLP 3.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|512| ## Data Source This model is trained on a custom dataset by John Snow Labs. ## Benchmarking ```bash label precision recall f1-score support B-ADE 0.93 0.79 0.85 2694 B-DRUG 0.97 0.87 0.92 9539 I-ADE 0.93 0.73 0.82 3236 I-DRUG 0.95 0.82 0.88 6115 accuracy - - 0.83 21584 macro-avg 0.84 0.84 0.84 21584 weighted-avg 0.95 0.83 0.89 21584 ``` --- layout: model title: Financial Assertion Status (Negation) author: John Snow Labs name: finassertion_negation date: 2023-01-01 tags: [negation, en, licensed] task: Assertion Status language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial Negation model, aimed to identify if an NER entity is mentioned in the context to be negated or not. ## Predicted Entities `positive`, `negative` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_negation_en_1.0.0_3.0_1672578587267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_negation_en_1.0.0_3.0_1672578587267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import pyspark.sql.functions as F document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") finassertion = finance.AssertionDLModel.pretrained("finassertion_negation", "en", "finance/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings"])\ .setOutputCol("finlabel") pipe = nlp.Pipeline(stages = [ document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter, finassertion]) text = "Gradio INC will not be entering into a joint agreement with Hugging Face, Inc." sdf = spark.createDataFrame([[text]]).toDF("text") res = pipe.fit(sdf).transform(sdf) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.finlabel.result)).alias("cols"))\ .select(F.expr("cols['0']").alias("ner_chunk"), F.expr("cols['1']").alias("assertion")).show(200, truncate=100) ```
## Results ```bash +-----------------+---------+ | ner_chunk|assertion| +-----------------+---------+ | Gradio INC| negative| |Hugging Face, Inc| positive| +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finassertion_negation| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotated legal sentences ## Benchmarking ```bash label tp fp fn prec rec f1 negative 26 0 1 1.0 0.962963 0.9811321 positive 38 1 0 0.974359 1.0 0.987013 Macro-average 641 1 1 0.9871795 0.9814815 0.9843222 Micro-average 0.9846154 0.9846154 0.9846154 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1657192274652.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1657192274652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|384.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-0 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1655732129289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1655732129289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_256d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|427.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-2 --- layout: model title: Legal Services Clause Binary Classifier author: John Snow Labs name: legclf_services_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `services` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `services` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_services_clause_en_1.0.0_3.2_1660123991970.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_services_clause_en_1.0.0_3.2_1660123991970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_services_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[services]| |[other]| |[other]| |[services]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_services_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.95 0.96 78 services 0.89 0.94 0.91 34 accuracy - - 0.95 112 macro-avg 0.93 0.94 0.94 112 weighted-avg 0.95 0.95 0.95 112 ``` --- layout: model title: Icelandic NER Pipeline author: John Snow Labs name: roberta_token_classifier_icelandic_ner_pipeline date: 2022-06-25 tags: [open_source, ner, token_classifier, roberta, icelandic, is] task: Named Entity Recognition language: is edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_icelandic_ner](https://nlp.johnsnowlabs.com/2021/12/06/roberta_token_classifier_icelandic_ner_is.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_4.0.0_3.0_1656122302435.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_pipeline_is_4.0.0_3.0_1656122302435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is") pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_token_classifier_icelandic_ner_pipeline", lang = "is") pipeline.annotate("Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.") ```
## Results ```bash +----------------+------------+ |chunk |ner_label | +----------------+------------+ |Peter Fergusson |Person | |New York |Location | |október 2011 |Date | |Tesla Motor |Organization| |100K $ |Money | +----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_icelandic_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|is| |Size:|457.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from yohein) author: John Snow Labs name: distilbert_qa_yohein_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `yohein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_yohein_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773298911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_yohein_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773298911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yohein_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yohein_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_yohein_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/yohein/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect biological concepts (biobert) author: John Snow Labs name: ner_bionlp_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_bionlp_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_bionlp_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_pipeline_en_4.3.0_3.2_1679313010526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_pipeline_en_4.3.0_3.2_1679313010526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_bionlp_biobert_pipeline", "en", "clinical/models") text = '''Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_bionlp_biobert_pipeline", "en", "clinical/models") val text = "Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bionlp_biobert.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:-----------------------|-------------:| | 0 | erbA | 9 | 12 | Gene_or_gene_product | 1 | | 1 | IRES | 14 | 17 | Organism | 0.754 | | 2 | virus | 36 | 40 | Organism | 0.9999 | | 3 | erythroid cells | 65 | 79 | Cell | 0.99855 | | 4 | bone | 100 | 103 | Multi-tissue_structure | 0.9794 | | 5 | marrow | 105 | 110 | Multi-tissue_structure | 0.9631 | | 6 | blastoderm cultures | 115 | 133 | Cell | 0.9868 | | 7 | IRES virus | 149 | 158 | Organism | 0.99985 | | 8 | erbA | 236 | 239 | Gene_or_gene_product | 0.9977 | | 9 | IRES virus | 241 | 250 | Organism | 0.9911 | | 10 | blastoderm | 259 | 268 | Cell | 0.9941 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Pipeline to Detect Radiology Related Entities author: John Snow Labs name: ner_radiology_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_radiology](https://nlp.johnsnowlabs.com/2021/03/31/ner_radiology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_pipeline_en_4.3.0_3.2_1678865918152.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_pipeline_en_4.3.0_3.2_1678865918152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models") text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_radiology_pipeline", "en", "clinical/models") val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------------------------|--------:|------:|:--------------------------|-------------:| | 0 | Bilateral breast | 0 | 15 | BodyPart | 0.945 | | 1 | ultrasound | 17 | 26 | ImagingTest | 0.6734 | | 2 | ovoid mass | 78 | 87 | ImagingFindings | 0.6095 | | 3 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.98158 | | 4 | cm | 129 | 130 | Units | 0.9696 | | 5 | anteromedial aspect of the left shoulder | 163 | 202 | BodyPart | 0.750517 | | 6 | mass | 210 | 213 | ImagingFindings | 0.9711 | | 7 | isoechoic echotexture | 228 | 248 | ImagingFindings | 0.80105 | | 8 | muscle | 266 | 271 | BodyPart | 0.7963 | | 9 | internal color flow | 294 | 312 | ImagingFindings | 0.477233 | | 10 | benign fibrous tissue | 334 | 354 | ImagingFindings | 0.524067 | | 11 | lipoma | 361 | 366 | Disease_Syndrome_Disorder | 0.6081 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate Italian to English Pipeline author: John Snow Labs name: translate_it_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, it, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `it` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_it_en_xx_2.7.0_2.4_1609689489920.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_it_en_xx_2.7.0_2.4_1609689489920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_it_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_it_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.it.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_it_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Portuguese BertForTokenClassification Cased model (from pucpr) author: John Snow Labs name: bert_token_classifier_clinicalnerpt_diagnostic date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-diagnostic` is a Portuguese model originally trained by `pucpr`. ## Predicted Entities `DiagnosticProcedure` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_diagnostic_pt_4.2.4_3.0_1669822359730.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_diagnostic_pt_4.2.4_3.0_1669822359730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_diagnostic","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_diagnostic","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_clinicalnerpt_diagnostic| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/pucpr/clinicalnerpt-diagnostic - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/SemClinBr - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_EASY_TIMESTEP_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212108351.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3_en_4.3.0_3.0_1674212108351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_recipe_triplet_recipes_base_easy_timestep_squadv2_epochs_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_EASY_TIMESTEP_squadv2_epochs_3 --- layout: model title: Lemmatizer (Dutch, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, nl] task: Lemmatization language: nl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Dutch Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nl_3.4.1_3.0_1646316580197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_nl_3.4.1_3.0_1646316580197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nl") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Je bent niet beter dan ik"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","nl") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Je bent niet beter dan ik").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.lemma").predict("""Je bent niet beter dan ik""") ```
## Results ```bash +--------------------------------+ |result | +--------------------------------+ |[Je, bent, niet, beter, dan, ik]| +--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|nl| |Size:|2.5 MB| --- layout: model title: Detect Problems, Tests and Treatments (ner_clinical_large) author: John Snow Labs name: ner_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_en_3.0.0_3.0_1617208419368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_en_3.0.0_3.0_1617208419368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |the G-protein-activated inwardly rectifying potassium (GIRK|TREATMENT| |the genomicorganization |TREATMENT| |a candidate gene forType II diabetes mellitus |PROBLEM | |byapproximately |TREATMENT| |single nucleotide polymorphisms |TREATMENT| |aVal366Ala substitution |TREATMENT| |an 8 base-pair |TREATMENT| |insertion/deletion |PROBLEM | |Ourexpression studies |TEST | |the transcript in various humantissues |PROBLEM | |fat andskeletal muscle |PROBLEM | |furtherstudies |PROBLEM | |the KCNJ9 protein |TREATMENT| |evaluation |TEST | |Type II diabetes |PROBLEM | |the treatment |TREATMENT| |breast cancer |PROBLEM | |the standard therapy |TREATMENT| |anthracyclines |TREATMENT| |taxanes |TREATMENT| +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented 2010 i2b2 challenge data with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | I-TREATMENT | 6625 | 1187 | 1329 | 0.848054 | 0.832914 | 0.840416 | | 1 | I-PROBLEM | 15142 | 1976 | 2542 | 0.884566 | 0.856254 | 0.87018 | | 2 | B-PROBLEM | 11005 | 1065 | 1587 | 0.911765 | 0.873968 | 0.892466 | | 3 | I-TEST | 6748 | 923 | 1264 | 0.879677 | 0.842237 | 0.86055 | | 4 | B-TEST | 8196 | 942 | 1029 | 0.896914 | 0.888455 | 0.892665 | | 5 | B-TREATMENT | 8271 | 1265 | 1073 | 0.867345 | 0.885167 | 0.876165 | | 6 | Macro-average | 55987 | 7358 | 8824 | 0.881387 | 0.863166 | 0.872181 | | 7 | Micro-average | 55987 | 7358 | 8824 | 0.883842 | 0.86385 | 0.873732 | ``` --- layout: model title: Spanish ElectraForQuestionAnswering Small model (from mrm8488) author: John Snow Labs name: electra_qa_electricidad_small_finetuned_squadv1 date: 2022-06-22 tags: [es, open_source, electra, question_answering] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electricidad-small-finetuned-squadv1-es` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_electricidad_small_finetuned_squadv1_es_4.0.0_3.0_1655921719427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_electricidad_small_finetuned_squadv1_es_4.0.0_3.0_1655921719427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_electricidad_small_finetuned_squadv1","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_electricidad_small_finetuned_squadv1","es") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.electra.small").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_electricidad_small_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|51.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/electricidad-small-finetuned-squadv1-es - https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/master/SQuAD-es-v1.1 --- layout: model title: Smaller BERT Sentence Embeddings (L-4_H-512_A-8) author: John Snow Labs name: sent_small_bert_L4_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_512_en_2.6.0_2.4_1598350568942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_512_en_2.6.0_2.4_1598350568942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_512", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_512", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L4_512').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L4_512_embeddings sentence [0.2255069762468338, 0.14144930243492126, 0.67... I hate cancer [-0.5351444482803345, 0.36734339594841003, 0.1... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L4_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1 --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025529159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664025529159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_1b_finnish_lm_by_Finnish_NLP| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|3.6 GB| --- layout: model title: Pipeline to Detect Adverse Drug Events (healthcare) author: John Snow Labs name: ner_ade_healthcare_pipeline date: 2022-03-22 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_healthcare_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_pipeline_en_3.4.1_3.0_1647944180015.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_pipeline_en_3.4.1_3.0_1647944180015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_ade_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_ade_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare_ade.pipeline").predict("""Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |Lipitor |DRUG | |severe fatigue|ADE | |voltaren |DRUG | |cramps |ADE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Legal Participations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_participations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, participations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Participations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Participations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_participations_bert_en_1.0.0_3.0_1678050674961.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_participations_bert_en_1.0.0_3.0_1678050674961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_participations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Participations]| |[Other]| |[Other]| |[Participations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_participations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.96 0.95 73 Participations 0.94 0.91 0.92 53 accuracy - - 0.94 126 macro-avg 0.94 0.93 0.93 126 weighted-avg 0.94 0.94 0.94 126 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2_en_4.3.0_3.0_1674215491724.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2_en_4.3.0_3.0_1674215491724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|432.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-2 --- layout: model title: Stopwords Remover for Ligurian language (162 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, lij, open_source] task: Stop Words Removal language: lij edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_lij_3.4.1_3.0_1646653713114.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_lij_3.4.1_3.0_1646653713114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","lij") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Unde é a marina?"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","lij") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Unde é a marina?").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lij.stopwords").predict("""Unde é a marina?""") ```
## Results ```bash +-----------------+ |result | +-----------------+ |[Unde, marina, ?]| +-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|lij| |Size:|1.8 KB| --- layout: model title: English BertForQuestionAnswering model (from haddadalwi) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-finetuned-squad-finetuned-islamic-squad` is a English model orginally trained by `haddadalwi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad_en_4.0.0_3.0_1654537141717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad_en_4.0.0_3.0_1654537141717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased.by_haddadalwi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/haddadalwi/bert-large-uncased-whole-word-masking-finetuned-squad-finetuned-islamic-squad --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes author: John Snow Labs name: sbiobertresolve_icd10cm_augmented_billable_hcc date: 2021-11-01 tags: [icd10cm, hcc, entity_resolution, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings and it supports 7-digit codes with HCC status. It has been updated by dropping the invalid codes that exist in the previous versions. In the result, look for the `all_k_aux_labels` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.3.1_2.4_1635784379929.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_3.3.1_2.4_1635784379929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_icd10cm_augmented_billable_hcc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings") icd_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter_icd, c2doc, sbert_embedder, icd_resolver ]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = resolver_pipeline.fit(data_ner).transform(data_ner) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented_billable").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | ner_chunk| entity|icd10cm_code| resolutions| all_codes| billable_hcc| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O2441|gestational diabetes mellitus:::postpartum gestational diabetes mel...| O2441:::O2443:::Z8632:::Z875:::O2431:::O2411:::O244:::O241:::O2481|0||0||0:::0||0||0:::1||0||0:::0||0||0:::0||0||0:::0||0||0:::0||0||0...| |subsequent type two diabetes mellitus|PROBLEM| O2411|pre-existing type 2 diabetes mellitus:::disorder associated with ty...|O2411:::E118:::E11:::E139:::E119:::E113:::E1144:::Z863:::Z8639:::E1...|0||0||0:::1||1||18:::0||0||0:::1||1||19:::1||1||19:::0||0||0:::1||1...| | T2DM|PROBLEM| E11|t2dm [type 2 diabetes mellitus]:::tndm2:::t2 category:::sma2:::nf2:...|E11:::P702:::C801:::G121:::Q850:::C779:::C509:::C439:::E723:::C5700...|0||0||0:::1||0||0:::1||1||12:::1||1||72:::0||0||0:::1||1||10:::0||0...| | HTG-induced pancreatitis|PROBLEM| K8520|alcohol-induced pancreatitis:::pancreatitis:::drug induced acute pa...|K8520:::K859:::K853:::K8590:::K85:::F102:::K858:::K8591:::K852:::K8...|1||0||0:::0||0||0:::0||0||0:::1||0||0:::0||0||0:::0||0||0:::0||0||0...| | acute hepatitis|PROBLEM| K720|acute hepatitis:::acute hepatitis a:::acute infectious hepatitis:::...|K720:::B15:::B179:::B172:::Z0389:::B159:::B150:::B16:::K752:::K712:...|0||0||0:::0||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0...| | obesity|PROBLEM| E669|obesity:::abdominal obesity:::obese:::central obesity:::overweight ...|E669:::E668:::Z6841:::Q130:::E66:::E6601:::Z8639:::E349:::H3550:::Z...|1||0||0:::1||0||0:::1||1||22:::1||0||0:::0||0||0:::1||1||22:::1||0|...| | a body mass index|PROBLEM| Z6841|finding of body mass index:::observation of body mass index:::mass ...|Z6841:::E669:::R229:::Z681:::R223:::R221:::Z68:::R222:::R220:::R418...|1||1||22:::1||0||0:::1||0||0:::1||0||0:::0||0||0:::1||0||0:::0||0||...| | polyuria|PROBLEM| R35|polyuria:::polyuric state:::polyuric state (disorder):::hematuria::...|R35:::R358:::E232:::R31:::R350:::R8299:::N401:::E723:::O048:::R300:...|0||0||0:::1||0||0:::1||1||23:::0||0||0:::1||0||0:::0||0||0:::1||0||...| | polydipsia|PROBLEM| R631|polydipsia:::psychogenic polydipsia:::primary polydipsia:::psychoge...|R631:::F6389:::E232:::F639:::O40:::G475:::M7989:::R632:::R061:::H53...|1||0||0:::1||1||nan:::1||1||23:::1||1||nan:::0||0||0:::0||0||0:::1|...| | poor appetite|PROBLEM| R630|poor appetite:::poor feeding:::bad taste in mouth:::unpleasant tast...|R630:::P929:::R438:::R432:::E86:::R196:::F520:::Z724:::R0689:::Z768...|1||0||0:::1||0||0:::1||0||0:::1||0||0:::0||0||0:::1||0||0:::1||0||0...| | vomiting|PROBLEM| R111|vomiting:::intermittent vomiting:::vomiting symptoms:::periodic vom...| R111:::R11:::R1110:::G43A1:::P921:::P9209:::G43A:::R1113:::R110|0||0||0:::0||0||0:::1||0||0:::1||1||nan:::1||0||0:::1||0||0:::0||0|...| | a respiratory tract infection|PROBLEM| J988|respiratory tract infection:::upper respiratory tract infection:::b...|J988:::J069:::A499:::J22:::J209:::Z593:::T17:::J0410:::Z1383:::J189...|1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::0||0||0...| +-------------------------------------+-------+------------+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on 01 November 2021 ICD10CM Dataset. --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465525 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465525` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465525_en_4.0.0_3.0_1655987202658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465525_en_4.0.0_3.0_1655987202658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465525","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465525","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465525.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465525| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465525 --- layout: model title: Bangla BertForMaskedLM Base Cased model (from sagorsarker) author: John Snow Labs name: bert_embeddings_bangla_base date: 2022-12-02 tags: [bn, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: bn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bangla-bert-base` is a Bangla model originally trained by `sagorsarker`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_base_bn_4.2.4_3.0_1670015550585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_base_bn_4.2.4_3.0_1670015550585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_bangla_base","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_bangla_base","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bangla_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|617.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/sagorsarker/bangla-bert-base - https://github.com/sagorbrur/bangla-bert - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://oscar-corpus.com/ - https://dumps.wikimedia.org/bnwiki/latest/ - https://github.com/sagorbrur/bnlp - https://github.com/sagorbrur/bangla-bert - https://github.com/google-research/bert - https://twitter.com/mapmeld - https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP - https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb - https://github.com/sagorbrur/bangla-bert/tree/master/evaluations/wikiann - https://arxiv.org/abs/2012.14353 - https://arxiv.org/abs/2104.08613 - https://arxiv.org/abs/2107.03844 - https://arxiv.org/abs/2101.00204 - https://github.com/sagorbrur - https://www.tensorflow.org/tfrc - https://github.com/google-research/bert --- layout: model title: TREC(50) Question Classifier author: John Snow Labs name: classifierdl_use_trec50 date: 2021-01-08 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [classifier, text_classification, en, open_source] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify open-domain, fact-based questions into sub categories of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values. ## Predicted Entities ``ENTY_animal``, ``ENTY_body``, ``ENTY_color``, ``ENTY_cremat``, ``ENTY_currency``, ``ENTY_dismed``, ``ENTY_event``, ``ENTY_food``, ``ENTY_instru``, ``ENTY_lang``, ``ENTY_letter``, ``ENTY_other``, ``ENTY_plant``, ``ENTY_product``, ``ENTY_religion``, ``ENTY_sport``, ``ENTY_substance``, ``ENTY_symbol``, ``ENTY_techmeth``, ``ENTY_termeq``, ``ENTY_veh``, ``ENTY_word``, ``DESC_def``, ``DESC_desc``, ``DESC_manner``, ``DESC_reason``, ``HUM_gr``, ``HUM_ind``, ``HUM_title``, ``HUM_desc``, ``LOC_city``, ``LOC_country``, ``LOC_mount``, ``LOC_other``, ``LOC_state``, ``NUM_code``, ``NUM_count``, ``NUM_date``, ``NUM_dist``, ``NUM_money``, ``NUM_ord``, ``NUM_other``, ``NUM_period``, ``NUM_perc``, ``NUM_speed``, ``NUM_temp``, ``NUM_volsize``, ``NUM_weight``, ``ABBR_abb``, ``ABBR_exp``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_en_2.7.1_2.4_1610118328412.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_en_2.7.1_2.4_1610118328412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec50', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec50", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""When did the construction of stone circles begin in the UK?"""] trec50_df = nlu.load('en.classify.trec50.use').predict(text, output_level = "document") trec50_df[["document", "trec50"]] ```
## Results ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |When did the construction of stone circles begin in the UK? | NUM_date | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_trec50| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| ## Data Source This model is trained on the 50 class version of the TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html --- layout: model title: Legal Letters of credit Clause Binary Classifier author: John Snow Labs name: legclf_letters_of_credit_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `letters-of-credit` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `letters-of-credit` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_letters_of_credit_clause_en_1.0.0_3.2_1660122609389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_letters_of_credit_clause_en_1.0.0_3.2_1660122609389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_letters_of_credit_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[letters-of-credit]| |[other]| |[other]| |[letters-of-credit]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_letters_of_credit_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support letters-of-credit 0.91 0.88 0.89 24 other 0.96 0.98 0.97 84 accuracy - - 0.95 108 macro-avg 0.94 0.93 0.93 108 weighted-avg 0.95 0.95 0.95 108 ``` --- layout: model title: Legal Delegation of duties Clause Binary Classifier author: John Snow Labs name: legclf_delegation_of_duties_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `delegation-of-duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `delegation-of-duties` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_delegation_of_duties_clause_en_1.0.0_3.2_1660122337035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_delegation_of_duties_clause_en_1.0.0_3.2_1660122337035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_delegation_of_duties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[delegation-of-duties]| |[other]| |[other]| |[delegation-of-duties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_delegation_of_duties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support delegation-of-duties 0.97 0.94 0.95 33 other 0.98 0.99 0.99 103 accuracy - - 0.98 136 macro-avg 0.97 0.96 0.97 136 weighted-avg 0.98 0.98 0.98 136 ``` --- layout: model title: Bangla RobertaForSequenceClassification Cased model (from neuralspace) author: John Snow Labs name: roberta_classifier_autotrain_citizen_nlu_bn_1370652766 date: 2022-12-09 tags: [bn, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: bn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-citizen_nlu_bn-1370652766` is a Bangla model originally trained by `neuralspace`. ## Predicted Entities `ReportingMissingPets`, `EligibilityForBloodDonationCovidGap`, `ReportingPropertyTakeOver`, `IntentForBloodReceivalAppointment`, `EligibilityForBloodDonationSTD`, `InquiryForDoctorConsultation`, `InquiryOfCovidSymptoms`, `InquiryForVaccineCount`, `InquiryForCovidPrevention`, `InquiryForVaccinationRequirements`, `EligibilityForBloodDonationForPregnantWomen`, `ReportingCyberCrime`, `ReportingHitAndRun`, `ReportingTresspassing`, `InquiryofBloodDonationRequirements`, `ReportingMurder`, `ReportingVehicleAccident`, `ReportingMissingPerson`, `EligibilityForBloodDonationAgeLimit`, `ReportingAnimalPoaching`, `InquiryOfEmergencyContact`, `InquiryForQuarantinePeriod`, `ContactRealPerson`, `IntentForBloodDonationAppointment`, `ReportingMissingVehicle`, `InquiryForCovidRecentCasesCount`, `InquiryOfContact`, `StatusOfFIR`, `InquiryofVaccinationAgeLimit`, `InquiryForCovidTotalCasesCount`, `EligibilityForBloodDonationGap`, `InquiryofPostBloodDonationEffects`, `InquiryofPostBloodReceivalCareSchemes`, `EligibilityForBloodReceiversBloodGroup`, `EligitbilityForVaccine`, `InquiryOfLockdownDetails`, `ReportingSexualAssault`, `InquiryForVaccineCost`, `InquiryForCovidDeathCount`, `ReportingDrugConsumption`, `ReportingDrugTrafficing`, `InquiryofPostBloodDonationCertificate`, `ReportingDowry`, `ReportingChildAbuse`, `ReportingAnimalAbuse`, `InquiryofPostBloodReceivalEffects`, `Eligibility For BloodDonationWithComorbidities`, `InquiryOfTiming`, `InquiryForCovidActiveCasesCount`, `InquiryOfLocation`, `InquiryofPostBloodDonationCareSchemes`, `ReportingTheft`, `InquiryForTravelRestrictions`, `ReportingDomesticViolence`, `InquiryofBloodReceivalRequirements` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_citizen_nlu_bn_1370652766_bn_4.2.4_3.0_1670623640434.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_citizen_nlu_bn_1370652766_bn_4.2.4_3.0_1670623640434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_citizen_nlu_bn_1370652766","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_citizen_nlu_bn_1370652766","bn") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autotrain_citizen_nlu_bn_1370652766| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|bn| |Size:|312.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/neuralspace/autotrain-citizen_nlu_bn-1370652766 --- layout: model title: Pipeline to Detect Radiology Entities, Assign Assertion Status and Find Relations author: John Snow Labs name: explain_clinical_doc_radiology date: 2023-04-20 tags: [licensed, clinical, en, ner, assertion, relation_extraction, radiology] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for detecting posology entities with the `ner_radiology` NER model, assigning their assertion status with `assertion_dl_radiology` model, and extracting relations between posology-related terminology with `re_test_problem_finding` relation extraction model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_4.3.0_3.2_1682019248720.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_4.3.0_3.2_1682019248720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models") text = """Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""" result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models") val text = """Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_radiology.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash +----+------------------------------------------+---------------------------+ | | chunks | entities | |---:|:-----------------------------------------|:--------------------------| | 0 | Bilateral breast | BodyPart | | 1 | ultrasound | ImagingTest | | 2 | ovoid mass | ImagingFindings | | 3 | 0.5 x 0.5 x 0.4 | Measurements | | 4 | cm | Units | | 5 | anteromedial aspect of the left shoulder | BodyPart | | 6 | mass | ImagingFindings | | 7 | isoechoic echotexture | ImagingFindings | | 8 | muscle | BodyPart | | 9 | internal color flow | ImagingFindings | | 10 | benign fibrous tissue | ImagingFindings | | 11 | lipoma | Disease_Syndrome_Disorder | +----+------------------------------------------+---------------------------+ +----+-----------------------+---------------------------+-------------+ | | chunks | entities | assertion | |---:|:----------------------|:--------------------------|:------------| | 0 | ultrasound | ImagingTest | Confirmed | | 1 | ovoid mass | ImagingFindings | Confirmed | | 2 | mass | ImagingFindings | Confirmed | | 3 | isoechoic echotexture | ImagingFindings | Confirmed | | 4 | internal color flow | ImagingFindings | Negative | | 5 | benign fibrous tissue | ImagingFindings | Suspected | | 6 | lipoma | Disease_Syndrome_Disorder | Suspected | +----+-----------------------+---------------------------+-------------+ +---------+-----------------+-----------------------+---------------------------+------------+ |relation | entity1 | chunk1 | entity2 | chunk2 | |--------:|:----------------|:----------------------|:--------------------------|:-----------| | 1 | ImagingTest | ultrasound | ImagingFindings | ovoid mass | | 0 | ImagingFindings | benign fibrous tissue | Disease_Syndrome_Disorder | lipoma | +---------+-----------------+-----------------------+---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_radiology| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - NerConverterInternalModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel --- layout: model title: English asr_Fine_Tuned_XLSR_English TFWav2Vec2ForCTC from Sania67 author: John Snow Labs name: asr_Fine_Tuned_XLSR_English date: 2022-09-26 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Fine_Tuned_XLSR_English` is a English model originally trained by Sania67. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Fine_Tuned_XLSR_English_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Fine_Tuned_XLSR_English_en_4.2.0_3.0_1664199621537.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Fine_Tuned_XLSR_English_en_4.2.0_3.0_1664199621537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Fine_Tuned_XLSR_English", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Fine_Tuned_XLSR_English", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Fine_Tuned_XLSR_English| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670021717667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670021717667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|39.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Sentence Entity Resolver for Snomed Aux Concepts, INT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_auxConcepts_int language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to Snomed codes (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from INT version) using chunk embeddings. {:.h2_title} ## Predicted Entities Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_int_en_2.6.4_2.4_1606235764318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_int_en_2.6.4_2.4_1606235764318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_aux_int_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_auxConcepts_int","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val snomed_aux_int_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_auxConcepts_int","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 148439002| 0.2138|risk factors pres...|148439002:::42595...| |chronic renal ins...| 83|109| PROBLEM| 722403003| 0.8517|gastrointestinal ...|722403003:::13781...| | COPD| 113|116| PROBLEM|845101000000100| 0.0962|management of chr...|845101000000100::...| | gastritis| 120|128| PROBLEM| 711498001| 0.3398|magnetic resonanc...|711498001:::71771...| | TIA| 136|138| PROBLEM| 449758002| 0.1927|traumatic infarct...|449758002:::85844...| |a non-ST elevatio...| 182|202| PROBLEM| 1411000087101| 0.0823|ct of left knee::...|1411000087101:::3...| |Guaiac positive s...| 208|229| PROBLEM| 388507006| 0.0555|asparagus rast:::...|388507006:::71771...| |cardiac catheteri...| 295|317| TEST| 41976001| 0.9790|cardiac catheteri...|41976001:::705921...| | PTCA| 324|327|TREATMENT| 312644004| 0.0616|angioplasty of po...|312644004:::41507...| | mid LAD lesion| 332|345| PROBLEM| 91749005| 0.1399|structure of firs...|91749005:::917470...| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_snomed_auxConcepts_int | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: Translate North Germanic languages to English Pipeline author: John Snow Labs name: translate_gmq_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gmq, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gmq` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gmq_en_xx_2.7.0_2.4_1609687798521.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gmq_en_xx_2.7.0_2.4_1609687798521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gmq_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gmq_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gmq.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gmq_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese Bert Embeddings (Large, Roberta, Whole Word Masking) author: John Snow Labs name: bert_embeddings_chinese_roberta_wwm_ext_large date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-roberta-wwm-ext-large` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_large_zh_3.4.2_3.0_1649668978927.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_large_zh_3.4.2_3.0_1649668978927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.chinese_roberta_wwm_ext_large").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_wwm_ext_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-roberta-wwm-ext-large - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Italian DistilBERT Embeddings author: John Snow Labs name: distilbert_embeddings_BERTino date: 2022-04-12 tags: [distilbert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `BERTino` is a Italian model orginally trained by `indigo-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_BERTino_it_3.4.2_3.0_1649783809783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_BERTino_it_3.4.2_3.0_1649783809783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_BERTino","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_BERTino","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.BERTino").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_BERTino| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|253.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/indigo-ai/BERTino - https://indigo.ai/en/ - https://www.corpusitaliano.it/ - https://corpora.dipintra.it/public/run.cgi/corp_info?corpname=itwac_full - https://universaldependencies.org/treebanks/it_partut/index.html - https://universaldependencies.org/treebanks/it_isdt/index.html - https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500 --- layout: model title: Fast Neural Machine Translation Model from Dutch to English author: John Snow Labs name: opus_mt_nl_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, nl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `nl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_nl_en_xx_2.7.0_2.4_1609164556626.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_nl_en_xx_2.7.0_2.4_1609164556626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_nl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_nl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.nl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_nl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0_en_4.0.0_3.0_1657191165597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0_en_4.0.0_3.0_1657191165597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_hier_triplet_epochs_1_shard_1_kldiv_squad2.0 --- layout: model title: English DistilBertForQuestionAnswering Cased model (from aszidon) author: John Snow Labs name: distilbert_qa_custom5 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom5` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom5_en_4.3.0_3.0_1672774713572.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom5_en_4.3.0_3.0_1672774713572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom5","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom5","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom5| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom5 --- layout: model title: Multilingual BertForQuestionAnswering model (from vanichandna) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetuned_squadv1 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-squadv1` is a Multilingual model orginally trained by `vanichandna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_squadv1_xx_4.0.0_3.0_1654180211247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_squadv1_xx_4.0.0_3.0_1654180211247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_squadv1","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetuned_squadv1","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.squad.bert.multilingual_base_cased.by_vanichandna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vanichandna/bert-base-multilingual-cased-finetuned-squadv1 --- layout: model title: Legal Joint Filing Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_joint_filing_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, joint_filing, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_joint_filing_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `joint-filing-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `joint-filing-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_bert_en_1.0.0_3.0_1669315294183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_joint_filing_agreement_bert_en_1.0.0_3.0_1669315294183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_joint_filing_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[joint-filing-agreement]| |[other]| |[other]| |[joint-filing-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_joint_filing_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support joint-filing-agreement 1.00 0.97 0.98 31 other 0.98 1.00 0.99 65 accuracy - - 0.99 96 macro-avg 0.99 0.98 0.99 96 weighted-avg 0.99 0.99 0.99 96 ``` --- layout: model title: English BertForQuestionAnswering model (from peggyhuang) author: John Snow Labs name: bert_qa_finetune_bert_base_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-bert-base-v1` is a English model orginally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v1_en_4.0.0_3.0_1654187728270.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v1_en_4.0.0_3.0_1654187728270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetune_bert_base_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_finetune_bert_base_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base.by_peggyhuang").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_finetune_bert_base_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/finetune-bert-base-v1 --- layout: model title: Recognize Entities DL Pipeline for Spanish - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, spanish, entity_recognizer_sm, pipeline, es] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: es edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_3.0.0_3.0_1616441492784.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_3.0.0_3.0_1616441492784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'es') annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "es") val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] result_df = nlu.load('es.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:---------------------------------------|:-----------------------| | 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | [[0.1754499971866607,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'B-MISC'] | ['John Snow', 'Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| --- layout: model title: Legal Non Competition Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_non_competition_agreement_bert date: 2023-01-26 tags: [en, legal, classification, non_competition, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_non_competition_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `non-competition-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `non-competition-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_agreement_bert_en_1.0.0_3.0_1674734515053.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_competition_agreement_bert_en_1.0.0_3.0_1674734515053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_non_competition_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[non-competition-agreement]| |[other]| |[other]| |[non-competition-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_competition_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support non-competition-agreement 0.88 0.93 0.90 54 other 0.96 0.94 0.95 116 accuracy - - 0.94 170 macro-avg 0.92 0.93 0.93 170 weighted-avg 0.94 0.94 0.94 170 ``` --- layout: model title: BERT multilingual base model (uncased) author: John Snow Labs name: bert_base_multilingual_uncased date: 2021-05-20 tags: [xx, multilingual, embeddings, bert, open_source] task: Embeddings language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained model on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1810.04805) and first released in [this repository](https://github.com/google-research/bert). This model is uncased: it does not make a difference between english and English. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Next sentence prediction (NSP): the models concatenate two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_uncased_xx_3.1.0_2.4_1621519949446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_multilingual_uncased_xx_3.1.0_2.4_1621519949446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_multilingual_uncased", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_multilingual_uncased", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.bert_base_multilingual_uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_multilingual_uncased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|xx| |Case sensitive:|true| ## Data Source [https://huggingface.co/bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) --- layout: model title: Legal Execution in counterparts Clause Binary Classifier author: John Snow Labs name: legclf_execution_in_counterparts_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `execution-in-counterparts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `execution-in-counterparts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_execution_in_counterparts_clause_en_1.0.0_3.2_1660122419235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_execution_in_counterparts_clause_en_1.0.0_3.2_1660122419235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_execution_in_counterparts_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[execution-in-counterparts]| |[other]| |[other]| |[execution-in-counterparts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_execution_in_counterparts_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support execution-in-counterparts 1.00 0.95 0.98 42 other 0.98 1.00 0.99 128 accuracy - - 0.99 170 macro-avg 0.99 0.98 0.98 170 weighted-avg 0.99 0.99 0.99 170 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shahma) author: John Snow Labs name: distilbert_qa_shahma_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shahma`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shahma_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772506673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shahma_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772506673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shahma_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shahma_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shahma_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shahma/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Applicable Law Clause Binary Classifier author: John Snow Labs name: legclf_applic_law_clause date: 2023-02-13 tags: [en, legal, classification, applicable, law, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `applic_law` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `applic_law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_applic_law_clause_en_1.0.0_3.0_1676302179504.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_applic_law_clause_en_1.0.0_3.0_1676302179504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_applic_law_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[applic_law]| |[other]| |[other]| |[applic_law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_applic_law_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support applic_law 1.00 0.95 0.97 20 other 0.93 1.00 0.96 13 accuracy - - 0.97 33 macro-avg 0.96 0.97 0.97 33 weighted-avg 0.97 0.97 0.97 33 ``` --- layout: model title: Legal Liens Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_liens_bert date: 2023-03-05 tags: [en, legal, classification, clauses, liens, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Liens` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Liens`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_liens_bert_en_1.0.0_3.0_1678050683070.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_liens_bert_en_1.0.0_3.0_1678050683070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_liens_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Liens]| |[Other]| |[Other]| |[Liens]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_liens_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Liens 0.90 0.84 0.87 31 Other 0.91 0.94 0.93 53 accuracy - - 0.90 84 macro-avg 0.90 0.89 0.90 84 weighted-avg 0.90 0.90 0.90 84 ``` --- layout: model title: German asr_wav2vec2_large_xlsr_german_demo TFWav2Vec2ForCTC from marcel author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_german_demo date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_german_demo` is a German model originally trained by marcel. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_german_demo_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103879889.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_german_demo_de_4.2.0_3.0_1664103879889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_german_demo', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_german_demo", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_german_demo| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Tigrinya author: John Snow Labs name: opus_mt_en_ti date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ti, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ti` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ti_xx_2.7.0_2.4_1609170948430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ti_xx_2.7.0_2.4_1609170948430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ti", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ti", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ti').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ti| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal No Material Adverse Effect Clause Binary Classifier author: John Snow Labs name: legclf_no_material_adverse_effect_clause date: 2023-01-29 tags: [en, legal, classification, material, adverse, effect, clauses, no_material_adverse_effect, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-material-adverse-effect` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `no-material-adverse-effect`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_effect_clause_en_1.0.0_3.0_1674993807450.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_material_adverse_effect_clause_en_1.0.0_3.0_1674993807450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_no_material_adverse_effect_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[no-material-adverse-effect]| |[other]| |[other]| |[no-material-adverse-effect]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_material_adverse_effect_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-material-adverse-effect 1.00 1.00 1.00 60 other 1.00 1.00 1.00 105 accuracy - - 1.00 165 macro-avg 1.00 1.00 1.00 165 weighted-avg 1.00 1.00 1.00 165 ``` --- layout: model title: Detect Living Species(roberta_embeddings_BR_BERTo) author: John Snow Labs name: ner_living_species_roberta date: 2022-06-22 tags: [pt, ner, clinical, licensed, roberta] task: Named Entity Recognition language: pt edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Portuguese which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `roberta_embeddings_BR_BERTo` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pt_3.5.3_3.0_1655923058986.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pt_3.5.3_3.0_1655923058986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_roberta", "pt", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_roberta", "pt", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.med_ner.living_species.roberta").predict("""Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis.""") ```
## Results ```bash +-------------+-------+ |ner_chunk |label | +-------------+-------+ |Mulher |HUMAN | |grávida |HUMAN | |estreptococos|SPECIES| |HBV |SPECIES| |HCV |SPECIES| |sífilis |SPECIES| +-------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_roberta| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.86 0.91 0.88 2827 B-SPECIES 0.52 0.86 0.65 2796 I-HUMAN 0.79 0.43 0.55 180 I-SPECIES 0.62 0.81 0.70 1099 micro-avg 0.65 0.86 0.74 6902 macro-avg 0.69 0.75 0.70 6902 weighted-avg 0.68 0.86 0.75 6902 ``` --- layout: model title: Fast Neural Machine Translation Model from Azerbaijani to English author: John Snow Labs name: opus_mt_az_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, az, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `az` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_az_en_xx_2.7.0_2.4_1609163603785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_az_en_xx_2.7.0_2.4_1609163603785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_az_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_az_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.az.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_az_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Japanese Bert Embeddings (Base, Character Tokenization) author: John Snow Labs name: bert_embeddings_bert_base_japanese_char date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_ja_3.4.2_3.0_1649674188981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_ja_3.4.2_3.0_1649674188981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_char").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_char| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|334.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-char - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Legal International Affairs Document Classifier (EURLEX) author: John Snow Labs name: legclf_international_affairs_bert date: 2023-03-06 tags: [en, legal, classification, clauses, international_affairs, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_international_affairs_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class International_Affairs or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `International_Affairs`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_international_affairs_bert_en_1.0.0_3.0_1678111736686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_international_affairs_bert_en_1.0.0_3.0_1678111736686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_international_affairs_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[International_Affairs]| |[Other]| |[Other]| |[International_Affairs]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_international_affairs_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support International_Affairs 0.85 0.95 0.90 43 Other 0.96 0.86 0.91 51 accuracy - - 0.90 94 macro-avg 0.91 0.91 0.90 94 weighted-avg 0.91 0.90 0.90 94 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Cased model (from stevemobs) author: John Snow Labs name: roberta_qa_quales_iberlef_squad_2 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `quales-iberlef-squad_2` is a Spanish model originally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_quales_iberlef_squad_2_es_4.3.0_3.0_1674212017084.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_quales_iberlef_squad_2_es_4.3.0_3.0_1674212017084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_quales_iberlef_squad_2","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_quales_iberlef_squad_2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_quales_iberlef_squad_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/stevemobs/quales-iberlef-squad_2 --- layout: model title: Sentence Entity Resolver for CPT codes (Augmented) author: John Snow Labs name: sbiobertresolve_cpt_procedures_augmented date: 2021-06-15 tags: [cpt, lincensed, en, clinical, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to CPT codes using Sentence Bert Embeddings. ## Predicted Entities CPT codes and their descriptions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.1.0_3.0_1623789734339.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_augmented_en_3.1.0_3.0_1623789734339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") jsl_sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli','en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") cpt_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_cpt_procedures_augmented", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("cpt_code") cpt_pipelineModel = PipelineModel( stages = [ documentAssembler, jsl_sbert_embedder, cpt_resolver]) cpt_lp = LightPipeline(cpt_pipelineModel) result = cpt_lp.fullAnnotate("heart surgery") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val cpt_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_cpt_procedures_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("cpt_code") val cpt_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, cpt_resolver)) val cpt_lp = LightPipeline(cpt_pipelineModel) val result = cpt_lp.fullAnnotate("heart surgery") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cpt.procedures_augmented").predict("""heart surgery""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:--------------|:----- |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|:--------------------------------------| | 0 | heart surgery | 33258 | [Cardiac surgery procedure [Operative tissue ablation and reconstruction of atria, performed at the time of other cardiac procedure(s), extensive (eg, maze procedure), without cardiopulmonary bypass (List separately in addition to code for primary procedure)], Cardiac surgery procedure [Unlisted procedure, cardiac surgery], Heart procedure [Interrogation device evaluation (in person) of intracardiac ischemia monitoring system with analysis, review, and report], Heart procedure [Insertion or removal and replacement of intracardiac ischemia monitoring system including imaging supervision and interpretation when performed and intra-operative interrogation and programming when performed; device only], ...]| [33258, 33999, 0306T, 0304T, ...] | [0.1031, 0.1031, 0.1377, 0.1377, ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cpt_procedures_augmented| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[cpt_code]| |Language:|en| |Case sensitive:|true| ## Data Source Trained on Current Procedural Terminology dataset with `sbiobert_base_cased_mli ` sentence embeddings. --- layout: model title: Translate Brazilian Sign Language to English Pipeline author: John Snow Labs name: translate_bzs_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bzs, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bzs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bzs_en_xx_2.7.0_2.4_1609688839270.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bzs_en_xx_2.7.0_2.4_1609688839270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bzs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bzs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bzs.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bzs_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to German author: John Snow Labs name: opus_mt_af_de date: 2021-06-01 tags: [open_source, seq2seq, translation, af, de, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: de {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_de_xx_3.1.0_2.4_1622561187540.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_de_xx_3.1.0_2.4_1622561187540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_de", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_de", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.German').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_de| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Smaller BERT Sentence Embeddings (L-8_H-512_A-8) author: John Snow Labs name: sent_small_bert_L8_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_512_en_2.6.0_2.4_1598350686215.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_512_en_2.6.0_2.4_1598350686215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_512", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_512", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L8_512').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L8_512_embeddings sentence [0.07683686912059784, -0.09125291556119919, 1.... I hate cancer [0.05132533982396126, 0.16612868010997772, -0.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L8_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1 --- layout: model title: Legal Control Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_control_agreement date: 2022-11-10 tags: [legal, licensed, classification, control, agreement, en] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_control_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `control-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `control-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_control_agreement_en_1.0.0_3.0_1668109264398.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_control_agreement_en_1.0.0_3.0_1668109264398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_control_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[control-agreement]| |[other]| |[other]| |[control-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_control_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support control-agreement 0.93 1.00 0.97 28 other 1.00 0.97 0.98 66 accuracy - - 0.98 94 macro-avg 0.97 0.98 0.98 94 weighted-avg 0.98 0.98 0.98 94 ``` --- layout: model title: Detect Assertion Status from Demographic Entities author: John Snow Labs name: assertion_oncology_demographic_binary_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects if a demographic entity refers to the patient or to someone else. ## Predicted Entities `Patient`, `Someone_Else` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_demographic_binary_wip_en_4.1.0_3.0_1664642285987.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_demographic_binary_wip_en_4.1.0_3.0_1664642285987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Age", "Gender"]) assertion = AssertionDLModel.pretrained("assertion_oncology_demographic_binary_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["One sister was diagnosed with breast cancer at the age of 40."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Age", "Gender")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_demographic_binary_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("One sister was diagnosed with breast cancer at the age of 40.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_demographic_binary_wip").predict("""One sister was diagnosed with breast cancer at the age of 40.""") ```
## Results ```bash | chunk | ner_label | assertion | |:----------|:------------|:-------------| | sister | Gender | Someone_Else | | age of 40 | Age | Someone_Else | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_demographic_binary_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Patient 0.94 0.94 0.94 32.0 Someone_Else 0.92 0.92 0.92 24.0 macro-avg 0.93 0.93 0.93 56.0 weighted-avg 0.93 0.93 0.93 56.0 ``` --- layout: model title: Detect 10 Different Entities in Hebrew (hebrewner_cc_300d) author: John Snow Labs name: hebrewner_cc_300d date: 2022-07-26 tags: [ner, open_source, he] task: Named Entity Recognition language: he edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Hebrew word embeddings to find 10 different types of entities in Hebrew text. It is trained using `hebrewner_cc_300d ` word embeddings, please use the same embeddings in the pipeline. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hebrewner_cc_300d_he_4.0.0_3.0_1658872858968.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hebrewner_cc_300d_he_4.0.0_3.0_1658872858968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("hebrewner_cc_300d", "he") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.") ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("hebrewner_cc_300d", "he") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג"וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה. אבו-ג"וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("he.ner").predict("""ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.""") ```
## Results ```bash | | ner_chunk | entity_label | |---:|------------------:|---------------:| | 0 | 25 לאוגוסט | DATE | | 1 | השב"כ | ORG | | 2 | מוחמד אבו-ג'וייד | PERS | | 3 | ירדני | MISC_AFF | | 4 | הפת"ח | ORG | | 5 | חיזבאללה | ORG | | 6 | אבו-ג'וייד | PERS | | 7 | בגדה | LOC | | 8 | ערביי | MISC_AFF | | 9 | ישראל | LOC | | 10 | ברכבת ישראל | ORG | | 11 | בנהריה | LOC | | 12 | ישראליות | MISC_AFF | | 13 | בירדן | LOC | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|hebrewner_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|he| |Size:|14.9 MB| |Dependencies:|hebrew_cc_300d| ## References This model is trained on dataset obtained from https://www.cs.bgu.ac.il/~elhadad/nlpproj/naama/ ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 5 2 0 0.714286 1 0.833333 I-MISC_AFF 2 0 3 1 0.4 0.571429 B-MISC_EVENT 7 0 1 1 0.875 0.933333 B-LOC 180 24 37 0.882353 0.829493 0.855107 I-ORG 124 47 38 0.725146 0.765432 0.744745 B-DATE 50 4 7 0.925926 0.877193 0.900901 I-PERS 157 10 15 0.94012 0.912791 0.926254 I-DATE 39 7 8 0.847826 0.829787 0.83871 B-MISC_AFF 132 11 9 0.923077 0.93617 0.929577 I-MISC_EVENT 6 0 2 1 0.75 0.857143 B-TIME 4 0 1 1 0.8 0.888889 I-PERCENT 8 0 0 1 1 1 I-MISC_ENT 11 3 10 0.785714 0.52381 0.628571 B-MISC_ENT 8 1 5 0.888889 0.615385 0.727273 I-LOC 79 18 23 0.814433 0.77451 0.79397 B-PERS 231 22 26 0.913044 0.898833 0.905882 B-MONEY 36 2 2 0.947368 0.947368 0.947368 B-PERCENT 28 3 0 0.903226 1 0.949152 B-ORG 166 41 37 0.801932 0.817734 0.809756 I-MONEY 61 1 1 0.983871 0.983871 0.983871 Macro-average 1334 196 225 0.899861 0.826869 0.861822 Micro-average 1334 196 225 0.871895 0.855677 0.86371 ``` --- layout: model title: Pipeline to Extraction of biomarker information author: John Snow Labs name: ner_biomarker_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_biomarker](https://nlp.johnsnowlabs.com/2021/11/26/ner_biomarker_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomarker_pipeline_en_3.4.1_3.0_1647871954538.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomarker_pipeline_en_3.4.1_3.0_1647871954538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_biomarker_pipeline", "en", "clinical/models") pipeline.annotate("Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin ") ``` ```scala val pipeline = new PretrainedPipeline("ner_biomarker_pipeline", "en", "clinical/models") pipeline.annotate("Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin ") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomarker.pipeline").predict("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """) ```
## Results ```bash | | ner_chunk | entity | confidence | |---:|:-------------------------|:----------------------|-------------:| | 0 | intraductal | CancerModifier | 0.9934 | | 1 | tubulopapillary | CancerModifier | 0.6403 | | 2 | neoplasm of the pancreas | CancerDx | 0.758825 | | 3 | clear cell | CancerModifier | 0.9633 | | 4 | Immunohistochemistry | Test | 0.9534 | | 5 | positivity | Biomarker_Measurement | 0.8795 | | 6 | Pan-CK | Biomarker | 0.9975 | | 7 | CK7 | Biomarker | 0.9975 | | 8 | CK8/18 | Biomarker | 0.9987 | | 9 | MUC1 | Biomarker | 0.9967 | | 10 | MUC6 | Biomarker | 0.9972 | | 11 | carbonic anhydrase IX | Biomarker | 0.937567 | | 12 | CD10 | Biomarker | 0.9974 | | 13 | EMA | Biomarker | 0.9899 | | 14 | β-catenin | Biomarker | 0.8059 | | 15 | e-cadherin | Biomarker | 0.9806 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomarker_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_base_beans ViTForImageClassification from karthiksv author: John Snow Labs name: image_classifier_vit_base_beans date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans` is a English model originally trained by karthiksv. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_en_4.1.0_3.0_1660169273658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_en_4.1.0_3.0_1660169273658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_beans", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_beans", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_beans| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_Tommi TFWav2Vec2ForCTC from Tommi author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_Tommi` is a Finnish model originally trained by Tommi. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664021001412.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi_fi_4.2.0_3.0_1664021001412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_Tommi| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Word2Vec Embeddings in Urdu (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ur, open_source] task: Embeddings language: ur edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ur_3.4.1_3.0_1647465404657.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ur_3.4.1_3.0_1647465404657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.embed.w2v_cc_300d").predict("""مجھے سپارک این ایل پی سے محبت ہے""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ur| |Size:|672.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Japanese Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_bert_base_japanese_char_extended date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char-extended` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_extended_ja_3.4.2_3.0_1649674670666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_extended_ja_3.4.2_3.0_1649674670666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_extended","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_extended","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_char_extended").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_char_extended| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|341.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/KoichiYasuoka/bert-base-japanese-char-extended --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_imdb_sentiment date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-imdb-sentiment` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_imdb_sentiment_en_4.3.0_3.0_1675126033560.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_imdb_sentiment_en_4.3.0_3.0_1675126033560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_imdb_sentiment","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_imdb_sentiment","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_imdb_sentiment| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|267.7 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-imdb-sentiment - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/1910.10683.pdf - https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67 - https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb - https://github.com/patil-suraj - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English asr_asr_with_transformers_wav2vec2 TFWav2Vec2ForCTC from osanseviero author: John Snow Labs name: asr_asr_with_transformers_wav2vec2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_asr_with_transformers_wav2vec2` is a English model originally trained by osanseviero. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_asr_with_transformers_wav2vec2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_asr_with_transformers_wav2vec2_en_4.2.0_3.0_1664043824301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_asr_with_transformers_wav2vec2_en_4.2.0_3.0_1664043824301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_asr_with_transformers_wav2vec2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_asr_with_transformers_wav2vec2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_asr_with_transformers_wav2vec2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.6 MB| --- layout: model title: Lithuanian asr_common_voice_lithuanian_fairseq TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: asr_common_voice_lithuanian_fairseq date: 2022-09-26 tags: [wav2vec2, lt, audio, open_source, asr] task: Automatic Speech Recognition language: lt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_common_voice_lithuanian_fairseq` is a Lithuanian model originally trained by birgermoell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_common_voice_lithuanian_fairseq_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_common_voice_lithuanian_fairseq_lt_4.2.0_3.0_1664202849902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_common_voice_lithuanian_fairseq_lt_4.2.0_3.0_1664202849902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_common_voice_lithuanian_fairseq", "lt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_common_voice_lithuanian_fairseq", "lt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_common_voice_lithuanian_fairseq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|lt| |Size:|228.0 MB| --- layout: model title: English RobertaForQuestionAnswering (from ydshieh) author: John Snow Labs name: roberta_qa_ydshieh_roberta_base_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `ydshieh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_roberta_base_squad2_en_4.0.0_3.0_1655735038173.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ydshieh_roberta_base_squad2_en_4.0.0_3.0_1655735038173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ydshieh_roberta_base_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ydshieh_roberta_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_ydshieh").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ydshieh_roberta_base_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ydshieh/roberta-base-squad2 - https://www.linkedin.com/company/deepset-ai/ - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/FARM/issues/552 - https://github.com/deepset-ai/FARM - http://www.deepset.ai/jobs - https://twitter.com/deepset_ai - https://github.com/deepset-ai/FARM/blob/master/examples/question_answering.py - https://github.com/deepset-ai/haystack/discussions - https://github.com/deepset-ai/haystack/ - https://deepset.ai - https://deepset.ai/germanquad - https://deepset.ai/german-bert --- layout: model title: Catalan, Valencian asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala TFWav2Vec2ForCTC from softcatala author: John Snow Labs name: pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala date: 2022-09-24 tags: [wav2vec2, ca, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ca edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala` is a Catalan, Valencian model originally trained by softcatala. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037119962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala_ca_4.2.0_3.0_1664037119962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala', lang = 'ca') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala", lang = "ca") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_softcatala| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ca| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: SDOH Housing Insecurity For Classification author: John Snow Labs name: genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli date: 2023-05-30 tags: [en, licensed, biobert, sdoh, housing, generic_classifier, housing_insecurity] task: Text Classification language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting whether the patient has housing insecurity. If the clinical note includes patient housing problems, the model identifies it. If there is no housing issue or it is not mentioned in the text, it is regarded as “No_Housing_Insecurity_Or_Not_Mentioned”. The model is trained by using GenericClassifierApproach annotator. `Housing_Insecurity`: The patient has housing problems. `No_Housing_Insecurity_Or_Not_Mentioned`: The patient has no housing problems or it is not mentioned in the clinical notes. ## Predicted Entities `Housing_Insecurity`, `No_Housing_Insecurity_Or_Not_Mentioned` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION_GENERIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.4.2_3.0_1685474945970.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.4.2_3.0_1685474945970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["The patient is homeless.", "Patient is a 50-year-old male who no has stable housing. He recently underwent a hip replacement surgery and has made a full recovery. ", "Patient is a 25-year-old female who has her private housing. She presented with symptoms of a urinary tract infection and was diagnosed with the condition. Her living situation has allowed her to receive prompt medical care and treatment, and she has made a full recovery. ", """Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical insurance. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "prediction.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("prediction") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("The patient is homeless.", "Patient is a 50-year-old male who no has stable housing. He recently underwent a hip replacement surgery and has made a full recovery. ", "Patient is a 25-year-old female who has her private housing. She presented with symptoms of a urinary tract infection and was diagnosed with the condition. Her living situation has allowed her to receive prompt medical care and treatment, and she has made a full recovery. ", """Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical insurance. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+----------------------------------------+ | text| result| +----------------------------------------------------------------------------------------------------+----------------------------------------+ | The patient is homeless.| [Housing_Insecurity]| |Patient is a 50-year-old male who no has stable housing. He recently underwent a hip replacement ...| [Housing_Insecurity]| |Patient is a 25-year-old female who has her private housing. She presented with symptoms of a uri...|[No_Housing_Insecurity_Or_Not_Mentioned]| |Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and a...|[No_Housing_Insecurity_Or_Not_Mentioned]| |Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing ins...| [Housing_Insecurity]| +----------------------------------------------------------------------------------------------------+----------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Housing_Insecurity 0.81 0.92 0.86 37 No_Housing_Insecurity_Or_Not_Mentioned 0.95 0.87 0.90 60 accuracy - - 0.89 97 macro-avg 0.88 0.89 0.88 97 weighted-avg 0.89 0.89 0.89 97 ``` --- layout: model title: English BertForQuestionAnswering model (from vanichandna) author: John Snow Labs name: bert_qa_muril_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muril-finetuned-squad` is a English model orginally trained by `vanichandna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_muril_finetuned_squad_en_4.0.0_3.0_1654188629286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_muril_finetuned_squad_en_4.0.0_3.0_1654188629286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_muril_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_muril_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_vanichandna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_muril_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|891.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vanichandna/muril-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739600551.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739600551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_only_classfn_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0 --- layout: model title: Translate English to Ilocano Pipeline author: John Snow Labs name: translate_en_ilo date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ilo, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ilo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ilo_xx_2.7.0_2.4_1609690588205.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ilo_xx_2.7.0_2.4_1609690588205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ilo", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ilo", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ilo').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ilo| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from CenIA) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased-finetuned-qa-sqac` is a Castilian, Spanish model orginally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac_es_4.0.0_3.0_1654180585886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac_es_4.0.0_3.0_1654180585886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/bert-base-spanish-wwm-uncased-finetuned-qa-sqac --- layout: model title: Estonain Lemmatizer author: John Snow Labs name: lemma date: 2020-11-28 task: Lemmatization language: et edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [lemmatizer, et, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_et_2.7.0_2.4_1606580379171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_et_2.7.0_2.4_1606580379171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "et") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(['üheksandana üheksas üheksanda Üheksas']) ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "et") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("üheksandana üheksas üheksanda Üheksas").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["üheksandana üheksas üheksanda Üheksas"] lemma_df = nlu.load('et.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 10, üheksas, {'sentence': '0'}), Annotation(token, 12, 18, üheksas, {'sentence': '0'}), Annotation(token, 20, 28, üheksas, {'sentence': '0'}), Annotation(token, 30, 36, üheksas, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|et| ## Data Source This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/) --- layout: model title: Pipeline to Detect problem, test, treatment in medical text (biobert) author: John Snow Labs name: ner_clinical_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_clinical_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_pipeline_en_4.3.0_3.2_1679314695992.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_pipeline_en_4.3.0_3.2_1679314695992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_biobert_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_biobert_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------------------------------------|--------:|------:|:------------|-------------:| | 0 | congestion | 62 | 71 | PROBLEM | 0.5069 | | 1 | some mild problems with his breathing while feeding | 163 | 213 | PROBLEM | 0.694063 | | 2 | any perioral cyanosis | 233 | 253 | PROBLEM | 0.6493 | | 3 | retractions | 258 | 268 | PROBLEM | 0.9971 | | 4 | a tactile temperature | 302 | 322 | PROBLEM | 0.8294 | | 5 | Tylenol | 345 | 351 | TREATMENT | 0.665 | | 6 | some decreased p.o | 372 | 389 | PROBLEM | 0.771067 | | 7 | His normal breast-feeding | 400 | 424 | TEST | 0.736767 | | 8 | his respiratory congestion | 488 | 513 | PROBLEM | 0.745767 | | 9 | more tired | 545 | 554 | PROBLEM | 0.6514 | | 10 | fussy | 569 | 573 | PROBLEM | 0.6512 | | 11 | albuterol treatments | 637 | 656 | TREATMENT | 0.8917 | | 12 | His urine output | 675 | 690 | TEST | 0.7114 | | 13 | any diarrhea | 832 | 843 | PROBLEM | 0.73595 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding UMLS Codes author: John Snow Labs name: snomed_umls_mapping date: 2023-03-29 tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, umls] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_umls_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_4.3.2_3.2_1680124512179.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_4.3.2_3.2_1680124512179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(733187009 449433008 51264003) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(733187009 449433008 51264003) ``` {:.nlu-block} ```python import nlu nlu.load("en.snomed.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash | | snomed_code | umls_code | |---:|:---------------------------------|:-------------------------------| | 0 | 733187009 | 449433008 | 51264003 | C4546029 | C3164619 | C0271267 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|5.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English BertForQuestionAnswering Cased model (from Nadav) author: John Snow Labs name: bert_qa_macsquad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MacSQuAD` is a English model originally trained by `Nadav`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_macsquad_en_4.0.0_3.0_1657182038098.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_macsquad_en_4.0.0_3.0_1657182038098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_macsquad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_macsquad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_macsquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Nadav/MacSQuAD --- layout: model title: Stop Words Cleaner for Indonesian author: John Snow Labs name: stopwords_id date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: id edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, id] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_id_id_2.5.4_2.4_1594742441630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_id_id_2.5.4_2.4_1594742441630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_id", "id") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_id", "id") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis."""] stopword_df = nlu.load('id.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=7, end=13, result='menjadi', metadata={'sentence': '0'}), Row(annotatorType='token', begin=15, end=18, result='raja', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=24, result='utara', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=25, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=27, end=30, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_id| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|id| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Persian DistilBERT Embeddings (from HooshvareLab) author: John Snow Labs name: distilbert_embeddings_distilbert_fa_zwnj_base date: 2022-04-12 tags: [distilbert, embeddings, fa, open_source] task: Embeddings language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-fa-zwnj-base` is a Persian model orginally trained by `HooshvareLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_fa_zwnj_base_fa_3.4.2_3.0_1649783880670.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_fa_zwnj_base_fa_3.4.2_3.0_1649783880670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_fa_zwnj_base","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["من عاشق جرقه NLP هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_fa_zwnj_base","fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("من عاشق جرقه NLP هستم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.embed.distilbert_fa_zwnj_base").predict("""من عاشق جرقه NLP هستم""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_fa_zwnj_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fa| |Size:|282.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/HooshvareLab/distilbert-fa-zwnj-base - https://github.com/hooshvare/parsbert/issues --- layout: model title: Extract Cancer Therapies and Posology Information author: John Snow Labs name: ner_oncology_unspecific_posology_healthcare date: 2023-01-11 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of treatments and posology information using unspecific labels (low granularity). ## Predicted Entities `Posology_Information`, `Cancer_Therapy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_en_4.2.4_3.0_1673475870938.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_healthcare_en_4.2.4_3.0_1673475870938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel()\ .pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel\ .pretrained("ner_oncology_unspecific_posology_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel .pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel() .pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_unspecific_posology_healthcare").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------------| | adriamycin | Cancer_Therapy | | 60 mg/m2 | Posology_Information | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Posology_Information | | over six courses | Posology_Information | | second cycle | Posology_Information | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_unspecific_posology_healthcare| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|33.8 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Posology_Information 1435 102 210 1645 0.93 0.87 0.90 Cancer_Therapy 1281 116 125 1406 0.92 0.91 0.91 macro-avg 2716 218 335 3051 0.93 0.89 0.91 micro-avg 2716 218 335 3051 0.93 0.89 0.91 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket` is a English model originally trained by lilitket. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_en_4.2.0_3.0_1664095206434.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_en_4.2.0_3.0_1664095206434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Cased model (from LucasS) author: John Snow Labs name: roberta_qa_robertaabsa date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaabsa_en_4.3.0_3.0_1674222776379.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaabsa_en_4.3.0_3.0_1674222776379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaabsa","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaabsa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_robertaabsa| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|437.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/LucasS/robertaABSA --- layout: model title: Financial English BERT Embeddings (Number masking) author: John Snow Labs name: bert_embeddings_sec_bert_num date: 2022-04-12 tags: [bert, embeddings, en, open_source, financial] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial Pretrained BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sec-bert-num` is a English model orginally trained by `nlpaueb`. This model is the same as Bert Base but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation). If you are interested in Financial Embeddings, take a look also at these two models: [sec-base](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_base_en_3_0.html): Same as Bert Base but trained with financial documents. [sec-shape](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_sh_en_3_0.html): Same as Bert sec-base but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_num_en_3.4.2_3.0_1649759295271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_num_en_3.4.2_3.0_1649759295271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_num","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_num","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.sec_bert_num").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_sec_bert_num| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/nlpaueb/sec-bert-num - https://arxiv.org/abs/2203.06482 - http://nlp.cs.aueb.gr/ --- layout: model title: English AlbertForQuestionAnswering Large model (from elgeish) author: John Snow Labs name: albert_qa_cs224n_squad2.0_large_v2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cs224n-squad2.0-albert-large-v2` is a English model originally trained by `elgeish`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_large_v2_en_4.0.0_3.0_1656064295571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_large_v2_en_4.0.0_3.0_1656064295571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_large_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_large_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.large_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_cs224n_squad2.0_large_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|63.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/elgeish/cs224n-squad2.0-albert-large-v2 - http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/elgeish/squad/tree/master/data --- layout: model title: Turkish DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_tr_cased date: 2022-04-12 tags: [distilbert, embeddings, tr, open_source] task: Embeddings language: tr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-tr-cased` is a Turkish model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_tr_cased_tr_3.4.2_3.0_1649783637258.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_tr_cased_tr_3.4.2_3.0_1649783637258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_tr_cased","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_tr_cased","tr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.embed.distilbert_base_cased").predict("""Spark NLP'yi seviyorum""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_tr_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|tr| |Size:|216.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-tr-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Cyberbullying Classifier author: John Snow Labs name: classifierdl_use_cyberbullying date: 2021-01-09 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en, classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Identify Racism, Sexism or Neutral tweets. ## Predicted Entities `neutral`, `racism`, `sexism` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_CYBERBULLYING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_CYBERBULLYING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.7.1_2.4_1610188083627.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_cyberbullying_en_2.7.1_2.4_1610188083627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_cyberbullying', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document_assembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_cyberbullying", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked"""] cyberbull_df = nlu.load('classify.cyberbullying.use').predict(text, output_level='document') cyberbull_df[["document", "cyberbullying"]] ``` {:.nlu-block} ```python ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.cyberbullying").predict("""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked""") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.cyberbullying").predict("""@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked""") ```
## Results ```bash +--------------------------------------------------------------------------------------------------------+------------+ |document |class | +--------------------------------------------------------------------------------------------------------+------------+ |@geeky_zekey Thanks for showing again that blacks are the biggest racists. Blocked. | racism | +--------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_cyberbullying| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source This model is trained on cyberbullying detection dataset. https://raw.githubusercontent.com/dhavalpotdar/cyberbullying-detection/master/data/data/data.csv ## Benchmarking ```bash precision recall f1-score support neutral 0.72 0.76 0.74 700 racism 0.89 0.94 0.92 773 sexism 0.82 0.71 0.76 622 accuracy 0.81 2095 macro avg 0.81 0.80 0.80 2095 weighted avg 0.81 0.81 0.81 2095 ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from haesun) author: John Snow Labs name: xlmroberta_ner_haesun_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `haesun`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_de_4.1.0_3.0_1660433456248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_haesun_base_finetuned_panx_de_4.1.0_3.0_1660433456248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_haesun_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_haesun_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/haesun/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English BertForQuestionAnswering Cased model (from aozorahime) author: John Snow Labs name: bert_qa_my_new_model date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `my-new-model` is a English model originally trained by `aozorahime`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_my_new_model_en_4.0.0_3.0_1657190466916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_my_new_model_en_4.0.0_3.0_1657190466916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_my_new_model","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_my_new_model","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_my_new_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aozorahime/my-new-model --- layout: model title: Pipeline to Detect diseases in medical text (biobert) author: John Snow Labs name: ner_diseases_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diseases_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_4.3.0_3.2_1679315318481.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_pipeline_en_4.3.0_3.2_1679315318481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models") text = '''Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_diseases_biobert_pipeline", "en", "clinical/models") val text = "Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases_biobert.pipeline").predict("""Indomethacin resulted in histopathologic findings typical of interstitial cystitis, such as leaky bladder epithelium and mucosal mastocytosis. The true incidence of nonsteroidal anti-inflammatory drug-induced cystitis in humans must be clarified by prospective clinical trials. An open-label phase II study of low-dose thalidomide in androgen-independent prostate cancer.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|-------------:| | 0 | interstitial cystitis | 61 | 81 | Disease | 0.99655 | | 1 | mastocytosis | 129 | 140 | Disease | 0.8569 | | 2 | cystitis | 209 | 216 | Disease | 0.9717 | | 3 | prostate cancer | 355 | 369 | Disease | 0.85965 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt3 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt3` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_4.2.4_3.0_1670327065192.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_4.2.4_3.0_1670327065192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt3| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|144.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt3 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: English asr_wav2vec2_base_test TFWav2Vec2ForCTC from cahya author: John Snow Labs name: pipeline_asr_wav2vec2_base_test date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_test` is a English model originally trained by cahya. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_test_en_4.2.0_3.0_1664035679567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_test_en_4.2.0_3.0_1664035679567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_test', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_test", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|348.7 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from jgammack) author: John Snow Labs name: distilbert_qa_mtl_base_uncased_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MTL-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mtl_base_uncased_squad_en_4.3.0_3.0_1672765509575.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mtl_base_uncased_squad_en_4.3.0_3.0_1672765509575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mtl_base_uncased_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mtl_base_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mtl_base_uncased_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/MTL-distilbert-base-uncased-squad --- layout: model title: Russian RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_ruRoberta_large date: 2022-04-14 tags: [roberta, embeddings, ru, open_source] task: Embeddings language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ruRoberta-large` is a Russian model orginally trained by `sberbank-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ruRoberta_large_ru_3.4.2_3.0_1649947722752.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ruRoberta_large_ru_3.4.2_3.0_1649947722752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ruRoberta_large","ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ruRoberta_large","ru") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю искра NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.embed.ruRoberta_large").predict("""Я люблю искра NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_ruRoberta_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ru| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/sberbank-ai/ruRoberta-large - https://sberdevices.ru/ --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from leabum) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_cuad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-cuad` is a English model originally trained by `leabum`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_cuad_en_4.3.0_3.0_1672767856883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_cuad_en_4.3.0_3.0_1672767856883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_cuad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_cuad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/leabum/distilbert-base-uncased-finetuned-cuad --- layout: model title: Emotional Stress Classifier (BERT) author: John Snow Labs name: bert_sequence_classifier_stress date: 2022-06-28 tags: [sequence_classification, bert, en, licensed, stress, mental, public_health] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT-based](https://huggingface.co/publichealthsurveillance/PHS-BERT) classifier that can classify whether the content of a text expresses emotional stress. ## Predicted Entities `no stress`, `stress` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_STRESS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stress_en_4.0.0_3.0_1656438010655.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_stress_en_4.0.0_3.0_1656438010655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stress", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stress", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.stress").predict("""No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ |text | class| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ |No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?|[stress]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_stress| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References [Dreaddit dataset](https://arxiv.org/abs/1911.00133) ## Benchmarking ```bash label precision recall f1-score support no-stress 0.83 0.82 0.83 334 stress 0.85 0.85 0.85 377 accuracy - - 0.84 711 macro-avg 0.84 0.84 0.84 711 weighted-avg 0.84 0.84 0.84 711 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from rpv) author: John Snow Labs name: distilbert_qa_rpv_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `rpv`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_rpv_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772340836.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_rpv_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772340836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rpv_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rpv_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_rpv_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/rpv/distilbert-base-uncased-finetuned-squad --- layout: model title: Japanese asr_wav2vec2_large_xlsr_japanese_hiragana TFWav2Vec2ForCTC from vumichien author: John Snow Labs name: asr_wav2vec2_large_xlsr_japanese_hiragana date: 2022-09-25 tags: [wav2vec2, ja, audio, open_source, asr] task: Automatic Speech Recognition language: ja edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_japanese_hiragana` is a Japanese model originally trained by vumichien. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_japanese_hiragana_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_japanese_hiragana_ja_4.2.0_3.0_1664122415445.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_japanese_hiragana_ja_4.2.0_3.0_1664122415445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_japanese_hiragana", "ja")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_japanese_hiragana", "ja") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_japanese_hiragana| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ja| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect PHI in Text (enriched-biobert) author: John Snow Labs name: ner_deid_enriched_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, enriched_biobert, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_enriched_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_enriched_biobert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_3.4.1_3.0_1647868393082.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_biobert_pipeline_en_3.4.1_3.0_1647868393082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models") pipeline.annotate("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_enriched_biobert_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.ner_enriched_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+------------+ |chunks |entities | +-----------------------------+------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |DOCTOR | |7194334 |PHONE | |01/13/93 |DATE | |Oliveira |DOCTOR | |1-11-2000 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |(302) 786-5227 |PHONE | |Brothers Coal-Mine |ORGANIZATION| +-----------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_enriched_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_el2 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-el2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el2_en_4.3.0_3.0_1675123363126.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el2_en_4.3.0_3.0_1675123363126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_el2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_el2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_el2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|59.2 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-el2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl8_en_4.3.0_3.0_1675118817239.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl8_en_4.3.0_3.0_1675118817239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|164.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-dl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_el2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el2_en_4.3.0_3.0_1675111126562.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el2_en_4.3.0_3.0_1675111126562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_el2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_el2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_el2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|339.1 MB| ## References - https://huggingface.co/google/t5-efficient-base-el2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Detect PHI for Deidentification (Glove - Subentity) author: John Snow Labs name: ner_deid_subentity_glove date: 2021-06-06 tags: [ner, deid, licensed, en, glove, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Absolute) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. It detects 23 entities. This ner model is trained with combination of i2b2 train set and augmented version of i2b2 train set using Glove-100d embeddings. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `MEDICALRECORD`, `ORGANIZATION`, `DOCTOR`, `USERNAME`, `PROFESSION`, `HEALTHPLAN`, `URL`, `CITY`, `DATE`, `LOCATION-OTHER`, `STATE`, `PATIENT`, `DEVICE`, `COUNTRY`, `ZIP`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `SREET`, `BIOID`, `FAX`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/deid_ner_subentity_glove_en_3.0.4_3.0_1623015533538.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/deid_ner_subentity_glove_en_3.0.4_3.0_1623015533538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_glove", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk_subentity") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, glove_embeddings, deid_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]}))) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val glove_embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_glove", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk_subentity") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, glove_embeddings, deid_ner, ner_converter)) val result = nlpPipeline.fit(Seq.empty["A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."].toDS.toDF("text")).transform(data) ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |+1 (302) 786-5227 |PHONE | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_glove| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source A custom data set which is created from the i2b2-PHI train and the augmented version of the i2b2-PHI train set is used. --- layout: model title: Translate Yoruba to English Pipeline author: John Snow Labs name: translate_yo_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, yo, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `yo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_yo_en_xx_2.7.0_2.4_1609688368244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_yo_en_xx_2.7.0_2.4_1609688368244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_yo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_yo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.yo.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_yo_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes (Disease or Syndrome) author: John Snow Labs name: sbiobertresolve_umls_disease_syndrome date: 2021-10-11 tags: [entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities to UMLS CUI codes. It is trained on `2021AB` UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the `Disease or Syndrome` category using `sbiobert_base_cased_mli` embeddings. ## Predicted Entities `Predicts UMLS codes for Diseases & Syndromes medical concepts` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_disease_syndrome_en_3.2.3_3.0_1633911418710.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_disease_syndrome_en_3.2.3_3.0_1633911418710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_umls_disease_syndrome``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes,Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension,Injury_or_Poisoning, Kidney_Disease, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_disease_syndrome","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_umls_disease_syndrome", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text") val res = p_model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls_disease_syndrome").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | chunk | code | code_description | all_k_code_desc | all_k_codes | |---:|:--------------------------------------|:---------|:--------------------------------------|:-------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | gestational diabetes mellitus | C0085207 | gestational diabetes mellitus | ['C0085207', 'C0032969', 'C2063017', 'C1283034', 'C0271663'] | ['gestational diabetes mellitus', 'pregnancy diabetes mellitus', 'pregnancy complicated by diabetes mellitus', 'maternal diabetes mellitus', 'gestational diabetes mellitus, a2'] | | 1 | subsequent type two diabetes mellitus | C0348921 | pre-existing type 2 diabetes mellitus | ['C0348921', 'C1719939', 'C0011860', 'C0877302', 'C0271640'] | ['pre-existing type 2 diabetes mellitus', 'disorder associated with type 2 diabetes mellitus', 'diabetes mellitus, type 2', 'insulin-requiring type 2 diabetes mellitus', 'secondary diabetes mellitus'] | | 2 | HTG-induced pancreatitis | C0376670 | alcohol-induced pancreatitis | ['C0376670', 'C1868971', 'C4302243', 'C0267940', 'C2350449'] | ['alcohol-induced pancreatitis', 'toxic pancreatitis', 'igg4-related pancreatitis', 'hemorrhage pancreatitis', 'graft pancreatitis'] | | 3 | an acute hepatitis | C0019159 | acute hepatitis | ['C0019159', 'C0276434', 'C0267797', 'C1386146', 'C2063407'] | ['acute hepatitis a', 'acute hepatitis a', 'acute hepatitis', 'acute infectious hepatitis', 'acute hepatitis e'] | | 4 | obesity | C0028754 | obesity | ['C0028754', 'C0342940', 'C0342942', 'C0857116', 'C1561826'] | ['obesity', 'abdominal obesity', 'generalized obesity', 'obesity gross', 'overweight and obesity'] | | 5 | polyuria | C0018965 | hematuria | ['C0018965', 'C0151582', 'C3888890', 'C0268556', 'C2936921'] | ['hematuria', 'uricosuria', 'polyuria-polydipsia syndrome', 'saccharopinuria', 'saccharopinuria'] | | 6 | polydipsia | C0268813 | primary polydipsia | ['C0268813', 'C0030508', 'C3888890', 'C0393777', 'C0206085'] | ['primary polydipsia', 'parasomnia', 'polyuria-polydipsia syndrome', 'hypnogenic paroxysmal dystonias', 'periodic hypersomnias'] | | 7 | poor appetite | C0003123 | lack of appetite | ['C0003123', 'C0011168', 'C0162429', 'C1282895', 'C0039338'] | ['lack of appetite', 'poor swallowing', 'poor nutrition', 'neurologic unpleasant taste', 'taste dis'] | | 8 | vomiting | C0152164 | periodic vomiting | ['C0152164', 'C0267172', 'C0152517', 'C0011119', 'C0152227'] | ['periodic vomiting', 'habit vomiting', 'viral vomiting', 'choking', 'tearing'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_disease_syndrome| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[output]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on `2021AB` UMLS dataset's ' Disease or Syndrome` category. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: English image_classifier_vit_rare_puppers ViTForImageClassification from eliwill author: John Snow Labs name: image_classifier_vit_rare_puppers date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_puppers` is a English model originally trained by eliwill. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_en_4.1.0_3.0_1660167215470.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_en_4.1.0_3.0_1660167215470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_puppers", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_puppers", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_puppers| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Translate English to Kinyarwanda Pipeline author: John Snow Labs name: translate_en_rw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, rw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `rw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_rw_xx_2.7.0_2.4_1609686452332.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_rw_xx_2.7.0_2.4_1609686452332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_rw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_rw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.rw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_rw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from South Caucasian Languages to English author: John Snow Labs name: opus_mt_ccs_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ccs, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ccs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ccs_en_xx_2.7.0_2.4_1609169090699.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ccs_en_xx_2.7.0_2.4_1609169090699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ccs_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ccs_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ccs.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ccs_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Bert Embeddings (from anferico) author: John Snow Labs name: bert_embeddings_bert_for_patents date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-for-patents` is a English model orginally trained by `anferico`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_for_patents_en_3.4.2_3.0_1649671629607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_for_patents_en_3.4.2_3.0_1649671629607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_for_patents","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_for_patents","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_for_patents").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_for_patents| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/anferico/bert-for-patents - https://cloud.google.com/blog/products/ai-machine-learning/how-ai-improves-patent-analysis - https://services.google.com/fh/files/blogs/bert_for_patents_white_paper.pdf - https://github.com/google/patents-public-data/blob/master/models/BERT%20for%20Patents.md - https://github.com/ec-jrc/Patents4IPPC - https://picampus-school.com/ - https://ec.europa.eu/jrc/en --- layout: model title: Spanish RoBERTa Embeddings (Large) author: John Snow Labs name: roberta_embeddings_roberta_large_bne date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-bne` is a Spanish model orginally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_large_bne_es_3.4.2_3.0_1649945069671.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_large_bne_es_3.4.2_3.0_1649945069671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_large_bne","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_large_bne","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.roberta_large_bne").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_large_bne| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|848.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - http://www.bne.es/en/Inicio/index.html - https://arxiv.org/abs/1907.11692 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Liability and Contra-Liability NER (Small) author: John Snow Labs name: finner_contraliability date: 2022-12-15 tags: [en, finance, contra, liability, licensed, ner] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a financial model to detect LIABILITY and CONTRA_LIABILITY mentions in texts. - CONTRA_LIABILITY: Negative liability account that offsets the liability account (e.g. paying a debt) - LIABILITY: Current or Long-Term Liability (not from stockholders) Please note this model requires some tokenization configuration to extract the currency (see python snippet below). ## Predicted Entities `LIABILITY`, `CONTRA_LIABILITY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_contraliability_en_1.0.0_3.0_1671136444267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_contraliability_en_1.0.0_3.0_1671136444267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_contraliability", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Reducing total debt continues to be a top priority , and we remain on track with our target of reducing overall debt levels by $ 15 billion by the end of 2025 ."""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +---------+------------------+----------+ | token| ner_label|confidence| +---------+------------------+----------+ | Reducing| O| 0.9997| | total| B-LIABILITY| 0.7884| | debt| I-LIABILITY| 0.8479| |continues| O| 1.0| | to| O| 1.0| | be| O| 1.0| | a| O| 1.0| | top| O| 1.0| | priority| O| 1.0| | ,| O| 1.0| | and| O| 1.0| | we| O| 1.0| | remain| O| 1.0| | on| O| 1.0| | track| O| 1.0| | with| O| 1.0| | our| O| 1.0| | target| O| 1.0| | of| O| 1.0| | reducing| O| 0.9993| | overall| O| 0.9969| | debt|B-CONTRA_LIABILITY| 0.5686| | levels|I-CONTRA_LIABILITY| 0.6611| | by| O| 0.9996| | $| O| 1.0| | 15| O| 1.0| | billion| O| 1.0| | by| O| 1.0| | the| O| 1.0| | end| O| 1.0| | of| O| 1.0| | 2025| O| 1.0| | .| O| 1.0| +---------+------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_contraliability| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations on Earning Calls and 10-K Filings combined. ## Benchmarking ```bash label precision recall f1-score support B-CONTRA_LIABILITY 0.7660 0.7200 0.7423 50 B-LIABILITY 0.8947 0.8990 0.8969 208 I-CONTRA_LIABILITY 0.7838 0.6304 0.6988 46 I-LIABILITY 0.8780 0.8929 0.8854 411 accuracy - - 0.9805 8299 macro-avg 0.8626 0.8267 0.8429 8299 weighted-avg 0.9803 0.9805 0.9803 8299 ``` --- layout: model title: Recognize Entities DL Pipeline for Norwegian (Bokmal) - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, norwegian_bokmal, entity_recognizer_md, pipeline, "no"] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: "no" edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_3.0.0_3.0_1616451623734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_3.0.0_3.0_1616451623734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'no') annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "no") val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei fra John Snow Labs! ""] result_df = nlu.load('no.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | [[0.1868100017309188,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|no| --- layout: model title: HCP Consult Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_vop_hcp_consult date: 2023-06-13 tags: [licensed, en, classification, vop, clinical, tensorflow] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can identify texts that mention a HCP consult. ## Predicted Entities `Consulted_By_HCP`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_en_4.4.3_3.0_1686679279680.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_hcp_consult_en_4.4.3_3.0_1686679279680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_hcp_consult", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["hi does anybody have feet aches with anxiety, i do suffer from anxiety but never had anything wrong with my feet before", "My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_hcp_consult", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("prediction") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("hi does anybody have feet aches with anxiety, i do suffer from anxiety but never had anything wrong with my feet before", "My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------------+------------------+ |text |result | +-----------------------------------------------------------------------------------------------------------------------+------------------+ |hi does anybody have feet aches with anxiety, i do suffer from anxiety but never had anything wrong with my feet before|[Other] | |My son has been to two doctors who gave him antibiotic drops but they also say the problem might related to allergies. |[Consulted_By_HCP]| +-----------------------------------------------------------------------------------------------------------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_hcp_consult| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset “Hello,I’m 20 year old girl. I’m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I’m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I’m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I’m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.” ## Benchmarking ```bash label precision recall f1-score support Consulted_By_HCP 0.670412 0.730612 0.699219 245 Other 0.848624 0.807860 0.827740 458 accuracy - - 0.780939 703 macro_avg 0.759518 0.769236 0.763480 703 weighted_avg 0.786516 0.780939 0.782950 703 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Ewe author: John Snow Labs name: opus_mt_en_ee date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ee, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ee` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ee_xx_2.7.0_2.4_1609167230649.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ee_xx_2.7.0_2.4_1609167230649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ee", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ee", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ee').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ee| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Termination Clause Binary Classifier (md) author: John Snow Labs name: legclf_termination_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `termination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_md_en_1.0.0_3.0_1669376522291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_md_en_1.0.0_3.0_1669376522291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_termination_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination]| |[other]| |[other]| |[termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support other 0.95 0.97 0.96 39 termination 0.97 0.94 0.95 32 accuracy 0.96 71 macro avg 0.96 0.96 0.96 71 weighted avg 0.96 0.96 0.96 71 ``` --- layout: model title: Chinese Bert Embeddings (from celtics1863) author: John Snow Labs name: bert_embeddings_env_bert_chinese date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `env-bert-chinese` is a Chinese model orginally trained by `celtics1863`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_env_bert_chinese_zh_3.4.2_3.0_1649670664238.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_env_bert_chinese_zh_3.4.2_3.0_1649670664238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_env_bert_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_env_bert_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.env_bert_chinese").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_env_bert_chinese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|384.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/celtics1863/env-bert-chinese --- layout: model title: NER Pipeline for 10 African Languages author: John Snow Labs name: xlm_roberta_large_token_classifier_masakhaner_pipeline date: 2022-06-27 tags: [masakhaner, african, xlm_roberta, multilingual, pipeline, amharic, hausa, igbo, kinyarwanda, luganda, swahilu, wolof, yoruba, nigerian, pidgin, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on [xlm_roberta_large_token_classifier_masakhaner](https://nlp.johnsnowlabs.com/2021/12/06/xlm_roberta_large_token_classifier_masakhaner_xx.html) ner model which is imported from `HuggingFace`. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/Ner_masakhaner/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_pipeline_xx_4.0.0_3.0_1656369154380.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_pipeline_xx_4.0.0_3.0_1656369154380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python masakhaner_pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_masakhaner_pipeline", lang = "xx") masakhaner_pipeline.annotate("አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።") ``` ```scala val masakhaner_pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_masakhaner_pipeline", lang = "xx") val masakhaner_pipeline.annotate("አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።") ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |አህመድ ቫንዳ |PER | |ከ3-10-2000 ጀምሮ|DATE | |በአዲስ አበባ |LOC | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_masakhaner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering model (from hiiii23) author: John Snow Labs name: distilbert_qa_hiiii23_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hiiii23`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hiiii23_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725446833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hiiii23_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725446833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hiiii23_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hiiii23_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_hiiii23").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hiiii23_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hiiii23/distilbert-base-uncased-finetuned-squad --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_512 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670325937593.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670325937593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|90.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Spanish Named Entity Recognition (Base, CAPITEL competition at IberLEF 2020 dataset) author: John Snow Labs name: roberta_ner_roberta_base_bne_capitel_ner date: 2022-05-03 tags: [roberta, ner, open_source, es] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-bne-capitel-ner` is a Spanish model orginally trained by `PlanTL-GOB-ES`. ## Predicted Entities `ORG`, `LOC`, `PER`, `OTH` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_base_bne_capitel_ner_es_3.4.2_3.0_1651593219771.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_base_bne_capitel_ner_es_3.4.2_3.0_1651593219771.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_base_bne_capitel_ner","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_base_bne_capitel_ner","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.ner.roberta_base_bne_capitel_ner").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_base_bne_capitel_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|457.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://sites.google.com/view/capitel2020 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Detect PHI for Deidentification purposes (Portuguese) author: John Snow Labs name: ner_deid_subentity date: 2022-04-13 tags: [deid, deidentification, pt, licensed] task: De-identification language: pt edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Portuguese) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 15 entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `SEX`, `EMAIL`, `ZIP`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pt_3.4.2_3.0_1649840643338.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pt_3.4.2_3.0_1649840643338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity", "pt", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = [''' Detalhes do paciente. Nome do paciente: Pedro Gonçalves NHC: 2569870. Endereço: Rua Das Flores 23. Cidade/ Província: Porto. Código Postal: 21754-987. Dados de cuidados. Data de nascimento: 10/10/1963. Idade: 53 anos Sexo: Homen Data de admissão: 17/06/2016. Doutora: Maria Santos '''] data = spark.createDataFrame([text]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "pt", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Detalhes do paciente. Nome do paciente: Pedro Gonçalves NHC: 2569870. Endereço: Rua Das Flores 23. Cidade/ Província: Porto. Código Postal: 21754-987. Dados de cuidados. Data de nascimento: 10/10/1963. Idade: 53 anos Sexo: Homen Data de admissão: 17/06/2016. Doutora: Maria Santos""" val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.med_ner.deid.subentity").predict(""" Detalhes do paciente. Nome do paciente: Pedro Gonçalves NHC: 2569870. Endereço: Rua Das Flores 23. Cidade/ Província: Porto. Código Postal: 21754-987. Dados de cuidados. Data de nascimento: 10/10/1963. Idade: 53 anos Sexo: Homen Data de admissão: 17/06/2016. Doutora: Maria Santos """) ```
## Results ```bash +-----------------+------------+ |chunk |ner_label | +-----------------+------------+ |Pedro Gonçalves |PATIENT | |2569870 |ID | |Rua Das Flores 23|STREET | |Porto |ORGANIZATION| |21754-987 |ID | |10/10/1963 |DATE | |53 |AGE | |17/06/2016 |DATE | |Maria Santos |DOCTOR | +-----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|15.0 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 2142.0 186.0 59.0 2201.0 0.9201 0.9732 0.9459 HOSPITAL 248.0 10.0 46.0 294.0 0.9612 0.8435 0.8986 DATE 1306.0 26.0 15.0 1321.0 0.9805 0.9886 0.9845 ORGANIZATION 3038.0 31.0 156.0 3194.0 0.9899 0.9512 0.9701 CITY 1836.0 58.0 15.0 1851.0 0.9694 0.9919 0.9805 ID 56.0 8.0 7.0 63.0 0.875 0.8889 0.8819 STREET 155.0 0.0 0.0 155.0 1.0 1.0 1.0 SEX 658.0 20.0 19.0 677.0 0.9705 0.9719 0.9712 EMAIL 131.0 0.0 0.0 131.0 1.0 1.0 1.0 ZIP 125.0 2.0 0.0 125.0 0.9843 1.0 0.9921 PROFESSION 237.0 15.0 39.0 276.0 0.9405 0.8587 0.8977 PHONE 64.0 2.0 0.0 64.0 0.9697 1.0 0.9846 COUNTRY 502.0 27.0 74.0 576.0 0.949 0.8715 0.9086 DOCTOR 461.0 35.0 31.0 492.0 0.9294 0.937 0.9332 AGE 538.0 17.0 8.0 546.0 0.9694 0.9853 0.9773 macro - - - - - - 0.9551 micro - - - - - - 0.9619 ``` --- layout: model title: Word2Vec Embeddings in Georgian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, ka, open_source] task: Embeddings language: ka edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ka_3.4.1_3.0_1647374663429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ka_3.4.1_3.0_1647374663429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ka") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["მე მიყვარს ნაპერწკალი NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ka") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("მე მიყვარს ნაპერწკალი NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ka.embed.w2v_cc_300d").predict("""მე მიყვარს ნაპერწკალი NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ka| |Size:|909.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Table recognition author: John Snow Labs name: table_recognition date: 2023-01-03 tags: [en, licensed, ocr, table_recognition] task: Table Recognition language: en nav_key: models edition: Visual NLP 4.1.0 spark_version: 3.3.0 supported: true annotator: TableRecognition article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model shows the capabilities for table recognition and free-text extraction using OCR techniques. For table recognition is proposed a CascadTabNet model. CascadeTabNet is a machine learning model for table detection in document images. It is based on a cascaded architecture, which is a two-stage process where the model first detects candidate regions that may contain tables, and then classifies these regions as tables or non-tables. The model is trained using a dataset of document images, where the tables have been manually annotated. The benchmark results show that the model is able to detect tables in document images with high accuracy. On the ICDAR2013 table competition dataset, CascadeTabNet achieved an F1-score of 0.85, which is considered a good score in this dataset. On the COCO-Text dataset, the model achieved a precision of 0.82 and a recall of 0.79, which are also considered good scores. In addition, the model has been evaluated on the UNLV dataset, where it achieved a precision of 0.87 and a recall of 0.83. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/IMAGE_TABLE_DETECTION/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/2.2.Spark_OCR_training_Table_recognition.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) # Detect tables on the page using pretrained model # It can be finetuned for have more accurate results for more specific documents table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("region") # Draw detected region's with table to the page draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("region") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.red) # Extract table regions to separate images splitter = ImageSplitRegions() \ .setInputCol("image") \ .setInputRegionsCol("region") \ .setOutputCol("table_image") \ .setDropCols("image") # Detect cells on the table image cell_detector = ImageTableCellDetector() \ .setInputCol("table_image") \ .setOutputCol("cells") \ .setAlgoType("morphops") \ .setDrawDetectedLines(True) # Extract text from the detected cells table_recognition = ImageCellsToTextTable() \ .setInputCol("table_image") \ .setCellsCol('cells') \ .setMargin(3) \ .setStrip(True) \ .setOutputCol('table') # Erase detected table regions fill_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("region") \ .setOutputCol("image_1") \ .setRectColor(Color.white) \ .setFilledRect(True) # OCR ocr = ImageToText() \ .setInputCol("image_1") \ .setOutputCol("text") \ .setOcrParams(["preserve_interword_spaces=1", ]) \ .setKeepLayout(True) \ .setOutputSpaceCharacterWidth(8) pipeline_table = PipelineModel(stages=[ binary_to_image, table_detector, draw_regions, fill_regions, splitter, cell_detector, table_recognition, ocr ]) imagePath = "/content/cTDaR_t10096.jpg" df = spark.read.format("binaryFile").load(imagePath) tables_results = pipeline_table.transform(df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) # Detect tables on the page using pretrained model # It can be finetuned for have more accurate results for more specific documents val table_detector = ImageTableDetector .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("region") # Draw detected region's with table to the page val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("region") .setOutputCol("image_with_regions") .setRectColor(Color.red) # Extract table regions to separate images val splitter = new ImageSplitRegions() .setInputCol("image") .setInputRegionsCol("region") .setOutputCol("table_image") .setDropCols("image") # Detect cells on the table image val cell_detector = new ImageTableCellDetector() .setInputCol("table_image") .setOutputCol("cells") .setAlgoType("morphops") .setDrawDetectedLines(True) # Extract text from the detected cells val table_recognition = new ImageCellsToTextTable() .setInputCol("table_image") .setCellsCol("cells") .setMargin(3) .setStrip(True) .setOutputCol("table") # Erase detected table regions val fill_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("region") .setOutputCol("image_1") .setRectColor(Color.white) .setFilledRect(True) # OCR val ocr = new ImageToText() .setInputCol("image_1") .setOutputCol("text") .setOcrParams(Array("preserve_interword_spaces=1", )) .setKeepLayout(True) .setOutputSpaceCharacterWidth(8) val pipeline_table = new PipelineModel().setStages(Array( binary_to_image, table_detector, draw_regions, fill_regions, splitter, cell_detector, table_recognition, ocr)) val imagePath = "/content/cTDaR_t10096.jpg" val df = spark.read.format("binaryFile").load(imagePath) val tables_results = pipeline_table.transform(df).cache() ```
## Example ### Input Image ![Screenshot](/assets/images/examples_ocr/table1_1_doc.png) ### Table Structure Recognition {%- capture td_image -%} ![Screenshot](/assets/images/examples_ocr/table1_2_detection.png) {%- endcapture -%} {%- capture tsr_detection -%} ![Screenshot](/assets/images/examples_ocr/table1_3_structure.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=td_image output_image=tsr_detection %} ## Model Information {:.table-model} |---|---| |Model Name:|table_recognition| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_scrambled_squad_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-10` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_10_en_4.3.0_3.0_1674216712953.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_10_en_4.3.0_3.0_1674216712953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_scrambled_squad_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-10 --- layout: model title: Legal Natural And Applied Sciences Document Classifier (EURLEX) author: John Snow Labs name: legclf_natural_and_applied_sciences_bert date: 2023-03-06 tags: [en, legal, classification, clauses, natural_and_applied_sciences, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_natural_and_applied_sciences_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Natural_and_Applied_Sciences or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Natural_and_Applied_Sciences`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_natural_and_applied_sciences_bert_en_1.0.0_3.0_1678111577098.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_natural_and_applied_sciences_bert_en_1.0.0_3.0_1678111577098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_natural_and_applied_sciences_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Natural_and_Applied_Sciences]| |[Other]| |[Other]| |[Natural_and_Applied_Sciences]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_natural_and_applied_sciences_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Natural_and_Applied_Sciences 0.95 0.90 0.92 99 Other 0.90 0.95 0.92 91 accuracy - - 0.92 190 macro-avg 0.92 0.92 0.92 190 weighted-avg 0.92 0.92 0.92 190 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Li) author: John Snow Labs name: roberta_qa_li_base_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `Li`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_li_base_squad2_en_4.3.0_3.0_1674218901284.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_li_base_squad2_en_4.3.0_3.0_1674218901284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_li_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_li_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_li_base_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|462.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Li/roberta-base-squad2 - https://rajpurkar.github.io/SQuAD-explorer - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Explain Document pipeline for Danish (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, danish, explain_document_lg, pipeline, da] supported: true task: [Named Entity Recognition, Lemmatization] language: da edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_da_3.0.0_3.0_1616524893608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_da_3.0.0_3.0_1616524893608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'da') annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "da") val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej fra John Snow Labs! ""] result_df = nlu.load('da.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[-0.025171000510454,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|da| --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1655731850822.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1655731850822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|415.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-6 --- layout: model title: Translate English to Vietnamese Pipeline author: John Snow Labs name: translate_en_vi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, vi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `vi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_vi_xx_2.7.0_2.4_1609698576382.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_vi_xx_2.7.0_2.4_1609698576382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_vi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_vi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.vi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_vi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic ElectraForQuestionAnswering model (from aymanm419) Version-2 author: John Snow Labs name: electra_qa_araElectra_SQUAD_ARCD_768 date: 2022-06-22 tags: [ar, open_source, electra, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araElectra-SQUAD-ARCD-768` is a Arabic model originally trained by `aymanm419`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_768_ar_4.0.0_3.0_1655920218769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_768_ar_4.0.0_3.0_1655920218769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD_768","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD_768","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.squad_arcd.electra.768d").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_araElectra_SQUAD_ARCD_768| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|504.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aymanm419/araElectra-SQUAD-ARCD-768 --- layout: model title: Forward-Looking Statements Classification author: John Snow Labs name: finclf_bert_fls date: 2022-09-06 tags: [en, finance, forward, looking, statements, fls, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Classification model aimed to detect at sentence or paragraph level, if there is a Forward-looking statements (FLS). FLS are beliefs and opinions about firm's future events or results, usually present in documents as Financial Reports. Identifying forward-looking statements from corporate reports can assist investors in financial analysis. This model was trained originally on 3,500 manually annotated sentences from Management Discussion and Analysis section of annual reports of Russell 3000 firms and then finetuned in house by JSL on low-performant examples. ## Predicted Entities `Specific FLS`, `Non-specific FLS`, `Not FLS` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_fls_en_1.0.0_3.2_1662468990598.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_fls_en_1.0.0_3.2_1662468990598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_bert_fls", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) # couple of simple examples example = spark.createDataFrame([["Global economy will increase during the next year."]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +--------------------+--------------+ | text| result| +--------------------+--------------+ |Global economy wi...|[Specific FLS]| +--------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_bert_fls| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotations on 10K financial reports and reports from Russell 3000 firms ## Benchmarking ```bash label precision recall f1-score support Specific_FLS 0.96 0.93 0.94 311 Non-specific_FLS 0.91 0.94 0.92 215 Not_FLS 0.84 0.87 0.85 70 accuracy - - 0.92 596 macro-avg 0.90 0.91 0.91 596 weighted-avg 0.93 0.92 0.92 596 ``` --- layout: model title: Pipeline to Detect PHI for Deidentification in Romanian (Word2Vec) author: John Snow Labs name: ner_deid_subentity_pipeline date: 2023-03-09 tags: [ner, deidentification, word2vec, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_w2v_subentity_ro_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ro_4.3.0_3.2_1678386065654.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ro_4.3.0_3.2_1678386065654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "ro", "clinical/models") text = '''Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "ro", "clinical/models") val text = "Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------|--------:|------:|:------------|-------------:| | 0 | Spitalul Pentru Ochi de Deal | 0 | 27 | HOSPITAL | 0.5594 | | 1 | Drumul Oprea Nr. 972 | 30 | 49 | STREET | 0.99724 | | 2 | Vaslui | 51 | 56 | CITY | 1 | | 3 | 737405 | 59 | 64 | ZIP | 1 | | 4 | +40(235)413773 | 79 | 92 | PHONE | 1 | | 5 | 25 May 2022 | 119 | 129 | DATE | 1 | | 6 | BUREAN MARIA | 158 | 169 | PATIENT | 0.9515 | | 7 | 77 | 180 | 181 | AGE | 1 | | 8 | Agota Evelyn Tımar | 191 | 208 | DOCTOR | 0.8149 | | 9 | 2450502264401 | 218 | 230 | IDNUM | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering (from squirro) author: John Snow Labs name: roberta_qa_distilroberta_base_squad_v2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad_v2` is a English model originally trained by `squirro`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad_v2_en_4.0.0_3.0_1655728374337.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad_v2_en_4.0.0_3.0_1655728374337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_distilroberta_base_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.distilled_base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilroberta_base_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/squirro/distilroberta-base-squad_v2 - https://paperswithcode.com/sota?task=Question+Answering&dataset=The+Stanford+Question+Answering+Dataset - https://www.linkedin.com/showcase/the-squirro-academy - https://twitter.com/Squirro - https://www.instagram.com/squirro/ - http://squirro.com - https://www.linkedin.com/company/squirroag - https://www.facebook.com/squirro - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Portuguese Large Legal Bert Embeddings author: John Snow Labs name: bert_embeddings_bert_large_cased_pt_lenerbr date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-cased-pt-lenerbr` is a Portuguese model orginally trained by `pierreguillou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_cased_pt_lenerbr_pt_3.4.2_3.0_1649673910638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_cased_pt_lenerbr_pt_3.4.2_3.0_1649673910638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_cased_pt_lenerbr","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_cased_pt_lenerbr","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_large_cased_pt_lenerbr").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_cased_pt_lenerbr| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/pierreguillou/bert-large-cased-pt-lenerbr - https://medium.com/@pierre_guillou/nlp-modelos-e-web-app-para-reconhecimento-de-entidade-nomeada-ner-no-dom%C3%ADnio-jur%C3%ADdico-b658db55edfb - https://github.com/piegu/language-models/blob/master/Finetuning_language_model_BERtimbau_LeNER_Br.ipynb - https://paperswithcode.com/sota?task=Fill+Mask&dataset=pierreguillou%2Flener_br_finetuning_language_model --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities author: John Snow Labs name: bert_token_classifier_ner_cellular_pipeline date: 2022-03-09 tags: [cellular, ner, bert_for_token_classifier, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_cellular](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_cellular_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CELLULAR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_3.0_1646826493144.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_3.0_1646826493144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline cellular_pipeline = PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") cellular_pipeline.fullAnnotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val cellular_pipeline = new PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") cellular_pipeline.fullAnnotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.cellular_pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1 |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: Classifier for Genders - BIOBERT author: John Snow Labs name: classifierdl_gender_biobert date: 2020-12-16 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.6.5 spark_version: 2.4 tags: [classifier, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model classifies the gender of the patient in the clinical document. {:.h2_title} ## Predicted Entities `Female`, ``Male`, `Unknown`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_biobert_en_2.6.4_2.4_1608119684447.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_biobert_en_2.6.4_2.4_1608119684447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use To classify your text, you can use this model as part of an nlp pipeline with the following stages: DocumentAssembler, BertSentenceEmbeddings (`biobert_pubmed_base_cased`), ClassifierDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... biobert_embeddings = BertEmbeddings().pretrained('biobert_pubmed_base_cased') \ .setInputCols(["document","token"])\ .setOutputCol("bert_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "bert_embeddings"]) \ .setOutputCol("sentence_bert_embeddings") \ .setPoolingStrategy("AVERAGE") genderClassifier = ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \ .setInputCols(["document", "sentence_bert_embeddings"]) \ .setOutputCol("gender") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, gender_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased") .setInputCols(Array("document","token")) .setOutputCol("bert_embeddings") val sentence_embeddings = SentenceEmbeddings() .setInputCols(Array("document", "bert_embeddings")) .setOutputCol("sentence_bert_embeddings") .setPoolingStrategy("AVERAGE") val genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models") .setInputCols(Array("document", "sentence_bert_embeddings")) .setOutputCol("gender") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, gender_classifier)) val data = Seq("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.gender.biobert").predict("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ```
{:.h2_title} ## Results ```bash Female ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_gender_biobert| |Type:|ClassifierDLModel| |Compatibility:|Healthcare NLP 2.6.5 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|[en]| |Case sensitive:|True| {:.h2_title} ## Data Source This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally. {:.h2_title} ## Benchmarking ```bash label precision recall f1-score support Female 0.9224 0.8954 0.9087 239 Male 0.7895 0.8468 0.8171 124 Unknown 0.8077 0.7778 0.7925 54 accuracy 0.8657 417 macro-avg 0.8399 0.8400 0.8394 417 weighted-avg 0.8680 0.8657 0.8664 417 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Celtic Languages author: John Snow Labs name: opus_mt_en_cel date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, cel, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `cel` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cel_xx_2.7.0_2.4_1609163584650.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cel_xx_2.7.0_2.4_1609163584650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_cel", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_cel", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.cel').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_cel| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_base_1b_1_finetuned_squadv2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-1B-1-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_1b_1_finetuned_squadv2_en_4.2.4_3.0_1669985564517.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_1b_1_finetuned_squadv2_en_4.2.4_3.0_1669985564517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_1b_1_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_1b_1_finetuned_squadv2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|447.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/roberta-base-1B-1-finetuned-squadv2 - https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/ - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Translate Polish to English Pipeline author: John Snow Labs name: translate_pl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pl_en_xx_2.7.0_2.4_1609690625108.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pl_en_xx_2.7.0_2.4_1609690625108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_pl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.pl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal NER for NDA (Remedies Clauses) author: John Snow Labs name: legner_nda_remedies date: 2023-04-16 tags: [en, licensed, ner, legal, nda, remedies] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `REMEDIES` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` for that purpose). It will extract the following entities: `CURRENCY`, `NUMERIC_REMEDY`, and `REMEDY_TYPE`. ## Predicted Entities `CURRENCY`, `NUMERIC_REMEDY`, `REMEDY_TYPE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_remedies_en_1.0.0_3.0_1681687124993.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_remedies_en_1.0.0_3.0_1681687124993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_remedies", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""The breaching party shall pay the non-breaching party liquidated damages of $ 1,000 per day for each day that the breach of this NDA continues."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +------------------+--------------+ |chunk |ner_label | +------------------+--------------+ |liquidated damages|REMEDY_TYPE | |$ |CURRENCY | |1,000 |NUMERIC_REMEDY| +------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_remedies| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support CURRENCY 1.00 1.00 1.00 11 NUMERIC_REMEDY 1.00 1.00 1.00 11 REMEDY_TYPE 0.86 0.94 0.90 32 micro-avg 0.91 0.96 0.94 54 macro-avg 0.95 0.98 0.97 54 weighted-avg 0.92 0.96 0.94 54 ``` --- layout: model title: Fast Neural Machine Translation Model from Atlantic-Congo Languages to English author: John Snow Labs name: opus_mt_alv_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, alv, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `alv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_alv_en_xx_2.7.0_2.4_1609169974046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_alv_en_xx_2.7.0_2.4_1609169974046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_alv_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_alv_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.alv.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_alv_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Social Determinants of Health Mentions author: John Snow Labs name: ner_sdoh_mentions_pipeline date: 2023-03-08 tags: [en, licensed, ner, sdoh, mentions, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_sdoh_mentions](https://nlp.johnsnowlabs.com/2022/12/18/ner_sdoh_mentions_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_mentions_pipeline_en_4.3.0_3.2_1678281267173.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_mentions_pipeline_en_4.3.0_3.2_1678281267173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_sdoh_mentions_pipeline", "en", "clinical/models") text = '''Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_sdoh_mentions_pipeline", "en", "clinical/models") val text = "Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-----------------|--------:|------:|:-----------------|-------------:| | 0 | married | 123 | 129 | sdoh_community | 0.9972 | | 1 | children | 141 | 148 | sdoh_community | 0.9999 | | 2 | works | 154 | 158 | sdoh_economics | 0.9995 | | 3 | alcohol | 185 | 191 | behavior_alcohol | 0.9925 | | 4 | intravenous drug | 196 | 211 | behavior_drug | 0.9803 | | 5 | smoking | 230 | 236 | behavior_tobacco | 0.9997 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_mentions_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Yoruba Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, yo, open_source] task: Named Entity Recognition language: yo edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-yoruba` is a Yoruba model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808736465.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808736465.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba","yo") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba","yo") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Mo nifẹ Snark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_yoruba| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|yo| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-yoruba - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukma --- layout: model title: NER on Capital Calls (Small) author: John Snow Labs name: finner_capital_calls date: 2023-02-01 tags: [capital, calls, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `small` capital call NER, trained to extract contact and financial information from Capital Call Notices. These are the entities retrieved by the model: ``` Financial information: FUND: Name of the Fund called ORG: Organization asking the Fund for the Capital AMOUNT: Amount called by ORG to FUND DUE_DATE: Due date of the call ACCOUNT_NAME: Organization's Bank Account Name ACCOUNT_NUMBER: Organization's Bank Account Number ABA: Routing Number (ABA) BANK_ADDRESS: Contact address of the Bank Contact information: PHONE: Contact Phone PERSON: Contact Person BANK_CONTACT: Person to contact in Bank EMAIL: Contact Email Other additional information, not directly involved in the call: OTHER_PERSON: Other people detected (People signing the call, people to whom is addressed the Notice, etc) OTHER_PERCENTAGE: Percentages mentiones OTHER_DATE: Other dates mentioned, not Due Date OTHER_AMOUNT: Other amounts mentioned OTHER_ADDRESS: Other addresses mentiones OTHER_ORG: Other ORG mentiones ``` ## Predicted Entities `FUND`, `ORG`, `AMOUNT`, `DUE_DATE`, `ACCOUNT_NAME`, `ACCOUNT_NUMBER`, `BANK_ADDRESS`, `PHONE`, `PERSON`, `BANK_CONTACT`, `EMAIL`, `OTHER_PERSON`, `OTHER_PERCENTAGE`, `OTHER_DATE`, `OTHER_AMOUNT`, `OTHER_ADDRESS`, `OTHER_ORG`, `ABA` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_CAPITAL_CALLS){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_capital_calls_en_1.0.0_3.0_1675250939298.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_capital_calls_en_1.0.0_3.0_1675250939298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from pyspark.sql import functions as F documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner = finance.NerModel.pretrained('finner_capital_calls', 'en', 'finance/models')\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") converter = finance.NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ pipeline = nlp.Pipeline(stages=[documentAssembler, sentence, tokenizer, embeddings, ner, converter ]) df = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(df) lp = nlp.LightPipeline(model) text = """Dear Charlotte R. Davis, We hope this message finds you well. This is to inform you that a capital call for Upfront Ventures has been initiated. The amount requested is 7000 EUR and is due on 01.01.2024. Kindly wire transfer the funds to the following account: Account Green Planet Solutions LLC Account Number 1234567-1XX Routing Number 51903761 Bank First Republic Bank If you require any further information, please do not hesitate to reach out to us at 3055 550818 or coxeric@example.com. Thank you for your prompt attention to this matter. Best regards, James Wilson""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) from pyspark.sql import functions as F result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']['entity']").alias("ner_label"), F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False) ```
## Results ```bash +--------------------------+--------------+----------+ |chunk |ner_label |confidence| +--------------------------+--------------+----------+ |Charlotte R. Davis |OTHER_PERSON |0.971875 | |Upfront Ventures |FUND |1.0 | |7000 EUR |AMOUNT |1.0 | |01.01.2024 |DUE_DATE |1.0 | |Green Planet Solutions LLC|ACCOUNT_NAME |0.999875 | |1234567-1XX |ACCOUNT_NUMBER|1.0 | |51903761 |ABA |1.0 | |First Republic Bank |BANK_NAME |0.9999333 | |3055 550818 |PHONE |1.0 | |coxeric@example.com |EMAIL |1.0 | |James Wilson |OTHER_PERSON |1.0 | +--------------------------+--------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_capital_calls| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house capital call notices ## Benchmarking ```bash Total test loss: 0.3542 Avg test loss: 0.0295 label tp fp fn prec rec f1 B-PERSON 47 0 0 1.0 1.0 1.0 I-OTHER_PERSON 214 0 0 1.0 1.0 1.0 I-AMOUNT 127 0 0 1.0 1.0 1.0 I-OTHER_PERCENTAGE 37 0 0 1.0 1.0 1.0 B-OTHER_DATE 25 0 0 1.0 1.0 1.0 I-BANK_ADDRESS 121 0 0 1.0 1.0 1.0 B-AMOUNT 170 0 0 1.0 1.0 1.0 B-OTHER_AMOUNT 409 0 0 1.0 1.0 1.0 I-ORG 311 18 0 0.9452888 1.0 0.971875 B-PHONE 79 0 0 1.0 1.0 1.0 I-DUE_DATE 153 0 0 1.0 1.0 1.0 B-FUND 124 0 0 1.0 1.0 1.0 B-ABA 97 0 0 1.0 1.0 1.0 I-ACCOUNT_NAME 223 0 0 1.0 1.0 1.0 I-OTHER_DATE 25 0 0 1.0 1.0 1.0 I-PHONE 119 0 0 1.0 1.0 1.0 B-BANK_ADDRESS 39 0 0 1.0 1.0 1.0 B-OTHER_ORG 139 0 6 1.0 0.95862067 0.97887325 I-OTHER_AMOUNT 307 0 0 1.0 1.0 1.0 I-FUND 131 0 0 1.0 1.0 1.0 I-BANK_NAME 139 0 0 1.0 1.0 1.0 B-EMAIL 73 0 0 1.0 1.0 1.0 I-BANK_CONTACT 52 0 0 1.0 1.0 1.0 B-BANK_CONTACT 30 0 0 1.0 1.0 1.0 B-OTHER_PERSON 116 0 0 1.0 1.0 1.0 B-ACCOUNT_NAME 97 0 0 1.0 1.0 1.0 B-DUE_DATE 127 0 0 1.0 1.0 1.0 B-OTHER_ADDRESS 11 0 0 1.0 1.0 1.0 B-ORG 147 6 0 0.9607843 1.0 0.98 B-BANK_NAME 113 0 0 1.0 1.0 1.0 B-OTHER_PERCENTAGE 74 0 0 1.0 1.0 1.0 I-OTHER_ADDRESS 38 0 0 1.0 1.0 1.0 B-ACCOUNT_NUMBER 97 0 0 1.0 1.0 1.0 I-PERSON 109 0 0 1.0 1.0 1.0 I-OTHER_ORG 283 0 18 1.0 0.9401993 0.969178 I-ACCOUNT_NUMBER 32 0 0 1.0 1.0 1.0 Macro-average 4435 24 24 0.997391 0.9971894 0.9972902 Micro-average 4435 24 24 0.99461764 0.99461764 0.99461764 ``` --- layout: model title: English LongformerForQuestionAnswering model (from manishiitg) author: John Snow Labs name: longformer_qa_recruit date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-recruit-qa` is a English model originally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_en_4.0.0_3.0_1656255562609.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_en_4.0.0_3.0_1656255562609.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.longformer.by_manishiitg").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_recruit| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|556.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/longformer-recruit-qa --- layout: model title: English BertForTokenClassification Cased model (from nguyenkhoa2407) author: John Snow Labs name: bert_token_classifier_autotrain_ner_favsbot date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-bert-NER-favsbot` is a English model originally trained by `nguyenkhoa2407`. ## Predicted Entities `TIME`, `SORT`, `PER`, `LOC`, `TAG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_ner_favsbot_en_4.2.4_3.0_1669814444554.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_ner_favsbot_en_4.2.4_3.0_1669814444554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_ner_favsbot","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_ner_favsbot","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_ner_favsbot| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nguyenkhoa2407/autotrain-bert-NER-favsbot --- layout: model title: Adverse Drug Events Classifier (DistilBERT) author: John Snow Labs name: distilbert_sequence_classifier_ade date: 2022-02-08 tags: [bert, sequence_classification, en, licensed] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalDistilBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify text/sentence in two categories: `True` : The sentence is talking about a possible ADE `False` : The sentences doesn’t have any information about an ADE. This model is a [DistilBERT](https://huggingface.co/distilbert-base-cased)-based classifier. Please note that there is no bio-version of DistilBERT so the performance may not be par with BioBERT-based classifiers. ## Predicted Entities `True`, `False` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/08.3.MedicalBertForSequenceClassification_in_SparkNLP.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/distilbert_sequence_classifier_ade_en_3.4.1_3.0_1644352732829.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/distilbert_sequence_classifier_ade_en_3.4.1_3.0_1644352732829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["I felt a bit drowsy and had blurred vision after taking Aspirin."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq("I felt a bit drowsy and had blurred vision after taking Aspirin.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.ade.seq_distilbert").predict("""I felt a bit drowsy and had blurred vision after taking Aspirin.""") ```
## Results ```bash +----------------------------------------------------------------+------+ |text |result| +----------------------------------------------------------------+------+ |I felt a bit drowsy and had blurred vision after taking Aspirin.|[True]| +----------------------------------------------------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_sequence_classifier_ade| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model is trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed. ## Benchmarking ```bash label precision recall f1-score support False 0.93 0.93 0.93 6935 True 0.64 0.65 0.65 1347 accuracy 0.88 0.88 0.88 8282 macro-avg 0.79 0.79 0.79 8282 weighted-avg 0.89 0.88 0.89 8282 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from anablasi) author: John Snow Labs name: roberta_qa_model_10k date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `model_10k_qa` is a English model originally trained by `anablasi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_model_10k_en_4.3.0_3.0_1674211482662.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_model_10k_en_4.3.0_3.0_1674211482662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model_10k","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model_10k","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_model_10k| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anablasi/model_10k_qa --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1657184326520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1657184326520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-6 --- layout: model title: English DistilBertForQuestionAnswering model (from threem) Squad1 author: John Snow Labs name: distilbert_qa_mysquadv2_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mysquadv2-finetuned-squad` is a English model originally trained by `threem`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_finetuned_squad_en_4.0.0_3.0_1654728438175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mysquadv2_finetuned_squad_en_4.0.0_3.0_1654728438175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mysquadv2_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.by_threem").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mysquadv2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/threem/mysquadv2-finetuned-squad --- layout: model title: Detect Oncology-Specific Entities author: John Snow Labs name: ner_oncology_limited_80p_for_benchmarks date: 2023-04-03 tags: [licensed, clinical, en, oncology, biomarker, treatment] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description `Important Note:` This model is trained with a partial dataset that is used to train [ner_oncology](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_en.html); and meant to be used for benchmarking run at [LLMs Healthcare Benchmarks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/academic/LLMs_in_Healthcare). This model extracts more than 40 oncology-related entities, including therapies, tests and staging. Definitions of Predicted Entities: `Adenopathy`: Mentions of pathological findings of the lymph nodes. `Age`: All mention of ages, past or present, related to the patient or with anybody else. `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. `Cancer_Dx`: Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction. `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. “BI-RADS” or “Allred score”). `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as “chemotherapy”. `Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. “5 cycles”). `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. “day 5”). `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. “third cycle”). `Date`: Mentions of exact dates, in any format, including day number, month and/or year. `Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as “died” or “passed away”. `Direction`: Directional and laterality terms, such as “left”, “right”, “bilateral”, “upper” and “lower”. `Dosage`: The quantity prescribed by the physician for an active ingredient. `Duration`: Words indicating the duration of a treatment (e.g. “for 2 weeks”). `Frequency`: Words indicating the frequency of treatment administration (e.g. “daily” or “bid”). `Gender`: Gender-specific nouns and pronouns (including words such as “him” or “she”, and family members such as “father”). `Grade`: All pathological grading of tumors (e.g. “grade 1”) or degrees of cellular differentiation (e.g. “well-differentiated”) `Histological_Type`: Histological variants or cancer subtypes, such as “papillary”, “clear cell” or “medullary”. `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as “hormonal therapy”. `Imaging_Test`: Imaging tests mentioned in texts, such as “chest CT scan”. `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as “immunotherapy”. `Invasion`: Mentions that refer to tumor invasion, such as “invasion” or “involvement”. Metastases or lymph node involvement are excluded from this category. `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. “first-line treatment”). `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. “malignant ductal cells”). `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. “ECOG performance status of 4”). `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. `Radiotherapy`: Terms that indicate the use of Radiotherapy. `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including “recurrence”, “bad response” or “improvement”. `Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. “yesterday” or “three years later”). `Route`: Words indicating the type of administration route (such as “PO” or “transdermal”). `Site_Bone`: Anatomical terms that refer to the human skeleton. `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). `Site_Breast`: Anatomical terms that refer to the breasts. `Site_Liver`: Anatomical terms that refer to the liver. `Site_Lung`: Anatomical terms that refer to the lungs. `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. `Smoking_Status`: All mentions of smoking related to the patient or to someone else. `Staging`: Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”. `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as “targeted therapy”. `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”). `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. “3 cm”). `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. “chemoradiotherapy” or “adjuvant therapy”). ## Predicted Entities `Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_limited_80p_for_benchmarks_en_4.3.2_3.0_1680548141397.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_limited_80p_for_benchmarks_en_4.3.2_3.0_1680548141397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_limited_80p_for_benchmarks", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_limited_80p_for_benchmarks", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------+-----+---+---------------------+ |chunk |begin|end|ner_label | +------------------------------+-----+---+---------------------+ |left |31 |34 |Direction | |mastectomy |36 |45 |Cancer_Surgery | |axillary lymph node dissection|54 |83 |Cancer_Surgery | |left |91 |94 |Direction | |breast cancer |96 |108|Cancer_Dx | |twenty years ago |110 |125|Relative_Date | |tumor |132 |136|Tumor_Finding | |positive |142 |149|Biomarker_Result | |ER |155 |156|Biomarker | |PR |162 |163|Biomarker | |radiotherapy |183 |194|Radiotherapy | |breast |229 |234|Site_Breast | |cancer |241 |246|Cancer_Dx | |recurred |248 |255|Response_To_Treatment| |right |262 |266|Direction | |lung |268 |271|Site_Lung | |metastasis |273 |282|Metastasis | |13 years later |284 |297|Relative_Date | |adriamycin |346 |355|Chemotherapy | |60 mg/m2 |358 |365|Dosage | |cyclophosphamide |372 |387|Chemotherapy | |600 mg/m2 |390 |398|Dosage | |six courses |406 |416|Cycle_Count | |first line |422 |431|Line_Of_Therapy | +------------------------------+-----+---+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_limited_80p_for_benchmarks| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.3 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 135.0 60.0 68.0 203.0 0.6923 0.665 0.6784 Direction 642.0 170.0 149.0 791.0 0.7906 0.8116 0.801 Staging 122.0 20.0 29.0 151.0 0.8592 0.8079 0.8328 Cancer_Score 14.0 8.0 8.0 22.0 0.6364 0.6364 0.6364 Imaging_Test 743.0 197.0 155.0 898.0 0.7904 0.8274 0.8085 Cycle_Number 45.0 34.0 10.0 55.0 0.5696 0.8182 0.6716 Tumor_Finding 948.0 104.0 110.0 1058.0 0.9011 0.896 0.8986 Site_Lymph_Node 199.0 62.0 57.0 256.0 0.7625 0.7773 0.7698 Invasion 116.0 33.0 20.0 136.0 0.7785 0.8529 0.814 Response_To_Treat... 244.0 145.0 144.0 388.0 0.6272 0.6289 0.6281 Smoking_Status 54.0 7.0 2.0 56.0 0.8852 0.9643 0.9231 Cycle_Count 96.0 26.0 32.0 128.0 0.7869 0.75 0.768 Tumor_Size 205.0 43.0 49.0 254.0 0.8266 0.8071 0.8167 Adenopathy 29.0 6.0 4.0 33.0 0.8286 0.8788 0.8529 Age 212.0 14.0 18.0 230.0 0.9381 0.9217 0.9298 Biomarker_Result 593.0 138.0 122.0 715.0 0.8112 0.8294 0.8202 Unspecific_Therapy 124.0 47.0 50.0 174.0 0.7251 0.7126 0.7188 Site_Breast 96.0 13.0 14.0 110.0 0.8807 0.8727 0.8767 Chemotherapy 570.0 40.0 65.0 635.0 0.9344 0.8976 0.9157 Targeted_Therapy 173.0 11.0 17.0 190.0 0.9402 0.9105 0.9251 Radiotherapy 128.0 26.0 21.0 149.0 0.8312 0.8591 0.8449 Performance_Status 29.0 10.0 14.0 43.0 0.7436 0.6744 0.7073 Pathology_Test 395.0 157.0 105.0 500.0 0.7156 0.79 0.751 Site_Other_Body_Part 682.0 284.0 370.0 1052.0 0.706 0.6483 0.6759 Cancer_Surgery 388.0 100.0 75.0 463.0 0.7951 0.838 0.816 Line_Of_Therapy 30.0 9.0 8.0 38.0 0.7692 0.7895 0.7792 Pathology_Result 135.0 162.0 169.0 304.0 0.4545 0.4441 0.4493 Hormonal_Therapy 95.0 9.0 10.0 105.0 0.9135 0.9048 0.9091 Site_Bone 171.0 42.0 68.0 239.0 0.8028 0.7155 0.7566 Immunotherapy 57.0 31.0 13.0 70.0 0.6477 0.8143 0.7215 Biomarker 759.0 161.0 118.0 877.0 0.825 0.8655 0.8447 Cycle_Day 84.0 32.0 32.0 116.0 0.7241 0.7241 0.7241 Frequency 199.0 33.0 32.0 231.0 0.8578 0.8615 0.8596 Route 88.0 12.0 23.0 111.0 0.88 0.7928 0.8341 Duration 184.0 60.0 110.0 294.0 0.7541 0.6259 0.684 Death_Entity 36.0 3.0 2.0 38.0 0.9231 0.9474 0.9351 Metastasis 307.0 18.0 21.0 328.0 0.9446 0.936 0.9403 Site_Liver 138.0 46.0 35.0 173.0 0.75 0.7977 0.7731 Cancer_Dx 720.0 112.0 100.0 820.0 0.8654 0.878 0.8717 Grade 48.0 21.0 13.0 61.0 0.6957 0.7869 0.7385 Date 365.0 17.0 16.0 381.0 0.9555 0.958 0.9567 Site_Lung 354.0 100.0 87.0 441.0 0.7797 0.8027 0.7911 Site_Brain 133.0 28.0 59.0 192.0 0.8261 0.6927 0.7535 Relative_Date 365.0 264.0 80.0 445.0 0.5803 0.8202 0.6797 Race_Ethnicity 51.0 10.0 5.0 56.0 0.8361 0.9107 0.8718 Gender 1267.0 15.0 3.0 1270.0 0.9883 0.9976 0.9929 Dosage 337.0 45.0 78.0 415.0 0.8822 0.812 0.8457 Oncogene 135.0 57.0 78.0 213.0 0.7031 0.6338 0.6667 Radiation_Dose 35.0 10.0 6.0 41.0 0.7778 0.8537 0.814 macro - - - - - - 0.7974 micro - - - - - - 0.8154 ``` --- layout: model title: English asr_wav2vec2_base_960h_4_gram TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_base_960h_4_gram date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_960h_4_gram` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_960h_4_gram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_960h_4_gram_en_4.2.0_3.0_1664022178956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_960h_4_gram_en_4.2.0_3.0_1664022178956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_960h_4_gram", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_960h_4_gram", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_960h_4_gram| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.6 MB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1657184494597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1657184494597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-2 --- layout: model title: Icelandic RobertaForQuestionAnswering Cased model (from vesteinn) author: John Snow Labs name: roberta_qa_icebert date: 2022-12-02 tags: [is, open_source, roberta, question_answering, tensorflow] task: Question Answering language: is edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IceBERT-QA` is a Icelandic model originally trained by `vesteinn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_is_4.2.4_3.0_1669972802947.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_is_4.2.4_3.0_1669972802947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert","is") \ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert","is") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_icebert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|is| |Size:|463.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/vesteinn/IceBERT-QA --- layout: model title: Hindi DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_hi_cased date: 2022-04-12 tags: [distilbert, embeddings, hi, open_source] task: Embeddings language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-hi-cased` is a Hindi model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_hi_cased_hi_3.4.2_3.0_1649783460616.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_hi_cased_hi_3.4.2_3.0_1649783460616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_hi_cased","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_hi_cased","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.distilbert_base_hi_cased").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_hi_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|177.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-hi-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Spanish RobertaForTokenClassification Cased model (from mrm8488) author: John Snow Labs name: roberta_ner_finetuned_bioclinical date: 2022-07-18 tags: [open_source, roberta, ner, bioclinical, es] task: Named Entity Recognition language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa NER model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioclinical-roberta-es-finenuned-clinical-ner` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_finetuned_bioclinical_es_4.0.0_3.0_1658155068450.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_finetuned_bioclinical_es_4.0.0_3.0_1658155068450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") ner = RoBertaForTokenClassification.pretrained("roberta_ner_finetuned_bioclinical","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner = RoBertaForTokenClassification.pretrained("roberta_ner_finetuned_bioclinical","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, ner)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_finetuned_bioclinical| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|441.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References https://huggingface.co/mrm8488/bioclinical-roberta-es-finenuned-clinical-ner --- layout: model title: Fast Neural Machine Translation Model from Armenian to English author: John Snow Labs name: opus_mt_hy_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, hy, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `hy` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_hy_en_xx_2.7.0_2.4_1609169598231.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_hy_en_xx_2.7.0_2.4_1609169598231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_hy_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_hy_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.hy.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_hy_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Persian ALBERT Embeddings (from m3hrdadfi) author: John Snow Labs name: albert_embeddings_albert_fa_base_v2 date: 2022-04-14 tags: [albert, embeddings, fa, open_source] task: Embeddings language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-fa-base-v2` is a Persian model orginally trained by `m3hrdadfi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_fa_base_v2_fa_3.4.2_3.0_1649954318874.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_fa_base_v2_fa_3.4.2_3.0_1649954318874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_fa_base_v2","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["من عاشق جرقه NLP هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_fa_base_v2","fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("من عاشق جرقه NLP هستم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.embed.albert").predict("""من عاشق جرقه NLP هستم""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_fa_base_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fa| |Size:|69.2 MB| |Case sensitive:|false| ## References - https://huggingface.co/m3hrdadfi/albert-fa-base-v2 - https://dumps.wikimedia.org/fawiki/ - https://github.com/miras-tech/MirasText - https://bigbangpage.com/ - https://www.chetor.com/ - https://www.eligasht.com/Blog/ - https://www.digikala.com/mag/ - https://www.ted.com/talks - https://github.com/m3hrdadfi/albert-persian - https://github.com/hooshvare/parsbert - https://github.com/m3hrdadfi/albert-persian --- layout: model title: Lemmatizer (Portugese, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-08 tags: [open_source, lemmatizer, pt] task: Lemmatization language: pt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Portugese Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_pt_3.4.1_3.0_1646753629532.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_pt_3.4.1_3.0_1646753629532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","pt") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Você não é melhor que eu"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","pt") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Você não é melhor que eu").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.lemma.spacylookup").predict("""Você não é melhor que eu""") ```
## Results ```bash +---------------------------------+ |result | +---------------------------------+ |[Você, não, ser, melhor, que, eu]| +---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|pt| |Size:|8.9 MB| --- layout: model title: Detect Assertion Status (assertion_dl_en) author: John Snow Labs name: assertion_dl_en date: 2020-01-30 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [clinical, licensed, ner, en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deep learning named entity recognition model for assertions. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities ``hypothetical``, ``present``, ``absent``, ``possible``, ``conditional``, ``associated_with_someone_else``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[document_assembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_result = LightPipeline(model).fullAnnotate('Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain')[0] ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe. ```bash | | chunks | entities | assertion | |---|------------|----------|-------------| | 0 | a headache | PROBLEM | present | | 1 | anxious | PROBLEM | conditional | | 2 | alopecia | PROBLEM | absent | | 3 | pain | PROBLEM | absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl| |Type:|ner| |Compatibility:|Spark NLP 2.4.0| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, ner_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|[en]| |Case sensitive:|false| ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label prec rec f1 absent 0.94 0.87 0.91 associated_with_someone_else 0.81 0.73 0.76 conditional 0.78 0.24 0.37 hypothetical 0.89 0.75 0.81 possible 0.70 0.52 0.60 present 0.91 0.97 0.94 Macro-average 0.84 0.68 0.73 Micro-average 0.91 0.91 0.91 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ms12345) author: John Snow Labs name: roberta_qa_ms12345_base_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `ms12345`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ms12345_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219374435.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ms12345_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219374435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ms12345_base_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ms12345_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ms12345_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ms12345/roberta-base-squad2-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from American Sign Language to French author: John Snow Labs name: opus_mt_ase_fr date: 2021-06-01 tags: [open_source, seq2seq, translation, ase, fr, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ase target languages: fr {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_fr_xx_3.1.0_2.4_1622561105602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_fr_xx_3.1.0_2.4_1622561105602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ase_fr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ase_fr", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.American Sign Language.translate_to.French').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ase_fr| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from mrm8488) Xqua author: John Snow Labs name: distilbert_qa_multi_finetuned_for_xqua_on_tydiqa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finetuned-for-xqua-on-tydiqa` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finetuned_for_xqua_on_tydiqa_en_4.0.0_3.0_1654727619668.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finetuned_for_xqua_on_tydiqa_en_4.0.0_3.0_1654727619668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finetuned_for_xqua_on_tydiqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finetuned_for_xqua_on_tydiqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.distil_bert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_multi_finetuned_for_xqua_on_tydiqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/distilbert-multi-finetuned-for-xqua-on-tydiqa - https://ai.google.com/research/tydiqa - https://github.com/google-research-datasets/tydiqa/blob/master/README.md#the-tasks - https://twitter.com/mrm8488 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1657184919799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1657184919799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-10 --- layout: model title: English asr_wav2vec2_base_100h_test TFWav2Vec2ForCTC from saahith author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_test date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_test` is a English model originally trained by saahith. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_test_en_4.2.0_3.0_1664094940990.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_test_en_4.2.0_3.0_1664094940990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_test', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_test", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Cased model (from osanseviero) author: John Snow Labs name: t5_finetuned_test date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-finetuned-test` is a English model originally trained by `osanseviero`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetuned_test_en_4.3.0_3.0_1675124670488.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetuned_test_en_4.3.0_3.0_1675124670488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetuned_test","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetuned_test","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetuned_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|286.9 MB| ## References - https://huggingface.co/osanseviero/t5-finetuned-test - https://medium.com/@priya.dwivedi/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81 --- layout: model title: News Classifier Pipeline for German text author: John Snow Labs name: classifierdl_bert_news_pipeline date: 2021-08-13 tags: [de, classifier, pipeline, news, open_source] task: Pipeline Public language: de edition: Spark NLP 3.1.3 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline classifies German texts of news. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_DE_NEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_DE_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_de_3.1.3_2.4_1628851787696.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_de_3.1.3_2.4_1628851787696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_bert_news_pipeline", lang = "de") result = pipeline.fullAnnotate("""Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)""") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_news_pipeline", "de") val result = pipeline.fullAnnotate("Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)")(0) ```
## Results ```bash ["Sport"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_news_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.1.3+| |License:|Open Source| |Edition:|Official| |Language:|de| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: English DistilBertForTokenClassification Base Cased model (from 51la5) author: John Snow Labs name: distilbert_token_classifier_base_ner date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-NER` is a English model originally trained by `51la5`. ## Predicted Entities `LOC`, `ORG`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.1_3.0_1678133783319.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.1_3.0_1678133783319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_ner| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/51la5/distilbert-base-NER - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: English ALBERT Embeddings (x-large) author: John Snow Labs name: albert_embeddings_albert_xlarge_v1 date: 2022-04-14 tags: [albert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-xlarge-v1` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xlarge_v1_en_3.4.2_3.0_1649954213986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xlarge_v1_en_3.4.2_3.0_1649954213986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xlarge_v1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xlarge_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.albert_xlarge_v1").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_xlarge_v1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|221.6 MB| |Case sensitive:|false| ## References - https://huggingface.co/albert-xlarge-v1 - https://arxiv.org/abs/1909.11942 - https://github.com/google-research/albert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Recognize Entities DL Pipeline for Russian - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, russian, entity_recognizer_md, pipeline, ru] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: ru edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_3.0.0_3.0_1616448672830.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_3.0.0_3.0_1616448672830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'ru') annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "ru") val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Здравствуйте из Джона Снежных Лабораторий! ""] result_df = nlu.load('ru.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------| | 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-LOC', 'I-LOC', 'I-LOC'] | ['Джона Снежных Лабораторий!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|ru| --- layout: model title: RE Pipeline between Body Parts and Direction Entities author: John Snow Labs name: re_bodypart_directions_pipeline date: 2023-06-13 tags: [licensed, clinical, relation_extraction, body_part, directions, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_bodypart_directions](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_directions_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_pipeline_en_4.4.4_3.2_1686664392280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_pipeline_en_4.4.4_3.2_1686664392280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models") pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models") pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart_directions.pipeline").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models") pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models") pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart_directions.pipeline").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""") ```
## Results ```bash Results | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------| | 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 | | 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 | | 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 | | 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 | | 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 | | 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 | | 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 | | 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 | | 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_directions_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Icelandic DistilBertForTokenClassification Cased model (from m3hrdadfi) author: John Snow Labs name: distilbert_token_classifier_typo_detector date: 2023-03-06 tags: [is, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: is edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `typo-detector-distilbert-is` is a Icelandic model originally trained by `m3hrdadfi`. ## Predicted Entities `TYPO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dtilbert_token_classifier_typo_detector_is_4.3.1_3.0_1678134296845.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dtilbert_token_classifier_typo_detector_is_4.3.1_3.0_1678134296845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("dtilbert_token_classifier_typo_detector","is") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("dtilbert_token_classifier_typo_detector","is") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dtilbert_token_classifier_typo_detector| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|is| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m3hrdadfi/typo-detector-distilbert-is - https://github.com/m3hrdadfi/typo-detector/issues --- layout: model title: English asr_wav2vec2_base_timit_demo_colab11_by_sameearif88 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab11_by_sameearif88` is a English model originally trained by sameearif88. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021315565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88_en_4.2.0_3.0_1664021315565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab11_by_sameearif88| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English XlmRoBertaForQuestionAnswering (from Dongjae) author: John Snow Labs name: xlm_roberta_qa_mrc2reader date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mrc2reader` is a English model originally trained by `Dongjae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_mrc2reader_en_4.0.0_3.0_1655987882933.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_mrc2reader_en_4.0.0_3.0_1655987882933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_mrc2reader","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_mrc2reader","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_mrc2reader| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Dongjae/mrc2reader --- layout: model title: Legal Dispute Resolve Clause Binary Classifier author: John Snow Labs name: legclf_dispute_resol_clause date: 2023-02-13 tags: [en, legal, classification, dispute, resolve, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `dispute_resol` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `dispute_resol`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resol_clause_en_1.0.0_3.0_1676303229805.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_dispute_resol_clause_en_1.0.0_3.0_1676303229805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_dispute_resol_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[dispute_resol]| |[other]| |[other]| |[dispute_resol]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_dispute_resol_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support dispute_resol 0.94 0.89 0.92 19 other 0.83 0.91 0.87 11 accuracy - - 0.90 30 macro-avg 0.89 0.90 0.89 30 weighted-avg 0.90 0.90 0.90 30 ``` --- layout: model title: RE Pipeline between Problem, Test, and Findings in Reports author: John Snow Labs name: re_test_problem_finding_pipeline date: 2023-06-13 tags: [licensed, clinical, relation_extraction, problem, test, findings, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_test_problem_finding](https://nlp.johnsnowlabs.com/2021/04/19/re_test_problem_finding_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_pipeline_en_4.4.4_3.2_1686665114725.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_pipeline_en_4.4.4_3.2_1686665114725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.test_problem_finding.pipeline").predict("""Targeted biopsy of this lesion for histological correlation should be considered.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.test_problem_finding.pipeline").predict("""Targeted biopsy of this lesion for histological correlation should be considered.""") ```
## Results ```bash Results | index | relations | entity1 | chunk1 | entity2 | chunk2 | |-------|--------------|--------------|---------------------|--------------|---------| | 0 | 1 | PROCEDURE | biopsy | SYMPTOM | lesion | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_test_problem_finding_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Pipeline to Detect Clinical Entities (jsl_ner_wip_clinical) author: John Snow Labs name: jsl_ner_wip_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_clinical](https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_4.3.0_3.2_1678875196882.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_pipeline_en_4.3.0_3.2_1678875196882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("jsl_ner_wip_clinical_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_wip_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.9984 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 1 | | 2 | male | 38 | 41 | Gender | 0.9986 | | 3 | for 2 days | 48 | 57 | Duration | 0.678133 | | 4 | congestion | 62 | 71 | Symptom | 0.9693 | | 5 | mom | 75 | 77 | Gender | 0.7091 | | 6 | yellow | 99 | 104 | Modifier | 0.667 | | 7 | discharge | 106 | 114 | Symptom | 0.3037 | | 8 | nares | 135 | 139 | External_body_part_or_region | 0.89 | | 9 | she | 147 | 149 | Gender | 0.9992 | | 10 | mild | 168 | 171 | Modifier | 0.8106 | | 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.500483 | | 12 | perioral cyanosis | 237 | 253 | Symptom | 0.54895 | | 13 | retractions | 258 | 268 | Symptom | 0.9847 | | 14 | One day ago | 272 | 282 | RelativeDate | 0.550167 | | 15 | mom | 285 | 287 | Gender | 0.573 | | 16 | Tylenol | 345 | 351 | Drug_BrandName | 0.9958 | | 17 | Baby | 354 | 357 | Age | 0.9989 | | 18 | decreased p.o. intake | 377 | 397 | Symptom | 0.22495 | | 19 | His | 400 | 402 | Gender | 0.9997 | | 20 | 20 minutes | 439 | 448 | Duration | 0.1453 | | 21 | q.2h. to | 450 | 457 | Frequency | 0.413667 | | 22 | 5 to 10 minutes | 459 | 473 | Duration | 0.152125 | | 23 | his | 488 | 490 | Gender | 0.9987 | | 24 | respiratory congestion | 492 | 513 | VS_Finding | 0.6458 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Ganda asr_wav2vec2_luganda_by_indonesian_nlp TFWav2Vec2ForCTC from indonesian-nlp author: John Snow Labs name: pipeline_asr_wav2vec2_luganda_by_indonesian_nlp date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_indonesian_nlp` is a Ganda model originally trained by indonesian-nlp. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_luganda_by_indonesian_nlp_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_indonesian_nlp_lg_4.2.0_3.0_1664036315040.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_indonesian_nlp_lg_4.2.0_3.0_1664036315040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_luganda_by_indonesian_nlp', lang = 'lg') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_luganda_by_indonesian_nlp", lang = "lg") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_luganda_by_indonesian_nlp| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|lg| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Limitation Of Liability Clause Binary Classifier author: John Snow Labs name: legclf_limitation_of_liability_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, limitation, of, liability, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the limitation-of-liability clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `limitation-of-liability`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limitation_of_liability_clause_en_1.0.0_3.0_1671393635939.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limitation_of_liability_clause_en_1.0.0_3.0_1671393635939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_limitation_of_liability_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[limitation-of-liability]| |[other]| |[other]| |[limitation-of-liability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limitation_of_liability_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support limitation-of-liability 0.93 0.90 0.91 29 other 0.93 0.95 0.94 39 accuracy - - 0.93 68 macro-avg 0.93 0.92 0.92 68 weighted-avg 0.93 0.93 0.93 68 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Irish author: John Snow Labs name: opus_mt_en_ga date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ga, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ga` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ga_xx_2.7.0_2.4_1609170911340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ga_xx_2.7.0_2.4_1609170911340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ga", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ga", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ga').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ga| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Introductory Clause Binary Classifier author: John Snow Labs name: legclf_introduction_clause date: 2022-11-17 tags: [introduction, parties, document, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the first introductory clause, where the Document Type and the Parties are mentioned. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `introduction`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_en_1.0.0_3.0_1668680203953.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_en_1.0.0_3.0_1668680203953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_introduction_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[introduction]| |[other]| |[other]| |[introduction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_introduction_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house, including CUAD dataset. ## Benchmarking ```bash label precision recall f1-score support introduction 1.00 0.98 0.99 99 other 0.99 1.00 0.99 151 accuracy - - 0.99 250 macro-avg 0.99 0.99 0.99 250 weighted-avg 0.99 0.99 0.99 250 ``` --- layout: model title: Legal Indemnifications Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_indemnifications_bert date: 2023-03-05 tags: [en, legal, classification, clauses, indemnifications, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Indemnifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Indemnifications`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnifications_bert_en_1.0.0_3.0_1678050524998.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnifications_bert_en_1.0.0_3.0_1678050524998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indemnifications_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Indemnifications]| |[Other]| |[Other]| |[Indemnifications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnifications_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Indemnifications 0.95 0.89 0.92 106 Other 0.92 0.96 0.94 141 accuracy - - 0.93 247 macro-avg 0.93 0.93 0.93 247 weighted-avg 0.93 0.93 0.93 247 ``` --- layout: model title: English image_classifier_vit_base_beans_demo ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_base_beans_demo date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans_demo` is a English model originally trained by nateraw. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_en_4.1.0_3.0_1660168105525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_en_4.1.0_3.0_1660168105525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_beans_demo", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_beans_demo", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_beans_demo| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Sentence Entity Resolver for Snomed Concepts, INT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_findings_int date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to Snomed codes (INT version) using using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts Snomed Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_int_en_3.0.4_3.0_1621189624936.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_int_en_3.0.4_3.0_1621189624936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_int_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_int","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_int_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val snomed_int_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_int","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_int_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.findings_int").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 266285003| 0.8867|rheumatic myocard...|266285003:::15529...| |chronic renal ins...| 83|109| PROBLEM| 236425005| 0.2470|chronic renal imp...|236425005:::90688...| | COPD| 113|116| PROBLEM| 413839001| 0.0720|chronic lung dise...|413839001:::41384...| | gastritis| 120|128| PROBLEM| 266502003| 0.3240|acute peptic ulce...|266502003:::45560...| | TIA| 136|138| PROBLEM|353101000119105| 0.0727|prostatic intraep...|353101000119105::...| |a non-ST elevatio...| 182|202| PROBLEM| 233843008| 0.2846|silent myocardial...|233843008:::71942...| |Guaiac positive s...| 208|229| PROBLEM| 168319009| 0.1167|stool culture pos...|168319009:::70396...| |cardiac catheteri...| 295|317| TEST| 301095005| 0.2137|cardiac finding::...|301095005:::25090...| | PTCA| 324|327|TREATMENT|842741000000109| 0.0631|occlusion of post...|842741000000109::...| | mid LAD lesion| 332|345| PROBLEM| 449567000| 0.0808|overriding left v...|449567000:::25342...| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_findings_int| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_int_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: English DistilBertForQuestionAnswering model (from holtin) Squad2 author: John Snow Labs name: distilbert_qa_base_uncased_holtin_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-holtin-finetuned-squad` is a English model originally trained by `holtin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_squad_en_4.0.0_3.0_1654727128971.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_squad_en_4.0.0_3.0_1654727128971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_holtin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_holtin_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/holtin/distilbert-base-uncased-holtin-finetuned-squad --- layout: model title: Legal Annual Bonus Clause Binary Classifier author: John Snow Labs name: legclf_annual_bonus_clause date: 2023-01-29 tags: [en, legal, classification, annual, bonus, clauses, annual_bonus, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `annual-bonus` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `annual-bonus`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_annual_bonus_clause_en_1.0.0_3.0_1675005760165.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_annual_bonus_clause_en_1.0.0_3.0_1675005760165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_annual_bonus_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[annual-bonus]| |[other]| |[other]| |[annual-bonus]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_annual_bonus_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support annual-bonus 1.00 0.94 0.97 33 other 0.95 1.00 0.97 39 accuracy - - 0.97 72 macro-avg 0.98 0.97 0.97 72 weighted-avg 0.97 0.97 0.97 72 ``` --- layout: model title: Pipeline to Detect Clinical Entities (WIP Greedy) author: John Snow Labs name: jsl_ner_wip_greedy_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, wip, biobert, greedy, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_greedy_biobert](https://nlp.johnsnowlabs.com/2021/07/26/jsl_ner_wip_greedy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_pipeline_en_3.4.1_3.0_1647866004113.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_pipeline_en_3.4.1_3.0_1647866004113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("jsl_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.greedy_wip_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | entity | |---:|:-----------------------------------------------|:-----------------------------| | 0 | 21-day-old | Age | | 1 | Caucasian | Race_Ethnicity | | 2 | male | Gender | | 3 | for 2 days | Duration | | 4 | congestion | Symptom | | 5 | mom | Gender | | 6 | suctioning yellow discharge | Symptom | | 7 | nares | External_body_part_or_region | | 8 | she | Gender | | 9 | mild problems with his breathing while feeding | Symptom | | 10 | perioral cyanosis | Symptom | | 11 | retractions | Symptom | | 12 | One day ago | RelativeDate | | 13 | mom | Gender | | 14 | tactile temperature | Symptom | | 15 | Tylenol | Drug | | 16 | Baby | Age | | 17 | decreased p.o. intake | Symptom | | 18 | His | Gender | | 19 | breast-feeding | External_body_part_or_region | | 20 | q.2h | Frequency | | 21 | to 5 to 10 minutes | Duration | | 22 | his | Gender | | 23 | respiratory congestion | Symptom | | 24 | He | Gender | | 25 | tired | Symptom | | 26 | fussy | Symptom | | 27 | over the past 2 days | RelativeDate | | 28 | albuterol | Drug | | 29 | ER | Clinical_Dept | | 30 | His | Gender | | 31 | urine output has also decreased | Symptom | | 32 | he | Gender | | 33 | per 24 hours | Frequency | | 34 | he | Gender | | 35 | per 24 hours | Frequency | | 36 | Mom | Gender | | 37 | diarrhea | Symptom | | 38 | His | Gender | | 39 | bowel | Internal_organ_or_component | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_greedy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Legal Alias Pipeline author: John Snow Labs name: legpipe_alias date: 2023-04-30 tags: [en, legal, ner, pipeline, alias, licensed] task: Pipeline Legal language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline allows you to detect names in quotes and brackets like: ("Supplier"), ("Recipient"), ("Disclosing Parties"), etc. very common in Legal Agreements to reference the parties. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legpipe_alias_en_1.0.0_3.0_1682861474127.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legpipe_alias_en_1.0.0_3.0_1682861474127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python legal_pipeline = nlp.PretrainedPipeline("legpipe_alias", "en", "legal/models") text = ["""MUTUAL NON-DISCLOSURE AGREEMENT This Mutual Non-Disclosure Agreement (the “Agreement”) is made on _________ by and between: John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway, Lewes, Delaware 19958 (“John Snow Labs”), and Acentos, S.L, a Spanish corporation, registered at Gran Via 71, 2º floor (“Company”), (each a “party” and together the “parties”). Recitals: John Snow Labs and Company intend to explore the possibility of a business relationship between each other, whereby each party (“Discloser”) may disclose sensitive information to the other party (“Recipient”). The parties agree as follows:"""] result = legal_pipeline.annotate(text) ```
## Results ```bash ['(“John Snow Labs”)', '(“Company”)', '( “ Discloser ” )', '(“Recipient”)'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legpipe_alias| |Type:|pipeline| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|13.1 KB| ## Included Models - DocumentAssembler - TokenizerModel - ContextualParserModel --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili_sw_4.1.0_3.0_1659356236533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili_sw_4.1.0_3.0_1659356236533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_yoruba_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-yoruba-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish TFWav2Vec2ForCTC from aapot author: John Snow Labs name: asr_wav2vec2_xlsr_1b_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish` is a Finnish model originally trained by aapot. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018554548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_fi_4.2.0_3.0_1664018554548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_1b_finnish", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_1b_finnish", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_1b_finnish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|3.6 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from monakth) author: John Snow Labs name: distilbert_qa_base_uncased_fine_tuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-uncased-fine-tuned-squad` is a English model originally trained by `monakth`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_fine_tuned_squad_en_4.3.0_3.0_1672774847284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_fine_tuned_squad_en_4.3.0_3.0_1672774847284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_fine_tuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_fine_tuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_fine_tuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/monakth/distillbert-base-uncased-fine-tuned-squad --- layout: model title: English image_classifier_vit_modeversion1_m6_e4 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_modeversion1_m6_e4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modeversion1_m6_e4` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion1_m6_e4_en_4.1.0_3.0_1660171803744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion1_m6_e4_en_4.1.0_3.0_1660171803744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_modeversion1_m6_e4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_modeversion1_m6_e4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_modeversion1_m6_e4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Malay ALBERT Embeddings (Large) author: John Snow Labs name: albert_embeddings_albert_large_bahasa_cased date: 2022-04-14 tags: [albert, embeddings, ms, open_source] task: Embeddings language: ms edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-large-bahasa-cased` is a Malay model orginally trained by `malay-huggingface`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_bahasa_cased_ms_3.4.2_3.0_1649954345847.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_bahasa_cased_ms_3.4.2_3.0_1649954345847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_bahasa_cased","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_bahasa_cased","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ms.embed.albert").predict("""Saya suka Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_large_bahasa_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ms| |Size:|68.8 MB| |Case sensitive:|false| ## References - https://huggingface.co/malay-huggingface/albert-large-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean - https://github.com/huseinzol05/malay-dataset/tree/master/corpus/pile - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert --- layout: model title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_base_japanese_v2 date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-v2` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_v2_ja_4.2.4_3.0_1670018249307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_v2_ja_4.2.4_3.0_1670018249307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_v2","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_v2","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_japanese_v2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|417.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-v2 - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-4` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1654191739285.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1654191739285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_64d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|378.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-4 --- layout: model title: Translate Welsh to English Pipeline author: John Snow Labs name: translate_cy_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cy, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cy` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cy_en_xx_2.7.0_2.4_1609689849644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cy_en_xx_2.7.0_2.4_1609689849644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cy_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cy_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cy.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cy_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1657184760188.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1657184760188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-4 --- layout: model title: Legal Duration and termination Clause Binary Classifier author: John Snow Labs name: legclf_duration_and_termination_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `duration-and-termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `duration-and-termination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duration_and_termination_clause_en_1.0.0_3.2_1660122389421.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duration_and_termination_clause_en_1.0.0_3.2_1660122389421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_duration_and_termination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[duration-and-termination]| |[other]| |[other]| |[duration-and-termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_duration_and_termination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support duration-and-termination 0.93 0.89 0.91 28 other 0.97 0.98 0.98 107 accuracy - - 0.96 135 macro-avg 0.95 0.94 0.94 135 weighted-avg 0.96 0.96 0.96 135 ``` --- layout: model title: Detect Living Species author: John Snow Labs name: ner_living_species date: 2022-06-22 tags: [en, ner, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_en_3.5.3_3.0_1655888659088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_en_3.5.3_3.0_1655888659088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols("sentence","token")\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "en","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "en","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.living_species").predict("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""") ```
## Results ```bash +-----------------------+-------+ |ner_chunk |label | +-----------------------+-------+ |woman |HUMAN | |bacterial |SPECIES| |Fusarium spp |SPECIES| |patient |HUMAN | |species |SPECIES| |Fusarium solani complex|SPECIES| |antifungals |SPECIES| +-----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.1 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.84 0.96 0.90 2950 B-SPECIES 0.73 0.92 0.81 3129 I-HUMAN 0.69 0.68 0.69 145 I-SPECIES 0.66 0.89 0.76 1166 micro-avg 0.76 0.93 0.83 7390 macro-avg 0.73 0.86 0.79 7390 weighted-avg 0.76 0.93 0.83 7390 ``` --- layout: model title: Pipeline to Detect PHI in text (ner_deid_sd_larg) author: John Snow Labs name: ner_deid_sd_large_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_sd_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_sd_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_4.3.0_3.2_1678733016225.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_pipeline_en_4.3.0_3.2_1678733016225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_sd_large_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.med_ner_large.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 0.9999 | | 1 | David Hale | 27 | 36 | NAME | 0.90085 | | 2 | Hendrickson Ora | 55 | 69 | NAME | 0.94935 | | 3 | 7194334 | 78 | 84 | ID | 0.9988 | | 4 | 01/13/93 | 93 | 100 | DATE | 0.9913 | | 5 | Oliveira | 110 | 117 | NAME | 0.9924 | | 6 | 25 | 121 | 122 | AGE | 0.987 | | 7 | 2079-11-09 | 150 | 159 | DATE | 0.9952 | | 8 | Cocke County Baptist Hospital | 163 | 191 | LOCATION | 0.795975 | | 9 | 0295 Keats Street | 195 | 211 | LOCATION | 0.741567 | | 10 | 302-786-5227 | 221 | 232 | CONTACT | 0.984 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_sd_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670325978179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_256_zh_4.2.4_3.0_1670325978179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|39.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English RobertaForSequenceClassification Base Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-suicide-depression` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression_en_4.0.0_3.0_1657715865562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression_en_4.0.0_3.0_1657715865562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_distilroberta_base_finetuned_suicide_depression| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/distilroberta-base-finetuned-suicide-depression - https://github.com/ayaanzhaque/SDCNL --- layout: model title: German Electra Embeddings (from deepset) author: John Snow Labs name: electra_embeddings_gelectra_base_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-base-generator` is a German model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_base_generator_de_3.4.4_3.0_1652786833144.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_gelectra_base_generator_de_3.4.4_3.0_1652786833144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_base_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_gelectra_base_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_gelectra_base_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|128.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gelectra-base-generator - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in English author: gokhanturer name: Ner_conll2003_100d date: 2022-02-08 tags: [opern_source, ner, glove_100d, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.1.2 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This NER model trained with GloVe 100d word embeddings, annotates text to find features like the names of people , places and organizations. ```python nerdl_model = NerDLModel.pretrained("Ner_conll2003_100d", "en", "@gokhanturer")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ``` ## Predicted Entities `PER`, `LOC`, `ORG`, `MISC` {:.btn-box} [Open in Colab](https://colab.research.google.com/drive/1KtA_K-7_xO0oxQ7DhJtU5RPHtGnMzK8z#scrollTo=BP1iPII8PTdb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/gokhanturer/Ner_conll2003_100d_en_3.1.2_3.0_1644322842689.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/gokhanturer/Ner_conll2003_100d_en_3.1.2_3.0_1644322842689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Open In Colab JOHNSNOWLABS_LOGO.png Colab Setup In [1]: ! pip install -q pyspark==3.1.2 spark-nlp ! pip install -q spark-nlp-display |████████████████████████████████| 212.4 MB 82 kB/s |████████████████████████████████| 140 kB 61.0 MB/s |████████████████████████████████| 198 kB 73.0 MB/s Building wheel for pyspark (setup.py) ... done |████████████████████████████████| 95 kB 2.0 MB/s |████████████████████████████████| 66 kB 3.5 MB/s In [3]: import sparknlp spark = sparknlp.start(gpu = True) from sparknlp.base import * from sparknlp.annotator import * import pyspark.sql.functions as F from sparknlp.training import CoNLL print("Spark NLP version", sparknlp.version()) print("Apache Spark version:", spark.version) spark Spark NLP version 3.4.0 Apache Spark version: 3.1.2 Out[3]: SparkSession - in-memory SparkContext Spark UI Versionv3.1.2Masterlocal[*]AppNameSpark NLP CONLL Data Prep In [2]: !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa Train Data In [5]: with open ("eng.train") as f: train_data = f.read() print (train_data[:500]) -DOCSTART- -X- -X- O EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC call NN I-NP O to TO B-VP O boycott VB I-VP O British JJ B-NP B-MISC lamb NN I-NP O . . O O Peter NNP B-NP B-PER Blackburn NNP I-NP I-PER BRUSSELS NNP B-NP B-LOC 1996-08-22 CD I-NP O The DT B-NP O European NNP I-NP B-ORG Commission NNP I-NP I-ORG said VBD B-VP O on IN B-PP O Thursday NNP B-NP O it PRP B-NP O disagreed VBD B-VP O with IN B-PP O German JJ B-NP B-MISC advice NN I-NP O to TO B-PP O consumers NNS B-NP In [6]: train_data = CoNLL().readDataset(spark, 'eng.train') train_data.show(3) +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | text| document| sentence| token| pos| label| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...| | Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...| | BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 3 rows In [7]: train_data.count() Out[7]: 14041 In [8]: train_data.select(F.explode(F.arrays_zip('token.result', 'pos.result', 'label.result')).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("pos"), F.expr("cols['2']").alias("ner_label")).show(truncate=50) +----------+---+---------+ | token|pos|ner_label| +----------+---+---------+ | EU|NNP| B-ORG| | rejects|VBZ| O| | German| JJ| B-MISC| | call| NN| O| | to| TO| O| | boycott| VB| O| | British| JJ| B-MISC| | lamb| NN| O| | .| .| O| | Peter|NNP| B-PER| | Blackburn|NNP| I-PER| | BRUSSELS|NNP| B-LOC| |1996-08-22| CD| O| | The| DT| O| | European|NNP| B-ORG| |Commission|NNP| I-ORG| | said|VBD| O| | on| IN| O| | Thursday|NNP| O| | it|PRP| O| +----------+---+---------+ only showing top 20 rows In [9]: train_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False) +------------+------+ |ground_truth|count | +------------+------+ |O |169578| |B-LOC |7140 | |B-PER |6600 | |B-ORG |6321 | |I-PER |4528 | |I-ORG |3704 | |B-MISC |3438 | |I-LOC |1157 | |I-MISC |1155 | +------------+------+ In [10]: #conll_data.select(F.countDistinct("label.result")).show() #conll_data.groupBy("label.result").count().show(truncate=False) train_data = train_data.withColumn('unique', F.array_distinct("label.result"))\ .withColumn('c', F.size('unique'))\ .filter(F.col('c')>1) train_data.select(F.explode(F.arrays_zip('token.result','label.result')).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"))\ .groupBy('ground_truth')\ .count()\ .orderBy('count', ascending=False)\ .show(100,truncate=False) +------------+------+ |ground_truth|count | +------------+------+ |O |137736| |B-LOC |7125 | |B-PER |6596 | |B-ORG |6288 | |I-PER |4528 | |I-ORG |3704 | |B-MISC |3437 | |I-LOC |1157 | |I-MISC |1155 | +------------+------+ Test Data In [11]: with open ("eng.testa") as f: test_data = f.read() print (test_data[:500]) -DOCSTART- -X- -X- O CRICKET NNP B-NP O - : O O LEICESTERSHIRE NNP B-NP B-ORG TAKE NNP I-NP O OVER IN B-PP O AT NNP B-NP O TOP NNP I-NP O AFTER NNP I-NP O INNINGS NNP I-NP O VICTORY NN I-NP O . . O O LONDON NNP B-NP B-LOC 1996-08-30 CD I-NP O West NNP B-NP B-MISC Indian NNP I-NP I-MISC all-rounder NN I-NP O Phil NNP I-NP B-PER Simmons NNP I-NP I-PER took VBD B-VP O four CD B-NP O for IN B-PP O 38 CD B-NP O on IN B-PP O Friday NNP B-NP O as IN B-PP O Leicestershire NNP B-NP B-ORG beat VBD B-VP In [12]: test_data = CoNLL().readDataset(spark, 'eng.testa') test_data.show(3) +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | text| document| sentence| token| pos| label| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...| | LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...| |West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ only showing top 3 rows In [13]: test_data.count() Out[13]: 3250 In [14]: test_data.select(F.explode(F.arrays_zip("token.result","label.result")).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth")).groupBy("ground_truth").count().orderBy("count", ascending=False).show(100,truncate=False) +------------+-----+ |ground_truth|count| +------------+-----+ |O |42759| |B-PER |1842 | |B-LOC |1837 | |B-ORG |1341 | |I-PER |1307 | |B-MISC |922 | |I-ORG |751 | |I-MISC |346 | |I-LOC |257 | +------------+-----+ NERDL Model with Glove_100d In [15]: glove_embeddings = WordEmbeddingsModel.pretrained()\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") glove_100d download started this may take some time. Approximate size to download 145.3 MB [OK!] In [16]: glove_embeddings.transform(test_data).write.parquet('test_data_embeddings.parquet') In [17]: nerTagger = NerDLApproach()\ .setInputCols(["sentence", "token", "embeddings"])\ .setLabelColumn("label")\ .setOutputCol("ner")\ .setMaxEpochs(8)\ .setLr(0.002)\ .setDropout(0.5)\ .setBatchSize(16)\ .setRandomSeed(0)\ .setVerbose(1)\ .setEvaluationLogExtended(True) \ .setEnableOutputLogs(True)\ .setIncludeConfidence(True)\ .setTestDataset('test_data_embeddings.parquet')\ .setEnableMemoryOptimizer(False) ner_pipeline = Pipeline(stages=[ glove_embeddings, nerTagger ]) In [19]: %%time ner_model = ner_pipeline.fit(train_data) CPU times: user 10.6 s, sys: 1.08 s, total: 11.7 s Wall time: 35min 21s In [20]: !cd ~/annotator_logs/ && ls -lt total 16 -rw-r--r-- 1 root root 13178 Feb 6 17:05 NerDLApproach_c5bf4e4c6211.log In [21]: !cat ~/annotator_logs/NerDLApproach_c5bf4e4c6211.log Name of the selected graph: ner-dl/blstm_10_100_128_120.pb Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079 Epoch 1/8 started, lr: 0.002, dataset size: 11079 Epoch 1/8 - 159.93s - loss: 2234.436 - batches: 695 Quality on test dataset: time to finish evaluation: 17.74s label tp fp fn prec rec f1 B-LOC 1695 94 142 0.94745666 0.92270005 0.93491447 I-ORG 528 76 223 0.8741722 0.7030626 0.77933586 I-MISC 255 88 91 0.7434402 0.7369942 0.74020314 I-LOC 189 14 68 0.9310345 0.73540854 0.8217391 I-PER 1270 59 37 0.95560575 0.9716909 0.9635812 B-MISC 797 142 125 0.84877527 0.8644252 0.85652876 B-ORG 1139 170 202 0.8701299 0.8493661 0.85962266 B-PER 1802 176 40 0.91102123 0.9782845 0.94345546 tp: 7675 fp: 819 fn: 928 labels: 8 Macro-average prec: 0.8852045, rec: 0.84524155, f1: 0.86476153 Micro-average prec: 0.903579, rec: 0.8921307, f1: 0.8978184 Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079 Name of the selected graph: ner-dl/blstm_10_100_128_120.pb Training started - total epochs: 8 - lr: 0.002 - batch size: 16 - labels: 9 - chars: 82 - training examples: 11079 Epoch 1/8 started, lr: 0.002, dataset size: 11079 Epoch 2/8 - 246.28s - loss: 839.1736 - batches: 695 Quality on test dataset: time to finish evaluation: 19.66s label tp fp fn prec rec f1 B-LOC 1762 124 75 0.9342524 0.95917255 0.9465484 I-ORG 585 76 166 0.8850227 0.77896136 0.82861185 I-MISC 247 39 99 0.8636364 0.71387285 0.7816456 I-LOC 233 74 24 0.7589577 0.9066148 0.8262412 I-PER 1275 54 32 0.95936793 0.97551644 0.9673748 B-MISC 791 70 131 0.9186992 0.85791755 0.88726866 B-ORG 1150 151 191 0.88393545 0.857569 0.8705526 B-PER 1800 147 42 0.9244992 0.9771987 0.9501188 tp: 7843 fp: 735 fn: 760 labels: 8 Macro-average prec: 0.89104635, rec: 0.8783529, f1: 0.88465416 Micro-average prec: 0.9143157, rec: 0.9116587, f1: 0.9129852 Epoch 3/8 started, lr: 0.001980198, dataset size: 11079 Epoch 1/8 - 254.10s - loss: 2203.116 - batches: 695 Quality on test dataset: time to finish evaluation: 22.30s label tp fp fn prec rec f1 B-LOC 1660 82 177 0.95292765 0.90364724 0.9276334 I-ORG 560 123 191 0.81991214 0.74567246 0.781032 I-MISC 227 65 119 0.7773973 0.65606934 0.7115987 I-LOC 155 10 102 0.93939394 0.6031128 0.73459715 I-PER 1259 60 48 0.954511 0.96327466 0.9588728 B-MISC 762 110 160 0.8738532 0.82646424 0.8494984 B-ORG 1160 237 181 0.83035076 0.8650261 0.8473338 B-PER 1785 170 57 0.9130435 0.96905535 0.94021595 tp: 7568 fp: 857 fn: 1035 labels: 8 Macro-average prec: 0.88267374, rec: 0.81654024, f1: 0.84832007 Micro-average prec: 0.89827895, rec: 0.87969315, f1: 0.8888889 Epoch 2/8 started, lr: 0.0019900498, dataset size: 11079 Epoch 3/8 - 257.88s - loss: 610.81525 - batches: 695 Quality on test dataset: time to finish evaluation: 18.07s label tp fp fn prec rec f1 B-LOC 1764 104 73 0.9443255 0.9602613 0.9522267 I-ORG 640 140 111 0.82051283 0.85219705 0.8360548 I-MISC 227 22 119 0.9116466 0.65606934 0.7630252 I-LOC 223 43 34 0.8383459 0.8677043 0.8527725 I-PER 1265 31 42 0.97608024 0.96786535 0.9719554 B-MISC 785 62 137 0.9268005 0.85141 0.8875071 B-ORG 1207 174 134 0.87400436 0.90007454 0.8868479 B-PER 1795 94 47 0.9502382 0.97448426 0.96220857 tp: 7906 fp: 670 fn: 697 labels: 8 Macro-average prec: 0.90524435, rec: 0.87875825, f1: 0.8918047 Micro-average prec: 0.921875, rec: 0.91898173, f1: 0.9204261 Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079 Epoch 2/8 - 252.19s - loss: 828.8285 - batches: 695 Quality on test dataset: time to finish evaluation: 17.37s label tp fp fn prec rec f1 B-LOC 1722 67 115 0.9625489 0.93739796 0.9498069 I-ORG 624 134 127 0.823219 0.83089215 0.8270378 I-MISC 230 30 116 0.88461536 0.6647399 0.75907594 I-LOC 199 13 58 0.9386792 0.77431905 0.8486141 I-PER 1274 44 33 0.9666161 0.97475135 0.9706667 B-MISC 787 70 135 0.9183197 0.85357916 0.8847667 B-ORG 1212 204 129 0.8559322 0.9038031 0.87921643 B-PER 1807 109 35 0.94311064 0.98099893 0.9616817 tp: 7855 fp: 671 fn: 748 labels: 8 Macro-average prec: 0.9116301, rec: 0.8650602, f1: 0.88773483 Micro-average prec: 0.9212996, rec: 0.9130536, f1: 0.917158 Epoch 3/8 started, lr: 0.001980198, dataset size: 11079 Epoch 4/8 - 250.45s - loss: 512.68085 - batches: 695 Quality on test dataset: time to finish evaluation: 17.78s label tp fp fn prec rec f1 B-LOC 1767 78 70 0.95772356 0.9618944 0.9598045 I-ORG 658 75 93 0.89768076 0.8761651 0.8867924 I-MISC 257 38 89 0.87118644 0.74277455 0.801872 I-LOC 229 18 28 0.9271255 0.8910506 0.9087302 I-PER 1264 21 43 0.9836576 0.9671002 0.97530866 B-MISC 841 127 81 0.86880165 0.9121475 0.8899471 B-ORG 1202 114 139 0.9133739 0.89634603 0.90477985 B-PER 1799 87 43 0.95387065 0.9766558 0.9651288 tp: 8017 fp: 558 fn: 586 labels: 8 Macro-average prec: 0.92167753, rec: 0.9030168, f1: 0.9122517 Micro-average prec: 0.9349271, rec: 0.9318842, f1: 0.93340325 Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079 Epoch 3/8 - 252.61s - loss: 604.5874 - batches: 695 Quality on test dataset: time to finish evaluation: 18.28s label tp fp fn prec rec f1 B-LOC 1764 112 73 0.9402985 0.9602613 0.95017505 I-ORG 614 84 137 0.87965614 0.8175766 0.847481 I-MISC 244 34 102 0.8776978 0.70520234 0.78205127 I-LOC 220 29 37 0.88353413 0.8560311 0.8695652 I-PER 1268 38 39 0.9709035 0.97016066 0.97053194 B-MISC 799 96 123 0.89273745 0.8665944 0.87947166 B-ORG 1205 123 136 0.9073795 0.8985832 0.90295994 B-PER 1792 110 50 0.94216615 0.97285557 0.95726496 tp: 7906 fp: 626 fn: 697 labels: 8 Macro-average prec: 0.9117967, rec: 0.88090813, f1: 0.89608634 Micro-average prec: 0.9266292, rec: 0.91898173, f1: 0.92278963 Epoch 4/8 started, lr: 0.0019704434, dataset size: 11079 Epoch 5/8 - 257.56s - loss: 437.73123 - batches: 695 Quality on test dataset: time to finish evaluation: 17.89s label tp fp fn prec rec f1 B-LOC 1806 163 31 0.91721684 0.9831247 0.94902784 I-ORG 606 26 145 0.95886075 0.8069241 0.8763557 I-MISC 287 99 59 0.7435233 0.82947975 0.78415304 I-LOC 233 54 24 0.8118467 0.9066148 0.85661757 I-PER 1273 26 34 0.9799846 0.9739862 0.9769762 B-MISC 846 146 76 0.8528226 0.9175705 0.8840125 B-ORG 1149 37 192 0.9688027 0.85682327 0.9093787 B-PER 1797 77 45 0.9589114 0.97557 0.96716905 tp: 7997 fp: 628 fn: 606 labels: 8 Macro-average prec: 0.8989962, rec: 0.90626174, f1: 0.9026143 Micro-average prec: 0.9271884, rec: 0.92955947, f1: 0.9283724 Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079 Epoch 4/8 - 255.39s - loss: 508.80334 - batches: 695 Quality on test dataset: time to finish evaluation: 17.63s label tp fp fn prec rec f1 B-LOC 1799 270 38 0.8695022 0.9793141 0.921147 I-ORG 616 97 135 0.86395514 0.82023966 0.8415301 I-MISC 253 33 93 0.88461536 0.73121387 0.8006329 I-LOC 236 117 21 0.66855526 0.91828793 0.77377045 I-PER 1256 18 51 0.98587126 0.96097934 0.9732662 B-MISC 799 66 123 0.92369944 0.8665944 0.89423615 B-ORG 1162 106 179 0.9164038 0.86651754 0.89076275 B-PER 1754 52 88 0.9712071 0.95222586 0.96162283 tp: 7875 fp: 759 fn: 728 labels: 8 Macro-average prec: 0.8854762, rec: 0.8869216, f1: 0.8861983 Micro-average prec: 0.91209173, rec: 0.91537833, f1: 0.9137321 Epoch 5/8 started, lr: 0.0019607844, dataset size: 11079 Epoch 6/8 - 262.11s - loss: 382.8735 - batches: 695 Quality on test dataset: time to finish evaluation: 17.92s label tp fp fn prec rec f1 B-LOC 1749 61 88 0.96629834 0.9520958 0.95914453 I-ORG 682 136 69 0.83374083 0.9081225 0.8693435 I-MISC 268 40 78 0.8701299 0.7745665 0.81957185 I-LOC 215 14 42 0.93886465 0.83657587 0.8847737 I-PER 1280 39 27 0.97043216 0.979342 0.97486675 B-MISC 837 96 85 0.8971061 0.90780914 0.90242594 B-ORG 1232 120 109 0.9112426 0.9187174 0.91496474 B-PER 1795 93 47 0.9507415 0.97448426 0.96246654 tp: 8058 fp: 599 fn: 545 labels: 8 Macro-average prec: 0.91731954, rec: 0.9064642, f1: 0.9118596 Micro-average prec: 0.9308074, rec: 0.93665, f1: 0.9337195 Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079 Epoch 5/8 - 263.75s - loss: 450.50388 - batches: 695 Quality on test dataset: time to finish evaluation: 17.58s label tp fp fn prec rec f1 B-LOC 1749 64 88 0.9646994 0.9520958 0.95835614 I-ORG 689 180 62 0.79286534 0.9174434 0.85061723 I-MISC 275 83 71 0.7681564 0.79479766 0.78124994 I-LOC 210 16 47 0.9292035 0.8171206 0.8695652 I-PER 1271 33 36 0.97469324 0.972456 0.9735733 B-MISC 825 103 97 0.88900864 0.8947939 0.89189196 B-ORG 1239 127 102 0.90702784 0.9239374 0.9154045 B-PER 1791 71 51 0.96186894 0.9723127 0.96706253 tp: 8049 fp: 677 fn: 554 labels: 8 Macro-average prec: 0.89844036, rec: 0.90561974, f1: 0.90201575 Micro-average prec: 0.9224158, rec: 0.93560386, f1: 0.92896307 Epoch 6/8 started, lr: 0.0019512196, dataset size: 11079 Epoch 7/8 - 262.34s - loss: 330.9146 - batches: 695 Quality on test dataset: time to finish evaluation: 18.09s label tp fp fn prec rec f1 B-LOC 1760 74 77 0.95965105 0.9580838 0.9588668 I-ORG 630 36 121 0.9459459 0.8388815 0.88920254 I-MISC 283 93 63 0.75265956 0.8179191 0.7839335 I-LOC 225 20 32 0.9183673 0.8754864 0.8964143 I-PER 1273 32 34 0.97547895 0.9739862 0.974732 B-MISC 837 113 85 0.8810526 0.90780914 0.8942308 B-ORG 1230 96 111 0.9276018 0.91722596 0.92238474 B-PER 1801 70 41 0.9625869 0.9777416 0.97010505 tp: 8039 fp: 534 fn: 564 labels: 8 Macro-average prec: 0.915418, rec: 0.9083917, f1: 0.91189134 Micro-average prec: 0.9377114, rec: 0.93444145, f1: 0.93607366 Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079 Epoch 6/8 - 264.34s - loss: 384.8886 - batches: 695 Quality on test dataset: time to finish evaluation: 18.07s label tp fp fn prec rec f1 B-LOC 1772 84 65 0.95474136 0.96461624 0.9596534 I-ORG 635 50 116 0.9270073 0.8455393 0.88440114 I-MISC 274 78 72 0.77840906 0.7919075 0.7851002 I-LOC 228 19 29 0.9230769 0.8871595 0.9047619 I-PER 1273 32 34 0.97547895 0.9739862 0.974732 B-MISC 842 125 80 0.8707342 0.9132321 0.8914769 B-ORG 1218 86 123 0.93404907 0.9082774 0.920983 B-PER 1791 65 51 0.96497846 0.9723127 0.9686317 tp: 8033 fp: 539 fn: 570 labels: 8 Macro-average prec: 0.9160594, rec: 0.90712893, f1: 0.9115723 Micro-average prec: 0.93712085, rec: 0.933744, f1: 0.9354294 Epoch 7/8 started, lr: 0.0019417477, dataset size: 11079 Epoch 8/8 - 266.21s - loss: 301.41052 - batches: 695 Quality on test dataset: time to finish evaluation: 18.45s label tp fp fn prec rec f1 B-LOC 1768 68 69 0.962963 0.96243876 0.96270084 I-ORG 658 49 93 0.9306931 0.8761651 0.9026063 I-MISC 267 56 79 0.8266254 0.7716763 0.7982063 I-LOC 228 14 29 0.94214875 0.8871595 0.91382766 I-PER 1272 35 35 0.9732211 0.9732211 0.9732211 B-MISC 834 98 88 0.8948498 0.9045553 0.8996764 B-ORG 1239 97 102 0.9273952 0.9239374 0.925663 B-PER 1806 94 36 0.9505263 0.98045605 0.9652592 tp: 8072 fp: 511 fn: 531 labels: 8 Macro-average prec: 0.9260528, rec: 0.90995115, f1: 0.9179313 Micro-average prec: 0.9404637, rec: 0.93827736, f1: 0.9393693 Epoch 7/8 - 256.62s - loss: 335.06775 - batches: 695 Quality on test dataset: time to finish evaluation: 8.79s label tp fp fn prec rec f1 B-LOC 1791 128 46 0.9332986 0.9749592 0.95367414 I-ORG 639 78 112 0.8912134 0.8508655 0.8705722 I-MISC 262 48 84 0.8451613 0.75722545 0.7987805 I-LOC 238 60 19 0.7986577 0.92607003 0.8576577 I-PER 1260 19 47 0.9851446 0.9640398 0.97447795 B-MISC 811 72 111 0.9184598 0.8796095 0.89861494 B-ORG 1215 95 126 0.92748094 0.90604025 0.9166353 B-PER 1786 56 56 0.96959823 0.96959823 0.96959823 tp: 8002 fp: 556 fn: 601 labels: 8 Macro-average prec: 0.90862685, rec: 0.903551, f1: 0.9060818 Micro-average prec: 0.93503153, rec: 0.9301407, f1: 0.9325797 Epoch 8/8 started, lr: 0.0019323673, dataset size: 11079 Epoch 8/8 - 133.22s - loss: 299.64578 - batches: 695 Quality on test dataset: time to finish evaluation: 8.91s label tp fp fn prec rec f1 B-LOC 1746 56 91 0.9689234 0.9504627 0.95960426 I-ORG 673 77 78 0.8973333 0.8961385 0.8967355 I-MISC 270 43 76 0.8626198 0.7803468 0.8194234 I-LOC 223 10 34 0.95708156 0.8677043 0.9102041 I-PER 1272 41 35 0.9687738 0.9732211 0.9709923 B-MISC 832 109 90 0.88416576 0.9023861 0.893183 B-ORG 1264 143 77 0.8983653 0.94258016 0.9199418 B-PER 1801 76 41 0.95950985 0.9777416 0.9685399 tp: 8081 fp: 555 fn: 522 labels: 8 Macro-average prec: 0.9245966, rec: 0.9113227, f1: 0.91791165 Micro-average prec: 0.93573415, rec: 0.9393235, f1: 0.9375254 In [22]: import pyspark.sql.functions as F predictions = ner_model.transform(test_data) predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"), F.expr("cols['2']").alias("prediction")).show(truncate=False) +--------------+------------+----------+ |token |ground_truth|prediction| +--------------+------------+----------+ |CRICKET |O |O | |- |O |O | |LEICESTERSHIRE|B-ORG |B-ORG | |TAKE |O |O | |OVER |O |O | |AT |O |O | |TOP |O |O | |AFTER |O |O | |INNINGS |O |O | |VICTORY |O |O | |. |O |O | |LONDON |B-LOC |B-LOC | |1996-08-30 |O |O | |West |B-MISC |B-MISC | |Indian |I-MISC |I-MISC | |all-rounder |O |O | |Phil |B-PER |B-PER | |Simmons |I-PER |I-PER | |took |O |O | |four |O |O | +--------------+------------+----------+ only showing top 20 rows In [23]: from sklearn.metrics import classification_report preds_df = predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ground_truth"), F.expr("cols['2']").alias("prediction")).toPandas() print (classification_report(preds_df['ground_truth'], preds_df['prediction'])) precision recall f1-score support B-LOC 0.97 0.95 0.96 1837 B-MISC 0.88 0.90 0.89 922 B-ORG 0.90 0.94 0.92 1341 B-PER 0.96 0.98 0.97 1842 I-LOC 0.96 0.87 0.91 257 I-MISC 0.86 0.78 0.82 346 I-ORG 0.90 0.90 0.90 751 I-PER 0.97 0.97 0.97 1307 O 1.00 1.00 1.00 42759 accuracy 0.99 51362 macro avg 0.93 0.92 0.93 51362 weighted avg 0.99 0.99 0.99 51362 Saving the Trained Model In [24]: ner_model.stages Out[24]: [WORD_EMBEDDINGS_MODEL_48cffc8b9a76, NerDLModel_6a88a8ead3fd] In [25]: ner_model.stages[1].write().overwrite().save("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02") Prediction Pipeline In [28]: document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence = SentenceDetector()\ .setInputCols(['document'])\ .setOutputCol('sentence') token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token') glove_embeddings = WordEmbeddingsModel.pretrained()\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") loaded_ner_model = NerDLModel.load("/content/drive/MyDrive/SparkNLPTask/Ner_glove_100d_e8_b16_lr0.02")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_span") ner_prediction_pipeline = Pipeline(stages = [ document, sentence, token, glove_embeddings, loaded_ner_model, converter ]) empty_data = spark.createDataFrame([['']]).toDF("text") prediction_model = ner_prediction_pipeline.fit(empty_data) glove_100d download started this may take some time. Approximate size to download 145.3 MB [OK!] In [33]: text = ''' The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga. ''' sample_data = spark.createDataFrame([[text]]).toDF("text") sample_data.show(truncate=False) +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |text | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | The final has its own Merseyside subplot, as it will pit Liverpool forwards Mo Salah (of Egypt: pictured above, in white, in the semi-final) and Sadio Mané (of Senegal) against each other. They are just two of the African stars to play for European clubs—the world’s strongest. In fact, only four teams in the English Premier League don’t have a player from the continent. Besides Mr Salah and Mr Mané, Riyad Mahrez of Algeria is at Manchester City, Wilfred Ndidi of Nigeria and Chelsea boasts Edouard Mendy, Senegal’s goalkeeper, and Hakim Ziyech of Morocco. In Italy’s Serie A, Kalidou Koulibaly of Senegal plays for Napoli and Franck Kessie of the Ivory Coast turns out for AC Milan. Eric Maxim Choupo-Moting of Cameroon and Bouna Sarr of Senegal both play for Bayern Munich, the dominant club in Germany’s Bundesliga. | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ In [34]: preds = prediction_model.transform(sample_data) result_df = preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \ .select(F.expr("entities['0']").alias("chunk"), F.expr("entities['1'].entity").alias("entity")).show(truncate=False) +---------------+------+ |chunk |entity| +---------------+------+ |Merseyside |ORG | |Liverpool |ORG | |Mo Salah |PER | |Egypt |LOC | |Sadio Mané |PER | |Senegal |LOC | |African |MISC | |European |MISC | |English |MISC | |Premier League |ORG | |Mr Salah |PER | |Mr Mané |PER | |Riyad Mahrez |PER | |Algeria |LOC | |Manchester City|LOC | |Wilfred Ndidi |PER | |Nigeria |LOC | |Chelsea |ORG | |Edouard Mendy |PER | |Senegal’s |PER | +---------------+------+ only showing top 20 rows In [35]: from sparknlp.base import LightPipeline light_model = LightPipeline(prediction_model) result = light_model.annotate(text) list(zip(result['token'], result['ner'])) Out[35]: [('The', 'O'), ('final', 'O'), ('has', 'O'), ('its', 'O'), ('own', 'O'), ('Merseyside', 'B-ORG'), ('subplot', 'O'), (',', 'O'), ('as', 'O'), ('it', 'O'), ('will', 'O'), ('pit', 'O'), ('Liverpool', 'B-ORG'), ('forwards', 'O'), ('Mo', 'B-PER'), ('Salah', 'I-PER'), ('(', 'O'), ('of', 'O'), ('Egypt', 'B-LOC'), (':', 'O'), ('pictured', 'O'), ('above', 'O'), (',', 'O'), ('in', 'O'), ('white', 'O'), (',', 'O'), ('in', 'O'), ('the', 'O'), ('semi-final', 'O'), (')', 'O'), ('and', 'O'), ('Sadio', 'B-PER'), ('Mané', 'I-PER'), ('(', 'O'), ('of', 'O'), ('Senegal', 'B-LOC'), (')', 'O'), ('against', 'O'), ('each', 'O'), ('other', 'O'), ('.', 'O'), ('They', 'O'), ('are', 'O'), ('just', 'O'), ('two', 'O'), ('of', 'O'), ('the', 'O'), ('African', 'B-MISC'), ('stars', 'O'), ('to', 'O'), ('play', 'O'), ('for', 'O'), ('European', 'B-MISC'), ('clubs—the', 'O'), ('world’s', 'O'), ('strongest', 'O'), ('.', 'O'), ('In', 'O'), ('fact', 'O'), (',', 'O'), ('only', 'O'), ('four', 'O'), ('teams', 'O'), ('in', 'O'), ('the', 'O'), ('English', 'B-MISC'), ('Premier', 'B-ORG'), ('League', 'I-ORG'), ('don’t', 'O'), ('have', 'O'), ('a', 'O'), ('player', 'O'), ('from', 'O'), ('the', 'O'), ('continent', 'O'), ('.', 'O'), ('Besides', 'O'), ('Mr', 'B-PER'), ('Salah', 'I-PER'), ('and', 'O'), ('Mr', 'B-PER'), ('Mané', 'I-PER'), (',', 'O'), ('Riyad', 'B-PER'), ('Mahrez', 'I-PER'), ('of', 'O'), ('Algeria', 'B-LOC'), ('is', 'O'), ('at', 'O'), ('Manchester', 'B-LOC'), ('City', 'I-LOC'), (',', 'O'), ('Wilfred', 'B-PER'), ('Ndidi', 'I-PER'), ('of', 'O'), ('Nigeria', 'B-LOC'), ('and', 'O'), ('Chelsea', 'B-ORG'), ('boasts', 'O'), ('Edouard', 'B-PER'), ('Mendy', 'I-PER'), (',', 'O'), ('Senegal’s', 'B-PER'), ('goalkeeper', 'O'), (',', 'O'), ('and', 'O'), ('Hakim', 'B-PER'), ('Ziyech', 'I-PER'), ('of', 'O'), ('Morocco', 'B-LOC'), ('.', 'O'), ('In', 'O'), ('Italy’s', 'B-MISC'), ('Serie', 'I-MISC'), ('A', 'I-MISC'), (',', 'O'), ('Kalidou', 'B-PER'), ('Koulibaly', 'I-PER'), ('of', 'O'), ('Senegal', 'B-LOC'), ('plays', 'O'), ('for', 'O'), ('Napoli', 'B-ORG'), ('and', 'O'), ('Franck', 'B-PER'), ('Kessie', 'I-PER'), ('of', 'O'), ('the', 'O'), ('Ivory', 'B-LOC'), ('Coast', 'I-LOC'), ('turns', 'O'), ('out', 'O'), ('for', 'O'), ('AC', 'B-ORG'), ('Milan', 'I-ORG'), ('.', 'O'), ('Eric', 'B-PER'), ('Maxim', 'I-PER'), ('Choupo-Moting', 'I-PER'), ('of', 'O'), ('Cameroon', 'B-LOC'), ('and', 'O'), ('Bouna', 'B-PER'), ('Sarr', 'I-PER'), ('of', 'O'), ('Senegal', 'B-LOC'), ('both', 'O'), ('play', 'O'), ('for', 'O'), ('Bayern', 'B-ORG'), ('Munich', 'I-ORG'), (',', 'O'), ('the', 'O'), ('dominant', 'O'), ('club', 'O'), ('in', 'O'), ('Germany’s', 'B-MISC'), ('Bundesliga', 'I-MISC'), ('.', 'O')] In [37]: import pandas as pd result = light_model.fullAnnotate(text) ner_df= pd.DataFrame([(int(x.metadata['sentence']), x.result, x.begin, x.end, y.result) for x,y in zip(result[0]["token"], result[0]["ner"])], columns=['sent_id','token','start','end','ner']) ner_df.head(15) Out[37]: sent_id token start end ner 0 0 The 1 3 O 1 0 final 5 9 O 2 0 has 11 13 O 3 0 its 15 17 O 4 0 own 19 21 O 5 0 Merseyside 23 32 B-ORG 6 0 subplot 34 40 O 7 0 , 41 41 O 8 0 as 43 44 O 9 0 it 46 47 O 10 0 will 49 52 O 11 0 pit 54 56 O 12 0 Liverpool 58 66 B-ORG 13 0 forwards 68 75 O 14 0 Mo 77 78 B-PER Highlight Entities In [38]: ann_text = light_model.fullAnnotate(text)[0] ann_text.keys() Out[38]: dict_keys(['document', 'ner_span', 'token', 'ner', 'embeddings', 'sentence']) In [39]: from sparknlp_display import NerVisualizer visualiser = NerVisualizer() print ('Standard Output') visualiser.display(ann_text, label_col='ner_span', document_col='document') Standard Output The final has its own Merseyside ORG subplot, as it will pit Liverpool ORG forwards Mo Salah PER (of Egypt LOC: pictured above, in white, in the semi-final) and Sadio Mané PER (of Senegal LOC) against each other. They are just two of the African MISC stars to play for European MISC clubs—the world’s strongest. In fact, only four teams in the English MISC Premier League ORG don’t have a player from the continent. Besides Mr Salah PER and Mr Mané PER, Riyad Mahrez PER of Algeria LOC is at Manchester City LOC, Wilfred Ndidi PER of Nigeria LOC and Chelsea ORG boasts Edouard Mendy PER, Senegal’s PER goalkeeper, and Hakim Ziyech PER of Morocco LOC. In Italy’s Serie A MISC, Kalidou Koulibaly PER of Senegal LOC plays for Napoli ORG and Franck Kessie PER of the Ivory Coast LOC turns out for AC Milan ORG. Eric Maxim Choupo-Moting PER of Cameroon LOC and Bouna Sarr PER of Senegal LOC both play for Bayern Munich ORG, the dominant club in Germany’s Bundesliga MISC. Streamlit In [14]: ! pip install -q pyspark==3.1.2 spark-nlp ! pip install -q spark-nlp-display In [ ]: !pip install streamlit !pip install pyngrok==4.1.1 In [2]: ! wget https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py --2022-02-06 22:39:33-- https://raw.githubusercontent.com/gokhanturer/JSL/main/streamlit_me_ner_model.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7979 (7.8K) [text/plain] Saving to: ‘streamlit_me_ner_model.py.3’ streamlit_me_ner_mo 100%[===================>] 7.79K --.-KB/s in 0s 2022-02-06 22:39:34 (93.4 MB/s) - ‘streamlit_me_ner_model.py.3’ saved [7979/7979] In [3]: !ngrok authtoken 24jtZ2Watn1mc1bSG6v19fel7p1_2bYeRjRkniKqqhfgRs6ub Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml In [5]: !streamlit run streamlit_me_ner_model.py &>/dev/null& In [6]: from pyngrok import ngrok public_url = ngrok.connect(port='8501') public_url Out[6]: 'http://2d54-34-125-109-11.ngrok.io' In [7]: !killall ngrok public_url = ngrok.connect(port='8501') public_url Out[7]: 'http://df30-34-125-109-11.ngrok.io' ```
## Results ```bash +---------------+------+ |chunk |entity| +---------------+------+ |Merseyside |ORG | |Liverpool |ORG | |Mo Salah |PER | |Egypt |LOC | |Sadio Mané |PER | |Senegal |LOC | |African |MISC | |European |MISC | |English |MISC | |Premier League |ORG | |Mr Salah |PER | |Mr Mané |PER | |Riyad Mahrez |PER | |Algeria |LOC | |Manchester City|LOC | |Wilfred Ndidi |PER | |Nigeria |LOC | |Chelsea |ORG | |Edouard Mendy |PER | |Senegal’s |PER | +---------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|Ner_conll2003_100d| |Type:|ner| |Compatibility:|Spark NLP 3.1.2+| |License:|Open Source| |Edition:|Community| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.3 MB| |Dependencies:|glove100d| ## References This model is trained based on data from : https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.testa ## Benchmarking ```bash label precision recall f1-score support B-LOC 0.97 0.95 0.96 1837 B-MISC 0.88 0.90 0.89 922 B-ORG 0.90 0.94 0.92 1341 B-PER 0.96 0.98 0.97 1842 I-LOC 0.96 0.87 0.91 257 I-MISC 0.86 0.78 0.82 346 I-ORG 0.90 0.90 0.90 751 I-PER 0.97 0.97 0.97 1307 O 1.00 1.00 1.00 42759 accuracy - - 0.99 51362 macro-avg 0.93 0.92 0.93 51362 weighted-avg 0.99 0.99 0.99 51362 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from akr) author: John Snow Labs name: distilbert_qa_akr_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `akr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_akr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769755011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_akr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769755011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_akr_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_akr_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_akr_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/akr/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_large_960h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_large_960h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_en_4.2.0_3.0_1664016568413.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_en_4.2.0_3.0_1664016568413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_960h", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_960h", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_960h| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|755.4 MB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el12 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el12` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el12_en_4.3.0_3.0_1675119407178.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el12_en_4.3.0_3.0_1675119407178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el12","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el12","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el12| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|184.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-el12 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering model (from Sotireas) author: John Snow Labs name: bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-ContaminationQAmodel_PubmedBERT` is a English model orginally trained by `Sotireas`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_4.0.0_3.0_1654176486544.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_4.0.0_3.0_1654176486544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Sotireas/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-ContaminationQAmodel_PubmedBERT --- layout: model title: Legal Preamble Clause Binary Classifier author: John Snow Labs name: legclf_preamble_clause date: 2023-02-13 tags: [en, legal, classification, preamble, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `preamble` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `preamble`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_preamble_clause_en_1.0.0_3.0_1676302301456.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_preamble_clause_en_1.0.0_3.0_1676302301456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_preamble_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[preamble]| |[other]| |[other]| |[preamble]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_preamble_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.88 0.93 16 preamble 0.91 1.00 0.95 21 accuracy - - 0.95 37 macro-avg 0.96 0.94 0.94 37 weighted-avg 0.95 0.95 0.95 37 ``` --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICDO Codes author: John Snow Labs name: snomed_icdo_mapping date: 2022-06-27 tags: [snomed, icdo, pipeline, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_icdo_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_3.5.3_3.0_1656364941154.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_3.5.3_3.0_1656364941154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") result= pipeline.fullAnnotate("10376009 2026006 26638004") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("10376009 2026006 26638004") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""10376009 2026006 26638004""") ```
## Results ```bash | | snomed_code | icdo_code | |---:|:------------------------------|:-------------------------| | 0 | 10376009 | 2026006 | 26638004 | 8050/2 | 9014/0 | 8322/0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icdo_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|208.7 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Tagalog RoBERTa Embeddings (Base) author: John Snow Labs name: roberta_embeddings_roberta_tagalog_base date: 2022-04-14 tags: [roberta, embeddings, tl, open_source] task: Embeddings language: tl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-tagalog-base` is a Tagalog model orginally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_tagalog_base_tl_3.4.2_3.0_1649948855487.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_tagalog_base_tl_3.4.2_3.0_1649948855487.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_tagalog_base","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Gustung-gusto ko ang Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_tagalog_base","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Gustung-gusto ko ang Spark NLP.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tl.embed.roberta_tagalog_base").predict("""Gustung-gusto ko ang Spark NLP.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_tagalog_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|tl| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/jcblaise/roberta-tagalog-base - https://blaisecruz.com --- layout: model title: Western Frisian BertForMaskedLM Base Cased model (from GroNLP) author: John Snow Labs name: bert_embeddings_base_dutch_cased_frisian date: 2022-12-02 tags: [fy, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: fy edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-dutch-cased-frisian` is a Western Frisian model originally trained by `GroNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_dutch_cased_frisian_fy_4.2.4_3.0_1670016581644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_dutch_cased_frisian_fy_4.2.4_3.0_1670016581644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_dutch_cased_frisian","fy") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_dutch_cased_frisian","fy") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_dutch_cased_frisian| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fy| |Size:|351.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/GroNLP/bert-base-dutch-cased-frisian - https://arxiv.org/abs/2105.02855 - https://github.com/wietsedv/low-resource-adapt - https://github.com/wietsedv/bertje --- layout: model title: English BertForQuestionAnswering model (from ncduy) author: John Snow Labs name: bert_qa_MiniLM_L12_H384_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MiniLM-L12-H384-uncased-finetuned-squad` is a English model orginally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_MiniLM_L12_H384_uncased_finetuned_squad_en_4.0.0_3.0_1654178848040.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_MiniLM_L12_H384_uncased_finetuned_squad_en_4.0.0_3.0_1654178848040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_MiniLM_L12_H384_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_MiniLM_L12_H384_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.mini_lm_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_MiniLM_L12_H384_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|124.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/MiniLM-L12-H384-uncased-finetuned-squad --- layout: model title: ALBERT Large CoNNL-03 NER Pipeline author: John Snow Labs name: albert_large_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653727302.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653727302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|64.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Sentence Entity Resolver for ICD-O (sbiobertresolve_icdo_augmented) author: John Snow Labs name: sbiobertresolve_icdo_augmented date: 2022-06-06 tags: [licensed, clinical, en, icdo, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.5.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical entities to ICD-O codes using `sbiobert_base_cased_mli` Sentence BERT Embeddings. Given an oncological entity found in the text (via NER models like `ner_jsl`), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original `Topography` and `Histology` codes, and their descriptions. ## Predicted Entities `ICD-O Codes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.5.2_3.0_1654546345691.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_augmented_en_3.5.2_3.0_1654546345691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["Oncological"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo_augmented", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN")\ resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, resolver ]) data = spark.createDataFrame([["""TRAF6 is a putative oncogene in a variety of cancers including urothelial cancer , and malignant melanoma. WWP2 appears to regulate the expression of the well characterized tumor and tensin homolog (PTEN) in endometroid adenocarcinoma and squamous cell carcinoma."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Oncological")) val c2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sentence_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo_augmented", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new Pipeline().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, resolver)) val data = Seq("""TRAF6 is a putative oncogene in a variety of cancers including urothelial cancer , and malignant melanoma. WWP2 appears to regulate the expression of the well characterized tumor and tensin homolog (PTEN) in endometroid adenocarcinoma and squamous cell carcinoma.""").toDS.toDF("text") val results = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icdo_augmented").predict("""TRAF6 is a putative oncogene in a variety of cancers including urothelial cancer , and malignant melanoma. WWP2 appears to regulate the expression of the well characterized tumor and tensin homolog (PTEN) in endometroid adenocarcinoma and squamous cell carcinoma.""") ```
## Results ```bash +--------------------------+-----------+---------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | chunk| entity|icdo_code| all_k_resolutions| all_k_codes| +--------------------------+-----------+---------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | cancers|Oncological| 8000/3|cancer:::carcinoma:::carcinomatosis:::neoplasms:::ceruminous carcinoma::...|8000/3:::8010/3:::8010/9:::800:::8420/3:::8140/3:::8010/3-C76.0:::8010/6...| | urothelial cancer|Oncological| 8120/3|urothelial carcinoma:::urothelial carcinoma in situ of urinary system:::...|8120/3:::8120/2-C68.9:::8010/3-C68.9:::8130/3-C68.9:::8070/3-C68.9:::813...| | malignant melanoma|Oncological| 8720/3|malignant melanoma:::malignant melanoma, of skin:::malignant melanoma, o...|8720/3:::8720/3-C44.9:::8720/3-C06.9:::8720/3-C69.9:::8721/3:::8720/3-C0...| | tumor|Oncological| 8000/1|tumor:::tumorlet:::tumor cells:::askin tumor:::tumor, secondary:::pilar ...|8000/1:::8040/1:::8001/1:::9365/3:::8000/6:::8103/0:::9364/3:::8940/0:::...| |endometroid adenocarcinoma|Oncological| 8380/3|endometrioid adenocarcinoma:::endometrioid adenoma:::scirrhous adenocarc...|8380/3:::8380/0:::8141/3-C54.1:::8560/3-C54.1:::8260/3-C54.1:::8380/3-C5...| | squamous cell carcinoma|Oncological| 8070/3|squamous cell carcinoma:::verrucous squamous cell carcinoma:::squamous c...|8070/3:::8051/3:::8070/2:::8052/3:::8070/3-C44.5:::8075/3:::8560/3:::807...| +--------------------------+-----------+---------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icdo_augmented| |Compatibility:|Healthcare NLP 3.5.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icdo_code]| |Language:|en| |Size:|175.7 MB| |Case sensitive:|false| ## References Trained on ICD-O Histology Behaviour dataset with sbiobert_base_cased_mli sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dl12 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl12` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl12_en_4.3.0_3.0_1675118545379.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl12_en_4.3.0_3.0_1675118545379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dl12","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dl12","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dl12| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|196.2 MB| ## References - https://huggingface.co/google/t5-efficient-small-dl12 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English image_classifier_vit_blocks ViTForImageClassification from lazyturtl author: John Snow Labs name: image_classifier_vit_blocks date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_blocks` is a English model originally trained by lazyturtl. ## Predicted Entities `red color`, `orange color`, `green color`, `cyan color`, `yellow color`, `blue color` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_blocks_en_4.1.0_3.0_1660166657299.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_blocks_en_4.1.0_3.0_1660166657299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_blocks", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_blocks", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_blocks| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Danish to English author: John Snow Labs name: opus_mt_da_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, da, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `da` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_da_en_xx_2.7.0_2.4_1609167272759.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_da_en_xx_2.7.0_2.4_1609167272759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_da_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_da_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.da.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_da_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Indonesian RobertaForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: roberta_embeddings_indo_small date: 2022-12-12 tags: [id, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: id edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indo-roberta-small` is a Indonesian model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_small_id_4.2.4_3.0_1670858716049.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indo_small_id_4.2.4_3.0_1670858716049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_small","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indo_small","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indo_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|id| |Size:|313.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/indo-roberta-small - https://arxiv.org/abs/1907.11692 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: English image_classifier_vit_base_patch16_224_recylce_ft ViTForImageClassification from NhatPham author: John Snow Labs name: image_classifier_vit_base_patch16_224_recylce_ft date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_recylce_ft` is a English model originally trained by NhatPham. ## Predicted Entities `Non-Recycle `, `Object`, `Recycle` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_recylce_ft_en_4.1.0_3.0_1660167955114.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_recylce_ft_en_4.1.0_3.0_1660167955114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_recylce_ft", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_recylce_ft", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_recylce_ft| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Relation Extraction Model Clinical author: John Snow Labs name: re_drug_drug_interaction_clinical class: RelationExtractionModel language: en nav_key: models repository: clinical/models date: 2020-09-03 task: Relation Extraction edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [clinical,licensed,relation extraction,en] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Relation Extraction model based on syntactic features using deep learning. This model can be used to identify drug-drug interactions relationships among drug entities. ## Predicted Entities ``DDI-advise``, ``DDI-effect``, ``DDI-mechanism``, ``DDI-int``, ``DDI-false`` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_drug_drug_interaction_clinical_en_2.5.5_2.4_1599156924424.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_drug_drug_interaction_clinical_en_2.5.5_2.4_1599156924424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use In the table below, `re_drug_drug_interaction_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------------------:|-----------------------------------------------------------------------|:------------:|---------------| | re_drug_drug_interaction_clinical | DDI-advise,
DDI-effect,
DDI-mechanism,
DDI-int,
DDI-false | ner_posology | [“drug-drug”] |
{% include programmingLanguageSelectScalaPython.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") ddi_re_model = RelationExtractionModel.pretrained("re_drug_drug_interaction_clinical","en","clinical/models")\ .setInputCols("word_embeddings","chunk","pos","dependency")\ .setOutputCol("category") nlp_pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_converter, dependency_parser, ddi_re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""") ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val ddi_re_model = RelationExtractionModel.pretrained("re_drug_drug_interaction_clinical","en","clinical/models") .setInputCols("word_embeddings","chunk","pos","dependency") .setOutputCol("category") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_converter, dependency_parser, ddi_re_model)) val data = Seq("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash |relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 |entity2_begin | entity2_end | chunk2 | |DDI-advise | DRUG | 5 | 17 | carbamazepine | DRUG | 62 73 | aripiprazole | ``` {:.model-param} ## Model Information {:.table-model} |----------------|-----------------------------------------| | Name: | re_drug_drug_interaction_clinical | | Type: | RelationExtractionModel | | Compatibility: | Spark NLP 2.5.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [word_embeddings, chunk, pos, dependency] | |Output labels: | [category] | | Language: | en | | Case sensitive: | False | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs. {:.h2_title} ## Benchmarking ```bash +-------------+------+------+------+ | relation|recall| prec | f1 | +-------------+------+------+------+ | DDI-effect| 0.76| 0.38 | 0.51 | | DDI-false| 0.72| 0.97 | 0.83 | | DDI-advise| 0.74| 0.39 | 0.51 | +-------------+------+------+------+ ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from comacrae) author: John Snow Labs name: roberta_qa_edav3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-edav3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_edav3_en_4.3.0_3.0_1674220150986.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_edav3_en_4.3.0_3.0_1674220150986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_edav3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_edav3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_edav3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/comacrae/roberta-edav3 --- layout: model title: Multilingual T5ForConditionalGeneration Base Cased model (from Voicelab) author: John Snow Labs name: t5_vlt5_base_keywords date: 2023-01-31 tags: [en, pl, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vlt5-base-keywords` is a Multilingual model originally trained by `Voicelab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_vlt5_base_keywords_xx_4.3.0_3.0_1675158538277.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_vlt5_base_keywords_xx_4.3.0_3.0_1675158538277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_vlt5_base_keywords","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_vlt5_base_keywords","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_vlt5_base_keywords| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.1 GB| ## References - https://huggingface.co/Voicelab/vlt5-base-keywords - https://nlp-demo-1.voicelab.ai/ - https://arxiv.org/abs/2209.14008 - https://arxiv.org/abs/2209.14008 - https://voicelab.ai/contact/ --- layout: model title: Athena Conditions Entity Resolver (Healthcare) author: John Snow Labs name: chunkresolve_athena_conditions_healthcare class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-09-16 task: Entity Resolution edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities Athena Codes and their normalized definition. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_athena_conditions_healthcare_en_2.6.0_2.4_1600265258887.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_athena_conditions_healthcare_en_2.6.0_2.4_1600265258887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use This model requires `embeddings_healthcare_100d` and `ner_healthcare` in the pipeline you use.
{% include programmingLanguageSelectScalaPython.html %} ```python ... athena_re_model = ChunkEntityResolverModel.pretrained("chunkresolve_athena_conditions_healthcare","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_athena = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter, chunk_embeddings, athena_re_model]) model = pipeline_athena.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala val athena_re_model = ChunkEntityResolverModel.pretrained("chunkresolve_athena_conditions_healthcare","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter, chunk_embeddings, athena_re_model)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash chunk entity athena_description athena_code 0 a cold PROBLEM Intolerant of cold 4213725 1 cough PROBLEM Cough 254761 2 runny nose PROBLEM O/E - nose 4156058 3 fever PROBLEM Fever 437663 4 difficulty breathing PROBLEM Difficulty breathing 4041664 5 her cough PROBLEM Does cough 4122567 6 dry PROBLEM Dry eyes 4036620 7 hacky PROBLEM Resolving infantile idiopathic scoliosis 44833868 8 physical exam TEST Physical angioedema 37110554 9 a right TM PROBLEM Tuberculosis of thyroid gland, unspecified 44819346 10 fairly congested PROBLEM Tonsil congested 4116401 11 Amoxil TREATMENT Amoxycillin overdose 4173544 12 Aldex TREATMENT Oral lesion 43530620 13 difficulty breathing PROBLEM Difficulty breathing 4041664 14 more congested PROBLEM Nasal congestion 4195085 15 a temperature TEST Tolerance of ambient temperature - finding 4271383 16 congestion PROBLEM Nasal congestion 4195085 ``` {:.model-param} ## Model Information {:.table-model} |----------------|-------------------------------------------| | Name: | chunkresolve_athena_conditions_healthcare | | Type: | ChunkEntityResolverModel | | Compatibility: | 2.6.0 | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_healthcare_100d | {:.h2_title} ## Data Source Trained on Athena dataset. --- layout: model title: German BertForMaskedLM Base Cased model (from deepset) author: John Snow Labs name: bert_embeddings_g_base date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gbert-base` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_base_de_4.2.4_3.0_1670022131926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_base_de_4.2.4_3.0_1670022131926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_base","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_base","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_g_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gbert-base - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Part of Speech for Norwegian Nynorsk author: John Snow Labs name: pos_ud_nynorsk date: 2021-03-09 tags: [part_of_speech, open_source, norwegian_nynorsk, pos_ud_nynorsk, nn] task: Part of Speech Tagging language: nn edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - NOUN - ADP - PUNCT - CCONJ - PRON - VERB - PROPN - AUX - ADJ - ADV - SCONJ - PART - INTJ - NUM - X - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_3.0.0_3.0_1615292123096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_nynorsk_nn_3.0.0_3.0_1615292123096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_nynorsk", "nn") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hello from John Snow Labs!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs!""] token_df = nlu.load('nn.pos.ud_nynorsk').predict(text) token_df ```
## Results ```bash token pos 0 Hello PROPN 1 from NOUN 2 John NOUN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_nynorsk| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|nn| --- layout: model title: Recognize Entities OntoNotes pipeline - BERT Medium author: John Snow Labs name: onto_recognize_entities_bert_medium date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_bert_medium, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_bert_medium is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_medium_en_3.0.0_3.0_1616477173790.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_medium_en_3.0.0_3.0_1616477173790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_bert_medium', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_medium", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.bert.medium').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.0365490540862083,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_medium| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Company Name to IRS (Edgar database) author: John Snow Labs name: finel_edgar_irs date: 2022-08-30 tags: [en, finance, companies, edgar, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Linking / Entity Resolution model, which allows you to retrieve the IRS number of a company given its name, using SEC Edgar database. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_edgar_irs_en_1.0.0_3.2_1661866402930.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_edgar_irs_en_1.0.0_3.2_1661866402930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_irs", "en", "finance/models")\ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("irs_code")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver]) lp = LightPipeline(pipelineModel) lp.fullAnnotate("CONTACT GOLD") ```
## Results ```bash +--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+ | chunk| code | all_codes| resolutions | all_distances| +--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+ | CONTACT GOLD | 981369960| [981369960, 271989147, 208531222, 273566922, 270348508] |[981369960, 271989147, 208531222, 273566922, 270348508] | [0.1733, 0.3700, 0.3867, 0.4103, 0.4121] | +--------------+-----------+---------------------------------------------------------+--------------------------------------------------------+-------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_edgar_irs| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[company_irs_number]| |Language:|en| |Size:|313.8 MB| |Case sensitive:|false| ## References In-house scrapping and postprocessing of SEC Edgar Database --- layout: model title: Word2Vec Embeddings in Palatine German (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, pfl, open_source] task: Embeddings language: pfl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pfl_3.4.1_3.0_1647451106726.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pfl_3.4.1_3.0_1647451106726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pfl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pfl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pfl.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|pfl| |Size:|92.0 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Method Of Exercise Clause Binary Classifier author: John Snow Labs name: legclf_method_of_exercise_clause date: 2023-01-27 tags: [en, legal, classification, method, exercise, clauses, method_of_exercise, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `method-of-exercise` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `method-of-exercise`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_method_of_exercise_clause_en_1.0.0_3.0_1674821693129.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_method_of_exercise_clause_en_1.0.0_3.0_1674821693129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_method_of_exercise_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[method-of-exercise]| |[other]| |[other]| |[method-of-exercise]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_method_of_exercise_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support method-of-exercise 0.97 1.00 0.98 32 other 1.00 0.97 0.99 38 accuracy - - 0.99 70 macro-avg 0.98 0.99 0.99 70 weighted-avg 0.99 0.99 0.99 70 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1655730608726.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1655730608726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|439.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-0 --- layout: model title: Recognize Entities DL Pipeline for Finnish - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, finnish, entity_recognizer_md, pipeline, fi] supported: true task: [Named Entity Recognition, Lemmatization] language: fi edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_fi_3.0.0_3.0_1616456428015.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_fi_3.0.0_3.0_1616456428015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'fi') annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "fi") val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei John Snow Labs! ""] result_df = nlu.load('fi.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-------------------------|:------------------------|:---------------------------------|:-----------------------------|:---------------------------------|:--------------------| | 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | [[0.1868100017309188,.,...]] | ['O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| --- layout: model title: Drug Reviews Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_drug_reviews_webmd date: 2022-07-28 tags: [en, clinical, licensed, public_health, classifier, sequence_classification, drug, review] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify drug reviews from WebMD.com ## Predicted Entities `negative`, `positive` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_CHANGE_DRUG_TREATMENT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_drug_reviews_webmd_en_4.0.0_3.0_1659008484818.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_drug_reviews_webmd_en_4.0.0_3.0_1659008484818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_drug_reviews_webmd", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.", "I previously used Cheratussin but was now dispensed Guaifenesin AC as a cheaper alternative. This stuff does n t work as good as Cheratussin and taste like cherry flavored sugar water."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_drug_reviews_webmd", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq(Array("While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.", "I previously used Cheratussin but was now dispensed Guaifenesin AC as a cheaper alternative. This stuff does n t work as good as Cheratussin and taste like cherry flavored sugar water.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.drug_reviews").predict("""While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.""") ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ |text |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ |While it has worked for me, the sweating and chills especially at night when trying to sleep are very off putting and I am not sure if I will stick with it very much longer. My eyese no longer feel like there is something in them and my mouth is definitely not as dry as before but the side effects are too invasive for my liking.|[negative]| |I previously used Cheratussin but was now dispensed Guaifenesin AC as a cheaper alternative. This stuff does n t work as good as Cheratussin and taste like cherry flavored sugar water . |[positive]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_drug_reviews_webmd| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support negative 0.8589 0.8234 0.8408 1042 positive 0.8612 0.8901 0.8754 1283 accuracy - - 0.8602 2325 macro-avg 0.8600 0.8568 0.8581 2325 weighted-avg 0.8602 0.8602 0.8599 2325 ``` --- layout: model title: English BertForQuestionAnswering model (from peggyhuang) author: John Snow Labs name: bert_qa_nolog_SciBert_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nolog-SciBert-v2` is a English model orginally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nolog_SciBert_v2_en_4.0.0_3.0_1654188991457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nolog_SciBert_v2_en_4.0.0_3.0_1654188991457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nolog_SciBert_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_nolog_SciBert_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.scibert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nolog_SciBert_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/nolog-SciBert-v2 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Filial) author: John Snow Labs name: distilbert_qa_filial_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Filial`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_filial_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768555427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_filial_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768555427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_filial_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_filial_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_filial_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Filial/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Waiver Of Subrogation Clause Binary Classifier author: John Snow Labs name: legclf_waiver_of_subrogation_clause date: 2023-01-29 tags: [en, legal, classification, waiver, subrogation, clauses, waiver_of_subrogation, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `waiver-of-subrogation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `waiver-of-subrogation`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_subrogation_clause_en_1.0.0_3.0_1675005652788.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_waiver_of_subrogation_clause_en_1.0.0_3.0_1675005652788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_waiver_of_subrogation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[waiver-of-subrogation]| |[other]| |[other]| |[waiver-of-subrogation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_waiver_of_subrogation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 39 waiver-of-subrogation 0.97 0.94 0.95 32 accuracy - - 0.96 71 macro-avg 0.96 0.96 0.96 71 weighted-avg 0.96 0.96 0.96 71 ``` --- layout: model title: Spanish BertForMaskedLM Base Uncased model (from dccuchile) author: John Snow Labs name: bert_embeddings_base_spanish_wwm_uncased date: 2022-12-02 tags: [es, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased` is a Spanish model originally trained by `dccuchile`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_uncased_es_4.2.4_3.0_1670018913004.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_spanish_wwm_uncased_es_4.2.4_3.0_1670018913004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_uncased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_spanish_wwm_uncased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_spanish_wwm_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased - https://github.com/google-research/bert - https://github.com/josecannete/spanish-corpora - https://github.com/google-research/bert/blob/master/multilingual.md - https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/tensorflow_weights.tar.gz - https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/pytorch_weights.tar.gz - https://users.dcc.uchile.cl/~jperez/beto/cased_2M/tensorflow_weights.tar.gz - https://users.dcc.uchile.cl/~jperez/beto/cased_2M/pytorch_weights.tar.gz - https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1827 - https://www.kaggle.com/nltkdata/conll-corpora - https://github.com/gchaperon/beto-benchmarks/blob/master/conll2002/dev_results_beto-cased_conll2002.txt - https://github.com/facebookresearch/MLDoc - https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-cased_mldoc.txt - https://github.com/gchaperon/beto-benchmarks/blob/master/MLDoc/dev_results_beto-uncased_mldoc.txt - https://github.com/google-research-datasets/paws/tree/master/pawsx - https://github.com/facebookresearch/XNLI - https://colab.research.google.com/drive/1uRwg4UmPgYIqGYY4gW_Nsw9782GFJbPt - https://www.adere.so/ - https://imfd.cl/en/ - https://www.tensorflow.org/tfrc - https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf - https://github.com/google-research/bert/blob/master/multilingual.md - https://arxiv.org/pdf/1904.09077.pdf - https://arxiv.org/pdf/1906.01502.pdf - https://arxiv.org/abs/1812.10464 - https://arxiv.org/pdf/1901.07291.pdf - https://arxiv.org/pdf/1904.02099.pdf - https://arxiv.org/pdf/1906.01569.pdf - https://arxiv.org/abs/1908.11828 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from V3RX2000) author: John Snow Labs name: xlmroberta_ner_v3rx2000_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `V3RX2000`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_de_4.1.0_3.0_1660430662562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_v3rx2000_base_finetuned_panx_de_4.1.0_3.0_1660430662562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_v3rx2000_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_v3rx2000_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/V3RX2000/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Pipeline to Summarize Clinical Notes in Laymen Terms author: John Snow Labs name: summarizer_clinical_laymen_pipeline date: 2023-06-06 tags: [licensed, en, clinical, summarization, laymen_terms] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_laymen](https://nlp.johnsnowlabs.com/2023/05/31/summarizer_clinical_laymen_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_pipeline_en_4.4.1_3.0_1686085843660.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_pipeline_en_4.4.1_3.0_1686085843660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_laymen_pipeline", "en", "clinical/models") text = """ Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss. PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath. PAST SURGICAL HISTORY: Pertinent for cholecystectomy. PSYCHOLOGICAL HISTORY: Negative. SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke. FAMILY HISTORY: Pertinent for obesity and hypertension. MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin. ALLERGIES: She has no known drug allergies. REVIEW OF SYSTEMS: Negative. PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis. ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_laymen_pipeline", "en", "clinical/models") val text = """ Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss. PAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath. PAST SURGICAL HISTORY: Pertinent for cholecystectomy. PSYCHOLOGICAL HISTORY: Negative. SOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke. FAMILY HISTORY: Pertinent for obesity and hypertension. MEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin. ALLERGIES: She has no known drug allergies. REVIEW OF SYSTEMS: Negative. PHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis. ASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash This is a clinical note about a 34-year-old woman who is interested in having weight loss surgery. She has been overweight for over 20 years and wants to have more energy and improve her self-image. She has tried many diets and weight loss programs, but has not been successful in keeping the weight off. She has a history of hypertension and shortness of breath, but is not allergic to any medications. She will have an upper endoscopy and will be contacted by a nutritionist and social worker. The plan is to have her weight loss surgery through the gastric bypass, rather than Lap-Band. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_laymen_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|937.2 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English BertForQuestionAnswering Mini Cased model (from M-FAC) author: John Snow Labs name: bert_qa_mini_finetuned_squadv2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-finetuned-squadv2` is a English model originally trained by `M-FAC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mini_finetuned_squadv2_en_4.0.0_3.0_1657187810266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mini_finetuned_squadv2_en_4.0.0_3.0_1657187810266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mini_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mini_finetuned_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mini_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/M-FAC/bert-mini-finetuned-squadv2 - https://arxiv.org/pdf/2107.03356.pdf - https://github.com/IST-DASLab/M-FAC --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1657185376585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1657185376585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_64_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-10 --- layout: model title: Financial Question Answering (RoBerta) author: John Snow Labs name: finqa_roberta date: 2022-08-09 tags: [en, finance, qa, licensed] task: Question Answering language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial RoBerta-based Question Answering model, trained on squad-v2, finetuned on proprietary Financial questions and answers. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finqa_roberta_en_1.0.0_3.2_1660054527812.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finqa_roberta_en_1.0.0_3.2_1660054527812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("finqa_roberta","en", "finance/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline(stages=[documentAssembler, spanClassifier]) example = spark.createDataFrame([["What is the current total Operating Profit?", "Operating profit totaled EUR 9.4 mn , down from EUR 11.7 mn in 2004"]]).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('answer.result').show() ```
## Results ```bash `9.4 mn , down from EUR 11.7` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finqa_roberta| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|248.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on squad-v2, finetuned on proprietary Financial questions and answers. --- layout: model title: English XlmRoBertaForQuestionAnswering (from tyqiangz) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_finetuned_chaii date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-chaii` is a English model originally trained by `tyqiangz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetuned_chaii_en_4.0.0_3.0_1655989602942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetuned_chaii_en_4.0.0_3.0_1655989602942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_finetuned_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_finetuned_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_finetuned_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tyqiangz/xlm-roberta-base-finetuned-chaii --- layout: model title: Dutch Named Entity Recognition (from Davlan) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_ner_hrl date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-ner-hrl` is a Dutch model orginally trained by `Davlan`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_ner_hrl_nl_3.4.2_3.0_1652809603204.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_ner_hrl_nl_3.4.2_3.0_1652809603204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_ner_hrl","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_ner_hrl","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_ner_hrl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|855.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl - https://camel.abudhabi.nyu.edu/anercorp/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2002/ner/ - https: --- layout: model title: Recognize Entities DL pipeline for English - BERT author: John Snow Labs name: recognize_entities_bert date: 2021-03-23 tags: [open_source, english, recognize_entities_bert, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The recognize_entities_bert is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/recognize_entities_bert_en_3.0.0_3.0_1616473903583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/recognize_entities_bert_en_3.0.0_3.0_1616473903583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('recognize_entities_bert', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("recognize_entities_bert", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.bert').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.085488274693489,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-ORG', 'O'] | ['John Snow Labs'] || | document | sentence | token | embeddings | ner | entities | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_bert| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Fast Neural Machine Translation Model from Kirundi to English author: John Snow Labs name: opus_mt_rn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, rn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `rn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_rn_en_xx_2.7.0_2.4_1609167024774.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_rn_en_xx_2.7.0_2.4_1609167024774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_rn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_rn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.rn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_rn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from juliusco) author: John Snow Labs name: distilbert_qa_juliusco_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `juliusco`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_juliusco_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771577414.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_juliusco_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771577414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juliusco_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_juliusco_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_juliusco_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/juliusco/distilbert-base-uncased-finetuned-squad --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265902` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902_en_4.0.0_3.0_1655984889587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902_en_4.0.0_3.0_1655984889587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265902").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265902| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265902 --- layout: model title: Detect Assertion Status (assertion_wip_large) author: John Snow Labs name: jsl_assertion_wip_large date: 2021-01-18 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.7.0 spark_version: 2.4 tags: [clinical, licensed, assertion, en, ner] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someone- else (Uzuner et al. 2011). Apart from what we released in other assertion models, an in-house annotations on a curated dataset (6K clinical notes) is used to augment the base assertion dataset (2010 i2b2/VA). {:.h2_title} ## Predicted Entities `present`, `absent`, `possible`, `planned`, `someoneelse`, `past`. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_large_en_2.6.5_2.4_1609091911183.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_large_en_2.6.5_2.4_1609091911183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip_large", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val nerConverter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip_large", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe. ```bash +-----------------------------------------+-----+---+----------------------------+-------+-----------+ |chunk |begin|end|ner_label |sent_id|assertion | +-----------------------------------------+-----+---+----------------------------+-------+-----------+ |21-day-old |17 |26 |Age |0 |present | |Caucasian |28 |36 |Race_Ethnicity |0 |present | |male |38 |41 |Gender |0 |someoneelse| |for 2 days |48 |57 |Duration |0 |present | |congestion |62 |71 |Symptom |0 |present | |mom |75 |77 |Gender |0 |someoneelse| |yellow |99 |104|Modifier |0 |present | |discharge |106 |114|Symptom |0 |present | |nares |135 |139|External_body_part_or_region|0 |someoneelse| |she |147 |149|Gender |0 |present | |mild |168 |171|Modifier |0 |present | |problems with his breathing while feeding|173 |213|Symptom |0 |present | |perioral cyanosis |237 |253|Symptom |0 |absent | |retractions |258 |268|Symptom |0 |absent | |One day ago |272 |282|RelativeDate |1 |someoneelse| |mom |285 |287|Gender |1 |someoneelse| |Tylenol |345 |351|Drug_BrandName |1 |someoneelse| |Baby |354 |357|Age |2 |someoneelse| |decreased p.o. intake |377 |397|Symptom |2 |someoneelse| |His |400 |402|Gender |3 |someoneelse| +-----------------------------------------+-----+---+----------------------------+-------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_assertion_wip_large| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, ner_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash label prec rec f1 absent 0.957 0.949 0.953 someoneelse 0.958 0.936 0.947 planned 0.766 0.657 0.707 possible 0.852 0.884 0.868 past 0.894 0.890 0.892 present 0.902 0.917 0.910 Macro-average 0.888 0.872 0.880 Micro-average 0.908 0.908 0.908 ``` --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nl10 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl10` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl10_en_4.3.0_3.0_1675116839759.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl10_en_4.3.0_3.0_1675116839759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nl10","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nl10","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nl10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|689.4 MB| ## References - https://huggingface.co/google/t5-efficient-large-nl10 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Albanian BertForQuestionAnswering model (from vanadhi) author: John Snow Labs name: bert_qa_bert_base_uncased_fiqa_flm_sq_flit date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: sq edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-fiqa-flm-sq-flit` is a Albanian model orginally trained by `vanadhi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_fiqa_flm_sq_flit_sq_4.0.0_3.0_1654181262795.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_fiqa_flm_sq_flit_sq_4.0.0_3.0_1654181262795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_fiqa_flm_sq_flit","sq") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_fiqa_flm_sq_flit","sq") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sq.answer_question.bert.base_uncased.by_vanadhi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_fiqa_flm_sq_flit| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sq| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vanadhi/bert-base-uncased-fiqa-flm-sq-flit - https://drive.google.com/file/d/1BlWaV-qVPfpGyJoWQJU9bXQgWCATgxEP/view --- layout: model title: Spanish T5ForConditionalGeneration Cased model (from JorgeSarry) author: John Snow Labs name: t5_est5base date: 2023-01-30 tags: [es, open_source, t5, tensorflow] task: Text Generation language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `est5base` is a Spanish model originally trained by `JorgeSarry`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_est5base_es_4.3.0_3.0_1675101719578.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_est5base_es_4.3.0_3.0_1675101719578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_est5base","es") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_est5base","es") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_est5base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|es| |Size:|511.9 MB| ## References - https://huggingface.co/JorgeSarry/est5base - https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90 --- layout: model title: English BertForQuestionAnswering Large Cased model (from Slavka) author: John Snow Labs name: bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-log-parser-winlogbeat_nowhitespace_large` is a English model originally trained by `Slavka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg_en_4.0.0_3.0_1657182801327.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg_en_4.0.0_3.0_1657182801327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespace_larg| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Slavka/bert-base-cased-finetuned-log-parser-winlogbeat_nowhitespace_large --- layout: model title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_base_japanese_char_whole_word_masking date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-char-whole-word-masking` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_whole_word_masking_ja_4.2.4_3.0_1670018214237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_whole_word_masking_ja_4.2.4_3.0_1670018214237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char_whole_word_masking","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char_whole_word_masking","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_japanese_char_whole_word_masking| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|334.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English image_classifier_vit_my_bean_VIT ViTForImageClassification from woojinSong author: John Snow Labs name: image_classifier_vit_my_bean_VIT date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_my_bean_VIT` is a English model originally trained by woojinSong. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_my_bean_VIT_en_4.1.0_3.0_1660167368304.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_my_bean_VIT_en_4.1.0_3.0_1660167368304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_my_bean_VIT", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_my_bean_VIT", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_my_bean_VIT| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Word2Vec Embeddings in Sinhala (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, si, open_source] task: Embeddings language: si edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_si_3.4.1_3.0_1647457358445.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_si_3.4.1_3.0_1647457358445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","si") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["මම ස්පර්ක් එන්එල්පී වලට කැමතියි"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","si") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("මම ස්පර්ක් එන්එල්පී වලට කැමතියි").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("si.embed.w2v_cc_300d").predict("""මම ස්පර්ක් එන්එල්පී වලට කැමතියි""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|si| |Size:|471.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_bne_becas date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-ROBERTaBECAS` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_becas_es_4.3.0_3.0_1674212894934.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_becas_es_4.3.0_3.0_1674212894934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_becas","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_becas","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_bne_becas| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|420.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-bne-ROBERTaBECAS --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_doddle124578 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_doddle124578` is a English model originally trained by doddle124578. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578_en_4.2.0_3.0_1664036115160.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578_en_4.2.0_3.0_1664036115160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_doddle124578| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Extract Granular Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_granular_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, ner, anatomy] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extractions mentions of anatomical entities using granular labels. Definitions of Predicted Entities: - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". - `Site_Bone`: Anatomical terms that refer to the human skeleton. - `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). - `Site_Breast`: Anatomical terms that refer to the breasts. - `Site_Liver`: Anatomical terms that refer to the liver. - `Site_Lung`: Anatomical terms that refer to the lungs. - `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. - `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. ## Predicted Entities `Direction`, `Site_Bone`, `Site_Brain`, `Site_Breast`, `Site_Liver`, `Site_Lung`, `Site_Lymph_Node`, `Site_Other_Body_Part` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_wip_en_4.0.0_3.0_1664584284877.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_wip_en_4.0.0_3.0_1664584284877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_anatomy_granular_wip").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:------------| | left | Direction | | breast | Site_Breast | | lungs | Site_Lung | | liver | Site_Liver | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_granular_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|859.9 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Direction 601.0 150.0 133.0 734.0 0.80 0.82 0.81 Site_Lymph_Node 415.0 31.0 51.0 466.0 0.93 0.89 0.91 Site_Breast 98.0 6.0 20.0 118.0 0.94 0.83 0.88 Site_Other_Body_Part 713.0 277.0 388.0 1101.0 0.72 0.65 0.68 Site_Bone 176.0 30.0 56.0 232.0 0.85 0.76 0.80 Site_Liver 134.0 77.0 36.0 170.0 0.64 0.79 0.70 Site_Lung 337.0 70.0 106.0 443.0 0.83 0.76 0.79 Site_Brain 164.0 58.0 36.0 200.0 0.74 0.82 0.78 macro_avg 2638.0 699.0 826.0 3464.0 0.81 0.79 0.80 micro_avg NaN NaN NaN NaN 0.79 0.76 0.78 ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from edwardjross) author: John Snow Labs name: xlmroberta_ner_edwardjross_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `edwardjross`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_edwardjross_base_finetuned_panx_de_4.1.0_3.0_1660432506605.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_edwardjross_base_finetuned_panx_de_4.1.0_3.0_1660432506605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_edwardjross_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_edwardjross_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_edwardjross_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/edwardjross/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Legal Name Clause Binary Classifier author: John Snow Labs name: legclf_name_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `name` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `name` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_name_clause_en_1.0.0_3.2_1660122668905.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_name_clause_en_1.0.0_3.2_1660122668905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_name_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[name]| |[other]| |[other]| |[name]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_name_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support name 0.98 0.91 0.94 46 other 0.97 0.99 0.98 148 accuracy - - 0.97 194 macro-avg 0.98 0.95 0.96 194 weighted-avg 0.97 0.97 0.97 194 ``` --- layout: model title: Spanish Deberta Embeddings model (from plncmm) author: John Snow Labs name: deberta_embeddings_cowese_base date: 2023-03-12 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, es, tensorflow] task: Embeddings language: es edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mdeberta-cowese-base-es` is a Spanish model originally trained by `plncmm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_cowese_base_es_4.3.1_3.0_1678657528702.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_cowese_base_es_4.3.1_3.0_1678657528702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_cowese_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_cowese_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_cowese_base| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|1.0 GB| |Case sensitive:|false| ## References https://huggingface.co/plncmm/mdeberta-cowese-base-es --- layout: model title: Fast Neural Machine Translation Model from English to Ga author: John Snow Labs name: opus_mt_en_gaa date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gaa, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gaa` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gaa_xx_2.7.0_2.4_1609169994333.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gaa_xx_2.7.0_2.4_1609169994333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gaa", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gaa", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gaa').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gaa| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from tner) author: John Snow Labs name: xlmroberta_ner_base_bionlp2004 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-bionlp2004` is a English model originally trained by `tner`. ## Predicted Entities `protein`, `dna`, `cell line`, `rna`, `cell type` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bionlp2004_en_4.1.0_3.0_1660426076485.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_bionlp2004_en_4.1.0_3.0_1660426076485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bionlp2004","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_bionlp2004","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_bionlp2004| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|783.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tner/xlm-roberta-base-bionlp2004 - https://github.com/asahi417/tner --- layout: model title: Detect Problems, Tests and Treatments (ner_clinical_large) author: John Snow Labs name: ner_clinical_large date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_en_3.0.0_3.0_1617206114650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |the G-protein-activated inwardly rectifying potassium (GIRK|TREATMENT| |the genomicorganization |TREATMENT| |a candidate gene forType II diabetes mellitus |PROBLEM | |byapproximately |TREATMENT| |single nucleotide polymorphisms |TREATMENT| |aVal366Ala substitution |TREATMENT| |an 8 base-pair |TREATMENT| |insertion/deletion |PROBLEM | |Ourexpression studies |TEST | |the transcript in various humantissues |PROBLEM | |fat andskeletal muscle |PROBLEM | |furtherstudies |PROBLEM | |the KCNJ9 protein |TREATMENT| |evaluation |TEST | |Type II diabetes |PROBLEM | |the treatment |TREATMENT| |breast cancer |PROBLEM | |the standard therapy |TREATMENT| |anthracyclines |TREATMENT| |taxanes |TREATMENT| +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_large| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented 2010 i2b2 challenge data with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | I-TREATMENT | 6625 | 1187 | 1329 | 0.848054 | 0.832914 | 0.840416 | | 1 | I-PROBLEM | 15142 | 1976 | 2542 | 0.884566 | 0.856254 | 0.87018 | | 2 | B-PROBLEM | 11005 | 1065 | 1587 | 0.911765 | 0.873968 | 0.892466 | | 3 | I-TEST | 6748 | 923 | 1264 | 0.879677 | 0.842237 | 0.86055 | | 4 | B-TEST | 8196 | 942 | 1029 | 0.896914 | 0.888455 | 0.892665 | | 5 | B-TREATMENT | 8271 | 1265 | 1073 | 0.867345 | 0.885167 | 0.876165 | | 6 | Macro-average | 55987 | 7358 | 8824 | 0.881387 | 0.863166 | 0.872181 | | 7 | Micro-average | 55987 | 7358 | 8824 | 0.883842 | 0.86385 | 0.873732 | ``` --- layout: model title: Legal Undertaking for costs Clause Binary Classifier author: John Snow Labs name: legclf_undertaking_for_costs_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `undertaking-for-costs` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `undertaking-for-costs` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_undertaking_for_costs_clause_en_1.0.0_3.2_1660123148146.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_undertaking_for_costs_clause_en_1.0.0_3.2_1660123148146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_undertaking_for_costs_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[undertaking-for-costs]| |[other]| |[other]| |[undertaking-for-costs]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_undertaking_for_costs_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.98 0.98 44 undertaking-for-costs 0.91 0.91 0.91 11 accuracy - - 0.96 55 macro-avg 0.94 0.94 0.94 55 weighted-avg 0.96 0.96 0.96 55 ``` --- layout: model title: French CamemBert Embeddings (from aliasdasd) author: John Snow Labs name: camembert_embeddings_aliasdasd_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `aliasdasd`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_aliasdasd_generic_model_fr_3.4.4_3.0_1653987389063.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_aliasdasd_generic_model_fr_3.4.4_3.0_1653987389063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_aliasdasd_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_aliasdasd_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_aliasdasd_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/aliasdasd/dummy-model --- layout: model title: Pipeline to Summarize Radiology Reports author: John Snow Labs name: summarizer_radiology_pipeline date: 2023-05-29 tags: [licensed, en, clinical, summarization, radiology] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_radiology](https://nlp.johnsnowlabs.com/2023/04/23/summarizer_jsl_radiology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_pipeline_en_4.4.2_3.0_1685401622765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_pipeline_en_4.4.2_3.0_1685401622765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_radiology_pipeline", "en", "clinical/models") text = """INDICATIONS: Peripheral vascular disease with claudication. RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96. LEFT: 1. Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower lobes. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_radiology_pipeline", "en", "clinical/models") val text = """INDICATIONS: Peripheral vascular disease with claudication. RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96. LEFT: 1. Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower lobes. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging, but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower lobes. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_radiology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|937.2 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: Multilingual XLMRobertaForTokenClassification Large Cased model (from oliverguhr) author: John Snow Labs name: xlmroberta_ner_fullstop_punctuation_multilang_larg date: 2022-08-01 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fullstop-punctuation-multilang-large` is a Multilingual model originally trained by `oliverguhr`. ## Predicted Entities `?`, `:`, `,`, `-`, `0`, `.` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_fullstop_punctuation_multilang_larg_xx_4.1.0_3.0_1659353733692.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_fullstop_punctuation_multilang_larg_xx_4.1.0_3.0_1659353733692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_fullstop_punctuation_multilang_larg","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_fullstop_punctuation_multilang_larg","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_fullstop_punctuation_multilang_larg| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large --- layout: model title: French CamemBert Embeddings (from wangst) author: John Snow Labs name: camembert_embeddings_wangst_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `wangst`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_wangst_generic_model_fr_3.4.4_3.0_1653990612908.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_wangst_generic_model_fr_3.4.4_3.0_1653990612908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_wangst_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_wangst_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_wangst_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/wangst/dummy-model --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1655733114451.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1655733114451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_512d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-6 --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1657192672919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1657192672919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|387.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-4 --- layout: model title: Longformer Large NER Pipeline author: John Snow Labs name: longformer_large_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, longformer, conll, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [longformer_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653984745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653984745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.5 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - LongformerForTokenClassification - NerConverter - Finisher --- layout: model title: Spanish RobertaForQuestionAnswering (from stevemobs) author: John Snow Labs name: roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs date: 2022-06-21 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-fine-tuned-squad-es` is a English model originally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs_es_4.0.0_3.0_1655791128840.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs_es_4.0.0_3.0_1655791128840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_fine_tuned_squad_es_stevemobs| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/stevemobs/roberta-large-fine-tuned-squad-es --- layout: model title: English asr_wav2vec2_large_xlsr_moroccan TFWav2Vec2ForCTC from othrif author: John Snow Labs name: asr_wav2vec2_large_xlsr_moroccan date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_moroccan` is a English model originally trained by othrif. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_moroccan_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_moroccan_en_4.2.0_3.0_1664097975990.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_moroccan_en_4.2.0_3.0_1664097975990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_moroccan", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_moroccan", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_moroccan| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Fast Neural Machine Translation Model from Umbundu to English author: John Snow Labs name: opus_mt_umb_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, umb, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `umb` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_umb_en_xx_2.7.0_2.4_1609169540084.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_umb_en_xx_2.7.0_2.4_1609169540084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_umb_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_umb_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.umb.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_umb_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Isoko to English author: John Snow Labs name: opus_mt_iso_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, iso, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `iso` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_iso_en_xx_2.7.0_2.4_1609169073963.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_iso_en_xx_2.7.0_2.4_1609169073963.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_iso_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_iso_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.iso.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_iso_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_models_6 TFWav2Vec2ForCTC from niclas author: John Snow Labs name: pipeline_asr_models_6 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_models_6` is a English model originally trained by niclas. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_models_6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_models_6_en_4.2.0_3.0_1664098783377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_models_6_en_4.2.0_3.0_1664098783377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_models_6', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_models_6", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_models_6| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_occupation_prediction ViTForImageClassification from darshanz author: John Snow Labs name: image_classifier_vit_occupation_prediction date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_occupation_prediction` is a English model originally trained by darshanz. ## Predicted Entities `anchor`, `professor`, `doctor`, `farmer`, `athlete` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_occupation_prediction_en_4.1.0_3.0_1660168721579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_occupation_prediction_en_4.1.0_3.0_1660168721579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_occupation_prediction", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_occupation_prediction", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_occupation_prediction| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from comacrae) author: John Snow Labs name: roberta_qa_unaugv3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-unaugv3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unaugv3_en_4.3.0_3.0_1674222699716.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unaugv3_en_4.3.0_3.0_1674222699716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unaugv3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unaugv3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unaugv3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/comacrae/roberta-unaugv3 --- layout: model title: English BertForQuestionAnswering model (from xraychen) author: John Snow Labs name: bert_qa_squad_baseline date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-baseline` is a English model orginally trained by `xraychen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_baseline_en_4.0.0_3.0_1654191934724.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_baseline_en_4.0.0_3.0_1654191934724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_baseline","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad_baseline","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_baseline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/xraychen/squad-baseline --- layout: model title: English asr_wav2vec2_base_rj_try_5 TFWav2Vec2ForCTC from rjrohit author: John Snow Labs name: asr_wav2vec2_base_rj_try_5 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_rj_try_5` is a English model originally trained by rjrohit. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_rj_try_5_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_rj_try_5_en_4.2.0_3.0_1664102610582.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_rj_try_5_en_4.2.0_3.0_1664102610582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_rj_try_5", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_rj_try_5", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_rj_try_5| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: ICD10CM Musculoskeletal Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_musculoskeletal_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-28 task: Entity Resolution edition: Healthcare NLP 2.4.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_musculoskeletal_clinical_en_2.4.5_2.4_1588103998999.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_musculoskeletal_clinical_en_2.4.5_2.4_1588103998999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... muscu_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_musculoskeletal_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, muscu_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val muscu_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_musculoskeletal_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, muscu_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash chunk entity icd10_muscu_description icd10_muscu_code 0 a cold, cough PROBLEM Postprocedural hemorrhage of a musculoskeletal... M96831 1 runny nose PROBLEM Acquired deformity of nose M950 2 fever PROBLEM Periodic fever syndromes M041 3 difficulty breathing PROBLEM Other dentofacial functional abnormalities M2659 4 her cough PROBLEM Cervicalgia M542 5 physical exam TEST Pathological fracture, unspecified toe(s), seq... M84479S 6 fairly congested PROBLEM Synovial hypertrophy, not elsewhere classified... M67262 7 Amoxil TREATMENT Torticollis M436 8 Aldex TREATMENT Other soft tissue disorders related to use, ov... M7088 9 difficulty breathing PROBLEM Other dentofacial functional abnormalities M2659 10 more congested PROBLEM Pain in unspecified ankle and joints of unspec... M25579 11 trouble sleeping PROBLEM Low back pain M545 12 congestion PROBLEM Progressive systemic sclerosis M340 ``` {:.model-param} ## Model Information {:.table-model} |----------------|-----------------------------------------------| | Name: | chunkresolve_icd10cm_musculoskeletal_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10CM Dataset Range: M0000-M9979XXS https://www.icd10data.com/ICD10CM/Codes/M00-M99 --- layout: model title: Fast Neural Machine Translation Model from French-Based Creoles And Pidgins to English author: John Snow Labs name: opus_mt_cpf_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cpf, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cpf` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cpf_en_xx_2.7.0_2.4_1609168557473.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cpf_en_xx_2.7.0_2.4_1609168557473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cpf_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cpf_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cpf.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cpf_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_huxxx657_roberta_base_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_huxxx657_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734309712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_huxxx657_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734309712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_huxxx657_roberta_base_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_huxxx657_roberta_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_huxxx657_roberta_base_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad --- layout: model title: Extract Clinical Problem Entities (low granularity) from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_problem_reduced_emb_clinical_medium date: 2023-06-07 tags: [licensed, clinical, ner, en, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences. The taxonomy is reduced (one label for all clinical problems). ## Predicted Entities `Problem`, `HealthStatus`, `Modifier` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_emb_clinical_medium_en_4.4.3_3.0_1686148297394.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_emb_clinical_medium_en_4.4.3_3.0_1686148297394.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Problem | | fatigue | Problem | | rheumatoid arthritis | Problem | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_reduced_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Problem 6190 1000 1020 7210 0.86 0.86 0.86 HealthStatus 92 32 15 107 0.74 0.86 0.80 Modifier 819 221 320 1139 0.79 0.72 0.75 macro_avg 7101 1253 1355 8456 0.80 0.81 0.80 micro_avg 7101 1253 1355 8456 0.85 0.84 0.84 ``` --- layout: model title: Pipeline to Detect Living Species(roberta_embeddings_BR_BERTo) author: John Snow Labs name: ner_living_species_roberta_pipeline date: 2023-03-13 tags: [pt, ner, clinical, licensed, roberta] task: Named Entity Recognition language: pt edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_roberta](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_roberta_pt_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pipeline_pt_4.3.0_3.2_1678732150750.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pipeline_pt_4.3.0_3.2_1678732150750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_roberta_pipeline", "pt", "clinical/models") text = '''Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_roberta_pipeline", "pt", "clinical/models") val text = "Mulher de 23 anos, de Capinota, Cochabamba, Bolívia. Ela está no nosso país há quatro anos. Frequentou o departamento de emergência obstétrica onde foi encontrada grávida de 37 semanas, com um colo dilatado de 5 cm e membranas rompidas. O obstetra de emergência realizou um teste de estreptococos negativo e solicitou um hemograma, glucose, bioquímica básica, HBV, HCV e serologia da sífilis." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------|--------:|------:|:------------|-------------:| | 0 | Mulher | 0 | 5 | HUMAN | 0.9975 | | 1 | país | 71 | 74 | HUMAN | 0.8869 | | 2 | grávida | 163 | 169 | HUMAN | 0.9702 | | 3 | estreptococos | 283 | 295 | SPECIES | 0.9211 | | 4 | HBV | 360 | 362 | SPECIES | 0.9911 | | 5 | HCV | 365 | 367 | SPECIES | 0.9858 | | 6 | sífilis | 384 | 390 | SPECIES | 0.8898 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_roberta_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|654.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl date: 2021-06-24 tags: [ner, licensed, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_2.4_1624566960534.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_2.4_1624566960534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Modifier | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 235.0 46.0 43.0 278.0 0.8363 0.8453 0.8408 Direction 3972.0 465.0 458.0 4430.0 0.8952 0.8966 0.8959 Respiration 82.0 4.0 4.0 86.0 0.9535 0.9535 0.9535 Cerebrovascular_D... 93.0 20.0 24.0 117.0 0.823 0.7949 0.8087 Family_History_He... 88.0 6.0 3.0 91.0 0.9362 0.967 0.9514 Heart_Disease 447.0 82.0 119.0 566.0 0.845 0.7898 0.8164 RelativeTime 158.0 80.0 59.0 217.0 0.6639 0.7281 0.6945 Strength 624.0 58.0 53.0 677.0 0.915 0.9217 0.9183 Smoking 121.0 11.0 4.0 125.0 0.9167 0.968 0.9416 Medical_Device 3716.0 491.0 466.0 4182.0 0.8833 0.8886 0.8859 Pulse 136.0 22.0 14.0 150.0 0.8608 0.9067 0.8831 Psychological_Con... 135.0 9.0 29.0 164.0 0.9375 0.8232 0.8766 Overweight 2.0 1.0 0.0 2.0 0.6667 1.0 0.8 Triglycerides 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Obesity 42.0 5.0 6.0 48.0 0.8936 0.875 0.8842 Admission_Discharge 318.0 24.0 11.0 329.0 0.9298 0.9666 0.9478 HDL 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Diabetes 110.0 14.0 8.0 118.0 0.8871 0.9322 0.9091 Section_Header 3740.0 148.0 157.0 3897.0 0.9619 0.9597 0.9608 Age 627.0 75.0 48.0 675.0 0.8932 0.9289 0.9107 O2_Saturation 34.0 14.0 17.0 51.0 0.7083 0.6667 0.6869 Kidney_Disease 96.0 12.0 34.0 130.0 0.8889 0.7385 0.8067 Test 2504.0 545.0 498.0 3002.0 0.8213 0.8341 0.8276 Communicable_Disease 21.0 10.0 6.0 27.0 0.6774 0.7778 0.7241 Hypertension 162.0 5.0 10.0 172.0 0.9701 0.9419 0.9558 External_body_par... 2626.0 356.0 413.0 3039.0 0.8806 0.8641 0.8723 Oxygen_Therapy 81.0 15.0 14.0 95.0 0.8438 0.8526 0.8482 Modifier 2341.0 404.0 539.0 2880.0 0.8528 0.8128 0.8324 Test_Result 1007.0 214.0 255.0 1262.0 0.8247 0.7979 0.8111 BMI 9.0 1.0 0.0 9.0 0.9 1.0 0.9474 Labour_Delivery 57.0 23.0 33.0 90.0 0.7125 0.6333 0.6706 Employment 271.0 59.0 55.0 326.0 0.8212 0.8313 0.8262 Fetus_NewBorn 66.0 33.0 51.0 117.0 0.6667 0.5641 0.6111 Clinical_Dept 923.0 110.0 83.0 1006.0 0.8935 0.9175 0.9053 Time 29.0 13.0 16.0 45.0 0.6905 0.6444 0.6667 Procedure 3185.0 462.0 501.0 3686.0 0.8733 0.8641 0.8687 Diet 36.0 20.0 45.0 81.0 0.6429 0.4444 0.5255 Oncological 459.0 61.0 55.0 514.0 0.8827 0.893 0.8878 LDL 3.0 0.0 3.0 6.0 1.0 0.5 0.6667 Symptom 7104.0 1302.0 1200.0 8304.0 0.8451 0.8555 0.8503 Temperature 116.0 6.0 8.0 124.0 0.9508 0.9355 0.9431 Vital_Signs_Header 215.0 29.0 24.0 239.0 0.8811 0.8996 0.8903 Relationship_Status 49.0 2.0 1.0 50.0 0.9608 0.98 0.9703 Total_Cholesterol 11.0 4.0 5.0 16.0 0.7333 0.6875 0.7097 Blood_Pressure 158.0 18.0 22.0 180.0 0.8977 0.8778 0.8876 Injury_or_Poisoning 579.0 130.0 127.0 706.0 0.8166 0.8201 0.8184 Drug_Ingredient 1716.0 153.0 132.0 1848.0 0.9181 0.9286 0.9233 Treatment 136.0 36.0 60.0 196.0 0.7907 0.6939 0.7391 Pregnancy 123.0 36.0 51.0 174.0 0.7736 0.7069 0.7387 Vaccine 13.0 2.0 6.0 19.0 0.8667 0.6842 0.7647 Disease_Syndrome_... 2981.0 559.0 446.0 3427.0 0.8421 0.8699 0.8557 Height 30.0 10.0 15.0 45.0 0.75 0.6667 0.7059 Frequency 595.0 99.0 138.0 733.0 0.8573 0.8117 0.8339 Route 858.0 76.0 89.0 947.0 0.9186 0.906 0.9123 Duration 351.0 99.0 108.0 459.0 0.78 0.7647 0.7723 Death_Entity 43.0 14.0 5.0 48.0 0.7544 0.8958 0.819 Internal_organ_or... 6477.0 972.0 991.0 7468.0 0.8695 0.8673 0.8684 Alcohol 80.0 18.0 13.0 93.0 0.8163 0.8602 0.8377 Substance_Quantity 6.0 7.0 4.0 10.0 0.4615 0.6 0.5217 Date 498.0 38.0 19.0 517.0 0.9291 0.9632 0.9459 Hyperlipidemia 47.0 3.0 3.0 50.0 0.94 0.94 0.94 Social_History_He... 99.0 7.0 7.0 106.0 0.934 0.934 0.934 Race_Ethnicity 116.0 0.0 0.0 116.0 1.0 1.0 1.0 Imaging_Technique 40.0 18.0 47.0 87.0 0.6897 0.4598 0.5517 Drug_BrandName 859.0 62.0 61.0 920.0 0.9327 0.9337 0.9332 RelativeDate 566.0 124.0 143.0 709.0 0.8203 0.7983 0.8091 Gender 6096.0 80.0 101.0 6197.0 0.987 0.9837 0.9854 Dosage 244.0 31.0 57.0 301.0 0.8873 0.8106 0.8472 Form 234.0 32.0 55.0 289.0 0.8797 0.8097 0.8432 Medical_History_H... 114.0 9.0 10.0 124.0 0.9268 0.9194 0.9231 Birth_Entity 4.0 2.0 3.0 7.0 0.6667 0.5714 0.6154 Substance 59.0 8.0 11.0 70.0 0.8806 0.8429 0.8613 Sexually_Active_o... 5.0 3.0 4.0 9.0 0.625 0.5556 0.5882 Weight 90.0 10.0 21.0 111.0 0.9 0.8108 0.8531 macro - - - - - - 0.8148 micro - - - - - - 0.8788 ``` --- layout: model title: Swahili BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_sw_cased date: 2022-12-02 tags: [sw, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: sw edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-sw-cased` is a Swahili model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_sw_cased_sw_4.2.4_3.0_1670018949309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_sw_cased_sw_4.2.4_3.0_1670018949309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_sw_cased","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_sw_cased","sw") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_sw_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sw| |Size:|370.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-sw-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Whereas Clause Binary Classifier author: John Snow Labs name: legclf_cuad_whereas_clause date: 2022-09-20 tags: [en, legal, classification, clauses, whereas, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `whereas` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.2_1663693211440.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_whereas_clause_en_1.0.0_3.2_1663693211440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_whereas_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[whereas]| |[other]| |[other]| |[whereas]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_whereas_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.4 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.94 0.96 67 whereas 0.91 0.98 0.94 41 accuracy - - 0.95 108 macro-avg 0.95 0.96 0.95 108 weighted-avg 0.96 0.95 0.95 108 ``` --- layout: model title: English RobertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: roberta_qa_icebert_texas_squad_is_saattrupdan date: 2022-06-21 tags: [is, open_source, question_answering, roberta] task: Question Answering language: is edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `icebert-texas-squad-is` is a English model originally trained by `saattrupdan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_texas_squad_is_saattrupdan_is_4.0.0_3.0_1655789280018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_texas_squad_is_saattrupdan_is_4.0.0_3.0_1655789280018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert_texas_squad_is_saattrupdan","is") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_icebert_texas_squad_is_saattrupdan","is") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("is.answer_question.squad.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_icebert_texas_squad_is_saattrupdan| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|is| |Size:|455.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saattrupdan/icebert-texas-squad-is --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_ft_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_ft_news_en_4.3.0_3.0_1674211000201.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_ft_news_en_4.3.0_3.0_1674211000201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_ft_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_ft_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_ft_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|458.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_roberta_FT_newsqa --- layout: model title: Hindi Named Entity Recognition (from l3cube-pune) author: John Snow Labs name: bert_ner_hing_bert_lid date: 2022-05-09 tags: [bert, ner, token_classification, hi, open_source] task: Named Entity Recognition language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `hing-bert-lid` is a Hindi model orginally trained by `l3cube-pune`. ## Predicted Entities `EN`, `HI` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_hing_bert_lid_hi_3.4.2_3.0_1652097677950.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_hing_bert_lid_hi_3.4.2_3.0_1652097677950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_hing_bert_lid","hi") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_hing_bert_lid","hi") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_hing_bert_lid| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|hi| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/l3cube-pune/hing-bert-lid - https://github.com/l3cube-pune/code-mixed-nlp - https://arxiv.org/abs/2204.08398 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674213450529.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4_en_4.3.0_3.0_1674213450529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|439.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-4 --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to Swedish author: John Snow Labs name: opus_mt_bcl_sv date: 2021-06-01 tags: [open_source, seq2seq, translation, bcl, sv, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bcl target languages: sv {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_sv_xx_3.1.0_2.4_1622552497758.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_sv_xx_3.1.0_2.4_1622552497758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_sv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_sv", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Central Bikol.translate_to.Swedish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_sv| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English ElectraForQuestionAnswering Large model (from mrm8488) author: John Snow Labs name: electra_qa_large_finetuned_squadv1 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-large-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_finetuned_squadv1_en_4.0.0_3.0_1655920852902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_finetuned_squadv1_en_4.0.0_3.0_1655920852902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_finetuned_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.large.by_mrm8488").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_large_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/electra-large-finetuned-squadv1 --- layout: model title: Multilingual BertForQuestionAnswering model (from horsbug98) author: John Snow Labs name: bert_qa_Part_1_mBERT_Model_E1 date: 2022-06-02 tags: [en, ar, bn, fi, id, ja, sw, ko, ru, te, th, open_source, question_answering, bert, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_1_mBERT_Model_E1` is a Multilingual model orginally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Part_1_mBERT_Model_E1_xx_4.0.0_3.0_1654178889694.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Part_1_mBERT_Model_E1_xx_4.0.0_3.0_1654178889694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Part_1_mBERT_Model_E1","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Part_1_mBERT_Model_E1","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.tydiqa.multi_lingual_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Part_1_mBERT_Model_E1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_1_mBERT_Model_E1 --- layout: model title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SST-2 author: John Snow Labs name: sent_bert_wiki_books_sst2 date: 2021-08-31 tags: [en, open_source, sentence_embeddings, wikipedia_dataset, books_corpus_dataset, sst_2_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on SST-2. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages. This model is fine-tuned on the SST-2 and is recommended for use in sentiment analysis tasks. The fine-tuning task uses the Stanford Sentiment Treebank (SST-2) dataset to predict the sentiment in a given sentence. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_sst2_en_3.2.0_3.0_1630412133457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_sst2_en_3.2.0_3.0_1630412133457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_sst2", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_sst2", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_sst2').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_wiki_books_sst2| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Stanford Sentiment Treebank (SST-2) dataset](https://nlp.stanford.edu/sentiment/index.html) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/sst2/2 --- layout: model title: English DistilBertForSequenceClassification Base Uncased model (from mrm8488) author: John Snow Labs name: distilbert_classifier_base_uncased_newspop_student date: 2022-07-20 tags: [open_source, distilbert, sequence_classifier, classification, newspop, en] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Classification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-newspop-student` is a Spanish model originally trained by `mrm8488`. ## Predicted Entities `palestine`, `obama`, `microsoft`, `economy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_classifier_base_uncased_newspop_student_en_4.0.0_3.0_1658326819970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_classifier_base_uncased_newspop_student_en_4.0.0_3.0_1658326819970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") seq = DistilBertForSequenceClassification.pretrained("distilbert_classifier_base_uncased_newspop_student","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, seq]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val seq = DistilBertForSequenceClassification.pretrained("distilbert_classifier_base_uncased_newspop_student","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, seq)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_classifier_base_uncased_newspop_student| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|249.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/mrm8488/distilbert-base-uncased-newspop-student --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191205969.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191205969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_hier_triplet_epochs_1_shard_1_squad2.0 --- layout: model title: Stopwords Remover for Kyrgyz language (96 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ky, open_source] task: Stop Words Removal language: ky edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ky_3.4.1_3.0_1646673146794.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ky_3.4.1_3.0_1646673146794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ky") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Сен менден артык эмессиң"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ky") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Сен менден артык эмессиң").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ky.stopwords").predict("""Сен менден артык эмессиң""") ```
## Results ```bash +-----------------------------+ |result | +-----------------------------+ |[Сен, менден, артык, эмессиң]| +-----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ky| |Size:|1.8 KB| --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili_sw_4.1.0_3.0_1659354352086.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili_sw_4.1.0_3.0_1659354352086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_luganda_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1655732398416.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1655732398416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_32d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|417.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-0 --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbtl3 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbtl3` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670327142675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbtl3_zh_4.2.4_3.0_1670327142675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbtl3","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbtl3| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|228.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbtl3 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: English T5ForConditionalGeneration Cased model (from gokceuludogan) author: John Snow Labs name: t5_t2t_adex_prompt date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t2t-adeX-prompt` is a English model originally trained by `gokceuludogan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_t2t_adex_prompt_en_4.3.0_3.0_1675107607187.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_t2t_adex_prompt_en_4.3.0_3.0_1675107607187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_t2t_adex_prompt","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_t2t_adex_prompt","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_t2t_adex_prompt| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|925.3 MB| ## References - https://huggingface.co/gokceuludogan/t2t-adeX-prompt - https://github.com/gokceuludogan/boun-tabi-smm4h22 --- layout: model title: Detect PHI for Deidentification purposes (Spanish, augmented) author: John Snow Labs name: ner_deid_subentity_augmented date: 2022-02-16 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity` model. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_es_3.3.4_3.0_1645006642756.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_es_3.3.4_3.0_1645006642756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.subentity_augmented").predict(""" Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-PATIENT| | Miguel| I-PATIENT| | Martínez| I-PATIENT| | ,| O| | varón| B-SEX| | de| O| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-CITY| | ,| O| | España| B-COUNTRY| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14| B-DATE| | de| I-DATE| | Marzo| I-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-HOSPITAL| | San| I-HOSPITAL| | Carlos| I-HOSPITAL| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 2022.0 224.0 140.0 2162.0 0.9003 0.9352 0.9174 HOSPITAL 259.0 35.0 50.0 309.0 0.881 0.8382 0.859 DATE 1023.0 12.0 12.0 1035.0 0.9884 0.9884 0.9884 ORGANIZATION 2624.0 516.0 544.0 3168.0 0.8357 0.8283 0.832 CITY 1561.0 339.0 266.0 1827.0 0.8216 0.8544 0.8377 ID 36.0 1.0 3.0 39.0 0.973 0.9231 0.9474 STREET 197.0 14.0 9.0 206.0 0.9336 0.9563 0.9448 USERNAME 10.0 6.0 1.0 11.0 0.625 0.9091 0.7407 SEX 682.0 13.0 11.0 693.0 0.9813 0.9841 0.9827 EMAIL 134.0 0.0 1.0 135.0 1.0 0.9926 0.9963 ZIP 141.0 2.0 1.0 142.0 0.986 0.993 0.9895 MEDICALRECORD 29.0 5.0 0.0 29.0 0.8529 1.0 0.9206 PROFESSION 252.0 27.0 25.0 277.0 0.9032 0.9097 0.9065 PHONE 51.0 11.0 0.0 51.0 0.8226 1.0 0.9027 COUNTRY 505.0 74.0 82.0 587.0 0.8722 0.8603 0.8662 DOCTOR 444.0 26.0 48.0 492.0 0.9447 0.9024 0.9231 AGE 549.0 15.0 7.0 556.0 0.9734 0.9874 0.9804 macro - - - - - - 0.9138 micro - - - - - - 0.8930 ``` --- layout: model title: Legal Separation Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_separation_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, separation, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_separation_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `separation-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `separation-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_separation_agreement_bert_en_1.0.0_3.0_1669371119187.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_separation_agreement_bert_en_1.0.0_3.0_1669371119187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_separation_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[separation-agreement]| |[other]| |[other]| |[separation-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_separation_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.95 0.94 82 separation-agreement 0.88 0.83 0.85 35 accuracy - - 0.91 117 macro-avg 0.90 0.89 0.90 117 weighted-avg 0.91 0.91 0.91 117 ``` --- layout: model title: English asr_wav2vec2_base_100h_ngram TFWav2Vec2ForCTC from saahith author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_ngram date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_ngram` is a English model originally trained by saahith. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_ngram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042368247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_ngram_en_4.2.0_3.0_1664042368247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_ngram', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_ngram", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_ngram| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dl6 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dl6` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl6_en_4.3.0_3.0_1675110178520.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl6_en_4.3.0_3.0_1675110178520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dl6","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dl6","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dl6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|366.3 MB| ## References - https://huggingface.co/google/t5-efficient-base-dl6 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English DistilBertForQuestionAnswering Cased model (from stevhliu) author: John Snow Labs name: distilbert_qa_my_awesome_model date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `my_awesome_qa_model` is a English model originally trained by `stevhliu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_my_awesome_model_en_4.3.0_3.0_1672775319390.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_my_awesome_model_en_4.3.0_3.0_1672775319390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_my_awesome_model","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_my_awesome_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_my_awesome_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/stevhliu/my_awesome_qa_model --- layout: model title: Clinical Deidentification (Spanish) author: John Snow Labs name: clinical_deidentification date: 2023-06-13 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_4.4.4_3.2_1686663754234.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_4.4.4_3.2_1686663754234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """) ```
## Results ```bash Results Masked with entity labels ------------------------------ Datos del paciente. Nombre: . Apellidos: . NHC: . NASS: 04. Domicilio: , 5 B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . : María Merino Viveros NºCol: . Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico: Masked with chars ------------------------------ Datos del paciente. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: ** [******] 04. Domicilio: [*******************], 5 B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. [****]: María Merino Viveros NºCol: ** ** [***]. Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos del paciente. Nombre: **** . Apellidos: ****. NHC: ****. NASS: **** **** 04. Domicilio: ****, 5 B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. ****: María Merino Viveros NºCol: **** **** ****. Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos del paciente. Nombre: Sr. Lerma . Apellidos: Aristides Gonzalez Gelabert. NHC: BBBBBBBBQR648597. NASS: 041010000011 RZRM020101906017 04. Domicilio: Valencia, 5 B.. Localidad/ Provincia: Madrid. CP: 99335. Datos asistenciales. Fecha de nacimiento: 25/04/1977. País: Barcelona. Edad: 8 años Sexo: F.. Fecha de Ingreso: 02/08/2018. transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78. Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English image_classifier_vit_rust_image_classification_2 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_2` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_2_en_4.1.0_3.0_1660166968473.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_2_en_4.1.0_3.0_1660166968473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Employee benefit plans Clause Binary Classifier (md) author: John Snow Labs name: legclf_employee_benefit_plans_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `employee-benefit-plans` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `employee-benefit-plans` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employee_benefit_plans_md_en_1.0.0_3.0_1669376476318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employee_benefit_plans_md_en_1.0.0_3.0_1669376476318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_employee_benefit_plans_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[employee-benefit-plans]| |[other]| |[other]| |[employee-benefit-plans]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employee_benefit_plans_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support employee-benefit-plans 0.97 1.00 0.98 28 other 1.00 0.97 0.99 39 accuracy 0.99 67 macro avg 0.98 0.99 0.98 67 weighted avg 0.99 0.99 0.99 67 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_scrambled_squad_5 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-5` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_5_en_4.3.0_3.0_1674216944002.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_5_en_4.3.0_3.0_1674216944002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_5","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_5","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_scrambled_squad_5| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-5 --- layout: model title: NER Model Finder with Sentence Entity Resolvers (sbert_jsl_medium_uncased) author: John Snow Labs name: sbertresolve_ner_model_finder date: 2021-11-24 tags: [ner, licensed, en, clinical, entity_resolver] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities (NER labels) to the most appropriate NER model using `sbert_jsl_medium_uncased` Sentence Bert Embeddings. Given the entity name, it will return a list of pretrained NER models having that entity or similar ones. ## Predicted Entities `NER Model Names` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1637764208798.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_ner_model_finder_en_3.3.2_2.4_1637764208798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbert_jsl_medium_uncased","en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") ner_model_finder = SentenceEntityResolverModel\ .pretrained("sbertresolve_ner_model_finder", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"])\ .setOutputCol("model_names")\ .setDistanceFunction("EUCLIDEAN") ner_model_finder_pipelineModel = PipelineModel(stages = [documentAssembler, sbert_embedder, ner_model_finder]) light_pipeline = LightPipeline(ner_model_finder_pipelineModel) annotations = light_pipeline.fullAnnotate("medication") ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val ner_model_finder = SentenceEntityResolverModel .pretrained("sbertresolve_ner_model_finder", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("model_names") .setDistanceFunction("EUCLIDEAN") val ner_model_finder_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, ner_model_finder)) val light_pipeline = LightPipeline(ner_model_finder_pipelineModel) val annotations = light_pipeline.fullAnnotate("medication") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.ner.model_finder").predict("""Put your text here.""") ```
## Results ```bash +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |entity |models |all_models |resolutions | +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |medication|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']|['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_clinical_large', 'ner_healthcare', 'ner_jsl_enriched', 'ner_clinical', 'ner_jsl_slim', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_events_admission_clinical', 'ner_events_healthcare', 'ner_events_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse']:::['ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'ner_medmentions_coarse']:::['ner_drugs']:::['ner_clinical_icdem', 'ner_medmentions_coarse']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_medmentions_coarse', 'ner_radiology_wip_clinical', 'ner_jsl_slim', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy', 'ner_radiology']:::['ner_medmentions_coarse','ner_clinical_icdem']:::['ner_posology_experimental']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_measurements_clinical', 'ner_radiology_wip_clinical', 'ner_radiology']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_posology_small', 'ner_jsl_enriched', 'ner_posology_experimental', 'ner_posology_large', 'ner_posology_healthcare', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_posology_greedy', 'ner_posology', 'ner_jsl_greedy']:::['ner_covid_trials', 'ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_deid_subentity_augmented', 'ner_deid_subentity_glove', 'ner_deidentify_dl', 'ner_deid_enriched']:::['jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_covid_trials', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_medmentions_coarse', 'jsl_rd_ner_wip_greedy_clinical', 'ner_jsl_enriched', 'ner_jsl', 'jsl_ner_wip_modifier_clinical', 'ner_jsl_greedy']:::['ner_chemd_clinical']|medication:::drug:::treatment:::therapeutic procedure:::drug ingredient:::drug chemical:::diagnostic aid:::substance:::medical device:::diagnostic procedure:::administration:::measurement:::drug strength:::physiological reaction:::patient:::vaccine:::psychological condition:::abbreviation| +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_ner_model_finder| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sbert_embeddings]| |Output Labels:|[models]| |Language:|en| |Case sensitive:|false| --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_dl6 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-dl6` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl6_en_4.3.0_3.0_1675123262536.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl6_en_4.3.0_3.0_1675123262536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_dl6","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_dl6","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_dl6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|62.3 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-dl6 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Financial NER (Headers / Subheaders) author: John Snow Labs name: finner_headers date: 2022-08-29 tags: [en, finance, ner, headers, splitting, sections, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Named Entity Recognition model, which will help you split long financial documents into smaller sections. To do that, it detects Headers and Subheaders of different sections. You can then use the beginning and end information in the metadata to retrieve the text between those headers. This model has been trained on 10-K filings, with the following HEADER and SUBHEADERS annotation guidelines: - PART I, PART II, etc are HEADERS - Item 1, Item 2, etc are also HEADERS - Item 1A, 2B, etc are SUBHEADERS - 1., 2., 2.1, etc. are SUBHEADERS - Other kind of short section names are also SUBHEADERS For more information about long document splitting, see [this](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb) workshop entry. ## Predicted Entities `HEADER`, `SUBHEADER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_HEADERS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_headers_en_1.0.0_3.2_1661771922923.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_headers_en_1.0.0_3.2_1661771922923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_headers', 'en', 'finance/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" 2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller. 2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6 2.2 Customer Agreements."""] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+-----------+----------+ | token| ner_label|confidence| +-----------+-----------+----------+ | 2| O| 0.576| | .| O| 0.9612| |Definitions|B-SUBHEADER| 0.9993| | .| O| 0.9755| | For| O| 0.9966| | purposes| O| 0.9863| | of| O| 0.9878| | this| O| 0.9974| | Agreement| O| 0.9994| | ,| O| 0.9999| | the| O| 1.0| | following| O| 1.0| | terms| O| 1.0| | have| O| 1.0| | the| O| 1.0| | meanings| O| 1.0| | ascribed| O| 1.0| | thereto| O| 1.0| | in| O| 1.0| | this| O| 1.0| | Section| O| 0.9985| | 1| O| 0.9999| | .| O| 0.9972| | 2| O| 0.9686| | .| O| 0.9834| |Appointment|B-SUBHEADER| 0.767| | as|I-SUBHEADER| 0.9479| | Reseller|I-SUBHEADER| 0.8429| | .| O| 0.9944| | 2.1|B-SUBHEADER| 0.6278| |Appointment|I-SUBHEADER| 0.6599| | .| O| 0.9972| | The| O| 0.9987| | Company| O| 0.9889| | hereby| O| 0.9914| | [***]| O| 0.9996| | .| O| 0.9999| | Allscripts| O| 0.9843| | may| O| 0.9989| | also| O| 0.9967| | disclose| O| 0.9949| | Company's| O| 0.9976| | pricing| O| 0.9999| |information| O| 0.9999| | relating| O| 0.9999| | to| O| 0.9998| | its| O| 0.9992| | Merchant| O| 0.9671| | Processing| O| 0.8411| | Services| O| 0.9662| +-----------+-----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_headers| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on 10-k filings ## Benchmarking ```bash label tp fp fn prec rec f1 I-HEADER 2835 9 8 0.996835 0.9971860 0.9970107 B-SUBHEADER 963 135 131 0.877049 0.8802559 0.87864965 I-SUBHEADER 2573 219 152 0.921561 0.9442202 0.9327533 B-HEADER 425 1 1 0.997652 0.9976526 0.9976526 Macro-average 6796 364 292 0.948274 0.9548287 0.95154047 Micro-average 6796 364 292 0.949162 0.9588036 0.9539584 ``` --- layout: model title: Estonian XlmRoBertaForQuestionAnswering (from anukaver) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_est_qa date: 2022-06-23 tags: [open_source, question_answering, xlmroberta] task: Question Answering language: et edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-est-qa` is a Estonian model originally trained by `anukaver`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_est_qa_et_4.0.0_3.0_1655992235241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_est_qa_et_4.0.0_3.0_1655992235241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_est_qa","et") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_est_qa","et") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("et.answer_question.xlm_roberta.by_anukaver").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_est_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|et| |Size:|883.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anukaver/xlm-roberta-est-qa --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1_en_4.0.0_3.0_1654181609802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1_en_4.0.0_3.0_1654181609802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_x2.32_f86.6_d15_hybrid_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|174.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squadv1-x2.44-f87.7-d26-hybrid-filled-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: Fast Neural Machine Translation Model from English to Japanese author: John Snow Labs name: opus_mt_en_jap date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, jap, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `jap` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_jap_xx_2.7.0_2.4_1609168310758.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_jap_xx_2.7.0_2.4_1609168310758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_jap", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_jap", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.jap').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_jap| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RoBERTa Embeddings (from jackaduma) author: John Snow Labs name: roberta_embeddings_SecRoBERTa date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `SecRoBERTa` is a English model orginally trained by `jackaduma`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_SecRoBERTa_en_3.4.2_3.0_1649946774316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_SecRoBERTa_en_3.4.2_3.0_1649946774316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_SecRoBERTa","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_SecRoBERTa","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.SecRoBERTa").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_SecRoBERTa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|314.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/jackaduma/SecRoBERTa - https://github.com/jackaduma/SecBERT/ - https://github.com/kbandla/APTnotes - https://stucco.github.io/data/ - https://ebiquity.umbc.edu/_file_directory_/papers/943.pdf - https://competitions.codalab.org/competitions/17262 - https://github.com/allenai/scibert - https://github.com/jackaduma/SecBERT - https://github.com/jackaduma/SecBERT --- layout: model title: Social Determinants of Health (clinical_large) author: John Snow Labs name: ner_sdoh_emb_clinical_large_wip date: 2023-04-17 tags: [en, clinical_large, social_determinants, public_health, ner, sdoh, pyspark_30, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_large_wip_en_4.3.2_3.0_1681756284245.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_large_wip_en_4.3.2_3.0_1681756284245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_emb_clinical_large_wip| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| |Dependencies:|embeddings_clinical_large| ## References Internal SHOP Project ## Benchmarking ```bash label precision recall f1-score support Employment 0.94 0.96 0.95 2075 Social_Support 0.91 0.90 0.90 658 Other_SDoH_Keywords 0.82 0.87 0.85 259 Healthcare_Institution 0.99 0.95 0.97 781 Alcohol 0.96 0.97 0.96 258 Gender 0.99 0.99 0.99 4957 Other_Disease 0.89 0.94 0.91 583 Access_To_Care 0.86 0.88 0.87 520 Mental_Health 0.89 0.81 0.85 494 Age 0.92 0.96 0.94 433 Marital_Status 1.00 1.00 1.00 92 Substance_Quantity 0.88 0.86 0.87 58 Substance_Use 0.91 0.97 0.94 192 Family_Member 0.97 0.99 0.98 2094 Financial_Status 0.86 0.65 0.74 124 Race_Ethnicity 0.93 0.93 0.93 27 Insurance_Status 0.93 0.87 0.90 85 Spiritual_Beliefs 0.86 0.81 0.83 52 Housing 0.88 0.85 0.87 400 Geographic_Entity 0.86 0.88 0.87 113 Disability 0.93 0.93 0.93 44 Quality_Of_Life 0.89 0.75 0.81 67 Income 0.89 0.77 0.83 31 Education 0.85 0.88 0.86 58 Transportation 0.86 0.89 0.88 57 Legal_Issues 0.72 0.91 0.80 47 Smoking 0.98 0.97 0.98 66 Substance_Frequency 0.93 0.75 0.83 57 Hypertension 1.00 1.00 1.00 21 Violence_Or_Abuse 0.83 0.62 0.71 63 Exercise 0.96 0.88 0.92 57 Diet 0.95 0.87 0.91 70 Sexual_Orientation 0.68 1.00 0.81 13 Language 0.89 0.73 0.80 22 Social_Exclusion 0.96 0.90 0.93 29 Substance_Duration 0.75 0.85 0.80 39 Communicable_Disease 1.00 0.84 0.91 31 Chidhood_Event 0.88 0.61 0.72 23 Community_Safety 0.95 0.93 0.94 44 Population_Group 0.89 0.62 0.73 13 Hyperlipidemia 0.78 1.00 0.88 7 Food_Insecurity 1.00 0.93 0.96 29 Eating_Disorder 0.67 0.92 0.77 13 Sexual_Activity 0.84 0.90 0.87 29 Environmental_Condition 1.00 1.00 1.00 20 Obesity 1.00 1.00 1.00 12 micro-avg 0.95 0.95 0.95 15217 macro-avg 0.90 0.88 0.88 15217 weighted-avg 0.95 0.95 0.95 15217 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Nadav) author: John Snow Labs name: roberta_qa_base_squad_finetuned_on_runaways date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad-finetuned-on-runaways-en` is a English model originally trained by `Nadav`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_finetuned_on_runaways_en_4.3.0_3.0_1674218728000.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_finetuned_on_runaways_en_4.3.0_3.0_1674218728000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_finetuned_on_runaways","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_finetuned_on_runaways","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad_finetuned_on_runaways| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Nadav/roberta-base-squad-finetuned-on-runaways-en --- layout: model title: Chinese T5ForConditionalGeneration Cased model (from IDEA-CCNL) author: John Snow Labs name: t5_randeng_77m_multitask_chinese date: 2023-01-30 tags: [zh, open_source, t5, tensorflow] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Randeng-T5-77M-MultiTask-Chinese` is a Chinese model originally trained by `IDEA-CCNL`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_randeng_77m_multitask_chinese_zh_4.3.0_3.0_1675098367899.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_randeng_77m_multitask_chinese_zh_4.3.0_3.0_1675098367899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_randeng_77m_multitask_chinese","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_randeng_77m_multitask_chinese","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_randeng_77m_multitask_chinese| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|349.2 MB| ## References - https://huggingface.co/IDEA-CCNL/Randeng-T5-77M-MultiTask-Chinese - https://github.com/IDEA-CCNL/Fengshenbang-LM - https://fengshenbang-doc.readthedocs.io/ - http://jmlr.org/papers/v21/20-074.html - https://github.com/IDEA-CCNL/Fengshenbang-LM/ - https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/pretrain_t5 - https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/mt5_summary - https://github.com/IDEA-CCNL/Fengshenbang-LM/ - https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/pretrain_t5 - https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/mt5_summary - https://arxiv.org/abs/2209.02970 - https://arxiv.org/abs/2209.02970 - https://github.com/IDEA-CCNL/Fengshenbang-LM/ - https://github.com/IDEA-CCNL/Fengshenbang-LM/ --- layout: model title: English BertForQuestionAnswering Cased model (from clementgyj) author: John Snow Labs name: bert_qa_finetuned_squad_50k date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-50k` is a English model originally trained by `clementgyj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_50k_en_4.0.0_3.0_1657186899398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetuned_squad_50k_en_4.0.0_3.0_1657186899398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_50k","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_finetuned_squad_50k","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_finetuned_squad_50k| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/clementgyj/bert-finetuned-squad-50k --- layout: model title: French CamemBert Embeddings (from fjluque) author: John Snow Labs name: camembert_embeddings_fjluque_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `fjluque`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_fjluque_generic_model_fr_3.4.4_3.0_1653988580244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_fjluque_generic_model_fr_3.4.4_3.0_1653988580244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_fjluque_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_fjluque_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_fjluque_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/fjluque/dummy-model --- layout: model title: English asr_Urdu_repo TFWav2Vec2ForCTC from bilalahmed15 author: John Snow Labs name: asr_Urdu_repo date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Urdu_repo` is a English model originally trained by bilalahmed15. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Urdu_repo_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Urdu_repo_en_4.2.0_3.0_1664107184096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Urdu_repo_en_4.2.0_3.0_1664107184096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Urdu_repo", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Urdu_repo", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Urdu_repo| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Finnish BERT Sentence Embeddings (Base Cased) author: John Snow Labs name: sent_bert_finnish_cased date: 2020-08-31 task: Embeddings language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, fi] supported: true deprecated: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. `FinBERT` features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words. `FinBERT` has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train `FinBERT`. These features allow `FinBERT` to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_finnish_cased_fi_2.6.0_2.4_1598897560014.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_finnish_cased_fi_2.6.0_2.4_1598897560014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_finnish_cased", "fi") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Vihaan syöpää', 'antibiootit eivät ole kipulääkkeitä']], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_finnish_cased", "fi") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("Vihaan syöpää","antibiootit eivät ole kipulääkkeitä").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Vihaan syöpää","antibiootit eivät ole kipulääkkeitä"] embeddings_df = nlu.load('fi.embed_sentence.bert.cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence fi_embed_sentence_bert_cased_embeddings Vihaan syöpää [-0.32807931303977966, -0.18222537636756897, 0... antibiootit eivät ole kipulääkkeitä [-0.192955881357193, -0.11151257902383804, 0.7... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_finnish_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[fi]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from https://github.com/TurkuNLP/FinBERT --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from SEISHIN) author: John Snow Labs name: distilbert_qa_seishin_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SEISHIN`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_seishin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769121878.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_seishin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769121878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_seishin_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_seishin_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_seishin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SEISHIN/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate English to Bislama Pipeline author: John Snow Labs name: translate_en_bi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bi_xx_2.7.0_2.4_1609698734459.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bi_xx_2.7.0_2.4_1609698734459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract Demographic Entities from Social Determinants of Health Texts author: John Snow Labs name: ner_sdoh_demographics_wip date: 2023-02-10 tags: [licensed, clinical, social_determinants, en, ner, demographics, sdoh, public_health] task: Named Entity Recognition language: en nav_key: modelsr edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic information related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Family_Member`, `Age`, `Gender`, `Geographic_Entity`, `Race_Ethnicity`, `Language`, `Spiritual_Beliefs` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_demographics_wip_en_4.2.8_3.0_1675998706136.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_demographics_wip_en_4.2.8_3.0_1675998706136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_demographics_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["SOCIAL HISTORY: He is a former tailor from Korea.", "He lives alone,single and no children.", "Pt is a 61 years old married, Caucasian, Catholic woman. Pt speaks English reasonably well."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_demographics_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Pt is a 61 years old married, Caucasian, Catholic woman. Pt speaks English reasonably well.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------+-----+---+------------+ |ner_label |begin|end|chunk | +-----------------+-----+---+------------+ |Gender |16 |17 |He | |Geographic_Entity|43 |47 |Korea | |Gender |0 |1 |He | |Family_Member |29 |36 |children | |Age |8 |19 |61 years old| |Race_Ethnicity |30 |38 |Caucasian | |Spiritual_Beliefs|41 |48 |Catholic | |Gender |50 |54 |woman | |Language |67 |73 |English | +-----------------+-----+---+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_demographics_wip| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|858.4 KB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Age 1346.0 73.0 74.0 1420.0 0.948555 0.947887 0.948221 Spiritual_Beliefs 100.0 13.0 16.0 116.0 0.884956 0.862069 0.873362 Family_Member 4468.0 134.0 43.0 4511.0 0.970882 0.990468 0.980577 Race_Ethnicity 56.0 0.0 13.0 69.0 1.000000 0.811594 0.896000 Gender 9825.0 67.0 247.0 10072.0 0.993227 0.975477 0.984272 Geographic_Entity 225.0 9.0 29.0 254.0 0.961538 0.885827 0.922131 Language 51.0 9.0 5.0 56.0 0.850000 0.910714 0.879310 ``` --- layout: model title: Explain Document ML Pipeline for English author: John Snow Labs name: explain_document_ml date: 2022-06-24 tags: [open_source, english, explain_document_ml, pipeline, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_ml is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_4.0.0_3.0_1656066222624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_ml_en_4.0.0_3.0_1656066222624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('explain_document_ml', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_ml", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | spell | lemmas | stems | pos | |---:|:---------------------------------|:---------------------------------|:-------------------------------------------------|:------------------------------------------------|:------------------------------------------------|:-----------------------------------------------|:---------------------------------------| | 0 | ['Hello fronm John Snwow Labs!'] | ['Hello fronm John Snwow Labs!'] | ['Hello', 'fronm', 'John', 'Snwow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'front', 'John', 'Snow', 'Labs', '!'] | ['hello', 'front', 'john', 'snow', 'lab', '!'] | ['UH', 'NN', 'NNP', 'NNP', 'NNP', '.'] || | document | sentence | token | spell | lemmas | stems | pos | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_ml| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|9.6 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - NorvigSweetingModel - LemmatizerModel - Stemmer - PerceptronModel --- layout: model title: Legal Tariff Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_tariff_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, tariff_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_tariff_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Tariff_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Tariff_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tariff_policy_bert_en_1.0.0_3.0_1678111753015.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tariff_policy_bert_en_1.0.0_3.0_1678111753015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_tariff_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Tariff_Policy]| |[Other]| |[Other]| |[Tariff_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_tariff_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.82 0.85 0.83 969 Tariff_Policy 0.87 0.85 0.86 1175 accuracy - - 0.85 2144 macro-avg 0.85 0.85 0.85 2144 weighted-avg 0.85 0.85 0.85 2144 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1657185004862.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1657185004862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-4 --- layout: model title: Detect Subentity PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_subentity date: 2023-05-31 tags: [licensed, clinical, ner, deidentification, arabic, ar] task: Named Entity Recognition language: ar edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model Word2Vec Arabic Clinical Embeddings. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `SEX`, `IDNUM`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ar_4.4.2_3.0_1685559675615.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ar_4.4.2_3.0_1685559675615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = ''' عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948235 وبريده الإلكتروني mohamedmell@gmail.com. ''' data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline .fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) text = ''' عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948235 وبريده الإلكتروني mohamedmell@gmail.com. ''' val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------------------------+----------------+ |chunk |ner_label| +------------------------------------------------+---------------+ |محمد |DOCTOR | |55 سنة |AGE | |15/05/2000 |DATE | |الرباط |CITY | |0610948235 |PHONE | |mohamedmell@gmail.com |EMAIL | +------------------------------------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ar| |Size:|15.0 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 196.0 26.0 32.0 228.0 0.8829 0.8596 0.8711 HOSPITAL 193.0 41.0 37.0 230.0 0.8248 0.8391 0.8319 DATE 877.0 14.0 8.0 885.0 0.9843 0.991 0.9876 ORGANIZATION 41.0 11.0 6.0 47.0 0.7885 0.8723 0.8283 CITY 260.0 8.0 5.0 265.0 0.9701 0.9811 0.9756 STREET 103.0 3.0 0.0 103.0 0.9717 1.0 0.9856 USERNAME 8.0 0.0 0.0 8.0 1.0 1.0 1.0 SEX 300.0 9.0 69.0 369.0 0.9709 0.813 0.885 IDNUM 13.0 1.0 0.0 13.0 0.9286 1.0 0.963 EMAIL 112.0 5.0 0.0 112.0 0.9573 1.0 0.9782 ZIP 80.0 4.0 0.0 80.0 0.9524 1.0 0.9756 MEDICALRECORD 17.0 1.0 0.0 17.0 0.9444 1.0 0.9714 PROFESSION 303.0 27.0 32.0 335.0 0.9182 0.9045 0.9113 PHONE 38.0 4.0 2.0 40.0 0.9048 0.95 0.9268 COUNTRY 158.0 10.0 8.0 166.0 0.9405 0.9518 0.9461 DOCTOR 440.0 23.0 34.0 474.0 0.9503 0.9283 0.9392 AGE 610.0 18.0 7.0 617.0 0.9713 0.9887 0.9799 macro - - - - - - 0.9386 micro - - - - - - 0.9434 ``` --- layout: model title: Word2Vec Embeddings in Minangkabau (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, min, open_source] task: Embeddings language: min edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_min_3.4.1_3.0_1647446021762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_min_3.4.1_3.0_1647446021762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","min") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","min") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("min.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|min| |Size:|142.8 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering (from Teepika) author: John Snow Labs name: roberta_qa_roberta_base_squad2_finetuned_selqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-selqa` is a English model originally trained by `Teepika`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_finetuned_selqa_en_4.0.0_3.0_1655735329360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_finetuned_selqa_en_4.0.0_3.0_1655735329360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_finetuned_selqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2_finetuned_selqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_Teepika").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2_finetuned_selqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Teepika/roberta-base-squad2-finetuned-selqa --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265899` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899_en_4.0.0_3.0_1655984561078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899_en_4.0.0_3.0_1655984561078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265899").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265899| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265899 --- layout: model title: Stopwords Remover for Lithuanian language (1314 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, lt, open_source] task: Stop Words Removal language: lt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_lt_3.4.1_3.0_1646673057631.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_lt_3.4.1_3.0_1646673057631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","lt") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Jūs nesate geresnis už mane"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","lt") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Jūs nesate geresnis už mane").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lt.stopwords").predict("""Jūs nesate geresnis už mane""") ```
## Results ```bash +------------------+ |result | +------------------+ |[nesate, geresnis]| +------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|lt| |Size:|4.8 KB| --- layout: model title: English RobertaForQuestionAnswering (from billfrench) author: John Snow Labs name: roberta_qa_cyberlandr_door date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cyberlandr-door` is a English model originally trained by `billfrench`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cyberlandr_door_en_4.0.0_3.0_1655728103928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cyberlandr_door_en_4.0.0_3.0_1655728103928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_cyberlandr_door","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_cyberlandr_door","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_billfrench").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cyberlandr_door| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|413.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/billfrench/cyberlandr-door --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from arunkumar629) author: John Snow Labs name: distilbert_qa_arunkumar629_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `arunkumar629`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_arunkumar629_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769951769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_arunkumar629_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769951769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arunkumar629_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arunkumar629_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_arunkumar629_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/arunkumar629/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you TFWav2Vec2ForCTC from project2you author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you` is a English model originally trained by project2you. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_en_4.2.0_3.0_1664110153624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_en_4.2.0_3.0_1664110153624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from Intel) author: John Snow Labs name: bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squadv1.1-sparse-80-1x4-block-pruneofa` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_4.0.0_3.0_1654536766214.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_4.0.0_3.0_1654536766214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased_sparse_80_1x4_block_pruneofa.by_Intel").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_squadv1.1_sparse_80_1x4_block_pruneofa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|437.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Intel/bert-large-uncased-squadv1.1-sparse-80-1x4-block-pruneofa - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: Detect Founding / Listing dates in texts (small) author: John Snow Labs name: finner_wiki_founding_dates date: 2023-01-15 tags: [listing, founding, establishment, dates, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model, aimed to detect Establishment (Founding) and Listing dates of Companies. It was trained with wikipedia texts about companies. ## Predicted Entities `FOUNDING_DATE`, `LISTING_DATE`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_founding_dates_en_1.0.0_3.0_1673798045941.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_founding_dates_en_1.0.0_3.0_1673798045941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python text = "The Toro Company, formerly known as the Toro Motor Company, is an American company founded in 1980. It was listed on the NASDAQ Global Market in August 2000. It design and operates lawn mowers and snow blowers and irrigation system supplies." df = spark.createDataFrame([[text]]).toDF("text") documenter = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) chunks = finance.NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner = finance.NerModel().pretrained("finner_wiki_founding_dates", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks]) model = pipe.fit(df) res = model.transform(df) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['3']['sentence']").alias("sentence_id"), F.expr("cols['0']").alias("chunk"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +-----------+-----------+---+-------------+ |sentence_id|chunk |end|ner_label | +-----------+-----------+---+-------------+ |0 |1980 |97 |FOUNDING_DATE| |1 |August 2000|155|LISTING_DATE | +-----------+-----------+---+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_wiki_founding_dates| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.1 MB| ## References Wikipedia ## Benchmarking ```bash label tp fp fn prec rec f1 B-LISTING_DATE 10 0 4 1.0 0.71428573 0.8333334 B-FOUNDING_DATE 18 3 2 0.85714287 0.9 0.87804884 I-LISTING_DATE 8 0 1 1.0 0.8888889 0.94117653 Macro-average 36 4 9 4 0.9 0.8 0.8470588 Micro-average 36 4 9 4 0.9 0.8 0.8470588 ``` --- layout: model title: Document Visual Question Answering with DONUT author: John Snow Labs name: docvqa_donut_base date: 2023-01-17 tags: [en, licensed] task: Document Visual Question Answering language: en nav_key: models edition: Visual NLP 4.3.0 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Document understanding transformer (Donut) model pretrained for Document Visual Question Answering (DocVQA) task in the dataset is from Document Visual Question Answering [competition](https://rrc.cvc.uab.es/?ch=17) and consists of 50K questions defined on more than 12K documents. Donut is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). Paper link [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) developed by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han and Seunghyun Park. DocVQA seeks to inspire a “purpose-driven” point of view in Document Analysis and Recognition research, where the document content is extracted and used to respond to high-level tasks defined by the human consumers of this information. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/VISUAL_QUESTION_ANSWERING/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrVisualQuestionAnswering.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/docvqa_donut_base_en_4.3.0_3.0_1673269990044.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/docvqa_donut_base_en_4.3.0_3.0_1673269990044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) visual_question_answering = VisualQuestionAnswering()\ .pretrained("docvqa_donut_base", "en", "clinical/ocr")\ .setInputCol(["image"])\ .setOutputCol("answers")\ .setQuestionsCol("questions") # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, visual_question_answering ]) test_image_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/vqa/agenda.png') bin_df = spark.read.format("binaryFile").load(test_image_path) questions = [["When it finish the Coffee Break?", "Who is giving the Introductory Remarks?", "Who is going to take part of the individual interviews?"]] questions_df = spark.createDataFrame([questions]) questions_df = questions_df.withColumnRenamed("_1", "questions") image_and_questions = bin_df.join(questions_df) results = pipeline.transform(image_and_questions).cache() results.select(results.answers).show(truncate=False) ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val visual_question_answering = VisualQuestionAnswering() .pretrained("docvqa_donut_base", "en", "clinical/ocr") .setInputCol(Array("image")) .setOutputCol("answers") .setQuestionsCol("questions") # OCR pipeline val pipeline = new PipelineModel().setStages(Array( binary_to_image, visual_question_answering)) val test_image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/vqa/agenda.png") val bin_df = spark.read.format("binaryFile").load(test_image_path) val questions = Array("When it finish the Coffee Break?", "Who is giving the Introductory Remarks?", "Who is going to take part of the individual interviews?") val questions_df = spark.createDataFrame(Array(questions)) val questions_df = questions_df.withColumnRenamed("_1", "questions") val image_and_questions = bin_df.join(questions_df) val results = pipeline.transform(image_and_questions).cache() results.select(results.answers).show(truncate=False) ```
## Example ### Input: ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |questions | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ When it finish the Coffee Break?, Who is giving the Introductory Remarks?, Who is going to take part of the individual interviews? +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ![Screenshot](/assets/images/examples_ocr/image12.png) ### Output: ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answers | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ When it finish the Coffee Break? -> 11:44 to 11:39 a.m., Who is giving the Introductory Remarks? -> lee a. waller, trrf vice presi- dent, Who is going to take part of the individual interviews? -> trrf treasurer]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ## Model Information {:.table-model} |---|---| |Model Name:|docvqa_donut_base| |Type:|ocr| |Compatibility:|Visual NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: ELECTRA Sentence Embeddings(ELECTRA Base) author: John Snow Labs name: sent_electra_base_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by: Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_electra_base_uncased_en_2.6.0_2.4_1598489784655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_electra_base_uncased_en_2.6.0_2.4_1598489784655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_electra_base_uncased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_base_uncased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.electra_base_uncased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_electra_base_uncased_embeddings I hate cancer [0.18555310368537903, -0.1990899294614792, 0.2... Antibiotics aren't painkiller [-0.23764970898628235, -0.21351191401481628, -... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_electra_base_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/google/electra_base/2 --- layout: model title: English image_classifier_vit_base_xray_pneumonia ViTForImageClassification from nickmuchi author: John Snow Labs name: image_classifier_vit_base_xray_pneumonia date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_xray_pneumonia` is a English model originally trained by nickmuchi. ## Predicted Entities `NORMAL`, `PNEUMONIA` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_xray_pneumonia_en_4.1.0_3.0_1660170972982.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_xray_pneumonia_en_4.1.0_3.0_1660170972982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_xray_pneumonia", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_xray_pneumonia", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_xray_pneumonia| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForQuestionAnswering model (from Plimpton) author: John Snow Labs name: distilbert_qa_Plimpton_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Plimpton`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Plimpton_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724353175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Plimpton_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724353175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Plimpton_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Plimpton_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Plimpton").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Plimpton_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Plimpton/distilbert-base-uncased-finetuned-squad --- layout: model title: XLM-RoBERTa 40-Language NER Pipeline author: John Snow Labs name: xlm_roberta_token_classifier_ner_40_lang_pipeline date: 2022-06-27 tags: [open_source, ner, token_classifier, xlm_roberta, multilang, "40", xx] task: Named Entity Recognition language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_token_classifier_ner_40_lang](https://nlp.johnsnowlabs.com/2021/09/28/xlm_roberta_token_classifier_ner_40_lang_xx.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_token_classifier_ner_40_lang_pipeline_xx_4.0.0_3.0_1656370754079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_token_classifier_ner_40_lang_pipeline_xx_4.0.0_3.0_1656370754079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_token_classifier_ner_40_lang_pipeline", lang = "xx") pipeline.annotate(["My name is John and I work at John Snow Labs.", "انا اسمي احمد واعمل في ارامكو"]) ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_token_classifier_ner_40_lang_pipeline", lang = "xx") pipeline.annotate(Array("My name is John and I work at John Snow Labs.", "انا اسمي احمد واعمل في ارامكو")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | |احمد |PER | |ارامكو |ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_token_classifier_ner_40_lang_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|967.7 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Pipeline to Detect Clinical Entities (ner_jsl_enriched) author: John Snow Labs name: ner_jsl_enriched_pipeline date: 2023-03-14 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_enriched](https://nlp.johnsnowlabs.com/2021/10/22/ner_jsl_enriched_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_pipeline_en_4.3.0_3.2_1678779376891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_pipeline_en_4.3.0_3.2_1678779376891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_enriched_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_enriched_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_enriched.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.9993 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9993 | | 2 | male | 38 | 41 | Gender | 0.999 | | 3 | 2 days | 52 | 57 | Duration | 0.8576 | | 4 | congestion | 62 | 71 | Symptom | 0.9892 | | 5 | mom | 75 | 77 | Gender | 0.9877 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.2232 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.87 | | 8 | she | 147 | 149 | Gender | 0.9965 | | 9 | mild | 168 | 171 | Modifier | 0.6063 | | 10 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.610967 | | 11 | perioral cyanosis | 237 | 253 | Symptom | 0.5396 | | 12 | retractions | 258 | 268 | Symptom | 0.9941 | | 13 | One day ago | 272 | 282 | RelativeDate | 0.870133 | | 14 | mom | 285 | 287 | Gender | 0.9974 | | 15 | tactile temperature | 304 | 322 | Symptom | 0.43565 | | 16 | Tylenol | 345 | 351 | Drug_BrandName | 0.9926 | | 17 | Baby | 354 | 357 | Age | 0.9976 | | 18 | decreased p.o. intake | 377 | 397 | Symptom | 0.5397 | | 19 | His | 400 | 402 | Gender | 0.9998 | | 20 | 20 minutes q.2h. to 5 to 10 minutes | 439 | 473 | Duration | 0.3732 | | 21 | his | 488 | 490 | Gender | 0.9461 | | 22 | respiratory congestion | 492 | 513 | Symptom | 0.5958 | | 23 | He | 516 | 517 | Gender | 0.9998 | | 24 | tired | 550 | 554 | Symptom | 0.9595 | | 25 | fussy | 569 | 573 | Symptom | 0.8263 | | 26 | over the past 2 days | 575 | 594 | RelativeDate | 0.49826 | | 27 | albuterol | 637 | 645 | Drug_Ingredient | 0.993 | | 28 | ER | 671 | 672 | Clinical_Dept | 0.998 | | 29 | His | 675 | 677 | Gender | 0.9998 | | 30 | urine output has also decreased | 679 | 709 | Symptom | 0.26296 | | 31 | he | 721 | 722 | Gender | 0.9924 | | 32 | per 24 hours | 760 | 771 | Frequency | 0.4958 | | 33 | he | 778 | 779 | Gender | 0.9951 | | 34 | per 24 hours | 807 | 818 | Frequency | 0.484933 | | 35 | Mom | 821 | 823 | Gender | 0.999 | | 36 | diarrhea | 836 | 843 | Symptom | 0.9995 | | 37 | His | 846 | 848 | Gender | 0.9998 | | 38 | bowel | 850 | 854 | Internal_organ_or_component | 0.9675 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Swedish BertForQuestionAnswering model (from KB) author: John Snow Labs name: bert_qa_bert_base_swedish_cased_squad_experimental date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: sv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-swedish-cased-squad-experimental` is a Swedish model orginally trained by `KB`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_swedish_cased_squad_experimental_sv_4.0.0_3.0_1654180645183.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_swedish_cased_squad_experimental_sv_4.0.0_3.0_1654180645183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_swedish_cased_squad_experimental","sv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_swedish_cased_squad_experimental","sv") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sv.answer_question.squad.bert.base_cased.by_KB").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_swedish_cased_squad_experimental| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sv| |Size:|465.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KB/bert-base-swedish-cased-squad-experimental --- layout: model title: Language Detection & Identification Pipeline - 220 Languages author: John Snow Labs name: detect_language_220 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Russia Buriat`, `Catalan`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latvian`, `Maithili`, `map-bms`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Scots`, `Sindhi`, `Northern Sami`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_220_xx_2.7.0_2.4_1607185721383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_220_xx_2.7.0_2.4_1607185721383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_220", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_220", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.220").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_220| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_all_904029577 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029577` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Name`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1678783486467.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1678783486467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_all_904029577| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029577 --- layout: model title: English DistilBertForQuestionAnswering model (from aszidon) Custom author: John Snow Labs name: distilbert_qa_custom date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.0.0_3.0_1654727944431.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom_en_4.0.0_3.0_1654727944431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.custom.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom --- layout: model title: Legal Time Clause Binary Classifier author: John Snow Labs name: legclf_time_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `time` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `time` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_time_clause_en_1.0.0_3.2_1660124089546.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_time_clause_en_1.0.0_3.2_1660124089546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_time_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[time]| |[other]| |[other]| |[time]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_time_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.99 0.99 304 time 0.98 0.96 0.97 150 accuracy - - 0.98 454 macro-avg 0.98 0.98 0.98 454 weighted-avg 0.98 0.98 0.98 454 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from SauravMaheshkar) author: John Snow Labs name: distilbert_qa_base_cased_led_chaii date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_chaii_en_4.3.0_3.0_1672766429348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_chaii_en_4.3.0_3.0_1672766429348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_chaii","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_chaii| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/distilbert-base-cased-distilled-chaii --- layout: model title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version author: John Snow Labs name: sbertresolve_snomed_bodyStructure_med date: 2021-07-08 tags: [snomed, en, entity_resolution, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings. ## Predicted Entities Snomed Codes and their normalized definition with `sbert_jsl_medium_uncased ` embeddings. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_bodyStructure_med_en_3.1.0_2.4_1625772026635.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_bodyStructure_med_en_3.1.0_2.4_1625772026635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") jsl_sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbert_jsl_medium_uncased','en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbertresolve_snomed_bodyStructure_med", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("snomed_code") snomed_pipelineModel = PipelineModel( stages = [ documentAssembler, jsl_sbert_embedder, snomed_resolver]) snomed_lp = LightPipeline(snomed_pipelineModel) result = snomed_lp.fullAnnotate("Amputation stump") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbertresolve_snomed_bodyStructure_med", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("snomed_code") val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver)) val snomed_lp = LightPipeline(snomed_pipelineModel) val result = snomed_lp.fullAnnotate("Amputation stump") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_body_structure_med").predict("""Amputation stump""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------| | 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_snomed_bodyStructure_med| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Case sensitive:|true| ## Data Source https://www.snomed.org/ --- layout: model title: German asr_exp_w2v2t_vp_s962 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_vp_s962 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s962` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_vp_s962_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s962_de_4.2.0_3.0_1664111856089.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s962_de_4.2.0_3.0_1664111856089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_vp_s962', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_vp_s962", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_vp_s962| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Anatomical References (biobert) author: John Snow Labs name: ner_anatomy_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect anatomical sites and references in medical text using pretrained NER model. ## Predicted Entities `tissue_structure`, `Organism_substance`, `Developing_anatomical_structure`, `Cell`, `Cellular_component`, `Immaterial_anatomical_entity`, `Organ`, `Pathological_formation`, `Organism_subdivision`, `Anatomical_system`, `Tissue` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_en_3.0.0_3.0_1617260624773.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_biobert_en_3.0.0_3.0_1617260624773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_anatomy_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_anatomy_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash +-------------------------------+-----+----+----+-----+---------+------+------+ | entity| tp| fp| fn|total|precision|recall| f1| +-------------------------------+-----+----+----+-----+---------+------+------+ | Organ| 53.0|17.0|12.0| 65.0| 0.7571|0.8154|0.7852| | Pathological_formation| 83.0|23.0|14.0| 97.0| 0.783|0.8557|0.8177| | Organism_substance| 42.0| 1.0|14.0| 56.0| 0.9767| 0.75|0.8485| | tissue_structure|131.0|28.0|49.0|180.0| 0.8239|0.7278|0.7729| | Cellular_component| 17.0| 0.0|20.0| 37.0| 1.0|0.4595|0.6296| | Tissue| 27.0| 4.0|16.0| 43.0| 0.871|0.6279|0.7297| | Anatomical_system| 15.0| 3.0| 8.0| 23.0| 0.8333|0.6522|0.7317| |Developing_anatomical_structure| 2.0| 1.0| 3.0| 5.0| 0.6667| 0.4| 0.5| | Immaterial_anatomical_entity| 7.0| 2.0| 6.0| 13.0| 0.7778|0.5385|0.6364| | Cell|180.0| 6.0|15.0|195.0| 0.9677|0.9231|0.9449| | Organism_subdivision| 11.0| 5.0|10.0| 21.0| 0.6875|0.5238|0.5946| +-------------------------------+-----+----+----+-----+---------+------+------+ +------------------+ | macro| +------------------+ |0.7264701979913192| +------------------+ +------------------+ | micro| +------------------+ |0.8108878300337679| +------------------+ ``` --- layout: model title: Part of Speech for Hindi author: John Snow Labs name: pos_ud_hdtb date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: hi edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, hi] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_hdtb_hi_2.5.5_2.4_1596054066666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_hdtb_hi_2.5.5_2.4_1596054066666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_hdtb", "hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_hdtb", "hi") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।"""] pos_df = nlu.load('hi.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='PROPN', metadata={'word': 'उत्तर'}), Row(annotatorType='pos', begin=6, end=7, result='ADP', metadata={'word': 'के'}), Row(annotatorType='pos', begin=9, end=12, result='NOUN', metadata={'word': 'राजा'}), Row(annotatorType='pos', begin=14, end=17, result='VERB', metadata={'word': 'होने'}), Row(annotatorType='pos', begin=19, end=20, result='ADP', metadata={'word': 'के'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_hdtb| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|hi| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Turkish BERT Base Uncased (BERTurk) author: John Snow Labs name: bert_base_turkish_uncased date: 2021-05-20 tags: [open_source, embeddings, bert, turkish, tr] task: Embeddings language: tr edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERTurk is a community-driven cased BERT model for Turkish. Some datasets used for pretraining and evaluation are contributed from the awesome Turkish NLP community, as well as the decision for the model name: BERTurk. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_turkish_uncased_tr_3.1.0_2.4_1621510523359.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_turkish_uncased_tr_3.1.0_2.4_1621510523359.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("tr.embed.bert.uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_turkish_uncased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|tr| |Case sensitive:|true| ## Data Source [https://huggingface.co/dbmdz/bert-base-turkish-uncased](https://huggingface.co/dbmdz/bert-base-turkish-uncased) ## Benchmarking ```bash For results on PoS tagging or NER tasks, please refer to [this repository](https://github.com/stefan-it/turkish-bert). ``` --- layout: model title: RE Pipeline between Body Parts and Procedures author: John Snow Labs name: re_bodypart_proceduretest_pipeline date: 2023-06-13 tags: [licensed, clinical, relation_extraction, body_part, procedures, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_bodypart_proceduretest](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_proceduretest_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_4.4.4_3.2_1686664541054.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_4.4.4_3.2_1686664541054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models") pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models") pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart_proceduretest.pipeline").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models") pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models") pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart_proceduretest.pipeline").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash Results | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|------------------------------|---------------|-------------|--------|---------|-------------|-------------|---------------------|------------| | 0 | 1 | External_body_part_or_region | 94 | 98 | chest | Test | 117 | 135 | portable ultrasound | 1.0 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_proceduretest_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Sentence Detection in Malayalam Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [ml, sentence_detection, open_source] task: Sentence Detection language: ml edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ml_3.2.0_3.0_1630336657068.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ml_3.2.0_3.0_1630336657068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "ml") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ? നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു. അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്. ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല! കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു. അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം? ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്: വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്? വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ml") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ? നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു. അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്. ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല! കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു. അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം? ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്: വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്? വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('ml.sentence_detector').predict("ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ? നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു. അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്. ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല! കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു. അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം? ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്: വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്? വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.", output_level ='sentence') ```
## Results ```bash +----------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------+ |[ഇംഗ്ലീഷ് വായിക്കുന്ന ഖണ്ഡികകളുടെ മികച്ച ഉറവിടം തേടുകയാണോ?] | |[നിങ്ങൾ ശരിയായ സ്ഥലത്ത് എത്തിയിരിക്കുന്നു.] | |[അടുത്തിടെ നടത്തിയ ഒരു പഠനമനുസരിച്ച്, ഇന്നത്തെ യുവാക്കളിൽ വായനാശീലം അതിവേഗം കുറഞ്ഞുവരികയാണ്.] | |[ഒരു നിശ്ചിത സെക്കൻഡിൽ കൂടുതൽ ഒരു ഇംഗ്ലീഷ് വായന ഖണ്ഡികയിൽ ശ്രദ്ധ കേന്ദ്രീകരിക്കാൻ അവർക്ക് കഴിയില്ല!]| |[കൂടാതെ, വായന എല്ലാ മത്സര പരീക്ഷകളുടെയും അവിഭാജ്യ ഘടകമായിരുന്നു.] | |[അതിനാൽ, നിങ്ങളുടെ വായനാ കഴിവുകൾ എങ്ങനെ മെച്ചപ്പെടുത്താം?] | |[ഈ ചോദ്യത്തിനുള്ള ഉത്തരം യഥാർത്ഥത്തിൽ മറ്റൊരു ചോദ്യമാണ്:] | |[വായനാ വൈദഗ്ധ്യത്തിന്റെ ഉപയോഗം എന്താണ്?] | |[വായനയുടെ പ്രധാന ലക്ഷ്യം 'അർത്ഥവത്താക്കുക' എന്നതാണ്.] | +----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|ml| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English BertForQuestionAnswering model (from kaporter) author: John Snow Labs name: bert_qa_kaporter_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `kaporter`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kaporter_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181111131.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kaporter_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181111131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kaporter_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_kaporter_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_kaporter").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kaporter_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kaporter/bert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1655731053643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1655731053643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|422.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-0 --- layout: model title: TREC(6) Question Classifier author: John Snow Labs name: classifierdl_use_trec6 date: 2021-01-08 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [classifier, open_source, en, text_classification] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify open-domain, fact-based questions into one of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations, or Numeric Values. ## Predicted Entities ``ABBR``, ``DESC``, ``NUM``, ``ENTY``, ``LOC``, ``HUM``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_en_2.7.1_2.4_1610118062425.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_en_2.7.1_2.4_1610118062425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec6', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec6", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""When did the construction of stone circles begin in the UK?"""] trec6_df = nlu.load('en.classify.trec6.use').predict(text, output_level='document') trec6_df[["document", "trec6"]] ```
## Results ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |When did the construction of stone circles begin in the UK? | NUM | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_trec6| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| ## Benchmarking ```bash precision recall f1-score support ABBR 0.00 0.00 0.00 26 DESC 0.89 0.96 0.92 343 ENTY 0.86 0.86 0.86 391 HUM 0.91 0.90 0.91 366 LOC 0.88 0.91 0.89 233 NUM 0.94 0.94 0.94 274 accuracy 0.89 1633 macro avg 0.75 0.76 0.75 1633 weighted avg 0.88 0.89 0.89 1633 ``` ## Data Source This model is trained on the 50 class version of the TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html --- layout: model title: English RobertaForQuestionAnswering Cased model (from AlirezaBaneshi) author: John Snow Labs name: roberta_qa_autotrain_test2_756523214 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523214` is a English model originally trained by `AlirezaBaneshi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523214_en_4.3.0_3.0_1674209197246.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523214_en_4.3.0_3.0_1674209197246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523214","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523214","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_autotrain_test2_756523214| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|415.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523214 --- layout: model title: Legal NER in Greek Legislations author: John Snow Labs name: legner_greek_legislation date: 2023-04-25 tags: [el, legal, ner, licensed, legislation] task: Named Entity Recognition language: el edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Legal NER model extracts the following entities from the Greek legislations: - `FACILITY`: Facilities, such as police stations, departments, etc. - `GPE`: Geopolitical Entity; any reference to a geopolitical entity (e.g., country, city, Greek administrative unit, etc.) - `LEG_REF`: Legislation Reference; any reference to Greek or European legislation - `ORG`: Organization; any reference to a public or private organization - `PER`: Any formal name of a person mentioned in the text - `PUBLIC_DOC`: Public Document Reference ## Predicted Entities `FACILITY`, `GPE`, `LEG_REF`, `PUBLIC_DOC`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_greek_legislation_el_1.0.0_3.0_1682420832367.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_greek_legislation_el_1.0.0_3.0_1682420832367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_el_cased","el")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_greek_legislation", "el", "legal/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text_list = ["""3 του άρθρου 5 του ν. 3148/2003, όπως ισχύει, αντικαθίσταται ως εξής""", """1 του άρθρου 1 ασκούνται πλέον από την ΕΥΔΕ/ΕΣΕΑ μέσα σε δύο μήνες από την έναρξη ισχύος του παρόντος Διατάγματος.""", """Ο Πρόεδρος της Επιτροπής και τα τέσσερα μέλη με ισάριθμα αναπληρωματικά εκλέγονται μεταξύ των δημοτών του Δήμου Κυθήρων.""", """Τη με αριθ. 117/Σ.10η/25 Ιουλ 2016 γνωμοδότηση του Ανωτάτου Στρατιωτικού Συμβουλίου."""] result = model.transform(spark.createDataFrame(pd.DataFrame({"text" : text_list}))) ```
## Results ```bash +----------------------------------------+----------+ |chunk |ner_label | +----------------------------------------+----------+ |ν. 3148/2003 |LEG_REF | |ΕΥΔΕ/ΕΣΕΑ |ORG | |Δήμου Κυθήρων |GPE | |αριθ. 117/Σ.10η/25 Ιουλ 2016 γνωμοδότηση|PUBLIC_DOC| +----------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_greek_legislation| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|el| |Size:|16.4 MB| ## References In-house annotations ## Benchmarking ```bash label precision recall f1-score support FACILITY 0.94 0.80 0.86 64 GPE 0.77 0.83 0.80 136 LEG_REF 0.94 0.90 0.92 93 ORG 0.85 0.74 0.79 173 PER 0.72 0.71 0.71 58 PUBLIC_DOC 0.76 0.82 0.79 39 micro-avg 0.83 0.80 0.81 563 macro-avg 0.83 0.80 0.81 563 weighted-avg 0.84 0.80 0.82 563 ``` --- layout: model title: Relation Extraction between Biomarkers and Results (ReDL) author: John Snow Labs name: redl_oncology_biomarker_result_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, test, biomarker, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Biomarker and Oncogene extractions to their corresponding Biomarker_Result extractions. ## Predicted Entities `is_finding_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biomarker_result_biobert_wip_en_4.2.4_3.0_1673766618517.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biomarker_result_biobert_wip_en_4.2.4_3.0_1673766618517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_biomarker_result_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_biomarker_result_biobert_wip", "en", "clinical/models") .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_biomarker_result_biobert_wip").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""") ```
## Results ```bash +-------------+----------------+-------------+-----------+--------+----------------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +-------------+----------------+-------------+-----------+--------+----------------+-------------+-----------+--------------------+----------+ |is_finding_of|Biomarker_Result| 25| 32|negative| Biomarker| 38| 67|thyroid transcrip...|0.99808085| |is_finding_of|Biomarker_Result| 25| 32|negative| Biomarker| 73| 78| napsin|0.99637383| |is_finding_of|Biomarker_Result| 96| 103|positive| Biomarker| 109| 110| ER|0.99221414| |is_finding_of|Biomarker_Result| 96| 103|positive| Biomarker| 116| 117| PR| 0.9893672| | O|Biomarker_Result| 96| 103|positive| Oncogene| 137| 140| HER2| 0.9986272| | O| Biomarker| 109| 110| ER|Biomarker_Result| 124| 131| negative| 0.9999089| | O| Biomarker| 116| 117| PR|Biomarker_Result| 124| 131| negative| 0.9998932| |is_finding_of|Biomarker_Result| 124| 131|negative| Oncogene| 137| 140| HER2|0.98810333| +-------------+----------------+-------------+-----------+--------+----------------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_biomarker_result_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.93 0.97 0.95 is_finding_of 0.97 0.93 0.95 macro-avg 0.95 0.95 0.95 ``` --- layout: model title: German RobertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: roberta_qa_ADDI_DE_RoBERTa date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-DE-RoBERTa` is a German model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_DE_RoBERTa_de_4.0.0_3.0_1655726326883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_DE_RoBERTa_de_4.0.0_3.0_1655726326883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_DE_RoBERTa","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ADDI_DE_RoBERTa","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.roberta.de_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ADDI_DE_RoBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|de| |Size:|422.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-DE-RoBERTa --- layout: model title: Detect Adverse Drug Events (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_ade date: 2022-01-04 tags: [ner, bertfortokenclassification, adverse, ade, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions of drugs in reviews, tweets, and medical text using the pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.4.0_2.4_1641283944065.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_en_3.4.0_2.4_1641283944065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter() \ .setInputCols(["document","token","ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""" ]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ade", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_ade").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |Lipitor |DRUG | |severe fatigue|ADE | |voltaren |DRUG | |cramps |ADE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ade| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source This model is trained on a custom dataset by John Snow Labs. ## Benchmarking ```bash label precision recall f1-score support B-ADE 0.93 0.79 0.85 2694 B-DRUG 0.97 0.87 0.92 9539 I-ADE 0.93 0.73 0.82 3236 I-DRUG 0.95 0.82 0.88 6115 accuracy - - 0.83 21584 macro-avg 0.84 0.84 0.84 21584 weighted-avg 0.95 0.83 0.89 21584 ``` --- layout: model title: Indonesian RoBERTa Embeddings (Base) author: John Snow Labs name: roberta_embeddings_indonesian_roberta_base date: 2022-04-14 tags: [roberta, embeddings, id, open_source] task: Embeddings language: id edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indonesian-roberta-base` is a Indonesian model orginally trained by `flax-community`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_base_id_3.4.2_3.0_1649948386496.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indonesian_roberta_base_id_3.4.2_3.0_1649948386496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_base","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indonesian_roberta_base","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka percikan NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.embed.indonesian_roberta_base").predict("""Saya suka percikan NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indonesian_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|468.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/flax-community/indonesian-roberta-base - https://arxiv.org/abs/1907.11692 - https://hf.co/w11wo - https://hf.co/stevenlimcorn - https://hf.co/munggok - https://hf.co/chewkokwah --- layout: model title: Pipeline to Detect clinical entities (ner_healthcare_slim) author: John Snow Labs name: ner_healthcare_slim_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_healthcare_slim](https://nlp.johnsnowlabs.com/2021/04/01/ner_healthcare_slim_de.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_slim_pipeline_de_4.3.0_3.2_1678879973742.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_slim_pipeline_de_4.3.0_3.2_1678879973742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_healthcare_slim_pipeline", "de", "clinical/models") text = '''Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_healthcare_slim_pipeline", "de", "clinical/models") val text = "Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------------------|--------:|------:|:------------------|-------------:| | 0 | Bronchialkarzinom | 17 | 33 | MEDICAL_CONDITION | 0.9988 | | 1 | Lungenkrebs | 50 | 60 | MEDICAL_CONDITION | 0.9931 | | 2 | SCLC | 63 | 66 | MEDICAL_CONDITION | 0.9957 | | 3 | Hernia | 73 | 78 | MEDICAL_CONDITION | 0.8134 | | 4 | femoralis | 80 | 88 | BODY_PART | 0.8001 | | 5 | Akne | 91 | 94 | MEDICAL_CONDITION | 0.9678 | | 6 | hochmalignes bronchogenes Karzinom | 112 | 145 | MEDICAL_CONDITION | 0.6409 | | 7 | Lunge | 179 | 183 | BODY_PART | 0.9729 | | 8 | Hauptbronchus | 195 | 207 | BODY_PART | 0.9987 | | 9 | Prävalenz | 232 | 240 | MEDICAL_CONDITION | 0.9986 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare_slim_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Word2Vec Embeddings in Croatian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, hr, open_source] task: Embeddings language: hr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hr_3.4.1_3.0_1647292091930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hr_3.4.1_3.0_1647292091930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Volim iskru nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Volim iskru nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hr.embed.w2v_cc_300d").predict("""Volim iskru nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|hr| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Chinese BertForMaskedLM Cased model (from ptrsxu) author: John Snow Labs name: bert_embeddings_ptrsxu_chinese_wwm_ext date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-bert-wwm-ext` is a Chinese model originally trained by `ptrsxu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_chinese_wwm_ext_zh_4.2.4_3.0_1670020981050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_ptrsxu_chinese_wwm_ext_zh_4.2.4_3.0_1670020981050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_chinese_wwm_ext","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_ptrsxu_chinese_wwm_ext","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_ptrsxu_chinese_wwm_ext| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ptrsxu/chinese-bert-wwm-ext - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Smaller BERT Embeddings (L-10_H-256_A-4) author: John Snow Labs name: small_bert_L10_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L10_256_en_2.6.0_2.4_1598344485022.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L10_256_en_2.6.0_2.4_1598344485022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L10_256", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L10_256", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L10_256').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L10_256_embeddings I [0.14484411478042603, -0.8349236249923706, -1.... love [-0.7449802160263062, -0.4852253794670105, -0.... NLP [-0.03900821506977081, -0.044783130288124084, ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L10_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1 --- layout: model title: Bert Embeddings Romanian (Base Cased) author: John Snow Labs name: bert_base_cased date: 2021-09-13 tags: [open_source, embeddings, ro] task: Embeddings language: ro edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus in Romanian Language. The details are described in the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_ro_3.2.0_3.0_1631533635237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_ro_3.2.0_3.0_1631533635237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash Generates 768 dimensional embeddings vectors per token ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_cased| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ro| |Case sensitive:|true| ## Benchmarking ```bash This model is imported from https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from monakth) author: John Snow Labs name: distilbert_qa_monakth_base_cased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-squad` is a English model originally trained by `monakth`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_cased_finetuned_squad_en_4.3.0_3.0_1672766954155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_cased_finetuned_squad_en_4.3.0_3.0_1672766954155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_cased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_cased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_monakth_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monakth/distilbert-base-cased-finetuned-squad --- layout: model title: English XlmRoBertaForQuestionAnswering (from airesearch) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_finetune_qa date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetune-qa` is a English model originally trained by `airesearch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetune_qa_en_4.0.0_3.0_1655989494932.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_finetune_qa_en_4.0.0_3.0_1655989494932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_finetune_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_finetune_qa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_finetune_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|864.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/airesearch/xlm-roberta-base-finetune-qa - https://wandb.ai/cstorm125/wangchanberta-qa - https://github.com/vistec-AI/thai2transformers/blob/dev/scripts/downstream/train_question_answering_lm_finetuning.py --- layout: model title: Detect Cellular/Molecular Biology Entities (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_cellular date: 2021-11-03 tags: [bertfortokenclassification, ner, cellular, en, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects molecular biology-related terms in medical texts. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_en_3.3.0_2.4_1635938889847.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_en_3.3.0_2.4_1635938889847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentence_detector, tokenizer, tokenClassifier, ner_converter]) p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.cellular").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-------------------------------------------+---------+ |chunk |ner_label| +-------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter|DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 |DNA | |Tax-responsive element 1 |DNA | |TRE-1 |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein|protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_cellular| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|512| ## Data Source Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. http://www.geniaproject.org/ ## Benchmarking ```bash label precision recall f1-score support B-DNA 0.87 0.77 0.82 1056 B-RNA 0.85 0.79 0.82 118 B-cell_line 0.66 0.70 0.68 500 B-cell_type 0.87 0.75 0.81 1921 B-protein 0.90 0.85 0.88 5067 I-DNA 0.93 0.86 0.90 1789 I-RNA 0.92 0.84 0.88 187 I-cell_line 0.67 0.76 0.71 989 I-cell_type 0.92 0.76 0.84 2991 I-protein 0.94 0.80 0.87 4774 accuracy - - 0.80 19392 macro-avg 0.76 0.81 0.78 19392 weighted-avg 0.89 0.80 0.85 19392 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from shwetha) author: John Snow Labs name: distilbert_qa_autotrain_user_954831770 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-qa-user-954831770` is a English model originally trained by `shwetha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_user_954831770_en_4.3.0_3.0_1672765643101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_autotrain_user_954831770_en_4.3.0_3.0_1672765643101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_user_954831770","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_autotrain_user_954831770","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_autotrain_user_954831770| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/shwetha/autotrain-qa-user-954831770 --- layout: model title: Swedish asr_lm_swedish TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: pipeline_asr_lm_swedish date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_lm_swedish` is a Swedish model originally trained by birgermoell. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_lm_swedish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_lm_swedish_sv_4.2.0_3.0_1664117937565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_lm_swedish_sv_4.2.0_3.0_1664117937565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_lm_swedish', lang = 'sv') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_lm_swedish", lang = "sv") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_lm_swedish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|757.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl24 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl24` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl24_en_4.3.0_3.0_1675123867646.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl24_en_4.3.0_3.0_1675123867646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl24","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl24","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl24| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|117.0 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl24 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: French CamemBert Embeddings (from Leisa) author: John Snow Labs name: camembert_embeddings_Leisa_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Leisa`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Leisa_generic_model_fr_3.4.4_3.0_1653986639094.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Leisa_generic_model_fr_3.4.4_3.0_1653986639094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Leisa_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Leisa_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Leisa_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Leisa/dummy-model --- layout: model title: Fast and Accurate Language Identification - 231 Languages (CNN) author: John Snow Labs name: ld_wiki_cnn_231 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on Wikipedia dataset with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Banjar`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Bosnian`, `Russia Buriat`, `Catalan`, `cbk-zam`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `dty`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Gilaki`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Croatian`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latgalian`, `Latvian`, `Maithili`, `map-bms`, `Moksha`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Malay`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Pennsylvania German`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-rup`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Sindhi`, `Northern Sami`, `Serbo-Croatian`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Sranan Tongo`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`. ## Predicted Entities `ace`, `af`, `als`, `am`, `an`, `ang`, `ar`, `arz`, `as`, `ast`, `av`, `ay`, `az`, `azb`, `ba`, `bar`, `bat-smg`, `bcl`, `be`, `bg`, `bh`, `bjn`, `bn`, `bo`, `bpy`, `br`, `bs`, `bxr`, `ca`, `cbk-zam`, `cdo`, `ce`, `ceb`, `ckb`, `co`, `crh`, `cs`, `csb`, `cv`, `cy`, `da`, `de`, `diq`, `dsb`, `dty`, `dv`, `el`, `eml`, `en`, `eo`, `es`, `et`, `ext`, `fa`, `fi`, `fiu-vro`, `fo`, `fr`, `frp`, `fur`, `fy`, `ga`, `gag`, `gd`, `gl`, `glk`, `gn`, `gom`, `gu`, `gv`, `ha`, `hak`, `he`, `hi`, `hif`, `hr`, `hsb`, `ht`, `hu`, `hy`, `ia`, `id`, `ie`, `ig`, `ilo`, `io`, `is`, `it`, `ja`, `jam`, `jbo`, `jv`, `ka`, `kaa`, `kab`, `kbd`, `kk`, `km`, `kn`, `ko`, `koi`, `krc`, `ksh`, `ku`, `kv`, `kw`, `ky`, `la`, `lad`, `lb`, `lez`, `lg`, `li`, `lij`, `lmo`, `ln`, `lo`, `lrc`, `lt`, `ltg`, `lv`, `mai`, `map-bms`, `mdf`, `mg`, `mhr`, `mi`, `min`, `mk`, `ml`, `mn`, `mr`, `mrj`, `ms`, `mt`, `mwl`, `my`, `myv`, `mzn`, `nah`, `nap`, `nds`, `nds-nl`, `ne`, `new`, `nl`, `nn`, `no`, `nrm`, `nso`, `nv`, `oc`, `olo`, `om`, `or`, `os`, `pa`, `pag`, `pam`, `pap`, `pcd`, `pdc`, `pfl`, `pl`, `pnb`, `ps`, `pt`, `qu`, `rm`, `ro`, `roa-rup`, `roa-tara`, `ru`, `rue`, `rw`, `sa`, `sah`, `sc`, `scn`, `sd`, `se`, `sh`, `si`, `sk`, `sl`, `sn`, `so`, `sq`, `sr`, `srn`, `stq`, `su`, `sv`, `sw`, `szl`, `ta`, `tcy`, `te`, `tet`, `tg`, `th`, `tk`, `tl`, `tn`, `to`, `tr`, `tt`, `tyv`, `udm`, `ug`, `uk`, `ur`, `uz`, `vec`, `vep`, `vi`, `vls`, `vo`, `wa`, `war`, `wo`, `wuu`, `xh`, `xmf`, `yi`, `yo`, `zea`, `zh`, `zh-classical`, `zh-min-nan`, `zh-yue`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_cnn_231_xx_2.7.0_2.4_1607183625658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_cnn_231_xx_2.7.0_2.4_1607183625658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_cnn_231", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_231').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_wiki_cnn_231| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Wikipedia ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | fr| 1000| 996| 0.996| | fi| 1000| 995| 0.995| | sv| 1000| 994| 0.994| | en| 1000| 991| 0.991| | pt| 1000| 988| 0.988| | de| 1000| 986| 0.986| | it| 1000| 982| 0.982| | es| 1000| 977| 0.977| | nl| 1000| 974| 0.974| | lt| 1000| 969| 0.969| | hu| 880| 850|0.9659090909090909| | lv| 916| 884|0.9650655021834061| | el| 1000| 965| 0.965| | pl| 914| 882|0.9649890590809628| | cs| 1000| 964| 0.964| | da| 1000| 963| 0.963| | et| 928| 892|0.9612068965517241| | bg| 1000| 954| 0.954| | sk| 1000| 945| 0.945| | ro| 784| 738|0.9413265306122449| | sl| 914| 850|0.9299781181619255| +--------+-----+-------+------------------+ +-------+--------------------+ |summary| precision| +-------+--------------------+ | count| 21| | mean| 0.9700702474999693| | stddev|0.018256955176991118| | min| 0.9299781181619255| | max| 0.996| +-------+--------------------+ ``` --- layout: model title: English image_classifier_vit_upside_down_classifier ViTForImageClassification from daveni author: John Snow Labs name: image_classifier_vit_upside_down_classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_upside_down_classifier` is a English model originally trained by daveni. ## Predicted Entities `original`, `upside_down` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_upside_down_classifier_en_4.1.0_3.0_1660166128533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_upside_down_classifier_en_4.1.0_3.0_1660166128533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_upside_down_classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_upside_down_classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_upside_down_classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Telugu BertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers date: 2022-12-02 tags: [te, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: te edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-bert` is a Telugu model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670022427927.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670022427927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|te| |Size:|611.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-bert - https://oscar-corpus.com/ --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_facility date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-facility` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_facility_en_4.3.0_3.0_1674220319873.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_facility_en_4.3.0_3.0_1674220319873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_facility","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_facility","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_facility| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-facility --- layout: model title: Named Entity Recognition Profiling (Clinical) author: John Snow Labs name: ner_profiling_clinical date: 2023-05-04 tags: [licensed, en, clinical, profiling, ner_profiling, ner] task: [Named Entity Recognition, Pipeline Healthcare] language: en edition: Healthcare NLP 4.4.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. It has been updated by adding new clinical NER models and NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_clinical`,`jsl_ner_wip_greedy_clinical`,`jsl_ner_wip_modifier_clinical`, `jsl_rd_ner_wip_greedy_clinical`, `ner_abbreviation_clinical`, `ner_ade_binary`, `ner_ade_clinical`, `ner_anatomy`, `ner_anatomy_coarse`, `ner_bacterial_species`, `ner_biomarker`, `ner_biomedical_bc2gm`, `ner_bionlp`, `ner_cancer_genetics`, `ner_cellular`, `ner_chemd_clinical`, `ner_chemicals`, `ner_chemprot_clinical`, `ner_chexpert`, `ner_clinical`, `ner_clinical_large`, `ner_clinical_trials_abstracts`, `ner_covid_trials`, `ner_deid_augmented`, `ner_deid_enriched`, `ner_deid_generic_augmented`,`ner_deid_large`, `ner_deid_sd`,`ner_deid_sd_large`,`ner_deid_subentity_augmented`,`ner_deid_subentity_augmented_i2b2`, `ner_deid_synthetic`, `ner_diseases`, `ner_diseases_large`, `ner_drugprot_clinical`, `ner_drugs`, `ner_drugs_greedy`, `ner_drugs_large`, `ner_eu_clinical_case`, `ner_eu_clinical_condition`, `ner_events_admission_clinical`, `ner_events_clinical`, `ner_financial_contract`, `ner_genetic_variants`, `ner_healthcare`, `ner_human_phenotype_gene_clinical`, `ner_human_phenotype_go_clinical`, `ner_jsl`, `ner_jsl_enriched`, `ner_jsl_slim`, `ner_living_species`, `ner_measurements_clinical`, `ner_medmentions_coarse`, `ner_nature_nero_clinical`, `ner_nihss`, `ner_oncology`, `ner_oncology_anatomy_general`, `ner_oncology_anatomy_granular`, `ner_oncology_biomarker`, `ner_oncology_demographics`, `ner_oncology_diagnosis`, `ner_oncology_posology`, `ner_oncology_response_to_treatment`, `ner_oncology_test`, `ner_oncology_therapy`, `ner_oncology_tnm`, `ner_oncology_unspecific_posology`, `ner_oncology_wip`, `ner_pathogen`, `ner_posology`, `ner_posology_experimental`, `ner_posology_greedy`, `ner_posology_large`, `ner_posology_small`, `ner_radiology`, `ner_radiology_wip_clinical`, `ner_risk_factors`, `ner_sdoh_access_to_healthcare_wip`, `ner_sdoh_community_condition_wip`, `ner_sdoh_demographics_wip`, `ner_sdoh_health_behaviours_problems_wip`, `ner_sdoh_income_social_status_wip`, `ner_sdoh_mentions`, `ner_sdoh_slim_wip`, `ner_sdoh_social_environment_wip`, `ner_sdoh_substance_usage_wip`, `ner_sdoh_wip`, `ner_supplement_clinical`, `ner_vop_anatomy_wip`, `ner_vop_clinical_dept_wip`, `ner_vop_demographic_wip`, `ner_vop_problem_reduced_wip`, `ner_vop_problem_wip`, `ner_vop_slim_wip`, `ner_vop_temporal_wip`, `ner_vop_test_wip`, `ner_vop_treatment_wip`, `ner_vop_wip`, `nerdl_tumour_demo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.4.0_3.2_1683225723531.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.4.0_3.2_1683225723531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models") result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ```
## Results ```bash ******************** ner_jsl Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('subsequent', 'Modifier'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Diabetes'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Communicable_Disease'), ('obesity', 'Obesity'), ('body mass index', 'Symptom'), ('33.5 kg/m2', 'Weight'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_diseases_large Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('vomiting', 'Disease')] ******************** ner_radiology Model Results ******************** [('gestational diabetes mellitus', 'Disease_Syndrome_Disorder'), ('type two diabetes mellitus', 'Disease_Syndrome_Disorder'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('acute hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Disease_Syndrome_Disorder'), ('body', 'BodyPart'), ('mass index', 'Symptom'), ('BMI', 'Test'), ('33.5', 'Measurements'), ('kg/m2', 'Units'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus', 'PROBLEM'), ('T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'PROBLEM'), ('BMI', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_medmentions_coarse Model Results ******************** [('female', 'Organism_Attribute'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('T2DM', 'Disease_or_Syndrome'), ('HTG-induced pancreatitis', 'Disease_or_Syndrome'), ('associated with', 'Qualitative_Concept'), ('acute hepatitis', 'Disease_or_Syndrome'), ('obesity', 'Disease_or_Syndrome'), ('body mass index', 'Clinical_Attribute'), ('BMI', 'Clinical_Attribute'), ('polyuria', 'Sign_or_Symptom'), ('polydipsia', 'Sign_or_Symptom'), ('poor appetite', 'Sign_or_Symptom'), ('vomiting', 'Sign_or_Symptom')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_clinical| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: Named Entity Recognition (NER) Model in Norwegian (Norne 6B 300) author: John Snow Labs name: norne_6B_300 date: 2020-05-06 task: Named Entity Recognition language: "no" edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, nn, nb, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Norne is a Named Entity Recognition (or NER) model of Norvegian, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Norne 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Derived-`DRV`, Product-`PROD`, Geo-political Entities Location-`GPE_LOC`, Geo-political Entities Organization-`GPE-ORG`, Event-`EVT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_NO/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NO.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/norne_6B_300_no_2.5.0_2.4_1588781290264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/norne_6B_300_no_2.5.0_2.4_1588781290264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained('glove_6B_300', lang='xx') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("norne_6B_300", "no") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("norne_6B_300", "no") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. [ 9] Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, programvareutvikler, investor og filantrop. Han er mest kjent som medgründer av Microsoft Corporation. I løpet av sin karriere hos Microsoft hadde Gates stillingene som styreleder, administrerende direktør (CEO), president og sjef programvarearkitekt, samtidig som han var den største individuelle aksjonæren fram til mai 2014. Han er en av de mest kjente gründere og pionerene i mikrodatarevolusjon på 1970- og 1980-tallet. Han er født og oppvokst i Seattle, Washington, og grunnla Microsoft sammen med barndomsvennen Paul Allen i 1975, i Albuquerque, New Mexico; det fortsatte å bli verdens største programvare for datamaskinprogramvare. Gates ledet selskapet som styreleder og administrerende direktør til han gikk av som konsernsjef i januar 2000, men han forble styreleder og ble sjef for programvarearkitekt. I løpet av slutten av 1990-tallet hadde Gates blitt kritisert for sin forretningstaktikk, som har blitt ansett som konkurransedyktig. Denne uttalelsen er opprettholdt av en rekke dommer. I juni 2006 kunngjorde Gates at han skulle gå over til en deltidsrolle hos Microsoft og på heltid ved Bill & Melinda Gates Foundation, den private veldedige stiftelsen som han og kona, Melinda Gates, opprettet i 2000. Han overførte gradvis arbeidsoppgavene sine til Ray Ozzie og Craig Mundie. Han trakk seg som styreleder for Microsoft i februar 2014 og tiltrådte et nytt verv som teknologirådgiver for å støtte den nyutnevnte administrerende direktøren Satya Nadella."""] ner_df = nlu.load('no.ner.norne.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Seattle |GPE_LOC | |Washington |GPE_LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |GPE_LOC | |New Mexico |GPE_LOC | |Gates |PER | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | |Ray Ozzie |PER | |Craig Mundie |PER | |Microsoft |ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|norne_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|no| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.559.pdf](https://www.aclweb.org/anthology/2020.lrec-1.559.pdf) --- layout: model title: Clinical Deidentification (Italian) author: John Snow Labs name: clinical_deidentification date: 2023-06-13 tags: [deidentification, pipeline, it, licensed] task: De-identification language: it edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts in Italian. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_4.4.4_3.2_1686664266856.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_4.4.4_3.2_1686664266856.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = """RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""" result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = "RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("it.deid.clinical").predict("""RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = """RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""" result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = "RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("it.deid.clinical").predict("""RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""") ```
## Results ```bash Results Masked with entity labels ------------------------------ RAPPORTO DI RICOVERO NOME: CODICE FISCALE: INDIRIZZO: CITTÀ : CODICE POSTALE: DATA DI NASCITA: ETÀ: anni SESSO: EMAIL: DATA DI AMMISSIONE: DOTTORE: RAPPORTO CLINICO: anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. - , Dipartimento di Endocrinologia e Nutrizione - , EMAIL: Masked with chars ------------------------------ RAPPORTO DI RICOVERO NOME: [****************] CODICE FISCALE: [**************] INDIRIZZO: [**************] CITTÀ : [****] CODICE POSTALE: [***]DATA DI NASCITA: [********] ETÀ: **anni SESSO: * EMAIL: [***********] DATA DI AMMISSIONE: [********] DOTTORE: [*********] RAPPORTO CLINICO: **anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. [**************] - [*****************], Dipartimento di Endocrinologia e Nutrizione - [*******************], [***] [****] EMAIL: [******************] Masked with fixed length chars ------------------------------ RAPPORTO DI RICOVERO NOME: **** CODICE FISCALE: **** INDIRIZZO: **** CITTÀ : **** CODICE POSTALE: ****DATA DI NASCITA: **** ETÀ: **** anni SESSO: **** EMAIL: **** DATA DI AMMISSIONE: **** DOTTORE: **** RAPPORTO CLINICO: **** anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. **** - ****, Dipartimento di Endocrinologia e Nutrizione - ****, **** **** EMAIL: **** Obfuscated ------------------------------ RAPPORTO DI RICOVERO NOME: Scotto-Polani CODICE FISCALE: ECI-QLN77G15L455Y INDIRIZZO: Viale Orlando 808 CITTÀ : Sesto Raimondo CODICE POSTALE: 53581DATA DI NASCITA: 09/03/1946 ETÀ: 5 anni SESSO: U EMAIL: HenryWatson@world.com DATA DI AMMISSIONE: 10/01/2017 DOTTORE: Sig. Fredo Marangoni RAPPORTO CLINICO: 5 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Antonio Rusticucci - ASL 7 DI CARBONIA AZIENDA U.S.L. N. 7, Dipartimento di Endocrinologia e Nutrizione - Via Giorgio 0 Appartamento 26, 03461 Sesto Raimondo EMAIL: murat.g@jsl.com {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICD10-CM Codes author: John Snow Labs name: snomed_icd10cm_mapping date: 2022-06-27 tags: [pipeline, snomed, icd10cm, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_icd10cm_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_3.5.3_3.0_1656363315439.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_3.5.3_3.0_1656363315439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") result= pipeline.fullAnnotate("128041000119107 292278006 293072005") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= new PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("128041000119107 292278006 293072005") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icd10cm.pipe").predict("""128041000119107 292278006 293072005""") ```
## Results ```bash | | snomed_code | icd10cm_code | |---:|:----------------------------------------|:---------------------------| | 0 | 128041000119107 | 292278006 | 293072005 | K22.70 | T43.595 | T37.1X5 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icd10cm_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.5 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Translate Lingala to English Pipeline author: John Snow Labs name: translate_ln_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ln, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ln` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ln_en_xx_2.7.0_2.4_1609698622095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ln_en_xx_2.7.0_2.4_1609698622095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ln_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ln_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ln.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ln_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Dutch DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_nl_cased date: 2022-04-12 tags: [distilbert, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-nl-cased` is a Dutch model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_nl_cased_nl_3.4.2_3.0_1649783996172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_nl_cased_nl_3.4.2_3.0_1649783996172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_nl_cased","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_nl_cased","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.distilbert_base_cased").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_nl_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|229.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-nl-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Agreement and Plan of Reorganization Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_agreement_and_plan_of_reorganization_bert date: 2022-12-06 tags: [en, legal, classification, agreement, plan, reorganizationlicensed, bert, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement_and_plan_of_reorganization_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `agreement-and-plan-of-reorganization` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `agreement-and-plan-of-reorganization`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_bert_en_1.0.0_3.0_1670349241846.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_plan_of_reorganization_bert_en_1.0.0_3.0_1670349241846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreement_and_plan_of_reorganization_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[agreement-and-plan-of-reorganization]| |[other]| |[other]| |[agreement-and-plan-of-reorganization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement_and_plan_of_reorganization_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement-and-plan-of-reorganization 1.00 1.00 1.00 31 other 1.00 1.00 1.00 35 accuracy - - 1.00 66 macro-avg 1.00 1.00 1.00 66 weighted-avg 1.00 1.00 1.00 66 ``` --- layout: model title: Arabic BertForMaskedLM Base Cased model (from aubmindlab) author: John Snow Labs name: bert_embeddings_base_arabertv2 date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabertv2` is a Arabic model originally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv2_ar_4.2.4_3.0_1670015875947.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv2_ar_4.2.4_3.0_1670015875947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv2","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv2","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabertv2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv2 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabertv2/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6_en_4.3.0_3.0_1674214999722.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6_en_4.3.0_3.0_1674214999722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|426.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-6 --- layout: model title: English DistilBertForQuestionAnswering model (from huxxx657) author: John Snow Labs name: distilbert_qa_huxxx657_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huxxx657_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725541025.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huxxx657_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725541025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huxxx657_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huxxx657_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_huxxx657").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_huxxx657_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect Genes and Human Phenotypes (biobert) author: John Snow Labs name: ner_human_phenotype_gene_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_gene_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_gene_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_4.3.0_3.2_1679315678860.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_4.3.0_3.2_1679315678860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models") text = '''Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models") val text = "Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3)." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype_gene_biobert.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------|-------------:| | 0 | type | 29 | 32 | GENE | 0.9977 | | 1 | polyhydramnios | 75 | 88 | HP | 0.9949 | | 2 | polyuria | 91 | 98 | HP | 0.9955 | | 3 | nephrocalcinosis | 101 | 116 | HP | 0.995 | | 4 | hypokalemia | 122 | 132 | HP | 0.9986 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Classify first part of agreements (Parties, Agreement type) author: John Snow Labs name: legclf_introduction_clause_cuad date: 2022-11-25 tags: [en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the first part of a legal agreement, were `PARTIES`, `AGREEMENT TYPE` and `ALIASES` or `ROLES` of the parties are described. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `introduction`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_cuad_en_1.0.0_3.0_1669371916085.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_cuad_en_1.0.0_3.0_1669371916085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_introduction_clause_cuad", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[introduction]| |[other]| |[other]| |[introduction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_introduction_clause_cuad| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support introduction 1.00 0.98 0.99 99 other 0.99 1.00 0.99 151 accuracy - - 0.99 250 macro-avg 0.99 0.99 0.99 250 weighted--avg 0.99 0.99 0.99 250 ``` --- layout: model title: Extract Entities Related to TNM Staging author: John Snow Labs name: ner_oncology_tnm date: 2022-11-24 tags: [licensed, en, clinical, oncology, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts staging information and mentions related to tumors, lymph nodes and metastases. Definitions of Predicted Entities: - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Lymph_Node`: Mentions of lymph nodes and pathological findings of the lymph nodes. - `Lymph_Node_Modifier`: Words that refer to a lymph node being abnormal (such as "enlargement"). - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Tumor`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Description`: Information related to tumor characteristics, such as size, presence of invasion, grade and hystological type. ## Predicted Entities `Cancer_Dx`, `Lymph_Node`, `Lymph_Node_Modifier`, `Metastasis`, `Staging`, `Tumor`, `Tumor_Description` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_en_4.2.2_3.0_1669308699155.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_en_4.2.2_3.0_1669308699155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_tnm").predict("""The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:------------------| | metastatic | Metastasis | | breast carcinoma | Cancer_Dx | | T2N1M1 stage IV | Staging | | 4 cm | Tumor_Description | | tumor | Tumor | | grade 2 | Tumor_Description | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_tnm| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.2 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Lymph_Node 570 77 77 647 0.88 0.88 0.88 Staging 232 22 26 258 0.91 0.90 0.91 Lymph_Node_Modifier 30 5 5 35 0.86 0.86 0.86 Tumor_Description 2651 581 490 3141 0.82 0.84 0.83 Tumor 1116 72 141 1257 0.94 0.89 0.91 Metastasis 358 15 12 370 0.96 0.97 0.96 Cancer_Dx 1302 87 92 1394 0.94 0.93 0.94 macro_avg 6259 859 843 7102 0.90 0.90 0.90 micro_avg 6259 859 843 7102 0.88 0.88 0.88 ``` --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-512_A-8_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185314705.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185314705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.uncased_4l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-512_A-8_squad2_covid-qna --- layout: model title: English asr_wav2vec2_base_demo_colab_by_thyagosme TFWav2Vec2ForCTC from thyagosme author: John Snow Labs name: asr_wav2vec2_base_demo_colab_by_thyagosme date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_demo_colab_by_thyagosme` is a English model originally trained by thyagosme. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_demo_colab_by_thyagosme_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_demo_colab_by_thyagosme_en_4.2.0_3.0_1664107996154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_demo_colab_by_thyagosme_en_4.2.0_3.0_1664107996154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_demo_colab_by_thyagosme", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_demo_colab_by_thyagosme", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_demo_colab_by_thyagosme| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Voice of the Patients (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_emb_clinical_medium_wip date: 2023-04-12 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: [3.0, 3.2] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Allergen`, `SubstanceQuantity`, `RaceEthnicity`, `Measurements`, `InjuryOrPoisoning`, `Treatment`, `Modifier`, `TestResult`, `MedicalDevice`, `Vaccine`, `Frequency`, `HealthStatus`, `Route`, `RelationshipStatus`, `Procedure`, `Duration`, `DateTime`, `AdmissionDischarge`, `Disease`, `Test`, `Substance`, `Laterality`, `Symptom`, `ClinicalDept`, `Dosage`, `Age`, `Drug`, `VitalTest`, `PsychologicalCondition`, `Form`, `BodyPart`, `Employment`, `Gender` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/VOICE_OF_THE_PATIENTS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_medium_wip_en_4.4.0_3.0_1681315530573.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_medium_wip_en_4.4.0_3.0_1681315530573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical_medium, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_medium_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical_medium, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_medium_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Treatment | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | allopathy medicine | Treatment | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_emb_clinical_medium_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Allergen 0 1 8 8 0.00 0.00 0.00 SubstanceQuantity 9 14 18 27 0.39 0.33 0.36 RaceEthnicity 2 0 6 8 1.00 0.25 0.40 Measurements 41 25 33 74 0.62 0.55 0.59 InjuryOrPoisoning 66 37 51 117 0.64 0.56 0.60 Treatment 96 39 46 142 0.71 0.68 0.69 Modifier 642 268 271 913 0.71 0.70 0.70 TestResult 394 185 154 548 0.68 0.72 0.70 MedicalDevice 177 76 67 244 0.70 0.73 0.71 Vaccine 20 4 12 32 0.83 0.63 0.71 Frequency 456 144 187 643 0.76 0.71 0.73 HealthStatus 60 4 38 98 0.94 0.61 0.74 Route 24 4 12 36 0.86 0.67 0.75 RelationshipStatus 19 3 9 28 0.86 0.68 0.76 Procedure 286 91 80 366 0.76 0.78 0.77 Duration 846 227 269 1115 0.79 0.76 0.77 DateTime 1813 455 391 2204 0.80 0.82 0.81 AdmissionDischarge 19 1 8 27 0.95 0.70 0.81 Disease 1247 318 256 1503 0.80 0.83 0.81 Test 734 150 175 909 0.83 0.81 0.82 Substance 156 48 22 178 0.76 0.88 0.82 Laterality 440 91 78 518 0.83 0.85 0.84 Symptom 3069 566 630 3699 0.84 0.83 0.84 ClinicalDept 205 35 31 236 0.85 0.87 0.86 Dosage 273 42 49 322 0.87 0.85 0.86 Age 294 60 29 323 0.83 0.91 0.87 Drug 1035 188 100 1135 0.85 0.91 0.88 VitalTest 144 23 13 157 0.86 0.92 0.89 PsychologicalCondition 284 32 30 314 0.90 0.90 0.90 Form 234 32 17 251 0.88 0.93 0.91 BodyPart 2532 256 213 2745 0.91 0.92 0.92 Employment 980 65 62 1042 0.94 0.94 0.94 Gender 1174 27 20 1194 0.98 0.98 0.98 macro_avg 17771 3511 3385 21156 0.79 0.73 0.75 micro_avg 17771 3511 3385 21156 0.84 0.84 0.84 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from amitjohn007) author: John Snow Labs name: roberta_qa_amitjohn007_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `amitjohn007`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_amitjohn007_base_finetuned_squad_en_4.3.0_3.0_1674217120272.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_amitjohn007_base_finetuned_squad_en_4.3.0_3.0_1674217120272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_amitjohn007_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_amitjohn007_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_amitjohn007_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/amitjohn007/roberta-base-finetuned-squad --- layout: model title: Named Entity Recognition (NER) Model in Danish (Dane 840B 300) author: John Snow Labs name: dane_ner_840B_300 date: 2020-08-30 task: Named Entity Recognition language: da edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, da, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Norne is a Named Entity Recognition (or NER) model of Danish, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Dane NER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DA/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dane_ner_840B_300_da_2.6.0_2.4_1598810268070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dane_ner_840B_300_da_2.6.0_2.4_1598810268070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx") \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("dane_ner_840B_300", "da") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af \u200b\u200bde mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af \u200b\u200b1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("dane_ner_840B_300", "da") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af ​​de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970"erne og 1980"erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af ​​1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af ​​de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af ​​1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."""] ner_df = nlu.load('da.ner.840B300D').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates |PER | |amerikansk |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |1970'erne |MISC | |1980'erne |MISC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |1990'erne |MISC | |Gates |MISC | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|PER | |Melinda Gates |PER | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dane_ner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|da| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.565.pdf](https://www.aclweb.org/anthology/2020.lrec-1.565.pdf) --- layout: model title: Legal Parties in interest Clause Binary Classifier author: John Snow Labs name: legclf_parties_in_interest_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `parties-in-interest` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `parties-in-interest` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_parties_in_interest_clause_en_1.0.0_3.2_1660122826311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_parties_in_interest_clause_en_1.0.0_3.2_1660122826311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_parties_in_interest_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[parties-in-interest]| |[other]| |[other]| |[parties-in-interest]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_parties_in_interest_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 106 parties-in-interest 0.98 0.98 0.98 49 accuracy - - 0.99 155 macro-avg 0.99 0.99 0.99 155 weighted-avg 0.99 0.99 0.99 155 ``` --- layout: model title: Longformer Base NER Pipeline author: John Snow Labs name: longformer_base_token_classifier_conll03_pipeline date: 2022-06-19 tags: [ner, longformer, pipeline, conll, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [longformer_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653913352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655653913352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|516.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - LongformerForTokenClassification - NerConverter - Finisher --- layout: model title: English image_classifier_vit_test ViTForImageClassification from flyswot author: John Snow Labs name: image_classifier_vit_test date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_test` is a English model originally trained by flyswot. ## Predicted Entities `EDGE + SPINE`, `OTHER`, `PAGE + FOLIO`, `FLYSHEET`, `CONTAINER`, `CONTROL SHOT`, `COVER`, `SCROLL` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_test_en_4.1.0_3.0_1660169623877.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_test_en_4.1.0_3.0_1660169623877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_test", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_test", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_test| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| --- layout: model title: Extract treatment entities (Voice of the Patients) author: John Snow Labs name: ner_vop_treatment_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, treatment] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts treatments mentioned in documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Treatment`, `Frequency`, `Procedure`, `Route`, `Duration`, `Dosage`, `Drug`, `Form` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_wip_en_4.4.0_3.0_1682013186202.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_wip_en_4.4.0_3.0_1682013186202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_treatment_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It"s been a bit of an adjustment, but he"s doing well."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_treatment_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It"s been a bit of an adjustment, but he"s doing well.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | metformin | Drug | | glipizide | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_treatment_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Treatment 144 44 67 211 0.77 0.68 0.72 Frequency 801 111 271 1072 0.88 0.75 0.81 Procedure 505 104 112 617 0.83 0.82 0.82 Route 29 3 8 37 0.91 0.78 0.84 Duration 1926 382 345 2271 0.83 0.85 0.84 Dosage 350 56 49 399 0.86 0.88 0.87 Drug 1210 125 108 1318 0.91 0.92 0.91 Form 235 23 18 253 0.91 0.93 0.92 macro_avg 5200 848 978 6178 0.86 0.83 0.84 micro_avg 5200 848 978 6178 0.86 0.84 0.85 ``` --- layout: model title: Adverse Drug Events Classifier (LogReg) author: John Snow Labs name: classifier_logreg_ade date: 2023-05-11 tags: [text_classification, ade, clinical, licensed, logreg, en] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: DocumentLogRegClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the Logistic Regression algorithm and classifies text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1683817451286.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1683817451286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") logreg = DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("prediction") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, logreg]) data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler =new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val logreg = new DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models") .setInputCols("token") .setOutputCol("prediction") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, logreg)) val data = Seq(Array("None of the patients required treatment for the overdose.", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------------+-------+ |Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] | |None of the patients required treatment for the overdose. |[False]| +----------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifier_logreg_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[prediction]| |Language:|en| |Size:|595.7 KB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.91 0.92 0.92 3362 True 0.79 0.79 0.79 1361 accuracy - - 0.88 4723 macro_avg 0.85 0.85 0.85 4723 weighted_avg 0.88 0.88 0.88 4723 ``` --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_09 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_09` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119422310.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09_en_4.2.0_3.0_1664119422310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_09| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: German BertForMaskedLM Large Cased model (from deepset) author: John Snow Labs name: bert_embeddings_g_large date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gbert-large` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_large_de_4.2.4_3.0_1670022204873.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_large_de_4.2.4_3.0_1670022204873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_large","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_large","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_g_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gbert-large - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - http://deepset.ai/ - https://haystack.deepset.ai/ - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/haystack - https://docs.haystack.deepset.ai - https://haystack.deepset.ai/community - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_512 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670021676308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_512_zh_4.2.4_3.0_1670021676308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|90.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English BertForQuestionAnswering model (from Vasanth) author: John Snow Labs name: bert_qa_bert_base_uncased_qa_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-qa-squad2` is a English model orginally trained by `Vasanth`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_qa_squad2_en_4.0.0_3.0_1654181282352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_qa_squad2_en_4.0.0_3.0_1654181282352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_qa_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_uncased.by_Vasanth").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Vasanth/bert-base-uncased-qa-squad2 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk TFWav2Vec2ForCTC from krirk author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk` is a English model originally trained by krirk. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042673180.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk_en_4.2.0_3.0_1664042673180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_krirk| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English AlbertForQuestionAnswering model (from rowan1224) author: John Snow Labs name: albert_qa_slp date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_slp_en_4.0.0_3.0_1656063737900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_slp_en_4.0.0_3.0_1656063737900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_slp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_slp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.albert.by_rowan1224").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_slp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/albert-slp --- layout: model title: Fast Neural Machine Translation Model from English to Italic Languages author: John Snow Labs name: opus_mt_en_itc date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, itc, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `itc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_itc_xx_2.7.0_2.4_1609170676219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_itc_xx_2.7.0_2.4_1609170676219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_itc", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_itc", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.itc').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_itc| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_parties_08_25 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-parties-08-25` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_08_25_en_4.3.0_3.0_1672766229642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_08_25_en_4.3.0_3.0_1672766229642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_08_25","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_08_25","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_parties_08_25| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-parties-08-25 --- layout: model title: Multilingual DistilBertForQuestionAnswering model (from ZYW) Squad author: John Snow Labs name: distilbert_qa_squad_en_de_es_vi_zh_model date: 2022-06-08 tags: [en, de, vi, zh, es, open_source, distilbert, question_answering, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-en-de-es-vi-zh-model` is a English model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_en_de_es_vi_zh_model_xx_4.0.0_3.0_1654728800543.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_en_de_es_vi_zh_model_xx_4.0.0_3.0_1654728800543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_en_de_es_vi_zh_model","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_en_de_es_vi_zh_model","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.squad.distil_bert._en_de_es_vi_zh_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_en_de_es_vi_zh_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/squad-en-de-es-vi-zh-model --- layout: model title: Fast and Accurate Language Identification - 220 Languages (CNN) author: John Snow Labs name: ld_wiki_tatoeba_cnn_220 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Russia Buriat`, `Catalan`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latvian`, `Maithili`, `map-bms`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Scots`, `Sindhi`, `Northern Sami`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`. ## Predicted Entities `ace`, `af`, `als`, `am`, `an`, `ang`, `ar`, `arz`, `as`, `ast`, `av`, `ay`, `az`, `azb`, `ba`, `bar`, `bat-smg`, `bcl`, `be`, `bg`, `bh`, `bn`, `bo`, `bpy`, `br`, `bxr`, `ca`, `cdo`, `ce`, `ceb`, `ckb`, `co`, `crh`, `cs`, `csb`, `cv`, `cy`, `da`, `de`, `diq`, `dsb`, `dv`, `el`, `eml`, `en`, `eo`, `es`, `et`, `eu`, `ext`, `fa`, `fi`, `fiu-vro`, `fo`, `fr`, `frp`, `fur`, `fy`, `ga`, `gag`, `gd`, `gl`, `gn`, `gom`, `gu`, `gv`, `ha`, `hak`, `he`, `hi`, `hif`, `hsb`, `ht`, `hu`, `hy`, `ia`, `id`, `ie`, `ig`, `ilo`, `io`, `is`, `it`, `ja`, `jam`, `jbo`, `jv`, `ka`, `kaa`, `kab`, `kbd`, `kk`, `km`, `kn`, `ko`, `koi`, `krc`, `ksh`, `ku`, `kv`, `kw`, `ky`, `la`, `lad`, `lb`, `lez`, `lg`, `li`, `lij`, `lmo`, `ln`, `lo`, `lrc`, `lt`, `lv`, `mai`, `map-bms`, `mg`, `mhr`, `mi`, `min`, `mk`, `ml`, `mn`, `mr`, `mrj`, `mt`, `mwl`, `my`, `myv`, `mzn`, `nah`, `nap`, `nds`, `nds-nl`, `ne`, `new`, `nl`, `nn`, `no`, `nrm`, `nso`, `nv`, `oc`, `olo`, `om`, `or`, `os`, `pa`, `pag`, `pam`, `pap`, `pcd`, `pfl`, `pl`, `pnb`, `ps`, `pt`, `qu`, `rm`, `ro`, `roa-tara`, `ru`, `rue`, `rw`, `sa`, `sah`, `sc`, `scn`, `sco`, `sd`, `se`, `si`, `sk`, `sl`, `sn`, `so`, `sq`, `sr`, `stq`, `su`, `sv`, `sw`, `szl`, `ta`, `tcy`, `te`, `tet`, `tg`, `th`, `tk`, `tl`, `tn`, `to`, `tr`, `tt`, `tyv`, `udm`, `ug`, `uk`, `ur`, `uz`, `vec`, `vep`, `vi`, `vls`, `vo`, `wa`, `war`, `wo`, `wuu`, `xh`, `xmf`, `yi`, `yo`, `zea`, `zh`, `zh-classical`, `zh-min-nan`, `zh-yue`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_220_xx_2.7.0_2.4_1607184539094.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_220_xx_2.7.0_2.4_1607184539094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_220", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_220", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_220').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_wiki_tatoeba_cnn_220| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Wikipedia and Tatoeba ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | sv| 1000| 999| 0.999| | fr| 1000| 999| 0.999| | fi| 1000| 998| 0.998| | it| 1000| 997| 0.997| | pt| 1000| 995| 0.995| | el| 1000| 994| 0.994| | de| 1000| 993| 0.993| | en| 1000| 990| 0.99| | nl| 1000| 987| 0.987| | hu| 880| 866|0.9840909090909091| | da| 1000| 980| 0.98| | es| 1000| 976| 0.976| | ro| 784| 765|0.9757653061224489| | et| 928| 905|0.9752155172413793| | lt| 1000| 975| 0.975| | cs| 1000| 973| 0.973| | pl| 914| 889|0.9726477024070022| | sk| 1000| 941| 0.941| | bg| 1000| 939| 0.939| | lv| 916| 857|0.9355895196506551| | sl| 914| 789|0.8632385120350109| +--------+-----+-------+------------------+ +-------+-------------------+ |summary| precision| +-------+-------------------+ | count| 21| | mean| 0.9734546412641623| | stddev|0.03176749551086062| | min| 0.8632385120350109| | max| 0.999| +-------+-------------------+ ``` --- layout: model title: Hausa Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ha, open_source] task: Named Entity Recognition language: ha edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-hausa` is a Hausa model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa_ha_3.4.2_3.0_1652808366927.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa_ha_3.4.2_3.0_1652808366927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa","ha") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ina son Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa","ha") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ina son Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_ner_hausa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ha| |Size:|775.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-hausa - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - htt --- layout: model title: English BertForQuestionAnswering Small Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd3_small date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd3-small` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd3_small_en_4.0.0_3.0_1657188265535.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd3_small_en_4.0.0_3.0_1657188265535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd3_small","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd3_small","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd3_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd3-small --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from drewski) author: John Snow Labs name: distilbert_qa_drewski_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `drewski`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_drewski_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770549448.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_drewski_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770549448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_drewski_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_drewski_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_drewski_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/drewski/distilbert-base-uncased-finetuned-squad --- layout: model title: English Bert Embeddings (from kornosk) author: John Snow Labs name: bert_embeddings_bert_political_election2020_twitter_mlm date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-political-election2020-twitter-mlm` is a English model orginally trained by `kornosk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_political_election2020_twitter_mlm_en_3.4.2_3.0_1649672268288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_political_election2020_twitter_mlm_en_3.4.2_3.0_1649672268288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_political_election2020_twitter_mlm","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_political_election2020_twitter_mlm","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_political_election2020_twitter_mlm").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_political_election2020_twitter_mlm| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/kornosk/bert-political-election2020-twitter-mlm - https://www.aclweb.org/anthology/2021.naacl-main.376 - https://github.com/GU-DataLab/stance-detection-KE-MLM - https://www.aclweb.org/anthology/2021.naacl-main.376 --- layout: model title: Typed Dependency Parsing pipeline for English author: John Snow Labs name: dependency_parse date: 2021-03-27 tags: [pipeline, dependency_parsing, untyped_dependency_parsing, typed_dependency_parsing, laballed_depdency_parsing, unlaballed_depdency_parsing, en, open_source] supported: true task: [Dependency Parser, Pipeline Public] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: Pipeline article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Typed Dependency parser, trained on the on the CONLL dataset. Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dependency_parse_en_3.0.0_3.0_1616864258046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dependency_parse_en_3.0.0_3.0_1616864258046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('dependency_parse', lang = 'en') annotations = pipeline.fullAnnotate("Dependencies represents relationships betweens words in a Sentence "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("dependency_parse", lang = "en") val result = pipeline.fullAnnotate("Dependencies represents relationships betweens words in a Sentence")(0) ``` {:.nlu-block} ```python nlu.load("dep.typed").predict("Dependencies represents relationships betweens words in a Sentence") ```
## Results ```bash +---------------------------------------------------------------------------------+--------------------------------------------------------+ |result |result | +---------------------------------------------------------------------------------+--------------------------------------------------------+ |[ROOT, Dependencies, represents, words, relationships, Sentence, Sentence, words]|[root, parataxis, nsubj, amod, nsubj, case, nsubj, flat]| +---------------------------------------------------------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dependency_parse| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetector - Tokenizer - PerceptronModel - DependencyParserModel - TypedDependencyParserModel --- layout: model title: Word2Vec Embeddings in Quechua (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, qu, open_source] task: Embeddings language: qu edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_qu_3.4.1_3.0_1647453594212.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_qu_3.4.1_3.0_1647453594212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","qu") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","qu") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("qu.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|qu| |Size:|110.0 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Lemmatizer (Czech, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, cs] task: Lemmatization language: cs edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Czech Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_cs_3.4.1_3.0_1646316557035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_cs_3.4.1_3.0_1646316557035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","cs") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Nejste lepší než já"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","cs") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Nejste lepší než já").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("cs.lemma.spacylookup").predict("""Nejste lepší než já""") ```
## Results ```bash +-------------------------+ |result | +-------------------------+ |[Nejste, lepšit, než, já]| +-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|cs| |Size:|379.0 KB| --- layout: model title: Clinical Deidentification (glove) author: John Snow Labs name: clinical_deidentification_glove date: 2023-06-13 tags: [deidentification, en, licensed, pipeline] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_4.4.4_3.2_1686663547185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_4.4.4_3.2_1686663547185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_glove","en","clinical/models") val result = deid_pipeline.annotate("Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_glove","en","clinical/models") val result = deid_pipeline.annotate("Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Results Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID: , IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID: [********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID: ****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Berneta Anis, Record date: 2093-02-19, # U4660137. Dr. Dr Worley Colonel, ID: ZJ:9570208, IP 005.005.005.005. He is a 67 male was admitted to the ST. LUKE'S HOSPITAL AT THE VINTAGE for cystectomy on 06-02-1981. Patient's VIN : 3CCCC22DDDD333888, SSN SSN-618-77-1042, Driver's license W693817528998. Phone 0496 46 46 70, 3100 weston rd, Shattuck, E-MAIL: Freddi@hotmail.com. {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_glove| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|181.4 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Financial English BERT Embeddings (Number shape masking) author: John Snow Labs name: bert_embeddings_sec_bert_sh date: 2022-04-12 tags: [bert, embeddings, en, open_source, financial] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Financial BERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sec-bert-shape` is a English model orginally trained by `nlpaueb`.This model is the same as Bert Base but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'. If you are interested in Financial Embeddings, take a look also at these two models: - [sec-base](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_base_en_3_0.html): Same as BERT Base but trained with financial documents. - [sec-num](https://nlp.johnsnowlabs.com/2022/04/12/bert_embeddings_sec_bert_num_en_3_0.html): Same as Bert sec-base but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_sh_en_3.4.2_3.0_1649758845734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sec_bert_sh_en_3.4.2_3.0_1649758845734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_sh","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_sh","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.sec_bert_sh").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_sec_bert_sh| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/nlpaueb/sec-bert-shape - https://arxiv.org/abs/2203.06482 - http://nlp.cs.aueb.gr/ --- layout: model title: Stopwords Remover for Serbian language (389 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, sr, open_source] task: Stop Words Removal language: sr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sr_3.4.1_3.0_1646673003296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sr_3.4.1_3.0_1646673003296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","sr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Ниси бољи од мене"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Ниси бољи од мене").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sr.stopwords").predict("""Ниси бољи од мене""") ```
## Results ```bash +------+ |result| +------+ |[бољи]| +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sr| |Size:|2.7 KB| --- layout: model title: Sentence Entity Resolver for LOINC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_loinc_augmented date: 2021-11-23 tags: [loinc, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous LOINC resolver models. ## Predicted Entities {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_augmented_en_3.3.2_2.4_1637664939262.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_augmented_en_3.3.2_2.4_1637664939262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_loinc_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Test, BMI, HDL, LDL, Medical_Device, Temperature, Total_Cholesterol, Triglycerides, Blood_Pressure``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical','en', 'clinical/models')\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['Test']) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("loinc_code")\ .setDistanceFunction("EUCLIDEAN") pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%."""]]).toDF("text") results = pipeline_loinc.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Test")) val chunk2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_loinc_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("loinc_code") .setDistanceFunction("EUCLIDEAN") val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.").toDF("text") val result = pipeline_loinc.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc.augmented").predict("""The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""") ```
## Results ```bash +--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+ | chunk|begin|end|entity|confidence|Loinc_Code| all_codes| resolutions| +--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+ | Body mass index| 74| 88| Test|0.39306664| LP35925-4|LP35925-4:::BDYCRC:::LP172732-2:::39156-5:::LP7...|body mass index:::body circumference:::body mus...| |aspartate aminotransferase| 111|136| Test| 0.74925| LP15426-7|LP15426-7:::14409-7:::LP307348-5:::LP15333-5:::...|aspartate aminotransferase::: aspartate transam...| | alanine aminotransferase| 146|169| Test| 0.9579| LP15333-5|LP15333-5:::LP307326-1:::16324-6:::LP307348-5::...|alanine aminotransferase:::alanine aminotransfe...| | hgba1c| 180|185| Test| 0.1118| 17855-8|17855-8:::4547-6:::55139-0:::72518-4:::45190-6:...| hba1c::: hgb a1::: hb1::: hcds1::: hhc1::: htr...| +--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_loinc_augmented| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Generic Deidentification NER (SEC Bert Embeddings) author: John Snow Labs name: finner_deid_sec date: 2023-02-24 tags: [deid, deidentification, anonymization, en, licensed] task: Named Entity Recognition language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub. The only difference between this and `finner_deid` is the embeddings used. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/DEID_FIN/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_en_1.0.0_3.0_1677282571388.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_en_1.0.0_3.0_1677282571388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_deid_sec', "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+----------------+ | token| ner_label| +-----------+----------------+ | This| O| | LICENSE| O| | AND| O| |DEVELOPMENT| O| | AGREEMENT| O| | (| O| | this| O| | Agreement| O| | )| O| | is| O| | entered| O| | into| O| | effective| O| | as| O| | of| O| | Nov| B-DATE| | .| I-DATE| | 02| I-DATE| | ,| I-DATE| | 2019| I-DATE| | (| O| | the| O| | Effective| O| | Date| O| | )| O| | by| O| | and| O| | between| O| | Bioeq| O| | IP| O| | AG| O| | ,| O| | having| O| | its| O| | principal| O| | place| O| | of| O| | business| O| | at| O| | 333| B-STREET| | Twin| I-STREET| | Dolphin| I-STREET| | Drive| I-STREET| | ,| O| | Suite|B-LOCATION-OTHER| | 600|I-LOCATION-OTHER| | ,| O| | Redwood| B-CITY| | City| I-CITY| | ,| O| | CA| B-STATE| | ,| O| | 94065| B-ZIP| | ,| O| | USA| B-STATE| | (| O| | Licensee| O| | ).| O| +-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_deid_sec| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References In-house annotated documents with protected information ## Benchmarking ```bash label precision recall f1-score support B-AGE 0.96 0.89 0.92 245 B-CITY 0.85 0.86 0.86 123 B-COUNTRY 0.86 0.67 0.75 36 B-DATE 0.98 0.97 0.97 2352 B-ORG 0.75 0.71 0.73 38 B-PERSON 0.97 0.94 0.95 1348 B-PHONE 0.86 0.80 0.83 86 B-PROFESSION 0.93 0.75 0.83 84 B-STATE 0.92 0.89 0.91 102 B-STREET 0.99 0.91 0.95 89 I-CITY 0.82 0.77 0.79 35 I-COUNTRY 1.00 0.50 0.67 6 I-DATE 0.96 0.95 0.96 402 I-ORG 0.71 0.86 0.77 28 I-PERSON 0.98 0.96 0.97 1240 I-PHONE 0.91 0.92 0.92 77 I-PROFESSION 0.96 0.79 0.87 70 I-STATE 1.00 0.62 0.77 8 I-STREET 0.98 0.94 0.96 188 I-ZIP 0.84 0.97 0.90 60 O 1.00 1.00 1.00 194103 accuracy - - 1.00 200762 macro-avg 0.72 0.62 0.65 200762 weighted-avg 1.00 1.00 1.00 200762 ``` --- layout: model title: Fast Neural Machine Translation Model from Italic Languages to English author: John Snow Labs name: opus_mt_itc_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, itc, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `itc` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_itc_en_xx_2.7.0_2.4_1609170571711.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_itc_en_xx_2.7.0_2.4_1609170571711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_itc_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_itc_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.itc.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_itc_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Navteca's Tapas Table Understanding (Large, WTQ) author: John Snow Labs name: table_qa_tapas_large_finetuned_wtq date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Large Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_wtq_en_4.2.0_3.0_1664530763103.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_wtq_en_4.2.0_3.0_1664530763103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_large_finetuned_wtq","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_large_finetuned_wtq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions --- layout: model title: Arabic BertForMaskedLM Mini Cased model (from asafaya) author: John Snow Labs name: bert_embeddings_mini_arabic date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-arabic` is a Arabic model originally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mini_arabic_ar_4.2.4_3.0_1670020664579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mini_arabic_ar_4.2.4_3.0_1670020664579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_mini_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_mini_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mini_arabic| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|43.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-mini-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: Legal United Nations Document Classifier (EURLEX) author: John Snow Labs name: legclf_united_nations_bert date: 2023-03-06 tags: [en, legal, classification, clauses, united_nations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_united_nations_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class United_Nations or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `United_Nations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_united_nations_bert_en_1.0.0_3.0_1678111655117.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_united_nations_bert_en_1.0.0_3.0_1678111655117.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_united_nations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[United_Nations]| |[Other]| |[Other]| |[United_Nations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_united_nations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.85 0.85 0.85 20 United_Nations 0.87 0.87 0.87 23 accuracy - - 0.86 43 macro-avg 0.86 0.86 0.86 43 weighted-avg 0.86 0.86 0.86 43 ``` --- layout: model title: English BertForQuestionAnswering model (from datauma) author: John Snow Labs name: bert_qa_datauma_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `datauma`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_datauma_bert_finetuned_squad_en_4.0.0_3.0_1654535608413.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_datauma_bert_finetuned_squad_en_4.0.0_3.0_1654535608413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_datauma_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_datauma_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_datauma").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_datauma_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/datauma/bert-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from peggyhuang) author: John Snow Labs name: bert_qa_bert_base_uncased_coqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-coqa` is a English model orginally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_coqa_en_4.0.0_3.0_1654180732707.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_coqa_en_4.0.0_3.0_1654180732707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_coqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_coqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_coqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/bert-base-uncased-coqa --- layout: model title: Universal Sentence Encoder Multilingual Large (tfhub_use_multi_lg) author: John Snow Labs name: tfhub_use_multi_lg date: 2021-05-06 tags: [xx, open_source, embeddings] task: Embeddings language: xx edition: Spark NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is the variable-length text and the output is a 512-dimensional vector. The universal-sentence-encoder model has trained with a deep averaging network (DAN) encoder. This model supports 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder. The details are described in the paper "[Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/abs/1907.04307)". Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_lg_xx_3.0.0_3.0_1620294638956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_lg_xx_3.0.0_3.0_1620294638956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi_lg", "xx") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") ``` ```scala val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi_lg", "xx") .setInputCols("sentence") .setOutputCol("sentence_embeddings") ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Me encanta usar SparkNLP"] embeddings_df = nlu.load('xx.use.multi_lg').predict(text, output_level='sentence') embeddings_df ```
## Results ```bash It gives a 512-dimensional vector of the sentences. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_multi_lg| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3) ## Benchmarking ```bash - We apply this model to the STS benchmark for semantic similarity. Results are shown below: STSBenchmark | dev | test | -----------------------------------|--------|-------| Correlation coefficient of Pearson | 0.837 | 0.825 | - For semantic similarity retrieval, we evaluate the model on [Quora and AskUbuntu retrieval task.](https://arxiv.org/abs/1811.08008). Results are shown below: Dataset | Quora | AskUbuntu | Average | -----------------------|-------|-----------|---------| Mean Average Precision | 89.1 | 42.3 | 65.7 | - For the translation pair retrieval, we evaluate the model on the United Nation Parallel Corpus. Results are shown below: Language Pair | en-es | en-fr | en-ru | en-zh | ---------------|--------|-------|-------|-------| Precision@1 | 86.1 | 83.3 | 88.9 | 78.8 | ``` --- layout: model title: Portuguese asr_bp_sid10_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp_sid10_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_sid10_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_sid10_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_sid10_xlsr_pt_4.2.0_3.0_1664191635179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_sid10_xlsr_pt_4.2.0_3.0_1664191635179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp_sid10_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp_sid10_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp_sid10_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|756.4 MB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from janeel) author: John Snow Labs name: roberta_qa_janeel_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `janeel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_janeel_base_finetuned_squad_en_4.3.0_3.0_1674217296605.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_janeel_base_finetuned_squad_en_4.3.0_3.0_1674217296605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_janeel_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_janeel_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_janeel_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/janeel/roberta-base-finetuned-squad --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_greedy_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_greedy](https://nlp.johnsnowlabs.com/2021/06/24/ner_jsl_greedy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_pipeline_en_3.4.1_3.0_1647869775586.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_pipeline_en_3.4.1_3.0_1647869775586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_greedy_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_greedy_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +----------------------------------------------+----------------------------+ |chunk |ner_label | +----------------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | |20 minutes |Duration | +----------------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Detect Problems, Tests and Treatments (ner_crf) author: John Snow Labs name: ner_crf class: NerCrfModel language: en nav_key: models repository: clinical/models date: 2020-01-28 task: Named Entity Recognition edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [ner] supported: true annotator: NerCrfModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Named Entity recognition annotator allows for a generic model to be trained by CRF model Clinical NER (Large) is a Named Entity Recognition model that annotates text to find references to clinical events. The entities it annotates are Problem, Treatment, and Test. Clinical NER is trained with the 'embeddings_clinical' word embeddings model, so be sure to use the same embeddings in the pipeline. ## Predicted Entities `Problem`, `Test`, `Treatment` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_crf_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_crf_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python model = NerCrfModel.pretrained("ner_crf","en","clinical/models")\ .setInputCols("sentence","token","pos","word_embeddings")\ .setOutputCol("ner") ``` ```scala val model = NerCrfModel.pretrained("ner_crf","en","clinical/models") .setInputCols("sentence","token","pos","word_embeddings") .setOutputCol("ner") ```
{:.model-param} ## Model Information {:.table-model} |---------------|---------------------------------------| | Name: | ner_crf | | Type: | NerCrfModel | | Compatibility: | Spark NLP 2.4.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [sentence, token, pos, word_embeddings] | |Output labels: | [ner] | | Language: | en | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on i2b2 augmented data with `clinical_embeddings` FILLUP --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223509314.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223509314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0 --- layout: model title: Extract Clinical Department Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_clinical_dept date: 2023-06-06 tags: [licensed, clinical, en, ner, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `AdmissionDischarge`, `ClinicalDept`, `MedicalDevice` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_en_4.4.3_3.0_1686074506621.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_en_4.4.3_3.0_1686074506621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_clinical_dept", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------------------|:--------------| | orthopedic department | ClinicalDept | | titanium plate | MedicalDevice | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_clinical_dept| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 AdmissionDischarge 29 1 5 34 0.97 0.85 0.91 ClinicalDept 289 31 37 326 0.90 0.89 0.89 MedicalDevice 253 97 79 332 0.72 0.76 0.74 macro_avg 571 129 121 692 0.86 0.83 0.85 micro_avg 571 129 121 692 0.82 0.83 0.82 ``` --- layout: model title: Arabic Electra Embeddings (from aubmindlab) author: John Snow Labs name: electra_embeddings_araelectra_base_generator date: 2022-05-17 tags: [ar, open_source, electra, embeddings] task: Embeddings language: ar edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araelectra-base-generator` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_araelectra_base_generator_ar_3.4.4_3.0_1652786188141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_araelectra_base_generator_ar_3.4.4_3.0_1652786188141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_araelectra_base_generator","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_araelectra_base_generator","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_araelectra_base_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ar| |Size:|222.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/araelectra-base-generator - https://arxiv.org/pdf/1406.2661.pdf - https://arxiv.org/abs/2012.15516 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: English DistilBertForTokenClassification Cased model (from Neurona) author: John Snow Labs name: distilbert_token_classifier_cpener_test date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cpener-test` is a English model originally trained by `Neurona`. ## Predicted Entities `cpe_vendor`, `cpe_version`, `cpe_product` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.1_3.0_1678783066417.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.1_3.0_1678783066417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_cpener_test| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Neurona/cpener-test --- layout: model title: Legal Warranties Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_warranties_bert date: 2023-03-05 tags: [en, legal, classification, clauses, warranties, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Warranties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Warranties`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_warranties_bert_en_1.0.0_3.0_1678049972315.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_warranties_bert_en_1.0.0_3.0_1678049972315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_warranties_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Warranties]| |[Other]| |[Other]| |[Warranties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_warranties_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.86 0.87 0.86 83 Warranties 0.81 0.80 0.80 59 accuracy - - 0.84 142 macro-avg 0.83 0.83 0.83 142 weighted-avg 0.84 0.84 0.84 142 ``` --- layout: model title: Modern Greek (1453-) asr_xlsr_53_wav2vec_greek TFWav2Vec2ForCTC from harshit345 author: John Snow Labs name: asr_xlsr_53_wav2vec_greek date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_53_wav2vec_greek` is a Modern Greek (1453-) model originally trained by harshit345. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_53_wav2vec_greek_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_53_wav2vec_greek_el_4.2.0_3.0_1664109101429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_53_wav2vec_greek_el_4.2.0_3.0_1664109101429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_53_wav2vec_greek", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_53_wav2vec_greek", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_53_wav2vec_greek| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: Google T5 (Text-To-Text Transfer Transformer) Base author: John Snow Labs name: t5_base date: 2021-01-08 task: [Question Answering, Summarization, Translation] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, t5, summarization, translation, en, seq2seq] supported: true recommended: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The T5 transformer model described in the seminal paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This model can perform a variety of tasks, such as text summarization, question answering, and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5TRANSFORMER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_en_2.7.1_2.4_1610133506835.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_en_2.7.1_2.4_1610133506835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Either set the following tasks or have them inline with your input: - summarize: - translate English to German: - translate English to French: - stsb sentence1: Big news. sentence2: No idea. The full list of tasks is in the Appendix of the paper: https://arxiv.org/pdf/1910.10683.pdf
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer() \ .pretrained("t5_base") \ .setTask("summarize:")\ .setMaxOutputLength(200)\ .setInputCols(["documents"]) \ .setOutputCol("summaries") pipeline = Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer .pretrained("t5_base") .setTask("summarize:") .setInputCols(Array("documents")) .setOutputCol("summaries") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val result = pipeline.fit(dataDf).transform(dataDf) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.base").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[t5]| |Language:|en| ## Data Source https://huggingface.co/t5-base --- layout: model title: Voice of the Patients author: John Snow Labs name: ner_vop_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `TestResult`, `SubstanceQuantity`, `InjuryOrPoisoning`, `Treatment`, `Modifier`, `HealthStatus`, `MedicalDevice`, `Procedure`, `Symptom`, `Frequency`, `RelationshipStatus`, `Duration`, `Allergen`, `VitalTest`, `Disease`, `Dosage`, `AdmissionDischarge`, `Test`, `Laterality`, `Route`, `DateTime`, `Drug`, `ClinicalDept`, `Vaccine`, `Form`, `Substance`, `PsychologicalCondition`, `Age`, `BodyPart`, `Employment`, `Gender` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_en_4.4.2_3.0_1684508941946.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_en_4.4.2_3.0_1684508941946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid heartrate | Symptom | | allopathy medicine | Drug | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset "Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you." ## Benchmarking ```bash label tp fp fn total precision recall f1 TestResult 353 98 171 524 0.78 0.67 0.72 SubstanceQuantity 60 20 25 85 0.75 0.71 0.73 InjuryOrPoisoning 122 37 54 176 0.77 0.69 0.73 Treatment 150 30 78 228 0.83 0.66 0.74 Modifier 817 214 322 1139 0.79 0.72 0.75 HealthStatus 80 24 27 107 0.77 0.75 0.76 MedicalDevice 250 71 82 332 0.78 0.75 0.77 Procedure 576 156 129 705 0.79 0.82 0.80 Symptom 3831 858 744 4575 0.82 0.84 0.83 Frequency 865 147 214 1079 0.85 0.80 0.83 RelationshipStatus 19 2 5 24 0.90 0.79 0.84 Duration 1845 244 465 2310 0.88 0.80 0.84 Allergen 38 4 8 46 0.90 0.83 0.86 VitalTest 143 16 29 172 0.90 0.83 0.86 Disease 1745 296 270 2015 0.85 0.87 0.86 Dosage 348 48 64 412 0.88 0.84 0.86 AdmissionDischarge 29 4 5 34 0.88 0.85 0.87 Test 1064 136 144 1208 0.89 0.88 0.88 Laterality 542 68 86 628 0.89 0.86 0.88 Route 42 5 6 48 0.89 0.88 0.88 DateTime 4075 706 327 4402 0.85 0.93 0.89 Drug 1323 196 117 1440 0.87 0.92 0.89 ClinicalDept 280 25 46 326 0.92 0.86 0.89 Vaccine 37 4 5 42 0.90 0.88 0.89 Form 252 34 14 266 0.88 0.95 0.91 Substance 398 58 23 421 0.87 0.95 0.91 PsychologicalCondition 411 42 33 444 0.91 0.93 0.92 Age 529 44 53 582 0.92 0.91 0.92 BodyPart 2730 224 170 2900 0.92 0.94 0.93 Employment 1168 37 75 1243 0.97 0.94 0.95 Gender 1292 21 25 1317 0.98 0.98 0.98 macro_avg 25414 3869 3816 29230 0.86 0.84 0.85 micro_avg 25414 3869 3816 29230 0.87 0.87 0.87 ``` --- layout: model title: Medical Spell Checker author: John Snow Labs name: spellcheck_clinical date: 2022-04-14 tags: [spellcheck, medical, medical_spell_checker, spell_checker, spelling_corrector, en, licensed, clinical] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: SpellCheckModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Contextual Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your medical input text. It’s based on Levenshtein Automation for generating candidate corrections and a Neural Language Model for ranking corrections. This model has been trained in a dataset containing data from different sources; MTSamples, i2b2 clinical notes, and several specific medical corpora. You can download the model that comes fully pretrained and ready to use. However, you can still customize it further without the need for re-training a new model from scratch. This can be accomplished by providing custom definitions for the word classes the model has been trained on, namely Dates, Numbers, Ages, Units, and Medications. This model is trained for PySpark 2.4.x users with SparkNLP 3.4.1. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainingstreamlit_notebooks/Hhealthcare/6.Clinical_Context_Spell_CheckerCONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.1_2.4_1649926082521.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.1_2.4_1649926082521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setContextChars(["*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained('spellcheck_clinical', 'en', 'clinical/models')\ .setInputCols("token")\ .setOutputCol("checked") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, spellModel]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] result = light_pipeline.annotate(example) ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setContextChars(Array("*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained("spellcheck_clinical", "en", "clinical/models"). setInputCols("token"). setOutputCol("checked") val pipeline = new Pipeline().setStages(Array( assembler, tokenizer, spellChecker)) val light_pipeline = new LightPipeline(pipeline.fit(Seq("").toDF("text"))) val text = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") val result = light_pipeline.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical").predict(""") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, spellModel]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical| |Compatibility:|Healthcare NLP 3.4.1| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| |Size:|141.2 MB| ## References MTSamples, i2b2 clinical notes, and several specific medical corpora. --- layout: model title: English DistilBertForQuestionAnswering model (from kaggleodin) author: John Snow Labs name: distilbert_qa_kaggleodin_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kaggleodin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaggleodin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725640096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaggleodin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725640096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaggleodin_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaggleodin_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_kaggleodin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kaggleodin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kaggleodin/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77 TFWav2Vec2ForCTC from emeson77 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77` is a English model originally trained by emeson77. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037259824.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77_en_4.2.0_3.0_1664037259824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_emeson77| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_break_data date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-break_data` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_break_data_en_4.3.0_3.0_1675108822999.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_break_data_en_4.3.0_3.0_1675108822999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_break_data","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_break_data","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_break_data| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|878.4 MB| ## References - https://huggingface.co/mrm8488/t5-base-finetuned-break_data - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb - https://twitter.com/psuraj28 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English asr_wav2vec2_base_checkpoint_6 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: asr_wav2vec2_base_checkpoint_6 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_6` is a English model originally trained by jiobiala24. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_checkpoint_6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_6_en_4.2.0_3.0_1664020777200.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_6_en_4.2.0_3.0_1664020777200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_checkpoint_6", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_checkpoint_6", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_checkpoint_6| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.2 MB| --- layout: model title: English asr_wav2vec2_murad_with_some_data TFWav2Vec2ForCTC from MBMMurad author: John Snow Labs name: pipeline_asr_wav2vec2_murad_with_some_data date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_murad_with_some_data` is a English model originally trained by MBMMurad. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_murad_with_some_data_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_murad_with_some_data_en_4.2.0_3.0_1664111468486.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_murad_with_some_data_en_4.2.0_3.0_1664111468486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_murad_with_some_data', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_murad_with_some_data", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_murad_with_some_data| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Baltic Languages author: John Snow Labs name: opus_mt_en_bat date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bat, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bat_xx_2.7.0_2.4_1609166860296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bat_xx_2.7.0_2.4_1609166860296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bat", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bat", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bat').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bat| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from vuiseng9) author: John Snow Labs name: bert_qa_bert_base_squadv1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-squadv1` is a English model orginally trained by `vuiseng9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_squadv1_en_4.0.0_3.0_1654180625513.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_squadv1_en_4.0.0_3.0_1654180625513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base.by_vuiseng9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vuiseng9/bert-base-squadv1 --- layout: model title: Stopwords Remover for Polish language (381 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, pl, open_source] task: Stop Words Removal language: pl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_pl_3.4.1_3.0_1646673188092.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_pl_3.4.1_3.0_1646673188092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","pl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nie jesteś lepszy ode mnie"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","pl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nie jesteś lepszy ode mnie").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pl.stopwords").predict("""Nie jesteś lepszy ode mnie""") ```
## Results ```bash +---------------------+ |result | +---------------------+ |[jesteś, lepszy, ode]| +---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|pl| |Size:|2.4 KB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from chiendvhust) author: John Snow Labs name: roberta_qa_chiendvhust_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `chiendvhust`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_chiendvhust_base_finetuned_squad_en_4.3.0_3.0_1674217182008.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_chiendvhust_base_finetuned_squad_en_4.3.0_3.0_1674217182008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_chiendvhust_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_chiendvhust_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_chiendvhust_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|457.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/chiendvhust/roberta-base-finetuned-squad --- layout: model title: Translate English to Sango Pipeline author: John Snow Labs name: translate_en_sg date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sg, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sg_xx_2.7.0_2.4_1609686246000.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sg_xx_2.7.0_2.4_1609686246000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sg').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sg| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Afrikaans (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, af, open_source] task: Embeddings language: af edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_af_3.4.1_3.0_1647281785039.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_af_3.4.1_3.0_1647281785039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","af") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ek is lief vir Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","af") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ek is lief vir Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("af.embed.w2v_cc_300d").predict("""Ek is lief vir Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|af| |Size:|515.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hogger32) author: John Snow Labs name: distilbert_qa_hogger32_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hogger32`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hogger32_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771247105.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hogger32_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771247105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hogger32_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hogger32_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hogger32_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hogger32/distilbert-base-uncased-finetuned-squad --- layout: model title: RCT Binary Classifier (USE) Pipeline author: John Snow Labs name: rct_binary_classifier_use_pipeline date: 2022-06-06 tags: [licensed, rct, clinical, classifier, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [rct_binary_classifier_use](https://nlp.johnsnowlabs.com/2022/05/27/rct_binary_classifier_use_en_3_0.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_use_pipeline_en_3.4.2_3.0_1654517524887.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_use_pipeline_en_3.4.2_3.0_1654517524887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rct_binary_classifier_use_pipeline", "en", "clinical/models") result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rct_binary_classifier_use_pipeline", "en", "clinical/models") val result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.rct_binary_use.pipeline").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |rct |text | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |true|Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rct_binary_classifier_use_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|991.2 MB| ## Included Models - DocumentAssembler - UniversalSentenceEncoder - ClassifierDLModel --- layout: model title: Spanish BertForQuestionAnswering model (from MMG) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad date: 2022-06-02 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad_es_4.0.0_3.0_1654180513805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad_es_4.0.0_3.0_1654180513805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad_sqac.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-sqac-finetuned-squad --- layout: model title: Bulgarian Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 11:14:00 +0800 task: Lemmatization language: bg edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, bg] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_bg_2.5.0_2.4_1588666297763.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_bg_2.5.0_2.4_1588666297763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "bg") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "bg") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена."""] lemma_df = nlu.load('bg.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Освен', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=7, result='че', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=9, result='съм', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=14, result='крада', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=17, result='на', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|bg| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from carolgao66) author: John Snow Labs name: distilbert_qa_carolgao66_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `carolgao66`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_carolgao66_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770348185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_carolgao66_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770348185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_carolgao66_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_carolgao66_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_carolgao66_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/carolgao66/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Indo-Iranian Languages to English author: John Snow Labs name: opus_mt_iir_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, iir, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `iir` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_iir_en_xx_2.7.0_2.4_1609163487551.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_iir_en_xx_2.7.0_2.4_1609163487551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_iir_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_iir_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.iir.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_iir_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Dutch RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_robbert_v2_dutch_base date: 2022-04-14 tags: [roberta, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbert-v2-dutch-base` is a Dutch model orginally trained by `pdelobelle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbert_v2_dutch_base_nl_3.4.2_3.0_1649949003731.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbert_v2_dutch_base_nl_3.4.2_3.0_1649949003731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbert_v2_dutch_base","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbert_v2_dutch_base","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.robbert_v2_dutch_base").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robbert_v2_dutch_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|438.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/pdelobelle/robbert-v2-dutch-base - https://github.com/iPieter/RobBERT - https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7180110604335112086 - https://www.aclweb.org/anthology/2021.wassa-1.27/ - https://arxiv.org/pdf/2001.06286.pdf - https://biblio.ugent.be/publication/8704637/file/8704638.pdf - https://arxiv.org/pdf/2001.06286.pdf - https://arxiv.org/pdf/2001.06286.pdf - https://arxiv.org/pdf/2004.02814.pdf - https://github.com/proycon/deepfrog - https://arxiv.org/pdf/2001.06286.pdf - https://github.com/proycon/deepfrog - https://arxiv.org/pdf/2001.06286.pdf - https://arxiv.org/pdf/2010.13652.pdf - https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/automatic-classification-of-participant-roles-in-cyberbullying-can-we-detect-victims-bullies-and-bystanders-in-social-media-text/A2079C2C738C29428E666810B8903342 - https://gitlab.com/spelfouten/dutch-simpletransformers/ - https://arxiv.org/pdf/2101.05716.pdf - https://medium.com/broadhorizon-cmotions/nlp-with-r-part-5-state-of-the-art-in-nlp-transformers-bert-3449e3cd7494 - https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ - https://arxiv.org/abs/2001.06286 - https://github.com/iPieter/RobBERT - https://arxiv.org/abs/1907.11692 - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ - https://arxiv.org/abs/2001.06286 - https://github.com/iPieter/RobBERT - https://github.com/benjaminvdb/110kDBRD - https://www.statmt.org/europarl/ - https://arxiv.org/abs/2001.02943 - https://universaldependencies.org/treebanks/nl_lassysmall/index.html - https://www.clips.uantwerpen.be/conll2002/ner/ - https://oscar-corpus.com/ - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://arxiv.org/abs/2001.06286 - https://github.com/iPieter/RobBERT#how-to-replicate-our-paper-experiments - https://arxiv.org/abs/1909.11942 - https://camembert-model.fr/ - https://en.wikipedia.org/wiki/Robbert - https://muppet.fandom.com/wiki/Bert - https://github.com/iPieter/RobBERT/blob/master/res/robbert_logo.png - https://people.cs.kuleuven.be/~pieter.delobelle - https://thomaswinters.be - https://people.cs.kuleuven.be/~bettina.berendt/ --- layout: model title: Word2Vec Embeddings in Basque (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, eu, open_source] task: Embeddings language: eu edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_eu_3.4.1_3.0_1647285496461.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_eu_3.4.1_3.0_1647285496461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Txinparta nlp maite dut"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Txinparta nlp maite dut").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("eu.embed.w2v_cc_300d").predict("""Txinparta nlp maite dut""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|eu| |Size:|1.1 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Clean patterns pipeline for English author: John Snow Labs name: clean_pattern date: 2021-03-24 tags: [open_source, english, clean_pattern, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_pattern_en_3.0.0_3.0_1616544446008.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_pattern_en_3.0.0_3.0_1616544446008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('clean_pattern', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("clean_pattern", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.pattern').predict(text) result_df ```
## Results ```bash | | document | sentence | token | normal | |---:|:-----------|:-----------|:----------|:----------| | 0 | ['Hello'] | ['Hello'] | ['Hello'] | ['Hello'] || | document | sentence | token | normal | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_pattern| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Detect Assertion Status (assertion_wip) author: John Snow Labs name: jsl_assertion_wip date: 2021-01-18 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.7.0 spark_version: 2.4 tags: [clinical, licensed, assertion, en, ner] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someoneelse (Uzuner et al. 2011). {:.h2_title} ## Predicted Entities `Present`, `Absent`, `Possible`, `Planned`, `Someoneelse`, `Past`, `Family`, `None`, `Hypotetical`. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_en_2.6.1_2.4_1606860510166.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_assertion_wip_en_2.6.1_2.4_1606860510166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionDLModel.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"]) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val nerConverter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("jsl_assertion_wip", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and an ``"assertion"`` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select ``"ner_chunk.result"`` and ``"assertion.result"`` from your output dataframe. ```bash +-----------------------------------------+-----+---+----------------------------+-------+---------+ |chunk |begin|end|ner_label |sent_id|assertion| +-----------------------------------------+-----+---+----------------------------+-------+---------+ |21-day-old |17 |26 |Age |0 |Family | |Caucasian |28 |36 |Race_Ethnicity |0 |Family | |male |38 |41 |Gender |0 |Family | |for 2 days |48 |57 |Duration |0 |Family | |congestion |62 |71 |Symptom |0 |Present | |mom |75 |77 |Gender |0 |Family | |yellow |99 |104|Modifier |0 |Family | |discharge |106 |114|Symptom |0 |Family | |nares |135 |139|External_body_part_or_region|0 |Family | |she |147 |149|Gender |0 |Family | |mild |168 |171|Modifier |0 |Family | |problems with his breathing while feeding|173 |213|Symptom |0 |Present | |perioral cyanosis |237 |253|Symptom |0 |Absent | |retractions |258 |268|Symptom |0 |Absent | |One day ago |272 |282|RelativeDate |1 |Family | |mom |285 |287|Gender |1 |Family | |Tylenol |345 |351|Drug_BrandName |1 |Family | |Baby |354 |357|Age |2 |Family | |decreased p.o. intake |377 |397|Symptom |2 |Family | |His |400 |402|Gender |3 |Family | +-----------------------------------------+-----+---+----------------------------+-------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_assertion_wip| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, ner_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash label prec rec f1 Absent 0.970 0.943 0.956 Someoneelse 0.868 0.775 0.819 Planned 0.721 0.754 0.737 Possible 0.852 0.884 0.868 Past 0.811 0.823 0.817 Present 0.833 0.866 0.849 Family 0.872 0.921 0.896 None 0.609 0.359 0.452 Hypothetical 0.722 0.810 0.763 Macro-average 0.888 0.872 0.880 Micro-average 0.908 0.908 0.908 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_parties_cased_08_31_v1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-parties-cased-08-31-v1` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_cased_08_31_v1_en_4.3.0_3.0_1672766262518.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_parties_cased_08_31_v1_en_4.3.0_3.0_1672766262518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_cased_08_31_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_parties_cased_08_31_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_parties_cased_08_31_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-parties-cased-08-31-v1 --- layout: model title: Ganda asr_wav2vec2_xlsr_multilingual_56 TFWav2Vec2ForCTC from voidful author: John Snow Labs name: asr_wav2vec2_xlsr_multilingual_56 date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_multilingual_56` is a Ganda model originally trained by voidful. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_multilingual_56_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_multilingual_56_lg_4.2.0_3.0_1664035818043.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_multilingual_56_lg_4.2.0_3.0_1664035818043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_multilingual_56", "lg")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_multilingual_56", "lg") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_multilingual_56| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|lg| |Size:|1.2 GB| --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becasincentivos2 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos2` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos2_es_4.3.0_3.0_1674218030841.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos2_es_4.3.0_3.0_1674218030841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos2","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becasincentivos2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos2 --- layout: model title: English RobertaForQuestionAnswering (from sunitha) author: John Snow Labs name: roberta_qa_roberta_customds_finetune date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-customds-finetune` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_customds_finetune_en_4.0.0_3.0_1655735713085.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_customds_finetune_en_4.0.0_3.0_1655735713085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_customds_finetune","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_customds_finetune","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_customds_finetune| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/roberta-customds-finetune --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from abhinavkulkarni) author: John Snow Labs name: distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `abhinavkulkarni`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769656788.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769656788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/abhinavkulkarni/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate Waray to English Pipeline author: John Snow Labs name: translate_war_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, war, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `war` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_war_en_xx_2.7.0_2.4_1609700727408.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_war_en_xx_2.7.0_2.4_1609700727408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_war_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_war_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.war.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_war_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal The merger Clause Binary Classifier author: John Snow Labs name: legclf_the_merger_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `the-merger` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `the-merger` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_the_merger_clause_en_1.0.0_3.2_1660123103140.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_the_merger_clause_en_1.0.0_3.2_1660123103140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_the_merger_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[the-merger]| |[other]| |[other]| |[the-merger]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_the_merger_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 1.00 0.99 98 the-merger 1.00 0.95 0.97 38 accuracy - - 0.99 136 macro-avg 0.99 0.97 0.98 136 weighted-avg 0.99 0.99 0.99 136 ``` --- layout: model title: Pipeline to Detect Clinical Events author: John Snow Labs name: ner_events_healthcare_pipeline date: 2022-03-22 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_events_healthcare_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_EVENTS_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_pipeline_en_3.4.1_3.0_1647943997404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_pipeline_en_3.4.1_3.0_1647943997404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_events_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient presented to the emergency room last evening") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_events_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient presented to the emergency room last evening") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare_events.pipeline").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +------------------+-------------+ |chunks |entities | +------------------+-------------+ |presented |EVIDENTIAL | |the emergency room|CLINICAL_DEPT| |last evening |DATE | +------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Translate English to Chuukese Pipeline author: John Snow Labs name: translate_en_chk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, chk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `chk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_chk_xx_2.7.0_2.4_1609689317048.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_chk_xx_2.7.0_2.4_1609689317048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_chk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_chk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.chk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_chk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from dmis-lab) author: John Snow Labs name: bert_qa_biobert_base_cased_v1.1_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad` is a English model orginally trained by `dmis-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_en_4.0.0_3.0_1654185575469.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_en_4.0.0_3.0_1654185575469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_base_cased_v1.1_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.biobert.base_cased.by_dmis-lab").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_base_cased_v1.1_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/dmis-lab/biobert-base-cased-v1.1-squad --- layout: model title: Korean Lemmatizer author: John Snow Labs name: lemma date: 2021-01-15 task: Lemmatization language: ko edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ko, lemmatizer, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ko_2.7.0_2.4_1610747055280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ko_2.7.0_2.4_1610747055280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_kaist_ud', 'ko')\ .setInputCols("document")\ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "ko") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, word_segmenter , lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) results = light_pipeline.fullAnnotate(["이렇게되면이러한인간형을다투어본받으려할것이틀림없다."]) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko") .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "ko") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter , lemmatizer)) val data = Seq("이렇게되면이러한인간형을다투어본받으려할것이틀림없다.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["이렇게되면이러한인간형을다투어본받으려할것이틀림없다."] lemma_df = nlu.load('ko.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 2, 이렇게, {'sentence': '0'}), Annotation(token, 3, 4, 되+면, {'sentence': '0'}), Annotation(token, 5, 7, 이러한+ㄴ, {'sentence': '0'}), Annotation(token, 8, 11, 인간형+을, {'sentence': '0'}), Annotation(token, 12, 15, 다투어본, {'sentence': '0'}), Annotation(token, 16, 18, 받으할, {'sentence': '0'}), Annotation(token, 18, 18, 려, {'sentence': '0'}), Annotation(token, 20, 21, 것+이, {'sentence': '0'}), Annotation(token, 22, 25, 틀림없+다, {'sentence': '0'}), Annotation(token, 26, 26, ., {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|ko| ## Data Source The model was trained on the universal dependencies from _Korea Advanced Institute of Science and Technology (KAIST)_ dataset. Reference: - Building Universal Dependency Treebanks in Korean, Jayeol Chun, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC'18, Miyazaki, Japan, 2018. --- layout: model title: Legal Choice of law Clause Binary Classifier (md) author: John Snow Labs name: legclf_choice_of_law_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `choice-of-law` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `choice-of-law` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_choice_of_law_md_en_1.0.0_3.0_1673460245223.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_choice_of_law_md_en_1.0.0_3.0_1673460245223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_choice_of_law_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[choice-of-law]| |[other]| |[other]| |[choice-of-law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_choice_of_law_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents + Lawinsider categorization ## Benchmarking ```bash precision recall f1-score support amendments-and-waivers 1.00 0.97 0.99 35 other 0.97 1.00 0.99 39 accuracy 0.99 74 macro avg 0.99 0.99 0.99 74 weighted avg 0.99 0.99 0.99 74 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from ThomasNLG) author: John Snow Labs name: t5_qa_webnlg_synth date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qa_webnlg_synth-en` is a English model originally trained by `ThomasNLG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qa_webnlg_synth_en_4.3.0_3.0_1675125486836.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qa_webnlg_synth_en_4.3.0_3.0_1675125486836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_qa_webnlg_synth","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_qa_webnlg_synth","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_qa_webnlg_synth| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|271.7 MB| ## References - https://huggingface.co/ThomasNLG/t5-qa_webnlg_synth-en - https://github.com/ThomasScialom/QuestEval - https://arxiv.org/abs/2104.07555 --- layout: model title: Legal Position and duties Clause Binary Classifier author: John Snow Labs name: legclf_position_and_duties_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `position-and-duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `position-and-duties` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_position_and_duties_clause_en_1.0.0_3.2_1660122849201.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_position_and_duties_clause_en_1.0.0_3.2_1660122849201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_position_and_duties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[position-and-duties]| |[other]| |[other]| |[position-and-duties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_position_and_duties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.99 0.99 92 position-and-duties 0.97 1.00 0.99 38 accuracy - - 0.99 130 macro-avg 0.99 0.99 0.99 130 weighted-avg 0.99 0.99 0.99 130 ``` --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili_sw_4.1.0_3.0_1659354042745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili_sw_4.1.0_3.0_1659354042745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_igbo_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-igbo-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Translate English to Mossi Pipeline author: John Snow Labs name: translate_en_mos date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mos, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mos` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mos_xx_2.7.0_2.4_1609686900824.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mos_xx_2.7.0_2.4_1609686900824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mos", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mos", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mos').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mos| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_25_1Aug_2022 TFWav2Vec2ForCTC from Roshana author: John Snow Labs name: pipeline_asr_wav2vec2_25_1Aug_2022 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_25_1Aug_2022` is a English model originally trained by Roshana. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_25_1Aug_2022_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_25_1Aug_2022_en_4.2.0_3.0_1664116524444.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_25_1Aug_2022_en_4.2.0_3.0_1664116524444.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_25_1Aug_2022', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_25_1Aug_2022", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_25_1Aug_2022| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl4_en_4.3.0_3.0_1675110033833.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl4_en_4.3.0_3.0_1675110033833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|329.9 MB| ## References - https://huggingface.co/google/t5-efficient-base-dl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Translate English to Basque (family) Pipeline author: John Snow Labs name: translate_en_euq date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, euq, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `euq` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_euq_xx_2.7.0_2.4_1609689947319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_euq_xx_2.7.0_2.4_1609689947319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_euq", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_euq", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.euq').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_euq| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from franklu) author: John Snow Labs name: bert_qa_pubmed_bert_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `pubmed_bert_squadv2` is a English model orginally trained by `franklu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_pubmed_bert_squadv2_en_4.0.0_3.0_1654189059722.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_pubmed_bert_squadv2_en_4.0.0_3.0_1654189059722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_pubmed_bert_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_pubmed_bert_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_pubmed.bert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_pubmed_bert_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/franklu/pubmed_bert_squadv2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578` is a English model originally trained by doddle124578. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037306955.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578_en_4.2.0_3.0_1664037306955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_1_by_doddle124578| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Oncology Pipeline for Biomarkers author: John Snow Labs name: oncology_biomarker_pipeline date: 2023-03-29 tags: [licensed, pipeline, oncology, biomarker, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline focuses on entities related to biomarkers. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.3.2_3.2_1680112789514.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.3.2_3.2_1680112789514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") text = '''Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") val text = "Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_biomarker.pipeline").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""") ```
## Results ```bash " ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Oncogene | ******************** ner_oncology_biomarker_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Biomarker | ******************** ner_oncology_test_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | Immunohistochemistry | Pathology_Test | | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Oncogene | ******************** ner_biomarker results ******************** | chunk | ner_label | |:-------------------------------|:----------------------| | Immunohistochemistry | Test | | negative | Biomarker_Measurement | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Measurement | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Measurement | | HER2 | Biomarker | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-------------------------------|:---------------|:------------| | Immunohistochemistry | Pathology_Test | Past | | thyroid transcription factor-1 | Biomarker | Present | | napsin A | Biomarker | Present | | ER | Biomarker | Present | | PR | Biomarker | Present | | HER2 | Oncogene | Present | ******************** assertion_oncology_test_binary_wip results ******************** | chunk | ner_label | assertion | |:-------------------------------|:---------------|:----------------| | Immunohistochemistry | Pathology_Test | Medical_History | | thyroid transcription factor-1 | Biomarker | Medical_History | | napsin A | Biomarker | Medical_History | | ER | Biomarker | Medical_History | | PR | Biomarker | Medical_History | | HER2 | Oncogene | Medical_History | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | O | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_related_to | | negative | Biomarker_Result | napsin A | Biomarker | is_related_to | | positive | Biomarker_Result | ER | Biomarker | is_related_to | | positive | Biomarker_Result | PR | Biomarker | is_related_to | | positive | Biomarker_Result | HER2 | Oncogene | O | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | O | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | | negative | Biomarker_Result | napsin A | Biomarker | is_finding_of | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | | positive | Biomarker_Result | HER2 | Oncogene | is_finding_of | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | ******************** re_oncology_biomarker_result_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | is_finding_of | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | | negative | Biomarker_Result | napsin A | Biomarker | is_finding_of | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | | positive | Biomarker_Result | HER2 | Oncogene | O | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_biomarker_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel - RelationExtractionModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from NitishKumar) author: John Snow Labs name: distilbert_qa_nitishkumar_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `NitishKumar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_nitishkumar_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768854083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_nitishkumar_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768854083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_nitishkumar_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_nitishkumar_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_nitishkumar_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/NitishKumar/distilbert-base-uncased-finetuned-squad --- layout: model title: Urdu Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_ur_cased date: 2022-04-11 tags: [bert, embeddings, ur, open_source] task: Embeddings language: ur edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-ur-cased` is a Urdu model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ur_cased_ur_3.4.2_3.0_1649676499960.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ur_cased_ur_3.4.2_3.0_1649676499960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ur_cased","ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ur_cased","ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.embed.bert_cased").predict("""مجھے سپارک این ایل پی سے محبت ہے""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_ur_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ur| |Size:|348.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ur-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Multilingual XLMRobertaForTokenClassification Cased model (from magistermilitum) author: John Snow Labs name: xlmroberta_ner_roberta_multilingual_medieval date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-multilingual-medieval-ner` is a Multilingual model originally trained by `magistermilitum`. ## Predicted Entities `LOC`, `L-PERS`, `PERS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_roberta_multilingual_medieval_xx_4.1.0_3.0_1660422872636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_roberta_multilingual_medieval_xx_4.1.0_3.0_1660422872636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_roberta_multilingual_medieval","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_roberta_multilingual_medieval","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_roberta_multilingual_medieval| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_1_lr_2e_5_bs_32_ep_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_1-lr-2e-5-bs-32-ep-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188305649.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188305649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_lr_2e_5_bs_32_ep_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_lr_2e_5_bs_32_ep_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_1_lr_2e_5_bs_32_ep_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_1-lr-2e-5-bs-32-ep-3 --- layout: model title: Arabic ElectraForQuestionAnswering model (from aymanm419) Version-1 author: John Snow Labs name: electra_qa_araElectra_SQUAD_ARCD date: 2022-06-22 tags: [ar, open_source, electra, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `araElectra-SQUAD-ARCD` is a Arabic model originally trained by `aymanm419`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_ar_4.0.0_3.0_1655920164320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_araElectra_SQUAD_ARCD_ar_4.0.0_3.0_1655920164320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_araElectra_SQUAD_ARCD","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.squad_arcd.electra").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_araElectra_SQUAD_ARCD| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|504.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aymanm419/araElectra-SQUAD-ARCD --- layout: model title: Translate English to Multiple languages Pipeline author: John Snow Labs name: translate_en_mul date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mul, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mul` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mul_xx_2.7.0_2.4_1609689926198.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mul_xx_2.7.0_2.4_1609689926198.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mul", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mul", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mul').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mul| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Indo-European Languages author: John Snow Labs name: opus_mt_en_ine date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ine, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ine` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ine_xx_2.7.0_2.4_1609164123331.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ine_xx_2.7.0_2.4_1609164123331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ine", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ine", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ine').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ine| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: RoBERTa Large CoNLL-03 NER Pipeline author: ahmedlone127 name: roberta_large_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, roberta, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/roberta_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655220223619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655220223619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: distilbert_qa_motiondew_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-finetuned` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_motiondew_finetuned_en_4.3.0_3.0_1672774065577.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_motiondew_finetuned_en_4.3.0_3.0_1672774065577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_motiondew_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_motiondew_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_motiondew_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/distilbert-finetuned --- layout: model title: Legal Question Answering (Bert, Large) author: John Snow Labs name: legqa_bert_large date: 2022-08-09 tags: [en, legal, qa, licensed] task: Question Answering language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal Bert-based Question Answering model, trained on squad-v2, finetuned on proprietary Legal questions and answers. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_bert_large_en_1.0.0_3.2_1660053509660.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_bert_large_en_1.0.0_3.2_1660053509660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.BertForQuestionAnswering.pretrained("legqa_bert_large","en", "legal/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) example = spark.createDataFrame([["Who was subjected to torture?", "The applicant submitted that her husband was subjected to treatment amounting to abuse whilst in the custody of police."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('answer.result').show() ```
## Results ```bash `her husband` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legqa_bert_large| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on squad-v2, finetuned on proprietary Legal questions and answers. --- layout: model title: RE Pipeline between Body Parts and Procedures author: John Snow Labs name: re_bodypart_proceduretest_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, body_part, procedures, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_bodypart_proceduretest](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_proceduretest_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_BODYPART_ENT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_3.4.1_3.0_1648733647318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_pipeline_en_3.4.1_3.0_1648733647318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models") pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_bodypart_proceduretest_pipeline", "en", "clinical/models") pipeline.fullAnnotate("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart_proceduretest.pipeline").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|------------------------------|---------------|-------------|--------|---------|-------------|-------------|---------------------|------------| | 0 | 1 | External_body_part_or_region | 94 | 98 | chest | Test | 117 | 135 | portable ultrasound | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_proceduretest_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Word2Vec Embeddings in Yoruba (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, yo, open_source] task: Embeddings language: yo edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_yo_3.4.1_3.0_1647467698125.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_yo_3.4.1_3.0_1647467698125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yo") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","yo") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Mo nifẹ Snark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("yo.embed.w2v_cc_300d").predict("""Mo nifẹ Snark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|yo| |Size:|85.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Clinical Deidentification Pipeline (Romanian) author: John Snow Labs name: clinical_deidentification date: 2023-06-13 tags: [licensed, clinical, ro, deid, deidentification] task: Pipeline Healthcare language: ro edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with `w2v_cc_300d` Romanian embeddings and can be used to deidentify PHI information from medical texts in Romanian. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP`, `ACCOUNT`, `LICENSE`, `PLATE` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ro_4.4.4_3.2_1686665695668.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ro_4.4.4_3.2_1686665695668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models") sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """ result = deid_pipeline.annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "ro", "clinical/models") val sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """ val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("ro.deid.clinical").predict("""Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models") sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """ result = deid_pipeline.annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "ro", "clinical/models") val sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """ val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("ro.deid.clinical").predict("""Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """) ```
## Results ```bash Results Masked with entity labels ------------------------------ Medic : Dr. , C.N.P : , Data setului de analize: Varsta : , Nume si Prenume : Tel: , E-mail : , Licență : , Înmatriculare : , Cont : , , Masked with chars ------------------------------ Medic : Dr. [**********], C.N.P : [***********], Data setului de analize: [*********] Varsta : **, Nume si Prenume : [**********] Tel: [************], E-mail : [************], Licență : [*********], Înmatriculare : [******], Cont : [******************], [**************************] [******************] [****], [****] Masked with fixed length chars ------------------------------ Medic : Dr. ****, C.N.P : ****, Data setului de analize: **** Varsta : ****, Nume si Prenume : **** Tel: ****, E-mail : ****, Licență : ****, Înmatriculare : ****, Cont : ****, **** **** ****, **** Obfuscated ------------------------------ Medic : Dr. Doina Gheorghiu, C.N.P : 6794561192919, Data setului de analize: 01-04-2001 Varsta : 91, Nume si Prenume : Dragomir Emilia Tel: 0248 551 376, E-mail : tudorsmaranda@kappa.ro, Licență : T003485962M, Înmatriculare : AR-65-UPQ, Cont : KHHO5029180812813651, Centrul Medical de Evaluare si Recuperare pentru Copii si Tineri Cristian Serban Buzias Aleea Voinea Curcani, 328479 {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English asr_wav2vec2_large_960h_lv60 TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_large_960h_lv60 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017406218.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_en_4.2.0_3.0_1664017406218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_960h_lv60| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_768 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_768_zh_4.2.4_3.0_1670021510353.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_768_zh_4.2.4_3.0_1670021510353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|330.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Amendments Clause Binary Classifier author: John Snow Labs name: legclf_amendments_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `amendments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `amendments` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_amendments_clause_en_1.0.0_3.2_1660122105448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_amendments_clause_en_1.0.0_3.2_1660122105448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_amendments_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[amendments]| |[other]| |[other]| |[amendments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_amendments_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support amendments 0.95 0.96 0.95 227 other 0.99 0.98 0.98 673 accuracy - - 0.98 900 macro-avg 0.97 0.97 0.97 900 weighted-avg 0.98 0.98 0.98 900 ``` --- layout: model title: Legal Intercreditor Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_intercreditor_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, intercreditor, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_intercreditor_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `intercreditor-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `intercreditor-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_intercreditor_agreement_bert_en_1.0.0_3.0_1669314986657.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_intercreditor_agreement_bert_en_1.0.0_3.0_1669314986657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_intercreditor_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[intercreditor-agreement]| |[other]| |[other]| |[intercreditor-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_intercreditor_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support intercreditor-agreement 0.87 0.82 0.84 33 other 0.93 0.95 0.94 82 accuracy - - 0.91 115 macro-avg 0.90 0.88 0.89 115 weighted-avg 0.91 0.91 0.91 115 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ModelTC) author: John Snow Labs name: roberta_qa_modeltc_base_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `ModelTC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_modeltc_base_squad_en_4.3.0_3.0_1674218615238.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_modeltc_base_squad_en_4.3.0_3.0_1674218615238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_modeltc_base_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_modeltc_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_modeltc_base_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ModelTC/roberta-base-squad --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4_en_4.3.0_3.0_1674214918693.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4_en_4.3.0_3.0_1674214918693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|427.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-4 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42_en_4.3.0_3.0_1674214395082.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42_en_4.3.0_3.0_1674214395082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|425.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-42 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1655731309098.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1655731309098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|430.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-42 --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_de_4.2.0_3.0_1664117864895.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_de_4.2.0_3.0_1664117864895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Portuguese Part of Speech Tagger (from Emanuel) author: John Snow Labs name: bert_pos_autonlp_pos_tag_bosque date: 2022-05-09 tags: [bert, pos, part_of_speech, pt, open_source] task: Part of Speech Tagging language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-pos-tag-bosque` is a Portuguese model orginally trained by `Emanuel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_autonlp_pos_tag_bosque_pt_3.4.2_3.0_1652091764630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_autonlp_pos_tag_bosque_pt_3.4.2_3.0_1652091764630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_autonlp_pos_tag_bosque","pt") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_autonlp_pos_tag_bosque","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_autonlp_pos_tag_bosque| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Emanuel/autonlp-pos-tag-bosque --- layout: model title: Portuguese Lemmatizer author: John Snow Labs name: lemma date: 2020-05-03 12:54:00 +0800 task: Lemmatization language: pt edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, pt] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_pt_2.5.0_2.4_1588499301752.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_pt_2.5.0_2.4_1588499301752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "pt") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "pt") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Além de ser o rei do norte, John Snow é um médico inglês e líder no desenvolvimento de anestesia e higiene médica."""] lemma_df = nlu.load('pt.lemma').predict(text, output_level='token') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='Além', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=6, result='de', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=10, result='ser', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=12, result='o', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=14, end=16, result='rei', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|pt| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Context Spell Checker Pipeline for English author: John Snow Labs name: spellcheck_dl_pipeline date: 2022-04-18 tags: [spellcheck, spell, spellcheck_pipeline, spelling_corrector, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained spellchecker pipeline is built on the top of [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/04/02/spellcheck_dl_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.2 and above. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.2_2.4_1650285592232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.2_2.4_1650285592232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] pipeline.annotate(text) ``` ```scala val pipeline = new PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") val example = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") pipeline.annotate(example) ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|99.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: Word2Vec Embeddings in Chuvash (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, cv, open_source] task: Embeddings language: cv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cv_3.4.1_3.0_1647291066615.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cv_3.4.1_3.0_1647291066615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("cv.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|cv| |Size:|251.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Translate Finnish to English Pipeline author: John Snow Labs name: translate_fi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, fi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `fi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_fi_en_xx_2.7.0_2.4_1609698992009.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_fi_en_xx_2.7.0_2.4_1609698992009.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_fi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_fi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.fi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_fi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Russian T5ForConditionalGeneration Small Cased model (from cointegrated) author: John Snow Labs name: t5_rut5_small_chitchat date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5-small-chitchat` is a Russian model originally trained by `cointegrated`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_small_chitchat_ru_4.3.0_3.0_1675106805725.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_small_chitchat_ru_4.3.0_3.0_1675106805725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_rut5_small_chitchat","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_rut5_small_chitchat","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_rut5_small_chitchat| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|277.4 MB| ## References - https://huggingface.co/cointegrated/rut5-small-chitchat --- layout: model title: Legal Further Assurances Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_further_assurances_bert date: 2023-03-05 tags: [en, legal, classification, clauses, further_assurances, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Further_Assurances` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Further_Assurances`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_bert_en_1.0.0_3.0_1678050719026.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_further_assurances_bert_en_1.0.0_3.0_1678050719026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_further_assurances_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Further_Assurances]| |[Other]| |[Other]| |[Further_Assurances]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_further_assurances_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Further_Assurances 0.93 0.97 0.95 120 Other 0.98 0.94 0.96 147 accuracy - - 0.96 267 macro-avg 0.95 0.96 0.95 267 weighted-avg 0.96 0.96 0.96 267 ``` --- layout: model title: Classifier for Adverse Drug Events in Small Conversations author: John Snow Labs name: classifierdl_ade_conversational_biobert date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 tags: [en, licensed, classifier, clinical] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify sentence in two categories: `True` : The sentence is talking about a possible ADE `False` : The sentences doesn’t have any information about an ADE. ## Predicted Entities `True`, `False` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_conversational_biobert_en_2.7.1_2.4_1611246389884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_conversational_biobert_en_2.7.1_2.4_1611246389884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = Tokenizer().setInputCols(['document']).setOutputCol('token') embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\ .setInputCols(["document", 'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = ClassifierDLModel.pretrained('classifierdl_ade_conversational_biobert', 'en', 'clinical/models')\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["I feel a bit drowsy & have a little blurred vision after taking an insulin", "I feel great after taking tylenol"]) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.ade.conversational").predict("""I feel a bit drowsy & have a little blurred vision after taking an insulin""") ```
## Results ```bash | | text | label | |--:|:---------------------------------------------------------------------------|:------| | 0 | I feel a bit drowsy & have a little blurred vision after taking an insulin | True | | 1 | I feel great after taking tylenol | False | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_ade_conversational_biobert| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|biobert_pubmed_base_cased| ## Data Source Trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed. ## Benchmarking ```bash precision recall f1-score support False 0.91 0.94 0.93 5706 True 0.80 0.70 0.74 1800 micro avg 0.89 0.89 0.89 7506 macro avg 0.85 0.82 0.84 7506 weighted avg 0.88 0.89 0.88 7506 ``` --- layout: model title: Smaller BERT Sentence Embeddings (L-10_H-768_A-12) author: John Snow Labs name: sent_small_bert_L10_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_768_en_2.6.0_2.4_1598351479319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_768_en_2.6.0_2.4_1598351479319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_768", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_768", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L10_768').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L10_768_embeddings sentence [-0.6537564396858215, -0.2422734946012497, -0.... I hate cancer [0.06436929106712341, -0.34515661001205444, 0.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L10_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1 --- layout: model title: Pipeline to Extract Negation and Uncertainty Entities from Spanish Medical Texts author: John Snow Labs name: ner_negation_uncertainty_pipeline date: 2023-03-09 tags: [es, clinical, licensed, ner, unc, usco, neg, nsco, negation, uncertainty] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_negation_uncertainty](https://nlp.johnsnowlabs.com/2022/08/13/ner_negation_uncertainty_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_negation_uncertainty_pipeline_es_4.3.0_3.2_1678359171669.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_negation_uncertainty_pipeline_es_4.3.0_3.2_1678359171669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_negation_uncertainty_pipeline", "es", "clinical/models") text = '''e realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_negation_uncertainty_pipeline", "es", "clinical/models") val text = "e realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)." val result = pipeline.fullAnnotate(text) ```
## Results ```bash +------------------------------------------------------+---------+ |chunk |ner_label| +------------------------------------------------------+---------+ |probable de |UNC | |cirrosis hepática |USCO | |no |NEG | |conocida previamente |NSCO | |no |NEG | |se realizó paracentesis control por escasez de liquido|NSCO | |susceptible de |UNC | |ca basocelular perlado |USCO | +------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_negation_uncertainty_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|318.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Sentence Detection in Bosnian Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [bs, sentence_detection, open_source] task: Sentence Detection language: bs edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_bs_3.2.0_3.0_1630317779410.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_bs_3.2.0_3.0_1630317779410.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "bs") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""Tražite sjajan izvor čitanja odlomaka na engleskom? Došli ste na pravo mjesto. Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje. Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi! Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita. Dakle, kako poboljšati svoje vještine čitanja? Odgovor na ovo pitanje zapravo je drugo pitanje: Kakva je korist od vještine čitanja? Glavna svrha čitanja je 'imati smisla'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "bs") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("Tražite sjajan izvor čitanja odlomaka na engleskom? Došli ste na pravo mjesto. Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje. Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi! Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita. Dakle, kako poboljšati svoje vještine čitanja? Odgovor na ovo pitanje zapravo je drugo pitanje: Kakva je korist od vještine čitanja? Glavna svrha čitanja je 'imati smisla'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('bs.sentence_detector').predict("Tražite sjajan izvor čitanja odlomaka na engleskom? Došli ste na pravo mjesto. Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje. Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi! Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita. Dakle, kako poboljšati svoje vještine čitanja? Odgovor na ovo pitanje zapravo je drugo pitanje: Kakva je korist od vještine čitanja? Glavna svrha čitanja je 'imati smisla'.", output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------+ |[Tražite sjajan izvor čitanja odlomaka na engleskom?] | |[Došli ste na pravo mjesto.] | |[Prema nedavnom istraživanju, navika čitanja u današnjoj mladosti brzo se smanjuje.] | |[Ne mogu se usredotočiti na dati odlomak za čitanje engleskog jezika duže od nekoliko sekundi!]| |[Takođe, čitanje je bilo i jeste sastavni dio svih takmičarskih ispita.] | |[Dakle, kako poboljšati svoje vještine čitanja?] | |[Odgovor na ovo pitanje zapravo je drugo pitanje:] | |[Kakva je korist od vještine čitanja?] | |[Glavna svrha čitanja je 'imati smisla'.] | +-----------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|bs| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Italian BertForMaskedLM Base Cased model (from dbmdz) author: John Snow Labs name: bert_embeddings_base_italian_xxl_cased date: 2022-12-02 tags: [it, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-xxl-cased` is a Italian model originally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_cased_it_4.2.4_3.0_1670017995735.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_xxl_cased_it_4.2.4_3.0_1670017995735.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_xxl_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_italian_xxl_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|415.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-italian-xxl-cased - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: Telugu BertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers date: 2022-12-06 tags: [te, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: te edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-bert` is a Telugu model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670326679548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_4.2.4_3.0_1670326679548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_indic_transformers","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|te| |Size:|611.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-bert - https://oscar-corpus.com/ --- layout: model title: English T5ForConditionalGeneration Cased model (from benjamyu) author: John Snow Labs name: t5_autotrain_ms_2_1174443640 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-ms-2-1174443640` is a English model originally trained by `benjamyu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_autotrain_ms_2_1174443640_en_4.3.0_3.0_1675099983295.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_autotrain_ms_2_1174443640_en_4.3.0_3.0_1675099983295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_autotrain_ms_2_1174443640","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_autotrain_ms_2_1174443640","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_autotrain_ms_2_1174443640| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|916.0 MB| ## References - https://huggingface.co/benjamyu/autotrain-ms-2-1174443640 --- layout: model title: English BertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: bert_token_classifier_autotrain_final_784824206 date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824206` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_final_784824206_en_4.2.4_3.0_1669814522383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_final_784824206_en_4.2.4_3.0_1669814522383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_final_784824206","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_final_784824206","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_final_784824206| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824206 --- layout: model title: English BertForQuestionAnswering model (from jackh1995) author: John Snow Labs name: bert_qa_bert_finetuned_jackh1995 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned` is a English model orginally trained by `jackh1995`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_jackh1995_en_4.0.0_3.0_1654534832705.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_jackh1995_en_4.0.0_3.0_1654534832705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_jackh1995","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_jackh1995","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_jackh1995").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_jackh1995| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|381.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jackh1995/bert-finetuned --- layout: model title: Named Entity Recognition (NER) Model in Swedish (GloVe 840B 300) author: John Snow Labs name: swedish_ner_840B_300 date: 2020-08-30 task: Named Entity Recognition language: sv edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, sv, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Swedish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_SV/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/swedish_ner_840B_300_sv_2.6.0_2.4_1598810268072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/swedish_ner_840B_300_sv_2.6.0_2.4_1598810268072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("swedish_ner_840B_300", "sv") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang = "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("swedish_ner_840B_300", "sv") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella."""] ner_df = nlu.load('sv.ner.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +------------------------+---------+ |chunk |ner_label| +------------------------+---------+ |William Henry Gates |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |MISC | |Seattle |LOC | |Washington |LOC | |Gates Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |MISC | |Gates |MISC | |Gates |MISC | |Gates |MISC | |Microsoft |ORG | |Bill |MISC | |Melinda Gates Foundation|MISC | |Melinda Gates |MISC | |Ray Ozzie |PER | |Craig Mundie |PER | |Microsoft |ORG | +------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|swedish_ner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|sv| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on a custom dataset with multi-lingual GloVe Embeddings ``glove_840B_300``. --- layout: model title: Part of Speech for Finnish author: John Snow Labs name: pos_ud_tdt date: 2020-05-04 23:32:00 +0800 task: Part of Speech Tagging language: fi edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, fi] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_tdt_fi_2.5.0_2.4_1588622348985.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_tdt_fi_2.5.0_2.4_1588622348985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_tdt", "fi") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_tdt", "fi") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Sen lisäksi, että hän on pohjoisen kuningas, John Snow on englantilainen lääkäri ja johtava anestesian ja lääketieteellisen hygienian kehittämisessä."""] pos_df = nlu.load('fi.pos.ud_tdt').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=2, result='PRON', metadata={'word': 'Sen'}), Row(annotatorType='pos', begin=4, end=10, result='ADP', metadata={'word': 'lisäksi'}), Row(annotatorType='pos', begin=11, end=11, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=13, end=16, result='SCONJ', metadata={'word': 'että'}), Row(annotatorType='pos', begin=18, end=20, result='PRON', metadata={'word': 'hän'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_tdt| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|fi| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: PICO Classifier (BERT) author: John Snow Labs name: bert_sequence_classifier_pico_biobert date: 2022-02-07 tags: [bert, sequence_classification, en, licensed] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify medical text according to the PICO framework. This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier. ## Predicted Entities `CONCLUSIONS`, `DESIGN_SETTING`, `INTERVENTION`, `PARTICIPANTS`, `FINDINGS`, `MEASUREMENTS`, `AIMS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_pico_biobert_en_3.4.1_3.0_1644265236813.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_pico_biobert_en_3.4.1_3.0_1644265236813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_pico", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["To compare the results of recording enamel opacities using the TF and modified DDE indices."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_pico_biobert", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("""To compare the results of recording enamel opacities using the TF and modified DDE indices.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.pico.seq_biobert").predict("""To compare the results of recording enamel opacities using the TF and modified DDE indices.""") ```
## Results ```bash +-------------------------------------------------------------------------------------------+------+ |text |result| +-------------------------------------------------------------------------------------------+------+ |To compare the results of recording enamel opacities using the TF and modified DDE indices.|[AIMS]| +-------------------------------------------------------------------------------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_pico_biobert| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model is trained on a custom dataset derived from a PICO classification dataset. ## Benchmarking ```bash label precision recall f1-score support AIMS 0.92 0.94 0.93 3813 CONCLUSIONS 0.85 0.86 0.86 4314 DESIGN_SETTING 0.88 0.78 0.83 5628 FINDINGS 0.91 0.92 0.91 9242 INTERVENTION 0.71 0.78 0.74 2331 MEASUREMENTS 0.79 0.87 0.83 3219 PARTICIPANTS 0.86 0.81 0.83 2723 accuracy - - 0.86 31270 macro-avg 0.85 0.85 0.85 31270 weighted-avg 0.87 0.86 0.86 31270 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from benny6) author: John Snow Labs name: roberta_qa_tydi date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-tydiqa` is a English model originally trained by `benny6`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tydi_en_4.3.0_3.0_1674222584111.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tydi_en_4.3.0_3.0_1674222584111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tydi","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tydi","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tydi| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|471.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/benny6/roberta-tydiqa --- layout: model title: Legal Cooperation Clause Binary Classifier author: John Snow Labs name: legclf_cooperation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `cooperation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `cooperation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cooperation_clause_en_1.0.0_3.2_1660122299955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cooperation_clause_en_1.0.0_3.2_1660122299955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cooperation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[cooperation]| |[other]| |[other]| |[cooperation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cooperation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support cooperation 0.91 0.94 0.93 34 other 0.98 0.97 0.97 96 accuracy - - 0.96 130 macro-avg 0.95 0.95 0.95 130 weighted-avg 0.96 0.96 0.96 130 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_ali221000262 TFWav2Vec2ForCTC from ali221000262 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_ali221000262` is a English model originally trained by ali221000262. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262_en_4.2.0_3.0_1664036650196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262_en_4.2.0_3.0_1664036650196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_ali221000262| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Sentence Entity Resolver for billable ICD10-CM HCC codes author: John Snow Labs name: sbiobertresolve_icd10cm_augmented_billable_hcc date: 2021-02-06 task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, entity_resolution] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). It also adds support of 7-digit codes with HCC status. ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for `aux_label` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`. For example, in the example shared `below the billable status is 1`, `hcc status is 1`, and `hcc score is 8`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_2.7.3_2.4_1612609178670.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_2.7.3_2.4_1612609178670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_icd10cm_augmented_billable_hcc``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \ .setInputCols(["document", "sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented_billable").predict("""metastatic lung cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| --- layout: model title: Detect PHI for Generic Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_generic_bert date: 2022-08-15 tags: [licensed, clinical, ro, deidentification, phi, generic, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.0.2_3.0_1660551174367.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.0.2_3.0_1660551174367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid_generic_bert").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|LOCATION | |Drumul Oprea Nr |LOCATION | |972 |LOCATION | |Vaslui |LOCATION | |737405 |LOCATION | |+40(235)413773 |CONTACT | |25 May 2022 |DATE | |BUREAN MARIA |NAME | |77 |AGE | |Agota Evelyn Tımar |NAME | |2450502264401 |ID | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_bert| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.3 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.95 0.97 0.96 1186 CONTACT 0.99 0.98 0.98 366 DATE 0.96 0.92 0.94 4518 ID 1.00 1.00 1.00 679 LOCATION 0.91 0.90 0.90 1683 NAME 0.93 0.96 0.94 2916 PROFESSION 0.87 0.85 0.86 161 micro-avg 0.94 0.94 0.94 11509 macro-avg 0.94 0.94 0.94 11509 weighted-avg 0.95 0.94 0.94 11509 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from describeai) author: John Snow Labs name: t5_gemini_small date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gemini-small` is a English model originally trained by `describeai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_gemini_small_en_4.3.0_3.0_1675102559187.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_gemini_small_en_4.3.0_3.0_1675102559187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_gemini_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_gemini_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_gemini_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|916.0 MB| ## References - https://huggingface.co/describeai/gemini-small - https://www.describe-ai.com/gemini --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to English author: John Snow Labs name: opus_mt_af_en date: 2021-06-01 tags: [open_source, seq2seq, translation, af, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_en_xx_3.1.0_2.4_1622558064281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_en_xx_3.1.0_2.4_1622558064281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Japanese Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_bert_large_japanese_char_extended date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese-char-extended` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_char_extended_ja_3.4.2_3.0_1649674799994.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_char_extended_ja_3.4.2_3.0_1649674799994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese_char_extended","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese_char_extended","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_large_japanese_char_extended").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_japanese_char_extended| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/KoichiYasuoka/bert-large-japanese-char-extended --- layout: model title: Javanese RoBERTa Embeddings (Small, IMDB Movie Review) author: John Snow Labs name: roberta_embeddings_javanese_roberta_small_imdb date: 2022-04-14 tags: [roberta, embeddings, jv, open_source] task: Embeddings language: jv edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-roberta-small-imdb` is a Javanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_roberta_small_imdb_jv_3.4.2_3.0_1649948176711.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_roberta_small_imdb_jv_3.4.2_3.0_1649948176711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_roberta_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_roberta_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("jv.embed.javanese_roberta_small_imdb").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_javanese_roberta_small_imdb| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|468.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-roberta-small-imdb - https://arxiv.org/abs/1907.11692 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: Embeddings Clinical (Medium) author: John Snow Labs name: embeddings_clinical_medium date: 2023-04-07 tags: [licensed, en, clinical, embeddings] task: Embeddings language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on a list of clinical and biomedical datasets curated in-house, using the word2vec algorithm. The dataset curation cut-off date is March 2023 and the model is expected to have a better generalization on recent content. The size of the model is around 1 GB and has 200 dimensions. Our benchmark tests indicate that our legacy clinical embeddings (embeddings_clinical) can be replaced with this one while training a new model (existing/previous models will still need to use the legacy embeddings that they're trained with). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_medium_en_4.3.2_3.0_1680835759101.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_medium_en_4.3.2_3.0_1680835759101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium","en","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|embeddings_clinical_medium| |Type:|embeddings| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|en| |Size:|787.5 MB| |Case sensitive:|true| |Dimension:|200| --- layout: model title: Swedish asr_wav2vec2_large_xlsr_swedish TFWav2Vec2ForCTC from marma author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_swedish date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_swedish` is a Swedish model originally trained by marma. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_swedish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_swedish_sv_4.2.0_3.0_1664118734633.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_swedish_sv_4.2.0_3.0_1664118734633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_swedish', lang = 'sv') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_swedish", lang = "sv") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_swedish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|756.1 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Drug Substance to UMLS Code Pipeline author: John Snow Labs name: umls_drug_substance_resolver_pipeline date: 2023-04-11 tags: [licensed, clinical, en, umls, pipeline, drug, subtance] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Drug Substances) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.3.2_3.0_1681217098344.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.3.2_3.0_1681217098344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models") result = pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models") val result = pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml") ``` {:.nlu-block} ```python +-----------------------------+---------+---------+ |chunk |ner_label|umls_code| +-----------------------------+---------+---------+ |metformin |DRUG |C0025598 | |lenvatinib |DRUG |C2986924 | |Magnesium hydroxide 100mg/1ml|DRUG |C1134402 | +-----------------------------+---------+---------+ ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_drug_substance_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|5.1 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Multilingual DistilBertForQuestionAnswering Base Cased model (from monakth) author: John Snow Labs name: distilbert_qa_monakth_base_case_finetuned_squad date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-finetuned-squad` is a Multilingual model originally trained by `monakth`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_case_finetuned_squad_xx_4.3.0_3.0_1672767194965.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_monakth_base_case_finetuned_squad_xx_4.3.0_3.0_1672767194965.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_case_finetuned_squad","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_monakth_base_case_finetuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_monakth_base_case_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monakth/distilbert-base-multilingual-cased-finetuned-squad --- layout: model title: Pipeline to Extract Pharmacological Entities From Spanish Medical Texts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_pharmacology_pipeline date: 2023-03-20 tags: [es, clinical, licensed, token_classification, bert, ner, pharmacology] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_pharmacology](https://nlp.johnsnowlabs.com/2022/08/11/bert_token_classifier_pharmacology_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_pipeline_es_4.3.0_3.2_1679298404485.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_pharmacology_pipeline_es_4.3.0_3.2_1679298404485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_pharmacology_pipeline", "es", "clinical/models") text = '''Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_pharmacology_pipeline", "es", "clinical/models") val text = "Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:--------------|-------------:| | 0 | creatinkinasa | 32 | 44 | PROTEINAS | 0.999973 | | 1 | LDH | 54 | 56 | PROTEINAS | 0.999972 | | 2 | urea | 66 | 69 | NORMALIZABLES | 0.999977 | | 3 | CA 19.9 | 81 | 87 | PROTEINAS | 0.999964 | | 4 | vimentina | 139 | 147 | PROTEINAS | 0.999961 | | 5 | S-100 | 150 | 154 | PROTEINAS | 0.999861 | | 6 | HMB-45 | 157 | 162 | PROTEINAS | 0.999965 | | 7 | actina | 166 | 171 | PROTEINAS | 0.999967 | | 8 | Cisplatino | 220 | 229 | NORMALIZABLES | 0.999988 | | 9 | Interleukina II | 232 | 246 | PROTEINAS | 0.999965 | | 10 | Dacarbacina | 249 | 259 | NORMALIZABLES | 0.999988 | | 11 | Interferon alfa | 263 | 277 | PROTEINAS | 0.999961 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_pharmacology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|410.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: XLNet Large CoNLL-03 NER Pipeline author: John Snow Labs name: xlnet_large_token_classifier_conll03_pipeline date: 2022-06-24 tags: [open_source, ner, token_classifier, xlnet, conll03, large, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1656081418731.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1656081418731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|32.1 KB| ## Included Models - DocumentAssembler - TokenizerModel - NormalizerModel --- layout: model title: Detect PHI for Deidentification purposes (Spanish, Roberta, augmented) author: John Snow Labs name: ner_deid_subentity_roberta_augmented date: 2022-02-15 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity_roberta` model. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms. This is a version that includes Roberta Clinical embeddings. You can find as well `ner_deid_subentity_augmented` that uses Sciwi 300d embeddings based instead of Roberta. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_2.4_1644927666923.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_augmented_es_3.3.4_2.4_1644927666923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_roberta_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.subentity.roberta").predict(""" Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-PATIENT| | Miguel| I-PATIENT| | Martínez| I-PATIENT| | ,| O| | varón| B-SEX| | de| O| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-CITY| | ,| O| | España| B-COUNTRY| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14| B-DATE| | de| I-DATE| | Marzo| I-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-HOSPITAL| | San| I-HOSPITAL| | Carlos| I-HOSPITAL| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_roberta_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 1874.0 165.0 186.0 2060.0 0.9191 0.9097 0.9144 HOSPITAL 241.0 19.0 54.0 295.0 0.9269 0.8169 0.8685 DATE 954.0 17.0 15.0 969.0 0.9825 0.9845 0.9835 ORGANIZATION 2521.0 483.0 468.0 2989.0 0.8392 0.8434 0.8413 CITY 1464.0 369.0 289.0 1753.0 0.7987 0.8351 0.8165 ID 35.0 1.0 0.0 35.0 0.9722 1.0 0.9859 STREET 194.0 8.0 6.0 200.0 0.9604 0.97 0.9652 USERNAME 7.0 0.0 4.0 11.0 1.0 0.6364 0.7778 SEX 618.0 9.0 9.0 627.0 0.9856 0.9856 0.9856 EMAIL 134.0 0.0 0.0 134.0 1.0 1.0 1.0 ZIP 138.0 0.0 1.0 139.0 1.0 0.9928 0.9964 MEDICALRECORD 29.0 10.0 0.0 29.0 0.7436 1.0 0.8529 PROFESSION 231.0 20.0 30.0 261.0 0.9203 0.8851 0.9023 PHONE 44.0 0.0 6.0 50.0 1.0 0.88 0.9362 COUNTRY 458.0 96.0 103.0 561.0 0.8267 0.8164 0.8215 DOCTOR 432.0 38.0 48.0 480.0 0.9191 0.9 0.9095 AGE 509.0 9.0 10.0 519.0 0.9826 0.9807 0.9817 macro - - - - - - 0.9141 micro - - - - - - 0.8891 ``` --- layout: model title: English Deberta Embeddings model (from scales-okn) author: John Snow Labs name: deberta_embeddings_docket_language_model date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `docket-language-model` is a English model originally trained by `scales-okn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_docket_language_model_en_4.3.1_3.0_1678704745796.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_docket_language_model_en_4.3.1_3.0_1678704745796.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_docket_language_model","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_docket_language_model","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_docket_language_model| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|false| ## References https://huggingface.co/scales-okn/docket-language-model --- layout: model title: Detect Person, Location, Organization, and Miscellaneous entities in Arabic (ANERcorp) author: John Snow Labs name: aner_cc_300d date: 2022-07-26 tags: [ner, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using `arabic_w2v_cc_300d` word embeddings, so please use the same embeddings in the pipeline. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/aner_cc_300d_ar_4.0.0_3.0_1658869537384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/aner_cc_300d_ar_4.0.0_3.0_1658869537384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("aner_cc_300d", "ar") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز") ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("aner_cc_300d", "ar") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""") ```
## Results ```bash | | ner_chunk | entity | |---:|-------------------------:|-------------:| | 0 | قوات الثورة العربية | ORG | | 1 | دمشق | LOC | | 2 | الإنكليز | PER | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|aner_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|ar| |Size:|14.9 MB| |Dependencies:|arabic_w2v_cc_300d| ## References This model is trained on data obtained from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp ## Benchmarking ```bash label tp fp fn prec rec f1 B-LOC 163 28 34 0.853403 0.827411 0.840206 I-ORG 60 10 5 0.857142 0.923077 0.888889 I-MIS 124 53 53 0.700565 0.700565 0.700565 I-LOC 64 20 23 0.761904 0.735632 0.748538 B-MIS 297 71 52 0.807065 0.851003 0.828452 I-PER 84 23 13 0.785046 0.865979 0.823530 B-ORG 54 9 12 0.857142 0.818181 0.837210 B-PER 182 26 33 0.875 0.846512 0.860520 Macro-average 1028 240 225 0.812159 0.821045 0.816578 Micro-average 1028 240 225 0.810726 0.820431 0.815550 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from hark99) author: John Snow Labs name: distilbert_qa_hark99_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hark99`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hark99_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725361304.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hark99_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725361304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hark99_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hark99_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_hark99").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hark99_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hark99/distilbert-base-uncased-finetuned-squad --- layout: model title: French CamemBert Embeddings (from peterhsu) author: John Snow Labs name: camembert_embeddings_peterhsu_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `peterhsu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_peterhsu_generic_model_fr_3.4.4_3.0_1653989980368.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_peterhsu_generic_model_fr_3.4.4_3.0_1653989980368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_peterhsu_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_peterhsu_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_peterhsu_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/peterhsu/dummy-model --- layout: model title: Recognize Entities OntoNotes - ELECTRA Base author: John Snow Labs name: onto_recognize_entities_electra_base date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `electra_base_uncased` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_base_en_2.7.0_2.4_1607511462448.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_base_en_2.7.0_2.4_1607511462448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_electra_base') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_base") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.electra.base').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |Parliament |ORG | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | |Parliament |ORG | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_electra_base| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl36 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl36` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl36_en_4.3.0_3.0_1675122498742.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl36_en_4.3.0_3.0_1675122498742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl36","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl36","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl36| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|572.2 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl36 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English DistilBertForQuestionAnswering Cased model (from minhdang241) author: John Snow Labs name: distilbert_qa_robust_tapt date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-tapt` is a English model originally trained by `minhdang241`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robust_tapt_en_4.3.0_3.0_1672775384901.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robust_tapt_en_4.3.0_3.0_1672775384901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robust_tapt","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robust_tapt","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_robust_tapt| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/minhdang241/robustqa-tapt --- layout: model title: Relation Extraction between different oncological entity types using granular classes (ReDL) author: John Snow Labs name: redl_oncology_granular_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Using this relation extraction model, four relation types can be identified: is_date_of (between date entities and other clinical entities), is_size_of (between Tumor_Finding and Tumor_Size), is_location_of (between anatomical entities and other entities) and is_finding_of (between test entities and their results). ## Predicted Entities `is_date_of`, `is_finding_of`, `is_location_of`, `is_size_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_granular_biobert_wip_en_4.2.4_3.0_1673768709402.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_granular_biobert_wip_en_4.2.4_3.0_1673768709402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_granular_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_granular_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("A mastectomy was performed two months ago.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_granular_biobert_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""") ```
## Results ```bash +----------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +----------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+ |is_date_of|Cancer_Surgery| 2| 11|mastectomy|Relative_Date| 27| 40|two months ago| 0.9652523| |is_size_of| Tumor_Size| 49| 52| 3 cm|Tumor_Finding| 54| 57| mass|0.81723577| +----------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_granular_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.83 0.91 0.87 is_date_of 0.82 0.80 0.81 is_finding_of 0.92 0.85 0.88 is_location_of 0.95 0.85 0.90 is_size_of 0.91 0.80 0.85 macro-avg 0.89 0.84 0.86 ``` --- layout: model title: Detect Entities Related to Cancer Diagnosis author: John Snow Labs name: ner_oncology_diagnosis date: 2022-11-24 tags: [licensed, clinical, en, ner, oncology] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to cancer diagnosis, such as Metastasis, Histological_Type or Invasion. Definitions of Predicted Entities: - `Adenopathy`: Mentions of pathological findings of the lymph nodes. - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score"). - `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated") - `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary". - `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category. - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells"). - `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4"). - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm"). ## Predicted Entities `Adenopathy`, `Cancer_Dx`, `Cancer_Score`, `Grade`, `Histological_Type`, `Invasion`, `Metastasis`, `Pathology_Result`, `Performance_Status`, `Staging`, `Tumor_Finding`, `Tumor_Size` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_en_4.2.2_3.0_1669300474926.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_en_4.2.2_3.0_1669300474926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_diagnosis", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_diagnosis", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_diagnosis").predict("""Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.""") ```
## Results ```bash | chunk | ner_label | |:-------------|:------------------| | tumor | Tumor_Finding | | adenopathies | Adenopathy | | invasive | Histological_Type | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | metastasis | Metastasis | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_diagnosis| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 354 63 99 453 0.85 0.78 0.81 Staging 234 27 24 258 0.90 0.91 0.90 Cancer_Score 36 15 26 62 0.71 0.58 0.64 Tumor_Finding 1121 83 136 1257 0.93 0.89 0.91 Invasion 154 27 27 181 0.85 0.85 0.85 Tumor_Size 1058 126 71 1129 0.89 0.94 0.91 Adenopathy 66 10 30 96 0.87 0.69 0.77 Performance_Status 116 15 19 135 0.89 0.86 0.87 Pathology_Result 852 686 290 1142 0.55 0.75 0.64 Metastasis 356 15 14 370 0.96 0.96 0.96 Cancer_Dx 1302 88 92 1394 0.94 0.93 0.94 Grade 201 23 35 236 0.90 0.85 0.87 macro_avg 5850 1178 863 6713 0.85 0.83 0.84 micro_avg 5850 1178 863 6713 0.85 0.87 0.86 ``` --- layout: model title: Part of Speech for Chinese author: John Snow Labs name: pos_ud_gsd_trad date: 2021-03-09 tags: [part_of_speech, open_source, chinese, pos_ud_gsd_trad, zh] task: Part of Speech Tagging language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - AUX - ADJ - PUNCT - ADV - VERB - NUM - NOUN - PRON - PART - ADP - DET - CCONJ - PROPN - X - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_trad_zh_3.0.0_3.0_1615292436582.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_trad_zh_3.0.0_3.0_1615292436582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.pos.ud_gsd_trad').predict(text) token_df ```
## Results ```bash token pos 0 从 PROPN 1 JohnSnowLabs X 2 你 PRON 3 好 ADJ 4 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd_trad| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|zh| --- layout: model title: English DistilBertForQuestionAnswering model (from Ayoola) author: John Snow Labs name: distilbert_qa_Ayoola_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Ayoola`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Ayoola_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724089038.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Ayoola_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724089038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Ayoola_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Ayoola_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Ayoola").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Ayoola_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Ayoola/distilbert-base-uncased-finetuned-squad --- layout: model title: Swedish BERT Base Cased Embedding author: John Snow Labs name: bert_base_cased date: 2021-09-07 tags: [open_source, bert_embeddings, swedish, cased, sv] task: Embeddings language: sv edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on aproximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_sv_3.2.2_3.0_1630999671555.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_sv_3.2.2_3.0_1630999671555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_cased", "sv") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_cased", "sv") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("sv.embed.bert.base_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sv| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/KB/bert-base-swedish-cased --- layout: model title: Legal No violation Clause Binary Classifier author: John Snow Labs name: legclf_no_violation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-violation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-violation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_violation_clause_en_1.0.0_3.2_1660122713710.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_violation_clause_en_1.0.0_3.2_1660122713710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_violation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-violation]| |[other]| |[other]| |[no-violation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_violation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-violation 1.00 1.00 1.00 35 other 1.00 1.00 1.00 93 accuracy - - 1.00 128 macro-avg 1.00 1.00 1.00 128 weighted-avg 1.00 1.00 1.00 128 ``` --- layout: model title: Fast Neural Machine Translation Model from Lunda to English author: John Snow Labs name: opus_mt_lun_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lun, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lun` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lun_en_xx_2.7.0_2.4_1609167008180.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lun_en_xx_2.7.0_2.4_1609167008180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lun_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lun_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lun.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lun_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Portuguese BertForMaskedLM Cased model (from pucpr) author: John Snow Labs name: bert_embeddings_biobertpt_all date: 2022-12-02 tags: [pt, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobertpt-all` is a Portuguese model originally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_all_pt_4.2.4_3.0_1670020710320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_all_pt_4.2.4_3.0_1670020710320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_all","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_all","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_biobertpt_all| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/pucpr/biobertpt-all - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Detect Entities Related to Cancer Therapies author: John Snow Labs name: ner_oncology_therapy date: 2022-11-24 tags: [clinical, en, licensed, oncology, treatment, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to oncology therapies using granular labels, including mentions of treatments, posology information and line of therapy. Definitions of Predicted Entities: - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy". - `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy". - `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy". - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Radiation_Dose`: Dose used in radiotherapy. - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). - `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy". - `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy"). ## Predicted Entities `Cancer_Surgery`, `Chemotherapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Hormonal_Therapy`, `Immunotherapy`, `Line_Of_Therapy`, `Radiotherapy`, `Radiation_Dose`, `Response_To_Treatment`, `Route`, `Targeted_Therapy`, `Unspecific_Therapy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_en_4.2.2_3.0_1669308088671.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_en_4.2.2_3.0_1669308088671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_therapy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_therapy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_therapy").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:----------------------| | mastectomy | Cancer_Surgery | | axillary lymph node dissection | Cancer_Surgery | | radiotherapy | Radiotherapy | | recurred | Response_To_Treatment | | adriamycin | Chemotherapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Chemotherapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | first line | Line_Of_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_therapy| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Cycle_Number 78 41 19 97 0.66 0.80 0.72 Response_To_Treatment 451 205 145 596 0.69 0.76 0.72 Cycle_Count 210 75 20 230 0.74 0.91 0.82 Unspecific_Therapy 189 76 89 278 0.71 0.68 0.70 Chemotherapy 831 87 48 879 0.91 0.95 0.92 Targeted_Therapy 194 28 34 228 0.87 0.85 0.86 Radiotherapy 279 35 31 310 0.89 0.90 0.89 Cancer_Surgery 720 192 99 819 0.79 0.88 0.83 Line_Of_Therapy 95 6 11 106 0.94 0.90 0.92 Hormonal_Therapy 170 6 15 185 0.97 0.92 0.94 Immunotherapy 96 17 32 128 0.85 0.75 0.80 Cycle_Day 205 38 43 248 0.84 0.83 0.84 Frequency 363 33 64 427 0.92 0.85 0.88 Route 93 6 20 113 0.94 0.82 0.88 Duration 527 102 234 761 0.84 0.69 0.76 Dosage 959 63 101 1060 0.94 0.90 0.92 Radiation_Dose 106 12 20 126 0.90 0.84 0.87 macro_avg 5566 1022 1025 6591 0.85 0.84 0.84 micro_avg 5566 1022 1025 6591 0.85 0.84 0.84 ``` --- layout: model title: Translate Setswana to English Pipeline author: John Snow Labs name: translate_tn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tn_en_xx_2.7.0_2.4_1609686822973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tn_en_xx_2.7.0_2.4_1609686822973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_aidSat ViTForImageClassification from YKXBCi author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_aidSat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_aidSat` is a English model originally trained by YKXBCi. ## Predicted Entities `Square`, `Farmland`, `BaseballField`, `Park`, `Commercial`, `Pond`, `Airport`, `SparseResidential`, `Church`, `School`, `Viaduct`, `Stadium`, `Desert`, `BareLand`, `MediumResidential`, `Center`, `Industrial`, `Playground`, `Port`, `DenseResidential`, `StorageTanks`, `Beach`, `Bridge`, `Mountain`, `River`, `Meadow`, `Resort`, `Parking`, `Forest`, `RailwayStation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_aidSat_en_4.1.0_3.0_1660167644527.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_aidSat_en_4.1.0_3.0_1660167644527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_aidSat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_aidSat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_aidSat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: Arabic Part of Speech Tagger (DA-Dialectal Arabic dataset, Modern Standard Arabic-MSA POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_da_pos_msa date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-da-pos-msa` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_msa_ar_3.4.2_3.0_1650993280099.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_msa_ar_3.4.2_3.0_1650993280099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_msa","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_msa","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_da_pos_msa").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_da_pos_msa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-pos-msa - https://dl.acm.org/doi/pdf/10.5555/1621804.1621808 - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hamishm) author: John Snow Labs name: distilbert_qa_hamishm_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hamishm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hamishm_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771049910.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hamishm_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771049910.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hamishm_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hamishm_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hamishm_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hamishm/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect diseases in medical text (biobert) author: John Snow Labs name: ner_diseases_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract entities pertaining to different types of general diseases using pretrained NER model. ## Predicted Entities `Disease` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_en_3.0.0_3.0_1617260638998.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_biobert_en_3.0.0_3.0_1617260638998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_diseases_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_diseases_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Czech asr_wav2vec2_xls_r_300m_250 TFWav2Vec2ForCTC from comodoro author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_250 date: 2022-09-25 tags: [wav2vec2, cs, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: cs edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_250` is a Czech model originally trained by comodoro. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_250_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_250_cs_4.2.0_3.0_1664119400609.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_250_cs_4.2.0_3.0_1664119400609.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_250', lang = 'cs') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_250", lang = "cs") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_250| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|cs| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from linh101201) author: John Snow Labs name: roberta_qa_linh101201_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `linh101201`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_linh101201_base_finetuned_squad_en_4.3.0_3.0_1674217360099.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_linh101201_base_finetuned_squad_en_4.3.0_3.0_1674217360099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_linh101201_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_linh101201_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_linh101201_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|424.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/linh101201/roberta-base-finetuned-squad --- layout: model title: English XLMRobertaForTokenClassification Large Uncased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_large_uncased_ontonotes5 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-uncased-ontonotes5` is a English model originally trained by `asahi417`. ## Predicted Entities `language`, `time`, `percent`, `quantity`, `product`, `ordinal number`, `cardinal number`, `event`, `geopolitical area`, `facility`, `organization`, `work of art`, `group`, `money`, `law`, `person`, `location`, `date` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_uncased_ontonotes5_en_4.1.0_3.0_1660425481816.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_uncased_ontonotes5_en_4.1.0_3.0_1660425481816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_uncased_ontonotes5","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_uncased_ontonotes5","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_large_uncased_ontonotes5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.8 GB| |Case sensitive:|false| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-large-uncased-ontonotes5 - https://github.com/asahi417/tner --- layout: model title: English asr_wav2vec2_xls_r_timit_tokenizer_base TFWav2Vec2ForCTC from hrdipto author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_timit_tokenizer_base` is a English model originally trained by hrdipto. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base_en_4.2.0_3.0_1664040322145.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base_en_4.2.0_3.0_1664040322145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_timit_tokenizer_base| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Japanese Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_base_japanese_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ja, open_source] task: Part of Speech Tagging language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-upos` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_upos_ja_3.4.2_3.0_1652091854179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_upos_ja_3.4.2_3.0_1652091854179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_upos","ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_upos","ja") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_japanese_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ja| |Size:|338.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vinitharaj) author: John Snow Labs name: distilbert_qa_vinitharaj_base_uncased_finetuned_squad2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `vinitharaj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vinitharaj_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773632350.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vinitharaj_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773632350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vinitharaj_base_uncased_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vinitharaj_base_uncased_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vinitharaj_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vinitharaj/distilbert-base-uncased-finetuned-squad2 --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_bert_FT_new_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_new_newsqa_en_4.0.0_3.0_1654185046001.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_FT_new_newsqa_en_4.0.0_3.0_1654185046001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.ft_new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/bert_FT_new_newsqa --- layout: model title: Fast Neural Machine Translation Model from Luo (Kenya and Tanzania) to English author: John Snow Labs name: opus_mt_luo_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, luo, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `luo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_luo_en_xx_2.7.0_2.4_1609167362702.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_luo_en_xx_2.7.0_2.4_1609167362702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_luo_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_luo_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.luo.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_luo_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Vietnamese BertForQuestionAnswering model (from nvkha) author: John Snow Labs name: bert_qa_bert_qa_vi_nvkha date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: vi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-qa-vi` is a Vietnamese model orginally trained by `nvkha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_qa_vi_nvkha_vi_4.0.0_3.0_1654249815695.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_qa_vi_nvkha_vi_4.0.0_3.0_1654249815695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_qa_vi_nvkha","vi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_qa_vi_nvkha","vi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("vi.answer_question.bert.by_nvkha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_qa_vi_nvkha| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|vi| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nvkha/bert-qa-vi --- layout: model title: Translate Tongan to English Pipeline author: John Snow Labs name: translate_to_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, to, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `to` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_to_en_xx_2.7.0_2.4_1609688214187.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_to_en_xx_2.7.0_2.4_1609688214187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_to_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_to_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.to.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_to_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Summarize Clinical Notes in Layman Terms author: John Snow Labs name: summarizer_clinical_laymen date: 2023-05-31 tags: [licensed, en, clinical, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries. It can generate summaries up to 512 tokens given an input text (max 1024 tokens). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685557633038.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_laymen_en_4.4.2_3.0_1685557633038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("summary")\ .setMaxNewTokens(512) pipeline = Pipeline(stages=[ document_assembler, summarizer ]) text ="""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.\n\nPAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.\n\nPAST SURGICAL HISTORY: Pertinent for cholecystectomy.\n\nPSYCHOLOGICAL HISTORY: Negative.\n\nSOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.\n\nFAMILY HISTORY: Pertinent for obesity and hypertension.\n\nMEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.\n\nREVIEW OF SYSTEMS: Negative.\n\nPHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval. """ data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_laymen", "en", "clinical/models") .setInputCols("document") .setOutputCol("summary") .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val data = Seq("""Olivia Smith was seen in my office for evaluation for elective surgical weight loss on October 6, 2008. Olivia Smith is a 34-year-old female with a BMI of 43. She is 5'6" tall and weighs 267 pounds. She is motivated to attempt surgical weight loss because she has been overweight for over 20 years and wants to have more energy and improve her self-image. She is not only affected physically, but also socially by her weight. When she loses weight she always regains it and she always gains back more weight than she has lost. At one time, she lost 100 pounds and gained the weight back within a year. She has tried numerous commercial weight loss programs including Weight Watcher's for four months in 1992 with 15-pound weight loss, RS for two months in 1990 with six-pound weight loss, Slim Fast for six weeks in 2004 with eight-pound weight loss, an exercise program for two months in 2007 with a five-pound weight loss, Atkin's Diet for three months in 2008 with a ten-pound weight loss, and Dexatrim for one month in 2005 with a five-pound weight loss. She has also tried numerous fat reduction or fad diets. She was on Redux for nine months with a 100-pound weight loss.\n\nPAST MEDICAL HISTORY: She has a history of hypertension and shortness of breath.\n\nPAST SURGICAL HISTORY: Pertinent for cholecystectomy.\n\nPSYCHOLOGICAL HISTORY: Negative.\n\nSOCIAL HISTORY: She is single. She drinks alcohol once a week. She does not smoke.\n\nFAMILY HISTORY: Pertinent for obesity and hypertension.\n\nMEDICATIONS: Include Topamax 100 mg twice daily, Zoloft 100 mg twice daily, Abilify 5 mg daily, Motrin 800 mg daily, and a multivitamin.\n\nALLERGIES: She has no known drug allergies.\n\nREVIEW OF SYSTEMS: Negative.\n\nPHYSICAL EXAM: This is a pleasant female in no acute distress. Alert and oriented x 3. HEENT: Normocephalic, atraumatic. Extraocular muscles intact, nonicteric sclerae. Chest is clear to auscultation bilaterally. Cardiovascular is normal sinus rhythm. Abdomen is obese, soft, nontender and nondistended. Extremities show no edema, clubbing or cyanosis.\n\nASSESSMENT/PLAN: This is a 34-year-old female with a BMI of 43 who is interested in surgical weight via the gastric bypass as opposed to Lap-Band. Olivia Smith will be asking for a letter of medical necessity from Dr. Andrew Johnson. She will also see my nutritionist and social worker and have an upper endoscopy. Once this is completed, we will submit her to her insurance company for approval.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash ['This is a clinical note about a 34-year-old woman who is interested in having weight loss surgery. She has been overweight for over 20 years and wants to have more energy and improve her self-image. She has tried many diets and weight loss programs, but has not been successful in keeping the weight off. She has a history of hypertension and shortness of breath, but is not allergic to any medications. She will have an upper endoscopy and will be contacted by a nutritionist and social worker. The plan is to have her weight loss surgery through the gastric bypass, rather than Lap-Band.'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_laymen| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.5 MB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from danhsf) author: John Snow Labs name: xlmroberta_ner_danhsf_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `danhsf`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_danhsf_base_finetuned_panx_de_4.1.0_3.0_1660432036772.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_danhsf_base_finetuned_panx_de_4.1.0_3.0_1660432036772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_danhsf_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_danhsf_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_danhsf_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/danhsf/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Legal Investment Advisory Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_investment_advisory_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, investment, advisory, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_investment_advisory_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `investment-advisory-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `investment-advisory-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_agreement_en_1.0.0_3.0_1670357942473.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_agreement_en_1.0.0_3.0_1670357942473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_investment_advisory_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[investment-advisory-agreement]| |[other]| |[other]| |[investment-advisory-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investment_advisory_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support investment-advisory-agreement 0.97 0.98 0.98 60 other 0.99 0.98 0.99 111 accuracy - - 0.98 171 macro-avg 0.98 0.98 0.98 171 weighted-avg 0.98 0.98 0.98 171 ``` --- layout: model title: Visual NER - CORD (Receipts) author: John Snow Labs name: visualner_receipts date: 2022-09-21 tags: [xx, licensed] task: OCR Object Detection language: xx edition: Visual NLP 4.0.0 spark_version: 3.2 supported: true annotator: VisualDocumentNERv21 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Visual NER, a model trained on the top of LayoutLMV2 to detect regions in Tickets. This model can be used after, for example, the Binary Image Classifier of Tickets, available at https://nlp.johnsnowlabs.com/2022/09/07/finvisualclf_vit_tickets_en.html ## Predicted Entities `COMPANY`, `DATE`, `AMOUNT`, `NAME`, `NUM`, `UNITPRICE`, `CNT`, `DISCOUNTPRICE`, `PRICE`, `ITEMSUBTOTAL`, `VATyn`, `SUBTOTAL`, `TOTALDISCOUNT`, `SERVICEPRICE`, `OTHERSVCPRICE`, `TAX`, `TOTAL`, `CASH`, `CHANGE`, `CREDITCARD`, `EMONEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visualner_receipts_xx_4.0.0_3.2_1663753935456.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visualner_receipts_xx_4.0.0_3.2_1663753935456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) img_to_hocr = ImageToHocr()\ .setInputCol("image")\ .setOutputCol("hocr")\ .setIgnoreResolution(False)\ .setOcrParams(["preserve_interword_spaces=0"]) tokenizer = HocrTokenizer()\ .setInputCol("hocr")\ .setOutputCol("token") doc_ner = VisualDocumentNerV21()\ .pretrained("visualner_receipts", "en", "clinical/ocr")\ .setInputCols(["token", "image"])\ .setOutputCol("entities") draw = ImageDrawAnnotations() \ .setInputCol("image") \ .setInputChunksCol("entities") \ .setOutputCol("image_with_annotations") \ .setFontSize(10) \ .setLineWidth(4)\ .setRectColor(Color.red) # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, img_to_hocr, tokenizer, doc_ner, draw ]) import pkg_resources bin_df = spark.read.format("binaryFile").load('data/t01.jpg') bin_df.show() results = pipeline.transform(bin_df).cache() res = results.collect() ## since pyspark2.3 doesn't have element_at, 'getItem' is involked path_array = f.split(results['path'], '/') # from pyspark2.4 # results.withColumn("filename", f.element_at(f.split("path", "/"), -1)) \ results.withColumn('filename', path_array.getItem(f.size(path_array)- 1)) \ .withColumn("exploded_entities", f.explode("entities")) \ .select("filename", "exploded_entities") \ .show(truncate=False) ```
## Results ```bash +----------+-------------------------------------------------------------------------------------------------------------------------------------------+ |filename |exploded_entities | +----------+-------------------------------------------------------------------------------------------------------------------------------------------+ |test0.jpeg|{named_entity, 24, 24, UNITPRICE-B, {confidence -> 95, width -> 66, x -> 306, y -> 229, word -> #010029, token -> #, height -> 17}, []} | |test0.jpeg|{named_entity, 32, 35, NAME-B, {confidence -> 91, width -> 38, x -> 200, y -> 250, word -> Sale, token -> sale, height -> 17}, []} | |test0.jpeg|{named_entity, 37, 37, OTHERS, {confidence -> 91, width -> 8, x -> 249, y -> 253, word -> #, token -> #, height -> 15}, []} | |test0.jpeg|{named_entity, 39, 47, NUM-B, {confidence -> 96, width -> 83, x -> 270, y -> 252, word -> 143710882, token -> 143710882, height -> 17}, []}| |test0.jpeg|{named_entity, 49, 52, NAME-B, {confidence -> 96, width -> 37, x -> 191, y -> 274, word -> Team, token -> team, height -> 17}, []} | |test0.jpeg|{named_entity, 66, 68, CNT-B, {confidence -> 88, width -> 28, x -> 82, y -> 296, word -> Jan, token -> jan, height -> 16}, []} | |test0.jpeg|{named_entity, 114, 114, OTHERS, {confidence -> 63, width -> 27, x -> 229, y -> 323, word -> ***, token -> *, height -> 13}, []} | +----------+-------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|visualner_receipts| |Type:|ocr| |Compatibility:|Visual NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|xx| |Size:|744.4 MB| ## References CORD --- layout: model title: English DistilBertForQuestionAnswering Cased model (from autoevaluate) author: John Snow Labs name: distilbert_qa_extractive date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `extractive-question-answering` is a English model originally trained by `autoevaluate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_extractive_en_4.3.0_3.0_1672775125688.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_extractive_en_4.3.0_3.0_1672775125688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_extractive","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_extractive","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_extractive| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/autoevaluate/extractive-question-answering --- layout: model title: BERTje A Dutch BERT model author: John Snow Labs name: bert_base_dutch_cased date: 2021-05-20 tags: [open_source, embeddings, bert, dutch, nl] task: Embeddings language: nl edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. For details, check out our paper on [arXiv](https://arxiv.org/abs/1912.09582), the code on [Github](https://github.com/wietsedv/bertje) and related work on [Semantic Scholar](https://www.semanticscholar.org/paper/BERTje%3A-A-Dutch-BERT-Model-Vries-Cranenburgh/a4d5e425cac0bf84c86c0c9f720b6339d6288ffa). The paper and Github page mention fine-tuned models that are available [here](https://huggingface.co/wietsedv). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_dutch_cased_nl_3.1.0_3.0_1621500934814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_dutch_cased_nl_3.1.0_3.0_1621500934814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_dutch_cased", "nl") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_dutch_cased", "nl") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.bert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_dutch_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|nl| |Case sensitive:|true| ## Data Source https://huggingface.co/dbmdz/bert-base-german-cased ## Benchmarking ```bash The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL, and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures. More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints. All of the tested models are *base* sized (12) layers with cased tokenization. Headers in the tables below link to original data sources. Scores link to the model pages that correspond to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available. ### Named Entity Recognition | Model | [CoNLL-2002](https://www.clips.uantwerpen.be/conll2002/ner/) | [SoNaR-1](https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus) | spaCy UD LassySmall | | ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | | **BERTje** | [**90.24**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-conll2002-ner) | [**84.93**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-sonar-ner) | [86.10](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner) | | [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | [88.61](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-conll2002-ner) | [84.19](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-sonar-ner) | [**86.77**](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-udlassy-ner) | | [BERT-NL](http://textdata.nl) | 85.05 | 80.45 | 81.62 | | [RobBERT](https://github.com/iPieter/RobBERT) | 84.72 | 81.98 | 79.84 | ### Part-of-speech tagging | Model | [UDv2.5 LassySmall](https://universaldependencies.org/treebanks/nl_lassysmall/index.html) | | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | | **BERTje** | **96.48** | | [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | 96.20 | | [BERT-NL](http://textdata.nl) | 96.10 | | [RobBERT](https://github.com/iPieter/RobBERT) | 95.91 | ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ParulChaudhari) author: John Snow Labs name: distilbert_qa_parulchaudhari_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ParulChaudhari`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_parulchaudhari_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768887661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_parulchaudhari_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768887661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_parulchaudhari_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_parulchaudhari_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_parulchaudhari_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ParulChaudhari/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentiment Analysis of German news author: John Snow Labs name: bert_sequence_classifier_news_sentiment date: 2022-01-18 tags: [german, sentiment, bert_sequence, de, open_source] task: Sentiment Analysis language: de edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/mdraw/german-news-sentiment-bert)) and it's been finetuned on news texts about migration for German language, leveraging `Bert` embeddings and `BertForSequenceClassification` for text classification purposes. ## Predicted Entities `positive`, `negative`, `neutral` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_news_sentiment_de_3.3.4_3.0_1642504435983.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_news_sentiment_de_3.3.4_3.0_1642504435983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_news_sentiment', 'de') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['Die Zahl der Flüchtlinge in Deutschland steigt von Tag zu Tag.']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_news_sentiment", "de") .setInputCols("document", "token") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Die Zahl der Flüchtlinge in Deutschland steigt von Tag zu Tag."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.news_sentiment.bert").predict("""Die Zahl der Flüchtlinge in Deutschland steigt von Tag zu Tag.""") ```
## Results ```bash ['neutral'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_news_sentiment| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## Data Source [https://wortschatz.uni-leipzig.de/en/download/German](https://wortschatz.uni-leipzig.de/en/download/German) --- layout: model title: English asr_wav2vec2_large_tedlium TFWav2Vec2ForCTC from sanchit-gandhi author: John Snow Labs name: asr_wav2vec2_large_tedlium date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_tedlium` is a English model originally trained by sanchit-gandhi. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_tedlium_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_tedlium_en_4.2.0_3.0_1664094417234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_tedlium_en_4.2.0_3.0_1664094417234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_tedlium", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_tedlium", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_tedlium| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ashhyun) author: John Snow Labs name: distilbert_qa_ashhyun_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ashhyun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ashhyun_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770017698.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ashhyun_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770017698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ashhyun_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ashhyun_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ashhyun_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ashhyun/distilbert-base-uncased-finetuned-squad --- layout: model title: English image_classifier_vit_beer_vs_wine ViTForImageClassification from filipafcastro author: John Snow Labs name: image_classifier_vit_beer_vs_wine date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_beer_vs_wine` is a English model originally trained by filipafcastro. ## Predicted Entities `beer`, `wine` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_beer_vs_wine_en_4.1.0_3.0_1660166570749.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_beer_vs_wine_en_4.1.0_3.0_1660166570749.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_beer_vs_wine", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_beer_vs_wine", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_beer_vs_wine| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Ndonga to English author: John Snow Labs name: opus_mt_ng_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ng, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ng` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ng_en_xx_2.7.0_2.4_1609169685419.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ng_en_xx_2.7.0_2.4_1609169685419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ng_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ng_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ng.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ng_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Cased model (from OnsElleuch) author: John Snow Labs name: t5_logisgenerator date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `logisgenerator` is a English model originally trained by `OnsElleuch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_logisgenerator_en_4.3.0_3.0_1675104908400.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_logisgenerator_en_4.3.0_3.0_1675104908400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_logisgenerator","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_logisgenerator","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_logisgenerator| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|280.2 MB| ## References - https://huggingface.co/OnsElleuch/logisgenerator - https://pypi.org/project/keytotext/ - https://pepy.tech/project/keytotext - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/notebooks/K2T.ipynb - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/keytotext#api - https://hub.docker.com/r/gagan30/keytotext - https://keytotext.readthedocs.io/en/latest/?badge=latest - https://github.com/psf/black - https://socialify.git.ci/gagan3012/keytotext/image?description=1&forks=1&language=1&owner=1&stargazers=1&theme=Light --- layout: model title: Augment Company Names with NASDAQ database author: John Snow Labs name: finmapper_nasdaq_companyname date: 2022-08-09 tags: [en, finance, companies, tickers, nasdaq, data, augmentation, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model allows you to, given an extracted name of a company, get information about that company, including the Industry, the Sector and the Trading Symbol (ticker). It can be optionally combined with Entity Resolution to normalize first the name of the company. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_companyname_en_1.0.0_3.2_1660038424307.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_companyname_en_1.0.0_3.2_1660038424307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") # Optional: To normalize the ORG name using NASDAQ data before the mapping ########################################################################## chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("chunk_embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_data_company_name', 'en', 'finance/models')\ .setInputCols("chunk_embeddings")\ .setOutputCol('normalized')\ .setDistanceFunction("EUCLIDEAN") ########################################################################## CM = finance.ChunkMapperModel()\ .pretrained('finmapper_nasdaq_companyname', 'en', 'finance/models')\ .setInputCols(["normalized"])\ #or ner_chunk without normalization .setOutputCol("mappings") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, # Optional for normalization chunk_embeddings, # Optional for normalization use_er_model, # Optional for normalization CM]) text = """Altaba Inc. is a company which ...""" test_data = spark.createDataFrame([[text]]).toDF("text") model = pipeline.fit(test_data) lp = nlp.LightPipeline(model) lp.fullAnnotate(text) ```
## Results ```bash [Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=10, result='AABA', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'ticker', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Altaba Inc.', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'company_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Altaba', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'short_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Asset Management', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'industry', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=10, result='Financial Services', metadata={'sentence': '0', 'chunk': '0', 'entity': 'Altaba Inc.', 'relation': 'sector', 'all_relations': ''}, embeddings=[])])] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_nasdaq_companyname| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|210.5 KB| ## References https://data.world/johnsnowlabs/list-of-companies-in-nasdaq-exchanges --- layout: model title: Stopwords Remover for Greek (modern) language (663 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, el, open_source] task: Stop Words Removal language: el edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_el_3.4.1_3.0_1646672928398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_el_3.4.1_3.0_1646672928398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","el") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Δεν είστε καλύτεροι από μένα"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","el") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Δεν είστε καλύτεροι από μένα").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("el.stopwords").predict("""Δεν είστε καλύτεροι από μένα""") ```
## Results ```bash +-----------------+ |result | +-----------------+ |[καλύτεροι, μένα]| +-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|el| |Size:|3.8 KB| --- layout: model title: Translate Caucasian languages to English Pipeline author: John Snow Labs name: translate_cau_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cau, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cau` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cau_en_xx_2.7.0_2.4_1609687516556.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cau_en_xx_2.7.0_2.4_1609687516556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cau_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cau_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cau.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cau_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented date: 2023-05-31 tags: [licensed, en, clinical, entity_resolution, icd10cm] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate. ## Predicted Entities `ICD-10-CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_4.4.2_3.0_1685503130827.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_4.4.2_3.0_1685503130827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline(stages = [document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, icd_resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented", "en", "clinical/models") .setInputCols("sbert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], gestatio...| [O24.4, O24.41, O24.43, Z86.32, Z87.5, O24.31, O24.11, O24.1, O24.81]| |subsequent type two diabetes mellitus|PROBLEM| O24.11|[pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mel...|[O24.11, E11.8, E11, E13.9, E11.9, E11.3, E11.44, Z86.3, Z86.39, E11.32,...| | obesity|PROBLEM| E66.9|[obesity [obesity, unspecified], abdominal obesity [other obesity], obes...|[E66.9, E66.8, Z68.41, Q13.0, E66, E66.01, Z86.39, E34.9, H35.50, Z83.49...| | a body mass index|PROBLEM| Z68.41|[finding of body mass index [body mass index [bmi] 40.0-44.9, adult], ob...|[Z68.41, E66.9, R22.9, Z68.1, R22.3, R22.1, Z68, R22.2, R22.0, R41.89, M...| | polyuria|PROBLEM| R35|[polyuria [polyuria], nocturnal polyuria [nocturnal polyuria], polyuric ...|[R35, R35.81, R35.8, E23.2, R31, R35.0, R82.99, N40.1, E72.3, O04.8, R30...| | polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], psychogenic polydipsia [other impulse disorder...|[R63.1, F63.89, E23.2, F63.9, O40, G47.5, M79.89, R63.2, R06.1, H53.8, I...| | poor appetite|PROBLEM| R63.0|[poor appetite [anorexia], poor feeding [feeding problem of newborn, uns...|[R63.0, P92.9, R43.8, R43.2, E86, R19.6, F52.0, Z72.4, R06.89, Z76.89, R...| | vomiting|PROBLEM| R11.1|[vomiting [vomiting], intermittent vomiting [nausea and vomiting], vomit...| [R11.1, R11, R11.10, G43.A1, P92.1, P92.09, G43.A, R11.13, R11.0]| | a respiratory tract infection|PROBLEM| J98.8|[respiratory tract infection [other specified respiratory disorders], up...|[J98.8, J06.9, A49.9, J22, J20.9, Z59.3, T17, J04.10, Z13.83, J18.9, P28...| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.4 GB| |Case sensitive:|false| --- layout: model title: Kannada RoBERTa Embeddings (from Chakita) author: John Snow Labs name: roberta_embeddings_KNUBert date: 2022-04-14 tags: [roberta, embeddings, kn, open_source] task: Embeddings language: kn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `KNUBert` is a Kannada model orginally trained by `Chakita`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_KNUBert_kn_3.4.2_3.0_1649948307214.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_KNUBert_kn_3.4.2_3.0_1649948307214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_KNUBert","kn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_KNUBert","kn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("kn.embed.KNUBert").predict("""ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_KNUBert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|kn| |Size:|314.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Chakita/KNUBert --- layout: model title: Relation extraction between Drugs and ADE (ReDL) author: John Snow Labs name: redl_ade_biobert date: 2023-01-14 tags: [relation_extraction, en, clinical, licensed, ade, biobert, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is an end-to-end trained BioBERT model, capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. 1 : Shows the adverse event and drug entities are related, 0 : Shows the adverse event and drug entities are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_ADE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_4.2.4_3.0_1673708531142.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_ade_biobert_en_4.2.4_3.0_1673708531142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['ade-drug', 'drug-ade']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_ade_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) text ="""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""" annotations = light_pipeline.fullAnnotate(text) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("drug-ade", "ade-drug")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_ade_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, ner_tagger, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot. The doctor moved me to voltarene 2 months ago, so far I have only had muscle cramps.""") ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |-----------:|:----------|----------------:|--------------:|:----------|:----------|----------------:|--------------:|:---------------|-------------:| | 1 | DRUG | 12 | 18 | Lipitor | ADE | 52 | 65 | severe fatigue | 0.998156 | | 1 | DRUG | 97 | 105 | voltarene | ADE | 144 | 156 | muscle cramps | 0.985513 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_ade_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References This model is trained on custom data annotated by JSL. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.829 0.895 0.861 1146 1 0.955 0.923 0.939 2454 Avg. 0.892 0.909 0.900 - Weighted-Avg. 0.915 0.914 0.914 - ``` --- layout: model title: English BertForQuestionAnswering model (from MrAnderson) author: John Snow Labs name: bert_qa_bert_base_512_full_trivia date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-512-full-trivia` is a English model orginally trained by `MrAnderson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_512_full_trivia_en_4.0.0_3.0_1654179670286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_512_full_trivia_en_4.0.0_3.0_1654179670286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_512_full_trivia","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_512_full_trivia","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.bert.base_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_512_full_trivia| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MrAnderson/bert-base-512-full-trivia --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223765535.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223765535.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_only_classfn_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_epochs_1_shard_1_squad2.0 --- layout: model title: Extract Mentions of Response to Cancer Treatment author: John Snow Labs name: ner_oncology_response_to_treatment_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, ner, treatment] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to the patient"s response to the oncology treatment, including clinical response and changes in tumor size. Definitions of Predicted Entities: - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Size_Trend`: Terms related to the changes in the size of the tumor (such as "growth" or "reduced in size"). ## Predicted Entities `Line_Of_Therapy`, `Response_To_Treatment`, `Size_Trend` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_wip_en_4.0.0_3.0_1664585303681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_wip_en_4.0.0_3.0_1664585303681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["She completed her first-line therapy, but some months later there was recurrence of the breast cancer. "]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("She completed her first-line therapy, but some months later there was recurrence of the breast cancer. ").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_response_to_treatment_wip").predict("""She completed her first-line therapy, but some months later there was recurrence of the breast cancer. """) ```
## Results ```bash | chunk | ner_label | |:-----------|:----------------------| | first-line | Line_Of_Therapy | | recurrence | Response_To_Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_response_to_treatment_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|848.8 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Response_To_Treatment 233.0 81.0 120.0 353.0 0.74 0.66 0.70 Size_Trend 31.0 34.0 45.0 76.0 0.48 0.41 0.44 Line_Of_Therapy 82.0 11.0 5.0 87.0 0.88 0.94 0.91 macro_avg 346.0 126.0 170.0 516.0 0.70 0.67 0.68 micro_avg NaN NaN NaN NaN 0.73 0.67 0.70 ``` --- layout: model title: Recognize Entities OntoNotes pipeline - BERT Large author: John Snow Labs name: onto_recognize_entities_bert_large date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_bert_large, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_bert_large is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_large_en_3.0.0_3.0_1616475201428.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_large_en_3.0.0_3.0_1616475201428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_bert_large', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_large", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.bert.large').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.262016534805297,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_large| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nh24 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nh24` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh24_en_4.3.0_3.0_1675113000788.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh24_en_4.3.0_3.0_1675113000788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nh24","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nh24","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nh24| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|636.9 MB| ## References - https://huggingface.co/google/t5-efficient-base-nh24 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering model (from armageddon) author: John Snow Labs name: bert_qa_bert_base_uncased_squad2_covid_qa_deepset date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad2-covid-qa-deepset` is a English model orginally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654181524986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654181524986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad2_covid_qa_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad2_covid_qa_deepset","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad2_covid_qa_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/bert-base-uncased-squad2-covid-qa-deepset --- layout: model title: Sentiment Analysis on texts about Airlines author: John Snow Labs name: distilbert_base_sequence_classifier_airlines date: 2022-02-18 tags: [airlines, distilbert, sequence_classification, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: DistilBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/tasosk/distilbert-base-uncased-airlines)) and it's been trained on tasosk/airlines dataset, leveraging `Distil-BERT` embeddings and `DistilBertForSequenceClassification` for text classification purposes. The model classifies texts into two categories: `YES` for positive comments, and `NO` for negative. ## Predicted Entities `YES`, `NO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_airlines_en_3.4.0_3.0_1645179643194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_airlines_en_3.4.0_3.0_1645179643194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = DistilBertForSequenceClassification\ .pretrained('distilbert_base_sequence_classifier_airlines', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([["Jersey to London Gatwick with easyJet and another great flight. Due to the flight time, airport check-in was not open, however I'd checked in a few days before with the easyJet app which was very quick and convenient. Boarding was quick and we left a few minutes early, which is a bonus. The cabin crew were friendly and the aircraft was clean and comfortable. We arrived at Gatwick 5-10 minutes early, and disembarking was as quick as boarding. On the way back, we were about half an hour early landing, which was fantastic. For the short flight from JER-LGW, easyJet are ideal and a bit better than British Airways in my opinion, and the fares are just unmissable. Both flights for two adults cost £180. easyJet can expect my business in the near future."]]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_airlines", "en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq("Jersey to London Gatwick with easyJet and another great flight. Due to the flight time, airport check-in was not open, however I'd checked in a few days before with the easyJet app which was very quick and convenient. Boarding was quick and we left a few minutes early, which is a bonus. The cabin crew were friendly and the aircraft was clean and comfortable. We arrived at Gatwick 5-10 minutes early, and disembarking was as quick as boarding. On the way back, we were about half an hour early landing, which was fantastic. For the short flight from JER-LGW, easyJet are ideal and a bit better than British Airways in my opinion, and the fares are just unmissable. Both flights for two adults cost £180. easyJet can expect my business in the near future.").toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['YES'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_sequence_classifier_airlines| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[class]| |Language:|en| |Size:|249.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References [https://huggingface.co/datasets/tasosk/airlines](https://huggingface.co/datasets/tasosk/airlines) ## Benchmarking ```bash label score accuracy 0.9288 f1 0.9289 ``` --- layout: model title: Pipeline to Detect Chemicals in Medical Texts author: John Snow Labs name: bert_token_classifier_ner_chemicals_pipeline date: 2022-03-14 tags: [chemicals, bert_token_classifier, pipeline, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemicals](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemicals_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMICALS/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647256416720.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_3.0_1647256416720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chemicals_pipeline = PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") chemicals_pipeline.annotate("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ``` ```scala val chemicals_pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") chemicals_pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemicals_pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |p - choloroaniline |CHEM | |chlorhexidine - digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone - iodine |CHEM | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.3 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from botika) author: John Snow Labs name: distilbert_qa_botika_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `botika`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_botika_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770249638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_botika_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770249638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_botika_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_botika_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_botika_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/botika/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from graviraja) author: John Snow Labs name: distilbert_qa_graviraja_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `graviraja`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770884217.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_graviraja_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770884217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_graviraja_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_graviraja_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/graviraja/distilbert-base-uncased-finetuned-squad --- layout: model title: Extract Demographic Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_demographic_emb_clinical_large date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, demographic] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `Gender`, `Employment`, `RaceEthnicity`, `Age`, `Substance`, `RelationshipStatus`, `SubstanceQuantity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_emb_clinical_large_en_4.4.3_3.0_1686075195884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_emb_clinical_large_en_4.4.3_3.0_1686075195884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_demographic_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_demographic_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------|:--------------| | grandma | Gender | | who's 85 | Age | | Black | RaceEthnicity | | doctors | Employment | | her | Gender | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_demographic_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1298 21 19 1317 0.98 0.99 0.98 Employment 1180 50 63 1243 0.96 0.95 0.95 RaceEthnicity 31 2 2 33 0.94 0.94 0.94 Age 549 45 33 582 0.92 0.94 0.93 Substance 391 56 30 421 0.87 0.93 0.90 RelationshipStatus 18 3 6 24 0.86 0.75 0.80 SubstanceQuantity 61 14 24 85 0.81 0.72 0.76 macro_avg 3528 191 177 3705 0.91 0.89 0.89 micro_avg 3528 191 177 3705 0.95 0.95 0.95 ``` --- layout: model title: Sentence Entity Resolver for Snomed Concepts, CT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_findings language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to Snomed codes (CT version) using chunk embeddings. {:.h2_title} ## Predicted Entities Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_2.6.4_2.4_1606235762315.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_2.6.4_2.4_1606235762315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use ```sbiobertresolve_snomed_findings``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. No need to set ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 38341003| 0.3234|hypertension:::hy...|38341003:::155295...| |chronic renal ins...| 83|109| PROBLEM|723190009| 0.7522|chronic renal ins...|723190009:::70904...| | COPD| 113|116| PROBLEM| 13645005| 0.1226|copd - chronic ob...|13645005:::155565...| | gastritis| 120|128| PROBLEM|235653009| 0.2444|gastritis:::gastr...|235653009:::45560...| | TIA| 136|138| PROBLEM|275382005| 0.0766|cerebral trauma (...|275382005:::44739...| |a non-ST elevatio...| 182|202| PROBLEM|233843008| 0.2224|silent myocardial...|233843008:::19479...| |Guaiac positive s...| 208|229| PROBLEM| 59614000| 0.9678|guaiac-positive s...|59614000:::703960...| |cardiac catheteri...| 295|317| TEST|301095005| 0.2584|cardiac finding::...|301095005:::25090...| | PTCA| 324|327|TREATMENT|373108000| 0.0809|post percutaneous...|373108000:::25103...| | mid LAD lesion| 332|345| PROBLEM|449567000| 0.0900|overriding left v...|449567000:::46140...| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_snomed_findings | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on SNOMED (CT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: Fast Neural Machine Translation Model from Swedish to English author: John Snow Labs name: opus_mt_sv_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sv, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sv_en_xx_2.7.0_2.4_1609170150107.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sv_en_xx_2.7.0_2.4_1609170150107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sv_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sv_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sv.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sv_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-wwm-squadv2-x2.15-f83.2-d25-hybrid-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1_en_4.0.0_3.0_1654537627313.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1_en_4.0.0_3.0_1654537627313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased_v2_x2.15_f83.2_d25_hybrid.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|455.4 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.15-f83.2-d25-hybrid-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data) author: John Snow Labs name: ner_deid_generic_augmented date: 2022-02-16 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic` ner model). This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic` which does not include it). It's a generalized version of `ner_deid_subentity_augmented`. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_3.0_1645006125653.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_es_3.3.4_3.0_1645006125653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.generic_augmented").predict(""" Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-NAME| | Miguel| I-NAME| | Martínez| I-NAME| | ,| O| | un| B-SEX| | varón| I-SEX| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-LOCATION| | ,| O| | España| B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19|B-PROFESSION| | el| O| | dia| O| | 14| O| | de| O| | Marzo| O| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| B-LOCATION| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-LOCATION| | San| I-LOCATION| | Carlos| I-LOCATION| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 185.0 3.0 0.0 185.0 0.984 1.0 0.992 NAME 2066.0 138.0 106.0 2172.0 0.9374 0.9512 0.9442 DATE 1017.0 18.0 18.0 1035.0 0.9826 0.9826 0.9826 ORGANIZATION 2468.0 482.0 332.0 2800.0 0.8366 0.8814 0.8584 ID 65.0 5.0 3.0 68.0 0.9286 0.9559 0.942 SEX 678.0 8.0 15.0 693.0 0.9883 0.9784 0.9833 LOCATION 2532.0 358.0 420.0 2952.0 0.8761 0.8577 0.8668 PROFESSION 246.0 9.0 31.0 277.0 0.9647 0.8881 0.9248 AGE 547.0 8.0 9.0 556.0 0.9856 0.9838 0.9847 macro - - - - - - 0.9421 micro - - - - - - 0.9092 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from aszidon) Custom Version-3 author: John Snow Labs name: distilbert_qa_custom3 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom3` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom3_en_4.0.0_3.0_1654727988638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom3_en_4.0.0_3.0_1654727988638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.custom3.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom3 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from claytonsamples) author: John Snow Labs name: xlmroberta_ner_claytonsamples_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `claytonsamples`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_claytonsamples_base_finetuned_panx_de_4.1.0_3.0_1660431662923.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_claytonsamples_base_finetuned_panx_de_4.1.0_3.0_1660431662923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_claytonsamples_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_claytonsamples_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_claytonsamples_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/claytonsamples/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Thai BertForQuestionAnswering model (from airesearch) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetune_qa date: 2022-06-02 tags: [th, open_source, question_answering, bert] task: Question Answering language: th edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetune-qa` is a Thai model orginally trained by `airesearch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetune_qa_th_4.0.0_3.0_1654179974020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetune_qa_th_4.0.0_3.0_1654179974020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetune_qa","th") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetune_qa","th") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("th.answer_question.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetune_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|th| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/airesearch/bert-base-multilingual-cased-finetune-qa - https://github.com/vistec-AI/thai2transformers/blob/dev/scripts/downstream/train_question_answering_lm_finetuning.py - https://wandb.ai/cstorm125/wangchanberta-qa --- layout: model title: Stopwords Remover for French language (507 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, fr, open_source] task: Stop Words Removal language: fr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_fr_3.4.1_3.0_1646673106300.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_fr_3.4.1_3.0_1646673106300.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","fr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Tu n'es pas mieux que moi"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","fr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Tu n'es pas mieux que moi").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.stopwords").predict("""Tu n'es pas mieux que moi""") ```
## Results ```bash +-------------+ |result | +-------------+ |[n'es, mieux]| +-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|fr| |Size:|2.9 KB| --- layout: model title: ALBERT Embeddings (XLarge Uncase) author: John Snow Labs name: albert_xlarge_uncased date: 2020-04-28 task: Embeddings language: en nav_key: models edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. The details are described in the paper "[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.](https://arxiv.org/abs/1909.11942)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_xlarge_uncased_en_2.5.0_2.4_1588073443653.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_xlarge_uncased_en_2.5.0_2.4_1588073443653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = AlbertEmbeddings.pretrained("albert_xlarge_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = AlbertEmbeddings.pretrained("albert_xlarge_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.albert.xlarge_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_albert_xlarge_uncased_embeddings I [-0.4735468626022339, -0.03991951420903206, -1... love [-0.4254034459590912, -0.371383935213089, -0.3... NLP [0.7200506329536438, -0.12543179094791412, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_xlarge_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|2048| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/albert_xlarge/3](https://tfhub.dev/google/albert_xlarge/3) --- layout: model title: Polish T5ForConditionalGeneration Base Cased model (from azwierzc) author: John Snow Labs name: t5_plt5_base_poquad date: 2023-01-30 tags: [pl, open_source, t5, tensorflow] task: Text Generation language: pl edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `plt5-base-poquad` is a Polish model originally trained by `azwierzc`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_plt5_base_poquad_pl_4.3.0_3.0_1675106743524.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_plt5_base_poquad_pl_4.3.0_3.0_1675106743524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_plt5_base_poquad","pl") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_plt5_base_poquad","pl") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_plt5_base_poquad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|pl| |Size:|1.1 GB| ## References - https://huggingface.co/azwierzc/plt5-base-poquad --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-28 tags: [cs, licensed, legal, ner, mapa] task: Named Entity Recognition language: cs edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Czech` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_cs_1.0.0_3.0_1682668776380.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_cs_1.0.0_3.0_1682668776380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_czech_legal","cs")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "cs", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""V roce 2007 uzavřela společnost Alpenrind, dříve S GmbH, se společností Martin-Meat usazenou v Maďarsku smlouvu, podle níž se posledně uvedená společnost zavázala k porcování masa a jeho balení v rozsahu 25 půlek jatečně upravených těl skotu týdně."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+------------+ |chunk |ner_label | +-----------+------------+ |2007 |DATE | |Alpenrind |ORGANISATION| |Martin-Meat|ORGANISATION| |Maďarsku |ADDRESS | |25 půlek |AMOUNT | +-----------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|cs| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.80 0.67 0.73 36 AMOUNT 1.00 1.00 1.00 5 DATE 0.98 0.98 0.98 56 ORGANISATION 0.64 0.66 0.65 32 PERSON 0.75 0.82 0.78 66 micro-avg 0.81 0.82 0.81 195 macro-avg 0.83 0.82 0.83 195 weighted-avg 0.81 0.82 0.81 195 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from shmuelamar) author: John Snow Labs name: roberta_qa_re date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `REQA-RoBERTa` is a English model originally trained by `shmuelamar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_re_en_4.3.0_3.0_1674208450623.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_re_en_4.3.0_3.0_1674208450623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_re","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_re","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_re| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/shmuelamar/REQA-RoBERTa --- layout: model title: Detect Clinical Entities in Romanian (w2v_cc_300d) author: John Snow Labs name: ner_clinical date: 2022-07-01 tags: [licenced, clinical, ro, ner, w2v, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract clinical entities from Romanian clinical texts. This model is trained using Romanian `w2v_cc_300d` embeddings. ## Predicted Entities `Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Dosage`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_ro_4.0.0_3.0_1656687302322.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_ro_4.0.0_3.0_1656687302322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "ro") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "ro", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) sample_text = """ Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""" data = spark.createDataFrame([[sample_text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "ro","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.clinical").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""") ```
## Results ```bash +--------------------------+-------------------------+ |chunks |entities | +--------------------------+-------------------------+ |Angio CT |Imaging_Test | |cardio-toracic |Body_Part | |Atrezie |Disease_Syndrome_Disorder| |valva pulmonara |Body_Part | |Hipoplazie |Disease_Syndrome_Disorder| |VS |Body_Part | |Atrezie |Disease_Syndrome_Disorder| |VAV stang |Body_Part | |Anastomoza Glenn |Disease_Syndrome_Disorder| |Sp |Body_Part | |Tromboza |Disease_Syndrome_Disorder| |Sectia Clinica Cardiologie|Clinical_Dept | |GE Revolution HD |Medical_Device | |Branula albastra |Medical_Device | |membrului superior |Body_Part | |drept |Direction | |30 ml |Dosage | |Iomeron 350 |Drug_Ingredient | |2.2 ml/s |Dosage | |20 ml |Dosage | +--------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|15.0 MB| ## Benchmarking ```bash label precision recall f1-score support Body_Part 0.87 0.90 0.88 689 Clinical_Dept 0.68 0.62 0.65 97 Date 1.00 0.99 0.99 87 Direction 0.64 0.74 0.69 50 Disease_Syndrome_Disorder 0.69 0.66 0.67 123 Dosage 0.74 0.97 0.84 38 Drug_Ingredient 0.98 0.92 0.95 48 Form 1.00 1.00 1.00 6 Imaging_Findings 0.74 0.76 0.75 202 Imaging_Technique 0.92 0.88 0.90 26 Imaging_Test 0.93 0.97 0.95 208 Measurements 0.70 0.67 0.69 214 Medical_Device 0.92 0.81 0.86 42 Pulse 0.82 1.00 0.90 9 Route 0.97 0.91 0.94 33 Score 0.91 0.95 0.93 41 Time 1.00 1.00 1.00 28 Units 0.60 0.89 0.71 88 Weight 1.00 1.00 1.00 9 micro-avg 0.82 0.84 0.83 2054 macro-avg 0.70 0.72 0.71 2054 weighted-avg 0.81 0.84 0.82 2054 ``` --- layout: model title: Part of Speech for Norwegian author: John Snow Labs name: pos_ud_bokmaal date: 2022-01-11 tags: [pos, norwegian, nb, open_source] task: Part of Speech Tagging language: nb edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. This model was trained using the dataset available at https://universaldependencies.org {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_3.4.0_3.0_1641902661339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_3.4.0_3.0_1641902661339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.") ``` ```scala val pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""] pos_df = nlu.load('nb.pos.ud_bokmaal').predict(text) pos_df ```
## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='DET', metadata={'word': 'Annet'}), Row(annotatorType='pos', begin=6, end=8, result='SCONJ', metadata={'word': 'enn'}), Row(annotatorType='pos', begin=10, end=10, result='PART', metadata={'word': 'å'}), Row(annotatorType='pos', begin=12, end=15, result='AUX', metadata={'word': 'være'}), Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kongen'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bokmaal| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Language:|nb| |Size:|17.7 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel --- layout: model title: French CamemBert Embeddings (from ppletscher) author: John Snow Labs name: camembert_embeddings_dummy date: 2022-05-23 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy` is a French model orginally trained by `ppletscher`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_dummy_fr_3.4.4_3.0_1653321214351.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_dummy_fr_3.4.4_3.0_1653321214351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_dummy","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_dummy","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_dummy| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/ppletscher/dummy --- layout: model title: SDOH Under Treatment For Classification author: John Snow Labs name: genericclassifier_sdoh_under_treatment_sbiobert_cased_mli date: 2023-04-27 tags: [en, licensed, clinical, sdoh, generic_classifier, under_treatment, biobert] task: Text Classification language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting if the patient is under treatment or not. If under treatment is not mentioned in the text, it is regarded as “not under treatment”. The model is trained by using GenericClassifierApproach annotator. `Under_Treatment`: The patient is under treatment. `Not_Under_Treatment_Or_Not_Mentioned`: The patient is not under treatment or it is not mentioned in the clinical notes. ## Predicted Entities `Under_Treatment`, `Not_Under_Treatment_Or_Not_Mentioned` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.3.2_3.0_1682608513576.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.3.2_3.0_1682608513576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["""Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications. To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity. Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly. With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life. """, """John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures. Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor. """] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "prediction.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("prediction") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("""Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications. To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity. Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly. With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life. """, """John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures. Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor. """)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+--------------------------------------+ | text| result| +----------------------------------------------------------------------------------------------------+--------------------------------------+ |Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disea...| [Under_Treatment]| |John, a 60-year-old man with a history of smoking and high blood pressure, presented to his prima...|[Not_Under_Treatment_Or_Not_Mentioned]| +----------------------------------------------------------------------------------------------------+--------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_under_treatment_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Not_Under_Treatment_Or_Not_Mentioned 0.86 0.68 0.76 222 Under_Treatment 0.86 0.94 0.90 450 accuracy - - 0.86 672 macro-avg 0.86 0.81 0.83 672 weighted-avg 0.86 0.86 0.85 672 ``` --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1_en_4.0.0_3.0_1654183687318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1_en_4.0.0_3.0_1654183687318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased_v2_x2.63_f82.6_d16_hybrid.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_wwm_squadv2_x2.63_f82.6_d16_hybrid_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|349.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: German XlmRoBertaForQuestionAnswering (from bhavikardeshna) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_german date: 2022-06-23 tags: [de, open_source, question_answering, xlmroberta] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-german` is a German model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_german_de_4.0.0_3.0_1655989699247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_german_de_4.0.0_3.0_1655989699247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_german","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_german","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_german| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|de| |Size:|883.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/xlm-roberta-base-german --- layout: model title: English BertForSequenceClassification Tiny Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_tiny_finetuned_fake_news_detection date: 2022-07-13 tags: [en, open_source, bert, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-fake-news-detection` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_fake_news_detection_en_4.0.0_3.0_1657720809070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_fake_news_detection_en_4.0.0_3.0_1657720809070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_fake_news_detection","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_fake_news_detection","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_tiny_finetuned_fake_news_detection| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|16.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bert-tiny-finetuned-fake-news-detection --- layout: model title: Translate South Slavic languages to English Pipeline author: John Snow Labs name: translate_zls_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, zls, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `zls` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_zls_en_xx_2.7.0_2.4_1609688390022.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_zls_en_xx_2.7.0_2.4_1609688390022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_zls_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_zls_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.zls.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_zls_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Assertion Status from Smoking Status Entity author: John Snow Labs name: assertion_oncology_smoking_status_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of the Smoking_Status entity. It classifies extractions as Present, Past or Absent. ## Predicted Entities `Absent`, `Past`, `Present` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_smoking_status_wip_en_4.1.0_3.0_1664641973214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_smoking_status_wip_en_4.1.0_3.0_1664641973214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Smoking_Status"]) assertion = AssertionDLModel.pretrained("assertion_oncology_smoking_status_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient quit smoking three years ago."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Smoking_Status")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_smoking_status_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient quit smoking three years ago.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_smoking_status").predict("""The patient quit smoking three years ago.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------|:---------------|:------------| | smoking | Smoking_Status | Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_smoking_status_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Absent 0.75 1.00 0.86 12.0 Past 0.78 0.93 0.85 15.0 Present 1.00 0.46 0.63 13.0 macro-avg 0.84 0.80 0.78 40.0 weighted-avg 0.84 0.80 0.78 40.0 ``` --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_scrambled_squad_5 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-5` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_5_en_4.0.0_3.0_1655734166418.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_5_en_4.0.0_3.0_1655734166418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_5","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_5","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_scrambled_5.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_5| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-5 --- layout: model title: Legal Legends Clause Binary Classifier author: John Snow Labs name: legclf_legends_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `legends` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `legends` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_legends_clause_en_1.0.0_3.2_1660123668543.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_legends_clause_en_1.0.0_3.2_1660123668543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_legends_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[legends]| |[other]| |[other]| |[legends]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_legends_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support legends 0.98 0.98 0.98 57 other 0.99 0.99 0.99 128 accuracy - - 0.99 185 macro-avg 0.99 0.99 0.99 185 weighted-avg 0.99 0.99 0.99 185 ``` --- layout: model title: German asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377 date: 2022-09-26 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377_de_4.2.0_3.0_1664189496988.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377_de_4.2.0_3.0_1664189496988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s377| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Fast Neural Machine Translation Model from Celtic Languages to English author: John Snow Labs name: opus_mt_cel_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cel, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cel` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cel_en_xx_2.7.0_2.4_1609168916224.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cel_en_xx_2.7.0_2.4_1609168916224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cel_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cel_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cel.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cel_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_de_4.2.0_3.0_1664112235753.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_de_4.2.0_3.0_1664112235753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: sbert_jsl_medium_uncased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.0.3_2.4_1621017111185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_medium_uncased_en_3.0.3_2.4_1621017111185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbert_jsl_medium_uncased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_medium_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_medium_uncased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Score Acc 0.724 STS(cos) 0.743 ``` --- layout: model title: Translate English to Tongan Pipeline author: John Snow Labs name: translate_en_to date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, to, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `to` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_to_xx_2.7.0_2.4_1609686800591.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_to_xx_2.7.0_2.4_1609686800591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_to", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_to", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.to').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_to| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from rahulchakwate) author: John Snow Labs name: roberta_qa_roberta_large_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_finetuned_squad_en_4.0.0_3.0_1655736911099.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_finetuned_squad_en_4.0.0_3.0_1655736911099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.large.by_rahulchakwate").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulchakwate/roberta-large-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6_en_4.3.0_3.0_1674213584769.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6_en_4.3.0_3.0_1674213584769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|439.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-6 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8_en_4.3.0_3.0_1674214068825.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8_en_4.3.0_3.0_1674214068825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|423.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-8 --- layout: model title: English image_classifier_vit_dog ViTForImageClassification from Sena author: John Snow Labs name: image_classifier_vit_dog date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog` is a English model originally trained by Sena. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_en_4.1.0_3.0_1660169568437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_en_4.1.0_3.0_1660169568437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dog", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dog", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dog| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Adverse Drug Events (MedicalBertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ade_tweet_binary_pipeline date: 2023-03-20 tags: [clinical, licensed, ade, en, medicalbertfortokenclassification, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ade_tweet_binary](https://nlp.johnsnowlabs.com/2022/07/29/bert_token_classifier_ade_tweet_binary_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ade_tweet_binary_pipeline_en_4.3.0_3.2_1679298990358.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ade_tweet_binary_pipeline_en_4.3.0_3.2_1679298990358.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ade_tweet_binary_pipeline", "en", "clinical/models") text = '''I used to be on paxil but that made me more depressed and prozac made me angry. Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ade_tweet_binary_pipeline", "en", "clinical/models") val text = "I used to be on paxil but that made me more depressed and prozac made me angry. Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------|-------------:| | 0 | depressed | 44 | 52 | ADE | 0.999755 | | 1 | angry | 73 | 77 | ADE | 0.999608 | | 2 | insulin blocking | 97 | 112 | ADE | 0.738712 | | 3 | sugar crashes | 147 | 159 | ADE | 0.993742 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ade_tweet_binary_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Multilingual XLMRoBerta Embeddings (from castorini) author: John Snow Labs name: xlmroberta_embeddings_afriberta_large date: 2022-05-13 tags: [ha, yo, ig, am, so, open_source, xlm_roberta, embeddings, xx, afriberta] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `afriberta_large` is a Multilingual model orginally trained by `castorini`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_large_xx_3.4.4_3.0_1652439242600.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_afriberta_large_xx_3.4.4_3.0_1652439242600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_large","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_afriberta_large","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_afriberta_large| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|471.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/castorini/afriberta_large - https://github.com/keleog/afriberta --- layout: model title: Detect Chemical Compounds and Genes author: John Snow Labs name: ner_chemprot_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts. ## Predicted Entities : `CHEMICAL`, `GENE-Y`, `GENE-N` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMPROT_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_en_3.0.0_3.0_1617208430062.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_en_3.0.0_3.0_1617208430062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot.clinical").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +----+---------------------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+=================================+=========+=======+==========+ | 0 | Keratinocyte growth factor | 0 | 25 | GENE-Y | +----+---------------------------------+---------+-------+----------+ | 1 | acidic fibroblast growth factor | 31 | 61 | GENE-Y | +----+---------------------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source This model was trained on the ChemProt corpus using 'embeddings_clinical' embeddings. Make sure you use the same embeddings when running the model. ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:| | 0 | B-GENE-Y | 4650 | 1090 | 838 | 0.810105 | 0.847303 | 0.828286 | | 1 | B-GENE-N | 1732 | 981 | 1019 | 0.638408 | 0.629589 | 0.633968 | | 2 | I-GENE-Y | 1846 | 571 | 573 | 0.763757 | 0.763125 | 0.763441 | | 3 | B-CHEMICAL | 7512 | 804 | 1136 | 0.903319 | 0.86864 | 0.88564 | | 4 | I-CHEMICAL | 1059 | 169 | 253 | 0.862378 | 0.807165 | 0.833858 | | 5 | I-GENE-N | 1393 | 853 | 598 | 0.620214 | 0.699648 | 0.657541 | | 6 | Macro-average | 18192 | 4468 | 4417 | 0.766363 | 0.769245 | 0.767801 | | 7 | Micro-average | 18192 | 4468 | 4417 | 0.802824 | 0.804635 | 0.803729 | ``` --- layout: model title: Arabic BertForMaskedLM Large Cased model (from aubmindlab) author: John Snow Labs name: bert_embeddings_large_arabertv02 date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-arabertv02` is a Arabic model originally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabertv02_ar_4.2.4_3.0_1670019689670.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabertv02_ar_4.2.4_3.0_1670019689670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabertv02","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabertv02","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_arabertv02| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.4 GB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-large-arabertv02 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-large-arabertv02/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Map Companies to their Acquisitions (wikipedia, en) author: John Snow Labs name: finmapper_wikipedia_parentcompanies date: 2023-01-13 tags: [parent, companies, subsidiaries, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models allows you to, given an extracter ORG, retrieve all the parent / subsidiary /companies acquired and/or in the same group than it. IMPORTANT: This requires an exact match as the name appears in Wikidata. If you are not sure the name is the same, pleas run `finmapper_wikipedia_parentcompanies` to normalize the company name first. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_wikipedia_parentcompanies_en_1.0.0_3.0_1673610612510.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_wikipedia_parentcompanies_en_1.0.0_3.0_1673610612510.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") # Optional: To normalize the ORG name using Wikipedia data before the mapping ########################################################################## chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk_doc") \ .setOutputCol("sentence_embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained("finel_wikipedia_parentcompanies", "en", "finance/models") \ .setInputCols(["ner_chunk_doc", "sentence_embeddings"]) \ .setOutputCol("normalized")\ .setDistanceFunction("EUCLIDEAN") ########################################################################## cm = finance.ChunkMapperModel()\ .pretrained("finmapper_wikipedia_parentcompanies", "en", "finance/models")\ .setInputCols(["normalized"])\ .setOutputCol("mappings") # or ner_chunk for non normalized versions nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, chunk_embeddings, use_er_model, cm ]) text = ["""Barclays is an American multinational bank which operates worldwide."""] test_data = spark.createDataFrame([text]).toDF("text") model = nlpPipeline.fit(test_data) lp = nlp.LightPipeline(model) lp.annotate(text) ```
## Results ```bash {'mappings': ['http://www.wikidata.org/entity/Q245343', 'Barclays@en-ca', 'http://www.wikidata.org/prop/direct/P355', 'is_parent_of', 'London Stock Exchange@en', 'BARC', 'בנק ברקליס@he', 'http://www.wikidata.org/entity/Q29488227'], ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_wikipedia_parentcompanies| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|852.6 KB| ## References Wikidata --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_greek_1 TFWav2Vec2ForCTC from skylord author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_greek_1 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_greek_1` is a Modern Greek (1453-) model originally trained by skylord. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_greek_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_1_el_4.2.0_3.0_1664110254247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_greek_1_el_4.2.0_3.0_1664110254247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_greek_1', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_greek_1", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_greek_1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: NER Pipeline - Voice of the Patient author: John Snow Labs name: ner_vop_pipeline date: 2023-06-09 tags: [pipeline, ner, en, licensed, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes the full taxonomy Named-Entity Recognition model to extract information from health-related text in colloquial language. This pipeline extracts diagnoses, treatments, tests, anatomical references and demographic entities. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_pipeline_en_4.4.3_3.0_1686338017684.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_pipeline_en_4.4.3_3.0_1686338017684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_pipeline", "en", "clinical/models") pipeline.annotate(" Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_pipeline", "en", "clinical/models") val result = pipeline.annotate(" Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ") ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.7 MB| |Dependencies:|embeddings_clinical_medium| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_deit_base_patch16_224 ViTForImageClassification from facebook author: John Snow Labs name: image_classifier_vit_deit_base_patch16_224 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_deit_base_patch16_224` is a English model originally trained by facebook. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_base_patch16_224_en_4.1.0_3.0_1660165615831.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_base_patch16_224_en_4.1.0_3.0_1660165615831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_deit_base_patch16_224", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_deit_base_patch16_224", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_deit_base_patch16_224| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|324.7 MB| --- layout: model title: Financial 10K Filings NER author: John Snow Labs name: finner_10k_summary date: 2022-08-17 tags: [en, finance, ner, annual, reports, 10k, filings, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole financial report. Instead: - Split by paragraphs; - Use the `finclf_form_10k_summary_item` Text Classifier to select only these paragraphs; This Financial NER Model is aimed to process the first summary page of 10K filings and extract the information about the Company submitting the filing, trading data, address / phones, CFN, IRS, etc. ## Predicted Entities `ADDRESS`, `CFN`, `FISCAL_YEAR`, `IRS`, `ORG`, `PHONE`, `STATE`, `STOCK_EXCHANGE`, `TICKER`, `TITLE_CLASS`, `TITLE_CLASS_VALUE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_SEC10K_FIRSTPAGE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_10k_summary_en_1.0.0_3.2_1660732829888.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_10k_summary_en_1.0.0_3.2_1660732829888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") \ .setCustomBounds(["\n\n"]) tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_finbert_pretrain_yiyanghkust","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_10k_summary","en","finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934 For the annual period ended January 31, 2021 or TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from________to_______ Commission File Number: 001-38856 PAGERDUTY, INC. (Exact name of registrant as specified in its charter) Delaware 27-2793871 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification Number) 600 Townsend St., Suite 200, San Francisco, CA 94103 (844) 800-3889 (Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices) Securities registered pursuant to Section 12(b) of the Act: Title of each class Trading symbol(s) Name of each exchange on which registered Common Stock, $0.000005 par value, PD New York Stock Exchange"""]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("ticker"), F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False) ```
## Results ```bash +----------------------------------------------+-----------------+ |ticker |label | +----------------------------------------------+-----------------+ |January 31, 2021 |FISCAL_YEAR | |001-38856 |CFN | |PAGERDUTY, INC |ORG | |Delaware |STATE | |27-2793871 |IRS | |600 Townsend St., Suite 200, San Francisco, CA|ADDRESS | |(844) 800-3889 |PHONE | |Common Stock |TITLE_CLASS | |$0.000005 |TITLE_CLASS_VALUE| |PD |TICKER | |New York Stock Exchange |STOCK_EXCHANGE | +----------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_10k_summary| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.1 MB| ## References Manual annotations on 10-K Filings ## Benchmarking ```bash label precision recall f1-score support B-TITLE_CLASS 1.00 1.00 1.00 15 I-TITLE_CLASS 1.00 1.00 1.00 21 B-ORG 0.84 0.66 0.74 62 I-ORG 0.88 0.76 0.82 93 B-STOCK_EXCHANGE 0.86 0.86 0.86 14 I-STOCK_EXCHANGE 0.98 0.98 0.98 50 B-PHONE 0.95 0.87 0.91 23 I-PHONE 0.95 1.00 0.98 60 B-STATE 0.89 0.85 0.87 20 B-IRS 1.00 0.88 0.93 16 B-ADDRESS 0.94 0.83 0.88 18 I-ADDRESS 0.92 0.97 0.94 144 B-TICKER 0.86 0.92 0.89 13 B-FISCAL_YEAR 0.96 0.88 0.92 50 I-FISCAL_YEAR 0.93 0.92 0.92 125 B-TITLE_CLASS_VALUE 1.00 0.93 0.97 15 B-CFN 0.92 1.00 0.96 12 micro avg 0.93 0.89 0.91 751 macro avg 0.84 0.81 0.82 751 weighted avg 0.92 0.89 0.91 751 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from sshleifer) author: John Snow Labs name: distilbert_qa_tiny_base_cased_distilled_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tiny-distilbert-base-cased-distilled-squad` is a English model originally trained by `sshleifer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tiny_base_cased_distilled_squad_en_4.0.0_3.0_1654728916941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tiny_base_cased_distilled_squad_en_4.0.0_3.0_1654728916941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tiny_base_cased_distilled_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tiny_base_cased_distilled_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_tiny_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_tiny_base_cased_distilled_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|641.4 KB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sshleifer/tiny-distilbert-base-cased-distilled-squad --- layout: model title: Stop Words Cleaner for Latin author: John Snow Labs name: stopwords_la date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: la edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, la] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_la_la_2.5.4_2.4_1594742439769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_la_la_2.5.4_2.4_1594742439769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_la", "la") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_la", "la") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene."""] stopword_df = nlu.load('la.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Alius', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=13, result='esse', metadata={'sentence': '0'}), Row(annotatorType='token', begin=15, end=19, result='regem', metadata={'sentence': '0'}), Row(annotatorType='token', begin=21, end=29, result='Aquilonis', metadata={'sentence': '0'}), Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_la| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|la| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English BertForQuestionAnswering model (from vuiseng9) author: John Snow Labs name: bert_qa_bert_l_squadv1.1_sl256 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-l-squadv1.1-sl256` is a English model orginally trained by `vuiseng9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_l_squadv1.1_sl256_en_4.0.0_3.0_1654536057484.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_l_squadv1.1_sl256_en_4.0.0_3.0_1654536057484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_l_squadv1.1_sl256","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_l_squadv1.1_sl256","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.sl256.by_vuiseng9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_l_squadv1.1_sl256| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vuiseng9/bert-l-squadv1.1-sl256 --- layout: model title: Bangla RoBERTa Embeddings (from neuralspace-reverie) author: John Snow Labs name: roberta_embeddings_indic_transformers_bn_roberta date: 2022-04-14 tags: [roberta, embeddings, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-bn-roberta` is a Bangla model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_bn_roberta_bn_3.4.2_3.0_1649947557406.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_bn_roberta_bn_3.4.2_3.0_1649947557406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_bn_roberta","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_bn_roberta","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indic_transformers_bn_roberta| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|312.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-roberta - https://oscar-corpus.com/ --- layout: model title: Pipeline to Detect Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_gene_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, gene, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_gene_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_gene_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_3.4.1_3.0_1647867336282.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_pipeline_en_3.4.1_3.0_1647867336282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).") ``` ```scala val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype_gene_biobert.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""") ```
## Results ```bash +----------------+--------+ |chunks |entities| +----------------+--------+ |type |GENE | |polyhydramnios |HP | |polyuria |HP | |nephrocalcinosis|HP | |hypokalemia |HP | +----------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English BertForQuestionAnswering model (from xraychen) author: John Snow Labs name: bert_qa_mqa_unsupsim date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mqa-unsupsim` is a English model orginally trained by `xraychen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_unsupsim_en_4.0.0_3.0_1654188398566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_unsupsim_en_4.0.0_3.0_1654188398566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mqa_unsupsim","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mqa_unsupsim","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.unsupsim.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mqa_unsupsim| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/xraychen/mqa-unsupsim --- layout: model title: Pipeline to Detect Normalized Genes and Human Phenotypes (biobert) author: John Snow Labs name: ner_human_phenotype_go_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_go_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_go_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_pipeline_en_4.3.0_3.2_1679315805636.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_pipeline_en_4.3.0_3.2_1679315805636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_human_phenotype_go_biobert_pipeline", "en", "clinical/models") text = '''Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_human_phenotype_go_biobert_pipeline", "en", "clinical/models") val text = "Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.phenotype_go_biobert.pipeline").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------------|--------:|------:|:------------|-------------:| | 0 | tumor | 39 | 43 | HP | 1 | | 1 | tricarboxylic acid cycle | 79 | 102 | GO | 0.999867 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForSequenceClassification Cased model (from joey234) author: John Snow Labs name: roberta_classifier_cuenb_mnli date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuenb-mnli` is a English model originally trained by `joey234`. ## Predicted Entities `entailment`, `contradiction`, `neutral` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_cuenb_mnli_en_4.2.4_3.0_1670624962389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_cuenb_mnli_en_4.2.4_3.0_1670624962389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_cuenb_mnli","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_cuenb_mnli","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_cuenb_mnli| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|468.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/joey234/cuenb-mnli - https://paperswithcode.com/sota?task=Text+Classification&dataset=GLUE+MNLI --- layout: model title: Fast Neural Machine Translation Model from Austro-Asiatic languages to English author: John Snow Labs name: opus_mt_aav_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, aav, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `aav` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_aav_en_xx_2.7.0_2.4_1609169255439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_aav_en_xx_2.7.0_2.4_1609169255439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_aav_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_aav_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.aav.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_aav_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbiobertresolve_icd10cm_slim_billable_hcc) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_billable_hcc date: 2021-05-25 tags: [icd10cm, slim, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD10 CM codes using sentence biobert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 11. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.0.3_2.4_1621942329774.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.0.3_2.4_1621942329774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_slim_billable_hcc ","en", "clinical/models") \ .setInputCols(["document", "sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) bert_pipeline_icd = PipelineModel(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["bladder cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN").setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("bladder cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc").predict("""sbiobertresolve_icd10cm_slim_billable_hcc """) ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:---------------|:--------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------:|:----------------------------|:---------------------------------------------------------| | 0 | bladder cancer | C671 |[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], adenocarcinoma, bladder neck [Malignant neoplasm of bladder neck], cancer in situ of urinary bladder [Carcinoma in situ of bladder], cancer of the urinary bladder, ureteric orifice [Malignant neoplasm of ureteric orifice], tumor of bladder neck [Neoplasm of unspecified behavior of bladder], cancer of the urethra [Malignant neoplasm of urethra]]| [C671, C679, C675, D090, C676, D494, C680] | ['1', '1', '11'] | [0.0685, 0.0709, 0.0963, 0.0978, 0.1068, 0.1080, 0.1211] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_billable_hcc| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Legal No Waivers Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_no_waivers_bert date: 2023-03-05 tags: [en, legal, classification, clauses, no_waivers, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `No_Waivers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `No_Waivers`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_waivers_bert_en_1.0.0_3.0_1678050601610.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_waivers_bert_en_1.0.0_3.0_1678050601610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_no_waivers_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[No_Waivers]| |[Other]| |[Other]| |[No_Waivers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_waivers_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support No_Waivers 0.90 0.98 0.94 54 Other 0.99 0.92 0.95 73 accuracy - - 0.94 127 macro-avg 0.94 0.95 0.94 127 weighted-avg 0.95 0.94 0.95 127 ``` --- layout: model title: English DistilBertForQuestionAnswering Small Cased model (from ncduy) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_finetuned_small date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad-small` is a English model originally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_small_en_4.3.0_3.0_1672766627900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_finetuned_small_en_4.3.0_3.0_1672766627900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_small","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_finetuned_small","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_finetuned_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/distilbert-base-cased-distilled-squad-finetuned-squad-small --- layout: model title: French CamemBert Embeddings (from mbateman) author: John Snow Labs name: camembert_embeddings_mbateman_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `mbateman`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_mbateman_generic_model_fr_3.4.4_3.0_1653989653583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_mbateman_generic_model_fr_3.4.4_3.0_1653989653583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_mbateman_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_mbateman_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_mbateman_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/mbateman/dummy-model --- layout: model title: English BertForQuestionAnswering model (from bioformers) author: John Snow Labs name: bert_qa_bioformer_cased_v1.0_squad1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioformer-cased-v1.0-squad1` is a English model orginally trained by `bioformers`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bioformer_cased_v1.0_squad1_en_4.0.0_3.0_1654185774933.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bioformer_cased_v1.0_squad1_en_4.0.0_3.0_1654185774933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bioformer_cased_v1.0_squad1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bioformer_cased_v1.0_squad1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bioformer.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bioformer_cased_v1.0_squad1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|158.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bioformers/bioformer-cased-v1.0-squad1 - https://rajpurkar.github.io/SQuAD-explorer - https://arxiv.org/pdf/1910.01108.pdf --- layout: model title: Abkhazian asr_speech_sprint_test TFWav2Vec2ForCTC from Mofe author: John Snow Labs name: pipeline_asr_speech_sprint_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_speech_sprint_test` is a Abkhazian model originally trained by Mofe. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_speech_sprint_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_speech_sprint_test_ab_4.2.0_3.0_1664021417370.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_speech_sprint_test_ab_4.2.0_3.0_1664021417370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_speech_sprint_test', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_speech_sprint_test", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_speech_sprint_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.6 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Counterparts Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_counterparts_bert date: 2023-03-05 tags: [en, legal, classification, clauses, counterparts, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Counterparts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Counterparts`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_bert_en_1.0.0_3.0_1678050553108.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_counterparts_bert_en_1.0.0_3.0_1678050553108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_counterparts_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Counterparts]| |[Other]| |[Other]| |[Counterparts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_counterparts_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Counterparts 1.00 0.99 1.0 302 Other 0.99 1.00 1.0 338 accuracy - - 1.0 640 macro-avg 1.00 1.00 1.0 640 weighted-avg 1.00 1.00 1.0 640 ``` --- layout: model title: Portuguese RoBERTa Embeddings (from rdenadai) author: John Snow Labs name: roberta_embeddings_BR_BERTo date: 2022-04-14 tags: [roberta, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `BR_BERTo` is a Portuguese model orginally trained by `rdenadai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_BR_BERTo_pt_3.4.2_3.0_1649947632437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_BR_BERTo_pt_3.4.2_3.0_1649947632437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_BR_BERTo","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.BR_BERTo").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_BR_BERTo| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|637.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/rdenadai/BR_BERTo - https://github.com/rdenadai/BR-BERTo --- layout: model title: Chinese BertForMaskedLM Base Cased model author: John Snow Labs name: bert_embeddings_base_chinese date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese` is a Chinese model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_chinese_zh_4.2.4_3.0_1670016364528.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_chinese_zh_4.2.4_3.0_1670016364528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_chinese| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/bert-base-chinese - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_128 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_128_zh_4.2.4_3.0_1670021529933.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_128_zh_4.2.4_3.0_1670021529933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|20.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Sick leave Clause Binary Classifier author: John Snow Labs name: legclf_sick_leave_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `sick-leave` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `sick-leave` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sick_leave_clause_en_1.0.0_3.2_1660124014656.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sick_leave_clause_en_1.0.0_3.2_1660124014656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_sick_leave_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[sick-leave]| |[other]| |[other]| |[sick-leave]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sick_leave_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.99 0.98 80 sick-leave 0.97 0.93 0.95 42 accuracy - - 0.97 122 macro-avg 0.97 0.96 0.96 122 weighted-avg 0.97 0.97 0.97 122 ``` --- layout: model title: Marathi Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, mr, open_source] task: Embeddings language: mr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Marathi model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_mr_3.4.2_3.0_1649675136793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_mr_3.4.2_3.0_1649675136793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mr.embed.muril_adapted_local").predict("""मला स्पार्क एनएलपी आवडते""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|mr| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Detect Assertion Status (assertion_dl) - supports confidence scores author: John Snow Labs name: assertion_dl date: 2021-01-26 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.7.2 spark_version: 2.4 tags: [assertion, en, licensed, clinical] supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Assign assertion status to clinical entities extracted by NER based on their context in the text. ## Predicted Entities `absent`, `present`, `conditional`, `associated_with_someone_else`, `hypothetical`, `possible`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.7.2_2.4_1611647201607.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_en_2.7.2_2.4_1611647201607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion ]) data = spark.createDataFrame([["The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes."]]).toDF("text") model = nlpPipeline.fit(data) result = model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val nlpPipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion )) val text = """The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash +-----------------------------------------------------------+---------+-----------+ |chunk |ner_label|assertion | +-----------------------------------------------------------+---------+-----------+ |the G-protein-activated inwardly rectifying potassium (GIRK|TREATMENT|conditional| |the genomicorganization |TREATMENT|present | |a candidate gene forType II diabetes mellitus |PROBLEM |present | |byapproximately |TREATMENT|present | |single nucleotide polymorphisms |TREATMENT|present | |aVal366Ala substitution |TREATMENT|present | |insertion/deletion |PROBLEM |present | |'Ourexpression studies |TEST |present | |the transcript in various humantissues |PROBLEM |present | |fat andskeletal muscle |PROBLEM |possible | |furtherstudies |PROBLEM |present | |the KCNJ9 protein |TREATMENT|present | |evaluation |TEST |possible | |Type II diabetes |PROBLEM |present | +-----------------------------------------------------------+---------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl| |Compatibility:|Spark NLP 2.7.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| ## Data Source Trained on i2b2 assertion data ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:-----------------------------|------:|-----:|-----:|---------:|---------:|---------:| | 0 | absent | 791 | 47 | 80 | 0.943914 | 0.908152 | 0.925688 | | 1 | present | 2499 | 169 | 120 | 0.936657 | 0.954181 | 0.945338 | | 2 | conditional | 23 | 19 | 21 | 0.547619 | 0.522727 | 0.534884 | | 3 | associated_with_someone_else | 38 | 2 | 11 | 0.95 | 0.77551 | 0.853933 | | 4 | hypothetical | 163 | 19 | 21 | 0.895604 | 0.88587 | 0.89071 | | 5 | possible | 119 | 61 | 64 | 0.661111 | 0.650273 | 0.655647 | | 6 | Macro-average | 3633 | 317 | 317 | 0.822484 | 0.782786 | 0.802144 | | 7 | Micro-average | 3633 | 317 | 317 | 0.919747 | 0.919747 | 0.919747 | ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from peggyhuang) author: John Snow Labs name: roberta_qa_canard date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-canard` is a English model originally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_canard_en_4.3.0_3.0_1674219967411.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_canard_en_4.3.0_3.0_1674219967411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_canard","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_canard","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_canard| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|465.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/peggyhuang/roberta-canard --- layout: model title: Detect Problems, Tests and Treatments (ner_healthcare) author: John Snow Labs name: ner_healthcare_en date: 2020-03-26 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.4 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for healthcare. Includes Problem, Test and Treatment entities. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities ``PROBLEM``, ``TEST``, ``TREATMENT``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_en_2.4.4_2.4_1585188313964.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_en_2.4.4_2.4_1585188313964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."]]).toDF("text")) results = model.transform(data) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq(A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .).toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash | | chunk | ner_label | |---|-------------------------------|-----------| | 0 | a respiratory tract infection | PROBLEM | | 1 | metformin | TREATMENT | | 2 | glipizide | TREATMENT | | 3 | dapagliflozin | TREATMENT | | 4 | T2DM | PROBLEM | | 5 | atorvastatin | TREATMENT | | 6 | gemfibrozil | TREATMENT | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare_en_2.4.4_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.4| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on 2010 i2b2 challenge data with 'embeddings_healthcare_100d'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|------:|------:|------:|---------:|---------:|---------:| | 0 | I-TREATMENT | 6625 | 1187 | 1329 | 0.848054 | 0.832914 | 0.840416 | | 1 | I-PROBLEM | 15142 | 1976 | 2542 | 0.884566 | 0.856254 | 0.87018 | | 2 | B-PROBLEM | 11005 | 1065 | 1587 | 0.911765 | 0.873968 | 0.892466 | | 3 | I-TEST | 6748 | 923 | 1264 | 0.879677 | 0.842237 | 0.86055 | | 4 | B-TEST | 8196 | 942 | 1029 | 0.896914 | 0.888455 | 0.892665 | | 5 | B-TREATMENT | 8271 | 1265 | 1073 | 0.867345 | 0.885167 | 0.876165 | | 6 | Macro-average | 55987 | 7358 | 8824 | 0.881387 | 0.863166 | 0.872181 | | 7 | Micro-average | 55987 | 7358 | 8824 | 0.883842 | 0.86385 | 0.873732 | ``` --- layout: model title: Pipeline to Detect Anatomical Regions author: John Snow Labs name: bert_token_classifier_ner_anatomy_pipeline date: 2022-03-21 tags: [licensed, ner, anatomy, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_anatomy](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_anatomy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_3.4.1_3.0_1647857125493.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_pipeline_en_3.4.1_3.0_1647857125493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python anatomy_pipeline = PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models") anatomy_pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.") ``` ```scala val anatomy_pipeline = new PretrainedPipeline("bert_token_classifier_ner_anatomy_pipeline", "en", "clinical/models") anatomy_pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.anatomy_pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |great toe |Multi-tissue_structure| |skin |Organ | |conjunctivae |Multi-tissue_structure| |Extraocular muscles|Multi-tissue_structure| |Nares |Multi-tissue_structure| |turbinates |Multi-tissue_structure| |Oropharynx |Multi-tissue_structure| |Mucous membranes |Tissue | |Neck |Organism_subdivision | |bowel |Organ | |great toe |Multi-tissue_structure| |skin |Organ | |toenails |Organism_subdivision | |foot |Organism_subdivision | |great toe |Multi-tissue_structure| |toenails |Organism_subdivision | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_anatomy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English RobertaForQuestionAnswering (from twmkn9) author: John Snow Labs name: roberta_qa_distilroberta_base_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad2` is a English model originally trained by `twmkn9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad2_en_4.0.0_3.0_1655728339007.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad2_en_4.0.0_3.0_1655728339007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_distilroberta_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.distilled_base.by_twmkn9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilroberta_base_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/twmkn9/distilroberta-base-squad2 --- layout: model title: Legal Duties Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_duties_bert date: 2023-03-05 tags: [en, legal, classification, clauses, duties, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Duties`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duties_bert_en_1.0.0_3.0_1678050512891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duties_bert_en_1.0.0_3.0_1678050512891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_duties_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Duties]| |[Other]| |[Other]| |[Duties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_duties_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Duties 0.92 0.92 0.92 39 Other 0.95 0.95 0.95 61 accuracy - - 0.94 100 macro-avg 0.94 0.94 0.94 100 weighted-avg 0.94 0.94 0.94 100 ``` --- layout: model title: Relation extraction between body parts and direction entities (ReDL). author: John Snow Labs name: redl_bodypart_direction_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like `Internal_organ_or_component`, `External_body_part_or_region` etc. and Direction entities like `upper`, `lower` in clinical texts. `1` : Shows the body part and direction entity are related, `0` : Shows the body part and direction entity are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_2.7.3_2.4_1612447753332.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_2.7.3_2.4_1612447753332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['direction-external_body_part_or_region', 'external_body_part_or_region-direction', 'direction-internal_organ_or_component', 'internal_organ_or_component-direction' ]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_direction_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia" p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text")) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array('direction-external_body_part_or_region', 'external_body_part_or_region-direction', 'direction-internal_organ_or_component', 'internal_organ_or_component-direction')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_direction_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""") ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------| | 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 | | 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 | | 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 | | 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 | | 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 | | 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 | | 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 | | 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 | | 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_direction_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on an internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.856 0.873 0.865 153 1 0.986 0.984 0.985 1347 Avg. 0.921 0.929 0.925 ``` --- layout: model title: Pipeline to Extract Entities in Spanish Clinical Trial Abstracts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_clinical_trials_abstracts_pipeline date: 2023-03-20 tags: [es, clinical, licensed, token_classification, bert, ner] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/08/11/bert_token_classifier_ner_clinical_trials_abstracts_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_pipeline_es_4.3.0_3.2_1679298645358.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_pipeline_es_4.3.0_3.2_1679298645358.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_trials_abstracts_pipeline", "es", "clinical/models") text = '''Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_trials_abstracts_pipeline", "es", "clinical/models") val text = "Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | suplementación | 13 | 26 | PROC | 0.999993 | | 1 | ácido fólico | 32 | 43 | CHEM | 0.999753 | | 2 | niveles de homocisteína | 55 | 77 | PROC | 0.997803 | | 3 | hemodiálisis | 101 | 112 | PROC | 0.999993 | | 4 | hiperhomocisteinemia | 118 | 137 | DISO | 0.999995 | | 5 | niveles de homocisteína | 248 | 270 | PROC | 0.999988 | | 6 | tHcy | 279 | 282 | PROC | 0.999989 | | 7 | ácido fólico | 309 | 320 | CHEM | 0.999987 | | 8 | vitamina B6 | 324 | 334 | CHEM | 0.999967 | | 9 | pp | 337 | 338 | CHEM | 0.999889 | | 10 | diálisis | 388 | 395 | PROC | 0.999993 | | 11 | función residual | 398 | 414 | PROC | 0.999948 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_trials_abstracts_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|410.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Mapping Drugs With Their Corresponding Actions And Treatments author: John Snow Labs name: drug_action_treatment_mapper date: 2022-04-04 tags: [en, chunkmapping, chunkmapper, drug, action, treatment, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drugs with their corresponding `action` and `treatment`. `action` refers to the function of the drug in various body systems, `treatment` refers to which disease the drug is used to treat ## Predicted Entities `action`, `treatment` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.4.2_3.0_1649098201229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.4.2_3.0_1649098201229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") ner = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\ .setInputCols(["token","sentence"])\ .setOutputCol("ner") nerconverter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("drug") chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") \ .setInputCols("drug")\ .setOutputCol("relations")\ .setRel("treatment") #or action pipeline = Pipeline().setStages([document_assembler, sentence_detector, tokenizer, ner, nerconverter, chunkerMapper]) text = ["""The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate. Cureent Medications: Diprivan, Proventil """] data = spark.createDataFrame([text]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val ner = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models") .setInputCols(Array("token","sentence")) .setOutputCol("ner") val nerconverter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("drug") val chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") .setInputCols("drug") .setOutputCol("relations") .setRel("treatment") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner, nerconverter, chunkerMapper )) val text_data = Seq("""The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate. Cureent Medications: Diprivan, Proventil""").toDF("text") val res = pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_to_action_treatment").predict(""" The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate. Cureent Medications: Diprivan, Proventil """) ```
## Results ```bash +---------+------------------+--------------------------------------------------------------+ |Drug |Treats |Pharmaceutical Action | +---------+------------------+--------------------------------------------------------------+ |Aklis |Hyperlipidemia |Hypertension:::Diabetic Kidney Disease:::Cerebrovascular... | |Dermovate|Lupus |Discoid Lupus Erythematosus:::Empeines:::Psoriasis:::Eczema...| |Diprivan |Infection |Laryngitis:::Pneumonia:::Pharyngitis | |Proventil|Addison's Disease |Allergic Conjunctivitis:::Anemia:::Ankylosing Spondylitis | +---------+------------------+--------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_action_treatment_mapper| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|8.7 MB| --- layout: model title: Translate English to Afrikaans Pipeline author: John Snow Labs name: translate_en_af date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, af, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `af` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_af_xx_2.7.0_2.4_1609689571052.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_af_xx_2.7.0_2.4_1609689571052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_af", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_af", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.af').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_af| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batteryonlybert_uncased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batteryonlybert-uncased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batteryonlybert_uncased_squad_v1_en_4.0.0_3.0_1654179355858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batteryonlybert_uncased_squad_v1_en_4.0.0_3.0_1654179355858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batteryonlybert_uncased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batteryonlybert_uncased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.bert.uncased_only_bert.by_batterydata").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batteryonlybert_uncased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/batteryonlybert-uncased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: Fast Neural Machine Translation Model from Xhosa to English author: John Snow Labs name: opus_mt_xh_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, xh, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `xh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_xh_en_xx_2.7.0_2.4_1609164176872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_xh_en_xx_2.7.0_2.4_1609164176872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_xh_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_xh_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.xh.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_xh_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from holtin) Squad author: John Snow Labs name: distilbert_qa_holtin_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `holtin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_holtin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725474082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_holtin_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725474082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_holtin_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_holtin_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_v2.by_holtin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_holtin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/holtin/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Consents Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_consents_bert date: 2023-03-05 tags: [en, legal, classification, clauses, consents, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Consents` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Consents`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consents_bert_en_1.0.0_3.0_1678050585575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consents_bert_en_1.0.0_3.0_1678050585575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_consents_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Consents]| |[Other]| |[Other]| |[Consents]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_consents_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Consents 0.81 0.94 0.87 49 Other 0.95 0.85 0.90 72 accuracy - - 0.88 121 macro-avg 0.88 0.89 0.88 121 weighted-avg 0.89 0.88 0.89 121 ``` --- layout: model title: English Named Entity Recognition (from dslim) author: John Snow Labs name: bert_ner_bert_base_NER date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-NER` is a English model orginally trained by `dslim`. ## Predicted Entities `LOC`, `PER`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_NER_en_3.4.2_3.0_1652096558277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_NER_en_3.4.2_3.0_1652096558277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_NER","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_NER","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_NER| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/dslim/bert-base-NER - https://www.aclweb.org/anthology/W03-0419.pdf - https://www.aclweb.org/anthology/W03-0419.pdf - https://arxiv.org/pdf/1810.04805 - https://github.com/google-research/bert/issues/223 --- layout: model title: English image_classifier_vit_electric_2 ViTForImageClassification from smc author: John Snow Labs name: image_classifier_vit_electric_2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_electric_2` is a English model originally trained by smc. ## Predicted Entities `poles`, `transformers` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_electric_2_en_4.1.0_3.0_1660171547079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_electric_2_en_4.1.0_3.0_1660171547079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_electric_2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_electric_2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_electric_2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Urdu BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_ur_cased date: 2022-12-02 tags: [ur, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ur edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ur-cased` is a Urdu model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ur_cased_ur_4.2.4_3.0_1670019274115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ur_cased_ur_4.2.4_3.0_1670019274115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ur_cased","ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ur_cased","ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_ur_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ur| |Size:|348.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ur-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nl12 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl12` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl12_en_4.3.0_3.0_1675117150157.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl12_en_4.3.0_3.0_1675117150157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nl12","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nl12","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nl12| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|802.0 MB| ## References - https://huggingface.co/google/t5-efficient-large-nl12 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Fast Neural Machine Translation Model from English to Hungarian author: John Snow Labs name: opus_mt_en_hu date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, hu, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `hu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hu_xx_2.7.0_2.4_1609170999360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hu_xx_2.7.0_2.4_1609170999360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_hu", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_hu", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.hu').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_hu| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Pohnpeian Pipeline author: John Snow Labs name: translate_en_pon date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, pon, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `pon` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pon_xx_2.7.0_2.4_1609690686455.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pon_xx_2.7.0_2.4_1609690686455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_pon", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_pon", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.pon').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_pon| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Rundi to English author: John Snow Labs name: opus_mt_run_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, run, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `run` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_run_en_xx_2.7.0_2.4_1609166179021.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_run_en_xx_2.7.0_2.4_1609166179021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_run_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_run_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.run.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_run_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-SciBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512_en_4.0.0_3.0_1657108506933.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512_en_4.0.0_3.0_1657108506933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_SciBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-SciBERT-512 --- layout: model title: Financial Sentiment Analysis (Lithuanian) author: John Snow Labs name: finclf_bert_sentiment_analysis date: 2022-10-22 tags: [lt, legal, classification, sentiment, analysis, licensed] task: Text Classification language: lt edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Lithuanian Sentiment Analysis Text Classifier, which will retrieve if a text is either expression a Positive Emotion or a Negative one. ## Predicted Entities `POS`,`NEG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_analysis_lt_1.0.0_3.0_1666475378253.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_analysis_lt_1.0.0_3.0_1666475378253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python # Test classifier in Spark NLP pipeline document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') # Load newly trained classifier sequenceClassifier_loaded = finance.BertForSequenceClassification.pretrained("finclf_bert_sentiment_analysis", "lt", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier_loaded ]) # Generating example example = spark.createDataFrame([["Pagalbos paraðiuto laukiantis verslas priemones vertina teigiamai tik yra keli „jeigu“"]]).toDF("text") result = pipeline.fit(example).transform(example) # Checking results result.select("text", "class.result").show(truncate=False) ```
## Results ```bash +---------------------------------------------------------------------------------------+------+ |text |result| +---------------------------------------------------------------------------------------+------+ |Pagalbos paraðiuto laukiantis verslas priemones vertina teigiamai tik yra keli „jeigu“|[POS] | +---------------------------------------------------------------------------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_bert_sentiment_analysis| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|lt| |Size:|406.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References An in-house augmented version of [this dataset](https://www.kaggle.com/datasets/rokastrimaitis/lithuanian-financial-news-dataset-and-bigrams?select=dataset%28original%29.csv) removing NEU tag ## Benchmarking ```bash label precision recall f1-score support NEG 0.80 0.76 0.78 509 POS 0.90 0.92 0.91 1167 accuracy - - 0.87 1676 macro-avg 0.85 0.84 0.84 1676 weighted-avg 0.87 0.87 0.87 1676 ``` --- layout: model title: Pipeline to Classify Texts into TREC-6 Classes author: John Snow Labs name: bert_sequence_classifier_trec_coarse_pipeline date: 2022-02-23 tags: [bert_sequence, trec, coarse, bert, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_sequence_classifier_trec_coarse_en](https://nlp.johnsnowlabs.com/2021/11/06/bert_sequence_classifier_trec_coarse_en.html). The TREC dataset for question classification consists of open-domain, fact-based questions divided into broad semantic categories. You can check the official documentation of the dataset, entities, etc. [here](https://search.r-project.org/CRAN/refmans/textdata/html/dataset_trec.html). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_3.4.0_3.0_1645609565500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_pipeline_en_3.4.0_3.0_1645609565500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python trec_pipeline = PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en") trec_pipeline.annotate("Germany is the largest country in Europe economically.") ``` ```scala val trec_pipeline = new PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en") trec_pipeline.annotate("Germany is the largest country in Europe economically.") ```
## Results ```bash ['LOC'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_trec_coarse_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|406.6 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertForSequenceClassification --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223837089.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223837089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_only_classfn_twostage_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|463.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_only_classfn_twostage_epochs_1_shard_1_squad2.0 --- layout: model title: Pipeline to Detect PHI for Deidentification (Generic) author: John Snow Labs name: ner_deid_generic_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deid, de] task: Named Entity Recognition language: de edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_de_3.4.1_3.0_1647888023955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_de_3.4.1_3.0_1647888023955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "de", "clinical/models") pipeline.annotate("Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "de", "clinical/models") pipeline.annotate("Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.") ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.deid_generic.pipeline").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash +-----------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------+---------+ |Michael Berger |NAME | |12 Dezember 2018 |DATE | |St. Elisabeth-Krankenhausin Bad Kissingen|LOCATION | |Berger |NAME | |76 |AGE | +-----------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Fast Neural Machine Translation Model from Arabic to Spanish author: John Snow Labs name: opus_mt_ar_es date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_es_xx_3.1.0_2.4_1622550859700.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_es_xx_3.1.0_2.4_1622550859700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from nlpunibo) author: John Snow Labs name: distilbert_qa_base_config2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config2` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config2_en_4.3.0_3.0_1672774448417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config2_en_4.3.0_3.0_1672774448417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_config2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_base_config2 --- layout: model title: Telugu Bert Embeddings author: John Snow Labs name: bert_embeddings_telugu_bertu date: 2022-04-11 tags: [bert, embeddings, te, open_source] task: Embeddings language: te edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `telugu_bertu` is a Telugu model orginally trained by `kuppuluri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_telugu_bertu_te_3.4.2_3.0_1649675264476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_telugu_bertu_te_3.4.2_3.0_1649675264476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_telugu_bertu","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_telugu_bertu","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.embed.telugu_bertu").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_telugu_bertu| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|te| |Size:|415.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/kuppuluri/telugu_bertu --- layout: model title: Translate English to Russian Pipeline author: John Snow Labs name: translate_en_ru date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ru, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ru` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ru_xx_2.7.0_2.4_1609687763987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ru_xx_2.7.0_2.4_1609687763987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ru", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ru", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ru').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ru| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Lushai Pipeline author: John Snow Labs name: translate_en_lus date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, lus, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `lus` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lus_xx_2.7.0_2.4_1609690402942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lus_xx_2.7.0_2.4_1609690402942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_lus", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_lus", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.lus').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_lus| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_span_sentiment_extraction date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-span-sentiment-extraction` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_span_sentiment_extraction_en_4.3.0_3.0_1675109003319.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_span_sentiment_extraction_en_4.3.0_3.0_1675109003319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_span_sentiment_extraction","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_span_sentiment_extraction","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_span_sentiment_extraction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|905.9 MB| ## References - https://huggingface.co/mrm8488/t5-base-finetuned-span-sentiment-extraction - https://twitter.com/AND__SO - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://www.kaggle.com/c/tweet-sentiment-extraction - https://arxiv.org/pdf/1910.10683.pdf - https://www.kaggle.com/c/tweet-sentiment-extraction - https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb - https://github.com/enzoampil - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English image_classifier_vit_Infrastructures ViTForImageClassification from drab author: John Snow Labs name: image_classifier_vit_Infrastructures date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Infrastructures` is a English model originally trained by drab. ## Predicted Entities `Cooling tower`, `Transmission grid`, `Wind turbines` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Infrastructures_en_4.1.0_3.0_1660167727547.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Infrastructures_en_4.1.0_3.0_1660167727547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Infrastructures", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Infrastructures", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Infrastructures| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Kinyarwanda XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand date: 2022-08-01 tags: [rw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: rw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-kinyarwanda` is a Kinyarwanda model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659354149074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659354149074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand","rw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand","rw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_kinyarwand| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|rw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-kinyarwanda - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Detect Entities Related to Cancer Therapies author: John Snow Labs name: ner_oncology_therapy_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner, treatment] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to oncology therapies using granular labels, including mentions of treatments, posology information and line of therapy. Definitions of Predicted Entities: - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy". - `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy". - `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy". - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Radiation_Dose`: Dose used in radiotherapy. - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). - `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy". - `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy"). ## Predicted Entities `Cancer_Surgery`, `Chemotherapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Hormonal_Therapy`, `Immunotherapy`, `Line_Of_Therapy`, `Radiotherapy`, `Radiation_Dose`, `Response_To_Treatment`, `Route`, `Targeted_Therapy`, `Unspecific_Therapy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_wip_en_4.0.0_3.0_1664557936894.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_wip_en_4.0.0_3.0_1664557936894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_therapy_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_therapy_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_therapy_wip").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:----------------------| | mastectomy | Cancer_Surgery | | axillary lymph node dissection | Cancer_Surgery | | radiotherapy | Radiotherapy | | recurred | Response_To_Treatment | | adriamycin | Chemotherapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Chemotherapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | first line | Line_Of_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_therapy_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|869.8 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Cycle_Number 58.0 18.0 15.0 73.0 0.76 0.79 0.78 Response_To_Treatment 249.0 80.0 180.0 429.0 0.76 0.58 0.66 Cycle_Count 151.0 48.0 24.0 175.0 0.76 0.86 0.81 Unspecific_Therapy 167.0 88.0 67.0 234.0 0.65 0.71 0.68 Chemotherapy 535.0 30.0 83.0 618.0 0.95 0.87 0.90 Targeted_Therapy 144.0 9.0 35.0 179.0 0.94 0.80 0.87 Radiotherapy 188.0 17.0 34.0 222.0 0.92 0.85 0.88 Cancer_Surgery 526.0 60.0 119.0 645.0 0.90 0.82 0.85 Line_Of_Therapy 73.0 14.0 14.0 87.0 0.84 0.84 0.84 Hormonal_Therapy 95.0 1.0 21.0 116.0 0.99 0.82 0.90 Immunotherapy 90.0 58.0 21.0 111.0 0.61 0.81 0.69 Cycle_Day 149.0 33.0 34.0 183.0 0.82 0.81 0.82 Frequency 287.0 35.0 62.0 349.0 0.89 0.82 0.86 Route 82.0 17.0 15.0 97.0 0.83 0.85 0.84 Duration 399.0 95.0 148.0 547.0 0.81 0.73 0.77 Dosage 718.0 38.0 109.0 827.0 0.95 0.87 0.91 Radiation_Dose 84.0 15.0 12.0 96.0 0.85 0.88 0.86 macro_avg 3995.0 656.0 993.0 4988.0 0.84 0.81 0.82 micro_avg NaN NaN NaN NaN 0.86 0.80 0.83 ``` --- layout: model title: English image_classifier_vit_roomclassifier ViTForImageClassification from lazyturtl author: John Snow Labs name: image_classifier_vit_roomclassifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_roomclassifier` is a English model originally trained by lazyturtl. ## Predicted Entities `Kitchen`, `Bedroom`, `Bathroom`, `Livingroom`, `DinningRoom`, `Laundry room` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_roomclassifier_en_4.1.0_3.0_1660171006278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_roomclassifier_en_4.1.0_3.0_1660171006278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_roomclassifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_roomclassifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_roomclassifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: SDOH Environment Status Classification author: John Snow Labs name: bert_sequence_classifier_sdoh_environment_status date: 2022-12-18 tags: [en, clinical, sdoh, licensed, sequence_classification, environment_status, classifier] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies related to environment situation such as any indication of housing, homeless or no related passage. A discharge summary was classified as True for the SDOH Environment if there was any indication of housing, False if the patient was homeless and None if there was no related passage. ## Predicted Entities `True`, `False`, `None` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_environment_status_en_4.2.2_3.0_1671371837321.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_environment_status_en_4.2.2_3.0_1671371837321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_environment_status", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) sample_texts = ["The patient is a 29-year-old female with a history of renal transplant in 2097, who had worsening renal failure for the past several months. Her chief complaints were hypotension and seizure. months prior to admission and had been more hypertensive recently, requiring blood pressure medications. She was noted to have worsening renal function secondary to recent preeclampsia and her blood pressure control was thought to be secondary to renal failure.", "Mr Known lastname 19017 is a 66 year-old man with a PMHx of stage 4 COPD (FEV1 0.65L;FEV1/FVC 37% predicted in 4-14) on 4L home o2 with numerous hospitalizations for COPD exacerbations and intubation, hypertension, coronary artery disease, GERD who presents with SOB and CP. He is admitted to the ICU for management of dyspnea and hypotension.", "He was deemed Child's B in 2156-5-17 with ongoing ethanol abuse, admitted to Intensive Care Unit due to acute decompensation of chronic liver disease due to alcoholic hepatitis and Escherichia coli sepsis. after being hit in the head with the a bottle and dropping to the floor in the apartment. They had Trauma work him up including a head computerized tomography scan which was negative. He had abdominal pain for approximately one month with increasing abdominal girth, was noted to be febrile to 100 degrees on presentation and was tachycardiac 130, stable blood pressures. He was noted to have distended abdomen with diffuse tenderness computerized tomography scan of the abdomen which showed ascites and large nodule of the liver, splenomegaly, paraesophageal varices and loops of thickened bowel."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_environment_status", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq("The patient is a 29-year-old female with a history of renal transplant in 2097, who had worsening renal failure for the past several months. Her chief complaints were hypotension and seizure. months prior to admission and had been more hypertensive recently, requiring blood pressure medications. She was noted to have worsening renal function secondary to recent preeclampsia and her blood pressure control was thought to be secondary to renal failure.") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.sdoh.environment_status").predict("""He was deemed Child's B in 2156-5-17 with ongoing ethanol abuse, admitted to Intensive Care Unit due to acute decompensation of chronic liver disease due to alcoholic hepatitis and Escherichia coli sepsis. after being hit in the head with the a bottle and dropping to the floor in the apartment. They had Trauma work him up including a head computerized tomography scan which was negative. He had abdominal pain for approximately one month with increasing abdominal girth, was noted to be febrile to 100 degrees on presentation and was tachycardiac 130, stable blood pressures. He was noted to have distended abdomen with diffuse tenderness computerized tomography scan of the abdomen which showed ascites and large nodule of the liver, splenomegaly, paraesophageal varices and loops of thickened bowel.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------+ | text| result| +----------------------------------------------------------------------------------------------------+-------+ |The patient is a 29-year-old female with a history of renal transplant in 2097, who had worsening...| [None]| |Mr Known lastname 19017 is a 66 year-old man with a PMHx of stage 4 COPD (FEV1 0.65L;FEV1/FVC 37%...|[False]| |He was deemed Child's B in 2156-5-17 with ongoing ethanol abuse, admitted to Intensive Care Unit ...| [True]| +----------------------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_sdoh_environment_status| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support None 0.89 0.78 0.83 277 False 0.86 0.93 0.90 419 True 0.67 1.00 0.80 6 accuracy - - 0.87 702 macro-avg 0.81 0.90 0.84 702 weighted-avg 0.87 0.87 0.87 702 ``` --- layout: model title: Danish asr_wav2vec2_xls_r_300m_ftspeech TFWav2Vec2ForCTC from saattrupdan author: John Snow Labs name: asr_wav2vec2_xls_r_300m_ftspeech date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_ftspeech` is a Danish model originally trained by saattrupdan. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_ftspeech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_ftspeech_da_4.2.0_3.0_1664101599815.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_ftspeech_da_4.2.0_3.0_1664101599815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_ftspeech", "da")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_ftspeech", "da") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_ftspeech| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|da| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_large_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-large-newsqa` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_large_news_en_4.3.0_3.0_1674224676431.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_large_news_en_4.3.0_3.0_1674224676431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_large_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_large_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_large_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tli8hf/unqover-roberta-large-newsqa --- layout: model title: Financial Legal proceedings Item Binary Classifier author: John Snow Labs name: finclf_legal_proceedings_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `legal_proceedings` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `legal_proceedings` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_legal_proceedings_item_en_1.0.0_3.2_1660154442715.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_legal_proceedings_item_en_1.0.0_3.2_1660154442715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_legal_proceedings_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[legal_proceedings]| |[other]| |[other]| |[legal_proceedings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_legal_proceedings_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support legal_proceedings 0.96 0.88 0.92 25 other 0.92 0.97 0.95 36 accuracy - - 0.93 61 macro-avg 0.94 0.93 0.93 61 weighted-avg 0.94 0.93 0.93 61 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from lordtt13) author: John Snow Labs name: t5_inshorts date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-inshorts` is a English model originally trained by `lordtt13`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_inshorts_en_4.3.0_3.0_1675124897561.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_inshorts_en_4.3.0_3.0_1675124897561.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_inshorts","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_inshorts","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_inshorts| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|927.0 MB| ## References - https://huggingface.co/lordtt13/t5-inshorts - https://arxiv.org/abs/1910.10683 - https://www.kaggle.com/shashichander009/inshorts-news-data - https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-t5-for-summarization.ipynb - https://github.com/lordtt13 - https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/ --- layout: model title: English asr_wav2vec2_base_timit_demo_colab240 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab240 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab240` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab240_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023948102.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab240_en_4.2.0_3.0_1664023948102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab240', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab240", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab240| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Russian T5ForConditionalGeneration Small Cased model (from cointegrated) author: John Snow Labs name: t5_rut5_small_normalizer date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5-small-normalizer` is a Russian model originally trained by `cointegrated`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_small_normalizer_ru_4.3.0_3.0_1675106835622.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_small_normalizer_ru_4.3.0_3.0_1675106835622.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_rut5_small_normalizer","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_rut5_small_normalizer","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_rut5_small_normalizer| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|277.8 MB| ## References - https://huggingface.co/cointegrated/rut5-small-normalizer - https://github.com/natasha/natasha - https://github.com/kmike/pymorphy2 - https://wortschatz.uni-leipzig.de/en/download/Russian --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from jgammack) author: John Snow Labs name: roberta_qa_jgammack_base_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_jgammack_base_squad_en_4.3.0_3.0_1674218670079.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_jgammack_base_squad_en_4.3.0_3.0_1674218670079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jgammack_base_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_jgammack_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_jgammack_base_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jgammack/roberta-base-squad --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_biobert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_pipeline_en_3.4.1_3.0_1647871826696.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_pipeline_en_3.4.1_3.0_1647871826696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_biobert_pipeline", "en", "clinical/models") pipeline.fullAnnotate('The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_biobert_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_biobert.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunks |entities | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Pipeline to Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl_pipeline date: 2023-03-09 tags: [ner, licensed, en, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl](https://nlp.johnsnowlabs.com/2022/10/19/ner_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_pipeline_en_4.3.0_3.2_1678353833465.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_pipeline_en_4.3.0_3.2_1678353833465.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.997 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9995 | | 2 | male | 38 | 41 | Gender | 0.9998 | | 3 | 2 days | 52 | 57 | Duration | 0.805 | | 4 | congestion | 62 | 71 | Symptom | 0.9049 | | 5 | mom | 75 | 77 | Gender | 0.9907 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.268133 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.7284 | | 8 | she | 147 | 149 | Gender | 0.9978 | | 9 | mild | 168 | 171 | Modifier | 0.7517 | | 10 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.664583 | | 11 | perioral cyanosis | 237 | 253 | Symptom | 0.6869 | | 12 | retractions | 258 | 268 | Symptom | 0.9912 | | 13 | Influenza vaccine | 325 | 341 | Vaccine_Name | 0.833 | | 14 | One day ago | 344 | 354 | RelativeDate | 0.8667 | | 15 | mom | 357 | 359 | Gender | 0.9991 | | 16 | tactile temperature | 376 | 394 | Symptom | 0.3339 | | 17 | Tylenol | 417 | 423 | Drug_BrandName | 0.9988 | | 18 | Baby | 426 | 429 | Age | 0.9634 | | 19 | decreased p.o | 449 | 461 | Symptom | 0.75925 | | 20 | His | 472 | 474 | Gender | 0.9998 | | 21 | 20 minutes | 511 | 520 | Duration | 0.48575 | | 22 | 5 to 10 minutes | 531 | 545 | Duration | 0.526575 | | 23 | his | 560 | 562 | Gender | 0.988 | | 24 | respiratory congestion | 564 | 585 | Symptom | 0.6168 | | 25 | He | 588 | 589 | Gender | 0.9992 | | 26 | tired | 622 | 626 | Symptom | 0.8745 | | 27 | fussy | 641 | 645 | Symptom | 0.8509 | | 28 | over the past 2 days | 647 | 666 | RelativeDate | 0.60494 | | 29 | albuterol | 709 | 717 | Drug_Ingredient | 0.9876 | | 30 | ER | 743 | 744 | Clinical_Dept | 0.9974 | | 31 | His | 747 | 749 | Gender | 0.9996 | | 32 | urine output has also decreased | 751 | 781 | Symptom | 0.39878 | | 33 | he | 793 | 794 | Gender | 0.997 | | 34 | per 24 hours | 832 | 843 | Frequency | 0.462333 | | 35 | he | 850 | 851 | Gender | 0.9983 | | 36 | per 24 hours | 879 | 890 | Frequency | 0.562167 | | 37 | Mom | 893 | 895 | Gender | 0.9997 | | 38 | diarrhea | 908 | 915 | Symptom | 0.9956 | | 39 | His | 918 | 920 | Gender | 0.9997 | | 40 | bowel | 922 | 926 | Internal_organ_or_component | 0.9218 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Relation Extraction Between Body Parts and Direction Entities (ReDL) author: John Snow Labs name: redl_bodypart_direction_biobert date: 2021-06-01 tags: [licensed, en, clinical, relation_extraction] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like `Internal_organ_or_component`, `External_body_part_or_region` etc. and direction entities like `upper`, `lower` in clinical texts. `1` : Shows the body part and direction entity are related, `0` : Shows the body part and direction entity are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_3.0.3_3.0_1622564511730.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_direction_biobert_en_3.0.3_3.0_1622564511730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['direction-external_body_part_or_region', 'external_body_part_or_region-direction', 'direction-internal_organ_or_component', 'internal_organ_or_component-direction' ]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_direction_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([[''' MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia ''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array('direction-external_body_part_or_region', 'external_body_part_or_region-direction', 'direction-internal_organ_or_component', 'internal_organ_or_component-direction')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_direction_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""") ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------| | 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 | | 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 | | 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 | | 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 | | 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 | | 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 | | 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 | | 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 | | 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_direction_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on an internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.856 0.873 0.865 153 1 0.986 0.984 0.985 1347 Avg. 0.921 0.929 0.925 - ``` --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification date: 2022-03-03 tags: [deidentification, en, licensed, pipeline, clinical] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.4.1_2.4_1646340071616.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.4.1_2.4_1646340071616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result =deid_pipeline .annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID, IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID[**********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Berneta Phenes, Record date: 2093-03-14, # Y5003067. Dr. Dr Gaston Margo, IDOX:8976967, IP 001.001.001.001. He is a 91 male was admitted to the MADONNA REHABILITATION HOSPITAL for cystectomy on 07-22-1994. Patient's VIN : 5eeee44ffff555666, SSN 999-84-3686, Driver's license S99956482. Phone 74 617 042, 1407 west stassney lane, Edmonton, E-MAIL: Carliss@hotmail.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English asr_final_wav2vec2_urdu_asr_project TFWav2Vec2ForCTC from Raffay author: John Snow Labs name: pipeline_asr_final_wav2vec2_urdu_asr_project date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_final_wav2vec2_urdu_asr_project` is a English model originally trained by Raffay. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_final_wav2vec2_urdu_asr_project_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_final_wav2vec2_urdu_asr_project_en_4.2.0_3.0_1664020194849.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_final_wav2vec2_urdu_asr_project_en_4.2.0_3.0_1664020194849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_final_wav2vec2_urdu_asr_project', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_final_wav2vec2_urdu_asr_project", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_final_wav2vec2_urdu_asr_project| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Voice of the Patients author: John Snow Labs name: ner_vop_slim_wip date: 2023-02-25 tags: [ner, clinical, en, licensed, vop, voice, patient] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient's own sentences. Note: 'wip' suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `AdmissionDischarge`, `Age`, `BodyPart`, `ClinicalDept`, `DateTime`, `Disease`, `Dosage_Strength`, `Drug`, `Duration`, `Employment`, `Form`, `Frequency`, `Gender`, `Laterality`, `Procedure`, `PsychologicalCondition`, `RelationshipStatus`, `Route`, `Symptom`, `Test`, `Vaccine`, `VitalTest` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/VOICE_OF_THE_PATIENTS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_slim_wip_en_4.3.1_3.0_1677342424243.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_slim_wip_en_4.3.1_3.0_1677342424243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_vop_slim_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_vop_slim_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------+-----+----+----------------------+ |chunk |begin|end |ner_label | +--------------------+-----+----+----------------------+ |20 year old |10 |20 |Age | |girl |22 |25 |Gender | |hyperthyroid |47 |58 |Disease | |1 month ago |60 |70 |DateTime | |weak |87 |90 |Symptom | |panic attacks |122 |134 |PsychologicalCondition| |depression |137 |146 |PsychologicalCondition| |left |149 |152 |Laterality | |chest |154 |158 |BodyPart | |pain |160 |163 |Symptom | |heart rate |176 |185 |VitalTest | |weight loss |196 |206 |Symptom | |4 months |215 |222 |Duration | |hospital |258 |265 |ClinicalDept | |discharged |276 |285 |AdmissionDischarge | |hospital |292 |299 |ClinicalDept | |blood tests |319 |329 |Test | |brain |332 |336 |BodyPart | |mri |338 |340 |Test | |ultrasound scan |343 |357 |Test | |endoscopy |360 |368 |Procedure | |doctors |391 |397 |Employment | |homeopathy doctor |486 |502 |Employment | |he |512 |513 |Gender | |hyperthyroid |546 |557 |Disease | |TSH |566 |568 |Test | |T3 |579 |580 |Test | |T4 |586 |587 |Test | |b12 deficiency |613 |626 |Disease | |vitamin D deficiency|632 |651 |Disease | |weekly |667 |672 |Frequency | |supplement |674 |683 |Drug | |vitamin D |688 |696 |Drug | |1000 mcg |702 |709 |Dosage_Strength | |b12 |711 |713 |Drug | |daily |715 |719 |Frequency | |homeopathy medicine |733 |751 |Drug | |40 days |757 |763 |Duration | |after 30 days |783 |795 |DateTime | |TSH |801 |803 |Test | |now |812 |814 |DateTime | |weakness |849 |856 |Symptom | |depression |862 |871 |PsychologicalCondition| |last week |912 |920 |DateTime | |rapid heartrate |960 |974 |Symptom | |thyroid |1074 |1080|BodyPart | +--------------------+-----+----+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_slim_wip| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.4 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Route 25.0 4.0 15.0 40.0 0.8621 0.625 0.7246 Procedure 161.0 48.0 70.0 231.0 0.7703 0.697 0.7318 Vaccine 22.0 7.0 8.0 30.0 0.7586 0.7333 0.7458 RelationshipStatus 6.0 2.0 2.0 8.0 0.75 0.75 0.75 Disease 884.0 201.0 285.0 1169.0 0.8147 0.7562 0.7844 Frequency 342.0 61.0 113.0 455.0 0.8486 0.7516 0.7972 Duration 720.0 188.0 146.0 866.0 0.793 0.8314 0.8117 Test 478.0 106.0 103.0 581.0 0.8185 0.8227 0.8206 Symptom 1569.0 337.0 340.0 1909.0 0.8232 0.8219 0.8225 DateTime 1558.0 277.0 296.0 1854.0 0.849 0.8403 0.8447 ClinicalDept 157.0 9.0 48.0 205.0 0.9458 0.7659 0.8464 Form 110.0 28.0 11.0 121.0 0.7971 0.9091 0.8494 Dosage_Strength 184.0 25.0 33.0 217.0 0.8804 0.8479 0.8638 Drug 672.0 109.0 103.0 775.0 0.8604 0.8671 0.8638 VitalTest 73.0 7.0 16.0 89.0 0.9125 0.8202 0.8639 Laterality 262.0 43.0 38.0 300.0 0.859 0.8733 0.8661 Age 236.0 42.0 14.0 250.0 0.8489 0.944 0.8939 PsychologicalCond... 144.0 20.0 14.0 158.0 0.878 0.9114 0.8944 BodyPart 1319.0 139.0 160.0 1479.0 0.9047 0.8918 0.8982 Employment 541.0 25.0 77.0 618.0 0.9558 0.8754 0.9139 AdmissionDischarge 13.0 0.0 1.0 14.0 1.0 0.9286 0.963 Gender 548.0 26.0 12.0 560.0 0.9547 0.9786 0.9665 ``` --- layout: model title: Legal Cooperation Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_cooperation_bert date: 2023-03-05 tags: [en, legal, classification, clauses, cooperation, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Cooperation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Cooperation`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cooperation_bert_en_1.0.0_3.0_1678050678952.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cooperation_bert_en_1.0.0_3.0_1678050678952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_cooperation_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Cooperation]| |[Other]| |[Other]| |[Cooperation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cooperation_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Cooperation 0.83 0.92 0.87 52 Other 0.94 0.86 0.90 74 accuracy - - 0.89 126 macro-avg 0.88 0.89 0.89 126 weighted-avg 0.89 0.89 0.89 126 ``` --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbertresolve_icd10cm_slim_billable_hcc) author: John Snow Labs name: sbertresolve_icd10cm_slim_billable_hcc date: 2023-05-31 tags: [icd10cm, licensed, slim, en, clinical] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD-10-CM codes using sentence bert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get Hierarchical Condition Categories (HCC) status. This column can be divided to get further details: `billable status || hcc status || hcc score`. For example, if `all_k_aux_labels` is like `1||1||19` which means the `billable status` is 1, `hcc status` is 1, and `hcc score` is 19. ## Predicted Entities {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_en_4.4.2_3.0_1685498851777.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_en_4.4.2_3.0_1685498851777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("bert_embeddings")\ .setCaseSensitive(False) icd10_resolver = SentenceEntityResolverModel.SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models")\ .setInputCols(["bert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, bert_embeddings, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("bert_embeddings") .setCaseSensitive(False) val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models") .setInputCols("bert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], gestatio...|[O24.4, O24.43, P70.2, O24.434, O24.430, O24.435, O24.41, E13, O24.13, O...| |subsequent type two diabetes mellitus|PROBLEM| E11|[type 2 diabetes mellitus [type 2 diabetes mellitus], type 1 diabetes me...|[E11, E10, E11.3, E11.65, E11.1, E10.3, E11.64, E11.0, E11.5, E10.59, E1...| | obesity|PROBLEM| E66.1|[drug-induced obesity [drug-induced obesity], obesity due to excess calo...|[E66.1, E66.0, E66.9, O99.210, O99.213, O99.212, E66, O99.211, E66.8, E6...| | a body mass index|PROBLEM| Z68|[body mass index [bmi] [body mass index [bmi]], body mass index [bmi] 70...|[Z68, Z68.45, Z68.4, Z68.1, Z68.2, Z68.22, Z68.21, Z68.25, Z68.43, Z68.2...| | polyuria|PROBLEM| R35|[polyuria [polyuria], biliuria [biliuria], chyluria [chyluria], anuria a...|[R35, R82.2, R82.0, R34, R80, R35.89, R35.8, R82.991, R82.4, D75.1, L68....| | polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], polymyositis [polymyositis], polyhydramnios [p...|[R63.1, M33.2, O40, F63.3, Q69, O15, K22.81, N89.7, M30.0, N47.1, D72.82...| | vomiting|PROBLEM| R11.1|[vomiting [vomiting], vomiting of newborn [vomiting of newborn], nausea ...|[R11.1, P92.0, R11, R11.12, G43.A, R11.0, R11.13, R11.14, P92.01, O21, O...| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_slim_billable_hcc| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[bert_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|297.8 MB| |Case sensitive:|false| --- layout: model title: Legal Prices Document Classifier (EURLEX) author: John Snow Labs name: legclf_prices_bert date: 2023-03-06 tags: [en, legal, classification, clauses, prices, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_prices_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Prices or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Prices`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_prices_bert_en_1.0.0_3.0_1678111798081.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_prices_bert_en_1.0.0_3.0_1678111798081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_prices_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Prices]| |[Other]| |[Other]| |[Prices]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_prices_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.92 0.91 0.92 671 Prices 0.92 0.93 0.92 724 accuracy - - 0.92 1395 macro-avg 0.92 0.92 0.92 1395 weighted-avg 0.92 0.92 0.92 1395 ``` --- layout: model title: Pipeline to Detect clinical events author: John Snow Labs name: ner_events_healthcare_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_events_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_pipeline_en_4.3.0_3.2_1678837044873.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_healthcare_pipeline_en_4.3.0_3.2_1678837044873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_events_healthcare_pipeline", "en", "clinical/models") text = '''The patient presented to the emergency room last evening.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_events_healthcare_pipeline", "en", "clinical/models") val text = "The patient presented to the emergency room last evening." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare_events.pipeline").predict("""The patient presented to the emergency room last evening.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------------|--------:|------:|:--------------|-------------:| | 0 | presented | 12 | 20 | EVIDENTIAL | 0.6769 | | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | 0.835967 | | 2 | last evening | 44 | 55 | DATE | 0.59135 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Financial Relation Extraction (Alias) author: John Snow Labs name: finre_org_prod_alias date: 2022-08-17 tags: [en, finance, re, relations, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to extract Aliases of Companies or Product names. An "Alias" is a named used in a document to refer to the original name of a company or product. Examples: - John Snow Labs, also known as JSL - John Snow Labs ("JSL") - etc ## Predicted Entities `has_alias`, `has_collective_alias` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_org_prod_alias_en_1.0.0_3.2_1660739200247.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_org_prod_alias_en_1.0.0_3.2_1660739200247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") reDL = finance.RelationExtractionDLModel()\ .pretrained("finre_org_prod_alias", "en", "finance/models")\ .setPredictionThreshold(0.1)\ .setInputCols(["ner_chunk", "document"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, reDL]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text=''' On March 12, 2020 we closed a Loan and Security Agreement with Hitachi Capital America Corp. ("Hitachi") the terms of which are described in this report which replaced our credit facility with Western Alliance Bank. ''' lmodel = LightPipeline(model) lmodel.fullAnnotate(text) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence has_alias ORG 64 92 Hitachi Capital America Corp. ALIAS 96 102 Hitachi 0.9983972 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_org_prod_alias| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|409.9 MB| ## References Manual annotations on CUAD dataset and 10K filings ## Benchmarking ```bash label Recall Precision F1 Support has_alias 0.920 1.000 0.958 50 has_collective_alias 1.000 0.750 0.857 6 no_rel 1.000 0.957 0.978 44 Avg. 0.973 0.902 0.931 - Weighted-Avg. 0.960 0.966 0.961 - ``` --- layout: model title: Translate Chinyanja to English Pipeline author: John Snow Labs name: translate_ny_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ny, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ny` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ny_en_xx_2.7.0_2.4_1609699579874.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ny_en_xx_2.7.0_2.4_1609699579874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ny_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ny_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ny.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ny_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Maintenance Of Insurance Clause Binary Classifier author: John Snow Labs name: legclf_maintenance_of_insurance_clause date: 2023-01-29 tags: [en, legal, classification, maintenance, insurance, clauses, maintenance_of_insurance, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `maintenance-of-insurance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `maintenance-of-insurance`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_maintenance_of_insurance_clause_en_1.0.0_3.0_1675004526424.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_maintenance_of_insurance_clause_en_1.0.0_3.0_1675004526424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_maintenance_of_insurance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[maintenance-of-insurance]| |[other]| |[other]| |[maintenance-of-insurance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_maintenance_of_insurance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support maintenance-of-insurance 0.98 0.98 0.98 56 other 0.99 0.99 0.99 106 accuracy - - 0.99 162 macro-avg 0.99 0.99 0.99 162 weighted-avg 0.99 0.99 0.99 162 ``` --- layout: model title: Multilingual XLMRoBerta Embeddings (from hfl) author: John Snow Labs name: xlmroberta_embeddings_cino_large date: 2022-05-14 tags: [zh, ko, xlm_roberta, embeddings, xx, open_source] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cino-large` is a Multilingual model orginally trained by `hfl`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_large_xx_3.4.4_3.0_1652531331865.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_large_xx_3.4.4_3.0_1652531331865.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_large", "xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_large", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_cino_large| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|2.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/cino-large - https://github.com/ymcui/Chinese-Minority-PLM - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology --- layout: model title: Legal National Accounts Document Classifier (EURLEX) author: John Snow Labs name: legclf_national_accounts_bert date: 2023-03-06 tags: [en, legal, classification, clauses, national_accounts, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_national_accounts_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class National_Accounts or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `National_Accounts`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_national_accounts_bert_en_1.0.0_3.0_1678111671494.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_national_accounts_bert_en_1.0.0_3.0_1678111671494.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_national_accounts_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[National_Accounts]| |[Other]| |[Other]| |[National_Accounts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_national_accounts_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support National_Accounts 0.96 1.00 0.98 27 Other 1.00 0.95 0.97 19 accuracy - - 0.98 46 macro-avg 0.98 0.97 0.98 46 weighted-avg 0.98 0.98 0.98 46 ``` --- layout: model title: English asr_vakyansh_wav2vec2_indian_english_enm_700 TFWav2Vec2ForCTC from Harveenchadha author: John Snow Labs name: asr_vakyansh_wav2vec2_indian_english_enm_700 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_vakyansh_wav2vec2_indian_english_enm_700` is a English model originally trained by Harveenchadha. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_vakyansh_wav2vec2_indian_english_enm_700_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_vakyansh_wav2vec2_indian_english_enm_700_en_4.2.0_3.0_1664094785788.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_vakyansh_wav2vec2_indian_english_enm_700_en_4.2.0_3.0_1664094785788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_vakyansh_wav2vec2_indian_english_enm_700", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_vakyansh_wav2vec2_indian_english_enm_700", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_vakyansh_wav2vec2_indian_english_enm_700| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.6 MB| --- layout: model title: Diseases and Syndromes to UMLS Code Pipeline author: John Snow Labs name: umls_disease_syndrome_resolver_pipeline date: 2022-07-26 tags: [en, licensed, umls, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Diseases and Syndromes) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_disease_syndrome_resolver_pipeline_en_4.0.0_3.0_1658825573136.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_disease_syndrome_resolver_pipeline_en_4.0.0_3.0_1658825573136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_disease_syndrome_resolver_pipeline", "en", "clinical/models") pipeline.annotate("A 34-year-old female with a history of poor appetite, gestational diabetes mellitus, acyclovir allergy and polyuria") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_disease_syndrome_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("A 34-year-old female with a history of poor appetite, gestational diabetes mellitus, acyclovir allergy and polyuria") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_disease_syndrome_resolver").predict("""A 34-year-old female with a history of poor appetite, gestational diabetes mellitus, acyclovir allergy and polyuria""") ```
## Results ```bash +-----------------------------+---------+---------+ |chunk |ner_label|umls_code| +-----------------------------+---------+---------+ |poor appetite |PROBLEM |C0003123 | |gestational diabetes mellitus|PROBLEM |C0085207 | |acyclovir allergy |PROBLEM |C0571297 | |polyuria |PROBLEM |C0018965 | +-----------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_disease_syndrome_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.4 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Pipeline to Extract Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_general_pipeline date: 2023-03-08 tags: [licensed, clinical, en, oncology, anatomy, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_anatomy_general](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_anatomy_general_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_pipeline_en_4.3.0_3.2_1678282717993.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_pipeline_en_4.3.0_3.2_1678282717993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_anatomy_general_pipeline", "en", "clinical/models") text = '''The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_anatomy_general_pipeline", "en", "clinical/models") val text = "The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:----------------|-------------:| | 0 | left | 36 | 39 | Direction | 0.9825 | | 1 | breast | 41 | 46 | Anatomical_Site | 0.9005 | | 2 | lungs | 82 | 86 | Anatomical_Site | 0.9735 | | 3 | liver | 99 | 103 | Anatomical_Site | 0.9817 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_general_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Sentence Entity Resolver for ICD10-GM author: John Snow Labs name: sbertresolve_icd10gm date: 2021-09-16 tags: [icd10gm, en, clinical, licensed, de] task: Entity Resolution language: de edition: Healthcare NLP 3.2.2 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-GM codes for the German language using `sent_bert_base_cased` (de) embeddings. ## Predicted Entities `ICD10GM Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_GM_DE/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10gm_de_3.2.2_2.4_1631814227170.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10gm_de_3.2.2_2.4_1631814227170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icd10gm_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10gm", "de", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("icd10gm_code") icd10gm_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, icd10gm_resolver]) icd_lp = LightPipeline(icd10gm_pipelineModel) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val icd10gm_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10gm", "de", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("icd10gm_code") val icd10gm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler,sbert_embedder,icd10gm_resolver)) val icd_lp = LightPipeline(icd10gm_pipelineModel) ``` {:.nlu-block} ```python import nlu nlu.load("de.resolve.icd10gm").predict("""Put your text here.""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:--------|:--------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | Dyspnoe | C671 |Dyspnoe, Schlafapnoe, Dysphonie, Frühsyphilis, Hyperzementose, Hypertrichose, Makrostomie, Dystonie, Nokardiose, Lebersklerose, Dyspareunie, Schizophrenie, Skoliose, Dysurie, Diphyllobothriose, Heterophorie, Rektozele, Enophthalmus, Amyloidose, Hyperventilation, Neurasthenie, Sarkoidose, Psoriasis-Arthropathie, Hyperodontie, Enteroptose| [R06.0, G47.3, R49.0, A51, K03.4, L68, Q18.4, G24, A43, K74.1, N94.1, F20, M41, R30.0, B70.0, H50.5, N81.6, H05.4, E85, R06.4, F48.0, D86, L40.5, K00.1, K63.4] | [0.0000, 2.5602, 3.0529, 3.3310, 3.4645, 3.7148, 3.7568, 3.8115, 3.8557, 3.8577, 3.9448, 3.9681, 3.9799, 3.9889, 4.0036, 4.0773, 4.0825, 4.1342, 4.2031, 4.2155, 4.2313, 4.2341, 4.2775, 4.2802, 4.2823] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10gm| |Compatibility:|Healthcare NLP 3.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10_gm2022_de_code]| |Language:|de| |Case sensitive:|false| --- layout: model title: Legal Sub Advisory Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_sub_advisory_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, sub, advisory, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_sub_advisory_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `sub-advisory-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `sub-advisory-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sub_advisory_agreement_en_1.0.0_3.0_1671393680636.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sub_advisory_agreement_en_1.0.0_3.0_1671393680636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sub_advisory_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[sub-advisory-agreement]| |[other]| |[other]| |[sub-advisory-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sub_advisory_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 0.99 202 sub-advisory-agreement 1.00 0.97 0.99 104 accuracy - - 0.99 306 macro-avg 0.99 0.99 0.99 306 weighted-avg 0.99 0.99 0.99 306 ``` --- layout: model title: English RobertaForQuestionAnswering (from saburbutt) author: John Snow Labs name: roberta_qa_roberta_base_tweetqa_model date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_base_tweetqa_model` is a English model originally trained by `saburbutt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_tweetqa_model_en_4.0.0_3.0_1655739012245.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_tweetqa_model_en_4.0.0_3.0_1655739012245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_tweetqa_model","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_tweetqa_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_tweetqa_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saburbutt/roberta_base_tweetqa_model --- layout: model title: Finnish BERT Embeddings (Base Cased) author: John Snow Labs name: bert_finnish_cased date: 2020-08-31 task: Embeddings language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, fi] supported: true deprecated: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. `FinBERT` features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words. `FinBERT` has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train `FinBERT`. These features allow `FinBERT` to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_finnish_cased_fi_2.6.0_2.4_1598896927571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_finnish_cased_fi_2.6.0_2.4_1598896927571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_finnish_cased", "fi") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Rakastan NLP: tä']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_finnish_cased", "fi") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Rakastan NLP: tä").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Rakastan NLP: tä"] embeddings_df = nlu.load('fi.embed.bert.cased.').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash fi_embed_bert_cased__embeddings token [0.09888151288032532, -0.72500079870224, 1.001... Rakastan [0.46280959248542786, -0.7008669972419739, 0.9... NLP [0.061913054436445236, 1.1024340391159058, 0.9... : [1.0134484767913818, -0.822027325630188, 1.353... tä ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_finnish_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[fi]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from https://github.com/TurkuNLP/FinBERT --- layout: model title: Arabic BertForQuestionAnswering model (from gfdgdfgdg) author: John Snow Labs name: bert_qa_arap_qa_bert date: 2022-06-02 tags: [ar, open_source, question_answering, bert] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `arap_qa_bert` is a Arabic model orginally trained by `gfdgdfgdg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_arap_qa_bert_ar_4.0.0_3.0_1654179127284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_arap_qa_bert_ar_4.0.0_3.0_1654179127284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_arap_qa_bert","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_arap_qa_bert","ar") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_arap_qa_bert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ar| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/gfdgdfgdg/arap_qa_bert --- layout: model title: Stopwords Remover for Estonian language (35 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, et, open_source] task: Stop Words Removal language: et edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_et_3.4.1_3.0_1646672955769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_et_3.4.1_3.0_1646672955769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","et") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Sa ei ole parem kui mina"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","et") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Sa ei ole parem kui mina").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("et.stopwords").predict("""Sa ei ole parem kui mina""") ```
## Results ```bash +------------------+ |result | +------------------+ |[ole, parem, mina]| +------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|et| |Size:|1.4 KB| --- layout: model title: English T5ForConditionalGeneration Cased model (from dsivakumar) author: John Snow Labs name: t5_text2sql date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `text2sql` is a English model originally trained by `dsivakumar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_text2sql_en_4.3.0_3.0_1675157342237.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_text2sql_en_4.3.0_3.0_1675157342237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_text2sql","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_text2sql","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_text2sql| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|283.3 MB| ## References - https://huggingface.co/dsivakumar/text2sql --- layout: model title: English asr_wav2vec2_base_timit_moaiz_exp1 TFWav2Vec2ForCTC from moaiz237 author: John Snow Labs name: asr_wav2vec2_base_timit_moaiz_exp1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_exp1` is a English model originally trained by moaiz237. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_moaiz_exp1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_exp1_en_4.2.0_3.0_1664036350853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_exp1_en_4.2.0_3.0_1664036350853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_moaiz_exp1", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_moaiz_exp1", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_moaiz_exp1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Pipeline to Detect PHI for Deidentification (ner_deidentification_dl) author: John Snow Labs name: ner_deidentify_dl_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deidentify_dl](https://nlp.johnsnowlabs.com/2021/03/31/ner_deidentify_dl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_pipeline_en_4.3.0_3.2_1678735442586.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_pipeline_en_4.3.0_3.2_1678735442586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deidentify_dl_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deidentify_dl_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deidentify.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:--------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 0.9999 | | 1 | David Hale | 27 | 36 | DOCTOR | 0.9913 | | 2 | Hendrickson Ora | 55 | 69 | DOCTOR | 0.95565 | | 3 | 7194334 | 78 | 84 | MEDICALRECORD | 0.9857 | | 4 | 01/13/93 | 93 | 100 | DATE | 0.999 | | 5 | Oliveira | 110 | 117 | DOCTOR | 0.9966 | | 6 | 25 | 121 | 122 | AGE | 0.6433 | | 7 | 2079-11-09 | 150 | 159 | DATE | 0.9984 | | 8 | Cocke County Baptist Hospital | 163 | 191 | HOSPITAL | 0.9466 | | 9 | Keats Street | 200 | 211 | STREET | 0.91485 | | 10 | 302-786-5227 | 221 | 232 | PHONE | 0.7415 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deidentify_dl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from arshiya20) author: John Snow Labs name: distilbert_qa_arshiya20_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `arshiya20`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_arshiya20_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769918439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_arshiya20_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769918439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arshiya20_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arshiya20_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_arshiya20_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/arshiya20/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate Mon-Khmer languages to English Pipeline author: John Snow Labs name: translate_mkh_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mkh, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mkh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mkh_en_xx_2.7.0_2.4_1609698555980.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mkh_en_xx_2.7.0_2.4_1609698555980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mkh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mkh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mkh.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mkh_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English ElectraForQuestionAnswering model (from ahotrod) author: John Snow Labs name: electra_qa_large_discriminator_squad2_512 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra_large_discriminator_squad2_512` is a English model originally trained by `ahotrod`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_discriminator_squad2_512_en_4.0.0_3.0_1655921513131.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_discriminator_squad2_512_en_4.0.0_3.0_1655921513131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_discriminator_squad2_512","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_discriminator_squad2_512","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.large_512d").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_large_discriminator_squad2_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ahotrod/electra_large_discriminator_squad2_512 --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW) author: John Snow Labs name: distilbert_qa_model_zyw date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-model` is a Multilingual model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_zyw_xx_4.3.0_3.0_1672774991529.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_zyw_xx_4.3.0_3.0_1672774991529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model_zyw","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model_zyw","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_model_zyw| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/en-de-model --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_large_uncased_squadv2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squadv2` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv2_en_4.0.0_3.0_1654536822607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv2_en_4.0.0_3.0_1654536822607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased_v2.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-large-uncased-squadv2 - https://arxiv.org/pdf/1810.04805v2.pdf%5D --- layout: model title: English DistilBertForQuestionAnswering model (from mvonwyl) author: John Snow Labs name: distilbert_qa_mvonwyl_base_uncased_finetuned_squad2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `mvonwyl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mvonwyl_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726852807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mvonwyl_base_uncased_finetuned_squad2_en_4.0.0_3.0_1654726852807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mvonwyl_base_uncased_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mvonwyl_base_uncased_finetuned_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mvonwyl_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2 --- layout: model title: Context Spell Checker Pipeline for English author: John Snow Labs name: spellcheck_dl_pipeline date: 2022-04-14 tags: [spellcheck, spelling_corrector, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained spellchecker pipeline is built on the top of [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/03/28/spellcheck_dl_en_3_0.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.1_3.0_1649937383379.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.1_3.0_1649937383379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] pipeline.annotate(text) ``` ```scala val pipeline = new PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") val example = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") pipeline.annotate(example) ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|99.7 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_cuad date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-base-finetuned-cuad` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_cuad_en_4.3.0_3.0_1675099512929.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_cuad_en_4.3.0_3.0_1675099512929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_cuad","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_cuad","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_cuad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|913.5 MB| ## References - https://huggingface.co/mrm8488/T5-base-finetuned-cuad --- layout: model title: English ElectraForQuestionAnswering model (from valhalla) author: John Snow Labs name: electra_qa_base_discriminator_finetuned_squadv1 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-discriminator-finetuned_squadv1` is a English model originally trained by `valhalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv1_en_4.0.0_3.0_1655920510384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv1_en_4.0.0_3.0_1655920510384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.base.by_valhalla").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_discriminator_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/valhalla/electra-base-discriminator-finetuned_squadv1 - https://github.com/patil-suraj/ --- layout: model title: Clinical Deidentification (Spanish) author: John Snow Labs name: clinical_deidentification date: 2022-02-14 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `E-MAIL`, `USERNAME`, `LOCATION`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.3.4_3.0_1644832415526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_es_3.3.4_3.0_1644832415526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com """) ```
## Results ```bash Masked with entity labels ------------------------------ Datos del paciente. Nombre: . Apellidos: . NHC: . NASS: 04. Domicilio: , 5 B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . : María Merino Viveros NºCol: . Informe clínico del paciente: de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición km 12,500 28905 - () Correo electrónico: Masked with chars ------------------------------ Datos del paciente. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: ** [******] 04. Domicilio: [*******************], 5 B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. [****]: María Merino Viveros NºCol: ** ** [***]. Informe clínico del paciente: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [*****************] km 12,500 28905 [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos del paciente. Nombre: **** . Apellidos: ****. NHC: ****. NASS: **** **** 04. Domicilio: ****, 5 B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. ****: María Merino Viveros NºCol: **** **** ****. Informe clínico del paciente: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** km 12,500 28905 **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos del paciente. Nombre: Sr. Lerma . Apellidos: Aristides Gonzalez Gelabert. NHC: BBBBBBBBQR648597. NASS: 041010000011 RZRM020101906017 04. Domicilio: Valencia, 5 B.. Localidad/ Provincia: Madrid. CP: 99335. Datos asistenciales. Fecha de nacimiento: 25/04/1977. País: Barcelona. Edad: 8 años Sexo: F.. Fecha de Ingreso: 02/08/2018. transportista: María Merino Viveros NºCol: olegario10 olegario10 felisa78. Informe clínico del paciente: RZRM020101906017 de 8 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Madrid en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Dra. Laguna +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Reinaldo Manjón Malo Barcelona Servicio de Endocrinología y Nutrición Valencia km 12,500 28905 Bilbao - Madrid (Barcelona) Correo electrónico: quintanasalome@example.net ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.3 MB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: Pipeline to Detect Chemical Compounds and Genes author: John Snow Labs name: bert_token_classifier_ner_chemprot_pipeline date: 2022-03-15 tags: [bert_token_classifier, ner, chemprot, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemprot](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemprot_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMPROT_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_3.4.1_2.4_1647341872538.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_3.4.1_2.4_1647341872538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chemprot_pipeline = PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") chemprot_pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` ```scala val chemprot_pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") chemprot_pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemprot_pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |Keratinocyte growth factor |GENE-Y | |acidic fibroblast growth factor|GENE-Y | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemprot_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.3 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from yashwantk) author: John Snow Labs name: distilbert_qa_yashwantk_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `yashwantk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_yashwantk_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773265482.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_yashwantk_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773265482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yashwantk_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yashwantk_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_yashwantk_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/yashwantk/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Non solicitation Clause Binary Classifier author: John Snow Labs name: legclf_non_solicitation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non-solicitation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `non-solicitation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_solicitation_clause_en_1.0.0_3.2_1660122751584.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_solicitation_clause_en_1.0.0_3.2_1660122751584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_non_solicitation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[non-solicitation]| |[other]| |[other]| |[non-solicitation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_solicitation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non-solicitation 1.00 0.95 0.98 22 other 0.99 1.00 0.99 84 accuracy - - 0.99 106 macro-avg 0.99 0.98 0.99 106 weighted-avg 0.99 0.99 0.99 106 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab4 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab4 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab4` is a English model originally trained by sameearif88. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab4_en_4.2.0_3.0_1664019147552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab4_en_4.2.0_3.0_1664019147552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab4', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab4", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab4| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Cased model (from clementgyj) author: John Snow Labs name: bert_qa_clementgyj_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `clementgyj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_clementgyj_finetuned_squad_en_4.0.0_3.0_1657186454797.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_clementgyj_finetuned_squad_en_4.0.0_3.0_1657186454797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_clementgyj_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_clementgyj_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_clementgyj_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/clementgyj/bert-finetuned-squad --- layout: model title: Translate English to Welsh Pipeline author: John Snow Labs name: translate_en_cy date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, cy, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `cy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_cy_xx_2.7.0_2.4_1609698924389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_cy_xx_2.7.0_2.4_1609698924389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_cy", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_cy", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.cy').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_cy| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_deit_small_patch16_224 ViTForImageClassification from facebook author: John Snow Labs name: image_classifier_vit_deit_small_patch16_224 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_deit_small_patch16_224` is a English model originally trained by facebook. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_small_patch16_224_en_4.1.0_3.0_1660165708246.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_small_patch16_224_en_4.1.0_3.0_1660165708246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_deit_small_patch16_224", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_deit_small_patch16_224", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_deit_small_patch16_224| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|83.2 MB| --- layout: model title: Detect concepts in drug development trials (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_drug_development_trials date: 2022-06-18 tags: [ner, en, bertfortokenclassification, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description It is a BertForTokenClassification NER model to identify concepts related to drug development including `Trial Groups` , Efficacy and Safety `End Points` , `Hazard Ratio`, and others in free text. ## Predicted Entities `Hazard_Ratio`, `Confidence_Interval`, `Patient_Count`, `Trial_Group`, `Patient_Group`, `Duration`, `Confidence_level`, `P_Value`, `Confidence_Range`, `End_Point`, `Follow_Up`, `ADE`, `Value`, `DATE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.4.1_3.0_1655578771078.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.4.1_3.0_1655578771078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) test_sentence = """In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""" data = spark.createDataFrame([[test_sentence]]).toDF('text') result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.drug_development_trials").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""") ```
## Results ```bash +-----------------+-------------+ |chunk |ner_label | +-----------------+-------------+ |median |Duration | |overall survival |End_Point | |with |Trial_Group | |without topotecan|Trial_Group | |4.0 |Value | |3.6 months |Value | |23 |Patient_Count| |63 |Patient_Count| |55 |Patient_Count| |33 patients |Patient_Count| |topotecan |Trial_Group | |11 |Patient_Count| |61 |Patient_Count| |66 |Patient_Count| |32 patients |Patient_Count| |without topotecan|Trial_Group | +-----------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_drug_development_trials| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|400.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Trained on data obtained from `clinicaltrials.gov` and annotated in-house. ## Benchmarking ```bash label prec rec f1 support B-ADE 0.50 0.33 0.40 3 B-Confidence_Interval 0.46 1.00 0.63 12 B-Confidence_Range 1.00 0.98 0.99 42 B-Confidence_level 1.00 0.67 0.81 43 B-DATE 0.95 0.93 0.94 40 B-Duration 1.00 0.82 0.90 11 B-End_Point 0.91 0.98 0.95 54 B-Follow_Up 1.00 1.00 1.00 2 B-Hazard_Ratio 0.77 1.00 0.87 24 B-P_Value 1.00 0.56 0.71 9 B-Patient_Count 1.00 0.95 0.97 19 B-Patient_Group 0.79 0.63 0.70 43 B-Trial_Group 0.96 0.94 0.95 274 B-Value 0.98 0.83 0.90 77 I-ADE 0.71 1.00 0.83 12 I-Confidence_Range 0.98 1.00 0.99 43 I-DATE 0.95 1.00 0.98 60 I-Duration 1.00 1.00 1.00 1 I-End_Point 0.92 1.00 0.96 44 I-Follow_Up 1.00 1.00 1.00 2 I-P_Value 0.82 1.00 0.90 18 I-Patient_Count 0.00 0.00 0.00 0 I-Patient_Group 0.79 0.94 0.86 187 I-Trial_Group 0.92 0.90 0.91 156 I-Value 1.00 1.00 1.00 10 O 0.98 0.98 0.98 2622 accuracy - - 0.96 3808 macro-avg 0.86 0.86 0.85 3808 weighted-avg 0.96 0.96 0.96 3808 ``` --- layout: model title: Legal Insolvency Clause Binary Classifier author: John Snow Labs name: legclf_insolvency_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `insolvency` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `insolvency` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_insolvency_clause_en_1.0.0_3.2_1660122549928.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_insolvency_clause_en_1.0.0_3.2_1660122549928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_insolvency_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[insolvency]| |[other]| |[other]| |[insolvency]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_insolvency_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support insolvency 0.98 0.98 0.98 43 other 0.99 0.99 0.99 101 accuracy - - 0.99 144 macro-avg 0.98 0.98 0.98 144 weighted-avg 0.99 0.99 0.99 144 ``` --- layout: model title: Dutch BertForQuestionAnswering model (from henryk) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2 date: 2022-06-02 tags: [nl, open_source, question_answering, bert] task: Question Answering language: nl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-dutch-squad2` is a Dutch model orginally trained by `henryk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2_nl_4.0.0_3.0_1654180038184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2_nl_4.0.0_3.0_1654180038184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2","nl") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2","nl") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("nl.answer_question.squadv2.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|nl| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-dutch-squad2 - https://www.linkedin.com/in/henryk-borzymowski-0755a2167/ - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/blob/master/multilingual.md --- layout: model title: Ukrainian XLMRoBerta Embeddings (from ukr-models) author: John Snow Labs name: xlmroberta_embeddings_xlm_roberta_base date: 2022-05-14 tags: [open_source, embeddings, uk, xlm_roberta] task: Embeddings language: uk edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-uk` is a Ukrainian model orginally trained by `ukr-models`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_xlm_roberta_base_uk_3.4.4_3.0_1652533212195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_xlm_roberta_base_uk_3.4.4_3.0_1652533212195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_xlm_roberta_base","uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_xlm_roberta_base","uk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю Spark NLP.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_xlm_roberta_base| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|uk| |Size:|264.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/ukr-models/xlm-roberta-base-uk --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from 21iridescent) author: John Snow Labs name: roberta_qa_base_finetuned_squad2_lwt date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-squad2-lwt` is a English model originally trained by `21iridescent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad2_lwt_en_4.3.0_3.0_1674210398377.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad2_lwt_en_4.3.0_3.0_1674210398377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad2_lwt","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad2_lwt","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad2_lwt| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/21iridescent/distilroberta-base-finetuned-squad2-lwt --- layout: model title: Chinese BertForQuestionAnswering model (from voidful) author: John Snow Labs name: bert_qa_question_answering_zh_voidful date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `question-answering-zh` is a Chinese model orginally trained by `voidful`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_zh_voidful_zh_4.0.0_3.0_1654250011767.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_zh_voidful_zh_4.0.0_3.0_1654250011767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_question_answering_zh_voidful","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_question_answering_zh_voidful","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.by_voidful").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_question_answering_zh_voidful| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|381.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/voidful/question-answering-zh --- layout: model title: Russian T5ForConditionalGeneration Cased model (from IlyaGusev) author: John Snow Labs name: t5_sber_rut5_filler date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sber_rut5_filler` is a Russian model originally trained by `IlyaGusev`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_sber_rut5_filler_ru_4.3.0_3.0_1675107176054.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_sber_rut5_filler_ru_4.3.0_3.0_1675107176054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_sber_rut5_filler","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_sber_rut5_filler","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_sber_rut5_filler| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|927.4 MB| ## References - https://huggingface.co/IlyaGusev/sber_rut5_filler --- layout: model title: Bengali Lemmatizer author: John Snow Labs name: lemma date: 2021-01-20 task: Lemmatization language: bn edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [bn, lemmatizer, open_source] supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_bn_2.7.0_2.4_1611163691269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_bn_2.7.0_2.4_1611163691269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "bn") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) results = light_pipeline.fullAnnotate(["একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহনভোগ এবং দেড়সের দুগ্ধ সেবায় নিযুক্ত আছে বৈদ্যনাথ গায়ে একখানি চাদর দিয়া জোড়করে একান্ত বিনীতভাবে ভূতলে বসিয়া ভক্তিভরে পবিত্র ভোজনব্যাপার নিরীক্ষণ করিতেছিলেন এমন সময় কোনোমতে দ্বারীদের দৃষ্টি এড়াইয়া জীর্ণদেহ বালক সহিত একটি অতি শীর্ণকায়া রমণী গৃহে প্রবেশ করিয়া ক্ষীণস্বরে কহিল বাবু দুটি খেতে দাও"]) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "bn") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহনভোগ এবং দেড়সের দুগ্ধ সেবায় নিযুক্ত আছে বৈদ্যনাথ গায়ে একখানি চাদর দিয়া জোড়করে একান্ত বিনীতভাবে ভূতলে বসিয়া ভক্তিভরে পবিত্র ভোজনব্যাপার নিরীক্ষণ করিতেছিলেন এমন সময় কোনোমতে দ্বারীদের দৃষ্টি এড়াইয়া জীর্ণদেহ বালক সহিত একটি অতি শীর্ণকায়া রমণী গৃহে প্রবেশ করিয়া ক্ষীণস্বরে কহিল বাবু দুটি খেতে দাও").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["একদিন প্রাতে বৈদ্যনাথের মার্বলমণ্ডিত দালানে একটি স্থূলোদর সন্ন্যাসী দুইসের মোহনভোগ এবং দেড়সের দুগ্ধ সেবায় নিযুক্ত আছে বৈদ্যনাথ গায়ে একখানি চাদর দিয়া জোড়করে একান্ত বিনীতভাবে ভূতলে বসিয়া ভক্তিভরে পবিত্র ভোজনব্যাপার নিরীক্ষণ করিতেছিলেন এমন সময় কোনোমতে দ্বারীদের দৃষ্টি এড়াইয়া জীর্ণদেহ বালক সহিত একটি অতি শীর্ণকায়া রমণী গৃহে প্রবেশ করিয়া ক্ষীণস্বরে কহিল বাবু দুটি খেতে দাও"] lemma_df = nlu.load('bn.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 4, একদিন, {'sentence': '0'}), Annotation(token, 6, 11, প্রাতঃ, {'sentence': '0'}), Annotation(token, 13, 22, বৈদ্যনাথ, {'sentence': '0'}), Annotation(token, 24, 35, মার্বলমণ্ডিত, {'sentence': '0'}), Annotation(token, 37, 42, দালান, {'sentence': '0'}), Annotation(token, 44, 47, এক, {'sentence': '0'}), Annotation(token, 49, 56, স্থূলউদর, {'sentence': '0'}), Annotation(token, 58, 66, সন্ন্যাসী, {'sentence': '0'}), Annotation(token, 68, 73, দুইসের, {'sentence': '0'}), Annotation(token, 75, 81, মোহনভোগ, {'sentence': '0'}), Annotation(token, 83, 85, এবং, {'sentence': '0'}), Annotation(token, 87, 93, দেড়সের, {'sentence': '0'}), Annotation(token, 95, 99, দুগ্ধ, {'sentence': '0'}), Annotation(token, 101, 105, সেবা, {'sentence': '0'}), Annotation(token, 107, 113, নিযুক্ত, {'sentence': '0'}), Annotation(token, 115, 117, আছে, {'sentence': '0'}), Annotation(token, 119, 126, বৈদ্যনাথ, {'sentence': '0'}), Annotation(token, 128, 131, গা, {'sentence': '0'}), Annotation(token, 133, 138, একখান, {'sentence': '0'}), Annotation(token, 140, 143, চাদর, {'sentence': '0'}), Annotation(token, 145, 148, দেওয়া, {'sentence': '0'}), Annotation(token, 150, 156, জোড়কর, {'sentence': '0'}), Annotation(token, 158, 163, একান্ত, {'sentence': '0'}), Annotation(token, 165, 173, বিনীতভাব, {'sentence': '0'}), Annotation(token, 175, 179, ভূতল, {'sentence': '0'}), Annotation(token, 181, 185, বসা, {'sentence': '0'}), Annotation(token, 187, 194, ভক্তিভরা, {'sentence': '0'}), Annotation(token, 196, 201, পবিত্র, {'sentence': '0'}), Annotation(token, 203, 213, ভোজনব্যাপার, {'sentence': '0'}), Annotation(token, 215, 222, নিরীক্ষণ, {'sentence': '0'}), Annotation(token, 224, 233, করা, {'sentence': '0'}), Annotation(token, 235, 237, এমন, {'sentence': '0'}), Annotation(token, 239, 241, সময়, {'sentence': '0'}), Annotation(token, 243, 249, কোনোমত, {'sentence': '0'}), Annotation(token, 251, 259, দ্বারী, {'sentence': '0'}), Annotation(token, 261, 266, দৃষ্টি, {'sentence': '0'}), Annotation(token, 268, 274, এড়ানো, {'sentence': '0'}), Annotation(token, 276, 283, জীর্ণদেহ, {'sentence': '0'}), Annotation(token, 285, 288, বালক, {'sentence': '0'}), Annotation(token, 290, 293, সহিত, {'sentence': '0'}), Annotation(token, 295, 298, এক, {'sentence': '0'}), Annotation(token, 300, 302, অতি, {'sentence': '0'}), Annotation(token, 304, 312, শীর্ণকায়া, {'sentence': '0'}), Annotation(token, 314, 317, রমণী, {'sentence': '0'}), Annotation(token, 319, 322, গৃহ, {'sentence': '0'}), Annotation(token, 324, 329, প্রবেশ, {'sentence': '0'}), Annotation(token, 331, 335, বিশ্বাস, {'sentence': '0'}), Annotation(token, 337, 346, ক্ষীণস্বর, {'sentence': '0'}), Annotation(token, 348, 351, কহা, {'sentence': '0'}), Annotation(token, 353, 356, বাবু, {'sentence': '0'}), Annotation(token, 358, 361, দুই, {'sentence': '0'}), Annotation(token, 363, 366, খাওয়া, {'sentence': '0'}), Annotation(token, 368, 370, দাওয়া, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|bn| ## Data Source The model was trained on the annotated Bengali data set from the [Indian Statistics Institute](https://www.isical.ac.in). Reference: - A. Chakrabarty, O.A. Pandit, and U. Garain (2017): Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks, in ACL 2017:1481-1491. --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_v2 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-robertav2` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_v2_es_4.3.0_3.0_1674218559260.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_v2_es_4.3.0_3.0_1674218559260.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_v2","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_v2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-robertav2 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from adi1494) author: John Snow Labs name: distilbert_qa_adi1494_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `adi1494`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_adi1494_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769689530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_adi1494_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769689530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_adi1494_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_adi1494_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_adi1494_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/adi1494/distilbert-base-uncased-finetuned-squad --- layout: model title: Twi T5ForConditionalGeneration Cased model (from clhuang) author: John Snow Labs name: t5_hotel_review_sentiment date: 2023-01-31 tags: [tw, open_source, t5, tensorflow] task: Text Generation language: tw edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-hotel-review-sentiment` is a Twi model originally trained by `clhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_hotel_review_sentiment_tw_4.3.0_3.0_1675124731516.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_hotel_review_sentiment_tw_4.3.0_3.0_1675124731516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_hotel_review_sentiment","tw") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_hotel_review_sentiment","tw") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_hotel_review_sentiment| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|tw| |Size:|1.0 GB| ## References - https://huggingface.co/clhuang/t5-hotel-review-sentiment --- layout: model title: Word Embeddings for Persian (persian_w2v_cc_300d) author: John Snow Labs name: persian_w2v_cc_300d date: 2020-12-05 task: Embeddings language: fa edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [embeddings, fa, open_source] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/persian_w2v_cc_300d_fa_2.7.0_2.4_1607169840793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/persian_w2v_cc_300d_fa_2.7.0_2.4_1607169840793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("persian_w2v_cc_300d", "fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['من یادگیری ماشین را دوست دارم']], ["text"])) ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("persian_w2v_cc_300d", "fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("من یادگیری ماشین را دوست دارم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""من یادگیری ماشین را دوست دارم"""] farvec_df = nlu.load('fa.embed.word2vec.300d').predict(text, output_level='token') farvec_df ```
{:.h2_title} ## Results The model gives 300 dimensional Word2Vec feature vector outputs per token. ```bash | token | fa_embed_word2vec_300d_embeddings |-------|-------------------------------------------------- | من | [-0.3861289620399475, -0.08295578509569168, -0... | را | [-0.15430298447608948, -0.24924889206886292, 0... | دوست | [0.07587642222642899, -0.24341894686222076, 0.... | دارم | [0.0899219810962677, -0.21863090991973877, 0.4... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|persian_w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|fa| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: Legal Introduction Clause Binary Classifier author: John Snow Labs name: legclf_introduction_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `introduction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `introduction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_en_1.0.0_3.2_1660122579442.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_introduction_clause_en_1.0.0_3.2_1660122579442.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_introduction_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[introduction]| |[other]| |[other]| |[introduction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_introduction_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support introduction 0.82 0.84 0.83 32 other 0.94 0.93 0.94 90 accuracy - - 0.91 122 macro-avg 0.88 0.89 0.88 122 weighted-avg 0.91 0.91 0.91 122 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from lorenzkuhn) author: John Snow Labs name: distilbert_qa_lorenzkuhn_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `lorenzkuhn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lorenzkuhn_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772042605.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lorenzkuhn_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772042605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lorenzkuhn_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lorenzkuhn_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_lorenzkuhn_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/lorenzkuhn/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate Kwangali to English Pipeline author: John Snow Labs name: translate_kwn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kwn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kwn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kwn_en_xx_2.7.0_2.4_1609698648068.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kwn_en_xx_2.7.0_2.4_1609698648068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kwn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kwn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kwn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kwn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RoBERTa Embeddings (Base, Random Sampling) author: John Snow Labs name: roberta_embeddings_bertin_base_random date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-base-random` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_random_es_3.4.2_3.0_1649945870509.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_random_es_3.4.2_3.0_1649945870509.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_random","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_random","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_base_random").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_base_random| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|234.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-base-random --- layout: model title: English asr_wav2vec2_large_xlsr_ksponspeech_1_20 TFWav2Vec2ForCTC from cheulyop author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_ksponspeech_1_20` is a English model originally trained by cheulyop. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20_en_4.2.0_3.0_1664097449426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20_en_4.2.0_3.0_1664097449426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_ksponspeech_1_20| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from internetoftim) author: John Snow Labs name: bert_qa_demo date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `demo` is a English model orginally trained by `internetoftim`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_demo_en_4.0.0_3.0_1654187421235.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_demo_en_4.0.0_3.0_1654187421235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_demo","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_demo","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_internetoftim").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_demo| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|798.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/internetoftim/demo --- layout: model title: Translate Indo-European languages to English Pipeline author: John Snow Labs name: translate_ine_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ine, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ine` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ine_en_xx_2.7.0_2.4_1609691766618.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ine_en_xx_2.7.0_2.4_1609691766618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ine_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ine_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ine.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ine_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stop Words Cleaner for Anglo-French author: John Snow Labs name: stopwords_af date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: af edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, af] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_af_af_2.5.4_2.4_1594742440083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_af_af_2.5.4_2.4_1594742440083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_af", "af") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Anders as die koning van die noorde, is John Snow 'n Engelse dokter en 'n leier in die ontwikkeling van narkose en mediese higiëne.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_af", "af") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Anders as die koning van die noorde, is John Snow "n Engelse dokter en "n leier in die ontwikkeling van narkose en mediese higiëne.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Anders as die koning van die noorde, is John Snow 'n Engelse dokter en 'n leier in die ontwikkeling van narkose en mediese higiëne."""] stopword_df = nlu.load('af.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='Anders', metadata={'sentence': '0'}), Row(annotatorType='token', begin=14, end=19, result='koning', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=34, result='noorde', metadata={'sentence': '0'}), Row(annotatorType='token', begin=35, end=35, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=40, end=43, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_af| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|af| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from doc2query) author: John Snow Labs name: t5_stackexchange_title_body_small_v1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `stackexchange-title-body-t5-small-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_stackexchange_title_body_small_v1_en_4.3.0_3.0_1675107510591.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_stackexchange_title_body_small_v1_en_4.3.0_3.0_1675107510591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_stackexchange_title_body_small_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_stackexchange_title_body_small_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_stackexchange_title_body_small_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|347.9 MB| ## References - https://huggingface.co/doc2query/stackexchange-title-body-t5-small-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_findings date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities This model returns CUI (concept unique identifier) codes for 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html {:.btn-box} [Live Demo](http://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.0.4_3.0_1621189546348.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_3.0.4_3.0_1621189546348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") stopwords = StopWordsCleaner.pretrained()\ .setInputCols("token")\ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "cleanTokens"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "cleanTokens", "ner"]) \ .setOutputCol("entities") chunk2doc = Chunk2Doc()\ .setInputCols("entities")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val stopwords = StopWordsCleaner.pretrained() .setInputCols("token") .setOutputCol("cleanTokens") .setCaseSensitive(False) val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "cleanTokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "cleanTokens", "ner")) .setOutputCol("entities") val chunk2doc = new Chunk2Doc() .setInputCols("entities") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",'en','clinical/models') .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls.findings").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | ner_chunk | cui_code | |---:|:--------------------------------------|:-----------| | 0 | gestational diabetes mellitus | C2183115 | | 1 | subsequent type two diabetes mellitus | C3532488 | | 2 | T2DM | C3280267 | | 3 | HTG-induced pancreatitis | C4554179 | | 4 | an acute hepatitis | C4750596 | | 5 | obesity | C1963185 | | 6 | a body mass index | C0578022 | | 7 | polyuria | C3278312 | | 8 | polydipsia | C3278316 | | 9 | poor appetite | C0541799 | | 10 | vomiting | C0042963 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_findings| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[umls_code]| |Language:|en| |Case sensitive:|false| ## Data Source [https://www.nlm.nih.gov/research/umls/index.html](https://www.nlm.nih.gov/research/umls/index.html) --- layout: model title: English BertForQuestionAnswering model (from nntadotzip) author: John Snow Labs name: bert_qa_bert_base_cased_IUChatbot_ontologyDts date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-IUChatbot-ontologyDts` is a English model orginally trained by `nntadotzip`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_IUChatbot_ontologyDts_en_4.0.0_3.0_1654179690784.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_IUChatbot_ontologyDts_en_4.0.0_3.0_1654179690784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_cased_IUChatbot_ontologyDts","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_cased_IUChatbot_ontologyDts","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base_cased.by_nntadotzip").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_cased_IUChatbot_ontologyDts| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nntadotzip/bert-base-cased-IUChatbot-ontologyDts --- layout: model title: SDOH Insurance Status For Classification author: John Snow Labs name: genericclassifier_sdoh_insurance_status_sbiobert_cased_mli date: 2023-04-27 tags: [en, licensed, sdoh, social_determinants, generic_classifier, biobert] task: Text Classification language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting whether the patient has insurance or not. If the patient's insurance status is not mentioned or is unknown, it is regarded as "Unknown". The model is trained by using GenericClassifierApproach annotator. `Insured`: The patient has insurance. `Uninsured`: The patient has no insurance. `Unknown`: Insurance status is not mentioned in the clinical notes or is unknown. ## Predicted Entities `Insured`, `Uninsured`, `Unknown` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/social_determinant){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_status_sbiobert_cased_mli_en_4.4.0_3.0_1682623268182.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_status_sbiobert_cased_mli_en_4.4.0_3.0_1682623268182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_status_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["The patient has VA insurance.", "She doesn't have any kind of insurance", """Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical insurance. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "prediction.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_status_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("prediction") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("The patient has VA insurance.", "She doesn't have any kind of insurance", """Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical insurance. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-----------+ | text| result| +----------------------------------------------------------------------------------------------------+-----------+ | The patient has VA insurance.| [Insured]| | She doesn't have any kind of insurance|[Uninsured]| |Patient: Mary H.\n\nBackground: Mary is a 40-year-old woman who has been diagnosed with asthma an...| [Insured]| |Patient: Sarah L.\n\nBackground: Sarah is a 35-year-old woman who has been experiencing housing i...|[Uninsured]| +----------------------------------------------------------------------------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_insurance_status_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References SDOH internal project ## Benchmarking ```bash label precision recall f1-score support Insured 0.94 0.90 0.92 145 Uninsured 0.84 0.89 0.86 36 Unknown 0.72 0.82 0.77 38 accuracy - - 0.88 219 macro-avg 0.84 0.87 0.85 219 weighted-avg 0.89 0.88 0.88 219 ``` --- layout: model title: English image_classifier_vit_age_classifier ViTForImageClassification from ibombonato author: John Snow Labs name: image_classifier_vit_age_classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_age_classifier` is a English model originally trained by ibombonato. ## Predicted Entities `60-70`, `40-50`, `70-80`, `80-90`, `10-20`, `50-60`, `90-100`, `30-40`, `0-10`, `20-30` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_age_classifier_en_4.1.0_3.0_1660169996525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_age_classifier_en_4.1.0_3.0_1660169996525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_age_classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_age_classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_age_classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English T5ForConditionalGeneration Cased model (from gagan3012) author: John Snow Labs name: t5_gagan3012_k2t_test date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `k2t-test` is a English model originally trained by `gagan3012`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_gagan3012_k2t_test_en_4.3.0_3.0_1675103953989.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_gagan3012_k2t_test_en_4.3.0_3.0_1675103953989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_gagan3012_k2t_test","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_gagan3012_k2t_test","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_gagan3012_k2t_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|242.7 MB| ## References - https://huggingface.co/gagan3012/k2t-test - https://pypi.org/project/keytotext/ - https://pepy.tech/project/keytotext - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/notebooks/K2T.ipynb - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/keytotext#api - https://hub.docker.com/r/gagan30/keytotext - https://keytotext.readthedocs.io/en/latest/?badge=latest - https://github.com/psf/black - https://socialify.git.ci/gagan3012/keytotext/image?description=1&forks=1&language=1&owner=1&stargazers=1&theme=Light --- layout: model title: Translate Semitic languages to English Pipeline author: John Snow Labs name: translate_sem_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sem, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sem` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sem_en_xx_2.7.0_2.4_1609686594071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sem_en_xx_2.7.0_2.4_1609686594071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sem_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sem_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sem.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sem_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Korean Electra Embeddings (from monologg) author: John Snow Labs name: electra_embeddings_koelectra_base_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-generator` is a Korean model orginally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_base_generator_ko_3.4.4_3.0_1652786875395.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_base_generator_ko_3.4.4_3.0_1652786875395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_base_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_base_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_koelectra_base_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|130.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/monologg/koelectra-base-generator - https://github.com/monologg/KoELECTRA/blob/master/README_EN.md --- layout: model title: Legal Capital Expenditures Clause Binary Classifier author: John Snow Labs name: legclf_capital_expenditures_clause date: 2023-01-27 tags: [en, legal, classification, capital, expenditures, clauses, capital_expenditures, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `capital-expenditures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `capital-expenditures`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_capital_expenditures_clause_en_1.0.0_3.0_1674820293638.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_capital_expenditures_clause_en_1.0.0_3.0_1674820293638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_capital_expenditures_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[capital-expenditures]| |[other]| |[other]| |[capital-expenditures]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_capital_expenditures_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support capital-expenditures 1.00 0.97 0.99 37 other 0.97 1.00 0.99 38 accuracy - - 0.99 75 macro-avg 0.99 0.99 0.99 75 weighted-avg 0.99 0.99 0.99 75 ``` --- layout: model title: English asr_wav2vec2_base_10k_voxpopuli TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_base_10k_voxpopuli date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_10k_voxpopuli` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_10k_voxpopuli_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_10k_voxpopuli_en_4.2.0_3.0_1664022541668.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_10k_voxpopuli_en_4.2.0_3.0_1664022541668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_10k_voxpopuli', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_10k_voxpopuli", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_10k_voxpopuli| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: French CamemBert Embeddings (from mdroth) author: John Snow Labs name: camembert_embeddings_generic_model_R91m date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model_R91m` is a French model orginally trained by `mdroth`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_generic_model_R91m_fr_3.4.4_3.0_1653991354882.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_generic_model_R91m_fr_3.4.4_3.0_1653991354882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_generic_model_R91m","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_generic_model_R91m","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_generic_model_R91m| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/mdroth/dummy-model_R91m --- layout: model title: Lemmatizer (Swedish, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, sv] task: Lemmatization language: sv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Swedish Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_sv_3.4.1_3.0_1646316517673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_sv_3.4.1_3.0_1646316517673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","sv") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Du är inte bättre än jag"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","sv") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Du är inte bättre än jag").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sv.lemma.spacylookup").predict("""Du är inte bättre än jag""") ```
## Results ```bash +------------------------------+ |result | +------------------------------+ |[Du, vara, inte, god, än, jag]| +------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|sv| |Size:|8.0 MB| --- layout: model title: English asr_wav2vec2_large_xls_r_300m_georgian_large TFWav2Vec2ForCTC from RaphaelKalandadze author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_georgian_large date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_georgian_large` is a English model originally trained by RaphaelKalandadze. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_georgian_large_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_georgian_large_en_4.2.0_3.0_1664120201846.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_georgian_large_en_4.2.0_3.0_1664120201846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_georgian_large", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_georgian_large", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_georgian_large| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Spanish RobertaForQuestionAnswering (from IIC) author: John Snow Labs name: roberta_qa_roberta_base_spanish_squades date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades` is a Spanish model originally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_spanish_squades_es_4.0.0_3.0_1655734721755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_spanish_squades_es_4.0.0_3.0_1655734721755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_spanish_squades","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_spanish_squades","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_spanish_squades| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|459.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/IIC/roberta-base-spanish-squades - https://paperswithcode.com/sota?task=question-answering&dataset=squad_es - https://arxiv.org/abs/2107.07253 --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from CenIA) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-qa-mlqa` is a Castilian, Spanish model orginally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa_es_4.0.0_3.0_1654180397621.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa_es_4.0.0_3.0_1654180397621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.mlqa.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/bert-base-spanish-wwm-cased-finetuned-qa-mlqa --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_squad_3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-3` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_3_en_4.0.0_3.0_1655734505387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_3_en_4.0.0_3.0_1655734505387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_squad_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_squad_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_v3.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_squad_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad-3 --- layout: model title: English image_classifier_vit_mit_indoor_scenes ViTForImageClassification from vincentclaes author: John Snow Labs name: image_classifier_vit_mit_indoor_scenes date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_mit_indoor_scenes` is a English model originally trained by vincentclaes. ## Predicted Entities `airport_inside`, `bowling`, `buffet`, `movietheater`, `clothingstore`, `inside_bus`, `fastfood_restaurant`, `operating_room`, `corridor`, `cloister`, `stairscase`, `auditorium`, `meeting_room`, `livingroom`, `videostore`, `bathroom`, `inside_subway`, `bedroom`, `casino`, `tv_studio`, `classroom`, `laboratorywet`, `nursery`, `office`, `deli`, `prisoncell`, `dentaloffice`, `restaurant_kitchen`, `studiomusic`, `locker_room`, `restaurant`, `laundromat`, `dining_room`, `subway`, `gameroom`, `museum`, `mall`, `garage`, `elevator`, `jewelleryshop`, `kindergarden`, `toystore`, `concert_hall`, `artstudio`, `kitchen`, `florist`, `waitingroom`, `grocerystore`, `library`, `bar`, `computerroom`, `trainstation`, `lobby`, `church_inside`, `pantry`, `closet`, `children_room`, `hairsalon`, `shoeshop`, `greenhouse`, `bookstore`, `bakery`, `poolinside`, `warehouse`, `winecellar`, `hospitalroom`, `gym` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_mit_indoor_scenes_en_4.1.0_3.0_1660170571863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_mit_indoor_scenes_en_4.1.0_3.0_1660170571863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_mit_indoor_scenes", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_mit_indoor_scenes", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_mit_indoor_scenes| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.1 MB| --- layout: model title: Generic Deidentification NER author: John Snow Labs name: finner_deid date: 2022-08-09 tags: [en, finance, ner, deid, licensed] task: [De-identification, Named Entity Recognition] language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/DEID_FIN/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_deid_en_1.0.0_3.2_1660050720560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_deid_en_1.0.0_3.2_1660050720560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+----------------+ | token| ner_label| +-----------+----------------+ | This| O| | LICENSE| O| | AND| O| |DEVELOPMENT| O| | AGREEMENT| O| | (| O| | this| O| | Agreement| O| | )| O| | is| O| | entered| O| | into| O| | effective| O| | as| O| | of| O| | Nov| B-DATE| | .| I-DATE| | 02| I-DATE| | ,| I-DATE| | 2019| I-DATE| | (| O| | the| O| | Effective| O| | Date| O| | )| O| | by| O| | and| O| | between| O| | Bioeq| O| | IP| O| | AG| O| | ,| O| | having| O| | its| O| | principal| O| | place| O| | of| O| | business| O| | at| O| | 333| B-STREET| | Twin| I-STREET| | Dolphin| I-STREET| | Drive| I-STREET| | ,| O| | Suite|B-LOCATION-OTHER| | 600|I-LOCATION-OTHER| | ,| O| | Redwood| B-CITY| | City| I-CITY| | ,| O| | CA| B-STATE| | ,| O| | 94065| B-ZIP| | ,| O| | USA| B-STATE| | (| O| | Licensee| O| | ).| O| +-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_deid| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References In-house annotated documents with protected information ## Benchmarking ```bash label precision recall f1-score support B-AGE 0.96 0.89 0.92 245 B-CITY 0.85 0.86 0.86 123 B-COUNTRY 0.86 0.67 0.75 36 B-DATE 0.98 0.97 0.97 2352 B-ORG 0.75 0.71 0.73 38 B-PERSON 0.97 0.94 0.95 1348 B-PHONE 0.86 0.80 0.83 86 B-PROFESSION 0.93 0.75 0.83 84 B-STATE 0.92 0.89 0.91 102 B-STREET 0.99 0.91 0.95 89 I-CITY 0.82 0.77 0.79 35 I-COUNTRY 1.00 0.50 0.67 6 I-DATE 0.96 0.95 0.96 402 I-ORG 0.71 0.86 0.77 28 I-PERSON 0.98 0.96 0.97 1240 I-PHONE 0.91 0.92 0.92 77 I-PROFESSION 0.96 0.79 0.87 70 I-STATE 1.00 0.62 0.77 8 I-STREET 0.98 0.94 0.96 188 I-ZIP 0.84 0.97 0.90 60 O 1.00 1.00 1.00 194103 accuracy - - 1.00 200762 macro-avg 0.72 0.62 0.65 200762 weighted-avg 1.00 1.00 1.00 200762 ``` --- layout: model title: Clinical QA BioGPT (JSL) author: John Snow Labs name: biogpt_chat_jsl date: 2023-04-12 tags: [licensed, en, clinical, text_generation, biogpt, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/BIOGPT_CHAT_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.Biogpt_Chat_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_en_4.3.2_3.0_1681319163583.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_en_4.3.2_3.0_1681319163583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") gpt_qa = MedicalTextGenerator.pretrained("biogpt_chat_jsl", "en", "clinical/models")\ .setInputCols("documents")\ .setOutputCol("answer") pipeline = Pipeline().setStages([document_assembler, gpt_qa]) data = spark.createDataFrame([["How to treat asthma ?"]]).toDF("text") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalTextGenerator.pretrained("biogpt_chat_jsl", "en", "clinical/models") .setInputCols("documents") .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = "How to treat asthma ?" val data = Seq(Array(text)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash ['Asthma is itself an allergic disease due to cold or dust or pollen or grass etc. irrespective of the triggering factor. You can go for pulmonary function tests if not done. Treatment is mainly symptomatic which might require inhalation steroids, beta agonists, anticholinergics as MDI or rota haler as a regular treatment. To decrease the inflammation of bronchi and bronchioles, you might be given oral antihistamines with mast cell stabilizers (montelukast) and steroids (prednisolone) with nebulization and frequently steam inhalation. To decrease the bronchoconstriction caused by allergens, you might be given oral antihistamines with mast cell stabilizers (montelukast) and steroids (prednisolone) with nebulization and frequently steam inhalation. The best way to cure any allergy is a complete avoidance of allergen or triggering factor. Consult your pulmonologist for further advise.'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biogpt_chat_jsl| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.4 GB| |Case sensitive:|true| --- layout: model title: English BertForQuestionAnswering Large Uncased model (from michaelrglass) author: John Snow Labs name: bert_qa_large_uncased_ssp date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-sspt` is a English model originally trained by `michaelrglass`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_large_uncased_ssp_en_4.0.0_3.0_1657187762696.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_large_uncased_ssp_en_4.0.0_3.0_1657187762696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_large_uncased_ssp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_large_uncased_ssp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_large_uncased_ssp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|796.4 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/michaelrglass/bert-large-uncased-sspt --- layout: model title: Detect Diagnosis, Symptoms, Drugs, Labs and Demographics (ner_jsl_enriched) author: John Snow Labs name: ner_jsl_enriched date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Name`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Frequency`: Frequency of administration for a dose prescribed. - `Gender`: Gender-specific nouns and pronouns. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Allergen`: Allergen related extractions mentioned in the document. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Respiration`: Number of breaths per minute. - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Temperature`: All mentions that refer to body temperature. - `Weight`: All mentions related to a patients weight. - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). ## Predicted Entities `Age`, `Name`, `Procedure`, `Pulse_Rate`, `Temperature`, `Gender`, `Frequency`, `Route`, `Diagnosis`, `Allergenic_substance`, `Weight`, `Respiratory_Rate`, `Symptom_Name`, `Causative_Agents(Virus_and_Bacteria)`, `Modifier`, `Blood_Pressure`, `Dosage`, `Drug_Name`, `Negation`, `Lab_Result`, `Section_Name`, `Maybe`, `Substance_Name`, `Procedure_Name`, `Lab_Name`, `O2_Saturation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_3.0.0_3.0_1617209691808.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_en_3.0.0_3.0_1617209691808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.enriched").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +---------------------------+------------+ |chunk |ner | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |congestion |Symptom_Name| |mom |Gender | |suctioning yellow discharge|Symptom_Name| |she |Gender | |problems with his breathing|Symptom_Name| |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-Pulse_Rate 80 26 9 0.754717 0.898876 0.820513 I-Diagnosis 2341 1644 1129 0.587453 0.67464 0.628035 I-Procedure_Name 2209 1128 1085 0.661972 0.670613 0.666265 B-Lab_Result 432 107 263 0.801484 0.621583 0.700162 B-Dosage 465 179 81 0.72205 0.851648 0.781513 I-Causative_Agents_(Virus_and_Bacteria) 9 3 10 0.75 0.473684 0.580645 B-Name 648 295 510 0.687169 0.559585 0.616849 I-Name 917 427 665 0.682292 0.579646 0.626794 B-Weight 52 25 9 0.675325 0.852459 0.753623 B-Symptom_Name 4244 1911 1776 0.689521 0.704983 0.697166 I-Maybe 25 15 63 0.625 0.284091 0.390625 I-Symptom_Name 1920 1584 2503 0.547945 0.434095 0.48442 B-Modifier 1399 704 942 0.66524 0.597608 0.629613 B-Blood_Pressure 82 21 7 0.796117 0.921348 0.854167 B-Frequency 290 93 97 0.75718 0.749354 0.753247 I-Gender 29 19 25 0.604167 0.537037 0.568627 I-Age 3 6 11 0.333333 0.214286 0.26087 B-Drug_Name 1762 500 271 0.778957 0.866699 0.820489 B-Substance_Name 143 32 53 0.817143 0.729592 0.770889 B-Temperature 58 23 11 0.716049 0.84058 0.773333 B-Section_Name 2700 294 177 0.901804 0.938478 0.919775 I-Route 131 165 177 0.442568 0.425325 0.433775 B-Maybe 108 47 164 0.696774 0.397059 0.505855 B-Gender 5156 685 68 0.882726 0.986983 0.931948 I-Dosage 435 182 87 0.705024 0.833333 0.763828 B-Causative_Agents_(Virus_and_Bacteria) 21 17 6 0.552632 0.777778 0.646154 I-Frequency 278 131 191 0.679707 0.592751 0.633257 B-Age 352 34 21 0.911917 0.9437 0.927536 I-Lab_Result 27 20 170 0.574468 0.137056 0.221311 B-Negation 1501 311 341 0.828366 0.814875 0.821565 B-Diagnosis 2657 1281 1049 0.674708 0.716945 0.695186 I-Section_Name 3876 1304 188 0.748263 0.95374 0.838598 B-Route 466 286 123 0.619681 0.791172 0.695004 I-Negation 80 152 190 0.344828 0.296296 0.318725 B-Procedure_Name 1453 739 562 0.662865 0.721092 0.690754 I-Allergenic_substance 6 1 7 0.857143 0.461538 0.6 B-Allergenic_substance 74 31 23 0.704762 0.762887 0.732673 I-Weight 46 43 17 0.516854 0.730159 0.605263 B-Lab_Name 639 189 287 0.771739 0.690065 0.72862 I-Modifier 104 156 417 0.4 0.199616 0.266325 I-Temperature 2 7 13 0.222222 0.133333 0.166667 I-Drug_Name 334 237 290 0.584939 0.535256 0.558996 I-Lab_Name 271 157 140 0.633178 0.659367 0.646007 B-Respiratory_Rate 46 6 5 0.884615 0.901961 0.893204 Macro-average 37896 15237 14343 0.621144 0.562248 0.59023 Micro-average 37896 15237 14343 0.713229 0.725435 0.71928 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from jhoonk) author: John Snow Labs name: distilbert_qa_jhoonk_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `jhoonk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_jhoonk_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725579171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_jhoonk_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725579171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jhoonk_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jhoonk_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_jhoonk").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_jhoonk_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jhoonk/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering model (from tucan9389) author: John Snow Labs name: distilbert_qa_tucan9389_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `tucan9389`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tucan9389_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726443115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tucan9389_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726443115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tucan9389_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tucan9389_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_tucan9389").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_tucan9389_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tucan9389/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Stock Markets in texts author: John Snow Labs name: finner_wiki_stockexchange date: 2023-01-15 tags: [stock, exchange, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model, aimed to detect Stock Exchanges / Stock Market names or abbreviations. It was trained with wikipedia texts about companies. ## Predicted Entities `STOCK_EXCHANGE`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_stockexchange_en_1.0.0_3.0_1673796187398.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_stockexchange_en_1.0.0_3.0_1673796187398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) chunks = finance.NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner = finance.NerModel().pretrained("finner_wiki_stockexchange", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks]) model = pipe.fit(df) res = model.transform(df) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['3']['sentence']").alias("sentence_id"), F.expr("cols['0']").alias("chunk"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +-----------+------+---+--------------+ |sentence_id|chunk |end|ner_label | +-----------+------+---+--------------+ |0 |NASDAQ|126|STOCK_EXCHANGE| +-----------+------+---+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_wiki_stockexchange| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.1 MB| ## References Wikipedia ## Benchmarking ```bash label tp fp fn prec rec f1 I-STOCK_EXCHANGE 21 0 0 1.0 1.0 1.0 B-STOCK_EXCHANGE 18 1 0 0.94736844 1.0 0.972973 Macro-average 39 1 0 0.9736842 1.0 0.9866667 Micro-average 39 1 0 0.975 1.0 0.98734176 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from ceshine) author: John Snow Labs name: t5_paraphrase_paws_msrp_opinosis date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-paraphrase-paws-msrp-opinosis` is a English model originally trained by `ceshine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_paraphrase_paws_msrp_opinosis_en_4.3.0_3.0_1675125013995.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_paraphrase_paws_msrp_opinosis_en_4.3.0_3.0_1675125013995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_paraphrase_paws_msrp_opinosis","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_paraphrase_paws_msrp_opinosis","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_paraphrase_paws_msrp_opinosis| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|923.5 MB| ## References - https://huggingface.co/ceshine/t5-paraphrase-paws-msrp-opinosis - https://github.com/ceshine/finetuning-t5/tree/master/paraphrase --- layout: model title: ELMo Embeddings author: John Snow Labs name: elmo date: 2020-01-31 task: Embeddings language: en nav_key: models edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: ElmoEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Computes contextualized word representations using character-based word representations and bidirectional LSTMs. This model outputs fixed embeddings at each LSTM layer and a learnable aggregation of the 3 layers. * `word_emb`: the character-based word representations with shape [batch_size, max_length, 512]. * `lstm_outputs1`: the first LSTM hidden state with shape [batch_size, max_length, 1024]. * `lstm_outputs2`: the second LSTM hidden state with shape [batch_size, max_length, 1024]. * `elmo`: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]. The complex architecture achieves state of the art results on several benchmarks. Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended. The details are described in the paper "[Deep contextualized word representations](https://arxiv.org/abs/1802.05365)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/elmo_en_2.4.0_2.4_1580488815299.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/elmo_en_2.4.0_2.4_1580488815299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = ElmoEmbeddings.pretrained("elmo", "en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") \ .setPoolingLayer("elmo") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = ElmoEmbeddings.pretrained("elmo", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") .setPoolingLayer("elmo") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.elmo').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_elmo_embeddings token [0.6922717690467834, -0.32613131403923035, 0.2... I [-0.7348151206970215, -0.09645576030015945, -0... love [0.16370956599712372, 0.1217058077454567, -0.0... NLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|elmo| |Type:|embeddings| |Compatibility:|Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/elmo/3](https://tfhub.dev/google/elmo/3) --- layout: model title: English BertForQuestionAnswering model (from andi611) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2-with-ner-Pwhatisthe-conll2003-with-neg-with-repeat` is a English model orginally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat_en_4.0.0_3.0_1654537369079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat_en_4.0.0_3.0_1654537369079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.bert.large_uncased_pwhatisthe.by_andi611").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/bert-large-uncased-whole-word-masking-squad2-with-ner-Pwhatisthe-conll2003-with-neg-with-repeat --- layout: model title: Fast Neural Machine Translation Model from English to Greek Languages author: John Snow Labs name: opus_mt_en_grk date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, grk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `grk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_grk_xx_2.7.0_2.4_1609163504692.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_grk_xx_2.7.0_2.4_1609163504692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_grk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_grk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.grk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_grk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from baru98) author: John Snow Labs name: distilbert_qa_baru98_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `baru98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_baru98_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770148875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_baru98_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770148875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_baru98_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_baru98_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_baru98_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/baru98/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Clinical Entities in Romanian (Bert, Base, Cased) author: John Snow Labs name: ner_clinical_bert date: 2022-11-22 tags: [licensed, clinical, ro, ner, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract clinical entities from Romanian clinical texts. This model is trained using `bert_base_cased` embeddings. ## Predicted Entities `Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.2.2_3.0_1669124033852.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.2.2_3.0_1669124033852.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter]) data = spark.createDataFrame([[""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""") ```
## Results ```bash +--------------------------+-------------------------+ |chunks |entities | +--------------------------+-------------------------+ |Angio CT cardio-toracic |Imaging_Test | |Atrezie |Disease_Syndrome_Disorder| |valva pulmonara |Body_Part | |Hipoplazie |Disease_Syndrome_Disorder| |VS |Body_Part | |Atrezie |Disease_Syndrome_Disorder| |VAV stang |Body_Part | |Anastomoza Glenn |Disease_Syndrome_Disorder| |Tromboza |Disease_Syndrome_Disorder| |Sectia Clinica Cardiologie|Clinical_Dept | |GE Revolution HD |Medical_Device | |Branula albastra |Medical_Device | |membrului superior drept |Body_Part | |Scout |Body_Part | |30 ml |Dosage | |Iomeron 350 |Drug_Ingredient | |2.2 ml/s |Dosage | |20 ml |Dosage | |ser fiziologic |Drug_Ingredient | |angio-CT |Imaging_Test | +--------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_bert| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.3 MB| ## Benchmarking ```bash label precision recall f1-score support Body_Part 0.91 0.93 0.92 679 Clinical_Dept 0.68 0.65 0.67 97 Date 0.99 0.99 0.99 87 Direction 0.66 0.76 0.70 50 Disease_Syndrome_Disorder 0.73 0.76 0.74 121 Dosage 0.78 1.00 0.87 38 Drug_Ingredient 0.90 0.94 0.92 48 Form 1.00 1.00 1.00 6 Imaging_Findings 0.86 0.82 0.84 201 Imaging_Technique 0.92 0.92 0.92 26 Imaging_Test 0.93 0.98 0.95 205 Measurements 0.71 0.69 0.70 214 Medical_Device 0.85 0.81 0.83 42 Pulse 0.82 1.00 0.90 9 Route 1.00 0.91 0.95 33 Score 1.00 0.98 0.99 41 Time 1.00 1.00 1.00 28 Units 0.60 0.93 0.73 88 Weight 0.82 1.00 0.90 9 micro-avg 0.84 0.87 0.86 2037 macro-avg 0.70 0.74 0.72 2037 weighted-avg 0.84 0.87 0.85 2037 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-PubMedBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_384_en_4.0.0_3.0_1657108669858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_384_en_4.0.0_3.0_1657108669858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-PubMedBERT-384 --- layout: model title: Mapping Diseases from the KEGG Database to Their Corresponding Categories, Descriptions and Clinical Vocabularies author: John Snow Labs name: kegg_disease_mapper date: 2022-11-18 tags: [disease, category, description, icd10, icd11, mesh, brite, en, clinical, chunk_mapper, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps diseases with their corresponding `category`, `description`, `icd10_code`, `icd11_code`, `mesh_code`, and hierarchical `brite_code`. This model was trained with the data from the KEGG database. ## Predicted Entities `category`, `description`, `icd10_code`, `icd11_code`, `mesh_code`, `brite_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/kegg_disease_mapper_en_4.2.2_3.0_1668794743905.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/kegg_disease_mapper_en_4.2.2_3.0_1668794743905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_diseases", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("kegg_disease_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["description", "category", "icd10_code", "icd11_code", "mesh_code", "brite_code"])\ pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, converter, chunkerMapper]) text= "A 55-year-old female with a history of myopia, kniest dysplasia and prostate cancer. She was on glipizide , and dapagliflozin for congenital nephrogenic diabetes insipidus." data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_diseases", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("kegg_disease_mapper", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("mappings") .setRels(Array("description", "category", "icd10_code", "icd11_code", "mesh_code", "brite_code")) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, converter, chunkerMapper)) val text= "A 55-year-old female with a history of myopia, kniest dysplasia and prostate cancer. She was on glipizide , and dapagliflozin for congenital nephrogenic diabetes insipidus." val data = Seq(text).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.kegg_disease").predict("""A 55-year-old female with a history of myopia, kniest dysplasia and prostate cancer. She was on glipizide , and dapagliflozin for congenital nephrogenic diabetes insipidus.""") ```
## Results ```bash +-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+ | ner_chunk| description| category|icd10_code|icd11_code|mesh_code| brite_code| +-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+ | myopia|Myopia is the most common ocular disorder world...| Nervous system disease| H52.1| 9D00.0| D009216| 08402,08403| | kniest dysplasia|Kniest dysplasia is an autosomal dominant chond...|Congenital malformation| Q77.7| LD24.3| C537207| 08402,08403| | prostate cancer|Prostate cancer constitutes a major health prob...| Cancer| C61| 2C82| NONE|08402,08403,08442,08441| |congenital nephrogenic diabetes insipidus|Nephrogenic diabetes insipidus (NDI) is charact...| Urinary system disease| N25.1| GB90.4A| D018500| 08402,08403| +-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|kegg_disease_mapper| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|595.6 KB| --- layout: model title: Multilingual BertForQuestionAnswering model (from salti) author: John Snow Labs name: bert_qa_salti_bert_base_multilingual_cased_finetuned_squad date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-squad` is a Multilingual model orginally trained by `salti`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_salti_bert_base_multilingual_cased_finetuned_squad_xx_4.0.0_3.0_1654180181226.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_salti_bert_base_multilingual_cased_finetuned_squad_xx_4.0.0_3.0_1654180181226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_salti_bert_base_multilingual_cased_finetuned_squad","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_salti_bert_base_multilingual_cased_finetuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.squad.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_salti_bert_base_multilingual_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/salti/bert-base-multilingual-cased-finetuned-squad - https://wandb.ai/salti/mBERT_QA/runs/wkqzhrp2 --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-10_H-512_A-8_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185208664.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185208664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.uncased_10l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|178.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-10_H-512_A-8_squad2_covid-qna --- layout: model title: Turkish BertForQuestionAnswering model (from oguzhanolm) author: John Snow Labs name: bert_qa_loodos_bert_base_uncased_QA_fine_tuned date: 2022-06-02 tags: [tr, open_source, question_answering, bert] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `loodos-bert-base-uncased-QA-fine-tuned` is a Turkish model orginally trained by `oguzhanolm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_loodos_bert_base_uncased_QA_fine_tuned_tr_4.0.0_3.0_1654188187941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_loodos_bert_base_uncased_QA_fine_tuned_tr_4.0.0_3.0_1654188187941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_loodos_bert_base_uncased_QA_fine_tuned","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_loodos_bert_base_uncased_QA_fine_tuned","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_loodos_bert_base_uncased_QA_fine_tuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|412.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/oguzhanolm/loodos-bert-base-uncased-QA-fine-tuned - https://github.com/TQuad/turkish-nlp-qa-dataset - https://paperswithcode.com/sota?task=Question+Answering&dataset=TQuAD --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from navteca) author: John Snow Labs name: roberta_qa_navteca_large_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2` is a English model originally trained by `navteca`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_large_squad2_en_4.2.4_3.0_1669988103804.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_large_squad2_en_4.2.4_3.0_1669988103804.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_large_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_large_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_navteca_large_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/navteca/roberta-large-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: RE Pipeline between Dates and Clinical Entities author: John Snow Labs name: re_date_clinical_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, date, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_date_clinical](https://nlp.johnsnowlabs.com/2021/01/18/re_date_clinical_en.html) model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=d5wI11ptz7hi){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_date_clinical_pipeline_en_3.4.1_3.0_1648734471721.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_date_clinical_pipeline_en_3.4.1_3.0_1648734471721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_date_clinical_pipeline", "en", "clinical/models") pipeline.fullAnnotate("This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_date_clinical_pipeline", "en", "clinical/models") pipeline.fullAnnotate("This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date_clinical.pipeline").predict("""This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.""") ```
## Results ```bash | | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |---|-----------|---------|---------------|-------------|------------------------------------------|---------|-------------|-------------|---------|------------| | 0 | 1 | Test | 24 | 25 | CT | Date | 31 | 37 | 1/12/95 | 1.0 | | 1 | 1 | Symptom | 45 | 84 | progressive memory and cognitive decline | Date | 92 | 98 | 8/11/94 | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_date_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - PerceptronModel - DependencyParserModel - RelationExtractionModel --- layout: model title: Legal Shareholder Services Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_shareholder_services_agreement_bert date: 2023-01-26 tags: [en, legal, classification, shareholder, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_shareholder_services_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `shareholder-services-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `shareholder-services-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_shareholder_services_agreement_bert_en_1.0.0_3.0_1674735062182.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_shareholder_services_agreement_bert_en_1.0.0_3.0_1674735062182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_shareholder_services_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[shareholder-services-agreement]| |[other]| |[other]| |[shareholder-services-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_shareholder_services_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 116 shareholder-services-agreement 0.98 0.98 0.98 48 accuracy - - 0.99 164 macro-avg 0.99 0.99 0.99 164 weighted-avg 0.99 0.99 0.99 164 ``` --- layout: model title: Smaller BERT Embeddings (L-4_H-768_A-12) author: John Snow Labs name: small_bert_L4_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L4_768_en_2.6.0_2.4_1598345024690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L4_768_en_2.6.0_2.4_1598345024690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L4_768", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L4_768", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L4_768').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L4_768_embeddings I [0.14908359944820404, -0.06654861569404602, 0.... love [0.9139627814292908, 0.2444770336151123, 0.952... NLP [1.1467561721801758, -0.11340214312076569, 1.1... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L4_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1 --- layout: model title: Legal Maintenance Clause Binary Classifier author: John Snow Labs name: legclf_maintenance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `maintenance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `maintenance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_maintenance_clause_en_1.0.0_3.2_1660123713462.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_maintenance_clause_en_1.0.0_3.2_1660123713462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_maintenance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[maintenance]| |[other]| |[other]| |[maintenance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_maintenance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support maintenance 0.93 0.91 0.92 95 other 0.96 0.97 0.96 201 accuracy - - 0.95 296 macro-avg 0.95 0.94 0.94 296 weighted-avg 0.95 0.95 0.95 296 ``` --- layout: model title: Smaller BERT Sentence Embeddings (L-2_H-256_A-4) author: John Snow Labs name: sent_small_bert_L2_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_256_en_2.6.0_2.4_1598350372298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_256_en_2.6.0_2.4_1598350372298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_256", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_256", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L2_256').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L2_256_embeddings sentence [-1.0944892168045044, -1.8199821710586548, 1.4... I hate cancer [-0.8097536563873291, -1.0587245225906372, 1.2... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L2_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1 --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_wolof_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-wolof-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_wolof_finetuned_ner_swahili_sw_4.1.0_3.0_1659356044720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_wolof_finetuned_ner_swahili_sw_4.1.0_3.0_1659356044720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_wolof_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_wolof_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_wolof_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-wolof-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Ukrainian Named Entity Recognition (from ukr-models) author: John Snow Labs name: xlmroberta_ner_uk_ner date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, uk, open_source] task: Named Entity Recognition language: uk edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `uk-ner` is a Ukrainian model orginally trained by `ukr-models`. ## Predicted Entities `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_uk_ner_uk_3.4.2_3.0_1652808897121.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_uk_ner_uk_3.4.2_3.0_1652808897121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_uk_ner","uk") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_uk_ner","uk") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_uk_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|uk| |Size:|406.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ukr-models/uk-ner - https://pypi.org/project/tokenize_uk/ --- layout: model title: English asr_liepa_lithuanian TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: pipeline_asr_liepa_lithuanian date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_liepa_lithuanian` is a English model originally trained by birgermoell. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_liepa_lithuanian_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_liepa_lithuanian_en_4.2.0_3.0_1664111880638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_liepa_lithuanian_en_4.2.0_3.0_1664111880638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_liepa_lithuanian', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_liepa_lithuanian", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_liepa_lithuanian| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|228.1 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Radiology Related Entities author: John Snow Labs name: ner_radiology date: 2021-01-18 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 2.7.0 spark_version: 2.4 tags: [en, ner, licensed, clinical] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for radiology related texts and reports. ## Predicted Entities `ImagingTest`, `Imaging_Technique`, `ImagingFindings`, `OtherFindings`, `BodyPart`, `Direction`, `Test`, `Symptom`, `Disease_Syndrome_Disorder`, `Medical_Device`, `Procedure`, `Measurements`, `Units` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_en_2.7.0_2.4_1610995075088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_en_2.7.0_2.4_1610995075088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an NLP pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") radiology_ner = NerDLModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, radiology_ner, ner_converter]) data = spark.createDataFrame([["Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val radiology_ner = NerDLModel().pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val nlpPipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, radiology_ner, ner_converter)) val data = Seq("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | chunks | entities | |----|-----------------------|---------------------------| | 0 | Bilateral | Direction | | 1 | breast | BodyPart | | 2 | ultrasound | ImagingTest | | 3 | ovoid mass | ImagingFindings | | 4 | 0.5 x 0.5 x 0.4 | Measurements | | 5 | cm | Units | | 6 | anteromedial aspect | Direction | | 7 | left | Direction | | 8 | shoulder | BodyPart | | 9 | mass | ImagingFindings | | 10 | isoechoic echotexture | ImagingFindings | | 11 | muscle | BodyPart | | 12 | internal color flow | ImagingFindings | | 13 | benign fibrous tissue | ImagingFindings | | 14 | lipoma | Disease_Syndrome_Disorder | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on a custom dataset comprising of MIMIC-CXR and MT Radiology texts ## Benchmarking ```bash label tp fp fn total precision recall f1 OtherFindings 8.0 15.0 63.0 71.0 0.3478 0.1127 0.1702 Measurements 481.0 30.0 15.0 496.0 0.9413 0.9698 0.9553 Direction 650.0 137.0 94.0 744.0 0.8259 0.8737 0.8491 ImagingFindings 1345.0 355.0 324.0 1669.0 0.7912 0.8059 0.7985 BodyPart 1942.0 335.0 290.0 2232.0 0.8529 0.8701 0.8614 Medical_Device 236.0 75.0 64.0 300.0 0.7588 0.7867 0.7725 Test 222.0 41.0 48.0 270.0 0.8441 0.8222 0.833 Procedure 269.0 117.0 116.0 385.0 0.6969 0.6987 0.6978 ImagingTest 263.0 50.0 43.0 306.0 0.8403 0.8595 0.8498 Symptom 498.0 101.0 132.0 630.0 0.8314 0.7905 0.8104 Disease_Syndrome_... 1180.0 258.0 200.0 1380.0 0.8206 0.8551 0.8375 Units 269.0 10.0 2.0 271.0 0.9642 0.9926 0.9782 Imaging_Technique 140.0 38.0 25.0 165.0 0.7865 0.8485 0.8163 macro - - - - - - 0.7524 micro - - - - - - 0.8315 ``` --- layout: model title: Translate English to Kirundi Pipeline author: John Snow Labs name: translate_en_rn date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, rn, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `rn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_rn_xx_2.7.0_2.4_1609699240949.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_rn_xx_2.7.0_2.4_1609699240949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_rn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_rn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.rn').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_rn| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hindi asr_Wav2Vec2_xls_r_300m_final TFWav2Vec2ForCTC from LegolasTheElf author: John Snow Labs name: asr_Wav2Vec2_xls_r_300m_final date: 2022-09-26 tags: [wav2vec2, hi, audio, open_source, asr] task: Automatic Speech Recognition language: hi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_xls_r_300m_final` is a Hindi model originally trained by LegolasTheElf. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_xls_r_300m_final_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_xls_r_300m_final_hi_4.2.0_3.0_1664193590997.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_xls_r_300m_final_hi_4.2.0_3.0_1664193590997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Wav2Vec2_xls_r_300m_final", "hi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Wav2Vec2_xls_r_300m_final", "hi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Wav2Vec2_xls_r_300m_final| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|hi| |Size:|1.2 GB| --- layout: model title: Telugu Word Embeddings (DistilBERT) author: John Snow Labs name: distilbert_uncased date: 2021-12-14 tags: [open_source, embeddings, te, distilbert, telugu] task: Embeddings language: te edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a DistilBERT language model pre-trained on ~2 GB of the monolingual training corpus. The pre-training data was majorly taken from OSCAR. This model can be fine-tuned on various downstream tasks like text classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_uncased_te_3.1.0_3.0_1639472349482.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_uncased_te_3.1.0_3.0_1639472349482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector() \ .setInputCols(['document'])\ .setOutputCol('sentence') tokenizer = Tokenizer()\ .setInputCols(['sentence']) \ .setOutputCol('token') distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_uncased", "te"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("embeddings")\ .setCaseSensitive(False) nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, distilbert_loaded]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు.']], ["text"])) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_uncased", "te")) .setInputCols("sentence","token") .setOutputCol("embeddings") .setCaseSensitive(False) val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, distilbert_loaded)) val data = Seq("ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.embed.distilbert").predict("""ఆంగ్ల పఠన పేరాల యొక్క గొప్ప మూలం కోసం చూస్తున్నారా? మీరు సరైన స్థలానికి వచ్చారు.""") ```
## Results ```bash | Token | Embeddings | |---------- |---------------------------------------------------------------------- | | ఆంగ్ల | [1.3526772, 0.66792506, 0.80145407, -1.7625582, 1.222954, ...] | | పఠన | [2.0163336, -0.11855277, -0.71146804, -0.3959106, -0.63389313, ...] | | పేరాల | [1.0630535, 0.2587409, 0.09540532, -0.048794597, 1.0124478, ...] | | యొక్క | [1.4005142, 0.43655983, 0.5112152, -0.9843408, 0.9581941, ...] | | గొప్ప | [1.6955082, 0.40451798, 0.8449157, -1.0998198, 0.80302626, ...] | | మూలం | [2.0383484, 0.90390867, -0.52174926, -0.637539, 0.29188454, ...] | | కోసం | [1.3596793, 1.0218208, 0.26274702, -0.2437865, 0.50547075, ...] | | చూస్తున్నారా | [1.4825231, 0.6084269, 1.5597858, -1.0951629, 0.33125773, ...] | | ? | [2.86698, -0.07081009, 0.078133255, 0.43756652, 0.05295326, ...] | | మీరు | [1.0796824, 0.35925022, 0.51510495, -0.9841369, 0.39694318, ...] | | సరైన | [1.1148729, -0.004858747, 0.041157544, -0.5826167, 0.24176109, ...] | | స్థలానికి | [1.2047833, 0.119116426, -0.039619423, -0.48747823, 0.15061232, ...] | | వచ్చారు | [1.1785411, -0.013213344, 0.14526407, -0.60479, 0.031448614, ...] | | . | [0.76072985, -1.9430697, -1.6266187, 0.46296686, -2.2197602, ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_uncased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|te| |Size:|249.4 MB| |Case sensitive:|false| --- layout: model title: Side Effect Classification Pipeline - Voice of the Patient author: John Snow Labs name: bert_sequence_classifier_vop_side_effect_pipeline date: 2023-06-13 tags: [pipeline, classification, side_effect, vop, clinical, en, licensed] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline includes the Medical Bert for Sequence Classification model to classify health-related text in colloquial language according to the presence or absence of mentions of side effects. The pipeline is built on the top of [bert_sequence_classifier_vop_side_effect](https://nlp.johnsnowlabs.com/2023/05/24/bert_sequence_classifier_vop_side_effect_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_side_effect_pipeline_en_4.4.3_3.2_1686700519111.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_side_effect_pipeline_en_4.4.3_3.2_1686700519111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use This pipeline includes the Medical Bert for Sequence Classification model to classify health-related text in colloquial language according to the presence or absence of mentions of side effects.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_vop_side_effect_pipeline", "en", "clinical/models") pipeline.annotate("I felt kind of dizzy after taking that medication for a month.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_side_effect_pipeline", "en", "clinical/models") val result = pipeline.annotate(I felt kind of dizzy after taking that medication for a month.) ```
## Results ```bash | text | prediction | |:---------------------------------------------------------------|:-------------| | I felt kind of dizzy after taking that medication for a month. | True | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_side_effect_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: Legal Successors Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_successors_bert date: 2023-03-05 tags: [en, legal, classification, clauses, successors, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Successors` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Successors`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_successors_bert_en_1.0.0_3.0_1678049877265.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_successors_bert_en_1.0.0_3.0_1678049877265.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_successors_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Successors]| |[Other]| |[Other]| |[Successors]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_successors_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 1.00 1.0 76 Successors 1.00 1.00 1.0 55 accuracy - - 1.0 131 macro-avg 1.00 1.00 1.0 131 weighted-avg 1.00 1.00 1.0 131 ``` --- layout: model title: English image_classifier_vit_dog_vs_chicken ViTForImageClassification from lewtun author: John Snow Labs name: image_classifier_vit_dog_vs_chicken date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog_vs_chicken` is a English model originally trained by lewtun. ## Predicted Entities `crispy fried chicken`, `poodle` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_vs_chicken_en_4.1.0_3.0_1660167393770.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_vs_chicken_en_4.1.0_3.0_1660167393770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dog_vs_chicken", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dog_vs_chicken", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dog_vs_chicken| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from Ahmed007) author: John Snow Labs name: distilbert_qa_close_book date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `close-book` is a English model originally trained by `Ahmed007`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_close_book_en_4.3.0_3.0_1672765961995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_close_book_en_4.3.0_3.0_1672765961995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_close_book","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_close_book","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_close_book| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Ahmed007/close-book --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_0 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-0` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_0_en_4.3.0_3.0_1672767389948.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_0_en_4.3.0_3.0_1672767389948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-0 --- layout: model title: Translate Afro-Asiatic languages to English Pipeline author: John Snow Labs name: translate_afa_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, afa, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `afa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_afa_en_xx_2.7.0_2.4_1609698877865.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_afa_en_xx_2.7.0_2.4_1609698877865.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_afa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_afa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.afa.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_afa_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract Clinical Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_emb_clinical_large date: 2023-06-06 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `Gender`, `Employment`, `BodyPart`, `Vaccine`, `Age`, `PsychologicalCondition`, `Form`, `Substance`, `ClinicalDept`, `Drug`, `AdmissionDischarge`, `DateTime`, `Test`, `Laterality`, `Route`, `Disease`, `VitalTest`, `Dosage`, `Duration`, `Frequency`, `Allergen`, `Symptom`, `Procedure`, `RelationshipStatus`, `HealthStatus`, `Modifier`, `MedicalDevice`, `SubstanceQuantity`, `InjuryOrPoisoning`, `TestResult`, `Treatment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_large_en_4.4.3_3.0_1686073213251.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_large_en_4.4.3_3.0_1686073213251.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | 2nd test | Test | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | homeopathy | Treatment | | thyroid | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1293 17 24 1317 0.99 0.98 0.98 Employment 1184 52 59 1243 0.96 0.95 0.96 BodyPart 2731 204 169 2900 0.93 0.94 0.94 Vaccine 40 5 2 42 0.89 0.95 0.92 Age 536 42 46 582 0.93 0.92 0.92 PsychologicalCondition 411 38 33 444 0.92 0.93 0.92 Form 248 33 18 266 0.88 0.93 0.91 Substance 394 60 27 421 0.87 0.94 0.90 ClinicalDept 291 37 35 326 0.89 0.89 0.89 Drug 1342 230 98 1440 0.85 0.93 0.89 AdmissionDischarge 28 1 6 34 0.97 0.82 0.89 DateTime 3998 579 404 4402 0.87 0.91 0.89 Test 1043 116 165 1208 0.90 0.86 0.88 Laterality 566 101 62 628 0.85 0.90 0.87 Route 41 5 7 48 0.89 0.85 0.87 Disease 1747 269 268 2015 0.87 0.87 0.87 VitalTest 151 25 21 172 0.86 0.88 0.87 Dosage 342 51 70 412 0.87 0.83 0.85 Duration 1943 302 367 2310 0.87 0.84 0.85 Frequency 910 208 169 1079 0.81 0.84 0.83 Allergen 33 1 13 46 0.97 0.72 0.83 Symptom 3557 517 1018 4575 0.87 0.78 0.82 Procedure 576 132 129 705 0.81 0.82 0.82 RelationshipStatus 18 2 6 24 0.90 0.75 0.82 HealthStatus 79 23 28 107 0.77 0.74 0.76 Modifier 805 213 334 1139 0.79 0.71 0.75 MedicalDevice 240 70 92 332 0.77 0.72 0.75 SubstanceQuantity 57 15 28 85 0.79 0.67 0.73 InjuryOrPoisoning 116 28 60 176 0.81 0.66 0.73 TestResult 353 95 171 524 0.79 0.67 0.73 Treatment 134 44 94 228 0.75 0.59 0.66 macro_avg 25207 3515 4023 29230 0.87 0.83 0.85 micro_avg 25207 3515 4023 29230 0.88 0.86 0.87 ``` --- layout: model title: Detect PHI for Deidentification (Glove, 7 labels) author: John Snow Labs name: ner_deid_generic_glove date: 2021-06-06 tags: [deid, clinical, glove, licensed, ner, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Absolute) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. It detects 7 entities. This ner model is trained with combination of i2b2 train set and augmented version of i2b2 train set using Glove-100d embeddings. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `CONTACT`, `AGE`, `ID` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/deid_ner_generic_glove_en_3.0.4_3.0_1623015102478.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/deid_ner_generic_glove_en_3.0.4_3.0_1623015102478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') deid_ner = MedicalNerModel.pretrained("ner_deid_generic_glove", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, glove_embeddings, deid_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]}))) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val glove_embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("sentence', 'token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_generic_glove", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, glove_embeddings, deid_ner, ner_converter)) val result = nlpPipeline.fit(Seq.empty["A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."].toDS.toDF("text")).transform(data) ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson, Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|LOCATION | |0295 Keats Street |LOCATION | |(302) 786-5227 |CONTACT | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_glove| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source A custom data set which is created from the i2b2-PHI train and the augmented version of the i2b2-PHI train set is used. --- layout: model title: German BertForTokenClassification Cased model (from Sahajtomar) author: John Snow Labs name: bert_token_classifier_ner_legal date: 2022-11-30 tags: [de, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `NER_legal_de` is a German model originally trained by `Sahajtomar`. ## Predicted Entities `UN`, `ST`, `VO`, `VS`, `ORG`, `STR`, `RS`, `LIT`, `PER`, `GRT`, `LD`, `INN`, `VT`, `AN`, `EUN`, `MRK`, `RR`, `GS`, `LDS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_legal_de_4.2.4_3.0_1669814042794.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_legal_de_4.2.4_3.0_1669814042794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_legal","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_legal","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Sahajtomar/NER_legal_de --- layout: model title: Medical Question Answering (biogpt) author: John Snow Labs name: medical_qa_biogpt date: 2023-03-09 tags: [licensed, clinical, en, gpt, biogpt, pubmed, question_answering, tensorflow] task: Question Answering language: en edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: MedicalQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is directly ported from the  official BioGPT [implementation](https://github.com/microsoft/BioGPT)  that is trained on Pubmed abstracts and then finetuned with PubmedQA dataset. It is the baseline version called [BioGPT-QA-PubMedQA-BioGPT](https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/QA-PubMedQA-BioGPT.tgz). It can generate two types of answers, short and long. Types of questions are supported: `"short"`(producing yes/no/maybe) answers and `"full"` (long answers). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/BIOGPT_MEDICAL_QUESTION_ANSWERING/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/31.Medical_Question_Answering.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medical_qa_biogpt_en_4.3.1_3.0_1678355315206.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medical_qa_biogpt_en_4.3.1_3.0_1678355315206.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler()\ .setInputCols("question", "context")\ .setOutputCols("document_question", "document_context") med_qa = sparknlp_jsl.annotators.MedicalQuestionAnswering\ .pretrained("medical_qa_biogpt","en","clinical/models")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setMaxNewTokens(30)\ .setTopK(1)\ .setQuestionType("long") # "short" pipeline = Pipeline(stages=[document_assembler, med_qa]) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65–97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus–stimulus and stimulus–response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" data = spark.createDataFrame( [ [long_question, paper_abstract, "long"], [yes_no_question, paper_abstract, "short"], ] ).toDF("question", "context", "question_type") pipeline.fit(data).transform(data.where("question_type == 'long'"))\ .select("answer.result")\ .show(truncate=False) pipeline.fit(data).transform(data.where("question_type == 'short'"))\ .select("answer.result")\ .show(truncate=False) ``` ```scala val document_assembler = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val med_qa = MedicalQuestionAnswering .pretrained("medical_qa_biogpt", "en", "clinical/models") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setMaxNewTokens(30) .setTopK(1) .setQuestionType("long") # "short" val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa)) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65–97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus–stimulus and stimulus–response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" val data = Seq( (long_question, paper_abstract,"long" ), (yes_no_question, paper_abstract, "short")) .toDS.toDF("question", "context", "question_type") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[the present study investigated whether directing spatial attention to one location in a visual array would enhance memory for the array features. participants memorized two]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medical_qa_biogpt| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 GB| |Case sensitive:|true| --- layout: model title: Detect Genes/Proteins (BC2GM) in Medical Texts author: John Snow Labs name: ner_biomedical_bc2gm_pipeline date: 2022-06-22 tags: [licensed, clinical, en, ner, bc2gm, gene_protein, gene, protein, biomedical] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_biomedical_bc2gm](https://nlp.johnsnowlabs.com/2022/05/10/ner_biomedical_bc2gm_en_3_0.html) model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_pipeline_en_3.5.3_3.0_1655893015210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_pipeline_en_3.5.3_3.0_1655893015210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_biomedical_bc2gm_pipeline", "en", "clinical/models") result = pipeline.fullAnnotate("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_biomedical_bc2gm_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomedical_bc2gm.pipeline").predict("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.""") ```
## Results ```bash +-----------+------------+ |chunk |ner_label | +-----------+------------+ |S-100 |GENE_PROTEIN| |HMB-45 |GENE_PROTEIN| |cytokeratin|GENE_PROTEIN| +-----------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomedical_bc2gm_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Japanese BertForQuestionAnswering Base model (from KoichiYasuoka) author: John Snow Labs name: bert_qa_base_japanese_wikipedia_ud_head date: 2022-06-28 tags: [ja, open_source, bert, question_answering] task: Question Answering language: ja edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-wikipedia-ud-head` is a Japanese model originally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_japanese_wikipedia_ud_head_ja_4.0.0_3.0_1656413487634.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_japanese_wikipedia_ud_head_ja_4.0.0_3.0_1656413487634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_japanese_wikipedia_ud_head","ja") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["私の名前は何ですか?", "私の名前はクララで、私はバークレーに住んでいます。"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_japanese_wikipedia_ud_head","ja") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("私の名前は何ですか?", "私の名前はクララで、私はバークレーに住んでいます。").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.answer_question.wikipedia.bert.base").predict("""私の名前は何ですか?|||"私の名前はクララで、私はバークレーに住んでいます。""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_japanese_wikipedia_ud_head| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ja| |Size:|338.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KoichiYasuoka/bert-base-japanese-wikipedia-ud-head - https://github.com/UniversalDependencies/UD_Japanese-GSDLUW --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities author: John Snow Labs name: ner_cellular_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_cellular](https://nlp.johnsnowlabs.com/2021/03/31/ner_cellular_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_pipeline_en_3.4.1_3.0_1647870149220.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_pipeline_en_3.4.1_3.0_1647870149220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_cellular_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` ```scala val pipeline = new PretrainedPipeline("ner_cellular_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cellular.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash |chunk |ner | +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1), |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English BertForTokenClassification Uncased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Original_scibert_scivocab_uncased date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-Original-scibert_scivocab_uncased` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_scibert_scivocab_uncased_en_4.0.0_3.0_1657108142723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_scibert_scivocab_uncased_en_4.0.0_3.0_1657108142723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_scibert_scivocab_uncased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_scibert_scivocab_uncased","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Original_scibert_scivocab_uncased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.4 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4-Original-scibert_scivocab_uncased --- layout: model title: Hindi RobertaForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: roberta_embeddings_indic_transformers date: 2022-12-12 tags: [hi, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: hi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-hi-roberta` is a Hindi model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_hi_4.2.4_3.0_1670858655317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_hi_4.2.4_3.0_1670858655317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|313.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-roberta - https://oscar-corpus.com/ --- layout: model title: Korean ElectraForQuestionAnswering model (from monologg) author: John Snow Labs name: electra_qa_base_v2_finetuned_korquad date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v2-finetuned-korquad` is a Korean model originally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_v2_finetuned_korquad_ko_4.0.0_3.0_1655922101194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_v2_finetuned_korquad_ko_4.0.0_3.0_1655922101194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_v2_finetuned_korquad","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_v2_finetuned_korquad","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.korquad.electra.base_v2.by_monologg").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_v2_finetuned_korquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|412.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/monologg/koelectra-base-v2-finetuned-korquad --- layout: model title: Company Name Normalization to Edgar Database author: John Snow Labs name: legel_edgar_company_name date: 2022-08-30 tags: [en, legal, companies, edgar, licensed] task: Entity Resolution language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Linking / Entity Resolution model, which allows you to map an extracted Company Name from any NER model, to the name used by SEC in Edgar Database. This can come in handy to afterwards use Edgar Chunk Mappers with the output of this resolution, to carry out data augmentation and retrieve additional information stored in Edgar Database about a company. For more information about data augmentation, check `Chunk Mapping` task in Models Hub. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legel_edgar_company_name_en_1.0.0_3.2_1661866264706.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legel_edgar_company_name_en_1.0.0_3.2_1661866264706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("irs_code")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp..Pipeline( stages = [ documentAssembler, embeddings, resolver]) lp = LightPipeline(pipelineModel) lp.fullAnnotate("CONTACT GOLD") ```
## Results ```bash +--------------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------+-------------------------------------------+ | chunk | code | all_codes | resolutions | all_distances | +--------------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------+-------------------------------------------+ | CONTACT GOLD | 981369960| [981369960, 271989147, 208531222, 273566922, 270348508] |[Contact Gold Corp, Guskin Gold Corp, Yinfu Gold Corp, MAGELLAN GOLD Corp, Star Gold Corp] | [0.1733, 0.3700, 0.3867, 0.4103, 0.4121] | +--------------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------+-------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legel_edgar_company_name| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[original_company_name]| |Language:|en| |Size:|315.1 MB| |Case sensitive:|false| ## References In-house scrapping and postprocessing of SEC Edgar Database --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_dm256 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-dm256` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dm256_en_4.3.0_3.0_1675115135612.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dm256_en_4.3.0_3.0_1675115135612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_dm256","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_dm256","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_dm256| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|370.2 MB| ## References - https://huggingface.co/google/t5-efficient-large-dm256 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Mapping SNOMED Codes with Their Corresponding ICD10-CM Codes author: John Snow Labs name: snomed_icd10cm_mapper date: 2022-06-26 tags: [clinical, licensed, icd10cm, chunk_mapper, en, snomed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps SNOMED codes to corresponding ICD10-CM codes. ## Predicted Entities `icd10cm_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapper_en_3.5.3_3.0_1656266755041.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapper_en_3.5.3_3.0_1656266755041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel.pretrained("snomed_icd10cm_mapper", "en", "clinical/models")\ .setInputCols(["snomed_code"])\ .setOutputCol("icd10cm_mappings")\ .setRels(["icd10cm_code"]) pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, snomed_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Radiating chest pain") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("snomed_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel.pretrained("snomed_icd10cm_mapper", "en", "clinical/models") .setInputCols("snomed_code") .setOutputCol("icd10cm_mappings") .setRels(Array("icd10cm_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, snomed_resolver, chunkerMapper)) val data = Seq("Radiating chest pain").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.snomed_to_icd10cm").predict("""Radiating chest pain""") ```
## Results ```bash | | ner_chunk | snomed_code | icd10cm_mappings | |---:|:---------------------|--------------:|:-------------------| | 0 | Radiating chest pain | 10000006 | R07.9 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icd10cm_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[snomed_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.5 MB| ## References This pretrained model maps SNOMED codes to corresponding ICD10-CM codes under the Unified Medical Language System (UMLS). --- layout: model title: Pipeline to Summarize Clinical Notes author: John Snow Labs name: summarizer_clinical_jsl_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_jsl](https://nlp.johnsnowlabs.com/2023/03/25/summarizer_clinical_jsl.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_pipeline_en_4.4.1_3.0_1685534653791.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_pipeline_en_4.4.1_3.0_1685534653791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_jsl_pipeline", "en", "clinical/models") text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_jsl_pipeline", "en", "clinical/models") val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: Stopwords Remover for Portuguese language (416 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, pt, open_source] task: Stop Words Removal language: pt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_pt_3.4.1_3.0_1646673119843.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_pt_3.4.1_3.0_1646673119843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","pt") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Você não é melhor que eu"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","pt") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Você não é melhor que eu").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.stopwords").predict("""Você não é melhor que eu""") ```
## Results ```bash +--------+ |result | +--------+ |[melhor]| +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|pt| |Size:|2.5 KB| --- layout: model title: Sentence Entity Resolver for HGNC author: John Snow Labs name: sbiobertresolve_hgnc date: 2023-03-26 tags: [hgnc, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted gene names and their short-form abbreviations to HGNC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it returns the locus groups and locus types of the genes as aux labels separated by || under the metadata. ## Predicted Entities `HGNC Codes`, `Locus Group`, `Locus Type` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hgnc_en_4.3.2_3.0_1679847290330.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hgnc_en_4.3.2_3.0_1679847290330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['GENE']) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hgnc","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) clinical_note = ["Recent studies have suggested a potential link between the double homeobox 4 like 20 (pseudogene), also known as DUX4L20, and FBXO48 and RNA guanine-7 methyltransferase "] data= spark.createDataFrame([clinical_note]).toDF('text') results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("GENE")) val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hgnc","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with an acute hepatitis and obesity with a body mass index (BMI) of 33.5 kg/m2").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | sent_id | ner_chunk | entity | HGNC Code | all_codes | resolutions | AUX | |----------:|:------------|:---------|:-------------|:------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------| | 0 | DUX4L20 | GENE | HGNC:50801 | ['HGNC:50801', 'HGNC:31982', 'HGNC:42423', 'HGNC:39776', 'HGNC:42023'...| ['DUX4L20 [double homeobox 4 like 20 (pseudogene)]', 'ANKRD20A4P [ankyrin repeat domain 20 family member A4, pseudogene]', ...| [pseudogene :: pseudogene, pseudogene :: pseudogene, non-coding RNA :: RNA, long non-coding, pseudogene :: pseudogene...| | 0 | FBXO48 | GENE | HGNC:33857 | ['HGNC:33857', 'HGNC:4930', 'HGNC:16653', 'HGNC:13114', 'HGNC:23535' ...| ['FBXO48 [F-box protein 48]', 'ZBTB48 [zinc finger and BTB domain containing 48]', 'MRPL48 [mitochondrial ribosomal protein' ...| [protein-coding gene :: gene with protein product, protein-coding gene :: gene with protein product, protein-coding gene...| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_hgnc| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hgnc_code]| |Language:|en| |Size:|251.9 MB| |Case sensitive:|false| ## References https://evs.nci.nih.gov/ftp1/NCI_Thesaurus/ --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_10_en_4.3.0_3.0_1674213671532.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_10_en_4.3.0_3.0_1674213671532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|423.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-10 --- layout: model title: Spanish Named Entity Recognition (from MMG) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_large_ner_spanish date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-ner-spanish` is a Spanish model orginally trained by `MMG`. ## Predicted Entities `LOC`, `ORG`, `MISC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_ner_spanish_es_3.4.2_3.0_1652807519626.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_ner_spanish_es_3.4.2_3.0_1652807519626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_ner_spanish","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_ner_spanish","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_large_ner_spanish| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/MMG/xlm-roberta-large-ner-spanish --- layout: model title: Translate English to Kongo Pipeline author: John Snow Labs name: translate_en_kg date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, kg, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `kg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_kg_xx_2.7.0_2.4_1609687119024.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_kg_xx_2.7.0_2.4_1609687119024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_kg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_kg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.kg').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_kg| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_10k_voxpopuli TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_base_10k_voxpopuli date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_10k_voxpopuli` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_10k_voxpopuli_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_10k_voxpopuli_en_4.2.0_3.0_1664022512444.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_10k_voxpopuli_en_4.2.0_3.0_1664022512444.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_10k_voxpopuli", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_10k_voxpopuli", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_10k_voxpopuli| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: English BertForQuestionAnswering model (from TingChenChang) author: John Snow Labs name: bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-finetuned-xquadv1-finetuned-squad-colab` is a English model orginally trained by `TingChenChang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab_en_4.0.0_3.0_1654184544770.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab_en_4.0.0_3.0_1654184544770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xquad_squad.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/TingChenChang/bert-multi-cased-finetuned-xquadv1-finetuned-squad-colab --- layout: model title: Detect Anatomical Regions (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_anatomy date: 2021-09-30 tags: [anatomy, bertfortokenclassification, en, licensed, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.2 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for anatomy terms. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `Anatomical_system`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Tissue` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_en_3.2.2_2.4_1632991802389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatomy_en_3.2.2_2.4_1632991802389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_anatomy", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) pp_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""" result = pp_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_anatomy", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence_detector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_anatomy").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |great toe |Multi-tissue_structure| |skin |Organ | |conjunctivae |Multi-tissue_structure| |Extraocular muscles|Multi-tissue_structure| |Nares |Multi-tissue_structure| |turbinates |Multi-tissue_structure| |Oropharynx |Multi-tissue_structure| |Mucous membranes |Tissue | |Neck |Organism_subdivision | |bowel |Organ | |great toe |Multi-tissue_structure| |skin |Organ | |toenails |Organism_subdivision | |foot |Organism_subdivision | |great toe |Multi-tissue_structure| |toenails |Organism_subdivision | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_anatomy| |Compatibility:|Healthcare NLP 3.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|512| ## Data Source Trained on the Anatomical Entity Mention (AnEM) corpus with 'embeddings_clinical'. http://www.nactem.ac.uk/anatomy/ ## Benchmarking ```bash label precision recall f1-score support B-Anatomical_system 1.00 0.50 0.67 4 B-Cell 0.89 0.96 0.92 74 B-Cellular_component 0.97 0.81 0.88 36 B-Developing_anatomical_structure 1.00 0.50 0.67 6 B-Immaterial_anatomical_entity 0.60 0.75 0.67 4 B-Multi-tissue_structure 0.75 0.86 0.80 58 B-Organ 0.86 0.88 0.87 48 B-Organism_subdivision 0.62 0.42 0.50 12 B-Organism_substance 0.89 0.81 0.85 31 B-Pathological_formation 0.91 0.91 0.91 32 B-Tissue 0.94 0.76 0.84 21 I-Anatomical_system 1.00 1.00 1.00 1 I-Cell 1.00 0.84 0.91 62 I-Cellular_component 0.92 0.85 0.88 13 I-Developing_anatomical_structure 1.00 1.00 1.00 1 I-Immaterial_anatomical_entity 1.00 1.00 1.00 1 I-Multi-tissue_structure 1.00 0.77 0.87 26 I-Organ 0.80 0.80 0.80 5 I-Organism_substance 1.00 0.71 0.83 7 I-Pathological_formation 1.00 0.94 0.97 16 I-Tissue 0.93 0.93 0.93 15 accuracy - - 0.84 473 macro-avg 0.87 0.77 0.83 473 weighted-avg 0.90 0.84 0.86 473 ``` --- layout: model title: Legal Remedies Clause Binary Classifier author: John Snow Labs name: legclf_remedies_clause date: 2023-02-13 tags: [en, legal, classification, remedies, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `remedies` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `remedies`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_remedies_clause_en_1.0.0_3.0_1676303332118.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_remedies_clause_en_1.0.0_3.0_1676303332118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_remedies_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[remedies]| |[other]| |[other]| |[remedies]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_remedies_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 10 remedies 1.00 1.00 1.00 19 accuracy - - 1.00 29 macro-avg 1.00 1.00 1.00 29 weighted-avg 1.00 1.00 1.00 29 ``` --- layout: model title: Legal Form Of Indemnification Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_form_of_indemnification_agreement_bert date: 2023-01-26 tags: [en, legal, classification, indemnification, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_form_of_indemnification_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `form-of-indemnification-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `form-of-indemnification-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_form_of_indemnification_agreement_bert_en_1.0.0_3.0_1674732486525.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_form_of_indemnification_agreement_bert_en_1.0.0_3.0_1674732486525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_form_of_indemnification_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[form-of-indemnification-agreement]| |[other]| |[other]| |[form-of-indemnification-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_form_of_indemnification_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support form-of-indemnification-agreement 1.00 0.97 0.99 36 other 0.98 1.00 0.99 65 accuracy - - 0.99 101 macro-avg 0.99 0.99 0.99 101 weighted-avg 0.99 0.99 0.99 101 ``` --- layout: model title: English ElectraForQuestionAnswering model (from PremalMatalia) author: John Snow Labs name: electra_qa_base_best_squad2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-best-squad2` is a English model originally trained by `PremalMatalia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_best_squad2_en_4.0.0_3.0_1655920339398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_best_squad2_en_4.0.0_3.0_1655920339398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_best_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_best_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_best_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/PremalMatalia/electra-base-best-squad2 --- layout: model title: Catalan, Valencian asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly TFWav2Vec2ForCTC from ccoreilly author: John Snow Labs name: pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly date: 2022-09-24 tags: [wav2vec2, ca, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ca edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly` is a Catalan, Valencian model originally trained by ccoreilly. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly_ca_4.2.0_3.0_1664035897662.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly_ca_4.2.0_3.0_1664035897662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly', lang = 'ca') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly", lang = "ca") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ca| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Russian (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-03-16 task: Named Entity Recognition language: ru edition: Spark NLP 2.4.4 spark_version: 2.4 tags: [ner, ru, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_RU){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RU.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_ru_2.4.4_2.4_1584014001694.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_ru_2.4.4_2.4_1584014001694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "ru") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "ru") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Уильям Генри Гейтс III (родился 28 октября 1955 года) - американский бизнес-магнат, разработчик программного обеспечения, инвестор и филантроп. Он наиболее известен как соучредитель корпорации Microsoft. За время своей карьеры в Microsoft Гейтс занимал должности председателя, главного исполнительного директора (CEO), президента и главного архитектора программного обеспечения, а также был крупнейшим индивидуальным акционером до мая 2014 года. Он является одним из самых известных предпринимателей и пионеров микрокомпьютерная революция 1970-х и 1980-х годов. Гейтс родился и вырос в Сиэтле, штат Вашингтон, в 1975 году вместе с другом детства Полом Алленом в Альбукерке, штат Нью-Мексико, и основал компанию Microsoft. она стала крупнейшей в мире компанией-разработчиком программного обеспечения для персональных компьютеров. Гейтс руководил компанией в качестве председателя и генерального директора, пока в январе 2000 года не ушел с поста генерального директора, но остался председателем и стал главным архитектором программного обеспечения. В конце 1990-х Гейтс подвергся критике за свою деловую тактику, которая считалась антиконкурентной. Это мнение было подтверждено многочисленными судебными решениями. В июне 2006 года Гейтс объявил, что перейдет на неполный рабочий день в Microsoft и будет работать на полную ставку в Фонде Билла и Мелинды Гейтс, частном благотворительном фонде, который он и его жена Мелинда Гейтс создали в 2000 году. [ 9] Постепенно он передал свои обязанности Рэю Оззи и Крейгу Манди. Он ушел с поста президента Microsoft в феврале 2014 года и занял новую должность консультанта по технологиям для поддержки вновь назначенного генерального директора Сатья Наделла."""] ner_df = nlu.load('ru.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Уильям Генри Гейтс III|PER | |Microsoft |ORG | |Microsoft Гейтс |ORG | |CEO |ORG | |Гейтс |PER | |Сиэтле |LOC | |Вашингтон |LOC | |Полом Алленом |PER | |Альбукерке |LOC | |Нью-Мексико |LOC | |Microsoft |ORG | |Гейтс |PER | |Гейтс |PER | |Это мнение |MISC | |Гейтс |PER | |Microsoft |ORG | |Фонде Билла |MISC | |Мелинды Гейтс |PER | |Мелинда Гейтс |PER | |Постепенно |PER | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ru| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://ru.wikipedia.org](https://ru.wikipedia.org) --- layout: model title: English BertForTokenClassification Cased model (from testing) author: John Snow Labs name: bert_token_classifier_autonlp_ingredient_sentiment_analysis_19126711 date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-ingredient_sentiment_analysis-19126711` is a English model originally trained by `testing`. ## Predicted Entities `I`, `B` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autonlp_ingredient_sentiment_analysis_19126711_en_4.2.4_3.0_1669814269719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autonlp_ingredient_sentiment_analysis_19126711_en_4.2.4_3.0_1669814269719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autonlp_ingredient_sentiment_analysis_19126711","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autonlp_ingredient_sentiment_analysis_19126711","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autonlp_ingredient_sentiment_analysis_19126711| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/testing/autonlp-ingredient_sentiment_analysis-19126711 --- layout: model title: Fast Neural Machine Translation Model from Marshallese to English author: John Snow Labs name: opus_mt_mh_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mh, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mh_en_xx_2.7.0_2.4_1609170065092.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mh_en_xx_2.7.0_2.4_1609170065092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mh_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mh_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mh.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mh_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Ukrainian Part of Speech Tagger (Base) author: John Snow Labs name: bert_pos_bert_base_slavic_cyrillic_upos date: 2022-04-26 tags: [bert, pos, part_of_speech, uk, open_source] task: Part of Speech Tagging language: uk edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-slavic-cyrillic-upos` is a Ukrainian model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_slavic_cyrillic_upos_uk_3.4.2_3.0_1650993899127.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_slavic_cyrillic_upos_uk_3.4.2_3.0_1650993899127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_slavic_cyrillic_upos","uk") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_slavic_cyrillic_upos","uk") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.pos.bert_base_slavic_cyrillic_upos").predict("""Я люблю Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_slavic_cyrillic_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|uk| |Size:|668.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-slavic-cyrillic-upos - https://universaldependencies.org/be/ - https://universaldependencies.org/bg/ - https://universaldependencies.org/ru/ - https://universaldependencies.org/treebanks/sr_set/ - https://universaldependencies.org/treebanks/uk_iu/ - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from risethi) author: John Snow Labs name: distilbert_qa_risethi_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `risethi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_risethi_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772308308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_risethi_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772308308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_risethi_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_risethi_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_risethi_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/risethi/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Court Decisions Unanimity Prediction (Portuguese) author: John Snow Labs name: legclf_court_decisions_unanimity date: 2023-04-14 tags: [pt, licensed, legal, classification, tensorflow] task: Text Classification language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which identifies the court decisions were unanimity, not_unanimity, or not_determined in the State Supreme Court. ## Predicted Entities `unanimity`, `not_unanimity`, `not_determined` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_court_decisions_unanimity_pt_1.0.0_3.0_1681479825814.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_court_decisions_unanimity_pt_1.0.0_3.0_1681479825814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler= nlp.DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") seq_classifier = legal.BertForSequenceClassification.pretrained("legclf_court_decisions_unanimity", "pt", "legal/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") nlpPipeline= nlp.Pipeline(stages=[ document_assembler, tokenizer, seq_classifier ]) # Example text example = spark.createDataFrame([["ACÓRDÃO/SALVO-CONDUTO PENAL E PROCESSUAL PENAL HABEAS CORPUS. CRIMES CONTRA A HONRA. ALEGAÇÃO DE INCOMPETÊNCIA DO JUÍZO, MANIPULAÇÃO DAS PROVAS NA AÇÃO ORIGINÁRIA E DEFEITO DE REPRESENTAÇÃO. PRELIMINARES AFASTADAS. IMPOSIÇÃO DE MEDIDAS CAUTELARES DIVERSAS DE PRISÃO. NECESSIDADE DE DEMONSTRAÇÃO DA ADEQUAÇÃO E PROPORCIONALIDADE DAS MEDIDAS. PRECEDENTES DO STJ. DECISÃO DESPROVIDA DE FUNDAMENTAÇÃO. MERA REFERÊNCIA À GRAVIDADE DOS FATOS NARRADOS. NULIDADE. INTELIGÊNCIA DO ARTIGO 93, IX, DA CRFB. CONSTRANGIMENTO ILEGAL EVIDENCIADO. ORDEM CONCEDIDA. 1 Observando que as penas máximas cominadas aos crimes imputados aos querelantes, em concurso material, excedem a dois anos, não há que se falar em competência dos Juizados Especiais Criminais. 2 Na estreita via do mandamus, não se faz possível avaliar eventual manipulação de provas procedida pela vítima nos autos originários. 3 Embora menos gravosas que a segregação cautelar, as medidas cautelares possuem o condão de restringir direitos individuais, razão pela qual não dispensam fundamentação acerca da sua necessidade e adequação. 4 Estando a decisão atacada totalmente desprovida de fundamentação, forçoso declarar a sua nulidade, revogando as medidas cautelares impostas."]]).toDF("text") empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) result = model.transform(example) # result is a DataFrame result.select("text", "class.result").show(truncate=200) ```
## Results ```bash +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+ | text| result| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+ |ACÓRDÃO/SALVO-CONDUTO PENAL E PROCESSUAL PENAL HABEAS CORPUS. CRIMES CONTRA A HONRA. ALEGAÇÃO DE INCOMPETÊNCIA DO JUÍZO, MANIPULAÇÃO DAS PROVAS NA AÇÃO ORIGINÁRIA E DEFEITO DE REPRESENTAÇÃO. PRELIM...|[not_determined]| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_court_decisions_unanimity| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|309.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://huggingface.co/datasets/joelito/brazilian_court_decisions) ## Benchmarking ```bash label precision recall f1-score support not_determined 0.76 0.80 0.78 116 not_unanimity 0.95 0.78 0.86 23 unanimity 0.73 0.71 0.72 90 accuracy - - 0.76 229 macro-avg 0.81 0.77 0.79 229 weighted-avg 0.77 0.76 0.76 229 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1657192170921.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1657192170921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|375.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-6 --- layout: model title: Fast Neural Machine Translation Model from English to West Slavic languages author: John Snow Labs name: opus_mt_en_zlw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, zlw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `zlw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_zlw_xx_2.7.0_2.4_1609164254897.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_zlw_xx_2.7.0_2.4_1609164254897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_zlw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_zlw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.zlw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_zlw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Shunichiro) author: John Snow Labs name: distilbert_qa_shunichiro_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Shunichiro`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shunichiro_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769221544.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shunichiro_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769221544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shunichiro_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shunichiro_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shunichiro_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Shunichiro/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Validity Clause Binary Classifier author: John Snow Labs name: legclf_validity_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `validity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `validity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_validity_clause_en_1.0.0_3.2_1660124112127.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_validity_clause_en_1.0.0_3.2_1660124112127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_validity_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[validity]| |[other]| |[other]| |[validity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_validity_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.98 0.97 89 validity 0.94 0.89 0.92 37 accuracy - - 0.95 126 macro-avg 0.95 0.93 0.94 126 weighted-avg 0.95 0.95 0.95 126 ``` --- layout: model title: Translate English to Finno-Ugrian languages Pipeline author: John Snow Labs name: translate_en_fiu date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, fiu, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `fiu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_fiu_xx_2.7.0_2.4_1609687072182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_fiu_xx_2.7.0_2.4_1609687072182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_fiu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_fiu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.fiu').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_fiu| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Mapping Entities with Corresponding ICD-10-CM Codes author: John Snow Labs name: icd10cm_mapper date: 2022-10-29 tags: [icd10cm, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Spark NLP for Healthcare 4.2.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities with their corresponding ICD-10-CM codes. ## Predicted Entities `icd10cm_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_mapper_en_4.2.1_3.0_1667082016627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_mapper_en_4.2.1_3.0_1667082016627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel\ .pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel\ .pretrained("icd10cm_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["icd10cm_code"]) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunkerMapper]) test_data = spark.createDataFrame([["A 35-year-old male with a history of primary leiomyosarcoma of neck, gestational diabetes mellitus diagnosed eight years prior to presentation and presented with a one-week history of polydipsia, poor appetite, and vomiting."]]).toDF("text") mapper_model = mapper_pipeline.fit(test_data) result= mapper_model.transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols("sentence", "token", "ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("icd10cm_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("icd10cm_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunkerMapper)) val data = Seq("A 35-year-old male with a history of primary leiomyosarcoma of neck, gestational diabetes mellitus diagnosed eight years prior to presentation and presented with a one-week history of polydipsia, poor appetite, and vomiting.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd10cm").predict("""A 35-year-old male with a history of primary leiomyosarcoma of neck, gestational diabetes mellitus diagnosed eight years prior to presentation and presented with a one-week history of polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash +------------------------------+-------+------------+ |ner_chunk |entity |icd10cm_code| +------------------------------+-------+------------+ |primary leiomyosarcoma of neck|PROBLEM|C49.0 | |gestational diabetes mellitus |PROBLEM|O24.919 | |polydipsia |PROBLEM|R63.1 | |poor appetite |PROBLEM|R63.0 | |vomiting |PROBLEM|R11.10 | +------------------------------+-------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_mapper| |Compatibility:|Spark NLP for Healthcare 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|6.2 MB| --- layout: model title: Sentence Entity Resolver for billable ICD10-CM HCC codes (Slim) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_billable_hcc date: 2021-05-21 tags: [licensed, clinical, en, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model has been augmented with synonyms and synonyms having low cosine similarity are dropped, making the model slim. ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score. For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 8. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.0.4_2.4_1621588560429.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.0.4_2.4_1621588560429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models")\ .setInputCols(["document", "sbert_embeddings"])\ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc").predict("""metastatic lung cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_billable_hcc| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Fast Neural Machine Translation Model from English to Congo Swahili author: John Snow Labs name: opus_mt_en_swc date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, swc, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `swc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_swc_xx_2.7.0_2.4_1609167640396.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_swc_xx_2.7.0_2.4_1609167640396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_swc", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_swc", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.swc').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_swc| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Toxic content classifier for German author: John Snow Labs name: distilbert_base_sequence_classifier_toxicity date: 2022-02-16 tags: [toxic, distilbert, sequence_classifier, de, open_source] task: Text Classification language: de edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: DistilBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/ml6team/distilbert-base-german-cased-toxic-comments)) and it's been trained on GermEval21 and IWG Hatespeech datasets for the German language, leveraging `Distil-BERT` embeddings and `DistilBertForSequenceClassification` for text classification purposes. ## Predicted Entities `non_toxic`, `toxic` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_toxicity_de_3.4.0_3.0_1645021339319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_toxicity_de_3.4.0_3.0_1645021339319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = DistilBertForSequenceClassification\ .pretrained('distilbert_base_sequence_classifier_toxicity', 'de') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([["Natürlich kann ich von zuwanderern mehr erwarten. muss ich sogar. sie müssen die sprache lernen, sie müssen die gepflogenheiten lernen und sich in die gesellschaft einfügen. dass muss ich nicht weil ich mich schon in die gesellschaft eingefügt habe. egal wo du hin ziehst, nirgendwo wird dir soviel zucker in den arsch geblasen wie in deutschland."]]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_toxicity", "de") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq("Natürlich kann ich von zuwanderern mehr erwarten. muss ich sogar. sie müssen die sprache lernen, sie müssen die gepflogenheiten lernen und sich in die gesellschaft einfügen. dass muss ich nicht weil ich mich schon in die gesellschaft eingefügt habe. egal wo du hin ziehst, nirgendwo wird dir soviel zucker in den arsch geblasen wie in deutschland.").toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['toxic'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_sequence_classifier_toxicity| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[class]| |Language:|de| |Size:|252.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - [GermEval21](https://github.com/germeval2021toxic/SharedTask/tree/main/Data%20Sets) - [IWG Hatespeech](https://github.com/UCSM-DUE/IWG_hatespeech_public) --- layout: model title: Legal Environmental Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_environmental_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, environmental_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_environmental_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Environmental_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Environmental_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_environmental_policy_bert_en_1.0.0_3.0_1678111880590.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_environmental_policy_bert_en_1.0.0_3.0_1678111880590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_environmental_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Environmental_Policy]| |[Other]| |[Other]| |[Environmental_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_environmental_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Environmental_Policy 0.85 0.86 0.86 227 Other 0.85 0.84 0.85 216 accuracy - - 0.85 443 macro-avg 0.85 0.85 0.85 443 weighted-avg 0.85 0.85 0.85 443 ``` --- layout: model title: Sentence Entity Resolver for Clinical Abbreviations and Acronyms (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_clinical_abbreviation_acronym date: 2022-01-03 tags: [abbreviation, entity_resolver, licensed, en, clinical, acronym] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical abbreviations and acronyms to their meanings using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It is the first primitive version of abbreviation resolution and will be improved further in the following releases. ## Predicted Entities `Abbreviation Meanings` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_2.4_1641239524364.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_clinical_abbreviation_acronym_en_3.3.4_2.4_1641239524364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["document", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \ .setInputCols(["document", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["document", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['ABBR']) sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["document", "ner_chunk"])\ .setOutputCol("sentence_embeddings")\ .setChunkWeight(0.5)\ .setCaseSensitive(True) abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("abbr_meaning")\ .setDistanceFunction("EUCLIDEAN")\ resolver_pipeline = Pipeline( stages = [ document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter, sentence_chunk_embeddings, abbr_resolver ]) text = "The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991." sample_text = spark.createDataFrame([[text]]).toDF('text') abbr_result = resolver_pipeline.fit(sample_text).transform(sample_text) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("document", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("document", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("ABBR")) val sentence_chunk_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("document", "ner_chunk")) .setOutputCol("sentence_embeddings") .setChunkWeight(0.5) .setCaseSensitive(True) val abbr_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("abbr_meaning") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new Pipeline().setStages(document_assembler, tokenizer, word_embeddings, clinical_ner, ner_converter, sentence_chunk_embeddings, abbr_resolver) val sample_text = Seq("The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.").toDF("text") val abbr_result = resolver_pipeline.fit(sample_text).transform(sample_text) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.clinical_abbreviation_acronym").predict("""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""") ```
## Results ```bash +-------+---------+------+------------------------+-------------------------------------------------------------------------+-----------------+---------------------------------+ |sent_id|ner_chunk|entity| abbr_meaning| all_k_results|all_k_resolutions| all_k_cosine_distances| +-------+---------+------+------------------------+-------------------------------------------------------------------------+-----------------+---------------------------------+ | 0| IR| ABBR|interventional radiology|interventional radiology:::immediate-release:::(stage) IA:::intraarterial|IR:::IR:::IA:::IA|0.0156:::0.0945:::0.1046:::0.1111| +-------+---------+------+------------------------+-------------------------------------------------------------------------+-----------------+---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_clinical_abbreviation_acronym| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[abbr_meaning]| |Language:|en| |Size:|105.3 MB| |Case sensitive:|false| --- layout: model title: Legal Chemistry Document Classifier (EURLEX) author: John Snow Labs name: legclf_chemistry_bert date: 2023-03-06 tags: [en, legal, classification, clauses, chemistry, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_chemistry_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Chemistry or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Chemistry`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_chemistry_bert_en_1.0.0_3.0_1678111646937.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_chemistry_bert_en_1.0.0_3.0_1678111646937.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_chemistry_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Chemistry]| |[Other]| |[Other]| |[Chemistry]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_chemistry_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Chemistry 0.84 0.92 0.88 180 Other 0.93 0.85 0.89 212 accuracy - - 0.88 392 macro-avg 0.88 0.89 0.88 392 weighted-avg 0.89 0.88 0.88 392 ``` --- layout: model title: Pipeline to Detect Clinical Events author: John Snow Labs name: ner_events_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_events_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_pipeline_en_4.3.0_3.2_1678866574652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_pipeline_en_4.3.0_3.2_1678866574652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_events_clinical_pipeline", "en", "clinical/models") text = '''The patient presented to the emergency room last evening.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_events_clinical_pipeline", "en", "clinical/models") val text = "The patient presented to the emergency room last evening." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_clinical.pipeline").predict("""The patient presented to the emergency room last evening.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------|--------:|------:|:--------------|-------------:| | 0 | presented | 12 | 20 | OCCURRENCE | 0.7132 | | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | 0.723267 | | 2 | last evening | 44 | 55 | DATE | 0.90555 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English LongformerForQuestionAnswering model (from abhijithneilabraham) author: John Snow Labs name: longformer_qa_covid date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer_covid_qa` is a English model originally trained by `abhijithneilabraham`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_covid_en_4.0.0_3.0_1656255806950.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_covid_en_4.0.0_3.0_1656255806950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_covid","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_covid","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid.longformer").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_covid| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|556.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/abhijithneilabraham/longformer_covid_qa - https://github.com/abhijithneilabraham/Covid-QA --- layout: model title: Smaller BERT Embeddings (L-8_H-768_A-12) author: John Snow Labs name: small_bert_L8_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L8_768_en_2.6.0_2.4_1598345245072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L8_768_en_2.6.0_2.4_1598345245072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L8_768", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L8_768", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L8_768').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L8_768_embeddings I [-0.17749905586242676, 0.12339337915182114, -0... love [-0.19556283950805664, -0.24686720967292786, 0... NLP [-0.5068103671073914, -0.5276650786399841, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L8_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ftorres) author: John Snow Labs name: distilbert_qa_ftorres_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ftorres`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ftorres_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770750585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ftorres_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770750585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ftorres_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ftorres_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ftorres_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ftorres/distilbert-base-uncased-finetuned-squad --- layout: model title: Match Pattern author: John Snow Labs name: match_pattern date: 2022-06-19 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The match_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and matches pattrens . It performs most of the common text processing tasks on your dataframe ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_pattern_en_4.0.0_3.0_1655653735848.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_pattern_en_4.0.0_3.0_1655653735848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("match_pattern", "en", "clinical/models") result = pipeline.annotate("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|match_pattern| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|29.0 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - RegexMatcherModel --- layout: model title: Word2Vec Embeddings in Emiliano-Romagnolo (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, eml, open_source] task: Embeddings language: eml edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_eml_3.4.1_3.0_1647294563985.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_eml_3.4.1_3.0_1647294563985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eml") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eml") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("eml.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|eml| |Size:|98.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Translate English to Cebuano Pipeline author: John Snow Labs name: translate_en_ceb date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ceb, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ceb` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ceb_xx_2.7.0_2.4_1609686571279.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ceb_xx_2.7.0_2.4_1609686571279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ceb", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ceb", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ceb').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ceb| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from huggingface-course) author: John Snow Labs name: bert_qa_huggingface_course_bert_finetuned_squad_accelerate date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `huggingface-course`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_huggingface_course_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535856581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_huggingface_course_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535856581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_huggingface_course_bert_finetuned_squad_accelerate","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_huggingface_course_bert_finetuned_squad_accelerate","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.accelerate.by_huggingface-course").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_huggingface_course_bert_finetuned_squad_accelerate| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huggingface-course/bert-finetuned-squad-accelerate --- layout: model title: Legal Underwriting Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_underwriting_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, underwriting, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_underwriting_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `underwriting-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `underwriting-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_underwriting_agreement_bert_en_1.0.0_3.0_1670349963889.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_underwriting_agreement_bert_en_1.0.0_3.0_1670349963889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_underwriting_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[underwriting-agreement]| |[other]| |[other]| |[underwriting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_underwriting_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.97 0.97 65 underwriting-agreement 0.94 0.94 0.94 36 accuracy - - 0.96 101 macro-avg 0.96 0.96 0.96 101 weighted-avg 0.96 0.96 0.96 101 ``` --- layout: model title: Translate Walloon to English Pipeline author: John Snow Labs name: translate_wa_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, wa, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `wa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_wa_en_xx_2.7.0_2.4_1609690770769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_wa_en_xx_2.7.0_2.4_1609690770769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_wa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_wa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.wa.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_wa_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Assertion Status from Oncology Treatments author: John Snow Labs name: assertion_oncology_treatment_binary_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion, treatment] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of oncology treatment entities. The model identifies if the treatment has been used (Present_Or_Past status), or if it is mentioned but in hypothetical or absent sentences (Hypothetical_Or_Absent status). ## Predicted Entities `Hypothetical_Or_Absent`, `Present_Or_Past` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_treatment_binary_wip_en_4.0.0_3.0_1665521859244.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_treatment_binary_wip_en_4.0.0_3.0_1665521859244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Surgery", "Chemotherapy"]) assertion = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient underwent a mastectomy two years ago. We recommend to start chemotherapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Surgery", "Chemotherapy")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient underwent a mastectomy two years ago. We recommend to start chemotherapy.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_treatment_binary").predict("""The patient underwent a mastectomy two years ago. We recommend to start chemotherapy.""") ```
## Results ```bash | chunk | ner_label | assertion | |:-------------|:---------------|:-----------------------| | mastectomy | Cancer_Surgery | Present_Or_Past | | chemotherapy | Chemotherapy | Hypothetical_Or_Absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_treatment_binary_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Hypothetical_Or_Absent 0.78 0.83 0.81 170.0 Present_Or_Past 0.81 0.76 0.78 160.0 macro-avg 0.80 0.79 0.79 330.0 weighted-avg 0.79 0.79 0.79 330.0 ``` --- layout: model title: Sundanese asr_wav2vec2_indonesian_javanese_sundanese TFWav2Vec2ForCTC from indonesian-nlp author: John Snow Labs name: pipeline_asr_wav2vec2_indonesian_javanese_sundanese date: 2022-09-24 tags: [wav2vec2, su, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: su edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_indonesian_javanese_sundanese` is a Sundanese model originally trained by indonesian-nlp. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_indonesian_javanese_sundanese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_indonesian_javanese_sundanese_su_4.2.0_3.0_1664038727353.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_indonesian_javanese_sundanese_su_4.2.0_3.0_1664038727353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_indonesian_javanese_sundanese', lang = 'su') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_indonesian_javanese_sundanese", lang = "su") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_indonesian_javanese_sundanese| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|su| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Spanish RobertaForQuestionAnswering (from BSC-TeMU) author: John Snow Labs name: roberta_qa_BSC_TeMU_roberta_base_bne_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-sqac` is a Spanish model originally trained by `BSC-TeMU`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_BSC_TeMU_roberta_base_bne_sqac_es_4.0.0_3.0_1655730098181.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_BSC_TeMU_roberta_base_bne_sqac_es_4.0.0_3.0_1655730098181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_BSC_TeMU_roberta_base_bne_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_BSC_TeMU_roberta_base_bne_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.base.by_BSC-TeMU").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_BSC_TeMU_roberta_base_bne_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/BSC-TeMU/roberta-base-bne-sqac - https://github.com/PlanTL-SANIDAD/lm-spanish - http://www.bne.es/en/Inicio/index.html - https://arxiv.org/abs/1907.11692 - https://arxiv.org/abs/2107.07253 --- layout: model title: Translate Eastern Malayo-Polynesian languages to English Pipeline author: John Snow Labs name: translate_pqe_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pqe, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pqe` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pqe_en_xx_2.7.0_2.4_1609699605019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pqe_en_xx_2.7.0_2.4_1609699605019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_pqe_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pqe_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.pqe.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pqe_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Persian XLMRobertaForTokenClassification Base Cased model (from BK-V) author: John Snow Labs name: xlmroberta_ner_base_finetuned_peyma date: 2022-08-14 tags: [fa, open_source, xlm_roberta, ner] task: Named Entity Recognition language: fa edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-peyma-fa` is a Persian model originally trained by `BK-V`. ## Predicted Entities `LOC`, `PER`, `MON`, `ORG`, `TIM`, `PCT`, `DAT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_peyma_fa_4.1.0_3.0_1660447374860.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_peyma_fa_4.1.0_3.0_1660447374860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_peyma","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_peyma","fa") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_peyma| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|843.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/BK-V/xlm-roberta-base-finetuned-peyma-fa --- layout: model title: Estonian CamemBert Embeddings (from EMBEDDIA) author: John Snow Labs name: camembert_embeddings_est_roberta date: 2022-05-31 tags: [et, open_source, camembert, embeddings] task: Embeddings language: et edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `est-roberta` is a Estonian model orginally trained by `EMBEDDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_est_roberta_et_3.4.4_3.0_1653991675972.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_est_roberta_et_3.4.4_3.0_1653991675972.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_est_roberta","et") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ma armastan sädet nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_est_roberta","et") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ma armastan sädet nlp").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_est_roberta| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|et| |Size:|280.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/EMBEDDIA/est-roberta - https://camembert-model.fr/ --- layout: model title: Vietnamese T5ForConditionalGeneration Base Cased model (from VietAI) author: John Snow Labs name: t5_vit5_base date: 2023-01-31 tags: [vi, open_source, t5, tensorflow] task: Text Generation language: vi edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vit5-base` is a Vietnamese model originally trained by `VietAI`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_vit5_base_vi_4.3.0_3.0_1675158262308.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_vit5_base_vi_4.3.0_3.0_1675158262308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_vit5_base","vi") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_vit5_base","vi") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_vit5_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|vi| |Size:|485.0 MB| ## References - https://huggingface.co/VietAI/vit5-base - https://github.com/vietai/ViT5 --- layout: model title: Extract relations between problem, treatment and test entities (ReDL) author: John Snow Labs name: redl_clinical_biobert date: 2023-01-06 tags: [licensed, clinical, en, relation_extraction, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations like `TrIP` : a certain treatment has improved a medical problem and 7 other such relations between problem, treatment and test entities. ## Predicted Entities `PIP`, `TeCP`, `TeRP`, `TrAP`, `TrCP`, `TrIP`, `TrNAP`, `TrWP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_4.2.4_3.0_1673020165617.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_clinical_biobert_en_4.2.4_3.0_1673020165617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel() \ .pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["problem-test", "problem-treatment"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_clinical_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """ data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("problem-test", "problem-treatment")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_clinical_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """) ```
## Results ```bash +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ |relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ | TrAP|TREATMENT| 512| 522| amoxicillin| PROBLEM| 528| 556|a respiratory tra...|0.99796957| | TrAP|TREATMENT| 571| 579| metformin| PROBLEM| 617| 620| T2DM|0.99757993| | TrAP|TREATMENT| 599| 611| dapagliflozin| PROBLEM| 659| 661| HTG| 0.996036| | TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 626| 637| atorvastatin| 0.9693424| | TrAP| PROBLEM| 617| 620| T2DM|TREATMENT| 643| 653| gemfibrozil|0.99460286| | TeRP| TEST| 739| 758|Physical examination| PROBLEM| 796| 810| dry oral mucosa|0.99775106| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 875| 884| tenderness|0.99272937| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 888| 895| guarding| 0.9840321| | TeRP| TEST| 830| 854|her abdominal exa...| PROBLEM| 902| 909| rigidity| 0.9883966| | TeRP| TEST| 1246| 1258| blood samples| PROBLEM| 1265| 1274| hemolyzing| 0.9534202| | TeRP| TEST| 1507| 1517| her glucose| PROBLEM| 1553| 1566| still elevated| 0.9464761| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1576| 1592| serum bicarbonate| 0.9428323| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1656| 1661| lipase| 0.9558198| | TeRP| PROBLEM| 1553| 1566| still elevated| TEST| 1670| 1672| U/L| 0.9214444| | TeRP| TEST| 1676| 1702|The β-hydroxybuty...| PROBLEM| 1733| 1740| elevated| 0.9863963| | TrAP|TREATMENT| 1937| 1951| an insulin drip| PROBLEM| 1957| 1961| euDKA| 0.9852455| | O| PROBLEM| 1957| 1961| euDKA| TEST| 1991| 2003| the anion gap|0.94141793| | O| PROBLEM| 1957| 1961| euDKA| TEST| 2015| 2027| triglycerides| 0.9622529| +--------+---------+-------------+-----------+--------------------+---------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_clinical_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|402.0 MB| ## Benchmarking ```bash Relation Recall Precision F1 Support PIP 0.859 0.878 0.869 1435 TeCP 0.629 0.782 0.697 337 TeRP 0.903 0.929 0.916 2034 TrAP 0.872 0.866 0.869 1693 TrCP 0.641 0.677 0.659 340 TrIP 0.517 0.796 0.627 151 TrNAP 0.402 0.672 0.503 112 TrWP 0.257 0.824 0.392 109 Avg. 0.635 0.803 0.691 - ``` --- layout: model title: English RobertaForQuestionAnswering (from comacrae) author: John Snow Labs name: roberta_qa_roberta_unaugv3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-unaugv3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_unaugv3_en_4.0.0_3.0_1655738558018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_unaugv3_en_4.0.0_3.0_1655738558018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_unaugv3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_unaugv3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.unaugv3.by_comacrae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_unaugv3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/comacrae/roberta-unaugv3 --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_specter_model_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `specter-bert-model_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_specter_model_squad2.0_en_4.0.0_3.0_1657192817801.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_specter_model_squad2.0_en_4.0.0_3.0_1657192817801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_specter_model_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_specter_model_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_specter_model_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/specter-bert-model_squad2.0 --- layout: model title: English BertForQuestionAnswering Cased model (from SebastianS) author: John Snow Labs name: bert_qa_sebastians_finetuned_squad_accelera date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model originally trained by `SebastianS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sebastians_finetuned_squad_accelera_en_4.0.0_3.0_1657187023304.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sebastians_finetuned_squad_accelera_en_4.0.0_3.0_1657187023304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sebastians_finetuned_squad_accelera","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sebastians_finetuned_squad_accelera","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sebastians_finetuned_squad_accelera| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SebastianS/bert-finetuned-squad-accelerate --- layout: model title: Financial Sentiment Analysis author: John Snow Labs name: finclf_bert_sentiment date: 2022-09-06 tags: [en, finance, sentiment, analysis, classification, licensed] task: Sentiment Analysis language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Sentiment Analysis fine-tuned model on 12K+ manually annotated (positive, negative, neutral) analyst reports on top of Financial Bert Embeddings. This model achieves superior performance on financial tone analysis task. ## Predicted Entities `positive`, `negative`, `neutral` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_FINANCE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_en_1.0.0_3.2_1662469460654.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_en_1.0.0_3.2_1662469460654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = finance.BertForSequenceClassification.pretrained("finclf_bert_sentiment", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) # couple of simple examples example = spark.createDataFrame([["Stocks rallied and the British pound gained."]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +--------------------+----------+ | text| result| +--------------------+----------+ |Stocks rallied an...|[Positive]| +--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_bert_sentiment| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotations on financial reports ## Benchmarking ```bash label precision recall f1-score support neutral 0.91 0.87 0.89 588 positive 0.76 0.81 0.78 251 negative 0.83 0.87 0.85 131 accuracy - - 0.86 970 macro-avg 0.83 0.85 0.84 970 weighted-avg 0.86 0.86 0.86 970 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from saburbutt) author: John Snow Labs name: xlm_roberta_qa_xlmroberta_large_tweetqa date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmroberta_large_tweetqa` is a English model originally trained by `saburbutt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmroberta_large_tweetqa_en_4.0.0_3.0_1655997777896.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmroberta_large_tweetqa_en_4.0.0_3.0_1655997777896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmroberta_large_tweetqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmroberta_large_tweetqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.xlmr_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmroberta_large_tweetqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saburbutt/xlmroberta_large_tweetqa --- layout: model title: English BertForTokenClassification Base Uncased model (from ml6team) author: John Snow Labs name: bert_token_classifier_base_uncased_city_country_ner date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-city-country-ner` is a English model originally trained by `ml6team`. ## Predicted Entities `CITY`, `COUNTRY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_uncased_city_country_ner_en_4.2.4_3.0_1669815124980.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_uncased_city_country_ner_en_4.2.4_3.0_1669815124980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_uncased_city_country_ner","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_uncased_city_country_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_uncased_city_country_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|256| ## References - https://huggingface.co/ml6team/bert-base-uncased-city-country-ner - https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html --- layout: model title: German asr_wav2vec2_xls_r_300m_english_by_aware_ai TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: asr_wav2vec2_xls_r_300m_english_by_aware_ai date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_english_by_aware_ai` is a German model originally trained by aware-ai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_english_by_aware_ai_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_english_by_aware_ai_de_4.2.0_3.0_1664097408175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_english_by_aware_ai_de_4.2.0_3.0_1664097408175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_english_by_aware_ai", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_english_by_aware_ai", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_english_by_aware_ai| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Korean Bert Embeddings Cased model (from onlydj96) author: John Snow Labs name: bert_embeddings_pretrain date: 2023-02-20 tags: [open_source, bert, bert_embeddings, bertformaskedlm, ko, tensorflow] task: Embeddings language: ko edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_pretrain` is a Korean model originally trained by `onlydj96`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_pretrain_ko_4.3.0_3.0_1676925661631.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_pretrain_ko_4.3.0_3.0_1676925661631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_pretrain","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_pretrain","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_pretrain| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|415.5 MB| |Case sensitive:|true| ## References https://huggingface.co/onlydj96/bert_pretrain --- layout: model title: Pipeline to Detect Temporal Relations for Clinical Events (Enriched) author: John Snow Labs name: re_temporal_events_enriched_clinical_pipeline date: 2023-06-13 tags: [licensed, clinical, relation_extraction, event, enriched, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_temporal_events_enriched_clinical](https://nlp.johnsnowlabs.com/2020/09/28/re_temporal_events_enriched_clinical_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_enriched_clinical_pipeline_en_4.4.4_3.2_1686664970911.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_enriched_clinical_pipeline_en_4.4.4_3.2_1686664970911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_temporal_events_enriched_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` ```scala val pipeline = new PretrainedPipeline("re_temporal_events_enriched_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temproal_enriched.pipeline").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_temporal_events_enriched_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` ```scala val pipeline = new PretrainedPipeline("re_temporal_events_enriched_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temproal_enriched.pipeline").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
## Results ```bash Results +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+===========+=================+===============+===============================================+============+=================+===============+==========================+==============+ | 0 | OVERLAP | PROBLEM | 54 | 98 | longstanding intermittent right low back pain | OCCURRENCE | 121 | 144 | a motor vehicle accident | 0.532308 | +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ | 1 | AFTER | DATE | 171 | 179 | that time | PROBLEM | 201 | 219 | any specific injury | 0.577288 | +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_temporal_events_enriched_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Dutch BertForMaskedLM Base Cased model (from GroNLP) author: John Snow Labs name: bert_embeddings_base_dutch_cased date: 2022-12-02 tags: [nl, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: nl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-dutch-cased` is a Dutch model originally trained by `GroNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_dutch_cased_nl_4.2.4_3.0_1670016541889.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_dutch_cased_nl_4.2.4_3.0_1670016541889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_dutch_cased","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_dutch_cased","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_dutch_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|409.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/GroNLP/bert-base-dutch-cased - https://www.semanticscholar.org/author/Wietse-de-Vries/144611157 - https://www.semanticscholar.org/author/Andreas-van-Cranenburgh/2791585 - https://www.semanticscholar.org/author/Arianna-Bisazza/3242253 - https://www.semanticscholar.org/author/Tommaso-Caselli/1864635 - https://www.semanticscholar.org/author/Gertjan-van-Noord/143715131 - https://www.semanticscholar.org/author/M.-Nissim/2742475 - https://arxiv.org/abs/1912.09582 - https://github.com/wietsedv/bertje - https://www.semanticscholar.org/paper/BERTje%3A-A-Dutch-BERT-Model-Vries-Cranenburgh/a4d5e425cac0bf84c86c0c9f720b6339d6288ffa - https://www.clips.uantwerpen.be/conll2002/ner/ - https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus - https://github.com/google-research/bert/blob/master/multilingual.md - http://textdata.nl - https://github.com/iPieter/RobBERT - https://universaldependencies.org/treebanks/nl_lassysmall/index.html - https://github.com/google-research/bert/blob/master/multilingual.md - http://textdata.nl - https://github.com/iPieter/RobBERT --- layout: model title: Chinese BertForQuestionAnswering model (from jackh1995) author: John Snow Labs name: bert_qa_roberta_base_chinese_extractive_qa_scratch date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chinese-extractive-qa-scratch` is a Chinese model orginally trained by `jackh1995`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_qa_scratch_zh_4.0.0_3.0_1654189279202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_qa_scratch_zh_4.0.0_3.0_1654189279202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_roberta_base_chinese_extractive_qa_scratch","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_roberta_base_chinese_extractive_qa_scratch","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.base.by_jackh1995").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_roberta_base_chinese_extractive_qa_scratch| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|408.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jackh1995/roberta-base-chinese-extractive-qa-scratch --- layout: model title: Dutch BERT Base Cased Embedding author: John Snow Labs name: bert_base_cased date: 2021-09-07 tags: [dutch, open_source, bert_embeddings, cased, nl] task: Embeddings language: nl edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_nl_3.2.2_3.0_1630999717658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_nl_3.2.2_3.0_1630999717658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_cased", "nl") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_cased", "nl") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.bert.base_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/GroNLP/bert-base-dutch-cased --- layout: model title: Fastext Word Embeddings in German author: John Snow Labs name: w2v_cc_300d class: WordEmbeddingsModel language: de repository: clinical/models date: 2020-09-06 task: Embeddings edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [embeddings, de, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/w2v_cc_300d_de_2.5.5_2.4_1599428063692.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/w2v_cc_300d_de_2.5.5_2.4_1599428063692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.w2v").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``w2v_cc_300d``. {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | w2v_cc_300d | | Type: | WordEmbeddingsModel | | Compatibility: | Healthcare NLP 2.5.5+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | de | | Dimension: | 300.0 | {:.h2_title} ## Data Source FastText common crawl word embeddings for Germany https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: Sentence Detection in Ukrainian Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [open_source, uk, sentence_detection] task: Sentence Detection language: uk edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_uk_3.2.0_3.0_1630322414306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_uk_3.2.0_3.0_1630322414306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "uk") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""Шукаєте чудове джерело англійського читання абзаців? Ви потрапили в потрібне місце. Згідно з останнім дослідженням, звичка читати у сучасної молоді стрімко знижується. Вони не можуть зосередитися на даному абзаці читання англійською мовою більше кількох секунд! Крім того, читання було і є невід’ємною частиною всіх конкурсних іспитів. Отже, як покращити свої навички читання? Відповідь на це питання насправді інше питання: Яка користь від навичок читання? Основна мета читання - «мати сенс».""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "uk") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("Шукаєте чудове джерело англійського читання абзаців? Ви потрапили в потрібне місце. Згідно з останнім дослідженням, звичка читати у сучасної молоді стрімко знижується. Вони не можуть зосередитися на даному абзаці читання англійською мовою більше кількох секунд! Крім того, читання було і є невід’ємною частиною всіх конкурсних іспитів. Отже, як покращити свої навички читання? Відповідь на це питання насправді інше питання: Яка користь від навичок читання? Основна мета читання - «мати сенс».").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('uk.sentence_detector').predict('Шукаєте чудове джерело англійського читання абзаців? Ви потрапили в потрібне місце. Згідно з останнім дослідженням, звичка читати у сучасної молоді стрімко знижується. Вони не можуть зосередитися на даному абзаці читання англійською мовою більше кількох секунд! Крім того, читання було і є невід’ємною частиною всіх конкурсних іспитів. Отже, як покращити свої навички читання? Відповідь на це питання насправді інше питання: Яка користь від навичок читання? Основна мета читання - «мати сенс».', output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------+ |[Шукаєте чудове джерело англійського читання абзаців?] | |[Ви потрапили в потрібне місце.] | |[Згідно з останнім дослідженням, звичка читати у сучасної молоді стрімко знижується.] | |[Вони не можуть зосередитися на даному абзаці читання англійською мовою більше кількох секунд!]| |[Крім того, читання було і є невід’ємною частиною всіх конкурсних іспитів.] | |[Отже, як покращити свої навички читання?] | |[Відповідь на це питання насправді інше питання:] | |[Яка користь від навичок читання?] | |[Основна мета читання - «мати сенс».] | +-----------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|uk| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Pipeline to Detect Drug Chemicals author: John Snow Labs name: bert_token_classifier_ner_drugs_pipeline date: 2022-03-21 tags: [licensed, ner, drugs, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_drugs](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_drugs_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_pipeline_en_3.4.1_3.0_1647864317028.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_pipeline_en_3.4.1_3.0_1647864317028.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_drugs_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_drugs_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_ade.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |potassium |DrugChem | |nucleotide |DrugChem | |anthracyclines|DrugChem | |taxanes |DrugChem | |vinorelbine |DrugChem | |vinorelbine |DrugChem | |anthracyclines|DrugChem | |taxanes |DrugChem | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_drugs_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English RobertaForQuestionAnswering (from Palak) author: John Snow Labs name: roberta_qa_distilroberta_base_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base_squad` is a English model originally trained by `Palak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad_en_4.0.0_3.0_1655728407851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_base_squad_en_4.0.0_3.0_1655728407851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_base_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_distilroberta_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.distilled_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilroberta_base_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Palak/distilroberta-base_squad --- layout: model title: German asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527 date: 2022-09-26 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527_de_4.2.0_3.0_1664190631948.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527_de_4.2.0_3.0_1664190631948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_10_austria_0_s527| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab) author: John Snow Labs name: bert_embeddings_base_arabic_camel_msa_quarter date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa-quarter` is a Arabic model originally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_quarter_ar_4.2.4_3.0_1670016162499.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_quarter_ar_4.2.4_3.0_1670016162499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_quarter","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_quarter","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic_camel_msa_quarter| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-quarter - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Fast Neural Machine Translation Model from English to Igbo author: John Snow Labs name: opus_mt_en_ig date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ig, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ig` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ig_xx_2.7.0_2.4_1609169417196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ig_xx_2.7.0_2.4_1609169417196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ig", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ig", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ig').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ig| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Javanese RobertaForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: roberta_embeddings_javanese_small date: 2022-12-12 tags: [jv, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-roberta-small` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_small_jv_4.2.4_3.0_1670858755321.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_small_jv_4.2.4_3.0_1670858755321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_javanese_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|jv| |Size:|468.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-roberta-small - https://arxiv.org/abs/1907.11692 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Detect PHI for Deidentification in Romanian (Word2Vec) author: John Snow Labs name: ner_deid_subentity date: 2022-06-27 tags: [ner, deidentification, word2vec, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ro_4.0.0_3.0_1656316441636.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ro_4.0.0_3.0_1656316441636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid.subentity").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|HOSPITAL | |Drumul Oprea Nr. |STREET | |Vaslui |CITY | |737405 |ZIP | |+40(235)413773 |PHONE | |25 May 2022 |DATE | |BUREAN MARIA |PATIENT | |77 |AGE | |Agota Evelyn |DOCTOR | |2450502264401 |IDNUM | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|15.1 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.96 0.99 0.97 1235 CITY 0.97 0.95 0.96 307 COUNTRY 0.92 0.74 0.82 115 DATE 0.94 0.89 0.91 5006 DOCTOR 0.96 0.96 0.96 2064 EMAIL 1.00 1.00 1.00 8 FAX 1.00 0.95 0.97 56 HOSPITAL 0.78 0.83 0.80 919 IDNUM 0.98 1.00 0.99 239 LOCATION-OTHER 1.00 0.85 0.92 13 MEDICALRECORD 1.00 1.00 1.00 455 ORGANIZATION 0.34 0.41 0.37 82 PATIENT 0.85 0.90 0.87 954 PHONE 0.97 0.98 0.98 315 PROFESSION 0.87 0.80 0.83 173 STREET 0.99 0.99 0.99 174 ZIP 0.99 0.97 0.98 140 micro-avg 0.92 0.91 0.92 12255 macro-avg 0.91 0.89 0.90 12255 weighted-avg 0.93 0.91 0.92 12255 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_BioBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-BioBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BioBERT_384_en_4.0.0_3.0_1657108284010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BioBERT_384_en_4.0.0_3.0_1657108284010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BioBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BioBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_BioBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-BioBERT-384 --- layout: model title: English Named Entity Recognition (Large, ConLL) author: John Snow Labs name: roberta_ner_roberta_large_ner_english date: 2022-05-03 tags: [roberta, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-ner-english` is a English model orginally trained by `Jean-Baptiste`. ## Predicted Entities `MISC`, `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_ner_english_en_3.4.2_3.0_1651593899023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_ner_english_en_3.4.2_3.0_1651593899023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_ner_english","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_ner_english","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.roberta_large_ner_english").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_large_ner_english| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Jean-Baptiste/roberta-large-ner-english - https://medium.com/@jean-baptiste.polle/lstm-model-for-email-signature-detection-8e990384fefa --- layout: model title: Financial Acquisitions Section Binary Classifier author: John Snow Labs name: finclf_acquisitions_item date: 2022-11-03 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `acquisitions` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Finance Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `acquisitions` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_acquisitions_item_en_1.0.0_3.0_1667484190818.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_acquisitions_item_en_1.0.0_3.0_1667484190818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_acquisitions_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[acquisitions]| |[other]| |[other]| |[acquisitions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_acquisitions_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support acquisitions 0.95 0.89 0.92 136 other 0.96 0.98 0.97 412 accuracy - - 0.96 548 macro-avg 0.95 0.94 0.95 548 weighted-avg 0.96 0.96 0.96 548 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from cometrain) author: John Snow Labs name: t5_stocks_news date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `stocks-news-t5` is a English model originally trained by `cometrain`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_stocks_news_en_4.3.0_3.0_1675107545220.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_stocks_news_en_4.3.0_3.0_1675107545220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_stocks_news","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_stocks_news","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_stocks_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|277.2 MB| ## References - https://huggingface.co/cometrain/stocks-news-t5 - https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis --- layout: model title: Spanish BERT Base Uncased Embedding author: John Snow Labs name: bert_base_uncased date: 2021-09-07 tags: [uncased, spanish, bert_embeddings, open_source, es] task: Embeddings language: es edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_es_3.2.2_3.0_1630999639144.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_es_3.2.2_3.0_1630999639144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_uncased", "es") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_uncased", "es") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bert.base_uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_uncased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased --- layout: model title: Legal Indemnity Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_indemnity_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, indemnity, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_indemnity_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `indemnity-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `indemnity-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_agreement_bert_en_1.0.0_3.0_1670349475264.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_agreement_bert_en_1.0.0_3.0_1670349475264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indemnity_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[indemnity-agreement]| |[other]| |[other]| |[indemnity-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnity_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support indemnity-agreement 0.88 0.82 0.85 28 other 0.86 0.91 0.89 35 accuracy - - 0.87 63 macro-avg 0.87 0.87 0.87 63 weighted-avg 0.87 0.87 0.87 63 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_agreement_date_08_25 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-agreement_date-08-25` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_agreement_date_08_25_en_4.3.0_3.0_1672765994991.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_agreement_date_08_25_en_4.3.0_3.0_1672765994991.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_agreement_date_08_25","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_agreement_date_08_25","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_agreement_date_08_25| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-agreement_date-08-25 --- layout: model title: Legal Restricted Stock Unit Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_restricted_stock_unit_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, restricted, stock, unit, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_restricted_stock_unit_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `restricted-stock-unit-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `restricted-stock-unit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_restricted_stock_unit_agreement_bert_en_1.0.0_3.0_1670349793942.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_restricted_stock_unit_agreement_bert_en_1.0.0_3.0_1670349793942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_restricted_stock_unit_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[restricted-stock-unit-agreement]| |[other]| |[other]| |[restricted-stock-unit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_restricted_stock_unit_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.94 0.94 35 restricted-stock-unit-agreement 0.95 0.95 0.95 42 accuracy - - 0.95 77 macro-avg 0.95 0.95 0.95 77 weighted-avg 0.95 0.95 0.95 77 ``` --- layout: model title: Polish Lemmatizer author: John Snow Labs name: lemma date: 2020-05-03 18:18:00 +0800 task: Lemmatization language: pl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, pl] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_pl_2.5.0_2.4_1588518491035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_pl_2.5.0_2.4_1588518491035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "pl") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "pl") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej."""] lemma_df = nlu.load('pl.lemma').predict(text, output_level='token') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='oprócz', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=7, end=11, result='bycia', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=18, result='król', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=20, end=26, result='północ', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=27, end=27, result=',', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|pl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Law Spanish Text Classification (from `hackathon-pln-es`) author: John Snow Labs name: roberta_jurisbert_class_tratados_internacionales_sistema_universal date: 2022-05-20 tags: [roberta, ner, text_classification, es, open_source] task: Text Classification language: es edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `jurisbert-class-tratados-internacionales-sistema-universal ` is a Spanish model orginally trained by `hackathon-pln-es`. ## Predicted Entities `Convención sobre la Eliminación de todas las formas de Discriminación contra la Mujer`, `Convención sobre los Derechos de las Personas con Discapacidad`, `Convención Internacional Sobre la Eliminación de Todas las Formas de Discriminación Racial`, `Convención contra la Tortura y otros Tratos o Penas Crueles, Inhumanos o Degradantes`, `Convención Internacional sobre la Protección de los Derechos de todos los Trabajadores Migratorios y de sus Familias`, `Convención de los Derechos del Niño`, `Pacto Internacional de Derechos Económicos, Sociales y Culturales`, `Pacto Internacional de Derechos Civiles y Políticos` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_jurisbert_class_tratados_internacionales_sistema_universal_es_3.4.4_3.0_1653050297872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_jurisbert_class_tratados_internacionales_sistema_universal_es_3.4.4_3.0_1653050297872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForSequenceClassification.pretrained("roberta_jurisbert_class_tratados_internacionales_sistema_universal","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Me encanta Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForSequenceClassification.pretrained("roberta_jurisbert_class_tratados_internacionales_sistema_universal","es") .setInputCols(Array("sentence", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Me encanta Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_jurisbert_class_tratados_internacionales_sistema_universal| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/hackathon-pln-es/jurisbert-class-tratados-internacionales-sistema-universal --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_large_few_shot_k_32_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-few-shot-k-32-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_few_shot_k_32_finetuned_squad_seed_0_en_4.3.0_3.0_1674221604163.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_few_shot_k_32_finetuned_squad_seed_0_en_4.3.0_3.0_1674221604163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_few_shot_k_32_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_few_shot_k_32_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_few_shot_k_32_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-large-few-shot-k-32-finetuned-squad-seed-0 --- layout: model title: Pipeline to Detect PHI for Deidentification (Enriched) author: John Snow Labs name: ner_deid_enriched_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, enriched, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_enriched](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_enriched_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_pipeline_en_3.4.1_3.0_1647867856711.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_pipeline_en_3.4.1_3.0_1647867856711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_enriched_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Remedy City Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Remedy City Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_enriched_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Remedy City Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Remedy City Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_enriched.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Remedy City Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Remedy City Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash +--------------------+---------+ |chunk |ner_label| +--------------------+---------+ |Smith |PATIENT | |VA Hospital |HOSPITAL | |Remedy City Hospital|HOSPITAL | |02/04/2003 |DATE | |Smith |PATIENT | |Remedy City Hospital|HOSPITAL | |Smith |PATIENT | |Smith |PATIENT | |Hart |DOCTOR | |Smith |PATIENT | |02/07/2003 |DATE | +--------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_enriched_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Word2Vec Embeddings in Estonian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, et, open_source] task: Embeddings language: et edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_et_3.4.1_3.0_1647370411777.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_et_3.4.1_3.0_1647370411777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","et") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","et") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("et.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|et| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Pipeline for Detect Subentity PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_subentity_pipeline date: 2023-06-13 tags: [licensed, clinical, deidentification, ar, pipeline] task: Pipeline Healthcare language: ar edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2023/05/29/ner_deid_subentity_ar.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ar_4.4.4_3.2_1686666161768.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_ar_4.4.4_3.2_1686666161768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models") text= ''' ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. ''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models") val text = "ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح." val result = pipeline.fullAnnotate(text) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models") text= ''' ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. ''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "ar", "clinical/models") val text = "ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح." val result = pipeline.fullAnnotate(text) ```
## Results ```bash Results +---------------+--------+ |chunks |entities| +---------------+--------+ |16 أبريل 2000 |DATE | |ليلى حسن |PATIENT | |789، |ZIP | |جدة |CITY | |54321 |ZIP | |المملكة العربية|CITY | |السعودية |COUNTRY | |النور |HOSPITAL| |أميرة أحمد |DOCTOR | |ليلى |PATIENT | |35 |AGE | +---------------+--------+ {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|ar| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Turkish ElectraForQuestionAnswering model (from enelpi) author: John Snow Labs name: electra_qa_enelpi_squad date: 2022-06-22 tags: [tr, open_source, electra, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tr-enelpi-squad-qa` is a Turkish model originally trained by `enelpi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_enelpi_squad_tr_4.0.0_3.0_1655921380832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_enelpi_squad_tr_4.0.0_3.0_1655921380832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_enelpi_squad","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_enelpi_squad","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.squad.electra").predict("""Benim adım ne?|||"Benim adım Clara ve Berkeley'de yaşıyorum.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_enelpi_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|412.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/enelpi/electra-tr-enelpi-squad-qa --- layout: model title: Thai XlmRoBertaForQuestionAnswering (from wicharnkeisei) author: John Snow Labs name: xlm_roberta_qa_thai_xlm_roberta_base_squad2 date: 2022-06-23 tags: [th, open_source, question_answering, xlmroberta] task: Question Answering language: th edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `thai-xlm-roberta-base-squad2` is a Thai model originally trained by `wicharnkeisei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_thai_xlm_roberta_base_squad2_th_4.0.0_3.0_1655988133743.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_thai_xlm_roberta_base_squad2_th_4.0.0_3.0_1655988133743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_thai_xlm_roberta_base_squad2","th") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_thai_xlm_roberta_base_squad2","th") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("th.answer_question.squadv2.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_thai_xlm_roberta_base_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|th| |Size:|881.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/wicharnkeisei/thai-xlm-roberta-base-squad2 - https://github.com/iapp-technology/iapp-wiki-qa-dataset --- layout: model title: Pdf processing author: John Snow Labs name: pdf_processing date: 2023-01-03 tags: [en, licensed, ocr, pdf_processing] task: Document Pdf Processing language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2.1 supported: true annotator: PdfProcessing article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model obtain the text of an input PDF document with a Tesseract pretrained model. Tesseract is an Optical Character Recognition (OCR) engine developed by Google. It is an open-source tool that can be used to recognize text in images and convert it into machine-readable text. The engine is based on a neural network architecture and uses machine learning algorithms to improve its accuracy over time. Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset. In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/PDF_TO_TEXT/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/2.1.Pdf_processing.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pdf_to_text = PdfToText() \ .setInputCol("content") \ .setOutputCol("text") \ .setSplitPage(True) \ .setExtractCoordinates(True) \ .setStoreSplittedPdf(True) pdf_to_image = PdfToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setKeepInput(True) ocr = ImageToText() \ .setInputCol("image") \ .setOutputCol("text") \ .setConfidenceThreshold(60) pipeline = PipelineModel(stages=[ pdf_to_text, pdf_to_image, ocr ]) pdf_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/pdfs/*.pdf') pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache() result = pipeline().transform(pdf_example_df).cache() ``` ```scala val pdf_to_text = new PdfToText() .setInputCol("content") .setOutputCol("text") .setSplitPage(True) .setExtractCoordinates(True) .setStoreSplittedPdf(True) val pdf_to_image = new PdfToImage() .setInputCol("content") .setOutputCol("image") .setKeepInput(True) val ocr = new ImageToText() .setInputCol("image") .setOutputCol("text") .setConfidenceThreshold(60) val pipeline = new PipelineModel().setStages(Array( pdf_to_text, pdf_to_image, ocr)) val pdf_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/pdfs/*.pdf") val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache() val result = pipeline().transform(pdf_example_df).cache() ```
## Example ### Input: Representation of images 2 and 3 of the example: ![Screenshot](/assets/images/examples_ocr/image7.png) ## Output text ```bash +--------------------+-------------------+------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+-----------+-------+-----------+----------------+---------+ | path| modificationTime|length| text| positions| height_dimension| width_dimension| content| image|total_pages|pagenum|documentnum| confidence|exception| +--------------------+-------------------+------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+-----------+-------+-----------+----------------+---------+ |file:/Users/nmeln...|2022-07-14 15:38:51|693743|Patient Nam\nFina...|[{[{Patient Nam\n...|1587.780029296875|1205.8299560546875|[25 50 44 46 2D 3...|{file:/Users/nmel...| 1| 0| 0|81.2276874118381| null| |file:/Users/nmeln...|2022-07-14 15:38:51|693743|Patient Name\nFin...|[{[{Patient Name\...|1583.780029296875|1217.8299560546875|[25 50 44 46 2D 3...|{file:/Users/nmel...| 1| 0| 0|78.5234429732613| null| |file:/Users/nmeln...|2022-07-14 15:38:51| 70556|Alexandria is the...|[{[{A, 0, 72.024,...| 792.0| 612.0|[25 50 44 46 2D 3...| null| null| 0| 0| null| null| |file:/Users/nmeln...|2022-07-14 15:38:51| 70556|Alexandria was fo...|[{[{A, 1, 72.024,...| 792.0| 612.0|[25 50 44 46 2D 3...| null| null| 0| 0| null| null| |file:/Users/nmeln...|2022-07-14 15:38:51| 11601|8 i , . ! \n9 i ,...|[{[{8, 0, 72.0604...| 843.0| 596.0|[25 50 44 46 2D 3...| null| null| 0| 0| null| null| +--------------------+-------------------+------+--------------------+--------------------+-----------------+------------------+--------------------+--------------------+-----------+-------+-----------+----------------+---------+ ``` ```bash text 0 Patient Nam Financial Numbe Random Hospital Date of Birth Patient Location Chief Complaint Shortness of breath History of Present Illness Patient is an 84-year-old male wilh a past medical history of hypertension, HFpEF last known EF 55%, mild to moderate TA, pulmonary hypertension, permanent atrial fibrillation on Eliquis, history of GI blesd, CK-M8, and anemia who presents with full weeks oi ccneralized fatigue and fecling unwell. He also notes some shortness oi Breath and worsening dyspnea willy minimal exerlion. His major complaints are shoulder and joint pains. diffusely. He also complains of "bone pain’. He denics having any fevers or cnills. e demes having any chest pain, palpitalicns, He denies any worse extremity swelling than his baseline. He states he’s been compliant with his mcdications. Although he stales he ran out of his Eliquis & few weeks ago. He denies having any blood in his stools or mc!ena, although he does take iron pills and states his stools arc irequently black. His hemoglobin Is al baseline. Twelve-lead EKG showing atrial fibrillation, RBBB, LAFB, PVC. Chest x-ray showing new small right creater than left pleural effusions with mild pulmonary vascular congestion. BNP increased to 2800, up fram 1900. Tropoain 0.03. Renal function at baseline. Hemoaglopin at baseline. She normally takes 80 mq of oral Lasix daily. He was given 80 mg of IV Lasix in the ED. He is currently net negative close to 1 L. He is still on 2 L nasal cannula. ' Ss 5 A 10 system roview af systems was completed and negative except as documented in HPI. Physical Exam Vitals & Measurements T: 36.8 °C (Oral) TMIN: 36.8 "C (Oral) TMAX: 37.0 °C (Oral) HR: 54 RR: 7 BP: 140/63 WT: 100.3 KG Pulse Ox: 100 % Oxygen: 2 L'min via Nasal Cannula GENERAL: no acute distress HEAD: normecephalic EYES‘EARS‘NOSE/THAOAT: nupils are equal. normal oropharynx NECK: normal inspection RESPIRATORY: no respiratory distress, no rales on my exam CARDIOVASCULAR: irregular. brady. no murmurs, rubs or galleps ABDOMEN: soft, non-tendes EXTREMITIES: Bilateral chronic venous stasis changes NEUROLOGIC: alert and osieniec x 3. no gross motar or sensory deficils AssessmenvPlan Acute on chronic diastolic CHF (congestive heart failure) Acute on chronic diastolic heart failure exacerbation. Small pleural effusions dilaterally with mild pulmonary vascular congesiion on chest x-ray, slighi elevation in BNR. We'll continue 1 more day af IV diuresis with 80 mg IV Lasix. He may have had a viral infection which precipilated this. We'll add Tylenol jor his joint paias. Continue atenclol and chiorthalidone. AF - Atrial fibrillation Permanent atrial fibrillation. Rates bradycardic in the &0s. Continue atenolol with hola parameters. Coniinue Eliquis for stroke prevention. No evidence oj bleeding, hemog'abin at baseline. Printed: 7/17/2017 13:01 EDT Page 16 of 42 Arincitis CHF - Congestive heart failure Chronic kidney disease Chronic venous insufficiency Edema GI bleeding Glaucoma Goul Hypertension Peptic ulcer Peripheral ncuropathy Peripheral vascular disease Pulmonary hypertension Tricuspid regurgitation Historical No qualifying data Procedure/Surgical History duodenal resection, duodenojcjunostomy. small bowel enterolomy, removal of foreign object and repair oi enterotomy (05/2 1/20 14), colonoscopy (12/10/2013), egd (1209/2013), H/O endoscopy (07/2013), H’O colonoscopy (03/2013), pilonidal cyst removal at base of spine (1981), laser eye surgery ior glaucoma. lesions on small intestine closed up. Home Medications Home allopurinol 300 mg oral tablet, 300 MG= 1 TAB, PO. Daily atenolol 25 mg oral tablet, 25 MG= 1 TAB, PO, Daily chtorthalidone 25 mg oral tablet, 23 MG= 1 TAB, PO, MVE Combigan 0.2%-0.5% ophthalmic solution, 1 DROP, Both Eyes, Q12H Eliquis 5 mg oral lablet, 5 MG= 1 TAB, PO, BID lerrous sulfate 925 mg (65 nig elemental iron) oral tablet, 325 MG= 1 TAB, PO, Daily Lasix 80 mg oral tabic:. 80 MG= | TAB. PO, BID omeprazole 20 mg oral delayed scicasc capsule, 20 MG= 1 CAP, PO, BID Percocei 5/325 oral tablet. | TAB, PO. QAM potassium chloride 20 mEq oral tablet, extended release, 20 MEO= 1 TAB, PO, Daily sertraline 50 mg oral tablet, 75 MG= 1,5 TAB, PQ. Daiiy triamcinolone 0.71% lopical cream, 1 APP, Topical, Daily lriamcmnolone 0.1% lopical ominient, 1 APP. Topical, Daily PowerChart 1 Patient Name Financial Number Date of Girth Patient Location Random Hospital H & P Anemia Vitamin D2 50,000 intl units (1.25 ma) oral ALBASeRne capsule, 1 TAS, PO, Veexly-Tue Arthritis Allergies Tylenol for pain. Patient also takes Percocet alt home, will add this cn. Chronic kidney disease AY baseline. Monitor while divresing. Hypertension Blood pressures within tolerable ranges. Pulmonary hypertension Tricuspid regurgitation Mild-to-moderaie on echocardiogram last year sholliisn (cout) sulfa drug (maculopapular rash) Social History Ever Smoked tobacco: Former Smoker Alcohol use - frequency; None Drug use: Never Lab Results O7/16/9 7 05:30 to O7/16/17 05:30 Attending physician note-the patient was interviewed and examined. The appropriatc information in power chart was reviewed. The patient was discussed wilh Dr, Persad. 143 1L 981H 26? Patient may have @ mild degree of heart failure. He and his wife were more concernes with ee Ins peripheral edema. He has underlying renal insufficiency as well. We'll try to diurese him to his “dry" weight. We will then try to adjust his medications to kcep him within & narrow range of [hat weight. We will stop his atenolol this point since he is relatively bradycardic anc observe his heart rate on the cardiac monitor. He will progress with his care and aclivily as tolerated. 102 07/16/17 05:30 to O7/ 16/17 05:30 05:30 GLU 102 mg/dL Printed: 7/1 7/2017 13:01 EDT Page 17 of 42 NA K CL TOTAL COZ BUN CRT ANION GAP CA CBC with diff WBC HGB HCT RBC MCV MICH MCHC RDW MPV 143 MMOL/L 3.6 MMOL/L 98 MMOL/L 40 MMOL/L 26 mg/dL. 1.23 mg/dL 5 7.9 mg/dL 07/16/17 05:30 3.4/ nl 10.1 G/DL 32.4 %o 3.41 /PL 95.0 FL 29.6 pg 31.2 % 15,9 %o 10.7 FL PowerChart 2 Alexandria is the second-largest city in Egypt and a major economic centre, extending about 32 km (20 mi) along the coast of the Mediterranean Sea in the north central part of the country. Its low elevation on the Nile delta makes it highly vulnerable to rising sea levels. Alexandria is an important industrial center because of its natural gas and oil pipelines from Suez. Alexandria is also a popular tourist destination. 3 Alexandria was founded around a small, ancient Egyptian town c. 332 BC by Alexander the Great,[4] king of Macedon and leader of the Greek League of Corinth, during his conquest of the Achaemenid Empire. Alexandria became an important center of Hellenistic civilization and remained the capital of Ptolemaic Egypt and Roman and Byzantine Egypt for almost 1,000 years, until the Muslim conquest of Egypt in AD 641, when a new capital was founded at Fustat (later absorbed into Cairo). Hellenistic Alexandria was best known for the Lighthouse of Alexandria (Pharos), one of the Seven Wonders of the Ancient World; its Great Library (the largest in the ancient world); and the Necropolis, one of the Seven Wonders of the Middle Ages. Alexandria was at one time the second most powerful city of the ancient Mediterranean region, after Rome. Ongoing maritime archaeology in the harbor of Alexandria, which began in 1994, is revealing details of Alexandria both before the arrival of Alexander, when a city named Rhacotis existed there, and during the Ptolemaic dynasty. 4 8 i , . ! 9 i , . ! 10 i , . ! 11 i , . ! 12 i , . ! 13 i , . ! 14 i , . !``` ``` ## Model Information {:.table-model} |---|---| |Model Name:|pdf_processing| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Swedish BertForTokenClassification Cased model (from hkaraoguz) author: John Snow Labs name: bert_token_classifier_swedish_ner date: 2022-11-29 tags: [sv, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: sv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BERT_swedish-ner` is a Swedish model originally trained by `hkaraoguz`. ## Predicted Entities `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_swedish_ner_sv_4.2.4_3.0_1669733190021.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_swedish_ner_sv_4.2.4_3.0_1669733190021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_swedish_ner","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_swedish_ner","sv") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_swedish_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|sv| |Size:|465.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/hkaraoguz/BERT_swedish-ner - https://paperswithcode.com/sota?task=Token+Classification&dataset=wikiann --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_0_en_4.3.0_3.0_1674215332011.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_0_en_4.3.0_3.0_1674215332011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|433.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-0 --- layout: model title: Word2Vec Embeddings in Western Panjabi (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, pnb, open_source] task: Embeddings language: pnb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pnb_3.4.1_3.0_1647467596944.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pnb_3.4.1_3.0_1647467596944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pnb") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pnb") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pnb.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|pnb| |Size:|134.7 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: German asr_exp_w2v2t_r_wav2vec2_s37 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_r_wav2vec2_s37 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_r_wav2vec2_s37` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_r_wav2vec2_s37_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_r_wav2vec2_s37_de_4.2.0_3.0_1664108261011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_r_wav2vec2_s37_de_4.2.0_3.0_1664108261011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_r_wav2vec2_s37', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_r_wav2vec2_s37", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_r_wav2vec2_s37| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Assignments Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_assignments_bert date: 2023-03-05 tags: [en, legal, classification, clauses, assignments, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Assignments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Assignments`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_assignments_bert_en_1.0.0_3.0_1678050646575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_assignments_bert_en_1.0.0_3.0_1678050646575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_assignments_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Assignments]| |[Other]| |[Other]| |[Assignments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_assignments_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Assignments 0.93 0.94 0.93 157 Other 0.95 0.94 0.94 188 accuracy - - 0.94 345 macro-avg 0.94 0.94 0.94 345 weighted-avg 0.94 0.94 0.94 345 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Fijian author: John Snow Labs name: opus_mt_en_fj date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, fj, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `fj` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_fj_xx_2.7.0_2.4_1609164632179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_fj_xx_2.7.0_2.4_1609164632179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_fj", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_fj", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.fj').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_fj| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Japanese Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_base_japanese_unidic_luw_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ja, open_source] task: Part of Speech Tagging language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-unidic-luw-upos` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_unidic_luw_upos_ja_3.4.2_3.0_1652092034890.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_unidic_luw_upos_ja_3.4.2_3.0_1652092034890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_unidic_luw_upos","ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_unidic_luw_upos","ja") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_japanese_unidic_luw_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ja| |Size:|415.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-japanese-unidic-luw-upos - https://universaldependencies.org/u/pos/ - https://pypi.org/project/fugashi - https://pypi.org/project/unidic-lite - https://pypi.org/project/pytokenizations - http://id.nii.ac.jp/1001/00216223/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from FabianWillner) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_trivia date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-triviaqa` is a English model originally trained by `FabianWillner`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_trivia_en_4.3.0_3.0_1672773828302.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_trivia_en_4.3.0_3.0_1672773828302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_trivia","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_trivia","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_trivia| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/FabianWillner/distilbert-base-uncased-finetuned-triviaqa --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1654189490798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0_en_4.0.0_3.0_1654189490798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|390.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-0 --- layout: model title: Part of Speech for Afrikaans author: John Snow Labs name: pos_afribooms date: 2021-04-06 tags: [pos, open_source, af] task: Part of Speech Tagging language: af edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_afribooms_af_3.0.0_3.0_1617749039095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_afribooms_af_3.0.0_3.0_1617749039095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_afribooms", "af")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_afribooms", "af") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer ,pos)) val data = Seq("Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .""] token_df = nlu.load('af.pos.afribooms').predict(text) token_df ```
## Results ```bash +---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+ |text |result | +---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+ |Die kodes wat gebruik word , moet duidelik en verstaanbaar vir leerders en ouers wees .|[DET, NOUN, PRON, VERB, AUX, PUNCT, AUX, ADJ, CCONJ, ADJ, ADP, NOUN, CCONJ, NOUN, AUX, PUNCT]| +---------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_afribooms| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|af| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.59 | 0.65 | 0.62 | 665 | | ADP | 0.76 | 0.79 | 0.77 | 1299 | | ADV | 0.72 | 0.69 | 0.71 | 523 | | AUX | 0.86 | 0.83 | 0.84 | 663 | | CCONJ | 0.70 | 0.71 | 0.70 | 380 | | DET | 0.84 | 0.70 | 0.76 | 1014 | | NOUN | 0.68 | 0.72 | 0.70 | 2025 | | NUM | 0.89 | 0.81 | 0.85 | 42 | | PART | 0.67 | 0.68 | 0.67 | 322 | | PRON | 0.87 | 0.87 | 0.87 | 794 | | PROPN | 0.87 | 0.67 | 0.75 | 156 | | PUNCT | 0.68 | 0.70 | 0.69 | 877 | | SCONJ | 0.82 | 0.85 | 0.83 | 210 | | SYM | 0.86 | 0.88 | 0.87 | 142 | | VERB | 0.69 | 0.71 | 0.70 | 889 | | X | 0.24 | 0.14 | 0.18 | 64 | | accuracy | | | 0.73 | 10065 | | macro avg | 0.73 | 0.71 | 0.72 | 10065 | | weighted avg | 0.74 | 0.73 | 0.73 | 10065 | ``` --- layout: model title: Social Determinants of Health (clinical_large) author: John Snow Labs name: ner_sdoh_emb_clinical_large_wip date: 2023-04-12 tags: [en, clinical_large, social_determinants, public_health, ner, sdoh, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_large_wip_en_4.3.2_3.2_1681303888925.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_large_wip_en_4.3.2_3.2_1681303888925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_large_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------+-----+---+-------------------+ |chunk |begin|end|ner_label | +------------------+-----+---+-------------------+ |55 years old |11 |22 |Age | |divorced |25 |32 |Marital_Status | |Mexcian |34 |40 |Race_Ethnicity | |American |42 |49 |Race_Ethnicity | |woman |51 |55 |Gender | |financial problems|62 |79 |Financial_Status | |She |82 |84 |Gender | |spanish |93 |99 |Language | |She |102 |104|Gender | |apartment |118 |126|Housing | |She |129 |131|Gender | |diabetes |158 |165|Other_Disease | |hospitalizations |233 |248|Other_SDoH_Keywords| |cleaning assistant|307 |324|Employment | |health insurance |354 |369|Insurance_Status | |She |391 |393|Gender | |son |401 |403|Family_Member | |student |405 |411|Education | |college |416 |422|Education | |depression |454 |463|Mental_Health | |She |466 |468|Gender | |she |479 |481|Gender | |rehab |489 |493|Access_To_Care | |her |514 |516|Gender | |catholic faith |518 |531|Spiritual_Beliefs | |support |547 |553|Social_Support | |She |565 |567|Gender | |etoh abuse |589 |598|Alcohol | |her |614 |616|Gender | |teens |618 |622|Age | |She |625 |627|Gender | |she |637 |639|Gender | |daily |652 |656|Substance_Quantity | |drinker |658 |664|Alcohol | |30 years |670 |677|Substance_Duration | |drinking beer |694 |706|Alcohol | |daily |708 |712|Substance_Frequency| |She |715 |717|Gender | |smokes |719 |724|Smoking | |a pack |726 |731|Substance_Quantity | |cigarettes |736 |745|Smoking | |a day |747 |751|Substance_Frequency| |She |754 |756|Gender | |DUI |762 |764|Legal_Issues | +------------------+-----+---+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_emb_clinical_large_wip| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| |Dependencies:|embeddings_clinical_large| ## References Internal SHOP Project ## Benchmarking ```bash label precision recall f1-score support Employment 0.94 0.96 0.95 2075 Social_Support 0.91 0.90 0.90 658 Other_SDoH_Keywords 0.82 0.87 0.85 259 Healthcare_Institution 0.99 0.95 0.97 781 Alcohol 0.96 0.97 0.96 258 Gender 0.99 0.99 0.99 4957 Other_Disease 0.89 0.94 0.91 583 Access_To_Care 0.86 0.88 0.87 520 Mental_Health 0.89 0.81 0.85 494 Age 0.92 0.96 0.94 433 Marital_Status 1.00 1.00 1.00 92 Substance_Quantity 0.88 0.86 0.87 58 Substance_Use 0.91 0.97 0.94 192 Family_Member 0.97 0.99 0.98 2094 Financial_Status 0.86 0.65 0.74 124 Race_Ethnicity 0.93 0.93 0.93 27 Insurance_Status 0.93 0.87 0.90 85 Spiritual_Beliefs 0.86 0.81 0.83 52 Housing 0.88 0.85 0.87 400 Geographic_Entity 0.86 0.88 0.87 113 Disability 0.93 0.93 0.93 44 Quality_Of_Life 0.89 0.75 0.81 67 Income 0.89 0.77 0.83 31 Education 0.85 0.88 0.86 58 Transportation 0.86 0.89 0.88 57 Legal_Issues 0.72 0.91 0.80 47 Smoking 0.98 0.97 0.98 66 Substance_Frequency 0.93 0.75 0.83 57 Hypertension 1.00 1.00 1.00 21 Violence_Or_Abuse 0.83 0.62 0.71 63 Exercise 0.96 0.88 0.92 57 Diet 0.95 0.87 0.91 70 Sexual_Orientation 0.68 1.00 0.81 13 Language 0.89 0.73 0.80 22 Social_Exclusion 0.96 0.90 0.93 29 Substance_Duration 0.75 0.85 0.80 39 Communicable_Disease 1.00 0.84 0.91 31 Chidhood_Event 0.88 0.61 0.72 23 Community_Safety 0.95 0.93 0.94 44 Population_Group 0.89 0.62 0.73 13 Hyperlipidemia 0.78 1.00 0.88 7 Food_Insecurity 1.00 0.93 0.96 29 Eating_Disorder 0.67 0.92 0.77 13 Sexual_Activity 0.84 0.90 0.87 29 Environmental_Condition 1.00 1.00 1.00 20 Obesity 1.00 1.00 1.00 12 micro-avg 0.95 0.95 0.95 15217 macro-avg 0.90 0.88 0.88 15217 weighted-avg 0.95 0.95 0.95 15217 ``` --- layout: model title: Legal Indemnification Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_indemnification_agreement date: 2022-11-10 tags: [en, legal, licensed, classification, indemnification] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_indemnification_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `indemnification-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `indemnification-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_agreement_en_1.0.0_3.0_1668108212729.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_agreement_en_1.0.0_3.0_1668108212729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_indemnification_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnification-agreement]| |[other]| |[other]| |[indemnification-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support indemnification-agreement 1.00 1.00 1.00 31 other 1.00 1.00 1.00 85 accuracy - - 1.00 116 macro-avg 1.00 1.00 1.00 116 weighted-avg 1.00 1.00 1.00 116 ``` --- layout: model title: English DistilBertForMaskedLM Cased model (from joaogante) author: John Snow Labs name: distilbert_embeddings_test_text date: 2022-12-12 tags: [en, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `test_text` is a English model originally trained by `joaogante`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_test_text_en_4.2.4_3.0_1670865040247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_test_text_en_4.2.4_3.0_1670865040247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_test_text","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_test_text","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_test_text| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| ## References - https://huggingface.co/joaogante/test_text - https://arxiv.org/abs/1910.01108 - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English BertForQuestionAnswering model (from youngjae) author: John Snow Labs name: bert_qa_youngjae_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `youngjae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_youngjae_bert_finetuned_squad_en_4.0.0_3.0_1654535781311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_youngjae_bert_finetuned_squad_en_4.0.0_3.0_1654535781311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_youngjae_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_youngjae_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_youngjae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_youngjae_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/youngjae/bert-finetuned-squad --- layout: model title: Spanish Bert Embeddings (Base, Question, Squades) author: John Snow Labs name: bert_embeddings_dpr_spanish_question_encoder_squades_base date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dpr-spanish-question_encoder-squades-base` is a Spanish model orginally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_question_encoder_squades_base_es_3.4.2_3.0_1649671254424.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_question_encoder_squades_base_es_3.4.2_3.0_1649671254424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_question_encoder_squades_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_question_encoder_squades_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.dpr_spanish_question_encoder_squades_base").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dpr_spanish_question_encoder_squades_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/IIC/dpr-spanish-question_encoder-squades-base - https://arxiv.org/abs/2004.04906 - https://github.com/facebookresearch/DPR - https://paperswithcode.com/sota?task=text+similarity&dataset=squad_es --- layout: model title: English AlbertForQuestionAnswering model (from saburbutt) author: John Snow Labs name: albert_qa_generic date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qa-generic` is a English model originally trained by `saburbutt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_generic_en_4.0.0_3.0_1656064418789.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_generic_en_4.0.0_3.0_1656064418789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_generic","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_generic","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.albert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_generic| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|771.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saburbutt/testing --- layout: model title: Translate English to Portuguese-based languages Pipeline author: John Snow Labs name: translate_en_cpp date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, cpp, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `cpp` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_cpp_xx_2.7.0_2.4_1609699630268.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_cpp_xx_2.7.0_2.4_1609699630268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_cpp", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_cpp", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.cpp').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_cpp| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Czech to English Pipeline author: John Snow Labs name: translate_cs_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cs, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cs_en_xx_2.7.0_2.4_1609690011748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cs_en_xx_2.7.0_2.4_1609690011748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cs.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cs_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Breton Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: br edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, br] supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_br_2.5.5_2.4_1596054394143.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_br_2.5.5_2.4_1596054394143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "br") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "br") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Distaolit dimp hon dleoù evel m"hor bo ivez distaolet d"hon dleourion.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion."""] lemma_df = nlu.load('br.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=8, result='Distaolit', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=13, result='_', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=17, result='kaout', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=23, result='dleoù', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=25, end=28, result='evel', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|br| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Word2Vec Embeddings in Punjabi (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, pa, open_source] task: Embeddings language: pa edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pa_3.4.1_3.0_1647294419824.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pa_3.4.1_3.0_1647294419824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ਮੈਨੂੰ ਸਪਾਰਕ ਐਨਐਲਪੀ ਪਸੰਦ ਹੈ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ਮੈਨੂੰ ਸਪਾਰਕ ਐਨਐਲਪੀ ਪਸੰਦ ਹੈ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pa.embed.w2v_cc_300d").predict("""ਮੈਨੂੰ ਸਪਾਰਕ ਐਨਐਲਪੀ ਪਸੰਦ ਹੈ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|pa| |Size:|232.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English BertForQuestionAnswering model (from youngjae) author: John Snow Labs name: bert_qa_youngjae_bert_finetuned_squad_accelerate date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `youngjae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_youngjae_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535900019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_youngjae_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535900019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_youngjae_bert_finetuned_squad_accelerate","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_youngjae_bert_finetuned_squad_accelerate","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.accelerate.by_youngjae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_youngjae_bert_finetuned_squad_accelerate| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/youngjae/bert-finetuned-squad-accelerate --- layout: model title: Legal Consideration Clause Binary Classifier author: John Snow Labs name: legclf_consideration_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `consideration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `consideration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consideration_clause_en_1.0.0_3.2_1660122285073.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consideration_clause_en_1.0.0_3.2_1660122285073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_consideration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[consideration]| |[other]| |[other]| |[consideration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_consideration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support consideration 0.73 0.65 0.69 34 other 0.91 0.94 0.92 128 accuracy - - 0.88 162 macro-avg 0.82 0.79 0.81 162 weighted-avg 0.87 0.88 0.87 162 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_wwm date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-bert-wwm` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_wwm_zh_4.2.4_3.0_1670020906241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_wwm_zh_4.2.4_3.0_1670020906241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_wwm","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_wwm","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_wwm| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-bert-wwm - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Context Spell Checker Pipeline for English author: John Snow Labs name: spellcheck_dl_pipeline date: 2022-06-19 tags: [spellcheck, spell, spellcheck_pipeline, spelling_corrector, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained spellchecker pipeline is built on the top of [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/04/02/spellcheck_dl_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.2 and above. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_4.0.0_3.0_1655653892700.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_4.0.0_3.0_1655653892700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] pipeline.annotate(text) ``` ```scala val pipeline = new PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") val example = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") pipeline.annotate(example) ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|99.8 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: Translate Ndonga to English Pipeline author: John Snow Labs name: translate_ng_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ng, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ng` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ng_en_xx_2.7.0_2.4_1609691829310.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ng_en_xx_2.7.0_2.4_1609691829310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ng_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ng_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ng.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ng_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_finetuned_audio_transcriber TFWav2Vec2ForCTC from mfreihaut author: John Snow Labs name: pipeline_asr_finetuned_audio_transcriber date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_finetuned_audio_transcriber` is a English model originally trained by mfreihaut. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_finetuned_audio_transcriber_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_finetuned_audio_transcriber_en_4.2.0_3.0_1664122195152.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_finetuned_audio_transcriber_en_4.2.0_3.0_1664122195152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_finetuned_audio_transcriber', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_finetuned_audio_transcriber", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_finetuned_audio_transcriber| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Arabic Part of Speech Tagger (CA-Classical Arabic dataset, Gulf Arabic POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_ca_pos_glf date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-ca-pos-glf` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_glf_ar_3.4.2_3.0_1650993546516.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_glf_ar_3.4.2_3.0_1650993546516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_glf","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_glf","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_ca_pos_glf").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_ca_pos_glf| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-glf - https://camel.abudhabi.nyu.edu/annotated-gumar-corpus/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Spanish DistilBertForQuestionAnswering Base Uncased model (from CenIA) author: John Snow Labs name: distilbert_qa_distillbert_base_uncased_finetuned_ml date: 2023-01-03 tags: [es, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-mlqa` is a Spanish model originally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_uncased_finetuned_ml_es_4.3.0_3.0_1672774747112.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_uncased_finetuned_ml_es_4.3.0_3.0_1672774747112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_uncased_finetuned_ml","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_uncased_finetuned_ml","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distillbert_base_uncased_finetuned_ml| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|250.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-mlqa --- layout: model title: Detect PHI for Deidentification (Augmented) author: John Snow Labs name: ner_deid_synthetic date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER (Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_synthetic_en_3.0.0_3.0_1617209707564.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_synthetic_en_3.0.0_3.0_1617209707564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_deid_synthetic","en","clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. ']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_synthetic", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.synthetic").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital, Dr. John Green (2347165768). He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same. """) ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |NAME | |VA Hospital |LOCATION | |John Green |NAME | |2347165768 |ID | |Day Hospital |LOCATION | |02/04/2003 |DATE | |Smith |NAME | |Day Hospital |LOCATION | |Smith |NAME | |Smith |NAME | |7 Ardmore Tower|LOCATION | |Hart |NAME | |Smith |NAME | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_synthetic| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented synthetic PHI data other links (n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/) ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | I-NAME | 1096 | 47 | 80 | 0.95888 | 0.931973 | 0.945235 | | 1 | I-CONTACT | 93 | 0 | 4 | 1 | 0.958763 | 0.978947 | | 2 | I-AGE | 3 | 1 | 6 | 0.75 | 0.333333 | 0.461538 | | 3 | B-DATE | 2078 | 42 | 52 | 0.980189 | 0.975587 | 0.977882 | | 4 | I-DATE | 474 | 39 | 25 | 0.923977 | 0.9499 | 0.936759 | | 5 | I-LOCATION | 755 | 68 | 76 | 0.917375 | 0.908544 | 0.912938 | | 6 | I-PROFESSION | 78 | 8 | 9 | 0.906977 | 0.896552 | 0.901734 | | 7 | B-NAME | 1182 | 101 | 36 | 0.921278 | 0.970443 | 0.945222 | | 8 | B-AGE | 259 | 10 | 11 | 0.962825 | 0.959259 | 0.961039 | | 9 | B-ID | 146 | 8 | 11 | 0.948052 | 0.929936 | 0.938907 | | 10 | B-PROFESSION | 76 | 9 | 21 | 0.894118 | 0.783505 | 0.835165 | | 11 | B-LOCATION | 556 | 87 | 71 | 0.864697 | 0.886762 | 0.875591 | | 12 | I-ID | 64 | 8 | 3 | 0.888889 | 0.955224 | 0.920863 | | 13 | B-CONTACT | 40 | 7 | 5 | 0.851064 | 0.888889 | 0.869565 | | 14 | Macro-average | 6900 | 435 | 410 | 0.912023 | 0.880619 | 0.896046 | | 15 | Micro-average | 6900 | 435 | 410 | 0.940695 | 0.943912 | 0.942301 | ``` --- layout: model title: English BertForQuestionAnswering model (from howey) author: John Snow Labs name: bert_qa_howey_bert_large_uncased_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squad` is a English model orginally trained by `howey`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_howey_bert_large_uncased_squad_en_4.0.0_3.0_1654536567405.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_howey_bert_large_uncased_squad_en_4.0.0_3.0_1654536567405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_howey_bert_large_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_howey_bert_large_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased.by_howey").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_howey_bert_large_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/howey/bert-large-uncased-squad --- layout: model title: Word2Vec Embeddings in Ossetic (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, os, open_source] task: Embeddings language: os edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_os_3.4.1_3.0_1647451060944.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_os_3.4.1_3.0_1647451060944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","os") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","os") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("os.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|os| |Size:|83.8 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Question Answering (RoBerta, CUAD, Large) author: John Snow Labs name: legqa_roberta_cuad_large date: 2023-01-30 tags: [en, licensed, tensorflow] task: Question Answering language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal RoBerta-based Question Answering model, trained on squad-v2, finetuned on CUAD dataset. In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES: ``` "Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract" ``` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_large_en_1.0.0_3.0_1675082400688.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_large_en_1.0.0_3.0_1675082400688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("legqa_roberta_cuad_large","en", "legal/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) text = """THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and certain of its subsidiaries. Identified on the signature pages hereto (each a "BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, in its capacity as agent for the Lenders under this Agreement (hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ') question = ['"Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"'] qt = [ [q,text] for q in questions ] example = spark.createDataFrame(qt).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('document_question.result', 'answer.result').show(truncate=False) ```
## Results ```bash ["Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"]|[P . H . GLATFELTER COMPANY]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legqa_roberta_cuad_large| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|454.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Squad, finetuned with CUAD-based Question/Answering --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_twostagetriplet_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostagetriplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223967265.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostagetriplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223967265.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostagetriplet_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostagetriplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_twostagetriplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|306.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostagetriplet_epochs_1_shard_1_squad2.0 --- layout: model title: English BertForTokenClassification Uncased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Modified_bluebert_pubmed_uncased_L_12_H_768_A_12 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Modified-bluebert_pubmed_uncased_L-12_H-768_A-12` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_bluebert_pubmed_uncased_L_12_H_768_A_12_en_4.0.0_3.0_1657109104665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_bluebert_pubmed_uncased_L_12_H_768_A_12_en_4.0.0_3.0_1657109104665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_bluebert_pubmed_uncased_L_12_H_768_A_12","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_bluebert_pubmed_uncased_L_12_H_768_A_12","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Modified_bluebert_pubmed_uncased_L_12_H_768_A_12| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4_Modified-bluebert_pubmed_uncased_L-12_H-768_A-12 --- layout: model title: Japanese BertForQuestionAnswering Base Cased model (from chiba) author: John Snow Labs name: bert_qa_base_japanese_whole_word_masking_tes date: 2022-07-07 tags: [ja, open_source, bert, question_answering] task: Question Answering language: ja edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-whole-word-masking_test` is a Japanese model originally trained by `chiba`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_japanese_whole_word_masking_tes_ja_4.0.0_3.0_1657183050287.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_japanese_whole_word_masking_tes_ja_4.0.0_3.0_1657183050287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_japanese_whole_word_masking_tes","ja") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["私の名前は何ですか?", "私の名前はクララで、私はバークレーに住んでいます。"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_japanese_whole_word_masking_tes","ja") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("私の名前は何ですか?", "私の名前はクララで、私はバークレーに住んでいます。").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_japanese_whole_word_masking_tes| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ja| |Size:|413.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/chiba/bert-base-japanese-whole-word-masking_test --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from rsvp-ai) author: John Snow Labs name: roberta_qa_bertserini_base date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-roberta-base` is a English model originally trained by `rsvp-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bertserini_base_en_4.3.0_3.0_1674209264464.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bertserini_base_en_4.3.0_3.0_1674209264464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bertserini_base","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bertserini_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_bertserini_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|449.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/rsvp-ai/bertserini-roberta-base --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mtreviso) author: John Snow Labs name: t5_ct5_small_wiki_l2r date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ct5-small-en-wiki-l2r` is a English model originally trained by `mtreviso`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ct5_small_wiki_l2r_en_4.3.0_3.0_1675100797165.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ct5_small_wiki_l2r_en_4.3.0_3.0_1675100797165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ct5_small_wiki_l2r","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ct5_small_wiki_l2r","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ct5_small_wiki_l2r| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.7 MB| ## References - https://huggingface.co/mtreviso/ct5-small-en-wiki-l2r - https://github.com/mtreviso/chunked-t5 --- layout: model title: Portuguese Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_bert_large_portuguese_cased date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-portuguese-cased` is a Portuguese model orginally trained by `neuralmind`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_portuguese_cased_pt_3.4.2_3.0_1649673670038.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_portuguese_cased_pt_3.4.2_3.0_1649673670038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_portuguese_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_portuguese_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_large_portuguese_cased").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_portuguese_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/neuralmind/bert-large-portuguese-cased - https://github.com/neuralmind-ai/portuguese-bert/ --- layout: model title: Detect Entities Related to Cancer Diagnosis author: John Snow Labs name: ner_oncology_diagnosis date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to cancer diagnosis, such as Metastasis, Histological_Type or Invasion. Definitions of Predicted Entities: - `Adenopathy`: Mentions of pathological findings of the lymph nodes. - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score"). - `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated") - `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary". - `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category. - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells"). - `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4"). - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm"). ## Predicted Entities `Adenopathy`, `Cancer_Dx`, `Cancer_Score`, `Grade`, `Histological_Type`, `Invasion`, `Metastasis`, `Pathology_Result`, `Performance_Status`, `Staging`, `Tumor_Finding`, `Tumor_Size` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_en_4.0.0_3.0_1666719602276.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_en_4.0.0_3.0_1666719602276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_diagnosis", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_diagnosis", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_diagnosis").predict("""Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.""") ```
## Results ```bash | chunk | ner_label | |:-------------|:------------------| | tumor | Tumor_Finding | | adenopathies | Adenopathy | | invasive | Histological_Type | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | metastasis | Metastasis | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_diagnosis| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 354 63 99 453 0.85 0.78 0.81 Staging 234 27 24 258 0.90 0.91 0.90 Cancer_Score 36 15 26 62 0.71 0.58 0.64 Tumor_Finding 1121 83 136 1257 0.93 0.89 0.91 Invasion 154 27 27 181 0.85 0.85 0.85 Tumor_Size 1058 126 71 1129 0.89 0.94 0.91 Adenopathy 66 10 30 96 0.87 0.69 0.77 Performance_Status 116 15 19 135 0.89 0.86 0.87 Pathology_Result 852 686 290 1142 0.55 0.75 0.64 Metastasis 356 15 14 370 0.96 0.96 0.96 Cancer_Dx 1302 88 92 1394 0.94 0.93 0.94 Grade 201 23 35 236 0.90 0.85 0.87 macro_avg 5850 1178 863 6713 0.85 0.83 0.84 micro_avg 5850 1178 863 6713 0.85 0.87 0.86 ``` --- layout: model title: English Named Entity Recognition (from abhibisht89) author: John Snow Labs name: bert_ner_spanbert_large_cased_finetuned_ade_corpus_v2 date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `spanbert-large-cased-finetuned-ade_corpus_v2` is a English model orginally trained by `abhibisht89`. ## Predicted Entities `DRUG`, `ADR` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_spanbert_large_cased_finetuned_ade_corpus_v2_en_3.4.2_3.0_1652096991524.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_spanbert_large_cased_finetuned_ade_corpus_v2_en_3.4.2_3.0_1652096991524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_spanbert_large_cased_finetuned_ade_corpus_v2","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_spanbert_large_cased_finetuned_ade_corpus_v2","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_spanbert_large_cased_finetuned_ade_corpus_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/abhibisht89/spanbert-large-cased-finetuned-ade_corpus_v2 --- layout: model title: Extract textual entities in biomedical texts author: John Snow Labs name: ner_nature_nero_clinical date: 2022-02-08 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is based on the NERO corpus, capable of extracting general entities. This model is trained to refute the claims made in https://www.nature.com/articles/s41540-021-00200-x regarding Spark NLP's performance and we hereby prove that we can get better than what is claimed. So, **this model is not meant to be used in production.** ## Predicted Entities `Organismpart`, `Chromosome`, `Physicalphenomenon`, `Abstractconcept`, `Gene`, `Meas`, `Machineactivity`, `Warfarin`, `Gen`, `Aminoacidpeptide`, `Language`, `P`, `Quantityormeasurement`, `Disease`, `Process`, `Propernamedgeographicallocation`, `Duration`, `Medicalprocedureordevice`, `Citation`, `Geographicnotproper`, `Atom`, `Gp`, `Medicaldevice`, `Namedentity`, `Unpropernamedgeographicallocation`, `Persongroup`, `Unit`, `Bodypart`, `Unconjugated`, `Timepoint`, `Protein`, `Publishedsourceofinformation`, `Quantity`, `Dr`, `Organism`, `Nonproteinornucleicacidchemical`, `G`, `Researchactivity`, `Drug`, `Measurement`, `Cells`, `Journal`, `Relationshipphrase`, `Medicalprocedure`, `Geographiclocation`, `Groupofpeople`, `Person`, `Tissue`, `Mentalprocess`, `Facility`, `Chemical`, `Geneorproteingroup`, `Ion`, `Food`, `Aminoacid`, `N`, `Biologicalprocess`, `Cell`, `Researchactivty`, `Publicationorcitation`, `Molecularprocess`, `Experimentalfactor`, `Medicalfinding`, `Nucleicacid`, `Laboratoryexperimentalfactor`, `Relationship`, `Geographicallocation`, `Geneorprotein`, `Smallmolecule`, `Partofprotein`, `Thing`, `Quantityormeasure`, `Environmentalfactor`, `Intellectualproduct`, `R`, `Molecule`, `Time`, `Anatomicalpart`, `Cellcomponent`, `Nucleicacidsubstance` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_nature_nero_clinical_en_3.3.4_3.0_1644358495292.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_nature_nero_clinical_en_3.3.4_3.0_1644358495292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_nature_nero_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_nature_nero_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.nero_clinical.nature").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | entity | |---:|:---------------------------------------------|:----------------------| | 0 | perioral cyanosis | Medicalfinding | | 1 | One day | Duration | | 2 | mom | Namedentity | | 3 | tactile temperature | Quantityormeasurement | | 4 | patient Tylenol | Chemical | | 5 | decreased p.o. intake | Medicalprocedure | | 6 | normal breast-feeding | Medicalfinding | | 7 | 20 minutes q.2h | Timepoint | | 8 | 5 to 10 minutes | Duration | | 9 | respiratory congestion | Medicalfinding | | 10 | past 2 days | Duration | | 11 | parents | Persongroup | | 12 | improvement | Process | | 13 | albuterol treatments | Medicalprocedure | | 14 | ER | Bodypart | | 15 | urine output | Quantityormeasurement | | 16 | 8 to 10 wet and 5 dirty diapers per 24 hours | Measurement | | 17 | 4 wet diapers per 24 hours | Measurement | | 18 | Mom | Person | | 19 | diarrhea | Medicalfinding | | 20 | bowel movements | Biologicalprocess | | 21 | soft in nature | Biologicalprocess | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_nature_nero_clinical| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.1 MB| ## References This model is based on https://www.nature.com/articles/s41540-021-00200-x and a response to: https://static-content.springer.com/esm/art%3A10.1038%2Fs41540-021-00200-x/MediaObjects/41540_2021_200_MOESM1_ESM.pdf ## Benchmarking ```bash label tp fp fn prec rec f1 B-Atom 11 7 48 0.6111111 0.18644068 0.2857143 I-Laboratoryexperimentalfactor 0 3 55 0.0 0.0 0.0 B-Disease 489 232 251 0.6782247 0.6608108 0.6694045 B-Partofprotein 77 61 67 0.557971 0.5347222 0.54609925 B-Nonproteinornucleicacidchemical 45 58 190 0.4368932 0.19148937 0.2662722 I-Propernamedgeographicallocation 60 33 31 0.6451613 0.6593407 0.6521739 B-Bodypart 648 336 343 0.6585366 0.65388495 0.6562025 B-Protein 832 504 440 0.6227545 0.6540881 0.63803685 I-Unit 0 0 4 0.0 0.0 0.0 B-Chemical 1390 1066 926 0.5659609 0.6001727 0.5825649 B-Publicationorcitation 2 10 9 0.16666667 0.18181819 0.17391303 I-Smallmolecule 34 201 353 0.14468086 0.087855294 0.10932476 I-Abstractconcept 0 0 4 0.0 0.0 0.0 I-Nucleicacid 449 213 249 0.67824775 0.6432665 0.6602941 B-Drug 306 170 213 0.64285713 0.5895954 0.6150754 B-Thing 0 0 19 0.0 0.0 0.0 I-Citation 6 14 17 0.3 0.26086956 0.27906975 I-Aminoacid 80 43 95 0.6504065 0.45714286 0.53691274 B-Medicalprocedureordevice 1 0 2 1.0 0.33333334 0.5 I-Nucleicacidsubstance 3 3 14 0.5 0.1764706 0.26086956 I-Gp 1244 860 427 0.5912548 0.7444644 0.6590729 I-Geographicallocation 0 18 17 0.0 0.0 0.0 I-Molecule 15 114 58 0.11627907 0.20547946 0.14851485 B-R 0 0 1 0.0 0.0 0.0 I-Measurement 1700 893 546 0.6556113 0.75690114 0.7026245 I-Intellectualproduct 197 251 311 0.43973213 0.38779527 0.41213387 B-Anatomicalpart 0 22 38 0.0 0.0 0.0 B-Gp 3597 1426 818 0.71610594 0.81472254 0.7622378 B-Person 105 42 80 0.71428573 0.5675676 0.63253015 I-Aminoacidpeptide 55 38 35 0.5913978 0.6111111 0.6010929 B-Environmentalfactor 24 30 40 0.44444445 0.375 0.40677968 B-Cellcomponent 188 146 191 0.56287426 0.49604222 0.5273493 I-Groupofpeople 1 1 10 0.5 0.09090909 0.15384614 I-Chromosome 39 27 12 0.59090906 0.7647059 0.6666667 B-G 0 0 1 0.0 0.0 0.0 I-Publishedsourceofinformation 122 100 131 0.5495495 0.48221344 0.5136842 I-Disease 710 313 239 0.69403714 0.74815595 0.72008115 I-Time 19 32 86 0.37254903 0.18095239 0.24358974 I-Relationship 41 18 33 0.69491524 0.5540541 0.6165413 I-Nonproteinornucleicacidchemical 32 87 257 0.26890758 0.11072665 0.15686275 I-Molecularprocess 1257 1057 589 0.5432152 0.68093175 0.6043269 I-Persongroup 587 199 233 0.7468193 0.71585363 0.7310087 B-Laboratoryexperimentalfactor 0 2 42 0.0 0.0 0.0 I-Mentalprocess 24 30 97 0.44444445 0.1983471 0.2742857 B-Aminoacidpeptide 33 26 44 0.55932206 0.42857143 0.48529413 B-Food 63 29 54 0.6847826 0.53846157 0.6028708 B-Journal 0 0 3 0.0 0.0 0.0 I-Quantityormeasure 0 2 4 0.0 0.0 0.0 I-Cell 1035 212 252 0.829992 0.8041958 0.81689036 B-Tissue 57 41 53 0.5816327 0.5181818 0.5480769 I-Medicaldevice 51 58 53 0.4678899 0.4903846 0.47887325 B-Mentalprocess 57 49 156 0.5377358 0.26760563 0.35736677 I-Bodypart 659 406 366 0.61877936 0.6429268 0.630622 I-Researchactivity 1073 568 410 0.65386957 0.7235334 0.6869398 I-Atom 11 2 51 0.84615386 0.17741935 0.29333335 B-Namedentity 173 423 504 0.29026845 0.25553915 0.2717989 B-Quantityormeasure 2 2 6 0.5 0.25 0.33333334 B-Citation 0 2 3 0.0 0.0 0.0 I-Cellcomponent 183 166 180 0.5243553 0.5041322 0.51404494 B-Unit 0 0 3 0.0 0.0 0.0 I-Person 41 50 33 0.45054945 0.5540541 0.49696973 I-Quantityormeasurement 202 418 557 0.32580644 0.26613966 0.2929659 B-Organismpart 23 54 41 0.2987013 0.359375 0.32624114 B-Cell 723 206 231 0.7782562 0.7578616 0.76792353 I-Chemical 1898 1128 832 0.62723064 0.6952381 0.65948576 I-Medicalfinding 1749 1267 1362 0.5799072 0.56219864 0.5709156 B-Process 1522 1421 1714 0.51715934 0.47033376 0.49263635 I-Food 75 39 60 0.65789473 0.5555556 0.60240966 I-Duration 344 269 169 0.5611746 0.6705653 0.61101246 I-Experimentalfactor 59 173 200 0.25431034 0.22779922 0.24032587 I-Quantity 742 670 750 0.52549577 0.49731904 0.5110193 B-Physicalphenomenon 1 2 11 0.33333334 0.083333336 0.13333334 I-Medicalprocedureordevice 3 0 3 1.0 0.5 0.6666667 B-Aminoacid 86 41 109 0.6771653 0.44102564 0.5341615 B-Quantity 613 554 645 0.5252785 0.4872814 0.5055671 B-Cells 0 0 2 0.0 0.0 0.0 I-Gene 134 44 95 0.752809 0.58515286 0.65847665 B-Medicalfinding 1913 1389 1296 0.5793458 0.59613585 0.5876209 I-Tissue 70 58 42 0.546875 0.625 0.5833333 B-Molecule 20 48 62 0.29411766 0.24390244 0.26666665 I-Organism 959 373 282 0.71997 0.7727639 0.74543333 I-Medicalprocedure 775 475 370 0.62 0.6768559 0.64718163 B-Unpropernamedgeographicallocation 7 5 25 0.5833333 0.21875 0.3181818 I-Timepoint 3 9 32 0.25 0.08571429 0.12765957 I-Organismpart 15 47 27 0.24193548 0.35714287 0.28846154 I-Biologicalprocess 798 787 835 0.50347 0.48867115 0.49596024 B-Time 47 49 97 0.48958334 0.3263889 0.39166668 B-Experimentalfactor 40 100 136 0.2857143 0.22727273 0.2531646 B-Nucleicacid 364 234 337 0.6086956 0.5192582 0.5604311 B-Propernamedgeographicallocation 112 38 22 0.74666667 0.8358209 0.7887324 B-Publishedsourceofinformation 252 95 125 0.7262248 0.66843504 0.69613266 I-Unpropernamedgeographicallocation 5 4 21 0.5555556 0.1923077 0.2857143 I-Protein 1114 717 449 0.6084107 0.7127319 0.65645254 B-Molecularprocess 907 696 609 0.5658141 0.59828496 0.5815967 B-Quantityormeasurement 171 293 438 0.36853448 0.28078818 0.31873253 B-Intellectualproduct 159 175 215 0.4760479 0.42513368 0.44915253 B-Persongroup 783 245 223 0.76167315 0.77833 0.7699115 I-Cells 0 0 2 0.0 0.0 0.0 B-Researchactivity 869 448 373 0.65983295 0.69967794 0.67917156 I-Environmentalfactor 20 18 35 0.5263158 0.36363637 0.43010756 B-Gene 64 37 105 0.63366336 0.37869823 0.47407413 B-Groupofpeople 3 2 10 0.6 0.23076923 0.33333334 B-Geneorproteingroup 241 191 203 0.5578704 0.5427928 0.5502283 B-Facility 119 134 180 0.47035572 0.3979933 0.43115944 B-Timepoint 10 15 30 0.4 0.25 0.30769232 B-Organism 869 290 297 0.7497843 0.745283 0.7475268 B-Duration 274 186 137 0.59565216 0.6666667 0.6291619 I-Facility 182 212 226 0.46192893 0.44607842 0.45386532 I-Process 589 880 1114 0.40095302 0.34586024 0.3713745 B-Language 0 0 2 0.0 0.0 0.0 B-Medicaldevice 29 44 56 0.39726028 0.34117648 0.36708862 B-Ion 54 23 37 0.7012987 0.5934066 0.64285713 I-Partofprotein 126 131 153 0.49027237 0.4516129 0.47014928 B-Gen 0 0 1 0.0 0.0 0.0 B-Geneorprotein 0 1 64 0.0 0.0 0.0 I-Thing 0 0 16 0.0 0.0 0.0 I-Gen 0 0 3 0.0 0.0 0.0 I-Geneorproteingroup 398 345 286 0.5356662 0.58187133 0.5578136 B-Abstractconcept 0 0 7 0.0 0.0 0.0 B-Chromosome 37 24 24 0.60655737 0.60655737 0.60655737 B-Relationship 133 64 59 0.6751269 0.6927083 0.6838046 B-Smallmolecule 19 62 229 0.2345679 0.076612905 0.115501516 I-Physicalphenomenon 0 2 7 0.0 0.0 0.0 I-Ion 102 41 126 0.7132867 0.4473684 0.5498652 I-Drug 173 133 169 0.5653595 0.50584793 0.5339506 I-Anatomicalpart 0 63 48 0.0 0.0 0.0 B-Measurement 639 451 298 0.5862385 0.68196374 0.6304884 I-Publicationorcitation 10 46 52 0.17857143 0.16129032 0.16949153 B-Geographicallocation 4 16 20 0.2 0.16666667 0.18181819 I-Journal 0 0 11 0.0 0.0 0.0 B-Relationshipphrase 0 0 1 0.0 0.0 0.0 B-Nucleicacidsubstance 2 2 16 0.5 0.11111111 0.18181819 B-Biologicalprocess 779 703 831 0.525641 0.48385093 0.503881 I-Geneorprotein 0 2 33 0.0 0.0 0.0 B-Medicalprocedure 820 450 346 0.6456693 0.703259 0.6732348 I-Namedentity 140 447 381 0.23850085 0.268714 0.25270757 Macro-average 41221 28282 28209 0.4370514 0.3767836 0.40468597 Micro-average 41221 28282 28209 0.5930823 0.5937059 0.5933939 ``` --- layout: model title: MNDA / NDA Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_nda_agreements date: 2023-02-08 tags: [nda, mnda, en, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_nda_agreements` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `nda` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `nda`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_agreements_en_1.0.0_3.0_1675880803682.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_agreements_en_1.0.0_3.0_1675880803682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_nda_agreements", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[nda]| |[other]| |[other]| |[nda]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_agreements| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support nda 0.93 0.96 0.95 135 other 0.97 0.95 0.96 181 accuracy - - 0.95 316 macro-avg 0.95 0.95 0.95 316 weighted-avg 0.95 0.95 0.95 316 ``` --- layout: model title: Javanese BertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: bert_embeddings_javanese_small date: 2022-12-02 tags: [jv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-bert-small` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_jv_4.2.4_3.0_1670022474239.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_jv_4.2.4_3.0_1670022474239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_javanese_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-bert-small - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Fast Neural Machine Translation Model from Tongan to English author: John Snow Labs name: opus_mt_to_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, to, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `to` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_to_en_xx_2.7.0_2.4_1609167464976.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_to_en_xx_2.7.0_2.4_1609167464976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_to_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_to_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.to.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_to_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Closings Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_closings_bert date: 2023-03-05 tags: [en, legal, classification, clauses, closings, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Closings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Closings`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_closings_bert_en_1.0.0_3.0_1678049939782.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_closings_bert_en_1.0.0_3.0_1678049939782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_closings_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Closings]| |[Other]| |[Other]| |[Closings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_closings_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Closings 0.96 0.92 0.94 60 Other 0.94 0.98 0.96 83 accuracy - - 0.95 143 macro-avg 0.95 0.95 0.95 143 weighted-avg 0.95 0.95 0.95 143 ``` --- layout: model title: Pipeline to Detect Radiology Concepts - WIP (biobert) author: John Snow Labs name: jsl_rd_ner_wip_greedy_biobert_pipeline date: 2023-03-20 tags: [licensed, clinical, en, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_rd_ner_wip_greedy_biobert](https://nlp.johnsnowlabs.com/2021/07/26/jsl_rd_ner_wip_greedy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_pipeline_en_4.3.0_3.2_1679310411354.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_pipeline_en_4.3.0_3.2_1679310411354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("jsl_rd_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") text = '''Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("jsl_rd_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") val text = "Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.wip_greedy_biobert.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:--------------------------|-------------:| | 0 | Bilateral | 0 | 8 | Direction | 0.9875 | | 1 | breast | 10 | 15 | BodyPart | 0.6109 | | 2 | ultrasound | 17 | 26 | ImagingTest | 0.7902 | | 3 | ovoid mass | 78 | 87 | ImagingFindings | 0.42185 | | 4 | 0.5 x 0.5 x 0.4 | 113 | 127 | Measurements | 0.9406 | | 5 | cm | 129 | 130 | Units | 1 | | 6 | left | 190 | 193 | Direction | 0.5566 | | 7 | shoulder | 195 | 202 | BodyPart | 0.6228 | | 8 | mass | 210 | 213 | ImagingFindings | 0.9463 | | 9 | isoechoic echotexture | 228 | 248 | ImagingFindings | 0.4332 | | 10 | muscle | 266 | 271 | BodyPart | 0.7148 | | 11 | internal color flow | 294 | 312 | ImagingFindings | 0.3726 | | 12 | benign fibrous tissue | 334 | 354 | ImagingFindings | 0.484533 | | 13 | lipoma | 361 | 366 | Disease_Syndrome_Disorder | 0.8955 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_rd_ner_wip_greedy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265904 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265904` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265904_en_4.0.0_3.0_1655985110516.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265904_en_4.0.0_3.0_1655985110516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265904","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265904","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265904").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265904| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265904 --- layout: model title: German asr_wav2vec2_xls_r_300m_english_by_aware_ai TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_english_by_aware_ai` is a German model originally trained by aware-ai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai_de_4.2.0_3.0_1664097471522.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai_de_4.2.0_3.0_1664097471522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_english_by_aware_ai| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Nyaneka to English Pipeline author: John Snow Labs name: translate_nyk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, nyk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `nyk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_nyk_en_xx_2.7.0_2.4_1609689243589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_nyk_en_xx_2.7.0_2.4_1609689243589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_nyk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_nyk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.nyk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_nyk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Exceptions Clause Binary Classifier author: John Snow Labs name: legclf_exceptions_clause date: 2023-02-13 tags: [en, legal, classification, exceptions, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exceptions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `exceptions`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exceptions_clause_en_1.0.0_3.0_1676303705879.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exceptions_clause_en_1.0.0_3.0_1676303705879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_exceptions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[exceptions]| |[other]| |[other]| |[exceptions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exceptions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support exceptions 1.00 1.00 1.00 17 other 1.00 1.00 1.00 8 accuracy - - 1.00 25 macro-avg 1.00 1.00 1.00 25 weighted-avg 1.00 1.00 1.00 25 ``` --- layout: model title: Fast Neural Machine Translation Model from Setswana to English author: John Snow Labs name: opus_mt_tn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tn_en_xx_2.7.0_2.4_1609163658011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tn_en_xx_2.7.0_2.4_1609163658011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-512_A-8_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185242186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna_en_4.0.0_3.0_1654185242186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.uncased_2l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|83.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-512_A-8_squad2_covid-qna --- layout: model title: Fast Neural Machine Translation Model from Arabic to Greek author: John Snow Labs name: opus_mt_ar_el date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, el, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: el {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_el_xx_3.1.0_2.4_1622550439728.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_el_xx_3.1.0_2.4_1622550439728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_el", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_el", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Greek').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_el| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect PHI for Deidentification purposes (Spanish, reduced entities, augmented data, Roberta) author: John Snow Labs name: ner_deid_generic_roberta_augmented date: 2022-02-15 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities (1 more than the `ner_deid_generic_roberta` ner model). This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset, several data augmentation mechanisms and has been augmented with MEDDOCAN Spanish Deidentification corpus (compared to `ner_deid_generic_roberta` which does not include it). It's a generalized version of `ner_deid_subentity_roberta_augmented`. This is a Roberta embeddings based model. You also have available the `ner_deid_generic_augmented` that uses Sciwi 300d embeddings. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_augmented_es_3.3.4_2.4_1644926722996.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_augmented_es_3.3.4_2.4_1644926722996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic_roberta_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.generic.roberta").predict(""" Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-NAME| | Miguel| I-NAME| | Martínez| I-NAME| | ,| O| | un| B-SEX| | varón| I-SEX| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-LOCATION| | ,| O| | España| B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19|B-PROFESSION| | el| O| | dia| O| | 14| O| | de| O| | Marzo| O| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| B-LOCATION| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-LOCATION| | San| I-LOCATION| | Carlos| I-LOCATION| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_roberta_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 177.0 3.0 6.0 183.0 0.9833 0.9672 0.9752 NAME 1963.0 159.0 123.0 2086.0 0.9251 0.941 0.933 DATE 953.0 18.0 16.0 969.0 0.9815 0.9835 0.9825 ORGANIZATION 2320.0 520.0 362.0 2682.0 0.8169 0.865 0.8403 ID 63.0 7.0 1.0 64.0 0.9 0.9844 0.9403 SEX 619.0 14.0 8.0 627.0 0.9779 0.9872 0.9825 LOCATION 2388.0 470.0 423.0 2811.0 0.8355 0.8495 0.8425 PROFESSION 233.0 15.0 28.0 261.0 0.9395 0.8927 0.9155 AGE 516.0 16.0 3.0 519.0 0.9699 0.9942 0.9819 macro - - - - - - 0.9326 micro - - - - - - 0.8943 ``` --- layout: model title: English BertForSequenceClassification Tiny Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_tiny_finetuned_sms_spam_detection date: 2022-07-13 tags: [en, open_source, bert, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-sms-spam-detection` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_sms_spam_detection_en_4.0.0_3.0_1657720825763.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_sms_spam_detection_en_4.0.0_3.0_1657720825763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_sms_spam_detection","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_sms_spam_detection","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_tiny_finetuned_sms_spam_detection| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|16.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bert-tiny-finetuned-sms-spam-detection --- layout: model title: Swedish BertForMaskedLM Base Cased model (from KBLab) author: John Snow Labs name: bert_embeddings_kblab_base_swedish_cased date: 2022-12-02 tags: [sv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: sv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-swedish-cased` is a Swedish model originally trained by `KBLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_kblab_base_swedish_cased_sv_4.2.4_3.0_1670019031106.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_kblab_base_swedish_cased_sv_4.2.4_3.0_1670019031106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kblab_base_swedish_cased","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kblab_base_swedish_cased","sv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_kblab_base_swedish_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sv| |Size:|468.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/KBLab/bert-base-swedish-cased --- layout: model title: English RobertaForQuestionAnswering (from z-uo) author: John Snow Labs name: roberta_qa_roberta_qasper date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-qasper` is a English model originally trained by `z-uo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_qasper_en_4.0.0_3.0_1655738258703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_qasper_en_4.0.0_3.0_1655738258703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_qasper","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_qasper","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_z-uo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_qasper| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/z-uo/roberta-qasper --- layout: model title: Legal Title Clause Binary Classifier author: John Snow Labs name: legclf_title_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `title` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `title` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_title_clause_en_1.0.0_3.2_1660123118000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_title_clause_en_1.0.0_3.2_1660123118000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_title_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[title]| |[other]| |[other]| |[title]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_title_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.98 0.96 179 title 0.93 0.86 0.90 65 accuracy - - 0.95 244 macro-avg 0.94 0.92 0.93 244 weighted-avg 0.95 0.95 0.95 244 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_512 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_512_zh_4.2.4_3.0_1670326051660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_512_zh_4.2.4_3.0_1670326051660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|137.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Lease agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_lease_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_lease_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class lease-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `lease-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_lease_agreement_en_1.0.0_3.0_1666620984031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_lease_agreement_en_1.0.0_3.0_1666620984031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_lease_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[lease-agreement]| |[other]| |[other]| |[lease-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_lease_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support lease-agreement 0.94 0.97 0.96 33 other 0.98 0.97 0.98 62 accuracy - - 0.97 95 macro-avg 0.96 0.97 0.97 95 weighted-avg 0.97 0.97 0.97 95 ``` --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding ICD-9-CM Codes author: John Snow Labs name: icd10_icd9_mapping date: 2023-03-29 tags: [en, licensed, icd10cm, icd9, pipeline, chunk_mapping] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icd10_icd9_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.3.2_3.2_1680118330093.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.3.2_3.2_1680118330093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(Z833 A0100 A000) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(Z833 A0100 A000) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10_icd9.mapping").predict("""Put your text here.""") ```
## Results ```bash | | icd10_code | icd9_code | |---:|:--------------------|:-------------------| | 0 | Z833 | A0100 | A000 | V180 | 0020 | 0010 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10_icd9_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|589.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Temporality / Certainty Assertion Status author: John Snow Labs name: finassertion_time date: 2022-09-27 tags: [en, licensed] task: Assertion Status language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Assertion Status Model aimed to detect temporality (PRESENT, PAST, FUTURE) or Certainty (POSSIBLE) in your financial documents ## Predicted Entities `PRESENT`, `PAST`, `FUTURE`, `POSSIBLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINASSERTION_TEMPORALITY){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_time_en_1.0.0_3.0_1664274273525.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_time_en_1.0.0_3.0_1664274273525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = finance.BertForTokenClassification.pretrained("finner_bert_roles","en","finance/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) chunk_converter = nlp.NerConverter() \ .setInputCols(["document", "token", "ner"]) \ .setOutputCol("ner_chunk") assertion = finance.AssertionDLModel.pretrained("finassertion_time", "en", "finance/models")\ .setInputCols(["document", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner, chunk_converter, assertion ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) lp = nlp.LightPipeline(model) texts = ["John Crawford will be hired by Atlantic Inc as CTO"] lp.annotate(texts) ```
## Results ```bash chunk,begin,end,entity_type,assertion CTO,47,49,ROLE,FUTURE ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finassertion_time| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, doc_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations on financial and legal corpora ## Benchmarking ```bash label tp fp fn prec rec f1 PRESENT 201 11 16 0.9481132 0.92626727 0.937063 POSSIBLE 171 3 6 0.98275864 0.9661017 0.9743589 FUTURE 119 6 4 0.952 0.96747965 0.95967746 PAST 270 16 10 0.9440559 0.96428573 0.9540636 Macro-average 761 36 36 0.9567319 0.9560336 0.95638263 Micro-average 761 36 36 0.9548306 0.9548306 0.9548306 ``` --- layout: model title: Pipeline to Extract Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_general_healthcare_pipeline date: 2023-03-08 tags: [licensed, clinical, oncology, en, ner, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_anatomy_general_healthcare](https://nlp.johnsnowlabs.com/2023/01/11/ner_oncology_anatomy_general_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_healthcare_pipeline_en_4.3.0_3.2_1678268209780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_healthcare_pipeline_en_4.3.0_3.2_1678268209780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_anatomy_general_healthcare_pipeline", "en", "clinical/models") text = " The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_anatomy_general_healthcare_pipeline", "en", "clinical/models") val text = " The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:---------|--------:|------:|:----------------|-------------:| | 0 | left | 37 | 40 | Direction | 0.9948 | | 1 | breast | 42 | 47 | Anatomical_Site | 0.5814 | | 2 | lungs | 83 | 87 | Anatomical_Site | 0.9486 | | 3 | liver | 100 | 104 | Anatomical_Site | 0.9646 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_general_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|533.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_scrambled_squad_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-10` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_10_en_4.0.0_3.0_1655733955391.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_10_en_4.0.0_3.0_1655733955391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_scrambled_10.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-10 --- layout: model title: Translate Swedish to English Pipeline author: John Snow Labs name: translate_sv_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sv, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sv_en_xx_2.7.0_2.4_1609699177957.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sv_en_xx_2.7.0_2.4_1609699177957.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sv.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sv_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Czech asr_wav2vec2_large_xlsr_czech TFWav2Vec2ForCTC from arampacha author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_czech date: 2022-09-25 tags: [wav2vec2, cs, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: cs edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_czech` is a Czech model originally trained by arampacha. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_czech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_czech_cs_4.2.0_3.0_1664120879670.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_czech_cs_4.2.0_3.0_1664120879670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_czech', lang = 'cs') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_czech", lang = "cs") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_czech| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|cs| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Multilingual T5ForConditionalGeneration Small Cased model (from hackathon-pln-es) author: John Snow Labs name: t5_small_finetuned_spanish_to_quechua date: 2023-01-31 tags: [qu, es, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-spanish-to-quechua` is a Multilingual model originally trained by `hackathon-pln-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_spanish_to_quechua_xx_4.3.0_3.0_1675126096661.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_spanish_to_quechua_xx_4.3.0_3.0_1675126096661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_spanish_to_quechua","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_spanish_to_quechua","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_spanish_to_quechua| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|282.6 MB| ## References - https://huggingface.co/hackathon-pln-es/t5-small-finetuned-spanish-to-quechua --- layout: model title: Part of Speech for Telugu (pos_mtg) author: John Snow Labs name: pos_mtg date: 2021-03-10 tags: [open_source, pos, te] supported: true task: Part of Speech Tagging language: te edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_mtg_te_2.7.5_2.4_1615400812325.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_mtg_te_2.7.5_2.4_1615400812325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_mtg", "te") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['ఆయన వస్తున్నారా , లేదా ?']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_mtg", "te") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos)) val data = Seq("ఆయన వస్తున్నారా , లేదా ?").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""ఆయన వస్తున్నారా , లేదా ?""] token_df = nlu.load('te.pos.mtg').predict(text) token_df ```
## Results ```bash +------------------------+--------------------------------+ |text |result | +------------------------+--------------------------------+ |ఆయన వస్తున్నారా , లేదా ?|[PRON, VERB, PUNCT, VERB, PUNCT]| +------------------------+--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_mtg| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|te| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.50 | 0.40 | 0.44 | 5 | | ADP | 0.75 | 0.43 | 0.55 | 7 | | ADV | 0.78 | 0.68 | 0.72 | 31 | | CCONJ | 0.00 | 0.00 | 0.00 | 1 | | DET | 0.89 | 0.89 | 0.89 | 18 | | INTJ | 0.00 | 0.00 | 0.00 | 0 | | NOUN | 0.82 | 0.76 | 0.79 | 171 | | NUM | 0.83 | 0.42 | 0.56 | 12 | | PART | 0.00 | 0.00 | 0.00 | 2 | | PRON | 0.88 | 0.93 | 0.91 | 122 | | PROPN | 0.69 | 0.86 | 0.77 | 21 | | PUNCT | 0.99 | 0.99 | 0.99 | 165 | | SCONJ | 0.71 | 1.00 | 0.83 | 5 | | VERB | 0.84 | 0.92 | 0.88 | 161 | | accuracy | | | 0.87 | 721 | | macro avg | 0.62 | 0.59 | 0.59 | 721 | | weighted avg | 0.87 | 0.87 | 0.86 | 721 | ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson TFWav2Vec2ForCTC from izzy-lazerson author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson` is a English model originally trained by izzy-lazerson. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson_en_4.2.0_3.0_1664112196900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson_en_4.2.0_3.0_1664112196900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten_en_4.2.0_3.0_1664105340184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten_en_4.2.0_3.0_1664105340184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Italian BertForMaskedLM Base Cased model (from dbmdz) author: John Snow Labs name: bert_embeddings_base_italian_cased date: 2022-12-02 tags: [it, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-cased` is a Italian model originally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_cased_it_4.2.4_3.0_1670017918448.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_cased_it_4.2.4_3.0_1670017918448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_italian_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|412.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-italian-cased - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: Human Rights Articles Classification author: John Snow Labs name: legclf_human_rights date: 2022-08-09 tags: [es, legal, conventions, classification, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Roberta-based Legal Sequence Classifier NLP model to label texts about Human Rights in Spanish as one of the following categories: - `Artículo 1. Obligación de Respetar los Derechos` - `Artículo 2. Deber de Adoptar Disposiciones de Derecho Interno` - `Artículo 3. Derecho al Reconocimiento de la Personalidad Jurídica` - `Artículo 4. Derecho a la Vida` - `Artículo 5. Derecho a la Integridad Personal` - `Artículo 6. Prohibición de la Esclavitud y Servidumbre` - `Artículo 7. Derecho a la Libertad Personal` - `Artículo 8. Garantías Judiciales` - `Artículo 9. Principio de Legalidad y de Retroactividad` - `Artículo 11. Protección de la Honra y de la Dignidad` - `Artículo 12. Libertad de Conciencia y de Religión` - `Artículo 13. Libertad de Pensamiento y de Expresión` - `Artículo 14. Derecho de Rectificación o Respuesta` - `Artículo 15. Derecho de Reunión` - `Artículo 16. Libertad de Asociación` - `Artículo 17. Protección a la Familia` - `Artículo 18. Derecho al Nombre` - `Artículo 19. Derechos del Niño` - `Artículo 20. Derecho a la Nacionalidad` - `Artículo 22. Derecho de Circulación y de Residencia` - `Artículo 23. Derechos Políticos` - `Artículo 24. Igualdad ante la Ley` - `Artículo 25. Protección Judicial` - `Artículo 26. Desarrollo Progresivo` - `Artículo 27. Suspensión de Garantías` - `Artículo 28. Cláusula Federal`, `Artículo 21. Derecho a la Propiedad Privada` - `Artículo_29_Normas_de_Interpretación` - `Artículo 30. Alcance de las Restricciones` - `Artículo 63.1 Reparaciones` This model was originally trained with 6089 legal texts (see the original work [here](https://huggingface.co/hackathon-pln-es/jurisbert-clas-art-convencion-americana-dh) about the American Convention of Human Rights. It has been finetuned with the International Convention of Human Rights and other similar documents (as, for example, https://www.ohchr.org/sites/default/files/UDHR/Documents/UDHR_Translations/spn.pdf). ## Predicted Entities `Artículo 1. Obligación de Respetar los Derechos`, `Artículo 2. Deber de Adoptar Disposiciones de Derecho Interno`, `Artículo 3. Derecho al Reconocimiento de la Personalidad Jurídica`, `Artículo 4. Derecho a la Vida`, `Artículo 5. Derecho a la Integridad Personal`, `Artículo 6. Prohibición de la Esclavitud y Servidumbre`, `Artículo 7. Derecho a la Libertad Personal`, `Artículo 8. Garantías Judiciales`, `Artículo 9. Principio de Legalidad y de Retroactividad`, `Artículo 11. Protección de la Honra y de la Dignidad`, `Artículo 12. Libertad de Conciencia y de Religión`, `Artículo 13. Libertad de Pensamiento y de Expresión`, `Artículo 14. Derecho de Rectificación o Respuesta`, `Artículo 15. Derecho de Reunión`, `Artículo 16. Libertad de Asociación`, `Artículo 17. Protección a la Familia`, `Artículo 18. Derecho al Nombre`, `Artículo 19. Derechos del Niño`, `Artículo 20. Derecho a la Nacionalidad`, `Artículo 22. Derecho de Circulación y de Residencia`, `Artículo 23. Derechos Políticos`, `Artículo 24. Igualdad ante la Ley`, `Artículo 25. Protección Judicial`, `Artículo 26. Desarrollo Progresivo`, `Artículo 27. Suspensión de Garantías`, `Artículo 28. Cláusula Federal`, `Artículo 21. Derecho a la Propiedad Privada`, `Artículo 29. Normas de Interpretación`, `Artículo 30. Alcance de las Restricciones`, `Artículo 63.1 Reparaciones` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_human_rights_en_1.0.0_3.2_1660057114857.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_human_rights_en_1.0.0_3.2_1660057114857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = nlp.RoBertaForSequenceClassification.pretrained("legclf_human_rights","en", "legal/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) text = """Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros.""" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------+-------------------------------------------------+ | text| result| +--------------------+-------------------------------------------------+ |Todos los seres h...|[Artículo 1. Obligación de Respetar los Derechos]| +--------------------+-------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_human_rights| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model was originally trained on 6089 legal texts (see the original work [here](https://huggingface.co/hackathon-pln-es/jurisbert-clas-art-convencion-americana-dh) about the American Convention of Human Rights. It has been finetuned with the International Convention of Human Rights and other similar documents (as, for example, https://www.ohchr.org/sites/default/files/UDHR/Documents/UDHR_Translations/spn.pdf). ## Benchmarking ```bash label precision recall f1-score support accuracy - - 0.91 98 macro-avg 0.92 0.91 0.91 98 weighted-avg 0.92 0.90 0.91 98 ``` --- layout: model title: Arabic BertForQuestionAnswering Cased model (from Sh3ra) author: John Snow Labs name: bert_qa_arabert_finetuned_arcd date: 2022-07-07 tags: [ar, open_source, bert, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `arabert-finetuned-arcd` is a Arabic model originally trained by `Sh3ra`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_arabert_finetuned_arcd_ar_4.0.0_3.0_1657182523872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_arabert_finetuned_arcd_ar_4.0.0_3.0_1657182523872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_arabert_finetuned_arcd","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_arabert_finetuned_arcd","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_arabert_finetuned_arcd| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|505.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sh3ra/arabert-finetuned-arcd --- layout: model title: Legal Monetary Relations Document Classifier (EURLEX) author: John Snow Labs name: legclf_monetary_relations_bert date: 2023-03-06 tags: [en, legal, classification, clauses, monetary_relations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_monetary_relations_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Monetary_Relations or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Monetary_Relations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_monetary_relations_bert_en_1.0.0_3.0_1678111802226.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_monetary_relations_bert_en_1.0.0_3.0_1678111802226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_monetary_relations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Monetary_Relations]| |[Other]| |[Other]| |[Monetary_Relations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_monetary_relations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Monetary_Relations 0.75 0.88 0.81 24 Other 0.90 0.79 0.84 34 accuracy - - 0.83 58 macro-avg 0.82 0.83 0.83 58 weighted-avg 0.84 0.83 0.83 58 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_2_en_4.0.0_3.0_1657192626029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_2_en_4.0.0_3.0_1657192626029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|387.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-2 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1655732836624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1655732836624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_512d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-0 --- layout: model title: Word Embeddings for French (word2vec_wac_200) author: John Snow Labs name: word2vec_wac_200 date: 2022-02-01 tags: [fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This French Word2Vec model was trained by [Jean-Philippe Fauconnier](https://fauconnier.github.io/) on the [frWaC Corpus](https://wacky.sslmit.unibo.it/doku.php?id=corpora) over a window size of 100 and dimensions of 200. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/word2vec_wac_200_fr_3.4.0_3.0_1643751580352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/word2vec_wac_200_fr_3.4.0_3.0_1643751580352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("word2vec_wac_200", "fr")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("word2vec_wac_200", "fr") .setInputCols("document", "token") .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.word2vec_wac_200").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|word2vec_wac_200| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|118.0 MB| |Case sensitive:|false| |Dimension:|200| ## References This model was trained by [Jean-Philippe Fauconnier](https://fauconnier.github.io/) on the [frWaC Corpus](https://wacky.sslmit.unibo.it/doku.php?id=corpora). [[1]](#1) [1] Fauconnier, Jean-Philippe (2015), French Word Embeddings, http://fauconnier.github.io --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Raphaelg9) author: John Snow Labs name: distilbert_qa_raphaelg9_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Raphaelg9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_raphaelg9_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769023119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_raphaelg9_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769023119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_raphaelg9_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_raphaelg9_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_raphaelg9_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Raphaelg9/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Execution Clause Binary Classifier author: John Snow Labs name: legclf_execution_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `execution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `execution` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_execution_clause_en_1.0.0_3.2_1660123518825.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_execution_clause_en_1.0.0_3.2_1660123518825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_execution_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[execution]| |[other]| |[other]| |[execution]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_execution_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support execution 0.87 0.77 0.82 44 other 0.89 0.94 0.91 84 accuracy - - 0.88 128 macro-avg 0.88 0.86 0.87 128 weighted-avg 0.88 0.88 0.88 128 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from victorlee071200) author: John Snow Labs name: distilbert_qa_victorlee071200_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_victorlee071200_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773002397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_victorlee071200_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773002397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_victorlee071200_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_victorlee071200_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_victorlee071200_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/victorlee071200/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_ff3000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-ff3000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff3000_en_4.3.0_3.0_1675120955500.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff3000_en_4.3.0_3.0_1675120955500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_ff3000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_ff3000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_ff3000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|171.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-ff3000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Abkhazian asr_wav2vec2_xls_r_300m_ab_CV8 TFWav2Vec2ForCTC from emre author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_ab_CV8 date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_ab_CV8` is a Abkhazian model originally trained by emre. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_ab_CV8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_ab_CV8_ab_4.2.0_3.0_1664020110360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_ab_CV8_ab_4.2.0_3.0_1664020110360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_ab_CV8', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_ab_CV8", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_ab_CV8| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Arabic XlmRoBertaForQuestionAnswering (from salti) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_arabic_qa date: 2022-06-23 tags: [ar, open_source, question_answering, xlmroberta] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-arabic_qa` is a Arabic model originally trained by `salti`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_arabic_qa_ar_4.0.0_3.0_1655992517026.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_arabic_qa_ar_4.0.0_3.0_1655992517026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_arabic_qa","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_arabic_qa","ar") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_arabic_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ar| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/salti/xlm-roberta-large-arabic_qa --- layout: model title: Pipeline to Detect Adverse Drug Events (biobert) author: John Snow Labs name: ner_ade_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_biobert_pipeline_en_4.3.0_3.2_1679311852288.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_biobert_pipeline_en_4.3.0_3.2_1679311852288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_ade_biobert_pipeline", "en", "clinical/models") text = '''Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_ade_biobert_pipeline", "en", "clinical/models") val text = "Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biobert_ade.pipeline").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lipitor | 12 | 18 | DRUG | 0.9996 | | 1 | severe fatigue | 52 | 65 | ADE | 0.7588 | | 2 | voltaren | 97 | 104 | DRUG | 0.998 | | 3 | cramps | 152 | 157 | ADE | 0.9258 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering (from shahrukhx01) author: John Snow Labs name: roberta_qa_roberta_base_squad2_boolq_baseline date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-boolq-baseline` is a English model originally trained by `shahrukhx01`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_boolq_baseline_en_4.0.0_3.0_1655735101028.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_boolq_baseline_en_4.0.0_3.0_1655735101028.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_boolq_baseline","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2_boolq_baseline","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_shahrukhx01").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2_boolq_baseline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/shahrukhx01/roberta-base-squad2-boolq-baseline --- layout: model title: Part of Speech for German author: John Snow Labs name: pos_ud_hdt date: 2021-03-08 tags: [part_of_speech, open_source, german, pos_ud_hdt, de] task: Part of Speech Tagging language: de edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - DET - ADJ - NOUN - VERB - PRON - PROPN - X - PUNCT - CCONJ - NUM - ADV - AUX - SCONJ - PART - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/german/pretrained-german-models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_hdt_de_3.0.0_3.0_1615230160154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_hdt_de_3.0.0_3.0_1615230160154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") posTagger = PerceptronModel.pretrained("pos_ud_hdt", "de") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) data = spark.createDataFrame([["Hallo aus John Snow Labs! "]], ["text"]) result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_hdt", "de") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hallo aus John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Hallo aus John Snow Labs!"] token_df = nlu.load('de.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hallo NOUN 1 aus ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_hdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|de| --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465516 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465516` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465516_en_4.0.0_3.0_1655986218284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465516_en_4.0.0_3.0_1655986218284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465516","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465516","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465516.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465516| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465516 --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [da, open_source] task: Named Entity Recognition language: da edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_da_4.0.0_3.0_1656129041439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_da_4.0.0_3.0_1656129041439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "da") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("da.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|da| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: English asr_wav2vec2_large_xls_r_300m_CN_colab TFWav2Vec2ForCTC from li666 author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_CN_colab date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_CN_colab` is a English model originally trained by li666. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_CN_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_CN_colab_en_4.2.0_3.0_1664094455307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_CN_colab_en_4.2.0_3.0_1664094455307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_CN_colab", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_CN_colab", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_CN_colab| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Sentiment Analysis for Food Reviews author: John Snow Labs name: distilbert_base_sequence_classifier_food date: 2022-02-14 tags: [food, amazon, distilbert, sequence_classification, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: DistilBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/Tejas3/distillbert_base_uncased_amazon_food_review_300)) and it's been trained on Amazon Food Review dataset, leveraging `Distil-BERT` embeddings and `DistilBertForSequenceClassification` for text classification purposes. The model classifies `Positive` or `Negative` sentiments of texts related to food reviews. ## Predicted Entities `Positive`, `Negative` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_food_en_3.4.0_3.0_1644846756022.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_food_en_3.4.0_3.0_1644846756022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_food", "en")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.annotate("The first time I ever used them was about 9 months ago. The food that came was just left at my front doorstep all over the place. Bread was smashed, bananas nearly rotten, and containers crushed. Given the weather, I decided to give them another try. This time my order was cancelled 6 hours before delivery. When cancelled, they didn't even give an explanation as to why it was cancelled. Amazon just needs to close up this portion of their shop.") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_food", "en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["The first time I ever used them was about 9 months ago. The food that came was just left at my front doorstep all over the place. Bread was smashed, bananas nearly rotten, and containers crushed. Given the weather, I decided to give them another try. This time my order was cancelled 6 hours before delivery. When cancelled, they didn't even give an explanation as to why it was cancelled. Amazon just needs to close up this portion of their shop."].toDS.toDF("text") val result = pipeline.fit(example1).transform(example) ```
## Results ```bash ['Negative'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_sequence_classifier_food| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[class]| |Language:|en| |Size:|249.8 MB| |Case sensitive:|true| |Max sentence length:|256| --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from efederici) author: John Snow Labs name: t5_it5_efficient_small_fanpage date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-fanpage` is a Italian model originally trained by `efederici`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_fanpage_it_4.3.0_3.0_1675103771407.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_fanpage_it_4.3.0_3.0_1675103771407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_fanpage","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_fanpage","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_fanpage| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|593.8 MB| ## References - https://huggingface.co/efederici/it5-efficient-small-fanpage --- layout: model title: English XlmRoBertaForQuestionAnswering model (from deepset) author: John Snow Labs name: xlm_roberta_base_qa_squad2 date: 2022-06-15 tags: [open_source, xlmroberta, question_answering, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_qa_squad2_en_4.0.0_3.0_1655293726154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_qa_squad2_en_4.0.0_3.0_1655293726154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_base_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_base_qa_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|875.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References https://huggingface.co/deepset/xlm-roberta-base-squad2 --- layout: model title: English image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2 ViTForImageClassification from V0ltron author: John Snow Labs name: image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2` is a English model originally trained by V0ltron. ## Predicted Entities `FullMed`, `MiddleLetter`, `BoundaryLetter`, `BoundaryMed`, `MiddleMed`, `FullLetter` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2_en_4.1.0_3.0_1660165490741.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2_en_4.1.0_3.0_1660165490741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_finetuned_largerDataSet_docSeperator_more_labels_all_apache2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Publicity Clause Binary Classifier author: John Snow Labs name: legclf_publicity_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `publicity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `publicity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_publicity_clause_en_1.0.0_3.2_1660122871432.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_publicity_clause_en_1.0.0_3.2_1660122871432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_publicity_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[publicity]| |[other]| |[other]| |[publicity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_publicity_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.98 0.96 96 publicity 0.94 0.84 0.89 37 accuracy - - 0.94 133 macro-avg 0.94 0.91 0.92 133 weighted-avg 0.94 0.94 0.94 133 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from hackathon-pln-es) author: John Snow Labs name: roberta_qa_base_bne_squad2 date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_squad2_es_4.2.4_3.0_1669985982782.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_squad2_es_4.2.4_3.0_1669985982782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_squad2","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_squad2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_bne_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|456.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/hackathon-pln-es/roberta-base-bne-squad2-es --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Qiliang) author: John Snow Labs name: distilbert_qa_qiliang_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Qiliang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_qiliang_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768989901.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_qiliang_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768989901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_qiliang_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_qiliang_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_qiliang_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Qiliang/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1657184713981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1657184713981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-2 --- layout: model title: Medical Question Answering (biogpt_pubmed_qa) author: John Snow Labs name: biogpt_pubmed_qa date: 2023-03-09 tags: [licensed, clinical, en, gpt, biogpt, pubmed, question_answering, tensorflow] task: Question Answering language: en edition: Healthcare NLP 4.3.1 spark_version: 3.0 published: false engine: tensorflow annotator: MedicalQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model has been trained with medical documents and can generate two types of answers, short and long. Types of questions are supported: `"short"` (producing yes/no/maybe) answers and `"full"` (long answers). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.3.1_3.0_1678372297946.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.3.1_3.0_1678372297946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler()\ .setInputCols("question", "context")\ .setOutputCols("document_question", "document_context") med_qa = sparknlp_jsl.annotators.MedicalQuestionAnswering\ .pretrained("biogpt_pubmed_qa","en","clinical/models")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setMaxNewTokens(30)\ .setTopK(1)\ .setQuestionType("long") # "short" pipeline = Pipeline(stages=[document_assembler, med_qa]) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" data = spark.createDataFrame( [ [long_question, paper_abstract, "long"], [yes_no_question, paper_abstract, "short"], ] ).toDF("question", "context", "question_type") pipeline.fit(data).transform(data.where("question_type == 'long'"))\ .select("answer.result")\ .show(truncate=False) pipeline.fit(data).transform(data.where("question_type == 'short'"))\ .select("answer.result")\ .show(truncate=False) ``` ```scala val document_assembler = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val med_qa = MedicalQuestionAnswering .pretrained("biogpt_pubmed_qa","en","clinical/models") .setInputCols(("document_question", "document_context")) .setOutputCol("answer") .setMaxNewTokens(30) .setTopK(1) .setQuestionType("long") # "short" val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa)) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" val data = Seq( (long_question, paper_abstract,"long" ), (yes_no_question, paper_abstract, "short")) .toDS.toDF("question", "context", "question_type") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[the present study investigated whether directing spatial attention to one location in a visual array would enhance memory for the array features. participants memorized two]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biogpt_pubmed_qa| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 GB| |Case sensitive:|true| --- layout: model title: Legal Indebtedness Clause Binary Classifier author: John Snow Labs name: legclf_indebtedness_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indebtedness` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indebtedness` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indebtedness_clause_en_1.0.0_3.2_1660123601510.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indebtedness_clause_en_1.0.0_3.2_1660123601510.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_indebtedness_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indebtedness]| |[other]| |[other]| |[indebtedness]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indebtedness_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support indebtedness 0.95 0.88 0.91 24 other 0.96 0.98 0.97 66 accuracy - - 0.96 90 macro-avg 0.96 0.93 0.94 90 weighted-avg 0.96 0.96 0.95 90 ``` --- layout: model title: English asr_wav2vec2_25_1Aug_2022 TFWav2Vec2ForCTC from Roshana author: John Snow Labs name: asr_wav2vec2_25_1Aug_2022 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_25_1Aug_2022` is a English model originally trained by Roshana. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_25_1Aug_2022_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_25_1Aug_2022_en_4.2.0_3.0_1664115989562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_25_1Aug_2022_en_4.2.0_3.0_1664115989562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_25_1Aug_2022", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_25_1Aug_2022", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_25_1Aug_2022| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-2` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1654191719402.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1654191719402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_64d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|378.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-2 --- layout: model title: Stopwords Remover for English language (306 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, en, open_source] task: Stop Words Removal language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_en_3.4.1_3.0_1646672348840.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_en_3.4.1_3.0_1646672348840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","en") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["You are not better than me"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","en") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("You are not better than me").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.stopwords").predict("""You are not better than me""") ```
## Results ```bash +--------+ |result | +--------+ |[better]| +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|en| |Size:|2.1 KB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from cgou) author: John Snow Labs name: roberta_qa_fin_v1_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fin_RoBERTa-v1-finetuned-squad` is a English model originally trained by `cgou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fin_v1_finetuned_squad_en_4.3.0_3.0_1674210819999.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fin_v1_finetuned_squad_en_4.3.0_3.0_1674210819999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fin_v1_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fin_v1_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fin_v1_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|248.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/cgou/fin_RoBERTa-v1-finetuned-squad --- layout: model title: Translate Uralic languages to English Pipeline author: John Snow Labs name: translate_urj_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, urj, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `urj` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_urj_en_xx_2.7.0_2.4_1609699134258.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_urj_en_xx_2.7.0_2.4_1609699134258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_urj_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_urj_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.urj.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_urj_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_lotr ViTForImageClassification from jjhoffstein author: John Snow Labs name: image_classifier_vit_lotr date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_lotr` is a English model originally trained by jjhoffstein. ## Predicted Entities `gandalf`, `gollum`, `aragorn`, `frodo`, `legolas` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lotr_en_4.1.0_3.0_1660168933925.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lotr_en_4.1.0_3.0_1660168933925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_lotr", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_lotr", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_lotr| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Tagalog RoBERTa Embeddings (Large) author: John Snow Labs name: roberta_embeddings_roberta_tagalog_large date: 2022-04-14 tags: [roberta, embeddings, tl, open_source] task: Embeddings language: tl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-tagalog-large` is a Tagalog model orginally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_tagalog_large_tl_3.4.2_3.0_1649948937589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_tagalog_large_tl_3.4.2_3.0_1649948937589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_tagalog_large","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Gustung-gusto ko ang Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_tagalog_large","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Gustung-gusto ko ang Spark NLP.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tl.embed.roberta_tagalog_large").predict("""Gustung-gusto ko ang Spark NLP.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_tagalog_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|tl| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/jcblaise/roberta-tagalog-large - https://blaisecruz.com --- layout: model title: Named Entity Recognition (NER) Model in Finnish (GloVe 840B 300) author: John Snow Labs name: finnish_ner_840B_300 date: 2020-09-01 task: Named Entity Recognition language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, fi, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Finnish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/PP_EXPLAIN_DOCUMENT_FI/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/PP_EXPLAIN_DOCUMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/finnish_ner_840B_300_fi_2.6.0_2.4_1598965807720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/finnish_ner_840B_300_fi_2.6.0_2.4_1598965807720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("finnish_ner_840B_300", "fi") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia \u200b\u200bulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970'erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990'erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("finnish_ner_840B_300", "fi") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia ​​ulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970"erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990"erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [9] Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia ​​ulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970'erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990'erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [9] Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella."""] ner_df = nlu.load('fi.ner.840B_300d').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |lokakuuta 1955 |DATE | |​​ulkoministeriöitä |ORG | |Microsoft |ORG | |Microsoft Corporationin|ORG | |Microsoft |ORG | |Gates |PER | |maj 2014 |DATE | |1970'erne 1980 1980erne|DATE | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |1975 |DATE | |Albuquerque |LOC | |New Mexico |LOC | |Det |LOC | |Gates |PER | |tammikuu 2000 |DATE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finnish_ner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fi| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.567.pdf](https://www.aclweb.org/anthology/2020.lrec-1.567.pdf) --- layout: model title: Japanese Electra Embeddings (from izumi-lab) author: John Snow Labs name: electra_embeddings_electra_small_paper_japanese_generator date: 2022-05-17 tags: [ja, open_source, electra, embeddings] task: Embeddings language: ja edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-paper-japanese-generator` is a Japanese model orginally trained by `izumi-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_paper_japanese_generator_ja_3.4.4_3.0_1652786712318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_paper_japanese_generator_ja_3.4.4_3.0_1652786712318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_paper_japanese_generator","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_paper_japanese_generator","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_small_paper_japanese_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ja| |Size:|19.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/izumi-lab/electra-small-paper-japanese-generator - https://github.com/google-research/electra - https://github.com/retarfi/language-pretraining/tree/v1.0 - https://arxiv.org/abs/2003.10555 - https://arxiv.org/abs/2003.10555 - https://creativecommons.org/licenses/by-sa/4.0/ --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from tabo) author: John Snow Labs name: distilbert_qa_tabo_base_uncased_finetuned_squad2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `tabo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tabo_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773599663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tabo_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773599663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tabo_base_uncased_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tabo_base_uncased_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_tabo_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tabo/distilbert-base-uncased-finetuned-squad2 --- layout: model title: Legal Demand Registration Clause Binary Classifier author: John Snow Labs name: legclf_demand_registration_clause date: 2022-12-07 tags: [en, legal, classification, clauses, demand_registration, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `demand-registration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `demand-registration`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_demand_registration_clause_en_1.0.0_3.0_1670444564459.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_demand_registration_clause_en_1.0.0_3.0_1670444564459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_demand_registration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[demand-registration]| |[other]| |[other]| |[demand-registration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_demand_registration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support demand-registration 1.00 0.96 0.98 28 other 0.99 1.00 0.99 73 accuracy - - 0.99 101 macro-avg 0.99 0.98 0.99 101 weighted-avg 0.99 0.99 0.99 101 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from google) author: John Snow Labs name: t5_flan_base date: 2023-03-01 tags: [open_source, t5, flan, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. flan-t5-base is a English model originally trained by google. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_base_xx_4.3.0_3.0_1677702524850.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_base_xx_4.3.0_3.0_1677702524850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_flan_base","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_flan_base","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_flan_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.0 GB| ## References https://huggingface.co/google/flan-t5-base --- layout: model title: Legal Change In Control Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_change_in_control_bert date: 2023-03-05 tags: [en, legal, classification, clauses, change_in_control, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Change_In_Control` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Change_In_Control`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_change_in_control_bert_en_1.0.0_3.0_1678050687048.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_change_in_control_bert_en_1.0.0_3.0_1678050687048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_change_in_control_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Change_In_Control]| |[Other]| |[Other]| |[Change_In_Control]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_change_in_control_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Change_In_Control 0.95 0.97 0.96 38 Other 0.98 0.97 0.98 61 accuracy - - 0.97 99 macro-avg 0.97 0.97 0.97 99 weighted-avg 0.97 0.97 0.97 99 ``` --- layout: model title: Fast Neural Machine Translation Model from Tiv to English author: John Snow Labs name: opus_mt_tiv_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tiv, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tiv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tiv_en_xx_2.7.0_2.4_1609167406296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tiv_en_xx_2.7.0_2.4_1609167406296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tiv_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tiv_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tiv.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tiv_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial 10K Filings NER (Summary page, SEC embeddings) author: John Snow Labs name: finner_sec_10k_summary date: 2022-10-25 tags: [sec, 10k, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole financial report. Instead: - Split by paragraphs; - Use the `finclf_form_10k_summary_item` Text Classifier to select only these paragraps; This Financial NER Model is aimed to process the first summary page of 10K filings and extract the information about the Company submitting the filing, trading data, address / phones, CFN, IRS, etc. This is another version of `finner_10k_summary` trained with better financial embeddings (`bert_embeddings_sec_bert_base`) ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_sec_10k_summary_en_1.0.0_3.0_1666711517681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_sec_10k_summary_en_1.0.0_3.0_1666711517681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") \ .setCustomBounds(["\n\n"]) tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_sec_10k_summary","en","finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934 For the annual period ended January 31, 2021 or TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from________to_______ Commission File Number: 001-38856 PAGERDUTY, INC. (Exact name of registrant as specified in its charter) Delaware 27-2793871 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification Number) 600 Townsend St., Suite 200, San Francisco, CA 94103 (844) 800-3889 (Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices) Securities registered pursuant to Section 12(b) of the Act: Title of each class Trading symbol(s) Name of each exchange on which registered Common Stock, $0.000005 par value, PD New York Stock Exchange"""]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("ticker"), F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False) ```
## Results ```bash +----------------------------------------------+-----------------+ |ticker |label | +----------------------------------------------+-----------------+ |January 31, 2021 |FISCAL_YEAR | |001-38856 |CFN | |PAGERDUTY, INC |ORG | |Delaware |STATE | |27-2793871 |IRS | |600 Townsend St., Suite 200, San Francisco, CA|ADDRESS | |(844) 800-3889 |PHONE | |Common Stock |TITLE_CLASS | |$0.000005 |TITLE_CLASS_VALUE| |PD |TICKER | |New York Stock Exchange |STOCK_EXCHANGE | +----------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_sec_10k_summary| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on 10-K Filings ## Benchmarking ```bash label precision recall f1-score support B-TITLE_CLASS 1.00 1.00 1.00 15 I-TITLE_CLASS 1.00 1.00 1.00 21 B-ORG 0.84 0.66 0.74 62 I-ORG 0.88 0.76 0.82 93 B-STOCK_EXCHANGE 0.86 0.86 0.86 14 I-STOCK_EXCHANGE 0.98 0.98 0.98 50 B-PHONE 0.95 0.87 0.91 23 I-PHONE 0.95 1.00 0.98 60 B-STATE 0.89 0.85 0.87 20 B-IRS 1.00 0.88 0.93 16 B-ADDRESS 0.94 0.83 0.88 18 I-ADDRESS 0.92 0.97 0.94 144 B-TICKER 0.86 0.92 0.89 13 B-FISCAL_YEAR 0.96 0.88 0.92 50 I-FISCAL_YEAR 0.93 0.92 0.92 125 B-TITLE_CLASS_VALUE 1.00 0.93 0.97 15 B-CFN 0.92 1.00 0.96 12 micro-avg 0.93 0.89 0.91 751 macro-avg 0.84 0.81 0.82 751 weighted-avg 0.92 0.89 0.91 751 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_all_903429540 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429540` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1678134087932.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1678134087932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_all_903429540| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429540 --- layout: model title: Mapping SNOMED Codes with Their Corresponding UMLS Codes author: John Snow Labs name: snomed_umls_mapper date: 2022-06-27 tags: [snomed, umls, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps SNOMED codes to corresponding UMLS codes under the Unified Medical Language System (UMLS). ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapper_en_3.5.3_3.0_1656313105018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapper_en_3.5.3_3.0_1656313105018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("snomed_umls_mapper", "en", "clinical/models") \ .setInputCols(["snomed_code"])\ .setOutputCol("umls_mappings")\ .setRels(["umls_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, snomed_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Radiating chest pain") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("snomed_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("snomed_umls_mapper", "en", "clinical/models") .setInputCols("snomed_code") .setOutputCol("umls_mappings") .setRels(Array("umls_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, snomed_resolver, chunkerMapper )) val data = Seq("Radiating chest pain").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.snomed_to_umls").predict("""Radiating chest pain""") ```
## Results ```bash | | ner_chunk | snomed_code | umls_mappings | |---:|:---------------------|--------------:|:----------------| | 0 | Radiating chest pain | 10000006 | C0232289 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_umls_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[snomed_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|5.1 MB| --- layout: model title: Part of Speech for Russian author: John Snow Labs name: pos_ud_gsd date: 2020-03-12 13:48:00 +0800 task: Part of Speech Tagging language: ru edition: Spark NLP 2.4.4 spark_version: 2.4 tags: [pos, ru] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ru_2.4.4_2.4_1584013495069.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ru_2.4.4_2.4_1584013495069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_gsd", "ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_gsd", "ru") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены."""] pos_df = nlu.load('ru.pos.ud_gsd').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=5, result='ADP', metadata={'word': 'Помимо'}), Row(annotatorType='pos', begin=7, end=10, result='PRON', metadata={'word': 'того'}), Row(annotatorType='pos', begin=11, end=11, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=13, end=15, result='SCONJ', metadata={'word': 'что'}), Row(annotatorType='pos', begin=17, end=18, result='PRON', metadata={'word': 'он'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Type:|pos| |Compatibility:|Spark NLP 2.4.4| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|ru| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Sentence Entity Resolver for Snomed Aux Concepts, CT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_auxConcepts language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to Snomed codes (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from CT version) using chunk embeddings. {:.h2_title} ## Predicted Entities Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_en_2.6.4_2.4_1606235765319.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_en_2.6.4_2.4_1606235765319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_aux_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_auxConcepts","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = model.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val snomed_aux_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_auxConcepts","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 73578008| 0.5321|hyperdistension::...|73578008:::147849...| |chronic renal ins...| 83|109| PROBLEM|151137006| 0.3356|renal function ob...|151137006:::26681...| | COPD| 113|116| PROBLEM|395159008| 0.1374|chronic obstructi...|395159008:::39059...| | gastritis| 120|128| PROBLEM| 57632009| 0.1274|gastroplication::...|57632009:::216090...| | TIA| 136|138| PROBLEM|449758002| 0.1984|traumatic infarct...|449758002:::85844...| |a non-ST elevatio...| 182|202| PROBLEM|713264002| 0.0941|nontraumatic rupt...|713264002:::31036...| |Guaiac positive s...| 208|229| PROBLEM| 25580003| 0.0906|faecaloma:::faeca...|25580003:::891580...| |cardiac catheteri...| 295|317| TEST| 41976001| 0.4957|cardiac catheteri...|41976001:::141945...| | PTCA| 324|327|TREATMENT|309817004| 0.0660|pulmonary angiogr...|309817004:::31264...| | mid LAD lesion| 332|345| PROBLEM|193467007| 0.1213|mid portion of an...|193467007:::91748...| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_snomed_auxConcepts | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on SNOMED (CT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: Hindi Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: hi edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, hi] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_hi_2.5.5_2.4_1596054396201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_hi_2.5.5_2.4_1596054396201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "hi") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "hi") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।"""] lemma_df = nlu.load('hi.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='उत्तर', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=7, result='विपिन', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=12, result='राजा', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=14, end=17, result='हो', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=20, result='विपिन', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|hi| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from dspg) author: John Snow Labs name: distilbert_qa_dspg_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `dspg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_dspg_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770583896.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_dspg_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770583896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_dspg_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_dspg_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_dspg_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/dspg/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForTokenClassification Cased model (from Shenzy2) author: John Snow Labs name: bert_token_classifier_autotrain_tk_1181244086 date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-tk-1181244086` is a English model originally trained by `Shenzy2`. ## Predicted Entities `design_concept`, `ui_element` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_tk_1181244086_en_4.2.4_3.0_1669814679080.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_tk_1181244086_en_4.2.4_3.0_1669814679080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_tk_1181244086","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_tk_1181244086","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_tk_1181244086| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Shenzy2/autotrain-tk-1181244086 --- layout: model title: Part of Speech for Indonesian author: John Snow Labs name: pos_ud_gsd date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: id edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, id] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_id_2.5.5_2.4_1596054136894.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_id_2.5.5_2.4_1596054136894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_gsd", "id") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_gsd", "id") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis."""] pos_df = nlu.load('id.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=5, result='ADP', metadata={'word': 'Selain'}), Row(annotatorType='pos', begin=7, end=13, result='VERB', metadata={'word': 'menjadi'}), Row(annotatorType='pos', begin=15, end=18, result='NOUN', metadata={'word': 'raja'}), Row(annotatorType='pos', begin=20, end=24, result='NOUN', metadata={'word': 'utara'}), Row(annotatorType='pos', begin=25, end=25, result='PUNCT', metadata={'word': ','}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|id| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_el4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el4_en_4.3.0_3.0_1675111271789.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el4_en_4.3.0_3.0_1675111271789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_el4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_el4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_el4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|366.4 MB| ## References - https://huggingface.co/google/t5-efficient-base-el4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: German NER for Laws author: John Snow Labs name: legner_courts date: 2022-10-02 tags: [de, legal, ner, laws, court, licensed] task: Named Entity Recognition language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect legal entities in German text, predicting up to 19 different labels: ``` | tag | meaning ----------------- | AN | Anwalt | EUN | Europäische Norm | GS | Gesetz | GRT | Gericht | INN | Institution | LD | Land | LDS | Landschaft | LIT | Literatur | MRK | Marke | ORG | Organisation | PER | Person | RR | Richter | RS | Rechtssprechung | ST | Stadt | STR | Straße | UN | Unternehmen | VO | Verordnung | VS | Vorschrift | VT | Vertrag ``` German Named Entity Recognition model, trained using a Deep Learning architecture (CharCNN + LSTM) with a Court Decisions (2017-2018) dataset (check `Data Source` section). You can also find transformer-based versions of this model in our Models Hub (`legner_bert_base_courts` and `legner_bert_large_courts`) ## Predicted Entities `STR`, `LIT`, `PER`, `EUN`, `VT`, `MRK`, `INN`, `UN`, `RS`, `ORG`, `GS`, `VS`, `LDS`, `GRT`, `VO`, `RR`, `LD`, `AN`, `ST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_DE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_courts_de_1.0.0_3.0_1664706079878.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_courts_de_1.0.0_3.0_1664706079878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d",'de','clinical/models')\ .setInputCols(["sentence", 'token'])\ .setOutputCol("embeddings")\ .setCaseSensitive(False) legal_ner = legal.NerModel.pretrained("legner_courts",'de','legal/models') \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... legal_pred_pipeline = nlp.Pipeline(stages = [document_assembler, sentence_detector, tokenizer, word_embeddings, legal_ner, ner_converter]) legal_light_model = LightPipeline(legal_pred_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = legal_light_model.fullAnnotate('''Jedoch wird der Verkehr darin naheliegend den Namen eines der bekanntesten Flüsse Deutschlands erkennen, welcher als Seitenfluss des Rheins durch Oberfranken, Unterfranken und Südhessen fließt und bei Mainz in den Rhein mündet. Klein , in : Maunz / Schmidt-Bleibtreu / Klein / Bethge , BVerfGG , § 19 Rn. 9 Richtlinien zur Bewertung des Grundvermögens – BewRGr – vom19. I September 1966 (BStBl I, S.890) ''') ```
## Results ```bash +---+---------------------------------------------------+----------+ | # | Chunks | Entities | +---+---------------------------------------------------+----------+ | 0 | Deutschlands | LD | +---+---------------------------------------------------+----------+ | 1 | Rheins | LDS | +---+---------------------------------------------------+----------+ | 2 | Oberfranken | LDS | +---+---------------------------------------------------+----------+ | 3 | Unterfranken | LDS | +---+---------------------------------------------------+----------+ | 4 | Südhessen | LDS | +---+---------------------------------------------------+----------+ | 5 | Mainz | ST | +---+---------------------------------------------------+----------+ | 6 | Rhein | LDS | +---+---------------------------------------------------+----------+ | 7 | Klein , in : Maunz / Schmidt-Bleibtreu / Klein... | LIT | +---+---------------------------------------------------+----------+ | 8 | Richtlinien zur Bewertung des Grundvermögens –... | VS | +---+---------------------------------------------------+----------+ | 9 | I September 1966 (BStBl I, S.890) | VS | +---+---------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_courts| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|15.0 MB| ## References The dataset used to train this model is taken from Leitner, et.al (2019) Leitner, E., Rehm, G., and Moreno-Schneider, J. (2019). Fine-grained Named Entity Recognition in Legal Documents. In Maribel Acosta, et al., editors, Semantic Systems. The Power of AI and Knowledge Graphs. Proceedings of the 15th International Conference (SEMANTiCS2019), number 11702 in Lecture Notes in Computer Science, pages 272–287, Karlsruhe, Germany, 9. Springer. 10/11 September 2019. Source of the annotated text: Court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG). ## Benchmarking ```bash label prec rec f1 Macro-average 0.9210195 0.9186192 0.9198177 Micro-average 0.9833763 0.9837547 0.9835655 ``` --- layout: model title: Legal Agricultural Activity Document Classifier (EURLEX) author: John Snow Labs name: legclf_agricultural_activity_bert date: 2023-03-06 tags: [en, legal, classification, clauses, agricultural_activity, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_agricultural_activity_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Agricultural_Activity or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Agricultural_Activity`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_activity_bert_en_1.0.0_3.0_1678111704283.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_activity_bert_en_1.0.0_3.0_1678111704283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agricultural_activity_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Agricultural_Activity]| |[Other]| |[Other]| |[Agricultural_Activity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agricultural_activity_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.0 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Agricultural_Activity 0.83 0.88 0.86 804 Other 0.86 0.79 0.82 694 accuracy - - 0.84 1498 macro-avg 0.84 0.84 0.84 1498 weighted-avg 0.84 0.84 0.84 1498 ``` --- layout: model title: Translate Xhosa to English Pipeline author: John Snow Labs name: translate_xh_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, xh, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `xh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_xh_en_xx_2.7.0_2.4_1609690334845.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_xh_en_xx_2.7.0_2.4_1609690334845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_xh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_xh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.xh.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_xh_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-256_A-4_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna_en_4.0.0_3.0_1654185276053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna_en_4.0.0_3.0_1654185276053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.uncased_4l_256d_a4a_256d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-256_A-4_squad2_covid-qna --- layout: model title: Pipeline to Detect PHI in Text author: John Snow Labs name: ner_deid_sd_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_sd](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_sd_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_pipeline_en_3.4.1_3.0_1647869878449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_pipeline_en_3.4.1_3.0_1647869878449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_sd_pipeline", "en", "clinical/models") pipeline.annotate("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_sd_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.sd.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+--------+ |chunks |entities| +-----------------------------+--------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson, Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |1-11-2000 |DATE | |Cocke County Baptist Hospital|LOCATION| |0295 Keats Street |LOCATION| |(302) 786-5227 |CONTACT | |Brothers Coal-Mine |LOCATION| +-----------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_sd_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English BertForQuestionAnswering model (from MichelBartels) author: John Snow Labs name: bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinybert-6l-768d-squad2-large-teacher-finetuned-step1` is a English model orginally trained by `MichelBartels`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1_en_4.0.0_3.0_1654192496308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1_en_4.0.0_3.0_1654192496308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_tiny_768d_v2.by_MichelBartels").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|249.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MichelBartels/tinybert-6l-768d-squad2-large-teacher-finetuned-step1 --- layout: model title: Modern Greek (1453-) asr_xlsr_53_wav2vec_greek TFWav2Vec2ForCTC from harshit345 author: John Snow Labs name: pipeline_asr_xlsr_53_wav2vec_greek date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_53_wav2vec_greek` is a Modern Greek (1453-) model originally trained by harshit345. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_53_wav2vec_greek_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_53_wav2vec_greek_el_4.2.0_3.0_1664109717228.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_53_wav2vec_greek_el_4.2.0_3.0_1664109717228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_53_wav2vec_greek', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_53_wav2vec_greek", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_53_wav2vec_greek| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Hungarian Pipeline author: John Snow Labs name: translate_en_hu date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, hu, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `hu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_hu_xx_2.7.0_2.4_1609700744893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_hu_xx_2.7.0_2.4_1609700744893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_hu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_hu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.hu').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_hu| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: French CamemBert Embeddings (from jcai1) author: John Snow Labs name: camembert_embeddings_jcai1_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `jcai1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_jcai1_generic_model_fr_3.4.4_3.0_1653988792605.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_jcai1_generic_model_fr_3.4.4_3.0_1653988792605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_jcai1_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_jcai1_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_jcai1_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/jcai1/dummy-model --- layout: model title: Japanese Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_large_japanese_unidic_luw_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ja, open_source] task: Part of Speech Tagging language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese-unidic-luw-upos` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_japanese_unidic_luw_upos_ja_3.4.2_3.0_1652092222361.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_japanese_unidic_luw_upos_ja_3.4.2_3.0_1652092222361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_japanese_unidic_luw_upos","ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_japanese_unidic_luw_upos","ja") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_large_japanese_unidic_luw_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ja| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-large-japanese-unidic-luw-upos - https://universaldependencies.org/u/pos/ - https://pypi.org/project/fugashi - https://pypi.org/project/unidic-lite - https://pypi.org/project/pytokenizations - http://id.nii.ac.jp/1001/00216223/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Legal Loan agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_loan_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_loan_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class loan-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `loan-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_loan_agreement_en_1.0.0_3.0_1666620991453.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_loan_agreement_en_1.0.0_3.0_1666620991453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_loan_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[loan-agreement]| |[other]| |[other]| |[loan-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_loan_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support loan-agreement 0.92 0.92 0.92 39 other 0.95 0.95 0.95 62 accuracy - - 0.94 101 macro-avg 0.94 0.94 0.94 101 weighted-avg 0.94 0.94 0.94 101 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1655730927446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_6_en_4.0.0_3.0_1655730927446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|439.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-6 --- layout: model title: French CamemBert Embeddings (from ankitkupadhyay) author: John Snow Labs name: camembert_embeddings_ankitkupadhyay_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `ankitkupadhyay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ankitkupadhyay_generic_model_fr_3.4.4_3.0_1653987503170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ankitkupadhyay_generic_model_fr_3.4.4_3.0_1653987503170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ankitkupadhyay_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ankitkupadhyay_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_ankitkupadhyay_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/ankitkupadhyay/dummy-model --- layout: model title: Pipeline to find clinical events and find temporal relations (ERA) author: John Snow Labs name: explain_clinical_doc_era date: 2021-04-01 tags: [pipeline, en, licensed, clinical] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline with `ner_clinical_events`, `assertion_dl` and `re_temporal_events_clinical`. It will extract clinical entities, assign assertion status and find temporal relationships between clinical entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_era_en_3.0.0_3.0_1617297404938.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_era_en_3.0.0_3.0_1617297404938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python era_pipeline = PretrainedPipeline('explain_clinical_doc_era', 'en', 'clinical/models') annotations = era_pipeline.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """)[0] ``` ```scala val era_pipeline = new PretrainedPipeline("explain_clinical_doc_era", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """)(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.era").predict("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """) ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|:-----------|:--------------|----------------:|--------------:|:--------------------------|:--------------|----------------:|--------------:|:------------------------------|-------------:| | 0 | AFTER | OCCURRENCE | 7 | 14 | admitted | CLINICAL_DEPT | 19 | 43 | The John Hopkins Hospital | 0.963836 | | 1 | BEFORE | OCCURRENCE | 7 | 14 | admitted | DATE | 45 | 54 | 2 days ago | 0.587098 | | 2 | BEFORE | OCCURRENCE | 7 | 14 | admitted | PROBLEM | 74 | 102 | gestational diabetes mellitus | 0.999991 | | 3 | OVERLAP | CLINICAL_DEPT | 19 | 43 | The John Hopkins Hospital | DATE | 45 | 54 | 2 days ago | 0.996056 | | 4 | BEFORE | CLINICAL_DEPT | 19 | 43 | The John Hopkins Hospital | PROBLEM | 74 | 102 | gestational diabetes mellitus | 0.995216 | | 5 | OVERLAP | DATE | 45 | 54 | 2 days ago | PROBLEM | 74 | 102 | gestational diabetes mellitus | 0.996954 | | 6 | BEFORE | EVIDENTIAL | 119 | 124 | denied | PROBLEM | 126 | 129 | pain | 1 | | 7 | BEFORE | EVIDENTIAL | 119 | 124 | denied | PROBLEM | 135 | 146 | any headache | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_era| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Typed Dependency Parsing pipeline for English author: John Snow Labs name: dependency_parse date: 2022-06-29 tags: [pipeline, dependency_parsing, untyped_dependency_parsing, typed_dependency_parsing, laballed_depdency_parsing, unlaballed_depdency_parsing, en, open_source] task: Dependency Parser language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: Pipeline article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Typed Dependency parser, trained on the on the CONLL dataset. Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dependency_parse_en_4.0.0_3.0_1656456276940.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dependency_parse_en_4.0.0_3.0_1656456276940.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('dependency_parse', lang = 'en') annotations = pipeline.fullAnnotate("Dependencies represents relationships betweens words in a Sentence "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("dependency_parse", lang = "en") val result = pipeline.fullAnnotate("Dependencies represents relationships betweens words in a Sentence")(0) ``` {:.nlu-block} ```python nlu.load("dep.typed").predict("Dependencies represents relationships betweens words in a Sentence") ```
## Results ```bash +---------------------------------------------------------------------------------+--------------------------------------------------------+ |result |result | +---------------------------------------------------------------------------------+--------------------------------------------------------+ |[ROOT, Dependencies, represents, words, relationships, Sentence, Sentence, words]|[root, parataxis, nsubj, amod, nsubj, case, nsubj, flat]| +---------------------------------------------------------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dependency_parse| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|24.1 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - DependencyParserModel - TypedDependencyParserModel --- layout: model title: Legal Agreement Of Purchase And Sale Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_agreement_of_purchase_and_sale_bert date: 2023-01-26 tags: [en, legal, classification, agreement, purchase, sale, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement_of_purchase_and_sale_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `agreement-of-purchase-and-sale` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `agreement-of-purchase-and-sale`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_of_purchase_and_sale_bert_en_1.0.0_3.0_1674731506904.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_of_purchase_and_sale_bert_en_1.0.0_3.0_1674731506904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreement_of_purchase_and_sale_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[agreement-of-purchase-and-sale]| |[other]| |[other]| |[agreement-of-purchase-and-sale]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement_of_purchase_and_sale_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement-of-purchase-and-sale 0.96 0.98 0.97 51 other 0.99 0.98 0.99 116 accuracy - - 0.98 167 macro-avg 0.98 0.98 0.98 167 weighted-avg 0.98 0.98 0.98 167 ``` --- layout: model title: English asr_wav2vec2_base_100h_test TFWav2Vec2ForCTC from saahith author: John Snow Labs name: asr_wav2vec2_base_100h_test date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_test` is a English model originally trained by saahith. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_test_en_4.2.0_3.0_1664094915243.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_test_en_4.2.0_3.0_1664094915243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_test", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_test", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_spanbert_finetuned_squadv1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-finetuned-squadv1` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_finetuned_squadv1_en_4.0.0_3.0_1654191785442.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_finetuned_squadv1_en_4.0.0_3.0_1654191785442.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|402.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/spanbert-finetuned-squadv1 - https://arxiv.org/abs/1907.10529 - https://twitter.com/mrm8488 - https://github.com/facebookresearch - https://github.com/facebookresearch/SpanBERT - https://github.com/facebookresearch/SpanBERT#pre-trained-models - https://rajpurkar.github.io/SQuAD-explorer/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Fast Neural Machine Translation Model from English to Afrikaans author: John Snow Labs name: opus_mt_en_af date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, af, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `af` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_af_xx_2.7.0_2.4_1609164103105.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_af_xx_2.7.0_2.4_1609164103105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_af", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_af", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.af').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_af| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Sourabh714) author: John Snow Labs name: distilbert_qa_sourabh714_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Sourabh714`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sourabh714_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769323017.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sourabh714_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769323017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sourabh714_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sourabh714_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sourabh714_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Sourabh714/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from atoivat) author: John Snow Labs name: distilbert_qa_atoivat_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `atoivat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_atoivat_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770083162.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_atoivat_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770083162.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_atoivat_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_atoivat_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_atoivat_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/atoivat/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from msms) author: John Snow Labs name: distilbert_qa_msms_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `msms`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_msms_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772240858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_msms_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772240858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_msms_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_msms_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_msms_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/msms/distilbert-base-uncased-finetuned-squad --- layout: model title: Abkhazian asr_xls_r_ab_test_by_baaastien TFWav2Vec2ForCTC from baaastien author: John Snow Labs name: pipeline_asr_xls_r_ab_test_by_baaastien date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_baaastien` is a Abkhazian model originally trained by baaastien. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_baaastien_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_baaastien_ab_4.2.0_3.0_1664020833113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_baaastien_ab_4.2.0_3.0_1664020833113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_baaastien', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_baaastien", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_test_by_baaastien| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|448.1 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Recognize Entities DL Pipeline for Dutch - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, dutch, entity_recognizer_sm, pipeline, nl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: nl edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_nl_3.0.0_3.0_1616442569062.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_nl_3.0.0_3.0_1616442569062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'nl') annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "nl") val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo van John Snow Labs! ""] result_df = nlu.load('nl.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | [[0.3653799891471863,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| --- layout: model title: Translate Ruund to English Pipeline author: John Snow Labs name: translate_rnd_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, rnd, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `rnd` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_rnd_en_xx_2.7.0_2.4_1609689061627.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_rnd_en_xx_2.7.0_2.4_1609689061627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_rnd_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_rnd_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.rnd.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_rnd_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dm256 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm256` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm256_en_4.3.0_3.0_1675119242272.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm256_en_4.3.0_3.0_1675119242272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dm256","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dm256","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dm256| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|74.2 MB| ## References - https://huggingface.co/google/t5-efficient-small-dm256 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: NER Model for German author: John Snow Labs name: xlm_roberta_large_token_classifier_conll03 date: 2021-12-25 tags: [german, xlm, roberta, ner, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.3.4 spark_version: 2.4 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned for the German language, leveraging `XLM-RoBERTa` embeddings and `XlmRobertaForTokenClassification` for NER purposes. ## Predicted Entities `PER`, `ORG`, `LOC`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_de_3.3.4_2.4_1640443988652.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_de_3.3.4_2.4_1640443988652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_conll03", "de"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_conll03", "de")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.ner.xlm").predict("""Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.""") ```
## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Ibser |PER | |ASK Ebreichsdorf |ORG | |Admira Wacker Mödling |ORG | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_conll03| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|de| |Size:|1.8 GB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/xlm-roberta-large-finetuned-conll03-german](https://huggingface.co/xlm-roberta-large-finetuned-conll03-german) --- layout: model title: Legal Standstill Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_standstill_agreement_bert date: 2023-01-29 tags: [en, legal, classification, standstill, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_standstill_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `standstill-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `standstill-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_standstill_agreement_bert_en_1.0.0_3.0_1674990206494.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_standstill_agreement_bert_en_1.0.0_3.0_1674990206494.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_standstill_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[standstill-agreement]| |[other]| |[other]| |[standstill-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_standstill_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.91 0.97 0.94 98 standstill-agreement 0.93 0.82 0.87 51 accuracy - - 0.92 149 macro-avg 0.92 0.90 0.91 149 weighted-avg 0.92 0.92 0.92 149 ``` --- layout: model title: Ganda asr_wav2vec2_xlsr_multilingual_56 TFWav2Vec2ForCTC from voidful author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_multilingual_56 date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_multilingual_56` is a Ganda model originally trained by voidful. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_multilingual_56_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_multilingual_56_lg_4.2.0_3.0_1664035886266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_multilingual_56_lg_4.2.0_3.0_1664035886266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_multilingual_56', lang = 'lg') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_multilingual_56", lang = "lg") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_multilingual_56| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|lg| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering model (from tabo) author: John Snow Labs name: distilbert_qa_checkpoint_500_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `checkpoint-500-finetuned-squad` is a English model originally trained by `tabo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_checkpoint_500_finetuned_squad_en_4.0.0_3.0_1654723402338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_checkpoint_500_finetuned_squad_en_4.0.0_3.0_1654723402338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_checkpoint_500_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_checkpoint_500_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_checkpoint_500_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tabo/checkpoint-500-finetuned-squad --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from davidenam) author: John Snow Labs name: xlmroberta_ner_davinam_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `davidenam`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_davinam_base_finetuned_panx_de_4.1.0_3.0_1660432159326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_davinam_base_finetuned_panx_de_4.1.0_3.0_1660432159326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_davinam_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_davinam_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_davinam_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/davidenam/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English image_classifier_vit_modeversion1_m7_e4 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_modeversion1_m7_e4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modeversion1_m7_e4` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion1_m7_e4_en_4.1.0_3.0_1660171370244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion1_m7_e4_en_4.1.0_3.0_1660171370244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_modeversion1_m7_e4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_modeversion1_m7_e4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_modeversion1_m7_e4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368_de_4.2.0_3.0_1664117272348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368_de_4.2.0_3.0_1664117272348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering (from avioo1) author: John Snow Labs name: roberta_qa_avioo1_roberta_base_squad2_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `avioo1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_avioo1_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735375943.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_avioo1_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735375943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_avioo1_roberta_base_squad2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_avioo1_roberta_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_avioo1").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_avioo1_roberta_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/avioo1/roberta-base-squad2-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1655732483088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1655732483088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_32d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|417.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-10 --- layout: model title: Recognize Entities DL Pipeline for Polish - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, polish, entity_recognizer_md, pipeline, pl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pl_3.0.0_3.0_1616450153520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pl_3.0.0_3.0_1616450153520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'pl') annotations = pipeline.fullAnnotate(""Witaj z John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "pl") val result = pipeline.fullAnnotate("Witaj z John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Witaj z John Snow Labs! ""] result_df = nlu.load('pl.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Witaj z John Snow Labs! '] | ['Witaj z John Snow Labs!'] | ['Witaj', 'z', 'John', 'Snow', 'Labs!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pl| --- layout: model title: Chinese BertForTokenClassification Base Cased model (from ckiplab) author: John Snow Labs name: bert_token_classifier_base_chinese_ws date: 2022-11-30 tags: [zh, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese-ws` is a Chinese model originally trained by `ckiplab`. ## Predicted Entities `B`, `I` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_chinese_ws_zh_4.2.4_3.0_1669814825671.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_chinese_ws_zh_4.2.4_3.0_1669814825671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_chinese_ws","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_chinese_ws","zh") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_chinese_ws| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|zh| |Size:|381.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ckiplab/bert-base-chinese-ws - https://github.com/ckiplab/ckip-transformers - https://muyang.pro - https://ckip.iis.sinica.edu.tw - https://github.com/ckiplab/ckip-transformers - https://github.com/ckiplab/ckip-transformers --- layout: model title: English BertForMaskedLM Cased model (from antoinelouis) author: John Snow Labs name: bert_embeddings_netbert date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `netbert` is a English model originally trained by `antoinelouis`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_netbert_en_4.2.4_3.0_1670022703984.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_netbert_en_4.2.4_3.0_1670022703984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_netbert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_netbert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_netbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|405.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/antoinelouis/netbert - https://github.com/antoiloui/netbert/blob/master/docs/index.md --- layout: model title: Multilingual BertForQuestionAnswering model (from alon-albalak) author: John Snow Labs name: bert_qa_bert_base_multilingual_xquad date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-xquad` is a Multilingual model orginally trained by `alon-albalak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_xquad_xx_4.0.0_3.0_1654180335035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_xquad_xx_4.0.0_3.0_1654180335035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_xquad","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_xquad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.xquad.bert.multilingual_base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_xquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|626.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/alon-albalak/bert-base-multilingual-xquad - https://github.com/deepmind/xquad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from tusbaki) author: John Snow Labs name: distilbert_qa_tusbaki_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `tusbaki`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tusbaki_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772936018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tusbaki_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772936018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tusbaki_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tusbaki_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_tusbaki_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tusbaki/distilbert-base-uncased-finetuned-squad --- layout: model title: English XlmRoBertaForQuestionAnswering (from Srini99) author: John Snow Labs name: xlm_roberta_qa_TQA date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `TQA` is a English model originally trained by `Srini99`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_TQA_en_4.0.0_3.0_1655983669052.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_TQA_en_4.0.0_3.0_1655983669052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_TQA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_TQA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.by_Srini99").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_TQA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|2.1 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Srini99/TQA --- layout: model title: POS Tagger Clinical author: John Snow Labs name: pos_clinical class: PerceptronModel language: en nav_key: models repository: clinical/models date: 2019-04-30 task: Part of Speech Tagging edition: Healthcare NLP 2.0.2 spark_version: 2.4 tags: [clinical, licensed, pos,en] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Sets a Part-Of-Speech tag to each word within a sentence. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/pos_clinical_en_2.0.2_2.4_1556660550177.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/pos_clinical_en_2.0.2_2.4_1556660550177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_clinical","en","clinical/models")\ .setInputCols(["token","sentence"])\ .setOutputCol("pos") pos_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(pos_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.""") ``` ```scala val pos = PerceptronModel.pretrained("pos_clinical","en","clinical/models") .setInputCols("token","sentence") .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.pos.clinical").predict("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.""") ```
{:.h2_title} ## Results ```bash [Annotation(pos, 0, 1, NN, {'word': 'He'}), Annotation(pos, 3, 5, VBD, {'word': 'was'}), Annotation(pos, 7, 11, VVN, {'word': 'given'}), Annotation(pos, 13, 19, NNS, {'word': 'boluses'}), Annotation(pos, 21, 22, II, {'word': 'of'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | pos_clinical | | Type: | PerceptronModel | | Compatibility: | Spark NLP 2.0.2+ | | License: | Licensed | | Edition: | Official | |Input labels: | [token, sentence] | |Output labels: | [pos] | | Language: | en | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained with MedPost dataset. --- layout: model title: Detect problem, test, treatment in medical text (biobert) author: John Snow Labs name: ner_clinical_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect problem, test, treatment in medical text using pretrained NER model. ## Predicted Entities `PROBLEM`, `TREATMENT`, `TEST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_en_3.0.0_3.0_1617260812919.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_en_3.0.0_3.0_1617260812919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical.biobert").predict("""Put your text here.""") ```
## Results ```bash +---------------------------------------------------+---------+ |chunk |ner_label| +---------------------------------------------------+---------+ |congestion |PROBLEM | |some mild problems with his breathing while feeding|PROBLEM | |any perioral cyanosis |PROBLEM | |retractions |PROBLEM | |a tactile temperature |PROBLEM | |Tylenol |TREATMENT| |some decreased p.o |PROBLEM | |His normal breast-feeding |TEST | |his respiratory congestion |PROBLEM | |more tired |PROBLEM | |fussy |PROBLEM | |albuterol treatments |TREATMENT| |His urine output |TEST | |any diarrhea |PROBLEM | +---------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Smaller BERT Sentence Embeddings (L-12_H-768_A-12) author: John Snow Labs name: sent_small_bert_L12_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_768_en_2.6.0_2.4_1598351662548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_768_en_2.6.0_2.4_1598351662548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_768", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_768", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L12_768').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L12_768_embeddings 0 I hate cancer [0.128899946808815, -0.2827132046222687, 0.165... 1 Antibiotics aren't painkiller [-0.2957196533679962, -0.1430252194404602, 0.0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L12_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1 --- layout: model title: Google's Tapas Table Understanding (Base, WIKISQL) author: John Snow Labs name: table_qa_tapas_base_finetuned_wikisql_supervised date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Base Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_base_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530699686.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_base_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530699686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_base_finetuned_wikisql_supervised","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_base_finetuned_wikisql_supervised| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|413.9 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions https://github.com/salesforce/WikiSQL --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_4_en_4.3.0_3.0_1674215570077.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_4_en_4.3.0_3.0_1674215570077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|432.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-4 --- layout: model title: Pipeline to Detect Mentions of Tumors in Text author: John Snow Labs name: nerdl_tumour_demo_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [nerdl_tumour_demo](https://nlp.johnsnowlabs.com/2021/04/01/nerdl_tumour_demo_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/nerdl_tumour_demo_pipeline_en_4.3.0_3.2_1678837464087.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/nerdl_tumour_demo_pipeline_en_4.3.0_3.2_1678837464087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("nerdl_tumour_demo_pipeline", "en", "clinical/models") text = '''The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("nerdl_tumour_demo_pipeline", "en", "clinical/models") val text = "The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:-------------|:-------------| | 0 | breast carcinoma | 35 | 50 | Localization | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_tumour_demo_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect clinical entities (ner_healthcare_slim) author: John Snow Labs name: ner_healthcare_slim date: 2021-04-01 tags: [ner, clinical, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect clinical entities in German text using pretrained NER model ## Predicted Entities `TREATMENT`, `PERSON`, `BODY_PART`, `TIME_INFORMATION`, `MEDICAL_CONDITION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HEALTHCARE_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_slim_de_3.0.0_3.0_1617260856273.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_slim_de_3.0.0_3.0_1617260856273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_healthcare_slim", "de", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_healthcare_slim", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(Seq.empty[String]).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare_slim| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| --- layout: model title: Legal Health and Safety Clause Binary Classifier author: John Snow Labs name: legclf_health_and_safety_clause date: 2022-12-07 tags: [en, legal, classification, clauses, health_and_safety, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `health-and-safety` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `health-and-safety`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_health_and_safety_clause_en_1.0.0_3.0_1670445028648.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_health_and_safety_clause_en_1.0.0_3.0_1670445028648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_health_and_safety_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[health-and-safety]| |[other]| |[other]| |[health-and-safety]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_health_and_safety_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support health-and-safety 1.00 0.85 0.92 27 other 0.95 1.00 0.97 73 accuracy - - 0.96 100 macro-avg 0.97 0.93 0.95 100 weighted-avg 0.96 0.96 0.96 100 ``` --- layout: model title: English AlbertForQuestionAnswering model (from mfeb) author: John Snow Labs name: albert_qa_xxlarge_v2_squad2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xxlarge-v2-squad2` is a English model originally trained by `mfeb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_v2_squad2_en_4.0.0_3.0_1656064012371.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_v2_squad2_en_4.0.0_3.0_1656064012371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_v2_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_v2_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.xxl_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xxlarge_v2_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|771.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mfeb/albert-xxlarge-v2-squad2 --- layout: model title: Detect Posology concepts (clinical_medium) author: John Snow Labs name: ner_posology_emb_clinical_medium date: 2023-04-12 tags: [ner, licensed, english, clinical, posology, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects Drug, Dosage, and administration instructions in text using pretrained NER model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_emb_clinical_medium_en_4.3.2_3.0_1681315841950.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_emb_clinical_medium_en_4.3.2_3.0_1681315841950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner = MedicalNerModel.pretrained('ner_posology_emb_clinical_medium' , "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("posology_ner") posology_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "posology_ner"]) \ .setOutputCol("posology_ner_chunk") posology_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, posology_ner, posology_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") posology_ner_model = posology_ner_pipeline.fit(empty_data) results = posology_ner_model.transform(spark.createDataFrame([["The patient has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(Array("sentence", "token"))\ .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel.pretrained('ner_posology_emb_clinical_large' "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("posology_ner") val posology_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("posology_ner_chunk") val posology_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter)) val data = Seq(""" The patient has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:--------------|--------:|------:|:-----------| | 0 | Aspirin | 29 | 35 | DRUG | | 1 | 81 milligrams | 37 | 49 | STRENGTH | | 2 | QDay | 51 | 54 | FREQUENCY | | 3 | insulin | 57 | 63 | DRUG | | 4 | 50 units | 65 | 72 | STRENGTH | | 5 | HCTZ | 82 | 85 | DRUG | | 6 | 50 mg | 87 | 91 | STRENGTH | | 7 | QDay | 93 | 96 | FREQUENCY | | 8 | Nitroglycerin | 99 | 111 | DRUG | | 9 | 1/150 | 113 | 117 | STRENGTH | | 10 | sublingually | 119 | 130 | ROUTE | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash precision recall f1-score support DRUG 0.91 0.91 0.91 2252 STRENGTH 0.88 0.93 0.91 2290 FREQUENCY 0.90 0.94 0.92 1782 DURATION 0.78 0.84 0.81 463 DOSAGE 0.66 0.63 0.65 476 ROUTE 0.89 0.89 0.89 394 FORM 0.86 0.76 0.81 773 micro avg 0.87 0.89 0.88 8430 macro avg 0.84 0.84 0.84 8430 weighted avg 0.87 0.89 0.88 8430 ``` --- layout: model title: Spanish Part of Speech Tagger (Base) author: John Snow Labs name: roberta_pos_roberta_base_bne_capitel_pos date: 2022-05-03 tags: [roberta, pos, part_of_speech, es, open_source] task: Part of Speech Tagging language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-bne-capitel-pos` is a Spanish model orginally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_base_bne_capitel_pos_es_3.4.2_3.0_1651591392282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_base_bne_capitel_pos_es_3.4.2_3.0_1651591392282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_base_bne_capitel_pos","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_base_bne_capitel_pos","es") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.pos.roberta_base_bne_capitel_pos").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_roberta_base_bne_capitel_pos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|447.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-pos - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://sites.google.com/view/capitel2020 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from venkateshdas) author: John Snow Labs name: roberta_qa_base_squad2_ta_qna_10e date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-ta-qna-roberta10e` is a English model originally trained by `venkateshdas`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_ta_qna_10e_en_4.3.0_3.0_1674219728754.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_ta_qna_10e_en_4.3.0_3.0_1674219728754.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_ta_qna_10e","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_ta_qna_10e","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_ta_qna_10e| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/venkateshdas/roberta-base-squad2-ta-qna-roberta10e --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_2_lr_2e_5_bs_32_ep_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_2-lr-2e-5-bs-32-ep-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188406969.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188406969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_2e_5_bs_32_ep_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_2e_5_bs_32_ep_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_2_lr_2e_5_bs_32_ep_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_2-lr-2e-5-bs-32-ep-3 --- layout: model title: Legal Consulting agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_consulting_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_consulting_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class consulting-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `consulting-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consulting_agreement_en_1.0.0_3.0_1666620877422.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consulting_agreement_en_1.0.0_3.0_1666620877422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_consulting_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[consulting-agreement]| |[other]| |[other]| |[consulting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_consulting_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support consulting-agreement 0.92 0.85 0.88 26 other 0.91 0.96 0.93 45 accuracy - - 0.92 71 macro-avg 0.92 0.90 0.91 71 weighted-avg 0.92 0.92 0.91 71 ``` --- layout: model title: Polish DistilBERT Embeddings author: John Snow Labs name: distilbert_embeddings_distilbert_base_pl_cased date: 2022-04-12 tags: [distilbert, embeddings, pl, open_source] task: Embeddings language: pl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-pl-cased` is a Polish model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_pl_cased_pl_3.4.2_3.0_1649783929723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_pl_cased_pl_3.4.2_3.0_1649783929723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_pl_cased","pl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Kocham iskra NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_pl_cased","pl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Kocham iskra NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pl.embed.distilbert_base_cased").predict("""Kocham iskra NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_pl_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pl| |Size:|225.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-pl-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English image_classifier_vit_ak__base_patch16_224_in21k_image_classification ViTForImageClassification from amitkayal author: John Snow Labs name: image_classifier_vit_ak__base_patch16_224_in21k_image_classification date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ak__base_patch16_224_in21k_image_classification` is a English model originally trained by amitkayal. ## Predicted Entities `1647932708266`, `1647153073068`, `1648120851131`, `1647323870424`, `1647324831223`, `1648529040870`, `1647667510657`, `1647838548516`, `1647491988625`, `1647068900550`, `1647241002660`, `1647155940977`, `1648444866188`, `1648287279256`, `1648360065352`, `1646891748795`, `1647175174130`, `1647667513009`, `1647503446037`, `1648616524252`, `1647241522127`, `1648454681973`, `1647581857611`, `1647233597062`, `1647933757599`, `1647420385320`, `1648361316938`, `1647856085470`, `1647243922705`, `1647497162634`, `1647237935761`, `1648369037326`, `1648115339027`, `1647153746047`, `1648273037858`, `1647150662937`, `1646893741640`, `1647845708343`, `1647147746930`, `1648366090438`, `1647156193650`, `1648537457250`, `1647149733777`, `1648443306030`, `1648646879260`, `1648001685069`, `1648528121469`, `1647156345180`, `1648456544611`, `1648107120561`, `1648359826755`, `1648366661601`, `1647666899143`, `1647935446369`, `1647668429439`, `1647936918167`, `1647235612241`, `1648041520955`, `1647243637048`, `1647680921307`, `1647081327122`, `1647087753595`, `1648528673166`, `1648710643516`, `1647945116578`, `1647846670493`, `1648536302434`, `1647761641671`, `1647325936506`, `1647325395017`, `1647234311073`, `1647759532201`, `1647241685563`, `1647935761690`, `1647846942176`, `1648698605364`, `1647933173332`, `1647420602021`, `1647159771571`, `1647324549954`, `1647065648037`, `1648536755030`, `1647924696448`, `1647927510285`, `1646892894926`, `1647580734898`, `1648287633901`, `1648442962568`, `1648368434262`, `1646988520214`, `1648279394220`, `1647150432271`, `1648643933549`, `1647448253535`, `1647929786244`, `1648370352609`, `1647330838532`, `1647147396410`, `1648644032811`, `1647140664832`, `1648536664835`, `1647410160165`, `1647164611497`, `1648183419560`, `1647773005511`, `1646034720737`, `1647328253213`, `1647155473555`, `1647953067595`, `1648538213890`, `1647409255750`, `1647682028193`, `1648116419206`, `1647329803500`, `1647154529007`, `1648099843046`, `1647248948963`, `1648279061829`, `1648296194026`, `1648108046332`, `1648113127522`, `1648455583722`, `1647761642926`, `1648533677853`, `1647940472293`, `1648701651498`, `1648456403645`, `1647752727972`, `1647494398054`, `1647674319450`, `1646887547179`, `1647158162746`, `1647176402807`, `1647065469305`, `1647838279102`, `1647674991869`, `1648113569157`, `1647067760203`, `1648365815895`, `1647330963213`, `1647405478288`, `1648372401817`, `1648103720352`, `1648115162538`, `1647784242846`, `1647402768175`, `1647490410305`, `1648286780106`, `1648625933250`, `1648534124563`, `1647173945909`, `1647235889634`, `1648525350277`, `1647236596892`, `1648292928751`, `1648706018510`, `1648024698508`, `1648707239343`, `1647767885862`, `1647240848064`, `1648280849373`, `1648406457408`, `1647236473208`, `1647157232763`, `1647147535719`, `1648706506117`, `1648706245642`, `1647933556772`, `1648616269757`, `1646809025517`, `1647420348512`, `1647399162926`, `1647843749559`, `1647242632376`, `1648182014371`, `1647321722450`, `1648369375824`, `1647331957791`, `1647494265176`, `1647252866283`, `1647930894634`, `1646809256434`, `1648537084774`, `1648450201448`, `1646885340637`, `1647760733996`, `1648287529293` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ak__base_patch16_224_in21k_image_classification_en_4.1.0_3.0_1660167449128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ak__base_patch16_224_in21k_image_classification_en_4.1.0_3.0_1660167449128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_ak__base_patch16_224_in21k_image_classification", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_ak__base_patch16_224_in21k_image_classification", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_ak__base_patch16_224_in21k_image_classification| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.4 MB| --- layout: model title: Part of Speech for Slovak author: John Snow Labs name: pos_ud_snk date: 2020-05-04 23:32:00 +0800 task: Part of Speech Tagging language: sk edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, sk] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_snk_sk_2.5.0_2.4_1588622627281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_snk_sk_2.5.0_2.4_1588622627281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_snk", "sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_snk", "sk") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny."""] pos_df = nlu.load('sk.pos.ud_snk').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='ADP', metadata={'word': 'Okrem'}), Row(annotatorType='pos', begin=6, end=9, result='DET', metadata={'word': 'toho'}), Row(annotatorType='pos', begin=10, end=10, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=12, end=13, result='SCONJ', metadata={'word': 'že'}), Row(annotatorType='pos', begin=15, end=16, result='AUX', metadata={'word': 'je'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_snk| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|sk| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_300000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-300000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_300000_cased_generator_de_3.4.4_3.0_1652786364457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_300000_cased_generator_de_3.4.4_3.0_1652786364457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_300000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_300000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_300000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-300000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1655731375791.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1655731375791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|422.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-6 --- layout: model title: French CamemBert Embeddings (from devtrent) author: John Snow Labs name: camembert_embeddings_devtrent_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `devtrent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_devtrent_generic_model_fr_3.4.4_3.0_1653987835010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_devtrent_generic_model_fr_3.4.4_3.0_1653987835010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_devtrent_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_devtrent_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_devtrent_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/devtrent/dummy-model --- layout: model title: Recognize Entities DL pipeline for German - Large author: John Snow Labs name: entity_recognizer_lg date: 2021-03-23 tags: [open_source, german, entity_recognizer_lg, pipeline, de] supported: true task: [Named Entity Recognition, Lemmatization] language: de edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_de_3.0.0_3.0_1616491780964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_de_3.0.0_3.0_1616491780964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_lg', lang = 'de') annotations = pipeline.fullAnnotate(""Hallo aus John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_lg", lang = "de") val result = pipeline.fullAnnotate("Hallo aus John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo aus John Snow Labs! ""] result_df = nlu.load('de.ner.recognizer.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo aus John Snow Labs! '] | ['Hallo aus John Snow Labs!'] | ['Hallo', 'aus', 'John', 'Snow', 'Labs!'] | [[-0.245989993214607,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|de| --- layout: model title: Relation extraction between dates and clinical entities author: John Snow Labs name: re_date_clinical date: 2021-01-18 task: Relation Extraction language: en nav_key: models edition: Spark NLP for Healthcare 2.7.1 spark_version: 2.4 tags: [en, relation_extraction, clinical, licensed] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between date and related other entities. `1` : Shows there is a relation between the date entity and other clinical entities, `0` : Shows there is no relation between the date entity and other clinical entities. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_date_clinical_en_2.7.1_2.4_1611000334654.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_date_clinical_en_2.7.1_2.4_1611000334654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_date_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:----------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | re_date_clinical | 0,1 | ner_jsl | [“date-admission_discharge”,
“admission_discharge-date”,
“date-alcohol”,
“alcohol-date”,
“date-allergen”,
“allergen-date”,
“date-bmi”,
“bmi-date”,
“date-birth_entity”,
“birth_entity-date”,
“date-blood_pressure”,
“blood_pressure-date”,
“date-cerebrovascular_disease”,
“cerebrovascular_disease-date”,
“date-clinical_dept”,
“clinical_dept-date”,
“date-communicable_disease”,
“communicable_disease-date”,
“date-death_entity”,
“death_entity-date”,
“date-diabetes”,
“diabetes-date”,
“date-diet”,
“diet-date”,
“date-disease_syndrome_disorder”,
“disease_syndrome_disorder-date”,
“date-drug_brandname”,
“drug_brandname-date”,
“date-drug_ingredient”,
“drug_ingredient-date”,
“date-ekg_findings”,
“ekg_findings-date”,
“date-external_body_part_or_region”,
“external_body_part_or_region-date”,
“date-fetus_newborn”,
“fetus_newborn-date”,
“date-hdl”,
“hdl-date”,
“date-heart_disease”,
“heart_disease-date”,
“date-height”,
“height-date”,
“date-hyperlipidemia”,
“hyperlipidemia-date”,
“date-hypertension”,
“hypertension-date”,
“date-imagingfindings”,
“imagingfindings-date”,
“date-imaging_technique”,
“imaging_technique-date”,
“date-injury_or_poisoning”,
“injury_or_poisoning-date”,
“date-internal_organ_or_component”,
“internal_organ_or_component-date”,
“date-kidney_disease”,
“kidney_disease-date”,
“date-ldl”,
“ldl-date”,
“date-modifier”,
“modifier-date”,
“date-o2_saturation”,
“o2_saturation-date”,
“date-obesity”,
“obesity-date”,
“date-oncological”,
“oncological-date”,
“date-overweight”,
“overweight-date”,
“date-oxygen_therapy”,
“oxygen_therapy-date”,
“date-pregnancy”,
“pregnancy-date”,
“date-procedure”,
“procedure-date”,
“date-psychological_condition”,
“psychological_condition-date”,
“date-pulse”,
“pulse-date”,
“date-respiration”,
“respiration-date”,
“date-smoking”,
“smoking-date”,
“date-substance”,
“substance-date”,
“date-substance_quantity”,
“substance_quantity-date”,
“date-symptom”,
“symptom-date”,
“date-temperature”,
“temperature-date”,
“date-test”,
“test-date”,
“date-test_result”,
“test_result-date”,
“date-total_cholesterol”,
“total_cholesterol-date”,
“date-treatment”,
“treatment-date”,
“date-triglycerides”,
“triglycerides-date”,
“date-vs_finding”,
“vs_finding-date”,
“date-vaccine”,
“vaccine-date”,
“date-vital_signs_header”,
“vital_signs_header-date”,
“date-weight”,
“weight-date”,
“time-admission_discharge”,
“admission_discharge-time”,
“time-alcohol”,
“alcohol-time”,
“time-allergen”,
“allergen-time”,
“time-bmi”,
“bmi-time”,
“time-birth_entity”,
“birth_entity-time”,
“time-blood_pressure”,
“blood_pressure-time”,
“time-cerebrovascular_disease”,
“cerebrovascular_disease-time”,
“time-clinical_dept”,
“clinical_dept-time”,
“time-communicable_disease”,
“communicable_disease-time”,
“time-death_entity”,
“death_entity-time”,
“time-diabetes”,
“diabetes-time”,
“time-diet”,
“diet-time”,
“time-disease_syndrome_disorder”,
“disease_syndrome_disorder-time”,
“time-drug_brandname”,
“drug_brandname-time”,
“time-drug_ingredient”,
“drug_ingredient-time”,
“time-ekg_findings”,
“ekg_findings-time”,
“time-external_body_part_or_region”,
“external_body_part_or_region-time”,
“time-fetus_newborn”,
“fetus_newborn-time”,
“time-hdl”,
“hdl-time”,
“time-heart_disease”,
“heart_disease-time”,
“time-height”,
“height-time”,
“time-hyperlipidemia”,
“hyperlipidemia-time”,
“time-hypertension”,
“hypertension-time”,
“time-imagingfindings”,
“imagingfindings-time”,
“time-imaging_technique”,
“imaging_technique-time”,
“time-injury_or_poisoning”,
“injury_or_poisoning-time”,
“time-internal_organ_or_component”,
“internal_organ_or_component-time”,
“time-kidney_disease”,
“kidney_disease-time”,
“time-ldl”,
“ldl-time”,
“time-modifier”,
“modifier-time”,
“time-o2_saturation”,
“o2_saturation-time”,
“time-obesity”,
“obesity-time”,
“time-oncological”,
“oncological-time”,
“time-overweight”,
“overweight-time”,
“time-oxygen_therapy”,
“oxygen_therapy-time”,
“time-pregnancy”,
“pregnancy-time”,
“time-procedure”,
“procedure-time”,
“time-psychological_condition”,
“psychological_condition-time”,
“time-pulse”,
“pulse-time”,
“time-respiration”,
“respiration-time”,
“time-smoking”,
“smoking-time”,
“time-substance”,
“substance-time”,
“time-substance_quantity”,
“substance_quantity-time”,
“time-symptom”,
“symptom-time”,
“time-temperature”,
“temperature-time”,
“time-test”,
“test-time”,
“time-test_result”,
“test_result-time”,
“time-total_cholesterol”,
“total_cholesterol-time”,
“time-treatment”,
“treatment-time”,
“time-triglycerides”,
“triglycerides-time”,
“time-vs_finding”,
“vs_finding-time”,
“time-vaccine”,
“vaccine-time”,
“time-vital_signs_header”,
“vital_signs_header-time”,
“time-weight”,
“weight-time”,
“relativedate-admission_discharge”,
“admission_discharge-relativedate”,
“relativedate-alcohol”,
“alcohol-relativedate”,
“relativedate-allergen”,
“allergen-relativedate”,
“relativedate-bmi”,
“bmi-relativedate”,
“relativedate-birth_entity”,
“birth_entity-relativedate”,
“relativedate-blood_pressure”,
“blood_pressure-relativedate”,
“relativedate-cerebrovascular_disease”,
“cerebrovascular_disease-relativedate”,
“relativedate-clinical_dept”,
“clinical_dept-relativedate”,
“relativedate-communicable_disease”,
“communicable_disease-relativedate”,
“relativedate-death_entity”,
“death_entity-relativedate”,
“relativedate-diabetes”,
“diabetes-relativedate”,
“relativedate-diet”,
“diet-relativedate”,
“relativedate-disease_syndrome_disorder”,
“disease_syndrome_disorder-relativedate”,
“relativedate-drug_brandname”,
“drug_brandname-relativedate”,
“relativedate-drug_ingredient”,
“drug_ingredient-relativedate”,
“relativedate-ekg_findings”,
“ekg_findings-relativedate”,
“relativedate-external_body_part_or_region”,
“external_body_part_or_region-relativedate”,
“relativedate-fetus_newborn”,
“fetus_newborn-relativedate”,
“relativedate-hdl”,
“hdl-relativedate”,
“relativedate-heart_disease”,
“heart_disease-relativedate”,
“relativedate-height”,
“height-relativedate”,
“relativedate-hyperlipidemia”,
“hyperlipidemia-relativedate”,
“relativedate-hypertension”,
“hypertension-relativedate”,
“relativedate-imagingfindings”,
“imagingfindings-relativedate”,
“relativedate-imaging_technique”,
“imaging_technique-relativedate”,
“relativedate-injury_or_poisoning”,
“injury_or_poisoning-relativedate”,
“relativedate-internal_organ_or_component”,
“internal_organ_or_component-relativedate”,
“relativedate-kidney_disease”,
“kidney_disease-relativedate”,
“relativedate-ldl”,
“ldl-relativedate”,
“relativedate-modifier”,
“modifier-relativedate”,
“relativedate-o2_saturation”,
“o2_saturation-relativedate”,
“relativedate-obesity”,
“obesity-relativedate”,
“relativedate-oncological”,
“oncological-relativedate”,
“relativedate-overweight”,
“overweight-relativedate”,
“relativedate-oxygen_therapy”,
“oxygen_therapy-relativedate”,
“relativedate-pregnancy”,
“pregnancy-relativedate”,
“relativedate-procedure”,
“procedure-relativedate”,
“relativedate-psychological_condition”,
“psychological_condition-relativedate”,
“relativedate-pulse”,
“pulse-relativedate”,
“relativedate-respiration”,
“respiration-relativedate”,
“relativedate-smoking”,
“smoking-relativedate”,
“relativedate-substance”,
“substance-relativedate”,
“relativedate-substance_quantity”,
“substance_quantity-relativedate”,
“relativedate-symptom”,
“symptom-relativedate”,
“relativedate-temperature”,
“temperature-relativedate”,
“relativedate-test”,
“test-relativedate”,
“relativedate-test_result”,
“test_result-relativedate”,
“relativedate-total_cholesterol”,
“total_cholesterol-relativedate”,
“relativedate-treatment”,
“treatment-relativedate”,
“relativedate-triglycerides”,
“triglycerides-relativedate”,
“relativedate-vs_finding”,
“vs_finding-relativedate”,
“relativedate-vaccine”,
“vaccine-relativedate”,
“relativedate-vital_signs_header”,
“vital_signs_header-relativedate”,
“relativedate-weight”,
“weight-relativedate”,
“relativetime-admission_discharge”,
“admission_discharge-relativetime”,
“relativetime-alcohol”,
“alcohol-relativetime”,
“relativetime-allergen”,
“allergen-relativetime”,
“relativetime-bmi”,
“bmi-relativetime”,
“relativetime-birth_entity”,
“birth_entity-relativetime”,
“relativetime-blood_pressure”,
“blood_pressure-relativetime”,
“relativetime-cerebrovascular_disease”,
“cerebrovascular_disease-relativetime”,
“relativetime-clinical_dept”,
“clinical_dept-relativetime”,
“relativetime-communicable_disease”,
“communicable_disease-relativetime”,
“relativetime-death_entity”,
“death_entity-relativetime”,
“relativetime-diabetes”,
“diabetes-relativetime”,
“relativetime-diet”,
“diet-relativetime”,
“relativetime-disease_syndrome_disorder”,
“disease_syndrome_disorder-relativetime”,
“relativetime-drug_brandname”,
“drug_brandname-relativetime”,
“relativetime-drug_ingredient”,
“drug_ingredient-relativetime”,
“relativetime-ekg_findings”,
“ekg_findings-relativetime”,
“relativetime-external_body_part_or_region”,
“external_body_part_or_region-relativetime”,
“relativetime-fetus_newborn”,
“fetus_newborn-relativetime”,
“relativetime-hdl”,
“hdl-relativetime”,
“relativetime-heart_disease”,
“heart_disease-relativetime”,
“relativetime-height”,
“height-relativetime”,
“relativetime-hyperlipidemia”,
“hyperlipidemia-relativetime”,
“relativetime-hypertension”,
“hypertension-relativetime”,
“relativetime-imagingfindings”,
“imagingfindings-relativetime”,
“relativetime-imaging_technique”,
“imaging_technique-relativetime”,
“relativetime-injury_or_poisoning”,
“injury_or_poisoning-relativetime”,
“relativetime-internal_organ_or_component”,
“internal_organ_or_component-relativetime”,
“relativetime-kidney_disease”,
“kidney_disease-relativetime”,
“relativetime-ldl”,
“ldl-relativetime”,
“relativetime-modifier”,
“modifier-relativetime”,
“relativetime-o2_saturation”,
“o2_saturation-relativetime”,
“relativetime-obesity”,
“obesity-relativetime”,
“relativetime-oncological”,
“oncological-relativetime”,
“relativetime-overweight”,
“overweight-relativetime”,
“relativetime-oxygen_therapy”,
“oxygen_therapy-relativetime”,
“relativetime-pregnancy”,
“pregnancy-relativetime”,
“relativetime-procedure”,
“procedure-relativetime”,
“relativetime-psychological_condition”,
“psychological_condition-relativetime”,
“relativetime-pulse”,
“pulse-relativetime”,
“relativetime-respiration”,
“respiration-relativetime”,
“relativetime-smoking”,
“smoking-relativetime”,
“relativetime-substance”,
“substance-relativetime”,
“relativetime-substance_quantity”,
“substance_quantity-relativetime”,
“relativetime-symptom”,
“symptom-relativetime”,
“relativetime-temperature”,
“temperature-relativetime”,
“relativetime-test”,
“test-relativetime”,
“relativetime-test_result”,
“test_result-relativetime”,
“relativetime-total_cholesterol”,
“total_cholesterol-relativetime”,
“relativetime-treatment”,
“treatment-relativetime”,
“relativetime-triglycerides”,
“triglycerides-relativetime”,
“relativetime-vs_finding”,
“vs_finding-relativetime”,
“relativetime-vaccine”,
“vaccine-relativetime”,
“relativetime-vital_signs_header”,
“vital_signs_header-relativetime”,
“relativetime-weight”,
“weight-relativetime”] | Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel().pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel().pretrained("re_date_clinical", "en", "clinical/models")\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(3)\ .setPredictionThreshold(0.9)\ .setRelationPairs(["test-date", "symptom-date"]) # Possible relation pairs. Default: All Relations. nlp_pipeline = Pipeline(stages=[documenter, sentencer,tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('''This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.''') ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel().pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel() .pretrained("re_date", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(3) #default: 0 .setPredictionThreshold(0.9) #default: 0.5 .setRelationPairs(Array("test-date", "symptom-date")) # Possible relation pairs. Default: All Relations. val nlpPipeline = new Pipeline().setStages(Array(documenter, sentencer,tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val result = pipeline.fit(Seq.empty[String]).transform(data) val annotations = light_pipeline.fullAnnotate("""This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.""") ```
## Results ```bash | | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |---|-----------|---------|---------------|-------------|------------------------------------------|---------|-------------|-------------|---------|------------| | 0 | 1 | Test | 24 | 25 | CT | Date | 31 | 37 | 1/12/95 | 1.0 | | 1 | 1 | Symptom | 45 | 84 | progressive memory and cognitive decline | Date | 92 | 98 | 8/11/94 | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_date_clinical| |Type:|re| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on data gathered and manually annotated by John Snow Labs ## Benchmarking ```bash label recall precision f1 0 0.74 0.71 0.72 1 0.94 0.95 0.94 ``` --- layout: model title: English image_classifier_vit_rare_bottle ViTForImageClassification from tmoodley author: John Snow Labs name: image_classifier_vit_rare_bottle date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_bottle` is a English model originally trained by tmoodley. ## Predicted Entities `Don Julio`, `bacardi`, `Jack Daniels`, `johnny walker`, `Southern Comfort` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_bottle_en_4.1.0_3.0_1660172098511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_bottle_en_4.1.0_3.0_1660172098511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_bottle", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_bottle", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_bottle| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: German BERT Base Cased Model author: John Snow Labs name: bert_base_german_cased date: 2021-05-20 tags: [open_source, embeddings, bert, german, de] task: Embeddings language: de edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens. The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_german_cased_de_3.1.0_3.0_1621502949396.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_german_cased_de_3.1.0_3.0_1621502949396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_german_cased", "de") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_german_cased", "de") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_german_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|de| |Case sensitive:|true| ## Data Source https://huggingface.co/dbmdz/bert-base-german-cased ## Benchmarking ```bash For results on downstream tasks like NER or PoS tagging, please refer to [this repository](https://github.com/stefan-it/fine-tuned-berts-seq). ``` --- layout: model title: English ElectraForQuestionAnswering model (from Andranik) author: John Snow Labs name: electra_qa_TestQA2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `TestQA2` is a English model originally trained by `Andranik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_TestQA2_en_4.0.0_3.0_1655920070692.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_TestQA2_en_4.0.0_3.0_1655920070692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_TestQA2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_TestQA2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_TestQA2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Andranik/TestQA2 --- layout: model title: Hindi DistilBERT Embeddings (from neuralspace-reverie) author: John Snow Labs name: distilbert_embeddings_indic_transformers_hi_distilbert date: 2022-04-12 tags: [distilbert, embeddings, hi, open_source] task: Embeddings language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-hi-distilbert` is a Hindi model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_hi_distilbert_hi_3.4.2_3.0_1649783440506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_hi_distilbert_hi_3.4.2_3.0_1649783440506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers_hi_distilbert","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers_hi_distilbert","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.indic_transformers_hi_distilbert").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_indic_transformers_hi_distilbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|247.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-distilbert - https://oscar-corpus.com/ --- layout: model title: English RobertaForQuestionAnswering (from nlpunibo) author: John Snow Labs name: roberta_qa_nlpunibo_roberta date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_nlpunibo_roberta_en_4.0.0_3.0_1655729625379.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_nlpunibo_roberta_en_4.0.0_3.0_1655729625379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlpunibo_roberta","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_nlpunibo_roberta","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_nlpunibo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_nlpunibo_roberta| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/roberta --- layout: model title: Vietnamese T5ForConditionalGeneration Small Cased model (from NlpHUST) author: John Snow Labs name: t5_vi_small date: 2023-01-31 tags: [vi, open_source, t5, tensorflow] task: Text Generation language: vi edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-vi-en-small` is a Vietnamese model originally trained by `NlpHUST`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_vi_small_vi_4.3.0_3.0_1675156710542.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_vi_small_vi_4.3.0_3.0_1675156710542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_vi_small","vi") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_vi_small","vi") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_vi_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|vi| |Size:|819.8 MB| ## References - https://huggingface.co/NlpHUST/t5-vi-en-small --- layout: model title: English DistilBertForTokenClassification Cased model (from f2io) author: John Snow Labs name: distilbert_token_classifier_ner_roles_openapi date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ner-roles-openapi` is a English model originally trained by `f2io`. ## Predicted Entities ``, `LOC`, `OR`, `PRG`, `ROLE`, `ORG`, `PER`, `ENTITY`, `MISC`, `ACTION` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.1_3.0_1678133755717.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.1_3.0_1678133755717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_ner_roles_openapi| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/f2io/ner-roles-openapi --- layout: model title: Translate English to Umbundu Pipeline author: John Snow Labs name: translate_en_umb date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, umb, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `umb` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_umb_xx_2.7.0_2.4_1609689689023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_umb_xx_2.7.0_2.4_1609689689023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_umb", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_umb", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.umb').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_umb| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el48 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el48` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el48_en_4.3.0_3.0_1675120232514.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el48_en_4.3.0_3.0_1675120232514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el48","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el48","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el48| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|401.7 MB| ## References - https://huggingface.co/google/t5-efficient-small-el48 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English asr_models_6 TFWav2Vec2ForCTC from niclas author: John Snow Labs name: asr_models_6 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_models_6` is a English model originally trained by niclas. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_models_6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_models_6_en_4.2.0_3.0_1664098721737.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_models_6_en_4.2.0_3.0_1664098721737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_models_6", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_models_6", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_models_6| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Catalan Named Entity Recognition from project-aina author: cayorodriguez name: ner_bsc date: 2022-07-07 tags: [ner, bsc, projecte_aina, ca, open_source] task: Named Entity Recognition language: ca edition: Spark NLP 3.4.4 spark_version: 3.0 supported: false recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Based on the huggingface model `projecte-aina/roberta-base-ca-cased-ner`, this model requires a specific tokenizer (look at the Python Examples section). ## Predicted Entities `PER`, `ORG`, `LOC`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/ner_bsc_ca_3.4.4_3.0_1657197794383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/ner_bsc_ca_3.4.4_3.0_1657197794383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') ex_list = ["aprox\.","pàg\.","p\.ex\.","gen\.","feb\.","abr\.","jul\.","set\.","oct\.","nov\.","dec\.","dr\.","dra\.","sr\.","sra\.","srta\.","núm\.","st\.","sta\.","pl\.","etc\.", "ex\."] ex_list_all = [] ex_list_all.extend(ex_list) ex_list_all.extend([x[0].upper() + x[1:] for x in ex_list]) ex_list_all.extend([x.upper() for x in ex_list]) tokenizer = Tokenizer() \ .setInputCols(['sentence']).setOutputCol('token')\ .setInfixPatterns(["(d|D)(els)","(d|D)(el)","(a|A)(ls)","(a|A)(l)","(p|P)(els)","(p|P)(el)",\ "([A-zÀ-ú_@]+)(-[A-zÀ-ú_@]+)",\ "(d'|D')([·A-zÀ-ú@_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+","(l'|L')([·A-zÀ-ú_]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|'|,)+", \ "(l'|l'|s'|s'|d'|d'|m'|m'|n'|n'|D'|D'|L'|L'|S'|S'|N'|N'|M'|M')([A-zÀ-ú_]+)",\ """([A-zÀ-ú·]+)(\.|,|\)|\?|!|;|\:|\"|”)(\.|,|\)|\?|!|;|\:|\"|”)+""",\ "([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|,|;|:|\?|,)+",\ "([A-zÀ-ú·]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)",\ "(\.|\"|;|:|!|\?|\-|\(|\)|”|“|')+([0-9A-zÀ-ú_]+)",\ "([0-9A-zÀ-ú·]+)(\.|\"|;|:|!|\?|\(|\)|”|“|'|,|%)",\ "(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+([0-9]+)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)+",\ "(d'|D'|l'|L')([·A-zÀ-ú@_]+)('l|'ns|'t|'m|'n|-les|-la|-lo|-li|-los|-me|-nos|-te|-vos|-se|-hi|-ne|-ho)(\.|\"|;|:|!|\?|\-|\(|\)|”|“|,)", \ "([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)([\.|\"|;|:|!|\?|\-|\(|\)|”|“|,]+)"]) \ .setExceptions(ex_list_all).fit(data) tokenClassifier = RoBertaForTokenClassification \ .pretrained('ner_bsc', 'ca', '@cayorodriguez') \ .setInputCols(['token', 'document']) \ .setOutputCol('ner') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) ner_converter = NerConverter() \ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('entities') pipeline = Pipeline(stages=[ document_assembler, tokenizer, tokenClassifier, ner_converter ]) example = spark.createDataFrame([['El meu nom es Carlos i visc a Catalunya!']]).toDF("text") result = pipeline.fit(example).transform(example) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bsc| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Community| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ca| |Size:|445.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References projecte-aina/ancora-ca-ner @ huggingface --- layout: model title: Part of Speech for Russian author: John Snow Labs name: pos_ud_gsd date: 2021-03-08 tags: [part_of_speech, open_source, russian, pos_ud_gsd, ru] task: Part of Speech Tagging language: ru edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - PROPN - CCONJ - PUNCT - PRON - VERB - DET - ADJ - SCONJ - AUX - PART - NUM - ADV - SYM - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ru_3.0.0_3.0_1615230203372.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_ru_3.0.0_3.0_1615230203372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Здравствуйте из Джона Снежных Лабораторий! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "ru") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Здравствуйте из Джона Снежных Лабораторий! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Здравствуйте из Джона Снежных Лабораторий! ""] token_df = nlu.load('ru.pos').predict(text) token_df ```
## Results ```bash token pos 0 Здравствуйте NOUN 1 из ADP 2 Джона PROPN 3 Снежных ADJ 4 Лабораторий PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ru| --- layout: model title: Finnish BertForMaskedLM Base Cased model (from TurkuNLP) author: John Snow Labs name: bert_embeddings_base_finnish_cased_v1 date: 2022-12-02 tags: [fi, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: fi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-finnish-cased-v1` is a Finnish model originally trained by `TurkuNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_finnish_cased_v1_fi_4.2.4_3.0_1670017531921.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_finnish_cased_v1_fi_4.2.4_3.0_1670017531921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_finnish_cased_v1","fi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_finnish_cased_v1","fi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_finnish_cased_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fi| |Size:|467.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1 - https://arxiv.org/abs/1912.07076 - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/master/multilingual.md - https://raw.githubusercontent.com/TurkuNLP/FinBERT/master/img/yle-ylilauta-curves.png - https://fasttext.cc/ - https://github.com/spyysalo/finbert-text-classification - https://github.com/spyysalo/yle-corpus - https://github.com/spyysalo/ylilauta-corpus - https://arxiv.org/abs/1908.04212 - https://github.com/Traubert/FiNer-rules - https://arxiv.org/pdf/1908.04212.pdf - https://github.com/jouniluoma/keras-bert-ner - https://github.com/mpsilfve/finer-data - https://universaldependencies.org/ - https://github.com/spyysalo/bert-pos - http://hdl.handle.net/11234/1-2837 - http://dl.turkunlp.org/finbert/bert-base-finnish-uncased.zip - http://dl.turkunlp.org/finbert/bert-base-finnish-cased.zip --- layout: model title: English DistilBertForQuestionAnswering Cased model (from aszidon) author: John Snow Labs name: distilbert_qa_custom3 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom3` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom3_en_4.3.0_3.0_1672774647602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom3_en_4.3.0_3.0_1672774647602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom3 --- layout: model title: Chinese Part of Speech Tagger(UPOS, Chinese Wikipedia Texts) author: John Snow Labs name: bert_pos_chinese_bert_wwm_ext_upos date: 2022-04-26 tags: [bert, pos, part_of_speech, zh, open_source] task: Part of Speech Tagging language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-bert-wwm-ext-upos` is a Chinese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_chinese_bert_wwm_ext_upos_zh_3.4.2_3.0_1650993237648.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_chinese_bert_wwm_ext_upos_zh_3.4.2_3.0_1650993237648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_bert_wwm_ext_upos","zh") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_bert_wwm_ext_upos","zh") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.pos.chinese_bert_wwm_ext_upos").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_chinese_bert_wwm_ext_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|zh| |Size:|382.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/chinese-bert-wwm-ext-upos - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: French RobertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: roberta_qa_ADDI_FR_RoBERTa date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: fr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FR-RoBERTa` is a French model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FR_RoBERTa_fr_4.0.0_3.0_1655726415021.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FR_RoBERTa_fr_4.0.0_3.0_1655726415021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_FR_RoBERTa","fr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ADDI_FR_RoBERTa","fr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fr.answer_question.roberta.fr_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ADDI_FR_RoBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fr| |Size:|422.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-FR-RoBERTa --- layout: model title: Legal NER from Sigma Absa Dataset (PER+Pronouns) author: John Snow Labs name: legner_sigma_absa_people date: 2022-12-16 tags: [sigma, absa, people, pronouns, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal NER model trained on the Sigma Absa Dataset for legal sentiment analysis on legal parties, including correference pronouns (he, him, their...). This is the first component which extracts those people names and pronouns and NER. You have the second component, which does Assertion Status to retrieve sentiment, on `legassertion_sigma_absa_sentiment` ## Predicted Entities `PER`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_sigma_absa_people_en_1.0.0_3.0_1671202164090.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_sigma_absa_people_en_1.0.0_3.0_1671202164090.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = legal.NerDLModel.pretrained("legner_sigma_absa_people", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("label") pipe = nlp.Pipeline(stages = [ document_assembler, sentence_detector, tokenizer, embeddings, ner]) text = "Petitioner Jae Lee moved to the United States from South Korea with his parents when he was 13." sdf = spark.createDataFrame([[text]]).toDF("text") res = pipe.fit(sdf).transform(sdf) import pyspark.sql.functions as F res.select(F.explode(F.arrays_zip(res.token.result, res.label.result, res.label.metadata)).alias("cols"))\ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"), F.expr("cols['2']['confidence']").alias("confidence")).show(200, truncate=100) ```
## Results ```bash +----------+---------+----------+ | token|ner_label|confidence| +----------+---------+----------+ |Petitioner| B-PER| 0.9997| | Jae| I-PER| 0.9952| | Lee| I-PER| 0.9951| | moved| O| 1.0| | to| O| 1.0| | the| O| 1.0| | United| O| 1.0| | States| O| 1.0| | from| O| 1.0| | South| O| 1.0| | Korea| O| 1.0| | with| O| 1.0| | his| B-PER| 1.0| | parents| O| 0.9998| | when| O| 1.0| | he| B-PER| 1.0| | was| O| 1.0| | 13| O| 1.0| | .| O| 1.0| +----------+---------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_sigma_absa_people| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References https://metatext.io/datasets/sigmalaw-absa ## Benchmarking ```bash label tp fp fn prec rec f1 I-PER 43 2 0 0.95555556 1.0 0.97727275 B-PER 777 11 15 0.9860406 0.9810606 0.98354435 Macro-average 820 13 15 0.9707981 0.9905303 0.9805649 Micro-average 820 13 15 0.9843938 0.98203593 0.9832135 ``` --- layout: model title: English Bert Embeddings (from beatrice-portelli) author: John Snow Labs name: bert_embeddings_DiLBERT date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `DiLBERT` is a English model orginally trained by `beatrice-portelli`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_DiLBERT_en_3.4.2_3.0_1649672580183.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_DiLBERT_en_3.4.2_3.0_1649672580183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_DiLBERT","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_DiLBERT","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.DiLBERT").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_DiLBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/beatrice-portelli/DiLBERT - https://github.com/KevinRoitero/dilbert --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl16_en_4.3.0_3.0_1675121587021.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl16_en_4.3.0_3.0_1675121587021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.9 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Confidential Information Clause Binary Classifier author: John Snow Labs name: legclf_confidential_information_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, confidential, information, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the confidential-information clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `confidential-information`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_confidential_information_clause_en_1.0.0_3.0_1671393646578.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_confidential_information_clause_en_1.0.0_3.0_1671393646578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_confidential_information_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[confidential-information]| |[other]| |[other]| |[confidential-information]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_confidential_information_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support confidential-information 1.00 1.00 1.00 25 other 1.00 1.00 1.00 39 accuracy - - 1.00 64 macro-avg 1.00 1.00 1.00 64 weighted-avg 1.00 1.00 1.00 64 ``` --- layout: model title: Vaccine Sentiment Classifier (PHS-BERT) author: John Snow Labs name: classifierdl_vaccine_sentiment date: 2022-07-28 tags: [public_health, en, licensed, vaccine_sentiment, classification] task: Sentiment Analysis language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT](https://arxiv.org/abs/2204.04521) based sentimental analysis model that can extract information from COVID-19 Vaccine-related tweets. The model predicts whether a tweet contains positive, negative, or neutral sentiments about COVID-19 Vaccines. ## Predicted Entities `negative`, `positive`, `neutral` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_vaccine_sentiment_en_4.0.0_3.0_1658998378316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_vaccine_sentiment_en_4.0.0_3.0_1658998378316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_phs_bert", "en", "public/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifierdl = ClassifierDLModel.pretrained('classifierdl_vaccine_sentiment', "en", "clinical/models")\ .setInputCols(['sentence', 'token', 'sentence_embeddings'])\ .setOutputCol('class') pipeline = Pipeline( stages = [ document_assembler, tokenizer, bert_embeddings, embeddingsSentence, classifierdl ]) text_list = ['A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.', 'People with a history of severe allergic reaction to any component of the vaccine should not take.', '43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b'] data = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_phs_bert", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val sentence_embeddings = SentenceEmbeddings() .setInputCols(Array("sentence", "embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = ClassifierDLModel.pretrained("classifierdl_vaccine_sentiment", "en", "clinical/models") .setInputCols(Array("sentence", "token", "sentence_embeddings")) .setOutputCol("class") val bert_clf_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifier)) val data = Seq(Array("A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.", "People with a history of severe allergic reaction to any component of the vaccine should not take.", "43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b")).toDS.toDF("text") val result = bert_clf_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.vaccine_sentiment").predict("""A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.""") ```
## Results ```bash +-----------------------------------------------------------------------------------------------------+----------+ |text |class | +-----------------------------------------------------------------------------------------------------+----------+ |A little bright light for an otherwise dark week. Thanks researchers, and frontline workers. Onwards.|[positive]| |People with a history of severe allergic reaction to any component of the vaccine should not take. |[negative]| |43 million doses of vaccines administrated worldwide...Production capacity of CHINA to reach 4 b |[neutral] | +-----------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_vaccine_sentiment| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|24.2 MB| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support neutral 0.76 0.72 0.74 1008 positive 0.80 0.79 0.80 966 negative 0.76 0.81 0.78 916 accuracy - - 0.77 2890 macro-avg 0.77 0.77 0.77 2890 weighted-avg 0.77 0.77 0.77 2890 ``` --- layout: model title: Portuguese asr_bp_commonvoice10_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp_commonvoice10_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_commonvoice10_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_commonvoice10_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_commonvoice10_xlsr_pt_4.2.0_3.0_1664196471529.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_commonvoice10_xlsr_pt_4.2.0_3.0_1664196471529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp_commonvoice10_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp_commonvoice10_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp_commonvoice10_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|756.4 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Sango author: John Snow Labs name: opus_mt_en_sg date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sg, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sg_xx_2.7.0_2.4_1609163620103.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sg_xx_2.7.0_2.4_1609163620103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sg", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sg", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sg').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sg| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic BertForQuestionAnswering model (from gfdgdfgdg) author: John Snow Labs name: bert_qa_arap_qa_bert_large_v2 date: 2022-06-02 tags: [ar, open_source, question_answering, bert] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `arap_qa_bert_large_v2` is a Arabic model orginally trained by `gfdgdfgdg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_arap_qa_bert_large_v2_ar_4.0.0_3.0_1654179148138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_arap_qa_bert_large_v2_ar_4.0.0_3.0_1654179148138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_arap_qa_bert_large_v2","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_arap_qa_bert_large_v2","ar") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.bert.large_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_arap_qa_bert_large_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ar| |Size:|1.4 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/gfdgdfgdg/arap_qa_bert_large_v2 --- layout: model title: English RobertaForQuestionAnswering Cased model (from Andranik) author: John Snow Labs name: roberta_qa_test_v1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `TestQaV1` is a English model originally trained by `Andranik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_test_v1_en_4.3.0_3.0_1674208905005.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_test_v1_en_4.3.0_3.0_1674208905005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_test_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_test_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_test_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Andranik/TestQaV1 --- layout: model title: XLNet Base CoNLL-03 NER Pipeline author: John Snow Labs name: xlnet_base_token_classifier_conll03_pipeline date: 2022-04-21 tags: [ner, english, xlnet, base, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_base_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650540715035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650540715035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|438.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlnetForTokenClassification - NerConverter - Finisher --- layout: model title: English asr_xlsr_53_bemba_5hrs TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: pipeline_asr_xlsr_53_bemba_5hrs date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_53_bemba_5hrs` is a English model originally trained by csikasote. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_53_bemba_5hrs_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_53_bemba_5hrs_en_4.2.0_3.0_1664035866138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_53_bemba_5hrs_en_4.2.0_3.0_1664035866138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_53_bemba_5hrs', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_53_bemba_5hrs", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_53_bemba_5hrs| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Stop Words Cleaner for Armenian author: John Snow Labs name: stopwords_hy date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: hy edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, hy] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_hy_hy_2.5.4_2.4_1594742439626.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_hy_hy_2.5.4_2.4_1594742439626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_hy", "hy") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_hy", "hy") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար"""] stopword_df = nlu.load('hy.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=10, result='Հյուսիսային', metadata={'sentence': '0'}), Row(annotatorType='token', begin=12, end=18, result='թագավոր', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=27, result='լինելուց', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=32, result='բացի', metadata={'sentence': '0'}), Row(annotatorType='token', begin=33, end=33, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_hy| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hy| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab1_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab1_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab1_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab1_by_sherry7144_en_4.2.0_3.0_1664018634076.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab1_by_sherry7144_en_4.2.0_3.0_1664018634076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab1_by_sherry7144", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab1_by_sherry7144", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab1_by_sherry7144| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Legal Settlement Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_settlement_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, settlement, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_settlement_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `settlement-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `settlement-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_settlement_agreement_bert_en_1.0.0_3.0_1669371428132.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_settlement_agreement_bert_en_1.0.0_3.0_1669371428132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_settlement_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[settlement-agreement]| |[other]| |[other]| |[settlement-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_settlement_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.95 0.94 65 settlement-agreement 0.90 0.85 0.88 33 accuracy - - 0.92 98 macro-avg 0.91 0.90 0.91 98 weighted-avg 0.92 0.92 0.92 98 ``` --- layout: model title: Word2Vec Embeddings in English (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_en_3.4.1_3.0_1647368332088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_en_3.4.1_3.0_1647368332088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from pythonist) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_pubmedbykrs date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-pubmedbykrs` is a English model originally trained by `pythonist`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_pubmedbykrs_en_4.3.0_3.0_1672768156281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_pubmedbykrs_en_4.3.0_3.0_1672768156281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_pubmedbykrs","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_pubmedbykrs","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_pubmedbykrs| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/pythonist/distilbert-base-uncased-finetuned-pubmedbykrs --- layout: model title: Google T5 for Closed Book Question Answering Small author: John Snow Labs name: google_t5_small_ssm_nq date: 2021-01-08 task: Question Answering language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, t5, seq2seq, question_answering, en] supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The model was pre-trained using T5's denoising objective on [C4](https://huggingface.co/datasets/c4), subsequently additionally pre-trained using REALM's salient span masking objective on Wikipedia, and finally fine-tuned on [Natural Questions (NQ)](https://huggingface.co/datasets/natural_questions). Note: The model was fine-tuned on 100% of the train splits of Natural Questions (NQ) for 10k steps. Other community Checkpoints: here Paper: [How Much Knowledge Can You Pack Into the Parameters of a Language Model?](https://arxiv.org/abs/1910.10683.pdf) {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/google_t5_small_ssm_nq_en_2.7.1_2.4_1610137175322.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/google_t5_small_ssm_nq_en_2.7.1_2.4_1610137175322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Either set the following tasks or have them inline with your input: - nq question: - trivia question: - question: - nq:
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer() \ .pretrained("google_t5_small_ssm_nq") \ .setTask("nq:")\ .setMaxOutputLength(200)\ .setInputCols(["documents"]) \ .setOutputCol("answer") pipeline = Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("answer.result").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer .pretrained("google_t5_small_ssm_nq") .setTask("nq:") .setInputCols(Array("documents")) .setOutputCol("answer") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val model = pipeline.fit(dataDf) val results = model.transform(dataDf) results.select("answer.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|google_t5_small_ssm_nq| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[t5]| |Language:|en| ## Data Source The model was pre-trained using T5's denoising objective on C4, subsequently additionally pre-trained using REALM's salient span masking objective on Wikipedia, and finally fine-tuned on Natural Questions (NQ). --- layout: model title: Malay ALBERT Embeddings (Base) author: John Snow Labs name: albert_embeddings_albert_base_bahasa_cased date: 2022-04-14 tags: [albert, embeddings, ms, open_source] task: Embeddings language: ms edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-base-bahasa-cased` is a Malay model orginally trained by `malay-huggingface`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_bahasa_cased_ms_3.4.2_3.0_1649954337875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_bahasa_cased_ms_3.4.2_3.0_1649954337875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_bahasa_cased","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_bahasa_cased","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ms.embed.albert_base_bahasa_cased").predict("""Saya suka Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_base_bahasa_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ms| |Size:|45.7 MB| |Case sensitive:|false| ## References - https://huggingface.co/malay-huggingface/albert-base-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean - https://github.com/huseinzol05/malay-dataset/tree/master/corpus/pile - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert --- layout: model title: Fast Neural Machine Translation Model from Finnish to English author: John Snow Labs name: opus_mt_fi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, fi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `fi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_fi_en_xx_2.7.0_2.4_1609170172454.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_fi_en_xx_2.7.0_2.4_1609170172454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_fi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_fi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.fi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_fi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from leixu) author: John Snow Labs name: xlmroberta_ner_leixu_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `leixu`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_leixu_base_finetuned_panx_de_4.1.0_3.0_1660435051026.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_leixu_base_finetuned_panx_de_4.1.0_3.0_1660435051026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_leixu_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_leixu_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_leixu_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/leixu/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English BertForQuestionAnswering model (from MichelBartels) author: John Snow Labs name: bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinybert-6l-768d-squad2-large-teacher-finetuned` is a English model orginally trained by `MichelBartels`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_en_4.0.0_3.0_1654192477274.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_en_4.0.0_3.0_1654192477274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_tiny_768d.by_MichelBartels").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|249.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MichelBartels/tinybert-6l-768d-squad2-large-teacher-finetuned --- layout: model title: XLNet Base CoNLL-03 NER Pipeline author: John Snow Labs name: xlnet_base_token_classifier_conll03_pipeline date: 2022-06-19 tags: [ner, english, xlnet, base, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654102760.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654102760.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|438.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlnetForTokenClassification - NerConverter - Finisher --- layout: model title: Fast Neural Machine Translation Model from English to South Slavic author: John Snow Labs name: opus_mt_en_zls date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, zls, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `zls` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_zls_xx_2.7.0_2.4_1609170551787.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_zls_xx_2.7.0_2.4_1609170551787.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_zls", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_zls", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.zls').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_zls| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pre-trained Pipeline for Few-NERD NER Model author: John Snow Labs name: nerdl_fewnerd_subentity_100d_pipeline date: 2021-09-28 tags: [fewnerd, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on Few-NERD/inter public dataset and it extracts 66 entities that are in general scope. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_pipeline_en_3.3.0_2.4_1632828937931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_pipeline_en_3.3.0_2.4_1632828937931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python fewnerd_pipeline = PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") fewnerd_pipeline.annotate("""12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.""") ``` ```scala val pipeline = new PretrainedPipeline("nerdl_fewnerd_subentity_100d_pipeline", lang = "en") val result = pipeline.fullAnnotate("12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.")(0) ```
## Results ```bash +-----------------------+----------------------------+ |chunk |ner_label | +-----------------------+----------------------------+ |Corazones ('12 Hearts')|art-broadcastprogram | |Spanish-language |other-language | |United States |location-GPE | |Telemundo |organization-media/newspaper| |Argentine TV |organization-media/newspaper| |Los Angeles |location-GPE | |Steven Spielberg |person-director | |Cloverfield Paradox |art-film | +-----------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_fewnerd_subentity_100d_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter - Finisher --- layout: model title: Augment Tickers with NASDAQ database author: John Snow Labs name: finmapper_nasdaq_ticker date: 2022-08-09 tags: [en, finance, companies, tickers, nasdaq, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model allows you to, given a Ticker, get information about that company, including the Company Name, the Industry and the Sector. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_ticker_en_1.0.0_3.2_1660038524908.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_ticker_en_1.0.0_3.2_1660038524908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("finner_roberta_ticker", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") CM = finance.ChunkMapperModel()\ .pretrained('finmapper_nasdaq_companyname', 'en', 'finance/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRel('company_name') pipeline = Pipeline().setStages([document_assembler, tokenizer, tokenClassifier, ner_converter, CM]) text = ["""There are some serious purchases and sales of AMZN stock today."""] test_data = spark.createDataFrame([text]).toDF("text") model = pipeline.fit(test_data) res= model.transform(test_data) res.select('mappings').collect() ```
## Results ```bash [Row(mappings=[Row(annotatorType='labeled_dependency', begin=46, end=49, result='AMZN', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'ticker', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Amazon.com Inc.', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'company_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Amazon.com', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'short_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Retail - Apparel & Specialty', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'industry', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Consumer Cyclical', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'sector', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=57, end=61, result='NONE', metadata={'sentence': '0', 'chunk': '1', 'entity': 'today'}, embeddings=[])])] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_nasdaq_ticker| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|210.3 KB| ## References https://data.world/johnsnowlabs/list-of-companies-in-nasdaq-exchanges --- layout: model title: Vietnamese BertForQuestionAnswering model (from tiennvcs) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_vi_infovqa date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: vi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-vi-infovqa` is a Vietnamese model orginally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_vi_infovqa_vi_4.0.0_3.0_1654181242200.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_vi_infovqa_vi_4.0.0_3.0_1654181242200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_vi_infovqa","vi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_vi_infovqa","vi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("vi.answer_question.bert.vi_infovqa.base_uncased.by_tiennvcs").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_vi_infovqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|vi| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/bert-base-uncased-finetuned-vi-infovqa --- layout: model title: English asr_wolof_ASR TFWav2Vec2ForCTC from khady author: John Snow Labs name: pipeline_asr_wolof_ASR date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wolof_ASR` is a English model originally trained by khady. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wolof_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wolof_ASR_en_4.2.0_3.0_1664024690847.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wolof_ASR_en_4.2.0_3.0_1664024690847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wolof_ASR', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wolof_ASR", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wolof_ASR| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: General model for table detection author: John Snow Labs name: general_model_table_detection_v2 date: 2023-01-10 tags: [en, licensed] task: OCR Table Detection language: en nav_key: models edition: Visual NLP 4.1.0 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description For table detection it is proposed the use of CascadeTabNet. It is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects tables on input images. The model is evaluated on ICDAR 2013, ICDAR 2019 and TableBank public datasets. It achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. Here it is used the CascadeTabNet general model for table detection inspired by https://arxiv.org/abs/2004.12629 ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageTableDetection.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/general_model_table_detection_v2_en_3.3.0_3.0_1623301511401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/general_model_table_detection_v2_en_3.3.0_3.0_1623301511401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("table_regions") draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("table_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.red) pipeline = PipelineModel(stages=[ binary_to_image, table_detector, draw_regions ]) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/tab_images/cTDaR_t10168.jpg imagePath = "cTDaR_t10168.jpg" image_df = spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df) ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("table_regions") val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("table_regions") .setOutputCol("image_with_regions") .setRectColor(Color.red) val pipeline = new PipelineModel().setStages(Array( binary_to_image, table_detector, draw_regions)) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/tab_images/cTDaR_t10168.jpg val imagePath = "cTDaR_t10168.jpg" val image_df = spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df) ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image5.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image5_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_table_detection_general_model| |Type:|ocr| |Compatibility:|Visual NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Output Labels:|[table regions]| |Language:|en| ## Benchmarking ```bash 3rd rank in ICDAR 2019 post-competition 1st rank in ICDAR 2013 and TableBank dataset ``` --- layout: model title: Relation Extraction between different oncological entity types (granular version) author: John Snow Labs name: re_oncology_granular_wip date: 2022-09-27 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Using this relation extraction model, four relation types can be identified: is_date_of (between date entities and other clinical entities), is_size_of (between Tumor_Finding and Tumor_Size), is_location_of (between anatomical entities and other entities) and is_finding_of (between test entities and their results). ## Predicted Entities `is_size_of`, `is_finding_of`, `is_date_of`, `is_location_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_granular_wip_en_4.0.0_3.0_1664301874672.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_granular_wip_en_4.0.0_3.0_1664301874672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use realation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_granular_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_granular_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("A mastectomy was performed two months ago, and a 3 cm mass was extracted.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_granular_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""") ```
## Results ```bash chunk1 entity1 chunk2 entity2 relation confidence mastectomy Cancer_Surgery two months ago Relative_Date is_date_of 0.91336143 3 cm Tumor_Size mass Tumor_Finding is_size_of 0.96745735 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_granular_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|267.2 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash relation recall precision f1 is_size_of 0.96 0.73 0.83 O 0.67 0.93 0.78 is_finding_of 0.94 0.75 0.83 is_date_of 0.94 0.54 0.69 is_location_of 0.94 0.81 0.87 macro-avg 0.89 0.75 0.80 ``` --- layout: model title: Legal Social Affairs Document Classifier (EURLEX) author: John Snow Labs name: legclf_social_affairs_bert date: 2023-03-06 tags: [en, legal, classification, clauses, social_affairs, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_social_affairs_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Social_Affairs or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Social_Affairs`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_social_affairs_bert_en_1.0.0_3.0_1678111486819.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_social_affairs_bert_en_1.0.0_3.0_1678111486819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_social_affairs_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Social_Affairs]| |[Other]| |[Other]| |[Social_Affairs]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_social_affairs_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.84 0.81 0.83 59 Social_Affairs 0.81 0.84 0.82 56 accuracy - - 0.83 115 macro-avg 0.83 0.83 0.83 115 weighted-avg 0.83 0.83 0.83 115 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_6_en_4.0.0_3.0_1655732701264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_6_en_4.0.0_3.0_1655732701264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_32d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|417.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-6 --- layout: model title: ELECTRA Embeddings(ELECTRA Small) author: John Snow Labs name: electra_large_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by: Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_large_uncased_en_2.6.0_2.4_1598485645331.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_large_uncased_en_2.6.0_2.4_1598485645331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("electra_large_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("electra_large_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.electra.large_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_electra_large_uncased_embeddings token [0.1289837807416916, -0.18811583518981934, 0.0... I [-0.02723774127662182, 0.0757141262292862, 0.3... love [0.4146347939968109, -0.31447598338127136, -0.... NLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/google/electra_large/2 --- layout: model title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on MNLI author: John Snow Labs name: sent_bert_wiki_books_mnli date: 2021-08-31 tags: [en, open_source, sentence_embeddings, wikipedia_dataset, books_corpus_dataset, mnli_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on MNLI. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages. This model is fine-tuned on the MNLI and is recommended for use in natural language inference tasks. The MNLI fine-tuning task is a textual entailment task and includes data from a range of genres of spoken and written text. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_mnli_en_3.2.0_3.0_1630412088632.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_mnli_en_3.2.0_3.0_1630412088632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_mnli", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_mnli", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_mnli').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_wiki_books_mnli| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [MNLI dataset](https://cims.nyu.edu/~sbowman/multinli/) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/mnli/2 --- layout: model title: Sentence Entity Resolver for SNOMED codes (procedures and measurements) author: John Snow Labs name: sbiobertresolve_snomed_procedures_measurements date: 2021-11-11 tags: [en, entity_resolution, licensed, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps medical entities to SNOMED codes using Sentence Bert Embeddings. The corpus of this model includes `Procedures` and `Measurement` domains. ## Predicted Entities `SNOMED Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_procedures_measurements_en_3.3.0_3.0_1636595616820.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_procedures_measurements_en_3.3.0_3.0_1636595616820.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli','en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_procedures_measurements", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("snomed_code") pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, resolver]) l_model = LightPipeline(pipelineModel) result = l_model.fullAnnotate(['coronary calcium score', 'heart surgery', 'ct scan', 'bp value']) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_procedures_measurements", "en", "clinical/models) .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("snomed_code") val pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, resolver)) val l_model = LightPipeline(pipelineModel) val result = l_model.fullAnnotate(Array("coronary calcium score", "heart surgery", "ct scan", "bp value")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.procedures_measurements").predict("""coronary calcium score""") ```
## Results ```bash | | chunk | code | code_description | all_k_code_desc | all_k_codes | |---:|:-----------------------|----------:|:------------------------------|:--------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | coronary calcium score | 450360000 | Coronary artery calcium score | ['450360000', '450734004', '1086491000000104', '1086481000000101', '762241007'] | ['Coronary artery calcium score', 'Coronary artery calcium score', 'Dundee Coronary Risk Disk score', 'Dundee Coronary Risk rank', 'Dundee Coronary Risk Disk'] | | 1 | heart surgery | 2598006 | Open heart surgery | ['2598006', '64915003', '119766003', '34068001', '233004008'] | ['Open heart surgery', 'Operation on heart', 'Heart reconstruction', 'Heart valve replacement', 'Coronary sinus operation'] | | 2 | ct scan | 303653007 | CT of head | ['303653007', '431864000', '363023007', '418272005', '241577003'] | ['CT of head', 'CT guided injection', 'CT of site', 'CT angiography', 'CT of spine'] | | 3 | bp value | 75367002 | Blood pressure | ['75367002', '6797001', '723232008', '46973005', '427732000'] | ['Blood pressure', 'Mean blood pressure', 'Average blood pressure', 'Blood pressure taking', 'Speed of blood pressure response'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_procedures_measurements| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[output]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on `SNOMED` code dataset with `sbiobert_base_cased_mli` sentence embeddings. --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from aytugkaya) author: John Snow Labs name: xlmroberta_ner_aytugkaya_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `aytugkaya`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_aytugkaya_base_finetuned_panx_de_4.1.0_3.0_1660431282851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_aytugkaya_base_finetuned_panx_de_4.1.0_3.0_1660431282851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_aytugkaya_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_aytugkaya_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_aytugkaya_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/aytugkaya/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Korean BertForQuestionAnswering model (from ainize) author: John Snow Labs name: bert_qa_ainize_klue_bert_base_mrc date: 2022-06-02 tags: [ko, open_source, question_answering, bert] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-bert-base-mrc` is a Korean model orginally trained by `ainize`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ainize_klue_bert_base_mrc_ko_4.0.0_3.0_1654188058903.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ainize_klue_bert_base_mrc_ko_4.0.0_3.0_1654188058903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ainize_klue_bert_base_mrc","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ainize_klue_bert_base_mrc","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.klue.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ainize_klue_bert_base_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|413.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ainize/klue-bert-base-mrc - https://ainize.ai/ - https://main-klue-mrc-bert-scy6500.endpoint.ainize.ai/ - https://ainize.ai/scy6500/KLUE-MRC-BERT?branch=main - https://ainize.ai/teachable-nlp - https://link.ainize.ai/3FjvBVn --- layout: model title: Part of Speech for Bihari author: John Snow Labs name: pos_ud_bhtb date: 2021-03-09 tags: [part_of_speech, open_source, bihari, pos_ud_bhtb, bh] task: Part of Speech Tagging language: bh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - NOUN - CCONJ - ADJ - PUNCT - ADP - VERB - PROPN - NUM - AUX - DET - PRON - SCONJ - PART - ADV - INTJ - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bhtb_bh_3.0.0_3.0_1615292414604.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bhtb_bh_3.0.0_3.0_1615292414604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_bhtb", "bh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_bhtb", "bh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hello from John Snow Labs!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs!""] token_df = nlu.load('bh.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hello NOUN 1 from NOUN 2 John NOUN 3 Snow NOUN 4 Labs NOUN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bhtb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|bh| --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl8_en_4.3.0_3.0_1675118058956.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl8_en_4.3.0_3.0_1675118058956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|576.4 MB| ## References - https://huggingface.co/google/t5-efficient-large-nl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Hindi Part of Speech Tagger (from sagorsarker) author: John Snow Labs name: bert_pos_codeswitch_hineng_pos_lince date: 2022-05-09 tags: [bert, pos, part_of_speech, hi, open_source] task: Part of Speech Tagging language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-hineng-pos-lince` is a Hindi model orginally trained by `sagorsarker`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_codeswitch_hineng_pos_lince_hi_3.4.2_3.0_1652091720542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_codeswitch_hineng_pos_lince_hi_3.4.2_3.0_1652091720542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_codeswitch_hineng_pos_lince","hi") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_codeswitch_hineng_pos_lince","hi") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_codeswitch_hineng_pos_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|hi| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-hineng-pos-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: English image_classifier_vit_rust_image_classification_4 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_4` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_4_en_4.1.0_3.0_1660167562540.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_4_en_4.1.0_3.0_1660167562540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batterybert_uncased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batterybert-uncased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batterybert_uncased_squad_v1_en_4.0.0_3.0_1654179312654.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batterybert_uncased_squad_v1_en_4.0.0_3.0_1654179312654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batterybert_uncased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batterybert_uncased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.bert.uncased.by_batterydata").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batterybert_uncased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/batterybert-uncased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: Extract temporal relations among clinical events (ReDL) author: John Snow Labs name: redl_temporal_events_biobert date: 2021-07-24 tags: [relation_extraction, en, clinical, licensed] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 2.4 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations between clinical events in terms of time. If an event occurred before, after, or overlaps another event. ## Predicted Entities `AFTER`, `BEFORE`, `OVERLAP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_EVENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_temporal_events_biobert_en_3.0.3_2.4_1627121501681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_temporal_events_biobert_en_3.0.3_2.4_1627121501681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") re_model = RelationExtractionDLModel()\ .pretrained("redl_temporal_events_biobert", "en", "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text = "She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_temporal_events_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_events").predict("""She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001""") ```
## Results ```bash +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ |relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ | BEFORE| OCCURRENCE| 7| 15| diagnosed| PROBLEM| 22| 27| cancer|0.78168863| | OVERLAP| PROBLEM| 22| 27| cancer| DATE| 32| 35| 1991| 0.8492274| | AFTER| OCCURRENCE| 51| 58| admitted|CLINICAL_DEPT| 63| 73| Mayo Clinic|0.85629463| | BEFORE| OCCURRENCE| 51| 58| admitted| OCCURRENCE| 91| 100| discharged| 0.6843513| | OVERLAP|CLINICAL_DEPT| 63| 73|Mayo Clinic| DATE| 78| 85| May 2000| 0.7844673| | BEFORE|CLINICAL_DEPT| 63| 73|Mayo Clinic| OCCURRENCE| 91| 100| discharged|0.60411876| | OVERLAP|CLINICAL_DEPT| 63| 73|Mayo Clinic| DATE| 105| 116|October 2001| 0.540761| | BEFORE| DATE| 78| 85| May 2000| OCCURRENCE| 91| 100| discharged| 0.6042761| | OVERLAP| DATE| 78| 85| May 2000| DATE| 105| 116|October 2001|0.64867175| | BEFORE| OCCURRENCE| 91| 100| discharged| DATE| 105| 116|October 2001| 0.5302478| +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_temporal_events_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on temporal clinical events benchmark dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support AFTER 0.332 0.655 0.440 2123 BEFORE 0.868 0.908 0.887 13817 OVERLAP 0.887 0.733 0.802 7860 Avg. 0.695 0.765 0.710 - ``` --- layout: model title: Chinese Bert Embeddings (6-layer) author: John Snow Labs name: bert_embeddings_rbt6 date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `rbt6` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_3.4.2_3.0_1649669338288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_3.4.2_3.0_1649669338288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.rbt6").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt6| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|224.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt6 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Part of Speech for Czech author: John Snow Labs name: pos_cac date: 2021-03-23 tags: [cs, open_source, pos] supported: true task: Part of Speech Tagging language: cs edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - INTJ - NOUN - NUM - PART - PRON - PROPN - PUNCT - SCONJ - SYM - VERB {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_cac_cs_2.7.5_2.4_1616507827439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_cac_cs_2.7.5_2.4_1616507827439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_cac", "cs")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([["'Soustředit všechny propagační a agitační prostředky na vytvoření veřejného mínění , které bude podporovat přesnou , tvořivou a iniciativní práci ."]], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_cac", "cs") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer , pos)) val data = Seq("'Soustředit všechny propagační a agitační prostředky na vytvoření veřejného mínění , které bude podporovat přesnou , tvořivou a iniciativní práci .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Soustředit všechny propagační a agitační prostředky na vytvoření veřejného mínění , které bude podporovat přesnou , tvořivou a iniciativní práci .""] token_df = nlu.load('cs.pos.cac').predict(text) token_df ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ |text |result | +---------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ |'Soustředit všechny propagační a agitační prostředky na vytvoření veřejného mínění , které bude podporovat přesnou , tvořivou a iniciativní práci .|[NOUN, VERB, DET, ADJ, CCONJ, ADJ, NOUN, ADP, NOUN, ADJ, NOUN, PUNCT, DET, AUX, VERB, ADJ, PUNCT, ADJ, CCONJ, ADJ, NOUN, PUNCT]| +---------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_cac| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|cs| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.96 | 0.97 | 0.97 | 1597 | | ADP | 1.00 | 0.99 | 1.00 | 1031 | | ADV | 0.99 | 0.97 | 0.98 | 552 | | AUX | 0.93 | 0.97 | 0.95 | 302 | | CCONJ | 1.00 | 1.00 | 1.00 | 575 | | DET | 1.00 | 0.98 | 0.99 | 418 | | INTJ | 1.00 | 1.00 | 1.00 | 1 | | NOUN | 0.97 | 0.98 | 0.97 | 3023 | | NUM | 1.00 | 0.97 | 0.99 | 103 | | PART | 0.99 | 0.92 | 0.96 | 92 | | PRON | 0.98 | 0.97 | 0.98 | 353 | | PROPN | 0.93 | 0.91 | 0.92 | 138 | | PUNCT | 1.00 | 1.00 | 1.00 | 1423 | | SCONJ | 0.99 | 0.99 | 0.99 | 175 | | SYM | 1.00 | 1.00 | 1.00 | 39 | | VERB | 0.97 | 0.94 | 0.95 | 1040 | | accuracy | | | 0.98 | 10862 | | macro avg | 0.98 | 0.97 | 0.98 | 10862 | | weighted avg | 0.98 | 0.98 | 0.98 | 10862 | ``` --- layout: model title: English BertForQuestionAnswering model (from andresestevez) author: John Snow Labs name: bert_qa_andresestevez_bert_finetuned_squad_accelerate date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model orginally trained by `andresestevez`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_andresestevez_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535829208.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_andresestevez_bert_finetuned_squad_accelerate_en_4.0.0_3.0_1654535829208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_andresestevez_bert_finetuned_squad_accelerate","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_andresestevez_bert_finetuned_squad_accelerate","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_andresestevez").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_andresestevez_bert_finetuned_squad_accelerate| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/andresestevez/bert-finetuned-squad-accelerate --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Portuguese (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-05-10 task: Named Entity Recognition language: pt edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, pt, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_PT){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_pt_2.5.0_2.4_1588495233641.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_pt_2.5.0_2.4_1588495233641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "pt") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "pt") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella."""] ner_df = nlu.load('pt.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |Ele |MISC | |Microsoft Corporation |ORG | |Durante |MISC | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Ele |MISC | |Nascido |MISC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Novo México |LOC | |Gates |PER | |CEO |ORG | |Gates |PER | |Gates |PER | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://pt.wikipedia.org](https://pt.wikipedia.org) --- layout: model title: Stopwords Remover for Catalan language (280 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ca, open_source] task: Stop Words Removal language: ca edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ca_3.4.1_3.0_1646673133236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ca_3.4.1_3.0_1646673133236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ca") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["No ets millor que jo"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ca") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("No ets millor que jo").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ca.stopwords").predict("""No ets millor que jo""") ```
## Results ```bash +--------+ |result | +--------+ |[millor]| +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ca| |Size:|2.1 KB| --- layout: model title: Pipeline to Detect Drug Chemicals author: John Snow Labs name: ner_drugs_large_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugs_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_pipeline_en_4.3.0_3.2_1678867384370.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_pipeline_en_4.3.0_3.2_1678867384370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_drugs_large_pipeline", "en", "clinical/models") text = '''The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain..''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_drugs_large_pipeline", "en", "clinical/models") val text = "The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs_large.pipeline").predict("""The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain..""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------------------------|--------:|------:|:------------|-------------:| | 0 | Aspirin 81 milligrams | 306 | 326 | DRUG | 0.8401 | | 1 | Humulin N | 334 | 342 | DRUG | 0.95755 | | 2 | insulin 50 units | 345 | 360 | DRUG | 0.847067 | | 3 | HCTZ 50 mg | 370 | 379 | DRUG | 0.875567 | | 4 | Nitroglycerin 1/150 sublingually | 387 | 418 | DRUG | 0.845967 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Basque RoBERTa Embeddings Cased model (from mrm8488) author: John Snow Labs name: roberta_embeddings_robasqu date: 2022-07-14 tags: [eu, open_source, roberta, embeddings] task: Embeddings language: eu edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBasquERTa` is a Basque model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robasqu_eu_4.0.0_3.0_1657809995026.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robasqu_eu_4.0.0_3.0_1657809995026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robasqu","eu") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robasqu","eu") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robasqu| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|eu| |Size:|313.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/RoBasquERTa --- layout: model title: Indonesian BertForMaskedLM Base Cased model (from cahya) author: John Snow Labs name: bert_embeddings_base_indonesian_522m date: 2022-12-02 tags: [id, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: id edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-indonesian-522M` is a Indonesian model originally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_indonesian_522m_id_4.2.4_3.0_1670017835135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_indonesian_522m_id_4.2.4_3.0_1670017835135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_indonesian_522m","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_indonesian_522m","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_indonesian_522m| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/cahya/bert-base-indonesian-522M - https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_4 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-4` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_4_en_4.3.0_3.0_1672767525583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_4_en_4.3.0_3.0_1672767525583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-4 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves TFWav2Vec2ForCTC from tonyalves author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves` is a English model originally trained by tonyalves. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_en_4.2.0_3.0_1664108795578.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves_en_4.2.0_3.0_1664108795578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_colab_by_tonyalves| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect Mentions of General Medical Terms author: John Snow Labs name: ner_medmentions_coarse_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_medmentions_coarse](https://nlp.johnsnowlabs.com/2021/04/01/ner_medmentions_coarse_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medmentions_coarse_pipeline_en_3.4.1_3.0_1647870771676.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medmentions_coarse_pipeline_en_3.4.1_3.0_1647870771676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_medmentions_coarse_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE MEDICAL TEXT") ``` ```scala val pipeline = new PretrainedPipeline("ner_medmentions_coarse_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE MEDICAL TEXT") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.medmentions_coarse.pipeline").predict("""EXAMPLE MEDICAL TEXT""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_medmentions_coarse_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265910 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265910` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265910_en_4.0.0_3.0_1655985771660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265910_en_4.0.0_3.0_1655985771660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265910","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265910","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265910").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265910| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265910 --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from csebuetnlp) author: John Snow Labs name: t5_banglat5_nmt_en2bn date: 2023-01-30 tags: [bn, en, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `banglat5_nmt_en_bn` is a Multilingual model originally trained by `csebuetnlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_banglat5_nmt_en2bn_xx_4.3.0_3.0_1675100250892.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_banglat5_nmt_en2bn_xx_4.3.0_3.0_1675100250892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_banglat5_nmt_en2bn","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_banglat5_nmt_en2bn","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_banglat5_nmt_en2bn| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.0 GB| ## References - https://huggingface.co/csebuetnlp/banglat5_nmt_en_bn - https://github.com/csebuetnlp/normalizer --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squad1.1_block_sparse_0.20_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad1.1-block-sparse-0.20-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.20_v1_en_4.0.0_3.0_1654181440387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.20_v1_en_4.0.0_3.0_1654181440387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.20_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.20_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_x1.16_f88.1_d8_unstruct_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad1.1_block_sparse_0.20_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|179.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squad1.1-block-sparse-0.20-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf - https://www.aclweb.org/anthology/N19-1423/ - https://arxiv.org/abs/2005.07683 --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_cline_techqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cline-techqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_techqa_en_4.0.0_3.0_1655727943280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_techqa_en_4.0.0_3.0_1655727943280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_techqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_cline_techqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.techqa_cline.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cline_techqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/cline-techqa --- layout: model title: ICD10CM ChunkResolver author: John Snow Labs name: chunkresolve_icd10cm_diseases_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-28 task: Entity Resolution edition: Healthcare NLP 2.4.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-CM Codes and their normalized definition with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_diseases_clinical_en_2.4.5_2.4_1588105984876.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_diseases_clinical_en_2.4.5_2.4_1588105984876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... icd10cmResolver = ChunkEntityResolverModel.pretrained('chunkresolve_icd10cm_diseases_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,3,2,0,0,7])\ .setInputCols('token', 'chunk_embs_jsl')\ .setOutputCol('icd10cm_resolution') pipeline_icd10 = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, icd10cmResolver]) empty_df = spark.createDataFrame([[""]]).toDF("text") data = ["""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret's Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU"""] pipeline_model = pipeline_icd10.fit(empty_df) light_pipeline = LightPipeline(pipeline_model) result = light_pipeline.annotate(data) ``` ```scala ... val icd10cmResolver = ChunkEntityResolverModel.pretrained('chunkresolve_icd10cm_diseases_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,3,2,0,0,7)) .setInputCols('token', 'chunk_embs_jsl') .setOutputCol('icd10cm_resolution') val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, icd10cmResolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret's Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | | coords | chunk | entity | icd10cm_opts | |---|-------------|-----------------------------|-----------|-------------------------------------------------------------------------------------------| | 0 | 2::499::506 | insomnia | Diagnosis | [(G4700, Insomnia, unspecified), (G4709, Other insomnia), (F5102, Adjustment insomnia)...]| | 1 | 4::83::109 | chronic renal insufficiency | Diagnosis | [(N185, Chronic kidney disease, stage 5), (N181, Chronic kidney disease, stage 1), (N1...]| | 2 | 4::120::128 | gastritis | Diagnosis | [(K2970, Gastritis, unspecified, without bleeding), (B9681, Helicobacter pylori [H. py...]| ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------------| | Name: | chunkresolve_icd10cm_diseases_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10CM Dataset Range: A000-N989 Except Neoplasms and Musculoskeletal https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: English Named Entity Recognition (from browndw) author: John Snow Labs name: bert_ner_docusco_bert date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `docusco-bert` is a English model orginally trained by `browndw`. ## Predicted Entities `Interactive`, `AcademicTerms`, `InformationChange`, `MetadiscourseCohesive`, `FirstPerson`, `InformationPlace`, `Updates`, `InformationChangeneritive`, `Reasoning`, `PublicTerms`, `Citation`, `Future`, `CitationHedged`, `InformationExnerition`, `Contingent`, `Strategic`, `PAD`, `CitationAuthority`, `Facilitate`, `Positive`, `ConfidenceHigh`, `InformationStates`, `AcademicWritingMoves`, `Uncertainty`, `SyntacticComplexity`, `Responsibility`, `Character`, `Narrative`, `MetadiscourseInteractive`, `InformationTopics`, `ConfidenceLow`, `ConfidenceHedged`, `ForceStressed`, `Negative`, `InformationChangeNegative`, `Description`, `Inquiry`, `InformationReportVerbs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_docusco_bert_en_3.4.2_3.0_1652097061800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_docusco_bert_en_3.4.2_3.0_1652097061800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_docusco_bert","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_docusco_bert","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_docusco_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/browndw/docusco-bert - https://www.english-corpora.org/coca/ - https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html - https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG= - https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/ - https://journals.sagepub.com/doi/full/10.1177/2055207619844865 - https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) - https://www.english-corpora.org/coca/ - https://arxiv.org/pdf/1810.04805 --- layout: model title: Japanese ALBERT Embeddings (from ken11) author: John Snow Labs name: albert_embeddings_albert_base_japanese_v1 date: 2022-04-14 tags: [albert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-base-japanese-v1` is a Japanese model orginally trained by `ken11`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_japanese_v1_ja_3.4.2_3.0_1649954238475.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_japanese_v1_ja_3.4.2_3.0_1649954238475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_japanese_v1","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_japanese_v1","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.albert_base_japanese_v1").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_base_japanese_v1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|45.6 MB| |Case sensitive:|false| ## References - https://huggingface.co/ken11/albert-base-japanese-v1 - https://ken11.jp/blog/sentencepiece-tokenizer-bug - https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83%BC%E3%82%B9%E3%83%80%E3%82%A6%E3%83%B3%E3%83%AD%E3%83%BC%E3%83%89 - https://www.rondhuit.com/download.html#ldcc - https://github.com/google/sentencepiece - https://opensource.org/licenses/MIT --- layout: model title: Google's Tapas Table Understanding (Base, SQA) author: John Snow Labs name: table_qa_tapas_base_finetuned_sqa date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Base Has aggregation operations?: False ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_base_finetuned_sqa_en_4.2.0_3.0_1664530865349.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_base_finetuned_sqa_en_4.2.0_3.0_1664530865349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_base_finetuned_sqa","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_base_finetuned_sqa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|413.9 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1657184964116.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1657184964116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-2 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becasv2_4 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becasv2-4` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_4_en_4.3.0_3.0_1672767755946.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becasv2_4_en_4.3.0_3.0_1672767755946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becasv2_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becasv2_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becasv2-4 --- layout: model title: Legal Taxes Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_taxes_bert date: 2023-03-05 tags: [en, legal, classification, clauses, taxes, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Taxes` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Taxes`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_taxes_bert_en_1.0.0_3.0_1678050662883.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_taxes_bert_en_1.0.0_3.0_1678050662883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_taxes_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Taxes]| |[Other]| |[Other]| |[Taxes]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_taxes_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.96 0.98 0.97 167 Taxes 0.97 0.96 0.96 134 accuracy - - 0.97 301 macro-avg 0.97 0.97 0.97 301 weighted-avg 0.97 0.97 0.97 301 ``` --- layout: model title: Igbo Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_igbo_finetuned_ner_igbo date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ig, open_source] task: Named Entity Recognition language: ig edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-igbo-finetuned-ner-igbo` is a Igbo model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_igbo_finetuned_ner_igbo_ig_3.4.2_3.0_1652809440980.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_igbo_finetuned_ner_igbo_ig_3.4.2_3.0_1652809440980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_igbo_finetuned_ner_igbo","ig") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ahụrụ m n'anya na-atọ m ụtọ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_igbo_finetuned_ner_igbo","ig") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ahụrụ m n'anya na-atọ m ụtọ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_igbo_finetuned_ner_igbo| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ig| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-igbo-finetuned-ner-igbo - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NER --- layout: model title: English asr_wav2vec2_base_100h_13K_steps TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_13K_steps date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_13K_steps` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_13K_steps_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_13K_steps_en_4.2.0_3.0_1664101244680.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_13K_steps_en_4.2.0_3.0_1664101244680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_13K_steps', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_13K_steps", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_13K_steps| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect clinical concepts (jsl_ner_wip_modifier_clinical) author: John Snow Labs name: jsl_ner_wip_modifier_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_modifier_clinical](https://nlp.johnsnowlabs.com/2021/04/01/jsl_ner_wip_modifier_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_modifier_clinical_pipeline_en_4.3.0_3.2_1678782821242.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_modifier_clinical_pipeline_en_4.3.0_3.2_1678782821242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("jsl_ner_wip_modifier_clinical_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("jsl_ner_wip_modifier_clinical_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.wip_modifier_clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.9817 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9998 | | 2 | male | 38 | 41 | Gender | 0.9922 | | 3 | for 2 days | 48 | 57 | Duration | 0.6968 | | 4 | congestion | 62 | 71 | Symptom | 0.875 | | 5 | mom | 75 | 77 | Gender | 0.8156 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.2697 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.6216 | | 8 | she | 147 | 149 | Gender | 0.9965 | | 9 | mild problems with his breathing while feeding | 168 | 213 | Symptom | 0.444029 | | 10 | perioral cyanosis | 237 | 253 | Symptom | 0.3283 | | 11 | retractions | 258 | 268 | Symptom | 0.957 | | 12 | One day ago | 272 | 282 | RelativeDate | 0.646267 | | 13 | mom | 285 | 287 | Gender | 0.692 | | 14 | tactile temperature | 304 | 322 | Symptom | 0.20765 | | 15 | Tylenol | 345 | 351 | Drug | 0.9951 | | 16 | Baby | 354 | 357 | Age | 0.981 | | 17 | decreased p.o. intake | 377 | 397 | Symptom | 0.437375 | | 18 | His | 400 | 402 | Gender | 0.999 | | 19 | 20 minutes | 439 | 448 | Duration | 0.20415 | | 20 | q.2h. | 450 | 454 | Frequency | 0.6406 | | 21 | to 5 to 10 minutes | 456 | 473 | Duration | 0.12444 | | 22 | his | 488 | 490 | Gender | 0.9904 | | 23 | respiratory congestion | 492 | 513 | Symptom | 0.5294 | | 24 | He | 516 | 517 | Gender | 0.9989 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_modifier_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from Papiamento to English author: John Snow Labs name: opus_mt_pap_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pap, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pap` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pap_en_xx_2.7.0_2.4_1609167979033.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pap_en_xx_2.7.0_2.4_1609167979033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pap_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pap_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pap.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pap_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_kv128 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-kv128` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv128_en_4.3.0_3.0_1675121257275.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_kv128_en_4.3.0_3.0_1675121257275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_kv128","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_kv128","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_kv128| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|183.9 MB| ## References - https://huggingface.co/google/t5-efficient-small-kv128 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Fast Neural Machine Translation Model from Slavic Languages to English author: John Snow Labs name: opus_mt_sla_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sla, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sla` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sla_en_xx_2.7.0_2.4_1609164230956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sla_en_xx_2.7.0_2.4_1609164230956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sla_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sla_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sla.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sla_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Understanding Restriction Level of Assignment Clauses(Bert) author: John Snow Labs name: legclf_nda_assigments_bert date: 2023-05-17 tags: [en, legal, licensed, bert, nda, classification, assigments, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a clause classified as `ASSIGNMENT ` using the `legmulticlf_mnda_sections_paragraph_other` classifier, you can subclassify the sentences as `PERMISSIVE_ASSIGNMENT`, `RESTRICTIVE_ASSIGNMENT` or `OTHER` from it using the `legclf_nda_assigments_bert` model. It has been trained with the SOTA approach. ## Predicted Entities `PERMISSIVE_ASSIGNMENT`, `RESTRICTIVE_ASSIGNMENT`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_assigments_bert_en_1.0.0_3.0_1684350248553.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_assigments_bert_en_1.0.0_3.0_1684350248553.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") sequence_classifier = legal.BertForSequenceClassification.pretrained("legclf_nda_assigments_bert", "en", "legal/models")\ .setInputCols(["document","token"])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequence_classifier ]) empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) text_list = [ """This Agreement will be binding upon and inure to the benefit of each Party and its respective heirs, successors and assigns""", """All notices and other communications provided for in this Agreement and the other Loan Documents shall be in writing and may (subject to paragraph (b) below) be telecopied (faxed), mailed by certified mail return receipt requested, or delivered by hand or overnight courier service to the intended recipient at the addresses specified below or at such other address as shall be designated by any party listed below in a notice to the other parties listed below given in accordance with this Section.""", """This Agreement is a personal contract for XCorp, and the rights and interests of XCorp hereunder may not be sold, transferred, assigned, pledged or hypothecated except as otherwise expressly permitted by the Company""" ] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +--------------------------------------------------------------------------------+----------------------+ | text| class| +--------------------------------------------------------------------------------+----------------------+ |This Agreement will be binding upon and inure to the benefit of each Party an...| PERMISSIVE_ASSIGNMENT| |All notices and other communications provided for in this Agreement and the o...| OTHER| |This Agreement is a personal contract for XCorp, and the rights and interests...|RESTRICTIVE_ASSIGNMENT| +--------------------------------------------------------------------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_assigments_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## Sample text from the training dataset In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support OTHER 0.98 1.00 0.99 124 PERMISSIVE_ASSIGNMENT 1.00 0.93 0.97 15 RESTRICTIVE_ASSIGNMENT 1.00 0.96 0.98 25 accuracy - - 0.99 164 macro avg 0.99 0.96 0.98 164 weighted avg 0.99 0.99 0.99 164 ``` --- layout: model title: English asr_wav2vec2_base_checkpoint_14 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: pipeline_asr_wav2vec2_base_checkpoint_14 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_14` is a English model originally trained by jiobiala24. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_checkpoint_14_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_14_en_4.2.0_3.0_1664114924936.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_14_en_4.2.0_3.0_1664114924936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_checkpoint_14', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_checkpoint_14", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_checkpoint_14| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.1 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Medical Spell Checker author: John Snow Labs name: spellcheck_clinical date: 2021-02-16 task: Spell Check language: en nav_key: models edition: Spark NLP 2.7.2 spark_version: 2.4 tags: [spelling, spellchecker, clinical, orthographic, spell_checker, medical_spell_checker, spelling_corrector, en, licensed] supported: true annotator: SpellCheckModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Contextual Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It's based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. This model has been trained in a dataset containing data from different sources; MTSamples, i2b2 clinical notes, and PubMed. You can download the model that comes fully pretrained and ready to use. However, you can still customize it further without the need for re-training a new model from scratch. This can be accomplished by providing custom definitions for the word classes the model has been trained on, namely Dates, Numbers, Ages, Units, and Medications. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_2.7.2_2.4_1613505168792.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_2.7.2_2.4_1613505168792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In order to use this model, you need to setup a pipeline and feed tokens.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setPrefixes(["\"", "(", "[", "\n"])\ .setSuffixes([".", ",", "?", ")","!", "'s"]) spellModel = ContextSpellCheckerModel\ .pretrained('spellcheck_clinical', 'en', 'clinical/models')\ .setInputCols("token")\ .setOutputCol("checked") finisher = Finisher()\ .setInputCols("checked") pipeline = Pipeline( stages = [ documentAssembler, tokenizer, spellModel, finisher ]) empty_ds = spark.createDataFrame([[""]]).toDF("text") lp = LightPipeline(pipeline.fit(empty_ds)) example = ["Witth the hell of phisical terapy the patient was imbulated and on posoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "She is to also call the ofice if she has any ever greater than 101, or leeding form the surgical wounds.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress" ] lp.annotate(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical").predict(""") pipeline = Pipeline( stages = [ documentAssembler, tokenizer, spellModel, finisher ]) empty_ds = spark.createDataFrame([[""") ```
## Results ```bash [{'checked': ['With', 'the', 'help', 'of', 'physical', 'therapy', 'the', 'patient', 'was', 'ambulated', 'and', 'on', 'postoperative', ',', 'the', 'patient', 'tolerating', 'a', 'post', 'surgical', 'soft', 'diet', '.']}, {'checked': ['With', 'pain', 'well', 'controlled', 'on', 'oral', 'pain', 'medications', ',', 'she', 'was', 'discharged', 'to', 'rehabilitation', 'facility', '.']}, {'checked': ['She', 'is', 'to', 'also', 'call', 'the', 'office', 'if', 'she', 'has', 'any', 'fever', 'greater', 'than', '101', ',', 'or', 'bleeding', 'from', 'the', 'surgical', 'wounds', '.']}, {'checked': ['Abdomen', 'is', 'soft', ',', 'nontender', ',', 'and', 'nondistended', '.']}, {'checked': ['Patient', 'not', 'showing', 'pain', 'or', 'any', 'health', 'problems', '.']}, {'checked': ['No', 'acute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical| |Compatibility:|Spark NLP 2.7.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Language:|en| ## Data Source MTSamples, i2b2 clinical notes, and PubMed. --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_vs_all_902129475 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_vs_all-902129475` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_vs_all_902129475_en_4.3.1_3.0_1678134115695.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_vs_all_902129475_en_4.3.1_3.0_1678134115695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_vs_all_902129475","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_vs_all_902129475","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_vs_all_902129475| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_vs_all-902129475 --- layout: model title: English BertForQuestionAnswering Uncased model (from husnu) author: John Snow Labs name: bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xtremedistil-l6-h256-uncased-TQUAD-finetuned_lr-2e-05_epochs-6` is a English model originally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_6_en_4.0.0_3.0_1657193243137.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_6_en_4.0.0_3.0_1657193243137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|47.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/husnu/xtremedistil-l6-h256-uncased-TQUAD-finetuned_lr-2e-05_epochs-6 --- layout: model title: Fast and Accurate Language Identification - 21 Languages (BIGRU) author: John Snow Labs name: ld_tatoeba_bigru_21 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using BiGRU architectures in TensorFlow/Keras. The model is trained on Tatoeba dataset with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias) This model can detect the following languages: `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Estonian`, `Finnish`, `French`, `Hungarian`, `Italian`, `Lithuanian`, `Latvian`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Slovak`, `Slovenian`, `Spanish`, `Swedish`. ## Predicted Entities `bg`, `cs`, `da`, `de`, `el`, `en`, `et`, `fi`, `fr`, `hu`, `it`, `lt`, `lv`, `nl`, `pl`, `pt`, `ro`, `sk`, `sl`, `es`, `sv`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_tatoeba_bigru_21_xx_2.7.0_2.4_1607183021248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_tatoeba_bigru_21_xx_2.7.0_2.4_1607183021248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_tatoeba_bigru_21", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_tatoeba_bigru_21", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_21.bigru').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_tatoeba_bigru_21| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Tatoeba ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | el| 1000| 1000| 1.0| | pt| 1000| 1000| 1.0| | fr| 1000| 1000| 1.0| | it| 1000| 1000| 1.0| | de| 1000| 1000| 1.0| | es| 1000| 1000| 1.0| | nl| 1000| 999| 0.999| | en| 1000| 999| 0.999| | da| 1000| 999| 0.999| | fi| 1000| 998| 0.998| | pl| 914| 906|0.9912472647702407| | bg| 1000| 990| 0.99| | hu| 880| 870|0.9886363636363636| | ro| 784| 773| 0.985969387755102| | sv| 1000| 985| 0.985| | lt| 1000| 983| 0.983| | cs| 1000| 979| 0.979| | lv| 916| 896|0.9781659388646288| | sk| 1000| 969| 0.969| | et| 928| 888|0.9568965517241379| | sl| 914| 861|0.9420131291028446| +--------+-----+-------+------------------+ +-------+--------------------+ |summary| precision| +-------+--------------------+ | count| 21| | mean| 0.9878061255168247| | stddev|0.015811719210146295| | min| 0.9420131291028446| | max| 1.0| +-------+--------------------+ ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from vennify) author: John Snow Labs name: t5_base_grammar_correction date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-grammar-correction` is a English model originally trained by `vennify`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_grammar_correction_en_4.3.0_3.0_1675109588406.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_grammar_correction_en_4.3.0_3.0_1675109588406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_grammar_correction","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_grammar_correction","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_grammar_correction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|904.6 MB| ## References - https://huggingface.co/vennify/t5-base-grammar-correction - https://github.com/EricFillion/happy-transformer - https://arxiv.org/abs/1702.04066 - https://www.vennify.ai/fine-tune-grammar-correction/ --- layout: model title: Translate Bantu languages to English Pipeline author: John Snow Labs name: translate_bnt_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bnt, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bnt` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bnt_en_xx_2.7.0_2.4_1609698899237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bnt_en_xx_2.7.0_2.4_1609698899237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bnt_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bnt_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bnt.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bnt_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Kongo author: John Snow Labs name: opus_mt_en_kg date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, kg, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `kg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_kg_xx_2.7.0_2.4_1609167529740.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_kg_xx_2.7.0_2.4_1609167529740.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_kg", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_kg", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.kg').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_kg| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191078345.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191078345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_hier_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: English asr_wav2vec2_base_timit_moaiz_exp1 TFWav2Vec2ForCTC from moaiz237 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_moaiz_exp1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_exp1` is a English model originally trained by moaiz237. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_moaiz_exp1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_exp1_en_4.2.0_3.0_1664036388529.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_exp1_en_4.2.0_3.0_1664036388529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_moaiz_exp1', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_moaiz_exp1", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_moaiz_exp1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Recognize Entities DL Pipeline for Portuguese - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, portuguese, entity_recognizer_md, pipeline, pt] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pt edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pt_3.0.0_3.0_1616449413384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pt_3.0.0_3.0_1616449413384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'pt') annotations = pipeline.fullAnnotate(""Olá de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "pt") val result = pipeline.fullAnnotate("Olá de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Olá de John Snow Labs! ""] result_df = nlu.load('pt.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:----------------------------|:---------------------------|:---------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Olá de John Snow Labs! '] | ['Olá de John Snow Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pt| --- layout: model title: English asr_xlsr_punctuation TFWav2Vec2ForCTC from boris author: John Snow Labs name: asr_xlsr_punctuation date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_punctuation` is a English model originally trained by boris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_punctuation_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_punctuation_en_4.2.0_3.0_1664020729832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_punctuation_en_4.2.0_3.0_1664020729832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_punctuation", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_punctuation", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_punctuation| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from bugdaryan) author: John Snow Labs name: distilbert_qa_bugdaryan_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `bugdaryan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bugdaryan_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770282430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bugdaryan_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770282430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bugdaryan_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bugdaryan_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bugdaryan_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/bugdaryan/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate Cebuano to English Pipeline author: John Snow Labs name: translate_ceb_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ceb, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ceb` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ceb_en_xx_2.7.0_2.4_1609691069990.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ceb_en_xx_2.7.0_2.4_1609691069990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ceb_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ceb_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ceb.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ceb_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Large Uncased model (from tiennvcs) author: John Snow Labs name: bert_qa_large_uncased_finetuned_vi_infovqa date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-finetuned-vi-infovqa` is a English model originally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_large_uncased_finetuned_vi_infovqa_en_4.0.0_3.0_1657187434802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_large_uncased_finetuned_vi_infovqa_en_4.0.0_3.0_1657187434802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_large_uncased_finetuned_vi_infovqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_large_uncased_finetuned_vi_infovqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_large_uncased_finetuned_vi_infovqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/bert-large-uncased-finetuned-vi-infovqa --- layout: model title: English asr_wav2vec2_base_timit_demo_colab66 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab66 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab66` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab66_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab66_en_4.2.0_3.0_1664024863951.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab66_en_4.2.0_3.0_1664024863951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab66', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab66", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab66| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Cased model (from AyushPJ) author: John Snow Labs name: distilbert_qa_test_squad_trained_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `test-squad-trained-finetuned-squad` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_finetuned_en_4.3.0_3.0_1672775755501.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_finetuned_en_4.3.0_3.0_1672775755501.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_test_squad_trained_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AyushPJ/test-squad-trained-finetuned-squad --- layout: model title: Ganda asr_wav2vec2_large_xlsr_luganda TFWav2Vec2ForCTC from lucio author: John Snow Labs name: asr_wav2vec2_large_xlsr_luganda date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_luganda` is a Ganda model originally trained by lucio. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_luganda_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_luganda_lg_4.2.0_3.0_1664036718968.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_luganda_lg_4.2.0_3.0_1664036718968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_luganda", "lg")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_luganda", "lg") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_luganda| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|lg| |Size:|1.2 GB| --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_bantai_v1 ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_bantai_v1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_bantai_v1` is a English model originally trained by AykeeSalazar. ## Predicted Entities `Public Smoking`, `Public-Drinking`, `ambiguous`, `non-violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_bantai_v1_en_4.1.0_3.0_1660168998995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_bantai_v1_en_4.1.0_3.0_1660168998995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_bantai_v1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_bantai_v1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_bantai_v1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Conditions precedent Clause Binary Classifier (md) author: John Snow Labs name: legclf_conditions_precedent_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `conditions-precedent` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `conditions-precedent` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conditions_precedent_md_en_1.0.0_3.0_1673460306136.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conditions_precedent_md_en_1.0.0_3.0_1673460306136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_conditions_precedent_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[conditions-precedent]| |[other]| |[other]| |[conditions-precedent]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conditions_precedent_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support other 1.00 0.97 0.99 39 term-of-agreement 0.97 1.00 0.98 32 accuracy 0.99 71 macro avg 0.98 0.99 0.99 71 weighted avg 0.99 0.99 0.99 71 ``` --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_ManuERT_for_xqua date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ManuERT-for-xqua` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ManuERT_for_xqua_en_4.0.0_3.0_1654178819404.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ManuERT_for_xqua_en_4.0.0_3.0_1654178819404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ManuERT_for_xqua","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ManuERT_for_xqua","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ManuERT_for_xqua| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/ManuERT-for-xqua --- layout: model title: English Deberta Embeddings model (from domenicrosati) author: John Snow Labs name: deberta_embeddings_v3_large_dapt_scientific_papers_pubmed date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-v3-large-dapt-scientific-papers-pubmed` is a English model originally trained by `domenicrosati`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_en_4.3.1_3.0_1678700874628.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_large_dapt_scientific_papers_pubmed_en_4.3.1_3.0_1678700874628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_large_dapt_scientific_papers_pubmed","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_large_dapt_scientific_papers_pubmed","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_v3_large_dapt_scientific_papers_pubmed| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|false| ## References https://huggingface.co/domenicrosati/deberta-v3-large-dapt-scientific-papers-pubmed --- layout: model title: English asr_wav2vec2_large_robust_libri_960h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_large_robust_libri_960h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_libri_960h` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_robust_libri_960h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_libri_960h_en_4.2.0_3.0_1664039426765.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_libri_960h_en_4.2.0_3.0_1664039426765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_robust_libri_960h", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_robust_libri_960h", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_robust_libri_960h| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.5 MB| --- layout: model title: Translate Sranan Tongo to English Pipeline author: John Snow Labs name: translate_srn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, srn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `srn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_srn_en_xx_2.7.0_2.4_1609689416373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_srn_en_xx_2.7.0_2.4_1609689416373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_srn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_srn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.srn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_srn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spell Checker in English Text author: John Snow Labs name: check_spelling_dl date: 2021-03-23 tags: [en, open_source] supported: true task: Spell Check language: en nav_key: models edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence pipeline that detects and corrects spelling errors in your input text. It's based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. You can download the pretrained pipeline that comes ready to use. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/check_spelling_dl_en_2.7.5_2.4_1616498835957.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/check_spelling_dl_en_2.7.5_2.4_1616498835957.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In order to use this pretrained pipeline, you need to just provide the text to be checked.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('check_spelling_dl', lang='en') result = pipeline.fullAnnotate("During the summer we have the hottest ueather. I have a black ueather jacket, so nice.I intrduce you to my sister, she is called ueather.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("check_spelling_dl", lang = "en") val result = pipeline.fullAnnotate("During the summer we have the hottest ueather. I have a black ueather jacket, so nice.I intrduce you to my sister, she is called ueather.") ``` {:.nlu-block} ```python import nlu nlu.load("en.spell").predict("""During the summer we have the hottest ueather. I have a black ueather jacket, so nice.I intrduce you to my sister, she is called ueather.""") ```
## Results ```bash [('During', 'During'), ('the', 'the'), ('summer', 'summer'), ('we', 'we'), ('have', 'have'), ('the', 'the'), ('hottest', 'hottest'), ('ueather', 'weather'), ('.', '.'), ('I', 'I'), ('have', 'have'), ('a', 'a'), ('black', 'black'), ('ueather', 'leather'), ('jacket', 'jacket'), (',', ','), ('so', 'so'), ('nice', 'nice'), ('.', '.'), ('I', 'I'), ('intrduce', 'introduce'), ('you', 'you'), ('to', 'to'), ('my', 'my'), ('sister', 'sister'), (',', ','), ('she', 'she'), ('is', 'is'), ('called', 'called'), ('ueather', 'Heather'), ('.', '.')] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|check_spelling_dl| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Language:|en| ## Included Models `SentenceDetectorDLModel` `ContextSpellCheckerModel` --- layout: model title: English BertForQuestionAnswering model (from howey) author: John Snow Labs name: bert_qa_bert_base_uncased_squad_L3 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_base_uncased_squad_L3` is a English model orginally trained by `howey`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad_L3_en_4.0.0_3.0_1654185111402.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad_L3_en_4.0.0_3.0_1654185111402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad_L3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad_L3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_l3.by_howey").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad_L3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|169.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/howey/bert_base_uncased_squad_L3 --- layout: model title: Spell Checking Pipeline for English author: John Snow Labs name: check_spelling date: 2021-03-26 tags: [open_source, english, check_spelling, pipeline, en] supported: true task: [Spell Check,] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The check_spelling is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SPELL_CHECKER_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SPELL_CHECKER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/check_spelling_en_3.0.0_3.0_1616772629811.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/check_spelling_en_3.0.0_3.0_1616772629811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('check_spelling', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("check_spelling", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('').predict(text) result_df ```
## Results ```bash | | document | sentence | token | checked | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------------------------| | 0 | ['I liek to live dangertus ! '] | ['I liek to live dangertus !'] | ['I', 'liek', 'to', 'live', 'dangertus', '!'] | ['I', 'like', 'to', 'live', 'dangerous', '!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|check_spelling| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - NorvigSweetingModel --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381_de_4.2.0_3.0_1664112699216.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381_de_4.2.0_3.0_1664112699216.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Russian T5ForConditionalGeneration Base Cased model (from 0x7194633) author: John Snow Labs name: t5_keyt5_base date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyt5-base` is a Russian model originally trained by `0x7194633`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_keyt5_base_ru_4.3.0_3.0_1675104774932.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_keyt5_base_ru_4.3.0_3.0_1675104774932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_keyt5_base","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_keyt5_base","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_keyt5_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|927.4 MB| ## References - https://huggingface.co/0x7194633/keyt5-base - https://github.com/0x7o/text2keywords - https://github.com/0x7o/text2keywords - https://github.com/0x7o/text2keywords - https://github.com/0x7o/text2keywords - https://colab.research.google.com/github/0x7o/text2keywords/blob/main/example/keyT5_use.ipynb - https://colab.research.google.com/github/0x7o/text2keywords/blob/main/example/keyT5_train.ipynb --- layout: model title: English image_classifier_vit_violation_classification_bantai__v100ep ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_violation_classification_bantai__v100ep date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_violation_classification_bantai__v100ep` is a English model originally trained by AykeeSalazar. ## Predicted Entities `Public Smoking`, `Public-Drinking`, `ambiguous`, `non-violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__v100ep_en_4.1.0_3.0_1660169126596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__v100ep_en_4.1.0_3.0_1660169126596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_violation_classification_bantai__v100ep", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_violation_classification_bantai__v100ep", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_violation_classification_bantai__v100ep| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1657185261800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1657185261800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-6 --- layout: model title: Stopwords Remover for Ukrainian language (467 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, uk, open_source] task: Stop Words Removal language: uk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_uk_3.4.1_3.0_1646673071386.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_uk_3.4.1_3.0_1646673071386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","uk") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Ви не краще за мене"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","uk") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Ви не краще за мене").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.stopwords").predict("""Ви не краще за мене""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|uk| |Size:|3.1 KB| --- layout: model title: English RobertaForQuestionAnswering (from Salesforce) author: John Snow Labs name: roberta_qa_qaconv_roberta_large_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qaconv-roberta-large-squad2` is a English model originally trained by `Salesforce`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_qaconv_roberta_large_squad2_en_4.0.0_3.0_1655729360534.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_qaconv_roberta_large_squad2_en_4.0.0_3.0_1655729360534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_qaconv_roberta_large_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_qaconv_roberta_large_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.large.by_Salesforce").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_qaconv_roberta_large_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Salesforce/qaconv-roberta-large-squad2 --- layout: model title: Legal Employment agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_employment_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_employment_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class employment-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `employment-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_agreement_en_1.0.0_3.0_1666620976596.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_agreement_en_1.0.0_3.0_1666620976596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_employment_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[employment-agreement]| |[other]| |[other]| |[employment-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employment_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|20.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support employment-agreement 0.96 1.00 0.98 24 other 1.00 0.98 0.99 62 accuracy - - 0.99 86 macro-avg 0.98 0.99 0.99 86 weighted-avg 0.99 0.99 0.99 86 ``` --- layout: model title: Legal Loan Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_loan_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, loan, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_loan_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `loan-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `loan-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_loan_agreement_bert_en_1.0.0_3.0_1669316061148.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_loan_agreement_bert_en_1.0.0_3.0_1669316061148.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_loan_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[loan-agreement]| |[other]| |[other]| |[loan-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_loan_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support loan-agreement 0.72 0.97 0.82 29 other 0.97 0.73 0.83 41 accuracy - - 0.83 70 macro-avg 0.84 0.85 0.83 70 weighted-avg 0.86 0.83 0.83 70 ``` --- layout: model title: Translate Wolaytta to English Pipeline author: John Snow Labs name: translate_wal_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, wal, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `wal` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_wal_en_xx_2.7.0_2.4_1609691928141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_wal_en_xx_2.7.0_2.4_1609691928141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_wal_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_wal_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.wal.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_wal_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl32 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl32` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl32_en_4.3.0_3.0_1675122269324.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl32_en_4.3.0_3.0_1675122269324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl32","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl32","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl32| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|514.9 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl32 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Translate Gilbertese to English Pipeline author: John Snow Labs name: translate_gil_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gil, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gil` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gil_en_xx_2.7.0_2.4_1609690078155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gil_en_xx_2.7.0_2.4_1609690078155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gil_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gil_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gil.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gil_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from rahulkuruvilla) A Version author: John Snow Labs name: distilbert_qa_COVID_DistilBERTa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-DistilBERTa` is a English model originally trained by `rahulkuruvilla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_COVID_DistilBERTa_en_4.0.0_3.0_1654722884791.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_COVID_DistilBERTa_en_4.0.0_3.0_1654722884791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_COVID_DistilBERTa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_COVID_DistilBERTa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid.distil_bert.a.by_rahulkuruvilla").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_COVID_DistilBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulkuruvilla/COVID-DistilBERTa --- layout: model title: Word2Vec Embeddings in Danish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, da, open_source] task: Embeddings language: da edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_da_3.4.1_3.0_1647293782911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_da_3.4.1_3.0_1647293782911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","da") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Jeg elsker Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","da") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Jeg elsker Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("da.embed.w2v_cc_300d").predict("""Jeg elsker Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|da| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English AlbertForQuestionAnswering model (from SS8) author: John Snow Labs name: albert_qa_squad_2.0 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert_squad_2.0` is a English model originally trained by `SS8`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_squad_2.0_en_4.0.0_3.0_1656064119679.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_squad_2.0_en_4.0.0_3.0_1656064119679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_squad_2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_squad_2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.albert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_squad_2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SS8/albert_squad_2.0 --- layout: model title: English asr_wav2vec2_large_xlsr_nahuatl TFWav2Vec2ForCTC from tyoc213 author: John Snow Labs name: asr_wav2vec2_large_xlsr_nahuatl date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_nahuatl` is a English model originally trained by tyoc213. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_nahuatl_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_nahuatl_en_4.2.0_3.0_1664110129570.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_nahuatl_en_4.2.0_3.0_1664110129570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_nahuatl", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_nahuatl", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_nahuatl| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from anurag0077) author: John Snow Labs name: distilbert_qa_anurag0077_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `anurag0077`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769885692.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769885692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anurag0077_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_city date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-city` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_city_en_4.3.0_3.0_1674220204421.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_city_en_4.3.0_3.0_1674220204421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_city","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_city","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_city| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-city --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_ff12000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-ff12000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff12000_en_4.3.0_3.0_1675120874161.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff12000_en_4.3.0_3.0_1675120874161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_ff12000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_ff12000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_ff12000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|388.5 MB| ## References - https://huggingface.co/google/t5-efficient-small-ff12000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Pipeline to Detect Time-related Terminology author: John Snow Labs name: roberta_token_classifier_timex_semeval_pipeline date: 2022-03-19 tags: [timex, semeval, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_timex_semeval](https://nlp.johnsnowlabs.com/2021/12/28/roberta_token_classifier_timex_semeval_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TIMEX_SEMEVAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_timex_semeval_pipeline_en_3.4.1_3.0_1647699181926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_timex_semeval_pipeline_en_3.4.1_3.0_1647699181926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python timex_pipeline = PretrainedPipeline("roberta_token_classifier_timex_semeval_pipeline", lang = "en") timex_pipeline.annotate("Model training was started at 22:12C and it took 3 days from Tuesday to Friday.") ``` ```scala val timex_pipeline = new PretrainedPipeline("roberta_token_classifier_timex_semeval_pipeline", lang = "en") timex_pipeline.annotate("Model training was started at 22:12C and it took 3 days from Tuesday to Friday.") ```
## Results ```bash +-------+-----------------+ |chunk |ner_label | +-------+-----------------+ |22:12C |Period | |3 |Number | |days |Calendar-Interval| |Tuesday|Day-Of-Week | |to |Between | |Friday |Day-Of-Week | +-------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_timex_semeval_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|439.5 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_cellular_pipeline date: 2023-03-20 tags: [bertfortokenclassification, ner, cellular, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_cellular](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_cellular_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_4.3.0_3.2_1679306596786.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_4.3.0_3.2_1679306596786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") text = '''Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") val text = "Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.cellular_pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------------------------|--------:|------:|:------------|-------------:| | 0 | intracellular signaling proteins | 27 | 58 | protein | 0.999477 | | 1 | human T-cell leukemia virus type 1 promoter | 130 | 172 | DNA | 0.999333 | | 2 | Tax | 186 | 188 | protein | 0.999844 | | 3 | Tax-responsive element 1 | 193 | 216 | DNA | 0.999461 | | 4 | cyclic AMP-responsive members | 237 | 265 | protein | 0.994097 | | 5 | CREB/ATF family | 274 | 288 | protein | 0.995731 | | 6 | transcription factors | 293 | 313 | protein | 0.993303 | | 7 | Tax | 389 | 391 | protein | 0.999503 | | 8 | human T-cell leukemia virus type 1 | 396 | 429 | DNA | 0.93039 | | 9 | Tax-responsive element 1 | 431 | 454 | DNA | 0.99246 | | 10 | TRE-1 | 457 | 461 | DNA | 0.999713 | | 11 | lacZ gene | 582 | 590 | DNA | 0.998996 | | 12 | CYC1 promoter | 617 | 629 | DNA | 0.99915 | | 13 | TRE-1 | 663 | 667 | DNA | 0.99943 | | 14 | cyclic AMP response element-binding protein | 695 | 737 | protein | 0.999852 | | 15 | CREB | 740 | 743 | protein | 0.999918 | | 16 | CREB | 749 | 752 | protein | 0.999922 | | 17 | GAL4 activation domain | 767 | 788 | protein | 0.931828 | | 18 | GAD | 791 | 793 | protein | 0.999684 | | 19 | reporter gene | 848 | 860 | DNA | 0.998856 | | 20 | Tax | 863 | 865 | protein | 0.999717 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Chinese BertForMaskedLM Mini Cased model (from hfl) author: John Snow Labs name: bert_embeddings_minirbt_h288 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minirbt-h288` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h288_zh_4.2.4_3.0_1670022638878.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h288_zh_4.2.4_3.0_1670022638878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h288","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h288","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_minirbt_h288| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|46.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/minirbt-h288 - https://github.com/iflytek/MiniRBT - https://github.com/ymcui/LERT - https://github.com/ymcui/PERT - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/iflytek/HFL-Anthology --- layout: model title: BERT Sentence Embeddings (Large Cased) author: John Snow Labs name: sent_bert_large_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_large_cased_en_2.6.0_2.4_1598346401930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_large_cased_en_2.6.0_2.4_1598346401930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_large_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_large_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer", "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.bert_large_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_sentence_bert_large_cased_embeddings I [[-0.6228358149528503, -0.3453695774078369, 0.... love [[-0.6228358149528503, -0.3453695774078369, 0.... NLP [[-0.6228358149528503, -0.3453695774078369, 0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1](https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1) --- layout: model title: Pipeline to Mapping ICDO Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icdo_snomed_mapping date: 2023-03-29 tags: [en, licensed, clinical, resolver, pipeline, chunk_mapping, icdo, snomed] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icdo_snomed_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapping_en_4.3.2_3.2_1680120340177.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapping_en_4.3.2_3.2_1680120340177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(8120/1 8170/3 8380/3) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(8120/1 8170/3 8380/3) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icdo_to_snomed.pipe").predict("""Put your text here.""") ```
## Results ```bash | | icdo_code | snomed_code | |---:|:-------------------------|:-------------------------------| | 0 | 8120/1 | 8170/3 | 8380/3 | 45083001 | 25370001 | 30289006 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icdo_snomed_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|133.3 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Fast Neural Machine Translation Model from Walloon to English author: John Snow Labs name: opus_mt_wa_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, wa, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `wa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_wa_en_xx_2.7.0_2.4_1609166413374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_wa_en_xx_2.7.0_2.4_1609166413374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_wa_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_wa_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.wa.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_wa_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Оcr base v2 for printed text author: John Snow Labs name: ocr_base_printed_v2 date: 2023-01-17 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 4.2.4 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr base printed model v2 for recognise printed text based on a TrOCR model pretrained with printed datasets. It is an Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_PRINTED/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextPrinted_V2.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_printed_v2_en_4.2.2_3.0_1670623909000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_printed_v2_en_4.2.2_3.0_1670623909000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(True) \ .setSizeThreshold(-1) \ .setLinkThreshold(0.3) \ .setWidth(500) ocr = ImageToTextV2Opt.pretrained("ocr_base_printed_v2", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setGroupImages(True) \ .setOutputCol("text") \ .setRegionsColumn("text_regions") draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, ocr, draw_regions ]) imagePath = pkg_resources.resource_filename('sparkocr', 'resources/ocr/images/check.jpg') image_df = spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(True) .setSizeThreshold(-1) .setLinkThreshold(0.3) .setWidth(500) val ocr = ImageToTextV2Opt .pretrained("ocr_base_printed_v2", "en", "clinical/ocr") .setInputCols(Array("image", "text_regions")) .setGroupImages(True) .setOutputCol("text") .setRegionsColumn("text_regions") val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val pipeline = new PipelineModel().setStages(Array(binary_to_image, text_detector, ocr, draw_regions)) val imagePath = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/check.jpg") val image_df = spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image2.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image2_out3.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash STARBUCKS STORE #10208 11302 EUCLID AVENUE CLEVELAND, OH (216) 229-0749 CHK 664290 12/07/2014 06:43 PM 1912003- DRAWER: 2. REG: 2 VT PEP MOCHA SBUX CARD : 4.95 XXXXXXXXXXXX3228 SUBTOTAL: $4.95 TOTAL @ 6.95 CHANGE DUE $0.00 ................ CCHECK CLOSED 12/07/2014 06:43 PM SBUX CARD X3228 NEW BALANCE: 37.45 CARD IS REGISTERED ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_base_printed_v2| |Type:|ocr| |Compatibility:|Visual NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Legal Merger Clause Binary Classifier author: John Snow Labs name: legclf_merger_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `merger` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `merger` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_merger_clause_en_1.0.0_3.2_1660122654083.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_merger_clause_en_1.0.0_3.2_1660122654083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_merger_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[merger]| |[other]| |[other]| |[merger]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_merger_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support merger 0.83 0.90 0.86 39 other 0.96 0.94 0.95 109 accuracy - - 0.93 148 macro-avg 0.90 0.92 0.91 148 weighted-avg 0.93 0.93 0.93 148 ``` --- layout: model title: Detect Assertion Status from Oncology Entities author: John Snow Labs name: assertion_oncology_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of entities related to oncology (including diagnoses, therapies and tests). ## Predicted Entities `Absent`, `Family`, `Hypothetical`, `Past`, `Possible`, `Present` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_wip_en_4.0.0_3.0_1665519630515.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_wip_en_4.0.0_3.0_1665519630515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Dx", "Tumor_Finding", "Cancer_Surgery", "Chemotherapy", "Pathology_Test", "Imaging_Test"]) assertion = AssertionDLModel.pretrained("assertion_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Dx", "Tumor_Finding", "Cancer_Surgery", "Chemotherapy", "Pathology_Test", "Imaging_Test")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient is suspected to have lung cancer. Family history is positive for other cancers. The result of the biopsy was positive.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_wip").predict("""The patient is suspected to have breast cancer. Family history is positive for other cancers. The result of the biopsy was positive.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------------|:---------------|:------------| | lung cancer | Cancer_Dx | Possible | | cancers | Cancer_Dx | Family | | biopsy | Pathology_Test | Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Absent 0.83 0.79 0.81 368.0 Family 0.80 0.80 0.80 40.0 Hypothetical 0.65 0.57 0.61 229.0 Past 0.90 0.91 0.91 2124.0 Possible 0.64 0.61 0.63 85.0 Present 0.87 0.88 0.88 2121.0 macro-avg 0.78 0.76 0.77 4967.0 weighted-avg 0.87 0.87 0.87 4967.0 ``` --- layout: model title: Sentiment Analysis for Thai (sentiment_jager_use) author: John Snow Labs name: sentiment_jager_use date: 2021-01-14 task: Sentiment Analysis language: th edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [sentiment, th, open_source] supported: true annotator: SentimentDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Analyze sentiment in reviews by classifying them as `positive` and `negative`. When the sentiment probability is below a customizable threshold (default to `0.6`) then resulting document will be labeled as `neutral`. This model is trained using the multilingual `UniversalSentenceEncoder` sentence embeddings, and uses DL approach to classify the sentiments. ## Predicted Entities `positive`, `negative`, `neutral` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentiment_jager_use_th_2.7.1_2.4_1610586390122.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentiment_jager_use_th_2.7.1_2.4_1610586390122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use in the pipeline with the pretrained multi-language `UniversalSentenceEncoder` annotator `tfhub_use_multi_lg`.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained("tfhub_use_multi_lg", "xx") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") sentimentdl = SentimentDLModel.pretrained("sentiment_jager_use", "th")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("sentiment") pipeline = Pipeline(stages = [document_assembler, use, sentimentdl]) example = spark.createDataFrame([['เเพ้ตอนnctโผล่มาตลอดเลยค่ะเเอด5555555']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained("tfhub_use_multi_lg", "xx") .setInputCols(Array("document") .setOutputCol("sentence_embeddings") val sentimentdl = SentimentDLModel.pretrained("sentiment_jager_use", "th") .setInputCols(Array("sentence_embeddings")) .setOutputCol("sentiment") val pipeline = new Pipeline().setStages(Array(document_assembler, use, sentimentdl)) val data = Seq("เเพ้ตอนnctโผล่มาตลอดเลยค่ะเเอด5555555").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""เเพ้ตอนnctโผล่มาตลอดเลยค่ะเเอด5555555"""] sentiment_df = nlu.load('th.classify.sentiment').predict(text) sentiment_df ```
## Results ```bash +-------------------------------------+----------+ |text |result | +-------------------------------------+----------+ |เเพ้ตอนnctโผล่มาตลอดเลยค่ะเเอด5555555 |[positive] | +-------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentiment_jager_use| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[sentiment]| |Language:|th| ## Data Source The model was trained on the custom corpus from [Jager V3](https://github.com/JagerV3/sentiment_analysis_thai). ## Benchmarking ```bash | sentiment | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | negative | 0.94 | 0.99 | 0.96 | 82 | | positive | 0.97 | 0.87 | 0.92 | 38 | | accuracy | | | 0.95 | 120 | | macro avg | 0.96 | 0.93 | 0.94 | 120 | | weighted avg | 0.95 | 0.95 | 0.95 | 120 | ``` --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_scrambled_squad_5_new date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-5-new` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_5_new_en_4.0.0_3.0_1655734213398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_5_new_en_4.0.0_3.0_1655734213398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_5_new","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_5_new","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_scrambled_sq.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_5_new| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-5-new --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP_fi_4.2.0_3.0_1664039192288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP_fi_4.2.0_3.0_1664039192288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|3.6 GB| --- layout: model title: Swedish asr_Wav2Vec2_large_xlsr_welsh TFWav2Vec2ForCTC from Srulikbdd author: John Snow Labs name: pipeline_asr_Wav2Vec2_large_xlsr_welsh date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_large_xlsr_welsh` is a Swedish model originally trained by Srulikbdd. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Wav2Vec2_large_xlsr_welsh_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_large_xlsr_welsh_sv_4.2.0_3.0_1664115162978.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_large_xlsr_welsh_sv_4.2.0_3.0_1664115162978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Wav2Vec2_large_xlsr_welsh', lang = 'sv') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Wav2Vec2_large_xlsr_welsh", lang = "sv") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Wav2Vec2_large_xlsr_welsh| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect Genetic Cancer Entities author: John Snow Labs name: ner_cancer_genetics_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_cancer_genetics](https://nlp.johnsnowlabs.com/2021/03/31/ner_cancer_genetics_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_pipeline_en_3.4.1_3.0_1647870755233.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_pipeline_en_3.4.1_3.0_1647870755233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_cancer_genetics_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.") ``` ```scala val pipeline = new PretrainedPipeline("ner_cancer_genetics_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cancer_genetics.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash +-----------------------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------------------+---------+ |human KCNJ9 |protein | |Kir 3.3 |protein | |GIRK3 |protein | |G-protein-activated inwardly rectifying potassium (GIRK) channel family|protein | |KCNJ9 locus |DNA | |chromosome 1q21-23 |DNA | |coding exons |DNA | |identified14 single nucleotide polymorphisms |DNA | |SNPs), |DNA | |KCNJ9 gene |DNA | |KCNJ9 protein |protein | |locus |DNA | +-----------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cancer_genetics_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_dog_breed_classifier ViTForImageClassification from skyau author: John Snow Labs name: image_classifier_vit_dog_breed_classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog_breed_classifier` is a English model originally trained by skyau. ## Predicted Entities `whippet`, `cairn`, `Old_English_sheepdog`, `Rottweiler`, `American_Staffordshire_terrier`, `Blenheim_spaniel`, `Leonberg`, `bluetick`, `Yorkshire_terrier`, `African_hunting_dog`, `Doberman`, `Appenzeller`, `Boston_bull`, `German_shepherd`, `kuvasz`, `standard_poodle`, `Chesapeake_Bay_retriever`, `toy_terrier`, `Australian_terrier`, `Dandie_Dinmont`, `Brittany_spaniel`, `basenji`, `Newfoundland`, `Airedale`, `giant_schnauzer`, `Bouvier_des_Flandres`, `golden_retriever`, `Welsh_springer_spaniel`, `Pekinese`, `West_Highland_white_terrier`, `briard`, `Gordon_setter`, `Border_collie`, `Pomeranian`, `Scotch_terrier`, `malamute`, `EntleBucher`, `toy_poodle`, `Mexican_hairless`, `clumber`, `Scottish_deerhound`, `curly-coated_retriever`, `Bedlington_terrier`, `soft-coated_wheaten_terrier`, `Irish_setter`, `Lhasa`, `bloodhound`, `French_bulldog`, `standard_schnauzer`, `Chihuahua`, `borzoi`, `Sealyham_terrier`, `malinois`, `Norwegian_elkhound`, `Staffordshire_bullterrier`, `bull_mastiff`, `Ibizan_hound`, `komondor`, `Kerry_blue_terrier`, `Saint_Bernard`, `basset`, `Eskimo_dog`, `Sussex_spaniel`, `English_springer`, `flat-coated_retriever`, `cocker_spaniel`, `Tibetan_terrier`, `Shih-Tzu`, `beagle`, `silky_terrier`, `Saluki`, `vizsla`, `pug`, `Shetland_sheepdog`, `Maltese_dog`, `Norwich_terrier`, `kelpie`, `Italian_greyhound`, `Walker_hound`, `Greater_Swiss_Mountain_dog`, `miniature_schnauzer`, `Great_Pyrenees`, `Tibetan_mastiff`, `collie`, `Siberian_husky`, `Bernese_mountain_dog`, `Irish_wolfhound`, `chow`, `boxer`, `Great_Dane`, `dingo`, `Japanese_spaniel`, `Rhodesian_ridgeback`, `Border_terrier`, `Afghan_hound`, `Irish_water_spaniel`, `black-and-tan_coonhound`, `redbone`, `Norfolk_terrier`, `affenpinscher`, `Brabancon_griffon`, `miniature_pinscher`, `Labrador_retriever`, `Lakeland_terrier`, `groenendael`, `schipperke`, `papillon`, `wire-haired_fox_terrier`, `Cardigan`, `English_foxhound`, `Pembroke`, `dhole`, `German_short-haired_pointer`, `miniature_poodle`, `Irish_terrier`, `Weimaraner`, `otterhound`, `English_setter`, `Samoyed`, `keeshond` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_breed_classifier_en_4.1.0_3.0_1660167173151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_breed_classifier_en_4.1.0_3.0_1660167173151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dog_breed_classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dog_breed_classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dog_breed_classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.2 MB| --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1655733311668.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1655733311668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_64d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|419.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-10 --- layout: model title: Pipeline to Detect Diseases in Medical Text author: John Snow Labs name: bert_token_classifier_ner_ncbi_disease_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_ncbi_disease](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_ncbi_disease_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ncbi_disease_pipeline_en_4.3.0_3.2_1679303325122.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ncbi_disease_pipeline_en_4.3.0_3.2_1679303325122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_ncbi_disease_pipeline", "en", "clinical/models") text = '''Kniest dysplasia is a moderately severe type II collagenopathy, characterized by short trunk and limbs, kyphoscoliosis, midface hypoplasia, severe myopia, and hearing loss.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_ncbi_disease_pipeline", "en", "clinical/models") val text = "Kniest dysplasia is a moderately severe type II collagenopathy, characterized by short trunk and limbs, kyphoscoliosis, midface hypoplasia, severe myopia, and hearing loss." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------|--------:|------:|:------------|-------------:| | 0 | Kniest dysplasia | 0 | 15 | Disease | 0.999886 | | 1 | type II collagenopathy | 40 | 61 | Disease | 0.999934 | | 2 | kyphoscoliosis | 104 | 117 | Disease | 0.99994 | | 3 | midface hypoplasia | 120 | 137 | Disease | 0.999911 | | 4 | myopia | 147 | 152 | Disease | 0.999894 | | 5 | hearing loss | 159 | 170 | Disease | 0.999351 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ncbi_disease_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-768_A-12_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna_en_4.0.0_3.0_1654185363690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna_en_4.0.0_3.0_1654185363690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.uncased_4l_768d_a12a_768d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|195.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-768_A-12_squad2_covid-qna --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465523 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465523` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465523_en_4.0.0_3.0_1655986981491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465523_en_4.0.0_3.0_1655986981491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465523","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465523","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465523.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465523| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465523 --- layout: model title: Stop Words Cleaner for Hausa author: John Snow Labs name: stopwords_ha date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ha edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ha] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ha_ha_2.5.4_2.4_1594742441392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ha_ha_2.5.4_2.4_1594742441392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ha", "ha") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Ban da kasancewarsa sarkin arewa, John Snow likita ne na Ingilishi kuma jagora a ci gaban maganin sa barci da tsabtar lafiya.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ha", "ha") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Ban da kasancewarsa sarkin arewa, John Snow likita ne na Ingilishi kuma jagora a ci gaban maganin sa barci da tsabtar lafiya.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Ban da kasancewarsa sarkin arewa, John Snow likita ne na Ingilishi kuma jagora a ci gaban maganin sa barci da tsabtar lafiya."""] stopword_df = nlu.load('ha.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=7, end=18, result='kasancewarsa', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=25, result='sarkin', metadata={'sentence': '0'}), Row(annotatorType='token', begin=27, end=31, result='arewa', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=32, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=34, end=37, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ha| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ha| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_0_en_4.3.0_3.0_1674214651872.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_0_en_4.3.0_3.0_1674214651872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|427.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-0 --- layout: model title: Pipeline to Mapping ICDO Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icdo_snomed_mapping date: 2022-06-27 tags: [icdo, snomed, pipeline, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icdo_snomed_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapping_en_3.5.3_3.0_1656364275328.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapping_en_3.5.3_3.0_1656364275328.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") result= pipeline.fullAnnotate("8120/1 8170/3 8380/3") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= new PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("8120/1 8170/3 8380/3") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icdo_to_snomed.pipe").predict("""8120/1 8170/3 8380/3""") ```
## Results ```bash | | icdo_code | snomed_code | |---:|:-------------------------|:-------------------------------| | 0 | 8120/1 | 8170/3 | 8380/3 | 45083001 | 25370001 | 30289006 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icdo_snomed_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|133.2 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Chinese BERT with Whole Word Masking author: John Snow Labs name: chinese_bert_wwm date: 2021-05-20 tags: [chinese, zh, embeddings, bert, open_source] task: Embeddings language: zh edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description **[Pre-Training with Whole Word Masking for Chinese BERT](https://arxiv.org/abs/1906.08101)** Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, Guoping Hu More resources by HFL: https://github.com/ymcui/HFL-Anthology If you find the technical report or resource is useful, please cite the following technical report in your paper. - Primary: [https://arxiv.org/abs/2004.13922](https://arxiv.org/abs/2004.13922) - Secondary: [https://arxiv.org/abs/1906.08101](https://arxiv.org/abs/1906.08101) {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/chinese_bert_wwm_zh_3.1.0_2.4_1621511963425.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/chinese_bert_wwm_zh_3.1.0_2.4_1621511963425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("chinese_bert_wwm", "zh") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("chinese_bert_wwm", "zh") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.bert.wwm").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chinese_bert_wwm| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|zh| |Case sensitive:|true| ## Data Source [https://huggingface.co/hfl/chinese-bert-wwm](https://huggingface.co/hfl/chinese-bert-wwm) ## Benchmarking ```bash - BERTGoogle BERT-wwm BERT-wwm-ext RoBERTa-wwm-ext RoBERTa-wwm-ext-large Masking WordPiece WWM[1] WWM WWM WWM Type base base base base large Data Source wiki wiki wiki+ext[2] wiki+ext wiki+ext Training Tokens # 0.4B 0.4B 5.4B 5.4B 5.4B Device TPU Pod v2 TPU v3 TPU v3 TPU v3 TPU Pod v3-32[3] Training Steps ? 100KMAX128 +100KMAX512 1MMAX128 +400KMAX512 1MMAX512 2MMAX512 Batch Size ? 2,560 / 384 2,560 / 384 384 512 Optimizer AdamW LAMB LAMB AdamW AdamW Vocabulary 21,128 ~BERT[4] ~BERT ~BERT ~BERT Init Checkpoint Random Init ~BERT ~BERT ~BERT Random Init ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_ff6000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-ff6000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff6000_en_4.3.0_3.0_1675112144975.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff6000_en_4.3.0_3.0_1675112144975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_ff6000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_ff6000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_ff6000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|692.3 MB| ## References - https://huggingface.co/google/t5-efficient-base-ff6000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Cusip Numbers Clause Binary Classifier author: John Snow Labs name: legclf_cusip_numbers_clause date: 2023-01-29 tags: [en, legal, classification, cusip, numbers, clauses, cusip_numbers, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `cusip-numbers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `cusip-numbers`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cusip_numbers_clause_en_1.0.0_3.0_1674994284758.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cusip_numbers_clause_en_1.0.0_3.0_1674994284758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_cusip_numbers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[cusip-numbers]| |[other]| |[other]| |[cusip-numbers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cusip_numbers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support cusip-numbers 0.93 0.96 0.95 28 other 0.97 0.95 0.96 37 accuracy - - 0.95 65 macro-avg 0.95 0.96 0.95 65 weighted-avg 0.95 0.95 0.95 65 ``` --- layout: model title: Detect Chemical Compounds and Genes (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_chemprot date: 2022-01-06 tags: [berfortokenclassification, chemprot, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect chemical compounds and genes in the medical text using the pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `CHEMICAL`, `GENE-Y`, `GENE-N` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_en_3.3.4_2.4_1641471274375.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_en_3.3.4_2.4_1641471274375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_chemprot", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""" ]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_chemprot", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot.bert").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |Keratinocyte growth factor |GENE-Y | |acidic fibroblast growth factor|GENE-Y | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemprot| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source This model is trained on a [ChemProt corpus](https://biocreative.bioinformatics.udel.edu/). ## Benchmarking ```bash label precision recall f1-score support B-CHEMICAL 0.93 0.79 0.85 8649 B-GENE-N 0.63 0.56 0.59 2752 B-GENE-Y 0.82 0.73 0.77 5490 I-CHEMICAL 0.90 0.79 0.84 1313 I-GENE-N 0.72 0.62 0.67 1993 I-GENE-Y 0.81 0.72 0.77 2420 accuracy - - 0.73 22617 macro-avg 0.75 0.74 0.75 22617 weighted-avg 0.83 0.73 0.78 22617 ``` --- layout: model title: Translate Marshallese to English Pipeline author: John Snow Labs name: translate_mh_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mh, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mh_en_xx_2.7.0_2.4_1609698850382.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mh_en_xx_2.7.0_2.4_1609698850382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mh.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mh_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect PHI for Deidentification (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_deid date: 2021-09-13 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 23 entities. This ner model is trained with a combination of the i2b2 train set and a re-augmented version of i2b2 train set using `BertForTokenClassification` We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `MEDICALRECORD`, `ORGANIZATION`, `DOCTOR`, `USERNAME`, `PROFESSION`, `HEALTHPLAN`, `URL`, `CITY`, `DATE`, `LOCATION-OTHER`, `STATE`, `PATIENT`, `DEVICE`, `COUNTRY`, `ZIP`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `SREET`, `BIOID`, `FAX`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_en_3.2.1_2.4_1631538493075.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_en_3.2.1_2.4_1631538493075.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_deid", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) sample_text = """A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""" df = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_deid", "en", "clinical/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq(""""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_deid").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |Oliveira |PATIENT | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |302) 786-5227 |PHONE | |Brothers Coal-Mine |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_deid| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|128| ## Data Source A custom data set which is created from the i2b2-PHI train and the re-augmented version of the i2b2-PHI train set is used. ## Benchmarking ```bash label precision recall f1-score support B-AGE 0.92 0.80 0.86 1050 B-CITY 0.71 0.93 0.80 530 B-COUNTRY 0.94 0.72 0.82 179 B-DATE 0.99 0.99 0.99 20434 B-DEVICE 0.68 0.66 0.67 35 B-DOCTOR 0.93 0.91 0.92 3609 B-EMAIL 0.92 1.00 0.96 11 B-HOSPITAL 0.94 0.90 0.92 2225 B-IDNUM 0.88 0.64 0.74 1185 B-LOCATION-OTHER 0.83 0.60 0.70 25 B-MEDICALRECORD 0.84 0.97 0.90 2263 B-ORGANIZATION 0.92 0.68 0.79 171 B-PATIENT 0.89 0.86 0.88 2297 B-PHONE 0.90 0.96 0.93 755 B-PROFESSION 0.86 0.81 0.83 265 B-STATE 0.97 0.87 0.92 261 B-STREET 0.98 0.99 0.99 184 B-USERNAME 0.96 0.78 0.86 357 B-ZIP 0.96 0.96 0.96 444 I-CITY 0.74 0.83 0.78 138 I-DATE 0.98 0.96 0.97 955 I-DEVICE 1.00 1.00 1.00 3 I-DOCTOR 0.92 0.98 0.95 3134 I-HOSPITAL 0.95 0.92 0.93 1322 I-LOCATION-OTHER 1.00 1.00 1.00 8 I-MEDICALRECORD 1.00 0.91 0.95 23 I-ORGANIZATION 0.98 0.61 0.75 77 I-PATIENT 0.89 0.81 0.85 1281 I-PHONE 0.97 0.96 0.97 374 I-PROFESSION 0.95 0.82 0.88 232 I-STREET 0.98 0.98 0.98 391 O 1.00 1.00 1.00 585606 accuracy - - 0.99 629960 macro-avg 0.79 0.71 0.73 629960 weighted-avg 0.99 0.99 0.99 629960 ``` --- layout: model title: Clinical Deidentification Pipeline (Portuguese) author: John Snow Labs name: clinical_deidentification date: 2022-04-14 tags: [deid, deidentification, pt, licensed] task: [De-identification, Pipeline Healthcare] language: pt edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true recommended: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with `w2v_cc_300d` portuguese embeddings and can be used to deidentify PHI information from medical texts in Portuguese. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `ID`, `COUNTRY`, `STREET`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_3.4.1_3.0_1649956332889.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_3.4.1_3.0_1649956332889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = """Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = "Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("pt.deid.clinical").predict("""Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """) ```
## Results ```bash Masked with entity labels ------------------------------ Dados do . Nome: . Apelido: . NIF: . NISS: . Endereço: . CÓDIGO POSTAL: . Dados de cuidados. Data de nascimento: . País: . Idade: anos Sexo: . Data de admissão: . Doutor: Cuéllar NºCol: . Relatório clínico do : de anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: - , 22 E-mail: . Masked with chars ------------------------------ Dados do [******]. Nome: [***]. Apelido: [*******]. NIF: [****]. NISS: [*********]. Endereço: [*********************]. CÓDIGO POSTAL: [***]. Dados de cuidados. Data de nascimento: [********]. País: [******]. Idade: ** anos Sexo: *. Data de admissão: [********]. Doutor: [*************] Cuéllar NºCol: ** ** [***]. Relatório clínico do [******]: [******] de ** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér[**] de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: [***********] - [*****************], 22 [******] E-mail: [****************]. Masked with fixed length chars ------------------------------ Dados do ****. Nome: ****. Apelido: ****. NIF: ****. NISS: ****. Endereço: ****. CÓDIGO POSTAL: ****. Dados de cuidados. Data de nascimento: ****. País: ****. Idade: **** anos Sexo: ****. Data de admissão: ****. Doutor: **** Cuéllar NºCol: **** **** ****. Relatório clínico do ****: **** de **** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér**** de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: **** - ****, 22 **** E-mail: ****. Obfuscated ------------------------------ Dados do H.. Nome: Marcos Alves. Apelido: Tiago Santos. NIF: 566-445. NISS: 134544332. Endereço: Rua de Santa María, 100. CÓDIGO POSTAL: 4099. Dados de cuidados. Data de nascimento: 31/03/1946. País: Espanha. Idade: 46 anos Sexo: Mulher. Data de admissão: 06/01/2017. Doutor: Carlos Melo Cuéllar NºCol: 134544332 134544332 124 445 311. Relatório clínico do H.: M. de 46 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicérHomen de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Carlos Melo - Avenida Dos Aliados, 56, 22 Espanha E-mail: maria.prado@jacob.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|1.3 GB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - nlp.RegexMatcherModel - nlp.RegexMatcherModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11 TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11` is a English model originally trained by nimrah. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11_en_4.2.0_3.0_1664117519491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11_en_4.2.0_3.0_1664117519491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191119508.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191119508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_hier_triplet_0.1_epochs_1_shard_1_squad2.0 --- layout: model title: Korean BertForQuestionAnswering Base Cased model (from tucan9389) author: John Snow Labs name: bert_qa_kcbert_base_finetuned_squad date: 2022-07-07 tags: [ko, open_source, bert, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kcbert-base-finetuned-squad` is a Korean model originally trained by `tucan9389`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kcbert_base_finetuned_squad_ko_4.0.0_3.0_1657189537948.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kcbert_base_finetuned_squad_ko_4.0.0_3.0_1657189537948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kcbert_base_finetuned_squad","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kcbert_base_finetuned_squad","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kcbert_base_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|406.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tucan9389/kcbert-base-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Tongan author: John Snow Labs name: opus_mt_en_to date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, to, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `to` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_to_xx_2.7.0_2.4_1609254398876.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_to_xx_2.7.0_2.4_1609254398876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_to", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_to", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.to').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_to| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish BertForQuestionAnswering model (from MMG) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG date: 2022-06-03 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-squad2-es` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG_es_4.0.0_3.0_1654249761122.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG_es_4.0.0_3.0_1654249761122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-squad2-es --- layout: model title: French T5ForConditionalGeneration Cased model (from ZakaryaRouzki) author: John Snow Labs name: t5_punctuation date: 2023-01-31 tags: [fr, open_source, t5, tensorflow] task: Text Generation language: fr edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-punctuation` is a French model originally trained by `ZakaryaRouzki`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_punctuation_fr_4.3.0_3.0_1675125343642.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_punctuation_fr_4.3.0_3.0_1675125343642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_punctuation","fr") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_punctuation","fr") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_punctuation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|fr| |Size:|925.1 MB| ## References - https://huggingface.co/ZakaryaRouzki/t5-punctuation - https://linkedin.com/in/rouzki --- layout: model title: Stopwords Remover for Telugu language (52 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, te, open_source] task: Stop Words Removal language: te edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_te_3.4.1_3.0_1646672989663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_te_3.4.1_3.0_1646672989663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","te") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["మీరు నన్ను కంటే మెరుగైనది కాదు"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","te") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("మీరు నన్ను కంటే మెరుగైనది కాదు").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.stopwords").predict("""మీరు నన్ను కంటే మెరుగైనది కాదు""") ```
## Results ```bash +------------------------------+ |result | +------------------------------+ |[మీరు, నన్ను, కంటే, మెరుగైనది]| +------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|te| |Size:|1.7 KB| --- layout: model title: Spanish NER for Laws and Treaties/Agreements (Roberta) author: John Snow Labs name: legner_laws_treaties date: 2022-09-28 tags: [es, legal, ner, laws, treaties, agreements, licensed] task: Named Entity Recognition language: es edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal Roberta Named Entity Recognition model in Spanish, able to recognize the following entities: - LEY: Law - TRAT_INTL: International Treaty (Agreement) This model originally trained on scjn dataset, available [here](https://huggingface.co/datasets/scjnugacj/scjn_dataset_ner) and finetuned on scrapped documents (as, for example, [this one](https://www.wipo.int/export/sites/www/pct/es/texts/pdf/pct.pdf)), improving the coverage of the original version, published [here](https://huggingface.co/datasets/scjnugacj/scjn_dataset_ner). ## Predicted Entities `LAW`, `TRAT_INTL` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_laws_treaties_es_1.0.0_3.0_1664362398391.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_laws_treaties_es_1.0.0_3.0_1664362398391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("legner_laws_treaties","es", "legal/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = nlp.Pipeline( stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) text = "Sin perjuicio de lo dispuesto en el párrafo b), los requisitos y los efectos de una reivindicación de prioridad presentada conforme al párrafo 1), serán los establecidos en el Artículo 4 del Acta de Estocolmo del Convenio de París para la Protección de la Propiedad Industrial." data = spark.createDataFrame([[""]]).toDF("text") fitmodel = pipeline.fit(data) light_model = LightPipeline(fitmodel) light_result = light_model.fullAnnotate(text) chunks = [] entities = [] for n in light_result[0]['ner_chunk']: print("{n.result} ({n.metadata['entity']})) ```
## Results ```bash para la Protección de la Propiedad Industrial. (TRAT_INTL) ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_laws_treaties| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model was originally trained on scjn dataset, available [here](https://huggingface.co/datasets/scjnugacj/scjn_dataset_ner) and finetuned on scrapped documents (as, for example, [this one](https://www.wipo.int/export/sites/www/pct/es/texts/pdf/pct.pdf)), improving the coverage of the original version, published [here](https://huggingface.co/datasets/scjnugacj/scjn_dataset_ner). ## Benchmarking ```bash label prec rec f1 Macro-average 0.9361195 0.9294152 0.9368145 Micro-average 0.9856711 0.9857456 0.9851656 ``` --- layout: model title: Detect Family History Status from Oncology Entities author: John Snow Labs name: assertion_oncology_family_history_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion, family_history] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects entities refering to the family history. ## Predicted Entities `Family_History`, `Other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_family_history_wip_en_4.1.0_3.0_1664642142116.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_family_history_wip_en_4.1.0_3.0_1664642142116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Dx"]) assertion = AssertionDLModel.pretrained("assertion_oncology_family_history_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["""Her family history is positive for breast cancer in her maternal aunt."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Dx")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_family_history_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""Her family history is positive for breast cancer in her maternal aunt.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_family_history").predict("""Her family history is positive for breast cancer in her maternal aunt.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------------|:------------|:---------------| | breast cancer | Cancer_Dx | Family_History | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_family_history_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Family_History 0.94 0.84 0.89 37.0 Other 0.91 0.97 0.94 62.0 macro-avg 0.92 0.90 0.91 99.0 weighted-avg 0.92 0.92 0.92 99.0 ``` --- layout: model title: RCT Binary Classifier (BioBERT) Pipeline author: John Snow Labs name: bert_sequence_classifier_binary_rct_biobert_pipeline date: 2023-06-13 tags: [licensed, classifier, rct, clinical, en] task: Entity Resolution language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline is a BioBERT based classifier that can classify if an article is a randomized clinical trial (RCT) or not. This pretrained pipeline is built on the top of [bert_sequence_classifier_binary_rct_biobert](https://nlp.johnsnowlabs.com/2022/04/25/bert_sequence_classifier_binary_rct_biobert_en_3_0.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_binary_rct_biobert_pipeline_en_4.4.4_3.2_1686665514038.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_binary_rct_biobert_pipeline_en_4.4.4_3.2_1686665514038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_binary_rct_biobert_pipeline", "en", "clinical/models") result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_binary_rct_biobert_pipeline", "en", "clinical/models") val result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.binary_rct_biobert.pipeline").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_binary_rct_biobert_pipeline", "en", "clinical/models") result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_binary_rct_biobert_pipeline", "en", "clinical/models") val result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.binary_rct_biobert.pipeline").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash Results +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |rct |text | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |True|Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_binary_rct_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.0 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: Turkish BertForQuestionAnswering Base Cased model (from husnu) author: John Snow Labs name: bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_1 date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3TQUAD2-finetuned_lr-2e-05_epochs-1` is a Turkish model originally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_1_tr_4.0.0_3.0_1657183615493.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_1_tr_4.0.0_3.0_1657183615493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_1","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_1","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|689.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/husnu/bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3TQUAD2-finetuned_lr-2e-05_epochs-1 --- layout: model title: Legal Transport Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_transport_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, transport_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_transport_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Transport_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Transport_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transport_policy_bert_en_1.0.0_3.0_1678111909241.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transport_policy_bert_en_1.0.0_3.0_1678111909241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_transport_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Transport_Policy]| |[Other]| |[Other]| |[Transport_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transport_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.76 0.97 0.86 40 Transport_Policy 0.98 0.78 0.87 55 accuracy - - 0.86 95 macro-avg 0.87 0.88 0.86 95 weighted-avg 0.89 0.86 0.86 95 ``` --- layout: model title: Depression Binary Classifier (PHS-BERT) author: John Snow Labs name: bert_sequence_classifier_depression_binary date: 2022-08-10 tags: [public_health, en, licensed, sequence_classification, mental_health, depression] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT](https://arxiv.org/abs/2204.04521) based text classification model that can classify whether a social media text expresses depression or not. ## Predicted Entities `no-depression`, `depression` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_depression_binary_en_4.0.2_3.0_1660145810767.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_depression_binary_en_4.0.2_3.0_1660145810767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_depression_binary", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["I am really feeling like there are no good men. I think I would rather be alone than deal with any man again. Has anyone else felt like this? Did your feelings ever change?"], ["For all of those who suffer from and find NYE difficult, I know we can get through it together."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_depression_binary", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array( "I am really feeling like there are no good men. I think I would rather be alone than deal with any man again. Has anyone else felt like this? Did your feelings ever change?", "For all of those who suffer from and find NYE difficult, I know we can get through it together.")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.depression_binary").predict("""I am really feeling like there are no good men. I think I would rather be alone than deal with any man again. Has anyone else felt like this? Did your feelings ever change?""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ |text |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ |I am really feeling like there are no good men. I think I would rather be alone than deal with any man again. Has anyone else felt like this? Did your feelings ever change?|[depression] | |For all of those who suffer from and find NYE difficult, I know we can get through it together. |[no-depression]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_depression_binary| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support no-depression 0.93 0.97 0.95 549 depression 0.97 0.93 0.95 577 accuracy - - 0.95 1126 macro-avg 0.95 0.95 0.95 1126 weighted-avg 0.95 0.95 0.95 1126 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vinitharaj) author: John Snow Labs name: distilbert_qa_vinitharaj_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vinitharaj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vinitharaj_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773035091.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vinitharaj_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773035091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vinitharaj_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vinitharaj_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vinitharaj_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vinitharaj/distilbert-base-uncased-finetuned-squad --- layout: model title: Multilingual (Finnish, Estonian, English) Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_finest_bert date: 2022-04-11 tags: [bert, embeddings, fi, et, en, xx, multilingual, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `finest-bert` is a English model orginally trained by `EMBEDDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_finest_bert_en_3.4.2_3.0_1649672041625.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_finest_bert_en_3.4.2_3.0_1649672041625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_finest_bert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_finest_bert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.finest_bert").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_finest_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|538.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/EMBEDDIA/finest-bert - https://arxiv.org/abs/2006.07890 --- layout: model title: Deidentify (Enriched) author: John Snow Labs name: deidentify_enriched_clinical date: 2021-01-29 task: De-identification language: en nav_key: models edition: Spark NLP for Healthcare 2.7.2 spark_version: 2.4 tags: [deidentify, en, obfuscation, licensed] supported: true annotator: DeIdentificationModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentify (Large) is a deidentification model. It identifies instances of protected health information in text documents, and it can either obfuscate them (e.g., replacing names with different, fake names) or mask them (e.g., replacing “2020-06-04” with Some faker data). This model is useful for maintaining HIPAA compliance when dealing with text documents that contain protected health information. ## Predicted Entities - PHONE - PATIENT - COUNTRY - USERNAME - LOCATION-OTHER - DATE - ID - DOCTOR - HOSPITAL - IDNUM - AGE - MEDICALRECORD - CITY - FAX - ZIP - HEALTHPLAN - PROFESSION - BIOID - URL - EMAIL - STATE - ORGANIZATION - STREET - DEVICE {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/deidentify_enriched_clinical_en_2.7.2_2.4_1611917177874.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/deidentify_enriched_clinical_en_2.7.2_2.4_1611917177874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") obfuscation = DeIdentificationModel.pretrained("deidentify_enriched_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "ner_chunk"]) \ .setOutputCol("obfuscated") \ .setMode("obfuscate") nlp_pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, obfuscation]) data = spark.createDataFrame([["""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street"""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val obfuscation = DeIdentificationModel.pretrained("deidentify_enriched_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "ner_chunk")) .setOutputCol("obfuscated") .setMode("obfuscate") val nlpPipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, obfuscation)) val data = Seq("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical").predict("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""") ```
## Results ```bash sentence deidentified 0 A . A . 1 Record date : 2093-01-13 , David Hale , M.D . Record date : 2093-01-18 , DR. Gregory Kaiser , M.D . 2 , Name : Hendrickson , Ora MR . , Name : Joel Vasquez MR . 3 # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . # 67696 Date : 01/18/93 PCP : DR. Jennifer Eaton , 25 years-old , Record date : 2079-11-14 . 4 Cocke County Baptist Hospital . San Leandro Hospital – San Leandro . 5 0295 Keats Street 3744 Retreat Avenue ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deidentify_enriched_clinical| |Compatibility:|Spark NLP 2.7.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, ner_chunk]| |Output Labels:|[deidentified]| |Language:|en| --- layout: model title: Fast Neural Machine Translation Model from Arabic to German author: John Snow Labs name: opus_mt_ar_de date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, de, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: de {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_de_xx_3.1.0_2.4_1622562969038.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_de_xx_3.1.0_2.4_1622562969038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_de", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_de", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.German').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_de| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: RxNorm to MeSH Code Mapping author: John Snow Labs name: rxnorm_mesh_mapping date: 2021-07-01 tags: [rxnorm, mesh, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CODE_MAPPING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_3.1.0_2.4_1625126293744.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_3.1.0_2.4_1625126293744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") pipeline.annotate("1191 6809 47613") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") val result = pipeline.annotate("1191 6809 47613") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.mesh").predict("""1191 6809 47613""") ```
## Results ```bash {'rxnorm': ['1191', '6809', '47613'], 'mesh': ['D001241', 'D008687', 'D019355']} Note: | RxNorm | Details | | ---------- | -------------------:| | 1191 | aspirin | | 6809 | metformin | | 47613 | calcium citrate | | MeSH | Details | | ---------- | -------------------:| | D001241 | Aspirin | | D008687 | Metformin | | D019355 | Calcium Citrate | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_mesh_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: Translate Luba-Lulua to English Pipeline author: John Snow Labs name: translate_lua_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lua, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lua` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lua_en_xx_2.7.0_2.4_1609698600492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lua_en_xx_2.7.0_2.4_1609698600492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lua_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lua_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lua.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lua_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Igbo Pipeline author: John Snow Labs name: translate_en_ig date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ig, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ig` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ig_xx_2.7.0_2.4_1609691476741.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ig_xx_2.7.0_2.4_1609691476741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ig", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ig", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ig').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ig| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRobertaForTokenClassification Cased model (from SkyR) author: John Snow Labs name: xlmroberta_ner_skyr_wikineural_multilingual date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `wikineural-multilingual-ner` is a Multilingual model originally trained by `SkyR`. ## Predicted Entities `ORG`, `LOC`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_skyr_wikineural_multilingual_xx_4.1.0_3.0_1660425623483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_skyr_wikineural_multilingual_xx_4.1.0_3.0_1660425623483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_skyr_wikineural_multilingual","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_skyr_wikineural_multilingual","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_skyr_wikineural_multilingual| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|890.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/SkyR/wikineural-multilingual-ner --- layout: model title: General model for table detection author: John Snow Labs name: ocr_table_detection_general_model date: 2021-09-04 tags: [en, licensed] task: OCR Table Detection & Recognition language: en nav_key: models edition: Visual NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description General model for table detection inspired by https://arxiv.org/abs/2004.12629 ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_table_detection_general_model_en_3.0.0_3.0_1630757579641.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_table_detection_general_model_en_3.0.0_3.0_1630757579641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use This modes is used by ImageTableDetector
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() binary_to_image.setImageType(ImageType.TYPE_3BYTE_BGR) table_detector = ImageTableDetector .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("table_regions") pipeline = PipelineModel(stages=[ binary_to_image, table_detector ]) ``` ```scala var imgDf = spark.read.format("binaryFile").load(imagePath) var bin2imTransformer = new BinaryToImage() bin2imTransformer.setImageType(ImageType.TYPE_3BYTE_BGR) val dataFrame = bin2imTransformer.transform(imgDf) val tableDetector = ImageTableDetector .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("table regions") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_table_detection_general_model| |Type:|ocr| |Compatibility:|Visual NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Output Labels:|[table regions]| |Language:|en| --- layout: model title: English ElectraForQuestionAnswering model (from ptran74) Version-4 author: John Snow Labs name: electra_qa_DSPFirst_Finetuning_4 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-4` is a English model originally trained by `ptran74`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_4_en_4.0.0_3.0_1655919679099.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_4_en_4.0.0_3.0_1655919679099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.finetuning_4").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_DSPFirst_Finetuning_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ptran74/DSPFirst-Finetuning-4 - https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/ - https://github.com/patil-suraj/question_generation - https://colab.research.google.com/drive/1dJXNstk2NSenwzdtl9xA8AqjP4LL-Ks_?usp=sharing --- layout: model title: English Bert Embeddings Cased model (from antoinev17) author: John Snow Labs name: bert_embeddings_base_uncased_issues_128 date: 2023-02-20 tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-issues-128` is a English model originally trained by `antoinev17 `. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uncased_issues_128_en_4.3.0_3.0_1676927301180.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_uncased_issues_128_en_4.3.0_3.0_1676927301180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_base_uncased_issues_128","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_base_uncased_issues_128","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_uncased_issues_128| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|true| ## References https://huggingface.co/antoinev17/bert-base-uncased-issues-128 --- layout: model title: English BertForQuestionAnswering Cased model (from wiselinjayajos) author: John Snow Labs name: bert_qa_wiselinjayajos_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `wiselinjayajos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_wiselinjayajos_finetuned_squad_en_4.0.0_3.0_1657186807960.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_wiselinjayajos_finetuned_squad_en_4.0.0_3.0_1657186807960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_wiselinjayajos_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_wiselinjayajos_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_wiselinjayajos_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/wiselinjayajos/bert-finetuned-squad --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becasincentivos6 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos6` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos6_es_4.3.0_3.0_1674218206378.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos6_es_4.3.0_3.0_1674218206378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos6","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos6","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becasincentivos6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos6 --- layout: model title: Legal Method of payment Clause Binary Classifier (md) author: John Snow Labs name: legclf_method_of_payment_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `method-of-payment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `method-of-payment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_method_of_payment_md_en_1.0.0_3.0_1669376507093.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_method_of_payment_md_en_1.0.0_3.0_1669376507093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_method_of_payment_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[method-of-payment]| |[other]| |[other]| |[method-of-payment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_method_of_payment_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support method-of-payment 0.97 0.94 0.96 35 other 0.95 0.97 0.96 39 accuracy 0.96 74 macro avg 0.96 0.96 0.96 74 weighted avg 0.96 0.96 0.96 74 ``` --- layout: model title: English T5ForConditionalGeneration Mini Cased model (from google) author: John Snow Labs name: t5_efficient_mini_nl12 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-mini-nl12` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl12_en_4.3.0_3.0_1675118188776.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl12_en_4.3.0_3.0_1675118188776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_mini_nl12","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_mini_nl12","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_mini_nl12| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|156.4 MB| ## References - https://huggingface.co/google/t5-efficient-mini-nl12 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal NER for NDA (Preamble Clause) author: John Snow Labs name: legner_nda_preamble date: 2023-04-06 tags: [en, licensed, legal, ner, nda, preamble, purpose] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `PREAMBLE` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `PURPOSE`, and `PURPOSE_OBJECT`. ## Predicted Entities `PURPOSE`, `PURPOSE_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_preamble_en_1.0.0_3.0_1680791988031.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_preamble_en_1.0.0_3.0_1680791988031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_preamble", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""In order to facilitate the consideration and negotiation of a possible transaction involving Chordiant and Pegasystems ( referred to collectively as the "Parties" and individually as a "Party"), each Party has requested access to certain non-public information regarding the other Party and the other Party’s subsidiaries."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------+--------------+ |chunk |ner_label | +-------------+--------------+ |consideration|PURPOSE | |negotiation |PURPOSE | |transaction |PURPOSE_OBJECT| +-------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_preamble| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support B-PURPOSE 1.00 0.93 0.97 15 B-PURPOSE_OBJECT 0.90 0.82 0.86 11 I-PURPOSE_OBJECT 1.00 0.80 0.89 5 micro-avg 0.96 0.87 0.92 31 macro-avg 0.97 0.85 0.90 31 weighted-avg 0.96 0.87 0.91 31 ``` --- layout: model title: Sentence Entity Resolver for ICD-O (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_icdo date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-O codes (Topography & Morphology codes) using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original `Topography` codes, `Morphology` codes comprising of `Histology` and `Behavior` codes, and descriptions in the aux metadata. ## Predicted Entities Predicts ICD-O Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_en_3.0.4_3.0_1621191532225.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_en_3.0.4_3.0_1621191532225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icdo_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo","en", "clinical/models")\ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icdo_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icdo").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM|8312/3| 0.3558|Renal cell carcin...|8312/3:::9964/3::...| |chronic renal ins...| 83|109| PROBLEM|9980/3| 0.5290|Refractory anemia...|9980/3:::8312/3::...| | COPD| 113|116| PROBLEM|9950/3| 0.2092|Polycythemia vera...|9950/3:::8141/3::...| | gastritis| 120|128| PROBLEM|8150/3| 0.1795|Islet cell carcin...|8150/3:::8153/3::...| | TIA| 136|138| PROBLEM|9570/0| 0.4843|Neuroma, NOS:::Ca...|9570/0:::8692/3::...| |a non-ST elevatio...| 182|202| PROBLEM|8343/2| 0.1914|Non-invasive EFVP...|8343/2:::9150/0::...| |Guaiac positive s...| 208|229| PROBLEM|8155/3| 0.1069|Vipoma:::Myeloid ...|8155/3:::9930/3::...| |cardiac catheteri...| 295|317| TEST|8045/3| 0.1144|Combined small ce...|8045/3:::9705/3::...| | PTCA| 324|327|TREATMENT|9735/3| 0.0924|Plasmablastic lym...|9735/3:::9365/3::...| | mid LAD lesion| 332|345| PROBLEM|9383/1| 0.0845|Subependymoma:::D...|9383/1:::8806/3::...| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icdo| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[icdo_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD-O Histology Behaviour dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_es_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-es-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_es_hi_dev_xx_4.0.0_3.0_1657190135587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_es_hi_dev_xx_4.0.0_3.0_1657190135587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_es_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_es_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_es_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-es-hi --- layout: model title: English BertForQuestionAnswering model (from z-uo) author: John Snow Labs name: bert_qa_bert_qasper date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-qasper` is a English model orginally trained by `z-uo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_qasper_en_4.0.0_3.0_1654184675001.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_qasper_en_4.0.0_3.0_1654184675001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_qasper","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_qasper","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_z-uo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_qasper| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/z-uo/bert-qasper --- layout: model title: Classifier for Genders - SBERT author: John Snow Labs name: classifierdl_gender_sbert date: 2020-12-16 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.6.5 spark_version: 2.4 tags: [classifier, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model classifies the gender of the patient in the clinical document. {:.h2_title} ## Predicted Entities `Female`, `Male`, `Unknown`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_sbert_en_2.6.4_2.4_1608119379496.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_sbert_en_2.6.4_2.4_1608119379496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use To classify your text, you can use this model as part of an nlp pipeline with the following stages: DocumentAssembler, BertSentenceEmbeddings (``sbiobert_base_cased_mli``), ClassifierDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings")\ .setMaxSentenceLength(512) gender_classifier = ClassifierDLModel.pretrained("classifierdl_gender_sbert", "en", "clinical/models") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlp_pipeline = Pipeline(stages=[document_assembler, sbert_embedder, gender_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") .setMaxSentenceLength(512) val gender_classifier = ClassifierDLModel.pretrained("classifierdl_gender_sbert", "en", "clinical/models") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_embeddings, gender_classifier)) val data = Seq("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.gender.sbert").predict("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ```
{:.h2_title} ## Results ```bash Female ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_gender_sbert| |Type:|ClassifierDLModel| |Compatibility:|Healthcare NLP 2.6.5 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|[en]| |Case sensitive:|True| {:.h2_title} ## Data Source This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally. {:.h2_title} ## Benchmarking ```bash label precision recall f1-score support Female 0.9224 0.8954 0.9087 239 Male 0.7895 0.8468 0.8171 124 Unknown 0.8077 0.7778 0.7925 54 accuracy - - 0.8657 417 macro-avg 0.8399 0.8400 0.8394 417 weighted-avg 0.8680 0.8657 0.8664 417 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from seomh) author: John Snow Labs name: distilbert_qa_seomh_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `seomh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_seomh_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772473886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_seomh_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772473886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_seomh_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_seomh_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_seomh_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/seomh/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec_large_xlsr_korean TFWav2Vec2ForCTC from fleek author: John Snow Labs name: asr_wav2vec_large_xlsr_korean date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec_large_xlsr_korean` is a English model originally trained by fleek. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec_large_xlsr_korean_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec_large_xlsr_korean_en_4.2.0_3.0_1664098551065.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec_large_xlsr_korean_en_4.2.0_3.0_1664098551065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec_large_xlsr_korean", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec_large_xlsr_korean", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec_large_xlsr_korean| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Assignment And Subletting Clause Binary Classifier author: John Snow Labs name: legclf_assignment_and_subletting_clause date: 2023-01-29 tags: [en, legal, classification, assignment, subletting, clauses, assignment_and_subletting, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `assignment-and-subletting` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `assignment-and-subletting`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_assignment_and_subletting_clause_en_1.0.0_3.0_1674993574865.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_assignment_and_subletting_clause_en_1.0.0_3.0_1674993574865.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_assignment_and_subletting_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[assignment-and-subletting]| |[other]| |[other]| |[assignment-and-subletting]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_assignment_and_subletting_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support assignment-and-subletting 1.00 0.96 0.98 26 other 0.97 1.00 0.99 38 accuracy - - 0.98 64 macro-avg 0.99 0.98 0.98 64 weighted-avg 0.98 0.98 0.98 64 ``` --- layout: model title: Financial Equity Item Binary Classifier author: John Snow Labs name: finclf_equity_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `equity` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `equity` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_equity_item_en_1.0.0_3.2_1660154390458.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_equity_item_en_1.0.0_3.2_1660154390458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_equity_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[equity]| |[other]| |[other]| |[equity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_equity_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support equity 0.84 0.85 0.84 66 other 0.85 0.84 0.84 67 accuracy - - 0.84 133 macro-avg 0.84 0.84 0.84 133 weighted-avg 0.84 0.84 0.84 133 ``` --- layout: model title: Legal No Defaults Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_no_defaults_bert date: 2023-03-05 tags: [en, legal, classification, clauses, no_defaults, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `No_Defaults` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `No_Defaults`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_defaults_bert_en_1.0.0_3.0_1678049951854.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_defaults_bert_en_1.0.0_3.0_1678049951854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_no_defaults_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[No_Defaults]| |[Other]| |[Other]| |[No_Defaults]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_defaults_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support No_Defaults 0.90 0.95 0.92 37 Other 0.96 0.93 0.95 56 accuracy - - 0.94 93 macro-avg 0.93 0.94 0.93 93 weighted-avg 0.94 0.94 0.94 93 ``` --- layout: model title: German RobertaForMaskedLM Cased model (from FabianGroeger) author: John Snow Labs name: roberta_embeddings_hotelbert date: 2022-12-12 tags: [de, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `HotelBERT` is a German model originally trained by `FabianGroeger`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_hotelbert_de_4.2.4_3.0_1670858469747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_hotelbert_de_4.2.4_3.0_1670858469747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_hotelbert","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_hotelbert","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_hotelbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|411.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/FabianGroeger/HotelBERT --- layout: model title: English asr_Urdu_repo TFWav2Vec2ForCTC from bilalahmed15 author: John Snow Labs name: pipeline_asr_Urdu_repo date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Urdu_repo` is a English model originally trained by bilalahmed15. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Urdu_repo_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Urdu_repo_en_4.2.0_3.0_1664107452190.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Urdu_repo_en_4.2.0_3.0_1664107452190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Urdu_repo', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Urdu_repo", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Urdu_repo| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Guarantee Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_guarantee_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, guarantee, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_guarantee_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `guarantee-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `guarantee-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_guarantee_agreement_bert_en_1.0.0_3.0_1669313872907.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_guarantee_agreement_bert_en_1.0.0_3.0_1669313872907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_guarantee_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[guarantee-agreement]| |[other]| |[other]| |[guarantee-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_guarantee_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support guarantee-agreement 0.88 0.88 0.88 33 other 0.94 0.94 0.94 65 accuracy - - 0.92 98 macro-avg 0.91 0.91 0.91 98 weighted-avg 0.92 0.92 0.92 98 ``` --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Polish (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-05-10 task: Named Entity Recognition language: pl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, pl, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_PL){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_pl_2.5.0_2.4_1588519719571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_pl_2.5.0_2.4_1588519719571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "pl") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "pl") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę."""] ner_df = nlu.load('pl.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |To amerykański |MISC | |Najbardziej |MISC | |Microsoft Corporation |ORG | |Podczas swojej |MISC | |Microsoft Gates |MISC | |CEO |ORG | |Urodzony |ORG | |Seattle |LOC | |Waszyngton |LOC | |Gates |PER | |Microsoftu |ORG | |Paulem Allenem |PER | |W Albuquerque |MISC | |Nowym Meksyku |LOC | |Gates |PER | |Ale |PER | |Gates |PER | |Opinię |MISC | |W czerwcu 2006 |MISC | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pl| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://pl.wikipedia.org](https://pl.wikipedia.org) --- layout: model title: English image_classifier_vit_autotrain_fashion_mnist__base ViTForImageClassification from abhishek author: John Snow Labs name: image_classifier_vit_autotrain_fashion_mnist__base date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_autotrain_fashion_mnist__base` is a English model originally trained by abhishek. ## Predicted Entities `Coat`, `Shirt`, `Sneaker`, `Sandal`, `T - shirt / top`, `Bag`, `Trouser`, `Dress`, `Ankle boot`, `Pullover` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_fashion_mnist__base_en_4.1.0_3.0_1660170414663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_fashion_mnist__base_en_4.1.0_3.0_1660170414663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_autotrain_fashion_mnist__base", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_autotrain_fashion_mnist__base", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_autotrain_fashion_mnist__base| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Sentiment Analysis Pipeline for Turkish texts author: John Snow Labs name: classifierdl_use_sentiment_pipeline date: 2021-11-04 tags: [turkish, sentiment, tr, open_source] task: Sentiment Analysis language: tr edition: Spark NLP 3.3.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline identifies the sentiments (positive or negative) in Turkish texts. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_sentiment_pipeline_tr_3.3.1_2.4_1636020950989.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_sentiment_pipeline_tr_3.3.1_2.4_1636020950989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_use_sentiment_pipeline", lang = "tr") result1 = pipeline.annotate("Bu sıralar kafam çok karışık.") result2 = pipeline.annotate("Sınavımı geçtiğimi öğrenince derin bir nefes aldım.") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "de") val result1 = pipeline.fullAnnotate("Bu sıralar kafam çok karışık.")(0) val result2 = pipeline.fullAnnotate("Sınavımı geçtiğimi öğrenince derin bir nefes aldım.")(0) ```
## Results ```bash ['NEGATIVE'] ['POSITIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_sentiment_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.1+| |License:|Open Source| |Edition:|Official| |Language:|tr| ## Included Models - DocumentAssembler - UniversalSentenceEncoder - ClassifierDLModel --- layout: model title: Fast Neural Machine Translation Model from Czech to English author: John Snow Labs name: opus_mt_cs_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cs, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cs_en_xx_2.7.0_2.4_1609166719928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cs_en_xx_2.7.0_2.4_1609166719928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cs_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cs_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cs.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cs_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Clinical Entities author: John Snow Labs name: ner_jsl_limited_80p_for_benchmarks date: 2023-04-02 tags: [ner, licensed, en, clinical, benchmark] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description `Important Note:` This model is trained with a partial dataset that is used to train [ner_jsl](https://nlp.johnsnowlabs.com/2022/10/19/ner_jsl_en.html); and meant to be used for benchmarking run at [LLMs Healthcare Benchmarks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/academic/LLMs_in_Healthcare). Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vaccine_Name`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_limited_80p_for_benchmarks_en_4.3.2_3.0_1680468591578.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_limited_80p_for_benchmarks_en_4.3.2_3.0_1680468591578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_jsl_limited_80p_for_benchmarks", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. """]]).toDF("text") result = ner_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_limited_80p_for_benchmarks", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | sentence_id | entities | |---:|:------------------------------------------|--------:|------:|--------------:|:-----------------------------| | 0 | 21-day-old | 18 | 27 | 0 | Age | | 1 | Caucasian | 29 | 37 | 0 | Race_Ethnicity | | 2 | male | 39 | 42 | 0 | Gender | | 3 | 2 days | 53 | 58 | 0 | Duration | | 4 | congestion | 63 | 72 | 0 | Symptom | | 5 | mom | 76 | 78 | 0 | Gender | | 6 | suctioning yellow discharge | 89 | 115 | 0 | Symptom | | 7 | nares | 136 | 140 | 0 | External_body_part_or_region | | 8 | she | 148 | 150 | 0 | Gender | | 9 | mild | 169 | 172 | 0 | Modifier | | 10 | problems with his breathing while feeding | 174 | 214 | 0 | Symptom | | 11 | perioral cyanosis | 238 | 254 | 0 | Symptom | | 12 | retractions | 259 | 269 | 0 | Symptom | | 13 | Influenza vaccine | 326 | 342 | 1 | Vaccine_Name | | 14 | One day ago | 345 | 355 | 2 | RelativeDate | | 15 | mom | 358 | 360 | 2 | Gender | | 16 | tactile temperature | 377 | 395 | 2 | Symptom | | 17 | Tylenol | 418 | 424 | 2 | Drug_BrandName | | 18 | Baby | 427 | 430 | 3 | Age | | 19 | decreased p.o | 450 | 462 | 3 | Symptom | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_limited_80p_for_benchmarks| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.3 MB| ## References Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn total precision recall f1 VS_Finding 164.0 69.0 40.0 204.0 0.7039 0.8039 0.7506 Direction 3394.0 452.0 367.0 3761.0 0.8825 0.9024 0.8923 Respiration 74.0 3.0 5.0 79.0 0.961 0.9367 0.9487 Cerebrovascular_D... 103.0 27.0 14.0 117.0 0.7923 0.8803 0.834 Family_History_He... 80.0 1.0 1.0 81.0 0.9877 0.9877 0.9877 Heart_Disease 432.0 74.0 64.0 496.0 0.8538 0.871 0.8623 ImagingFindings 66.0 32.0 102.0 168.0 0.6735 0.3929 0.4962 RelativeTime 103.0 55.0 82.0 185.0 0.6519 0.5568 0.6006 Strength 552.0 56.0 37.0 589.0 0.9079 0.9372 0.9223 Smoking 105.0 3.0 9.0 114.0 0.9722 0.9211 0.9459 Medical_Device 3043.0 530.0 364.0 3407.0 0.8517 0.8932 0.8719 Allergen 1.0 1.0 17.0 18.0 0.5 0.0556 0.1 EKG_Findings 42.0 36.0 53.0 95.0 0.5385 0.4421 0.4855 Pulse 106.0 23.0 17.0 123.0 0.8217 0.8618 0.8413 Psychological_Con... 103.0 35.0 20.0 123.0 0.7464 0.8374 0.7893 Overweight 5.0 3.0 0.0 5.0 0.625 1.0 0.7692 Triglycerides 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Obesity 34.0 4.0 7.0 41.0 0.8947 0.8293 0.8608 Admission_Discharge 283.0 30.0 7.0 290.0 0.9042 0.9759 0.9386 HDL 2.0 1.0 0.0 2.0 0.6667 1.0 0.8 Diabetes 107.0 8.0 3.0 110.0 0.9304 0.9727 0.9511 Section_Header 3184.0 185.0 118.0 3302.0 0.9451 0.9643 0.9546 Age 524.0 49.0 61.0 585.0 0.9145 0.8957 0.905 O2_Saturation 29.0 10.0 13.0 42.0 0.7436 0.6905 0.716 Kidney_Disease 82.0 8.0 17.0 99.0 0.9111 0.8283 0.8677 Test 2063.0 451.0 414.0 2477.0 0.8206 0.8329 0.8267 Communicable_Disease 22.0 9.0 9.0 31.0 0.7097 0.7097 0.7097 Hypertension 124.0 4.0 7.0 131.0 0.9688 0.9466 0.9575 External_body_par... 2277.0 405.0 353.0 2630.0 0.849 0.8658 0.8573 Oxygen_Therapy 70.0 17.0 10.0 80.0 0.8046 0.875 0.8383 Modifier 1960.0 357.0 549.0 2509.0 0.8459 0.7812 0.8123 Test_Result 796.0 178.0 210.0 1006.0 0.8172 0.7913 0.804 BMI 4.0 0.0 1.0 5.0 1.0 0.8 0.8889 Labour_Delivery 55.0 31.0 26.0 81.0 0.6395 0.679 0.6587 Employment 192.0 24.0 48.0 240.0 0.8889 0.8 0.8421 Fetus_NewBorn 24.0 19.0 43.0 67.0 0.5581 0.3582 0.4364 Clinical_Dept 795.0 57.0 85.0 880.0 0.9331 0.9034 0.918 Time 22.0 11.0 9.0 31.0 0.6667 0.7097 0.6875 Procedure 2458.0 413.0 503.0 2961.0 0.8561 0.8301 0.8429 Diet 21.0 6.0 30.0 51.0 0.7778 0.4118 0.5385 Oncological 342.0 62.0 77.0 419.0 0.8465 0.8162 0.8311 LDL 4.0 0.0 0.0 4.0 1.0 1.0 1.0 Symptom 5777.0 1069.0 1277.0 7054.0 0.8439 0.819 0.8312 Temperature 86.0 5.0 12.0 98.0 0.9451 0.8776 0.9101 Vital_Signs_Header 201.0 23.0 14.0 215.0 0.8973 0.9349 0.9157 Relationship_Status 44.0 1.0 3.0 47.0 0.9778 0.9362 0.9565 Total_Cholesterol 10.0 5.0 7.0 17.0 0.6667 0.5882 0.625 Blood_Pressure 131.0 35.0 23.0 154.0 0.7892 0.8506 0.8188 Injury_or_Poisoning 431.0 71.0 140.0 571.0 0.8586 0.7548 0.8034 Drug_Ingredient 1508.0 106.0 158.0 1666.0 0.9343 0.9052 0.9195 Treatment 124.0 36.0 68.0 192.0 0.775 0.6458 0.7045 Pregnancy 89.0 40.0 38.0 127.0 0.6899 0.7008 0.6953 Vaccine 1.0 0.0 4.0 5.0 1.0 0.2 0.3333 Disease_Syndrome_... 2471.0 551.0 432.0 2903.0 0.8177 0.8512 0.8341 Height 12.0 3.0 9.0 21.0 0.8 0.5714 0.6667 Frequency 500.0 103.0 110.0 610.0 0.8292 0.8197 0.8244 Route 797.0 77.0 70.0 867.0 0.9119 0.9193 0.9156 Duration 258.0 52.0 106.0 364.0 0.8323 0.7088 0.7656 Death_Entity 38.0 7.0 3.0 41.0 0.8444 0.9268 0.8837 Internal_organ_or... 5434.0 839.0 984.0 6418.0 0.8663 0.8467 0.8564 Vaccine_Name 5.0 1.0 3.0 8.0 0.8333 0.625 0.7143 Alcohol 78.0 13.0 6.0 84.0 0.8571 0.9286 0.8914 Substance_Quantity 0.0 9.0 1.0 1.0 0.0 0.0 0.0 Date 455.0 25.0 12.0 467.0 0.9479 0.9743 0.9609 Hyperlipidemia 34.0 0.0 2.0 36.0 1.0 0.9444 0.9714 Social_History_He... 75.0 3.0 3.0 78.0 0.9615 0.9615 0.9615 Imaging_Technique 25.0 16.0 23.0 48.0 0.6098 0.5208 0.5618 Race_Ethnicity 110.0 0.0 2.0 112.0 1.0 0.9821 0.991 Drug_BrandName 788.0 65.0 53.0 841.0 0.9238 0.937 0.9303 RelativeDate 488.0 150.0 98.0 586.0 0.7649 0.8328 0.7974 Gender 5189.0 66.0 55.0 5244.0 0.9874 0.9895 0.9885 Dosage 229.0 32.0 69.0 298.0 0.8774 0.7685 0.8193 Form 179.0 22.0 37.0 216.0 0.8905 0.8287 0.8585 Medical_History_H... 112.0 7.0 6.0 118.0 0.9412 0.9492 0.9451 Birth_Entity 2.0 2.0 5.0 7.0 0.5 0.2857 0.3636 Substance 60.0 6.0 11.0 71.0 0.9091 0.8451 0.8759 Sexually_Active_o... 2.0 1.0 1.0 3.0 0.6667 0.6667 0.6667 Weight 77.0 8.0 12.0 89.0 0.9059 0.8652 0.8851 macro - - - - - - 0.7914 micro - - - - - - 0.8691 ``` --- layout: model title: Danish asr_xls_r_300m_nst_cv9 TFWav2Vec2ForCTC from chcaa author: John Snow Labs name: pipeline_asr_xls_r_300m_nst_cv9 date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_300m_nst_cv9` is a Danish model originally trained by chcaa. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_300m_nst_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_300m_nst_cv9_da_4.2.0_3.0_1664103552039.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_300m_nst_cv9_da_4.2.0_3.0_1664103552039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_300m_nst_cv9', lang = 'da') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_300m_nst_cv9", lang = "da") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_300m_nst_cv9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|da| |Size:|756.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house ViTForImageClassification from mayoughi author: John Snow Labs name: image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house` is a English model originally trained by mayoughi. ## Predicted Entities `coffee house indoors`, `balcony`, `hospital`, `airport`, `hallway` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_en_4.1.0_3.0_1660169311301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house_en_4.1.0_3.0_1660169311301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_where_am_I_hospital_balcony_hallway_airport_coffee_house| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_512 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_512_zh_4.2.4_3.0_1670021790876.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_512_zh_4.2.4_3.0_1670021790876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|137.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Pipeline to Detect Clinical Entities (jsl_ner_wip_greedy_biobert) author: John Snow Labs name: jsl_ner_wip_greedy_biobert_pipeline date: 2023-03-20 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_greedy_biobert](https://nlp.johnsnowlabs.com/2021/07/26/jsl_ner_wip_greedy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_pipeline_en_4.3.0_3.2_1679310267372.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_pipeline_en_4.3.0_3.2_1679310267372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("jsl_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("jsl_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.greedy_wip_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 1 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9488 | | 2 | male | 38 | 41 | Gender | 0.9978 | | 3 | for 2 days | 48 | 57 | Duration | 0.7709 | | 4 | congestion | 62 | 71 | Symptom | 0.5467 | | 5 | mom | 75 | 77 | Gender | 0.9355 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.327867 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.8963 | | 8 | she | 147 | 149 | Gender | 0.995 | | 9 | mild problems with his breathing while feeding | 168 | 213 | Symptom | 0.588714 | | 10 | perioral cyanosis | 237 | 253 | Symptom | 0.58635 | | 11 | retractions | 258 | 268 | Symptom | 0.9864 | | 12 | One day ago | 272 | 282 | RelativeDate | 0.755833 | | 13 | mom | 285 | 287 | Gender | 0.9956 | | 14 | tactile temperature | 304 | 322 | Symptom | 0.10505 | | 15 | Tylenol | 345 | 351 | Drug | 0.9496 | | 16 | Baby | 354 | 357 | Age | 0.976 | | 17 | decreased p.o. intake | 377 | 397 | Symptom | 0.448125 | | 18 | His | 400 | 402 | Gender | 0.999 | | 19 | q.2h. to 5 to 10 minutes | 450 | 473 | Frequency | 0.298843 | | 20 | his | 488 | 490 | Gender | 0.9976 | | 21 | respiratory congestion | 492 | 513 | VS_Finding | 0.6158 | | 22 | He | 516 | 517 | Gender | 0.9998 | | 23 | tired | 550 | 554 | Symptom | 0.8912 | | 24 | fussy | 569 | 573 | Symptom | 0.9541 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_greedy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_japanese_hiragana_vtuber TFWav2Vec2ForCTC from thunninoi author: John Snow Labs name: pipeline_asr_wav2vec2_japanese_hiragana_vtuber date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_japanese_hiragana_vtuber` is a English model originally trained by thunninoi. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_japanese_hiragana_vtuber_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_japanese_hiragana_vtuber_en_4.2.0_3.0_1664095841326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_japanese_hiragana_vtuber_en_4.2.0_3.0_1664095841326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_japanese_hiragana_vtuber', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_japanese_hiragana_vtuber", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_japanese_hiragana_vtuber| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1655732197028.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1655732197028.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_256d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|427.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-4 --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_6_en_4.0.0_3.0_1657192421420.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_6_en_4.0.0_3.0_1657192421420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|383.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-6 --- layout: model title: Legal Base salary Clause Binary Classifier author: John Snow Labs name: legclf_base_salary_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `base-salary` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `base-salary` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_base_salary_clause_en_1.0.0_3.2_1660122165052.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_base_salary_clause_en_1.0.0_3.2_1660122165052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_base_salary_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[base-salary]| |[other]| |[other]| |[base-salary]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_base_salary_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support base-salary 0.97 1.00 0.98 30 other 1.00 0.99 1.00 106 accuracy - - 0.99 136 macro-avg 0.98 1.00 0.99 136 weighted-avg 0.99 0.99 0.99 136 ``` --- layout: model title: Legal Exercise of option Clause Binary Classifier author: John Snow Labs name: legclf_exercise_of_option_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exercise-of-option` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `exercise-of-option` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exercise_of_option_clause_en_1.0.0_3.2_1660122426759.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exercise_of_option_clause_en_1.0.0_3.2_1660122426759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_exercise_of_option_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exercise-of-option]| |[other]| |[other]| |[exercise-of-option]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exercise_of_option_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support exercise-of-option 0.97 0.90 0.93 39 other 0.96 0.99 0.97 91 accuracy - - 0.96 130 macro-avg 0.96 0.94 0.95 130 weighted-avg 0.96 0.96 0.96 130 ``` --- layout: model title: English BertForMaskedLM Cased model (from philschmid) author: John Snow Labs name: bert_embeddings_fin_pretrain_yiyanghkust date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finbert-pretrain-yiyanghkust` is a English model originally trained by `philschmid`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fin_pretrain_yiyanghkust_en_4.2.4_3.0_1670022094241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fin_pretrain_yiyanghkust_en_4.2.4_3.0_1670022094241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fin_pretrain_yiyanghkust","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fin_pretrain_yiyanghkust","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_fin_pretrain_yiyanghkust| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/philschmid/finbert-pretrain-yiyanghkust - https://arxiv.org/abs/2006.08097 --- layout: model title: English BertForQuestionAnswering model (from Salesforce) author: John Snow Labs name: bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qaconv-bert-large-uncased-whole-word-masking-squad2` is a English model orginally trained by `Salesforce`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2_en_4.0.0_3.0_1654189117017.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2_en_4.0.0_3.0_1654189117017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased.by_Salesforce").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Salesforce/qaconv-bert-large-uncased-whole-word-masking-squad2 --- layout: model title: Thai DistilBERT Embeddings author: John Snow Labs name: distilbert_embeddings_distilbert_base_th_cased date: 2022-04-12 tags: [distilbert, embeddings, th, open_source] task: Embeddings language: th edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-th-cased` is a Thai model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_th_cased_th_3.4.2_3.0_1649784017728.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_th_cased_th_3.4.2_3.0_1649784017728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_th_cased","th") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ฉันรัก Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_th_cased","th") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ฉันรัก Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("th.embed.distilbert_base_cased").predict("""ฉันรัก Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_th_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|th| |Size:|185.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-th-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Translate Efik to English Pipeline author: John Snow Labs name: translate_efi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, efi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `efi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_efi_en_xx_2.7.0_2.4_1609699653371.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_efi_en_xx_2.7.0_2.4_1609699653371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_efi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_efi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.efi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_efi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: RxNorm Scd ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_scd_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-07-27 task: Entity Resolution edition: Healthcare NLP 2.5.1 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scd_clinical_en_2.5.1_2.4_1595813884363.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scd_clinical_en_2.5.1_2.4_1595813884363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... rxnormResolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_scd_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,3,2,0,0,7])\ .setInputCols(['token', 'chunk_embs_drug'])\ .setOutputCol('rxnorm_resolution')\ pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, rxnormResolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnormResolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_scd_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,3,2,0,0,7)) .setInputCols('token', 'chunk_embs_drug') .setOutputCol('rxnorm_resolution') val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, rxnormResolver)) val result = pipeline.fit(Seq.empty[String]).transform(data) ```
{:.h2_title} ## Results ```bash | coords | chunk | entity | rxnorm_opts | |--------------|-------------|-----------|-----------------------------------------------------------------------------------------| | 3::278::287 | creatinine | DrugChem | [(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), ...] | | 7::83::93 | cholesterol | DrugChem | [(2104173, beta Sitosterol 35 MG Oral Tablet), (832876, phytosterol esters 500 MG O...] | | 10::397::406 | creatinine | DrugChem | [(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), ...] | ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------| | Name: | chunkresolve_rxnorm_scd_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.1+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on December 2019 RxNorm Clinical Drugs (TTY=SCD) ontology graph with `embeddings_clinical` https://www.nlm.nih.gov/pubs/techbull/nd19/brief/nd19_rxnorm_december_2019_release.html --- layout: model title: English BertForQuestionAnswering model (from JAlexis) author: John Snow Labs name: bert_qa_PruebaBert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `PruebaBert` is a English model orginally trained by `JAlexis`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_PruebaBert_en_4.0.0_3.0_1654179023524.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_PruebaBert_en_4.0.0_3.0_1654179023524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_PruebaBert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_PruebaBert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.cord19.prueba_bert.by_JAlexis").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_PruebaBert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/JAlexis/PruebaBert --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_enriched_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_enriched](https://nlp.johnsnowlabs.com/2021/10/22/ner_jsl_enriched_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_pipeline_en_3.4.1_3.0_1647869433247.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_pipeline_en_3.4.1_3.0_1647869433247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_enriched_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_enriched_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_enriched.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | begin | end | entity | |---:|:------------------------------------------|--------:|------:|:-----------------------------| | 0 | 21-day-old | 17 | 26 | Age | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | | 2 | male | 38 | 41 | Gender | | 3 | 2 days | 52 | 57 | Duration | | 4 | congestion | 62 | 71 | Symptom | | 5 | mom | 75 | 77 | Gender | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | | 7 | nares | 135 | 139 | External_body_part_or_region | | 8 | she | 147 | 149 | Gender | | 9 | mild | 168 | 171 | Modifier | | 10 | problems with his breathing while feeding | 173 | 213 | Symptom | | 11 | perioral cyanosis | 237 | 253 | Symptom | | 12 | retractions | 258 | 268 | Symptom | | 13 | One day ago | 272 | 282 | RelativeDate | | 14 | mom | 285 | 287 | Gender | | 15 | tactile temperature | 304 | 322 | Symptom | | 16 | Tylenol | 345 | 351 | Drug_BrandName | | 17 | Baby | 354 | 357 | Age | | 18 | decreased p.o. intake | 377 | 397 | Symptom | | 19 | His | 400 | 402 | Gender | | 20 | q.2h | 450 | 453 | Frequency | | 21 | 5 to 10 minutes | 459 | 473 | Duration | | 22 | his | 488 | 490 | Gender | | 23 | respiratory congestion | 492 | 513 | Symptom | | 24 | He | 516 | 517 | Gender | | 25 | tired | 550 | 554 | Symptom | | 26 | fussy | 569 | 573 | Symptom | | 27 | over the past 2 days | 575 | 594 | RelativeDate | | 28 | albuterol | 637 | 645 | Drug_Ingredient | | 29 | ER | 671 | 672 | Clinical_Dept | | 30 | His | 675 | 677 | Gender | | 31 | urine output has also decreased | 679 | 709 | Symptom | | 32 | he | 721 | 722 | Gender | | 33 | per 24 hours | 760 | 771 | Frequency | | 34 | he | 778 | 779 | Gender | | 35 | per 24 hours | 807 | 818 | Frequency | | 36 | Mom | 821 | 823 | Gender | | 37 | diarrhea | 836 | 843 | Symptom | | 38 | His | 846 | 848 | Gender | | 39 | bowel | 850 | 854 | Internal_organ_or_component | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_string_instrument_detector ViTForImageClassification from rexoscare author: John Snow Labs name: image_classifier_vit_string_instrument_detector date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_string_instrument_detector` is a English model originally trained by rexoscare. ## Predicted Entities `Banjo`, `Guitar`, `Mandolin`, `Ukulele` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_string_instrument_detector_en_4.1.0_3.0_1660169802968.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_string_instrument_detector_en_4.1.0_3.0_1660169802968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_string_instrument_detector", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_string_instrument_detector", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_string_instrument_detector| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_job_all_903929564 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-job_all-903929564` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Job`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_job_all_903929564_en_4.3.1_3.0_1678134005364.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_job_all_903929564_en_4.3.1_3.0_1678134005364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_job_all_903929564","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_job_all_903929564","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_job_all_903929564| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-job_all-903929564 --- layout: model title: Translate English to Fijian Pipeline author: John Snow Labs name: translate_en_fj date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, fj, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `fj` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_fj_xx_2.7.0_2.4_1609688726797.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_fj_xx_2.7.0_2.4_1609688726797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_fj", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_fj", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.fj').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_fj| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Illegality Clause Binary Classifier author: John Snow Labs name: legclf_illegality_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `illegality` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `illegality` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_illegality_clause_en_1.0.0_3.2_1660123594068.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_illegality_clause_en_1.0.0_3.2_1660123594068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_illegality_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[illegality]| |[other]| |[other]| |[illegality]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_illegality_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support illegality 0.95 0.95 0.95 22 other 0.98 0.98 0.98 56 accuracy - - 0.97 78 macro-avg 0.97 0.97 0.97 78 weighted-avg 0.97 0.97 0.97 78 ``` --- layout: model title: Spanish Part of Speech Tagger (Large) author: John Snow Labs name: roberta_pos_roberta_large_bne_capitel_pos date: 2022-05-03 tags: [roberta, pos, part_of_speech, es, open_source] task: Part of Speech Tagging language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-bne-capitel-pos` is a Spanish model orginally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_large_bne_capitel_pos_es_3.4.2_3.0_1651595953421.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_large_bne_capitel_pos_es_3.4.2_3.0_1651595953421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_large_bne_capitel_pos","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_large_bne_capitel_pos","es") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.pos.roberta_large_bne_capitel_pos").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_roberta_large_bne_capitel_pos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-capitel-pos - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://sites.google.com/view/capitel2020 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Marathi ALBERT Embeddings (v2) author: John Snow Labs name: albert_embeddings_marathi_albert_v2 date: 2022-04-14 tags: [albert, embeddings, mr, open_source] task: Embeddings language: mr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `marathi-albert-v2` is a Marathi model orginally trained by `l3cube-pune`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_marathi_albert_v2_mr_3.4.2_3.0_1649954249563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_marathi_albert_v2_mr_3.4.2_3.0_1649954249563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_marathi_albert_v2","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_marathi_albert_v2","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mr.embed.albert_v2").predict("""मला स्पार्क एनएलपी आवडते""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_marathi_albert_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|mr| |Size:|128.3 MB| |Case sensitive:|false| ## References - https://huggingface.co/l3cube-pune/marathi-albert-v2 - https://github.com/l3cube-pune/MarathiNLP - https://arxiv.org/abs/2202.01159 --- layout: model title: English asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml TFWav2Vec2ForCTC from arbml author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml` is a English model originally trained by arbml. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml_en_4.2.0_3.0_1664096871274.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml_en_4.2.0_3.0_1664096871274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForMaskedLM Large Uncased model author: John Snow Labs name: bert_embeddings_large_uncased_whole_word_masking date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_uncased_whole_word_masking_en_4.2.4_3.0_1670020551279.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_uncased_whole_word_masking_en_4.2.4_3.0_1670020551279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_uncased_whole_word_masking","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_uncased_whole_word_masking","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_uncased_whole_word_masking| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/bert-large-uncased-whole-word-masking - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1655732897981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1655732897981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_512d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-10 --- layout: model title: Arabic Part of Speech Tagger (Modern Standard Arabic-MSA POS, Modern Standard Arabic-MSA, Dialectal Arabic-DA, and Classical Arabic-CA) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_mix_pos_msa date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-mix-pos-msa` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_mix_pos_msa_ar_3.4.2_3.0_1650993325530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_mix_pos_msa_ar_3.4.2_3.0_1650993325530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_mix_pos_msa","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_mix_pos_msa","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_mix_pos_msa").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_mix_pos_msa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix-pos-msa - https://dl.acm.org/doi/pdf/10.5555/1621804.1621808 - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: English ElectraForQuestionAnswering model (from rowan1224) Squad author: John Snow Labs name: electra_qa_squad_slp date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-squad-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_squad_slp_en_4.0.0_3.0_1655921330994.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_squad_slp_en_4.0.0_3.0_1655921330994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_squad_slp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_squad_slp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_squad_slp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/electra-squad-slp --- layout: model title: English image_classifier_vit_oz_fauna ViTForImageClassification from lewtun author: John Snow Labs name: image_classifier_vit_oz_fauna date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_oz_fauna` is a English model originally trained by lewtun. ## Predicted Entities `koala`, `kookaburra`, `dingo`, `possum`, `tasmanian devil` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_oz_fauna_en_4.1.0_3.0_1660170193715.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_oz_fauna_en_4.1.0_3.0_1660170193715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_oz_fauna", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_oz_fauna", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_oz_fauna| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from manishiitg) author: John Snow Labs name: bert_qa_spanbert_large_recruit_qa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-large-recruit-qa` is a English model orginally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_large_recruit_qa_en_4.0.0_3.0_1654191827632.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_large_recruit_qa_en_4.0.0_3.0_1654191827632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_large_recruit_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_large_recruit_qa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.span_bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_large_recruit_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/spanbert-large-recruit-qa --- layout: model title: Public Health Mention Classifier (PHS-BERT) author: John Snow Labs name: classifierdl_health_mentions date: 2022-07-25 tags: [public_health, health, mention, en, licensed, classification] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT](https://arxiv.org/abs/2204.04521) based classifier that can classify public health mentions in social media text. Mentions are classified into three labels about personal health situation, figurative mention and other mentions. More detailed information about classes as follows: `health_mention`: The text contains a health mention that specifically indicating someone's health situation. This means someone has a certain disease or symptoms including death. e.g.; *My PCR test is positive. I have a severe joint pain, mucsle pain and headache right now.* `other_mention`: The text contains a health mention; however does not states a spesific person's situation. General health mentions like informative mentions, discussion about disease etc. e.g.; *Aluminum is a light metal that causes dementia and Alzheimer's disease.* `figurative_mention`: The text mention specific disease or symptom but it is used metaphorically, does not contain health-related information. e.g.; *I don't wanna fall in love. If I ever did that, I think I'd have a heart attack.* ## Predicted Entities `figurative_mention`, `other_mention`, `health_mention` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_health_mentions_en_4.0.0_3.0_1658759311177.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_health_mentions_en_4.0.0_3.0_1658759311177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_phs_bert", "en", "public/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["sentence", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifierdl = ClassifierDLModel.pretrained('classifierdl_health_mentions', 'en', 'clinical/models')\ .setInputCols(['sentence', 'token', 'sentence_embeddings'])\ .setOutputCol('class') clf_pipeline = Pipeline( stages = [ document_assembler, tokenizer, bert_embeddings, embeddingsSentence, classifierdl ]) data = spark.createDataFrame([["I feel a bit drowsy & have a little blurred vision after taking an insulin."]]).toDF("text") result = clf_pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val sentence_embeddings = SentenceEmbeddings() .setInputCols(Array("sentence", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = ClassifierDLModel.pretrained("classifierdl_health_mentions", "en", "clinical/models") .setInputCols(Array("sentence", "token", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifier)) val data = Seq("I feel a bit drowsy & have a little blurred vision after taking an insulin.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.health").predict("""I feel a bit drowsy & have a little blurred vision after taking an insulin.""") ```
## Results ```bash +---------------------------------------------------------------------------+----------------+ |text |class | +---------------------------------------------------------------------------+----------------+ |I feel a bit drowsy & have a little blurred vision after taking an insulin.|[health_mention]| +---------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_health_mentions| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|24.1 MB| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash precision recall f1-score support health_mention 0.77 0.83 0.80 1375 other_mention 0.84 0.81 0.83 2102 figurative_mention 0.79 0.78 0.79 1412 accuracy - - 0.81 4889 macro-avg 0.80 0.81 0.80 4889 weighted-avg 0.81 0.81 0.81 4889 ``` --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [fi, open_source] task: Named Entity Recognition language: fi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fi_4.0.0_3.0_1656125094034.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_fi_4.0.0_3.0_1656125094034.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "fi") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("fi.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: English image_classifier_vit_Test_Model ViTForImageClassification from Nonem100 author: John Snow Labs name: image_classifier_vit_Test_Model date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Test_Model` is a English model originally trained by Nonem100. ## Predicted Entities `nachos`, `popcorn`, `cotton candy`, `hot dog`, `hamburger` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Test_Model_en_4.1.0_3.0_1660169509995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Test_Model_en_4.1.0_3.0_1660169509995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Test_Model", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Test_Model", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Test_Model| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: ICD10CM Musculoskeletal Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_musculoskeletal_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_musculoskeletal_clinical_en_3.0.0_3.0_1617355429847.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_musculoskeletal_clinical_en_3.0.0_3.0_1617355429847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... muscu_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_musculoskeletal_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, muscu_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val muscu_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_musculoskeletal_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, muscu_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash chunk entity icd10_muscu_description icd10_muscu_code 0 a cold, cough PROBLEM Postprocedural hemorrhage of a musculoskeletal... M96831 1 runny nose PROBLEM Acquired deformity of nose M950 2 fever PROBLEM Periodic fever syndromes M041 3 difficulty breathing PROBLEM Other dentofacial functional abnormalities M2659 4 her cough PROBLEM Cervicalgia M542 5 physical exam TEST Pathological fracture, unspecified toe(s), seq... M84479S 6 fairly congested PROBLEM Synovial hypertrophy, not elsewhere classified... M67262 7 Amoxil TREATMENT Torticollis M436 8 Aldex TREATMENT Other soft tissue disorders related to use, ov... M7088 9 difficulty breathing PROBLEM Other dentofacial functional abnormalities M2659 10 more congested PROBLEM Pain in unspecified ankle and joints of unspec... M25579 11 trouble sleeping PROBLEM Low back pain M545 12 congestion PROBLEM Progressive systemic sclerosis M340 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10cm_musculoskeletal_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm]| |Language:|en| --- layout: model title: Sentence Detection in Somali Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [open_source, sentence_detection, so] task: Sentence Detection language: so edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_so_3.2.0_3.0_1630321968392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_so_3.2.0_3.0_1630321968392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "so") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga? Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa. Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi! Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka. Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska? Jawaabta su'aashan dhab ahaantii waa su'aal kale: Waa maxay isticmaalka xirfadaha akhriska? Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "so") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga? Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa. Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi! Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka. Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska? Jawaabta su'aashan dhab ahaantii waa su'aal kale: Waa maxay isticmaalka xirfadaha akhriska? Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('so.sentence_detector').predict("Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga? Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa. Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi! Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka. Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska? Jawaabta su'aashan dhab ahaantii waa su'aal kale: Waa maxay isticmaalka xirfadaha akhriska? Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.", output_level ='sentence') ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga?] | |[Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa.]| |[Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi!] | |[Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka.] | |[Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska?] | |[Jawaabta su'aashan dhab ahaantii waa su'aal kale:] | |[Waa maxay isticmaalka xirfadaha akhriska?] | |[Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.] | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|so| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_large_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_large_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_large_biobert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_biobert_pipeline_en_3.4.1_3.0_1647873520729.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_biobert_pipeline_en_3.4.1_3.0_1647873520729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_large_biobert_pipeline", "en", "clinical/models") pipeline.fullAnnotate('The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_large_biobert_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_biobert_large.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunks |entities | +--------------+---------+ |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |STRENGTH | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d |FREQUENCY| |Lantus |DRUG | |40 units |STRENGTH | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |folic acid |DRUG | |1 mg |STRENGTH | |daily |FREQUENCY| |levothyroxine |DRUG | +--------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_large_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|421.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Chinese T5ForConditionalGeneration Base Cased model (from Langboat) author: John Snow Labs name: t5_mengzi_base date: 2023-01-30 tags: [zh, open_source, t5, tensorflow] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mengzi-t5-base` is a Chinese model originally trained by `Langboat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mengzi_base_zh_4.3.0_3.0_1675105125486.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mengzi_base_zh_4.3.0_3.0_1675105125486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mengzi_base","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mengzi_base","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mengzi_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|1.0 GB| ## References - https://huggingface.co/Langboat/mengzi-t5-base - https://arxiv.org/abs/2110.06696 --- layout: model title: Detect Entities Related to Cancer Therapies author: John Snow Labs name: ner_oncology_therapy date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, treatment] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to oncology therapies using granular labels, including mentions of treatments, posology information and line of therapy. Definitions of Predicted Entities: - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy". - `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy". - `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy". - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Radiation_Dose`: Dose used in radiotherapy. - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). - `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy". - `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy"). ## Predicted Entities `Cancer_Surgery`, `Chemotherapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Hormonal_Therapy`, `Immunotherapy`, `Line_Of_Therapy`, `Radiotherapy`, `Radiation_Dose`, `Response_To_Treatment`, `Route`, `Targeted_Therapy`, `Unspecific_Therapy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_en_4.0.0_3.0_1666718855759.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_en_4.0.0_3.0_1666718855759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_therapy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_therapy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_therapy").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:----------------------| | mastectomy | Cancer_Surgery | | axillary lymph node dissection | Cancer_Surgery | | radiotherapy | Radiotherapy | | recurred | Response_To_Treatment | | adriamycin | Chemotherapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Chemotherapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | first line | Line_Of_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_therapy| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.4 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Cycle_Number 78 41 19 97 0.66 0.80 0.72 Response_To_Treatment 451 205 145 596 0.69 0.76 0.72 Cycle_Count 210 75 20 230 0.74 0.91 0.82 Unspecific_Therapy 189 76 89 278 0.71 0.68 0.70 Chemotherapy 831 87 48 879 0.91 0.95 0.92 Targeted_Therapy 194 28 34 228 0.87 0.85 0.86 Radiotherapy 279 35 31 310 0.89 0.90 0.89 Cancer_Surgery 720 192 99 819 0.79 0.88 0.83 Line_Of_Therapy 95 6 11 106 0.94 0.90 0.92 Hormonal_Therapy 170 6 15 185 0.97 0.92 0.94 Immunotherapy 96 17 32 128 0.85 0.75 0.80 Cycle_Day 205 38 43 248 0.84 0.83 0.84 Frequency 363 33 64 427 0.92 0.85 0.88 Route 93 6 20 113 0.94 0.82 0.88 Duration 527 102 234 761 0.84 0.69 0.76 Dosage 959 63 101 1060 0.94 0.90 0.92 Radiation_Dose 106 12 20 126 0.90 0.84 0.87 macro_avg 5566 1022 1025 6591 0.85 0.84 0.84 micro_avg 5566 1022 1025 6591 0.85 0.84 0.84 ``` --- layout: model title: Clinical Deidentification (French) author: John Snow Labs name: clinical_deidentification date: 2023-06-13 tags: [deid, fr, licensed] task: De-identification language: fr edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts in French. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_fr_4.4.4_3.2_1686663918009.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_fr_4.4.4_3.2_1686663918009.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = """COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """ result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = "COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("fr.deid_obfuscated").predict("""COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = """COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """ result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = "COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("fr.deid_obfuscated").predict("""COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """) ```
## Results ```bash Results Masked with entity labels ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : NOM : NUMÉRO DE SÉCURITÉ SOCIALE : ADRESSE : VILLE : CODE POSTAL : DATE DE NAISSANCE : Âge : Sexe : COURRIEL : DATE D'ADMISSION : MÉDÉCIN : RAPPORT CLINIQUE : ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. nous a été adressé car présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : - Service D'Endocrinologie et de Nutrition - , COURRIEL : Masked with chars ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : [**] NOM : [****] NUMÉRO DE SÉCURITÉ SOCIALE : [***********] ADRESSE : [****************] VILLE : [******] CODE POSTAL : [***] DATE DE NAISSANCE : [********] Âge : [****] Sexe : * COURRIEL : [****************] DATE D'ADMISSION : [********] MÉDÉCIN : [**************] RAPPORT CLINIQUE : ** ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. ** nous a été adressé car ** présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : [**************] - [****************************] Service D'Endocrinologie et de Nutrition - [******************], [***] [******] COURRIEL : [****************] Masked with fixed length chars ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : **** NOM : **** NUMÉRO DE SÉCURITÉ SOCIALE : **** ADRESSE : **** VILLE : **** CODE POSTAL : **** DATE DE NAISSANCE : **** Âge : **** Sexe : **** COURRIEL : **** DATE D'ADMISSION : **** MÉDÉCIN : **** RAPPORT CLINIQUE : **** ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. **** nous a été adressé car **** présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : **** - **** Service D'Endocrinologie et de Nutrition - ****, **** **** COURRIEL : **** Obfuscated ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : Mme Ollivier NOM : Mme Traore NUMÉRO DE SÉCURITÉ SOCIALE : 164033818514436 ADRESSE : 731, boulevard de Legrand VILLE : Sainte Antoine CODE POSTAL : 37443 DATE DE NAISSANCE : 18/03/1946 Âge : 46 Sexe : Femme COURRIEL : georgeslemonnier@live.com DATE D'ADMISSION : 10/01/2017 MÉDÉCIN : Pr. Manon Dupuy RAPPORT CLINIQUE : 26 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Homme nous a été adressé car Homme présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dr Tristan-Gilbert Poulain - CENTRE HOSPITALIER D'ORTHEZ Service D'Endocrinologie et de Nutrition - 6, avenue Pages, 37443 Sainte Antoine COURRIEL : massecatherine@bouygtel.fr {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English DistilBertForQuestionAnswering Cased model (from abhilash1910) author: John Snow Labs name: distilbert_qa_squadv1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squadv1` is a English model originally trained by `abhilash1910`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squadv1_en_4.3.0_3.0_1672774380902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squadv1_en_4.3.0_3.0_1672774380902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squadv1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squadv1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/abhilash1910/distilbert-squadv1 --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_tiny_3_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-3-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_3_finetuned_squadv2_en_4.0.0_3.0_1654184985253.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_3_finetuned_squadv2_en_4.0.0_3.0_1654184985253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_3_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_tiny_3_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.tiny_v3.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_tiny_3_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|21.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-tiny-3-finetuned-squadv2 --- layout: model title: Legal Criminal Law Document Classifier (EURLEX) author: John Snow Labs name: legclf_criminal_law_bert date: 2023-03-06 tags: [en, legal, classification, clauses, criminal_law, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_criminal_law_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Criminal_Law or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Criminal_Law`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_criminal_law_bert_en_1.0.0_3.0_1678111769534.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_criminal_law_bert_en_1.0.0_3.0_1678111769534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_criminal_law_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Criminal_Law]| |[Other]| |[Other]| |[Criminal_Law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_criminal_law_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Criminal_Law 0.88 0.86 0.87 74 Other 0.84 0.85 0.85 61 accuracy - - 0.86 135 macro-avg 0.86 0.86 0.86 135 weighted-avg 0.86 0.86 0.86 135 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from anurag0077) Squad author: John Snow Labs name: distilbert_qa_anurag0077_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `anurag0077`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725004904.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725004904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_v3.by_anurag0077").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anurag0077_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Oncology Entities author: John Snow Labs name: ner_oncology_wip date: 2022-07-25 tags: [licensed, english, clinical, ner, oncology, cancer, biomarker, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true published: false annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained named entity recognition (NER) model is a deep learning model for detecting entities related to cancer diagnosis (such as staging, grade or mentiones of metastasis), cancer treatments (chemotherapy, targeted therapy or surgical procedures, among other), and oncological tests (pathology tests, biomarkers, oncogenes, etc). The model was trained using `MedicalNerApproach` annotator that allows to train generic NER models based on Neural Networks. ## Predicted Entities `Gender`, `Age`, `Race_Ethnicity`, `Date`, `Cancer_Dx`, `Metastasis`, `Invasion`, `Histological_Type`, `Grade`, `Tumor_Finding`, `Staging`, `Tumor_Size`, `Oncogene`, `Biomarker`, `Biomarker_Result`, `Performance_Status`, `Pathology_Test`, `Pathology_Result`, `Smoking_Status`, `Anatomical_Site`, `Direction`, `Site_Lymph_Node`, `Chemotherapy`, `Immunotherapy`, `Targeted_Therapy`, `Hormonal_Therapy`, `Unspecific_Therapy`, `Radiotherapy`, `Cancer_Surgery`, `Line_Of_Therapy`, `Response_To_Treatment`, `Radiation_Dose`, `Duration`, `Frequency`, `Cycle_Number`, `Cycle_Day`, `Dosage`, `Route`, `Relative_Date`, `Imaging_Test` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_wip_en_3.5.0_3.0_1658771306053.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_wip_en_3.5.0_3.0_1658771306053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(['document'])\ .setOutputCol('sentence') tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained('embeddings_clinical', 'en', 'clinical/models')\ .setInputCols(["sentence", 'token']) \ .setOutputCol("embeddings") ner_oncology = MedicalNerModel.pretrained('ner_oncology_wip', 'en', 'clinical/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_oncology_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_oncology, ner_oncology_converter]) data = spark.createDataFrame([["She then sought medical attention for a breast lump that she had noticed for the past few months. This was clinically diagnosed as breast cancer. She subsequently underwent right wide local excision of the mass and axillary clearance. Histology revealed 28mm grade 3 oestrogen receptor positive, human epidermal growth factor receptor 2 negative ductal carcinoma involving 12 of 14 axillary nodes. An oncology referral was made."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_oncology = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_oncology_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner_oncology, ner_oncology_converter)) val data = Seq("""She then sought medical attention for a breast lump that she had noticed for the past few months. This was clinically diagnosed as breast cancer. She subsequently underwent right wide local excision of the mass and axillary clearance. Histology revealed 28mm grade 3 oestrogen receptor positive, human epidermal growth factor receptor 2 negative ductal carcinoma involving 12 of 14 axillary nodes. An oncology referral was made.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_wip").predict("""She then sought medical attention for a breast lump that she had noticed for the past few months. This was clinically diagnosed as breast cancer. She subsequently underwent right wide local excision of the mass and axillary clearance. Histology revealed 28mm grade 3 oestrogen receptor positive, human epidermal growth factor receptor 2 negative ductal carcinoma involving 12 of 14 axillary nodes. An oncology referral was made.""") ```
## Results ```bash +----------------------------------------+-----+---+-----------------+ | chunk|begin|end| ner_label| +----------------------------------------+-----+---+-----------------+ | She| 0| 2| Gender| | breast| 40| 45| Anatomical_Site| | lump| 47| 50| Tumor_Finding| | she| 57| 59| Gender| | for the past few months| 73| 95| Duration| | breast cancer| 131|143| Cancer_Dx| | She| 146|148| Gender| | right| 173|177| Direction| | wide local excision| 179|197| Cancer_Surgery| | mass| 206|209| Tumor_Finding| | axillary clearance| 215|232| Anatomical_Site| | Histology| 235|243| Pathology_Test| | oestrogen receptor| 267|284| Biomarker| | positive| 286|293| Biomarker_Result| |human epidermal growth factor receptor 2| 296|335| Oncogene| | negative| 337|344| Biomarker_Result| | ductal| 346|351|Histological_Type| | carcinoma involving 12| 353|374| Cancer_Dx| | 14 axillary nodes| 379|395| Site_Lymph_Node| +----------------------------------------+-----+---+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_wip| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|965.8 KB| ## References Trained on case reports sampled from PubMed, and annotated in-house. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 181.0 48.0 48.0 229.0 0.790393 0.790393 0.790393 Anatomical_Site 1166.0 401.0 208.0 1374.0 0.744097 0.848617 0.792928 Direction 391.0 75.0 159.0 550.0 0.839056 0.710909 0.769685 Staging 187.0 28.0 32.0 219.0 0.869767 0.853881 0.861751 Imaging_Test 959.0 119.0 120.0 1079.0 0.889610 0.888786 0.889198 Cycle_Number 162.0 43.0 25.0 187.0 0.790244 0.866310 0.826531 Tumor_Finding 446.0 62.0 96.0 542.0 0.877953 0.822878 0.849524 Site_Lymph_Node 414.0 86.0 73.0 487.0 0.828000 0.850103 0.838906 Invasion 21.0 8.0 28.0 49.0 0.724138 0.428571 0.538462 Response_To_Treatment 249.0 64.0 112.0 361.0 0.795527 0.689751 0.738872 Smoking_Status 18.0 10.0 7.0 25.0 0.642857 0.720000 0.679245 Tumor_Size 623.0 69.0 37.0 660.0 0.900289 0.943939 0.921598 Age 416.0 1.0 23.0 439.0 0.997602 0.947608 0.971963 Biomarker_Result 539.0 122.0 202.0 741.0 0.815431 0.727395 0.768902 Unspecific_Therapy 57.0 30.0 52.0 109.0 0.655172 0.522936 0.581633 Chemotherapy 415.0 29.0 25.0 440.0 0.934685 0.943182 0.938914 Targeted_Therapy 104.0 33.0 10.0 114.0 0.759124 0.912281 0.828685 Radiotherapy 104.0 10.0 26.0 130.0 0.912281 0.800000 0.852459 Performance_Status 61.0 6.0 35.0 96.0 0.910448 0.635417 0.748466 Pathology_Test 352.0 68.0 118.0 470.0 0.838095 0.748936 0.791011 Cancer_Surgery 291.0 41.0 38.0 329.0 0.876506 0.884498 0.880484 Line_Of_Therapy 64.0 5.0 11.0 75.0 0.927536 0.853333 0.888889 Pathology_Result 154.0 156.0 82.0 236.0 0.496774 0.652542 0.564103 Hormonal_Therapy 62.0 3.0 14.0 76.0 0.953846 0.815789 0.879433 Biomarker 734.0 236.0 103.0 837.0 0.756701 0.876941 0.812396 Immunotherapy 24.0 7.0 19.0 43.0 0.774194 0.558140 0.648649 Cycle_Day 58.0 21.0 10.0 68.0 0.734177 0.852941 0.789116 Frequency 192.0 17.0 53.0 245.0 0.918660 0.783673 0.845815 Route 26.0 3.0 41.0 67.0 0.896552 0.388060 0.541667 Duration 230.0 88.0 67.0 297.0 0.723270 0.774411 0.747967 Metastasis 171.0 2.0 22.0 193.0 0.988439 0.886010 0.934426 Cancer_Dx 568.0 59.0 59.0 627.0 0.905901 0.905901 0.905901 Grade 15.0 9.0 73.0 88.0 0.625000 0.170455 0.267857 Date 385.0 21.0 12.0 397.0 0.948276 0.969773 0.958904 Relative_Date 348.0 85.0 91.0 439.0 0.803695 0.792711 0.798165 Race_Ethnicity 21.0 1.0 3.0 24.0 0.954545 0.875000 0.913043 Gender 619.0 12.0 10.0 629.0 0.980983 0.984102 0.982540 Dosage 512.0 45.0 107.0 619.0 0.919210 0.827141 0.870748 Oncogene 283.0 23.0 120.0 403.0 0.924837 0.702233 0.798307 Radiation_Dose 53.0 13.0 3.0 56.0 0.803030 0.946429 0.868852 Macro-average - - - - - - 0.796909 Micro-average - - - - - - 0.835757 ``` --- layout: model title: Sentence Entity Resolver for CPT codes (procedures and measurements) - Augmented author: John Snow Labs name: sbiobertresolve_cpt_procedures_measurements_augmented date: 2021-07-02 tags: [licensed, en, entity_resolution, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps medical entities to CPT codes using Sentence Bert Embeddings. The corpus of this model has been extented to measurements, and this model is capable of mapping both procedures and measurement concepts/entities to CPT codes. Measurement codes are helpful in codifying medical entities related to tests and their results. ## Predicted Entities CPT codes and their descriptions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_measurements_augmented_en_3.1.0_3.0_1625257370771.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_measurements_augmented_en_3.1.0_3.0_1625257370771.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_cpt_procedures_measurements_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Procedure``` set in ```.setWhiteList()```. ```sbiobertresolve_cpt_procedures_measurements_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_measurements_clinical``` as NER model. ```Measurements``` set in ```.setWhiteList()```. Merge ner_jsl and ner_measurements_clinical model chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") jsl_sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli','en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") cpt_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_cpt_procedures_measurements_augmented", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("cpt_code") cpt_pipelineModel = PipelineModel( stages = [ documentAssembler, jsl_sbert_embedder, cpt_resolver]) cpt_lp = LightPipeline(cpt_pipelineModel) result = cpt_lp.fullAnnotate(['calcium score', 'heart surgery']) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val cpt_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_cpt_procedures_measurements_augmented", "en", "clinical/models) .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("cpt_code") val cpt_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, cpt_resolver)) val cpt_lp = LightPipeline(cpt_pipelineModel) val result = cpt_lp.fullAnnotate(['calcium score', 'heart surgery']) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cpt.procedures_measurements").predict("""calcium score""") ```
## Results ```bash | | chunks | code | resolutions | |---:|:--------------|:----- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | calcium score | 82310 | Calcium measurement [Calcium; total] | | 1 | heart surgery | 33257 | Cardiac surgery procedure [Operative tissue ablation and reconstruction of atria, performed at the time of other cardiac procedure(s), limited (eg, modified maze procedure) (List separately in addition to code for primary procedure)] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cpt_procedures_measurements_augmented| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sbert_embeddings]| |Output Labels:|[cpt_code]| |Language:|en| |Case sensitive:|true| ## Data Source Trained on Current Procedural Terminology dataset with `sbiobert_base_cased_mli` sentence embeddings. --- layout: model title: RxNorm to UMLS Code Mapping author: John Snow Labs name: rxnorm_umls_mapping date: 2021-07-01 tags: [rxnorm, umls, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RxNorm codes to UMLS codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CODE_MAPPING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_3.1.0_2.4_1625126295049.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_3.1.0_2.4_1625126295049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline( "rxnorm_umls_mapping","en","clinical/models") pipeline.annotate("1161611 315677 343663") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline( "rxnorm_umls_mapping","en","clinical/models") val result = pipeline.annotate("1161611 315677 343663") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.umls").predict("""1161611 315677 343663""") ```
## Results ```bash {'rxnorm': ['1161611', '315677', '343663'], 'umls': ['C3215948', 'C0984912', 'C1146501']} Note: | RxNorm | Details | | ---------- | ------------------------:| | 1161611 | metformin Pill | | 315677 | cimetidine 100 mg | | 343663 | insulin lispro 50 UNT/ML | | UMLS | Details | | ---------- | ------------------------:| | C3215948 | metformin pill | | C0984912 | cimetidine 100 mg | | C1146501 | insulin lispro 50 unt/ml | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: English Bert Embeddings (from jackaduma) author: John Snow Labs name: bert_embeddings_SecBERT date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `SecBERT` is a English model orginally trained by `jackaduma`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_SecBERT_en_3.4.2_3.0_1649671929316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_SecBERT_en_3.4.2_3.0_1649671929316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_SecBERT","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_SecBERT","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.SecBERT").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_SecBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|313.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/jackaduma/SecBERT - https://github.com/jackaduma/SecBERT/ - https://github.com/kbandla/APTnotes - https://stucco.github.io/data/ - https://ebiquity.umbc.edu/_file_directory_/papers/943.pdf - https://competitions.codalab.org/competitions/17262 - https://github.com/allenai/scibert - https://github.com/jackaduma/SecBERT - https://github.com/jackaduma/SecBERT --- layout: model title: English image_classifier_vit_base_cats_vs_dogs ViTForImageClassification from akahana author: John Snow Labs name: image_classifier_vit_base_cats_vs_dogs date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_cats_vs_dogs` is a English model originally trained by akahana. ## Predicted Entities `cat`, `dog` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_cats_vs_dogs_en_4.1.0_3.0_1660171931492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_cats_vs_dogs_en_4.1.0_3.0_1660171931492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_cats_vs_dogs", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_cats_vs_dogs", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_cats_vs_dogs| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Arabic to Hebrew author: John Snow Labs name: opus_mt_ar_he date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, he, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: he {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_he_xx_3.1.0_2.4_1622552020327.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_he_xx_3.1.0_2.4_1622552020327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_he", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_he", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Hebrew').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_he| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: French CamemBert Embeddings (from Katster) author: John Snow Labs name: camembert_embeddings_Katster_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Katster`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Katster_generic_model_fr_3.4.4_3.0_1653986534274.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Katster_generic_model_fr_3.4.4_3.0_1653986534274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Katster_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Katster_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Katster_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Katster/dummy-model --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from Lucapro) author: John Snow Labs name: t5_test_model date: 2023-01-31 tags: [en, ro, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `test-model` is a Multilingual model originally trained by `Lucapro`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_test_model_xx_4.3.0_3.0_1675157312429.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_test_model_xx_4.3.0_3.0_1675157312429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_test_model","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_test_model","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_test_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|260.2 MB| ## References - https://huggingface.co/Lucapro/test-model - https://paperswithcode.com/sota?task=Translation&dataset=wmt16+ro-en --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el8_dl1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el8-dl1` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl1_en_4.3.0_3.0_1675120557059.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_dl1_en_4.3.0_3.0_1675120557059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el8_dl1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el8_dl1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|119.7 MB| ## References - https://huggingface.co/google/t5-efficient-small-el8-dl1 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Spam Classifier author: John Snow Labs name: classifierdl_use_spam date: 2021-01-09 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Automatically identify messages as being regular messages or Spam. ## Predicted Entities `spam`, `ham` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.7.1_2.4_1610187019592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.7.1_2.4_1610187019592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document_assembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained('classifierdl_use_spam', 'en') .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now."""] spam_df = nlu.load('classify.spam.use').predict(text, output_level='document') spam_df[["document", "spam"]] ```
## Results ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |Congratulations! You've won a $1,000 Walmart gift card. Go to http://bit.ly/1234 to claim now. | spam | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_spam| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source This model is trained on UCI spam dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip ## Benchmarking ```bash precision recall f1-score support ham 0.99 0.99 0.99 966 spam 0.95 0.95 0.95 149 accuracy 0.99 1115 macro avg 0.97 0.97 0.97 1115 weighted avg 0.99 0.99 0.99 1115 ``` --- layout: model title: English image_classifier_vit_Teeth_A ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Teeth_A date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Teeth_A` is a English model originally trained by steven123. ## Predicted Entities `Good Teeth`, `Missing Teeth`, `Rotten Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Teeth_A_en_4.1.0_3.0_1660170593463.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Teeth_A_en_4.1.0_3.0_1660170593463.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Teeth_A", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Teeth_A", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Teeth_A| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Normalized Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_go_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_go_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_human_phenotype_go_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_pipeline_en_4.3.0_3.2_1678874608526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_pipeline_en_4.3.0_3.2_1678874608526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_human_phenotype_go_clinical_pipeline", "en", "clinical/models") text = '''Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_human_phenotype_go_clinical_pipeline", "en", "clinical/models") val text = "Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype_clinical.pipeline").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------------|--------:|------:|:------------|-------------:| | 0 | tumor | 39 | 43 | HP | 0.9996 | | 1 | tricarboxylic acid cycle | 79 | 102 | GO | 0.994633 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Board Of Directors Clause Binary Classifier author: John Snow Labs name: legclf_board_of_directors_clause date: 2023-01-29 tags: [en, legal, classification, board, directors, clauses, board_of_directors, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `board-of-directors` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `board-of-directors`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_board_of_directors_clause_en_1.0.0_3.0_1675004773054.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_board_of_directors_clause_en_1.0.0_3.0_1675004773054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_board_of_directors_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[board-of-directors]| |[other]| |[other]| |[board-of-directors]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_board_of_directors_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support board-of-directors 1.00 0.96 0.98 24 other 0.97 1.00 0.99 38 accuracy - - 0.98 62 macro-avg 0.99 0.98 0.98 62 weighted-avg 0.98 0.98 0.98 62 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_modelo_v1b3 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-modelo-robertav1b3` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo_v1b3_es_4.3.0_3.0_1674218384530.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo_v1b3_es_4.3.0_3.0_1674218384530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo_v1b3","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo_v1b3","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_modelo_v1b3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-modelo-robertav1b3 --- layout: model title: English asr_vakyansh_wav2vec2_maithili_maim_50 TFWav2Vec2ForCTC from Harveenchadha author: John Snow Labs name: pipeline_asr_vakyansh_wav2vec2_maithili_maim_50 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_vakyansh_wav2vec2_maithili_maim_50` is a English model originally trained by Harveenchadha. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_vakyansh_wav2vec2_maithili_maim_50_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_vakyansh_wav2vec2_maithili_maim_50_en_4.2.0_3.0_1664117308568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_vakyansh_wav2vec2_maithili_maim_50_en_4.2.0_3.0_1664117308568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_vakyansh_wav2vec2_maithili_maim_50', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_vakyansh_wav2vec2_maithili_maim_50", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_vakyansh_wav2vec2_maithili_maim_50| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect PHI for Deidentification in Romanian (Word2Vec) author: John Snow Labs name: ner_deid_subentity date: 2022-06-27 tags: [ner, deidentification, word2vec, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ro_4.0.0_3.0_1656316441636.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ro_4.0.0_3.0_1656316441636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid.subentity").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|HOSPITAL | |Drumul Oprea Nr. |STREET | |Vaslui |CITY | |737405 |ZIP | |+40(235)413773 |PHONE | |25 May 2022 |DATE | |BUREAN MARIA |PATIENT | |77 |AGE | |Agota Evelyn |DOCTOR | |2450502264401 |IDNUM | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|15.1 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.96 0.99 0.97 1235 CITY 0.97 0.95 0.96 307 COUNTRY 0.92 0.74 0.82 115 DATE 0.94 0.89 0.91 5006 DOCTOR 0.96 0.96 0.96 2064 EMAIL 1.00 1.00 1.00 8 FAX 1.00 0.95 0.97 56 HOSPITAL 0.78 0.83 0.80 919 IDNUM 0.98 1.00 0.99 239 LOCATION-OTHER 1.00 0.85 0.92 13 MEDICALRECORD 1.00 1.00 1.00 455 ORGANIZATION 0.34 0.41 0.37 82 PATIENT 0.85 0.90 0.87 954 PHONE 0.97 0.98 0.98 315 PROFESSION 0.87 0.80 0.83 173 STREET 0.99 0.99 0.99 174 ZIP 0.99 0.97 0.98 140 micro-avg 0.92 0.91 0.92 12255 macro-avg 0.91 0.89 0.90 12255 weighted-avg 0.93 0.91 0.92 12255 ``` --- layout: model title: Detect Drugs and Proteins author: John Snow Labs name: ner_drugprot_clinical date: 2021-12-20 tags: [ner, clinical, drugprot, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects chemical compounds/drugs and genes/proteins in medical text and research articles. Chemical compounds/drugs are labeled as `CHEMICAL`, genes/proteins are labeled as `GENE` and entity mentions of type `GENE` and of type `CHEMICAL` that overlap such as enzymes and small peptides are labeled as `GENE_AND_CHEMICAL`. ## Predicted Entities `GENE`, `CHEMICAL`, `GENE_AND_CHEMICAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUG_PROT/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugprot_clinical_en_3.3.3_3.0_1639989110299.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugprot_clinical_en_3.3.3_3.0_1639989110299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentences", "tokens", "ner"])\ .setOutputCol("ner_chunks") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) EXAMPLE_TEXT = "Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation." data = spark.createDataFrame([[EXAMPLE_TEXT]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner")) .setOutputCol("ner_chunks") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter)) val data = Seq("""Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugprot_clinical").predict("""Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |clenbuterol |CHEMICAL | |beta 2-adrenoceptor |GENE | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugprot_clinical| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.7 MB| |Dependencies:|embeddings_clinical| ## Data Source This model was trained on the [DrugProt corpus](https://zenodo.org/record/5119892). ## Benchmarking ```bash label tp fp fn total precision recall f1 GENE_AND_CHEMICAL 786.0 171.0 143.0 929.0 0.8213 0.8461 0.8335 CHEMICAL 8228.0 779.0 575.0 8803.0 0.9135 0.9347 0.924 GENE 7176.0 822.0 652.0 7828.0 0.8972 0.9167 0.9069 macro - - - - - - 0.88811683 micro - - - - - - 0.91156048 ``` --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_news_pretrain_bert_FT_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_news_pretrain_bert_FT_newsqa_en_4.0.0_3.0_1654188933131.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_news_pretrain_bert_FT_newsqa_en_4.0.0_3.0_1654188933131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_news_pretrain_bert_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_news_pretrain_bert_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_news_pretrain_bert_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/news_pretrain_bert_FT_newsqa --- layout: model title: Stopwords Remover for Tigrinya language (182 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ti, open_source] task: Stop Words Removal language: ti edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ti_3.4.1_3.0_1646673030296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ti_3.4.1_3.0_1646673030296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ti") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["ይቕሬታ፣ ሓደ መደብ ገይረ ኣሎኹ።"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ti") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("ይቕሬታ፣ ሓደ መደብ ገይረ ኣሎኹ።").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ti.stopwords").predict("""ይቕሬታ፣ ሓደ መደብ ገይረ ኣሎኹ።""") ```
## Results ```bash +---------------------------+ |result | +---------------------------+ |[ይቕሬታ፣, ሓደ, መደብ, ገይረ, ኣሎኹ።]| +---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ti| |Size:|2.1 KB| --- layout: model title: Modern Greek (1453-) BertForQuestionAnswering model (from Danastos) author: John Snow Labs name: bert_qa_triviaqa_bert_el_Danastos date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `triviaqa_bert_el` is a Modern Greek (1453-) model orginally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_triviaqa_bert_el_Danastos_el_4.0.0_3.0_1654250055165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_triviaqa_bert_el_Danastos_el_4.0.0_3.0_1654250055165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_triviaqa_bert_el_Danastos","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_triviaqa_bert_el_Danastos","el") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("el.answer_question.trivia.bert.by_Danastos").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_triviaqa_bert_el_Danastos| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|421.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/triviaqa_bert_el --- layout: model title: Detect Clinical Entities (ner_eu_clinical_case - eu) author: John Snow Labs name: ner_eu_clinical_case date: 2023-02-02 tags: [eu, clinical, licensed, ner] task: Named Entity Recognition language: eu edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical entities from Basque texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_event`, `bodypart`, `clinical_condition`, `units_measurements`, `patient`, `date_time` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_eu_4.2.8_3.0_1675359410041.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_eu_4.2.8_3.0_1675359410041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "eu", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""3 urteko mutiko bat nahasmendu autistarekin unibertsitateko ospitaleko A pediatriako ospitalean. Ez du autismoaren espektroaren nahaste edo gaixotasun familiaren aurrekaririk. Mutilari komunikazio-nahaste larria diagnostikatu zioten, elkarrekintza sozialeko zailtasunak eta prozesamendu sentsorial atzeratua. Odol-analisiak normalak izan ziren (tiroidearen hormona estimulatzailea (TSH), hemoglobina, batez besteko bolumen corpuskularra (MCV) eta ferritina). Goiko endoskopiak mukosaren azpiko tumore bat ere erakutsi zuen, urdail-irteeren guztizko oztopoa eragiten zuena. Estroma gastrointestinalaren tumore baten susmoa ikusita, distaleko gastrektomia egin zen. Azterketa histopatologikoak agerian utzi zuen mukosaren azpiko zelulen ugaltzea."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_case", "eu", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""3 urteko mutiko bat nahasmendu autistarekin unibertsitateko ospitaleko A pediatriako ospitalean. Ez du autismoaren espektroaren nahaste edo gaixotasun familiaren aurrekaririk. Mutilari komunikazio-nahaste larria diagnostikatu zioten, elkarrekintza sozialeko zailtasunak eta prozesamendu sentsorial atzeratua. Odol-analisiak normalak izan ziren (tiroidearen hormona estimulatzailea (TSH), hemoglobina, batez besteko bolumen corpuskularra (MCV) eta ferritina). Goiko endoskopiak mukosaren azpiko tumore bat ere erakutsi zuen, urdail-irteeren guztizko oztopoa eragiten zuena. Estroma gastrointestinalaren tumore baten susmoa ikusita, distaleko gastrektomia egin zen. Azterketa histopatologikoak agerian utzi zuen mukosaren azpiko zelulen ugaltzea.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------+------------------+ |chunk |ner_label | +----------------------------+------------------+ |3 urteko mutiko bat |patient | |nahasmendu |clinical_event | |autismoaren espektroaren |clinical_condition| |nahaste |clinical_event | |gaixotasun |clinical_event | |familiaren |patient | |aurrekaririk |clinical_event | |Mutilari |patient | |komunikazio-nahaste |clinical_event | |diagnostikatu |clinical_event | |elkarrekintza |clinical_event | |zailtasunak |clinical_event | |prozesamendu sentsorial |clinical_event | |Odol-analisiak |clinical_event | |normalak |units_measurements| |tiroidearen |bodypart | |hormona estimulatzailea |clinical_event | |TSH |clinical_event | |hemoglobina |clinical_event | |bolumen |clinical_event | |MCV |clinical_event | |ferritina |clinical_event | |Goiko |bodypart | |endoskopiak |clinical_event | |mukosaren azpiko |bodypart | |tumore |clinical_event | |erakutsi |clinical_event | |oztopoa |clinical_event | |Estroma gastrointestinalaren|clinical_event | |tumore |clinical_event | |ikusita |clinical_event | |distaleko |bodypart | |gastrektomia |clinical_event | |Azterketa |clinical_event | |agerian |clinical_event | |utzi |clinical_event | |mukosaren azpiko zelulen |bodypart | |ugaltzea |clinical_event | +----------------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|eu| |Size:|896.1 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Sample text from the training dataset 3 urteko mutiko bat nahasmendu autistarekin unibertsitateko ospitaleko A pediatriako ospitalean. Ez du autismoaren espektroaren nahaste edo gaixotasun familiaren aurrekaririk. Mutilari komunikazio-nahaste larria diagnostikatu zioten, elkarrekintza sozialeko zailtasunak eta prozesamendu sentsorial atzeratua. Odol-analisiak normalak izan ziren (tiroidearen hormona estimulatzailea (TSH), hemoglobina, batez besteko bolumen corpuskularra (MCV) eta ferritina). Goiko endoskopiak mukosaren azpiko tumore bat ere erakutsi zuen, urdail-irteeren guztizko oztopoa eragiten zuena. Estroma gastrointestinalaren tumore baten susmoa ikusita, distaleko gastrektomia egin zen. Azterketa histopatologikoak agerian utzi zuen mukosaren azpiko zelulen ugaltzea. ## Benchmarking ```bash label tp fp fn total precision recall f1 date_time 103.0 13.0 26.0 129.0 0.8879 0.7984 0.8408 units_measurements 257.0 37.0 9.0 266.0 0.8741 0.9662 0.9179 clinical_condition 20.0 22.0 33.0 53.0 0.4782 0.3774 0.4211 patient 69.0 3.0 8.0 77.0 0.9583 0.8961 0.9262 clinical_event 712.0 121.0 95.0 807.0 0.8547 0.8823 0.8683 bodypart 182.0 33.0 15.0 197.0 0.8465 0.9239 0.8835 macro - - - - - - 0.8096 micro - - - - - - 0.8640 ``` --- layout: model title: English ElectraForQuestionAnswering model (from bdickson) Version-1 author: John Snow Labs name: electra_qa_small_discriminator_finetuned_squad_1 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-discriminator-finetuned-squad` is a English model originally trained by `bdickson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_discriminator_finetuned_squad_1_en_4.0.0_3.0_1655921242790.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_discriminator_finetuned_squad_1_en_4.0.0_3.0_1655921242790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_discriminator_finetuned_squad_1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_discriminator_finetuned_squad_1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.small.by_bdickson").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_small_discriminator_finetuned_squad_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bdickson/electra-small-discriminator-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Russian author: John Snow Labs name: opus_mt_en_ru date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ru, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ru` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ru_xx_2.7.0_2.4_1609164756740.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ru_xx_2.7.0_2.4_1609164756740.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ru", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ru", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ru').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ru| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for RxNorm (sbluebert_base_uncased_mli embeddings) author: John Snow Labs name: sbluebertresolve_rxnorm_augmented_uncased date: 2021-12-28 tags: [en, clinical, entity_resolution, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbluebert_base_uncased_mli` Sentence Bert Embeddings. It is trained on the augmented version of the dataset which is used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in all_k_aux_labels column. ## Predicted Entities `RxNorm Codes`, `Concept Classes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_rxnorm_augmented_uncased_en_3.3.4_2.4_1640698320751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_rxnorm_augmented_uncased_en_3.3.4_2.4_1640698320751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained('sbluebert_base_uncased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_rxnorm_augmented_uncased", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver]) light_model = LightPipeline(rxnorm_pipelineModel) result = light_model.fullAnnotate(["Coumadin 5 mg", "aspirin", "Neurontin 300", "avandia 4 mg"]) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_rxnorm_augmented_uncased", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val light_model = LightPipeline(rxnorm_pipelineModel) val result = light_model.fullAnnotate(Array("Coumadin 5 mg", "aspirin", "avandia 4 mg")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.augmented_uncased").predict("""Coumadin 5 mg""") ```
## Results ```bash | | RxNormCode | Resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | all_k_aux_labels | |---:|-------------:|:-------------------------------------------|:----------------------------------|:----------------------------------|:----------------------------------|:----------------------------------------------------------------|:----------------------------------| | 0 | 855333 | warfarin sodium 5 MG [Coumadin] | 855333:::432467:::438740:::855... | 0.0000:::1.6841:::1.6841:::3.2... | 0.0000:::0.0062:::0.0062:::0.0... | warfarin sodium 5 MG [Coumadin]:::coumarin 5 MG Oral Tablet:... | Branded Drug Comp:::Clinical D... | | 1 | 1537020 | aspirin Effervescent Oral Tablet | 1537020:::1191:::405403:::1001... | 0.0000:::0.0000:::6.0493:::6.4... | 0.0000:::0.0000:::0.0797:::0.0... | aspirin Effervescent Oral Tablet:::aspirin:::YSP Aspirin:::E... | Clinical Drug Form:::Ingredien... | | 2 | 105029 | gabapentin 300 MG Oral Capsule [Neurontin] | 105029:::1098609:::207088:::20... | 3.1683:::6.0071:::6.2050:::6.2... | 0.0227:::0.0815:::0.0862:::0.0... | gabapentin 300 MG Oral Capsule [Neurontin]:::lamotrigine 300... | Branded Drug:::Branded Drug Co... | | 3 | 261242 | rosiglitazone 4 MG Oral Tablet [Avandia] | 261242:::847706:::577784:::212... | 0.0000:::6.8783:::6.9828:::7.4... | 0.0000:::0.1135:::0.1183:::0.1... | rosiglitazone 4 MG Oral Tablet [Avandia]:::glimepiride 4 MG ... | Branded Drug:::Branded Drug Co... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbluebertresolve_rxnorm_augmented_uncased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|978.4 MB| |Case sensitive:|false| --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_0_en_4.0.0_3.0_1655731514790.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_0_en_4.0.0_3.0_1655731514790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|415.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-0 --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nl16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl16_en_4.3.0_3.0_1675117543768.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl16_en_4.3.0_3.0_1675117543768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nl16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nl16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nl16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/google/t5-efficient-large-nl16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Japanese Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_bert_large_japanese date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_ja_3.4.2_3.0_1649674459280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_ja_3.4.2_3.0_1649674459280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_large_japanese").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_japanese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-large-japanese - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English asr_wav2vec2_base_timit_ali_hasan_colab_EX2 TFWav2Vec2ForCTC from ali221000262 author: John Snow Labs name: asr_wav2vec2_base_timit_ali_hasan_colab_EX2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_ali_hasan_colab_EX2` is a English model originally trained by ali221000262. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_ali_hasan_colab_EX2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_ali_hasan_colab_EX2_en_4.2.0_3.0_1664038518590.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_ali_hasan_colab_EX2_en_4.2.0_3.0_1664038518590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_ali_hasan_colab_EX2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_ali_hasan_colab_EX2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_ali_hasan_colab_EX2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Ganda asr_wav2vec2_luganda_by_birgermoell TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: asr_wav2vec2_luganda_by_birgermoell date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_birgermoell` is a Ganda model originally trained by birgermoell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_luganda_by_birgermoell_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_birgermoell_lg_4.2.0_3.0_1664037165332.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_birgermoell_lg_4.2.0_3.0_1664037165332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_luganda_by_birgermoell", "lg")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_luganda_by_birgermoell", "lg") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_luganda_by_birgermoell| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|lg| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_multi_uncased_finetuned_chaii date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-uncased-finetuned-chaii` is a English model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_uncased_finetuned_chaii_en_4.0.0_3.0_1654184608840.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_uncased_finetuned_chaii_en_4.0.0_3.0_1654184608840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_uncased_finetuned_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_uncased_finetuned_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.bert.uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_uncased_finetuned_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|626.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-multi-uncased-finetuned-chaii --- layout: model title: English BertForQuestionAnswering model (from Neulvo) author: John Snow Labs name: bert_qa_Neulvo_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `Neulvo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Neulvo_bert_finetuned_squad_en_4.0.0_3.0_1654535543683.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Neulvo_bert_finetuned_squad_en_4.0.0_3.0_1654535543683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Neulvo_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Neulvo_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_Neulvo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Neulvo_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Neulvo/bert-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Welsh author: John Snow Labs name: opus_mt_en_cy date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, cy, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `cy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cy_xx_2.7.0_2.4_1609170126158.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cy_xx_2.7.0_2.4_1609170126158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_cy", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_cy", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.cy').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_cy| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_cetuc_sid_voxforge_mls_1 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_cetuc_sid_voxforge_mls_1` is a English model originally trained by joaoalvarenga. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1_en_4.2.0_3.0_1664023274170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1_en_4.2.0_3.0_1664023274170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_cetuc_sid_voxforge_mls_1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Cased model (from Slavka) author: John Snow Labs name: bert_qa_xdzadi00_based_v4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xdzadi00_bert-based-v4` is a English model originally trained by `Slavka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xdzadi00_based_v4_en_4.0.0_3.0_1657193206660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xdzadi00_based_v4_en_4.0.0_3.0_1657193206660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xdzadi00_based_v4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_xdzadi00_based_v4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xdzadi00_based_v4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Slavka/xdzadi00_bert-based-v4 --- layout: model title: Legal Revolving Credit Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_revolving_credit_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, revolving, credit, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_revolving_credit_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `revolving-credit-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `revolving-credit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_revolving_credit_agreement_bert_en_1.0.0_3.0_1670349861196.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_revolving_credit_agreement_bert_en_1.0.0_3.0_1670349861196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_revolving_credit_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[revolving-credit-agreement]| |[other]| |[other]| |[revolving-credit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_revolving_credit_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.94 0.94 35 revolving-credit-agreement 0.93 0.93 0.93 29 accuracy - - 0.94 64 macro-avg 0.94 0.94 0.94 64 weighted-avg 0.94 0.94 0.94 64 ``` --- layout: model title: NER Pipeline for Clinical Department - Voice of the Patient author: John Snow Labs name: ner_vop_clinical_dept_pipeline date: 2023-06-10 tags: [licensed, ner, en, vop, pipeline] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of clinical departments and medical devices from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_pipeline_en_4.4.3_3.0_1686406177343.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_pipeline_en_4.4.3_3.0_1686406177343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_clinical_dept_pipeline", "en", "clinical/models") pipeline.annotate(" My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery! ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_clinical_dept_pipeline", "en", "clinical/models") val result = pipeline.annotate(" My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery! ") ```
## Results ```bash | chunk | ner_label | |:----------------------|:--------------| | orthopedic department | ClinicalDept | | titanium plate | MedicalDevice | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_clinical_dept_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English BertForQuestionAnswering Tiny Cased model (from MichelBartels) author: John Snow Labs name: bert_qa_tinybert_6l_768d_squad2_large_teach date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinybert-6l-768d-squad2-large-teacher` is a English model originally trained by `MichelBartels`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teach_en_4.0.0_3.0_1657192953672.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teach_en_4.0.0_3.0_1657192953672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2_large_teach","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2_large_teach","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tinybert_6l_768d_squad2_large_teach| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|249.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MichelBartels/tinybert-6l-768d-squad2-large-teacher --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_zh_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-zh-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_zh_hi_dev_xx_4.0.0_3.0_1657190323203.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_zh_hi_dev_xx_4.0.0_3.0_1657190323203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_zh_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_zh_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_zh_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-zh-hi --- layout: model title: English T5ForConditionalGeneration Mini Cased model (from google) author: John Snow Labs name: t5_efficient_mini_nl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-mini-nl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl8_en_4.3.0_3.0_1675118402070.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl8_en_4.3.0_3.0_1675118402070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_mini_nl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_mini_nl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_mini_nl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|120.1 MB| ## References - https://huggingface.co/google/t5-efficient-mini-nl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Sublicense Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_sublicense_agreement_bert date: 2023-01-29 tags: [en, legal, classification, sublicense, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_sublicense_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `sublicense-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `sublicense-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sublicense_agreement_bert_en_1.0.0_3.0_1674990525969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sublicense_agreement_bert_en_1.0.0_3.0_1674990525969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sublicense_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[sublicense-agreement]| |[other]| |[other]| |[sublicense-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sublicense_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.99 0.96 99 sublicense-agreement 0.97 0.83 0.89 41 accuracy - - 0.94 140 macro-avg 0.95 0.91 0.93 140 weighted-avg 0.94 0.94 0.94 140 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Sivakumar) author: John Snow Labs name: distilbert_qa_sivakumar_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Sivakumar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sivakumar_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769288959.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sivakumar_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769288959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sivakumar_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sivakumar_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sivakumar_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Sivakumar/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Finno-Ugrian Languages to English author: John Snow Labs name: opus_mt_fiu_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, fiu, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `fiu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_fiu_en_xx_2.7.0_2.4_1609164957344.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_fiu_en_xx_2.7.0_2.4_1609164957344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_fiu_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_fiu_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.fiu.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_fiu_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Оcr small for printed text author: John Snow Labs name: ocr_small_printed date: 2023-01-10 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.3.3 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr small model for recognise printed text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_PRINTED/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/1.3.Trasformer_based_Text_Recognition.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_printed_en_3.3.3_2.4_1645007455031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_printed_en_3.3.3_2.4_1645007455031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(False) \ .setSizeThreshold(15) \ .setLinkThreshold(0.3) draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) ocr = ImageToTextV2.pretrained("ocr_small_printed", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setOutputCol("hocr") \ .setOutputFormat(OcrOutputFormat.HOCR) \ .setGroupImages(False) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, draw_regions, ocr ]) image_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/images/check.jpg') image_example_df = spark.read.format("binaryFile").load(image_path) result = pipeline.transform(image_example_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2.pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(False) .setSizeThreshold(15) .setLinkThreshold(0.3) val draw_regions = ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val ocr = ImageToTextV2.pretrained("ocr_small_printed", "en", "clinical/ocr") .setInputCols("image", "text_regions") .setOutputCol("hocr") .setOutputFormat(OcrOutputFormat.HOCR) .setGroupImages(False) val pipeline = new PipelineModel().setStages(Array( binary_to_image, text_detector, draw_regions, ocr)) val image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/check.jpg"") val image_example_df = spark.read.format("binaryFile").load(image_path) val result = pipeline.transform(image_example_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image2.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image2_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash STARBUCKS STORE #10208 11302 EUCLID AVENUE CLEVELAND OH (216) 229-0749 CHK 664290 12/07/2014 06:43 PM 1912003 DRAMER: 2 REG: 2 VT PEP MOCHA 4.95 SBUX CARD 4.95 XXXXXXXXXXXX3228 SUBTOTAL $4.95 TOTAL $4.95 CHANGE DUE $0.00 CHECK CLOSED 12/07/2014 06:43 PM SBUX CARD X3228 NEW BALANCE: 37.45 CARD IS REGISTERED ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_small_printed| |Type:|ocr| |Compatibility:|Visual NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Chinese Lemmatizer author: John Snow Labs name: lemma date: 2020-05-04 20:11:00 +0800 task: Lemmatization language: zh edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, zh] supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_zh_2.5.0_2.4_1588611649140.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_zh_2.5.0_2.4_1588611649140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "zh") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "zh") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""除了担任北方国王之外,约翰·斯诺(John Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。"""] lemma_df = nlu.load('zh.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=20, result='除了担任北方国王之外,约翰·斯诺(John', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=22, end=50, result='Snow)是一位英国医师,也是麻醉和医疗卫生发展的领导者。', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|zh| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species date: 2022-06-27 tags: [it, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: it edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Italian which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_it_3.5.3_3.0_1656318503132.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_it_3.5.3_3.0_1656318503132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "it", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "it", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.classify.bert_token.ner_living_species").predict("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""") ```
## Results ```bash +----------------+-------+ |ner_chunk |label | +----------------+-------+ |donna |HUMAN | |personale |HUMAN | |madre |HUMAN | |fratello |HUMAN | |sorella |HUMAN | |virus epatotropi|SPECIES| |HBV |SPECIES| |HCV |SPECIES| |HIV |SPECIES| |paziente |HUMAN | +----------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|it| |Size:|410.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.78 0.96 0.86 2772 B-SPECIES 0.62 0.89 0.73 2866 I-HUMAN 0.44 0.63 0.52 101 I-SPECIES 0.57 0.83 0.67 1039 micro-avg 0.67 0.91 0.77 6778 macro-avg 0.60 0.83 0.70 6778 weighted-avg 0.68 0.91 0.77 6778 ``` --- layout: model title: Word2Vec Embeddings in Bangla (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bn_3.4.1_3.0_1647286875201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bn_3.4.1_3.0_1647286875201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed.w2v_cc_300d").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bn| |Size:|859.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-4` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4_en_4.0.0_3.0_1654191388378.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4_en_4.0.0_3.0_1654191388378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_128d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|381.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-4 --- layout: model title: English asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4 TFWav2Vec2ForCTC from chrisvinsen author: John Snow Labs name: pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4` is a English model originally trained by chrisvinsen. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_en_4.2.0_3.0_1664103715058.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4_en_4.2.0_3.0_1664103715058.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_4| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from Chinyanja to English author: John Snow Labs name: opus_mt_ny_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ny, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ny` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ny_en_xx_2.7.0_2.4_1609170818337.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ny_en_xx_2.7.0_2.4_1609170818337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ny_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ny_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ny.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ny_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Taxation Document Classifier (EURLEX) author: John Snow Labs name: legclf_taxation_bert date: 2023-03-06 tags: [en, legal, classification, clauses, taxation, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_taxation_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Taxation or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Taxation`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_taxation_bert_en_1.0.0_3.0_1678111555574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_taxation_bert_en_1.0.0_3.0_1678111555574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_taxation_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Taxation]| |[Other]| |[Other]| |[Taxation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_taxation_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.86 0.94 0.90 96 Taxation 0.93 0.85 0.89 100 accuracy - - 0.89 196 macro-avg 0.90 0.89 0.89 196 weighted-avg 0.90 0.89 0.89 196 ``` --- layout: model title: Pipeline to Extract Biomarkers and their Results author: John Snow Labs name: ner_oncology_biomarker_pipeline date: 2023-03-09 tags: [licensed, clinical, en, ner, oncology, biomarker] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_biomarker](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_biomarker_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_pipeline_en_4.3.0_3.2_1678345026649.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_pipeline_en_4.3.0_3.2_1678345026649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_biomarker_pipeline", "en", "clinical/models") text = '''The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_biomarker_pipeline", "en", "clinical/models") val text = "The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------------------|--------:|------:|:-----------------|-------------:| | 0 | negative | 70 | 77 | Biomarker_Result | 0.9984 | | 1 | CK7 | 83 | 85 | Biomarker | 1 | | 2 | synaptophysin | 88 | 100 | Biomarker | 0.9995 | | 3 | Syn | 103 | 105 | Biomarker | 0.9979 | | 4 | chromogranin A | 109 | 122 | Biomarker | 0.9692 | | 5 | CgA | 125 | 127 | Biomarker | 0.9994 | | 6 | Muc5AC | 131 | 136 | Biomarker | 0.9987 | | 7 | human epidermal growth factor receptor-2 | 139 | 178 | Biomarker | 0.99876 | | 8 | HER2 | 181 | 184 | Biomarker | 1 | | 9 | Muc6 | 192 | 195 | Biomarker | 0.9999 | | 10 | positive | 198 | 205 | Biomarker_Result | 0.9987 | | 11 | CK20 | 211 | 214 | Biomarker | 1 | | 12 | Muc1 | 217 | 220 | Biomarker | 0.9999 | | 13 | Muc2 | 223 | 226 | Biomarker | 0.9999 | | 14 | E-cadherin | 229 | 238 | Biomarker | 0.9999 | | 15 | p53 | 245 | 247 | Biomarker | 1 | | 16 | Ki-67 index | 254 | 264 | Biomarker | 0.99465 | | 17 | 87% | 276 | 278 | Biomarker_Result | 0.9814 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_biomarker_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_movie_roberta_MITmovie_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `movie-roberta-MITmovie-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_roberta_MITmovie_squad_en_4.0.0_3.0_1655729031815.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_roberta_MITmovie_squad_en_4.0.0_3.0_1655729031815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_roberta_MITmovie_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_movie_roberta_MITmovie_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.mitmovie_squad.roberta.by_thatdramebaazguy").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_movie_roberta_MITmovie_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/thatdramebaazguy/movie-roberta-MITmovie-squad --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from Davlan) author: John Snow Labs name: xlmroberta_ner_base_masakhan date: 2022-08-01 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-masakhaner` is a Multilingual model originally trained by `Davlan`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_masakhan_xx_4.1.0_3.0_1659356472350.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_masakhan_xx_4.1.0_3.0_1659356472350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_masakhan","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_masakhan","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_masakhan| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|801.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Davlan/xlm-roberta-base-masakhaner - https://github.com/masakhane-io/masakhane-ner - https://arxiv.org/abs/2103.11811 --- layout: model title: Financial English Bert Embeddings (Base, Communication texts) author: John Snow Labs name: bert_embeddings_finbert_pretrain_yiyanghkust date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial English Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `finbert-pretrain-yiyanghkust` is a English model orginally available in Hugging Face as `yiyanghkust/finbert-pretrain`. It was trained on the following datasets: - Corporate Reports 10-K & 10-Q: 2.5B tokens - Earnings Call Transcripts: 1.3B tokens - Analyst Reports: 1.1B tokens {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_finbert_pretrain_yiyanghkust_en_3.4.2_3.0_1649672533001.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_finbert_pretrain_yiyanghkust_en_3.4.2_3.0_1649672533001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_finbert_pretrain_yiyanghkust","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_finbert_pretrain_yiyanghkust","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.finbert_pretrain_yiyanghkust").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_finbert_pretrain_yiyanghkust| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/philschmid/finbert-pretrain-yiyanghkust - https://arxiv.org/abs/2006.08097 --- layout: model title: Maltese CamemBert Embeddings (from fenrhjen) author: John Snow Labs name: camembert_embeddings_camembert_aux_amandes date: 2022-05-31 tags: [mt, open_source, camembert, embeddings] task: Embeddings language: mt edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `camembert_aux_amandes` is a Maltese model orginally trained by `fenrhjen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_camembert_aux_amandes_mt_3.4.4_3.0_1653985705496.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_camembert_aux_amandes_mt_3.4.4_3.0_1653985705496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_camembert_aux_amandes","mt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I Love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_camembert_aux_amandes","mt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I Love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_camembert_aux_amandes| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|mt| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/fenrhjen/camembert_aux_amandes --- layout: model title: Spanish RoBERTa Embeddings (Base, Stepwise) author: John Snow Labs name: roberta_embeddings_bertin_base_stepwise date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-base-stepwise` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_stepwise_es_3.4.2_3.0_1649946195172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_stepwise_es_3.4.2_3.0_1649946195172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_stepwise","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_stepwise","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_base_stepwise").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_base_stepwise| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|234.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-base-stepwise --- layout: model title: Fast Neural Machine Translation Model from English to Indo-Iranian Languages author: John Snow Labs name: opus_mt_en_iir date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, iir, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `iir` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_iir_xx_2.7.0_2.4_1609167251409.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_iir_xx_2.7.0_2.4_1609167251409.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_iir", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_iir", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.iir').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_iir| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_4_en_4.0.0_3.0_1657184537116.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_4_en_4.0.0_3.0_1657184537116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-4 --- layout: model title: Smaller BERT Embeddings (L-6_H-768_A-12) author: John Snow Labs name: small_bert_L6_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L6_768_en_2.6.0_2.4_1598345125237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L6_768_en_2.6.0_2.4_1598345125237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L6_768", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L6_768", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L6_768').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L6_768_embeddings I [-0.2927809953689575, -0.1677704006433487, -0.... love [-0.23247751593589783, 0.2524709105491638, -0.... NLP [0.03696206957101822, -0.11152120679616928, -0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L6_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_hady TFWav2Vec2ForCTC from hady author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_hady` is a English model originally trained by hady. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady_en_4.2.0_3.0_1664039200389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady_en_4.2.0_3.0_1664039200389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_hady| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Recognize Entities DL Pipeline for Swedish - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, swedish, entity_recognizer_sm, pipeline, sv] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: sv edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_sv_3.0.0_3.0_1616443122813.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_sv_3.0.0_3.0_1616443122813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'sv') annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "sv") val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej från John Snow Labs! ""] result_df = nlu.load('sv.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| --- layout: model title: German T5ForConditionalGeneration Small Cased model (from Shahm) author: John Snow Labs name: t5_small_german date: 2023-01-31 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-german` is a German model originally trained by `Shahm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_german_de_4.3.0_3.0_1675126257283.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_german_de_4.3.0_3.0_1675126257283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_german","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_german","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_german| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|285.5 MB| ## References - https://huggingface.co/Shahm/t5-small-german - https://paperswithcode.com/sota?task=Summarization&dataset=mlsum+de --- layout: model title: English asr_wav2vec2_large_in_lm TFWav2Vec2ForCTC from crossdelenna author: John Snow Labs name: pipeline_asr_wav2vec2_large_in_lm date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_in_lm` is a English model originally trained by crossdelenna. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_in_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_in_lm_en_4.2.0_3.0_1664103269999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_in_lm_en_4.2.0_3.0_1664103269999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_in_lm', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_in_lm", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_in_lm| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBERT Embeddings (%90 sparse) author: John Snow Labs name: distilbert_embeddings_distilbert_base_uncased_sparse_90_unstructured_pruneofa date: 2022-04-12 tags: [distilbert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-uncased-sparse-90-unstructured-pruneofa` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uncased_sparse_90_unstructured_pruneofa_en_3.4.2_3.0_1649783362329.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uncased_sparse_90_unstructured_pruneofa_en_3.4.2_3.0_1649783362329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uncased_sparse_90_unstructured_pruneofa","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uncased_sparse_90_unstructured_pruneofa","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilbert_base_uncased_sparse_90_unstructured_pruneofa").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_uncased_sparse_90_unstructured_pruneofa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|123.7 MB| |Case sensitive:|false| ## References - https://huggingface.co/Intel/distilbert-base-uncased-sparse-90-unstructured-pruneofa - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: Detect PHI for Deidentification purposes (French, reduced entities) author: John Snow Labs name: ner_deid_generic date: 2022-02-11 tags: [deid, fr, licensed] task: De-identification language: fr edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (French) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This NER model is trained with a custom dataset internally annotated, the [French WikiNER dataset](https://metatext.io/datasets/wikiner), a [public dataset of French company names](https://www.data.gouv.fr/fr/datasets/entreprises-immatriculees-en-2017/), a [public dataset of French hospital names](https://salesdorado.com/fichiers-prospection/hopitaux/) and several data augmentation mechanisms. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_fr_3.4.1_2.4_1644591444704.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_fr_3.4.1_2.4_1644591444704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "fr", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "fr", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.med_ner.deid_generic").predict("""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""") ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | J'ai| O| | vu| O| | en| O| |consultation| O| | Michel| B-NAME| | Martinez| I-NAME| | (| O| | 49| B-AGE| | ans| I-AGE| | )| O| | adressé| O| | au| O| | Centre|B-LOCATION| | Hospitalier|I-LOCATION| | De|I-LOCATION| | Plaisir|I-LOCATION| | pour| O| | un| O| | diabète| O| | mal| O| | contrôlé| O| | avec| O| | des| O| | symptômes| O| | datant| O| | de| O| | Mars| B-DATE| | 2015| I-DATE| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [French WikiNER dataset](https://metatext.io/datasets/wikiner) - [Public dataset of French company names](https://www.data.gouv.fr/fr/datasets/entreprises-immatriculees-en-2017/) - [Public dataset of French hospital names](https://salesdorado.com/fichiers-prospection/hopitaux/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 159.0 0.0 1.0 160.0 1.0 0.9938 0.9969 NAME 2633.0 111.0 197.0 2830.0 0.9595 0.9304 0.9447 DATE 2612.0 32.0 42.0 2654.0 0.9879 0.9842 0.986 ID 95.0 8.0 7.0 102.0 0.9223 0.9314 0.9268 LOCATION 3450.0 480.0 522.0 3972.0 0.8779 0.8686 0.8732 PROFESSION 326.0 54.0 82.0 408.0 0.8579 0.799 0.8274 AGE 395.0 37.0 46.0 441.0 0.9144 0.8957 0.9049 macro - - - - - - 0.9229 micro - - - - - - 0.9226 ``` --- layout: model title: German T5ForConditionalGeneration Small Cased model (from ml6team) author: John Snow Labs name: t5_mt5_small_german_finetune_mlsum date: 2023-01-30 tags: [de, open_source, t5, tensorflow] task: Text Generation language: de edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mt5-small-german-finetune-mlsum` is a German model originally trained by `ml6team`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mt5_small_german_finetune_mlsum_de_4.3.0_3.0_1675106313224.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mt5_small_german_finetune_mlsum_de_4.3.0_3.0_1675106313224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mt5_small_german_finetune_mlsum","de") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mt5_small_german_finetune_mlsum","de") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mt5_small_german_finetune_mlsum| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|de| |Size:|1.3 GB| ## References - https://huggingface.co/ml6team/mt5-small-german-finetune-mlsum - https://github.com/pltrdy/rouge --- layout: model title: Finance NER (10Q, lg, 139 entities) author: John Snow Labs name: finner_10q_xlbr date: 2022-12-02 tags: [10q, xlbr, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is an NER model containing 139 numeric financial entities from different 10Q reports. The tokens being annotated are the amounts, not any other surrounding word, but the context will determine what kind of amount is from the list of the 139 available. This is a large (`lg`) model, trained with 200K sentences. ## Predicted Entities `DeferredFinanceCostsNet`, `DisposalGroupIncludingDiscontinuedOperationConsideration`, `DebtInstrumentCarryingAmount`, `CommonStockSharesAuthorized`, `RestructuringCharges`, `DeferredFinanceCostsGross`, `OperatingLeasesRentExpenseNet`, `EquityMethodInvestmentOwnershipPercentage`, `ClassOfWarrantOrRightExercisePriceOfWarrantsOrRights1`, `DebtInstrumentTerm`, `DebtInstrumentRedemptionPricePercentage`, `CommonStockCapitalSharesReservedForFutureIssuance`, `LossContingencyAccrualAtCarryingValue`, `SaleOfStockPricePerShare`, `MinorityInterestOwnershipPercentageByParent`, `PropertyPlantAndEquipmentUsefulLife`, `TreasuryStockAcquiredAverageCostPerShare`, `Goodwill`, `SupplementalInformationForPropertyCasualtyInsuranceUnderwritersPriorYearClaimsAndClaimsAdjustmentExpense`, `CommonStockParOrStatedValuePerShare`, `OperatingLeaseWeightedAverageDiscountRatePercent`, `DebtInstrumentConvertibleConversionPrice1`, `AmortizationOfIntangibleAssets`, `PreferredStockSharesAuthorized`, `OperatingLeasePayments`, `DebtInstrumentMaturityDate`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodWeightedAverageGrantDateFairValue`, `EffectiveIncomeTaxRateReconciliationAtFederalStatutoryIncomeTaxRate`, `AllocatedShareBasedCompensationExpense`, `PreferredStockDividendRatePercentage`, `StockRepurchaseProgramRemainingAuthorizedRepurchaseAmount1`, `TreasuryStockValueAcquiredCostMethod`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsVestedInPeriodTotalFairValue`, `IncomeTaxExpenseBenefit`, `DerivativeFixedInterestRate`, `RelatedPartyTransactionExpensesFromTransactionsWithRelatedParty`, `PublicUtilitiesRequestedRateIncreaseDecreaseAmount`, `RestructuringAndRelatedCostExpectedCost1`, `StockRepurchaseProgramAuthorizedAmount1`, `ShareBasedCompensation`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriodWeightedAverageGrantDateFairValue`, `LongTermDebtFairValue`, `LineOfCreditFacilityUnusedCapacityCommitmentFeePercentage`, `LineOfCreditFacilityCurrentBorrowingCapacity`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardAwardVestingPeriod1`, `SharebasedCompensationArrangementBySharebasedPaymentAwardAwardVestingRightsPercentage`, `PaymentsToAcquireBusinessesGross`, `MinorityInterestOwnershipPercentageByNoncontrollingOwners`, `AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount`, `NumberOfReportableSegments`, `BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIntangibleAssetsOtherThanGoodwill`, `OperatingLeaseCost`, `BusinessCombinationConsiderationTransferred1`, `UnrecognizedTaxBenefitsThatWouldImpactEffectiveTaxRate`, `CommonStockDividendsPerShareDeclared`, `AreaOfRealEstateProperty`, `LesseeOperatingLeaseTermOfContract`, `RevenueRemainingPerformanceObligation`, `RelatedPartyTransactionAmountsOfTransaction`, `InterestExpense`, `OperatingLeaseExpense`, `StockIssuedDuringPeriodSharesNewIssues`, `DebtInstrumentFaceAmount`, `CapitalizedContractCostAmortization`, `DebtInstrumentBasisSpreadOnVariableRate1`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsNonvestedNumber`, `GainsLossesOnExtinguishmentOfDebt`, `LineOfCreditFacilityRemainingBorrowingCapacity`, `OperatingLeaseRightOfUseAsset`, `OperatingLeaseWeightedAverageRemainingLeaseTerm1`, `OperatingLossCarryforwards`, `ConcentrationRiskPercentage1`, `GuaranteeObligationsMaximumExposure`, `StockRepurchasedAndRetiredDuringPeriodShares`, `LesseeOperatingLeaseRenewalTerm`, `ContractWithCustomerLiabilityRevenueRecognized`, `DefinedBenefitPlanContributionsByEmployer`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsGrantsInPeriodGross`, `RepaymentsOfDebt`, `EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognized`, `BusinessAcquisitionPercentageOfVotingInterestsAcquired`, `DebtInstrumentInterestRateEffectivePercentage`, `AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife`, `DebtInstrumentUnamortizedDiscount`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAuthorized`, `BusinessCombinationContingentConsiderationLiability`, `DebtInstrumentInterestRateStatedPercentage`, `LeaseAndRentalExpense`, `RevenueFromContractWithCustomerExcludingAssessedTax`, `SharePrice`, `CommonStockSharesOutstanding`, `ContractWithCustomerLiability`, `DerivativeNotionalAmount`, `RevenueFromRelatedParties`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardOptionsExercisesInPeriodTotalIntrinsicValue`, `Revenues`, `EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedShareBasedAwardsOtherThanOptions`, `AccrualForEnvironmentalLossContingencies`, `ProceedsFromIssuanceOfCommonStock`, `EmployeeServiceShareBasedCompensationTaxBenefitFromCompensationExpense`, `IncomeLossFromEquityMethodInvestments`, `NumberOfOperatingSegments`, `UnrecognizedTaxBenefits`, `RevenueFromContractWithCustomerIncludingAssessedTax`, `LossContingencyDamagesSoughtValue`, `SharebasedCompensationArrangementBySharebasedPaymentAwardExpirationPeriod`, `TreasuryStockSharesAcquired`, `FiniteLivedIntangibleAssetUsefulLife`, `BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssumedIntangibles`, `EffectiveIncomeTaxRateContinuingOperations`, `LossContingencyEstimateOfPossibleLoss`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardNumberOfSharesAvailableForGrant`, `BusinessCombinationAcquisitionRelatedCosts`, `StockRepurchasedDuringPeriodShares`, `CashAndCashEquivalentsFairValueDisclosure`, `LineOfCreditFacilityInterestRateAtPeriodEnd`, `ShareBasedCompensationArrangementByShareBasedPaymentAwardEquityInstrumentsOtherThanOptionsGrantsInPeriod`, `CumulativeEffectOfNewAccountingPrincipleInPeriodOfAdoption`, `LettersOfCreditOutstandingAmount`, `EmployeeServiceShareBasedCompensationNonvestedAwardsTotalCompensationCostNotYetRecognizedPeriodForRecognition1`, `NumberOfRealEstateProperties`, `DebtWeightedAverageInterestRate`, `SaleOfStockNumberOfSharesIssuedInTransaction`, `AssetImpairmentCharges`, `Depreciation`, `DebtInstrumentFairValue`, `DefinedContributionPlanCostRecognized`, `InterestExpenseDebt`, `LossContingencyPendingClaimsNumber`, `PaymentsToAcquireBusinessesNetOfCashAcquired`, `BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued`, `GoodwillImpairmentLoss`, `LineOfCredit`, `AmortizationOfFinancingCosts`, `EquityMethodInvestments`, `LineOfCreditFacilityCommitmentFeePercentage`, `LongTermDebt`, `LineOfCreditFacilityMaximumBorrowingCapacity`, `OperatingLeaseLiability` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_10Q_XLBR){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_10q_xlbr_en_1.0.0_3.0_1669977147020.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_10q_xlbr_en_1.0.0_3.0_1669977147020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) nerTagger = finance.NerModel.pretrained('finner_10q_xlbr', 'en', 'finance/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") pipeline = nlp.Pipeline(stages=[documentAssembler, sentence, tokenizer, embeddings, ner_model ]) text = """Common Stock The authorized capital of the Company is 200,000,000 common shares , par value $ 0.001 , of which 12,481,724 are issued or outstanding .""" df = spark.createDataFrame([[text]]).toDF("text") fit = pipeline.fit(df) result = fit.transform(df) result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"), F.expr("cols['2']['confidence']").alias("confidence")) result_df.show(50, truncate=100) ```
## Results ```bash +-----------+-------------------------------------+----------+ | token| ner_label|confidence| +-----------+-------------------------------------+----------+ | Common| O| 1.0| | Stock| O| 1.0| | The| O| 1.0| | authorized| O| 1.0| | capital| O| 1.0| | of| O| 1.0| | the| O| 1.0| | Company| O| 1.0| | is| O| 1.0| |200,000,000| B-CommonStockSharesAuthorized| 0.9932| | common| O| 1.0| | shares| O| 1.0| | ,| O| 1.0| | par| O| 1.0| | value| O| 1.0| | $| O| 1.0| | 0.001|B-CommonStockParOrStatedValuePerShare| 0.9988| | ,| O| 1.0| | of| O| 1.0| | which| O| 1.0| | 12,481,724| B-CommonStockSharesOutstanding| 0.9649| | are| O| 1.0| | issued| O| 1.0| | or| O| 1.0| |outstanding| O| 1.0| | .| O| 1.0| +-----------+-------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_10q_xlbr| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|17.0 MB| ## References An in-house modified version of https://huggingface.co/datasets/nlpaueb/finer-139, re-splited and filtered to focus on sentences with bigger density of tags. ## Benchmarking ```bash label tp fp fn prec rec f1 Macro-average 53613 10309 10243 0.8324958 0.8049274 0.8184795 Micro-average 53613 10309 10243 0.8387253 0.8395922 0.8391586 ``` --- layout: model title: Ganda asr_wav2vec2_large_xlsr_luganda TFWav2Vec2ForCTC from lucio author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_luganda date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_luganda` is a Ganda model originally trained by lucio. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_luganda_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_luganda_lg_4.2.0_3.0_1664036777655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_luganda_lg_4.2.0_3.0_1664036777655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_luganda', lang = 'lg') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_luganda", lang = "lg") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_luganda| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|lg| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Tai to English Pipeline author: John Snow Labs name: translate_taw_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, taw, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `taw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_taw_en_xx_2.7.0_2.4_1609700707816.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_taw_en_xx_2.7.0_2.4_1609700707816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_taw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_taw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.taw.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_taw_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from husnu) author: John Snow Labs name: bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xtremedistil-l6-h256-uncased-finetuned_lr-2e-05_epochs-3` is a English model orginally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3_en_4.0.0_3.0_1654192622134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3_en_4.0.0_3.0_1654192622134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.xtremedistiled_uncased_lr_2e_05_epochs_3.by_husnu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|47.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/husnu/xtremedistil-l6-h256-uncased-finetuned_lr-2e-05_epochs-3 --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Dutch (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-05-10 task: Named Entity Recognition language: nl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, nl, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_NL){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_nl_2.5.0_2.4_1588546201484.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_nl_2.5.0_2.4_1588546201484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "nl") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "nl") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd 's werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella."""] ner_df = nlu.load('nl.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Amerikaanse |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Gates |PER | |Seattle |LOC | |Washington |LOC | |Paul Allen Microsoft |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |CEO |ORG | |CEO |ORG | |Eind |ORG | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|nl| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://nl.wikipedia.org](https://nl.wikipedia.org) --- layout: model title: Legal Executive Employment Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_executive_employment_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, executive_employment, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_executive_employment_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `executive-employment-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `executive-employment-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_executive_employment_agreement_bert_en_1.0.0_3.0_1669313118478.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_executive_employment_agreement_bert_en_1.0.0_3.0_1669313118478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_executive_employment_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[executive-employment-agreement]| |[other]| |[other]| |[executive-employment-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_executive_employment_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support executive-employment-agreement 0.92 0.94 0.93 36 other 0.97 0.95 0.96 65 accuracy - - 0.95 101 macro-avg 0.94 0.95 0.95 101 weighted-avg 0.95 0.95 0.95 101 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten_en_4.2.0_3.0_1664105259391.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten_en_4.2.0_3.0_1664105259391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_patrickvonplaten| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Diagnoses and Procedures (Spanish) author: John Snow Labs name: ner_diag_proc class: NerDLModel language: es repository: clinical/models date: 2020-07-08 task: Named Entity Recognition edition: Healthcare NLP 2.5.3 spark_version: 2.4 tags: [clinical,licensed,ner,es] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Pretrained named entity recognition deep learning model for diagnostics and procedures in spanish ## Predicted Entities ``DIAGNOSTICO``, ``PROCEDIMIENTO``. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_es_2.5.3_2.4_1594168623415.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_es_2.5.3_2.4_1594168623415.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") model = NerDLModel.pretrained("ner_diag_proc","es","clinical/models")\ .setInputCols("sentence","token","word_embeddings")\ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embed, model, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.']], ["text"])) ``` ```scala ... val embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val model = NerDLModel.pretrained("ner_diag_proc","es","clinical/models") .setInputCols("sentence","token","word_embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embed, model, ner_converter)) val data = Seq("HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner").predict("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""") ```
## Results ```bash +----------------------+-------------+ |chunk |ner_label | +----------------------+-------------+ |ENFERMEDAD |DIAGNOSTICO | |cáncer de vejiga |DIAGNOSTICO | |resección |PROCEDIMIENTO| |cistectomía |PROCEDIMIENTO| |estrés |DIAGNOSTICO | |infarto subendocárdico|DIAGNOSTICO | |enfermedad |DIAGNOSTICO | |arterias coronarias |DIAGNOSTICO | +----------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|----------------------------------| | Name: | ner_diag_proc | | Type: | NerDLModel | | Compatibility: | 2.5.3 | | License: | Licensed | | Edition: | Official | |Input labels: | [sentence, token, word_embeddings] | |Output labels: | [ner] | | Language: | es | | Dependencies: | embeddings_scielowiki_300d | {:.h2_title} ## Data Source Trained on CodiEsp Challenge dataset trained with `embeddings_scielowiki_300d` https://temu.bsc.es/codiesp/ ## Benchmarking ```bash +-------------+------+------+------+------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +-------------+------+------+------+------+---------+------+------+ |PROCEDIMIENTO|2299.0|1103.0| 860.0|3159.0| 0.6758|0.7278|0.7008| | DIAGNOSTICO|6623.0|1364.0|2974.0|9597.0| 0.8292|0.6901|0.7533| +-------------+------+------+------+------+---------+------+------+ +------------------+ | macro| +------------------+ |0.7270531284138397| +------------------+ +------------------+ | micro| +------------------+ |0.7402992400932049| +------------------+ ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from silveto) author: John Snow Labs name: distilbert_qa_silveto_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `silveto`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_silveto_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772771581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_silveto_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772771581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_silveto_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_silveto_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_silveto_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/silveto/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate English to Uralic languages Pipeline author: John Snow Labs name: translate_en_urj date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, urj, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `urj` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_urj_xx_2.7.0_2.4_1609687665593.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_urj_xx_2.7.0_2.4_1609687665593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_urj", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_urj", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.urj').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_urj| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Albanian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, sq, open_source] task: Embeddings language: sq edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sq_3.4.1_3.0_1647282075905.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sq_3.4.1_3.0_1647282075905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sq") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Unë e dua shkëndijën nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sq") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Unë e dua shkëndijën nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sq.embed.w2v_cc_300d").predict("""Unë e dua shkëndijën nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sq| |Size:|684.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: MeSH to UMLS Code Mapping author: John Snow Labs name: mesh_umls_mapping date: 2021-07-01 tags: [mesh, umls, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CODE_MAPPING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.1.0_2.4_1625126292100.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_3.1.0_2.4_1625126292100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("mesh_umls_mapping","en","clinical/models") pipeline.annotate("C028491 D019326 C579867") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("mesh_umls_mapping","en","clinical/models") val result = pipeline.annotate("C028491 D019326 C579867") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.mesh.umls").predict("""C028491 D019326 C579867""") ```
## Results ```bash {'mesh': ['C028491', 'D019326', 'C579867'], 'umls': ['C0970275', 'C0886627', 'C3696376']} Note: | MeSH | Details | | ---------- | ----------------------------:| | C028491 | 1,3-butylene glycol | | D019326 | 17-alpha-Hydroxyprogesterone | | C579867 | 3-Methylglutaconic Aciduria | | UMLS | Details | | ---------- | ---------------------------:| | C0970275 | 1,3-butylene glycol | | C0886627 | 17-hydroxyprogesterone | | C3696376 | 3-methylglutaconic aciduria | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|mesh_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: English asr_model_sid_voxforge_cetuc_1 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: asr_model_sid_voxforge_cetuc_1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_sid_voxforge_cetuc_1` is a English model originally trained by joaoalvarenga. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_model_sid_voxforge_cetuc_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_model_sid_voxforge_cetuc_1_en_4.2.0_3.0_1664021832928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_model_sid_voxforge_cetuc_1_en_4.2.0_3.0_1664021832928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_model_sid_voxforge_cetuc_1", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_model_sid_voxforge_cetuc_1", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_model_sid_voxforge_cetuc_1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect PHI in text (ner_deid_sd_larg) author: John Snow Labs name: ner_deid_sd_large date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect PHI in text for de-identification using pretrained NER model. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `PROFESSION`, `CONTACT`, `DATE`, `NAME`, `AGE`, `ID`, `LOCATION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_en_3.0.0_3.0_1617260861713.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_large_en_3.0.0_3.0_1617260861713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_sd_large", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentenceDetector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) sample_text = """Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""" df = spark.createDataFrame([[sample_text]]).toDF("text") result = nlpPipeline.fit(df).transform(df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_sd_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.sd_large").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |25 |AGE | |2079-11-09 |DATE | |Cocke County Baptist Hospital|LOCATION | |0295 Keats Street |LOCATION | |302-786-5227 |CONTACT | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_sd_large| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab0_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab0_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_sherry7144_en_4.2.0_3.0_1664037814643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_sherry7144_en_4.2.0_3.0_1664037814643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_sherry7144", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_sherry7144", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab0_by_sherry7144| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Gam) author: John Snow Labs name: roberta_qa_base_finetuned_cuad_gam date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-cuad-gam` is a English model originally trained by `Gam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_cuad_gam_en_4.3.0_3.0_1674216483026.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_cuad_gam_en_4.3.0_3.0_1674216483026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_cuad_gam","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_cuad_gam","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_cuad_gam| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|450.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Gam/roberta-base-finetuned-cuad-gam --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman` is a Finnish model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_fi_4.2.0_3.0_1664041651614.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman_fi_4.2.0_3.0_1664041651614.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_finnish_by_jonatasgrosman| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: Spanish Bert Embeddings (from mmaguero) author: John Snow Labs name: bert_embeddings_beto_gn_base_cased date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `beto-gn-base-cased` is a Spanish model orginally trained by `mmaguero`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_beto_gn_base_cased_es_3.4.2_3.0_1649671395707.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_beto_gn_base_cased_es_3.4.2_3.0_1649671395707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_beto_gn_base_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_beto_gn_base_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.beto_gn_base_cased").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_beto_gn_base_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|411.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/mmaguero/beto-gn-base-cased --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Modified_PubMedBERT date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Modified_PubMedBERT` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_PubMedBERT_en_4.0.0_3.0_1657108813220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_PubMedBERT_en_4.0.0_3.0_1657108813220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_PubMedBERT","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_PubMedBERT","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Modified_PubMedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Modified_PubMedBERT --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squad1.1_block_sparse_0.32_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad1.1-block-sparse-0.32-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.32_v1_en_4.0.0_3.0_1654181454017.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.32_v1_en_4.0.0_3.0_1654181454017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.32_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.32_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.1.1_block_sparse_0.32_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad1.1_block_sparse_0.32_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|213.4 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squad1.1-block-sparse-0.32-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf - https://www.aclweb.org/anthology/N19-1423/ - https://arxiv.org/abs/2005.07683 --- layout: model title: Word2Vec Embeddings in Nepali (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ne, open_source] task: Embeddings language: ne edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ne_3.4.1_3.0_1647448067646.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ne_3.4.1_3.0_1647448067646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ne") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मलाई स्पार्क एनएलपी मन पर्छ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ne") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मलाई स्पार्क एनएलपी मन पर्छ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ne.embed.w2v_cc_300d").predict("""मलाई स्पार्क एनएलपी मन पर्छ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ne| |Size:|334.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_hfl_chinese_wwm_ext date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-bert-wwm-ext` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_hfl_chinese_wwm_ext_zh_4.2.4_3.0_1670020942972.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_hfl_chinese_wwm_ext_zh_4.2.4_3.0_1670020942972.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_hfl_chinese_wwm_ext","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_hfl_chinese_wwm_ext","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_hfl_chinese_wwm_ext| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-bert-wwm-ext - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: English asr_wav2vec2_large_xlsr_sermon TFWav2Vec2ForCTC from sharonibejih author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_sermon date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_sermon` is a English model originally trained by sharonibejih. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_sermon_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_sermon_en_4.2.0_3.0_1664115294414.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_sermon_en_4.2.0_3.0_1664115294414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_sermon', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_sermon", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_sermon| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Arabic BertForMaskedLM Large Cased model (from asafaya) author: John Snow Labs name: bert_embeddings_large_arabic date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-arabic` is a Arabic model originally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabic_ar_4.2.4_3.0_1670019914064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabic_ar_4.2.4_3.0_1670019914064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_arabic| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-large-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: Word2Vec Embeddings in Goan Konkani (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, gom, open_source] task: Embeddings language: gom edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gom_3.4.1_3.0_1647374766723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gom_3.4.1_3.0_1647374766723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gom") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gom") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gom.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|gom| |Size:|167.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English T5ForConditionalGeneration Cased model (from tscholak) author: John Snow Labs name: t5_1zha5ono date: 2023-01-30 tags: [en, open_source, t5] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `1zha5ono` is a English model originally trained by `tscholak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_1zha5ono_en_4.3.0_3.0_1675095935006.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_1zha5ono_en_4.3.0_3.0_1675095935006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_1zha5ono","en") \ .setInputCols(["document"]) \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_1zha5ono","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_1zha5ono| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|947.0 MB| ## References - https://huggingface.co/tscholak/1zha5ono - https://arxiv.org/abs/2109.05093 - https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#lm-adapted-t511lm100k - https://yale-lily.github.io/spider - https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#lm-adapted-t511lm100k - https://github.com/ElementAI/picard - https://github.com/ElementAI/picard - https://arxiv.org/abs/2109.05093 - https://github.com/ElementAI/picard --- layout: model title: ELECTRA Embeddings(ELECTRA Small) author: John Snow Labs name: electra_small_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by: Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_small_uncased_en_2.6.0_2.4_1598485458536.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_small_uncased_en_2.6.0_2.4_1598485458536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("electra_small_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("electra_small_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.electra.small_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_electra_small_uncased_embeddings token [0.7677724361419678, -0.621893048286438, -0.93... I [1.1656712293624878, -0.04712940752506256, 0.1... love [1.0845087766647339, -0.00533430278301239, -0.... NLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_small_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/electra_small/2](https://tfhub.dev/google/electra_small/2) --- layout: model title: Explain Document Pipeline for Danish author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, danish, explain_document_md, pipeline, da] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: da edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_da_3.0.0_3.0_1616437187649.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_da_3.0.0_3.0_1616437187649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'da') annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "da") val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej fra John Snow Labs! ""] result_df = nlu.load('da.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.4006600081920624,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|da| --- layout: model title: Dutch Part of Speech Tagger (from oliverguhr) author: John Snow Labs name: roberta_pos_fullstop_dutch_punctuation_prediction date: 2022-05-03 tags: [roberta, pos, part_of_speech, nl, open_source] task: Part of Speech Tagging language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fullstop-dutch-punctuation-prediction` is a Dutch model orginally trained by `oliverguhr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_fullstop_dutch_punctuation_prediction_nl_3.4.2_3.0_1651596311447.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_fullstop_dutch_punctuation_prediction_nl_3.4.2_3.0_1651596311447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_fullstop_dutch_punctuation_prediction","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_fullstop_dutch_punctuation_prediction","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.pos.fullstop_dutch_punctuation_prediction").predict("""Ik hou van Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_fullstop_dutch_punctuation_prediction| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|436.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/oliverguhr/fullstop-dutch-punctuation-prediction --- layout: model title: German BertForTokenClassification Cased model (from elenanereiss) author: John Snow Labs name: bert_token_classifier_german_ler date: 2022-11-30 tags: [de, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-german-ler` is a German model originally trained by `elenanereiss`. ## Predicted Entities ``, `EUN`, `LIT`, `RR`, `INN`, `RS`, `PER`, `VO`, `UN`, `MRK`, `AN`, `LD`, `STR`, `GRT`, `ORG`, `GS`, `VT`, `LDS`, `ST`, `VS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_german_ler_de_4.2.4_3.0_1669815202449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_german_ler_de_4.2.4_3.0_1669815202449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_german_ler","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_german_ler","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_german_ler| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|407.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/elenanereiss/bert-german-ler - https://github.com/elenanereiss/bert-legal-ner - https://paperswithcode.com/sota?task=Token+Classification&dataset=elenanereiss%2Fgerman-ler --- layout: model title: Latin BertForMaskedLM Cased model (from cook) author: John Snow Labs name: bert_embeddings_cicero_similis date: 2022-12-02 tags: [la, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: la edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cicero-similis` is a Latin model originally trained by `cook`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_cicero_similis_la_4.2.4_3.0_1670021953584.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_cicero_similis_la_4.2.4_3.0_1670021953584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_cicero_similis","la") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_cicero_similis","la") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_cicero_similis| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|la| |Size:|233.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/cook/cicero-similis --- layout: model title: Fast Neural Machine Translation Model from English to Kirundi author: John Snow Labs name: opus_mt_en_rn date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, rn, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `rn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_rn_xx_2.7.0_2.4_1609170714678.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_rn_xx_2.7.0_2.4_1609170714678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_rn", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_rn", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.rn').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_rn| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForSequenceClassification Base Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_codebert_base_finetuned_detect_insecure_cod date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `codebert-base-finetuned-detect-insecure-code` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_codebert_base_finetuned_detect_insecure_cod_en_4.0.0_3.0_1657716286514.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_codebert_base_finetuned_detect_insecure_cod_en_4.0.0_3.0_1657716286514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_codebert_base_finetuned_detect_insecure_cod","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_codebert_base_finetuned_detect_insecure_cod","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_codebert_base_finetuned_detect_insecure_cod| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|469.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/codebert-base-finetuned-detect-insecure-code - https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection - https://arxiv.org/pdf/1907.11692.pdf - https://arxiv.org/pdf/2002.08155.pdf --- layout: model title: Extract Health and Behaviours Problems Entities from Social Determinants of Health Texts author: John Snow Labs name: ner_sdoh_health_behaviours_problems_wip date: 2023-02-24 tags: [clinical, licensed, en, social_determinants, ner, public_health, sdoh, health, behaviours, problems, health_behaviours_problems] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts health and behaviours problems related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Diet`, `Mental_Health`, `Obesity`, `Eating_Disorder`, `Sexual_Activity`, `Disability`, `Quality_Of_Life`, `Other_Disease`, `Exercise`, `Communicable_Disease`, `Hyperlipidemia`, `Hypertension` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_health_behaviours_problems_wip_en_4.3.1_3.0_1677198610586.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_health_behaviours_problems_wip_en_4.3.1_3.0_1677198610586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_health_behaviours_problems_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["She has not been getting regular exercise and not followed diet for approximately two years due to chronic sciatic pain.", "Medical History: The patient is a 32-year-old female who presents with a history of anxiety, depression, bulimia nervosa, elevated cholesterol, and substance abuse.", "Pt was intubated atthe scene & currently sedated due to high BP. Also, he is currently on social security disability."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_health_behaviours_problems_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("She has not been getting regular exercise for approximately two years due to chronic sciatic pain.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------+-----+---+---------------+ |chunk |begin|end|ner_label | +--------------------+-----+---+---------------+ |regular exercise |25 |40 |Exercise | |diet |59 |62 |Diet | |chronic sciatic pain|99 |118|Other_Disease | |anxiety |84 |90 |Mental_Health | |depression |93 |102|Mental_Health | |bulimia nervosa |105 |119|Eating_Disorder| |elevated cholesterol|122 |141|Hyperlipidemia | |high BP |56 |62 |Hypertension | |disability |106 |115|Disability | +--------------------+-----+---+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_health_behaviours_problems_wip| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Quality_Of_Life 127.0 19.0 3.0 130.0 0.869863 0.976923 0.920290 Eating_Disorder 56.0 5.0 0.0 56.0 0.918033 1.000000 0.957265 Obesity 16.0 2.0 7.0 23.0 0.888889 0.695652 0.780488 Exercise 103.0 6.0 5.0 108.0 0.944954 0.953704 0.949309 Communicable_Disease 61.0 11.0 5.0 66.0 0.847222 0.924242 0.884058 Hypertension 52.0 0.0 2.0 54.0 1.000000 0.962963 0.981132 Other_Disease 1068.0 85.0 79.0 1147.0 0.926279 0.931125 0.928696 Diet 66.0 12.0 15.0 81.0 0.846154 0.814815 0.830189 Disability 95.0 1.0 6.0 101.0 0.989583 0.940594 0.964467 Mental_Health 1020.0 45.0 134.0 1154.0 0.957746 0.883882 0.919333 Hyperlipidemia 19.0 1.0 2.0 21.0 0.950000 0.904762 0.926829 Sexual_Activity 82.0 15.0 6.0 88.0 0.845361 0.931818 0.886486 ``` --- layout: model title: Relation Extraction between generic entities author: John Snow Labs name: generic_re date: 2022-12-20 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.3 spark_version: 3.0 tags: [re, en, clinical, licensed, relation extraction, generic] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description We already have more than 80 relation extraction (RE) models that can extract relations between certain named entities. Nevertheless, there are some rare entities or cases that you may not find the right RE or the one you find may not work as expected due to nature of your dataset. In order to ease this burden, we are releasing a generic RE model (generic_re) that can be used between any named entities using the syntactic distances, POS tags and dependency tree between the entities. You can tune this model by using the setMaxSyntacticDistance param. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_POSOLOGY/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel()\ .pretrained("ner_oncology_wip", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") reModel = RelationExtractionModel()\ .pretrained("generic_re")\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"]) \ .setMaxSyntacticDistance(4) pipeline = Pipeline(stages=[ documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) text = """Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.""" ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") tokenizer = Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel() .pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_Model = RelationExtractionModel() .pretrained("generic_re") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test")) .setMaxSyntacticDistance(4) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependecy_parser, re_Model)) val data = Seq("Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash |sentence |entity1_begin |entity1_end | chunk1 | entity1 |entity2_begin |entity2_end | chunk2 | entity2 | relation |confidence| |--------:|-------------:|-----------:|:----------|:-----------------|-------------:|-----------:|:-----------------------|:-----------------|:--------------------------------|----------| | 0 | 1 | 9 | Pathology | Pathology_Test | 18 | 28 | tumor cells | Pathology_Result | Pathology_Test-Pathology_Result | 1| | 0 | 42 | 49 | positive | Biomarker_Result | 55 | 62 | estrogen | Biomarker | Biomarker_Result-Biomarker | 1| | 0 | 42 | 49 | positive | Biomarker_Result | 68 | 89 | progesterone receptors | Biomarker | Biomarker_Result-Biomarker | 1| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|generic_re| |Compatibility:|Healthcare NLP 4.2.3+| |Edition:|Official| |License:|Licensed| |Language:|[en]| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_ali221000262 TFWav2Vec2ForCTC from ali221000262 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_ali221000262 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_ali221000262` is a English model originally trained by ali221000262. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_ali221000262_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_ali221000262_en_4.2.0_3.0_1664036614909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_ali221000262_en_4.2.0_3.0_1664036614909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_ali221000262", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_ali221000262", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_ali221000262| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Vietnamese T5ForConditionalGeneration Base Cased model (from VietAI) author: John Snow Labs name: t5_vit5_base_vietnews_summarization date: 2023-01-31 tags: [vi, open_source, t5, tensorflow] task: Text Generation language: vi edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vit5-base-vietnews-summarization` is a Vietnamese model originally trained by `VietAI`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_vit5_base_vietnews_summarization_vi_4.3.0_3.0_1675158454304.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_vit5_base_vietnews_summarization_vi_4.3.0_3.0_1675158454304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_vit5_base_vietnews_summarization","vi") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_vit5_base_vietnews_summarization","vi") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_vit5_base_vietnews_summarization| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|vi| |Size:|485.0 MB| ## References - https://huggingface.co/VietAI/vit5-base-vietnews-summarization - https://paperswithcode.com/sota/abstractive-text-summarization-on-vietnews?p=vit5-pretrained-text-to-text-transformer-for - https://github.com/vietai/ViT5 - https://github.com/vietai/ViT5/blob/main/eval/Eval_vietnews_sum.ipynb --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_scrambled_squad_15_new date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-15-new` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_15_new_en_4.0.0_3.0_1655734119312.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_15_new_en_4.0.0_3.0_1655734119312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_15_new","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_15_new","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_scrambled_15_v2.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_15_new| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-15-new --- layout: model title: Legal Warrant Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_warrant_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, warrant, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_warrant_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `warrant-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `warrant-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_warrant_agreement_en_1.0.0_3.0_1671393665185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_warrant_agreement_en_1.0.0_3.0_1671393665185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_warrant_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[warrant-agreement]| |[other]| |[other]| |[warrant-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_warrant_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.98 0.99 230 warrant-agreement 0.96 1.00 0.98 99 accuracy - - 0.99 329 macro-avg 0.98 0.99 0.99 329 weighted-avg 0.99 0.99 0.99 329 ``` --- layout: model title: Extract anatomical entities (Voice of the Patients) author: John Snow Labs name: ner_vop_anatomy_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `BodyPart`, `Laterality` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_wip_en_4.4.2_3.0_1684511968875.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_wip_en_4.4.2_3.0_1684511968875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_anatomy_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_anatomy_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.").toDS.toDF("text") ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | muscle | BodyPart | | neck | BodyPart | | trapezius | BodyPart | | head | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_anatomy_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 BodyPart 2727 221 173 2900 0.93 0.94 0.93 Laterality 557 81 71 628 0.87 0.89 0.88 macro_avg 3284 302 244 3528 0.90 0.92 0.90 micro_avg 3284 302 244 3528 0.92 0.93 0.92 ``` --- layout: model title: English BertForQuestionAnswering model (from recobo) author: John Snow Labs name: bert_qa_chemical_bert_uncased_squad2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chemical-bert-uncased-squad2` is a English model orginally trained by `recobo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chemical_bert_uncased_squad2_en_4.0.0_3.0_1654537776973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chemical_bert_uncased_squad2_en_4.0.0_3.0_1654537776973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chemical_bert_uncased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_chemical_bert_uncased_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_chemical.bert.uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chemical_bert_uncased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/recobo/chemical-bert-uncased-squad2 --- layout: model title: Legal Binding Effects Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_binding_effects_bert date: 2023-03-05 tags: [en, legal, classification, clauses, binding_effects, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Binding_Effects` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Binding_Effects`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_binding_effects_bert_en_1.0.0_3.0_1678050516792.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_binding_effects_bert_en_1.0.0_3.0_1678050516792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_binding_effects_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Binding_Effects]| |[Other]| |[Other]| |[Binding_Effects]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_binding_effects_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Binding_Effects 0.91 0.89 0.90 70 Other 0.92 0.94 0.93 97 accuracy - - 0.92 167 macro-avg 0.92 0.91 0.91 167 weighted-avg 0.92 0.92 0.92 167 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Ayoola) author: John Snow Labs name: distilbert_qa_ayoola_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Ayoola`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ayoola_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768287189.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ayoola_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768287189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ayoola_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ayoola_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ayoola_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Ayoola/distilbert-base-uncased-finetuned-squad --- layout: model title: Spanish Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_es_cased date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-es-cased` is a Spanish model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_es_cased_es_3.4.2_3.0_1649671116551.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_es_cased_es_3.4.2_3.0_1649671116551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_es_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_es_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bert_base_es_cased").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_es_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|399.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-es-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Detect 10 Different Entities in Hebrew (hebrew_cc_300d embeddings) author: John Snow Labs name: hebrewner_cc_300d date: 2022-08-09 tags: [he, open_source] task: Named Entity Recognition language: he edition: Spark NLP 4.0.2 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Hebrew word embeddings to find 10 different types of entities in Hebrew text. It is trained using `hebrew_cc_300d` word embeddings, please use the same embeddings in the pipeline. Predicted entities: Persons-`PERS`, Dates-`DATE`, Organizations-`ORG`, Locations-`LOC`, Percentage-`PERCENT`, Money-`MONEY`, Time-`TIME`, Miscellaneous (Affiliation)-`MISC_AFF`, Miscellaneous (Event)-`MISC_EVENT`, Miscellaneous (Entity)-`MISC_ENT`. ## Predicted Entities `PERS`, `LOC`, `DATE`, `ORG`, `TIME`, `MONEY`, `MISC_EVENT`, `MISC_AFF`, `PERCENT`, `MISC_ENT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hebrewner_cc_300d_he_4.0.2_3.0_1660031325511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hebrewner_cc_300d_he_4.0.2_3.0_1660031325511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("hebrewner_cc_300d", "he") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.""") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he",) .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("hebrewner_cc_300d", "he") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlp_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("""ב- 25 לאוגוסט עצר השב"כ את מוחמד אבו-ג'וייד , אזרח ירדני , שגויס לארגון הפת"ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.""").toDS.toDF("text") val result = nlp_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("he.ner").predict("""ח והופעל על ידי חיזבאללה. אבו-ג'וייד התכוון להקים חוליות טרור בגדה ובקרב ערביי ישראל , לבצע פיגוע ברכבת ישראל בנהריה , לפגוע במטרות ישראליות בירדן ולחטוף חיילים כדי לשחרר אסירים ביטחוניים.""") ```
## Results ```bash | | ner_chunk | entity_label | |---:|------------------:|---------------:| | 0 | 25 לאוגוסט | DATE | | 1 | השב"כ | ORG | | 2 | מוחמד אבו-ג'וייד | PERS | | 3 | ירדני | MISC_AFF | | 4 | הפת"ח | ORG | | 5 | חיזבאללה | ORG | | 6 | אבו-ג'וייד | PERS | | 7 | בגדה | LOC | | 8 | ערביי | MISC_AFF | | 9 | ישראל | LOC | | 10 | ברכבת ישראל | ORG | | 11 | בנהריה | LOC | | 12 | ישראליות | MISC_AFF | | 13 | בירדן < | LOC | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|hebrewner_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 4.0.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|he| |Size:|14.8 MB| ## References This model is trained on dataset obtained from [https://www.cs.bgu.ac.il/~elhadad/nlpproj/naama/](https://www.cs.bgu.ac.il/~elhadad/nlpproj/naama/) ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 5 2 0 0.714286 1 0.833333 I-MISC_AFF 2 0 3 1 0.4 0.571429 B-MISC_EVENT 7 0 1 1 0.875 0.933333 B-LOC 180 24 37 0.882353 0.829493 0.855107 I-ORG 124 47 38 0.725146 0.765432 0.744745 B-DATE 50 4 7 0.925926 0.877193 0.900901 I-PERS 157 10 15 0.94012 0.912791 0.926254 I-DATE 39 7 8 0.847826 0.829787 0.83871 B-MISC_AFF 132 11 9 0.923077 0.93617 0.929577 I-MISC_EVENT 6 0 2 1 0.75 0.857143 B-TIME 4 0 1 1 0.8 0.888889 I-PERCENT 8 0 0 1 1 1 I-MISC_ENT 11 3 10 0.785714 0.52381 0.628571 B-MISC_ENT 8 1 5 0.888889 0.615385 0.727273 I-LOC 79 18 23 0.814433 0.77451 0.79397 B-PERS 231 22 26 0.913044 0.898833 0.905882 B-MONEY 36 2 2 0.947368 0.947368 0.947368 B-PERCENT 28 3 0 0.903226 1 0.949152 B-ORG 166 41 37 0.801932 0.817734 0.809756 I-MONEY 61 1 1 0.983871 0.983871 0.983871 Macro-average 1334 196 225 0.899861 0.826869 0.861822 Micro-average 1334 196 225 0.871895 0.855677 0.86371 ``` --- layout: model title: Detect details of cellular structures (biobert) author: John Snow Labs name: ner_cellular_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract different types of cells, protiens, and their sub-structures using pretrained NER model. ## Predicted Entities `protein`, `cell_type`, `RNA`, `DNA`, `cell_line` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_en_3.0.0_3.0_1617260803352.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_en_3.0.0_3.0_1617260803352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_cellular_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_cellular_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cellular.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: BioBERT Embeddings (Pubmed) author: John Snow Labs name: biobert_pubmed_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_base_cased_en_2.6.0_2.4_1598342186392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_base_cased_en_2.6.0_2.4_1598342186392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pubmed_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pubmed_base_cased_embeddings I [0.4227580428123474, -0.01985771767795086, -0.... hate [0.04862901195883751, 0.2535072863101959, -0.5... cancer [0.05491625890135765, 0.09395376592874527, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pubmed_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_enriched_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_enriched_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_jsl_enriched_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_biobert_pipeline_en_3.4.1_3.0_1647869600266.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_biobert_pipeline_en_3.4.1_3.0_1647869600266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_enriched_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_enriched_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_enriched_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +---------------------------+------------+ |chunks |entities | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |mom |Gender | |she |Gender | |mild |Modifier | |problems with his breathing|Symptom_Name| |negative |Negation | |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | |His |Gender | |he |Gender | +---------------------------+------------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_flowers ViTForImageClassification from Sena author: John Snow Labs name: image_classifier_vit_flowers date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_flowers` is a English model originally trained by Sena. ## Predicted Entities `nergis`, `leylak`, `karanfil`, `zambak`, `menekse` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_flowers_en_4.1.0_3.0_1660166443375.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_flowers_en_4.1.0_3.0_1660166443375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_flowers", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_flowers", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_flowers| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from EasthShin) author: John Snow Labs name: bert_qa_Klue_CommonSense_model date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Klue-CommonSense-model` is a English model orginally trained by `EasthShin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Klue_CommonSense_model_en_4.0.0_3.0_1654178775404.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Klue_CommonSense_model_en_4.0.0_3.0_1654178775404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Klue_CommonSense_model","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Klue_CommonSense_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.klue.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Klue_CommonSense_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|413.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/EasthShin/Klue-CommonSense-model - https://mindslab.ai:8080/kr/company - https://ainize.ai/EastHShin/Klue-CommonSense_QA?branch=main - https://ainize.ai/workspace/create?imageId=hnj95592adzr02xPTqss&git=https://github.com/EastHShin/Klue-CommonSense-workspace - https://main-klue-common-sense-qa-east-h-shin.endpoint.ainize.ai/ --- layout: model title: Smaller BERT Embeddings (L-2_H-128_A-2) author: John Snow Labs name: small_bert_L2_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L2_128_en_2.6.0_2.4_1598344320681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L2_128_en_2.6.0_2.4_1598344320681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L2_128').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L2_128_embeddings I [-1.2788691520690918, -0.011364400386810303, 0.... love [-1.4087588787078857, -0.348095178604126, -0.... NLP [-1.6277656555175781, -0.28823617100715637, ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L2_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1) --- layout: model title: Legal Severability Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_severability_bert date: 2023-03-05 tags: [en, legal, classification, clauses, severability, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Severability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Severability`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_severability_bert_en_1.0.0_3.0_1678050694994.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_severability_bert_en_1.0.0_3.0_1678050694994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_severability_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Severability]| |[Other]| |[Other]| |[Severability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_severability_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.98 1.00 0.99 263 Severability 1.00 0.98 0.99 233 accuracy - - 0.99 496 macro-avg 0.99 0.99 0.99 496 weighted-avg 0.99 0.99 0.99 496 ``` --- layout: model title: Relation Extraction Between Body Parts and Procedures author: John Snow Labs name: redl_bodypart_procedure_test_biobert date: 2021-09-10 tags: [relation_extraction, en, clinical, dl, licensed] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 2.4 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like ‘Internal_organ_or_component’, ’External_body_part_or_region’ etc. and procedure and test entities. `1` : body part and test/procedure are related to each other. `0` : body part and test/procedure are not related to each other. ## Predicted Entities `0`, `1` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_3.0.3_2.4_1631307197287.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_3.0.3_2.4_1631307197287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["external_body_part_or_region-test"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_procedure_test_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([['''TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks"). setRelationPairs(Array("external_body_part_or_region-test")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_procedure_test_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.procedure").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash | | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |---:|-----------:|:-----------------------------|:---------|:----------|:--------------------|-------------:| | 0 | 1 | External_body_part_or_region | chest | Test | portable ultrasound | 0.99953 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_procedure_test_biobert_en_3.0.3_2.4| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on a custom internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.338 0.472 0.394 325 1 0.904 0.843 0.872 1275 Avg. 0.621 0.657 0.633 - ``` --- layout: model title: English image_classifier_vit_simple_kitchen ViTForImageClassification from black author: John Snow Labs name: image_classifier_vit_simple_kitchen date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_simple_kitchen` is a English model originally trained by black. ## Predicted Entities `best kitchen island`, `kitchen cabinet`, `kitchen countertop` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_simple_kitchen_en_4.1.0_3.0_1660168529679.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_simple_kitchen_en_4.1.0_3.0_1660168529679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_simple_kitchen", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_simple_kitchen", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_simple_kitchen| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Named Entity Recognition Profiling (Clinical) author: John Snow Labs name: ner_profiling_clinical date: 2022-08-30 tags: [en, clinical, profiling, ner_profiling, ner, licensed] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. It has been updated by adding new clinical NER models and NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_clinical`, `jsl_ner_wip_greedy_clinical`, `jsl_ner_wip_modifier_clinical`, `jsl_rd_ner_wip_greedy_clinical`, `ner_abbreviation_clinical`, `ner_ade_binary`, `ner_ade_clinical`, `ner_anatomy`, `ner_anatomy_coarse`, `ner_bacterial_species`, `ner_biomarker`, `ner_biomedical_bc2gm`, `ner_bionlp`, `ner_cancer_genetics`, `ner_cellular`, `ner_chemd_clinical`, `ner_chemicals`, `ner_chemprot_clinical`, `ner_chexpert`, `ner_clinical`, `ner_clinical_large`, `ner_clinical_trials_abstracts`, `ner_covid_trials`, `ner_deid_augmented`, `ner_deid_enriched`, `ner_deid_generic_augmented`, `ner_deid_large`, `ner_deid_sd`, `ner_deid_sd_large`, `ner_deid_subentity_augmented`, `ner_deid_subentity_augmented_i2b2`, `ner_deid_synthetic`, `ner_deidentify_dl`, `ner_diseases`, `ner_diseases_large`, `ner_drugprot_clinical`, `ner_drugs`, `ner_drugs_greedy`, `ner_drugs_large`, `ner_events_admission_clinical`, `ner_events_clinical`, `ner_genetic_variants`, `ner_human_phenotype_gene_clinical`, `ner_human_phenotype_go_clinical`, `ner_jsl`, `ner_jsl_enriched`, `ner_jsl_greedy`, `ner_jsl_slim`, `ner_living_species`, `ner_measurements_clinical`, `ner_medmentions_coarse`, `ner_nature_nero_clinical`, `ner_nihss`, `ner_pathogen`, `ner_posology`, `ner_posology_experimental`, `ner_posology_greedy`, `ner_posology_large`, `ner_posology_small`, `ner_radiology`, `ner_radiology_wip_clinical`, `ner_risk_factors`, `ner_supplement_clinical`, `nerdl_tumour_demo` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.0.2_3.0_1661867359272.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.0.2_3.0_1661867359272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ******************** ner_jsl Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('subsequent', 'Modifier'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Diabetes'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Communicable_Disease'), ('obesity', 'Obesity'), ('body mass index', 'Symptom'), ('33.5 kg/m2', 'Weight'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_diseases_large Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('vomiting', 'Disease')] ******************** ner_radiology Model Results ******************** [('gestational diabetes mellitus', 'Disease_Syndrome_Disorder'), ('type two diabetes mellitus', 'Disease_Syndrome_Disorder'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('acute hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Disease_Syndrome_Disorder'), ('body', 'BodyPart'), ('mass index', 'Symptom'), ('BMI', 'Test'), ('33.5', 'Measurements'), ('kg/m2', 'Units'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus', 'PROBLEM'), ('T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'PROBLEM'), ('BMI', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_medmentions_coarse Model Results ******************** [('female', 'Organism_Attribute'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('T2DM', 'Disease_or_Syndrome'), ('HTG-induced pancreatitis', 'Disease_or_Syndrome'), ('associated with', 'Qualitative_Concept'), ('acute hepatitis', 'Disease_or_Syndrome'), ('obesity', 'Disease_or_Syndrome'), ('body mass index', 'Clinical_Attribute'), ('BMI', 'Clinical_Attribute'), ('polyuria', 'Sign_or_Symptom'), ('polydipsia', 'Sign_or_Symptom'), ('poor appetite', 'Sign_or_Symptom'), ('vomiting', 'Sign_or_Symptom')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_clinical| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.6 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: Fast Neural Machine Translation Model from Tetela to English author: John Snow Labs name: opus_mt_tll_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tll, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tll` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tll_en_xx_2.7.0_2.4_1609168380675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tll_en_xx_2.7.0_2.4_1609168380675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tll_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tll_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tll.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tll_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_huggingpics_package_demo_2 ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_huggingpics_package_demo_2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_huggingpics_package_demo_2` is a English model originally trained by nateraw. ## Predicted Entities `corgi`, `husky`, `shibu inu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_huggingpics_package_demo_2_en_4.1.0_3.0_1660169461913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_huggingpics_package_demo_2_en_4.1.0_3.0_1660169461913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_huggingpics_package_demo_2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_huggingpics_package_demo_2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_huggingpics_package_demo_2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Recognize Posology Pipeline author: John Snow Labs name: recognize_entities_posology date: 2021-03-29 tags: [ner, named_entity_recognition, pos, parts_of_speech, posology, ner_posology, pipeline, en, licensed] task: [Named Entity Recognition, Part of Speech Tagging] language: en nav_key: models edition: Spark NLP for Healthcare 3.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline detects drugs, dosage, form, frequency, duration, route, and drug strength in text. ## Predicted Entities `DRUG`, `STRENGTH`, `DURATION`, `FREQUENCY`, `FORM`, `DOSAGE`, `ROUTE`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/recognize_entities_posology_en_3.0.0_3.0_1617042229126.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/recognize_entities_posology_en_3.0.0_3.0_1617042229126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('recognize_entities_posology', 'en', 'clinical/models') annotations = pipeline.fullAnnotate("""The patient was perscriped 50MG penicilin for is headache""")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("recognize_entities_posology", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient was perscriped 50MG penicilin for is headache""")(0) ``` {:.nlu-block} ```python import nlu result_df = nlu.load('ner.posology').predict("""The patient was perscriped 50MG penicilin for is headache""") result_df ```
## Results ```bash +-----------------------------------------+ |result | +-----------------------------------------+ |[O, O, O, O, B-Strength, B-Drug, O, O, O]| +-----------------------------------------+ +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |ner | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[named_entity, 0, 2, O, [word -> The, confidence -> 1.0], []], [named_entity, 4, 10, O, [word -> patient, confidence -> 0.9993], []], [named_entity, 12, 14, O, [word -> was, confidence -> 1.0], []], [named_entity, 16, 25, O, [word -> perscriped, confidence -> 0.9985], []], [named_entity, 27, 30, B-Strength, [word -> 50MG, confidence -> 0.9966], []], [named_entity, 32, 40, B-Drug, [word -> penicilin, confidence -> 0.9934], []], [named_entity, 42, 44, O, [word -> for, confidence -> 0.9999], []], [named_entity, 46, 47, O, [word -> is, confidence -> 0.9468], []], [named_entity, 49, 56, O, [word -> headache, confidence -> 0.9805], []]]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_posology| |Type:|pipeline| |Compatibility:|Spark NLP for Healthcare 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_500000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-500000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_500000_cased_generator_de_3.4.4_3.0_1652786423286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_500000_cased_generator_de_3.4.4_3.0_1652786423286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_500000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_500000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_500000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|223.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-500000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Financial NER (xl, Extra Large) author: John Snow Labs name: finner_financial_xlarge date: 2022-11-30 tags: [en, financial, ner, earning, calls, 10k, fillings, annual, reports, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `xl` (extra-large) version of a financial model, trained in a combination of two data sets: Earning Calls and 10K Fillings. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The aim of this model is to detect the main pieces of financial information in annual reports of companies, more specifically this model is being trained with 10K filings. The currently available entities are: - AMOUNT: Numeric amounts, not percentages - ASSET: Current or Fixed Asset - ASSET_DECREASE: Decrease in the asset possession/exposure - ASSET_INCREASE: Increase in the asset possession/exposure - CF: Total cash flow  - CF_DECREASE: Relative decrease in cash flow - CF_INCREASE: Relative increase in cash flow - COUNT: Number of items (not monetary, not percentages). - CURRENCY: The currency of the amount - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - EXPENSE: An expense or loss - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - FCF: Free Cash Flow - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - KPI: Key Performance Indicator, a quantifiable measure of performance over time for a specific objective - KPI_DECREASE: Relative decrease in a KPI - KPI_INCREASE: Relative increase in a KPI - LIABILITY: Current or Long-Term Liability (not from stockholders) - LIABILITY_DECREASE: Relative decrease in liability - LIABILITY_INCREASE: Relative increase in liability - ORG: Mention to a company/organization name - PERCENTAGE: Numeric amounts which are percentages - PROFIT: Profit or also Revenue - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - TICKER: Trading symbol of the company You can also check for the Relation Extraction model which connects these entities together ## Predicted Entities `AMOUNT`, `ASSET`, `ASSET_DECREASE`, `ASSET_INCREASE`, `CF`, `CF_DECREASE`, `CF_INCREASE`, `COUNT`, `CURRENCY`, `DATE`, `EXPENSE`, `EXPENSE_DECREASE`, `EXPENSE_INCREASE`, `FCF`, `FISCAL_YEAR`, `KPI`, `KPI_DECREASE`, `KPI_INCREASE`, `LIABILITY`, `LIABILITY_DECREASE`, `LIABILITY_INCREASE`, `ORG`, `PERCENTAGE`, `PROFIT`, `PROFIT_DECLINE`, `PROFIT_INCREASE`, `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_financial_xlarge_en_1.0.0_3.0_1669840074362.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_financial_xlarge_en_1.0.0_3.0_1669840074362.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_financial_xlarge", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""License fees revenue decreased 40 %, or 0.5 million to 0.7 million for the year ended December 31, 2020 compared to 1.2 million for the year ended December 31, 2019"""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +---------+----------------+----------+ | token| ner_label|confidence| +---------+----------------+----------+ | License|B-PROFIT_DECLINE| 0.9658| | fees|I-PROFIT_DECLINE| 0.7826| | revenue|I-PROFIT_DECLINE| 0.8992| |decreased| O| 1.0| | 40| B-PERCENTAGE| 0.9997| | %| O| 1.0| | ,| O| 0.9997| | or| O| 0.9999| | 0.5| B-AMOUNT| 0.9925| | million| I-AMOUNT| 0.9989| | to| O| 0.9996| | 0.7| B-AMOUNT| 0.9368| | million| I-AMOUNT| 0.9949| | for| O| 0.9999| | the| O| 0.9944| | year| O| 0.9976| | ended| O| 0.9987| | December| B-FISCAL_YEAR| 0.9941| | 31| I-FISCAL_YEAR| 0.8955| | ,| I-FISCAL_YEAR| 0.8869| | 2020| I-FISCAL_YEAR| 0.9941| | compared| O| 0.9999| | to| O| 0.9995| | 1.2| B-AMOUNT| 0.9853| | million| I-AMOUNT| 0.9831| | for| O| 0.9999| | the| O| 0.9914| | year| O| 0.9948| | ended| O| 0.9985| | December| B-FISCAL_YEAR| 0.9812| | 31| I-FISCAL_YEAR| 0.8185| | ,| I-FISCAL_YEAR| 0.8351| | 2019| I-FISCAL_YEAR| 0.9541| +---------+----------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_financial_xlarge| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations on Earning Calls and 10-K Filings combined. ## Benchmarking ```bash label tp fp fn prec rec f1 B-LIABILITY_INCREASE 1 0 0 1.0 1.0 1.0 I-AMOUNT 915 97 8 0.9041502 0.9913326 0.9457364 B-COUNT 7 1 2 0.875 0.7777778 0.8235294 I-LIABILITY_INCREASE 1 0 0 1.0 1.0 1.0 B-AMOUNT 1304 124 19 0.9131653 0.9856387 0.9480189 I-KPI 2 0 7 1.0 0.22222222 0.36363637 B-DATE 525 32 44 0.94254935 0.9226714 0.9325044 I-LIABILITY 156 49 97 0.7609756 0.6166008 0.68122274 I-DATE 343 12 36 0.9661972 0.9050132 0.93460494 B-CF_DECREASE 6 1 3 0.85714287 0.6666667 0.75 I-EXPENSE 270 86 74 0.75842696 0.78488374 0.77142864 I-KPI_INCREASE 0 0 1 0.0 0.0 0.0 B-LIABILITY 82 30 46 0.73214287 0.640625 0.6833333 I-CF 420 97 84 0.8123791 0.8333333 0.82272285 I-CF_DECREASE 17 3 12 0.85 0.5862069 0.6938776 I-COUNT 7 0 0 1.0 1.0 1.0 B-FCF 5 0 0 1.0 1.0 1.0 B-PROFIT_INCREASE 54 23 22 0.7012987 0.7105263 0.7058824 B-KPI_INCREASE 1 0 2 1.0 0.33333334 0.5 B-EXPENSE 118 42 36 0.7375 0.76623374 0.75159234 I-CF_INCREASE 43 0 17 1.0 0.71666664 0.8349514 I-PERCENTAGE 4 6 0 0.4 1.0 0.5714286 I-PROFIT_DECLINE 39 11 4 0.78 0.90697676 0.8387097 I-KPI_DECREASE 1 1 0 0.5 1.0 0.6666667 B-CF_INCREASE 23 0 2 1.0 0.92 0.9583333 I-PROFIT 228 118 19 0.6589595 0.9230769 0.7689713 B-CURRENCY 943 42 12 0.9573604 0.98743457 0.972165 I-PROFIT_INCREASE 80 34 16 0.7017544 0.8333333 0.7619047 B-CF 118 32 29 0.7866667 0.8027211 0.7946128 B-PROFIT 134 55 23 0.7089947 0.85350317 0.7745664 B-PERCENTAGE 281 17 7 0.942953 0.9756944 0.95904434 B-TICKER 2 0 0 1.0 1.0 1.0 I-FISCAL_YEAR 585 17 27 0.9717608 0.9558824 0.9637562 B-ORG 2 0 0 1.0 1.0 1.0 B-PROFIT_DECLINE 22 5 8 0.8148148 0.73333335 0.7719298 B-EXPENSE_INCREASE 35 7 4 0.8333333 0.8974359 0.86419755 B-EXPENSE_DECREASE 23 3 4 0.88461536 0.8518519 0.8679245 B-FISCAL_YEAR 195 6 12 0.9701493 0.942029 0.9558824 I-EXPENSE_DECREASE 46 9 16 0.8363636 0.7419355 0.78632486 I-FCF 10 0 0 1.0 1.0 1.0 I-EXPENSE_INCREASE 83 13 9 0.8645833 0.90217394 0.88297874 Macro-average 7134 977 728 0.77496254 0.72599226 0.74967855 Micro-average 7134 977 728 0.8795463 0.9074027 0.8932574 ``` --- layout: model title: Mapping Abbreviations and Acronyms of Medical Regulatory Activities with Their Definitions author: John Snow Labs name: abbreviation_mapper date: 2022-06-26 tags: [abbreviation, definition, licensed, clinical, en, chunk_mapper] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps abbreviations and acronyms of medical regulatory activities with their `definition`. ## Predicted Entities `definition` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/abbreviation_mapper_en_3.5.3_3.0_1656250645758.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/abbreviation_mapper_en_3.5.3_3.0_1656250645758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("abbr_ner") abbr_converter = NerConverter() \ .setInputCols(["sentence", "token", "abbr_ner"]) \ .setOutputCol("abbr_ner_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper", "en", "clinical/models")\ .setInputCols(["abbr_ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["definition"])\ .setLowerCase(True) pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper]) sample_text = ["""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""] data = spark.createDataFrame([sample_text]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("abbr_ner") val abbr_converter = new NerConverter() .setInputCols(Array("sentence", "token", "abbr_ner")) .setOutputCol("abbr_ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper", "en", "clinical/models") .setInputCols("abbr_ner_chunk") .setOutputCol("mappings") .setRels(Array("definition")) .setLowerCase(True) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper)) val sample_text = """Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""" val data = Seq(sample_text).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.abbreviation_to_definition").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash +-----+----------------------------+ |chunk|mapping_result | +-----+----------------------------+ |CBC |complete blood count | |HIV |human immunodeficiency virus| +-----+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|abbreviation_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[abbr_ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|219.9 KB| ## References https://www.johnsnowlabs.com/marketplace/list-of-abbreviations-and-acronyms-for-medical-regulatory-activities/ --- layout: model title: English DistilBertForQuestionAnswering Cased model (from kevinbram) author: John Snow Labs name: distilbert_qa_testarbara date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `testarbara` is a English model originally trained by `kevinbram`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_testarbara_en_4.3.0_3.0_1672775789186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_testarbara_en_4.3.0_3.0_1672775789186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_testarbara","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_testarbara","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_testarbara| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/kevinbram/testarbara --- layout: model title: English asr_wav2vec2_base_100h_by_facebook TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_base_100h_by_facebook date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_by_facebook` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_by_facebook_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_by_facebook_en_4.2.0_3.0_1664038653083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_by_facebook_en_4.2.0_3.0_1664038653083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_by_facebook", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_by_facebook", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_by_facebook| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: English RobertaForQuestionAnswering (from LucasS) author: John Snow Labs name: roberta_qa_robertaABSA date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robertaABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaABSA_en_4.0.0_3.0_1655738674279.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robertaABSA_en_4.0.0_3.0_1655738674279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_robertaABSA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_robertaABSA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta_absa").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_robertaABSA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|436.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LucasS/robertaABSA --- layout: model title: Chinese BertForMaskedLM Large Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_mac_large date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-macbert-large` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_mac_large_zh_4.2.4_3.0_1670021258806.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_mac_large_zh_4.2.4_3.0_1670021258806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_mac_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_mac_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_mac_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-macbert-large - https://github.com/ymcui/MacBERT/blob/master/LICENSE - https://2020.emnlp.org - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://github.com/chatopera/Synonyms - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 --- layout: model title: English asr_wav2vec2_xls_r_300m TFWav2Vec2ForCTC from hgharibi author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m` is a English model originally trained by hgharibi. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_en_4.2.0_3.0_1664016873292.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_en_4.2.0_3.0_1664016873292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal General Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_general_bert date: 2023-03-05 tags: [en, legal, classification, clauses, general, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `General` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `General`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_general_bert_en_1.0.0_3.0_1678049996624.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_general_bert_en_1.0.0_3.0_1678049996624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_general_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[General]| |[Other]| |[Other]| |[General]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_general_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support General 0.77 0.81 0.79 122 Other 0.84 0.81 0.82 150 accuracy - - 0.81 272 macro-avg 0.81 0.81 0.81 272 weighted-avg 0.81 0.81 0.81 272 ``` --- layout: model title: Legal Credit agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_credit_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_credit_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class credit-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `credit-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_credit_agreement_en_1.0.0_3.0_1666620886293.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_credit_agreement_en_1.0.0_3.0_1666620886293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_credit_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[credit-agreement]| |[other]| |[other]| |[credit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_credit_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support credit-agreement 0.94 0.94 0.94 36 other 0.97 0.97 0.97 62 accuracy - - 0.96 98 macro-avg 0.96 0.96 0.96 98 weighted-avg 0.96 0.96 0.96 98 ``` --- layout: model title: Pipeline to Mapping RxNORM Codes with Their Corresponding UMLS Codes author: John Snow Labs name: rxnorm_umls_mapping date: 2022-06-27 tags: [rxnorm, umls, pipeline, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `rxnorm_umls_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_3.5.3_3.0_1656367260714.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_3.5.3_3.0_1656367260714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") result= pipeline.fullAnnotate("1161611 315677") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("1161611 315677") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.umls").predict("""1161611 315677""") ```
## Results ```bash | | rxnorm_code | umls_code | |---:|:-----------------|:--------------------| | 0 | 1161611 | 315677 | C3215948 | C0984912 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.9 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_768 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_768_zh_4.2.4_3.0_1670325904665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_768_zh_4.2.4_3.0_1670325904665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|117.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Relation Extraction between Biomarkers and Results (ReDL) author: John Snow Labs name: redl_oncology_biomarker_result_biobert_wip date: 2022-09-29 tags: [licensed, clinical, oncology, en, relation_extraction, test, biomarker] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Biomarker and Oncogene extractions to their corresponding Biomarker_Result extractions. ## Predicted Entities `is_finding_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biomarker_result_biobert_wip_en_4.1.0_3.0_1664457944095.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biomarker_result_biobert_wip_en_4.1.0_3.0_1664457944095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_biomarker_result_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols("ner_chunk", "dependencies") .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_biomarker_result_biobert_wip", "en", "clinical/models") .setInputCols("re_ner_chunk", "sentence") .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_biomarker_result_biobert_wip").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""") ```
## Results ```bash | chunk1 | entity1 | chunk2 | entity2 | relation | confidence | |----------|------------------|--------------------------------|------------------|---------------|------------| | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | 0.99808085 | | negative | Biomarker_Result | napsin | Biomarker | is_finding_of | 0.99637383 | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | 0.99221414 | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | 0.9893672 | | positive | Biomarker_Result | HER2 | Oncogene | O | 0.9986272 | | ER | Biomarker | negative | Biomarker_Result | O | 0.9999089 | | PR | Biomarker | negative | Biomarker_Result | O | 0.9998932 | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | 0.98810333 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_biomarker_result_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.93 0.97 0.95 is_finding_of 0.97 0.93 0.95 macro-avg 0.95 0.95 0.95 ``` --- layout: model title: English asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6 TFWav2Vec2ForCTC from chrisvinsen author: John Snow Labs name: pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6` is a English model originally trained by chrisvinsen. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_en_4.2.0_3.0_1664107389148.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6_en_4.2.0_3.0_1664107389148.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_wav2vec2_base_commonvoice_demo_colab_6| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Portuguese BertForQuestionAnswering model (from pierreguillou) author: John Snow Labs name: bert_qa_bert_large_cased_squad_v1.1_portuguese date: 2022-06-06 tags: [pt, open_source, question_answering, bert] task: Question Answering language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-cased-squad-v1.1-portuguese` is a Portuguese model orginally trained by `pierreguillou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_cased_squad_v1.1_portuguese_pt_4.0.0_3.0_1654536169488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_cased_squad_v1.1_portuguese_pt_4.0.0_3.0_1654536169488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_cased_squad_v1.1_portuguese","pt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_cased_squad_v1.1_portuguese","pt") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("pt.answer_question.squad.bert.large_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_cased_squad_v1.1_portuguese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pt| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pierreguillou/bert-large-cased-squad-v1.1-portuguese - https://github.com/piegu/language-models/blob/master/question_answering_BERT_large_cased_squad_v11_pt.ipynb - https://nbviewer.jupyter.org/github/piegu/language-models/blob/master/question_answering_BERT_large_cased_squad_v11_pt.ipynb - https://medium.com/@pierre_guillou/nlp-como-treinar-um-modelo-de-question-answering-em-qualquer-linguagem-baseado-no-bert-large-1c899262dd96#c2f5 - https://ailab.unb.br/ - https://www.linkedin.com/in/pierreguillou/ - http://www.deeplearningbrasil.com.br/ - https://neuralmind.ai/ - https://medium.com/@pierre_guillou/nlp-como-treinar-um-modelo-de-question-answering-em-qualquer-linguagem-baseado-no-bert-large-1c899262dd96 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_deletion_squad_15 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-deletion-squad-15` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_deletion_squad_15_en_4.3.0_3.0_1674216599443.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_deletion_squad_15_en_4.3.0_3.0_1674216599443.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_deletion_squad_15","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_deletion_squad_15","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_deletion_squad_15| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-deletion-squad-15 --- layout: model title: ICD10CM Injuries Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_injuries_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-28 task: Entity Resolution edition: Healthcare NLP 2.4.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_injuries_clinical_en_2.4.5_2.4_1588103825347.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_injuries_clinical_en_2.4.5_2.4_1588103825347.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... injury_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_injuries_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, injury_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val injury_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_injuries_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, injury_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash chunk entity icd10_inj_description icd10_inj_code 0 a cold, cough PROBLEM Insect bite (nonvenomous) of throat, sequela S1016XS 1 runny nose PROBLEM Abrasion of nose, initial encounter S0031XA 2 fever PROBLEM Blister (nonthermal) of abdominal wall, initia... S30821A 3 difficulty breathing PROBLEM Concussion without loss of consciousness, init... S060X0A 4 her cough PROBLEM Contusion of throat, subsequent encounter S100XXD 5 physical exam TEST Concussion without loss of consciousness, init... S060X0A 6 fairly congested PROBLEM Contusion and laceration of right cerebrum wit... S06310A 7 Amoxil TREATMENT Insect bite (nonvenomous), unspecified ankle, ... S90569S 8 Aldex TREATMENT Insect bite (nonvenomous) of penis, initial en... S30862A 9 difficulty breathing PROBLEM Concussion without loss of consciousness, init... S060X0A 10 more congested PROBLEM Complete traumatic amputation of two or more r... S98211S 11 trouble sleeping PROBLEM Concussion without loss of consciousness, sequela S060X0S 12 congestion PROBLEM Unspecified injury of portal vein, initial enc... S35319A ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------------| | Name: | chunkresolve_icd10cm_injuries_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10CM Dataset Range: S0000XA-S98929S https://www.icd10data.com/ICD10CM/Codes/S00-T88 --- layout: model title: English image_classifier_vit_pneumonia_test_attempt ViTForImageClassification from eren23 author: John Snow Labs name: image_classifier_vit_pneumonia_test_attempt date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pneumonia_test_attempt` is a English model originally trained by eren23. ## Predicted Entities `NORMAL`, `PNEUMONIA` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pneumonia_test_attempt_en_4.1.0_3.0_1660167722454.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pneumonia_test_attempt_en_4.1.0_3.0_1660167722454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pneumonia_test_attempt", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pneumonia_test_attempt", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pneumonia_test_attempt| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Living Species(bert_base_cased) author: John Snow Labs name: ner_living_species_bert date: 2022-06-22 tags: [es, ner, clinical, licensed, bert] task: Named Entity Recognition language: es edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `bert_base_cased` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_es_3.5.3_3.0_1655906269739.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_es_3.5.3_3.0_1655906269739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "es","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "es","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.living_species.bert").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/](https://temu.bsc.es/livingner/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 1.00 1.00 1.00 3281 B-SPECIES 1.00 1.00 1.00 3712 I-HUMAN 1.00 0.99 0.99 297 I-SPECIES 0.90 0.99 0.95 1732 micro-avg 0.98 1.00 0.99 9022 macro-avg 0.97 0.99 0.98 9022 weighted-avg 0.98 1.00 0.99 9022 ``` --- layout: model title: Legal Limited Liability Company Operating Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_limited_liability_company_operating_agreement_bert date: 2023-02-02 tags: [en, legal, classification, limited, liability, company, operating, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_limited_liability_company_operating_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `limited-liability-company-operating-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `limited-liability-company-operating-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limited_liability_company_operating_agreement_bert_en_1.0.0_3.0_1675360839682.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limited_liability_company_operating_agreement_bert_en_1.0.0_3.0_1675360839682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_limited_liability_company_operating_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[limited-liability-company-operating-agreement]| |[other]| |[other]| |[limited-liability-company-operating-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limited_liability_company_operating_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support limited-liability-company-operating-agreement 1.00 0.94 0.97 53 other 0.98 1.00 0.99 122 accuracy - - 0.98 175 macro-avg 0.99 0.97 0.98 175 weighted-avg 0.98 0.98 0.98 175 ``` --- layout: model title: Pipeline to find clinical events and find temporal relations (ERA) author: John Snow Labs name: explain_clinical_doc_era date: 2023-04-20 tags: [pipeline, en, licensed, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline with `ner_clinical_events`, `assertion_dl` and `re_temporal_events_clinical`. It will extract clinical entities, assign assertion status and find temporal relationships between clinical entities. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_era_en_4.3.0_3.2_1682022511508.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_era_en_4.3.0_3.2_1682022511508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_era", "en", "clinical/models") text = """She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_era", "en", "clinical/models") val text = """She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """ val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.era").predict("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """) ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|:-----------|:--------------|----------------:|--------------:|:--------------------------|:--------------|----------------:|--------------:|:------------------------------|-------------:| | 0 | AFTER | OCCURRENCE | 7 | 14 | admitted | CLINICAL_DEPT | 19 | 43 | The John Hopkins Hospital | 0.963836 | | 1 | BEFORE | OCCURRENCE | 7 | 14 | admitted | DATE | 45 | 54 | 2 days ago | 0.587098 | | 2 | BEFORE | OCCURRENCE | 7 | 14 | admitted | PROBLEM | 74 | 102 | gestational diabetes mellitus | 0.999991 | | 3 | OVERLAP | CLINICAL_DEPT | 19 | 43 | The John Hopkins Hospital | DATE | 45 | 54 | 2 days ago | 0.996056 | | 4 | BEFORE | CLINICAL_DEPT | 19 | 43 | The John Hopkins Hospital | PROBLEM | 74 | 102 | gestational diabetes mellitus | 0.995216 | | 5 | OVERLAP | DATE | 45 | 54 | 2 days ago | PROBLEM | 74 | 102 | gestational diabetes mellitus | 0.996954 | | 6 | BEFORE | EVIDENTIAL | 119 | 124 | denied | PROBLEM | 126 | 129 | pain | 1 | | 7 | BEFORE | EVIDENTIAL | 119 | 124 | denied | PROBLEM | 135 | 146 | any headache | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_era| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - DependencyParserModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - RelationExtractionModel - AssertionDLModel --- layout: model title: Part of Speech for Greek author: John Snow Labs name: pos_ud_gdt date: 2021-03-08 tags: [part_of_speech, open_source, greek, pos_ud_gdt, el] task: Part of Speech Tagging language: el edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - X - VERB - ADP - NUM - NOUN - ADV - PUNCT - CCONJ - ADJ - AUX - PROPN - PRON - SCONJ - PART - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gdt_el_3.0.0_3.0_1615230364351.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gdt_el_3.0.0_3.0_1615230364351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gdt", "el") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Γεια σας από το John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gdt", "el") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Γεια σας από το John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Γεια σας από το John Snow Labs! ""] token_df = nlu.load('el.pos.ud_gdt').predict(text) token_df ```
## Results ```bash token pos 0 Γεια NOUN 1 σας PRON 2 από ADP 3 το DET 4 John X 5 Snow X 6 Labs X 7 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|el| --- layout: model title: Pipeline to Detect Living Species(embeddings_scielo_300d) author: John Snow Labs name: ner_living_species_300_pipeline date: 2023-03-09 tags: [licensed, clinical, es, ner] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_300](https://nlp.johnsnowlabs.com/2022/11/22/ner_living_species_300_es.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_pipeline_es_4.3.0_3.2_1678392015205.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_pipeline_es_4.3.0_3.2_1678392015205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_300_pipeline", "es", "clinical/models") text = '''Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_300_pipeline", "es", "clinical/models") val text = "Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lactante varón | 0 | 13 | HUMAN | 0.92045 | | 1 | familiares | 41 | 50 | HUMAN | 1 | | 2 | personales | 78 | 87 | HUMAN | 1 | | 3 | neonatal | 116 | 123 | HUMAN | 0.9817 | | 4 | legumbres | 162 | 170 | SPECIES | 0.9972 | | 5 | lentejas | 243 | 250 | SPECIES | 0.9592 | | 6 | garbanzos | 254 | 262 | SPECIES | 0.9754 | | 7 | legumbres | 290 | 298 | SPECIES | 0.9975 | | 8 | madre | 334 | 338 | HUMAN | 1 | | 9 | Cacahuete | 616 | 624 | SPECIES | 0.9963 | | 10 | padres | 728 | 733 | HUMAN | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_300_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|230.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering model (from Laikokwei) author: John Snow Labs name: bert_qa_Laikokwei_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `Laikokwei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Laikokwei_bert_finetuned_squad_en_4.0.0_3.0_1654535522132.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Laikokwei_bert_finetuned_squad_en_4.0.0_3.0_1654535522132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Laikokwei_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Laikokwei_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_Laikokwei").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Laikokwei_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Laikokwei/bert-finetuned-squad --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2023-06-13 tags: [licensed, en, clinical, biobert, profiling, ner_profiling, ner] task: [Named Entity Recognition, Pipeline Healthcare] language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_greedy_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_anatomy_coarse_biobert`, `ner_bionlp_biobert`, `ner_cellular_biobert`, `ner_chemprot_biobert`, `ner_clinical_biobert`, `ner_deid_biobert`, `ner_deid_enriched_biobert`, `ner_diseases_biobert`, `ner_events_biobert`, `ner_human_phenotype_gene_biobert`, `ner_human_phenotype_go_biobert`, `ner_jsl_biobert`, `ner_jsl_enriched_biobert`, `ner_jsl_greedy_biobert`, `ner_living_species_biobert`, `ner_posology_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.4.4_3.2_1686663346965.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.4.4_3.2_1686663346965.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_biobert", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_biobert", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ```
## Results ```bash Results ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|766.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: Czech asr_wav2vec2_xls_r_300m_250 TFWav2Vec2ForCTC from comodoro author: John Snow Labs name: asr_wav2vec2_xls_r_300m_250 date: 2022-09-25 tags: [wav2vec2, cs, audio, open_source, asr] task: Automatic Speech Recognition language: cs edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_250` is a Czech model originally trained by comodoro. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_250_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_250_cs_4.2.0_3.0_1664119316377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_250_cs_4.2.0_3.0_1664119316377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_250", "cs")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_250", "cs") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_250| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|cs| |Size:|1.2 GB| --- layout: model title: Sarcasm Classifier author: John Snow Labs name: classifierdl_use_sarcasm date: 2021-01-09 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en, classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify if a text contains sarcasm. ## Predicted Entities `normal`, `sarcasm` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_SARCASM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_SARCASM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_sarcasm_en_2.7.1_2.4_1610210956231.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_sarcasm_en_2.7.1_2.4_1610210956231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_sarcasm', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('If I could put into words how much I love waking up at am on Tuesdays I would') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_sarcasm", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("If I could put into words how much I love waking up at am on Tuesdays I would").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""If I could put into words how much I love waking up at am on Tuesdays I would"""] sarcasm_df = nlu.load('classify.sarcasm.use').predict(text, output_level='document') sarcasm_df[["document", "sarcasm"]] ```
## Results ```bash +--------------------------------------------------------------------------------------------------------+------------+ |document |class | +--------------------------------------------------------------------------------------------------------+------------+ |If I could put into words how much I love waking up at am on Tuesdays I would | sarcasm | +--------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_sarcasm| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source http://www.cs.utah.edu/~riloff/pdfs/official-emnlp13-sarcasm.pdf ## Benchmarking ```bash precision recall f1-score support normal 0.98 0.89 0.93 495 sarcasm 0.60 0.91 0.73 93 accuracy 0.89 588 macro avg 0.79 0.90 0.83 588 weighted avg 0.92 0.89 0.90 588 ``` --- layout: model title: Spanish Bert Embeddings (Base, Pasage, Allqa) author: John Snow Labs name: bert_embeddings_dpr_spanish_passage_encoder_allqa_base date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dpr-spanish-passage_encoder-allqa-base` is a Spanish model orginally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_passage_encoder_allqa_base_es_3.4.2_3.0_1649671207246.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_passage_encoder_allqa_base_es_3.4.2_3.0_1649671207246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_passage_encoder_allqa_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_passage_encoder_allqa_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.dpr_spanish_passage_encoder_allqa_base").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dpr_spanish_passage_encoder_allqa_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/IIC/dpr-spanish-passage_encoder-allqa-base - https://arxiv.org/abs/2004.04906 - https://github.com/facebookresearch/DPR - https://arxiv.org/abs/2004.04906 - https://paperswithcode.com/sota?task=text+similarity&dataset=squad_es --- layout: model title: English image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1 ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1` is a English model originally trained by AykeeSalazar. ## Predicted Entities `nonViolation`, `publicDrinking`, `publicSmoking` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1_en_4.1.0_3.0_1660170865642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1_en_4.1.0_3.0_1660170865642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vc_bantai__withoutAMBI_adunest_v1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Chemicals in Medical text (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_chemicals_pipeline date: 2023-03-20 tags: [berfortokenclassification, ner, chemicals, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemicals](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemicals_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_4.3.0_3.2_1679306458020.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_4.3.0_3.2_1679306458020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") text = '''The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") val text = "The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemicals_pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------------|--------:|------:|:------------|-------------:| | 0 | p - choloroaniline | 40 | 57 | CHEM | 0.999986 | | 1 | chlorhexidine - digluconate | 90 | 116 | CHEM | 0.999989 | | 2 | kanamycin | 169 | 177 | CHEM | 0.999985 | | 3 | colistin | 181 | 188 | CHEM | 0.999982 | | 4 | povidone - iodine | 194 | 210 | CHEM | 0.99998 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Translate English to Tigrinya Pipeline author: John Snow Labs name: translate_en_ti date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ti, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ti` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ti_xx_2.7.0_2.4_1609700767749.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ti_xx_2.7.0_2.4_1609700767749.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ti", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ti", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ti').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ti| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Conflict of interest Clause Binary Classifier author: John Snow Labs name: legclf_conflict_of_interest_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `conflict-of-interest` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `conflict-of-interest` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conflict_of_interest_clause_en_1.0.0_3.2_1660122277665.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conflict_of_interest_clause_en_1.0.0_3.2_1660122277665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_conflict_of_interest_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[conflict-of-interest]| |[other]| |[other]| |[conflict-of-interest]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conflict_of_interest_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support conflict-of-interest 0.90 0.81 0.85 32 other 0.95 0.97 0.96 108 accuracy - - 0.94 140 macro-avg 0.92 0.89 0.91 140 weighted-avg 0.93 0.94 0.93 140 ``` --- layout: model title: English RobertaForSequenceClassification Cased model (from masifayub) author: John Snow Labs name: roberta_classifier_autotrain_pan_976832386 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-PAN-976832386` is a English model originally trained by `masifayub`. ## Predicted Entities `1`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_pan_976832386_en_4.2.4_3.0_1670623208389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_pan_976832386_en_4.2.4_3.0_1670623208389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_pan_976832386","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_pan_976832386","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autotrain_pan_976832386| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/masifayub/autotrain-PAN-976832386 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_0_en_4.3.0_3.0_1674213299788.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_0_en_4.3.0_3.0_1674213299788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|439.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-0 --- layout: model title: Medical Question Answering (biogpt_pubmed_qa) author: John Snow Labs name: biogpt_pubmed_qa date: 2023-05-15 tags: [licensed, clinical, en, biogpt, pubmed, qa, question_answering, tensorflow] task: Question Answering language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model has been trained with medical documents and can generate two types of answers, short and long. Types of questions are supported: `"short"` (producing yes/no/maybe) answers and `"full"` (long answers). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.4.2_3.0_1684165576397.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_pubmed_qa_en_4.4.2_3.0_1684165576397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler()\ .setInputCols("question", "context")\ .setOutputCols("document_question", "document_context") med_qa = MedicalQuestionAnswering\ .pretrained("biogpt_pubmed_qa", "en", "clinical/models")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setMaxNewTokens(30)\ .setTopK(1)\ .setQuestionType("long") # "short" pipeline = Pipeline(stages=[document_assembler, med_qa]) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" data = spark.createDataFrame( [ [long_question, paper_abstract, "long"], [yes_no_question, paper_abstract, "short"], ] ).toDF("question", "context", "question_type") pipeline.fit(data).transform(data.where("question_type == 'long'"))\ .select("answer.result")\ .show(truncate=False) pipeline.fit(data).transform(data.where("question_type == 'short'"))\ .select("answer.result")\ .show(truncate=False) ``` ```scala val document_assembler = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val med_qa = MedicalQuestionAnswering .pretrained("biogpt_pubmed_qa","en","clinical/models") .setInputCols(("document_question", "document_context")) .setOutputCol("answer") .setMaxNewTokens(30) .setTopK(1) .setQuestionType("long") # "short" val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa)) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" val data = Seq( (long_question, paper_abstract,"long" ), (yes_no_question, paper_abstract, "short")) .toDS.toDF("question", "context", "question_type") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[the present study investigated whether directing spatial attention to one location in a visual array would enhance memory for the array features. participants memorized two]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biogpt_pubmed_qa| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 GB| |Case sensitive:|true| --- layout: model title: Stopwords Remover for Luxembourgish language (207 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, lb, open_source] task: Stop Words Removal language: lb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_lb_3.4.1_3.0_1646673262041.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_lb_3.4.1_3.0_1646673262041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","lb") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Dir sidd net besser wéi ech"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","lb") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Dir sidd net besser wéi ech").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lb.stopwords").predict("""Dir sidd net besser wéi ech""") ```
## Results ```bash +--------------+ |result | +--------------+ |[sidd, besser]| +--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|lb| |Size:|1.9 KB| --- layout: model title: Translate Azerbaijani to English Pipeline author: John Snow Labs name: translate_az_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, az, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `az` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_az_en_xx_2.7.0_2.4_1609686225262.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_az_en_xx_2.7.0_2.4_1609686225262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_az_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_az_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.az.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_az_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Trade Document Classifier (EURLEX) author: John Snow Labs name: legclf_trade_bert date: 2023-03-06 tags: [en, legal, classification, clauses, trade, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_trade_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Trade or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Trade`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_trade_bert_en_1.0.0_3.0_1678111687840.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_trade_bert_en_1.0.0_3.0_1678111687840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_trade_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Trade]| |[Other]| |[Other]| |[Trade]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_trade_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.0 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.83 0.87 0.85 778 Trade 0.89 0.86 0.88 995 accuracy - - 0.86 1773 macro-avg 0.86 0.86 0.86 1773 weighted-avg 0.86 0.86 0.86 1773 ``` --- layout: model title: Sundanese RoBERTa Embeddings (from w11wo) author: John Snow Labs name: roberta_embeddings_sundanese_roberta_base date: 2022-04-14 tags: [roberta, embeddings, su, open_source] task: Embeddings language: su edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sundanese-roberta-base` is a Sundanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_sundanese_roberta_base_su_3.4.2_3.0_1649948770581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_sundanese_roberta_base_su_3.4.2_3.0_1649948770581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_sundanese_roberta_base","su") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Abdi bogoh Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_sundanese_roberta_base","su") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Abdi bogoh Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("su.embed.sundanese_roberta_base").predict("""Abdi bogoh Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_sundanese_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|su| |Size:|468.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/sundanese-roberta-base - https://arxiv.org/abs/1907.11692 - https://hf.co/datasets/oscar - https://hf.co/datasets/mc4 - https://hf.co/datasets/cc100 - https://su.wikipedia.org/ - https://hf.co/w11wo/sundanese-roberta-base/tree/main - https://hf.co/w11wo/sundanese-roberta-base/tensorboard - https://w11wo.github.io/ --- layout: model title: English asr_wav2vec2_base_timit_demo_test_jong TFWav2Vec2ForCTC from prows12 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_test_jong date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_test_jong` is a English model originally trained by prows12. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_test_jong_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_test_jong_en_4.2.0_3.0_1664101569048.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_test_jong_en_4.2.0_3.0_1664101569048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_test_jong', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_test_jong", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_test_jong| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Kaonde to English Pipeline author: John Snow Labs name: translate_kqn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kqn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kqn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kqn_en_xx_2.7.0_2.4_1609691630870.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kqn_en_xx_2.7.0_2.4_1609691630870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kqn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kqn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kqn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kqn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Burmese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, my, open_source] task: Embeddings language: my edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_my_3.4.1_3.0_1647288529217.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_my_3.4.1_3.0_1647288529217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","my") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ငါ Spark NLP ကိုချစ်"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","my") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ငါ Spark NLP ကိုချစ်").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("my.embed.w2v_cc_300d").predict("""ငါ Spark NLP ကိုချစ်""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|my| |Size:|195.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Brokers Clause Binary Classifier author: John Snow Labs name: legclf_brokers_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `brokers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `brokers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_brokers_clause_en_1.0.0_3.2_1660122187570.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_brokers_clause_en_1.0.0_3.2_1660122187570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_brokers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[brokers]| |[other]| |[other]| |[brokers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_brokers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support brokers 0.96 0.99 0.97 77 other 1.00 0.99 0.99 219 accuracy - - 0.99 296 macro-avg 0.98 0.99 0.98 296 weighted-avg 0.99 0.99 0.99 296 ``` --- layout: model title: Sentence Entity Resolver for ICD-10-CM Codes (sbertresolve_icd10cm_augmented) author: John Snow Labs name: sbertresolve_icd10cm_augmented date: 2023-05-31 tags: [en, clinical, licensed, icd10cm, entity_resolution] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD-10-CM codes using `sbert_jsl_medium_uncased` sentence bert embeddings. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). ## Predicted Entities `ICD-10-CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_en_4.4.2_3.0_1685531969657.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_en_4.4.2_3.0_1685531969657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("bert_embeddings")\ .setCaseSensitive(False) icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented", "en", "clinical/models")\ .setInputCols(["bert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, bert_embeddings, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("bert_embeddings") .setCaseSensitive(False) val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented", "en", "clinical/models") .setInputCols("bert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], maternal...| [O24.4, O24.41, O24.43, Z86.32, K86.8, P70.2, O24.434, E10.9, O24.430]| |subsequent type two diabetes mellitus|PROBLEM| E11|[type 2 diabetes mellitus [type 2 diabetes mellitus], type ii diabetes m...|[E11, E11.9, E10.9, E10, E13.9, Z83.3, L83, E11.8, E11.32, E10.8, Z86.39...| | acute hepatitis|PROBLEM| K72.0|[acute hepatitis [acute and subacute hepatic failure], acute hepatitis a...|[K72.0, B15, B17.2, B17.1, B16, B17.9, B18.8, B15.9, K75.2, K73.9, B17.1...| | obesity|PROBLEM| E66.9|[obesity [obesity, unspecified], upper body obesity [other obesity], chi...| [E66.9, E66.8, P90, Q13.0, M79.4, Z86.39]| | a body mass index|PROBLEM| E66.9|[observation of body mass index [obesity, unspecified], finding of body ...|[E66.9, Z68.41, Z68, E66.8, Z68.45, Z68.4, Z68.1, Z68.2, R22.9, Z68.22, ...| | polyuria|PROBLEM| R35|[polyuria [polyuria], sialuria [other specified metabolic disorders], st...|[R35, E88.8, R30.0, N28.89, O04.8, R82.4, E74.8, R82.2, E73.9, R82.0, R3...| | polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], polyotia [accessory auricle], polysomia [conjo...|[R63.1, Q17.0, Q89.4, Q89.09, Q74.8, H53.8, H53.2, Q13.2, R63.8, E23.2, ...| | poor appetite|PROBLEM| R63.0|[poor appetite [anorexia], excessive appetite [polyphagia], poor feeding...|[R63.0, R63.2, P92.9, R45.81, Z55.8, R41.84, R41.3, Z74.8, R46.89, R45.8...| | vomiting|PROBLEM| R11.1|[vomiting [vomiting], vomiting bile [vomiting following gastrointestinal...|[R11.1, K91.0, K92.0, A08.39, R11, P92.0, P92.09, R11.12, R11.10, O21.9,...| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[bert_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|938.1 MB| |Case sensitive:|false| --- layout: model title: Translate English to Ewe Pipeline author: John Snow Labs name: translate_en_ee date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ee, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ee` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ee_xx_2.7.0_2.4_1609686384150.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ee_xx_2.7.0_2.4_1609686384150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ee", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ee", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ee').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ee| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265909 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265909` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265909_en_4.0.0_3.0_1655985665052.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265909_en_4.0.0_3.0_1655985665052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265909","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265909","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265909").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265909| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265909 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from allenai) author: John Snow Labs name: t5_unifiedqa_v2_small_1251000 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unifiedqa-v2-t5-small-1251000` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_small_1251000_en_4.3.0_3.0_1675158019272.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_small_1251000_en_4.3.0_3.0_1675158019272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_unifiedqa_v2_small_1251000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_unifiedqa_v2_small_1251000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_unifiedqa_v2_small_1251000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|148.1 MB| ## References - https://huggingface.co/allenai/unifiedqa-v2-t5-small-1251000 - #further-details-httpsgithubcomallenaiunifiedqa - https://github.com/allenai/unifiedqa - #further-details-httpsgithubcomallenaiunifiedqa - https://github.com/allenai/unifiedqa --- layout: model title: English image_classifier_vit_pond_image_classification_6 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_6 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_6` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_6_en_4.1.0_3.0_1660167065053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_6_en_4.1.0_3.0_1660167065053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_6", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_6", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_6| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Marathi DistilBertForMaskedLM Cased model (from DarshanDeshpande) author: John Snow Labs name: distilbert_embeddings_marathi date: 2022-12-12 tags: [mr, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: mr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `marathi-distilbert` is a Marathi model originally trained by `DarshanDeshpande`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_marathi_mr_4.2.4_3.0_1670865013879.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_marathi_mr_4.2.4_3.0_1670865013879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_marathi","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_marathi","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_marathi| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|mr| |Size:|247.8 MB| |Case sensitive:|false| ## References - https://huggingface.co/DarshanDeshpande/marathi-distilbert - https://github.com/DarshanDeshpande - https://www.linkedin.com/in/darshan-deshpande/ - https://github.com/Baras64 - http://​www.linkedin.com/in/harsh-abhi --- layout: model title: Fast Neural Machine Translation Model from Tonga (Zambezi) to English author: John Snow Labs name: opus_mt_toi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, toi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `toi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_toi_en_xx_2.7.0_2.4_1609168217047.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_toi_en_xx_2.7.0_2.4_1609168217047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_toi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_toi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.toi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_toi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Authority Clause Binary Classifier author: John Snow Labs name: legclf_authority_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `authority` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `authority` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_authority_clause_en_1.0.0_3.2_1660122135152.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_authority_clause_en_1.0.0_3.2_1660122135152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_authority_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[authority]| |[other]| |[other]| |[authority]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_authority_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support authority 0.86 0.78 0.82 23 other 0.95 0.97 0.96 99 accuracy - - 0.93 122 macro-avg 0.90 0.88 0.89 122 weighted-avg 0.93 0.93 0.93 122 ``` --- layout: model title: Word2Vec Embeddings in Tosk Albanian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, als, open_source] task: Embeddings language: als edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_als_3.4.1_3.0_1647282325294.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_als_3.4.1_3.0_1647282325294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","als") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","als") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("als.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|als| |Size:|573.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Mapping Drugs With Their Corresponding Actions And Treatments author: John Snow Labs name: drug_action_treatment_mapper date: 2022-03-31 tags: [en, chunkmapping, chunkmapper, drug, action, treatment, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drugs with their corresponding `action` and `treatment`. `action` refers to the function of the drug in various body systems, `treatment` refers to which disease the drug is used to treat ## Predicted Entities `action`, `treatment` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.5.0_3.0_1648744864957.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.5.0_3.0_1648744864957.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") ner = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\ .setInputCols("token","sentence")\ .setOutputCol("ner") nerconverter = NerConverterInternal()\ .setInputCols("sentence", "token", "ner")\ .setOutputCol("drug") chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") \ .setInputCols("drug")\ .setOutputCol("relations")\ .setRel("treatment") #or action pipeline = Pipeline().setStages([document_assembler, sentence_detector, tokenizer, ner, nerconverter, chunkerMapper]) text = [""" The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate. Cureent Medications: Diprivan, Proventil """] test_data = spark.createDataFrame([text]).toDF("text") res = pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val ner = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models") .setInputCols("token","sentence") .setOutputCol("ner") val nerconverter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("drug") val chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") .setInputCols("drug") .setOutputCol("relations") .setRel("treatment") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner, nerconverter, chunkerMapper )) val test_data = Seq("The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate. Cureent Medications: Diprivan, Proventil").toDF("text") val res = pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_to_action_treatment").predict(""" The patient is a 71-year-old female patient of Dr. X. and she was given Aklis and Dermovate. Cureent Medications: Diprivan, Proventil """) ```
## Results ```bash +---------+------------------+--------------------------------------------------------------+ |Drug |Treats |Pharmaceutical Action | +---------+------------------+--------------------------------------------------------------+ |Aklis |Hyperlipidemia |Hypertension:::Diabetic Kidney Disease:::Cerebrovascular... | |Dermovate|Lupus |Discoid Lupus Erythematosus:::Empeines:::Psoriasis:::Eczema...| |Diprivan |Infection |Laryngitis:::Pneumonia:::Pharyngitis | |Proventil|Addison's Disease |Allergic Conjunctivitis:::Anemia:::Ankylosing Spondylitis | +---------+------------------+--------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_action_treatment_mapper| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|8.7 MB| --- layout: model title: Fast Neural Machine Translation Model from Yoruba to English author: John Snow Labs name: opus_mt_yo_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, yo, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `yo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_yo_en_xx_2.7.0_2.4_1609167298178.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_yo_en_xx_2.7.0_2.4_1609167298178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_yo_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_yo_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.yo.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_yo_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic BertForMaskedLM Cased model (from UBC-NLP) author: John Snow Labs name: bert_embeddings_marbertv2 date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MARBERTv2` is a Arabic model originally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_marbertv2_ar_4.2.4_3.0_1670015405976.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_marbertv2_ar_4.2.4_3.0_1670015405976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_marbertv2","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_marbertv2","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_marbertv2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|609.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/UBC-NLP/MARBERTv2 - https://aclanthology.org/2021.acl-long.551.pdf - https://github.com/UBC-NLP/marbert - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: Fast Neural Machine Translation Model from English to Kinyarwanda author: John Snow Labs name: opus_mt_en_rw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, rw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `rw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_rw_xx_2.7.0_2.4_1609169329842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_rw_xx_2.7.0_2.4_1609169329842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_rw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_rw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.rw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_rw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Part of Speech for Indonesian author: John Snow Labs name: roberta_token_classifier_pos_tagger date: 2021-12-27 tags: [indonesian, roberta, pos, id, open_source] task: Part of Speech Tagging language: id edition: Spark NLP 3.3.4 spark_version: 2.4 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned on [indonlu's](https://hf.co/datasets/indonlu) POSP dataset for the Indonesian language, leveraging `RoBERTa` embeddings and `RobertaForTokenClassification` for POS tagging purposes. ## Predicted Entities `PPO`, `KUA`, `ADV`, `PRN`, `VBI`, `PAR`, `VBP`, `NNP`, `UNS`, `VBT`, `VBL`, `NNO`, `ADJ`, `PRR`, `PRK`, `CCN`, `$$$`, `ADK`, `ART`, `CSN`, `NUM`, `SYM`, `INT`, `NEG`, `PRI`, `VBE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_pos_tagger_id_3.3.4_2.4_1640589883082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_pos_tagger_id_3.3.4_2.4_1640589883082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_pos_tagger", "id"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Budi sedang pergi ke pasar.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_pos_tagger", "id")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Budi sedang pergi ke pasar."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash +------+---------+ |chunk |ner_label| +------+---------+ |Budi |NNO | |sedang|ADK | |pergi |VBI | |ke |PPO | |pasar |NNO | |. |SYM | +------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_pos_tagger| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|id| |Size:|466.2 MB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/w11wo/indonesian-roberta-base-posp-tagger](https://huggingface.co/w11wo/indonesian-roberta-base-posp-tagger) ## Benchmarking ```bash label score f1 0.8893 Accuracy 0.9399 ``` --- layout: model title: Marathi ALBERT Embeddings (v1) author: John Snow Labs name: albert_embeddings_marathi_albert date: 2022-04-14 tags: [albert, embeddings, mr, open_source] task: Embeddings language: mr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `marathi-albert` is a Marathi model orginally trained by `l3cube-pune`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_marathi_albert_mr_3.4.2_3.0_1649954258953.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_marathi_albert_mr_3.4.2_3.0_1649954258953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_marathi_albert","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_marathi_albert","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mr.embed.albert").predict("""मला स्पार्क एनएलपी आवडते""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_marathi_albert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|mr| |Size:|45.1 MB| |Case sensitive:|false| ## References - https://huggingface.co/l3cube-pune/marathi-albert - https://github.com/l3cube-pune/MarathiNLP - https://arxiv.org/abs/2202.01159 --- layout: model title: Sentiment Analysis in Spanish author: John Snow Labs name: beto_sentiment date: 2022-10-11 tags: [beto, sentiment, bert, es, open_source] task: Text Classification language: es edition: Spark NLP 4.2.0 spark_version: [3.2, 3.0] supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Model trained with TASS 2020 corpus (around ~5k tweets) of several dialects of Spanish. Base model is BETO, a BERT model trained in Spanish. ## Predicted Entities `POS`, `NEG`, `NEU` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/beto_sentiment_es_4.2.0_3.2_1665504729919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/beto_sentiment_es_4.2.0_3.2_1665504729919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification.pretrained("beto_sentiment", "en")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) # couple of simple examples example = spark.createDataFrame([["Te quiero. Te amo."]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +------------------+------+ | text|result| +------------------+------+ |Te quiero. Te amo.| [POS]| +------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|beto_sentiment| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://github.com/finiteautomata/pysentimiento/ --- layout: model title: Detect PHI for Deidentification purposes (Spanish) author: John Snow Labs name: ner_deid_subentity date: 2022-01-18 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 13 entities. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset and several data augmentation mechanisms. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `E-MAIL`, `USERNAME`, `LOCATION`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_es_3.3.4_3.0_1642512189785.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_es_3.3.4_3.0_1642512189785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity", "es", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = [''' Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.subentity").predict(""" Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | Antonio| B-PATIENT| | Pérez| I-PATIENT| | Juan| I-PATIENT| | ,| O| | nacido| O| | en| O| | Cadiz|B-LOCATION| | ,| O| | España|B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14| B-DATE| | de| I-DATE| | Marzo| I-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica|B-HOSPITAL| | San|I-HOSPITAL| | Carlos|I-HOSPITAL| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| |Dependencies:|embeddings_sciwiki_300d| ## Data Source - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 2088.0 201.0 178.0 2266.0 0.9122 0.9214 0.9168 HOSPITAL 302.0 43.0 85.0 387.0 0.8754 0.7804 0.8251 DATE 1837.0 33.0 20.0 1857.0 0.9824 0.9892 0.9858 ORGANIZATION 2498.0 477.0 649.0 3147.0 0.8397 0.7938 0.8161 MAIL 58.0 0.0 0.0 58.0 1.0 1.0 1.0 USERNAME 90.0 0.0 15.0 105.0 1.0 0.8571 0.9231 LOCATION 1866.0 391.0 354.0 2220.0 0.8268 0.8405 0.8336 ZIP 20.0 1.0 2.0 22.0 0.9524 0.9091 0.9302 MEDICALRECORD 111.0 5.0 20.0 131.0 0.9569 0.8473 0.8988 PROFESSION 270.0 96.0 134.0 404.0 0.7377 0.6683 0.7013 PHONE 108.0 11.0 8.0 116.0 0.9076 0.931 0.9191 DOCTOR 659.0 40.0 40.0 699.0 0.9428 0.9428 0.9428 AGE 302.0 53.0 61.0 363.0 0.8507 0.832 0.8412 macro - - - - - - 0.8872247 micro - - - - - - 0.8741892 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from vkmr) author: John Snow Labs name: distilbert_qa_vkmr_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vkmr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkmr_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726538207.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkmr_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726538207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkmr_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkmr_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_vkmr").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vkmr_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vkmr/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Personnel Management And Staff Remuneration Document Classifier (EURLEX) author: John Snow Labs name: legclf_personnel_management_and_staff_remuneration_bert date: 2023-03-06 tags: [en, legal, classification, clauses, personnel_management_and_staff_remuneration, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_personnel_management_and_staff_remuneration_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Personnel_Management_and_Staff_Remuneration or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Personnel_Management_and_Staff_Remuneration`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_personnel_management_and_staff_remuneration_bert_en_1.0.0_3.0_1678111663312.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_personnel_management_and_staff_remuneration_bert_en_1.0.0_3.0_1678111663312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_personnel_management_and_staff_remuneration_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Personnel_Management_and_Staff_Remuneration]| |[Other]| |[Other]| |[Personnel_Management_and_Staff_Remuneration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_personnel_management_and_staff_remuneration_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.87 0.90 47 Personnel_Management_and_Staff_Remuneration 0.89 0.94 0.92 53 accuracy - - 0.91 100 macro-avg 0.91 0.91 0.91 100 weighted-avg 0.91 0.91 0.91 100 ``` --- layout: model title: Detect Living Species(embeddings_scielo_300d) author: John Snow Labs name: ner_living_species_300 date: 2022-11-22 tags: [licensed, clinical, es, ner] task: Named Entity Recognition language: es edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `embeddings_scielo_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_es_4.2.2_3.0_1669127690723.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_es_4.2.2_3.0_1669127690723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_300", "es","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_300", "es","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.living_species.300").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_300| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.98 0.97 0.98 3281 B-SPECIES 0.94 0.98 0.96 3712 I-HUMAN 0.87 0.81 0.84 297 I-SPECIES 0.79 0.89 0.84 1732 micro-avg 0.92 0.95 0.94 9022 macro-avg 0.90 0.91 0.90 9022 weighted-avg 0.93 0.95 0.94 9022 ``` --- layout: model title: English asr_wav2vec2_japanese_hiragana_vtuber TFWav2Vec2ForCTC from thunninoi author: John Snow Labs name: asr_wav2vec2_japanese_hiragana_vtuber date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_japanese_hiragana_vtuber` is a English model originally trained by thunninoi. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_japanese_hiragana_vtuber_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_japanese_hiragana_vtuber_en_4.2.0_3.0_1664095763480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_japanese_hiragana_vtuber_en_4.2.0_3.0_1664095763480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_japanese_hiragana_vtuber", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_japanese_hiragana_vtuber", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_japanese_hiragana_vtuber| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Spanish RoBERTa Embeddings (Base, Using Sequence Length 512) author: John Snow Labs name: roberta_embeddings_bertin_base_gaussian_exp_512seqlen date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-base-gaussian-exp-512seqlen` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_gaussian_exp_512seqlen_es_3.4.2_3.0_1649945655435.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_gaussian_exp_512seqlen_es_3.4.2_3.0_1649945655435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_gaussian_exp_512seqlen","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_gaussian_exp_512seqlen","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_base_gaussian_exp_512seqlen").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_base_gaussian_exp_512seqlen| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|234.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-base-gaussian-exp-512seqlen --- layout: model title: Fake News Classifier in Urdu author: John Snow Labs name: classifierdl_urduvec_fakenews date: 2021-12-29 tags: [urdu, fake_news, fake, ur, open_source] task: Text Classification language: ur edition: Spark NLP 3.3.1 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model determines if news articles in Urdu are real or fake. ## Predicted Entities `real`, `fake` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_UR_FAKENEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_UR_FAKENEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_urduvec_fakenews_ur_3.3.1_3.0_1640771335815.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_urduvec_fakenews_ur_3.3.1_3.0_1640771335815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("news") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") normalizer = Normalizer() \ .setInputCols(["token"]) \ .setOutputCol("normalized") lemma = LemmatizerModel.pretrained("lemma", "ur") \ .setInputCols(["normalized"]) \ .setOutputCol("lemma") embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \ .setInputCols("document", "lemma") \ .setOutputCol("embeddings") embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifierdl = ClassifierDLModel.pretrained("classifierdl_urduvec_fakenews", "ur") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") urdu_fake_pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer, lemma, embeddings, embeddingsSentence, classifierdl]) light_pipeline = LightPipeline(urdu_fake_pipeline.fit(spark.createDataFrame([['']]).toDF("news"))) result = light_pipeline.annotate("ایک امریکی تھنک ٹینک نے خبردار کیا ہے کہ جیسے جیسے چین مصنوعی ذہانت (آرٹیفیشل انٹیلی جنس) کے میدان میں ترقی کر رہا ہے، دنیا کا اقتصادی اور عسکری توازن تبدیل ہو سکتا ہے۔") result["class"] ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val normalizer = Normalizer() .setInputCols(Array("token")) .setOutputCol("normalized") val lemma = LemmatizerModel.pretrained("lemma", "ur") .setInputCols(Array("normalized")) .setOutputCol("lemma") val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") .setInputCols(Array("document", "lemma")) .setOutputCol("embeddings") val embeddingsSentence = SentenceEmbeddings() .setInputCols(Array("document", "embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = ClassifierDLModel.pretrained("classifierdl_urduvec_fakenews", "ur") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val urdu_fake_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, normalizer, lemma, embeddings, embeddingsSentence, classifier)) val light_pipeline = LightPipeline(urdu_fake_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result = light_pipeline.annotate("ایک امریکی تھنک ٹینک نے خبردار کیا ہے کہ جیسے جیسے چین مصنوعی ذہانت (آرٹیفیشل انٹیلی جنس) کے میدان میں ترقی کر رہا ہے، دنیا کا اقتصادی اور عسکری توازن تبدیل ہو سکتا ہے۔") ``` {:.nlu-block} ```python import nlu nlu.load("ur.classify.fakenews").predict("""ایک امریکی تھنک ٹینک نے خبردار کیا ہے کہ جیسے جیسے چین مصنوعی ذہانت (آرٹیفیشل انٹیلی جنس) کے میدان میں ترقی کر رہا ہے، دنیا کا اقتصادی اور عسکری توازن تبدیل ہو سکتا ہے۔""") ```
## Results ```bash ['real'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_urduvec_fakenews| |Compatibility:|Spark NLP 3.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|ur| |Size:|21.5 MB| ## Data Source Combination of multiple open source data sets. ## Benchmarking ```bash label precision recall f1-score support fake 0.77 0.70 0.73 415 real 0.71 0.77 0.74 387 accuracy 0.73 802 macro-avg 0.74 0.74 0.73 802 weighted-avg 0.74 0.73 0.73 802 ``` --- layout: model title: English BertForQuestionAnswering model (from tiennvcs) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_infovqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-infovqa` is a English model orginally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_infovqa_en_4.0.0_3.0_1654180950608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_infovqa_en_4.0.0_3.0_1654180950608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_infovqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_infovqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.infovqa.base_uncased.by_tiennvcs").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_infovqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/bert-base-uncased-finetuned-infovqa --- layout: model title: Extract relations between effects of using multiple drugs (ReDL) author: John Snow Labs name: redl_drug_drug_interaction_biobert date: 2021-07-24 tags: [relation_extraction, en, licensed, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 2.4 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract potential improvements or harmful effects of Drug-Drug interactions (DDIs) when two or more drugs are taken at the same time or at a certain interval. ## Predicted Entities `DDI-advise`, `DDI-effect`, `DDI-false`, `DDI-int`, `DDI-mechanism` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_DRUG_DRUG_INT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_drug_drug_interaction_biobert_en_3.0.3_2.4_1627119817997.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_drug_drug_interaction_biobert_en_3.0.3_2.4_1627119817997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_drug_drug_interaction_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text="""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. \ If additional adrenergic drugs are to be administered by any route, \ they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array('SYMPTOM-EXTERNAL_BODY_PART_OR_REGION')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_drug_drug_interaction_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.drug_drug_interaction").predict("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. \ If additional adrenergic drugs are to be administered by any route, \ they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""") ```
## Results ```bash +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ | relation|entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ |DDI-false| DRUG| 5| 17|carbamazepine| DRUG| 62| 73|aripiprazole|0.91685396| +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_drug_drug_interaction_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on DDI Extraction corpus. ## Benchmarking ```bash Relation Recall Precision F1 Support DDI-advise 0.758 0.874 0.812 211 DDI-effect 0.759 0.754 0.756 348 DDI-false 0.977 0.957 0.967 4097 DDI-int 0.175 0.458 0.253 63 DDI-mechanism 0.783 0.853 0.816 281 Avg. 0.690 0.779 0.721 - ``` --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli - EntityChunkEmbeddings) author: John Snow Labs name: sbiobertresolve_rxnorm_augmented_re date: 2022-02-09 tags: [rxnorm, licensed, en, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.4.0 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes without specifying the relations between the entities (relations are calculated on the fly inside the annotator) using sbiobert_base_cased_mli Sentence Bert Embeddings (EntityChunkEmbeddings). Embeddings used in this model are calculated with following weights : `{"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2}` . EntityChunkEmbeddings with those weights are required in the pipeline to get best result. ## Predicted Entities `RxNorm Codes`, `Concept Classes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_re_en_3.4.0_2.4_1644395696788.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_augmented_re_en_3.4.0_2.4_1644395696788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") sentence_detector = SentenceDetector() \ .setInputCols("documents") \ .setOutputCol("sentences") tokenizer = Tokenizer() \ .setInputCols("sentences") \ .setOutputCol("tokens") embeddings = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") posology_ner_model = MedicalNerModel()\ .pretrained("ner_posology_large", "en", "clinical/models")\ .setInputCols(["sentences", "tokens", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner"])\ .setOutputCol("ner_chunks") pos_tager = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") drug_chunk_embeddings = EntityChunkEmbeddings()\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunks", "dependencies"])\ .setOutputCol("drug_chunk_embeddings")\ .setMaxSyntacticDistance(3)\ .setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})\ .setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2}) rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\ .setInputCols(["drug_chunk_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_weighted_pipeline_re = Pipeline( stages = [ documenter, sentence_detector, tokenizer, embeddings, posology_ner_model, ner_converter, pos_tager, dependency_parser, drug_chunk_embeddings, rxnorm_resolver ]) sampleText = ["The patient was given metformin 500 mg, 2.5 mg of coumadin and then ibuprofen.", "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG"] data_df = spark.createDataFrame(sample_df) results = rxnorm_weighted_pipeline_re.fit(data_df).transform(data_df) ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val sentence_detector = SentenceDetector() .setInputCols("documents") .setOutputCol("sentences") val tokenizer = Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val embeddings = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel() .pretrained("ner_posology_large", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner")) .setOutputCol("ner_chunks") val pos_tager = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val drug_chunk_embeddings = EntityChunkEmbeddings() .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunks", "dependencies")) .setOutputCol("drug_chunk_embeddings") .setMaxSyntacticDistance(3) .setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]}) .setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2} val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models") .setInputCols(Array("drug_chunk_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val rxnorm_weighted_pipeline_re = new PipelineModel().setStages(Array(documenter, sentence_detector, tokenizer, embeddings, posology_ner_model, ner_converter, pos_tager, dependency_parser, drug_chunk_embeddings, rxnorm_re)) val light_model = LightPipeline(rxnorm_weighted_pipeline_re) vat sampleText = Array("The patient was given metformin 500 mg, 2.5 mg of coumadin and then ibuprofen.", "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG") val results = rxnorm_weighted_pipeline_re.fit(sampleText).transform(sampleText) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.augmented_re").predict("""The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG""") ```
## Results ```bash +-----+----------------+--------------------------+--------------------------------------------------+ |index| chunk|rxnorm_code_weighted_08_re| Concept_Name| +-----+----------------+--------------------------+--------------------------------------------------+ | 0|metformin 500 mg| 860974|metformin hydrochloride 500 MG:::metformin 500 ...| | 0| 2.5 mg coumadin| 855313|warfarin sodium 2.5 MG [Coumadin]:::warfarin so...| | 0| ibuprofen| 1747293|ibuprofen Injection:::ibuprofen Pill:::ibuprofe...| | 1|metformin 400 mg| 332809|metformin 400 MG:::metformin 250 MG Oral Tablet...| | 1| coumadin 5 mg| 855333|warfarin sodium 5 MG [Coumadin]:::warfarin sodi...| | 1| coumadin| 202421|Coumadin:::warfarin sodium 2 MG/ML Injectable S...| | 1|amlodipine 10 MG| 308135|amlodipine 10 MG Oral Tablet:::amlodipine 10 MG...| +-----+----------------+--------------------------+--------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_augmented_re| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Size:|759.7 MB| |Case sensitive:|false| --- layout: model title: English image_classifier_vit_grain ViTForImageClassification from roydcarlson author: John Snow Labs name: image_classifier_vit_grain date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_grain` is a English model originally trained by roydcarlson. ## Predicted Entities `teff`, `buckwheat`, `barley`, `wheat`, `millet` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_grain_en_4.1.0_3.0_1660169441159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_grain_en_4.1.0_3.0_1660169441159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_grain", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_grain", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_grain| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Stopwords Remover for Korean language (63 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ko, open_source] task: Stop Words Removal language: ko edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ko_3.4.1_3.0_1646673221687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ko_3.4.1_3.0_1646673221687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ko") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["당신은 나보다 낫지 않습니다"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ko") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("당신은 나보다 낫지 않습니다").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.stopwords").predict("""당신은 나보다 낫지 않습니다""") ```
## Results ```bash +--------------------------------+ |result | +--------------------------------+ |[당신은, 나보다, 낫지, 않습니다]| +--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ko| |Size:|1.6 KB| --- layout: model title: Word Embeddings for Hebrew (hebrew_cc_300d) author: John Snow Labs name: hebrew_cc_300d date: 2020-12-09 task: Embeddings language: he edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [embeddings, open_source, he] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hebrew_cc_300d_he_2.7.0_2.4_1607518856144.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hebrew_cc_300d_he_2.7.0_2.4_1607518856144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['כמו גם התקפות והאשמות נגד בראון']], ["text"])) ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("hebrew_cc_300d", "he") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("כמו גם התקפות והאשמות נגד בראון").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["כמו גם התקפות והאשמות נגד בראון"] hebrevec_df = nlu.load('he.embed.cbow_300d').predict(text, output_level="token") hebrevec_df ```
{:.h2_title} ## Results The model gives 300 dimensional Word2Vec feature vector outputs per token. ```bash | he_embed_cbow_300d_embeddings | token |----------------------------------------------------|------- | [0.01140000019222498, 0.005900000222027302, 0.... כמו | [-0.06199999898672104, 0.04879999905824661, 0.... גם | [0.041600000113248825, 0.045099999755620956, 0... התקפות | [0.01489999983459711, 0.024800000712275505, -0... והאשמות | [-0.049800001084804535, 0.05260000005364418, -... נגד | [0.01209999993443489, -0.012600000016391277, -... בראון ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|hebrew_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|he| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from [https://fasttext.cc/docs/en/crawl-vectors.html](https://fasttext.cc/docs/en/crawl-vectors.html) --- layout: model title: Legal Participations Clause Binary Classifier author: John Snow Labs name: legclf_participations_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `participations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `participations` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_participations_clause_en_1.0.0_3.2_1660122818723.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_participations_clause_en_1.0.0_3.2_1660122818723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_participations_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[participations]| |[other]| |[other]| |[participations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_participations_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.98 0.95 57 participations 0.93 0.74 0.82 19 accuracy - - 0.92 76 macro-avg 0.93 0.86 0.89 76 weighted-avg 0.92 0.92 0.92 76 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kizunasunhy) author: John Snow Labs name: distilbert_qa_kizunasunhy_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kizunasunhy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kizunasunhy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771843588.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kizunasunhy_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771843588.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kizunasunhy_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kizunasunhy_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kizunasunhy_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kizunasunhy/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate English to Gilbertese Pipeline author: John Snow Labs name: translate_en_gil date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gil, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gil` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gil_xx_2.7.0_2.4_1609686096384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gil_xx_2.7.0_2.4_1609686096384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gil", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gil", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gil').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gil| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Polish to English author: John Snow Labs name: opus_mt_pl_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pl_en_xx_2.7.0_2.4_1609167932979.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pl_en_xx_2.7.0_2.4_1609167932979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Normalized Genes and Human Phenotypes (biobert) author: John Snow Labs name: ner_human_phenotype_go_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text. ## Predicted Entities `HP`, `GO` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GO_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_en_3.0.0_3.0_1617260627136.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_en_3.0.0_3.0_1617260627136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_go_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_human_phenotype_go_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype.go_biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash entity tp fp fn total precision recall f1 GO 7637.0 579.0 441.0 8078.0 0.9295 0.9454 0.9374 HP 1463.0 273.0 222.0 1685.0 0.8427 0.8682 0.8553 macro - - - - - - 0.89635 micro - - - - - - 0.92323 ``` --- layout: model title: Arabic BertForMaskedLM Large Cased model (from aubmindlab) author: John Snow Labs name: bert_embeddings_large_arabertv2 date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-arabertv2` is a Arabic model originally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabertv2_ar_4.2.4_3.0_1670019805854.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_arabertv2_ar_4.2.4_3.0_1670019805854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabertv2","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_arabertv2","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_arabertv2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.4 GB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-large-arabertv2 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-large-arabertv2/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: English RobertaForQuestionAnswering Cased model (from eAsyle) author: John Snow Labs name: roberta_qa_testabsa date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `testABSA` is a English model originally trained by `eAsyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_testabsa_en_4.3.0_3.0_1674224267018.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_testabsa_en_4.3.0_3.0_1674224267018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_testabsa","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_testabsa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_testabsa| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|426.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/eAsyle/testABSA --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_en_4.3.0_3.0_1675119492656.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_en_4.3.0_3.0_1675119492656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|208.2 MB| ## References - https://huggingface.co/google/t5-efficient-small-el16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Language Detection & Identification Pipeline - 95 Languages author: John Snow Labs name: detect_language_95 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Afrikaans`, `Amharic`, `Aragonese`, `Arabic`, `Assamese`, `Azerbaijani`, `Belarusian`, `Bulgarian`, `Bengali`, `Breton`, `Bosnian`, `Catalan`, `Czech`, `Welsh`, `Danish`, `German`, `Greek`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Persian`, `Finnish`, `Faroese`, `French`, `Irish`, `Galician`, `Gujarati`, `Hebrew`, `Hindi`, `Croatian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Japanese`, `Javanese`, `Georgian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Kurdish`, `Kyrgyz`, `Latin`, `Luxembourgish`, `Lao`, `Lithuanian`, `Latvian`, `Malagasy`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Malay`, `Maltese`, `Nepali`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Occitan`, `Odia (Oriya)`, `Punjabi (Eastern)`, `Polish`, `Pashto`, `Portuguese`, `Quechua`, `Romanian`, `Russian`, `Northern Sami`, `Sinhala`, `Slovak`, `Slovenian`, `Albanian`, `Serbian`, `Swedish`, `Swahili`, `Tamil`, `Telugu`, `Thai`, `Tagalog`, `Turkish`, `Tatar`, `Uyghur`, `Ukrainian`, `Urdu`, `Vietnamese`, `Volapük`, `Walloon`, `Xhosa`, `Chinese`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_95_xx_2.7.0_2.4_1607185479059.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_95_xx_2.7.0_2.4_1607185479059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_95", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_95", lang = "xx) pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.95").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_95| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: English BertForQuestionAnswering model (from vuiseng9) author: John Snow Labs name: bert_qa_bert_l_squadv1.1_sl384 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-l-squadv1.1-sl384` is a English model orginally trained by `vuiseng9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_l_squadv1.1_sl384_en_4.0.0_3.0_1654536116442.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_l_squadv1.1_sl384_en_4.0.0_3.0_1654536116442.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_l_squadv1.1_sl384","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_l_squadv1.1_sl384","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.sl384.by_vuiseng9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_l_squadv1.1_sl384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vuiseng9/bert-l-squadv1.1-sl384 --- layout: model title: Pipeline to Detect Problem, Test and Treatment author: John Snow Labs name: ner_healthcare_pipeline date: 2022-03-22 tags: [licensed, ner, healthcare, treatment, problem, test, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_healthcare](https://nlp.johnsnowlabs.com/2021/04/21/ner_healthcare_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_en_3.4.1_3.0_1647943495587.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_pipeline_en_3.4.1_3.0_1647943495587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .""") ```
## Results ```bash +-----------------------------+---------+ |chunks |entities | +-----------------------------+---------+ |gestational diabetes mellitus|PROBLEM | |type two diabetes mellitus |PROBLEM | |HTG-induced pancreatitis |PROBLEM | |an acute hepatitis |PROBLEM | |obesity |PROBLEM | |a body mass index |TEST | |BMI |TEST | |polyuria |PROBLEM | |polydipsia |PROBLEM | |poor appetite |PROBLEM | |vomiting |PROBLEM | |amoxicillin |TREATMENT| |a respiratory tract infection|PROBLEM | |metformin |TREATMENT| |glipizide |TREATMENT| +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Japanese Bert Embeddings (Base, Character Tokenization, v2) author: John Snow Labs name: bert_embeddings_bert_base_japanese_char_v2 date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-char-v2` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_v2_ja_3.4.2_3.0_1649674322814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_char_v2_ja_3.4.2_3.0_1649674322814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_v2","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_char_v2","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_char_v2").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_char_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|340.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2 - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Stopwords Remover for Malayalam language (9 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ml, open_source] task: Stop Words Removal language: ml edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ml_3.4.1_3.0_1646672962533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ml_3.4.1_3.0_1646672962533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ml") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["നിങ്ങൾ എന്നെക്കാൾ മികച്ചവനല്ല"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ml") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("നിങ്ങൾ എന്നെക്കാൾ മികച്ചവനല്ല").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ml.stopwords").predict("""നിങ്ങൾ എന്നെക്കാൾ മികച്ചവനല്ല""") ```
## Results ```bash +---------------------------------+ |result | +---------------------------------+ |[നിങ്ങൾ, എന്നെക്കാൾ, മികച്ചവനല്ല]| +---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ml| |Size:|1.4 KB| --- layout: model title: Moldavian, Moldovan, Romanian asr_romanian_wav2vec2 TFWav2Vec2ForCTC from gigant author: John Snow Labs name: asr_romanian_wav2vec2 date: 2022-09-25 tags: [wav2vec2, ro, audio, open_source, asr] task: Automatic Speech Recognition language: ro edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_romanian_wav2vec2` is a Moldavian, Moldovan, Romanian model originally trained by gigant. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_romanian_wav2vec2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_romanian_wav2vec2_ro_4.2.0_3.0_1664098158164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_romanian_wav2vec2_ro_4.2.0_3.0_1664098158164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_romanian_wav2vec2", "ro")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_romanian_wav2vec2", "ro") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_romanian_wav2vec2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ro| |Size:|1.2 GB| --- layout: model title: Legal Powers Clause Binary Classifier author: John Snow Labs name: legclf_powers_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `powers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `powers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_powers_clause_en_1.0.0_3.2_1660122856675.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_powers_clause_en_1.0.0_3.2_1660122856675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_powers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[powers]| |[other]| |[other]| |[powers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_powers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.99 0.98 307 powers 0.97 0.93 0.95 124 accuracy - - 0.97 431 macro-avg 0.97 0.96 0.96 431 weighted-avg 0.97 0.97 0.97 431 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dl16 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl16_en_4.3.0_3.0_1675118638338.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl16_en_4.3.0_3.0_1675118638338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dl16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dl16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dl16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|228.3 MB| ## References - https://huggingface.co/google/t5-efficient-small-dl16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl48 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl48` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl48_en_4.3.0_3.0_1675123074393.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl48_en_4.3.0_3.0_1675123074393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl48","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl48","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl48| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|742.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl48 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Resolve Company Names to Tickers using Wikidata author: John Snow Labs name: finel_wiki_parentorgs_ticker date: 2023-01-18 tags: [en, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model helps you retrieve the TICKER of a company using a previously detected ORG entity with NER. It also retrieves the normalized company name as per Wikidata, which can be retrieved from `aux_label` column in metadata. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_wiki_parentorgs_ticker_en_1.0.0_3.0_1674038769879.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_wiki_parentorgs_ticker_en_1.0.0_3.0_1674038769879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = finance.SentenceEntityResolverModel.pretrained("finel_wiki_parentorgs_tickers", "en", "finance/models")\ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("normalized_name")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver ]) lp = nlp.LightPipeline(pipelineModel) test_pred = lp.fullAnnotate('Alphabet Incorporated') print(test_pred[0]['normalized_name'][0].result) print(test_pred[0]['normalized_name'][0].metadata['all_k_aux_labels'].split(':::')[0]) ```
## Results ```bash GOOGL Aux data: Alphabet Inc. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_wiki_parentorgs_ticker| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[original_company_name]| |Language:|en| |Size:|2.8 MB| |Case sensitive:|false| ## References Wikipedia dump about company subsidiaries --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav TFWav2Vec2ForCTC from vai6hav author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav` is a English model originally trained by vai6hav. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_en_4.2.0_3.0_1664112729220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav_en_4.2.0_3.0_1664112729220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_vai6hav| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: BERT Sequence Classification - Identify Trec Data Classes author: John Snow Labs name: bert_sequence_classifier_trec_coarse date: 2021-11-06 tags: [bert_for_sequence_classification, trec, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models` and it is a simple base BERT model trained on the "trec" dataset. ## Predicted Entities `DESC`, `ENTY`, `HUM`, `NUM`, `ABBR`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_en_3.3.2_2.4_1636229841055.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_trec_coarse_en_3.3.2_2.4_1636229841055.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_trec_coarse', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['Germany is the largest country in Europe economically.']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_trec_coarse", "en") .setInputCols("document", "token") .setOutputCol("class") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Germany is the largest country in Europe economically."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['LOC'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_trec_coarse| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/aychang/bert-base-cased-trec-coarse](https://huggingface.co/aychang/bert-base-cased-trec-coarse) ## Benchmarking ```bash epoch: 2.0, eval_loss: 0.138086199760437 eval_runtime: 1.6132, eval_samples_per_second: 309.94 +------------+-------+-----------------+--------------+ | entity|eval_f1| eval_precision| eval_recall| +------------+-------+-----------------+--------------+ | DESC| 0.981| 0.985| 0.978| | ENTY| 0.944| 0.988| 0.904| | ABBR| 1.| 1.| 1.| | HUM| 0.992| 0.984| 1.| | NUM| 0.969| 0.941| 1.| | LOC| 0.981| 0.975| 0.987| +------------+-------+-----------------+--------------+ eval_accuracy: 0.974 ``` --- layout: model title: Detect Adverse Drug Events (MedicalBertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_ade_binary date: 2022-07-27 tags: [clinical, ade, licensed, public_health, token_classification, ner, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions of drugs in texts excahnged over twitter. This model is trained with the `BertForTokenClassification` method from the transformers library and imported into Spark NLP. ## Predicted Entities `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_binary_en_4.0.0_3.0_1658906410540.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_binary_en_4.0.0_3.0_1658906410540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ade_binary", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame(["I used to be on paxil but that made me more depressed and prozac made me angry", "Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs."], StringType()).toDF("text") result = model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ade_binary", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "label")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("I used to be on paxil but that made me more depressed and prozac made me angry", "Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.")).toDS().toDF("text") val result = model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_token.ner_ade_bert").predict("""Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.""") ```
## Results ```bash +-------------+---------+ |chunk |ner_label| +-------------+---------+ |depressed |ADE | |angry |ADE | |sugar crashes|ADE | +-------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ade_binary| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support B-ADE 0.89 0.88 0.88 3720 I-ADE 0.85 0.84 0.84 3145 O 0.98 0.98 0.98 26963 accuracy - - 0.95 33828 macro-avg 0.90 0.90 0.90 33828 weighted-avg 0.95 0.95 0.95 33828 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Vietnamese author: John Snow Labs name: opus_mt_en_vi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, vi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `vi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_vi_xx_2.7.0_2.4_1609170438037.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_vi_xx_2.7.0_2.4_1609170438037.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_vi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_vi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.vi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_vi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal NER (Signers) author: John Snow Labs name: legner_signers date: 2022-08-16 tags: [en, legal, ner, agreements, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal NER Model, aimed to process the last page of the agreements when information can be found about: - People Signing the document; - Title of those people in their companies; - Company (Party) they represent; ## Predicted Entities `SIGNING_TITLE`, `SIGNING_PERSON`, `PARTY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_SIGNERS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_signers_en_1.0.0_3.2_1660646474494.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_signers_en_1.0.0_3.2_1660646474494.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_signers', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """ VENDOR: VENDINGDATA CORPORATION, a Nevada corporation By: /s/ Steven J. Blad Its: Steven J. Blad CEO DISTRIBUTOR: TECHNICAL CASINO SUPPLIES LTD, an English company By: /s/ David K. Heap Its: David K. Heap Chief Executive Officer -15-""" res = model.transform(spark.createDataFrame([[text]]).toDF("text")) ```
## Results ```bash +-----------+----------------+ | token| ner_label| +-----------+----------------+ | VENDOR| O| | :| O| |VENDINGDATA| B-PARTY| |CORPORATION| O| | ,| O| | a| O| | Nevada| O| |corporation| O| | By| O| | :| O| | /s/| O| | Steven|B-SIGNING_PERSON| | J|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Blad|I-SIGNING_PERSON| | Its| O| | :| O| | Steven|B-SIGNING_PERSON| | J|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Blad|I-SIGNING_PERSON| | CEO| B-SIGNING_TITLE| |DISTRIBUTOR| O| | :| O| | TECHNICAL| B-PARTY| | CASINO| I-PARTY| | SUPPLIES| I-PARTY| | LTD| I-PARTY| | ,| O| | an| O| | English| O| | company| O| | By| O| | :| O| | /s/| O| | David|B-SIGNING_PERSON| | K|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Heap|I-SIGNING_PERSON| | Its| O| | :| O| | David|B-SIGNING_PERSON| | K|I-SIGNING_PERSON| | .|I-SIGNING_PERSON| | Heap|I-SIGNING_PERSON| | Chief| B-SIGNING_TITLE| | Executive| I-SIGNING_TITLE| | Officer| I-SIGNING_TITLE| | -| O| | 15| O| | -| O| +-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_signers| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-PARTY 366 26 39 0.93367344 0.9037037 0.91844416 I-SIGNING_TITLE 41 0 4 1.0 0.9111111 0.95348835 I-SIGNING_PERSON 115 10 13 0.92 0.8984375 0.9090909 B-SIGNING_PERSON 46 3 11 0.93877554 0.80701756 0.8679246 B-PARTY 122 14 28 0.89705884 0.81333333 0.85314685 B-SIGNING_TITLE 26 0 2 1.0 0.9285714 0.9629629 Macro-average 716 53 97 0.9482513 0.8770291 0.91125065 Micro-average 716 53 97 0.9310793 0.8806888 0.9051833 ``` --- layout: model title: English BertForTokenClassification Cased model (from Wanjiru) author: John Snow Labs name: bert_token_classifier_autotrain_gro_ner date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain_gro_ner` is a English model originally trained by `Wanjiru`. ## Predicted Entities `METRIC`, `ITEM` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_gro_ner_en_4.2.4_3.0_1669814750536.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_gro_ner_en_4.2.4_3.0_1669814750536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_gro_ner","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_gro_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_gro_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Wanjiru/autotrain_gro_ner --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_large_data_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-data-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_data_seed_0_en_4.0.0_3.0_1655736575851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_data_seed_0_en_4.0.0_3.0_1655736575851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_data_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_data_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.large_seed_0.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_data_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-large-data-seed-0 --- layout: model title: English asr_wav2vec2_large_xlsr_coraa_portuguese_cv8 TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_wav2vec2_large_xlsr_coraa_portuguese_cv8 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_coraa_portuguese_cv8` is a English model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_en_4.2.0_3.0_1664043287384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_coraa_portuguese_cv8_en_4.2.0_3.0_1664043287384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_coraa_portuguese_cv8", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_coraa_portuguese_cv8", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_coraa_portuguese_cv8| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_finetuned_squadv1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_squadv1_en_4.3.0_3.0_1674210647091.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_squadv1_en_4.3.0_3.0_1674210647091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_squadv1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_squadv1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/distilroberta-finetuned-squadv1 --- layout: model title: Oncology Pipeline for Biomarkers author: John Snow Labs name: oncology_biomarker_pipeline date: 2022-12-01 tags: [licensed, pipeline, oncology, biomarker, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline focuses on entities related to biomarkers. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.2.2_3.0_1669902355525.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_biomarker_pipeline_en_4.2.2_3.0_1669902355525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") pipeline.annotate("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_biomarker.pipeline").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Oncogene | ******************** ner_oncology_biomarker_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Biomarker | ******************** ner_oncology_test_wip results ******************** | chunk | ner_label | |:-------------------------------|:-----------------| | Immunohistochemistry | Pathology_Test | | negative | Biomarker_Result | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Result | | HER2 | Oncogene | ******************** ner_biomarker results ******************** | chunk | ner_label | |:-------------------------------|:----------------------| | Immunohistochemistry | Test | | negative | Biomarker_Measurement | | thyroid transcription factor-1 | Biomarker | | napsin A | Biomarker | | positive | Biomarker_Measurement | | ER | Biomarker | | PR | Biomarker | | negative | Biomarker_Measurement | | HER2 | Biomarker | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-------------------------------|:---------------|:------------| | Immunohistochemistry | Pathology_Test | Past | | thyroid transcription factor-1 | Biomarker | Present | | napsin A | Biomarker | Present | | ER | Biomarker | Present | | PR | Biomarker | Present | | HER2 | Oncogene | Present | ******************** assertion_oncology_test_binary_wip results ******************** | chunk | ner_label | assertion | |:-------------------------------|:---------------|:----------------| | Immunohistochemistry | Pathology_Test | Medical_History | | thyroid transcription factor-1 | Biomarker | Medical_History | | napsin A | Biomarker | Medical_History | | ER | Biomarker | Medical_History | | PR | Biomarker | Medical_History | | HER2 | Oncogene | Medical_History | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | O | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_related_to | | negative | Biomarker_Result | napsin A | Biomarker | is_related_to | | positive | Biomarker_Result | ER | Biomarker | is_related_to | | positive | Biomarker_Result | PR | Biomarker | is_related_to | | positive | Biomarker_Result | HER2 | Oncogene | O | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | O | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | | negative | Biomarker_Result | napsin A | Biomarker | is_finding_of | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | | positive | Biomarker_Result | HER2 | Oncogene | is_finding_of | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | ******************** re_oncology_biomarker_result_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------------------|:-----------------|:-------------------------------|:-----------------|:--------------| | Immunohistochemistry | Pathology_Test | negative | Biomarker_Result | is_finding_of | | negative | Biomarker_Result | thyroid transcription factor-1 | Biomarker | is_finding_of | | negative | Biomarker_Result | napsin A | Biomarker | is_finding_of | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | | positive | Biomarker_Result | HER2 | Oncogene | O | | ER | Biomarker | negative | Biomarker_Result | O | | PR | Biomarker | negative | Biomarker_Result | O | | negative | Biomarker_Result | HER2 | Oncogene | is_finding_of | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_biomarker_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel - RelationExtractionModel --- layout: model title: Legal Termination Clause Binary Classifier (CUAD dataset, USE version) author: John Snow Labs name: legclf_cuad_termination_clause date: 2022-11-09 tags: [termination, en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. This version was trained with Universal Sentence Encoder. There is another version using Sentence Bert, called `legclf_sbert_cuad_termination_clause` If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. There are other models in this dataset with similar title, but the difference is the dataset it was trained on. This one was trained with `cuad` dataset. ## Predicted Entities `termination`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_termination_clause_en_1.0.0_3.0_1667994003805.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_termination_clause_en_1.0.0_3.0_1667994003805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cuad_termination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([[" ---------------------\n\n This Agreement may be terminated immediately by Developer..."]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_termination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.97 0.99 35 termination 0.98 1.00 0.99 44 accuracy - - 0.99 79 macro-avg 0.99 0.99 0.99 79 weighted-avg 0.99 0.99 0.99 79 ``` --- layout: model title: Legal Miscellaneous provisions Clause Binary Classifier (md) author: John Snow Labs name: legclf_miscellaneous_provisions_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `miscellaneous-provisions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `miscellaneous-provisions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_provisions_md_en_1.0.0_3.0_1673460283239.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_provisions_md_en_1.0.0_3.0_1673460283239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_miscellaneous_provisions_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[miscellaneous-provisions]| |[other]| |[other]| |[miscellaneous-provisions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_miscellaneous_provisions_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support fees-and-expenses 0.92 0.89 0.91 27 other 0.93 0.95 0.94 39 accuracy 0.92 66 macro avg 0.92 0.92 0.92 66 weighted avg 0.92 0.92 0.92 66 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from georgio) author: John Snow Labs name: distilbert_qa_georgio_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `georgio`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_georgio_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770783179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_georgio_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770783179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_georgio_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_georgio_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_georgio_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/georgio/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Coal And Mining Industries Document Classifier (EURLEX) author: John Snow Labs name: legclf_coal_and_mining_industries_bert date: 2023-03-06 tags: [en, legal, classification, clauses, coal_and_mining_industries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_coal_and_mining_industries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Coal_and_Mining_Industries or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Coal_and_Mining_Industries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_coal_and_mining_industries_bert_en_1.0.0_3.0_1678111573002.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_coal_and_mining_industries_bert_en_1.0.0_3.0_1678111573002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_coal_and_mining_industries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Coal_and_Mining_Industries]| |[Other]| |[Other]| |[Coal_and_Mining_Industries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_coal_and_mining_industries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Coal_and_Mining_Industries 0.91 0.88 0.90 34 Other 0.83 0.86 0.84 22 accuracy - - 0.88 56 macro-avg 0.87 0.87 0.87 56 weighted-avg 0.88 0.88 0.88 56 ``` --- layout: model title: Mapping Company Names to Edgar Database author: John Snow Labs name: legmapper_edgar_companyname date: 2022-08-18 tags: [en, legal, companies, edgar, data, augmentation, licensed] task: Chunk Mapping language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Chunk Mapper model allows you to, given a detected Organization with any NER model, augment it with information available in the SEC Edgar database. Some of the fields included in this Chunk Mapper are: - IRS number - Sector - Former names - Address, Phone, State - Dates where the company submitted filings - etc. IMPORTANT: Chunk Mappers work with exact matches, so before using Chunk Mapping, you need to carry out Company Name Normalization to get how the company name is stored in Edgar. To do this, use Entity Linking, more especifically the `finel_edgar_companynames` model, with the Organization Name extracted by any NER model. You will get the normalized version (by Edgar standards) of the name, which you can send to this model for data augmentation. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmapper_edgar_companyname_en_1.0.0_3.2_1660817357103.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmapper_edgar_companyname_en_1.0.0_3.2_1660817357103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") # Optional: To normalize the ORG name using NASDAQ data before the mapping ########################################################################## chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk_doc") \ .setOutputCol("sentence_embeddings") use_er_model = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models") \ .setInputCols(["ner_chunk_doc", "sentence_embeddings"]) \ .setOutputCol("normalized")\ .setDistanceFunction("EUCLIDEAN") ########################################################################## cm = legal.ChunkMapperModel()\ .pretrained("legmapper_edgar_companyname", "en", "legal/models")\ .setInputCols(["normalized"])\ .setOutputCol("mappings") # or ner_chunk for non normalized versions nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, chunk_embeddings, use_er_model, cm ]) text = """NIKE Inc is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services""" test_data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(test_data) lp = nlp.LightPipeline(model) lp.annotate(text) ```
## Results ```bash {"mappings": [["labeled_dependency", 0, 22, "Jamestown Invest 1, LLC", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "name", "all_relations": ""}], ["labeled_dependency", 0, 22, "REAL ESTATE INVESTMENT TRUSTS [6798]", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "sic", "all_relations": ""}], ["labeled_dependency", 0, 22, "6798", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "sic_code", "all_relations": ""}], ["labeled_dependency", 0, 22, "831529368", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "irs_number", "all_relations": ""}], ["labeled_dependency", 0, 22, "1231", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "fiscal_year_end", "all_relations": ""}], ["labeled_dependency", 0, 22, "GA", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "state_location", "all_relations": ""}], ["labeled_dependency", 0, 22, "DE", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "state_incorporation", "all_relations": ""}], ["labeled_dependency", 0, 22, "PONCE CITY MARKET", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_street", "all_relations": ""}], ["labeled_dependency", 0, 22, "ATLANTA", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_city", "all_relations": ""}], ["labeled_dependency", 0, 22, "GA", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_state", "all_relations": ""}], ["labeled_dependency", 0, 22, "30308", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_zip", "all_relations": ""}], ["labeled_dependency", 0, 22, "7708051000", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_phone", "all_relations": ""}], ["labeled_dependency", 0, 22, "Jamestown Atlanta Invest 1, LLC", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "former_name", "all_relations": ""}], ["labeled_dependency", 0, 22, "20180824", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "former_name_date", "all_relations": ""}], ["labeled_dependency", 0, 22, "2019-11-21", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "date", "all_relations": "2019-10-24:::2019-11-25:::2019-11-12:::2022-01-13:::2022-03-31:::2022-04-11:::2022-07-12:::2022-06-30:::2021-01-14:::2021-04-06:::2021-03-31:::2021-04-28:::2021-06-30:::2021-09-10:::2021-09-22:::2021-09-30:::2021-10-08:::2020-03-16:::2021-12-30:::2020-04-06:::2020-04-29:::2020-06-12:::2020-07-20:::2020-07-07:::2020-07-28:::2020-07-31:::2020-09-09:::2020-09-25:::2020-10-08:::2020-11-12"}], ["labeled_dependency", 0, 22, "1751158", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "company_id", "all_relations": ""}]]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmapper_edgar_companyname| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|11.0 MB| ## References Manually scrapped Edgar Database --- layout: model title: XLNet Base CoNLL-03 NER Pipeline author: ahmedlone127 name: xlnet_base_token_classifier_conll03_pipeline date: 2022-06-14 tags: [ner, english, xlnet, base, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/xlnet_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655215928746.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/xlnet_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655215928746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|438.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlnetForTokenClassification - NerConverter - Finisher --- layout: model title: Pipeline to Detect posology entities (large-biobert) author: John Snow Labs name: ner_posology_large_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_large_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_large_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_biobert_pipeline_en_4.3.0_3.2_1679315056396.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_biobert_pipeline_en_4.3.0_3.2_1679315056396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_large_biobert_pipeline", "en", "clinical/models") text = '''The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_large_biobert_pipeline", "en", "clinical/models") val text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_biobert_large.pipeline").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:------------|-------------:| | 0 | 1 | 27 | 27 | DOSAGE | 0.9998 | | 1 | capsule | 29 | 35 | FORM | 0.9978 | | 2 | Advil | 40 | 44 | DRUG | 0.9992 | | 3 | 10 mg | 46 | 50 | STRENGTH | 0.8269 | | 4 | for 5 days | 52 | 61 | DURATION | 0.978333 | | 5 | magnesium hydroxide | 67 | 85 | DRUG | 0.9783 | | 6 | 100mg/1ml | 87 | 95 | STRENGTH | 0.9336 | | 7 | suspension | 97 | 106 | FORM | 0.9999 | | 8 | PO | 108 | 109 | ROUTE | 0.9871 | | 9 | 40 units | 179 | 186 | DOSAGE | 0.6543 | | 10 | insulin glargine | 191 | 206 | DRUG | 0.97145 | | 11 | at night | 208 | 215 | FREQUENCY | 0.83505 | | 12 | 12 units | 218 | 225 | DOSAGE | 0.69795 | | 13 | insulin lispro | 230 | 243 | DRUG | 0.89265 | | 14 | with meals | 245 | 254 | FREQUENCY | 0.8772 | | 15 | metformin | 261 | 269 | DRUG | 1 | | 16 | 1000 mg | 271 | 277 | STRENGTH | 0.69955 | | 17 | two times a day | 279 | 293 | FREQUENCY | 0.758125 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_large_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Payments Clause Binary Classifier author: John Snow Labs name: legclf_payments_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `payments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `payments` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_payments_clause_en_1.0.0_3.2_1660122833664.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_payments_clause_en_1.0.0_3.2_1660122833664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_payments_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[payments]| |[other]| |[other]| |[payments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_payments_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.97 0.95 284 payments 0.91 0.84 0.87 105 accuracy - - 0.93 389 macro avg 0.92 0.90 0.91 389 weighted avg 0.93 0.93 0.93 389 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from jakobwes) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_squad_v1.1 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm_roberta_squad_v1.1` is a English model originally trained by `jakobwes`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_squad_v1.1_en_4.0.0_3.0_1655997017397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_squad_v1.1_en_4.0.0_3.0_1655997017397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_squad_v1.1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_squad_v1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_squad_v1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|819.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jakobwes/xlm_roberta_squad_v1.1 --- layout: model title: Detect PHI for Deidentification (ner_deidentification_dl) author: John Snow Labs name: ner_deidentify_dl date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotator (NERDLModel) allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Deidentification NER (DL) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. ## Predicted Entities `Age`,`BIOID`,`City`,`Country`,`Country`,`Date`,`Device`,`Doctor`,`EMail`,`Hospital`,`Fax`,`Healthplan`,`Hospital`,,`Idnum`,`Location-Other`,`Medicalrecord`,`Organization`,`Patient`,`Phone`,`Profession`,`State`,`Street`,`URL`,`Username`,`Zip` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_en_3.0.0_3.0_1617209710705.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_en_3.0.0_3.0_1617209710705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_deidentify_dl","en","clinical/models") \ .setInputCols("sentence","token","embeddings") \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deidentify_dl","en","clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid").predict("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""") ```
## Results ```bash +---------------+-----+ |ner_label |count| +---------------+-----+ |O |28 | |I-HOSPITAL |4 | |B-DATE |3 | |I-STREET |3 | |I-PATIENT |2 | |B-DOCTOR |2 | |B-AGE |1 | |B-PATIENT |1 | |I-DOCTOR |1 | |B-MEDICALRECORD|1 | +---------------+-----+. +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson , Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25 |AGE | |2079-11-09 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deidentify_dl| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on JSL enriched n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:-----------------|------:|-----:|-----:|---------:|---------:|---------:| | 1 | I-AGE | 7 | 3 | 6 | 0.7 | 0.538462 | 0.608696 | | 2 | I-DOCTOR | 800 | 27 | 94 | 0.967352 | 0.894855 | 0.929692 | | 3 | I-IDNUM | 6 | 0 | 2 | 1 | 0.75 | 0.857143 | | 4 | B-DATE | 1883 | 34 | 56 | 0.982264 | 0.971119 | 0.97666 | | 5 | I-DATE | 425 | 28 | 25 | 0.93819 | 0.944444 | 0.941307 | | 6 | B-PHONE | 29 | 7 | 9 | 0.805556 | 0.763158 | 0.783784 | | 7 | B-STATE | 87 | 4 | 11 | 0.956044 | 0.887755 | 0.920635 | | 8 | B-CITY | 35 | 11 | 26 | 0.76087 | 0.57377 | 0.654206 | | 9 | I-ORGANIZATION | 12 | 4 | 15 | 0.75 | 0.444444 | 0.55814 | | 10 | B-DOCTOR | 728 | 75 | 53 | 0.9066 | 0.932138 | 0.919192 | | 11 | I-PROFESSION | 43 | 11 | 13 | 0.796296 | 0.767857 | 0.781818 | | 12 | I-PHONE | 62 | 4 | 4 | 0.939394 | 0.939394 | 0.939394 | | 13 | B-AGE | 234 | 13 | 16 | 0.947368 | 0.936 | 0.94165 | | 14 | B-STREET | 20 | 7 | 16 | 0.740741 | 0.555556 | 0.634921 | | 15 | I-ZIP | 60 | 3 | 2 | 0.952381 | 0.967742 | 0.96 | | 16 | I-MEDICALRECORD | 54 | 5 | 2 | 0.915254 | 0.964286 | 0.93913 | | 17 | B-ZIP | 2 | 1 | 0 | 0.666667 | 1 | 0.8 | | 18 | B-HOSPITAL | 256 | 23 | 66 | 0.917563 | 0.795031 | 0.851913 | | 19 | I-STREET | 150 | 17 | 20 | 0.898204 | 0.882353 | 0.890208 | | 20 | B-COUNTRY | 22 | 2 | 8 | 0.916667 | 0.733333 | 0.814815 | | 21 | I-COUNTRY | 1 | 0 | 0 | 1 | 1 | 1 | | 22 | I-STATE | 6 | 0 | 1 | 1 | 0.857143 | 0.923077 | | 23 | B-USERNAME | 30 | 0 | 4 | 1 | 0.882353 | 0.9375 | | 24 | I-HOSPITAL | 295 | 37 | 64 | 0.888554 | 0.821727 | 0.853835 | | 25 | I-PATIENT | 243 | 26 | 41 | 0.903346 | 0.855634 | 0.878843 | | 26 | B-PROFESSION | 52 | 8 | 17 | 0.866667 | 0.753623 | 0.806202 | | 27 | B-IDNUM | 32 | 3 | 12 | 0.914286 | 0.727273 | 0.810127 | | 28 | I-CITY | 76 | 15 | 13 | 0.835165 | 0.853933 | 0.844444 | | 29 | B-PATIENT | 337 | 29 | 40 | 0.920765 | 0.893899 | 0.907133 | | 30 | B-MEDICALRECORD | 74 | 6 | 4 | 0.925 | 0.948718 | 0.936709 | | 31 | B-ORGANIZATION | 20 | 5 | 13 | 0.8 | 0.606061 | 0.689655 | | 32 | Macro-average | 6083 | 408 | 673 | 0.7976 | 0.697533 | 0.744218 | | 33 | Micro-average | 6083 | 408 | 673 | 0.937144 | 0.900385 | 0.918397 | ``` --- layout: model title: Google's Tapas Table Understanding (Small, WIKISQL) author: John Snow Labs name: table_qa_tapas_small_finetuned_wikisql_supervised date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Small Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530735482.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530735482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_small_finetuned_wikisql_supervised","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_small_finetuned_wikisql_supervised| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|110.1 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions https://github.com/salesforce/WikiSQL --- layout: model title: Legal Indemnity Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_indemnity_bert date: 2023-03-05 tags: [en, legal, classification, clauses, indemnity, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Indemnity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Indemnity`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_bert_en_1.0.0_3.0_1678049927751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_bert_en_1.0.0_3.0_1678049927751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indemnity_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Indemnity]| |[Other]| |[Other]| |[Indemnity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnity_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Indemnity 0.92 0.86 0.89 28 Other 0.91 0.95 0.93 43 accuracy - - 0.92 71 macro-avg 0.92 0.91 0.91 71 weighted-avg 0.92 0.92 0.91 71 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab3_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab3_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab3_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab3_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab3_by_sherry7144_en_4.2.0_3.0_1664041663422.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab3_by_sherry7144_en_4.2.0_3.0_1664041663422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab3_by_sherry7144", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab3_by_sherry7144", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab3_by_sherry7144| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Legal Marketing Clause Binary Classifier author: John Snow Labs name: legclf_marketing_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `marketing` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `marketing` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_marketing_clause_en_1.0.0_3.2_1660122639196.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_marketing_clause_en_1.0.0_3.2_1660122639196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_marketing_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[marketing]| |[other]| |[other]| |[marketing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_marketing_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support marketing 0.50 0.50 0.50 2 other 0.93 0.93 0.93 15 accuracy - - 0.88 17 macro-avg 0.72 0.72 0.72 17 weighted-avg 0.88 0.88 0.88 17 ``` --- layout: model title: Translate English to Ruund Pipeline author: John Snow Labs name: translate_en_rnd date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, rnd, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `rnd` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_rnd_xx_2.7.0_2.4_1609688021276.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_rnd_xx_2.7.0_2.4_1609688021276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_rnd", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_rnd", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.rnd').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_rnd| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_electric_pole_type_classification ViTForImageClassification from smc author: John Snow Labs name: image_classifier_vit_electric_pole_type_classification date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_electric_pole_type_classification` is a English model originally trained by smc. ## Predicted Entities `R300`, `Portico`, `R200`, `Tripode`, `H`, `Poste` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_electric_pole_type_classification_en_4.1.0_3.0_1660169367781.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_electric_pole_type_classification_en_4.1.0_3.0_1660169367781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_electric_pole_type_classification", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_electric_pole_type_classification", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_electric_pole_type_classification| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Stopwords Remover for Japanese language (153 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ja, open_source] task: Stop Words Removal language: ja edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ja_3.4.1_3.0_1646672369133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ja_3.4.1_3.0_1646672369133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ja") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["あなたは私よりも優れていません"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ja") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("あなたは私よりも優れていません").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.stopwords").predict("""Put your text here.""") ```
## Results ```bash +--------------------------------+ |result | +--------------------------------+ |[あなたは私よりも優れていません]| +--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ja| |Size:|1.8 KB| --- layout: model title: Clinical Deidentification Pipeline (English) author: John Snow Labs name: clinical_deidentification date: 2022-09-14 tags: [deidendification, deid, en, licensed, clinical, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_4.1.0_3.2_1663183687291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_4.1.0_3.2_1663183687291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID, IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID[**********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Craige Perks, Record date: 2093-02-06, # R2593192. Dr. Dr Felice Lacer, IDXO:4884578, IP 444.444.444.444. He is a 75 male was admitted to the MADISON VALLEY MEDICAL CENTER for cystectomy on 07-01-1972. Patient's VIN : 2BBBB11BBBB222999, SSN SSN-814-86-1962, Driver's license P055567317431. Phone 0381-6762484, Budaörsi út 14., New brunswick, E-MAIL: Reba@google.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - TextMatcherModel - ContextualParserModel - RegexMatcherModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English BertForMaskedLM Base Cased model (from model-attribution-challenge) author: John Snow Labs name: bert_embeddings_model_attribution_challenge_base_cased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased` is a English model originally trained by `model-attribution-challenge`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_model_attribution_challenge_base_cased_en_4.2.4_3.0_1670016326907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_model_attribution_challenge_base_cased_en_4.2.4_3.0_1670016326907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_model_attribution_challenge_base_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_model_attribution_challenge_base_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_model_attribution_challenge_base_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/model-attribution-challenge/bert-base-cased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Translate English to Nyaneka Pipeline author: John Snow Labs name: translate_en_nyk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, nyk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `nyk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_nyk_xx_2.7.0_2.4_1609689594225.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_nyk_xx_2.7.0_2.4_1609689594225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_nyk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_nyk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.nyk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_nyk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Legal proceedings Clause Binary Classifier author: John Snow Labs name: legclf_legal_proceedings_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `legal-proceedings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `legal-proceedings` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_legal_proceedings_clause_en_1.0.0_3.2_1660122602093.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_legal_proceedings_clause_en_1.0.0_3.2_1660122602093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_legal_proceedings_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[legal-proceedings]| |[other]| |[other]| |[legal-proceedings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_legal_proceedings_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support legal-proceedings 0.96 0.79 0.87 29 other 0.95 0.99 0.97 106 accuracy - - 0.95 135 macro-avg 0.95 0.89 0.92 135 weighted-avg 0.95 0.95 0.95 135 ``` --- layout: model title: English asr_wav2vec2_large_robust_LS960 TFWav2Vec2ForCTC from leonardvorbeck author: John Snow Labs name: asr_wav2vec2_large_robust_LS960 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_LS960` is a English model originally trained by leonardvorbeck. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_robust_LS960_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_LS960_en_4.2.0_3.0_1664023686921.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_robust_LS960_en_4.2.0_3.0_1664023686921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_robust_LS960", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_robust_LS960", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_robust_LS960| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.6 MB| --- layout: model title: Abkhazian asr_xls_r_eng TFWav2Vec2ForCTC from mattchurgin author: John Snow Labs name: asr_xls_r_eng date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_eng` is a Abkhazian model originally trained by mattchurgin. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_eng_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_eng_ab_4.2.0_3.0_1664019618400.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_eng_ab_4.2.0_3.0_1664019618400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_eng", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_eng", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_eng| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|1.4 MB| --- layout: model title: Turkish BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_tr_cased date: 2022-12-02 tags: [tr, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: tr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-tr-cased` is a Turkish model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_tr_cased_tr_4.2.4_3.0_1670019105736.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_tr_cased_tr_4.2.4_3.0_1670019105736.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_tr_cased","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_tr_cased","tr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_tr_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|tr| |Size:|378.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-tr-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jdang) author: John Snow Labs name: xlmroberta_ner_jdang_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jdang`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jdang_base_finetuned_panx_de_4.1.0_3.0_1660434450928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jdang_base_finetuned_panx_de_4.1.0_3.0_1660434450928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jdang_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jdang_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jdang_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jdang/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sam999) author: John Snow Labs name: distilbert_qa_sam999_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sam999`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sam999_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772406690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sam999_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772406690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sam999_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sam999_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sam999_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sam999/distilbert-base-uncased-finetuned-squad --- layout: model title: Explain Document DL author: John Snow Labs name: explain_document_dl date: 2020-03-19 task: [Sentence Detection, Part of Speech Tagging, Lemmatization, Pipeline Public, Spell Check] language: en nav_key: models edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pipeline, en, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description The ``explain_document_dl`` is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_dl_en_2.4.3_2.4_1584626657780.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_dl_en_2.4.3_2.4_1584626657780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('explain_document_dl', lang = 'en') annotations = pipeline.fullAnnotate("""French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented, and before any means of space travel had been devised.""")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline('explain_document_dl', lang = 'en') val result = pipeline.fullAnnotate("French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented, and before any means of space travel had been devised.")(0) ``` {:.nlu-block} ```python import nlu text = ["""John Snow built a detailed map of all the households where people died, and came to the conclusion that the fault was one public water pump that all the victims had used."""] explain_df = nlu.load('en.explain.dl').predict(text) explain_df ```
{:.h2_title} ## Results {:.result_box} ```bash +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | text| document| sentence| token| spell| lemmas| stems| pos| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ |French author who...|[[document, 0, 23...|[[document, 0, 57...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, fr...|[[pos, 0, 5, JJ, ...| +--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_dl| |Type:|pipeline| |Compatibility:|Spark NLP 2.5.5+| |License:|Open Source| |Edition:|Community| |Language:|[en]| {:.h2_title} ## Included Models The explain_document_dl has one Transformer and six annotators: - Documenssembler - A Transformer that creates a column that contains documents. - Sentence Segmenter - An annotator that produces the sentences of the document. - Tokenizer - An annotator that produces the tokens of the sentences. - SpellChecker - An annotator that produces the spelling-corrected tokens. - Stemmer - An annotator that produces the stems of the tokens. - Lemmatizer - An annotator that produces the lemmas of the tokens. - POS Tagger - An annotator that produces the parts of speech of the associated tokens. --- layout: model title: Pipeline to Detect Neoplasms author: John Snow Labs name: ner_neoplasms_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, es] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_neoplasms](https://nlp.johnsnowlabs.com/2021/03/31/ner_neoplasms_es.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_neoplasms_pipeline_es_4.3.0_3.2_1678879327865.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_neoplasms_pipeline_es_4.3.0_3.2_1678879327865.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_neoplasms_pipeline", "es", "clinical/models") text = '''HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_neoplasms_pipeline", "es", "clinical/models") val text = "HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------|--------:|------:|:---------------------|-------------:| | 0 | cáncer | 140 | 145 | MORFOLOGIA_NEOPLASIA | 0.9997 | | 1 | Multi-Link | 1195 | 1204 | MORFOLOGIA_NEOPLASIA | 0.574 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_neoplasms_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|383.2 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Legal Relation Extraction (Whereas, sm, Bidirectional)) author: John Snow Labs name: legre_whereas date: 2022-08-24 tags: [en, legal, re, relations, licensed] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_whereas_clause` Text Classifier to select only these paragraphs; This is a Relation Extraction model to infer relations between elements in WHEREAS clauses, more specifically the SUBJECT, the ACTION and the OBJECT. There are two relations possible: `has_subject` and `has_object`. You can also use `legpipe_whereas` which includes this model and its NER and also depedency parsing, to carry out chunk extraction using grammatical features (the dependency tree). This model is a `sm` model without meaningful directions in the relations (the model was not trained to understand if the direction of the relation is from left to right or right to left). There are bigger models in Models Hub trained also with directed relationships. ## Predicted Entities `has_subject`, `has_object` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_whereas_en_1.0.0_3.2_1661341573628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_whereas_en_1.0.0_3.2_1661341573628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_whereas', 'en', 'legal/models')\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") reDL = legal.RelationExtractionDLModel\ .pretrained("legre_whereas", "en", "legal/models")\ .setPredictionThreshold(0.5)\ .setInputCols(["ner_chunk", "document"])\ .setOutputCol("relations") pipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, reDL ]) text = """ WHEREAS VerticalNet owns and operates a series of online communities ( as defined below ) that are accessible via the world wide web , each of which is designed to be an online gathering place for businesses of a certain type or within a certain industry ; """ data = spark.createDataFrame([[text]]).toDF("text") model = pipeline.fit(data) res = model.transform(data) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence has_subject WHEREAS_SUBJECT 11 21 VerticalNet WHEREAS_ACTION 32 39 operates 0.9982886 has_subject WHEREAS_SUBJECT 11 21 VerticalNet WHEREAS_OBJECT 41 70 a series of online communities 0.9890683 has_object WHEREAS_ACTION 32 39 operates WHEREAS_OBJECT 41 70 a series of online communities 0.7831568 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_whereas| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|409.9 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label Recall Precision F1 Support has_object 0.946 0.981 0.964 56 has_subject 0.952 0.988 0.969 83 no_rel 1.000 0.970 0.985 161 Avg. 0.966 0.980 0.973 - Weighted-Avg. 0.977 0.977 0.977 - ``` --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_how_5e_05 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-how-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_how_5e_05_en_4.3.0_3.0_1672766726781.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_how_5e_05_en_4.3.0_3.0_1672766726781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_how_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_how_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_how_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-how-5e-05 --- layout: model title: English AlbertForQuestionAnswering model (from nlpunibo) author: John Snow Labs name: albert_qa_nlpunibo date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_nlpunibo_en_4.0.0_3.0_1656063694630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_nlpunibo_en_4.0.0_3.0_1656063694630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_nlpunibo","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_nlpunibo","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.albert.by_nlpunibo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_nlpunibo| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/albert --- layout: model title: English asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent TFWav2Vec2ForCTC from creynier author: John Snow Labs name: asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent` is a English model originally trained by creynier. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_en_4.2.0_3.0_1664042438490.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent_en_4.2.0_3.0_1664042438490.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_5percent| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species date: 2022-06-23 tags: [it, ner, clinical, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Italian which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `w2v_cc_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_it_3.5.3_3.0_1655972680773.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_it_3.5.3_3.0_1655972680773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "it", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "it", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.med_ner.living_species").predict("""Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.""") ```
## Results ```bash +----------------+-------+ |ner_chunk |label | +----------------+-------+ |donna |HUMAN | |personale |HUMAN | |madre |HUMAN | |fratello |HUMAN | |sorella |HUMAN | |virus epatotropi|SPECIES| |HBV |SPECIES| |HCV |SPECIES| |HIV |SPECIES| |paziente |HUMAN | +----------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|15.1 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.83 0.97 0.89 2772 B-SPECIES 0.60 0.92 0.73 2866 I-HUMAN 0.67 0.61 0.64 101 I-SPECIES 0.56 0.87 0.68 1039 micro-avg 0.68 0.93 0.78 6778 macro-avg 0.67 0.84 0.74 6778 weighted-avg 0.69 0.93 0.79 6778 ``` --- layout: model title: Spanish RoBERTa Embeddings (Bertin Base) author: John Snow Labs name: roberta_embeddings_bertin_roberta_base_spanish date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model for Spanish Language, trained within the Bertin project. Other non-base Bertin models can be found [here](https://nlp.johnsnowlabs.com/models?q=bertin). The model was uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-roberta-base-spanish` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_roberta_base_spanish_es_3.4.2_3.0_1649945200032.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_roberta_base_spanish_es_3.4.2_3.0_1649945200032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_roberta_base_spanish","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_roberta_base_spanish","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_roberta_base_spanish").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_roberta_base_spanish| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|234.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-roberta-base-spanish - https://github.com/google/flax - https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers - https://arxiv.org/pdf/2107.07253.pdf - https://arxiv.org/abs/1907.11692 - https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/ - https://www.washingtonpost.com/technology/2019/05/16/police-have-used-celebrity-lookalikes-distorted-images-boost-facial-recognition-results-research-finds/ - https://www.wired.com/story/ai-college-exam-proctors-surveillance/ - https://www.eff.org/deeplinks/2020/09/students-are-pushing-back-against-proctoring-surveillance-apps - https://www.washingtonpost.com/technology/2019/10/22/ai-hiring-face-scanning-algorithm-increasingly-decides-whether-you-deserve-job/ - https://www.technologyreview.com/2021/07/21/1029860/disability-rights-employment-discrimination-ai-hiring/ - https://www.insider.com/china-is-testing-ai-recognition-on-the-uighurs-bbc-2021-5 - https://www.health.harvard.edu/blog/anti-asian-racism-breaking-through-stereotypes-and-silence-2021041522414 - https://discord.com/channels/858019234139602994/859113060068229190 --- layout: model title: Part of Speech for Indonesian author: John Snow Labs name: pos_ud_gsd date: 2021-03-09 tags: [part_of_speech, open_source, indonesian, pos_ud_gsd, id] task: Part of Speech Tagging language: id edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PROPN - AUX - DET - NOUN - PRON - VERB - ADP - PUNCT - ADV - CCONJ - SCONJ - NUM - ADJ - PART - SYM - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_id_3.0.0_3.0_1615292193447.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_id_3.0.0_3.0_1615292193447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "id") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Halo dari John Salju Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "id") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Halo dari John Salju Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Halo dari John Salju Labs! ""] token_df = nlu.load('id.pos').predict(text) token_df ```
## Results ```bash token pos 0 Halo PROPN 1 dari ADP 2 John PROPN 3 Salju PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|id| --- layout: model title: Arabic Bert Embeddings (Base, MSA dataset) author: John Snow Labs name: bert_embeddings_bert_base_arabic_camelbert_msa date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_ar_3.4.2_3.0_1649678312712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_ar_3.4.2_3.0_1649678312712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic_camelbert_msa").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic_camelbert_msa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: German asr_wav2vec2_base_german_cv9 TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: asr_wav2vec2_base_german_cv9 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_german_cv9` is a German model originally trained by oliverguhr. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_german_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_german_cv9_de_4.2.0_3.0_1664101206544.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_german_cv9_de_4.2.0_3.0_1664101206544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_german_cv9", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_german_cv9", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_german_cv9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|349.2 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Isoko author: John Snow Labs name: opus_mt_en_iso date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, iso, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `iso` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_iso_xx_2.7.0_2.4_1609167424385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_iso_xx_2.7.0_2.4_1609167424385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_iso", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_iso", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.iso').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_iso| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Armenian to English Pipeline author: John Snow Labs name: translate_hy_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, hy, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `hy` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_hy_en_xx_2.7.0_2.4_1609691902821.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_hy_en_xx_2.7.0_2.4_1609691902821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_hy_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_hy_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.hy.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_hy_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal No conflicts Clause Binary Classifier author: John Snow Labs name: legclf_no_conflicts_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-conflicts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-conflicts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_conflicts_clause_en_1.0.0_3.2_1660122683948.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_conflicts_clause_en_1.0.0_3.2_1660122683948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_conflicts_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-conflicts]| |[other]| |[other]| |[no-conflicts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_conflicts_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-conflicts 0.96 0.94 0.95 54 other 0.98 0.99 0.98 153 accuracy - - 0.98 207 macro-avg 0.97 0.97 0.97 207 weighted-avg 0.98 0.98 0.98 207 ``` --- layout: model title: Part of Speech for Hungarian author: John Snow Labs name: pos_ud_szeged date: 2020-05-05 12:50:00 +0800 task: Part of Speech Tagging language: hu edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, hu] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_szeged_hu_2.5.0_2.4_1588671966774.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_szeged_hu_2.5.0_2.4_1588671966774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_szeged", "hu") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_szeged", "hu") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében."""] pos_df = nlu.load('hu.pos.ud_szeged').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=1, result='DET', metadata={'word': 'Az'}), Row(annotatorType='pos', begin=3, end=8, result='ADJ', metadata={'word': 'északi'}), Row(annotatorType='pos', begin=10, end=15, result='NOUN', metadata={'word': 'király'}), Row(annotatorType='pos', begin=17, end=27, result='NOUN', metadata={'word': 'kivételével'}), Row(annotatorType='pos', begin=29, end=32, result='PROPN', metadata={'word': 'John'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_szeged| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|hu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Self Report Classification Pipeline - Voice of the Patient author: John Snow Labs name: bert_sequence_classifier_vop_self_report_pipeline date: 2023-06-14 tags: [licensed, en, clinical, vop, classification] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline includes the Medical Bert for Sequence Classification model to classify texts depending on if they are self-reported or if they refer to another person. The pipeline is built on the top of [bert_sequence_classifier_vop_self_report](https://nlp.johnsnowlabs.com/2023/06/13/bert_sequence_classifier_vop_self_report_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_self_report_pipeline_en_4.4.3_3.2_1686702483761.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_self_report_pipeline_en_4.4.3_3.2_1686702483761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_vop_self_report_pipeline", "en", "clinical/models") pipeline.annotate("My friend was treated for her skin cancer two years ago.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_self_report_pipeline", "en", "clinical/models") val result = pipeline.annotate(My friend was treated for her skin cancer two years ago.) ```
## Results ```bash | text | prediction | |:---------------------------------------------------------|:-------------| | My friend was treated for her skin cancer two years ago. | 3rd_Person | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_self_report_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: English Legal RobertaForMaskedLM Base Cased model author: John Snow Labs name: roberta_embeddings_legal_base date: 2022-12-12 tags: [en, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-roberta-base` is a English model originally trained by `saibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_legal_base_en_4.2.4_3.0_1670858851348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_legal_base_en_4.2.4_3.0_1670858851348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_legal_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|468.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/saibo/legal-roberta-base - https://www.kaggle.com/uspto/patent-litigations - https://case.law/ - https://www.kaggle.com/bigquery/patents - https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api --- layout: model title: English BertForQuestionAnswering model (from machine2049) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_duorc_bert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-duorc_bert` is a English model orginally trained by `machine2049`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_duorc_bert_en_4.0.0_3.0_1654180931013.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_duorc_bert_en_4.0.0_3.0_1654180931013.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_duorc_bert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_duorc_bert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base_uncased.by_machine2049").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_duorc_bert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/machine2049/bert-base-uncased-finetuned-duorc_bert --- layout: model title: Context Spell Checker Pipeline for English author: John Snow Labs name: spellcheck_dl_pipeline date: 2022-04-14 tags: [spellcheck, spell, spellcheck_pipeline, spelling_corrector, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained spellchecker pipeline is built on the top of [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/04/01/spellcheck_dl_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.1. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.1_2.4_1649941123093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_pipeline_en_3.4.1_2.4_1649941123093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] pipeline.annotate(text) ``` ```scala val pipeline = new PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") val example = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") pipeline.annotate(example) ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|99.7 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: Legal Industrial Structures And Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_industrial_structures_and_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, industrial_structures_and_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_industrial_structures_and_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Industrial_Structures_and_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Industrial_Structures_and_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_industrial_structures_and_policy_bert_en_1.0.0_3.0_1678111630461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_industrial_structures_and_policy_bert_en_1.0.0_3.0_1678111630461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_industrial_structures_and_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Industrial_Structures_and_Policy]| |[Other]| |[Other]| |[Industrial_Structures_and_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_industrial_structures_and_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Industrial_Structures_and_Policy 0.80 0.80 0.80 40 Other 0.84 0.84 0.84 51 accuracy - - 0.82 91 macro-avg 0.82 0.82 0.82 91 weighted-avg 0.82 0.82 0.82 91 ``` --- layout: model title: Tamil Bert Embeddings author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, ta, open_source] task: Embeddings language: ta edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Tamil model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_ta_3.4.2_3.0_1649676437079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_ta_3.4.2_3.0_1649676437079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","ta") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["நான் தீப்பொறி NLP ஐ நேசிக்கிறேன்"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","ta") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("நான் தீப்பொறி NLP ஐ நேசிக்கிறேன்").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ta.embed.muril_adapted_local").predict("""நான் தீப்பொறி NLP ஐ நேசிக்கிறேன்""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ta| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Smaller BERT Sentence Embeddings (L-6_H-512_A-8) author: John Snow Labs name: sent_small_bert_L6_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_512_en_2.6.0_2.4_1598350624049.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_512_en_2.6.0_2.4_1598350624049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_512", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_512", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L6_512').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L6_512_embeddings I hate cancer [0.07266588509082794, 1.1798694133758545, -0.1... Antibiotics aren't painkiller [0.2663014829158783, 1.0382004976272583, -0.41... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L6_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_2_en_4.3.0_3.0_1674214836583.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_2_en_4.3.0_3.0_1674214836583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|427.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-2 --- layout: model title: Chinese Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_jdt_fin_roberta_wwm_large date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `jdt-fin-roberta-wwm-large` is a Chinese model orginally trained by `wangfan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_jdt_fin_roberta_wwm_large_zh_3.4.2_3.0_1649670973199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_jdt_fin_roberta_wwm_large_zh_3.4.2_3.0_1649670973199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_jdt_fin_roberta_wwm_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_jdt_fin_roberta_wwm_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.jdt_fin_roberta_wwm_large").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_jdt_fin_roberta_wwm_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/wangfan/jdt-fin-roberta-wwm-large --- layout: model title: Extract Treatment Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_treatment date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, treatment, drug] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts treatments mentioned in documents transferred from the patient’s own sentences. ## Predicted Entities `Drug`, `Form`, `Dosage`, `Frequency`, `Route`, `Duration`, `Procedure`, `Treatment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_en_4.4.3_3.0_1686077083722.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_en_4.4.3_3.0_1686077083722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_treatment", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_treatment_emb_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | metformin | Drug | | glipizide | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_treatment| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Drug 1297 107 143 1440 0.92 0.90 0.91 Form 246 36 20 266 0.87 0.92 0.90 Dosage 346 53 66 412 0.87 0.84 0.85 Frequency 909 185 170 1079 0.83 0.84 0.84 Route 38 5 10 48 0.88 0.79 0.84 Duration 1825 252 485 2310 0.88 0.79 0.83 Procedure 565 101 140 705 0.85 0.80 0.82 Treatment 144 25 84 228 0.85 0.63 0.73 macro_avg 5370 764 1118 6488 0.87 0.81 0.84 micro_avg 5370 764 1118 6488 0.88 0.83 0.85 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from jsunster) author: John Snow Labs name: distilbert_qa_jsunster_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `jsunster`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_jsunster_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771511693.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_jsunster_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771511693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jsunster_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jsunster_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_jsunster_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jsunster/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from LenaSchmidt) author: John Snow Labs name: distilbert_qa_lenaschmidt_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `LenaSchmidt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lenaschmidt_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768620901.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lenaschmidt_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768620901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lenaschmidt_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lenaschmidt_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_lenaschmidt_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/LenaSchmidt/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect Chemical Compounds and Genes author: John Snow Labs name: bert_token_classifier_ner_chemprot_pipeline date: 2022-03-15 tags: [chemprot, bert_token_classifier, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemprot](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemprot_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMPROT_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_3.4.1_3.0_1647339959529.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_3.4.1_3.0_1647339959529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chemprot_pipeline = PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") chemprot_pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` ```scala val chemprot_pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") chemprot_pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemprot_pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |Keratinocyte growth factor |GENE-Y | |acidic fibroblast growth factor|GENE-Y | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemprot_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.3 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_tahazakir TFWav2Vec2ForCTC from tahazakir author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab0_by_tahazakir date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_tahazakir` is a English model originally trained by tahazakir. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab0_by_tahazakir_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_tahazakir_en_4.2.0_3.0_1664037070578.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_tahazakir_en_4.2.0_3.0_1664037070578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_tahazakir", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_tahazakir", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab0_by_tahazakir| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534_de_4.2.0_3.0_1664105023083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534_de_4.2.0_3.0_1664105023083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_base_checkpoint_9 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: pipeline_asr_wav2vec2_base_checkpoint_9 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_9` is a English model originally trained by jiobiala24. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_checkpoint_9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_9_en_4.2.0_3.0_1664021052269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_9_en_4.2.0_3.0_1664021052269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_checkpoint_9', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_checkpoint_9", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_checkpoint_9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_960h_4_gram TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_base_960h_4_gram date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_960h_4_gram` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_960h_4_gram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_960h_4_gram_en_4.2.0_3.0_1664022212611.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_960h_4_gram_en_4.2.0_3.0_1664022212611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_960h_4_gram', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_960h_4_gram", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_960h_4_gram| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Indic Languages author: John Snow Labs name: opus_mt_en_inc date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, inc, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `inc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_inc_xx_2.7.0_2.4_1609167901916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_inc_xx_2.7.0_2.4_1609167901916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_inc", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_inc", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.inc').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_inc| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chunk Entity Resolver for ICD10 codes author: John Snow Labs name: chunkresolve_ICD10GM_2021 date: 2021-04-16 tags: [entity_resolution, clinical, licensed, de] task: Entity Resolution language: de edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-GM codes for German language using chunk embeddings (augmented with synonyms, four times richer than previous resolver). ## Predicted Entities ICD10 codes {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_GM_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_ICD10GM_2021_de_3.0.0_3.0_1618603791008.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_ICD10GM_2021_de_3.0.0_3.0_1618603791008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... resolver = ChunkEntityResolverModel.pretrained("chunkresolve_ICD10GM_2021","de","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ... ``` ```scala ... val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_ICD10GM_2021","de","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_ICD10GM_2021| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[recognized]| |Language:|de| --- layout: model title: Stopwords Remover for Icelandic language (152 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, is, open_source] task: Stop Words Removal language: is edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_is_3.4.1_3.0_1646673153470.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_is_3.4.1_3.0_1646673153470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","is") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Þú ert ekki betri en ég"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","is") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Þú ert ekki betri en ég").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("is.stopwords").predict("""Þú ert ekki betri en ég""") ```
## Results ```bash +------------------------+ |result | +------------------------+ |[Þú, ert, betri, en, ég]| +------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|is| |Size:|1.9 KB| --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_1_impartit_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `set_date_1-impartit_4-bert` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_impartit_4_en_4.0.0_3.0_1657191771743.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_impartit_4_en_4.0.0_3.0_1657191771743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_impartit_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_impartit_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_1_impartit_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/set_date_1-impartit_4-bert --- layout: model title: English image_classifier_vit_animals_classifier ViTForImageClassification from victor author: John Snow Labs name: image_classifier_vit_animals_classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_animals_classifier` is a English model originally trained by victor. ## Predicted Entities `lion`, `dolphin`, `hippo`, `giraffe`, `elephant` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animals_classifier_en_4.1.0_3.0_1660168625857.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animals_classifier_en_4.1.0_3.0_1660168625857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_animals_classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_animals_classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_animals_classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Radiology Entities, Assign Assertion Status and Find Relations author: John Snow Labs name: explain_clinical_doc_radiology date: 2022-03-31 tags: [licensed, clinical, en, ner, assertion, relation_extraction, radiology] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for detecting radiology entities with the `ner_radiology` NER model, assigning their assertion status with `assertion_dl_radiology` model, and extracting relations between the diagnosis, test, and findings with `re_test_problem_finding` relation extraction model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_3.4.2_3.0_1648737971620.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_3.4.2_3.0_1648737971620.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models") result = pipeline.fullAnnotate("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models") val result = pipeline.fullAnnotate("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_radiology.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash +----+------------------------------------------+---------------------------+ | | chunks | entities | |---:|:-----------------------------------------|:--------------------------| | 0 | Bilateral breast | BodyPart | | 1 | ultrasound | ImagingTest | | 2 | ovoid mass | ImagingFindings | | 3 | 0.5 x 0.5 x 0.4 | Measurements | | 4 | cm | Units | | 5 | anteromedial aspect of the left shoulder | BodyPart | | 6 | mass | ImagingFindings | | 7 | isoechoic echotexture | ImagingFindings | | 8 | muscle | BodyPart | | 9 | internal color flow | ImagingFindings | | 10 | benign fibrous tissue | ImagingFindings | | 11 | lipoma | Disease_Syndrome_Disorder | +----+------------------------------------------+---------------------------+ +----+-----------------------+---------------------------+-------------+ | | chunks | entities | assertion | |---:|:----------------------|:--------------------------|:------------| | 0 | ultrasound | ImagingTest | Confirmed | | 1 | ovoid mass | ImagingFindings | Confirmed | | 2 | mass | ImagingFindings | Confirmed | | 3 | isoechoic echotexture | ImagingFindings | Confirmed | | 4 | internal color flow | ImagingFindings | Negative | | 5 | benign fibrous tissue | ImagingFindings | Suspected | | 6 | lipoma | Disease_Syndrome_Disorder | Suspected | +----+-----------------------+---------------------------+-------------+ +---------+-----------------+-----------------------+---------------------------+------------+ |relation | entity1 | chunk1 | entity2 | chunk2 | |--------:|:----------------|:----------------------|:--------------------------|:-----------| | 1 | ImagingTest | ultrasound | ImagingFindings | ovoid mass | | 0 | ImagingFindings | benign fibrous tissue | Disease_Syndrome_Disorder | lipoma | +---------+-----------------+-----------------------+---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_radiology| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternal - NerConverterInternal - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel --- layout: model title: Pipeline to Detect Anatomical Structures (Single Entity - embeddings_clinical) author: John Snow Labs name: ner_anatomy_coarse_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy_coarse](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_coarse_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_pipeline_en_4.3.0_3.2_1678862662568.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_pipeline_en_4.3.0_3.2_1678862662568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_anatomy_coarse_pipeline", "en", "clinical/models") text = '''content in the lung tissue''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_anatomy_coarse_pipeline", "en", "clinical/models") val text = "content in the lung tissue" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_coarse.pipeline").predict("""content in the lung tissue""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | lung tissue | 15 | 25 | Anatomy | 0.99655 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Bacterial Species (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_bacteria date: 2021-09-30 tags: [bacteria, bertfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.2 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect different types of species of bacteria in text using pretrained NER model. This model is trained with the `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. ## Predicted Entities `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_en_3.2.2_2.4_1632995062374.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_en_3.2.2_2.4_1632995062374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bacteria", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents \ a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica \ sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_bacteria", "en", "clincal/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_bacteria").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents \ a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica \ sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""") ```
## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |SMSP (T) |SPECIES | |Methanoregula formicica|SPECIES | |SMSP (T) |SPECIES | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bacteria| |Compatibility:|Healthcare NLP 3.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|512| ## Data Source Trained on a custom dataset by John Snow Labs. ## Benchmarking ```bash label precision recall f1-score support B-SPECIES 0.98 0.84 0.91 767 I-SPECIES 0.99 0.84 0.91 1043 accuracy - - 0.84 1810 macro-avg 0.85 0.89 0.87 1810 weighted-avg 0.99 0.84 0.91 1810 ``` --- layout: model title: Recognize Entities OntoNotes - BERT Small author: John Snow Labs name: onto_recognize_entities_bert_small date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `small_bert_L4_512` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_small_en_2.7.0_2.4_1607511018075.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_small_en_2.7.0_2.4_1607511018075.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_bert_small') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_small") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.bert.small').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |Parliament |ORG | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | |Parliament |ORG | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_small| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: Translate Philippine languages to English Pipeline author: John Snow Labs name: translate_phi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, phi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `phi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_phi_en_xx_2.7.0_2.4_1609690604314.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_phi_en_xx_2.7.0_2.4_1609690604314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_phi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_phi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.phi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_phi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from holtin) author: John Snow Labs name: distilbert_qa_holtin_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `holtin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_holtin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771281588.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_holtin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771281588.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_holtin_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_holtin_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_holtin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/holtin/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Service agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_service_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_service_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class service-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `service-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_service_agreement_en_1.0.0_3.0_1666621014311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_service_agreement_en_1.0.0_3.0_1666621014311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_service_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[service-agreement]| |[other]| |[other]| |[service-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_service_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.98 0.96 52 service-agreement 0.96 0.90 0.93 30 accuracy - - 0.95 82 macro avg 0.95 0.94 0.95 82 weighted avg 0.95 0.95 0.95 82 ``` --- layout: model title: Legal Currency Clause Binary Classifier author: John Snow Labs name: legclf_currency_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `currency` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `currency` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_currency_clause_en_1.0.0_3.2_1660123389175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_currency_clause_en_1.0.0_3.2_1660123389175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_currency_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[currency]| |[other]| |[other]| |[currency]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_currency_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support currency 1.00 0.98 0.99 49 other 0.99 1.00 0.99 96 accuracy - - 0.99 145 macro-avg 0.99 0.99 0.99 145 weighted-avg 0.99 0.99 0.99 145 ``` --- layout: model title: Clinical Deidentification Pipeline (English) author: John Snow Labs name: clinical_deidentification date: 2022-03-03 tags: [deidentification, en, licensed, pipeline, clinical] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.4.1_3.0_1646335814560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_3.4.1_3.0_1646335814560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID, IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID[**********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Berneta Phenes, Record date: 2093-03-14, # Y5003067. Dr. Dr Gaston Margo, IDOX:8976967, IP 001.001.001.001. He is a 91 male was admitted to the MADONNA REHABILITATION HOSPITAL for cystectomy on 07-22-1994. Patient's VIN : 5eeee44ffff555666, SSN 999-84-3686, Driver's license S99956482. Phone 74 617 042, 1407 west stassney lane, Edmonton, E-MAIL: Carliss@hotmail.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English asr_asr_with_transformers_wav2vec2 TFWav2Vec2ForCTC from osanseviero author: John Snow Labs name: pipeline_asr_asr_with_transformers_wav2vec2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_asr_with_transformers_wav2vec2` is a English model originally trained by osanseviero. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_asr_with_transformers_wav2vec2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_asr_with_transformers_wav2vec2_en_4.2.0_3.0_1664043860600.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_asr_with_transformers_wav2vec2_en_4.2.0_3.0_1664043860600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_asr_with_transformers_wav2vec2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_asr_with_transformers_wav2vec2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_asr_with_transformers_wav2vec2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_large_robust_LS960 TFWav2Vec2ForCTC from leonardvorbeck author: John Snow Labs name: pipeline_asr_wav2vec2_large_robust_LS960 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_LS960` is a English model originally trained by leonardvorbeck. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_robust_LS960_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_LS960_en_4.2.0_3.0_1664023749748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_LS960_en_4.2.0_3.0_1664023749748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_robust_LS960', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_robust_LS960", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_robust_LS960| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from Alexander-Learn) author: John Snow Labs name: bert_qa_Alexander_Learn_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `Alexander-Learn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Alexander_Learn_bert_finetuned_squad_en_4.0.0_3.0_1654535398223.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Alexander_Learn_bert_finetuned_squad_en_4.0.0_3.0_1654535398223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Alexander_Learn_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Alexander_Learn_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_Alexander-Learn").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Alexander_Learn_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Alexander-Learn/bert-finetuned-squad --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities author: John Snow Labs name: ner_cellular_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_cellular](https://nlp.johnsnowlabs.com/2021/03/31/ner_cellular_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_pipeline_en_4.3.0_3.2_1678863174517.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_pipeline_en_4.3.0_3.2_1678863174517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_cellular_pipeline", "en", "clinical/models") text = '''Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_cellular_pipeline", "en", "clinical/models") val text = "Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cellular.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------------------------------------|--------:|------:|:------------|-------------:| | 0 | intracellular signaling proteins | 27 | 58 | protein | 0.763367 | | 1 | human T-cell leukemia virus type 1 promoter | 130 | 172 | DNA | 0.559172 | | 2 | Tax | 186 | 188 | protein | 0.9869 | | 3 | Tax-responsive element 1 | 193 | 216 | DNA | 0.973067 | | 4 | cyclic AMP-responsive members | 237 | 265 | protein | 0.6491 | | 5 | CREB/ATF family | 274 | 288 | protein | 0.85205 | | 6 | transcription factors | 293 | 313 | protein | 0.78645 | | 7 | Tax | 389 | 391 | protein | 0.9584 | | 8 | human T-cell leukemia virus type 1 Tax-responsive element 1 | 396 | 454 | DNA | 0.678156 | | 9 | TRE-1), | 457 | 463 | DNA | 0.7465 | | 10 | lacZ gene | 582 | 590 | DNA | 0.9465 | | 11 | CYC1 promoter | 617 | 629 | DNA | 0.9915 | | 12 | TRE-1 | 663 | 667 | DNA | 0.9989 | | 13 | cyclic AMP response element-binding protein | 695 | 737 | protein | 0.8081 | | 14 | CREB | 740 | 743 | protein | 0.9962 | | 15 | CREB | 749 | 752 | protein | 0.994 | | 16 | GAL4 activation domain | 767 | 788 | protein | 0.802233 | | 17 | GAD | 791 | 793 | protein | 0.9932 | | 18 | reporter gene | 848 | 860 | DNA | 0.78715 | | 19 | Tax | 863 | 865 | protein | 0.9986 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate Spanish to English Pipeline author: John Snow Labs name: translate_es_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, es, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `es` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_es_en_xx_2.7.0_2.4_1609685934740.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_es_en_xx_2.7.0_2.4_1609685934740.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_es_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_es_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.es.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_es_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_with_lm TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_with_lm` is a German model originally trained by aware-ai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm_de_4.2.0_3.0_1664098536828.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm_de_4.2.0_3.0_1664098536828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_with_lm| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Greek RobertaForMaskedLM Base Uncased model (from gealexandri) author: John Snow Labs name: roberta_embeddings_palobert_base_greek_uncased_v1 date: 2022-12-12 tags: [el, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: el edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `palobert-base-greek-uncased-v1` is a Greek model originally trained by `gealexandri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_palobert_base_greek_uncased_v1_el_4.2.4_3.0_1670858886972.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_palobert_base_greek_uncased_v1_el_4.2.4_3.0_1670858886972.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_palobert_base_greek_uncased_v1","el") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_palobert_base_greek_uncased_v1","el") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_palobert_base_greek_uncased_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|314.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/gealexandri/palobert-base-greek-uncased-v1 - https://arxiv.org/abs/1907.11692 - http://www.paloservices.com/ --- layout: model title: Part of Speech for Czech author: John Snow Labs name: pos_ud_pdt date: 2020-05-04 23:32:00 +0800 task: Part of Speech Tagging language: cs edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, cs] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_pdt_cs_2.5.0_2.4_1588622155494.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_pdt_cs_2.5.0_2.4_1588622155494.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_pdt", "cs") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_pdt", "cs") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny."""] pos_df = nlu.load('cs.pos.ud_pdt').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='ADP', metadata={'word': 'Kromě'}), Row(annotatorType='pos', begin=6, end=9, result='DET', metadata={'word': 'toho'}), Row(annotatorType='pos', begin=10, end=10, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=12, end=13, result='SCONJ', metadata={'word': 'že'}), Row(annotatorType='pos', begin=15, end=16, result='AUX', metadata={'word': 'je'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_pdt| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|cs| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: SDOH Community Present Binary Classification author: John Snow Labs name: bert_sequence_classifier_sdoh_community_present_status date: 2022-12-18 tags: [en, licensed, clinical, sequence_classification, classifier, community_present, sdoh] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies related to social support such as a family member or friend in the clinical documents. A discharge summary was classified True for Community-Present if the discharge summary had passages related to active social support and False if such passages were not found in the discharge summary. ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_community_present_status_en_4.2.2_3.0_1671371389301.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_community_present_status_en_4.2.2_3.0_1671371389301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_community_present_status", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) sample_texts = ["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH.", "Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_community_present_status", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq("Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker.") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.sdoh_community_present_status").predict("""Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------+ | text| result| +----------------------------------------------------------------------------------------------------+-------+ |Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair...| [True]| |Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep...|[False]| +----------------------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_sdoh_community_present_status| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support False 0.95 0.68 0.80 203 True 0.85 0.98 0.91 359 accuracy - - 0.87 562 macro-avg 0.90 0.83 0.85 562 weighted-avg 0.88 0.87 0.87 562 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Tiv author: John Snow Labs name: opus_mt_en_tiv date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tiv, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tiv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tiv_xx_2.7.0_2.4_1609170083951.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tiv_xx_2.7.0_2.4_1609170083951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tiv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tiv", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tiv').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tiv| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic Part of Speech Tagger (from CAMeL-Lab) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_ca_pos_msa date: 2022-05-09 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-ca-pos-msa` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_msa_ar_3.4.2_3.0_1652092479978.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_ca_pos_msa_ar_3.4.2_3.0_1652092479978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_msa","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_ca_pos_msa","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_ca_pos_msa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-ca-pos-msa - https://dl.acm.org/doi/pdf/10.5555/1621804.1621808 - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: German Bert Embeddings author: John Snow Labs name: bert_embeddings_bert_base_german_uncased date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-uncased` is a German model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_uncased_de_3.4.2_3.0_1649675856996.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_uncased_de_3.4.2_3.0_1649675856996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_uncased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_uncased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_german_uncased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_german_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-german-uncased - https://deepset.ai/german-bert - https://deepset.ai/ - https://spacy.io/ - https://github.com/allenai/scibert - https://github.com/stefan-it/fine-tuned-berts-seq - https://github.com/dbmdz/berts/issues/new --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: bert_ner_bert_base_parsbert_armanner_uncased date: 2022-05-09 tags: [bert, ner, token_classification, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-parsbert-armanner-uncased` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `fac`, `pers`, `pro`, `event`, `org`, `loc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_armanner_uncased_fa_3.4.2_3.0_1652099598361.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_parsbert_armanner_uncased_fa_3.4.2_3.0_1652099598361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_armanner_uncased","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_parsbert_armanner_uncased","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_parsbert_armanner_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|607.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/bert-base-parsbert-armanner-uncased - https://arxiv.org/abs/2005.12515 - https://github.com/HaniehP/PersianNER - https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://arxiv.org/abs/2005.12515 - https://tensorflow.org/tfrc - https://hooshvare.com - https://www.linkedin.com/in/m3hrdadfi/ - https://twitter.com/m3hrdadfi - https://github.com/m3hrdadfi - https://www.linkedin.com/in/mohammad-gharachorloo/ - https://twitter.com/MGharachorloo - https://github.com/baarsaam - https://www.linkedin.com/in/marziehphi/ - https://twitter.com/marziehphi - https://github.com/marziehphi - https://www.linkedin.com/in/mohammad-manthouri-aka-mansouri-07030766/ - https://twitter.com/mmanthouri - https://github.com/mmanthouri - https://hooshvare.com/ - https://www.linkedin.com/company/hooshvare - https://twitter.com/hooshvare - https://github.com/hooshvare - https://www.instagram.com/hooshvare/ - https://www.linkedin.com/in/sara-tabrizi-64548b79/ - https://www.behance.net/saratabrizi - https://www.instagram.com/sara_b_tabrizi/ --- layout: model title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_base_japanese_char date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-char` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_ja_4.2.4_3.0_1670018147657.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_ja_4.2.4_3.0_1670018147657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_japanese_char| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|334.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-char - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_multi_cased_finedtuned_xquad_chaii date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-finedtuned-xquad-chaii` is a English model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finedtuned_xquad_chaii_en_4.0.0_3.0_1654183791284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finedtuned_xquad_chaii_en_4.0.0_3.0_1654183791284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_finedtuned_xquad_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_cased_finedtuned_xquad_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xquad_chaii.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_cased_finedtuned_xquad_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-multi-cased-finedtuned-xquad-chaii --- layout: model title: Visual NER on 10K Filings (SEC) author: John Snow Labs name: visualner_10kfilings date: 2023-01-10 tags: [en, licensed] task: OCR Object Detection language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2 supported: true annotator: VisualDocumentNERv21 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Visual NER team aimed to extract the main key points in the summary page of SEC 10 K filings (Annual reports). ## Predicted Entities `REGISTRANT`, `ADDRESS`, `PHONE`, `DATE`, `EMPLOYERIDNB`, `EXCHANGE`, `STATE`, `STOCKCLASS`, `STOCKVALUE`, `TRADINGSYMBOL`, `FILENUMBER` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/90.2.Financial_Visual_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visualner_10kfilings_en_4.0.0_3.2_1663769328577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visualner_10kfilings_en_4.0.0_3.2_1663769328577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) img_to_hocr = ImageToHocr()\ .setInputCol("image")\ .setOutputCol("hocr")\ .setIgnoreResolution(False)\ .setOcrParams(["preserve_interword_spaces=0"]) tokenizer = HocrTokenizer()\ .setInputCol("hocr")\ .setOutputCol("token") doc_ner = VisualDocumentNerV21()\ .pretrained("visualner_10kfilings", "en", "clinical/models")\ .setInputCols(["token", "image"])\ .setOutputCol("entities") draw = ImageDrawAnnotations() \ .setInputCol("image") \ .setInputChunksCol("entities") \ .setOutputCol("image_with_annotations") \ .setFontSize(10) \ .setLineWidth(4)\ .setRectColor(Color.red) # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, img_to_hocr, tokenizer, doc_ner, draw ]) bin_df = spark.read.format("binaryFile").load('data/t01.jpg') results = pipeline.transform(bin_df).cache() res = results.collect() path_array = f.split(results['path'], '/') results.withColumn('filename', path_array.getItem(f.size(path_array)- 1)) \ .withColumn("exploded_entities", f.explode("entities")) \ .select("filename", "exploded_entities") \ .show(truncate=False) ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val img_to_hocr = new ImageToHocr() .setInputCol("image") .setOutputCol("hocr") .setIgnoreResolution(False) .setOcrParams(Array("preserve_interword_spaces=0")) val tokenizer = new HocrTokenizer() .setInputCol("hocr") .setOutputCol("token") val doc_ner = VisualDocumentNerV21() .pretrained("visualner_10kfilings", "en", "clinical/models") .setInputCols(Array("token", "image")) .setOutputCol("entities") val draw = new ImageDrawAnnotations() .setInputCol("image") .setInputChunksCol("entities") .setOutputCol("image_with_annotations") .setFontSize(10) .setLineWidth(4) .setRectColor(Color.red) # OCR pipeline val pipeline = new PipelineModel().setStages(Array( binary_to_image, img_to_hocr, tokenizer, doc_ner, draw)) val bin_df = spark.read.format("binaryFile").load("data/t01.jpg") val results = pipeline.transform(bin_df).cache() val res = results.collect() val path_array = f.split(results["path"], "/") val results.withColumn("filename", path_array.getItem(f.size(path_array)- 1)) .withColumn(Array("exploded_entities", f.explode("entities"))) .select(Array("filename", "exploded_entities")) .show(truncate=False) ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image10.jpeg) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image10_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash +--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ |filename|exploded_entities | +--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ |t01.jpg |{named_entity, 712, 716, OTHERS, {confidence -> 96, width -> 74, x -> 1557, y -> 416, word -> Ended, token -> ended, height -> 18}, []} | |t01.jpg |{named_entity, 718, 724, DATE-B, {confidence -> 96, width -> 97, x -> 1639, y -> 416, word -> January, token -> january, height -> 24}, []} | |t01.jpg |{named_entity, 726, 727, DATE-I, {confidence -> 95, width -> 34, x -> 1743, y -> 416, word -> 31,, token -> 31, height -> 22}, []} | |t01.jpg |{named_entity, 730, 733, DATE-I, {confidence -> 96, width -> 54, x -> 1785, y -> 416, word -> 2021, token -> 2021, height -> 18}, []} | |t01.jpg |{named_entity, 735, 744, OTHERS, {confidence -> 91, width -> 143, x -> 1372, y -> 472, word -> Commission, token -> commission, height -> 18}, []} | |t01.jpg |{named_entity, 746, 749, OTHERS, {confidence -> 96, width -> 36, x -> 1523, y -> 472, word -> file, token -> file, height -> 18}, []} | |t01.jpg |{named_entity, 751, 756, OTHERS, {confidence -> 92, width -> 96, x -> 1568, y -> 472, word -> number:, token -> number, height -> 18}, []} | |t01.jpg |{named_entity, 759, 761, FILENUMBER-B, {confidence -> 92, width -> 119, x -> 1675, y -> 472, word -> 001-39495, token -> 001, height -> 18}, []} | |t01.jpg |{named_entity, 769, 773, REGISTRANT-B, {confidence -> 92, width -> 136, x -> 1472, y -> 558, word -> ASANA,, token -> asana, height -> 31}, []} | |t01.jpg |{named_entity, 776, 778, REGISTRANT-I, {confidence -> 95, width -> 72, x -> 1620, y -> 558, word -> INC., token -> inc, height -> 25}, []} +--------+----------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|visualner_10kfilings| |Type:|ocr| |Compatibility:|Visual NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|744.4 MB| ## References SEC 10k filings --- layout: model title: RoBERTa Large CoNLL-03 NER Pipeline author: John Snow Labs name: roberta_large_token_classifier_conll03_pipeline date: 2022-04-23 tags: [open_source, ner, token_classifier, roberta, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/roberta_large_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650718129682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650718129682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_ruperta_base_finetuned_squadv1 date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-squadv1` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ruperta_base_finetuned_squadv1_es_4.2.4_3.0_1669984876949.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ruperta_base_finetuned_squadv1_es_4.2.4_3.0_1669984876949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ruperta_base_finetuned_squadv1","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ruperta_base_finetuned_squadv1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ruperta_base_finetuned_squadv1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|470.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv1 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from kamalkraj) author: John Snow Labs name: bert_qa_base_uncased_squad_v2.0_finetuned date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-v2.0-finetuned` is a English model originally trained by `kamalkraj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_squad_v2.0_finetuned_en_4.0.0_3.0_1657185760559.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_squad_v2.0_finetuned_en_4.0.0_3.0_1657185760559.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_squad_v2.0_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_squad_v2.0_finetuned","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_squad_v2.0_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kamalkraj/bert-base-uncased-squad-v2.0-finetuned --- layout: model title: Legal Bank accounts Clause Binary Classifier author: John Snow Labs name: legclf_bank_accounts_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `bank-accounts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `bank-accounts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bank_accounts_clause_en_1.0.0_3.2_1660122150189.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bank_accounts_clause_en_1.0.0_3.2_1660122150189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_bank_accounts_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[bank-accounts]| |[other]| |[other]| |[bank-accounts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bank_accounts_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support bank-accounts 1.00 0.98 0.99 43 other 0.99 1.00 1.00 112 accuracy - - 0.99 155 macro-avg 1.00 0.99 0.99 155 weighted-avg 0.99 0.99 0.99 155 ``` --- layout: model title: Bemba (Zambia) asr_wav2vec2_large_xlsr_bemba TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: asr_wav2vec2_large_xlsr_bemba date: 2022-09-24 tags: [wav2vec2, bem, audio, open_source, asr] task: Automatic Speech Recognition language: bem edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_bemba` is a Bemba (Zambia) model originally trained by csikasote. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_bemba_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_bemba_bem_4.2.0_3.0_1664023099610.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_bemba_bem_4.2.0_3.0_1664023099610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_bemba", "bem")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_bemba", "bem") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_bemba| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|bem| |Size:|1.2 GB| --- layout: model title: Spanish ElectraForQuestionAnswering model (from hackathon-pln-es) author: John Snow Labs name: electra_qa_biomedtra_small_es_squad2 date: 2022-06-22 tags: [es, open_source, electra, question_answering] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biomedtra-small-es-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_biomedtra_small_es_squad2_es_4.0.0_3.0_1655920306800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_biomedtra_small_es_squad2_es_4.0.0_3.0_1655920306800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_biomedtra_small_es_squad2","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_biomedtra_small_es_squad2","es") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2.electra.small").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_biomedtra_small_es_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|51.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hackathon-pln-es/biomedtra-small-es-squad2-es --- layout: model title: Detect Chemicals in Medical text (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_chemicals date: 2021-10-19 tags: [berfortokenclassification, ner, chemicals, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract different types of chemical compounds mentioned in text using pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `CHEM` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_en_3.3.0_2.4_1634649785035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_en_3.3.0_2.4_1634649785035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_chemicals", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_chemicals", "en", "clinical/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_chemical").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |p - choloroaniline |CHEM | |chlorhexidine - digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone - iodine |CHEM | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemicals| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|512| ## Data Source This model is trained on a custom dataset by John Snow Labs. ## Benchmarking ```bash label precision recall f1-score support B-CHEM 0.99 0.92 0.95 30731 I-CHEM 0.99 0.93 0.96 31270 accuracy - - 0.93 62001 macro-avg 0.96 0.95 0.96 62001 weighted-avg 0.99 0.93 0.96 62001 ``` --- layout: model title: Stop Words Cleaner for Japanese author: John Snow Labs name: stopwords_ja date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ja edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ja] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ja_ja_2.5.4_2.4_1594742438927.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ja_ja_2.5.4_2.4_1594742438927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ja", "ja") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("北の王であることを除いて、ジョン・スノーはイギリスの医師であり、麻酔と医療衛生の開発のリーダーです。") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ja", "ja") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("北の王であることを除いて、ジョン・スノーはイギリスの医師であり、麻酔と医療衛生の開発のリーダーです。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""北の王であることを除いて、ジョン・スノーはイギリスの医師であり、麻酔と医療衛生の開発のリーダーです。"""] stopword_df = nlu.load('ja.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=49, result='北の王であることを除いて、ジョン・スノーはイギリスの医師であり、麻酔と医療衛生の開発のリーダーです。', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ja| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ja| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Legal Headings Clause Binary Classifier author: John Snow Labs name: legclf_headings_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `headings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `headings` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_headings_clause_en_1.0.0_3.2_1660123578960.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_headings_clause_en_1.0.0_3.2_1660123578960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_headings_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[headings]| |[other]| |[other]| |[headings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_headings_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support headings 0.95 0.97 0.96 131 other 0.98 0.98 0.98 252 accuracy - - 0.97 383 macro-avg 0.97 0.97 0.97 383 weighted-avg 0.97 0.97 0.97 383 ``` --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to Spanish author: John Snow Labs name: opus_mt_af_es date: 2021-06-01 tags: [open_source, seq2seq, translation, af, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_es_xx_3.1.0_2.4_1622555083309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_es_xx_3.1.0_2.4_1622555083309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Living Species(w2v_cc_300d) author: John Snow Labs name: ner_living_species date: 2022-06-22 tags: [pt, ner, clinical, licensed] task: Named Entity Recognition language: pt edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Portuguese which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `w2v_cc_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pt_3.5.3_3.0_1655922628486.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pt_3.5.3_3.0_1655922628486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pt")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "pt","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "pt","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.med_ner.living_species").predict("""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.""") ```
## Results ```bash +-------------------+-------+ |ner_chunk |label | +-------------------+-------+ |rapariga |HUMAN | |pessoal |HUMAN | |paciente |HUMAN | |gato |SPECIES| |veterinário |HUMAN | |Trichophyton rubrum|SPECIES| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|15.1 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.80 0.94 0.87 2832 B-SPECIES 0.52 0.89 0.65 2810 I-HUMAN 0.80 0.62 0.70 180 I-SPECIES 0.68 0.82 0.74 1107 micro-avg 0.64 0.89 0.75 6929 macro-avg 0.70 0.82 0.74 6929 weighted-avg 0.67 0.89 0.76 6929 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from fadhilarkan) author: John Snow Labs name: distilbert_qa_fadhilarkan_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `fadhilarkan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_fadhilarkan_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725246680.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_fadhilarkan_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725246680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fadhilarkan_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fadhilarkan_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_fadhilarkan").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_fadhilarkan_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/fadhilarkan/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-26 tags: [bg, licensed, ner, legal, mapa] task: Named Entity Recognition language: bg edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Bulgarian` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_bg_1.0.0_3.0_1682548782666.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_bg_1.0.0_3.0_1682548782666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_bg_cased", "bg")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "bg", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""7 В окончателно решение № 1072 на Curtea de Apel București ( Апелативен съд Букурещ, Румъния ), 3-то гражданско отделение за малолетни и непълнолетни лица и семейноправни въпроси, от 12 юни 2013г., което е приложено към акта за преюдициално запитване и представено от г‑н Liberato, се уточнява, че„ [с] ъдът приема, че страните са сключили брак в Италия през октомври 2005 г. и до октомври 2006 г. са живели ту в Румъния, ту в Италия."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |Букурещ, Румъния|ADDRESS | |12 юни 2013г., |DATE | |г‑н Liberato |PERSON | |Италия |ADDRESS | |октомври 2005 г.|DATE | |октомври 2006 г.|DATE | |Румъния |ADDRESS | |Италия |ADDRESS | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|bg| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.86 0.75 0.80 8 AMOUNT 1.00 0.64 0.78 11 DATE 0.97 0.97 0.97 65 ORGANISATION 0.81 0.86 0.83 35 PERSON 0.87 0.84 0.85 56 macro-avg 0.90 0.87 0.89 175 macro-avg 0.90 0.81 0.85 175 weighted-avg 0.90 0.87 0.89 175 ``` --- layout: model title: English BertForQuestionAnswering model (from JAlexis) author: John Snow Labs name: bert_qa_Bertv1_fine date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Bertv1_fine` is a English model orginally trained by `JAlexis`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Bertv1_fine_en_4.0.0_3.0_1654176446585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Bertv1_fine_en_4.0.0_3.0_1654176446585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Bertv1_fine","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Bertv1_fine","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.cord19.bert.by_JAlexis").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Bertv1_fine| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/JAlexis/Bertv1_fine --- layout: model title: Word2Vec Embeddings in Turkish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, tr, open_source] task: Embeddings language: tr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tr_3.4.1_3.0_1647463837323.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tr_3.4.1_3.0_1647463837323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.embed.w2v_cc_300d").predict("""Spark NLP'yi seviyorum""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: French CamemBert Embeddings (from Hasanmurad) author: John Snow Labs name: camembert_embeddings_Hasanmurad_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Hasanmurad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Hasanmurad_generic_model_fr_3.4.4_3.0_1653986108833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Hasanmurad_generic_model_fr_3.4.4_3.0_1653986108833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Hasanmurad_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Hasanmurad_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Hasanmurad_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Hasanmurad/dummy-model --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-28 tags: [sk, licensed, ner, legal, mapa] task: Named Entity Recognition language: sk edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Slovak` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_sk_1.0.0_3.0_1682674803309.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_sk_1.0.0_3.0_1682674803309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_slovak_legal","sk")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "sk", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Návrhom podaným 22. mája 2007 na Tribunale di Teramo ( súd v Terame, Taliansko ) požiadal pán Liberato o rozluku a o zverenie syna do svojej starostlivosti."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------+---------+ |chunk |ner_label| +-------------+---------+ |22. mája 2007|DATE | |Terame |ADDRESS | |Taliansko |ADDRESS | |pán Liberato |PERSON | +-------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|sk| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.88 0.85 0.86 26 AMOUNT 1.00 1.00 1.00 4 DATE 0.92 0.88 0.90 50 ORGANISATION 0.79 0.61 0.69 31 PERSON 0.66 0.86 0.75 44 micro-avg 0.80 0.82 0.81 155 macro-avg 0.85 0.84 0.84 155 weighted-avg 0.81 0.82 0.81 155 ``` --- layout: model title: Notice Clause Relation Extraction Model author: John Snow Labs name: legre_notice_clause_xs date: 2022-12-17 tags: [en, legal, relations, redl, licensed, tensorflow] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Relation Extraction model aimed to be used in notice clauses, to retrieve relations between entities as NOTICE_PARTY, ADDRESS, EMAIL, TITLE etc. Make sure you run this model only on the NER entities in notice clauses, after you filter them using `legclf_notice_clause` ## Predicted Entities `has_notice_party`, `has_address`, `has_person`, `has_phone`, `has_fax`, `has_title`, `has_email`, `has_department` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_notice_clause_xs_en_1.0.0_3.0_1671280929569.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_notice_clause_xs_en_1.0.0_3.0_1671280929569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = nlp.SentenceDetectorDLModel.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence")\ .setCustomBounds(["\n\n"])\ .setUseCustomBoundsOnly(True) tokenizer = nlp.Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos_tagger = nlp.PerceptronModel.pretrained()\ .setInputCols(["sentence", "token"])\ .setOutputCol("pos_tags") dependency_parser = nlp.DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_notice_clause', 'en', 'legal/models') \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = nlp.NerConverter() \ .setInputCols(["sentence","token","ner"]) \ .setOutputCol("ner_chunk") re_filter = legal.RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunks")\ .setMaxSyntacticDistance(12)\ .setRelationPairs(['NAME-NOTICE_PARTY','NAME-ADDRESS','NAME-PERSON', 'NAME-TITLE','NAME-EMAIL','NAME-PHONE', 'NAME-FAX', 'NAME-DEPARTMENT']) reDL = legal.RelationExtractionDLModel.pretrained("legre_notice_clause_xs", "en", "legal/models") \ .setPredictionThreshold(0.1) \ .setInputCols(["re_ner_chunks", "sentence"]) \ .setOutputCol("relations") pipeline = nlp.Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos_tagger, dependency_parser, embeddings, ner_model, ner_converter, re_filter, reDL]) empty_df = spark.createDataFrame([['']]).toDF("text") re_model = pipeline.fit(empty_df) light_model = nlp.LightPipeline(re_model) text = """The addresses for notices shall be: IBM MSL 8501 IBM Drive 200 Baker Avenue Charlotte, NC 28262 Concord, MA 01742 Attn: MSL Project Office Attn: General Counsel Telephone: 704-594-1964 Telephone: 978-287-5630 Facsimile: 704-594-4108 Facsimile: 978-287-5635 Either Party may change its address for this section by giving written notice to the other Party.""" result = light_model.fullAnnotate(text) ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---------------------|------------|------------------|----------------|------------|---------------|------------------|----------------|------------------------------------------------------|---------------| | has_address | NAME | 36 | 42 | IBM MSL | ADDRESS | 44 | 112 | 8501 IBM Drive 200 Baker Avenue Charlotte, NC ... | 0.9997987 | | has_notice_party | NAME | 36 | 42 | IBM MSL | DEPARTMENT | 120 | 137 | MSL Project Office | 0.34552842 | | has_title | NAME | 36 | 42 | IBM MSL | TITLE | 145 | 159 | General Counsel | 0.48349348 | | has_phone | NAME | 36 | 42 | IBM MSL | PHONE | 173 | 184 | 704-594-1964 | 0.99517375 | | has_phone | NAME | 36 | 42 | IBM MSL | PHONE | 197 | 208 | 978-287-5630 | 0.9961247 | | has_fax | NAME | 36 | 42 | IBM MSL | FAX | 221 | 232 | 704-594-4108 | 0.99340916 | | has_fax | NAME | 36 | 42 | IBM MSL | FAX | 245 | 256 | 978-287-5635 | 0.97187006 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_notice_clause_xs| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|402.6 MB| ## References In-house dataset ## Benchmarking ```bash label Recall Precision F1 Support has_address 0.976 1.000 0.988 41 has_department 0.667 1.000 0.800 3 has_email 1.000 1.000 1.000 7 has_fax_phone 1.000 1.000 1.000 8 has_notice_party 1.000 0.955 0.977 42 has_person 1.000 0.938 0.968 15 has_title 0.875 0.933 0.903 16 other 1.000 1.000 1.000 68 Avg. 0.940 0.978 0.954 - Weighted-Avg. 0.980 0.980 0.979 - ``` --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_triplet_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_bert_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657190994171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_triplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657190994171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_triplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_triplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_triplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_bert_triplet_epochs_1_shard_1_squad2.0 --- layout: model title: Chinese BertForMaskedLM Small Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_lert_small date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-lert-small` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_lert_small_zh_4.2.4_3.0_1670021149485.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_lert_small_zh_4.2.4_3.0_1670021149485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_lert_small","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_lert_small","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_lert_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|57.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-lert-small - https://github.com/ymcui/LERT/blob/main/README_EN.md - https://arxiv.org/abs/2211.05344 --- layout: model title: Legal Administration Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_administration_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, administration, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_administration_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `administration-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `administration-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_administration_agreement_en_1.0.0_3.0_1669290751753.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_administration_agreement_en_1.0.0_3.0_1669290751753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_administration_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[administration-agreement]| |[other]| |[other]| |[administration-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_administration_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support administration-agreement 0.93 0.93 0.93 29 other 0.98 0.98 0.98 90 accuracy - - 0.97 119 macro-avg 0.95 0.95 0.95 119 weighted-avg 0.97 0.97 0.97 119 ``` --- layout: model title: Legal Litigation Clause Binary Classifier author: John Snow Labs name: legclf_litigation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `litigation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `litigation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_litigation_clause_en_1.0.0_3.2_1660122631845.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_litigation_clause_en_1.0.0_3.2_1660122631845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_litigation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[litigation]| |[other]| |[other]| |[litigation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_litigation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support litigation 1.00 0.89 0.94 27 other 0.97 1.00 0.99 103 accuracy - - 0.98 130 macro-avg 0.99 0.94 0.96 130 weighted-avg 0.98 0.98 0.98 130 ``` --- layout: model title: Greek BertForQuestionAnswering model (from Danastos) author: John Snow Labs name: bert_qa_qacombination_bert_el_Danastos date: 2022-06-03 tags: [el, open_source, question_answering, bert] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qacombination_bert_el` is a Greek model orginally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qacombination_bert_el_Danastos_el_4.0.0_3.0_1654249989643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qacombination_bert_el_Danastos_el_4.0.0_3.0_1654249989643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qacombination_bert_el_Danastos","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_qacombination_bert_el_Danastos","el") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("el.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qacombination_bert_el_Danastos| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|421.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/qacombination_bert_el --- layout: model title: Extract Granular Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_granular date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, anatomy] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extractions mentions of anatomical entities using granular labels. Definitions of Predicted Entities: - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". - `Site_Bone`: Anatomical terms that refer to the human skeleton. - `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). - `Site_Breast`: Anatomical terms that refer to the breasts. - `Site_Liver`: Anatomical terms that refer to the liver. - `Site_Lung`: Anatomical terms that refer to the lungs. - `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. - `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. ## Predicted Entities `Direction`, `Site_Bone`, `Site_Brain`, `Site_Breast`, `Site_Liver`, `Site_Lung`, `Site_Lymph_Node`, `Site_Other_Body_Part` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_en_4.0.0_3.0_1666722590194.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_granular_en_4.0.0_3.0_1666722590194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_granular", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_anatomy_granular").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:------------| | left | Direction | | breast | Site_Breast | | lungs | Site_Lung | | liver | Site_Liver | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_granular| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Direction 822 221 162 984 0.79 0.84 0.81 Site_Lymph_Node 481 38 70 551 0.93 0.87 0.90 Site_Breast 88 14 59 147 0.86 0.60 0.71 Site_Other_Body_Part 604 184 897 1501 0.77 0.40 0.53 Site_Bone 252 74 61 313 0.77 0.81 0.79 Site_Liver 178 92 56 234 0.66 0.76 0.71 Site_Lung 398 98 161 559 0.80 0.71 0.75 Site_Brain 197 44 82 279 0.82 0.71 0.76 macro_avg 3020 765 1548 4568 0.80 0.71 0.74 micro_avg 3020 765 1548 4568 0.80 0.66 0.71 ``` --- layout: model title: English RobertaForQuestionAnswering (from comacrae) author: John Snow Labs name: roberta_qa_roberta_edav3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-edav3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_edav3_en_4.0.0_3.0_1655735808428.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_edav3_en_4.0.0_3.0_1655735808428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_edav3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_edav3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.edav3.by_comacrae").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_edav3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/comacrae/roberta-edav3 --- layout: model title: Pipeline to Detect PHI for Deidentification (Subentity- Augmented) author: John Snow Labs name: ner_deid_subentity_augmented_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity_augmented](https://nlp.johnsnowlabs.com/2021/09/03/ner_deid_subentity_augmented_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_pipeline_en_3.4.1_3.0_1647870542439.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_pipeline_en_3.4.1_3.0_1647870542439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_deid_subentity_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_subentity_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.subentity_ner_augmented.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25-year-old |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street. |STREET | |(302) 786-5227 |PHONE | |Brothers Coal-Mine |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_doddle124578 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_doddle124578 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_doddle124578` is a English model originally trained by doddle124578. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_doddle124578_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_doddle124578_en_4.2.0_3.0_1664036087586.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_doddle124578_en_4.2.0_3.0_1664036087586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_doddle124578", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_doddle124578", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_doddle124578| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824209 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824209` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824209_en_4.3.1_3.0_1678134230054.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824209_en_4.3.1_3.0_1678134230054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824209","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824209","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824209| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824209 --- layout: model title: Fast Neural Machine Translation Model from English to Kaonde author: John Snow Labs name: opus_mt_en_kqn date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, kqn, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `kqn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_kqn_xx_2.7.0_2.4_1609168468183.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_kqn_xx_2.7.0_2.4_1609168468183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_kqn", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_kqn", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.kqn').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_kqn| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Relationship of the parties Clause Binary Classifier author: John Snow Labs name: legclf_relationship_of_the_parties_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `relationship_of_the_parties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `relationship_of_the_parties` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_relationship_of_the_parties_clause_en_1.0.0_3.2_1660122908778.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_relationship_of_the_parties_clause_en_1.0.0_3.2_1660122908778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_relationship_of_the_parties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[relationship_of_the_parties]| |[other]| |[other]| |[relationship_of_the_parties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_relationship_of_the_parties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 0.99 241 relationship of the parties 0.99 0.96 0.97 78 accuracy - - 0.99 319 macro-avg 0.99 0.98 0.98 319 weighted-avg 0.99 0.99 0.99 319 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_ff9000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-ff9000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff9000_en_4.3.0_3.0_1675121175904.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff9000_en_4.3.0_3.0_1675121175904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_ff9000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_ff9000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_ff9000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|316.3 MB| ## References - https://huggingface.co/google/t5-efficient-small-ff9000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering Cased model (from pierrerappolt) author: John Snow Labs name: roberta_qa_cart date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cart` is a English model originally trained by `pierrerappolt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cart_en_4.2.4_3.0_1669985087510.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cart_en_4.2.4_3.0_1669985087510.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cart","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cart","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cart| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/pierrerappolt/cart --- layout: model title: Fast Neural Machine Translation Model from Italian to English author: John Snow Labs name: opus_mt_it_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, it, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `it` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_it_en_xx_2.7.0_2.4_1609163790941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_it_en_xx_2.7.0_2.4_1609163790941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_it_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_it_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.it.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_it_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Mapping Entities (Major Clinical Concepts) with Corresponding UMLS CUI Codes author: John Snow Labs name: umls_major_concepts_mapper date: 2022-07-11 tags: [umls, chunk_mapper, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities (Major Clinical Concepts) with corresponding UMLS CUI codes. ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_mapper_en_4.0.0_3.0_1657579020337.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_mapper_en_4.0.0_3.0_1657579020337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_medmentions_coarse", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "clinical_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("umls_major_concepts_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["umls_code"])\ .setLowerCase(True) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper]) test_data = spark.createDataFrame([["The patient complains of pustules after falling from stairs. Also, she has a history of quadriceps tendon rupture"]]).toDF("text") result = mapper_pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_medmentions_coarse", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols("sentence", "token", "clinical_ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("umls_major_concepts_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("umls_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper)) val test_data = Seq("The patient complains of pustules after falling from stairs. Also, she has a history of quadriceps tendon rupture").toDF("text") val result = mapper_pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_major_concepts_mapper").predict("""The patient complains of pustules after falling from stairs. Also, she has a history of quadriceps tendon rupture""") ```
## Results ```bash +-------------------------+---------+ |ner_chunk |umls_code| +-------------------------+---------+ |pustules |C0241157 | |stairs |C4300351 | |quadriceps tendon rupture|C0263968 | +-------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_major_concepts_mapper| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|37.0 MB| ## References Data sampled from https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: T5 for grammar error correction author: John Snow Labs name: t5_grammar_error_corrector date: 2022-01-12 tags: [t5, en, open_source] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a text-to-text model fine tuned to correct grammatical errors when the task is set to "gec:". It is based on Prithiviraj Damodaran's Gramformer model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5_LINGUISTIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_LINGUISTIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_grammar_error_corrector_en_3.4.0_3.0_1641983182673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_grammar_error_corrector_en_3.4.0_3.0_1641983182673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * spark = sparknlp.start() documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer.pretrained("t5_grammar_error_corrector") \ .setTask("gec:") \ .setInputCols(["documents"]) \ .setMaxOutputLength(200) \ .setOutputCol("corrections") pipeline = Pipeline().setStages([documentAssembler, t5]) data = spark.createDataFrame([["He are moving here."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("corrections.result").show(truncate=False) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_grammar_error_corrector") .setTask("gec:") .setMaxOutputLength(200) .setInputCols("documents") .setOutputCol("corrections") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("He are moving here.").toDF("text") val result = pipeline.fit(data).transform(data) result.select("corrections.result").show(false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.grammar_error_corrector").predict("""He are moving here.""") ```
## Results ```bash +--------------------+ |result | +--------------------+ |[He is moving here.]| +--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_grammar_error_corrector| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[corrections]| |Language:|en| |Size:|926.2 MB| ## Data Source Model originally from the transformers library: https://huggingface.co/prithivida/grammar_error_correcter_v1 --- layout: model title: Legal Distribution Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_distribution_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, distribution, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_distribution_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `distribution-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `distribution-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_distribution_agreement_bert_en_1.0.0_3.0_1669310752189.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_distribution_agreement_bert_en_1.0.0_3.0_1669310752189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_distribution_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[distribution-agreement]| |[other]| |[other]| |[distribution-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_distribution_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support distribution-agreement 0.92 0.95 0.94 38 other 0.97 0.95 0.96 65 accuracy - - 0.95 103 macro-avg 0.95 0.95 0.95 103 weighted-avg 0.95 0.95 0.95 103 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becasincentivos4 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasIncentivos4` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos4_es_4.3.0_3.0_1674218146985.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasincentivos4_es_4.3.0_3.0_1674218146985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos4","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasincentivos4","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becasincentivos4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasIncentivos4 --- layout: model title: Tamil Lemmatizer author: John Snow Labs name: lemma date: 2021-04-02 tags: [ta, open_source, lemmatizer] task: Lemmatization language: ta edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ta_3.0.0_3.0_1617388293492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ta_3.0.0_3.0_1617388293492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "ta") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) example = spark.createDataFrame([['கட்சி வெற்றி பெற்றதோடு பெற்றத் ஓடு ஆட்சியில் உள்ள கட்சிக்கு ஒரு மாற்றுக் கட்சியாக வளர்ந்துள்ளது வளர்ந்த் உள்ளது .']], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "ta") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("கட்சி வெற்றி பெற்றதோடு பெற்றத் ஓடு ஆட்சியில் உள்ள கட்சிக்கு ஒரு மாற்றுக் கட்சியாக வளர்ந்துள்ளது வளர்ந்த் உள்ளது .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["கட்சி வெற்றி பெற்றதோடு பெற்றத் ஓடு ஆட்சியில் உள்ள கட்சிக்கு ஒரு மாற்றுக் கட்சியாக வளர்ந்துள்ளது வளர்ந்த் உள்ளது ."] lemma_df = nlu.load('ta.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash +-------------+ | lemma| +-------------+ | கட்சி| | வெற்றி| | பெற்றதோடு| | பெற்றத்| | ஓடு| | ஆட்சி| | உள்| | கட்சிக்கு| | ஒரு| | மாற்றுக்| | கட்சியாக| |வளர்ந்துள்ளது| | வளர்ந்த்| | உள்| | .| +-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ta| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7. ## Benchmarking ```bash Precision=0.62, Recall=0.58, F1-score=0.6 ``` --- layout: model title: Multilingual XLMRoBerta Embeddings (from hfl) author: John Snow Labs name: xlmroberta_embeddings_cino_large_v2 date: 2022-05-13 tags: [zh, ko, open_source, xlm_roberta, embeddings, xx, cino] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `cino-large-v2` is a Multilingual model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_large_v2_xx_3.4.4_3.0_1652439564348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_large_v2_xx_3.4.4_3.0_1652439564348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_large_v2","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_large_v2","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_cino_large_v2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|1.7 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/cino-large-v2 - https://github.com/ymcui/Chinese-Minority-PLM - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology --- layout: model title: Chinese Word Segmentation author: John Snow Labs name: wordseg_ctb9 date: 2021-01-03 task: Word Segmentation language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [zh] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. We trained this model on the the [Chinese Penn Treebank](https://www.cs.brandeis.edu/~clp/ctb/) version 9 data set following the research paper by Nianwen Xue. References: - Xue, Nianwen, et al. Chinese Treebank 9.0 LDC2016T13. Web Download. Philadelphia: Linguistic Data Consortium, 2016. - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_2.7.0_2.4_1609691676660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_2.7.0_2.4_1609691676660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9")\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[ document_assembler, word_segmenter ]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = ws_model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] token_df = nlu.load('zh.segment_words.ctb9').predict(text, output_level='token') token_df ```
## Results ```bash +----------------------------------+--------------------------------------------------------+ |text |result | +----------------------------------+--------------------------------------------------------+ |然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]| +----------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_ctb9| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source ctb9_train.chartag ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 | | WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 | | WORDSEG_MSR | 0,5984 | 0,6088 | 0,6035 | | WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 | | WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 | ``` --- layout: model title: English image_classifier_vit_rust_image_classification_9 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_9 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_9` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_9_en_4.1.0_3.0_1660171188424.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_9_en_4.1.0_3.0_1660171188424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_9", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_9", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_9| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Named Entity Recognition - BERT Medium (OntoNotes) author: John Snow Labs name: onto_small_bert_L8_512 date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `small_bert_L8_512` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_small_bert_L8_512_en_2.7.0_2.4_1607199531477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_small_bert_L8_512_en_2.7.0_2.4_1607199531477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... ner_onto = NerDLModel.pretrained("onto_small_bert_L8_512", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val ner_onto = NerDLModel.pretrained("onto_small_bert_L8_512", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.bert.small_l8_512').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |the 1970s and 1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | |the late 1990s |DATE | |Gates |PERSON | |June 2006 |DATE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_small_bert_L8_512| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.8849518, rec: 0.85147995, f1: 0.8678933 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11073 phrases; correct: 9556. accuracy: 97.26%; 9556 11257 11073 precision: 86.30%; recall: 84.89%; FB1: 85.59 CARDINAL: 798 935 929 precision: 85.90%; recall: 85.35%; FB1: 85.62 929 DATE: 1410 1602 1654 precision: 85.25%; recall: 88.01%; FB1: 86.61 1654 EVENT: 23 63 44 precision: 52.27%; recall: 36.51%; FB1: 42.99 44 FAC: 79 135 121 precision: 65.29%; recall: 58.52%; FB1: 61.72 121 GPE: 2097 2240 2244 precision: 93.45%; recall: 93.62%; FB1: 93.53 2244 LANGUAGE: 9 22 11 precision: 81.82%; recall: 40.91%; FB1: 54.55 11 LAW: 14 40 20 precision: 70.00%; recall: 35.00%; FB1: 46.67 20 LOC: 111 179 152 precision: 73.03%; recall: 62.01%; FB1: 67.07 152 MONEY: 282 314 320 precision: 88.12%; recall: 89.81%; FB1: 88.96 320 NORP: 755 841 889 precision: 84.93%; recall: 89.77%; FB1: 87.28 889 ORDINAL: 169 195 201 precision: 84.08%; recall: 86.67%; FB1: 85.35 201 ORG: 1368 1795 1624 precision: 84.24%; recall: 76.21%; FB1: 80.02 1624 PERCENT: 309 349 351 precision: 88.03%; recall: 88.54%; FB1: 88.29 351 PERSON: 1816 1988 2037 precision: 89.15%; recall: 91.35%; FB1: 90.24 2037 PRODUCT: 42 76 67 precision: 62.69%; recall: 55.26%; FB1: 58.74 67 QUANTITY: 85 105 108 precision: 78.70%; recall: 80.95%; FB1: 79.81 108 TIME: 137 212 222 precision: 61.71%; recall: 64.62%; FB1: 63.13 222 WORK_OF_ART: 52 166 79 precision: 65.82%; recall: 31.33%; FB1: 42.45 79 ``` --- layout: model title: Emotion Detection Classifier author: John Snow Labs name: Emotion Classifier class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/07/2020 task: Text Classification edition: Spark NLP 2.5.3 spark_version: 2.4 tags: [open_source, en, classifier, emotion] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Automatically identify Joy, Surprise, Fear, Sadness in Tweets using out pretrained Spark NLP DL classifier. {:.h2_title} ## Predicted Entities ``surprise``, ``sadness``, ``fear``, ``joy``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_EMOTION/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_EMOTION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_emotion_en_2.5.3_2.4_1593783319297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_emotion_en_2.5.3_2.4_1593783319297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', "en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") classifier = ClassifierDLModel.pretrained('classifierdl_use_emotion', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("sentiment") nlpPipeline = Pipeline(stages=[document_assembler, use, classifier]) data = spark.createDataFrame([["@Mira I just saw you on live t.v!!"], ["Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!!"], ["Nooooo! My dad turned off the internet so I can't listen to band music!"], ["My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after."]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained('tfhub_use', "en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val classifier = ClassifierDLModel.pretrained("classifierdl_use_emotion", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("sentiment") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, classifier)) val data = Seq(Array("@Mira I just saw you on live t.v!!", "Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!!", "Nooooo! My dad turned off the internet so I can't listen to band music!", "My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""@Mira I just saw you on live t.v!!"""] emotion_df = nlu.load('en.classify.emotion.use').predict(text, output_level='document') emotion_df[["document", "emotion"]] ```
{:.h2_title} ## Results ```bash +-------------------------------------------------------------------------------------------------------------------------------------+---------+ |document |sentiment| +-------------------------------------------------------------------------------------------------------------------------------------+---------+ |@Mira I just saw you on live t.v!! |surprise | |Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!! |joy | |Nooooo! My dad turned off the internet so I can't listen to band music! |sadness | |My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after.|fear | +-------------------------------------------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| | Model Name | classifierdl_use_emotion | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.3 | | Spark NLP Compatibility | 2.4 | | License | open source | | Edition | public | | Input Labels | [document, sentence_embeddings]| | Output Labels | [class] | | Language | en | | Upstream Dependencies | tfhub_use | {:.h2_title} ## Data Source This model is trained on multiple datasets inlcuding youtube comments, twitter and ISEAR dataset. --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from akdeniz27) author: John Snow Labs name: roberta_qa_large_cuad date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-cuad` is a English model originally trained by `akdeniz27`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_cuad_en_4.2.4_3.0_1669987585871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_cuad_en_4.2.4_3.0_1669987585871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_cuad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_cuad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/akdeniz27/roberta-large-cuad - https://www.atticusprojectai.org/cuad - https://github.com/TheAtticusProject/cuad - https://arxiv.org/abs/2103.06268 - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 - https://drive.google.com/file/d/1of37X0hAhECQ3BN_004D8gm6V88tgZaB/view?usp=sharing - https://zenodo.org/record/4599830 - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 - https://www.atticusprojectai.org/cuad - https://www.atticusprojectai.org/ --- layout: model title: Legal Termination Clause Binary Classifier author: John Snow Labs name: legclf_termination_clause date: 2023-02-13 tags: [en, legal, classification, termination, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `termination`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_clause_en_1.0.0_3.0_1676303921297.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_clause_en_1.0.0_3.0_1676303921297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_termination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[termination]| |[other]| |[other]| |[termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 8 termination 1.00 1.00 1.00 14 accuracy - - 1.00 22 macro-avg 1.00 1.00 1.00 22 weighted-avg 1.00 1.00 1.00 22 ``` --- layout: model title: Part of Speech for Arabic author: John Snow Labs name: pos_ud_padt date: 2020-11-30 task: Part of Speech Tagging language: ar edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, ar] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_padt_ar_2.7.0_2.4_1606721957579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_padt_ar_2.7.0_2.4_1606721957579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_padt", "ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(["كرستيانو رونالدو لاعب برتغالي محترف يلعب في صفوف منتخب البرتغال لكرة القدم"]) ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_padt", "ar") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("كرستيانو رونالدو لاعب برتغالي محترف يلعب في صفوف منتخب البرتغال لكرة القدم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""كرستيانو رونالدو لاعب برتغالي محترف يلعب في صفوف منتخب البرتغال لكرة القدم"""] pos_df = nlu.load('ar.pos').predict(text) pos_df ```
## Results ```bash {'pos': [Annotation(pos, 0, 7, X, {'word': 'كرستيانو'}), Annotation(pos, 9, 15, X, {'word': 'رونالدو'}), Annotation(pos, 17, 20, NOUN, {'word': 'لاعب'}), Annotation(pos, 22, 28, X, {'word': 'برتغالي'}), Annotation(pos, 30, 34, X, {'word': 'محترف'}), Annotation(pos, 36, 39, VERB, {'word': 'يلعب'}), Annotation(pos, 41, 42, ADP, {'word': 'في'}), Annotation(pos, 44, 47, NOUN, {'word': 'صفوف'}), Annotation(pos, 49, 53, NOUN, {'word': 'منتخب'}), Annotation(pos, 55, 62, X, {'word': 'البرتغال'}), Annotation(pos, 64, 67, CCONJ, {'word': 'لكرة'}), Annotation(pos, 69, 73, NOUN, {'word': 'القدم'})], ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_padt| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[tags, document]| |Output Labels:|[pos]| |Language:|ar| ## Data Source The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org) ## Benchmarking ```bash | | | precision | recall | f1-score | support | |---:|:-------------|:------------|:---------|-----------:|----------:| | 0 | ADJ | 0.90 | 0.91 | 0.91 | 2937 | | 1 | ADP | 0.99 | 1.00 | 0.99 | 4528 | | 2 | ADV | 0.96 | 0.93 | 0.95 | 104 | | 3 | AUX | 0.88 | 0.85 | 0.87 | 197 | | 4 | CCONJ | 1.00 | 0.99 | 0.99 | 1963 | | 5 | DET | 0.95 | 0.96 | 0.96 | 623 | | 6 | NOUN | 0.94 | 0.96 | 0.95 | 9547 | | 7 | NUM | 0.98 | 0.97 | 0.98 | 779 | | 8 | None | 1.00 | 1.00 | 1 | 3868 | | 9 | PART | 0.92 | 0.93 | 0.93 | 226 | | 10 | PRON | 0.99 | 1.00 | 1 | 1133 | | 11 | PROPN | 1.00 | 0.48 | 0.65 | 31 | | 12 | PUNCT | 1.00 | 1.00 | 1 | 2052 | | 13 | SCONJ | 0.99 | 0.98 | 0.98 | 534 | | 14 | SYM | 1.00 | 0.98 | 0.99 | 41 | | 15 | VERB | 0.94 | 0.93 | 0.94 | 2189 | | 16 | X | 0.80 | 0.64 | 0.71 | 1380 | | 17 | accuracy | | | 0.96 | 32132 | | 18 | macro avg | 0.95 | 0.91 | 0.93 | 32132 | | 19 | weighted avg | 0.95 | 0.96 | 0.95 | 32132 | ``` --- layout: model title: Translate Dutch to English Pipeline author: John Snow Labs name: translate_nl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, nl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `nl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_nl_en_xx_2.7.0_2.4_1609690917869.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_nl_en_xx_2.7.0_2.4_1609690917869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_nl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_nl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.nl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_nl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Split Sentences in Healthcare Texts author: John Snow Labs name: sentence_detector_dl_healthcare date: 2021-03-16 tags: [en, sentence_detection, licensed, clinical] task: Sentence Detection language: en nav_key: models edition: Healthcare NLP 2.7.0 spark_version: 2.4 supported: true annotator: SentenceDetectorDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SENTENCE_DETECTOR_HC/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/20.SentenceDetectorDL_Healthcare.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.7.0_2.4_1615880554391.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sentence_detector_dl_healthcare_en_2.7.0_2.4_1615880554391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) result = sd_model.fullAnnotate("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. LASIX CHANGED TO 40 PO BID WHICH IS SAME AS HE TAKES AT HOME - RECEIVED 40 PO IN AM - 700CC U/O TOTAL FOR FLUID NEGATIVE ~ 600 THUS FAR TODAY, ~ 600 NEG LOS.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. LASIX CHANGED TO 40 PO BID WHICH IS SAME AS HE TAKES AT HOME - RECEIVED 40 PO IN AM - 700CC U/O TOTAL FOR FLUID NEGATIVE ~ 600 THUS FAR TODAY, ~ 600 NEG LOS.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.detect_sentence.clinical").predict("""He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety.Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. LASIX CHANGED TO 40 PO BID WHICH IS SAME AS HE TAKES AT HOME - RECEIVED 40 PO IN AM - 700CC U/O TOTAL FOR FLUID NEGATIVE ~ 600 THUS FAR TODAY, ~ 600 NEG LOS.""") ```
## Results ```bash | | sentence | |---:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | He was given boluses of MS04 with some effect, he has since been placed on a PCA - he take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, he has also received ativan for anxiety. | | 1 | Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms mag so4 iv. | | 2 | LASIX CHANGED TO 40 PO BID WHICH IS SAME AS HE TAKES AT HOME - RECEIVED 40 PO IN AM - 700CC U/O TOTAL FOR FLUID NEGATIVE ~ 600 THUS FAR TODAY, ~ 600 NEG LOS. | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl_healthcare| |Compatibility:|Healthcare NLP 2.7.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|en| ## Data Source Healthcare SDDL model is trained on in-house domain specific data. ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English image_classifier_vit_base_patch16_224 ViTForImageClassification from google author: John Snow Labs name: image_classifier_vit_base_patch16_224 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224` is a English model originally trained by google. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_en_4.1.0_3.0_1660165449585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_en_4.1.0_3.0_1660165449585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|324.8 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from ms12345) author: John Snow Labs name: distilbert_qa_ms12345_base_cased_led_squad_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad` is a English model originally trained by `ms12345`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ms12345_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766560316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ms12345_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766560316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ms12345_base_cased_led_squad_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ms12345_base_cased_led_squad_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ms12345_base_cased_led_squad_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ms12345/distilbert-base-cased-distilled-squad-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from karukapur) author: John Snow Labs name: distilbert_qa_karukapur_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `karukapur`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_karukapur_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771710627.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_karukapur_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771710627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_karukapur_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_karukapur_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_karukapur_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/karukapur/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl2_en_4.3.0_3.0_1675109902657.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl2_en_4.3.0_3.0_1675109902657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|293.7 MB| ## References - https://huggingface.co/google/t5-efficient-base-dl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Question Pair Classifier author: John Snow Labs name: classifierdl_electra_questionpair date: 2021-08-13 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.1.3 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Identifies whether two question sentences are semantically repetitive or different. ## Predicted Entities `almost_same`, `not_same`. {:.btn-box} {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_QUESTIONPAIR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_QUESTIONPAIRS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_electra_questionpair_en_3.1.3_2.4_1628840750568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_electra_questionpair_en_3.1.3_2.4_1628840750568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use - The model is trained with `sent_electra_large_uncased` embeddings therefore the same embeddings should be used in the prediction pipeline. - The question pairs should be identified with "q1" and "q2" in the text. The input text format should be as follows : `text = "q1: What is your name? q2: Who are you?"`
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_electra_questionpair', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document, embeddings, document_classifier]) light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result_1 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie do you like most?") print(result_1["class"]) result_2 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?") print(result_2["class"]) ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en") .setInputCols("document") .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_electra_questionpair", 'en') .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier)) val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result_1 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie do you like most?") val result_2 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.questionpair").predict("""q1: What is your favorite movie? q2: Which movie genre would you like to watch?""") ```
## Results ```bash ['almost_same'] ['not_same'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_electra_questionpair| |Compatibility:|Spark NLP 3.1.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| ## Data Source A custom dataset is used based on this source : "https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs". ## Benchmarking ```bash label precision recall f1-score support almost_same 0.85 0.91 0.88 29652 not_same 0.90 0.84 0.87 29634 accuracy - - 0.88 59286 macro-avg 0.88 0.88 0.88 59286 weighted-avg 0.88 0.88 0.88 59286 ``` --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543_de_4.2.0_3.0_1664118461374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543_de_4.2.0_3.0_1664118461374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Abkhazian asr_xls_r_ab_test_by_muneson TFWav2Vec2ForCTC from muneson author: John Snow Labs name: asr_xls_r_ab_test_by_muneson date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_muneson` is a Abkhazian model originally trained by muneson. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_test_by_muneson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_muneson_ab_4.2.0_3.0_1664019191275.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_muneson_ab_4.2.0_3.0_1664019191275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_test_by_muneson", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_test_by_muneson", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_test_by_muneson| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.5 KB| --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_0_en_4.0.0_3.0_1657191925822.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_0_en_4.0.0_3.0_1657191925822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|375.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-0 --- layout: model title: Sentence Embeddings - sbiobert (tuned) author: John Snow Labs name: sbiobert_jsl_cased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_cased_en_3.0.3_2.4_1621017156951.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_cased_en_3.0.3_2.4_1621017156951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbiobert_jsl_cased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbiobert_jsl_cased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.biobert.jsl_cased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobert_jsl_cased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.788 STS(cos) 0.736 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_en_4.3.0_3.0_1675120502722.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el8_en_4.3.0_3.0_1675120502722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|159.9 MB| ## References - https://huggingface.co/google/t5-efficient-small-el8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_emotion date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-emotion` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_emotion_en_4.3.0_3.0_1675125983774.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_emotion_en_4.3.0_3.0_1675125983774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_emotion","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_emotion","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_emotion| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|263.3 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-emotion - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/dair-ai/emotion_dataset - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://twitter.com/omarsar0 - https://github.com/dair-ai/emotion_dataset - https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb - https://github.com/patil-suraj - https://i.imgur.com/JBtAwPx.png - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Question classification of open-domain and fact-based questions Pipeline - TREC50 author: John Snow Labs name: classifierdl_use_trec50_pipeline date: 2021-01-08 task: [Text Classification, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [classifier, text_classification, pipeline, en, open_source] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify open-domain, fact-based questions into one of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_pipeline_en_2.7.1_2.4_1610119592993.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec50_pipeline_en_2.7.1_2.4_1610119592993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("classifierdl_use_trec50_pipeline", lang = "en") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("classifierdl_use_trec50_pipeline", lang = "en") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.trec50.component_list").predict("""Put your text here.""") ```
## Results ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |When did the construction of stone circles begin in the UK? | NUM_date | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_trec50_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Language:|en| --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_dl2 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-dl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl2_en_4.3.0_3.0_1675123232101.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl2_en_4.3.0_3.0_1675123232101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_dl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_dl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_dl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|54.2 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-dl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: NER Pipeline for 6 Scandinavian Languages author: John Snow Labs name: bert_token_classifier_scandi_ner_pipeline date: 2022-06-25 tags: [danish, norwegian, swedish, icelandic, faroese, bert, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on [bert_token_classifier_scandi_ner](https://nlp.johnsnowlabs.com/2021/12/09/bert_token_classifier_scandi_ner_xx.html) model which is imported from `HuggingFace`. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_SCANDINAVIAN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_scandi_ner_pipeline_xx_4.0.0_3.0_1656119754783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_scandi_ner_pipeline_xx_4.0.0_3.0_1656119754783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python scandiner_pipeline = PretrainedPipeline("bert_token_classifier_scandi_ner_pipeline", lang = "xx") scandiner_pipeline.annotate("Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner.") ``` ```scala val scandiner_pipeline = new PretrainedPipeline("bert_token_classifier_scandi_ner_pipeline", lang = "xx") val scandiner_pipeline.annotate("Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner.") ```
## Results ```bash +-------------------+---------+ |chunk |ner_label| +-------------------+---------+ |Hans |PER | |Statens Universitet|ORG | |København |LOC | |københavner |MISC | +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_scandi_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|666.9 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - BertForTokenClassification - NerConverter - Finisher --- layout: model title: Detect Cellular/Molecular Biology Entities author: John Snow Labs name: bert_token_classifier_ner_jnlpba_cellular date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects molecular biology-related terms in medical texts. The model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP. ## Predicted Entities `cell_line`, `cell_type`, `protein`, `DNA`, `RNA` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jnlpba_cellular_en_4.0.0_3.0_1658754953153.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jnlpba_cellular_en_4.0.0_3.0_1658754953153.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jnlpba_cellular", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""The results suggest that activation of protein kinase C, but not new protein synthesis, is required for IL-2 induction of IFN-gamma and GM-CSF cytoplasmic mRNA. It also was observed that suppression of cytokine gene expression by these agents was independent of the inhibition of proliferation. These data indicate that IL-2 and IL-12 may have distinct signaling pathways leading to the induction of IFN-gammaand GM-CSFgene expression, andthatthe NK3.3 cell line may serve as a novel model for dissecting the biochemical and molecular events involved in these pathways. A functional T-cell receptor signaling pathway is required for p95vav activity. Stimulation of the T-cell antigen receptor ( TCR ) induces activation of multiple tyrosine kinases, resulting in phosphorylation of numerous intracellular substrates. One substrate is p95vav, which is expressed exclusively in hematopoietic and trophoblast cells."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jnlpba_cellular", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""The results suggest that activation of protein kinase C, but not new protein synthesis, is required for IL-2 induction of IFN-gamma and GM-CSF cytoplasmic mRNA. It also was observed that suppression of cytokine gene expression by these agents was independent of the inhibition of proliferation. These data indicate that IL-2 and IL-12 may have distinct signaling pathways leading to the induction of IFN-gammaand GM-CSFgene expression, andthatthe NK3.3 cell line may serve as a novel model for dissecting the biochemical and molecular events involved in these pathways. A functional T-cell receptor signaling pathway is required for p95vav activity. Stimulation of the T-cell antigen receptor ( TCR ) induces activation of multiple tyrosine kinases, resulting in phosphorylation of numerous intracellular substrates. One substrate is p95vav, which is expressed exclusively in hematopoietic and trophoblast cells.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.jnlpba_cellular").predict("""The results suggest that activation of protein kinase C, but not new protein synthesis, is required for IL-2 induction of IFN-gamma and GM-CSF cytoplasmic mRNA. It also was observed that suppression of cytokine gene expression by these agents was independent of the inhibition of proliferation. These data indicate that IL-2 and IL-12 may have distinct signaling pathways leading to the induction of IFN-gammaand GM-CSFgene expression, andthatthe NK3.3 cell line may serve as a novel model for dissecting the biochemical and molecular events involved in these pathways. A functional T-cell receptor signaling pathway is required for p95vav activity. Stimulation of the T-cell antigen receptor ( TCR ) induces activation of multiple tyrosine kinases, resulting in phosphorylation of numerous intracellular substrates. One substrate is p95vav, which is expressed exclusively in hematopoietic and trophoblast cells.""") ```
## Results ```bash +-------------------------------------+---------+ |ner_chunk |label | +-------------------------------------+---------+ |protein kinase C |protein | |IL-2 |protein | |IFN-gamma and GM-CSF cytoplasmic mRNA|RNA | |cytokine gene |DNA | |IL-2 |protein | |IL-12 |protein | |IFN-gammaand GM-CSFgene |protein | |NK3.3 cell line |cell_line| |T-cell receptor |protein | |p95vav |protein | |T-cell antigen receptor |protein | |TCR |protein | |tyrosine kinases |protein | |p95vav |protein | |hematopoietic and trophoblast cells |cell_type| +-------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jnlpba_cellular| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-cell_line 0.5850 0.6880 0.6324 500 I-cell_line 0.6374 0.7644 0.6952 989 B-DNA 0.7187 0.7453 0.7318 1056 I-DNA 0.8134 0.8603 0.8362 1789 B-protein 0.7286 0.8429 0.7816 5067 I-protein 0.8020 0.8129 0.8074 4774 B-RNA 0.6812 0.7966 0.7344 118 I-RNA 0.8358 0.8984 0.8660 187 B-cell_type 0.7768 0.7501 0.7632 1921 I-cell_type 0.8654 0.7887 0.8253 2991 micro-avg 0.7673 0.8065 0.7864 19392 macro-avg 0.7444 0.7948 0.7673 19392 weighted-avg 0.7722 0.8065 0.7875 19392 ``` --- layout: model title: Arabic Bert Embeddings (Base, 1790k Iterations) author: John Snow Labs name: bert_embeddings_bert_base_qarib60_1790k date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-qarib60_1790k` is a Arabic model orginally trained by `qarib`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib60_1790k_ar_3.4.2_3.0_1649678882711.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib60_1790k_ar_3.4.2_3.0_1649678882711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib60_1790k","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib60_1790k","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_qarib60_1790k").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_qarib60_1790k| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/qarib/bert-base-qarib60_1790k - http://opus.nlpl.eu/ - https://github.com/qcri/QARIB/Training_QARiB.md - https://github.com/qcri/QARIB/Using_QARiB.md --- layout: model title: Fast Neural Machine Translation Model from English to Eastern Malayo-Polynesian Languages author: John Snow Labs name: opus_mt_en_pqe date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, pqe, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `pqe` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pqe_xx_2.7.0_2.4_1609164776742.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pqe_xx_2.7.0_2.4_1609164776742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_pqe", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_pqe", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.pqe').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_pqe| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_ilgiornale_to_repubblica date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-ilgiornale-to-repubblica` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_ilgiornale_to_repubblica_it_4.3.0_3.0_1675103353037.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_ilgiornale_to_repubblica_it_4.3.0_3.0_1675103353037.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_ilgiornale_to_repubblica","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_ilgiornale_to_repubblica","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_ilgiornale_to_repubblica| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|593.9 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-ilgiornale-to-repubblica - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Headline+style+transfer+%28Il+Giornale+to+Repubblica%29&dataset=CHANGE-IT --- layout: model title: Persian XlmRoBertaForQuestionAnswering model (from m3hrdadfi) author: John Snow Labs name: xlmroberta_qa_xlmr_large date: 2022-06-28 tags: [fa, open_source, xlmroberta, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-large-qa-fa` is a Persian model originally trained by `m3hrdadfi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_qa_xlmr_large_fa_4.0.0_3.0_1656419067604.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_qa_xlmr_large_fa_4.0.0_3.0_1656419067604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlmroberta_qa_xlmr_large","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = XlmRoBertaForQuestionAnswering.pretrained("xlmroberta_qa_xlmr_large","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.answer_question.xlmr_roberta.large").predict("""اسم من چیست؟|||"نام من کلارا است و من در برکلی زندگی می کنم.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_qa_xlmr_large| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/m3hrdadfi/xlmr-large-qa-fa - https://github.com/sajjjadayobi/PersianQA - https://github.com/m3hrdadfi --- layout: model title: Detect Clinical Entities (bert_token_classifier_ner_clinical) author: John Snow Labs name: bert_token_classifier_ner_clinical date: 2022-01-06 tags: [berfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a BERT-based version of the ner_clinical model and it is 4% better than the legacy NER model (MedicalNerModel) that is based on BiLSTM-CNN-Char architecture. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_en_3.3.4_2.4_1641472941908.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_en_3.3.4_2.4_1641472941908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier ]) p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) text = ''' A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . ''' result = p_model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector,tokenizer,tokenClassifier)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_clinical").predict(""" A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """) ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |gestational diabetes mellitus |PROBLEM | |type two diabetes mellitus |PROBLEM | |T2DM |PROBLEM | |HTG-induced pancreatitis |PROBLEM | |an acute hepatitis |PROBLEM | |obesity |PROBLEM | |a body mass index |TEST | |BMI |TEST | |polyuria |PROBLEM | |polydipsia |PROBLEM | |poor appetite |PROBLEM | |vomiting |PROBLEM | |amoxicillin |TREATMENT| |a respiratory tract infection |PROBLEM | |metformin |TREATMENT| |glipizide |TREATMENT| |dapagliflozin |TREATMENT| |T2DM |PROBLEM | |atorvastatin |TREATMENT| |gemfibrozil |TREATMENT| +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|128| ## Data Source Trained on augmented 2010 i2b2 challenge data with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label precision recall f1-score support PROBLEM 0.88 0.92 0.90 30276 TEST 0.91 0.86 0.88 17237 TREATMENT 0.87 0.88 0.88 17298 O 0.97 0.97 0.97 202438 accuracy - - 0.95 267249 macro-avg 0.91 0.91 0.91 267249 weighted-avg 0.95 0.95 0.95 267249 ``` --- layout: model title: Indonesian BertForQuestionAnswering Base Cased model (from cahya) author: John Snow Labs name: bert_qa_base_indonesian_tydiqa date: 2022-07-07 tags: [id, open_source, bert, question_answering] task: Question Answering language: id edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-indonesian-tydiqa` is a Indonesian model originally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_indonesian_tydiqa_id_4.0.0_3.0_1657183004627.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_indonesian_tydiqa_id_4.0.0_3.0_1657183004627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_indonesian_tydiqa","id") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Siapa namaku?", "Nama saya Clara dan saya tinggal di Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_indonesian_tydiqa","id") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Siapa namaku?", "Nama saya Clara dan saya tinggal di Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_indonesian_tydiqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|id| |Size:|413.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/cahya/bert-base-indonesian-tydiqa --- layout: model title: English RobertaForQuestionAnswering Cased model (from vuiseng9) author: John Snow Labs name: roberta_qa_l_squadv1.1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-l-squadv1.1` is a English model originally trained by `vuiseng9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_l_squadv1.1_en_4.3.0_3.0_1674220913524.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_l_squadv1.1_en_4.3.0_3.0_1674220913524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_l_squadv1.1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_l_squadv1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_l_squadv1.1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/vuiseng9/roberta-l-squadv1.1 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa TFWav2Vec2ForCTC from masapasa author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa` is a English model originally trained by masapasa. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa_en_4.2.0_3.0_1664096613190.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa_en_4.2.0_3.0_1664096613190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|756.6 MB| --- layout: model title: German Bert Embeddings(Uncased) author: John Snow Labs name: bert_embeddings_bert_base_german_dbmdz_uncased date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-dbmdz-uncased` is a German model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_dbmdz_uncased_de_3.4.2_3.0_1649675809759.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_german_dbmdz_uncased_de_3.4.2_3.0_1649675809759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_dbmdz_uncased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_german_dbmdz_uncased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_german_dbmdz_uncased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_german_dbmdz_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/bert-base-german-dbmdz-uncased --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8` is a English model originally trained by lilitket. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_en_4.2.0_3.0_1664120453852.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8_en_4.2.0_3.0_1664120453852.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr8| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Japanese Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_large_japanese_luw_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ja, open_source] task: Part of Speech Tagging language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese-luw-upos` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_japanese_luw_upos_ja_3.4.2_3.0_1652092112309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_japanese_luw_upos_ja_3.4.2_3.0_1652092112309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_japanese_luw_upos","ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_japanese_luw_upos","ja") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_large_japanese_luw_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ja| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-large-japanese-luw-upos - https://universaldependencies.org/u/pos/ - http://id.nii.ac.jp/1001/00216223/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Extract Entities in Spanish Clinical Trial Abstracts author: John Snow Labs name: ner_clinical_trials_abstracts date: 2022-08-12 tags: [es, clinical, licensed, ner, clinical_abstracts, chem, diso, proc] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model detects relevant entities from Spanish clinical trial abstracts and trained by using MedicalNerApproach annotator that allows to train generic NER models based on Neural Networks. The model detects CHEM (Pharmacological and Chemical Substances), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). ## Predicted Entities `CHEM`, `DISO`, `PROC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_es_4.0.2_3.0_1660339167613.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_es_4.0.2_3.0_1660339167613.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_clinical_trials_abstracts', "es", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols("sentence","token") .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.clinical_trial_abstracts").predict("""Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |suplementación |PROC | |ácido fólico |CHEM | |niveles de homocisteína |PROC | |hemodiálisis |PROC | |hiperhomocisteinemia |DISO | |niveles de homocisteína total|PROC | |tHcy |PROC | |ácido fólico |CHEM | |vitamina B6 |CHEM | |pp |CHEM | |diálisis |PROC | |función residual |PROC | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_trials_abstracts| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.4 MB| ## References The model is prepared using the reference paper: "A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine", Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión and Antonio Moreno-Sandoval. BMC Medical Informatics and Decision Making volume 21, Article number: 69 (2021) ## Benchmarking ```bash label precision recall f1-score support B-DISO 0.91 0.93 0.92 2465 I-DISO 0.85 0.88 0.86 2788 B-CHEM 0.91 0.92 0.91 1558 I-CHEM 0.82 0.91 0.86 645 B-PROC 0.89 0.91 0.90 3348 I-PROC 0.80 0.87 0.83 4232 micro-avg 0.86 0.90 0.88 15036 macro-avg 0.86 0.90 0.88 15036 weighted-avg 0.86 0.90 0.88 15036 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from clevrly) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_hotpot date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-hotpot_qa` is a English model originally trained by `clevrly`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_hotpot_en_4.3.0_3.0_1672767990673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_hotpot_en_4.3.0_3.0_1672767990673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_hotpot","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_hotpot","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_hotpot| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/clevrly/distilbert-base-uncased-finetuned-hotpot_qa --- layout: model title: Pipeline to Detect Chemical Compounds and Genes (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_chemprot_pipeline date: 2023-03-20 tags: [berfortokenclassification, chemprot, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemprot](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemprot_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_4.3.0_3.2_1679306959462.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_4.3.0_3.2_1679306959462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") text = '''Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") val text = "Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemprot_pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | Keratinocyte | 0 | 11 | GENE-Y | 0.999147 | | 1 | growth | 13 | 18 | GENE-Y | 0.999752 | | 2 | factor | 20 | 25 | GENE-Y | 0.999685 | | 3 | acidic | 31 | 36 | GENE-Y | 0.999661 | | 4 | fibroblast | 38 | 47 | GENE-Y | 0.999753 | | 5 | growth | 49 | 54 | GENE-Y | 0.999771 | | 6 | factor | 56 | 61 | GENE-Y | 0.999742 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemprot_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English Bert Embeddings (Base, Uncased, Unstructured, Without Classifier Layer) author: John Snow Labs name: bert_embeddings_bert_base_uncased_mnli_sparse_70_unstructured_no_classifier date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-uncased-mnli-sparse-70-unstructured-no-classifier` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_mnli_sparse_70_unstructured_no_classifier_en_3.4.2_3.0_1649672994004.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_mnli_sparse_70_unstructured_no_classifier_en_3.4.2_3.0_1649672994004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_mnli_sparse_70_unstructured_no_classifier","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_mnli_sparse_70_unstructured_no_classifier","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_base_uncased_mnli_sparse_70_unstructured_no_classifier").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_uncased_mnli_sparse_70_unstructured_no_classifier| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|228.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/Intel/bert-base-uncased-mnli-sparse-70-unstructured-no-classifier --- layout: model title: English T5ForConditionalGeneration Cased model (from pitehu) author: John Snow Labs name: t5_ner_conll_list date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5_NER_CONLL_LIST` is a English model originally trained by `pitehu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ner_conll_list_en_4.3.0_3.0_1675099601757.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ner_conll_list_en_4.3.0_3.0_1675099601757.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ner_conll_list","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ner_conll_list","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ner_conll_list| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|272.6 MB| ## References - https://huggingface.co/pitehu/T5_NER_CONLL_LIST --- layout: model title: English Named Entity Recognition (from Jorgeutd) author: John Snow Labs name: bert_ner_bert_large_uncased_finetuned_ner date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-uncased-finetuned-ner` is a English model orginally trained by `Jorgeutd`. ## Predicted Entities `LOC`, `PER`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_large_uncased_finetuned_ner_en_3.4.2_3.0_1652097206579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_large_uncased_finetuned_ner_en_3.4.2_3.0_1652097206579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_large_uncased_finetuned_ner","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_large_uncased_finetuned_ner","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_large_uncased_finetuned_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Jorgeutd/bert-large-uncased-finetuned-ner - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Pipeline for detecting posology entities author: John Snow Labs name: recognize_entities_posology date: 2021-04-01 tags: [pipeline, en, licensed, clinical] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline with `ner_posology`. It will only extract medication entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/recognize_entities_posology_en_3.0.0_3.0_1617298186572.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/recognize_entities_posology_en_3.0.0_3.0_1617298186572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('recognize_entities_posology', 'en', 'clinical/models') res = pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """) ``` ```scala val era_pipeline = new PretrainedPipeline("recognize_entities_posology", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """)(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.recognize_entities.posology").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """) ```
## Results ```bash | | chunk | begin | end | entity | |---:|:-----------------|--------:|------:|:----------| | 0 | metformin | 83 | 91 | DRUG | | 1 | 1000 mg | 93 | 99 | STRENGTH | | 2 | two times a day | 101 | 115 | FREQUENCY | | 3 | 40 units | 270 | 277 | DOSAGE | | 4 | insulin glargine | 282 | 297 | DRUG | | 5 | at night | 299 | 306 | FREQUENCY | | 6 | 12 units | 309 | 316 | DOSAGE | | 7 | insulin lispro | 321 | 334 | DRUG | | 8 | with meals | 336 | 345 | FREQUENCY | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_posology| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English XlmRoBertaForQuestionAnswering (from seongju) author: John Snow Labs name: xlm_roberta_qa_klue_mrc_roberta_base date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-mrc-roberta-base` is a English model originally trained by `seongju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_klue_mrc_roberta_base_en_4.0.0_3.0_1655987736396.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_klue_mrc_roberta_base_en_4.0.0_3.0_1655987736396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_klue_mrc_roberta_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_klue_mrc_roberta_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.klue.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_klue_mrc_roberta_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|855.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/seongju/klue-mrc-roberta-base --- layout: model title: English DistilBertForQuestionAnswering model (from akr) author: John Snow Labs name: distilbert_qa_akr_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `akr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_akr_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724968396.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_akr_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724968396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_akr_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_akr_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_akr").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_akr_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/akr/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dm1000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dm1000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dm1000_en_4.3.0_3.0_1675110576936.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dm1000_en_4.3.0_3.0_1675110576936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dm1000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dm1000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dm1000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|633.1 MB| ## References - https://huggingface.co/google/t5-efficient-base-dm1000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Pipeline to Summarize Clinical Notes (PubMed) author: John Snow Labs name: summarizer_biomedical_pubmed_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_biomedical_pubmed](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_biomedical_pubmed_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_biomedical_pubmed_pipeline_en_4.4.1_3.0_1685533294249.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_biomedical_pubmed_pipeline_en_4.4.1_3.0_1685533294249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_biomedical_pubmed_pipeline", "en", "clinical/models") text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis.""" result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_biomedical_pubmed_pipeline", "en", "clinical/models") val text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis.""" val result = pipeline.fullAnnotate(text) ```
## Results ```bash The results of this review suggest that aggressive ovarian cancer surgery is associated with a significant reduction in the risk of recurrence and a reduction in the number of radical versus conservative surgical resections. However, the results of this review are based on only one small trial. Further research is needed to determine the role of aggressive ovarian cancer surgery in women with stage IIIC ovarian cancer. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_biomedical_pubmed_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English RobertaForQuestionAnswering Base Uncased model (from t15) author: John Snow Labs name: roberta_qa_base_uncased_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-uncased-squad` is a English model originally trained by `t15`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_uncased_squad_en_4.3.0_3.0_1674210559838.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_uncased_squad_en_4.3.0_3.0_1674210559838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_uncased_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_uncased_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|false| |Max sentence length:|256| ## References - https://huggingface.co/t15/distilroberta-base-uncased-squad --- layout: model title: English RobertaForTokenClassification Cased model (from Jean-Baptiste) author: John Snow Labs name: roberta_token_classifier_ticker date: 2023-03-01 tags: [en, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-ticker` is a English model originally trained by `Jean-Baptiste`. ## Predicted Entities `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_ticker_en_4.3.0_3.0_1677703811345.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_ticker_en_4.3.0_3.0_1677703811345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_ticker","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_ticker","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_ticker| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|465.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Jean-Baptiste/roberta-ticker - https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab40 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab40 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab40` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab40_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab40_en_4.2.0_3.0_1664020884314.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab40_en_4.2.0_3.0_1664020884314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab40', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab40", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab40| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Symptoms, Treatments and Other Entities in German author: John Snow Labs name: ner_healthcare date: 2021-03-31 tags: [ner, clinical, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect symptoms, treatments and other entities in medical text in German language. ## Predicted Entities `DIAGLAB_PROCEDURE`, `MEDICAL_SPECIFICATION`, `MEDICAL_DEVICE`, `MEASUREMENT`, `BIOLOGICAL_CHEMISTRY`, `BODY_FLUID`, `TIME_INFORMATION`, `LOCAL_SPECIFICATION`, `BIOLOGICAL_PARAMETER`, `PROCESS`, `MEDICATION`, `DOSING`, `DEGREE`, `MEDICAL_CONDITION`, `PERSON`, `TISSUE`, `STATE_OF_HEALTH`, `BODY_PART`, `TREATMENT` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HEALTHCARE_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_de_3.0.0_3.0_1617208455368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_de_3.0.0_3.0_1617208455368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "de", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") clinical_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, clinical_ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_healthcare_slim", "de", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val clinical_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, clinical_ner_converter)) val data = Seq("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.healthcare").predict("""Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom""") ```
## Results ```bash +-----------------+---------------------+-----+---+ |chunk |ner_label |begin|end| +-----------------+---------------------+-----+---+ |Kleinzellige |MEASUREMENT |4 |15 | |Bronchialkarzinom|MEDICAL_CONDITION |17 |33 | |Kleinzelliger |MEDICAL_SPECIFICATION|36 |48 | |Lungenkrebs |MEDICAL_CONDITION |50 |60 | |SCLC |MEDICAL_CONDITION |63 |66 | |Karzinom |MEDICAL_CONDITION |103 |110| +-----------------+---------------------+-----+---+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with *w2v_cc_300d*. ## Benchmarking ```bash | | label | tp | fp | fn | precision| recall| f1 | |---:|--------------------:|-------:|------:|-----:|---------:|---------:|---------:| | 0 | BIOLOGICAL_PARAMETER| 103 | 52 | 57 | 0.6645 | 0.6438 | 0.654 | | 1 | BODY_FLUID | 166 | 16 | 24 | 0.9121 | 0.8737 | 0.8925 | | 2 | PERSON | 475 | 74 | 142 | 0.8652 | 0.7699 | 0.8148 | | 3 | DOSING | 38 | 14 | 31 | 0.7308 | 0.5507 | 0.6281 | | 4 | DIAGLAB_PROCEDURE | 236 | 58 | 68 | 0.8027 | 0.7763 | 0.7893 | | 5 | BODY_PART | 690 | 72 | 79 | 0.9055 | 0.8973 | 0.9014 | | 6 | MEDICATION | 391 | 117 | 167 | 0.7697 | 0.7007 | 0.7336 | | 7 | STATE_OF_HEALTH | 321 | 41 | 76 | 0.8867 | 0.8086 | 0.8458 | | 8 | LOCAL_SPECIFICATION | 57 | 19 | 24 | 0.75 | 0.7037 | 0.7261 | | 9 | MEASUREMENT | 574 | 260 | 222 | 0.6882 | 0.7211 | 0.7043 | | 10 | TREATMENT | 476 | 131 | 135 | 0.7842 | 0.7791 | 0.7816 | | 11 | MEDICAL_CONDITION | 1741 | 442 | 271 | 0.7975 | 0.8653 | 0.83 | | 12 | TIME_INFORMATION | 651 | 126 | 161 | 0.8378 | 0.8017 | 0.8194 | | 13 | BIOLOGICAL_CHEMISTRY| 192 | 55 | 60 | 0.7773 | 0.7619 | 0.7695 | ``` --- layout: model title: Chuvash RobertaForQuestionAnswering (from sunitha) author: John Snow Labs name: roberta_qa_CV_Merge_DS date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: cv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CV_Merge_DS` is a Chuvash model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_CV_Merge_DS_cv_4.0.0_3.0_1655726646554.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_CV_Merge_DS_cv_4.0.0_3.0_1655726646554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_CV_Merge_DS","cv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_CV_Merge_DS","cv") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("cv.answer_question.roberta.cv_merge_ds.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_CV_Merge_DS| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|cv| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/CV_Merge_DS --- layout: model title: Legal Master Lease Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_master_lease_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, master_lease, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_master_lease_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `master-lease-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `master-lease-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_master_lease_agreement_en_1.0.0_3.0_1668117131869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_master_lease_agreement_en_1.0.0_3.0_1668117131869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_master_lease_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[master-lease-agreement]| |[other]| |[other]| |[master-lease-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_master_lease_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support master-lease-agreement 0.96 0.90 0.93 30 other 0.96 0.98 0.97 66 accuracy - - 0.96 96 macro-avg 0.96 0.94 0.95 96 weighted-avg 0.96 0.96 0.96 96 ``` --- layout: model title: Swedish BertForQuestionAnswering model (from susumu2357) author: John Snow Labs name: bert_qa_bert_base_swedish_squad2 date: 2022-06-02 tags: [sv, open_source, question_answering, bert] task: Question Answering language: sv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-swedish-squad2` is a Swedish model orginally trained by `susumu2357`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_swedish_squad2_sv_4.0.0_3.0_1654180667162.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_swedish_squad2_sv_4.0.0_3.0_1654180667162.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_swedish_squad2","sv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_swedish_squad2","sv") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sv.answer_question.squadv2.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_swedish_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sv| |Size:|465.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/susumu2357/bert-base-swedish-squad2 - https://github.com/susumu2357/SQuAD_v2_sv --- layout: model title: Legal Capitalization Clause Binary Classifier author: John Snow Labs name: legclf_capitalization_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `capitalization` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `capitalization` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_capitalization_clause_en_1.0.0_3.2_1660122203093.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_capitalization_clause_en_1.0.0_3.2_1660122203093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_capitalization_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[capitalization]| |[other]| |[other]| |[capitalization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_capitalization_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support capitalization 1.00 0.89 0.94 19 other 0.96 1.00 0.98 53 accuracy - - 0.97 72 macro-avg 0.98 0.95 0.96 72 weighted-avg 0.97 0.97 0.97 72 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from meghana) author: John Snow Labs name: xlm_roberta_qa_hitalmqa_finetuned_squad date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hitalmqa-finetuned-squad` is a English model originally trained by `meghana`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_hitalmqa_finetuned_squad_en_4.0.0_3.0_1655987596935.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_hitalmqa_finetuned_squad_en_4.0.0_3.0_1655987596935.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_hitalmqa_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_hitalmqa_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.xlm_roberta.by_meghana").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_hitalmqa_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/meghana/hitalmqa-finetuned-squad --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification class: PipelineModel language: en nav_key: models repository: clinical/models date: 2020-01-31 task: [De-identification, Pipeline Healthcare] edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [pipeline, clinical, licensed] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This pipeline can be used to de-identify PHI information from medical texts. The PHI information will be obfuscated in the resulting text. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_2.4.0_2.4_1580481115376.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_2.4.0_2.4_1580481115376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" model = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") result = deid_pipeline.annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val model = new PretrainedPipeline("clinical_deidentification", "en", "clinical/models") val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
{:.model-param} ## Model Information {:.table-model} |---------------|---------------------------| | Name: | clinical_deidentification | | Type: | PipelineModel | | Compatibility: | Spark NLP 2.4.0+ | | License: | Licensed | | Edition: | Official | | Language: | en | --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab2_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab2_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_sherry7144_en_4.2.0_3.0_1664025065732.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_sherry7144_en_4.2.0_3.0_1664025065732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_sherry7144", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_sherry7144", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab2_by_sherry7144| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Earning Calls Financial NER (Specific, sm) author: John Snow Labs name: finner_earning_calls_specific_sm date: 2022-11-30 tags: [en, finance, ner, licensed, earning, calls] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `sm` (small) version of a financial model trained on Earning Calls transcripts to detect financial entities (NER model). This model is called `Specific` as it has more labels in comparison with a `Generic` version. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The currently available entities are: - AMOUNT: Numeric amounts, not percentages - ASSET: Current or Fixed Asset - ASSET_DECREASE: Decrease in the asset possession/exposure - ASSET_INCREASE: Increase in the asset possession/exposure - CF: Total cash flow  - CFO: Cash flow from operating activity - CFO_INCREASE: Cash flow from operating activity increased - COUNT: Number of items (not monetary, not percentages). - CURRENCY: The currency of the amount - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - EXPENSE: An expense or loss - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - FCF: Free Cash Flow - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - INCOME: Any income that is reported - INCOME_INCREASE: Relative increase in income - KPI: Key Performance Indicator, a quantifiable measure of performance over time for a specific objective - KPI_DECREASE: Relative decrease in a KPI - KPI_INCREASE: Relative increase in a KPI - LIABILITY: Current or Long-Term Liability (not from stockholders) - LIABILITY_DECREASE: Relative decrease in liability - LIABILITY_INCREASE: Relative increase in liability - LOSS: Type of loss (e.g. gross, net) - ORG: Mention to a company/organization name - PERCENTAGE: : Numeric amounts which are percentages - PROFIT: Profit or also Revenue - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - REVENUE: Revenue reported by company - REVENUE_DECLINE: Relative decrease in revenue when compared to other years - REVENUE_INCREASE: Relative increase in revenue when compared to other years - STOCKHOLDERS_EQUITY: Equity possessed by stockholders, not liability - TICKER: Trading symbol of the company ## Predicted Entities `AMOUNT`, `ASSET`, `ASSET_DECREASE`, `ASSET_INCREASE`, `CF`, `CFO`, `CFO_INCREASE`, `COUNT`, `CURRENCY`, `DATE`, `EXPENSE`, `EXPENSE_DECREASE`, `EXPENSE_INCREASE`, `FCF`, `FISCAL_YEAR`, `INCOME`, `INCOME_INCREASE`, `KPI`, `KPI_DECREASE`, `KPI_INCREASE`, `LIABILITY`, `LIABILITY_DECREASE`, `LIABILITY_INCREASE`, `LOSS`, `ORG`, `PERCENTAGE`, `PROFIT`, `PROFIT_DECLINE`, `PROFIT_INCREASE`, `REVENUE`, `REVENUE_DECLINE`, `REVENUE_INCREASE`, `STOCKHOLDERS_EQUITY`, `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_specific_sm_en_1.0.0_3.0_1669835090214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_specific_sm_en_1.0.0_3.0_1669835090214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_earning_calls_specific_sm", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Adjusted EPS was ahead of our expectations at $ 1.21 , and free cash flow is also ahead of our expectations despite a $ 1.5 billion additional tax payment we made related to the R&D amortization."""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +------------+----------+----------+ | token| ner_label|confidence| +------------+----------+----------+ | Adjusted| B-PROFIT| 0.6957| | EPS| I-PROFIT| 0.8325| | was| O| 0.9994| | ahead| O| 0.9996| | of| O| 0.9929| | our| O| 0.9852| |expectations| O| 0.9845| | at| O| 1.0| | $|B-CURRENCY| 0.9995| | 1.21| B-AMOUNT| 1.0| | ,| O| 0.9993| | and| O| 0.9997| | free| B-FCF| 0.9883| | cash| I-FCF| 0.815| | flow| I-FCF| 0.8644| | is| O| 0.9997| | also| O| 0.9966| | ahead| O| 0.9998| | of| O| 0.9953| | our| O| 0.9877| |expectations| O| 0.994| | despite| O| 0.9997| | a| O| 0.9979| | $|B-CURRENCY| 0.9992| | 1.5| B-AMOUNT| 1.0| | billion| I-AMOUNT| 0.9997| | additional| B-EXPENSE| 0.641| | tax| I-EXPENSE| 0.3146| | payment| I-EXPENSE| 0.6099| | we| O| 0.9613| | made| O| 0.982| | related| O| 0.9732| | to| O| 0.8816| | the| O| 0.7283| | R&D| O| 0.8978| |amortization| O| 0.5825| | .| O| 1.0| +------------+----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_earning_calls_specific_sm| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on Earning Calls. ## Benchmarking ```bash label tp fp fn prec rec f1 I-REVENUE_INCREASE 53 16 52 0.76811594 0.50476193 0.6091954 I-AMOUNT 382 2 4 0.9947917 0.9896373 0.9922078 B-COUNT 14 9 1 0.6086956 0.93333334 0.73684216 B-AMOUNT 454 0 5 1.0 0.9891068 0.9945236 I-KPI 2 11 23 0.15384616 0.08 0.10526316 I-ORG 16 0 0 1.0 1.0 1.0 B-DATE 122 13 0 0.9037037 1.0 0.94941634 B-LIABILITY_DECREASE 1 1 0 0.5 1.0 0.6666667 I-DATE 3 2 0 0.6 1.0 0.75 B-LOSS 4 0 4 1.0 0.5 0.6666667 I-ASSET 6 2 14 0.75 0.3 0.42857146 I-EXPENSE 46 13 59 0.77966 0.43809524 0.5609756 I-KPI_INCREASE 1 7 13 0.125 0.071428575 0.09090909 B-REVENUE_INCREASE 60 21 34 0.7407407 0.63829786 0.6857143 I-COUNT 13 6 0 0.68421054 1.0 0.8125 I-CFO 23 1 0 0.9583333 1.0 0.9787234 B-FCF 13 4 0 0.7647059 1.0 0.8666667 B-PROFIT_INCREASE 11 11 5 0.5 0.6875 0.57894737 B-EXPENSE 26 16 45 0.61904764 0.36619717 0.460177 B-REVENUE_DECLINE 6 4 13 0.6 0.31578946 0.41379312 B-STOCKHOLDERS_EQUITY 3 0 3 1.0 0.5 0.6666667 I-PROFIT_DECLINE 4 1 7 0.8 0.36363637 0.5 I-LIABILITY_DECREASE 1 1 0 0.5 1.0 0.6666667 I-LOSS 12 0 10 1.0 0.54545456 0.7058824 I-PROFIT 148 40 10 0.78723407 0.93670887 0.8554913 B-CFO 9 1 1 0.9 0.9 0.9 B-CURRENCY 440 0 1 1.0 0.9977324 0.9988649 I-PROFIT_INCREASE 11 10 6 0.52380955 0.64705884 0.5789474 I-CURRENCY 6 0 0 1.0 1.0 1.0 B-PROFIT 93 27 16 0.775 0.853211 0.812227 B-PERCENTAGE 418 7 3 0.9835294 0.9928741 0.9881796 B-TICKER 13 0 0 1.0 1.0 1.0 I-FISCAL_YEAR 2 3 1 0.4 0.6666667 0.5 B-ORG 14 0 0 1.0 1.0 1.0 I-STOCKHOLDERS_EQUITY 6 0 2 1.0 0.75 0.85714287 I-REVENUE_DECLINE 8 9 8 0.47058824 0.5 0.4848485 B-EXPENSE_INCREASE 6 0 4 1.0 0.6 0.75 B-REVENUE 51 17 15 0.75 0.77272725 0.761194 B-FISCAL_YEAR 1 1 0 0.5 1.0 0.6666667 I-EXPENSE_DECREASE 3 3 2 0.5 0.6 0.54545456 I-FCF 26 9 0 0.74285716 1.0 0.852459 I-REVENUE 45 12 18 0.7894737 0.71428573 0.75000006 I-EXPENSE_INCREASE 8 0 3 1.0 0.72727275 0.84210527 Macro-average 2611 311 491 0.6658762 0.6029909 0.63287526 Micro-average 2611 311 91 0.8935661 0.84171504 0.8668659 ``` --- layout: model title: Pipeline to Extract Pharmacological Entities from Spanish Medical Texts author: John Snow Labs name: ner_pharmacology_pipeline date: 2023-03-09 tags: [es, clinical, licensed, ner, pharmacology, proteinas, normalizables] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_pharmacology](https://nlp.johnsnowlabs.com/2022/08/13/ner_pharmacology_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_pharmacology_pipeline_es_4.3.0_3.2_1678358547733.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_pharmacology_pipeline_es_4.3.0_3.2_1678358547733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_pharmacology_pipeline", "es", "clinical/models") text = '''e realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_pharmacology_pipeline", "es", "clinical/models") val text = "e realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:--------------|-------------:| | 0 | creatinkinasa | 31 | 43 | PROTEINAS | 0.9994 | | 1 | LDH | 53 | 55 | PROTEINAS | 0.9996 | | 2 | urea | 65 | 68 | NORMALIZABLES | 0.9996 | | 3 | CA 19.9 | 80 | 86 | PROTEINAS | 0.99835 | | 4 | vimentina | 138 | 146 | PROTEINAS | 0.9991 | | 5 | S-100 | 149 | 153 | PROTEINAS | 0.9996 | | 6 | HMB-45 | 156 | 161 | PROTEINAS | 0.9986 | | 7 | actina | 165 | 170 | PROTEINAS | 0.9998 | | 8 | Cisplatino | 219 | 228 | NORMALIZABLES | 0.9999 | | 9 | Interleukina II | 231 | 245 | PROTEINAS | 0.99955 | | 10 | Dacarbacina | 248 | 258 | NORMALIZABLES | 0.9996 | | 11 | Interferon alfa | 262 | 276 | PROTEINAS | 0.99935 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_pharmacology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|318.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_nq_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nq_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nq_el_4_el_4.0.0_3.0_1657190609566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nq_el_4_el_4.0.0_3.0_1657190609566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nq_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_nq_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nq_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/nq_bert_el_4 --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_oliverguhr TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_by_oliverguhr date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_oliverguhr` is a German model originally trained by oliverguhr. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_de_4.2.0_3.0_1664104475873.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_oliverguhr_de_4.2.0_3.0_1664104475873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_by_oliverguhr", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_by_oliverguhr", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_by_oliverguhr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Russian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_ru_cased date: 2022-12-02 tags: [ru, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ru edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ru-cased` is a Russian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ru_cased_ru_4.2.4_3.0_1670018825264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ru_cased_ru_4.2.4_3.0_1670018825264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ru_cased","ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ru_cased","ru") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_ru_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ru| |Size:|364.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ru-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Spanish NER Model author: John Snow Labs name: roberta_token_classifier_bne_capitel_ner date: 2021-12-07 tags: [roberta, spanish, es, token_classifier, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description - This model is imported from `Hugging Face`. - RoBERTa-base-bne is a transformer-based masked language model for the Spanish language. It is based on the RoBERTa base model and has been pretrained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019. ## Predicted Entities `OTH`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_es_3.3.2_2.4_1638866935540.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_es_3.3.2_2.4_1638866935540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_bne_capitel_ner", "es"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_bne_capitel_ner", "es")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash +------------------------+---------+ |chunk |ner_label| +------------------------+---------+ |Antonio |PER | |fábrica de Mercedes-Benz|ORG | |Madrid. |LOC | +------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_bne_capitel_ner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|es| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus) ## Benchmarking ```bash label score f1 0.8867 ``` --- layout: model title: Detect Genomic Variant Information (ner_genetic_variants) author: John Snow Labs name: ner_genetic_variants date: 2021-06-25 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts genetic variant information from the medical text and is trained on the tmVar 2.0 dataset that can be found at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/ ## Predicted Entities `DNAMutation`, `ProteinMutation`, `SNP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_GEN_VAR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_genetic_variants_en_3.1.0_3.0_1624607526370.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_genetic_variants_en_3.1.0_3.0_1624607526370.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') clinical_ner = MedicalNerModel.pretrained("ner_genetic_variants", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) data = spark.createDataFrame([["""The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols("sentence", "token") .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_genetic_variants", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.genetic_variants").predict("""The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.""") ```
## Results ```bash +---------+---------------+ |chunk |ner_label | +---------+---------------+ |A3243G |DNAMutation | |G13513A |DNAMutation | |T10191C |DNAMutation | |p.S45P |ProteinMutation| |A11470C |DNAMutation | |p. K237N |ProteinMutation| |T13046C |DNAMutation | |p. |ProteinMutation| |M237T |ProteinMutation| |A11470C |DNAMutation | |T13046C |DNAMutation | |rs3753394|SNP | |rs800292 |SNP | |rs1061170|SNP | |rs1329428|SNP | |rs2274700|SNP | |rs1410996|SNP | |rs3753394|SNP | |rs800292 |SNP | |rs1061170|SNP | |rs1329428|SNP | |rs7535263|SNP | |rs1410996|SNP | |rs2274700|SNP | +---------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_genetic_variants| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 SNP 38.0 0.0 0.0 38.0 1.0 1.0 1.0 ProteinMutation 171.0 21.0 19.0 190.0 0.8906 0.9 0.8953 DNAMutation 161.0 47.0 42.0 203.0 0.774 0.7931 0.7835 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from peggyhuang) author: John Snow Labs name: bert_qa_scibert_coqa date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SciBERT-CoQA` is a English model originally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_coqa_en_4.0.0_3.0_1657182460716.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_coqa_en_4.0.0_3.0_1657182460716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_scibert_coqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_scibert_coqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_scibert_coqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|410.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/SciBERT-CoQA --- layout: model title: Legal Modification and waiver Clause Binary Classifier author: John Snow Labs name: legclf_modification_and_waiver_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `modification-and-waiver` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `modification-and-waiver` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_modification_and_waiver_clause_en_1.0.0_3.2_1660122661461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_modification_and_waiver_clause_en_1.0.0_3.2_1660122661461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_modification_and_waiver_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[modification-and-waiver]| |[other]| |[other]| |[modification-and-waiver]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_modification_and_waiver_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support modification-and-waiver 0.86 0.88 0.87 42 other 0.95 0.94 0.95 106 accuracy - - 0.93 148 macro-avg 0.91 0.91 0.91 148 weighted-avg 0.93 0.93 0.93 148 ``` --- layout: model title: Multilingual (English, German, Spanish) DistilBertForQuestionAnswering model (from ZYW) Squad author: John Snow Labs name: distilbert_qa_squad_en_de_es_model date: 2022-06-08 tags: [en, de, es, open_source, distilbert, question_answering, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-en-de-es-model` is a English model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_en_de_es_model_xx_4.0.0_3.0_1654728749221.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_en_de_es_model_xx_4.0.0_3.0_1654728749221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_en_de_es_model","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_en_de_es_model","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.squad.distil_bert.en_de_es_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_en_de_es_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/squad-en-de-es-model --- layout: model title: Mapping ICD10CM Codes with Corresponding Causes and Claim Analysis Codes author: John Snow Labs name: icd10cm_cause_claim_mapper date: 2023-05-11 tags: [en, licensed, chunk_mapping, icd10cm, cause, claim] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICD-10-CM codes, subsequently providing corresponding causes and generating claim analysis codes for each respective ICD-10-CM code. If there is no equivalent claim analysis code, the result will be `None`. ## Predicted Entities `icd10cm_cause`, `icd10cm_claim_analysis_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_cause_claim_mapper_en_4.4.0_3.0_1683819210044.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_cause_claim_mapper_en_4.4.0_3.0_1683819210044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") chunk_assembler = Doc2Chunk()\ .setInputCols("document")\ .setOutputCol("icd_chunk") chunkerMapper = ChunkMapperModel.pretrained("icd10cm_cause_claim_mapper", "en", "clinical/models")\ .setInputCols(["icd_chunk"])\ .setOutputCol("mappings")\ .setRels(["icd10cm_cause", "icd10cm_claim_analysis_code"]) pipeline = Pipeline().setStages([document_assembler, chunk_assembler, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) res = lp.fullAnnotate(["D69.51", "G43.83", "A18.03"]) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val chunk_assembler = new Doc2Chunk() .setInputCols("document") .setOutputCol("icd_chunk") val chunkerMapper = ChunkMapperModel.pretrained("icd10cm_cause_claim_mapper", "en", "clinical/models") .setInputCols(Array("icd_chunk")) .setOutputCol("mappings") .setRels(Array("icd10cm_cause", "icd10cm_claim_analysis_code")) val mapper_pipeline = new Pipeline().setStages(Array(document_assembler, chunk_assembler, chunkerMapper)) val data = Seq(Array("D69.51", "G43.83", "A18.03")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------+------------------------------------+---------------------------+ |icd10cm_code|cause |icd10cm_claim_analysis_code| +------------+------------------------------------+---------------------------+ |D69.51 |Unintentional injuries |D69.51 | |D69.51 |Adverse effects of medical treatment|D69.51 | |G43.83 |Headache disorders |G43.83 | |G43.83 |Tension-type headache |G43.83 | |G43.83 |Migraine |G43.83 | |A18.03 |Whooping cough |A18.03 | +------------+------------------------------------+---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_cause_claim_mapper| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|600.2 KB| --- layout: model title: Detect bacterial species author: John Snow Labs name: ner_bacterial_species date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect different types of species of bacteria in text using pretrained NER model. ## Predicted Entities `SPECIES` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BACTERIAL_SPECIES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_en_3.0.0_3.0_1617260641415.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_en_3.0.0_3.0_1617260641415.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_bacterial_species", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_bacterial_species", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bacterial_species").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bacterial_species| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash +-------+------+-----+-----+------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +-------+------+-----+-----+------+---------+------+------+ |SPECIES|1396.0|265.0|414.0|1810.0| 0.8405|0.7713|0.8044| +-------+------+-----+-----+------+---------+------+------+ +------------------+ | macro| +------------------+ |0.8043791414577931| +------------------+ +------------------+ | micro| +------------------+ |0.8043791414577931| +------------------+ ``` --- layout: model title: English BertForQuestionAnswering model (from piEsposito) author: John Snow Labs name: bert_qa_braquad_bert_qna date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `braquad-bert-qna` is a English model orginally trained by `piEsposito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_braquad_bert_qna_en_4.0.0_3.0_1654537655715.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_braquad_bert_qna_en_4.0.0_3.0_1654537655715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_braquad_bert_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_braquad_bert_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_piEsposito").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_braquad_bert_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/piEsposito/braquad-bert-qna - https://github.com/piEsposito/br-quad-2.0 --- layout: model title: Pipeline to Detect Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_gene_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_gene_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_human_phenotype_gene_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_pipeline_en_4.3.0_3.2_1678874928254.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_pipeline_en_4.3.0_3.2_1678874928254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") text = '''Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") val text = "Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3)." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phnotype_gene_clinical.pipeline").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------|-------------:| | 0 | type | 29 | 32 | GENE | 0.9837 | | 1 | polyhydramnios | 75 | 88 | HP | 0.987 | | 2 | polyuria | 91 | 98 | HP | 0.9964 | | 3 | nephrocalcinosis | 101 | 116 | HP | 0.9979 | | 4 | hypokalemia | 122 | 132 | HP | 0.9952 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_roberta_FT_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_roberta_FT_newsqa_en_4.0.0_3.0_1655728714119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_roberta_FT_newsqa_en_4.0.0_3.0_1655728714119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_roberta_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_fpdm_roberta_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_fpdm_roberta_ft_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_roberta_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|458.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_roberta_FT_newsqa --- layout: model title: Translate Pohnpeian to English Pipeline author: John Snow Labs name: translate_pon_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pon, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pon` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pon_en_xx_2.7.0_2.4_1609689085209.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pon_en_xx_2.7.0_2.4_1609689085209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_pon_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pon_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.pon.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pon_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English XLMRobertaForTokenClassification Large Cased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_large_all_english date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-all-english` is a English model originally trained by `asahi417`. ## Predicted Entities `time`, `corporation`, `ordinal number`, `cardinal number`, `rna`, `geopolitical area`, `protein`, `product`, `percent`, `dna`, `disease`, `cell line`, `law`, `other`, `date`, `chemical`, `event`, `work of art`, `cell type`, `location`, `language`, `quantity`, `facility`, `organization`, `group`, `money`, `person` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_all_english_en_4.1.0_3.0_1660424076486.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_all_english_en_4.1.0_3.0_1660424076486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_all_english","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_all_english","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_large_all_english| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-large-all-english - https://github.com/asahi417/tner --- layout: model title: Pipeline to Detect Chemicals in Medical Texts author: John Snow Labs name: bert_token_classifier_ner_chemicals_pipeline date: 2022-03-14 tags: [chemicals, ner, bert_token_classifier, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemicals](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemicals_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMICALS/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_2.4_1647264078537.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_pipeline_en_3.4.1_2.4_1647264078537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chemicals_pipeline = PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") chemicals_pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` ```scala val chemicals_pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemicals_pipeline", "en", "clinical/models") chemicals_pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemicals_pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |p - choloroaniline |CHEM | |chlorhexidine - digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone - iodine |CHEM | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.3 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: Hausa Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_hausa date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ha, open_source] task: Named Entity Recognition language: ha edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-hausa` is a Hausa model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_hausa_ha_3.4.2_3.0_1652808158883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_hausa_ha_3.4.2_3.0_1652808158883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_hausa","ha") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ina son Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_hausa","ha") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ina son Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_swahili_finetuned_ner_hausa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ha| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-hausa - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman --- layout: model title: English asr_wav2vec2_large_a TFWav2Vec2ForCTC from yongjian author: John Snow Labs name: pipeline_asr_wav2vec2_large_a date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_a` is a English model originally trained by yongjian. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_a_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_a_en_4.2.0_3.0_1664039953191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_a_en_4.2.0_3.0_1664039953191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_a', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_a", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_a| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Smaller BERT Embeddings (L-12_H-256_A-4) author: John Snow Labs name: small_bert_L12_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L12_256_en_2.6.0_2.4_1598344517363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L12_256_en_2.6.0_2.4_1598344517363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L12_256", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L12_256", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L12_256').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L12_256_embeddings I [2.3376450538635254, -1.8108669519424438, 0.00... love [0.369129478931427, -0.1494956910610199, 0.391... NLP [0.31006014347076416, 0.20426464080810547, -0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L12_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1 --- layout: model title: Translate Cushitic languages to English Pipeline author: John Snow Labs name: translate_cus_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cus, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cus` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cus_en_xx_2.7.0_2.4_1609690727352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cus_en_xx_2.7.0_2.4_1609690727352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cus_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cus_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cus.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cus_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbiobertresolve_icd10cm_slim_billable_hcc) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_billable_hcc date: 2022-05-11 tags: [licensed, en, clinical, icd10, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` sentence bert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It returns the official resolution text within the brackets and also provides billable and HCC information of the codes in `all_k_aux_labels` parameter in the metadata. This column can be divided to get further details: `billable status || hcc status || hcc score`. For example, if `all_k_aux_labels` is like `1||1||19` which means the `billable status` is 1, `hcc status` is 1, and `hcc score` is 19. ## Predicted Entities `ICD10 Codes`, `billable status`, `hcc status`, `hcc score` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.5.1_3.0_1652294908790.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.5.1_3.0_1652294908790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("icd_code")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, icd_resolver ]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity , presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val c2doc = new Chunk2Doc() .setInputCols(Array("ner_chunk")) .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("icd_code") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new PipelineModel().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity , presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS.toDF("text") val results = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity , presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+--------+-------------------------------------------------------------------+--------------------------------------------------+-------------------------------------------------------+ | chunk| entity|icd_code| all_k_resolutions| all_k_codes| all_k_aux_labels | +-------------------------------------+-------+--------+-------------------------------------------------------------------+--------------------------------------------------+-------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.41|gestational diabetes mellitus [Gestational diabetes mellitus in ...|O24.41:::E11.9:::O24.919:::O24.419:::O24.439:::...| 0||0||0:::1||1||19:::1||0||0:::1||0||0:::1||0||0:::...| |subsequent type two diabetes mellitus|PROBLEM| O24.11|pre-existing type 2 diabetes mellitus [Pre-existing type 2 diabe...|O24.11:::E11.8:::E11.9:::E11:::E13.9:::E11.3:::...| 0||0||0:::1||1||18:::1||1||19:::0||0||0:::1||1||19:...| | T2DM|PROBLEM| E11.9|t2dm [Type 2 diabetes mellitus without complications]:::gm>2 [GM...|E11.9:::E75.00:::H35.89:::F80.0:::R44.8:::M79.8...| 1||1||19:::1||1||52:::1||0||0:::1||0||0:::1||0||0::...| | HTG-induced pancreatitis|PROBLEM| K85.9|alcohol-induced pancreatitis [Acute pancreatitis, unspecified]:::..|K85.9:::F10.988:::K85.3:::K85:::K85.2:::K85.8::...| 0||0||0:::1||1||55:::0||0||0:::0||0||0:::0||0||0:::...| | acute hepatitis|PROBLEM| K72.0|acute hepatitis [Acute and subacute hepatic failure]:::acute hep...|K72.0:::B17.9:::B15.9:::B15:::B17.2:::Z03.89:::...| 0||0||0:::1||0||0:::1||0||0:::0||0||0:::1||0||0:::1...| | obesity|PROBLEM| E66.8|abdominal obesity [Other obesity]:::overweight and obesity [Over...|E66.8:::E66:::E66.01:::E66.9:::Z91.89:::E66.3::...| 1||0||0:::0||0||0:::1||1||22:::1||0||0:::1||0||0:::...| | polyuria|PROBLEM| R35|polyuria [Polyuria]:::nocturnal polyuria [Nocturnal polyuria]:::...|R35:::R35.81:::R35.89:::R31:::R30.0:::E72.01:::...| 0||0||0:::1||0||0:::1||0||0:::0||0||0:::1||0||0:::1...| | polydipsia|PROBLEM| R63.1|polydipsia [Polydipsia]:::psychogenic polydipsia [Other impulse ...|R63.1:::F63.89:::O40:::O40.9XX0:::G47.50:::G47....| 1||0||0:::1||0||0:::0||0||0:::1||0||0:::1||0||0:::0...| | poor appetite|PROBLEM| R63.0|poor appetite [Anorexia]:::patient dissatisfied with nutrition r...|R63.0:::Z76.89:::R53.1:::R10.9:::R45.81:::R44.8...| 1||0||0:::1||0||0:::1||0||0:::1||0||0:::1||0||0:::1...| | vomiting|PROBLEM| R11.1|vomiting [Vomiting]:::vomiting [Vomiting, unspecified]:::intermi...|R11.1:::R11.10:::R11:::G43.A0:::G43.A:::R11.0::...| 0||0||0:::1||0||0:::0||0||0:::1||0||0:::0||0||0:::1...| | a respiratory tract infection|PROBLEM| J06.9|upper respiratory tract infection [Acute upper respiratory infec...|J06.9:::T17.9:::T17:::J04.10:::J22:::J98.8:::J9...| 1||0||0:::0||0||0:::0||0||0:::1||0||0:::1||0||0:::1...| +-------------------------------------+-------+--------+----------------------------------------------------------------------------------------------------+--------------------------------------------------+-------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_billable_hcc| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|846.6 MB| |Case sensitive:|false| --- layout: model title: Public Health Surveillance (PHS) BERT Embeddings author: John Snow Labs name: bert_embeddings_phs_bert date: 2022-07-02 tags: [bert, en, embeddings, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BERT Embeddings model, adapted from [Hugging Face](https://huggingface.co/publichealthsurveillance/PHS-BERT) and curated to provide scalability and production-readiness using Spark NLP. `PHS-BERT` is an English model and trained to identify the tasks related to public health surveillance (PHS) on social media. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_phs_bert_en_4.0.0_3.0_1656759538082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_phs_bert_en_4.0.0_3.0_1656759538082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_phs_bert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_phs_bert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_phs_bert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| ## References https://arxiv.org/abs/2204.04521 --- layout: model title: Detect actions in general commands related to music, restaurant, movies. author: John Snow Labs name: nerdl_snips_100d date: 2021-02-15 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.3 spark_version: 2.4 tags: [open_source, ner, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Understand user commands and find relevant entities and actions and tag them to get a structured representation for automation. ## Predicted Entities `playlist_owner`, `served_dish`, `track`, `poi`, `cuisine`, `spatial_relation`, `object_type`, `facility`, `album`, `country`, `geographic_poi`, `location_name`, `object_part_of_series_type`, `object_select`, `artist`, `rating_value`, `best_rating`, `sort`, `party_size_description`, `party_size_number`, `restaurant_name`, `object_location_type`, `playlist`, `service`, `city`, `O`, `genre`, `movie_name`, `current_location`, `rating_unit`, `restaurant_type`, `condition_temperature`, `condition_description`, `entity_name`, `movie_type`, `object_name`, `state`, `year`, `music_item`, `timeRange` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_CLS_SNIPS){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_snips_100d_en_2.7.3_2.4_1613403676821.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_snips_100d_en_2.7.3_2.4_1613403676821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("nerdl_snips_100d") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, embeddings, ner, ner_converter]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate('book a spot for nona gray myrtle and alison at a top-rated brasserie that is distant from wilson av on nov the 4th 2030 that serves ouzeri') ... ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en") .setInputCols(Array("sentence", 'token')) .setOutputCol("embeddings") val ner = NerDLModel.pretrained('nerdl_snips_100d') .setInputCols(Array('sentence', 'token', 'embeddings')).setOutputCol('ner') val ner_converter = NerConverter.setInputCols(Array('document', 'token', 'ner')) .setOutputCol('ner_chunk') val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner, ner_converter)) val data = Seq("book a spot for nona gray myrtle and alison at a top-rated brasserie that is distant from wilson av on nov the 4th 2030 that serves ouzeri").toDF("text") val result = pipeline.fit(data).transform(data) ... ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.snips").predict("""book a spot for nona gray myrtle and alison at a top-rated brasserie that is distant from wilson av on nov the 4th 2030 that serves ouzeri""") ```
## Results ```bash +------------------------------+------------------------+ | ner_chunk | label | +------------------------------+------------------------+ | nona gray myrtle and alison | PARTY_SIZE_DESCRIPTION | | top-rated | SORT | | brasserie | RESTAURANT_TYPE | | distant | SPATIAL_RELATION | | wilson Macro-average | POI | | nov the 4th 2030 | TIMERANGE | | ouzeri | CUISINE | +------------------------------+------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_snips_100d| |Type:|ner| |Compatibility:|Spark NLP 2.7.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source This model is trained on the NLU Benchmark SNIPS dataset https://github.com/MiuLab/SlotGated-SLU ## Benchmarking ```bash B-facility 3 0 0 1.0 1.0 1.0 B-poi 7 0 1 1.0 0.875 0.93333334 B-object_location_type 22 1 0 0.95652175 1.0 0.9777778 B-service 24 2 0 0.9230769 1.0 0.96000004 I-entity_name 53 2 1 0.96363634 0.9814815 0.9724771 B-genre 5 0 0 1.0 1.0 1.0 I-service 5 0 0 1.0 1.0 1.0 I-object_type 66 0 0 1.0 1.0 1.0 I-sort 9 0 0 1.0 1.0 1.0 I-city 19 1 0 0.95 1.0 0.9743589 B-music_item 102 2 2 0.9807692 0.9807692 0.9807692 I-movie_name 100 5 21 0.95238096 0.8264463 0.8849558 B-party_size_description 10 0 0 1.0 1.0 1.0 B-served_dish 10 3 2 0.7692308 0.8333333 0.8 B-object_type 161 8 1 0.9526627 0.99382716 0.9728097 B-playlist 123 6 6 0.95348835 0.95348835 0.95348835 B-restaurant_name 14 1 1 0.93333334 0.93333334 0.93333334 B-geographic_poi 11 0 0 1.0 1.0 1.0 B-condition_description 28 0 0 1.0 1.0 1.0 I-object_location_type 16 0 0 1.0 1.0 1.0 B-spatial_relation 70 3 1 0.9589041 0.9859155 0.9722222 I-party_size_description 35 0 0 1.0 1.0 1.0 I-poi 10 0 1 1.0 0.90909094 0.95238096 I-artist 111 4 1 0.9652174 0.9910714 0.9779735 B-condition_temperature 23 0 0 1.0 1.0 1.0 I-movie_type 16 0 0 1.0 1.0 1.0 I-object_part_of_series_type 0 0 1 0.0 0.0 0.0 B-city 60 1 0 0.9836066 1.0 0.9917355 I-location_name 29 0 1 1.0 0.96666664 0.9830508 B-album 0 2 10 0.0 0.0 0.0 I-genre 2 0 0 1.0 1.0 1.0 B-state 55 0 4 1.0 0.9322034 0.9649123 I-object_name 383 29 16 0.9296116 0.9598997 0.9445129 B-current_location 13 0 1 1.0 0.9285714 0.9629629 B-timeRange 102 8 5 0.92727274 0.95327103 0.9400922 B-sort 29 1 3 0.96666664 0.90625 0.9354838 I-timeRange 144 7 0 0.95364237 1.0 0.97627115 B-rating_unit 40 0 0 1.0 1.0 1.0 I-current_location 7 0 0 1.0 1.0 1.0 I-state 6 0 0 1.0 1.0 1.0 I-album 4 1 17 0.8 0.1904762 0.30769232 B-entity_name 31 4 2 0.8857143 0.93939394 0.9117647 B-object_name 134 22 13 0.85897434 0.91156465 0.88448846 B-playlist_owner 70 1 0 0.9859155 1.0 0.9929078 I-music_item 5 0 0 1.0 1.0 1.0 I-spatial_relation 41 2 1 0.95348835 0.97619045 0.9647058 I-country 25 1 0 0.96153843 1.0 0.98039216 B-rating_value 80 0 0 1.0 1.0 1.0 B-restaurant_type 64 0 1 1.0 0.9846154 0.9922481 I-playlist_owner 7 0 0 1.0 1.0 1.0 I-cuisine 1 0 0 1.0 1.0 1.0 B-track 7 10 2 0.4117647 0.7777778 0.5384615 B-movie_name 37 2 10 0.94871795 0.78723407 0.8604651 B-party_size_number 50 0 0 1.0 1.0 1.0 I-restaurant_type 7 0 0 1.0 1.0 1.0 B-year 24 1 0 0.96 1.0 0.9795918 B-location_name 23 0 1 1.0 0.9583333 0.9787234 B-object_part_of_series_type 11 1 0 0.9166667 1.0 0.95652175 B-country 43 4 1 0.9148936 0.97727275 0.94505495 I-playlist 218 4 13 0.981982 0.94372296 0.96247244 I-served_dish 2 1 2 0.6666667 0.5 0.57142854 I-track 19 29 2 0.39583334 0.9047619 0.5507246 B-artist 99 4 8 0.9611651 0.92523366 0.9428571 B-best_rating 43 0 0 1.0 1.0 1.0 I-restaurant_name 35 2 1 0.9459459 0.9722222 0.9589041 B-object_select 40 1 0 0.9756098 1.0 0.9876543 B-cuisine 12 1 2 0.9230769 0.85714287 0.8888889 B-movie_type 33 0 0 1.0 1.0 1.0 I-geographic_poi 33 0 0 1.0 1.0 1.0 tp: 3121 fp: 177 fn: 155 labels: 69 Macro-average prec: 0.91982585, rec: 0.9205297, f1: 0.9201776 Micro-average prec: 0.9463311, rec: 0.9526862, f1: 0.949498 ``` --- layout: model title: Multilingual T5ForConditionalGeneration Small Cased model (from north) author: John Snow Labs name: t5_small_ncc_lm date: 2023-01-31 tags: [is, nn, en, "no", sv, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_small_NCC_lm` is a Multilingual model originally trained by `north`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_ncc_lm_xx_4.3.0_3.0_1675157076420.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_ncc_lm_xx_4.3.0_3.0_1675157076420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_ncc_lm","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_ncc_lm","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_ncc_lm| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.6 GB| ## References - https://huggingface.co/north/t5_small_NCC_lm - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/google-research/t5x - https://arxiv.org/abs/2104.09617 - https://arxiv.org/abs/2104.09617 - https://arxiv.org/pdf/1910.10683.pdf - https://sites.research.google/trc/about/ --- layout: model title: Pipeline to Extract entities in clinical trial abstracts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_clinical_trials_abstracts_pipeline date: 2023-03-20 tags: [berttokenclassifier, bert, biobert, en, ner, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/06/29/bert_token_classifier_ner_clinical_trials_abstracts_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_pipeline_en_4.3.0_3.2_1679304059319.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_trials_abstracts_pipeline_en_4.3.0_3.2_1679304059319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_trials_abstracts_pipeline", "en", "clinical/models") text = '''This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_trials_abstracts_pipeline", "en", "clinical/models") val text = "This open-label, parallel-group, two-arm, pilot study compared the beta-cell protective effect of adding insulin glargine (GLA) vs. NPH insulin to ongoing metformin. Overall, 28 insulin-naive type 2 diabetes subjects (mean +/- SD age, 61.5 +/- 6.7 years; BMI, 30.7 +/- 4.3 kg/m(2)) treated with metformin and sulfonylurea were randomized to add once-daily GLA or NPH at bedtime." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:-------------------|-------------:| | 0 | open-label | 5 | 14 | CTDesign | 0.742075 | | 1 | parallel-group | 17 | 30 | CTDesign | 0.725741 | | 2 | two-arm | 33 | 39 | CTDesign | 0.427547 | | 3 | insulin glargine | 105 | 120 | Drug | 0.985063 | | 4 | GLA | 123 | 125 | Drug | 0.96917 | | 5 | NPH insulin | 132 | 142 | Drug | 0.762519 | | 6 | metformin | 155 | 163 | Drug | 0.996344 | | 7 | 28 | 175 | 176 | NumberPatients | 0.968501 | | 8 | type 2 diabetes | 192 | 206 | DisorderOrSyndrome | 0.979685 | | 9 | 61.5 | 235 | 238 | Age | 0.610416 | | 10 | kg/m(2 | 273 | 278 | BioAndMedicalUnit | 0.974807 | | 11 | metformin | 295 | 303 | Drug | 0.99696 | | 12 | sulfonylurea | 309 | 320 | Drug | 0.996722 | | 13 | randomized | 327 | 336 | CTDesign | 0.990632 | | 14 | once-daily | 345 | 354 | DrugTime | 0.472084 | | 15 | GLA | 356 | 358 | Drug | 0.972978 | | 16 | NPH | 363 | 365 | Drug | 0.989424 | | 17 | bedtime | 370 | 376 | DrugTime | 0.936016 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_trials_abstracts_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_huxxx657_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_huxxx657_base_finetuned_squad_en_4.3.0_3.0_1674217238547.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_huxxx657_base_finetuned_squad_en_4.3.0_3.0_1674217238547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_huxxx657_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_huxxx657_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_huxxx657_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from AlirezaBaneshi) author: John Snow Labs name: roberta_qa_autotrain_test2_756523214 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-test2-756523214` is a English model originally trained by `AlirezaBaneshi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523214_en_4.0.0_3.0_1655727699500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autotrain_test2_756523214_en_4.0.0_3.0_1655727699500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_autotrain_test2_756523214","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_autotrain_test2_756523214","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.756523214.by_AlirezaBaneshi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_autotrain_test2_756523214| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|415.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AlirezaBaneshi/autotrain-test2-756523214 --- layout: model title: Pipeline to Detect Radiology Concepts (WIP Greedy) author: John Snow Labs name: jsl_rd_ner_wip_greedy_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, wip, clinical, radiology, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_rd_ner_wip_greedy_clinical](https://nlp.johnsnowlabs.com/2021/04/01/jsl_rd_ner_wip_greedy_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_clinical_pipeline_en_3.4.1_3.0_1647867631313.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_clinical_pipeline_en_3.4.1_3.0_1647867631313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("jsl_rd_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` ```scala val pipeline = new PretrainedPipeline("jsl_rd_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_rd_wip_greedy.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash +---------------------+-------------------------+ |chunks |entities | +---------------------+-------------------------+ |Bilateral |Direction | |breast |BodyPart | |ultrasound |ImagingTest | |ovoid mass |Symptom | |anteromedial aspect |Direction | |left |Direction | |shoulder |BodyPart | |mass |Symptom | |isoechoic echotexture|ImagingFindings | |adjacent muscle |BodyPart | |internal color flow |Symptom | |benign fibrous tissue|Symptom | |lipoma |Disease_Syndrome_Disorder| +---------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_rd_ner_wip_greedy_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Arabic Bert Embeddings (Base, Trained on a half of the full MSA dataset) author: John Snow Labs name: bert_embeddings_bert_base_arabic_camelbert_msa_half date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-half` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_half_ar_3.4.2_3.0_1649678623023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_half_ar_3.4.2_3.0_1649678623023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_half","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_half","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic_camelbert_msa_half").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic_camelbert_msa_half| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-half - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Legal Administration Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_administration_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, administration, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_administration_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `administration-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `administration-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_administration_agreement_bert_en_1.0.0_3.0_1669300445143.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_administration_agreement_bert_en_1.0.0_3.0_1669300445143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_administration_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[administration-agreement]| |[other]| |[other]| |[administration-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_administration_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support administration-agreement 0.88 0.81 0.84 26 other 0.93 0.95 0.94 65 accuracy - - 0.91 91 macro-avg 0.90 0.88 0.89 91 weighted-avg 0.91 0.91 0.91 91 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from elasticdotventures) author: John Snow Labs name: distilbert_qa_elasticdotventures_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `elasticdotventures`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_elasticdotventures_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770617065.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_elasticdotventures_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770617065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_elasticdotventures_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_elasticdotventures_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_elasticdotventures_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/elasticdotventures/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Term of agreement Clause Binary Classifier (md) author: John Snow Labs name: legclf_term_of_agreement_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `term-of-agreement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `term-of-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_term_of_agreement_md_en_1.0.0_3.0_1673460298428.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_term_of_agreement_md_en_1.0.0_3.0_1673460298428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_term_of_agreement_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[term-of-agreement]| |[other]| |[other]| |[term-of-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_term_of_agreement_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support other 0.90 0.97 0.94 39 tax-matters 0.97 0.89 0.93 35 accuracy 0.93 74 macro avg 0.94 0.93 0.93 74 weighted avg 0.94 0.93 0.93 74 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from lewtun) author: John Snow Labs name: bert_qa_lewtun_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `lewtun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_lewtun_finetuned_squad_en_4.0.0_3.0_1657186592485.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_lewtun_finetuned_squad_en_4.0.0_3.0_1657186592485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_lewtun_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_lewtun_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_lewtun_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/lewtun/bert-finetuned-squad --- layout: model title: Legal Criticality Prediction Classifier (Italian) author: John Snow Labs name: legclf_critical_prediction_italian date: 2023-03-27 tags: [it, licensed, legal, classification, tensorflow] task: Text Classification language: it edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Description: This is a Binary classification model which identifies two criticality labels(critical, non-critical) in Italian-based Court Cases. ## Predicted Entities `critical`, `non-critical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_critical_prediction_italian_it_1.0.0_3.0_1679944691458.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_critical_prediction_italian_it_1.0.0_3.0_1679944691458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = nlp.RoBertaForSequenceClassification.pretrained("legclf_critical_prediction_italian", "it", "legal/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") nlpPipeline = nlp.Pipeline( stages = [documentAssembler, tokenizer, classifier]) # Example text example = spark.createDataFrame([["Per questi motivi, il Tribunale federale pronuncia: 1. Nella misura in cui è ammissibile, il ricorso è respinto. 2. Le spese giudiziarie di fr. 2'000.-- sono poste a carico del ricorrente. 3. Comunicazione ai patrocinatori delle parti, al patrocinatore di C._ e al Presidente della Camera di protezione del Tribunale d'appello del Cantone Ticino."]]).toDF("text") empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) result = model.transform(example) # result is a DataFrame result.select("text", "class.result").show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+----------+ | text| result| +----------------------------------------------------------------------------------------------------+----------+ |Per questi motivi, il Tribunale federale pronuncia: 1. Nella misura in cui è ammissibile, il rico...|[critical]| +----------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_critical_prediction_italian| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|it| |Size:|415.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction) ## Benchmarking ```bash label precision recall f1-score support critical 0.82 0.90 0.86 10 non_critical 0.95 0.91 0.93 23 accuracy - - 0.91 33 macro-avg 0.89 0.91 0.90 33 weighted-avg 0.91 0.91 0.91 33 ``` --- layout: model title: Arabic Bert Embeddings (Base, Trained on a quarter of the full MSA dataset) author: John Snow Labs name: bert_embeddings_bert_base_arabic_camelbert_msa_quarter date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-quarter` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_quarter_ar_3.4.2_3.0_1649678578942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_quarter_ar_3.4.2_3.0_1649678578942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_quarter","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_quarter","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic_camelbert_msa_quarter").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic_camelbert_msa_quarter| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-quarter - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_900000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-900000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_900000_cased_generator_de_3.4.4_3.0_1652786530316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_900000_cased_generator_de_3.4.4_3.0_1652786530316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_900000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_900000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_900000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-900000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Tagalog Electra Embeddings (from jcblaise) author: John Snow Labs name: electra_embeddings_electra_tagalog_base_uncased_generator date: 2022-05-17 tags: [tl, open_source, electra, embeddings] task: Embeddings language: tl edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tagalog-base-uncased-generator` is a Tagalog model orginally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_base_uncased_generator_tl_3.4.4_3.0_1652786747701.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_base_uncased_generator_tl_3.4.4_3.0_1652786747701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_base_uncased_generator","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Mahilig ako sa Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_base_uncased_generator","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Mahilig ako sa Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_tagalog_base_uncased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|130.6 MB| |Case sensitive:|false| ## References - https://huggingface.co/jcblaise/electra-tagalog-base-uncased-generator - https://blaisecruz.com --- layout: model title: German RoBERTa Embeddings (Small, Hotel) author: John Snow Labs name: roberta_embeddings_HotelBERT_small date: 2022-04-14 tags: [roberta, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `HotelBERT-small` is a German model orginally trained by `FabianGroeger`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_HotelBERT_small_de_3.4.2_3.0_1649948046264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_HotelBERT_small_de_3.4.2_3.0_1649948046264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_HotelBERT_small","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_HotelBERT_small","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_HotelBERT_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|313.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/FabianGroeger/HotelBERT-small --- layout: model title: Legal Successors Clause Binary Classifier author: John Snow Labs name: legclf_successors_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `successors` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `successors` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_successors_clause_en_1.0.0_3.2_1660124037330.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_successors_clause_en_1.0.0_3.2_1660124037330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_successors_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[successors]| |[other]| |[other]| |[successors]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_successors_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.98 0.99 159 successors 0.96 1.00 0.98 72 accuracy - - 0.99 231 macro-avg 0.98 0.99 0.99 231 weighted-avg 0.99 0.99 0.99 231 ``` --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_findings date: 2022-07-05 tags: [entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using sbiobert_base_cased_mli Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. This model returns CUI (concept unique identifier) codes for 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html ## Predicted Entities `umls cuı codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_UMLS_CUI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_4.0.0_3.0_1657041603500.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_findings_en_4.0.0_3.0_1657041603500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "clinical_ner"])\ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_findings","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "clinical_ner")) .setOutputCol("ner_chunk") val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_umls_findings", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val p_model = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text") val res = p_model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls.findings").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | ner_chunk | cui_code | |---:|:--------------------------------------|:-----------| | 0 | gestational diabetes mellitus | C2183115 | | 1 | subsequent type two diabetes mellitus | C3532488 | | 2 | T2DM | C3280267 | | 3 | HTG-induced pancreatitis | C4554179 | | 4 | an acute hepatitis | C4750596 | | 5 | obesity | C1963185 | | 6 | a body mass index | C0578022 | | 7 | polyuria | C3278312 | | 8 | polydipsia | C3278316 | | 9 | poor appetite | C0541799 | | 10 | vomiting | C0042963 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_findings| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[name]| |Language:|en| |Size:|2.2 GB| |Case sensitive:|false| ## References Trained on 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_4 TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_4` is a English model originally trained by nimrah. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_4_en_4.2.0_3.0_1664116671111.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_4_en_4.2.0_3.0_1664116671111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_4", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_4", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_4| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Lemmatizer (English, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-08 tags: [open_source, lemmatizer, en] task: Lemmatization language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This English Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_en_3.4.1_3.0_1646753615538.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_en_3.4.1_3.0_1646753615538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","en") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["You are not better than me"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","en") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("You are not better than me").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.lemma.spacylookup").predict("""You are not better than me""") ```
## Results ```bash +------------------------------+ |result | +------------------------------+ |[You, be, not, well, than, me]| +------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|en| |Size:|427.0 KB| --- layout: model title: Legal Fiscal year Clause Binary Classifier author: John Snow Labs name: legclf_fiscal_year_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `fiscal-year` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `fiscal-year` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fiscal_year_clause_en_1.0.0_3.2_1660123548661.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fiscal_year_clause_en_1.0.0_3.2_1660123548661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_fiscal_year_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[fiscal-year]| |[other]| |[other]| |[fiscal-year]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fiscal_year_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support fiscal-year 1.00 0.96 0.98 45 other 0.98 1.00 0.99 91 accuracy - - 0.99 136 macro-avg 0.99 0.98 0.98 136 weighted-avg 0.99 0.99 0.99 136 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Bislama author: John Snow Labs name: opus_mt_en_bi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bi_xx_2.7.0_2.4_1609170461176.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bi_xx_2.7.0_2.4_1609170461176.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739695554.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739695554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_twostage_quadruplet_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|306.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_6 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_6_en_4.3.0_3.0_1674215648700.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_6_en_4.3.0_3.0_1674215648700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|432.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-6 --- layout: model title: English DistilBertForMaskedLM Base Uncased model (from Intel) author: John Snow Labs name: distilbert_embeddings_base_uncased_sparse_90_unstructured_pruneofa date: 2022-12-12 tags: [en, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-sparse-90-unstructured-pruneofa` is a English model originally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_uncased_sparse_90_unstructured_pruneofa_en_4.2.4_3.0_1670864808665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_uncased_sparse_90_unstructured_pruneofa_en_4.2.4_3.0_1670864808665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_uncased_sparse_90_unstructured_pruneofa","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_uncased_sparse_90_unstructured_pruneofa","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_base_uncased_sparse_90_unstructured_pruneofa| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|123.6 MB| |Case sensitive:|false| ## References - https://huggingface.co/Intel/distilbert-base-uncased-sparse-90-unstructured-pruneofa - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: English RobertaForQuestionAnswering (from prk) author: John Snow Labs name: roberta_qa_prk_roberta_base_squad2_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `prk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_prk_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735480993.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_prk_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735480993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_prk_roberta_base_squad2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_prk_roberta_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_prk").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_prk_roberta_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/prk/roberta-base-squad2-finetuned-squad --- layout: model title: Lemmatizer (Hungarian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, hu] task: Lemmatization language: hu edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Hungarian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_hu_3.4.1_3.0_1646316501230.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_hu_3.4.1_3.0_1646316501230.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","hu") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Nem vagy jobb, mint én"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","hu") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Nem vagy jobb, mint én").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hu.lemma").predict("""Nem vagy jobb, mint én""") ```
## Results ```bash +---------------------------+ |result | +---------------------------+ |[Nem, van, jó, ,, mint, én]| +---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|hu| |Size:|440.7 KB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from ml6team) author: John Snow Labs name: t5_keyphrase_generation_small_inspec date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-generation-t5-small-inspec` is a English model originally trained by `ml6team`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_keyphrase_generation_small_inspec_en_4.3.0_3.0_1675104684365.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_keyphrase_generation_small_inspec_en_4.3.0_3.0_1675104684365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_keyphrase_generation_small_inspec","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_keyphrase_generation_small_inspec","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_keyphrase_generation_small_inspec| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|280.5 MB| ## References - https://huggingface.co/ml6team/keyphrase-generation-t5-small-inspec - https://dl.acm.org/doi/10.3115/1119355.1119383 - https://paperswithcode.com/sota?task=Keyphrase+Generation&dataset=inspec --- layout: model title: Danish asr_wav2vec2_base_nst TFWav2Vec2ForCTC from Alvenir author: John Snow Labs name: asr_wav2vec2_base_nst date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_nst` is a Danish model originally trained by Alvenir. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_nst_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_nst_da_4.2.0_3.0_1664103793330.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_nst_da_4.2.0_3.0_1664103793330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_nst", "da")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_nst", "da") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_nst| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|da| |Size:|354.3 MB| --- layout: model title: Spanish RoBERTa Embeddings (Base) author: John Snow Labs name: roberta_embeddings_roberta_base_bne date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-bne` is a Spanish model orginally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_bne_es_3.4.2_3.0_1649944766661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_bne_es_3.4.2_3.0_1649944766661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_bne","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_bne","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.roberta_base_bne").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_base_bne| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|297.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - http://www.bne.es/en/Inicio/index.html - https://arxiv.org/abs/1907.11692 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Stop Words Cleaner for Bulgarian author: John Snow Labs name: stopwords_bg date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: bg edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, bg] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_bg_bg_2.5.4_2.4_1594742440962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_bg_bg_2.5.4_2.4_1594742440962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_bg", "bg") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_bg", "bg") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена."""] stopword_df = nlu.load('bg.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=11, end=14, result='крал', metadata={'sentence': '0'}), Row(annotatorType='token', begin=19, end=23, result='север', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=24, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=26, end=29, result='Джон', metadata={'sentence': '0'}), Row(annotatorType='token', begin=31, end=34, result='Сноу', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_bg| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|bg| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English XlmRoBertaForQuestionAnswering (from alon-albalak) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_xquad date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-xquad` is a English model originally trained by `alon-albalak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_xquad_en_4.0.0_3.0_1655991956893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_xquad_en_4.0.0_3.0_1655991956893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_xquad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_xquad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xquad.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_xquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|847.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/alon-albalak/xlm-roberta-base-xquad - https://github.com/deepmind/xquad --- layout: model title: Arabic BertForTokenClassification Cased model (from abdusah) author: John Snow Labs name: bert_token_classifier_ara_ner date: 2022-11-30 tags: [ar, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `arabert-ner` is a Arabic model originally trained by `abdusah`. ## Predicted Entities `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ara_ner_ar_4.2.4_3.0_1669814157826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ara_ner_ar_4.2.4_3.0_1669814157826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ara_ner","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ara_ner","ar") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ara_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ar| |Size:|505.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/abdusah/arabert-ner --- layout: model title: Pipeline to Summarize Clinical Notes (PubMed) author: John Snow Labs name: summarizer_biomedical_pubmed_pipeline date: 2023-05-29 tags: [licensed, en, clinical, text_summarization, biomedical] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_biomedical_pubmed](https://nlp.johnsnowlabs.com/2023/04/03/summarizer_biomedical_pubmed_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_biomedical_pubmed_pipeline_en_4.4.2_3.0_1685399522856.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_biomedical_pubmed_pipeline_en_4.4.2_3.0_1685399522856.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_biomedical_pubmed_pipeline", "en", "clinical/models") text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_biomedical_pubmed_pipeline", "en", "clinical/models") val text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash The results of this review suggest that aggressive ovarian cancer surgery is associated with a significant reduction in the risk of recurrence and a reduction in the number of radical versus conservative surgical resections. However, the results of this review are based on only one small trial. Further research is needed to determine the role of aggressive ovarian cancer surgery in women with stage IIIC ovarian cancer. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_biomedical_pubmed_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: Legal Anti Corruption Laws Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_anti_corruption_laws_bert date: 2023-03-05 tags: [en, legal, classification, clauses, anti_corruption_laws, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Anti_Corruption_Laws` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Anti_Corruption_Laws`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_anti_corruption_laws_bert_en_1.0.0_3.0_1678049923672.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_anti_corruption_laws_bert_en_1.0.0_3.0_1678049923672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_anti_corruption_laws_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Anti_Corruption_Laws]| |[Other]| |[Other]| |[Anti_Corruption_Laws]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_anti_corruption_laws_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Anti_Corruption_Laws 1.00 1.00 1.0 11 Other 1.00 1.00 1.0 19 accuracy - - 1.0 30 macro-avg 1.00 1.00 1.0 30 weighted-avg 1.00 1.00 1.0 30 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_128 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_128_zh_4.2.4_3.0_1670325790515.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_128_zh_4.2.4_3.0_1670325790515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|20.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_768 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_768_zh_4.2.4_3.0_1670021642796.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_768_zh_4.2.4_3.0_1670021642796.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|117.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Detect Clinical Entities, Assign Assertion and Find Relations author: John Snow Labs name: explain_clinical_doc_era date: 2020-09-30 task: [Named Entity Recognition, Assertion Status, Relation Extraction, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [pipeline, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained pipeline with ``ner_clinical_events``, ``assertion_dl`` and ``re_temporal_events_clinical`` trained with ``embeddings_healthcare_100d``. It will extract clinical entities, assign assertion status and find temporal relationships between clinical entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_era_en_2.5.5_2.4_1597845753750.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_era_en_2.5.5_2.4_1597845753750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python era_pipeline = PretrainedPipeline('explain_clinical_doc_era', 'en', 'clinical/models') annotations = era_pipeline.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """)[0] annotations.keys() ``` ```scala val era_pipeline = new PretrainedPipeline("explain_clinical_doc_era", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """)(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.era").predict("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache. She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """) ```
{:.h2_title} ## Results The output is a dictionary with the following keys: ``'sentences'``, ``'clinical_ner_tags'``, ``'clinical_ner_chunks_re'``, ``'document'``, ``'clinical_ner_chunks'``, ``'assertion'``, ``'clinical_relations'``, ``'tokens'``, ``'embeddings'``, ``'pos_tags'``, ``'dependencies'``. Here is the result of `clinical_ner_chunks` : ```bash | # | chunks | begin | end | entities | |----|-------------------------------|-------|-----|---------------| | 0 | admitted | 7 | 14 | OCCURRENCE | | 1 | The John Hopkins Hospital | 19 | 43 | CLINICAL_DEPT | | 2 | 2 days ago | 45 | 54 | DATE | | 3 | gestational diabetes mellitus | 74 | 102 | PROBLEM | | 4 | diagnosed | 104 | 112 | OCCURRENCE | | 5 | denied | 119 | 124 | EVIDENTIAL | | 6 | pain | 126 | 129 | PROBLEM | | 7 | any headache | 135 | 146 | PROBLEM | | 8 | seen | 157 | 160 | OCCURRENCE | | 9 | the endocrinology service | 165 | 189 | CLINICAL_DEPT | | 10 | discharged | 203 | 212 | OCCURRENCE | | 11 | 03/02/2018 | 217 | 226 | DATE | | 12 | insulin glargine | 243 | 258 | TREATMENT | | 13 | insulin lispro | 274 | 287 | TREATMENT | | 14 | metformin | 294 | 302 | TREATMENT | | 15 | two times a day | 312 | 326 | FREQUENCY | | 16 | close follow-up | 337 | 351 | OCCURRENCE | | 17 | endocrinology | 358 | 370 | CLINICAL_DEPT | | 18 | discharge | 377 | 385 | OCCURRENCE | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_era| |Type:|pipeline| |Compatibility:|Healthcare NLP 2.6.0 +| |License:|Licensed| |Edition:|Official| |Language:|[en]| {:.h2_title} ## Included Models - ``ner_clinical_events`` - ``assertion_dl`` - ``re_temporal_events_clinical`` --- layout: model title: Detect Restaurant-related Terminology author: John Snow Labs name: nerdl_restaurant_100d date: 2021-12-31 tags: [ner, restaurant, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with `glove_100d` embeddings to detect restaurant-related terminology. ## Predicted Entities `Location`, `Cuisine`, `Amenity`, `Restaurant_Name`, `Dish`, `Rating`, `Hours`, `Price` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_RESTAURANT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RESTAURANT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_en_3.3.4_3.0_1640949258750.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_en_3.3.4_3.0_1640949258750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nerdl = NerDLModel.pretrained("nerdl_restaurant_100d")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, nerdl, ner_converter]) text = """Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.""" data = spark.createDataFrame([[text]]).toDF("text") result = nlp_pipeline.fit(data).transform(data) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val nerdl = NerDLModel.pretrained("nerdl_restaurant_100d") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, nerdl, ner_converter)) val data = Seq("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.restaurant").predict("""Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.""") ```
## Results ```bash +---------------------------+---------------+ |chunk |ner_label | +---------------------------+---------------+ |favourite |Rating | |pasta bar |Dish | |most reasonably |Price | |lunch |Hours | |in town! |Location | |Sha Tin – Pici’s |Restaurant_Name| |burrata |Dish | |arugula salad |Dish | |freshly tossed tuna tartare|Dish | |reliable |Price | |handmade pasta |Dish | |pappardelle |Dish | |effortless |Amenity | |Italian |Cuisine | |tidy one-pot |Amenity | |espresso |Dish | +---------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_restaurant_100d| |Type:|ner| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.2 MB| ## Data Source [https://groups.csail.mit.edu/sls/downloads/restaurant/](https://groups.csail.mit.edu/sls/downloads/restaurant/) ## Benchmarking ```bash label precision recall f1-score support B-Amenity 0.77 0.75 0.76 545 B-Cuisine 0.86 0.88 0.87 524 B-Dish 0.84 0.80 0.82 303 B-Hours 0.67 0.72 0.69 197 B-Location 0.89 0.89 0.89 807 B-Price 0.86 0.87 0.86 169 B-Rating 0.87 0.79 0.83 221 B-Restaurant_Name 0.91 0.94 0.92 388 I-Amenity 0.80 0.75 0.77 561 I-Cuisine 0.71 0.71 0.71 135 I-Dish 0.66 0.77 0.71 104 I-Hours 0.87 0.84 0.86 306 I-Location 0.92 0.87 0.89 834 I-Price 0.64 0.81 0.71 52 I-Rating 0.80 0.85 0.82 118 I-Restaurant_Name 0.82 0.89 0.85 359 O 0.95 0.96 0.96 8634 accuracy - - 0.91 14257 macro-avg 0.81 0.83 0.82 14257 weighted-avg 0.91 0.91 0.91 14257 ``` --- layout: model title: Chinese XLMRoBerta Embeddings (from coastalcph) author: John Snow Labs name: xlmroberta_embeddings_fairlex_cail_minilm date: 2022-05-13 tags: [zh, open_source, xlm_roberta, embeddings, fairlex] task: Embeddings language: zh edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fairlex-cail-minilm` is a Chinese model orginally trained by `coastalcph`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_fairlex_cail_minilm_zh_3.4.4_3.0_1652439739577.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_fairlex_cail_minilm_zh_3.4.4_3.0_1652439739577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_fairlex_cail_minilm","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_fairlex_cail_minilm","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_fairlex_cail_minilm| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|403.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/coastalcph/fairlex-cail-minilm - https://coastalcph.github.io - https://github.com/iliaschalkidis - https://twitter.com/KiddoThe2B --- layout: model title: XLNet Embeddings (Large) author: John Snow Labs name: xlnet_large_cased date: 2020-04-28 task: Embeddings language: en nav_key: models edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: XlnetEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking. The details are described in the paper "[​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_large_cased_en_2.5.0_2.4_1588074397954.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_large_cased_en_2.5.0_2.4_1588074397954.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = XlnetEmbeddings.pretrained("xlnet_large_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = XlnetEmbeddings.pretrained("xlnet_large_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.xlnet_large_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_xlnet_large_cased_embeddings I [0.9742076396942139, -0.6181889772415161, 0.45... love [-0.7322277426719666, -1.7254987955093384, -0.... NLP [1.6873085498809814, -0.8617655038833618, 0.46... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/zihangdai/xlnet](https://github.com/zihangdai/xlnet) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from smonah) author: John Snow Labs name: distilbert_qa_smonah_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `smonah`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_smonah_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772837756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_smonah_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772837756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_smonah_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_smonah_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_smonah_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/smonah/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering Cased model (from KFlash) author: John Snow Labs name: bert_qa_kflash_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `KFlash`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kflash_finetuned_squad_en_4.0.0_3.0_1657186070833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kflash_finetuned_squad_en_4.0.0_3.0_1657186070833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kflash_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kflash_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kflash_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KFlash/bert-finetuned-squad --- layout: model title: Korean T5ForConditionalGeneration Base Cased model (from KETI-AIR) author: John Snow Labs name: t5_ke_base date: 2023-01-30 tags: [ko, open_source, t5, tensorflow] task: Text Generation language: ko edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ke-t5-base-ko` is a Korean model originally trained by `KETI-AIR`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ke_base_ko_4.3.0_3.0_1675104551769.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ke_base_ko_4.3.0_3.0_1675104551769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ke_base","ko") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ke_base","ko") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ke_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ko| |Size:|569.1 MB| ## References - https://huggingface.co/KETI-AIR/ke-t5-base-ko - https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints - https://github.com/AIRC-KETI/ke-t5 - https://aclanthology.org/2021.findings-emnlp.33/ - https://jmlr.org/papers/volume21/20-074/20-074.pdf - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 - https://www.tensorflow.org/datasets/catalog/c4 - https://jmlr.org/papers/volume21/20-074/20-074.pdf - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 --- layout: model title: French CamemBert Embeddings (from lijingxin) author: John Snow Labs name: camembert_embeddings_lijingxin_generic_model_2 date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model-2` is a French model orginally trained by `lijingxin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_lijingxin_generic_model_2_fr_3.4.4_3.0_1653990930875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_lijingxin_generic_model_2_fr_3.4.4_3.0_1653990930875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_lijingxin_generic_model_2","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_lijingxin_generic_model_2","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_lijingxin_generic_model_2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/lijingxin/dummy-model-2 --- layout: model title: Legal Effectiveness Clause Binary Classifier author: John Snow Labs name: legclf_effectiveness_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `effectiveness` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `effectiveness` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effectiveness_clause_en_1.0.0_3.2_1660123466311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effectiveness_clause_en_1.0.0_3.2_1660123466311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_effectiveness_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[effectiveness]| |[other]| |[other]| |[effectiveness]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_effectiveness_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support effectiveness 0.94 0.86 0.90 37 other 0.95 0.98 0.96 93 accuracy - - 0.95 130 macro-avg 0.94 0.92 0.93 130 weighted-avg 0.95 0.95 0.95 130 ``` --- layout: model title: Mapping Companies IRS to Edgar Database author: John Snow Labs name: finmapper_edgar_irs date: 2022-08-18 tags: [en, finance, companies, edgar, data, augmentation, irs, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Chunk Mapper model allows you to, given a detected IRS with any NER model, augment it with information available in the SEC Edgar database. Some of the fields included in this Chunk Mapper are: - Company Name - Sector - Former names - Address, Phone, State - Dates where the company submitted filings - etc. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_edgar_irs_en_1.0.0_3.2_1660817662889.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_edgar_irs_en_1.0.0_3.2_1660817662889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["CARDINAL"]) CM = finance.ChunkMapperModel()\ .pretrained("finmapper_edgar_irs", "en", "finance/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, CM]) text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""] test_data = spark.createDataFrame([text]).toDF("text") model = pipeline.fit(test_data) res= model.transform(test_data) ```
## Results ```bash [Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=8, result='Masterworks 096, LLC', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='RETAIL-RETAIL STORES, NEC [5990]', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'sic', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='5990', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'sic_code', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='873474341', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'irs_number', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='1231', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'fiscal_year_end', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NY', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'state_location', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='DE', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'state_incorporation', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='225 LIBERTY STREET', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_street', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NEW YORK', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_city', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NY', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_state', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='10281', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_zip', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='2035185172', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_phone', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'former_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'former_name_date', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='2022-01-10', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'date', 'all_relations': '2022-04-26:::2021-11-17'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='1894064', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'company_id', 'all_relations': ''}, embeddings=[])])] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_edgar_irs| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|5.7 MB| ## References Manually scrapped Edgar Database --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_base_uncased_becas_6 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-becas-6` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_6_en_4.3.0_3.0_1672767591313.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_becas_6_en_4.3.0_3.0_1672767591313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_becas_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_becas_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-becas-6 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_base_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-base-newsqa` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_base_news_en_4.3.0_3.0_1674224491216.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_base_news_en_4.3.0_3.0_1674224491216.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_base_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_base_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_base_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|463.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tli8hf/unqover-roberta-base-newsqa --- layout: model title: Spanish BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488 date: 2022-06-03 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-spa-squad2-es` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_4.0.0_3.0_1654249712848.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_4.0.0_3.0_1654249712848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2.bert.base_cased.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es - https://github.com/dccuchile/beto - https://github.com/google-research/bert/blob/master/multilingual.md - https://github.com/ccasimiro88/TranslateAlignRetrieve - https://twitter.com/mrm8488 - https://github.com/google-research/bert - https://github.com/josecannete/spanish-corpora - https://github.com/dccuchile/beto/blob/master/README.md --- layout: model title: Finnish RobertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: roberta_qa_ADDI_FI_RoBERTa date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: fi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FI-RoBERTa` is a Finnish model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FI_RoBERTa_fi_4.0.0_3.0_1655726349369.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_FI_RoBERTa_fi_4.0.0_3.0_1655726349369.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_FI_RoBERTa","fi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ADDI_FI_RoBERTa","fi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fi.answer_question.roberta.fi_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ADDI_FI_RoBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fi| |Size:|421.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-FI-RoBERTa --- layout: model title: Chinese Part of Speech Tagger(Large, UPOS, Chinese Wikipedia Texts author: John Snow Labs name: bert_pos_chinese_roberta_large_upos date: 2022-04-26 tags: [bert, pos, part_of_speech, zh, open_source] task: Part of Speech Tagging language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-roberta-large-upos` is a Chinese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_chinese_roberta_large_upos_zh_3.4.2_3.0_1650993118594.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_chinese_roberta_large_upos_zh_3.4.2_3.0_1650993118594.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_roberta_large_upos","zh") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_chinese_roberta_large_upos","zh") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.pos.chinese_roberta_large_upos").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_chinese_roberta_large_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/chinese-roberta-large-upos - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Arabic BertForMaskedLM Base Cased model (from aubmindlab) author: John Snow Labs name: bert_embeddings_base_arabertv01 date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabertv01` is a Arabic model originally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv01_ar_4.2.4_3.0_1670015781720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabertv01_ar_4.2.4_3.0_1670015781720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv01","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabertv01","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabertv01| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv01 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabertv01/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jamesmarcel) author: John Snow Labs name: xlmroberta_ner_jamesmarcel_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jamesmarcel`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jamesmarcel_base_finetuned_panx_de_4.1.0_3.0_1660434088567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jamesmarcel_base_finetuned_panx_de_4.1.0_3.0_1660434088567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jamesmarcel_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jamesmarcel_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jamesmarcel_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jamesmarcel/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Stop Words Cleaner for Hebrew author: John Snow Labs name: stopwords_he date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: he edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, he] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_he_he_2.5.4_2.4_1594742441877.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_he_he_2.5.4_2.4_1594742441877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_he", "he") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("מלבד היותו מלך הצפון, ג'ון סנואו הוא רופא אנגלי ומוביל בפיתוח הרדמה והיגיינה רפואית.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_he", "he") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("מלבד היותו מלך הצפון, ג'ון סנואו הוא רופא אנגלי ומוביל בפיתוח הרדמה והיגיינה רפואית.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""מלבד היותו מלך הצפון, ג'ון סנואו הוא רופא אנגלי ומוביל בפיתוח הרדמה והיגיינה רפואית"""] stopword_df = nlu.load('he.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=5, end=9, result='היותו', metadata={'sentence': '0'}), Row(annotatorType='token', begin=11, end=13, result='מלך', metadata={'sentence': '0'}), Row(annotatorType='token', begin=15, end=19, result='הצפון', metadata={'sentence': '0'}), Row(annotatorType='token', begin=20, end=20, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=22, end=25, result="ג'ון", metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_he| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|he| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English LongformerForQuestionAnswering model (from valhalla) author: John Snow Labs name: longformer_base_base_qa_squad2 date: 2022-06-15 tags: [open_source, longformer, question_answering, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-base-4096-finetuned-squadv1` is a English model originally trained by `valhalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_base_base_qa_squad2_en_4.0.0_3.0_1655292497074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_base_base_qa_squad2_en_4.0.0_3.0_1655292497074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_base_base_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_base_base_qa_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.longformer.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_base_base_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|550.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1 --- layout: model title: Extract Cancer Therapies and Posology Information author: John Snow Labs name: ner_oncology_unspecific_posology_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of treatments and posology information using general labels (low granularity). Definitions of Predicted Entities: - `Cancer_Therapy`: Mentions of cancer treatments, including chemotherapy, radiotherapy, surgery and other. - `Posology_Information`: Terms related to the posology of the treatment, including duration, frequencies and dosage. ## Predicted Entities `Cancer_Therapy`, `Posology_Information` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_wip_en_4.0.0_3.0_1664563752336.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_wip_en_4.0.0_3.0_1664563752336.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_unspecific_posology_wip").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------------| | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Posology_Information | | over six courses | Posology_Information | | second cycle | Posology_Information | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_unspecific_posology_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|841.3 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Posology_Information 1908.0 86.0 439.0 2347.0 0.96 0.81 0.88 Cancer_Therapy 1685.0 94.0 440.0 2125.0 0.95 0.79 0.86 macro_avg 3593.0 180.0 879.0 4472.0 0.95 0.80 0.87 micro_avg NaN NaN NaN NaN 0.95 0.80 0.87 ``` --- layout: model title: English RobertaForQuestionAnswering (from eAsyle) author: John Snow Labs name: roberta_qa_testABSA date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `testABSA` is a English model originally trained by `eAsyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_testABSA_en_4.0.0_3.0_1655739870848.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_testABSA_en_4.0.0_3.0_1655739870848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_testABSA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_testABSA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.testabsa.by_eAsyle").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_testABSA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|426.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/eAsyle/testABSA --- layout: model title: Indonesian T5ForConditionalGeneration Small Cased model (from panggi) author: John Snow Labs name: t5_small_summarization_cased date: 2023-01-31 tags: [id, open_source, t5, tensorflow] task: Text Generation language: id edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-indonesian-summarization-cased` is a Indonesian model originally trained by `panggi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_summarization_cased_id_4.3.0_3.0_1675126407873.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_summarization_cased_id_4.3.0_3.0_1675126407873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_summarization_cased","id") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_summarization_cased","id") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_summarization_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|id| |Size:|287.9 MB| ## References - https://huggingface.co/panggi/t5-small-indonesian-summarization-cased - https://github.com/kata-ai/indosum --- layout: model title: Finnish asr_wav2vec2_large_uralic_voxpopuli_v2_finnish TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: asr_wav2vec2_large_uralic_voxpopuli_v2_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_uralic_voxpopuli_v2_finnish` is a Finnish model originally trained by Finnish-NLP. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_fi_4.2.0_3.0_1664037968087.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_uralic_voxpopuli_v2_finnish_fi_4.2.0_3.0_1664037968087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_uralic_voxpopuli_v2_finnish", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_uralic_voxpopuli_v2_finnish", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_uralic_voxpopuli_v2_finnish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: Legal Management contract Document Classifier (Longformer) author: John Snow Labs name: legclf_management_contract date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_management_contract` model is a Legal Longformer Document Classifier to classify if the document belongs to the class management-contract or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `management-contract` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_contract_en_1.0.0_3.0_1666620999192.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_contract_en_1.0.0_3.0_1666620999192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_management_contract", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[management-contract]| |[other]| |[other]| |[management-contract]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_management_contract| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support management-contract 0.97 0.94 0.95 31 other 0.96 0.98 0.97 52 accuracy - - 0.96 83 macro-avg 0.96 0.96 0.96 83 weighted-avg 0.96 0.96 0.96 83 ``` --- layout: model title: Fast Neural Machine Translation Model from Argentine Sign Language to Spanish author: John Snow Labs name: opus_mt_aed_es date: 2021-06-01 tags: [open_source, seq2seq, translation, aed, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: aed target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_aed_es_xx_3.1.0_2.4_1622560123437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_aed_es_xx_3.1.0_2.4_1622560123437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_aed_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_aed_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Argentine Sign Language.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_aed_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_uer_large date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `uer_large` is a Chinese model orginally trained by `junnyu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_uer_large_zh_3.4.2_3.0_1649669433036.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_uer_large_zh_3.4.2_3.0_1649669433036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_uer_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_uer_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.uer_large").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_uer_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/junnyu/uer_large - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://share.weiyun.com/5G90sMJ --- layout: model title: Legal Severance Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_severance_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, severance, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_severance_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `severance-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `severance-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_severance_agreement_en_1.0.0_3.0_1670358097286.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_severance_agreement_en_1.0.0_3.0_1670358097286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_severance_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[severance-agreement]| |[other]| |[other]| |[severance-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_severance_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.99 1.00 111 severance-agreement 0.98 1.00 0.99 57 accuracy - - 0.99 168 macro-avg 0.99 1.00 0.99 168 weighted-avg 0.99 0.99 0.99 168 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from manishiitg) author: John Snow Labs name: roberta_qa_distilrobert_base_squadv2_328seq_128stride_test date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilrobert-base-squadv2-328seq-128stride-test` is a English model originally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilrobert_base_squadv2_328seq_128stride_test_en_4.3.0_3.0_1674210316686.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilrobert_base_squadv2_328seq_128stride_test_en_4.3.0_3.0_1674210316686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilrobert_base_squadv2_328seq_128stride_test","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilrobert_base_squadv2_328seq_128stride_test","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilrobert_base_squadv2_328seq_128stride_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/manishiitg/distilrobert-base-squadv2-328seq-128stride-test --- layout: model title: English RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_roberta_base_1B_1_finetuned_squadv2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-1B-1-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_1B_1_finetuned_squadv2_en_4.0.0_3.0_1655729733962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_1B_1_finetuned_squadv2_en_4.0.0_3.0_1655729733962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_1B_1_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_1B_1_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_1B_1_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|447.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/roberta-base-1B-1-finetuned-squadv2 - https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/ - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: RoBERTa large model author: John Snow Labs name: roberta_large date: 2021-05-20 tags: [en, english, embeddings, roberta, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1907.11692) and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it makes a difference between english and English. RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_en_3.1.0_2.4_1621523610703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_en_3.1.0_2.4_1621523610703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = RoBertaEmbeddings.pretrained("roberta_large", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ``` ```scala val embeddings = RoBertaEmbeddings.pretrained("roberta_large", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.roberta.large").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/roberta-large](https://huggingface.co/roberta-large) ## Benchmarking ```bash When fine-tuned on downstream tasks, this model achieves the following results: Glue test results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 96.4 | 90.9 | 86.6 | ``` --- layout: model title: Welsh Lemmatizer author: John Snow Labs name: lemma date: 2021-04-02 tags: [cy, open_source, lemmatizer] task: Lemmatization language: cy edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_cy_3.0.0_3.0_1617389338320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_cy_3.0.0_3.0_1617389338320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "cy") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Dywedir yn aml taw rygbi 'r undeb yw mabolgamp genedlaethol Cymru , er mae pêl-droed yn denu mwy o wylwyr i 'r maes ."]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "cy") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Dywedir yn aml taw rygbi "r undeb yw mabolgamp genedlaethol Cymru , er mae pêl-droed yn denu mwy o wylwyr i "r maes .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Dywedir yn aml taw rygbi 'r undeb yw mabolgamp genedlaethol Cymru , er mae pêl-droed yn denu mwy o wylwyr i 'r maes ."] lemma_df = nlu.load('cy.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash +------------+ | lemma| +------------+ | Dywedir| | yn| | aml| | taw| | rygbi| | '| | r| | undeb| | bod| | mabolgamp| |cenedlaethol| | Cymru| | ,| | er| | bod| | pêl-droed| | yn| | denu| | mawr| | o| +------------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|cy| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7. ## Benchmarking ```bash Precision=0.74, Recall=0.71, F1-score=0.72 ``` --- layout: model title: Stopwords Remover for Hebrew language (226 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, he, open_source] task: Stop Words Removal language: he edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_he_3.4.1_3.0_1646673023597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_he_3.4.1_3.0_1646673023597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","he") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["אתה לא יותר טוב ממני"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","he") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("אתה לא יותר טוב ממני").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("he.stopwords").predict("""אתה לא יותר טוב ממני""") ```
## Results ```bash +-----------+ |result | +-----------+ |[טוב, ממני]| +-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|he| |Size:|2.1 KB| --- layout: model title: Romanian T5ForConditionalGeneration Small Cased model (from BlackKakapo) author: John Snow Labs name: t5_small_grammar date: 2023-01-31 tags: [ro, open_source, t5, tensorflow] task: Text Generation language: ro edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-grammar-ro` is a Romanian model originally trained by `BlackKakapo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_grammar_ro_4.3.0_3.0_1675126317557.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_grammar_ro_4.3.0_3.0_1675126317557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_grammar","ro") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_grammar","ro") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_grammar| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ro| |Size:|284.0 MB| ## References - https://huggingface.co/BlackKakapo/t5-small-grammar-ro - https://img.shields.io/badge/V.1-03.08.2022-brightgreen --- layout: model title: Recognize Entities OntoNotes - BERT Medium author: John Snow Labs name: onto_recognize_entities_bert_medium date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, en, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `small_bert_L8_512` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_medium_en_2.7.0_2.4_1607510751761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_medium_en_2.7.0_2.4_1607510751761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_bert_medium') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_medium") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.bert.medium').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_medium| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: Tswana RobertaForMaskedLM Cased model (from MoseliMotsoehli) author: John Snow Labs name: roberta_embeddings_tswanabert date: 2022-12-12 tags: [tn, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: tn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `TswanaBert` is a Tswana model originally trained by `MoseliMotsoehli`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_tswanabert_tn_4.2.4_3.0_1670858564493.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_tswanabert_tn_4.2.4_3.0_1670858564493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_tswanabert","tn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_tswanabert","tn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_tswanabert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tn| |Size:|231.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/MoseliMotsoehli/TswanaBert - https://wortschatz.uni-leipzig.de/en/download - http://doi.org/10.5281/zenodo.3668495 - http://setswana.blogspot.com/ - https://omniglot.com/writing/tswana.php - http://www.dailynews.gov.bw/ - http://www.mmegi.bw/index.php - https://tsena.co.bw - http://www.botswana.co.za/Cultural_Issues-travel/botswana-country-guide-en-route.html - https://www.poemhunter.com/poem/2013-setswana/ - https://www.poemhunter.com/poem/ngwana-wa-mosetsana/ --- layout: model title: English image_classifier_vit_exper2_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper2_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper2_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper2_mesum5_en_4.1.0_3.0_1660167892531.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper2_mesum5_en_4.1.0_3.0_1660167892531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper2_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper2_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper2_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Stopwords Remover for Latvian language (161 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, lv, open_source] task: Stop Words Removal language: lv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_lv_3.4.1_3.0_1646673201505.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_lv_3.4.1_3.0_1646673201505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","lv") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Jūs neesat labāks par mani"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","lv") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Jūs neesat labāks par mani").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lv.stopwords").predict("""Jūs neesat labāks par mani""") ```
## Results ```bash +---------------------------+ |result | +---------------------------+ |[Jūs, neesat, labāks, mani]| +---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|lv| |Size:|1.8 KB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from zshang3) author: John Snow Labs name: distilbert_qa_zshang3_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `zshang3`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_zshang3_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773401928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_zshang3_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773401928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zshang3_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zshang3_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_zshang3_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/zshang3/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Biomedical Entities in English author: gokhanturer name: ner_protein_glove date: 2022-02-20 tags: [open_source, ner, glove_100d, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.1.2 spark_version: 3.0 supported: false annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition model that finds `Protein` entitites in biomedical texts. ## Predicted Entities `Protein` {:.btn-box} [Open in Colab](https://colab.research.google.com/drive/1npHXVQbqZ5rFOTReG2DjOGuQFR3cX34Q#scrollTo=Lq8fqJfmFY9V){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/gokhanturer/ner_protein_glove_en_3.1.2_3.0_1645385210378.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/gokhanturer/ner_protein_glove_en_3.1.2_3.0_1645385210378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence = SentenceDetector()\ .setInputCols(['document'])\ .setOutputCol('sentence') token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token') glove_embeddings = WordEmbeddingsModel.pretrained()\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") loaded_ner_model = NerDLModel.pretrained("ner_protein_glove", "en")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_span") ner_prediction_pipeline = Pipeline(stages = [ document, sentence, token, glove_embeddings, loaded_ner_model, converter ]) text = ''' MACROPHAGES ARE MONONUCLEAR phagocytes that reside within almost all tissues including adipose tissue, where they are identifiable as distinct populations with tissue-specific morphology, localization, and function (1). During the process of atherosclerosis, monocytes adhere to the endothelium and migrate into the intima, express scavenger receptors, and bind internalized lipoprotein particles resulting in the formation of foam cells (2). In obesity, adipose tissue contains an increased number of resident macrophages (3, 4). Macrophage accumulation in proportion to adipocyte size may increase the adipose tissue production of proinflammatory and acute-phase molecules and thereby contribute to the pathophysiological consequences of obesity (1, 3). These facts indicate that macrophages play an important role in a variety of diseases. When activated, macrophages release stereotypical profiles of cytokines and biological molecules such as nitric oxide TNF-α, IL-6, and IL-1 (5). TNF-α is a potent chemoattractant (6) and originates predominantly from residing mouse peritoneal macrophages (MPM) and mast cells (7). TNF-α induces leukocyte adhesion and degranulation, stimulates nicotinamide adenine dinucleotide phosphate (NADPH) oxidase, and enhances expression of IL-2 receptors and expression of E-selectin and intercellular adhesion molecules on the endothelium (8). TNF-α also stimulates expression of IL-1, IL-2, IL-6, and platelet-activating factor receptor (9). In addition, TNF-α decreases insulin sensitivity and increases lipolysis in adipocytes (10, 11). IL-6 also increase lipolysis and has been implicated in the hypertriglyceridemia and increased serum free fatty acid levels associated with obesity (12). Increased IL-6 signaling induces the expression of C-reactive protein and haptoglubin in liver (13). Recombinant IL-6 treatment increases atherosclerotic lesion size 5-fold (14). IL-6 also dose-dependently increases macrophage oxidative low-density lipoprotein (LDL) degradation and CD36 mRNA expression in vitro (15). These data clearly indicate that IL-6 and TNF-α are important pathogenetic factors associated with obesity, insulin resistance, and atherosclerosis. However, the factors regulating gene expression of these cytokines in macrophages have not been fully clarified. ''' sample_data = spark.createDataFrame([[text]]).toDF("text") prediction_model = ner_prediction_pipeline.fit(sample_data) preds = prediction_model.transform(sample_data) ``` ```scala val document = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val token = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val glove_embeddings = WordEmbeddingsModel.pretrained() .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val loaded_ner_model = NerDLModel.pretrained("ner_protein_glove", "en") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_span") val ner_prediction_pipeline = new Pipeline().setStages(Array( document, sentence, token, glove_embeddings, loaded_ner_model, converter )) val text =Seq(" MACROPHAGES ARE MONONUCLEAR phagocytes that reside within almost all tissues including adipose tissue, where they are identifiable as distinct populations with tissue-specific morphology, localization, and function (1). During the process of atherosclerosis, monocytes adhere to the endothelium and migrate into the intima, express scavenger receptors, and bind internalized lipoprotein particles resulting in the formation of foam cells (2). In obesity, adipose tissue contains an increased number of resident macrophages (3, 4). Macrophage accumulation in proportion to adipocyte size may increase the adipose tissue production of proinflammatory and acute-phase molecules and thereby contribute to the pathophysiological consequences of obesity (1, 3). These facts indicate that macrophages play an important role in a variety of diseases. When activated, macrophages release stereotypical profiles of cytokines and biological molecules such as nitric oxide TNF-α, IL-6, and IL-1 (5). TNF-α is a potent chemoattractant (6) and originates predominantly from residing mouse peritoneal macrophages (MPM) and mast cells (7). TNF-α induces leukocyte adhesion and degranulation, stimulates nicotinamide adenine dinucleotide phosphate (NADPH) oxidase, and enhances expression of IL-2 receptors and expression of E-selectin and intercellular adhesion molecules on the endothelium (8). TNF-α also stimulates expression of IL-1, IL-2, IL-6, and platelet-activating factor receptor (9). In addition, TNF-α decreases insulin sensitivity and increases lipolysis in adipocytes (10, 11). IL-6 also increase lipolysis and has been implicated in the hypertriglyceridemia and increased serum free fatty acid levels associated with obesity (12). Increased IL-6 signaling induces the expression of C-reactive protein and haptoglubin in liver (13). Recombinant IL-6 treatment increases atherosclerotic lesion size 5-fold (14). IL-6 also dose-dependently increases macrophage oxidative low-density lipoprotein (LDL) degradation and CD36 mRNA expression in vitro (15). These data clearly indicate that IL-6 and TNF-α are important pathogenetic factors associated with obesity, insulin resistance, and atherosclerosis. However, the factors regulating gene expression of these cytokines in macrophages have not been fully clarified. ").toDF("text") val preds = prediction_model.fit(sample_data).transform(sample_data) ```
## Results ```bash +--------------+-------+ |chunk |entity | +--------------+-------+ |IL-6 |Protein| |IL-1 (5). |Protein| |TNF-α |Protein| |IL-2 receptors|Protein| |IL-2 |Protein| |IL-6 |Protein| |insulin |Protein| |haptoglubin |Protein| |CD36 |Protein| |IL-6 |Protein| |insulin |Protein| +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|Ner_glove_100d| |Type:|ner| |Compatibility:|Spark NLP 3.1.2+| |License:|Open Source| |Edition:|Community| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.3 MB| |Dependencies:|glove100d| ## References This model trained based on following [dataset](https://github.com/gokhanturer/NER_Model_SparkNLP/blob/main/BioNLP09_IOB_train.conll) ## Benchmarking ```bash label precision recall f1-score support B-Protein 0.86 0.87 0.87 3589 I-Protein 0.84 0.86 0.85 4078 O 0.99 0.98 0.98 66957 accuracy - - 0.97 74624 macro-avg 0.90 0.91 0.90 74624 weighted-avg 0.97 0.97 0.97 74624 ``` --- layout: model title: Fast Neural Machine Translation Model from Lingala to English author: John Snow Labs name: opus_mt_ln_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ln, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ln` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ln_en_xx_2.7.0_2.4_1609170591201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ln_en_xx_2.7.0_2.4_1609170591201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ln_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ln_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ln.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ln_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Niuean Pipeline author: John Snow Labs name: translate_en_niu date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, niu, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `niu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_niu_xx_2.7.0_2.4_1609686690157.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_niu_xx_2.7.0_2.4_1609686690157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_niu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_niu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.niu').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_niu| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish BertForQuestionAnswering Base Uncased model (from stevemobs) author: John Snow Labs name: bert_qa_base_spanish_wwm_uncased_finetuned_squad date: 2022-07-07 tags: [es, open_source, bert, question_answering] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased-finetuned-squad_es` is a Spanish model originally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_spanish_wwm_uncased_finetuned_squad_es_4.0.0_3.0_1657183450072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_spanish_wwm_uncased_finetuned_squad_es_4.0.0_3.0_1657183450072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_spanish_wwm_uncased_finetuned_squad","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_spanish_wwm_uncased_finetuned_squad","es") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_spanish_wwm_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|410.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/stevemobs/bert-base-spanish-wwm-uncased-finetuned-squad_es --- layout: model title: Spanish BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_es_cased date: 2022-12-02 tags: [es, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-es-cased` is a Spanish model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_es_cased_es_4.2.4_3.0_1670017490926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_es_cased_es_4.2.4_3.0_1670017490926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_es_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_es_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_es_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|399.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-es-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Translate English to Brazilian Sign Language Pipeline author: John Snow Labs name: translate_en_bzs date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bzs, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bzs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bzs_xx_2.7.0_2.4_1609689757673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bzs_xx_2.7.0_2.4_1609689757673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bzs", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bzs", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bzs').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bzs| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Part of Speech for Hindi author: John Snow Labs name: pos_ud_hdtb date: 2021-03-09 tags: [part_of_speech, open_source, hindi, pos_ud_hdtb, hi] task: Part of Speech Tagging language: hi edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - PROPN - ADP - ADV - ADJ - NOUN - NUM - AUX - PUNCT - PRON - VERB - CCONJ - PART - SCONJ - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_hdtb_hi_3.0.0_3.0_1615292181587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_hdtb_hi_3.0.0_3.0_1615292181587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_hdtb", "hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['जॉन स्नो लैब्स से नमस्ते! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_hdtb", "hi") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("जॉन स्नो लैब्स से नमस्ते! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""जॉन स्नो लैब्स से नमस्ते! ""] token_df = nlu.load('hi.pos').predict(text) token_df ```
## Results ```bash token pos 0 जॉन PROPN 1 स्नो PROPN 2 लैब्स PROPN 3 से ADP 4 नमस्ते NOUN 5 ! VERB ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_hdtb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|hi| --- layout: model title: Pipeline to Summarize Clinical Notes (Augmented) author: John Snow Labs name: summarizer_clinical_jsl_augmented_pipeline date: 2023-05-29 tags: [licensed, en, clinical, text_summarization, augmented] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_jsl_augmented](https://nlp.johnsnowlabs.com/2023/03/30/summarizer_clinical_jsl_augmented_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_pipeline_en_4.4.2_3.0_1685392918075.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_pipeline_en_4.4.2_3.0_1685392918075.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_jsl_augmented_pipeline", "en", "clinical/models") text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_jsl_augmented_pipeline", "en", "clinical/models") val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for a recheck. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. Her medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, and TriViFlor. She also has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from sarahmiller137) author: John Snow Labs name: distilbert_token_classifier_base_uncased_ft_conll2003 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-ft-conll2003` is a English model originally trained by `sarahmiller137`. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_ft_conll2003_en_4.3.1_3.0_1678783599401.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_ft_conll2003_en_4.3.1_3.0_1678783599401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_ft_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_ft_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_ft_conll2003| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sarahmiller137/distilbert-base-uncased-ft-conll2003 - https://aclanthology.org/W03-0419 - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Bangla DistilBERT Embeddings (from neuralspace-reverie) author: John Snow Labs name: distilbert_embeddings_indic_transformers_bn_distilbert date: 2022-04-12 tags: [distilbert, embeddings, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-bn-distilbert` is a Bangla model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_bn_distilbert_bn_3.4.2_3.0_1649783483143.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_bn_distilbert_bn_3.4.2_3.0_1649783483143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers_bn_distilbert","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers_bn_distilbert","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed.indic_transformers_bn_distilbert").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_indic_transformers_bn_distilbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|248.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-distilbert - https://oscar-corpus.com/ --- layout: model title: Legal Death Clause Binary Classifier author: John Snow Labs name: legclf_death_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `death` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `death` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_death_clause_en_1.0.0_3.2_1660122314869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_death_clause_en_1.0.0_3.2_1660122314869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_death_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[death]| |[other]| |[other]| |[death]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_death_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support death 0.98 0.95 0.96 43 other 0.98 0.99 0.99 112 accuracy - - 0.98 155 macro-avg 0.98 0.97 0.98 155 weighted-avg 0.98 0.98 0.98 155 ``` --- layout: model title: Translate English to Atlantic-Congo languages Pipeline author: John Snow Labs name: translate_en_alv date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, alv, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `alv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_alv_xx_2.7.0_2.4_1609691115946.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_alv_xx_2.7.0_2.4_1609691115946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_alv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_alv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.alv').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_alv| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Recognize Entities OntoNotes - BERT Mini author: John Snow Labs name: onto_recognize_entities_bert_mini date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, pipeline, open_source] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `small_bert_L4_256` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_mini_en_2.7.0_2.4_1607510918214.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_mini_en_2.7.0_2.4_1607510918214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_bert_base') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_base") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.bert.mini').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |Parliament |ORG | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_mini| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: Legal Leather And Textile Industries Document Classifier (EURLEX) author: John Snow Labs name: legclf_leather_and_textile_industries_bert date: 2023-03-06 tags: [en, legal, classification, clauses, leather_and_textile_industries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_leather_and_textile_industries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Leather_and_Textile_Industries or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Leather_and_Textile_Industries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_leather_and_textile_industries_bert_en_1.0.0_3.0_1678111851690.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_leather_and_textile_industries_bert_en_1.0.0_3.0_1678111851690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_leather_and_textile_industries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Leather_and_Textile_Industries]| |[Other]| |[Other]| |[Leather_and_Textile_Industries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_leather_and_textile_industries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Leather_and_Textile_Industries 0.88 0.95 0.91 135 Other 0.95 0.90 0.92 162 accuracy - - 0.92 297 macro-avg 0.92 0.92 0.92 297 weighted-avg 0.92 0.92 0.92 297 ``` --- layout: model title: Fast Neural Machine Translation Model from Welsh to English author: John Snow Labs name: opus_mt_cy_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cy, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cy` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cy_en_xx_2.7.0_2.4_1609164073979.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cy_en_xx_2.7.0_2.4_1609164073979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cy_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cy_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cy.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cy_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Zulu RobertaForMaskedLM Cased model (from MoseliMotsoehli) author: John Snow Labs name: roberta_embeddings_zuberta date: 2022-12-12 tags: [zu, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: zu edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `zuBERTa` is a Zulu model originally trained by `MoseliMotsoehli`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_zuberta_zu_4.2.4_3.0_1670859948117.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_zuberta_zu_4.2.4_3.0_1670859948117.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_zuberta","zu") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_zuberta","zu") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_zuberta| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zu| |Size:|256.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/MoseliMotsoehli/zuBERTa - https://wortschatz.uni-leipzig.de/en/download - https://zu.wikipedia.org/wiki/Special:AllPages --- layout: model title: Detect Disease Mentions in Spanish (Tweets) author: John Snow Labs name: disease_mentions_tweet date: 2022-08-14 tags: [es, clinical, licensed, public_health, ner, disease, tweet] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting disease mentions in Spanish tweets and trained by using MedicalNerApproach annotator that allows to train generic NER models based on Neural Networks. The model detects ENFERMEDAD. ## Predicted Entities `ENFERMEDAD` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_NER_DISEASE_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4TC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/disease_mentions_tweet_es_4.0.2_3.0_1660443994563.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/disease_mentions_tweet_es_4.0.2_3.0_1660443994563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('disease_mentions_tweet', "es", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto"""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("disease_mentions_tweet", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.tweet_disease_mention").predict("""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto""") ```
## Results ```bash +---------------------+----------+ |chunk |ner_label | +---------------------+----------+ |Neumonía en el pulmón|ENFERMEDAD| |Sinusitis de caballo |ENFERMEDAD| |Faringitis aguda |ENFERMEDAD| |infección de orina |ENFERMEDAD| |Gripe |ENFERMEDAD| +---------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|disease_mentions_tweet| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.2 MB| ## References The dataset is Covid-19-specific and consists of tweets collected via a series of keywords associated with that disease. ## Benchmarking ```bash label precision recall f1-score support B-ENFERMEDAD 0.94 0.96 0.95 4243 I-ENFERMEDAD 0.83 0.77 0.80 1570 micro-avg 0.91 0.91 0.91 5813 macro-avg 0.88 0.87 0.87 5813 weighted-avg 0.91 0.91 0.91 5813 ``` --- layout: model title: Word2Vec Embeddings in Slovak (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sk, open_source] task: Embeddings language: sk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sk_3.4.1_3.0_1647457801748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sk_3.4.1_3.0_1647457801748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Milujem iskru nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Milujem iskru nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sk.embed.w2v_cc_300d").predict("""Milujem iskru nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sk| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: ALBERT Base CoNNL-03 NER Pipeline author: ahmedlone127 name: albert_base_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/albert_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655219065531.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/albert_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655219065531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|43.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Expenses Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_expenses_bert date: 2023-03-05 tags: [en, legal, classification, clauses, expenses, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Expenses`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_expenses_bert_en_1.0.0_3.0_1678050658714.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_expenses_bert_en_1.0.0_3.0_1678050658714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_expenses_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Expenses]| |[Other]| |[Other]| |[Expenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_expenses_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Expenses 0.93 0.97 0.95 143 Other 0.98 0.94 0.96 173 accuracy - - 0.95 316 macro-avg 0.95 0.95 0.95 316 weighted-avg 0.95 0.95 0.95 316 ``` --- layout: model title: English image_classifier_vit_pasta_shapes ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_pasta_shapes date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pasta_shapes` is a English model originally trained by nateraw. ## Predicted Entities `corgi`, `husky`, `shibu inu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pasta_shapes_en_4.1.0_3.0_1660170092691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pasta_shapes_en_4.1.0_3.0_1660170092691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pasta_shapes", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pasta_shapes", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pasta_shapes| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForTokenClassification Cased model (from yanekyuk) author: John Snow Labs name: bert_token_classifier_cased_keyword_discriminator date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-cased-keyword-discriminator` is a English model originally trained by `yanekyuk`. ## Predicted Entities `CON`, `ENT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_cased_keyword_discriminator_en_4.2.4_3.0_1669815162301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_cased_keyword_discriminator_en_4.2.4_3.0_1669815162301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_cased_keyword_discriminator","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_cased_keyword_discriminator","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_cased_keyword_discriminator| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/yanekyuk/bert-cased-keyword-discriminator --- layout: model title: Embeddings Clinical (Large) author: John Snow Labs name: embeddings_clinical_large date: 2023-04-07 tags: [licensed, en, clinical, embeddings] task: Embeddings language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on a list of clinical and biomedical datasets curated in-house, using the word2vec algorithm. The dataset curation cut-off date is March 2023 and the model is expected to have a better generalization on recent content. The size of the model is around 2 GB and has 200 dimensions. Our benchmark tests indicate that our legacy clinical embeddings (embeddings_clinical) can be replaced with this one while training a new model (existing/previous models will still need to use the legacy embeddings that they're trained with). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_large_en_4.3.2_3.0_1680905541704.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_large_en_4.3.2_3.0_1680905541704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large","en","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|embeddings_clinical_large| |Type:|embeddings| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|en| |Size:|2.0 GB| |Case sensitive:|true| |Dimension:|200| --- layout: model title: Translate English to Irish Pipeline author: John Snow Labs name: translate_en_ga date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ga, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ga` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ga_xx_2.7.0_2.4_1609700791565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ga_xx_2.7.0_2.4_1609700791565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ga", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ga", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ga').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ga| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Indemnity Clause Binary Classifier author: John Snow Labs name: legclf_indemnity_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indemnity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_clause_en_1.0.0_3.2_1660122519881.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_clause_en_1.0.0_3.2_1660122519881.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_indemnity_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnity]| |[other]| |[other]| |[indemnity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnity_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash legal precision recall f1-score support indemnity 0.96 0.92 0.94 25 other 0.97 0.98 0.98 64 accuracy - - 0.97 89 macro-avg 0.96 0.95 0.96 89 weighted-avg 0.97 0.97 0.97 89 ``` --- layout: model title: Pipeline to Detect biological concepts (biobert) author: John Snow Labs name: ner_bionlp_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_bionlp_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_bionlp_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_pipeline_en_3.4.1_3.0_1647871746678.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_pipeline_en_3.4.1_3.0_1647871746678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_bionlp_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.") ``` ```scala val pipeline = new PretrainedPipeline("ner_bionlp_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bionlp_biobert.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash +-----------------------------------+--------------------+ |chunk |ner_label | +-----------------------------------+--------------------+ |human |Organism | |KCNJ9 |Gene_or_gene_product| |Kir 3.3 |Gene_or_gene_product| |GIRK3 |Gene_or_gene_product| |rectifying potassium (GIRK) channel|Gene_or_gene_product| |KCNJ9 |Gene_or_gene_product| |chromosome 1q21-23 |Cellular_component | |humantissues |Organism | |pancreas |Organ | |tissues |Tissue | |fat andskeletal muscle |Tissue | |KCNJ9 |Gene_or_gene_product| |KCNJ9 |Gene_or_gene_product| +-----------------------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: BERT Embeddings trained on MEDLINE/PubMed author: John Snow Labs name: bert_pubmed date: 2021-08-30 tags: [en, bert_embeddings, medline_pubmed_dataset, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture[1] pretrained from scratch on MEDLINE/PubMed This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings that improve its accuracy over the original BERT base checkpoint {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pubmed_en_3.2.0_3.0_1630316760578.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pubmed_en_3.2.0_3.0_1630316760578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_pubmed", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_pubmed", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.pubmed').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pubmed| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [MEDLINE/PubMed dataset](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) This Model has been imported from: https://tfhub.dev/google/experts/bert/pubmed/2 --- layout: model title: Word Embeddings for Bengali (bengali_cc_300d) author: John Snow Labs name: bengali_cc_300d date: 2021-02-10 task: Embeddings language: bn edition: Spark NLP 2.7.3 spark_version: 2.4 tags: [open_source, bn, embeddings] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words. These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bengali_cc_300d_bn_2.7.3_2.4_1612956925175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bengali_cc_300d_bn_2.7.3_2.4_1612956925175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = WordEmbeddingsModel.pretrained("bengali_cc_300d", "bn") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed").predict("""Put your text here.""") ```
## Results ```bash The model gives 300 dimensional feature vector output per token. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bengali_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.7.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[word_embeddings]| |Language:|bn| |Case sensitive:|false| |Dimension:|300| ## Data Source This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: English DistilBertForQuestionAnswering Cased model (from rowan1224) author: John Snow Labs name: distilbert_qa_slp date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_slp_en_4.3.0_3.0_1672774282530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_slp_en_4.3.0_3.0_1672774282530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_slp","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_slp","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_slp| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/distilbert-slp --- layout: model title: Fast and Accurate Language Identification - 21 Languages (CNN) author: John Snow Labs name: ld_wiki_tatoeba_cnn_21 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, language_detection, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Estonian`, `Finnish`, `French`, `Hungarian`, `Italian`, `Lithuanian`, `Latvian`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Slovak`, `Slovenian`, `Spanish`, `Swedish`. ## Predicted Entities `bg`, `cs`, `da`, `de`, `el`, `en`, `et`, `fi`, `fr`, `hu`, `it`, `lt`, `lv`, `nl`, `pl`, `pt`, `ro`, `sk`, `sl`, `es`, `sv`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_21_xx_2.7.0_2.4_1607177877570.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_21_xx_2.7.0_2.4_1607177877570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_21", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_21", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_21').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_wiki_tatoeba_cnn_21| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Wikipedia and Tatoeba. ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | de| 1000| 1000| 1.0| | nl| 1000| 1000| 1.0| | pt| 1000| 1000| 1.0| | fr| 1000| 1000| 1.0| | es| 1000| 1000| 1.0| | it| 1000| 1000| 1.0| | fi| 1000| 1000| 1.0| | da| 1000| 999| 0.999| | en| 1000| 999| 0.999| | sv| 1000| 998| 0.998| | el| 1000| 996| 0.996| | bg| 1000| 989| 0.989| | pl| 914| 903|0.9879649890590809| | hu| 880| 867|0.9852272727272727| | ro| 784| 771|0.9834183673469388| | lt| 1000| 982| 0.982| | sk| 1000| 976| 0.976| | et| 928| 903|0.9730603448275862| | cs| 1000| 967| 0.967| | sl| 914| 875|0.9573304157549234| | lv| 916| 869|0.9486899563318777| +--------+-----+-------+------------------+ +-------+--------------------+ |summary| precision| +-------+--------------------+ | count| 21| | mean| 0.9876995879070323| | stddev|0.015446490915012105| | min| 0.9486899563318777| | max| 1.0| +-------+--------------------+ ``` --- layout: model title: German BertForSequenceClassification Base Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_base_german_dbmdz_cased_finetuned_pawsx date: 2022-07-13 tags: [de, open_source, bert, sequence_classification] task: Text Classification language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-german-dbmdz-cased-finetuned-pawsx-de` is a German model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_base_german_dbmdz_cased_finetuned_pawsx_de_4.0.0_3.0_1657720863133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_base_german_dbmdz_cased_finetuned_pawsx_de_4.0.0_3.0_1657720863133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_base_german_dbmdz_cased_finetuned_pawsx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_base_german_dbmdz_cased_finetuned_pawsx","de") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_base_german_dbmdz_cased_finetuned_pawsx| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bert-base-german-dbmdz-cased-finetuned-pawsx-de --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_base_all_english date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-base-all-english` is a English model originally trained by `asahi417`. ## Predicted Entities `time`, `corporation`, `ordinal number`, `cardinal number`, `rna`, `geopolitical area`, `protein`, `product`, `percent`, `dna`, `disease`, `cell line`, `law`, `other`, `date`, `chemical`, `event`, `work of art`, `cell type`, `location`, `language`, `quantity`, `facility`, `organization`, `group`, `money`, `person` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_all_english_en_4.1.0_3.0_1660423323063.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_all_english_en_4.1.0_3.0_1660423323063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_all_english","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_all_english","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_base_all_english| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|815.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-base-all-english - https://github.com/asahi417/tner --- layout: model title: English asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18 TFWav2Vec2ForCTC from jhonparra18 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18` is a English model originally trained by jhonparra18. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_en_4.2.0_3.0_1664019836794.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18_en_4.2.0_3.0_1664019836794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_guarani_small_by_jhonparra18| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_all_903429548 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429548` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429548_en_4.3.0_3.0_1677881042384.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429548_en_4.3.0_3.0_1677881042384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429548","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429548","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_all_903429548| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429548 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken TFWav2Vec2ForCTC from cuzeverynameistaken author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken` is a English model originally trained by cuzeverynameistaken. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_en_4.2.0_3.0_1664023081884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken_en_4.2.0_3.0_1664023081884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab1_by_cuzeverynameistaken| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English BertForQuestionAnswering model (from andi611) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2-with-ner-Pistherea-conll2003-with-neg-with-repeat` is a English model orginally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat_en_4.0.0_3.0_1654537307579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat_en_4.0.0_3.0_1654537307579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.bert.large_uncased_pistherea.by_andi611").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/bert-large-uncased-whole-word-masking-squad2-with-ner-Pistherea-conll2003-with-neg-with-repeat --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbiobertresolve_icd10cm_slim_billable_hcc) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_billable_hcc date: 2021-08-26 tags: [en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD10 CM codes using sentence biobert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 11. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.1.3_2.4_1629971987395.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_3.1.3_2.4_1629971987395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_slim_billable_hcc ","en", "clinical/models") \ .setInputCols(["document", "sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["bladder cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("bladder cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc").predict("""sbiobertresolve_icd10cm_slim_billable_hcc """) ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:---------------|:--------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|-------------------------------------------:|:----------------------------|:---------------------------------------------------------| | 0 | bladder cancer | C671 |[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], adenocarcinoma, bladder neck [Malignant neoplasm of bladder neck], cancer in situ of urinary bladder [Carcinoma in situ of bladder], cancer of the urinary bladder, ureteric orifice [Malignant neoplasm of ureteric orifice], tumor of bladder neck [Neoplasm of unspecified behavior of bladder], cancer of the urethra [Malignant neoplasm of urethra]]| [C671, C679, C675, D090, C676, D494, C680] | ['1', '1', '11'] | [0.0685, 0.0709, 0.0963, 0.0978, 0.1068, 0.1080, 0.1211] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_billable_hcc| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: ICD10CM Puerile Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_puerile_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-28 task: Entity Resolution edition: Healthcare NLP 2.4.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_puerile_clinical_en_2.4.5_2.4_1588103916781.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_puerile_clinical_en_2.4.5_2.4_1588103916781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... puerile_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_puerile_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("resolution") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, puerile_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val puerile_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_puerile_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, puerile_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash chunk entity icd10Puerile_description icd10Puerile_code 0 cold Symptom_Name Prolonged pregnancy O481 1 cough Symptom_Name Diseases of the respiratory system complicatin... O9953 2 runny nose Symptom_Name Other specified pregnancy related conditions, ... O26899 3 Mom Gender Classical hydatidiform mole O010 4 she Gender Other vomiting complicating pregnancy O218 5 no Negated Continuing pregnancy after spontaneous abortio... O3111X0 6 fever Symptom_Name Puerperal sepsis O85 7 Her Gender Obesity complicating the puerperium O99215 8 she Gender Other vomiting complicating pregnancy O218 9 spitting up a lot Symptom_Name Eclampsia, unspecified as to time period O159 10 She Gender Other vomiting complicating pregnancy O218 11 no Negated Continuing pregnancy after spontaneous abortio... O3111X0 12 difficulty breathing Symptom_Name Other disorders of lactation O9279 13 her Gender Diseases of the nervous system complicating th... O99355 14 cough Symptom_Name Diseases of the respiratory system complicatin... O9953 15 dry Modifier Cracked nipple associated with lactation O9213 16 hacky Modifier Severe pre-eclampsia, unspecified trimester O1410 17 She Gender Other vomiting complicating pregnancy O218 18 fairly Modifier Maternal care for high head at term, not appli... O324XX0 19 congested Symptom_Name Postpartum thyroiditis O905 20 She Gender Other vomiting complicating pregnancy O218 21 Amoxil Drug_Name Suppressed lactation O925 22 Aldex Drug_Name Severe pre-eclampsia, unspecified trimester O1410 23 her Gender Diseases of the nervous system complicating th... O99355 24 Mom Gender Classical hydatidiform mole O010 25 she Gender Other vomiting complicating pregnancy O218 26 She Gender Other vomiting complicating pregnancy O218 27 difficulty breathing Symptom_Name Other disorders of lactation O9279 28 She Gender Other vomiting complicating pregnancy O218 29 congested and her appetite had decreased Symptom_Name Decreased fetal movements, second trimester, n... O368120 30 She Gender Other vomiting complicating pregnancy O218 31 102.6 Temperature Postpartum coagulation defects O723 32 trouble sleeping Symptom_Name Other disorders of lactation O9279 33 secondary to Modifier Unspecified pre-existing hypertension complica... O10919 34 congestion Symptom_Name Viral hepatitis complicating childbirth O9842 ``` {:.model-param} ## Model Information {:.table-model} |----------------|---------------------------------------| | Name: | chunkresolve_icd10cm_puerile_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10CM Dataset Range: O0000-O9989 https://www.icd10data.com/ICD10CM/Codes/O00-O9A --- layout: model title: Finnish BertForMaskedLM Base Uncased model (from TurkuNLP) author: John Snow Labs name: bert_embeddings_base_finnish_uncased_v1 date: 2022-12-02 tags: [fi, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: fi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-finnish-uncased-v1` is a Finnish model originally trained by `TurkuNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_finnish_uncased_v1_fi_4.2.4_3.0_1670017573662.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_finnish_uncased_v1_fi_4.2.4_3.0_1670017573662.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_finnish_uncased_v1","fi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_finnish_uncased_v1","fi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_finnish_uncased_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fi| |Size:|467.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1 - http://dl.turkunlp.org/finbert/bert-base-finnish-cased-v1.zip - http://dl.turkunlp.org/finbert/bert-base-finnish-uncased-v1.zip - https://arxiv.org/abs/1912.07076 - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/master/multilingual.md - https://raw.githubusercontent.com/TurkuNLP/FinBERT/master/img/yle-ylilauta-curves.png - https://fasttext.cc/ - https://github.com/spyysalo/finbert-text-classification - https://github.com/spyysalo/yle-corpus - https://github.com/spyysalo/ylilauta-corpus - https://arxiv.org/abs/1908.04212 - https://github.com/Traubert/FiNer-rules - https://arxiv.org/pdf/1908.04212.pdf - https://github.com/jouniluoma/keras-bert-ner - https://github.com/mpsilfve/finer-data - https://universaldependencies.org/ - https://github.com/spyysalo/bert-pos - http://hdl.handle.net/11234/1-2837 - http://dl.turkunlp.org/finbert/bert-base-finnish-uncased.zip - http://dl.turkunlp.org/finbert/bert-base-finnish-cased.zip --- layout: model title: Legal Rights Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_rights_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, rights, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_rights_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `rights-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `rights-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rights_agreement_bert_en_1.0.0_3.0_1669369164449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rights_agreement_bert_en_1.0.0_3.0_1669369164449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_rights_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[rights-agreement]| |[other]| |[other]| |[rights-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_rights_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.91 0.97 0.94 65 rights-agreement 0.92 0.80 0.86 30 accuracy - - 0.92 95 macro-avg 0.92 0.88 0.90 95 weighted-avg 0.92 0.92 0.91 95 ``` --- layout: model title: Part of Speech for Armenian author: John Snow Labs name: pos_ud_armtdp date: 2021-03-09 tags: [part_of_speech, open_source, armenian, pos_ud_armtdp, hy] task: Part of Speech Tagging language: hy edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PROPN - NOUN - NUM - PUNCT - ADJ - VERB - ADP - ADV - CCONJ - X - AUX - DET - PRON - SCONJ - SYM - PART - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_armtdp_hy_3.0.0_3.0_1615292139311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_armtdp_hy_3.0.0_3.0_1615292139311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_armtdp", "hy") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Ողջույն, John ոն Ձյուն լաբորատորիաներից: ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_armtdp", "hy") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Ողջույն, John ոն Ձյուն լաբորատորիաներից: ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Ողջույն, John ոն Ձյուն լաբորատորիաներից: ""] token_df = nlu.load('hy.pos').predict(text) token_df ```
## Results ```bash token pos 0 Ողջույն INTJ 1 , PUNCT 2 John ADJ 3 ոն NOUN 4 Ձյուն NOUN 5 լաբորատորիաներից NOUN 6 : PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_armtdp| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|hy| --- layout: model title: Medical Text Generation (T5-based) author: John Snow Labs name: text_generator_generic_flan_base date: 2023-04-03 tags: [licensed, en, clinical, text_generation, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based text generation model, which is basically the same as official [Flan-T5-base model](https://huggingface.co/google/flan-t5-base) released by Google.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 512 tokens given an input text (max 1024 tokens). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_GENERATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.1.Medical_Text_Generation.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/text_generator_generic_flan_base_en_4.3.2_3.0_1680522827259.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/text_generator_generic_flan_base_en_4.3.2_3.0_1680522827259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("prompt")\ .setOutputCol("document_prompt") med_text_generator = MedicalTextGenerator.pretrained("text_generator_generic_flan_base", "en", "clinical/models")\ .setInputCols("document_prompt")\ .setOutputCol("answer")\ .setMaxNewTokens(256)\ .setDoSample(True)\ .setTopK(3)\ .setRandomSeed(42) pipeline = Pipeline(stages=[document_assembler, med_text_generator]) data = spark.createDataFrame([["the patient is admitted to the clinic with a severe back pain and "]]).toDF("prompt") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("prompt") .setOutputCol("document_prompt") val med_text_generator = MedicalTextGenerator.pretrained("text_generator_generic_flan_base", "en", "clinical/models") .setInputCols("document_prompt") .setOutputCol("answer") .setMaxNewTokens(256) .setDoSample(true) .setTopK(3) .setRandomSeed(42) val pipeline = new Pipeline().setStages(Array(document_assembler, med_text_generator)) val data = Seq(Array("the patient is admitted to the clinic with a severe back pain and ")).toDS.toDF("prompt") val result = pipeline.fit(data).transform(data) ```
## Results ```bash ['the patient is admitted to the clinic with a severe back pain and a severe left - sided leg pain. The patient was diagnosed with a lumbar disc herniation and underwent a discectomy. The patient was discharged on the third postoperative day. The patient was followed up for a period of 6 months and was found to be asymptomatic. A rare case of a giant cell tumor of the sacrum. Giant cell tumors ( GCTs ) are benign, locally aggressive tumors that are most commonly found in the long bones of the extremities. They are rarely found in the spine. We report a case of a GCT of the sacrum in a young female patient. The patient presented with a history of progressive lower back pain and a palpable mass in the left buttock. The patient underwent a left hemilaminectomy and biopsy. The histopathological examination revealed a GCT. The patient was treated with a combination of surgery and radiation therapy. The patient was followed up for 2 years and no recurrence was observed. A rare case of a giant cell tumor of the sacrum. Giant cell tumors ( GCTs ) are benign, locally aggressive tumors that are most commonly found in the long bones of the extremities. They are rarely found in the spine. We report a case of a GCT'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_generator_generic_flan_base| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.9 MB| --- layout: model title: Extract relations between phenotypic abnormalities and diseases (ReDL) author: John Snow Labs name: redl_human_phenotype_gene_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations to fully understand the origin of some phenotypic abnormalities and their associated diseases. `1` : Entities are related, `0` : Entities are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_2.7.3_2.4_1612440673031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_2.7.3_2.4_1612440673031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") #Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_human_phenotype_gene_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text = "She has a retinal degeneration, hearing loss and renal failure, short stature, \ Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive." data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array('SYMPTOM-EXTERNAL_BODY_PART_OR_REGION')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_human_phenotype_gene_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val result = pipeline.fit(Seq.empty["She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive."].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.humen_phenotype_gene").predict("""She has a retinal degeneration, hearing loss and renal failure, short stature, \ Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""") ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|-----------:|:----------|----------------:|--------------:|:---------------------|:----------|----------------:|--------------:|:--------------------|-------------:| | 0 | 0 | HP | 10 | 29 | retinal degeneration | HP | 32 | 43 | hearing loss | 0.893809 | | 1 | 0 | HP | 10 | 29 | retinal degeneration | HP | 49 | 61 | renal failure | 0.958486 | | 2 | 1 | HP | 10 | 29 | retinal degeneration | HP | 162 | 180 | autosomal recessive | 0.65584 | | 3 | 0 | HP | 32 | 43 | hearing loss | HP | 64 | 76 | short stature | 0.707055 | | 4 | 1 | HP | 32 | 43 | hearing loss | GENE | 96 | 103 | SH3PXD2B | 0.640802 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_human_phenotype_gene_biobert| |Compatibility:| Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on a silver standard corpus of human phenotype and gene annotations and their relations. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.922 0.908 0.915 129 1 0.831 0.855 0.843 71 Avg. 0.877 0.882 0.879 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Raphaelg9) author: John Snow Labs name: distilbert_qa_Raphaelg9_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Raphaelg9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Raphaelg9_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724387431.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Raphaelg9_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724387431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Raphaelg9_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Raphaelg9_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Raphaelg9").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Raphaelg9_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Raphaelg9/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from fractalego) author: John Snow Labs name: bert_qa_fewrel_zero_shot date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fewrel-zero-shot` is a English model orginally trained by `fractalego`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fewrel_zero_shot_en_4.0.0_3.0_1654187638832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fewrel_zero_shot_en_4.0.0_3.0_1654187638832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fewrel_zero_shot","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fewrel_zero_shot","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.zero_shot").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fewrel_zero_shot| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/fractalego/fewrel-zero-shot - https://www.aclweb.org/anthology/2020.coling-main.124 --- layout: model title: Embeddings Scielo 300 dims author: John Snow Labs name: embeddings_scielo_300d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-26 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_300d_es_2.5.0_2.4_1590467138742.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielo_300d_es_2.5.0_2.4_1590467138742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.scielo.300d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_scielo_300d``. {:.model-param} ## Model Information {:.table-model} |---------------|------------------------| | Name: | embeddings_scielo_300d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 300.0 | {:.h2_title} ## Data Source Trained on Scielo Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Stopwords Remover for Hindi language (233 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, hi, open_source] task: Stop Words Removal language: hi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_hi_3.4.1_3.0_1646672942184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_hi_3.4.1_3.0_1646672942184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","hi") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["तुम मुझसे बेहतर नहीं हो"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","hi") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("तुम मुझसे बेहतर नहीं हो").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.stopwords").predict("""तुम मुझसे बेहतर नहीं हो""") ```
## Results ```bash +-------------------+ |result | +-------------------+ |[तुम, मुझसे, बेहतर]| +-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hi| |Size:|2.2 KB| --- layout: model title: English DistilBertForQuestionAnswering model (from jgammack) MTL author: John Snow Labs name: distilbert_qa_MTL_base_uncased_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MTL-distilbert-base-uncased-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_MTL_base_uncased_squad_en_4.0.0_3.0_1654722966366.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_MTL_base_uncased_squad_en_4.0.0_3.0_1654722966366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_MTL_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_MTL_base_uncased_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_mtl.by_jgammack").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_MTL_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/MTL-distilbert-base-uncased-squad --- layout: model title: Translate English to Lunda Pipeline author: John Snow Labs name: translate_en_lun date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, lun, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `lun` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lun_xx_2.7.0_2.4_1609687838269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lun_xx_2.7.0_2.4_1609687838269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_lun", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_lun", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.lun').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_lun| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Arabic to English author: John Snow Labs name: opus_mt_ar_en date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_en_xx_3.1.0_2.4_1622552419647.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_en_xx_3.1.0_2.4_1622552419647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_world_landmarks ViTForImageClassification from mmgyorke author: John Snow Labs name: image_classifier_vit_world_landmarks date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_world_landmarks` is a English model originally trained by mmgyorke. ## Predicted Entities `la sagrada familia`, `leaning tower of pisa`, `arc de triomphe`, `taj mahal`, `big ben` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_world_landmarks_en_4.1.0_3.0_1660171257315.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_world_landmarks_en_4.1.0_3.0_1660171257315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_world_landmarks", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_world_landmarks", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_world_landmarks| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Option Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_option_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, option, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_option_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `option-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `option-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_option_agreement_en_1.0.0_3.0_1670357991770.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_option_agreement_en_1.0.0_3.0_1670357991770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_option_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[option-agreement]| |[other]| |[other]| |[option-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_option_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support option-agreement 0.96 0.93 0.95 59 other 0.96 0.98 0.97 111 accuracy - - 0.96 170 macro-avg 0.96 0.96 0.96 170 weighted-avg 0.96 0.96 0.96 170 ``` --- layout: model title: Detect Drugs - Generalized Single Entity (ner_drugs_greedy) author: John Snow Labs name: ner_drugs_greedy date: 2020-12-14 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.5 spark_version: 2.4 tags: [ner, licensed, en, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a single entity model that generalises all posology concepts into one and finds longest available chunks of drugs. It is trained using `embeddings_clinical` so please use the same embeddings in the pipeline. ## Predicted Entities `DRUG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_en_2.6.4_2.4_1607417409084.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_en_2.6.4_2.4_1607417409084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_drugs_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated."]]).toDF("text")) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = NerDLModel.pretrained("ner_drugs_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val data = Seq("DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugsgreedy").predict("""DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""") ```
## Results ```bash +-----------------------------------+------------+ | chunk | ner_label | +-----------------------------------+------------+ | hydrocortisone tablets | DRUG | | 20 mg to 240 mg of hydrocortisone | DRUG | +-----------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_greedy| |Type:|ner| |Compatibility:|Spark NLP 2.6.5+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on augmented i2b2_med7 + FDA dataset with ``embeddings_clinical``, [https://www.i2b2.org/NLP/Medication](https://www.i2b2.org/NLP/Medication). ## Benchmarking ```bash label tp fp fn prec rec f1 I-DRUG 37858 4166 3338 0.90086615 0.91897273 0.9098294 B-DRUG 29926 2006 1756 0.937179 0.9445742 0.9408621 tp: 67784 fp: 6172 fn: 5094 labels: 2 Macro-average prec: 0.91902256, rec: 0.9317734, f1: 0.92535406 Micro-average prec: 0.916545, rec: 0.93010235, f1: 0.9232739 ``` --- layout: model title: English Bert Embeddings (from mlcorelib) author: John Snow Labs name: bert_embeddings_deberta_base_uncased date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `deberta-base-uncased` is a English model orginally trained by `mlcorelib`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_3.4.2_3.0_1649672445146.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_3.4.2_3.0_1649672445146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.deberta_base_uncased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_deberta_base_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/mlcorelib/deberta-base-uncased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: German asr_exp_w2v2t_vp_s184 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_vp_s184 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s184` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_vp_s184_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s184_de_4.2.0_3.0_1664109862866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_s184_de_4.2.0_3.0_1664109862866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_vp_s184', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_vp_s184", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_vp_s184| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Explain Document Pipeline for German author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, german, explain_document_md, pipeline, de] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: de edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_de_3.0.0_3.0_1616431227526.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_de_3.0.0_3.0_1616431227526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'de') annotations = pipeline.fullAnnotate(""Hallo aus John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "de") val result = pipeline.fullAnnotate("Hallo aus John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo aus John Snow Labs! ""] result_df = nlu.load('de.explain.document').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:------------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo aus John Snow Labs! '] | ['Hallo aus John Snow Labs!'] | ['Hallo', 'aus', 'John', 'Snow', 'Labs!'] | ['Hallo', 'aus', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.5910000205039978,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|de| --- layout: model title: Translate Danish to English Pipeline author: John Snow Labs name: translate_da_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, da, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `da` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_da_en_xx_2.7.0_2.4_1609689269793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_da_en_xx_2.7.0_2.4_1609689269793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_da_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_da_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.da.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_da_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1654191610651.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10_en_4.0.0_3.0_1654191610651.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_512d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|387.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-10 --- layout: model title: Financial Relation Extraction (Work Experience, Sm, Bidirectional) author: John Snow Labs name: finre_work_experience date: 2022-09-28 tags: [work, experience, en, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole financial report. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `finclf_work_experience_item` Text Classifier to select only these paragraphs; This model allows you to analyzed present and past job positions of people, extracting relations between PERSON, ORG, ROLE and DATE. This model requires an NER with the mentioned entities, as `finner_org_per_role` and can also be combined with `finassertiondl_past_roles` to detect if the entities are mentioned to have happened in the PAST or not (although you can also infer that from the relations as `had_role_until`). This model is a `sm` model without meaningful directions in the relations (the model was not trained to understand if the direction of the relation is from left to right or right to left). There are bigger models in Models Hub trained also with directed relationships. ## Predicted Entities `has_role`, `had_role_until`, `has_role_from`, `works_for`, `has_role_in_company` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_work_experience_en_1.0.0_3.0_1664360618647.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_work_experience_en_1.0.0_3.0_1664360618647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","en")\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_org_per_role_date', 'en', 'finance/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pos = nlp.PerceptronModel.pretrained()\ .setInputCols(["sentence", "token"])\ .setOutputCol("pos") dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\ .setInputCols(["sentence", "pos", "token"])\ .setOutputCol("dependencies") re_ner_chunk_filter = finance.RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setRelationPairs(["PERSON-ROLE, ORG-ROLE, DATE-ROLE, PERSON-ORG"])\ .setMaxSyntacticDistance(5) re_Model = finance.RelationExtractionDLModel.pretrained("finre_work_experience", "en", "finance/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relations")\ .setPredictionThreshold(0.5) pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter, pos, dependency_parser, re_ner_chunk_filter, re_Model ]) empty_df = spark.createDataFrame([['']]).toDF("text") re_model = pipeline.fit(empty_df) light_model = LightPipeline(re_model) text_list = ["""We have experienced significant changes in our senior management team over the past several years, including the appointments of Mark Schmitz as our Executive Vice President and Chief Operating Officer in 2019.""", """In January 2019, Jose Cil was assigned the CEO of Restaurant Brands International, and Daniel Schwartz was assigned the Executive Chairman of the company.""", ] results = light_model.fullAnnotate(text_list) ```
## Results ```bash has_role PERSON 129 140 Mark Schmitz ROLE 149 172 Executive Vice President 0.8707728 has_role PERSON 129 140 Mark Schmitz ROLE 178 200 Chief Operating Officer 0.97559035 has_role_from ROLE 149 172 Executive Vice President DATE 205 208 2019 0.9327241 has_role_from ROLE 178 200 Chief Operating Officer DATE 205 208 2019 0.90718126 has_role_from DATE 3 14 January 2019 ROLE 43 45 CEO 0.996639 has_role_from DATE 3 14 January 2019 ROLE 120 137 Executive Chairman 0.9964874 has_role PERSON 17 24 Jose Cil ROLE 43 45 CEO 0.8917691 has_role PERSON 17 24 Jose Cil ROLE 120 137 Executive Chairman 0.8527716 has_role ROLE 43 45 CEO PERSON 87 101 Daniel Schwartz 0.5765097 has_role PERSON 87 101 Daniel Schwartz ROLE 120 137 Executive Chairman 0.79235893 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_work_experience| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|409.9 MB| ## References Manual annotations on CUAD dataset, 10K filings and Wikidata ## Benchmarking ```bash label Recall Precision F1 Support had_role_until 0.972 0.972 0.972 36 has_role 0.986 0.980 0.983 146 has_role_from 0.983 0.983 0.983 58 has_role_in_company 0.954 0.969 0.961 65 works_for 0.933 0.933 0.933 15 Avg. 0.966 0.967 0.966 - Weighted-Avg. 0.975 0.975 0.975 - ``` --- layout: model title: Romanian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_romanian_legal date: 2023-02-16 tags: [ro, romanian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: ro edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-romanian-roberta-base` is a Romanian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_romanian_legal_ro_4.2.4_3.0_1676556391039.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_romanian_legal_ro_4.2.4_3.0_1676556391039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_romanian_legal", "ro")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_romanian_legal", "ro") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_romanian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ro| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-romanian-roberta-base --- layout: model title: Legal Investment Subadvisory Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_investment_subadvisory_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, investment_subadvisory, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_investment_subadvisory_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `investment-subadvisory-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `investment-subadvisory-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_subadvisory_agreement_bert_en_1.0.0_3.0_1669315177671.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_subadvisory_agreement_bert_en_1.0.0_3.0_1669315177671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_investment_subadvisory_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[investment-subadvisory-agreement]| |[other]| |[other]| |[investment-subadvisory-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investment_subadvisory_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support investment-subadvisory-agreement 1.00 1.00 1.00 32 other 1.00 1.00 1.00 82 accuracy - - 1.00 114 macro-avg 1.00 1.00 1.00 114 weighted-avg 1.00 1.00 1.00 114 ``` --- layout: model title: English ElectraForQuestionAnswering model (from Palak) author: John Snow Labs name: electra_qa_google_base_discriminator_squad date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `google_electra-base-discriminator_squad` is a English model originally trained by `Palak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_google_base_discriminator_squad_en_4.0.0_3.0_1655921991596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_google_base_discriminator_squad_en_4.0.0_3.0_1655921991596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_google_base_discriminator_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_google_base_discriminator_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.base.by_Palak").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_google_base_discriminator_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Palak/google_electra-base-discriminator_squad --- layout: model title: English XlmRoBertaForQuestionAnswering (from laifuchicago) author: John Snow Labs name: xlm_roberta_qa_farm2tran date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `farm2tran` is a English model originally trained by `laifuchicago`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_farm2tran_en_4.0.0_3.0_1655987409597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_farm2tran_en_4.0.0_3.0_1655987409597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_farm2tran","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_farm2tran","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.by_laifuchicago").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_farm2tran| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/laifuchicago/farm2tran --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from gkss) author: John Snow Labs name: distilbert_qa_gkss_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `gkss`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_gkss_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770818177.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_gkss_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770818177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gkss_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gkss_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_gkss_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/gkss/distilbert-base-uncased-finetuned-squad --- layout: model title: Arabic T5ForConditionalGeneration Cased model (from malmarjeh) author: John Snow Labs name: t5_arabic_text_summarization date: 2023-01-30 tags: [ar, open_source, t5, tensorflow] task: Text Generation language: ar edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-arabic-text-summarization` is a Arabic model originally trained by `malmarjeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_arabic_text_summarization_ar_4.3.0_3.0_1675107890347.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_arabic_text_summarization_ar_4.3.0_3.0_1675107890347.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_arabic_text_summarization","ar") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_arabic_text_summarization","ar") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_arabic_text_summarization| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ar| |Size:|1.4 GB| ## References - https://huggingface.co/malmarjeh/t5-arabic-text-summarization --- layout: model title: English DistilBertForQuestionAnswering model (from huxxx657) Jumbling author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_jumbling_squad_15 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-jumbling-squad-15` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_jumbling_squad_15_en_4.0.0_3.0_1654723967561.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_jumbling_squad_15_en_4.0.0_3.0_1654723967561.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_jumbling_squad_15","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_jumbling_squad_15","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_v2.by_huxxx657").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_jumbling_squad_15| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/distilbert-base-uncased-finetuned-jumbling-squad-15 --- layout: model title: BERT Sentence Embeddings trained on MEDLINE/PubMed author: John Snow Labs name: sent_bert_pubmed date: 2021-08-31 tags: [en, open_source, sentence_embeddings, medline_pubmed_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture pretrained from scratch on MEDLINE/PubMed. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings that improve its accuracy over the original BERT base checkpoint. This model is intended to be used for a variety of English NLP tasks in the medical domain. The pre-training data contains more medical text and the model may not generalize to text outside of that domain. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_pubmed_en_3.2.0_3.0_1630412084893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_pubmed_en_3.2.0_3.0_1630412084893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_pubmed", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_pubmed", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.pubmed').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_pubmed| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [MEDLINE/PubMed dataset](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) This Model has been imported from: https://tfhub.dev/google/experts/bert/pubmed/2 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hcy11) author: John Snow Labs name: distilbert_qa_hcy11_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hcy11`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hcy11_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771148653.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hcy11_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771148653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hcy11_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hcy11_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hcy11_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hcy11/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForTokenClassification Cased model (from TomUdale) author: John Snow Labs name: distilbert_token_classifier_sec_example date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sec_example` is a English model originally trained by `TomUdale`. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_sec_example_en_4.3.1_3.0_1678783571147.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_sec_example_en_4.3.1_3.0_1678783571147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_sec_example","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_sec_example","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_sec_example| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/TomUdale/sec_example --- layout: model title: Multilingual BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp date: 2022-06-03 tags: [te, en, open_source, question_answering, bert, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-finedtuned-xquad-tydiqa-goldp` is a Multilingual model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp_xx_4.0.0_3.0_1654253530216.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp_xx_4.0.0_3.0_1654253530216.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.xquad_tydiqa.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-multi-cased-finedtuned-xquad-tydiqa-goldp --- layout: model title: Embeddings Sciwiki 300 dims author: John Snow Labs name: embeddings_sciwiki_300d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-27 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_300d_es_2.5.0_2.4_1590609454054.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_300d_es_2.5.0_2.4_1590609454054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.sciwiki_300d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_sciwiki_300d``. {:.model-param} ## Model Information {:.table-model} |---------------|-------------------------| | Name: | embeddings_sciwiki_300d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 300.0 | {:.h2_title} ## Data Source Trained on Clinical Wikipedia Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Legal Means Of Agricultural Production Document Classifier (EURLEX) author: John Snow Labs name: legclf_means_of_agricultural_production_bert date: 2023-03-06 tags: [en, legal, classification, clauses, means_of_agricultural_production, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_means_of_agricultural_production_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Means_of_Agricultural_Production or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Means_of_Agricultural_Production`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_means_of_agricultural_production_bert_en_1.0.0_3.0_1678111831055.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_means_of_agricultural_production_bert_en_1.0.0_3.0_1678111831055.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_means_of_agricultural_production_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Means_of_Agricultural_Production]| |[Other]| |[Other]| |[Means_of_Agricultural_Production]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_means_of_agricultural_production_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Means_of_Agricultural_Production 0.86 0.89 0.88 476 Other 0.87 0.83 0.85 416 accuracy - - 0.87 892 macro-avg 0.87 0.86 0.86 892 weighted-avg 0.87 0.87 0.87 892 ``` --- layout: model title: English asr_sp_proj TFWav2Vec2ForCTC from behroz author: John Snow Labs name: pipeline_asr_sp_proj date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_sp_proj` is a English model originally trained by behroz. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_sp_proj_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_sp_proj_en_4.2.0_3.0_1664019439184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_sp_proj_en_4.2.0_3.0_1664019439184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_sp_proj', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_sp_proj", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_sp_proj| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_small_ssm_nq date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-ssm-nq` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_ssm_nq_en_4.3.0_3.0_1675155920334.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_ssm_nq_en_4.3.0_3.0_1675155920334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_ssm_nq","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_ssm_nq","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_ssm_nq| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|179.4 MB| ## References - https://huggingface.co/google/t5-small-ssm-nq - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/2002.08909.pdf - https://arxiv.org/abs/1910.10683.pdf - https://goo.gle/t5-cbqa - https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/how_much_know_ledge_image.png --- layout: model title: NER Pipeline for Hindi+English author: John Snow Labs name: bert_token_classifier_hi_en_ner_pipeline date: 2022-03-22 tags: [hindi, bert_token, hi, open_source] task: Named Entity Recognition language: hi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on [bert_token_classifier_hi_en_ner](https://nlp.johnsnowlabs.com/2021/12/27/bert_token_classifier_hi_en_ner_hi.html). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_HINDI_ENGLISH/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_HINDI_ENGLISH.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_hi_en_ner_pipeline_hi_3.4.1_3.0_1647954363761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_hi_en_ner_pipeline_hi_3.4.1_3.0_1647954363761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_hi_en_ner_pipeline", lang = "hi") pipeline.annotate("रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_hi_en_ner_pipeline", lang = "hi") val pipeline.annotate("रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।") ```
## Results ```bash +---------------------------+------------+ |chunk |ner_label | +---------------------------+------------+ |रिलायंस इंडस्ट्रीज़ लिमिटेड |ORGANISATION| |Reliance Industries Limited|ORGANISATION| |भारतीय |PLACE | |मुंबई |PLACE | |महाराष्ट्र |PLACE | |Maharashtra) |PLACE | |नवल टाटा |PERSON | |मुम्बई |PLACE | |Mumbai |PLACE | |टाटा समुह |ORGANISATION| |भारत |PLACE | |जमशेदजी टाटा |PERSON | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_hi_en_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|hi| |Size:|665.8 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - BertForTokenClassification - NerConverter - Finisher --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from sarahmiller137) author: John Snow Labs name: distilbert_token_classifier_base_uncased_ft_conll2003 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-ft-conll2003` is a English model originally trained by `sarahmiller137`. ## Predicted Entities `LOC`, `ORG`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_ft_conll2003_en_4.3.0_3.0_1677881411947.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_ft_conll2003_en_4.3.0_3.0_1677881411947.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_ft_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_ft_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_ft_conll2003| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sarahmiller137/distilbert-base-uncased-ft-conll2003 - https://aclanthology.org/W03-0419 - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: English T5ForConditionalGeneration Cased model (from erickfm) author: John Snow Labs name: t5_neutrally date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `neutrally` is a English model originally trained by `erickfm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_neutrally_en_4.3.0_3.0_1675106407099.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_neutrally_en_4.3.0_3.0_1675106407099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_neutrally","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_neutrally","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_neutrally| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|923.9 MB| ## References - https://huggingface.co/erickfm/neutrally - https://github.com/rpryzant/neutralizing-bias - https://nlp.stanford.edu/pubs/pryzant2020bias.pdf - https://en.wikipedia.org/wiki/BLEU - https://apps-summer22.ischool.berkeley.edu/neutrally/ --- layout: model title: Spanish BERT Base Cased Embedding author: John Snow Labs name: bert_base_cased date: 2021-09-07 tags: [spanish, open_source, bert_embeddings, cased, es] task: Embeddings language: es edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_cased_es_3.2.2_3.0_1630999631885.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_cased_es_3.2.2_3.0_1630999631885.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_cased", "es") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_cased", "es") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bert.base_cased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased --- layout: model title: English Named Entity Recognition (from DeDeckerThomas) author: John Snow Labs name: distilbert_ner_keyphrase_extraction_distilbert_inspec date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-inspec` is a English model orginally trained by `DeDeckerThomas`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_inspec_en_3.4.2_3.0_1652721896350.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_keyphrase_extraction_distilbert_inspec_en_3.4.2_3.0_1652721896350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_inspec","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_keyphrase_extraction_distilbert_inspec","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_keyphrase_extraction_distilbert_inspec| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/DeDeckerThomas/keyphrase-extraction-distilbert-inspec - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=inspec --- layout: model title: Legal License Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_license_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, license, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_license_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `license-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `license-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_license_agreement_en_1.0.0_3.0_1671393662516.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_license_agreement_en_1.0.0_3.0_1671393662516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_license_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[license-agreement]| |[other]| |[other]| |[license-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_license_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support license-agreement 0.97 0.96 0.96 98 other 0.98 0.99 0.98 223 accuracy - - 0.98 321 macro-avg 0.98 0.97 0.97 321 weighted-avg 0.98 0.98 0.98 321 ``` --- layout: model title: English Named Entity Recognition (from kSaluja) author: John Snow Labs name: bert_ner_autonlp_tele_red_data_model_585716433 date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `autonlp-tele_red_data_model-585716433` is a English model orginally trained by `kSaluja`. ## Predicted Entities `TARGET`, `SUGGESTIONTYPE`, `CALLTYPE`, `INSTRUMENT`, `BUYPRICE`, `HOLDINGPERIOD`, `STOPLOSS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_tele_red_data_model_585716433_en_3.4.2_3.0_1652096670651.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_autonlp_tele_red_data_model_585716433_en_3.4.2_3.0_1652096670651.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_tele_red_data_model_585716433","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_autonlp_tele_red_data_model_585716433","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_autonlp_tele_red_data_model_585716433| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/kSaluja/autonlp-tele_red_data_model-585716433 --- layout: model title: Russian Fact Extraction NER author: John Snow Labs name: finner_bert_rufacts date: 2022-09-27 tags: [ru, licensed] task: Named Entity Recognition language: ru edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Russian Name Entity Recognition created for the RuREBus (Russian Relation Extraction for Business) shared task. The overall goal of the shared task was to develop business-oriented models capable of relation and/or fact extraction from texts. A lot of business applications require relation extraction. Although there are a few corpora, that contain texts annotated with relations, all of them are more of an academic nature and differ from typical business applications. There are a few reasons for this. First, the annotations are quite tight, i.e. almost every sentence contains an annotated relation. In contrast, the business-oriented documents often contain much less annotated examples. There might be one or two annotated examples in the whole document. Second, the existing corpora cover everyday topics (family relationships, birth and death relations, purchase and sale relations, etc). The business applications require other domain-specific relations. The goal of the task is to compare the methods for relation extraction in a more close-to-practice way. For these reasons, we suggest using the documents, produced by the Ministry of Economic Development of the Russian Federation. The corpus contains regional reports and strategic plans. A part of the corpus is annotated with named entities (8 classes) and semantic relations (11 classes). In total there are approximately 300 annotated documents. The annotation schema and the guidelines for annotators can be found in [here](https://github.com/dialogue-evaluation/RuREBus/blob/master/markup_instruction.pdf) (in Russian). The dataset consists of: 1. A train set with manually annotated named entities and relations. First and second parts of train set are avaliable [here](https://github.com/dialogue-evaluation/RuREBus/tree/master/train_data) 2. A large corpus (approx. 280 million tokens) of raw free-form documents, produced by the Ministry of Economic Development. These documents come from the same domain as the train and the test set. This data is avaliable [here](https://yadi.sk/d/9uKbo3p0ghdNpQ). 3. A test set without any annotations The predicted entities are: MET - Metric (productivity, growth...) ECO - Economical Entity / Concept (inner market, energy source...) BIN - 1-time action, binary (happened or not - construction, development, presence, absence...) CMP - Quantitative Comparision entity (increase, decrease...) QUA - Qualitative Comparison entity (stable, limited...) ACT - Activity (Restauration of buildings, Festivities in Cities...) INT - Institutions (Centers, Departments, etc) SOC - Social - Social object (Children, Elder people, Workers of X sector, ...) ## Predicted Entities `MET`, `ECO`, `BIN`, `CMP`, `QUA`, `ACT`, `INT`, `SOC` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_NER_RUSSIAN_GOV/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_bert_rufacts_ru_1.0.0_3.0_1664277912886.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_bert_rufacts_ru_1.0.0_3.0_1664277912886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ru") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = finance.BertForTokenClassification.pretrained("finner_bert_rufacts", "en", "finance/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentencerDL, tokenizer, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) text_list = ["""В рамках обеспечения использования в деятельности ОМСУ муниципального образования Московской области региональных и муниципальных информационных систем предусматривается решение задач"""] import pandas as pd df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("ner_chunk"), F.expr("cols['1']['entity']").alias("label")).show(truncate = False) ```
## Results ```bash +--------------------------------------------------+-----+ |ner_chunk |label| +--------------------------------------------------+-----+ |обеспечения |BIN | |ОМСУ муниципального образования |INST | |региональных и муниципальных информационных систем|ECO | +--------------------------------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_bert_rufacts| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|ru| |Size:|358.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://github.com/dialogue-evaluation/RuREBus ## Benchmarking ```bash label precision recall f1-score support B-MET 0.7440 0.7440 0.7440 250 I-MET 0.8301 0.7704 0.7991 945 B-BIN 0.7248 0.7850 0.7537 614 B-ACT 0.6052 0.5551 0.5791 254 I-ACT 0.7215 0.6244 0.6695 892 B-ECO 0.6892 0.6813 0.6852 524 I-ECO 0.6750 0.6899 0.6824 861 B-CMP 0.8405 0.8354 0.8379 164 I-CMP 0.2000 0.0714 0.1053 14 B-INST 0.7152 0.7019 0.7085 161 I-INST 0.7560 0.7114 0.7330 440 B-SOC 0.5547 0.6698 0.6068 212 I-SOC 0.6178 0.7087 0.6601 381 B-QUA 0.6167 0.7303 0.6687 152 I-QUA 0.7333 0.4400 0.5500 25 micro-avg 0.7107 0.7017 0.7062 5927 macro-avg 0.6610 0.6337 0.6413 5927 weighted-avg 0.7136 0.7017 0.7059 5927 ``` --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from SauravMaheshkar) author: John Snow Labs name: distilbert_qa_finetuned_for_xqua_on_chaii date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finetuned-for-xqua-on-chaii` is a Multilingual model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_for_xqua_on_chaii_xx_4.3.0_3.0_1672774178039.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_for_xqua_on_chaii_xx_4.3.0_3.0_1672774178039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_for_xqua_on_chaii","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_for_xqua_on_chaii","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finetuned_for_xqua_on_chaii| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/distilbert-multi-finetuned-for-xqua-on-chaii --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from nbroad) author: John Snow Labs name: roberta_qa_base_super_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rob-base-superqa2` is a English model originally trained by `nbroad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_super_2_en_4.3.0_3.0_1674212395438.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_super_2_en_4.3.0_3.0_1674212395438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_super_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_super_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_super_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nbroad/rob-base-superqa2 - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2 --- layout: model title: Pipeline to Detect Clinical Entities in Romanian (Bert, Base, Cased) author: John Snow Labs name: ner_clinical_bert_pipeline date: 2023-03-09 tags: [licensed, clinical, ro, ner, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_bert](https://nlp.johnsnowlabs.com/2022/11/22/ner_clinical_bert_ro.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_pipeline_ro_4.3.0_3.2_1678352766945.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_pipeline_ro_4.3.0_3.2_1678352766945.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_bert_pipeline", "ro", "clinical/models") text = '''Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_bert_pipeline", "ro", "clinical/models") val text = "Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min." val result = pipeline.fullAnnotate(text) ```
## Results ```bass | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------------------|--------:|------:|:--------------------------|-------------:| | 0 | Angio CT | 12 | 19 | Imaging_Test | 0.96415 | | 1 | cardio-toracic | 21 | 34 | Body_Part | 0.4776 | | 2 | Atrezie | 53 | 59 | Disease_Syndrome_Disorder | 0.9602 | | 3 | valva pulmonara | 64 | 78 | Body_Part | 0.73105 | | 4 | Hipoplazie | 81 | 90 | Disease_Syndrome_Disorder | 0.628 | | 5 | VS | 92 | 93 | Body_Part | 0.9543 | | 6 | Atrezie | 96 | 102 | Disease_Syndrome_Disorder | 0.8763 | | 7 | VAV stang | 104 | 112 | Body_Part | 0.9444 | | 8 | Anastomoza Glenn | 115 | 130 | Disease_Syndrome_Disorder | 0.8648 | | 9 | Tromboza | 137 | 144 | Disease_Syndrome_Disorder | 0.991 | | 10 | GE Revolution HD | 238 | 253 | Medical_Device | 0.668367 | | 11 | Branula albastra | 256 | 271 | Medical_Device | 0.9179 | | 12 | membrului superior | 292 | 309 | Body_Part | 0.98815 | | 13 | drept | 311 | 315 | Direction | 0.5645 | | 14 | Scout | 318 | 322 | Body_Part | 0.3484 | | 15 | 30 ml | 342 | 346 | Dosage | 0.9996 | | 16 | Iomeron 350 | 348 | 358 | Drug_Ingredient | 0.9822 | | 17 | 2.2 ml/s | 368 | 375 | Dosage | 0.9327 | | 18 | 20 ml | 388 | 392 | Dosage | 0.9977 | | 19 | ser fiziologic | 394 | 407 | Drug_Ingredient | 0.9609 | | 20 | angio-CT | 452 | 459 | Imaging_Test | 0.9965 | | 21 | cardiotoracica | 461 | 474 | Body_Part | 0.9344 | | 22 | achizitii secventiale prospective | 479 | 511 | Imaging_Technique | 0.966833 | | 23 | 100/min | 546 | 552 | Pulse | 0.9128 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|483.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Subrogation Clause Binary Classifier author: John Snow Labs name: legclf_subrogation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `subrogation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `subrogation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subrogation_clause_en_1.0.0_3.2_1660123050748.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subrogation_clause_en_1.0.0_3.2_1660123050748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_subrogation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[subrogation]| |[other]| |[other]| |[subrogation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subrogation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.97 0.95 89 subrogation 0.90 0.79 0.84 33 accuracy - - 0.92 122 macro-avg 0.91 0.88 0.89 122 weighted-avg 0.92 0.92 0.92 122 ``` --- layout: model title: German Public Health Mention Sequence Classifier (GBERT-large) author: John Snow Labs name: bert_sequence_classifier_health_mentions_gbert_large date: 2022-08-10 tags: [public_health, de, licensed, sequence_classification, health_mention] task: Text Classification language: de edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [GBERT-large](https://arxiv.org/pdf/2010.10906.pdf) based sequence classification model that can classify public health mentions in German social media text. ## Predicted Entities `non-health`, `health-related` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_gbert_large_de_4.0.2_3.0_1660133721898.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_gbert_large_de_4.0.2_3.0_1660133721898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_gbert_large", "de", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["Durch jahrelanges Rauchen habe ich meine Lunge einfach zu sehr geschädigt - Punkt."], ["die Schatzsuche war das Highlight beim Kindergeburtstag, die kids haben noch lange davon gesprochen"] ]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_gbert_large", "de", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("Durch jahrelanges Rauchen habe ich meine Lunge einfach zu sehr geschädigt - Punkt.", "Das Gefühl kenne ich auch denke, dass es vorallem mit der Sorge um das Durchfallen zusammenhängt.")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.bert_sequence.health_mentions_gbert_large").predict("""die Schatzsuche war das Highlight beim Kindergeburtstag, die kids haben noch lange davon gesprochen""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------+----------------+ |text |result | +---------------------------------------------------------------------------------------------------+----------------+ |Durch jahrelanges Rauchen habe ich meine Lunge einfach zu sehr geschädigt - Punkt. |[health-related]| |die Schatzsuche war das Highlight beim Kindergeburtstag, die kids haben noch lange davon gesprochen|[non-health] | +---------------------------------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mentions_gbert_large| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support non-health 0.99 0.99 0.99 82 health-related 0.99 0.99 0.99 69 accuracy - - 0.99 151 macro-avg 0.99 0.99 0.99 151 weighted-avg 0.99 0.99 0.99 151 ``` --- layout: model title: Legal Free Movement Of Capital Document Classifier (EURLEX) author: John Snow Labs name: legclf_free_movement_of_capital_bert date: 2023-03-06 tags: [en, legal, classification, clauses, free_movement_of_capital, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_free_movement_of_capital_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Free_Movement_of_Capital or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Free_Movement_of_Capital`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_free_movement_of_capital_bert_en_1.0.0_3.0_1678111740710.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_free_movement_of_capital_bert_en_1.0.0_3.0_1678111740710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_free_movement_of_capital_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Free_Movement_of_Capital]| |[Other]| |[Other]| |[Free_Movement_of_Capital]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_free_movement_of_capital_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Free_Movement_of_Capital 0.85 0.93 0.89 67 Other 0.91 0.82 0.86 61 accuracy - - 0.88 128 macro-avg 0.88 0.87 0.87 128 weighted-avg 0.88 0.88 0.87 128 ``` --- layout: model title: Legal Subsidiaries Clause Binary Classifier author: John Snow Labs name: legclf_subsidiaries_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `subsidiaries` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `subsidiaries` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subsidiaries_clause_en_1.0.0_3.2_1660124029869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subsidiaries_clause_en_1.0.0_3.2_1660124029869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_subsidiaries_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[subsidiaries]| |[other]| |[other]| |[subsidiaries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subsidiaries_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.95 0.97 63 subsidiaries 0.88 0.96 0.92 23 accuracy - - 0.95 86 macro-avg 0.93 0.95 0.94 86 weighted-avg 0.96 0.95 0.95 86 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_el6 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el6` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el6_en_4.3.0_3.0_1675111427354.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el6_en_4.3.0_3.0_1675111427354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_el6","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_el6","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_el6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|393.2 MB| ## References - https://huggingface.co/google/t5-efficient-base-el6 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Information And Information Processing Document Classifier (EURLEX) author: John Snow Labs name: legclf_information_and_information_processing_bert date: 2023-03-06 tags: [en, legal, classification, clauses, information_and_information_processing, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_information_and_information_processing_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Information_and_Information_Processing or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Information_and_Information_Processing`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_information_and_information_processing_bert_en_1.0.0_3.0_1678111806432.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_information_and_information_processing_bert_en_1.0.0_3.0_1678111806432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_information_and_information_processing_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Information_and_Information_Processing]| |[Other]| |[Other]| |[Information_and_Information_Processing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_information_and_information_processing_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Information_and_Information_Processing 0.79 0.96 0.87 47 Other 0.93 0.70 0.80 40 accuracy - - 0.84 87 macro-avg 0.86 0.83 0.83 87 weighted-avg 0.86 0.84 0.84 87 ``` --- layout: model title: Detect Anatomical Structures in Medical Text author: John Snow Labs name: bert_token_classifier_ner_anatem date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Anatomical entities ranging from subcellular structures to organ systems are central to biomedical science, and mentions of these entities are essential to understanding the scientific literature. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects anatomical structures from a medical text. ## Predicted Entities `Anatomy` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatem_en_4.0.0_3.0_1658749376934.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatem_en_4.0.0_3.0_1658749376934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_anatem", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""Malignant cells often display defects in autophagy, an evolutionarily conserved pathway for degrading long-lived proteins and cytoplasmic organelles. However, as yet, there is no genetic evidence for a role of autophagy genes in tumor suppression. The beclin 1 autophagy gene is monoallelically deleted in 40 - 75 % of cases of human sporadic breast, ovarian, and prostate cancer."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_anatem", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""Malignant cells often display defects in autophagy, an evolutionarily conserved pathway for degrading long-lived proteins and cytoplasmic organelles. However, as yet, there is no genetic evidence for a role of autophagy genes in tumor suppression. The beclin 1 autophagy gene is monoallelically deleted in 40 - 75 % of cases of human sporadic breast, ovarian, and prostate cancer.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.anatem").predict("""Malignant cells often display defects in autophagy, an evolutionarily conserved pathway for degrading long-lived proteins and cytoplasmic organelles. However, as yet, there is no genetic evidence for a role of autophagy genes in tumor suppression. The beclin 1 autophagy gene is monoallelically deleted in 40 - 75 % of cases of human sporadic breast, ovarian, and prostate cancer.""") ```
## Results ```bash +----------------------+-------+ |ner_chunk |label | +----------------------+-------+ |Malignant cells |Anatomy| |cytoplasmic organelles|Anatomy| |tumor |Anatomy| |breast |Anatomy| |ovarian |Anatomy| |prostate cancer |Anatomy| +----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_anatem| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-Anatomy 0.8489 0.9380 0.8912 4616 I-Anatomy 0.9190 0.8839 0.9011 3247 micro-avg 0.8755 0.9157 0.8951 7863 macro-avg 0.8839 0.9110 0.8962 7863 weighted-avg 0.8778 0.9157 0.8953 7863 ``` --- layout: model title: Multilingual T5ForConditionalGeneration Base Cased model (from sonoisa) author: John Snow Labs name: t5_base_english_japanese date: 2023-01-30 tags: [en, ja, multilingual, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-english-japanese` is a Multilingual model originally trained by `sonoisa`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_english_japanese_xx_4.3.0_3.0_1675108659585.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_english_japanese_xx_4.3.0_3.0_1675108659585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_english_japanese","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_english_japanese","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_english_japanese| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|521.2 MB| ## References - https://huggingface.co/sonoisa/t5-base-english-japanese - https://en.wikipedia.org - https://ja.wikipedia.org - https://oscar-corpus.com - http://data.statmt.org/cc-100/ - http://data.statmt.org/cc-100/ - https://github.com/sonoisa/t5-japanese - https://creativecommons.org/licenses/by-sa/4.0/deed.ja - http://commoncrawl.org/terms-of-use/ --- layout: model title: Translate Kirundi to English Pipeline author: John Snow Labs name: translate_rn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, rn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `rn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_rn_en_xx_2.7.0_2.4_1609688989566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_rn_en_xx_2.7.0_2.4_1609688989566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_rn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_rn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.rn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_rn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal NER for NDA (Applicable Law Clause) author: John Snow Labs name: legner_nda_applicable_law date: 2023-04-06 tags: [en, licensed, legal, ner, nda, law] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `APPLIC_LAW ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `LAW `, and `LAW_LOCATION`. ## Predicted Entities `LAW`, `LAW_LOCATION` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_applicable_law_en_1.0.0_3.0_1680819415977.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_applicable_law_en_1.0.0_3.0_1680819415977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_applicable_law", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""This Agreement will be governed and construed in accordance with the laws of the State of Utah without regard to the conflicts of laws or principles thereof."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------+------------+ |chunk |ner_label | +-------------+------------+ |laws |LAW | |State of Utah|LAW_LOCATION| |laws |LAW | +-------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_applicable_law| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support LAW 0.97 0.97 0.97 70 LAW_LOCATION 0.92 0.94 0.93 51 micro-avg 0.95 0.96 0.95 121 macro-avg 0.95 0.96 0.95 121 weighted-avg 0.95 0.96 0.95 121 ``` --- layout: model title: Chinese T5ForConditionalGeneration Cased model (from hululuzhu) author: John Snow Labs name: t5_chinese_couplet_mengzi_finetune date: 2023-01-30 tags: [zh, open_source, t5, tensorflow] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-couplet-t5-mengzi-finetune` is a Chinese model originally trained by `hululuzhu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_chinese_couplet_mengzi_finetune_zh_4.3.0_3.0_1675100437975.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_chinese_couplet_mengzi_finetune_zh_4.3.0_3.0_1675100437975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_chinese_couplet_mengzi_finetune","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_chinese_couplet_mengzi_finetune","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_chinese_couplet_mengzi_finetune| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|1.0 GB| ## References - https://huggingface.co/hululuzhu/chinese-couplet-t5-mengzi-finetune - https://github.com/hululuzhu/chinese-ai-writing-share - https://github.com/hululuzhu/chinese-ai-writing-share/tree/main/slides - https://github.com/wb14123/couplet-dataset --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from BruceZJC) author: John Snow Labs name: distilbert_qa_brucezjc_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `BruceZJC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_brucezjc_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768353044.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_brucezjc_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768353044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_brucezjc_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_brucezjc_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_brucezjc_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/BruceZJC/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentence Entity Resolver for RxCUI (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_rxcui date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxCUI codes codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts RxCUI Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxcui_en_3.0.4_3.0_1621189488426.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxcui_en_3.0.4_3.0_1621189488426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_rxcui``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") rxcui_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxcui","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxcui_resolver]) data = spark.createDataFrame([["He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day"]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val rxcui_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxcui","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxcui_resolver)) val data = Seq("He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxcui").predict("""He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day""") ```
## Results ```bash +---------------------------+--------+-----------------------------------------------------+ | chunk | code | term | +---------------------------+--------+-----------------------------------------------------+ | 50 mg of eltrombopag oral | 825427 | eltrombopag 50 MG Oral Tablet | | 5 mg amlodipine | 197361 | amlodipine 5 MG Oral Tablet | | metformin 1000 mg | 861004 | metformin hydrochloride 2000 MG Oral Tablet | +---------------------------+--------+-----------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxcui| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxcui_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on November 2020 RxNorm Clinical Drugs ontology graph with ``sbiobert_base_cased_mli`` embeddings. https://www.nlm.nih.gov/pubs/techbull/nd20/brief/nd20_rx_norm_november_release.html. [Sample Content](https://rxnav.nlm.nih.gov/REST/rxclass/class/byRxcui.json?rxcui=1000000). --- layout: model title: English image_classifier_vit_dog_races ViTForImageClassification from roschmid author: John Snow Labs name: image_classifier_vit_dog_races date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dog_races` is a English model originally trained by roschmid. ## Predicted Entities `Rottweiler`, `Shiba Inu`, `Chow Chow dog`, `Siberian Husky`, `German Shepherd`, `Pug`, `Border Collie`, `Golden Retriever`, `Tibetan Mastiff` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_races_en_4.1.0_3.0_1660168522778.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dog_races_en_4.1.0_3.0_1660168522778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dog_races", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dog_races", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dog_races| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Part of Speech for Norwegian author: John Snow Labs name: pos_ud_bokmaal date: 2020-05-05 18:57:00 +0800 task: Part of Speech Tagging language: nb edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, nb] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_2.5.0_2.4_1588693881973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_2.5.0_2.4_1588693881973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""] pos_df = nlu.load('nb.pos.ud_bokmaal').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='DET', metadata={'word': 'Annet'}), Row(annotatorType='pos', begin=6, end=8, result='SCONJ', metadata={'word': 'enn'}), Row(annotatorType='pos', begin=10, end=10, result='PART', metadata={'word': 'å'}), Row(annotatorType='pos', begin=12, end=15, result='AUX', metadata={'word': 'være'}), Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kongen'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bokmaal| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|nb| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Identify Competing Organizations author: John Snow Labs name: finassertion_competitors date: 2022-08-17 tags: [en, finance, assertion, status, competitors, licensed] task: Assertion Status language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models allows you to identify ORG and PRODUCTS mentioned in the text to be from a competitor. By default, if nothing is mentioned, it returns `NO_COMPETITOR`. ## Predicted Entities `NO_COMPETITOR`, `COMPETITOR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ASSERTIONDL_COMPETITORS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_competitors_en_1.0.0_3.2_1660735220316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_competitors_en_1.0.0_3.2_1660735220316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python # Annotator that transforms a text column from dataframe into an Annotation ready for NLP from johnsnowlabs import * documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # Sentence Detector annotator, processes various sentences per line sentenceDetector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") # nlp.Tokenizer splits words in a relevant format for NLP tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model_org = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = finance.NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['ORG', 'PRODUCT']) assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model_org, ner_converter, assertion ]) text = "Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc." data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(data) model.transform(spark.createDataFrame([[text]]).toDF("text")).select(F.explode(F.arrays_zip('ner_chunk.result', 'assertion.result')).alias('result')).show(truncate=False) ```
## Results ```bash +--------------------------+ |result | +--------------------------+ |[McAfee LLC, COMPETITOR] | |[Broadcom Inc, COMPETITOR]| +--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finassertion_competitors| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, doc_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations from 10K Filings ## Benchmarking ```bash label tp fp fn prec rec f1 NO_COMPETITOR 158 0 1 1.0 0.9937107 0.9968454 COMPETITOR 25 1 0 0.9615384 1.0 0.9803921 Macro-average 183 1 1 0.9807692 0.9968554 0.9887469 Micro-average 183 1 1 0.9945652 0.9945652 0.9945652 ``` --- layout: model title: Relation extraction between body parts and problem entities (ReDL) author: John Snow Labs name: redl_bodypart_problem_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts and problem entities in clinical texts. `1` : Shows that there is a relation between body part entity and entities labeled as problem ( diagnosis, symptom etc.), `0` : Shows that there no relation between body part and problem entities. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_problem_biobert_en_2.7.3_2.4_1612446369008.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_problem_biobert_en_2.7.3_2.4_1612446369008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION',"EXTERNAL_BODY_PART_OR_REGION-SYMPTOM"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_problem_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="No neurologic deficits other than some numbness in his left hand." p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text")) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION","EXTERNAL_BODY_PART_OR_REGION-SYMPTOM")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_problem_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("No neurologic deficits other than some numbness in his left hand.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.problem").predict("""No neurologic deficits other than some numbness in his left hand.""") ```
## Results ```bash | | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |---:|-----------:|:----------|:---------|:-----------------------------|:---------|-------------:| | 0 | 1 | Symptom | numbness | External_body_part_or_region | hand | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_problem_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.762 0.814 0.787 315 1 0.938 0.917 0.927 885 Avg. 0.850 0.865 0.857 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from habib1030) author: John Snow Labs name: distilbert_qa_habib1030_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `habib1030`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_habib1030_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770984625.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_habib1030_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770984625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_habib1030_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_habib1030_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_habib1030_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/habib1030/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from West Slavic languages to English author: John Snow Labs name: opus_mt_zlw_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, zlw, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `zlw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_zlw_en_xx_2.7.0_2.4_1609166809467.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_zlw_en_xx_2.7.0_2.4_1609166809467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_zlw_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_zlw_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.zlw.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_zlw_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from aszidon) Custom Version-5 author: John Snow Labs name: distilbert_qa_custom5 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbertcustom5` is a English model originally trained by `aszidon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom5_en_4.0.0_3.0_1654728059999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_custom5_en_4.0.0_3.0_1654728059999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom5","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_custom5","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.custom5.by_aszidon").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_custom5| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aszidon/distilbertcustom5 --- layout: model title: English DistilBertForQuestionAnswering model (from deepakvk) author: John Snow Labs name: distilbert_qa_base_uncased_distilled_squad_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-distilled-squad-finetuned-squad` is a English model originally trained by `deepakvk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_distilled_squad_finetuned_squad_en_4.0.0_3.0_1654723814436.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_distilled_squad_finetuned_squad_en_4.0.0_3.0_1654723814436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_distilled_squad_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_distilled_squad_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_deepakvk").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_distilled_squad_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/deepakvk/distilbert-base-uncased-distilled-squad-finetuned-squad --- layout: model title: Danish XLMRobertaForTokenClassification Cased model (from saattrupdan) author: John Snow Labs name: xlmroberta_ner_employment_contract_ner date: 2022-08-01 tags: [da, open_source, xlm_roberta, ner] task: Named Entity Recognition language: da edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `employment-contract-ner-da` is a Danish model originally trained by `saattrupdan`. ## Predicted Entities `SALARY`, `STARTDATE`, `WORKHOURS`, `WORKPLACE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_employment_contract_ner_da_4.1.0_3.0_1659353422793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_employment_contract_ner_da_4.1.0_3.0_1659353422793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_employment_contract_ner","da") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_employment_contract_ner","da") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_employment_contract_ner| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|da| |Size:|798.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/saattrupdan/employment-contract-ner-da --- layout: model title: French DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_fr_cased date: 2022-04-12 tags: [distilbert, embeddings, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-fr-cased` is a French model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_fr_cased_fr_3.4.2_3.0_1649783659773.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_fr_cased_fr_3.4.2_3.0_1649783659773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_fr_cased","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_fr_cased","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.distilbert").predict("""J'adore Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_fr_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|231.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-fr-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Clinical Deidentification (glove) author: John Snow Labs name: clinical_deidentification_glove date: 2022-03-04 tags: [deidentification, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_3.4.1_3.0_1646402880942.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_3.4.1_3.0_1646402880942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_glove","en","clinical/models") val result = deid_pipeline.annotate("Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID: , IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID: [********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID: ****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Berneta Anis, Record date: 2093-02-19, # U4660137. Dr. Dr Worley Colonel, ID: ZJ:9570208, IP 005.005.005.005. He is a 67 male was admitted to the ST. LUKE'S HOSPITAL AT THE VINTAGE for cystectomy on 06-02-1981. Patient's VIN : 3CCCC22DDDD333888, SSN SSN-618-77-1042, Driver's license W693817528998. Phone 0496 46 46 70, 3100 weston rd, Shattuck, E-MAIL: Freddi@hotmail.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_glove| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|181.3 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Extract Biomarkers and their Results author: John Snow Labs name: ner_oncology_biomarker_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, ner, biomarker] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of biomarkers and biomarker results from oncology texts. Definitions of Predicted Entities: - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer (including oncogenes). - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. ## Predicted Entities `Biomarker`, `Biomarker_Result` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_wip_en_4.0.0_3.0_1664584581032.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_wip_en_4.0.0_3.0_1664584581032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_biomarker_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_biomarker_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_biomarker_wip").predict("""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.""") ```
## Results ```bash | chunk | ner_label | |:-----------------------------------------|:-----------------| | negative | Biomarker_Result | | CK7 | Biomarker | | synaptophysin | Biomarker | | Syn | Biomarker | | chromogranin A | Biomarker | | CgA | Biomarker | | Muc5AC | Biomarker | | human epidermal growth factor receptor-2 | Biomarker | | HER2 | Biomarker | | Muc6 | Biomarker | | positive | Biomarker_Result | | CK20 | Biomarker | | Muc1 | Biomarker | | Muc2 | Biomarker | | E-cadherin | Biomarker | | p53 | Biomarker | | Ki-67 index | Biomarker | | 87% | Biomarker_Result | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_biomarker_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|841.8 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Biomarker_Result 877.0 205.0 229.0 1106.0 0.81 0.79 0.80 Biomarker 1305.0 166.0 163.0 1468.0 0.89 0.89 0.89 macro_avg 2182.0 371.0 392.0 2574.0 0.85 0.84 0.84 micro_avg NaN NaN NaN NaN 0.85 0.85 0.85 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Original_BioBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Original-BioBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Original_BioBERT_384_en_4.0.0_3.0_1657108919488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Original_BioBERT_384_en_4.0.0_3.0_1657108919488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Original_BioBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Original_BioBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Original_BioBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Original-BioBERT-384 --- layout: model title: English RobertaForQuestionAnswering Cased model (from billfrench) author: John Snow Labs name: roberta_qa_cyberlandr_door date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cyberlandr-door` is a English model originally trained by `billfrench`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cyberlandr_door_en_4.3.0_3.0_1674209416902.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cyberlandr_door_en_4.3.0_3.0_1674209416902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cyberlandr_door","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cyberlandr_door","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cyberlandr_door| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|414.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/billfrench/cyberlandr-door --- layout: model title: Detect Entities in General Scope (Few-NERD dataset) author: John Snow Labs name: nerdl_fewnerd_100d date: 2021-07-02 tags: [ner, en, fewnerd, public, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.1.1 spark_version: 2.4 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Few-NERD/inter public dataset and it extracts 8 entities that are in general scope. ## Predicted Entities `PERSON`, `ORGANIZATION`, `LOCATION`, `ART`, `BUILDING`, `PRODUCT`, `EVENT`, `OTHER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_en_3.1.1_2.4_1625227974733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_100d_en_3.1.1_2.4_1625227974733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use The model is trained using `glove_100d` word embeddings so, you should use the same embeddings in your nlp pipeline.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("nerdl_fewnerd_100d") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, embeddings, ner, ner_converter]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate("""The Double Down is a sandwich offered by Kentucky Fried Chicken (KFC) restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).""") ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("nerdl_fewnerd_100d") .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner") val ner_converter = NerConverter.setInputCols(Array("document", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner, ner_converter)) val data = Seq("The Double Down is a sandwich offered by Kentucky Fried Chicken (KFC) restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.fewnerd").predict("""The Double Down is a sandwich offered by Kentucky Fried Chicken (KFC) restaurants. He did not see active service again until 1882, when he took part in the Anglo-Egyptian War, and was present at the battle of Tell El Kebir (September 1882), for which he was mentioned in dispatches, received the Egypt Medal with clasp and the 3rd class of the Order of Medjidie, and was appointed a Companion of the Order of the Bath (CB).""") ```
## Results ```bash +----------------------------------+---------+ |chunk |ner_label| +----------------------------------+---------+ |Double Down |PRODUCT | |Kentucky Fried Chicken |BUILDING | |KFC |BUILDING | |Anglo-Egyptian War |EVENT | |Tell El Kebir |EVENT | |Egypt Medal |OTHER | |Order of Medjidie |OTHER | |Companion of the Order of the Bath|OTHER | +----------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_fewnerd_100d| |Type:|ner| |Compatibility:|Spark NLP 3.1.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Few-NERD:A Few-shot Named Entity Recognition Dataset, author: Ding, Ning and Xu, Guangwei and Chen, Yulin, and Wang, Xiaobin and Han, Xu and Xie, Pengjun and Zheng, Hai-Tao and Liu, Zhiyuan, book title: ACL-IJCNL, 2021. ## Benchmarking ```bash +------------+-------+------+-------+-------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +------------+-------+------+-------+-------+---------+------+------+ | PERSON|21555.0|6194.0| 5643.0|27198.0| 0.7768|0.7925|0.7846| |ORGANIZATION|36744.0|9059.0|13156.0|49900.0| 0.8022|0.7364|0.7679| | LOCATION|36367.0|7521.0| 7006.0|43373.0| 0.8286|0.8385|0.8335| | ART| 6170.0|1649.0| 2998.0| 9168.0| 0.7891| 0.673|0.7264| | BUILDING| 5112.0|2435.0| 3014.0| 8126.0| 0.6774|0.6291|0.6523| | PRODUCT| 8317.0|3253.0| 4325.0|12642.0| 0.7188|0.6579| 0.687| | OTHER|14461.0|4414.0| 5161.0|19622.0| 0.7661| 0.737|0.7513| | EVENT| 6024.0|1880.0| 2275.0| 8299.0| 0.7621|0.7259|0.7436| +------------+-------+------+-------+-------+---------+------+------+ +------------------+ | macro| +------------------+ |0.7433252741184967| +------------------+ +------------------+ | micro| +------------------+ |0.7703038245945377| +------------------+ ``` --- layout: model title: Stopwords Remover for Chinese language (1892 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, zh, open_source] task: Stop Words Removal language: zh edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_zh_3.4.1_3.0_1646673208417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_zh_3.4.1_3.0_1646673208417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","zh") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["除了作为北方之王之外,约翰·斯诺还是一位英国医生,也是麻醉和医疗卫生发展的领导者。"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","zh") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("除了作为北方之王之外,约翰·斯诺还是一位英国医生,也是麻醉和医疗卫生发展的领导者。").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.stopwords").predict("""Put your text here.""") ```
## Results ```bash +-----------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------+ |[除了作为北方之王之外,约翰·斯诺还是一位英国医生,也是麻醉和医疗卫生发展的领导者。]| +-----------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|zh| |Size:|8.1 KB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_768 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_768_zh_4.2.4_3.0_1670325770644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_768_zh_4.2.4_3.0_1670325770644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|330.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Detect PHI for Deidentification (Generic - Augmented) author: John Snow Labs name: ner_deid_generic_augmented date: 2021-06-30 tags: [clinical, deid, ner, en, licensed] task: [Named Entity Recognition, De-identification] language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Generic - Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This ner model is trained with a combination of i2b2 train set and augmented version of i2b2 train set.## Predicted Entities`DATE`, `NAME`, `LOCATION`, `PROFESSION`, `CONTACT`, `AGE`, `ID`. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `CONTACT`, `AGE`, `ID` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_en_3.1.0_2.4_1625050266197.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_en_3.1.0_2.4_1625050266197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk_generic") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]}))) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk_generic") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter)) val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.generic_augmented").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson, Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|LOCATION | |0295 Keats Street |LOCATION | |(302) 786-5227 |CONTACT | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_augmented| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source A custom data set which is created from the i2b2-PHI train and the augmented version of the i2b2-PHI train set is used. ## Benchmarking ```bash entity tp fp fn total precision recall f1 CONTACT 341.0 15.0 14.0 355.0 0.9579 0.9606 0.9592 NAME 5065.0 165.0 205.0 5270.0 0.9685 0.9611 0.9648 DATE 5532.0 53.0 110.0 5642.0 0.9905 0.9805 0.9855 ID 614.0 23.0 71.0 685.0 0.9639 0.8964 0.9289 LOCATION 2686.0 164.0 284.0 2970.0 0.9425 0.9044 0.923 PROFESSION 228.0 28.0 145.0 373.0 0.8906 0.6113 0.725 AGE 713.0 34.0 49.0 762.0 0.9545 0.9357 0.945 macro - - - - - - 0.91876 micro - - - - - - 0.95616 ``` --- layout: model title: Ukrainian XlmRoBertaForQuestionAnswering model (from robinhad) author: John Snow Labs name: xlmroberta_qa_ukrainian date: 2022-06-28 tags: [uk, open_source, xlmroberta, question_answering] task: Question Answering language: uk edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ukrainian-qa` is a Ukrainian model originally trained by `robinhad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_qa_ukrainian_uk_4.0.0_3.0_1656418914986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_qa_ukrainian_uk_4.0.0_3.0_1656418914986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlmroberta_qa_ukrainian","uk") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Як мене звати?", "Мене звуть Клара, і я живу в Берклі."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = XlmRoBertaForQuestionAnswering.pretrained("xlmroberta_qa_ukrainian","uk") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Як мене звати?", "Мене звуть Клара, і я живу в Берклі.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.answer_question.xlmr_roberta").predict("""Як мене звати?|||"Мене звуть Клара, і я живу в Берклі.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_qa_ukrainian| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|uk| |Size:|402.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/robinhad/ukrainian-qa - https://github.com/fido-ai/ua-datasets/tree/main/ua_datasets/src/question_answering - https://github.com/robinhad/ukrainian-qa --- layout: model title: Fast Neural Machine Translation Model from English to Romance Languages author: John Snow Labs name: opus_mt_en_roa date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, roa, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `roa` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_roa_xx_2.7.0_2.4_1609168782781.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_roa_xx_2.7.0_2.4_1609168782781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_roa", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_roa", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.roa').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_roa| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Detection in Yiddish Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [open_source, sentence_detection, yi] task: Sentence Detection language: yi edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_yi_3.2.0_3.0_1630323089681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_yi_3.2.0_3.0_1630323089681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "yi") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""איר זוכט פֿאַר אַ גרויס מקור פון לייענען פּאַראַגראַפס אין ענגליש? איר'ווע קומען צו די רעכט אָרט. לויט צו אַ פריש לערנען, די מידע פון לייענען אין הייַנט ס יוגנט איז ראַפּאַדלי דיקריסינג. זיי קענען נישט פאָקוס אויף אַ געגעבן פּאַראַגראַף פֿאַר ענגליש לייענען פֿאַר מער ווי אַ ביסל סעקונדעס! לייענען איז געווען און איז אַ ינטאַגראַל טייל פון אַלע קאַמפּעטיטיוו יגזאַמז. אַזוי ווי טאָן איר פֿאַרבעסערן דיין לייענען סקילז? דער ענטפער צו דעם קשיא איז אַקשלי אן אנדער קשיא: וואָס איז די נוצן פון לייענען סקילז? דער הויפּט ציל פון לייענען איז 'צו מאַכן זינען'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "yi") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("איר זוכט פֿאַר אַ גרויס מקור פון לייענען פּאַראַגראַפס אין ענגליש? איר'ווע קומען צו די רעכט אָרט. לויט צו אַ פריש לערנען, די מידע פון לייענען אין הייַנט ס יוגנט איז ראַפּאַדלי דיקריסינג. זיי קענען נישט פאָקוס אויף אַ געגעבן פּאַראַגראַף פֿאַר ענגליש לייענען פֿאַר מער ווי אַ ביסל סעקונדעס! לייענען איז געווען און איז אַ ינטאַגראַל טייל פון אַלע קאַמפּעטיטיוו יגזאַמז. אַזוי ווי טאָן איר פֿאַרבעסערן דיין לייענען סקילז? דער ענטפער צו דעם קשיא איז אַקשלי אן אנדער קשיא: וואָס איז די נוצן פון לייענען סקילז? דער הויפּט ציל פון לייענען איז 'צו מאַכן זינען'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('yi.sentence_detector').predict("איר זוכט פֿאַר אַ גרויס מקור פון לייענען פּאַראַגראַפס אין ענגליש? איר'ווע קומען צו די רעכט אָרט. לויט צו אַ פריש לערנען, די מידע פון לייענען אין הייַנט ס יוגנט איז ראַפּאַדלי דיקריסינג. זיי קענען נישט פאָקוס אויף אַ געגעבן פּאַראַגראַף פֿאַר ענגליש לייענען פֿאַר מער ווי אַ ביסל סעקונדעס! לייענען איז געווען און איז אַ ינטאַגראַל טייל פון אַלע קאַמפּעטיטיוו יגזאַמז. אַזוי ווי טאָן איר פֿאַרבעסערן דיין לייענען סקילז? דער ענטפער צו דעם קשיא איז אַקשלי אן אנדער קשיא: וואָס איז די נוצן פון לייענען סקילז? דער הויפּט ציל פון לייענען איז 'צו מאַכן זינען'.", output_level ='sentence') ```
## Results ```bash +--------------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------------+ |[איר זוכט פֿאַר אַ גרויס מקור פון לייענען פּאַראַגראַפס אין ענגליש?] | |[איר'ווע קומען צו די רעכט אָרט.] | |[לויט צו אַ פריש לערנען, די מידע פון לייענען אין הייַנט ס יוגנט איז ראַפּאַדלי דיקריסינג.] | |[זיי קענען נישט פאָקוס אויף אַ געגעבן פּאַראַגראַף פֿאַר ענגליש לייענען פֿאַר מער ווי אַ ביסל סעקונדעס!] | |[לייענען איז געווען און איז אַ ינטאַגראַל טייל פון אַלע קאַמפּעטיטיוו יגזאַמז.] | |[אַזוי ווי טאָן איר פֿאַרבעסערן דיין לייענען סקילז?] | |[דער ענטפער צו דעם קשיא איז אַקשלי אן אנדער קשיא:] | |[וואָס איז די נוצן פון לייענען סקילז?] | |[דער הויפּט ציל פון לייענען איז 'צו מאַכן זינען'.] | +--------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|yi| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English Named Entity Recognition (from kleinay) author: John Snow Labs name: bert_ner_nominalization_candidate_classifier date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `nominalization-candidate-classifier` is a English model orginally trained by `kleinay`. ## Predicted Entities `False`, `True` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_nominalization_candidate_classifier_en_3.4.2_3.0_1652096794094.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_nominalization_candidate_classifier_en_3.4.2_3.0_1652096794094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_nominalization_candidate_classifier","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_nominalization_candidate_classifier","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_nominalization_candidate_classifier| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/kleinay/nominalization-candidate-classifier - https://www.aclweb.org/anthology/2020.coling-main.274/ - https://github.com/kleinay/QANom --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el32 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el32` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el32_en_4.3.0_3.0_1675120001777.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el32_en_4.3.0_3.0_1675120001777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el32","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el32","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el32| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|305.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-el32 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering model (from amoux) author: John Snow Labs name: bert_qa_scibert_nli_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `scibert_nli_squad` is a English model orginally trained by `amoux`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_nli_squad_en_4.0.0_3.0_1654189419720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_nli_squad_en_4.0.0_3.0_1654189419720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_scibert_nli_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_scibert_nli_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.scibert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_scibert_nli_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/amoux/scibert_nli_squad --- layout: model title: Financial Executives compensation Item Binary Classifier author: John Snow Labs name: finclf_executives_compensation_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `executives_compensation` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `executives_compensation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_executives_compensation_item_en_1.0.0_3.2_1660154405348.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_executives_compensation_item_en_1.0.0_3.2_1660154405348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_executives_compensation_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[executives_compensation]| |[other]| |[other]| |[executives_compensation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_executives_compensation_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support executives_compensation 0.93 0.94 0.94 104 other 0.93 0.92 0.92 83 accuracy - - 0.93 187 macro-avg 0.93 0.93 0.93 187 weighted-avg 0.93 0.93 0.93 187 ``` --- layout: model title: Extract clinical problems (Voice of the Patients) - Granular version author: John Snow Labs name: ner_vop_problem_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences using a granular taxonomy. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `InjuryOrPoisoning`, `Modifier`, `HealthStatus`, `Symptom`, `Disease`, `PsychologicalCondition` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_wip_en_4.4.0_3.0_1682012767167.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_wip_en_4.4.0_3.0_1682012767167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I"ve been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I"ve been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Symptom | | fatigue | Symptom | | rheumatoid arthritis | Disease | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 InjuryOrPoisoning 93 33 52 145 0.74 0.64 0.69 Modifier 695 210 292 987 0.77 0.70 0.73 HealthStatus 80 20 25 105 0.80 0.76 0.78 Symptom 3669 440 1016 4685 0.89 0.78 0.83 Disease 1708 261 262 1970 0.87 0.87 0.87 PsychologicalCondition 315 30 26 341 0.91 0.92 0.92 macro_avg 6560 994 1673 8233 0.83 0.78 0.80 micro_avg 6560 994 1673 8233 0.87 0.80 0.83 ``` --- layout: model title: US Financial News Articles Summarization(Headers) author: John Snow Labs name: finsum_us_news_headers date: 2023-03-04 tags: [en, licensed, finance, summarization, t5, tensorflow] task: Summarization language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is fine-tuned with US news articles from `bloomberg.com`, `cnbc.com`, `reuters.com`, `wsj.com`, and `fortune.com`. It is a Financial News Summarizer, aimed to extract headers from US financial news. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finsum_us_news_headers_en_1.0.0_3.0_1677962901205.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finsum_us_news_headers_en_1.0.0_3.0_1677962901205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("documents") t5 = nlp.T5Transformer().pretrained("finsum_us_news_headers", "en", "finance/models") \ .setTask("summarization") \ .setInputCols(["documents"]) \ .setMaxOutputLength(512) \ .setOutputCol("summaries") data = spark.createDataFrame([["Kindred Systems Inc. was founded with the mission of creating human-like intelligence in machines and a vision to commercialize its research work in tandem. Over the past few years, Kindred’s Product and Artificial General Intelligence (AGI) divisions have accomplished a tremendous amount in their respective domains, working independently to allow each team to optimize for their objectives. The company has reached a point in its evolution where spinning off the AGI division maximizes the likelihood of success for both divisions, as well as returns to Kindred shareholders. Geordie Rose is stepping down as CEO and President of Kindred to lead this new entity named Sanctuary based in Vancouver, Canada. Kindred co-founder Suzanne Gildert will also be stepping down from her role as Chief Science Officer and will join Sanctuary as co-CEO. Sanctuary’s focus is on the implementation and testing of a specific framework for artificial general intelligence. The new entity will license some of Kindred’s patents and software, and Kindred will maintain a minority ownership in Sanctuary. Kindred’s Board of Directors has appointed Jim Liefer, previously the company’s COO, to serve as CEO and President. As COO, Liefer brought strong executive leadership alongside co-founder, George Babu, for the development and deployment of Kindred’s first commercial product Sort, and will continue to lead the company in its mission to research and develop human-like intelligence in machines. The Kindred team in Toronto will continue its applied research in machine and reinforcement learning, with the San Francisco office focused on robotics, product development, and commercialization. With Kindred Sort, the company aims to alleviate the massive pressures facing the retail and fulfillment industry, which includes significant online sales growth, labor shortages, and a lack of advancement in technology. Kindred Sort allows retailers to manage the exploding growth and demand of this sector more efficiently. During the 2017 holiday season, Kindred’s robots sorted thousands of items ordered at speeds averaging over 410 units per hour, and reaching peak speeds of over 531 units per hour, freeing human workers to perform other parts of the fulfillment process critical to meet growing customer demand. “Kindred will maintain its commitment to building human-like intelligence in machines and applying those learnings to create and teach a new intelligent class of robots that will enhance the quality of our day-to-day lives, and in particular, the way we work,” said Liefer. “We look forward to advancing Kindred Sort, achieving new AI and robotic milestones while also helping to drive retail and other industries forward.” Babu, Kindred co-founder, and Chief Product Officer will be joining Liefer on Kindred’s Board of Directors. Babu will continue to oversee product strategy and the expansion of Kindred’s partnerships and pilot programs with major global retailers. Kindred Systems Inc.’s mission is to build machines with human-like intelligence. The company’s central thesis is that intelligence requires a body. Since its founding in 2014, Kindred has been exploring and engineering systems that enable robots to understand and participate in our world, with the ultimate goal of a future where intelligent machines work together with people. Kindred is headquartered in San Francisco with an office in Toronto."]]).toDF("text") pipeline = nlp.Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data).transform(data) ```
## Results ```bash Kindred Systems Inc. Announces Appointment of Jim Liefer as CEO and President ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finsum_us_news_headers| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|925.8 MB| ## References Train dataset available [here](https://www.kaggle.com/datasets/jeet2016/us-financial-news-articles) --- layout: model title: Legal Benefits Clause Binary Classifier author: John Snow Labs name: legclf_benefits_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `benefits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `benefits` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_benefits_clause_en_1.0.0_3.2_1660123261684.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_benefits_clause_en_1.0.0_3.2_1660123261684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_benefits_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[benefits]| |[other]| |[other]| |[benefits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_benefits_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support benefits 0.92 0.92 0.92 25 other 0.98 0.98 0.98 89 accuracy - - 0.96 114 macro-avg 0.95 0.95 0.95 114 weighted-avg 0.96 0.96 0.96 114 ``` --- layout: model title: Bangla RobertaForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: roberta_embeddings_indic_transformers date: 2022-12-12 tags: [bn, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: bn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-bn-roberta` is a Bangla model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_bn_4.2.4_3.0_1670858623995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_bn_4.2.4_3.0_1670858623995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bn| |Size:|312.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-roberta - https://oscar-corpus.com/ --- layout: model title: Extract Tickers on Financial Texts (RoBerta) author: John Snow Labs name: finner_roberta_ticker date: 2022-08-09 tags: [en, financial, ner, ticker, trading, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model aims to detect Trading Symbols / Tickers in texts. You can then use Chunk Mappers to get more information about the company that ticker belongs to. This is a RoBerta-based model, you can find other lighter versions of this model in Models Hub. ## Predicted Entities `TICKER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TICKER/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_roberta_ticker_en_1.0.0_3.2_1660036613729.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_roberta_ticker_en_1.0.0_3.2_1660036613729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("finner_roberta_ticker", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, tokenClassifier, ner_converter]) text = ["""There are some serious purchases and sales of AMZN stock today."""] test_data = spark.createDataFrame([text]).toDF("text") model = pipeline.fit(test_data) res= model.transform(test_data) res.select('ner_chunk').collect() ```
## Results ```bash ['AMZN'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_roberta_ticker| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|465.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Original dataset (https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020) and weak labelling on in-house texts ## Benchmarking ```bash label precision recall f1-score support TICKER 0.98 0.97 0.98 9823 micro-avg 0.98 0.97 0.98 9823 macro-avg 0.98 0.97 0.98 9823 weighted-avg 0.98 0.97 0.98 9823 ``` --- layout: model title: Legal Business Classification Document Classifier (EURLEX) author: John Snow Labs name: legclf_business_classification_bert date: 2023-03-06 tags: [en, legal, classification, clauses, business_classification, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_business_classification_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Business_Classification or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Business_Classification`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_business_classification_bert_en_1.0.0_3.0_1678111559659.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_business_classification_bert_en_1.0.0_3.0_1678111559659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_business_classification_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Business_Classification]| |[Other]| |[Other]| |[Business_Classification]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_business_classification_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Business_Classification 0.81 0.94 0.87 32 Other 0.92 0.76 0.83 29 accuracy - - 0.85 61 macro-avg 0.86 0.85 0.85 61 weighted-avg 0.86 0.85 0.85 61 ``` --- layout: model title: Fast Neural Machine Translation Model from American Sign Language to English author: John Snow Labs name: opus_mt_ase_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ase, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ase` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_en_xx_2.7.0_2.4_1609169701112.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_en_xx_2.7.0_2.4_1609169701112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ase_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ase_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ase.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ase_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract relations between NIHSS entities author: John Snow Labs name: redl_nihss_biobert date: 2023-01-15 tags: [en, licensed, clinical, relation_extraction, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relate scale items and their measurements according to NIHSS guidelines. ## Predicted Entities `Has_Value`, `0` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_NIHSS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_nihss_biobert_en_4.2.4_3.0_1673762755276.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_nihss_biobert_en_4.2.4_3.0_1673762755276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_nihss", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") re_model = RelationExtractionDLModel().pretrained('redl_nihss_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text= "There , her initial NIHSS score was 4 , as recorded by the ED physicians . This included 2 for weakness in her left leg and 2 for what they felt was subtle ataxia in her left arm and leg ." p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text")) result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [text]}))) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_nihss", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") val re_model = RelationExtractionDLModel().pretrained("redl_nihss_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""There , her initial NIHSS score was 4 , as recorded by the ED physicians . This included 2 for weakness in her left leg and 2 for what they felt was subtle ataxia in her left arm and leg .""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.extract_relation.nihss").predict("""There , her initial NIHSS score was 4 , as recorded by the ED physicians . This included 2 for weakness in her left leg and 2 for what they felt was subtle ataxia in her left arm and leg .""") ```
## Results ```bash +---------+-----------+-------------+-----------+-----------+------------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+-----------+-------------+-----------+-----------+------------+-------------+-----------+--------------------+----------+ |Has_Value| NIHSS| 20| 30|NIHSS score| Measurement| 36| 36| 4| 0.9998851| |Has_Value|Measurement| 89| 89| 2| 6a_LeftLeg| 111| 118| left leg| 0.9987311| | 0|Measurement| 89| 89| 2| Measurement| 124| 124| 2|0.97510725| | 0|Measurement| 89| 89| 2|7_LimbAtaxia| 156| 185|ataxia in her lef...| 0.999889| | 0| 6a_LeftLeg| 111| 118| left leg| Measurement| 124| 124| 2|0.99989617| | 0| 6a_LeftLeg| 111| 118| left leg|7_LimbAtaxia| 156| 185|ataxia in her lef...| 0.9999521| |Has_Value|Measurement| 124| 124| 2|7_LimbAtaxia| 156| 185|ataxia in her lef...| 0.9896319| +---------+-----------+-------------+-----------+-----------+------------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_nihss_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References @article{wangnational, title={National Institutes of Health Stroke Scale (NIHSS) Annotations for the MIMIC-III Database}, author={Wang, Jiayang and Huang, Xiaoshuo and Yang, Lin and Li, Jiao} } ## Benchmarking ```bash label Recall Precision F1 Support 0 0.989 0.976 0.982 611 Has_Value 0.983 0.992 0.988 889 Avg. 0.986 0.984 0.985 - Weighted-Avg. 0.985 0.985 0.985 - ``` --- layout: model title: Clinical Deidentification (English, Glove, Augmented) author: John Snow Labs name: clinical_deidentification_glove_augmented date: 2022-06-30 tags: [en, deid, deidentification, clinical, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities. It's different to `clinical_deidentification_glove` in the way it manages PHONE and PATIENT, having apart from the NER, some rules in Contextual Parser components. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_4.0.0_3.0_1656579032191.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_augmented_en_4.0.0_3.0_1656579032191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove_augmented", "en", "clinical/models") deid_pipeline.annotate("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification_glove_augmented","en","clinical/models") val result = pipeline.annotate("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_augmented.pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN: 324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ```
## Results ```bash {'masked': ['Record date : , , M.D.', 'IP: .', "The driver's license no: .", 'The SSN: and e-mail: .', 'Name : MR. # Date : .', 'PCP : , years old.', 'Record date : , : .'], 'masked_fixed_length_chars': ['Record date : ****, ****, M.D.', 'IP: ****.', "The driver's license no: ****.", 'The SSN: **** and e-mail: ****.', 'Name : **** MR. # **** Date : ****.', 'PCP : ****, **** years old.', 'Record date : ****, **** : ****.'], 'masked_with_chars': ['Record date : [********], [********], M.D.', 'IP: [************].', "The driver's license no: [******].", 'The SSN: [*******] and e-mail: [************].', 'Name : [**************] MR. # [****] Date : [******].', 'PCP : [******], ** years old.', 'Record date : [********], [***********] : [***************].'], 'ner_chunk': ['2093-01-13', 'David Hale', 'A334455B', '324598674', 'hale@gmail.com', 'Hendrickson, Ora', '719435', '01/13/93', 'Oliveira', '25', '2079-11-09', "Patient's VIN", '1HGBH41JXMN109286'], 'obfuscated': ['Record date : 2093-01-23, Dr Marshia Curling, M.D.', 'IP: 004.004.004.004.', "The driver's license no: 123XX123.", 'The SSN: SSN-089-89-9294 and e-mail: Mikey@hotmail.com.', 'Name : Stephania Chang MR. # E5881795 Date : 02-14-1983.', 'PCP : Dr Lovella Israel, 52 years old.', 'Record date : 2079-11-14, Dr Colie Carne : 3CCCC22DDDD333888.'], 'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 'IP: 203.120.223.13.', "The driver's license no: A334455B.", 'The SSN: 324598674 and e-mail: hale@gmail.com.', 'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.', 'PCP : Oliveira, 25 years old.', "Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286."]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_glove_augmented| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|182.7 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - MedicalNerModel - NerConverterInternalModel - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Legal Loan and Security Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_loan_and_security_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, loan, security, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_loan_and_security_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `loan-and-security-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `loan-and-security-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_loan_and_security_agreement_en_1.0.0_3.0_1668109815627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_loan_and_security_agreement_en_1.0.0_3.0_1668109815627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_loan_and_security_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[loan-and-security-agreement]| |[other]| |[other]| |[loan-and-security-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_loan_and_security_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support loan-and-security-agreement 0.94 1.00 0.97 33 other 1.00 0.98 0.99 85 accuracy - - 0.98 118 macro-avg 0.97 0.99 0.98 118 weighted-avg 0.98 0.98 0.98 118 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from Aiyshwariya) author: John Snow Labs name: bert_qa_aiyshwariya_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `Aiyshwariya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_aiyshwariya_finetuned_squad_en_4.0.0_3.0_1657185952130.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_aiyshwariya_finetuned_squad_en_4.0.0_3.0_1657185952130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_aiyshwariya_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_aiyshwariya_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_aiyshwariya_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Aiyshwariya/bert-finetuned-squad --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Imbalancedscibert_scivocab_cased date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD_Imbalancedscibert_scivocab_cased` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Imbalancedscibert_scivocab_cased_en_4.0.0_3.0_1657108990893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Imbalancedscibert_scivocab_cased_en_4.0.0_3.0_1657108990893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Imbalancedscibert_scivocab_cased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Imbalancedscibert_scivocab_cased","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Imbalancedscibert_scivocab_cased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD_Imbalancedscibert_scivocab_cased --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_3 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_3 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_3` is a English model originally trained by doddle124578. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_3_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_3_en_4.2.0_3.0_1664038287215.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_3_en_4.2.0_3.0_1664038287215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_3", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_3", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_3| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Detect concepts in drug development trials (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_drug_development_trials date: 2022-03-22 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description It is a `BertForTokenClassification` NER model to identify concepts related to drug development including `Trial Groups` , `End Points` , `Hazard Ratio` and other entities in free text. ## Predicted Entities `Patient_Count`, `Duration`, `End_Point`, `Value`, `Trial_Group`, `Hazard_Ratio`, `Total_Patients` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.4_3.0_1647948437359.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.4_3.0_1647948437359.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) test_sentence = """In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""" data = spark.createDataFrame([[test_sentence]]).toDF('text') result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.drug_development_trials").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""") ```
## Results ```bash +-----------------+-------------+ |chunk |ner_label | +-----------------+-------------+ |median |Duration | |overall survival |End_Point | |with |Trial_Group | |without topotecan|Trial_Group | |4.0 |Value | |3.6 months |Value | |23 |Patient_Count| |63 |Patient_Count| |55 |Patient_Count| |33 patients |Patient_Count| |topotecan |Trial_Group | |11 |Patient_Count| |61 |Patient_Count| |66 |Patient_Count| |32 patients |Patient_Count| |without topotecan|Trial_Group | +-----------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_drug_development_trials| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|400.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Trained on data obtained from `clinicaltrials.gov` and annotated in-house. ## Benchmarking ```bash label prec rec f1 support B-Duration 0.93 0.94 0.93 1820 B-End_Point 0.99 0.98 0.98 5022 B-Hazard_Ratio 0.97 0.95 0.96 778 B-Patient_Count 0.81 0.88 0.85 300 B-Trial_Group 0.86 0.88 0.87 6751 B-Value 0.94 0.96 0.95 7675 I-Duration 0.71 0.82 0.76 185 I-End_Point 0.94 0.98 0.96 1491 I-Patient_Count 0.48 0.64 0.55 44 I-Trial_Group 0.78 0.75 0.77 4561 I-Value 0.93 0.95 0.94 1511 O 0.96 0.95 0.95 47423 accuracy 0.94 0.94 0.94 77608 macro-avg 0.79 0.82 0.80 77608 weighted-avg 0.94 0.94 0.94 77608 ``` --- layout: model title: Extract Test Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_test_emb_clinical_large date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts test mentions from the documents transferred from the patient’s own sentences. ## Predicted Entities `VitalTest`, `Test`, `Measurements`, `TestResult` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_emb_clinical_large_en_4.4.3_3.0_1686076758374.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_emb_clinical_large_en_4.4.3_3.0_1686076758374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_test_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_test_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------|:------------| | thyroid levels | Test | | blood test | Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_test_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 VitalTest 162 29 10 172 0.85 0.94 0.89 Test 1060 122 148 1208 0.90 0.88 0.89 Measurements 146 18 40 186 0.89 0.78 0.83 TestResult 339 80 185 524 0.81 0.65 0.72 macro_avg 1707 249 383 2090 0.86 0.81 0.83 micro_avg 1707 249 383 2090 0.87 0.82 0.84 ``` --- layout: model title: Yoruba Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_ner_yoruba date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, yo, open_source] task: Named Entity Recognition language: yo edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-yoruba` is a Yoruba model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808655534.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_yoruba_yo_3.4.2_3.0_1652808655534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_yoruba","yo") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Mo nifẹ Snark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_yoruba","yo") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Mo nifẹ Snark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_ner_yoruba| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|yo| |Size:|773.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-yoruba - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - ht --- layout: model title: Russian T5ForConditionalGeneration Cased model (from Kateryna) author: John Snow Labs name: t5_eva_forum_headlines date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `eva_ru_forum_headlines` is a Russian model originally trained by `Kateryna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_eva_forum_headlines_ru_4.3.0_3.0_1675101795155.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_eva_forum_headlines_ru_4.3.0_3.0_1675101795155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_eva_forum_headlines","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_eva_forum_headlines","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_eva_forum_headlines| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|981.6 MB| ## References - https://huggingface.co/Kateryna/eva_ru_forum_headlines - https://github.com/KaterynaD/eva.ru/tree/main/Code/Notebooks/9.%20Headlines --- layout: model title: ALBERT Base CoNNL-03 NER Pipeline author: John Snow Labs name: albert_base_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654368506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654368506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|43.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Part of Speech for Traditional Chinese author: John Snow Labs name: pos_ud_gsd_trad date: 2021-01-25 task: Part of Speech Tagging language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, zh, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities `ADJ`, `ADP`, `ADV`, `AUX`, `CONJ`, `DET`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SYM`, `VERB`, and `X`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_trad_zh_2.7.0_2.4_1611578220288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_trad_zh_2.7.0_2.4_1611578220288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh")\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, word_segmenter, posTagger ]) example = spark.createDataFrame([['然而,這樣的處理也衍生了一些問題。']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh") .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ud_gsd_trad", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos)) val data = Seq("然而,這樣的處理也衍生了一些問題。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,這樣的處理也衍生了一些問題。"""] pos_df = nlu.load('zh.pos.ud_gsd_trad').predict(text, output_level = "token") pos_df ```
## Results ```bash +------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ |text |result | +------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ |然而 , 這樣 的 處理 也 衍生 了 一些 問題 。 |[ADV, PUNCT, PRON, PART, NOUN, ADV, VERB, PART, ADJ, NOUN, PUNCT] | +------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd_trad| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|zh| ## Data Source The model was trained on the [Universal Dependencies](https://universaldependencies.org/) for Traditional Chinese annotated and converted by Google. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.70 | 0.68 | 0.69 | 272 | | ADP | 0.85 | 0.86 | 0.85 | 535 | | ADV | 0.90 | 0.90 | 0.90 | 549 | | AUX | 0.88 | 0.88 | 0.88 | 281 | | CCONJ | 0.92 | 0.87 | 0.89 | 191 | | DET | 0.93 | 0.93 | 0.93 | 138 | | NOUN | 0.88 | 0.92 | 0.90 | 3312 | | NUM | 0.98 | 0.99 | 0.98 | 653 | | PART | 0.97 | 0.94 | 0.95 | 1359 | | PRON | 0.97 | 0.97 | 0.97 | 168 | | PROPN | 0.89 | 0.84 | 0.86 | 1006 | | PUNCT | 1.00 | 1.00 | 1.00 | 1688 | | SYM | 1.00 | 1.00 | 1.00 | 3 | | VERB | 0.86 | 0.83 | 0.85 | 1769 | | X | 1.00 | 0.88 | 0.93 | 88 | | accuracy | | | 0.91 | 12012 | | macro avg | 0.91 | 0.90 | 0.91 | 12012 | | weighted avg | 0.91 | 0.91 | 0.91 | 12012 | ``` --- layout: model title: English image_classifier_vit_finetuned_chest_xray_pneumonia ViTForImageClassification from nickmuchi author: John Snow Labs name: image_classifier_vit_finetuned_chest_xray_pneumonia date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_finetuned_chest_xray_pneumonia` is a English model originally trained by nickmuchi. ## Predicted Entities `NORMAL`, `PNEUMONIA` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_finetuned_chest_xray_pneumonia_en_4.1.0_3.0_1660165538802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_finetuned_chest_xray_pneumonia_en_4.1.0_3.0_1660165538802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_finetuned_chest_xray_pneumonia", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_finetuned_chest_xray_pneumonia", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_finetuned_chest_xray_pneumonia| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Recognize Entities DL Pipeline for Russian - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, russian, entity_recognizer_sm, pipeline, ru] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: ru edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_ru_3.0.0_3.0_1616441765899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_ru_3.0.0_3.0_1616441765899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'ru') annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "ru") val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Здравствуйте из Джона Снежных Лабораторий! ""] result_df = nlu.load('ru.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------| | 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['Джона Снежных Лабораторий!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|ru| --- layout: model title: Arabic Part of Speech Tagger (DA-Dialectal Arabic dataset, Egyptian Arabic POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_da_pos_egy date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-da-pos-egy` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_egy_ar_3.4.2_3.0_1650993413426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_egy_ar_3.4.2_3.0_1650993413426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_egy","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_egy","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_da_pos_egy").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_da_pos_egy| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-pos-egy - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Bulgarian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_bulgarian_legal date: 2023-02-16 tags: [bul, bulgarian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: bul edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-bulgarian-roberta-base` is a Bulgarian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_bulgarian_legal_bul_4.2.4_3.0_1676558917204.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_bulgarian_legal_bul_4.2.4_3.0_1676558917204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_bulgarian_legal", "bul")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_bulgarian_legal", "bul") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_bulgarian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bul| |Size:|416.1 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-bulgarian-roberta-base --- layout: model title: Danish BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_da_cased date: 2022-12-02 tags: [da, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: da edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-da-cased` is a Danish model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_da_cased_da_4.2.4_3.0_1670016470135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_da_cased_da_4.2.4_3.0_1670016470135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_da_cased","da") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_da_cased","da") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_da_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|da| |Size:|389.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-da-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English BertForQuestionAnswering Cased model (from WounKai) author: John Snow Labs name: bert_qa_wounkai_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `WounKai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_wounkai_finetuned_squad_en_4.0.0_3.0_1657186291604.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_wounkai_finetuned_squad_en_4.0.0_3.0_1657186291604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_wounkai_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_wounkai_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_wounkai_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/WounKai/bert-finetuned-squad --- layout: model title: Pipeline to Detect PHI for Deidentification author: John Snow Labs name: ner_deidentify_dl_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deidentify_dl](https://nlp.johnsnowlabs.com/2021/03/31/ner_deidentify_dl_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_pipeline_en_3.4.1_3.0_1647871253846.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_pipeline_en_3.4.1_3.0_1647871253846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deidentify_dl_pipeline", "en", "clinical/models") pipeline.annotate("A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street") ``` ```scala val pipeline = new PretrainedPipeline("ner_deidentify_dl_pipeline", "en", "clinical/models") pipeline.annotate("A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deidentify.pipeline").predict("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""") ```
## Results ```bash +---------------+-----+ |ner_label |count| +---------------+-----+ |O |28 | |I-HOSPITAL |4 | |B-DATE |3 | |I-STREET |3 | |I-PATIENT |2 | |B-DOCTOR |2 | |B-AGE |1 | |B-PATIENT |1 | |I-DOCTOR |1 | |B-MEDICALRECORD|1 | +---------------+-----+. +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson , Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25 |AGE | |2079-11-09 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deidentify_dl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Earning Calls Financial NER (Generic, md) author: John Snow Labs name: finner_earning_calls_generic_md date: 2022-12-15 tags: [en, finance, earning, calls, licensed, ner] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `md` (medium) version of a financial model trained on Earning Calls transcripts to detect financial entities (NER model). This model is called `Generic` as it has fewer labels in comparison with the `Specific` version. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The currently available entities are: - AMOUNT: Numeric amounts, not percentages - ASSET: Current or Fixed Asset - ASSET_DECREASE: Decrease in the asset possession/exposure - ASSET_INCREASE: Increase in the asset possession/exposure - CF: Total cash flow  - CF_DECREASE: Relative decrease in cash flow - CF_INCREASE: Relative increase in cash flow - COUNT: Number of items (not monetary, not percentages). - CURRENCY: The currency of the amount - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - EXPENSE: An expense or loss - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - FCF: Free Cash Flow - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - KPI: Key Performance Indicator, a quantifiable measure of performance over time for a specific objective - KPI_DECREASE: Relative decrease in a KPI - KPI_INCREASE: Relative increase in a KPI - LIABILITY: Current or Long-Term Liability (not from stockholders) - LIABILITY_DECREASE: Relative decrease in liability - LIABILITY_INCREASE: Relative increase in liability - ORG: Mention to a company/organization name - PERCENTAGE: : Numeric amounts which are percentages - PROFIT: Profit or also Revenue - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - TICKER: Trading symbol of the company ## Predicted Entities `AMOUNT`, `ASSET`, `ASSET_DECREASE`, `ASSET_INCREASE`, `CF`, `CF_INCREASE`, `COUNT`, `CURRENCY`, `DATE`, `EXPENSE`, `EXPENSE_DECREASE`, `EXPENSE_INCREASE`, `FCF`, `FISCAL_YEAR`, `KPI`, `KPI_DECREASE`, `KPI_INCREASE`, `LIABILITY`, `LIABILITY_DECREASE`, `LIABILITY_INCREASE`, `ORG`, `PERCENTAGE`, `PROFIT`, `PROFIT_DECLINE`, `PROFIT_INCREASE`, `TICKER`, `CF_DECREASE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_generic_md_en_1.0.0_3.0_1671135709181.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_generic_md_en_1.0.0_3.0_1671135709181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_earning_calls_generic_md", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Adjusted EPS was ahead of our expectations at $ 1.21 , and free cash flow is also ahead of our expectations despite a $ 1.5 billion additional tax payment we made related to the R&D amortization."""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +------------+----------+----------+ | token| ner_label|confidence| +------------+----------+----------+ | Adjusted| B-PROFIT| 0.9641| | EPS| I-PROFIT| 0.9838| | was| O| 1.0| | ahead| O| 1.0| | of| O| 1.0| | our| O| 1.0| |expectations| O| 1.0| | at| O| 1.0| | $|B-CURRENCY| 1.0| | 1.21| B-AMOUNT| 1.0| | ,| O| 0.9984| | and| O| 1.0| | free| B-FCF| 0.9981| | cash| I-FCF| 0.9994| | flow| I-FCF| 0.9996| | is| O| 1.0| | also| O| 1.0| | ahead| O| 1.0| | of| O| 1.0| | our| O| 1.0| |expectations| O| 1.0| | despite| O| 1.0| | a| O| 1.0| | $|B-CURRENCY| 1.0| | 1.5| B-AMOUNT| 1.0| | billion| I-AMOUNT| 0.9999| | additional| O| 0.9786| | tax| O| 0.9603| | payment| O| 0.9043| | we| O| 1.0| | made| O| 1.0| | related| O| 1.0| | to| O| 1.0| | the| O| 0.9999| | R&D| O| 0.9993| |amortization| O| 0.9976| | .| O| 1.0| +------------+----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_earning_calls_generic_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations on Earning Calls. ## Benchmarking ```bash label precision recall f1 support AMOUNT 99.476440 99.650350 99.563319 573 ASSET 54.838710 50.000000 52.307692 31 ASSET_INCREASE 100.000000 33.333333 50.000000 1 CF 44.827586 54.166667 49.056604 29 COUNT 61.290323 86.363636 71.698113 31 CURRENCY 99.095841 99.636364 99.365367 553 DATE 88.304094 96.178344 92.073171 171 EXPENSE 66.666667 50.602410 57.534247 63 EXPENSE_DECREASE 100.000000 60.000000 75.000000 3 EXPENSE_INCREASE 55.555556 55.555556 55.555556 9 FCF 78.947368 75.000000 76.923077 19 KPI 31.578947 28.571429 30.000000 19 KPI_DECREASE 33.333333 20.000000 25.000000 6 KPI_INCREASE 53.333333 38.095238 44.444444 15 LIABILITY 41.666667 47.619048 44.444444 24 LIABILITY_DECREASE 100.000000 33.333333 50.000000 1 ORG 95.000000 95.000000 95.000000 20 PERCENTAGE 98.951049 99.472759 99.211218 572 PROFIT 77.973568 78.666667 78.318584 227 PROFIT_DECLINE 48.648649 48.648649 48.648649 37 PROFIT_INCREASE 69.285714 66.896552 68.070175 140 TICKER 94.736842 100.000000 97.297297 19 accuracy - - 0.9585 19083 macro-avg 0.6513 0.5875 0.6067 19083 weighted-avg 0.9577 0.9585 0.9577 19083 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824211 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824211` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1678783263483.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1678783263483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824211| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824211 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from hadxu) author: John Snow Labs name: xlmroberta_ner_hadxu_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `hadxu`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hadxu_base_finetuned_panx_de_4.1.0_3.0_1660433336876.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hadxu_base_finetuned_panx_de_4.1.0_3.0_1660433336876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hadxu_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hadxu_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_hadxu_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/hadxu/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from annt5396) author: John Snow Labs name: distilbert_qa_annt5396_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `annt5396`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_annt5396_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769820445.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_annt5396_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769820445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_annt5396_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_annt5396_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_annt5396_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/annt5396/distilbert-base-uncased-finetuned-squad --- layout: model title: German asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922_de_4.2.0_3.0_1664119967685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922_de_4.2.0_3.0_1664119967685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English asr_xlsr_wav2vec2_final TFWav2Vec2ForCTC from chrisvinsen author: John Snow Labs name: asr_xlsr_wav2vec2_final date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_final` is a English model originally trained by chrisvinsen. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_wav2vec2_final_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_final_en_4.2.0_3.0_1664110483097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_wav2vec2_final_en_4.2.0_3.0_1664110483097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_wav2vec2_final", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_wav2vec2_final", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_wav2vec2_final| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from tiennvcs) author: John Snow Labs name: bert_qa_bert_large_uncased_finetuned_docvqa date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-finetuned-docvqa` is a English model orginally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_finetuned_docvqa_en_4.0.0_3.0_1654536476096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_finetuned_docvqa_en_4.0.0_3.0_1654536476096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_finetuned_docvqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_finetuned_docvqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.large_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_finetuned_docvqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/bert-large-uncased-finetuned-docvqa --- layout: model title: Language Detection & Identification Pipeline - 231 Languages author: John Snow Labs name: detect_language_231 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Achinese`, `Afrikaans`, `Tosk Albanian`, `Amharic`, `Aragonese`, `Old English`, `Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Avaric`, `Aymara`, `Azerbaijani`, `South Azerbaijani`, `Bashkir`, `Bavarian`, `bat-smg`, `Central Bikol`, `Belarusian`, `Bulgarian`, `bh`, `Banjar`, `Bengali`, `Tibetan`, `Bishnupriya`, `Breton`, `Bosnian`, `Russia Buriat`, `Catalan`, `cbk-zam`, `Min Dong Chinese`, `Chechen`, `Cebuano`, `Central Kurdish (Soranî)`, `Corsican`, `Crimean Tatar`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `Danish`, `German`, `Dimli (individual language)`, `Lower Sorbian`, `dty`, `Dhivehi`, `Greek`, `eml`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Extremaduran`, `Persian`, `Finnish`, `fiu-vro`, `Faroese`, `French`, `Arpitan`, `Friulian`, `Frisian`, `Irish`, `Gagauz`, `Scottish Gaelic`, `Galician`, `Gilaki`, `Guarani`, `Konkani (Goan)`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Croatian`, `Upper Sorbian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Interlingue`, `Igbo`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kabardian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Karachay-Balkar`, `Kölsch`, `Kurdish`, `Komi`, `Cornish`, `Kyrgyz`, `Latin`, `Ladino`, `Luxembourgish`, `Lezghian`, `Luganda`, `Limburgan`, `Ligurian`, `Lombard`, `Lingala`, `Lao`, `Northern Luri`, `Lithuanian`, `Latgalian`, `Latvian`, `Maithili`, `map-bms`, `Moksha`, `Malagasy`, `Meadow Mari`, `Maori`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Hill Mari`, `Malay`, `Maltese`, `Mirandese`, `Burmese`, `Erzya`, `Mazanderani`, `Nahuatl`, `Neapolitan`, `Low German (Low Saxon)`, `nds-nl`, `Nepali`, `Newari`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Narom`, `Pedi`, `Navajo`, `Occitan`, `Livvi`, `Oromo`, `Odia (Oriya)`, `Ossetian`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Picard`, `Pennsylvania German`, `Palatine German`, `Polish`, `Punjabi (Western)`, `Pashto`, `Portuguese`, `Quechua`, `Romansh`, `Romanian`, `roa-rup`, `roa-tara`, `Russian`, `Rusyn`, `Kinyarwanda`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Sindhi`, `Northern Sami`, `Serbo-Croatian`, `Sinhala`, `Slovak`, `Slovenian`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Sranan Tongo`, `Saterland Frisian`, `Sundanese`, `Swedish`, `Swahili`, `Silesian`, `Tamil`, `Tulu`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Turkmen`, `Tagalog`, `Setswana`, `Tongan`, `Turkish`, `Tatar`, `Tuvinian`, `Udmurt`, `Uyghur`, `Ukrainian`, `Urdu`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Vlaams`, `Volapük`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Zeeuws`, `Chinese`, `zh-classical`, `zh-min-nan`, `zh-yue`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_231_xx_2.7.0_2.4_1607185843755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_231_xx_2.7.0_2.4_1607185843755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_231", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_231", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.231").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_231| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: Detect Entities in Urdu (urduvec_140M_300d embeddings) author: John Snow Labs name: uner_mk_140M_300d date: 2022-08-09 tags: [ner, ur, open_source] task: Named Entity Recognition language: ur edition: Spark NLP 4.0.2 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using `urduvec_140M_300d` word embeddings, so please use the same embeddings in the pipeline. Predicted Entities : Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Dates-``DATE``, Designations-``DESIGNATION``, Times-``TIME``, Numbers-``NUMBER``. ## Predicted Entities `PER`, `LOC`, `ORG`, `DATE`, `TIME`, `DESIGNATION`, `NUMBER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/uner_mk_140M_300d_ur_4.0.2_3.0_1660035998466.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/uner_mk_140M_300d_ur_4.0.2_3.0_1660035998466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlp_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""").toDS.toDF("text") val result = nlp_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.ner").predict("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""") ```
## Results ```bash | | ner_chunk | entity | |---:|---------------:|-------------:| | 0 |بریگیڈیئر | DESIGNATION | | 1 |ایڈ بٹلر | PERSON | | 2 |سنہ دوہزارچھ | DATE | | 3 |ہلمند | LOCATION | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|uner_mk_140M_300d| |Type:|ner| |Compatibility:|Spark NLP 4.0.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|ur| |Size:|14.8 MB| ## References This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 12 10 1 0.545455 0.923077 0.685714 B-PERSON 2808 846 535 0.768473 0.839964 0.80263 B-DATE 34 6 6 0.85 0.85 0.85 I-DATE 45 1 2 0.978261 0.957447 0.967742 B-DESIGNATION 49 30 16 0.620253 0.753846 0.680556 I-LOCATION 2110 750 701 0.737762 0.750623 0.744137 B-TIME 11 9 3 0.55 0.785714 0.647059 I-ORGANIZATION 2006 772 760 0.722102 0.725235 0.723665 I-NUMBER 18 6 2 0.75 0.9 0.818182 B-LOCATION 5428 1255 582 0.81221 0.903161 0.855275 B-NUMBER 194 36 27 0.843478 0.877828 0.86031 I-DESIGNATION 25 15 6 0.625 0.806452 0.704225 I-PERSON 3562 759 433 0.824346 0.891614 0.856662 B-ORGANIZATION 1114 466 641 0.705063 0.634758 0.668066 Macro-average 17416 4961 3715 0.738029 0.828551 0.780675 Micro-average 17416 4961 3715 0.778299 0.824192 0.800588 ``` --- layout: model title: Chinese Financial NER (sm, bert_embeddings_mengzi_bert_base_fin) author: John Snow Labs name: finner_finance_chinese_sm date: 2023-02-04 tags: [zh, cn, finance, ner, licensed] task: Named Entity Recognition language: zh edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the small version of the NER model for Financial Chinese texts, trained in a subset of **ChFinAnn** (see "Datasets used for training"). To use this model, use the `BertEmbeddings` model named `bert_embeddings_mengzi_bert_base_fin"` as: ```python bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_mengzi_bert_base_fin","zh") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ``` Also, please note that the Chinese texts are not separated by white space. The embedding model we use is based on character-level embeddings, so you need to split the text on every character (for example, by setting `.setSplitPattern("")` in the `Tokenizer` annotator). ## Predicted Entities `AveragePrice`, `ClosingDate`, `CompanyName`, `EndDate`, `EquityHolder`, `FrozeShares`, `HighestTradingPrice`, `LaterHoldingShares`, `LegalInstitution`, `LowestTradingPrice`, `OtherType`, `PledgedShares`, `Pledgee`, `ReleasedDate`, `RepurchaseAmount`, `RepurchasedShares`, `StartDate`, `StockAbbr`, `StockCode`, `TotalHoldingRatio`, `TotalHoldingShares`, `TotalPledgedShares`, `TradedShares`, `UnfrozeDate` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_FINANCE_CHINESE){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_finance_chinese_sm_zh_1.0.0_3.0_1675554138686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_finance_chinese_sm_zh_1.0.0_3.0_1675554138686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setSplitPattern("") # Split on char level embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_mengzi_bert_base_fin","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_finance_chinese_sm", "zh", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""近日,渤海水业股份有限公司(以下简称“公司”)收到公司持股5%以上股东李华青女士的《告知函》,获悉李华青女士将其所持有的部分公司股票进行补充质押,具体事项如下:"""] res = model. Transform(spark.createDataFrame([text]).toDF("text")) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']['entity']").alias("ner_label"), F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False) ```
## Results ```bash +------------------------------------+------------+----------+ |chunk |ner_label |confidence| +------------------------------------+------------+----------+ |业股份有限公司(以下简称“公司”)收到|CompanyName |0.7933 | |质押,具体 |EquityHolder|0.9378667 | +------------------------------------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_finance_chinese_sm| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|zh| |Size:|16.8 MB| ## References The dataset used for training was a subset of the **chFinAnn** dataset, consisting of financial statements of Chinese listed companies from 2008 to 2018. Reference: - [Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction](https://aclanthology.org/D19-1032) (Zheng et al., EMNLP-IJCNLP 2019) ## Sample text from the training dataset 近日,渤海水业股份有限公司(以下简称“公司”)收到公司持股5%以上股东李华青女士的《告知函》,获悉李华青女士将其所持有的部分公司股票进行补充质押,具体事项如下: ## Benchmarking ```bash entity precision recall f1 support AveragePrice 78.0731 85.1449 81.4558 301 ClosingDate 76.0148 57.7031 65.6051 271 CompanyName 94.0767 95.9251 94.9919 5605 EndDate 75.487 44.5402 56.0241 616 EquityHolder 83.8303 91.319 87.4146 7780 FrozeShares 47.4227 31.7241 38.0165 97 HighestTradingPrice 74.4186 70.5085 72.4108 559 LaterHoldingShares 31.4961 11.9048 17.2786 127 LegalInstitution 92.3767 87.6596 89.9563 223 LowestTradingPrice 78.5047 52.1739 62.6866 107 OtherType 78.2961 36.7619 50.0324 493 PledgedShares 78.0776 65.5189 71.249 1779 Pledgee 90.1003 88.656 89.3723 1596 ReleasedDate 54.5016 46.1853 50 622 RepurchaseAmount 68.323 72.8477 70.5128 322 RepurchasedShares 79.5918 70.4819 74.7604 588 StartDate 65.4217 77.5493 70.9711 4150 StockAbbr 83.7656 82.0521 82.9 2969 StockCode 99.8355 99.5626 99.6989 1824 TotalHoldingRatio 74.2574 85.1628 79.3371 1515 TotalHoldingShares 67.0582 88.7306 76.387 2043 TotalPledgedShares 74.5989 86.0226 79.9045 1122 TradedShares 76.9231 68.6948 72.5765 910 UnfrozeDate 5.88235 5.55556 5.71429 17 ``` --- layout: model title: Pipeline to Resolve ICD-9-CM Codes author: John Snow Labs name: icd9_resolver_pipeline date: 2023-03-29 tags: [en, licensed, clinical, resolver, pipeline, chunk_mapping, icd9] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding ICD-9-CM codes. You’ll just feed your text and it will return the corresponding ICD-9-CM codes. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd9_resolver_pipeline_en_4.3.2_3.2_1680119796842.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd9_resolver_pipeline_en_4.3.2_3.2_1680119796842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd9_resolver_pipeline", "en", "clinical/models") text= A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd9_resolver_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate(A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd9.pipeline").predict("""Put your text here.""") ```
## Results ```bash +-----------------------------+---------+---------+ |chunk |ner_chunk|icd9_code| +-----------------------------+---------+---------+ |gestational diabetes mellitus|PROBLEM |V12.21 | |anisakiasis |PROBLEM |127.1 | |fetal and neonatal hemorrhage|PROBLEM |772 | +-----------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd9_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.2 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Voice of the Patients (embeddings_clinical_large) author: John Snow Labs name: ner_vop_emb_clinical_large_wip date: 2023-04-12 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: [3.2, 3.0] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Allergen`, `SubstanceQuantity`, `RaceEthnicity`, `Measurements`, `InjuryOrPoisoning`, `Treatment`, `TestResult`, `Modifier`, `Route`, `MedicalDevice`, `Vaccine`, `RelationshipStatus`, `Frequency`, `HealthStatus`, `Procedure`, `Duration`, `DateTime`, `Disease`, `Test`, `Substance`, `Symptom`, `Laterality`, `Dosage`, `ClinicalDept`, `PsychologicalCondition`, `VitalTest`, `Age`, `Drug`, `BodyPart`, `AdmissionDischarge`, `Form`, `Employment`, `Gender` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/VOICE_OF_THE_PATIENTS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_large_wip_en_4.4.0_3.0_1681315187438.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_large_wip_en_4.4.0_3.0_1681315187438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical_large, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_large_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical_large, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_large_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Treatment | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | heartrate | VitalTest | | homeopathy | Treatment | | thyroid | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_emb_clinical_large_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Allergen 0 0 8 8 0.00 0.00 0.00 SubstanceQuantity 7 10 20 27 0.41 0.26 0.32 RaceEthnicity 2 0 6 8 1.00 0.25 0.40 Measurements 36 20 38 74 0.64 0.49 0.55 InjuryOrPoisoning 65 28 52 117 0.70 0.56 0.62 Treatment 86 27 56 142 0.76 0.61 0.67 TestResult 379 150 169 548 0.72 0.69 0.70 Modifier 644 229 269 913 0.74 0.71 0.72 Route 23 5 13 36 0.82 0.64 0.72 MedicalDevice 171 54 73 244 0.76 0.70 0.73 Vaccine 21 4 11 32 0.84 0.66 0.74 RelationshipStatus 18 2 10 28 0.90 0.64 0.75 Frequency 478 161 165 643 0.75 0.74 0.75 HealthStatus 75 19 23 98 0.80 0.77 0.78 Procedure 275 68 91 366 0.80 0.75 0.78 Duration 884 275 231 1115 0.76 0.79 0.78 DateTime 1796 397 408 2204 0.82 0.81 0.82 Disease 1258 323 245 1503 0.80 0.84 0.82 Test 752 155 157 909 0.83 0.83 0.83 Substance 152 37 26 178 0.80 0.85 0.83 Symptom 3078 547 621 3699 0.85 0.83 0.84 Laterality 425 62 93 518 0.87 0.82 0.85 Dosage 266 37 56 322 0.88 0.83 0.85 ClinicalDept 201 28 35 236 0.88 0.85 0.86 PsychologicalCondition 282 41 32 314 0.87 0.90 0.89 VitalTest 146 25 11 157 0.85 0.93 0.89 Age 295 38 28 323 0.89 0.91 0.90 Drug 1040 136 95 1135 0.88 0.92 0.90 BodyPart 2528 245 217 2745 0.91 0.92 0.92 AdmissionDischarge 24 1 3 27 0.96 0.89 0.92 Form 233 24 18 251 0.91 0.93 0.92 Employment 988 51 54 1042 0.95 0.95 0.95 Gender 1173 26 21 1194 0.98 0.98 0.98 macro_avg 17801 3225 3355 21156 0.80 0.73 0.76 micro_avg 17801 3225 3355 21156 0.85 0.84 0.84 ``` --- layout: model title: English BertForMaskedLM Base Uncased model (from ayansinha) author: John Snow Labs name: bert_embeddings_false_positives_scancode_base_uncased_l8_1 date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `false-positives-scancode-bert-base-uncased-L8-1` is a English model originally trained by `ayansinha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_base_uncased_l8_1_en_4.2.4_3.0_1670022053884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_base_uncased_l8_1_en_4.2.4_3.0_1670022053884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_base_uncased_l8_1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_base_uncased_l8_1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_false_positives_scancode_base_uncased_l8_1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ayansinha/false-positives-scancode-bert-base-uncased-L8-1 - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py --- layout: model title: English RobertaForQuestionAnswering (from eAsyle) author: John Snow Labs name: roberta_qa_testABSA3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `testABSA3` is a English model originally trained by `eAsyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_testABSA3_en_4.0.0_3.0_1655739938744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_testABSA3_en_4.0.0_3.0_1655739938744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_testABSA3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_testABSA3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.testabsa3.by_eAsyle").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_testABSA3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|426.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/eAsyle/testABSA3 --- layout: model title: Diseases and Syndromes to UMLS Code Pipeline author: John Snow Labs name: umls_disease_syndrome_resolver_pipeline date: 2023-03-30 tags: [en, licensed, umls, pipeline] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Diseases and Syndromes) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_disease_syndrome_resolver_pipeline_en_4.3.2_3.2_1680190919515.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_disease_syndrome_resolver_pipeline_en_4.3.2_3.2_1680190919515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_disease_syndrome_resolver_pipeline", "en", "clinical/models") pipeline.annotate("A 34-year-old female with a history of poor appetite, gestational diabetes mellitus, acyclovir allergy and polyuria") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_disease_syndrome_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("A 34-year-old female with a history of poor appetite, gestational diabetes mellitus, acyclovir allergy and polyuria") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_disease_syndrome_resolver").predict("""A 34-year-old female with a history of poor appetite, gestational diabetes mellitus, acyclovir allergy and polyuria""") ```
## Results ```bash +-----------------------------+---------+---------+ |chunk |ner_label|umls_code| +-----------------------------+---------+---------+ |poor appetite |PROBLEM |C0003123 | |gestational diabetes mellitus|PROBLEM |C0085207 | |acyclovir allergy |PROBLEM |C0571297 | |polyuria |PROBLEM |C0018965 | +-----------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_disease_syndrome_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.4 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English BertForQuestionAnswering Uncased model (from husnu) author: John Snow Labs name: bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xtremedistil-l6-h256-uncased-TQUAD-finetuned_lr-2e-05_epochs-3` is a English model originally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_3_en_4.0.0_3.0_1657193231850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_3_en_4.0.0_3.0_1657193231850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xtremedistil_l6_h256_uncased_tquad_finetuned_lr_2e_05_epochs_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|47.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/husnu/xtremedistil-l6-h256-uncased-TQUAD-finetuned_lr-2e-05_epochs-3 --- layout: model title: Italian Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_it_cased date: 2022-04-11 tags: [bert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-it-cased` is a Italian model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_it_cased_it_3.4.2_3.0_1649676920355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_it_cased_it_3.4.2_3.0_1649676920355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_it_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_it_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.bert_it_cased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_it_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|396.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-it-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Standard terms and conditions of trust Clause Binary Classifier author: John Snow Labs name: legclf_standard_terms_and_conditions_of_trust_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `standard-terms-and-conditions-of-trust` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `standard-terms-and-conditions-of-trust` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_standard_terms_and_conditions_of_trust_clause_en_1.0.0_3.2_1660123035837.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_standard_terms_and_conditions_of_trust_clause_en_1.0.0_3.2_1660123035837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_standard_terms_and_conditions_of_trust_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[standard-terms-and-conditions-of-trust]| |[other]| |[other]| |[standard-terms-and-conditions-of-trust]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_standard_terms_and_conditions_of_trust_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.75 1.00 0.86 6 standard-terms-and-conditions-of-trust 0.00 0.00 0.00 2 accuracy - - 0.75 8 macro-avg 0.38 0.50 0.43 8 weighted-avg 0.56 0.75 0.64 8 ``` --- layout: model title: English image_classifier_vit_CarViT ViTForImageClassification from abdusahmbzuai author: John Snow Labs name: image_classifier_vit_CarViT date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_CarViT` is a English model originally trained by abdusahmbzuai. ## Predicted Entities `Toyota`, `Audi`, `Dodge`, `Aston Martin`, `Chevrolet`, `Mitsubishi`, `Kia`, `Honda`, `Chrysler`, `Lexus`, `Land Rover`, `Rolls-Royce`, `Porsche`, `FIAT`, `Cadillac`, `Jaguar`, `smart`, `Tesla`, `Maserati`, `Buick`, `GMC`, `Genesis`, `McLaren`, `Bentley`, `BMW`, `Lincoln`, `Subaru`, `Volvo`, `Lamborghini`, `Nissan`, `Alfa Romeo`, `Jeep`, `INFINITI`, `Mazda`, `Hyundai`, `Volkswagen`, `Ram`, `Ferrari`, `Acura`, `Mercedes-Benz`, `MINI`, `Ford` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_CarViT_en_4.1.0_3.0_1660165745338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_CarViT_en_4.1.0_3.0_1660165745338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_CarViT", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_CarViT", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_CarViT| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: English BertForQuestionAnswering model (from SreyanG-NVIDIA) author: John Snow Labs name: bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `SreyanG-NVIDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181017940.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181017940.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SreyanG-NVIDIA/bert-base-uncased-finetuned-squad --- layout: model title: Kanuri asr_wav2vec2_xlsr_korean_senior TFWav2Vec2ForCTC from hyyoka author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_korean_senior date: 2022-09-24 tags: [wav2vec2, kr, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: kr edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_korean_senior` is a Kanuri model originally trained by hyyoka. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_korean_senior_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_korean_senior_kr_4.2.0_3.0_1664024430198.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_korean_senior_kr_4.2.0_3.0_1664024430198.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_korean_senior', lang = 'kr') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_korean_senior", lang = "kr") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_korean_senior| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|kr| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Base Cased model (from razent) author: John Snow Labs name: t5_scifive_base_pubmed date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SciFive-base-Pubmed` is a English model originally trained by `razent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_scifive_base_pubmed_en_4.3.0_3.0_1675098922792.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_scifive_base_pubmed_en_4.3.0_3.0_1675098922792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_scifive_base_pubmed","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_scifive_base_pubmed","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_scifive_base_pubmed| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|474.3 MB| ## References - https://huggingface.co/razent/SciFive-base-Pubmed - https://arxiv.org/abs/2106.03598 - https://github.com/justinphan3110/SciFive --- layout: model title: Word2Vec Embeddings in Nahuatl languages (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, nah, open_source] task: Embeddings language: nah edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nah_3.4.1_3.0_1647447890372.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nah_3.4.1_3.0_1647447890372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nah") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nah") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nah.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|nah| |Size:|67.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect Living Species(roberta_base_biomedical) author: John Snow Labs name: ner_living_species_roberta date: 2022-06-22 tags: [es, ner, clinical, licensed, roberta] task: Named Entity Recognition language: es edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `roberta_base_biomedical` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_es_3.5.3_3.0_1655906938288.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_es_3.5.3_3.0_1655906938288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_roberta", "es", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_roberta", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.living_species.roberta").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_roberta| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/](https://temu.bsc.es/livingner/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.99 0.99 0.99 3268 B-SPECIES 0.98 0.98 0.98 3688 I-HUMAN 0.94 0.97 0.95 297 I-SPECIES 0.98 0.90 0.94 1720 micro-avg 0.98 0.97 0.97 8973 macro-avg 0.97 0.96 0.96 8973 weighted-avg 0.98 0.97 0.97 8973 ``` --- layout: model title: Spanish Part of Speech Tagger (from mrm8488) author: John Snow Labs name: roberta_pos_RuPERTa_base_finetuned_pos date: 2022-05-03 tags: [roberta, pos, part_of_speech, es, open_source] task: Part of Speech Tagging language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `RuPERTa-base-finetuned-pos` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_RuPERTa_base_finetuned_pos_es_3.4.2_3.0_1651596033465.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_RuPERTa_base_finetuned_pos_es_3.4.2_3.0_1651596033465.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_RuPERTa_base_finetuned_pos","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_RuPERTa_base_finetuned_pos","es") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.pos.RuPERTa_base_finetuned_pos").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_RuPERTa_base_finetuned_pos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|470.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pos - https://www.kaggle.com/nltkdata/conll-corpora - https://www.kaggle.com/nltkdata/conll-corpora - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-finetuned-chaii` is a English model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii_en_4.0.0_3.0_1654536941289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii_en_4.0.0_3.0_1654536941289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.bert.large_uncased_uncased_whole_word_masking_finetuned.by_SauravMaheshkar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-large-uncased-whole-word-masking-finetuned-chaii --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1657192224643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1657192224643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|375.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-8 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from marioarteaga) author: John Snow Labs name: distilbert_qa_marioarteaga_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `marioarteaga`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_marioarteaga_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772174929.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_marioarteaga_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772174929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_marioarteaga_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_marioarteaga_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_marioarteaga_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/marioarteaga/distilbert-base-uncased-finetuned-squad --- layout: model title: Extract Community Condition Entities from Social Determinants of Health Texts author: John Snow Labs name: ner_sdoh_community_condition_wip date: 2023-02-24 tags: [licensed, en, clinical, sdoh, social_determinants, ner, public_health, community, condition, community_condition] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts community condition information related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Transportation`, `Community_Living_Conditions`, `Housing`, `Food_Insecurity` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_community_condition_wip_en_4.3.1_3.0_1677201525944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_community_condition_wip_en_4.3.1_3.0_1677201525944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_community_condition_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["He is currently experiencing financial stress due to job insecurity, and he lives in a small apartment in a densely populated area with limited access to green spaces and outdoor recreational activities.", "Patient reports difficulty affording healthy food, and relies oncheaper, processed options.", "She reports her husband and sons provide transportation top medical apptsand do her grocery shopping."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_community_condition_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("He is currently experiencing financial stress due to job insecurity, and he lives in a small apartment in a densely populated area with limited access to green spaces and outdoor recreational activities.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------+-----+---+---------------------------+ |chunk |begin|end|ner_label | +-------------------------------+-----+---+---------------------------+ |small apartment |87 |101|Housing | |green spaces |154 |165|Community_Living_Conditions| |outdoor recreational activities|171 |201|Community_Living_Conditions| |healthy food |37 |48 |Food_Insecurity | |transportation |41 |54 |Transportation | +-------------------------------+-----+---+---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_community_condition_wip| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Food_Insecurity 40.0 0.0 5.0 45.0 1.000000 0.888889 0.941176 Housing 376.0 20.0 28.0 404.0 0.949495 0.930693 0.940000 Community_Living_Conditions 97.0 8.0 8.0 105.0 0.923810 0.923810 0.923810 Transportation 31.0 2.0 0.0 31.0 0.939394 1.000000 0.968750 ``` --- layout: model title: Sentence Entity Resolver for Hierarchical Condition Categories (HCC) codes (Augmented) author: John Snow Labs name: sbiobertresolve_hcc_augmented date: 2023-05-31 tags: [en, entity_resolution, clinical, hcc, licensed] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to Hierarchical Condition Categories (HCC) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `HCC Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcc_augmented_en_4.4.2_3.0_1685528056939.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcc_augmented_en_4.4.2_3.0_1685528056939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree, transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA complicated by hypotension requiring Atropine , IV fluids and transient dopamine, subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."""]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree, transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA complicated by hypotension requiring Atropine , IV fluids and transient dopamine, subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------+-------+----------+--------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| all_codes| resolutions| +---------------------------+-------+----------+--------------------+---------------------------------------------------------------------------+ | hypertension|PROBLEM| 0| [0]| [hypertension [essential (primary) hypertension]]| |chronic renal insufficiency|PROBLEM| 0| [0, 136]|[chronic renal insufficiency [chronic kidney disease, unspecified], end ...| | COPD|PROBLEM| 111|[111, 80, 0, 23, 47]|[copd [chronic obstructive pulmonary disease, unspecified], coning [comp...| | gastritis|PROBLEM| 0| [0]| [gastritis [gastritis, unspecified, without bleeding]]| | TIA|PROBLEM| 0| [0, 12, 48]|[tia [transient cerebral ischemic attack, unspecified], tsh-oma [benign ...| | hypotension|PROBLEM| 0| [0]| [hypotension [hypotension]]| +---------------------------+-------+----------+--------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_hcc_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hcc]| |Language:|en| |Size:|1.4 GB| |Case sensitive:|false| --- layout: model title: English DistilBertForQuestionAnswering model (from alinemati) author: John Snow Labs name: distilbert_qa_BERT date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BERT` is a English model originally trained by `alinemati`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_BERT_en_4.0.0_3.0_1654722812832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_BERT_en_4.0.0_3.0_1654722812832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_BERT","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_BERT","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_alinemati").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_BERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/alinemati/BERT --- layout: model title: NER Model for Hindi+English author: John Snow Labs name: bert_token_classifier_hi_en_ner date: 2021-12-27 tags: [ner, hi, en, open_source] task: Named Entity Recognition language: hi edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from Hugging Face to carry out Name Entity Recognition with mixed Hindi-English texts, provided by the LinCE repository. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_HINDI_ENGLISH.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_hi_en_ner_hi_3.2.0_3.0_1640612846736.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_hi_en_ner_hi_3.2.0_3.0_1640612846736.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector() \ .setInputCols(['document'])\ .setOutputCol('sentence') tokenizer = Tokenizer()\ .setInputCols(['sentence']) \ .setOutputCol('token') tokenClassifier_loaded = BertForTokenClassification.pretrained("bert_token_classifier_hi_en_ner","hi")\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, tokenClassifier_loaded, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([["""रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।"""]], ["text"])) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier_loaded = BertForTokenClassification.pretrained("bert_token_classifier_hi_en_ner","hi") .setInputCols("sentence","token") .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols("sentence","token","ner") .setOutputCol("ner_chunk") val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, tokenClassifier_loaded, ner_converter)) val data = Seq("""रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.ner.bert").predict("""रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।""") ```
## Results ```bash | chunk | ner_label | |----------------------- |-------------- | | रिलायंस इंडस्ट्रीज़ लिमिटेड | ORGANISATION | | Reliance Industries Limited | ORGANISATION | | मुंबई | PLACE | | महाराष्ट्र | PLACE | | Maharashtra | PLACE | | नवल टाटा | PERSON | | मुम्बई | PLACE | | Mumbai | PLACE | | टाटा समुह | ORGANISATION | | भारत | PLACE | | जमशेदजी टाटा | PERSON | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_hi_en_ner| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|hi| |Size:|665.7 MB| |Case sensitive:|true| |Max sentense length:|128| ## Data Source https://ritual.uh.edu/lince/home --- layout: model title: English DistilBertForTokenClassification Cased model (from ankleBowl) author: John Snow Labs name: distilbert_token_classifier_autotrain_lucy_light_control_3122788375 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-lucy-light-control-3122788375` is a English model originally trained by `ankleBowl`. ## Predicted Entities `PER`, `OFF`, `BRI`, `EMP`, `ONN`, `DIM`, `COL` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_lucy_light_control_3122788375_en_4.3.1_3.0_1678783543204.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_lucy_light_control_3122788375_en_4.3.1_3.0_1678783543204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_lucy_light_control_3122788375","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_lucy_light_control_3122788375","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_lucy_light_control_3122788375| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ankleBowl/autotrain-lucy-light-control-3122788375 --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-6_H-128_A-2_cord19-200616_squad2` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_en_4.0.0_3.0_1657188925469.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_en_4.0.0_3.0_1657188925469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|20.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-6_H-128_A-2_cord19-200616_squad2 --- layout: model title: Swiss Legal Roberta Embeddings author: John Snow Labs name: roberta_large_swiss_legal date: 2023-02-16 tags: [gsw, swiss, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: gsw edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-swiss-roberta-large` is a Swiss model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_swiss_legal_gsw_4.2.4_3.0_1676577298804.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_swiss_legal_gsw_4.2.4_3.0_1676577298804.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_swiss_legal", "gsw")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_swiss_legal", "gsw") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_swiss_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|gsw| |Size:|1.6 GB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-swiss-roberta-large --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_cv9 TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_cv9 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_cv9` is a German model originally trained by oliverguhr. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_cv9_de_4.2.0_3.0_1664096909834.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_cv9_de_4.2.0_3.0_1664096909834.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_cv9", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_cv9", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_cv9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Yoruba Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:37:00 +0800 task: Lemmatization language: yo edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, yo] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_yo_2.5.5_2.4_1596055008864.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_yo_2.5.5_2.4_1596055008864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "yo") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "yo") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera."""] lemma_df = nlu.load('yo.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='Yato', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=6, result='si', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=11, result='jijẹ', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=15, result='ọba', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=21, result='ariwa', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|yo| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English RoBERTa Embeddings (Mixed sampling strategy) author: John Snow Labs name: roberta_embeddings_distilroberta_base_climate_d_s date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-climate-d-s` is a English model orginally trained by `climatebert`. Sampling strategy ds:As expressed in the author's paper [here](https://arxiv.org/pdf/2110.12010.pdf), ds is "div select + sim select", meaning 70% of the biggest composite scaled score diverse+sim was used, discarding the rest. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_d_s_en_3.4.2_3.0_1649946889886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_d_s_en_3.4.2_3.0_1649946889886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_d_s","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_d_s","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta_base_climate_d_s").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base_climate_d_s| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|310.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/climatebert/distilroberta-base-climate-d-s - https://arxiv.org/abs/2110.12010 --- layout: model title: Fast Neural Machine Translation Model from English to Brazilian Sign Language author: John Snow Labs name: opus_mt_en_bzs date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bzs, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bzs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bzs_xx_2.7.0_2.4_1609168236857.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bzs_xx_2.7.0_2.4_1609168236857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bzs", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bzs", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bzs').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bzs| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Resolve Medication Codes(Transform) author: John Snow Labs name: medication_resolver_transform_pipeline date: 2022-09-01 tags: [resolver, rxnorm, ndc, snomed, umls, ade, pipeline, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used with Spark transform. You can use `medication_resolver_pipeline` as Lightpipeline (with `annotate/fullAnnotate`). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_transform_pipeline_en_4.0.2_3.0_1662045429139.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_transform_pipeline_en_4.0.2_3.0_1662045429139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline medication_resolver_pipeline = PretrainedPipeline("medication_resolver_transform_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" data = spark.createDataFrame([[text]]).toDF("text") result = medication_resolver_pipeline.transform(data) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val medication_resolver_pipeline = new PretrainedPipeline("medication_resolver_transform_pipeline", "en", "clinical/models") val data = Seq("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""").toDS.toDF("text") val result = medication_resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication_transform.pipeline").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | chunk | ner_label | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |:-----------------------------|:------------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_transform_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - Doc2Chunk - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Doc2Chunk - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: English image_classifier_vit_rust_image_classification_1 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_1` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_1_en_4.1.0_3.0_1660166731268.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_1_en_4.1.0_3.0_1660166731268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Niuean author: John Snow Labs name: opus_mt_en_niu date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, niu, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `niu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_niu_xx_2.7.0_2.4_1609167483288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_niu_xx_2.7.0_2.4_1609167483288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_niu", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_niu", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.niu').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_niu| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xlsr_nahuatl TFWav2Vec2ForCTC from tyoc213 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_nahuatl date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_nahuatl` is a English model originally trained by tyoc213. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_nahuatl_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_nahuatl_en_4.2.0_3.0_1664110207282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_nahuatl_en_4.2.0_3.0_1664110207282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_nahuatl', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_nahuatl", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_nahuatl| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast and Accurate Language Identification - 95 Languages (CNN) author: John Snow Labs name: ld_wiki_tatoeba_cnn_95 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Afrikaans`, `Amharic`, `Aragonese`, `Arabic`, `Assamese`, `Azerbaijani`, `Belarusian`, `Bulgarian`, `Bengali`, `Breton`, `Bosnian`, `Catalan`, `Czech`, `Welsh`, `Danish`, `German`, `Greek`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Persian`, `Finnish`, `Faroese`, `French`, `Irish`, `Galician`, `Gujarati`, `Hebrew`, `Hindi`, `Croatian`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Japanese`, `Javanese`, `Georgian`, `Kazakh`, `Khmer`, `Kannada`, `Korean`, `Kurdish`, `Kyrgyz`, `Latin`, `Luxembourgish`, `Lao`, `Lithuanian`, `Latvian`, `Malagasy`, `Macedonian`, `Malayalam`, `Mongolian`, `Marathi`, `Malay`, `Maltese`, `Nepali`, `Dutch`, `Norwegian Nynorsk`, `Norwegian`, `Occitan`, `Odia (Oriya)`, `Punjabi (Eastern)`, `Polish`, `Pashto`, `Portuguese`, `Quechua`, `Romanian`, `Russian`, `Northern Sami`, `Sinhala`, `Slovak`, `Slovenian`, `Albanian`, `Serbian`, `Swedish`, `Swahili`, `Tamil`, `Telugu`, `Thai`, `Tagalog`, `Turkish`, `Tatar`, `Uyghur`, `Ukrainian`, `Urdu`, `Vietnamese`, `Volapük`, `Walloon`, `Xhosa`, `Chinese`. ## Predicted Entities `af`, `am`, `an`, `ar`, `as`, `az`, `be`, `bg`, `bn`, `br`, `bs`, `ca`, `cs`, `cy`, `da`, `de`, `el`, `en`, `eo`, `es`, `et`, `eu`, `fa`, `fi`, `fo`, `fr`, `ga`, `gl`, `gu`, `he`, `hi`, `hr`, `ht`, `hu`, `hy`, `ia`, `id`, `is`, `it`, `ja`, `jv`, `ka`, `kk`, `km`, `kn`, `ko`, `ku`, `ky`, `la`, `lb`, `lo`, `lt`, `lv`, `mg`, `mk`, `ml`, `mn`, `mr`, `ms`, `mt`, `ne`, `nl`, `nn`, `no`, `oc`, `or`, `pa`, `pl`, `ps`, `pt`, `qu`, `ro`, `ru`, `se`, `si`, `sk`, `sl`, `sq`, `sr`, `sv`, `sw`, `ta`, `te`, `th`, `tl`, `tr`, `tt`, `ug`, `uk`, `ur`, `vi`, `vo`, `wa`, `xh`, `zh`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_95_xx_2.7.0_2.4_1607184332861.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_95_xx_2.7.0_2.4_1607184332861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_95", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_95", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_95').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_wiki_tatoeba_cnn_95| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Wikipedia and Tatoeba ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | fr| 1000| 999| 0.999| | de| 1000| 998| 0.998| | fi| 1000| 998| 0.998| | pt| 1000| 996| 0.996| | sv| 1000| 995| 0.995| | el| 1000| 994| 0.994| | nl| 1000| 994| 0.994| | it| 1000| 994| 0.994| | en| 1000| 993| 0.993| | es| 1000| 984| 0.984| | hu| 880| 865|0.9829545454545454| | ro| 784| 769|0.9808673469387755| | lt| 1000| 978| 0.978| | et| 928| 906|0.9762931034482759| | cs| 1000| 975| 0.975| | pl| 914| 890| 0.973741794310722| | da| 1000| 958| 0.958| | sk| 1000| 947| 0.947| | bg| 1000| 939| 0.939| | lv| 916| 849|0.9268558951965066| | sl| 914| 844|0.9234135667396062| +--------+-----+-------+------------------+ +-------+-------------------+ |summary| precision| +-------+-------------------+ | count| 21| | mean| 0.9764822024804014| | stddev|0.02384830734809143| | min| 0.9234135667396062| | max| 0.999| +-------+-------------------+ ``` --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from jamarju) author: John Snow Labs name: roberta_qa_base_bne_squad_2.0 date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-squad-2.0-es` is a Spanish model originally trained by `jamarju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_squad_2.0_es_4.2.4_3.0_1669985918426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_bne_squad_2.0_es_4.2.4_3.0_1669985918426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_squad_2.0","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_bne_squad_2.0","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_bne_squad_2.0| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|456.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jamarju/roberta-base-bne-squad-2.0-es - https://github.com/PlanTL-SANIDAD/lm-spanish - https://github.com/ccasimiro88/TranslateAlignRetrieve --- layout: model title: English BertForQuestionAnswering Cased model (from CherylTSW) author: John Snow Labs name: bert_qa_cheryltsw_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `CherylTSW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_cheryltsw_finetuned_squad_en_4.0.0_3.0_1657186031141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_cheryltsw_finetuned_squad_en_4.0.0_3.0_1657186031141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_cheryltsw_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_cheryltsw_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_cheryltsw_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/CherylTSW/bert-finetuned-squad --- layout: model title: English asr_bach_arb TFWav2Vec2ForCTC from bkh6722 author: John Snow Labs name: asr_bach_arb date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bach_arb` is a English model originally trained by bkh6722. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bach_arb_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bach_arb_en_4.2.0_3.0_1664118857579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bach_arb_en_4.2.0_3.0_1664118857579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bach_arb", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bach_arb", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bach_arb| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.2 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sabah17) author: John Snow Labs name: distilbert_qa_sabah17_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sabah17`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sabah17_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772373637.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sabah17_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772373637.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sabah17_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sabah17_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sabah17_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sabah17/distilbert-base-uncased-finetuned-squad --- layout: model title: Persian BertForQuestionAnswering Base Uncased model (from aminnaghavi) author: John Snow Labs name: bert_qa_base_parsbert_uncased_finetuned_perqa date: 2022-07-07 tags: [fa, open_source, bert, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-parsbert-uncased-finetuned-perQA` is a Persian model originally trained by `aminnaghavi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_parsbert_uncased_finetuned_perqa_fa_4.0.0_3.0_1657183345692.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_parsbert_uncased_finetuned_perqa_fa_4.0.0_3.0_1657183345692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_parsbert_uncased_finetuned_perqa","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_parsbert_uncased_finetuned_perqa","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_parsbert_uncased_finetuned_perqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aminnaghavi/bert-base-parsbert-uncased-finetuned-perQA --- layout: model title: Detect Family History Status from Oncology Entities author: John Snow Labs name: assertion_oncology_family_history_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion, family_history] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects entities refering to the family history. ## Predicted Entities `Family_History`, `Other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_family_history_wip_en_4.0.0_3.0_1665522020132.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_family_history_wip_en_4.0.0_3.0_1665522020132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Dx"]) assertion = AssertionDLModel.pretrained("assertion_oncology_family_history_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["Her family history is positive for breast cancer in her maternal aunt."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Dx")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_family_history_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""Her family history is positive for breast cancer in her maternal aunt.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_family_history").predict("""Her family history is positive for breast cancer in her maternal aunt.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------------|:------------|:---------------| | breast cancer | Cancer_Dx | Family_History | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_family_history_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Family_History 0.88 0.96 0.92 24.0 Other 0.96 0.90 0.93 29.0 macro-avg 0.92 0.93 0.92 53.0 weighted-avg 0.93 0.92 0.92 53.0 ``` --- layout: model title: Text Detection author: John Snow Labs name: image_text_detector_v2 date: 2023-01-10 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 4.1.0 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description CRAFT: Character-Region Awareness For Text detection, is designed with a convolutional neural network producing the character region score and affinity score. The region score is used to localize individual characters in the image, and the affinity score is used to group each character into a single instance. To compensate for the lack of character-level annotations, we propose a weaklysupervised learning framework that estimates characterlevel ground truths in existing real word-level datasets. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/TEXT_DETECTION/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageTextDetection.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/image_text_detector_v2_en_3.3.0_2.4_1643618928538.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/image_text_detector_v2_en_3.3.0_2.4_1643618928538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setScoreThreshold(0.5) \ .setTextThreshold(0.2) \ .setSizeThreshold(10) \ .setWithRefiner(True) draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, draw_regions ]) imagePath = pkg_resources.resource_filename('sparkocr', 'resources/ocr/text_detection/020_Yas_patella1.jpg') image_df = spark.read.format("binaryFile").load(imagePath).sort("path") result = pipeline.transform(image_df) ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setScoreThreshold(0.5) .setTextThreshold(0.2) .setSizeThreshold(10) .setWithRefiner(True) val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val pipeline = new PipelineModel().setStages(Array( binary_to_image, text_detector, draw_regions)) val imagePath = pkg_resources.resource_filename("sparkocr", "resources/ocr/text_detection/020_Yas_patella1.jpg") val image_df = spark.read.format("binaryFile").load(imagePath).sort("path") val result = pipeline.transform(image_df) ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image6.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image6_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_detection_v2| |Type:|ocr| |Compatibility:|Visual NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Output Labels:|[text_regions]| |Language:|en| --- layout: model title: English BertForQuestionAnswering model (from andi611) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2-with-ner-mit-restaurant-with-neg-with-repeat` is a English model orginally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat_en_4.0.0_3.0_1654537564717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat_en_4.0.0_3.0_1654537564717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/bert-large-uncased-whole-word-masking-squad2-with-ner-mit-restaurant-with-neg-with-repeat --- layout: model title: BioBERT Embeddings (Pubmed Large) author: John Snow Labs name: biobert_pubmed_large_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_large_cased_en_2.6.0_2.4_1598342382907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_large_cased_en_2.6.0_2.4_1598342382907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pubmed_large_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pubmed_large_cased_embeddings I [-0.041047871112823486, 0.24242812395095825, 0... hate [-0.6859451532363892, -0.45743268728256226, -0... cancer [-0.12403186410665512, 0.6688604354858398, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pubmed_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Legal Assigns Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_assigns_bert date: 2023-03-05 tags: [en, legal, classification, clauses, assigns, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Assigns` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Assigns`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_assigns_bert_en_1.0.0_3.0_1678050622215.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_assigns_bert_en_1.0.0_3.0_1678050622215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_assigns_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Assigns]| |[Other]| |[Other]| |[Assigns]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_assigns_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Assigns 0.83 1.00 0.91 5 Other 1.00 0.67 0.80 3 accuracy - - 0.88 8 macro-avg 0.92 0.83 0.85 8 weighted-avg 0.90 0.88 0.87 8 ``` --- layout: model title: Chinese Bert Embeddings (3-layer) author: John Snow Labs name: bert_embeddings_rbt3 date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `rbt3` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_3.4.2_3.0_1649669074305.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_3.4.2_3.0_1649669074305.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.rbt3").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt3| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|144.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt3 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification date: 2022-03-03 tags: [deidentification, pipeline, de, licensed,clinical] task: Pipeline Healthcare language: de edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from **German** medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `CONTACT`, `ID`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `DLN`, `PLATE` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_3.4.1_3.0_1646330183939.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_3.4.1_3.0_1646330183939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models") sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """ result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models") val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.clinical").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """) ```
## Results ```bash Masked with entity labels ------------------------------ Zusammenfassung : wird am Morgen des ins eingeliefert. Herr ist Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: Platte Kontonummer: SSN : Lizenznummer: Adresse : Masked with chars ------------------------------ Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************] eingeliefert. Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: [*******] Platte [*****] Kontonummer: [********************] SSN : [**********] Lizenznummer: [*********] Adresse : [*****************] [***] Masked with fixed length chars ------------------------------ Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert. Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: **** Platte **** Kontonummer: **** SSN : **** Lizenznummer: **** Adresse : **** **** Obfusceted ------------------------------ Zusammenfassung : Herrmann Kallert wird am Morgen des 11-26-1977 ins International Neuroscience eingeliefert. Herr Herrmann Kallert ist 79 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: 136704D357 Platte QA348G Kontonummer: 192837465738 SSN : 1310011981M454 Lizenznummer: XX123456 Adresse : Klingelhöferring 31206 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - Finisher --- layout: model title: Detect former names of companies in texts (small) author: John Snow Labs name: finner_wiki_formername date: 2023-01-15 tags: [former, name, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model, aimed to detect Former Names of companies. It was trained with wikipedia texts about companies. ## Predicted Entities `FORMER_NAME`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_formername_en_1.0.0_3.0_1673797854205.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_formername_en_1.0.0_3.0_1673797854205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) chunks = finance.NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner = finance.NerModel().pretrained("finner_wiki_formername", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks]) model = pipe.fit(df) res = model.transform(df) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['3']['sentence']").alias("sentence_id"), F.expr("cols['0']").alias("chunk"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +-----------+------------------+---+-----------+ |sentence_id|chunk |end|ner_label | +-----------+------------------+---+-----------+ |0 |Toro Motor Company|57 |FORMER_NAME| +-----------+------------------+---+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_wiki_formername| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 MB| ## References Wikipedia ## Benchmarking ```bash label tp fp fn prec rec f1 I-FORMER_NAME 29 20 13 0.59183675 0.6904762 0.63736266 B-FORMER_NAME 19 5 8 0.7916667 0.7037037 0.7450981 Macro-average 48 25 21 0.6917517 0.6970899 0.69441056 Micro-average 48 25 21 0.65753424 0.6956522 0.6760564 ``` --- layout: model title: NER Pipeline for 9 African Languages author: John Snow Labs name: distilbert_base_token_classifier_masakhaner_pipeline date: 2022-06-25 tags: [hausa, igbo, kinyarwanda, luganda, nigerian, pidgin, swahilu, wolof, yoruba, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_base_token_classifier_masakhaner](https://nlp.johnsnowlabs.com/2022/01/18/distilbert_base_token_classifier_masakhaner_xx.html) model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/Ner_masakhaner/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_token_classifier_masakhaner_pipeline_xx_4.0.0_3.0_1656118926686.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_token_classifier_masakhaner_pipeline_xx_4.0.0_3.0_1656118926686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python masakhaner_pipeline = PretrainedPipeline("distilbert_base_token_classifier_masakhaner_pipeline", lang = "xx") masakhaner_pipeline.annotate("Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde.") ``` ```scala val masakhaner_pipeline = new PretrainedPipeline("distilbert_base_token_classifier_masakhaner_pipeline", lang = "xx") masakhaner_pipeline.annotate("Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde.") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |Mohammed Sani Musa |PER | |Activate Technologies Limited|ORG | |ọdún-un 2019 |DATE | |All rogressives Congress |ORG | |APC |ORG | |Aṣojú Ìlà-Oòrùn Niger |LOC | |Premium Times |ORG | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_token_classifier_masakhaner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|505.8 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13 TFWav2Vec2ForCTC from PereLluis13 author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13` is a Modern Greek (1453-) model originally trained by PereLluis13. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13_el_4.2.0_3.0_1664106724119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13_el_4.2.0_3.0_1664106724119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: Recognize Entities DL Pipeline for Norwegian (Bokmal) - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, norwegian_bokmal, entity_recognizer_sm, pipeline, "no"] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: "no" edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_no_3.0.0_3.0_1616442836782.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_no_3.0.0_3.0_1616442836782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'no') annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "no") val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei fra John Snow Labs! ""] result_df = nlu.load('no.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | [[-0.394499987363815,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|no| --- layout: model title: English asr_wav2vec2_indian_english TFWav2Vec2ForCTC from anjulRajendraSharma author: John Snow Labs name: asr_wav2vec2_indian_english date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_indian_english` is a English model originally trained by anjulRajendraSharma. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_indian_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_indian_english_en_4.2.0_3.0_1664123154208.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_indian_english_en_4.2.0_3.0_1664123154208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_indian_english", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_indian_english", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_indian_english| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.5 MB| --- layout: model title: Chunk Entity Resolver for billable ICD10-CM HCC codes author: John Snow Labs name: chunkresolve_icd10cm_hcc_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). It also adds support of 7-digit codes with HCC status. For reference: http://www.formativhealth.com/wp-content/uploads/2018/06/HCC-White-Paper.pdf ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for `aux_label` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`. For example, in the example shared below the `billable status is 1`, `hcc status is 1`, and `hcc score is 8`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_hcc_clinical_en_3.0.0_3.0_1617356679231.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_hcc_clinical_en_3.0.0_3.0_1617356679231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_hcc_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ... ``` ```scala ... val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_hcc_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10cm_hcc_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab4 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab4 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab4` is a English model originally trained by sameearif88. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab4_en_4.2.0_3.0_1664019107930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab4_en_4.2.0_3.0_1664019107930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab4", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab4", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab4| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from nsridhar) author: John Snow Labs name: roberta_qa_finetuned_country date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-country` is a English model originally trained by `nsridhar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_country_en_4.3.0_3.0_1674220262882.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_country_en_4.3.0_3.0_1674220262882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_country","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_country","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_country| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nsridhar/roberta-finetuned-country --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Anogmeld) author: John Snow Labs name: distilbert_qa_anogmeld_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Anogmeld`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anogmeld_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768254089.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anogmeld_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768254089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anogmeld_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anogmeld_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anogmeld_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Anogmeld/distilbert-base-uncased-finetuned-squad --- layout: model title: Kinyarwanda XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_swahili_finetuned_ner_kinyarwand date: 2022-08-01 tags: [rw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: rw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-kinyarwanda` is a Kinyarwanda model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659355664108.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659355664108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner_kinyarwand","rw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner_kinyarwand","rw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_swahili_finetuned_ner_kinyarwand| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|rw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-kinyarwanda - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Clinical Deidentification Pipeline (English) author: John Snow Labs name: clinical_deidentification date: 2022-07-24 tags: [deidentification, deid, clinical, pipeline, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_4.0.0_3.0_1658693448275.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_en_4.0.0_3.0_1658693448275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_pipeline").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID, IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID[**********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Craige Perks, Record date: 2093-02-06, # R2593192. Dr. Dr Felice Lacer, IDXO:4884578, IP 444.444.444.444. He is a 75 male was admitted to the MADISON VALLEY MEDICAL CENTER for cystectomy on 07-01-1972. Patient's VIN : 2BBBB11BBBB222999, SSN SSN-814-86-1962, Driver's license P055567317431. Phone 0381-6762484, Budaörsi út 14., New brunswick, E-MAIL: Reba@google.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - TextMatcherModel - ContextualParserModel - RegexMatcherModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English DistilBertForQuestionAnswering model (from holtin) Full Squad author: John Snow Labs name: distilbert_qa_base_uncased_holtin_finetuned_full_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-holtin-finetuned-full-squad` is a English model originally trained by `holtin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_full_squad_en_4.0.0_3.0_1654727083911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_full_squad_en_4.0.0_3.0_1654727083911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_full_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_full_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_full.by_holtin").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_holtin_finetuned_full_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/holtin/distilbert-base-uncased-holtin-finetuned-full-squad --- layout: model title: English asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten_en_4.2.0_3.0_1664096345574.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten_en_4.2.0_3.0_1664096345574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: Korean T5ForConditionalGeneration Small Cased model (from KETI-AIR) author: John Snow Labs name: t5_ke_small date: 2023-01-30 tags: [ko, open_source, t5, tensorflow] task: Text Generation language: ko edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ke-t5-small-ko` is a Korean model originally trained by `KETI-AIR`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ke_small_ko_4.3.0_3.0_1675104654891.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ke_small_ko_4.3.0_3.0_1675104654891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ke_small","ko") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ke_small","ko") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ke_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ko| |Size:|211.1 MB| ## References - https://huggingface.co/KETI-AIR/ke-t5-small-ko - https://github.com/AIRC-KETI/ke-t5 - https://aclanthology.org/2021.findings-emnlp.33/ - https://koreascience.kr/article/CFKO202130060717834.pdf --- layout: model title: German asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896_de_4.2.0_3.0_1664119494278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896_de_4.2.0_3.0_1664119494278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_zh_cased date: 2022-04-12 tags: [distilbert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-zh-cased` is a Chinese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_zh_cased_zh_3.4.2_3.0_1649783150730.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_zh_cased_zh_3.4.2_3.0_1649783150730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_zh_cased","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_zh_cased","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.distilbert_base_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_zh_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|198.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-zh-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English RobertaForQuestionAnswering (from navteca) author: John Snow Labs name: roberta_qa_navteca_roberta_base_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `navteca`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_roberta_base_squad2_en_4.0.0_3.0_1655734986431.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_navteca_roberta_base_squad2_en_4.0.0_3.0_1655734986431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_navteca_roberta_base_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_navteca_roberta_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_navteca_roberta_base_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/navteca/roberta-base-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: English BertForQuestionAnswering model (from rsvp-ai) author: John Snow Labs name: bert_qa_bertserini_bert_large_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-bert-large-squad` is a English model orginally trained by `rsvp-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_bert_large_squad_en_4.0.0_3.0_1654185472710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertserini_bert_large_squad_en_4.0.0_3.0_1654185472710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertserini_bert_large_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bertserini_bert_large_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large.by_rsvp-ai").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bertserini_bert_large_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rsvp-ai/bertserini-bert-large-squad --- layout: model title: English DistilBertForQuestionAnswering Cased model (from ksabeh) author: John Snow Labs name: distilbert_qa_attribute_correction_mlm date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-attribute-correction-mlm` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_mlm_en_4.3.0_3.0_1672766362107.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_mlm_en_4.3.0_3.0_1672766362107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction_mlm","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction_mlm","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_attribute_correction_mlm| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ksabeh/distilbert-attribute-correction-mlm --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824211 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824211` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1677881778499.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824211_en_4.3.1_3.0_1677881778499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824211","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824211| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824211 --- layout: model title: Legal Arrangement Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_arrangement_agreement_bert date: 2023-01-29 tags: [en, legal, classification, arrangement, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_arrangement_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `arrangement-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `arrangement-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_arrangement_agreement_bert_en_1.0.0_3.0_1674990701764.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_arrangement_agreement_bert_en_1.0.0_3.0_1674990701764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_arrangement_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[arrangement-agreement]| |[other]| |[other]| |[arrangement-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_arrangement_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support arrangement-agreement 0.97 0.95 0.96 39 other 0.97 0.98 0.97 57 accuracy - - 0.97 96 macro-avg 0.97 0.97 0.97 96 weighted-avg 0.97 0.97 0.97 96 ``` --- layout: model title: Mapping RxNorm Codes with Corresponding Actions and Treatments author: John Snow Labs name: rxnorm_action_treatment_mapper date: 2022-06-27 tags: [chunk_mapper, action, treatment, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm and RxNorm Extension codes with their corresponding action and treatment. Action refers to the function of the drug in various body systems; treatment refers to which disease the drug is used to treat. ## Predicted Entities `action`, `treatment` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_action_treatment_mapper_en_3.5.3_3.0_1656315389520.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_action_treatment_mapper_en_3.5.3_3.0_1656315389520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper_1 = ChunkMapperModel\ .pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("action_mappings")\ .setRels(["action"]) chunkerMapper_2 = ChunkMapperModel\ .pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("treatment_mappings")\ .setRels(["treatment"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper_1, chunkerMapper_2 ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) pipeline = LightPipeline(model) result = pipeline.fullAnnotate(['Sinequan 150 MG', 'Zonalon 50 mg']) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper_1 = ChunkMapperModel .pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models") .setInputCols("rxnorm_code") .setOutputCol("action_mappings") .setRels("action") val chunkerMapper_2 = ChunkMapperModel .pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models") .setInputCols("rxnorm_code") .setOutputCol("treatment_mappings") .setRels("treatment") val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper_1, chunkerMapper_2 )) val data = Seq(Array("Sinequan 150 MG", "Zonalon 50 mg")).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_to_action_treatment").predict("""Sinequan 150 MG""") ```
## Results ```bash | | ner_chunk | rxnorm_code | Action | Treatment | |---:|:--------------------|:--------------|:----------------------------------------------------------|:-----------------------------------------------------------------------| | 0 | ['Sinequan 150 MG'] | ['1000067'] | ['Anxiolytic', 'Psychoanaleptics', 'Sedative'] | ['Depression', 'Neurosis', 'Anxiety&Panic Attacks', 'Psychosis'] | | 1 | ['Zonalon 50 mg'] | ['103971'] | ['Analgesic (Opioid)', 'Analgetic', 'Opioid', 'Vitamins'] | ['Pain'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_action_treatment_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[rxnorm_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|21.1 MB| --- layout: model title: Mapping Drugs With Their Corresponding Actions And Treatments author: John Snow Labs name: drug_action_treatment_mapper date: 2022-06-28 tags: [drug, action, treatment, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drugs with their corresponding action and treatment. action refers to the function of the drug in various body systems, treatment refers to which disease the drug is used to treat. ## Predicted Entities `action`, `treatment` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.5.3_3.0_1656398556500.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_action_treatment_mapper_en_3.5.3_3.0_1656398556500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_small", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner")\ .setLabelCasing("upper") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["DRUG"]) chunkerMapper_action = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("action_mappings")\ .setRels(["action"])\ .setLowerCase(True) chunkerMapper_treatment = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("treatment_mappings")\ .setRels(["treatment"])\ .setLowerCase(True) pipeline = Pipeline().setStages([document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunkerMapper_action, chunkerMapper_treatment]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) text = """The patient is a 71-year-old female patient of Dr. X. and she was given Aklis, Dermovate, Aacidexam and Paracetamol.""" pipeline = LightPipeline(model) result = pipeline.fullAnnotate(text) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") .setLabelCasing("upper") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("DRUG")) val chunkerMapper_action = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("action_mappings") .setRels(Array("action")) .setLowerCase(True) val chunkerMapper_treatment = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("treatment_mappings") .setRels(Array("treatment")) .setLowerCase(True) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunkerMapper_action, chunkerMapper_treatment)) val data = Seq("""The patient is a 71-year-old female patient of Dr. X. and she was given Aklis, Dermovate, Aacidexam and Paracetamol.""").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_to_action_treatment").predict("""The patient is a 71-year-old female patient of Dr. X. and she was given Aklis, Dermovate, Aacidexam and Paracetamol.""") ```
## Results ```bash +-----------+--------------------+-------------------+------------------------------------------------------------+-----------------------------------------------------------------------------+ |Drug |action_mappings |treatment_mappings |action_meta |treatment_meta | +-----------+--------------------+-------------------+------------------------------------------------------------+-----------------------------------------------------------------------------+ |Aklis |cardioprotective |hyperlipidemia |hypotensive:::natriuretic |hypertension:::diabetic kidney disease:::cerebrovascular accident:::smoking | |Dermovate |anti-inflammatory |lupus |corticosteroids::: dermatological preparations:::very strong|discoid lupus erythematosus:::empeines:::psoriasis:::eczema | |Aacidexam |antiallergic |abscess |antiexudative:::anti-inflammatory:::anti-shock |brain abscess:::agranulocytosis:::adrenogenital syndrome | |Paracetamol|analgesic |arthralgia |anti-inflammatory:::antipyretic:::pain reliever |period pain:::pain:::sore throat:::headache:::influenza:::toothache | +-----------+--------------------+-------------------+------------------------------------------------------------+-----------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_action_treatment_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|8.4 MB| --- layout: model title: Sentence Entity Resolver for MeSH (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_mesh date: 2021-11-14 tags: [mesh, entity_resolution, licensed, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities to Medical Subject Heading (MeSH) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `MeSH Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_mesh_en_3.3.2_2.4_1636888521477.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_mesh_en_3.3.2_2.4_1636888521477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_mesh``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. No need to set ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings") mesh_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_mesh", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("mesh_code")\ .setDistanceFunction("EUCLIDEAN")\ resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter, c2doc, sbert_embedder, mesh_resolver ]) data = spark.createDataFrame([["""She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed malignant mesothelioma."""]]).toDF("text")) result = resolver_pipeline.fit(data).transform(data) ``` ```scala ... val c2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") val mesh_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_mesh", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("mesh_code") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new Pipeline( stages = Array( document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter, c2doc, sbert_embedder, mesh_resolver)) val data = Seq("She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed malignant mesothelioma.") val result = resolver_pipeline.fit.(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.mesh").predict("""She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed malignant mesothelioma.""") ```
## Results ```bash +--------------------------+---------+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | ner_chunk| entity| mesh_code| all_codes| resolutions| distances| +--------------------------+---------+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | chest pain| PROBLEM| D002637|D002637:::D059350:::D019547:::D020069:::D015746:::D000072716:::D005157:::D059265:::D001416:::D048...|Chest Pain:::Chronic Pain:::Neck Pain:::Shoulder Pain:::Abdominal Pain:::Cancer Pain:::Facial Pai...|0.0000:::0.0577:::0.0587:::0.0601:::0.0658:::0.0704:::0.0712:::0.0741:::0.0766:::0.0778:::0.0794:...| |bilateral pleural effusion| PROBLEM| D010996|D010996:::D010490:::D011654:::D016724:::D010995:::D016066:::D011001:::D007819:::D035422:::D004653...|Pleural Effusion:::Pericardial Effusion:::Pulmonary Edema:::Empyema, Pleural:::Pleural Diseases::...|0.0309:::0.1010:::0.1115:::0.1213:::0.1218:::0.1398:::0.1425:::0.1401:::0.1451:::0.1464:::0.1464:...| | the pathology| TEST| D010336|D010336:::D010335:::D001004:::D020969:::C001675:::C536472:::D004194:::D003951:::D013631:::C535329...|Pathology:::Pathologic Processes:::Anus Diseases:::Disease Attributes:::malformins:::Upington dis...|0.0788:::0.0977:::0.1364:::0.1396:::0.1419:::0.1459:::0.1418:::0.1393:::0.1514:::0.1541:::0.1491:...| | the pericardectomy|TREATMENT| D010492|D010492:::D011670:::D018700:::D020884:::D011672:::D005927:::D064727:::D002431:::C000678968:::D011...|Pericardiectomy:::Pulpectomy:::Pleurodesis:::Colpotomy:::Pulpotomy:::Glossectomy:::Posterior Caps...|0.1098:::0.1448:::0.1801:::0.1852:::0.1871:::0.1923:::0.1901:::0.2023:::0.2075:::0.2010:::0.1996:...| | mesothelioma| PROBLEM|D000086002|D000086002:::C535700:::D009208:::D032902:::D018301:::D018199:::C562740:::C000686536:::D018276:::D...|Mesothelioma, Malignant:::Malignant mesenchymal tumor:::Myoepithelioma:::Ganoderma:::Neoplasms, M...|0.0813:::0.1515:::0.1599:::0.1810:::0.1864:::0.1881:::0.1907:::0.1938:::0.1924:::0.1876:::0.2040:...| | chest tube placement|TREATMENT| D015505|D015505:::D019616:::D013896:::D012124:::D013906:::D013510:::D020708:::D035423:::D013903:::D000066...|Chest Tubes:::Thoracic Surgical Procedures:::Thoracic Diseases:::Respiratory Care Units:::Thoraco...|0.0557:::0.1473:::0.1598:::0.1604:::0.1725:::0.1651:::0.1795:::0.1760:::0.1804:::0.1846:::0.1883:...| | drainage of the fluid|TREATMENT| D004322|D004322:::D018495:::C045413:::D021061:::D045268:::D018508:::D005441:::D015633:::D014906:::D001834...|Drainage:::Fluid Shifts:::Bonain's liquid:::Liquid Ventilation:::Flowmeters:::Water Purification:...|0.1141:::0.1403:::0.1582:::0.1549:::0.1586:::0.1626:::0.1599:::0.1655:::0.1667:::0.1656:::0.1741:...| | thoracoscopy|TREATMENT| D013906|D013906:::D020708:::D035423:::D013905:::D035441:::D013897:::D001468:::D000069258:::D013909:::D013...|Thoracoscopy:::Thoracoscopes:::Thoracic Cavity:::Thoracoplasty:::Thoracic Wall:::Thoracic Duct:::...|0.0000:::0.0359:::0.0744:::0.1007:::0.1070:::0.1143:::0.1186:::0.1257:::0.1228:::0.1356:::0.1354:...| | fluid biopsies| TEST|D000073890|D000073890:::D010533:::D020420:::D011677:::D017817:::D001706:::D005441:::D005751:::D013582:::D000...|Liquid Biopsy:::Peritoneal Lavage:::Cyst Fluid:::Punctures:::Nasal Lavage Fluid:::Biopsy:::Fluids...|0.1408:::0.1612:::0.1763:::0.1744:::0.1744:::0.1810:::0.1744:::0.1828:::0.1896:::0.1909:::0.1950:...| | malignant mesothelioma| PROBLEM|D000086002|D000086002:::C535700:::C562740:::D009236:::D007890:::D012515:::D009208:::C009823:::C000683999:::C...|Mesothelioma, Malignant:::Malignant mesenchymal tumor:::Hemangiopericytoma, Malignant:::Myxosarco...|0.0737:::0.1106:::0.1658:::0.1627:::0.1660:::0.1639:::0.1728:::0.1676:::0.1791:::0.1843:::0.1849:...| +-------+--------------------------+---------+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_mesh| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[mesh_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on 01 December 2021 MeSH dataset. --- layout: model title: English BertForQuestionAnswering model (from deepset) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_squad2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2` is a English model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_en_4.0.0_3.0_1654537257894.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_en_4.0.0_3.0_1654537257894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/deepset/bert-large-uncased-whole-word-masking-squad2 --- layout: model title: English DistilBertForMaskedLM Base Uncased model (from Intel) author: John Snow Labs name: distilbert_embeddings_base_uncased_sparse_85_unstructured_pruneofa date: 2022-12-12 tags: [en, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-sparse-85-unstructured-pruneofa` is a English model originally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_uncased_sparse_85_unstructured_pruneofa_en_4.2.4_3.0_1670864786882.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_uncased_sparse_85_unstructured_pruneofa_en_4.2.4_3.0_1670864786882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_uncased_sparse_85_unstructured_pruneofa","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_uncased_sparse_85_unstructured_pruneofa","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_base_uncased_sparse_85_unstructured_pruneofa| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|132.8 MB| |Case sensitive:|false| ## References - https://huggingface.co/Intel/distilbert-base-uncased-sparse-85-unstructured-pruneofa - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: English Bert Embeddings (from antoiloui) author: John Snow Labs name: bert_embeddings_netbert date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `netbert` is a English model orginally trained by `antoiloui`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_netbert_en_3.4.2_3.0_1649672939940.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_netbert_en_3.4.2_3.0_1649672939940.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_netbert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_netbert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.netbert").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_netbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/antoiloui/netbert - https://github.com/antoiloui/netbert/blob/master/docs/index.md --- layout: model title: Malay T5ForConditionalGeneration Small Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_translation_small_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-translation-t5-small-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_small_standard_bahasa_cased_ms_4.3.0_3.0_1675102196206.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_small_standard_bahasa_cased_ms_4.3.0_3.0_1675102196206.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_translation_small_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_translation_small_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_translation_small_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|288.7 MB| ## References - https://huggingface.co/mesolitica/finetune-translation-t5-small-standard-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/translation/laser - https://github.com/huseinzol05/malaya/tree/master/session/translation/hf-t5 --- layout: model title: Fast Neural Machine Translation Model from Cushitic languages to English author: John Snow Labs name: opus_mt_cus_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cus, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cus` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cus_en_xx_2.7.0_2.4_1609163467456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cus_en_xx_2.7.0_2.4_1609163467456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cus_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cus_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cus.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cus_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_triplet_roberta_FT_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_roberta_FT_newsqa_en_4.0.0_3.0_1655728818966.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_roberta_FT_newsqa_en_4.0.0_3.0_1655728818966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_roberta_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_fpdm_triplet_roberta_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_fpdm_triplet_roberta_ft_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_triplet_roberta_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|458.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_triplet_roberta_FT_newsqa --- layout: model title: Sentence Detection in Marathi Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [open_source, sentence_detection, mr] task: Sentence Detection language: mr edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_mr_3.2.0_3.0_1630319297311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_mr_3.2.0_3.0_1630319297311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "mr") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""इंग्रजी वाचन परिच्छेद एक उत्तम स्रोत शोधत आहात? आपण योग्य ठिकाणी आला आहात. नुकत्याच झालेल्या एका अभ्यासानुसार, आजच्या तरुणांमध्ये वाचनाची सवय झपाट्याने कमी होत आहे. ते दिलेल्या इंग्रजी वाचनाच्या परिच्छेदावर काही सेकंदांपेक्षा जास्त काळ लक्ष केंद्रित करू शकत नाहीत! तसेच, वाचन हा सर्व स्पर्धा परीक्षांचा अविभाज्य भाग होता आणि आहे. तर, तुम्ही तुमचे वाचन कौशल्य कसे सुधारता? या प्रश्नाचे उत्तर प्रत्यक्षात दुसरा प्रश्न आहे: वाचन कौशल्याचा उपयोग काय आहे? वाचनाचा मुख्य हेतू म्हणजे 'अर्थ काढणे'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "mr") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("इंग्रजी वाचन परिच्छेद एक उत्तम स्रोत शोधत आहात? आपण योग्य ठिकाणी आला आहात. नुकत्याच झालेल्या एका अभ्यासानुसार, आजच्या तरुणांमध्ये वाचनाची सवय झपाट्याने कमी होत आहे. ते दिलेल्या इंग्रजी वाचनाच्या परिच्छेदावर काही सेकंदांपेक्षा जास्त काळ लक्ष केंद्रित करू शकत नाहीत! तसेच, वाचन हा सर्व स्पर्धा परीक्षांचा अविभाज्य भाग होता आणि आहे. तर, तुम्ही तुमचे वाचन कौशल्य कसे सुधारता? या प्रश्नाचे उत्तर प्रत्यक्षात दुसरा प्रश्न आहे: वाचन कौशल्याचा उपयोग काय आहे? वाचनाचा मुख्य हेतू म्हणजे 'अर्थ काढणे'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('mr.sentence_detector').predict("इंग्रजी वाचन परिच्छेद एक उत्तम स्रोत शोधत आहात? आपण योग्य ठिकाणी आला आहात. नुकत्याच झालेल्या एका अभ्यासानुसार, आजच्या तरुणांमध्ये वाचनाची सवय झपाट्याने कमी होत आहे. ते दिलेल्या इंग्रजी वाचनाच्या परिच्छेदावर काही सेकंदांपेक्षा जास्त काळ लक्ष केंद्रित करू शकत नाहीत! तसेच, वाचन हा सर्व स्पर्धा परीक्षांचा अविभाज्य भाग होता आणि आहे. तर, तुम्ही तुमचे वाचन कौशल्य कसे सुधारता? या प्रश्नाचे उत्तर प्रत्यक्षात दुसरा प्रश्न आहे: वाचन कौशल्याचा उपयोग काय आहे? वाचनाचा मुख्य हेतू म्हणजे 'अर्थ काढणे'.", output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------------+ |[इंग्रजी वाचन परिच्छेद एक उत्तम स्रोत शोधत आहात?] | |[आपण योग्य ठिकाणी आला आहात.] | |[नुकत्याच झालेल्या एका अभ्यासानुसार, आजच्या तरुणांमध्ये वाचनाची सवय झपाट्याने कमी होत आहे.] | |[ते दिलेल्या इंग्रजी वाचनाच्या परिच्छेदावर काही सेकंदांपेक्षा जास्त काळ लक्ष केंद्रित करू शकत नाहीत!] | |[तसेच, वाचन हा सर्व स्पर्धा परीक्षांचा अविभाज्य भाग होता आणि आहे.] | |[तर, तुम्ही तुमचे वाचन कौशल्य कसे सुधारता?] | |[या प्रश्नाचे उत्तर प्रत्यक्षात दुसरा प्रश्न आहे:] | |[वाचन कौशल्याचा उपयोग काय आहे?] | |[वाचनाचा मुख्य हेतू म्हणजे 'अर्थ काढणे'.] | +-----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|mr| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Detect Genes/Proteins (BC2GM) in Medical Texts author: John Snow Labs name: ner_biomedical_bc2gm date: 2022-05-10 tags: [bc2gm, ner, biomedical, gene_protein, gene, protein, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. This model has been trained to extract genes/proteins from a medical text. ## Predicted Entities `GENE_PROTEIN` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_en_3.5.1_3.0_1652184014650.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_en_3.5.1_3.0_1652184014650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections."]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentenceetectorl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomedical_bc2gm").predict("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.""") ```
## Results ```bash +-----------+------------+ |chunk |ner_label | +-----------+------------+ |S-100 |GENE_PROTEIN| |HMB-45 |GENE_PROTEIN| |cytokeratin|GENE_PROTEIN| +-----------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomedical_bc2gm| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.6 MB| ## References Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition ([BC2GM](https://metatext.io/datasets/biocreative-ii-gene-mention-recognition-(bc2gm))) Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). ## Benchmarking ```bash label precision recall f1-score support GENE_PROTEIN 0.83 0.82 0.82 6325 micro-avg 0.83 0.82 0.82 6325 macro-avg 0.83 0.82 0.82 6325 weighted-avg 0.83 0.82 0.82 6325 ``` --- layout: model title: Fast Neural Machine Translation Model from Pohnpeian to English author: John Snow Labs name: opus_mt_pon_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pon, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pon` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pon_en_xx_2.7.0_2.4_1609163996704.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pon_en_xx_2.7.0_2.4_1609163996704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pon_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pon_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pon.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pon_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Part of Speech for Urdu author: John Snow Labs name: pos_ud_udtb date: 2020-11-30 task: Part of Speech Tagging language: ur edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, ur] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_udtb_ur_2.7.0_2.4_1606733090479.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_udtb_ur_2.7.0_2.4_1606733090479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_udtb", "ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(["شمال کا بادشاہ ہونے کے علاوہ ، جان سن ایک انگریزی معالج ہے۔"]) ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_udtb", "ur") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("شمال کا بادشاہ ہونے کے علاوہ ، جان سن ایک انگریزی معالج ہے۔").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""شمال کا بادشاہ ہونے کے علاوہ ، جان سن ایک انگریزی معالج ہے۔"""] pos_df = nlu.load('ur.pos.ud_udtb').predict(text) pos_df ```
## Results ```bash {'pos': [Annotation(pos, 0, 3, NOUN, {'word': 'شمال'}), Annotation(pos, 5, 6, ADP, {'word': 'کا'}), Annotation(pos, 8, 13, NOUN, {'word': 'بادشاہ'}), Annotation(pos, 15, 18, VERB, {'word': 'ہونے'}), Annotation(pos, 20, 21, ADP, {'word': 'کے'}), Annotation(pos, 23, 27, ADP, {'word': 'علاوہ'}), Annotation(pos, 29, 29, PUNCT, {'word': '،'}), Annotation(pos, 31, 33, PROPN, {'word': 'جان'}), Annotation(pos, 35, 36, PROPN, {'word': 'سن'}), Annotation(pos, 38, 40, NUM, {'word': 'ایک'}), Annotation(pos, 42, 48, PROPN, {'word': 'انگریزی'}), Annotation(pos, 50, 54, ADJ, {'word': 'معالج'}), Annotation(pos, 56, 58, PUNCT, {'word': 'ہے۔'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_udtb| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[tags, document]| |Output Labels:|[pos]| |Language:|ur| ## Data Source The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org) ## Benchmarking ```bash | | | precision | recall | f1-score | support | |---:|:-------------|:------------|:---------|-----------:|----------:| | 0 | ADJ | 0.84 | 0.85 | 0.84 | 1117 | | 1 | ADP | 0.98 | 0.99 | 0.98 | 3122 | | 2 | ADV | 0.83 | 0.65 | 0.73 | 125 | | 3 | AUX | 0.97 | 0.96 | 0.96 | 937 | | 4 | CCONJ | 0.96 | 1.00 | 0.98 | 338 | | 5 | DET | 0.87 | 0.82 | 0.84 | 237 | | 6 | NOUN | 0.89 | 0.92 | 0.9 | 3690 | | 7 | NUM | 0.97 | 0.95 | 0.96 | 267 | | 8 | PART | 0.96 | 0.88 | 0.91 | 337 | | 9 | PRON | 0.96 | 0.94 | 0.95 | 499 | | 10 | PROPN | 0.88 | 0.85 | 0.86 | 1975 | | 11 | PUNCT | 1.00 | 1.00 | 1 | 682 | | 12 | SCONJ | 0.97 | 0.99 | 0.98 | 248 | | 13 | VERB | 0.95 | 0.95 | 0.95 | 1232 | | 14 | accuracy | | | 0.93 | 14806 | | 15 | macro avg | 0.93 | 0.91 | 0.92 | 14806 | | 16 | weighted avg | 0.93 | 0.93 | 0.93 | 14806 | ​ ``` --- layout: model title: Pipeline to Detect problem, test, treatment in medical text (biobert) author: John Snow Labs name: ner_clinical_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_clinical_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_pipeline_en_3.4.1_3.0_1647875214361.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_biobert_pipeline_en_3.4.1_3.0_1647875214361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_clinical_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_clinical_biobert_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +---------------------------------------------------+---------+ |chunks |entities | +---------------------------------------------------+---------+ |congestion |PROBLEM | |some mild problems with his breathing while feeding|PROBLEM | |any perioral cyanosis |PROBLEM | |retractions |PROBLEM | |a tactile temperature |PROBLEM | |Tylenol |TREATMENT| |some decreased p.o |PROBLEM | |His normal breast-feeding |TEST | |his respiratory congestion |PROBLEM | |more tired |PROBLEM | |fussy |PROBLEM | |albuterol treatments |TREATMENT| |His urine output |TEST | |any diarrhea |PROBLEM | |yellow colored |PROBLEM | +---------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Palak) author: John Snow Labs name: roberta_qa_base_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base_squad` is a English model originally trained by `Palak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_en_4.3.0_3.0_1674210601142.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_en_4.3.0_3.0_1674210601142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Palak/distilroberta-base_squad --- layout: model title: English asr_wav2vec_finetuned_on_cryptocurrency TFWav2Vec2ForCTC from distractedm1nd author: John Snow Labs name: pipeline_asr_wav2vec_finetuned_on_cryptocurrency date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec_finetuned_on_cryptocurrency` is a English model originally trained by distractedm1nd. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec_finetuned_on_cryptocurrency_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec_finetuned_on_cryptocurrency_en_4.2.0_3.0_1664025018794.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec_finetuned_on_cryptocurrency_en_4.2.0_3.0_1664025018794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec_finetuned_on_cryptocurrency', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec_finetuned_on_cryptocurrency", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec_finetuned_on_cryptocurrency| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Part of Speech for Estonian author: John Snow Labs name: pos_ud_edt date: 2020-11-30 task: Part of Speech Tagging language: et edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [et, pos] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_edt_et_2.7.0_2.4_1606724297129.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_edt_et_2.7.0_2.4_1606724297129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_edt", "et") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(["Lisaks sellele, et ta on põhjamaa kuningas, on John Snow inglise arst ning narkoosi ja meditsiinilise hügieeni arendamise juht."]) ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_edt", "et") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Lisaks sellele, et ta on põhjamaa kuningas, on John Snow inglise arst ning narkoosi ja meditsiinilise hügieeni arendamise juht.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash {'pos': [Annotation(pos, 0, 5, NOUN, {'word': 'Lisaks'}), Annotation(pos, 7, 13, PRON, {'word': 'sellele'}), Annotation(pos, 14, 14, PUNCT, {'word': ','}), Annotation(pos, 16, 17, SCONJ, {'word': 'et'}), Annotation(pos, 19, 20, PRON, {'word': 'ta'}), Annotation(pos, 22, 23, AUX, {'word': 'on'}), Annotation(pos, 25, 32, NOUN, {'word': 'põhjamaa'}), Annotation(pos, 34, 41, NOUN, {'word': 'kuningas'}), Annotation(pos, 42, 42, PUNCT, {'word': ','}), Annotation(pos, 44, 45, AUX, {'word': 'on'}), Annotation(pos, 47, 50, PROPN, {'word': 'John'}), Annotation(pos, 52, 55, PROPN, {'word': 'Snow'}), Annotation(pos, 57, 63, ADJ, {'word': 'inglise'}), Annotation(pos, 65, 68, NOUN, {'word': 'arst'}), Annotation(pos, 70, 73, CCONJ, {'word': 'ning'}), Annotation(pos, 75, 82, NOUN, {'word': 'narkoosi'}), Annotation(pos, 84, 85, CCONJ, {'word': 'ja'}), Annotation(pos, 87, 100, NOUN, {'word': 'meditsiinilise'}), Annotation(pos, 102, 109, NOUN, {'word': 'hügieeni'}), Annotation(pos, 111, 120, NOUN, {'word': 'arendamise'}), Annotation(pos, 122, 125, NOUN, {'word': 'juht'}), Annotation(pos, 126, 126, PUNCT, {'word': '.'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_edt| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[tags, document]| |Output Labels:|[pos]| |Language:|et| ## Data Source The model is trained on data obtained from [https://universaldependencies.org](https://universaldependencies.org) ## Benchmarking ```bash | | | precision | recall | f1-score | support | |---:|:-------------|:------------|:---------|-----------:|----------:| | 0 | ADJ | 0.86 | 0.82 | 0.84 | 3655 | | 1 | ADP | 0.91 | 0.92 | 0.91 | 838 | | 2 | ADV | 0.95 | 0.95 | 0.95 | 4553 | | 3 | AUX | 0.94 | 0.98 | 0.96 | 2426 | | 4 | CCONJ | 0.99 | 0.98 | 0.98 | 1820 | | 5 | DET | 0.82 | 0.74 | 0.78 | 752 | | 6 | INTJ | 0.92 | 0.68 | 0.78 | 50 | | 7 | NOUN | 0.92 | 0.95 | 0.94 | 11352 | | 8 | NUM | 0.96 | 0.90 | 0.93 | 756 | | 9 | PRON | 0.93 | 0.94 | 0.93 | 2350 | | 10 | PROPN | 0.96 | 0.92 | 0.94 | 2619 | | 11 | PUNCT | 1.00 | 1.00 | 1 | 6989 | | 12 | SCONJ | 0.96 | 0.99 | 0.98 | 1048 | | 13 | SYM | 1.00 | 0.72 | 0.84 | 18 | | 14 | VERB | 0.93 | 0.91 | 0.92 | 4846 | | 15 | X | 0.56 | 0.15 | 0.23 | 68 | | 16 | accuracy | | | 0.94 | 44140 | | 17 | macro avg | 0.91 | 0.85 | 0.87 | 44140 | | 18 | weighted avg | 0.94 | 0.94 | 0.94 | 44140 | ``` --- layout: model title: Smaller BERT Sentence Embeddings (L-12_H-256_A-4) author: John Snow Labs name: sent_small_bert_L12_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_256_en_2.6.0_2.4_1598350492180.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_256_en_2.6.0_2.4_1598350492180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_256", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_256", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L12_256').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L12_256_embeddings I hate cancer [0.9404042959213257, -0.5918057560920715, 0.07... Antibiotics aren't painkiller [0.1526544690132141, 0.050494179129600525, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L12_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1 --- layout: model title: English RobertaForQuestionAnswering Cased model (from veronica320) author: John Snow Labs name: roberta_qa_for_event_extraction date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `QA-for-Event-Extraction` is a English model originally trained by `veronica320`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_for_event_extraction_en_4.3.0_3.0_1674208285851.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_for_event_extraction_en_4.3.0_3.0_1674208285851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_event_extraction","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_for_event_extraction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_for_event_extraction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/veronica320/QA-for-Event-Extraction - https://aclanthology.org/2021.acl-short.42/ - https://github.com/uwnlp/qamr - https://github.com/veronica320/Zeroshot-Event-Extraction --- layout: model title: Pipeline to Detect Drug Information (Small) author: John Snow Labs name: ner_posology_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_pipeline_en_4.3.0_3.2_1678870040871.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_pipeline_en_4.3.0_3.2_1678870040871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_pipeline", "en", "clinical/models") text = '''The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_pipeline", "en", "clinical/models") val text = "The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | insulin | 59 | 65 | DRUG | 0.9759 | | 1 | Bactrim | 346 | 352 | DRUG | 0.9986 | | 2 | for 14 days | 354 | 364 | DURATION | 0.7948 | | 3 | Fragmin | 925 | 931 | DRUG | 0.9951 | | 4 | 5000 units | 933 | 942 | DOSAGE | 0.84355 | | 5 | subcutaneously | 944 | 957 | ROUTE | 0.9958 | | 6 | daily | 959 | 963 | FREQUENCY | 0.9792 | | 7 | Xenaderm | 966 | 973 | DRUG | 0.6773 | | 8 | topically | 985 | 993 | ROUTE | 0.5207 | | 9 | b.i.d | 995 | 999 | FREQUENCY | 0.9964 | | 10 | Lantus | 1003 | 1008 | DRUG | 0.997 | | 11 | 40 units | 1010 | 1017 | DOSAGE | 0.7432 | | 12 | subcutaneously | 1019 | 1032 | ROUTE | 0.9844 | | 13 | at bedtime | 1034 | 1043 | FREQUENCY | 0.95525 | | 14 | OxyContin | 1046 | 1054 | DRUG | 0.9965 | | 15 | 30 mg | 1056 | 1060 | STRENGTH | 0.72475 | | 16 | p.o | 1062 | 1064 | ROUTE | 0.9981 | | 17 | q.12 h | 1067 | 1072 | FREQUENCY | 0.9672 | | 18 | folic acid | 1076 | 1085 | DRUG | 0.8448 | | 19 | 1 mg | 1087 | 1090 | STRENGTH | 0.9154 | | 20 | daily | 1092 | 1096 | FREQUENCY | 0.9999 | | 21 | levothyroxine | 1099 | 1111 | DRUG | 0.9952 | | 22 | 0.1 mg | 1113 | 1118 | STRENGTH | 0.7658 | | 23 | p.o | 1120 | 1122 | ROUTE | 0.9982 | | 24 | daily | 1125 | 1129 | FREQUENCY | 0.9983 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Arabic BertForMaskedLM Cased model (from UBC-NLP) author: John Snow Labs name: bert_embeddings_marbert date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MARBERT` is a Arabic model originally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_marbert_ar_4.2.4_3.0_1670015350392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_marbert_ar_4.2.4_3.0_1670015350392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_marbert","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_marbert","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_marbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|611.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/UBC-NLP/MARBERT - https://aclanthology.org/2021.acl-long.551.pdf - https://github.com/UBC-NLP/LMBERT - https://github.com/UBC-NLP/marbert - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: Match Datetime in Texts author: John Snow Labs name: match_datetime date: 2022-07-07 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description DateMatcher based on yyyy/MM/dd ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_datetime_en_4.0.0_3.0_1657188140219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_datetime_en_4.0.0_3.0_1657188140219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline_local = PretrainedPipeline('match_datetime') tres = pipeline_local.fullAnnotate(input_list)[0] for dte in tres['date']: sent = tres['sentence'][int(dte.metadata['sentence'])] print (f'text/chunk {sent.result[dte.begin:dte.end+1]} | mapped_date: {dte.result}') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP SparkNLP.version() val testData = spark.createDataFrame(Seq( (1, "David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow."))).toDF("id", "text") val pipeline = PretrainedPipeline("match_datetime", lang="en") val annotation = pipeline.transform(testData) annotation.show() ```
## Results ```bash text/chunk yesterday | mapped_date: 2022/01/02 text/chunk day before | mapped_date: 2022/01/02 text/chunk today | mapped_date: 2022/01/03 text/chunk tomorrow | mapped_date: 2022/01/04 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|match_datetime| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|9.9 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - MultiDateMatcher --- layout: model title: Mapping RxNorm and RxNorm Extension Codes with Corresponding Drug Brand Names author: John Snow Labs name: rxnorm_drug_brandname_mapper date: 2023-02-09 tags: [chunk_mappig, rxnorm, drug_brand_name, rxnorm_extension, en, clinical, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.3.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm and RxNorm Extension codes with their corresponding drug brand names. It returns 2 types of brand names for the corresponding RxNorm or RxNorm Extension code. ## Predicted Entities `rxnorm_brandname`, `rxnorm_extension_brandname` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_drug_brandname_mapper_en_4.3.0_3.0_1675966478332.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_drug_brandname_mapper_en_4.3.0_3.0_1675966478332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols(["chunk", "sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") resolver2chunk = Resolution2Chunk()\ .setInputCols(["rxnorm_code"]) \ .setOutputCol("rxnorm_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_chunk"])\ .setOutputCol("mappings")\ .setRels(["rxnorm_brandname", "rxnorm_extension_brandname"]) pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver, resolver2chunk, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) pipeline = LightPipeline(model) result = pipeline.fullAnnotate(['metformin', 'advil']) ``` ```scala val documentAssembler = new DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("chunk") val sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["chunk"])\ .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols(["chunk", "sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") val resolver2chunk = new Resolution2Chunk()\ .setInputCols(["rxnorm_code"]) \ .setOutputCol("rxnorm_chunk")\ val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_chunk"])\ .setOutputCol("mappings")\ .setRels(["rxnorm_brandname", "rxnorm_extension_brandname"]) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, rxnorm_resolver, resolver2chunk chunkerMapper )) val data = Seq(Array("metformin", "advil")).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ```
## Results ```bash +--------------+-------------+--------------------------------------------------+--------------------------+ | drug_name|rxnorm_result| mapping_result| relation | +--------------+-------------+--------------------------------------------------+--------------------------+ | metformin| 6809|Actoplus Met (metformin):::Avandamet (metformin...| rxnorm_brandname| | metformin| 6809|A FORMIN (metformin):::ABERIN MAX (metformin)::...|rxnorm_extension_brandname| | advil| 153010| Advil (Advil)| rxnorm_brandname| | advil| 153010| NONE|rxnorm_extension_brandname| +--------------+-------------+--------------------------------------------------+--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_drug_brandname_mapper| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[rxnorm_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|4.0 MB| --- layout: model title: English BertForQuestionAnswering Base Cased model (from nntadotzip) author: John Snow Labs name: bert_qa_base_cased_iuchatbot_ontologydts_berttokenizer_12april2022 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-IUChatbot-ontologyDts-bertBaseCased-bertTokenizer-12April2022` is a English model originally trained by `nntadotzip`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_iuchatbot_ontologydts_berttokenizer_12april2022_en_4.0.0_3.0_1657182722207.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_iuchatbot_ontologydts_berttokenizer_12april2022_en_4.0.0_3.0_1657182722207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_iuchatbot_ontologydts_berttokenizer_12april2022","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_iuchatbot_ontologydts_berttokenizer_12april2022","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_cased_iuchatbot_ontologydts_berttokenizer_12april2022| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nntadotzip/bert-base-cased-IUChatbot-ontologyDts-bertBaseCased-bertTokenizer-12April2022 --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nh8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nh8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh8_en_4.3.0_3.0_1675113470480.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh8_en_4.3.0_3.0_1675113470480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nh8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nh8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nh8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|420.8 MB| ## References - https://huggingface.co/google/t5-efficient-base-nh8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English image_classifier_vit_cifar_10_2 ViTForImageClassification from alfredcs author: John Snow Labs name: image_classifier_vit_cifar_10_2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_cifar_10_2` is a English model originally trained by alfredcs. ## Predicted Entities `deer`, `bird`, `dog`, `horse`, `automobile`, `truck`, `frog`, `ship`, `airplane`, `cat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_cifar_10_2_en_4.1.0_3.0_1660169116177.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_cifar_10_2_en_4.1.0_3.0_1660169116177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_cifar_10_2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_cifar_10_2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_cifar_10_2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_final_wav2vec2_urdu_asr_project TFWav2Vec2ForCTC from Raffay author: John Snow Labs name: asr_final_wav2vec2_urdu_asr_project date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_final_wav2vec2_urdu_asr_project` is a English model originally trained by Raffay. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_final_wav2vec2_urdu_asr_project_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_final_wav2vec2_urdu_asr_project_en_4.2.0_3.0_1664020167074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_final_wav2vec2_urdu_asr_project_en_4.2.0_3.0_1664020167074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_final_wav2vec2_urdu_asr_project", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_final_wav2vec2_urdu_asr_project", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_final_wav2vec2_urdu_asr_project| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.3 MB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_10_en_4.0.0_3.0_1657184162845.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_10_en_4.0.0_3.0_1657184162845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-10 --- layout: model title: Legal Disability Clause Binary Classifier author: John Snow Labs name: legclf_disability_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `disability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `disability` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disability_clause_en_1.0.0_3.2_1660122344513.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disability_clause_en_1.0.0_3.2_1660122344513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_disability_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[disability]| |[other]| |[other]| |[disability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disability_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support disability 0.93 0.96 0.94 26 other 0.99 0.98 0.98 96 accuracy - - 0.98 122 macro-avg 0.96 0.97 0.96 122 weighted-avg 0.98 0.98 0.98 122 ``` --- layout: model title: Marathi XLMRoBerta Embeddings (from l3cube-pune) author: John Snow Labs name: xlmroberta_embeddings_marathi_roberta date: 2022-05-13 tags: [mr, open_source, xlm_roberta, embeddings] task: Embeddings language: mr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `marathi-roberta` is a Marathi model orginally trained by `l3cube-pune`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_marathi_roberta_mr_3.4.4_3.0_1652440151547.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_marathi_roberta_mr_3.4.4_3.0_1652440151547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_marathi_roberta","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_marathi_roberta","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_marathi_roberta| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|mr| |Size:|1.0 GB| |Case sensitive:|true| ## References - https://huggingface.co/l3cube-pune/marathi-roberta - https://github.com/l3cube-pune/MarathiNLP - https://arxiv.org/abs/2202.01159 --- layout: model title: Legal Miscellaneous Industries Document Classifier (EURLEX) author: John Snow Labs name: legclf_miscellaneous_industries_bert date: 2023-03-06 tags: [en, legal, classification, clauses, miscellaneous_industries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_miscellaneous_industries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Miscellaneous_Industries or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Miscellaneous_Industries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_industries_bert_en_1.0.0_3.0_1678111876363.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_industries_bert_en_1.0.0_3.0_1678111876363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_miscellaneous_industries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Miscellaneous_Industries]| |[Other]| |[Other]| |[Miscellaneous_Industries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_miscellaneous_industries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Miscellaneous_Industries 0.82 0.79 0.81 39 Other 0.79 0.81 0.80 37 accuracy - - 0.80 76 macro-avg 0.80 0.80 0.80 76 weighted-avg 0.80 0.80 0.80 76 ``` --- layout: model title: Pipeline to Detect PHI for Deidentification (Augmented) author: John Snow Labs name: ner_deid_augmented_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_augmented](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_augmented_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_pipeline_en_4.3.0_3.2_1678737755264.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_augmented_pipeline_en_4.3.0_3.2_1678737755264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_augmented_pipeline", "en", "clinical/models") text = '''HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_augmented_pipeline", "en", "clinical/models") val text = "HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.ner_augmented.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:------------|-------------:| | 0 | Smith | 32 | 36 | NAME | 0.9998 | | 1 | VA Hospital | 184 | 194 | LOCATION | 0.98755 | | 2 | Day Hospital | 258 | 269 | LOCATION | 0.9907 | | 3 | 02/04/2003 | 341 | 350 | DATE | 0.8956 | | 4 | Smith | 374 | 378 | NAME | 0.9997 | | 5 | Day Hospital | 397 | 408 | LOCATION | 0.9539 | | 6 | Smith | 782 | 786 | NAME | 0.9999 | | 7 | Smith | 1131 | 1135 | NAME | 0.9996 | | 8 | 7 Ardmore Tower | 1153 | 1167 | LOCATION | 0.784167 | | 9 | Hart | 1221 | 1224 | NAME | 0.9996 | | 10 | Smith | 1231 | 1235 | NAME | 0.9998 | | 11 | 02/07/2003 | 1329 | 1338 | DATE | 0.9997 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_512 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_512_zh_4.2.4_3.0_1670325991425.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_512_zh_4.2.4_3.0_1670325991425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|113.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English RobertaForQuestionAnswering (from stevemobs) author: John Snow Labs name: roberta_qa_quales_iberlef date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `quales-iberlef` is a English model originally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_quales_iberlef_en_4.0.0_3.0_1655729537297.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_quales_iberlef_en_4.0.0_3.0_1655729537297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_quales_iberlef","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_quales_iberlef","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_stevemobs").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_quales_iberlef| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/stevemobs/quales-iberlef --- layout: model title: Norwegian Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 18:56:00 +0800 task: Lemmatization language: nb edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, nb] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_nb_2.5.0_2.4_1588693886432.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_nb_2.5.0_2.4_1588693886432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "nb") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "nb") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""] lemma_df = nlu.load('nb.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Annet', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=8, result='enn', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=10, result='i', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=15, result='være', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=22, result='konge', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|nb| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Translate English to Manx Pipeline author: John Snow Labs name: translate_en_gv date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gv, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gv_xx_2.7.0_2.4_1609691549866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gv_xx_2.7.0_2.4_1609691549866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gv').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gv| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding ICD-9-CM Codes author: John Snow Labs name: icd10_icd9_mapping date: 2022-09-30 tags: [icd10cm, icd9cm, licensed, en, clinical, pipeline, chunk_mapping] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of icd10_icd9_mapper model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.1.0_3.0_1664539288823.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_4.1.0_3.0_1664539288823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") result= pipeline.fullAnnotate('Z833 A0100 A000') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate('Z833 A0100 A000') ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10_icd9.mapping").predict("""Z833 A0100 A000""") ```
## Results ```bash | | icd10_code | icd9_code | |---:|:--------------------|:-------------------| | 0 | Z833 | A0100 | A000 | V180 | 0020 | 0010 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10_icd9_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|589.5 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: German NER for Laws (Bert, Large) author: John Snow Labs name: legner_bert_large_courts date: 2022-10-02 tags: [de, legal, ner, laws, court, licensed] task: Named Entity Recognition language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect legal entities in German text, predicting up to 19 different labels: ``` | tag | meaning ----------------- | AN | Anwalt | EUN | Europäische Norm | GS | Gesetz | GRT | Gericht | INN | Institution | LD | Land | LDS | Landschaft | LIT | Literatur | MRK | Marke | ORG | Organisation | PER | Person | RR | Richter | RS | Rechtssprechung | ST | Stadt | STR | Straße | UN | Unternehmen | VO | Verordnung | VS | Vorschrift | VT | Vertrag ``` German Named Entity Recognition model, trained using large German Base Bert model and finetuned using Court Decisions (2017-2018) dataset (check `Data Source` section). You can also find a lighter Deep Learning (non-transformer based) in our Models Hub (`legner_courts`) and a Bert Base version (`legner_bert_base_courts`). ## Predicted Entities `STR`, `LIT`, `PER`, `EUN`, `VT`, `MRK`, `INN`, `UN`, `RS`, `ORG`, `GS`, `VS`, `LDS`, `GRT`, `VO`, `RR`, `LD`, `AN`, `ST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_DE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_bert_large_courts_de_1.0.0_3.0_1664706406959.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_bert_large_courts_de_1.0.0_3.0_1664706406959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") ner_model = nlp.BertForTokenClassification.pretrained("legner_bert_large_courts", "de", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) text_list = ["""Der Europäische Gerichtshof für Menschenrechte (EGMR) gibt dabei allerdings ebenso wenig wie das Bundesverfassungsgericht feste Fristen vor, sondern stellt auf die jeweiligen Umstände des Einzelfalls ab.""", """Formelle Rechtskraft ( § 705 ZPO ) trat mit Verkündung des Revisionsurteils am 15. Dezember 2016 ein (vgl. Zöller / Seibel ZPO 32. Aufl. § 705 Rn. 8) ."""] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("ner_chunk"), F.expr("cols['1']['entity']").alias("label")).show(truncate = False) ```
## Results ```bash +---+---------------------------------------------------+----------+ | # | Chunks | Entities | +---+---------------------------------------------------+----------+ | 0 | Deutschlands | LD | +---+---------------------------------------------------+----------+ | 1 | Rheins | LDS | +---+---------------------------------------------------+----------+ | 2 | Oberfranken | LDS | +---+---------------------------------------------------+----------+ | 3 | Unterfranken | LDS | +---+---------------------------------------------------+----------+ | 4 | Südhessen | LDS | +---+---------------------------------------------------+----------+ | 5 | Mainz | ST | +---+---------------------------------------------------+----------+ | 6 | Rhein | LDS | +---+---------------------------------------------------+----------+ | 7 | Klein , in : Maunz / Schmidt-Bleibtreu / Klein... | LIT | +---+---------------------------------------------------+----------+ | 8 | Richtlinien zur Bewertung des Grundvermögens –... | VS | +---+---------------------------------------------------+----------+ | 9 | I September 1966 (BStBl I, S.890) | VS | +---+---------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_bert_large_courts| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References The dataset used to train this model is taken from Leitner, et.al (2019) Leitner, E., Rehm, G., and Moreno-Schneider, J. (2019). Fine-grained Named Entity Recognition in Legal Documents. In Maribel Acosta, et al., editors, Semantic Systems. The Power of AI and Knowledge Graphs. Proceedings of the 15th International Conference (SEMANTiCS2019), number 11702 in Lecture Notes in Computer Science, pages 272–287, Karlsruhe, Germany, 9. Springer. 10/11 September 2019. Source of the annotated text: Court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection. The documents originate from seven federal courts: Federal Labour Court (BAG), Federal Fiscal Court (BFH), Federal Court of Justice (BGH), Federal Patent Court (BPatG), Federal Social Court (BSG), Federal Constitutional Court (BVerfG) and Federal Administrative Court (BVerwG). ## Benchmarking ```bash label prec rec f1 Macro-average 0.9361195 0.9294152 0.9368145 Micro-average 0.9856711 0.9857456 0.9851656 ``` --- layout: model title: Translate English to Czech Pipeline author: John Snow Labs name: translate_en_cs date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, cs, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `cs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_cs_xx_2.7.0_2.4_1609691394506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_cs_xx_2.7.0_2.4_1609691394506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_cs", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_cs", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.cs').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_cs| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_zero_shot date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `zero_shot` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_zero_shot_en_4.0.0_3.0_1654192643172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_zero_shot_en_4.0.0_3.0_1654192643172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_zero_shot","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_zero_shot","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.zero_shot.by_krinal214").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_zero_shot| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/zero_shot --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nh8 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nh8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh8_en_4.3.0_3.0_1675123720663.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nh8_en_4.3.0_3.0_1675123720663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nh8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nh8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nh8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|46.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nh8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_which_1e_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-which-1e-4` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_which_1e_4_en_4.3.0_3.0_1674209779299.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_which_1e_4_en_4.3.0_3.0_1674209779299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_which_1e_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_which_1e_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_which_1e_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-which-1e-4 --- layout: model title: Recognize Entities DL Pipeline for English author: John Snow Labs name: recognize_entities_dl date: 2021-03-23 tags: [open_source, english, recognize_entities_dl, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The recognize_entities_dl is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/recognize_entities_dl_en_3.0.0_3.0_1616473647254.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/recognize_entities_dl_en_3.0.0_3.0_1616473647254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('recognize_entities_dl', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("recognize_entities_dl", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:---------------------------------------------------|:------------------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.2668800055980682,.,...]] | ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O'] | ['Hello from John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_dl| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Basque Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: eu edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, eu] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_eu_2.5.5_2.4_1596054393659.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_eu_2.5.5_2.4_1596054393659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "eu") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "eu") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow."""] lemma_df = nlu.load('eu.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=10, result='iparralde', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=18, result='errege', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=20, end=26, result='izan', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=28, end=31, result='gain', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=32, end=32, result=',', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|eu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Named Entity Recognition - BERT Large (OntoNotes) author: John Snow Labs name: onto_bert_large_cased date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `bert_large_cased` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_bert_large_cased_en_2.7.0_2.4_1607198127113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_bert_large_cased_en_2.7.0_2.4_1607198127113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_large_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_onto = NerDLModel.pretrained("onto_bert_large_cased", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_large_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_onto = NerDLModel.pretrained("onto_bert_large_cased", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.bert.cased_large').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |one |CARDINAL | |the 1970s |DATE | |1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Microsoft |PERSON | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |ORG | |January 2000 |DATE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_bert_large_cased| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.8947816, rec: 0.9059915, f1: 0.90035164 CoNLL Eval (test dataset): processed 152728 tokens with 11257 phrases; found: 11351 phrases; correct: 10044. accuracy: 98.02%; 10044 11257 11351 precision: 88.49%; recall: 89.22%; FB1: 88.85 CARDINAL: 793 935 953 precision: 83.21%; recall: 84.81%; FB1: 84.00 953 DATE: 1420 1602 1697 precision: 83.68%; recall: 88.64%; FB1: 86.09 1697 EVENT: 37 63 63 precision: 58.73%; recall: 58.73%; FB1: 58.73 63 FAC: 98 135 152 precision: 64.47%; recall: 72.59%; FB1: 68.29 152 GPE: 2128 2240 2218 precision: 95.94%; recall: 95.00%; FB1: 95.47 2218 LANGUAGE: 10 22 13 precision: 76.92%; recall: 45.45%; FB1: 57.14 13 LAW: 21 40 30 precision: 70.00%; recall: 52.50%; FB1: 60.00 30 LOC: 133 179 166 precision: 80.12%; recall: 74.30%; FB1: 77.10 166 MONEY: 279 314 317 precision: 88.01%; recall: 88.85%; FB1: 88.43 317 NORP: 796 841 840 precision: 94.76%; recall: 94.65%; FB1: 94.71 840 ORDINAL: 180 195 219 precision: 82.19%; recall: 92.31%; FB1: 86.96 219 ORG: 1620 1795 1873 precision: 86.49%; recall: 90.25%; FB1: 88.33 1873 PERCENT: 309 349 342 precision: 90.35%; recall: 88.54%; FB1: 89.44 342 PERSON: 1862 1988 1970 precision: 94.52%; recall: 93.66%; FB1: 94.09 1970 PRODUCT: 51 76 68 precision: 75.00%; recall: 67.11%; FB1: 70.83 68 QUANTITY: 81 105 99 precision: 81.82%; recall: 77.14%; FB1: 79.41 99 TIME: 116 212 179 precision: 64.80%; recall: 54.72%; FB1: 59.34 179 WORK_OF_ART: 110 166 152 precision: 72.37%; recall: 66.27%; FB1: 69.18 152 CoNLL Eval (development dataset): processed 147724 tokens with 11066 phrases; found: 11259 phrases; correct: 10326. accuracy: 98.73%; 10326 11066 11259 precision: 91.71%; recall: 93.31%; FB1: 92.51 CARDINAL: 869 938 1001 precision: 86.81%; recall: 92.64%; FB1: 89.63 1001 DATE: 1374 1507 1571 precision: 87.46%; recall: 91.17%; FB1: 89.28 1571 EVENT: 112 143 138 precision: 81.16%; recall: 78.32%; FB1: 79.72 138 FAC: 105 115 140 precision: 75.00%; recall: 91.30%; FB1: 82.35 140 GPE: 2213 2268 2293 precision: 96.51%; recall: 97.57%; FB1: 97.04 2293 LANGUAGE: 27 33 27 precision: 100.00%; recall: 81.82%; FB1: 90.00 27 LAW: 33 40 39 precision: 84.62%; recall: 82.50%; FB1: 83.54 39 LOC: 163 204 178 precision: 91.57%; recall: 79.90%; FB1: 85.34 178 MONEY: 257 274 275 precision: 93.45%; recall: 93.80%; FB1: 93.62 275 NORP: 806 847 841 precision: 95.84%; recall: 95.16%; FB1: 95.50 841 ORDINAL: 212 232 241 precision: 87.97%; recall: 91.38%; FB1: 89.64 241 ORG: 1618 1740 1771 precision: 91.36%; recall: 92.99%; FB1: 92.17 1771 PERCENT: 164 177 174 precision: 94.25%; recall: 92.66%; FB1: 93.45 174 PERSON: 1961 2020 2064 precision: 95.01%; recall: 97.08%; FB1: 96.03 2064 PRODUCT: 64 72 73 precision: 87.67%; recall: 88.89%; FB1: 88.28 73 QUANTITY: 83 100 96 precision: 86.46%; recall: 83.00%; FB1: 84.69 96 TIME: 165 214 203 precision: 81.28%; recall: 77.10%; FB1: 79.14 203 WORK_OF_ART: 100 142 134 precision: 74.63%; recall: 70.42%; FB1: 72.46 134 ``` --- layout: model title: English RobertaForSequenceClassification Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_codebert2codebert_finetuned_code_defect_detection date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `codebert2codebert-finetuned-code-defect-detection` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_codebert2codebert_finetuned_code_defect_detection_en_4.0.0_3.0_1657716178266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_codebert2codebert_finetuned_code_defect_detection_en_4.0.0_3.0_1657716178266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_codebert2codebert_finetuned_code_defect_detection","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_codebert2codebert_finetuned_code_defect_detection","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_codebert2codebert_finetuned_code_defect_detection| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|469.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/codebert2codebert-finetuned-code-defect-detection --- layout: model title: French CamemBert Embeddings (from MYX4567) author: John Snow Labs name: camembert_embeddings_MYX4567_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `MYX4567`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_MYX4567_generic_model_fr_3.4.4_3.0_1653986744952.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_MYX4567_generic_model_fr_3.4.4_3.0_1653986744952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_MYX4567_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_MYX4567_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_MYX4567_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/MYX4567/dummy-model --- layout: model title: Pipeline to Detect Bacterial Species author: John Snow Labs name: bert_token_classifier_ner_bacteria_pipeline date: 2022-03-21 tags: [licensed, ner, bacteria, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bacteria](https://nlp.johnsnowlabs.com/2022/01/07/bert_token_classifier_ner_bacteria_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_pipeline_en_3.4.1_3.0_1647862897728.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_pipeline_en_3.4.1_3.0_1647862897728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_bacteria_pipeline", "en", "clinical/models") pipeline.annotate("Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bacteria_pipeline", "en", "clinical/models") pipeline.annotate("Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bacteria_ner.pipeline").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""") ```
## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |SMSP (T) |SPECIES | |Methanoregula formicica|SPECIES | |SMSP (T) |SPECIES | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bacteria_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: Detect Drug Chemicals author: John Snow Labs name: ner_drugs_large date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for Drugs. The model combines dosage, strength, form, and route into a single entity: Drug. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `DRUG` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_en_3.0.0_3.0_1617209701231.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_en_3.0.0_3.0_1617209701231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") # Clinical word embeddings trained on PubMED dataset word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_drugs_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") # Clinical word embeddings trained on PubMED dataset val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_drugs_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs.large").predict(""". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""") ```
## Results ```bash +--------------------------------+---------+ |chunk |ner_label| +--------------------------------+---------+ |Aspirin 81 milligrams |DRUG | |Humulin N |DRUG | |insulin 50 units |DRUG | |HCTZ 50 mg |DRUG | |Nitroglycerin 1/150 sublingually|DRUG | +--------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_large| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on i2b2_med7 + FDA with 'embeddings_clinical'. https://www.i2b2.org/NLP/Medication --- layout: model title: Translate Marathi to English Pipeline author: John Snow Labs name: translate_mr_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mr, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mr` - target languages: `en` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/INDIAN_TRANSLATION_MARATHI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TRANSLATION_PIPELINES_MODELS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mr_en_xx_2.7.0_2.4_1609686054618.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mr_en_xx_2.7.0_2.4_1609686054618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mr_en", lang = "xx") result = pipeline.annotate("मला वाचायला आवडते.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mr_en", lang = "xx") val result = pipeline.annotate("मला वाचायला आवडते.") ``` {:.nlu-block} ```python import nlu text = ["मला वाचायला आवडते."] translate_df = nlu.load('xx.mr.translate_to.en').predict(text, output_level='sentence') translate_df ```
## Results ```bash +------------------------------+---------------------------+ |sentence |translation | +------------------------------+---------------------------+ |मला वाचायला आवडते. |I like reading. | +------------------------------+---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mr_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Icelandic XlmRoBertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: xlm_roberta_qa_xlmr_base_texas_squad_is_is_saattrupdan date: 2022-06-24 tags: [open_source, question_answering, xlmroberta, is] task: Question Answering language: is edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-is` is a Icelandic model originally trained by `saattrupdan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_is_is_saattrupdan_is_4.0.0_3.0_1656066430850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_is_is_saattrupdan_is_4.0.0_3.0_1656066430850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_is_is_saattrupdan","is") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmr_base_texas_squad_is_is_saattrupdan","is") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("is.answer_question.squad.xlmr_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_is_is_saattrupdan| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|is| |Size:|874.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saattrupdan/xlmr-base-texas-squad-is --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1655733408907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_2_en_4.0.0_3.0_1655733408907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_64d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|419.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-2 --- layout: model title: Legal Fringe Benefits Clause Binary Classifier author: John Snow Labs name: legclf_fringe_benefits_clause date: 2023-01-29 tags: [en, legal, classification, fringe, benefits, clauses, fringe_benefits, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `fringe-benefits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `fringe-benefits`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fringe_benefits_clause_en_1.0.0_3.0_1674994206659.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fringe_benefits_clause_en_1.0.0_3.0_1674994206659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_fringe_benefits_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[fringe-benefits]| |[other]| |[other]| |[fringe-benefits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fringe_benefits_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support fringe-benefits 1.00 1.00 1.00 66 other 1.00 1.00 1.00 104 accuracy - - 1.00 170 macro-avg 1.00 1.00 1.00 170 weighted-avg 1.00 1.00 1.00 170 ``` --- layout: model title: Legal Sublease Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_sublease_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, sublease, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_sublease_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `sublease-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `sublease-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sublease_agreement_bert_en_1.0.0_3.0_1669371981511.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sublease_agreement_bert_en_1.0.0_3.0_1669371981511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sublease_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[sublease-agreement]| |[other]| |[other]| |[sublease-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sublease_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 65 sublease-agreement 1.00 1.00 1.00 33 accuracy - - 1.00 98 macro-avg 1.00 1.00 1.00 98 weighted-avg 1.00 1.00 1.00 98 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vkmr) author: John Snow Labs name: distilbert_qa_vkmr_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vkmr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkmr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773133530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vkmr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773133530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkmr_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vkmr_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vkmr_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vkmr/distilbert-base-uncased-finetuned-squad --- layout: model title: Clinical Deidentification (Spanish, augmented) author: John Snow Labs name: clinical_deidentification_augmented date: 2022-02-16 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. It differs from the previous `clinical_deidentificaiton` pipeline in that it includes the `ner_deid_subentity_augmented` NER model and some improvements in ContextualParsers and RegexMatchers. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `USERNAME`, `STREET`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_augmented_es_3.4.1_3.0_1645005904505.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_augmented_es_3.4.1_3.0_1645005904505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""" result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical_augmented").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""") ```
## Results ```bash Masked with entity labels ------------------------------ Datos . Nombre: . Apellidos: . NHC: . NASS: . Domicilio: , B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . Médico: NºCol: . Informe clínico : de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > ; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición - () Correo electrónico: Masked with chars ------------------------------ Datos [**********]. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: [************]. Domicilio: [*******************], * B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. Médico: [******************] NºCol: [*********]. Informe clínico [**********]: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; [**] 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > [****]; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). [*********] [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [***************************] [***] [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos ****. Nombre: **** . Apellidos: ****. NHC: ****. NASS: ****. Domicilio: ****, **** B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. Médico: **** NºCol: ****. Informe clínico ****: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; **** 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > ****; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). **** **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** **** **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos Hombre. Nombre: Aurora Garrido Paez . Apellidos: Aurora Garrido Paez. NHC: BBBBBBBBQR648597. NASS: 48127833R. Domicilio: C/ Rambla, 246, 5 B.. Localidad/ Provincia: Alicante. CP: 24202. Datos asistenciales. Fecha de nacimiento: 21/04/1977. País: Portugal. Edad: 56 años Sexo: Hombre. Fecha de Ingreso: 10/07/2018. Médico: Francisco José Roca Bermúdez NºCol: 21344083-P. Informe clínico Hombre: 041010000011 de 56 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Zaragoza en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; Tecnogroup SL 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: María Miguélez Sanz +++; Test de Coombs > 07-25-1974; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). F. 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Francisco José Roca Bermúdez Hospital 12 de Octubre Servicio de Endocrinología y Nutrición Calle Ramón y Cajal s/n 03129 Zaragoza - Alicante (Portugal) Correo electrónico: richard@company.it ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_augmented| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.3 MB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - nlp.RegexMatcherModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: Korean Bert Embeddings (from emeraldgoose) author: John Snow Labs name: bert_embeddings_bert_base_v1_sports date: 2022-04-11 tags: [bert, embeddings, ko, open_source] task: Embeddings language: ko edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-v1-sports` is a Korean model orginally trained by `emeraldgoose`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_v1_sports_ko_3.4.2_3.0_1649675633299.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_v1_sports_ko_3.4.2_3.0_1649675633299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_v1_sports","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_v1_sports","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.embed.bert_base_v1_sports").predict("""나는 Spark NLP를 좋아합니다""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_v1_sports| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|343.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/emeraldgoose/bert-base-v1-sports --- layout: model title: Detect Cellular/Molecular Biology Entities (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_cellular date: 2022-01-06 tags: [bertfortokenclassification, ner, cellular, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects molecular biology-related terms in medical texts. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_en_3.3.4_2.4_1641455594142.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_en_3.3.4_2.4_1641455594142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.cellular").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1 |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_cellular| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013. https://aclanthology.org/W13-2008/ ## Benchmarking ```bash label precision recall f1-score support B-DNA 0.87 0.77 0.82 1056 B-RNA 0.85 0.79 0.82 118 B-cell_line 0.66 0.70 0.68 500 B-cell_type 0.87 0.75 0.81 1921 B-protein 0.90 0.85 0.88 5067 I-DNA 0.93 0.86 0.90 1789 I-RNA 0.92 0.84 0.88 187 I-cell_line 0.67 0.76 0.71 989 I-cell_type 0.92 0.76 0.84 2991 I-protein 0.94 0.80 0.87 4774 accuracy - - 0.80 19392 macro-avg 0.76 0.81 0.78 19392 weighted-avg 0.89 0.80 0.85 19392 ``` --- layout: model title: Recognize Entities DL Pipeline for Spanish - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, spanish, entity_recognizer_md, pipeline, es] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: es edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_es_3.0.0_3.0_1616447940054.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_es_3.0.0_3.0_1616447940054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'es') annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "es") val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] result_df = nlu.load('es.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | [[0.5123000144958496,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| --- layout: model title: Legal Lease Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_lease_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, lease, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_lease_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `lease-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `lease-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_lease_agreement_bert_en_1.0.0_3.0_1669315492194.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_lease_agreement_bert_en_1.0.0_3.0_1669315492194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_lease_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[lease-agreement]| |[other]| |[other]| |[lease-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_lease_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support lease-agreement 0.95 0.88 0.91 40 other 0.94 0.98 0.96 82 accuracy - - 0.94 122 macro-avg 0.94 0.93 0.93 122 weighted-avg 0.94 0.94 0.94 122 ``` --- layout: model title: Clean Slang in Texts author: John Snow Labs name: clean_slang date: 2022-01-03 tags: [en, open_source] task: Pipeline Public language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_slang is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_slang_en_3.3.4_3.0_1641218003693.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_slang_en_3.3.4_3.0_1641218003693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('clean_slang', lang='en') testDoc = ''' yo, what is wrong with ya? ''' ``` ```scala val pipeline = new PretrainedPipeline("clean_slang", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.slang').predict(text) result_df ```
## Results ```bash ['hey', 'what', 'is', 'wrong', 'with', 'you'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_slang| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|19.1 KB| ## Included Models - DocumentAssembler - TokenizerModel - NormalizerModel --- layout: model title: Legal Applicable Laws Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_applicable_laws_bert date: 2023-03-05 tags: [en, legal, classification, clauses, applicable_laws, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Applicable_Laws` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Applicable_Laws`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_applicable_laws_bert_en_1.0.0_3.0_1678050521046.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_applicable_laws_bert_en_1.0.0_3.0_1678050521046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_applicable_laws_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Applicable_Laws]| |[Other]| |[Other]| |[Applicable_Laws]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_applicable_laws_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Applicable_Laws 0.80 1.00 0.89 40 Other 1.00 0.84 0.91 61 accuracy - - 0.90 101 macro-avg 0.90 0.92 0.90 101 weighted-avg 0.92 0.90 0.90 101 ``` --- layout: model title: Sentence Entity Resolver for SNOMED Concepts author: John Snow Labs name: sbiobertresolve_snomed_findings_aux_concepts date: 2022-02-26 tags: [snomed, licensed, en, clinical, aux, ct] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to Snomed codes using sbiobert_base_cased_mli Sentence Bert Embeddings. This is also capable of extracting `Morph Abnormality`, `Procedure`, `Substance`, `Physical Object`, and `Body Structure` concepts of Snomed codes. In the metadata, the `all_k_aux_labels` can be divided to get further information: `ground truth`, `concept`, and `aux`. For example, in the example shared below the ground truth is `Atherosclerosis`, concept is `Observation`, and aux is `Morph Abnormality` ## Predicted Entities `SNOMED Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_aux_concepts_en_3.1.2_3.0_1645879611162.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_aux_concepts_en_3.1.2_3.0_1645879611162.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_clinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("clinical_ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "clinical_ner"]) \ .setOutputCol("ner_chunk")\ chunk2doc = Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline= Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner_clinical, ner_converter, chunk2doc, sbert_embedder, snomed_resolver ]) text= """FINDINGS: The patient was found upon excision of the cyst that it contained a large Prolene suture; beneath this was a very small incisional hernia, the hernia cavity, which contained omentum; the hernia was easily repaired""" df= spark.createDataFrame([[text]]).toDF("text") result= nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_clinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "clinical_ner")) .setOutputCol("ner_chunk") val chunk2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("snomed_code") val new nlpPipeine().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner_clinical, ner_converter, chunk2doc, sbert_embedder, snomed_resolver)) val text= """FINDINGS: The patient was found upon excision of the cyst that it contained a large Prolene suture; beneath this was a very small incisional hernia, the hernia cavity, which contained omentum; the hernia was easily repaired""" val df = Seq(text).toDF(“text”) val result= nlpPipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.findings_aux_concepts").predict("""FINDINGS: The patient was found upon excision of the cyst that it contained a large Prolene suture; beneath this was a very small incisional hernia, the hernia cavity, which contained omentum; the hernia was easily repaired""") ```
## Results ```bash | sent_id | ner_chunk | entity | snomed_code | all_codes | resolutions | all_k_aux_labels | |--------- |-------------------------------- |----------- |------------------- |------------------------------------------------------------------------ |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | 1 | excision | TREATMENT | 180397004 | [180397004, 65801008, 129304002, 257819000, 82868003,... | [excision from organ noc: [sinus tract] or [fistula], excision, excision - action, surgical excision, margins of excision,... | [Excision from organ NOC: [sinus tract] or [fistula]\|Procedure\|Procedure, Excision\|Procedure\|Procedure, Excision - action\|Observation\|Qualifier Value, Surgical excision\|Observation\|Qualifier Value, Surgical margins\|Spec Anatomic Site\|Body Structure,... | | 1 | the cyst | PROBLEM | 246178003 | [246178003, 103552005, 441457006, 264515009, 367643001,... | [form of cyst, cyst, cyst, cyst, cyst,... | [Form of cyst\|Observation\|Attribute, Kingdom Protozoa cyst\|Observation\|Organism, Cyst\|Condition\|Clinical Finding, Cyst - morphology\|Observation\|Qualifier Value, Cyst\|Observation\|Morph Abnormality,... | | 1 | a large Prolene suture | TREATMENT | 20594411000001105 | [20594411000001105, 7267511000001100, 20125511000001105, 463182000,... | [finger stalls plastic medium, portia disposable gloves polythene medium (bray group ltd), silk mittens 8-14 years, polybutester suture, skinnies silk gloves large child blue (dermacea ltd),... | [-\|-\|-, Portia disposable gloves polythene medium (Bray Group Ltd)\|Device\|Physical Object, -\|-\|-, Polybutester suture\|Device\|Physical Object, -\|-\|-,... | | 1 | a very small incisional hernia | PROBLEM | 155752004 | [155752004, 196894007, 266513000, 415772007, 266514006, | [simple incisional hernia, simple incisional hernia, simple incisional hernia, uncomplicated ventral incisional hernia, umbilical hernia - simple,... | [Hernia - incisional\|Condition\|Clinical Finding, Uncomplicated incisional hernia\|Condition\|Clinical Finding, Incisional hernia - simple\|Condition\|Clinical Finding, Uncomplicated ventral incisional hernia\|Condition\|Clinical Finding,... | | 1 | the hernia cavity | PROBLEM | 112639008 | [112639008, 52515009, 359801000, 414403008, 147780008, | [protrusion, hernia, hernia, hernia, notification of whooping cough,... | [Protrusion\|Observation\|Morph Abnormality, Hernia of abdominal cavity\|Condition\|Clinical Finding, Hernia of abdominal cavity\|Condition\|Clinical Finding, Hernia\|Observation\|Morph Abnormality,... | | 1 | the hernia | PROBLEM | 52515009 | [52515009, 359801000, 414403008, 147780008, 112639008, | [hernia, hernia, hernia, notification of whooping cough, protrusion,... | [Hernia of abdominal cavity\|Condition\|Clinical Finding, Hernia of abdominal cavity\|Condition\|Clinical Finding, Hernia\|Observation\|Morph Abnormality, Notification of whooping cough\|Procedure\|Procedure,... | | 1 | repaired | TREATMENT | 50826004 | [50826004, 4365001, 257903006, 33714007, 260938008, | [repaired, repair, repair, corrected, restoration,... | [Repaired\|Observation\|Qualifier Value, Surgical repair\|Procedure\|Procedure, Repair - action\|Observation\|Qualifier Value, Corrected\|Observation\|Qualifier Value, Type of restoration\|Observation\|Attribute,... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_findings_aux_concepts| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Size:|4.7 GB| |Case sensitive:|false| --- layout: model title: Detect Adverse Drug Events (MedicalBertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ade_tweet_binary date: 2022-07-29 tags: [clinical, licensed, ade, en, medicalbertfortokenclassification, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions of drugs in texts excahnged over twitter. This model is trained with the `BertForTokenClassification` method from the transformers library and imported into Spark NLP. ## Predicted Entities `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ade_tweet_binary_en_4.0.0_3.0_1659092904667.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ade_tweet_binary_en_4.0.0_3.0_1659092904667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ade_tweet_binary", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame(["I used to be on paxil but that made me more depressed and prozac made me angry", "Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ade_tweet_binary", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("I used to be on paxil but that made me more depressed and prozac made me angry", "Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.")).toDS().toDF("text") val result = model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_token.binary_ade_tweet").predict("""Maybe cos of the insulin blocking effect of seroquel but i do feel sugar crashes when eat fast carbs.""") ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |depressed |ADE | |angry |ADE | |insulin blocking|ADE | |sugar crashes |ADE | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ade_tweet_binary| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support B-ADE 0.83 0.81 0.82 525 I-ADE 0.72 0.63 0.67 439 O 0.96 0.97 0.97 5439 accuracy - - 0.94 6403 macro-avg 0.84 0.80 0.82 6403 weighted-avg 0.93 0.94 0.94 6403 ``` --- layout: model title: Detect Adverse Drug Events author: John Snow Labs name: ner_ade_clinical date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions of drugs in reviews, tweets, and medical text using pretrained NER model. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinical_en_3.0.0_3.0_1617260622471.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinical_en_3.0.0_3.0_1617260622471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.ade.clinical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash +------+-------+------+------+-------+---------+------+------+ |entity| tp| fp| fn| total|precision|recall| f1| +------+-------+------+------+-------+---------+------+------+ | DRUG|17470.0|1436.0|1951.0|19421.0| 0.924|0.8995|0.9116| | ADE| 6010.0|1244.0|1886.0| 7896.0| 0.8285|0.7611|0.7934| +------+-------+------+------+-------+---------+------+------+ +------------------+ | macro| +------------------+ |0.8525141088742945| +------------------+ +------------------+ | micro| +------------------+ |0.8774545383517981| +------------------+ ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_news_pretrain_ft_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_ft_news_en_4.3.0_3.0_1674211715703.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_ft_news_en_4.3.0_3.0_1674211715703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_news_pretrain_ft_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_news_pretrain_ft_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_news_pretrain_ft_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/news_pretrain_roberta_FT_newsqa --- layout: model title: Financial Controls procedures Item Binary Classifier author: John Snow Labs name: finclf_controls_procedures_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `controls_procedures` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `controls_procedures` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_controls_procedures_item_en_1.0.0_3.2_1660154383195.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_controls_procedures_item_en_1.0.0_3.2_1660154383195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_controls_procedures_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[controls_procedures]| |[other]| |[other]| |[controls_procedures]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_controls_procedures_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.4 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support controls_procedures 0.94 0.93 0.94 86 other 0.94 0.95 0.94 92 accuracy - - 0.94 178 macro-avg 0.94 0.94 0.94 178 weighted-avg 0.94 0.94 0.94 178 ``` --- layout: model title: Arabic BertForMaskedLM Cased model (from asafaya) author: John Snow Labs name: bert_embeddings_medium_arabic date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-arabic` is a Arabic model originally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_medium_arabic_ar_4.2.4_3.0_1670020613211.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_medium_arabic_ar_4.2.4_3.0_1670020613211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_medium_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_medium_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_medium_arabic| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|158.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-medium-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: Financial Relation Extraction on German Financial Statements author: John Snow Labs name: finre_has_value date: 2023-03-25 tags: [re, licensed, finance, de, tensorflow] task: Relation Extraction language: de edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a relation extraction model to extract financial entities and their values from text with `finner_financial_entity_value` model. ## Predicted Entities `has_value` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_has_value_de_1.0.0_3.0_1679702750286.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_has_value_de_1.0.0_3.0_1679702750286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sen = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_sentence_embeddings_financial", "de")\ .setInputCols("document", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) pos_tagger = nlp.PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"])\ .setOutputCol("pos_tags") dependency_parser = nlp.DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentence", "pos_tags", "token"])\ .setOutputCol("dependencies") ner_model = finance.NerModel().pretrained('finner_financial_entity_value', 'de', 'finance/models)\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner1") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner1"])\ .setOutputCol("ner_chunks") re_ner_chunk_filter = finance.RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["FINANCIAL_ENTITY-FINANCIAL_VALUE", "FINANCIAL_VALUE-FINANCIAL_ENTITY"]) reDL = finance.RelationExtractionDLModel().pretrained('finre_has_value', 'de', 'finance/models)\ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentence"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sen, tokenizer, embeddings, pos_tagger, dependency_parser, ner_model, ner_converter, re_ner_chunk_filter, reDL ]) empty_df = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_df) text= """Die Darlehensverbindlichkeit gegenüber der Vaillant GmbH in Höhe von TEUR 3.433 hat eine Laufzeit bis zum 31.12.2019 , die restlichen Verbindlichkeiten haben eine Restlaufzeit bis zu einem Jahr .""" sdf = spark.createDataFrame([[text]]).toDF("text") res = model.transform(sdf) res.show(20,truncate=False) result_df = res.select(F.explode(F.arrays_zip(res.relations.result, res.relations.metadata)).alias("cols")) \ .select( F.expr("cols['0']").alias("relations"),\ F.expr("cols['1']['entity1']").alias("relations_entity1"),\ F.expr("cols['1']['chunk1']" ).alias("relations_chunk1" ),\ F.expr("cols['1']['entity2']").alias("relations_entity2"),\ F.expr("cols['1']['chunk2']" ).alias("relations_chunk2" ),\ F.expr("cols['1']['confidence']" ).alias("confidence" ),\ F.expr("cols['1']['syntactic_distance']" ).alias("syntactic_distance" ),\ ).filter("relations!='other'") result_df.show() ```
## Results ```bash relations relations_entity1 relations_chunk1 relations_entity2 relations_chunk2 confidence syntactic_distance has_value FINANCIAL_ENTITY Darlehensverbindl... FINANCIAL_VALUE 3.433 1.0 8 has_value FINANCIAL_VALUE 3.433 FINANCIAL_ENTITY Verbindlichkeiten 1.0 undefined ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_has_value| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|661.2 MB| ## Benchmarking ```bash Relation Recall Precision F1 Support has_value 1.000 1.000 1.000 100 Avg. 1.000 1.000 1.000 - Weighted Avg. 1.000 1.000 1.000 - ``` --- layout: model title: Pipeline to Detect Problems, Tests and Treatments (ner_clinical_large) author: John Snow Labs name: ner_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_pipeline_en_4.3.0_3.2_1678866176867.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_pipeline_en_4.3.0_3.2_1678866176867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_pipeline", "en", "clinical/models") text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_pipeline", "en", "clinical/models") val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------------------------------------|--------:|------:|:------------|-------------:| | 0 | the G-protein-activated inwardly rectifying potassium (GIRK | 48 | 106 | TREATMENT | 0.6926 | | 1 | the genomicorganization | 142 | 164 | TREATMENT | 0.80715 | | 2 | a candidate gene forType II diabetes mellitus | 210 | 254 | PROBLEM | 0.754343 | | 3 | byapproximately | 380 | 394 | TREATMENT | 0.7924 | | 4 | single nucleotide polymorphisms | 464 | 494 | TREATMENT | 0.636967 | | 5 | aVal366Ala substitution | 532 | 554 | PROBLEM | 0.53615 | | 6 | an 8 base-pair | 561 | 574 | PROBLEM | 0.607733 | | 7 | insertion/deletion | 581 | 598 | PROBLEM | 0.8692 | | 8 | Ourexpression studies | 601 | 621 | TEST | 0.89975 | | 9 | the transcript in various humantissues | 648 | 685 | PROBLEM | 0.83306 | | 10 | fat andskeletal muscle | 749 | 770 | PROBLEM | 0.778133 | | 11 | furtherstudies | 830 | 843 | PROBLEM | 0.8789 | | 12 | the KCNJ9 protein | 864 | 880 | TREATMENT | 0.561033 | | 13 | evaluation | 892 | 901 | TEST | 0.9981 | | 14 | Type II diabetes | 940 | 955 | PROBLEM | 0.698967 | | 15 | the treatment | 1025 | 1037 | TREATMENT | 0.81195 | | 16 | breast cancer | 1042 | 1054 | PROBLEM | 0.9604 | | 17 | the standard therapy | 1067 | 1086 | TREATMENT | 0.757767 | | 18 | anthracyclines | 1125 | 1138 | TREATMENT | 0.9999 | | 19 | taxanes | 1144 | 1150 | TREATMENT | 0.9999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fake News Classifier author: John Snow Labs name: classifierdl_use_fakenews class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/07/2020 task: Text Classification edition: Spark NLP 2.5.3 spark_version: 2.4 tags: [classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Determine if news articles are Real or Fake. {:.h2_title} ## Predicted Entities ``REAL``, ``FAKE``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_FAKENEWS/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_FAKENEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_fakenews_en_2.5.3_2.4_1593783319296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_fakenews_en_2.5.3_2.4_1593783319296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_fakenews', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_fakenews", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton"""] fake_df = nlu.load('classify.fakenews.use').predict(text, output_level='document') fake_df[["document", "fakenews"]] ```
{:.h2_title} ## Results ```bash +--------------------------------------------------------------------------------------------------------+------------+ |document |class | +--------------------------------------------------------------------------------------------------------+------------+ |Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton | FAKE | +--------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| | Model Name | classifierdl_use_fakenews | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.3 | | Spark NLP Compatibility | 2.4 | | License | open source | | Edition | public | | Input Labels | [document, sentence_embeddings] | | Output Labels | [class] | | Language | en | | Upstream Dependencies | with tfhub_use | {:.h2_title} ## Data Source This model is trained on the fake new classification challenge. https://raw.githubusercontent.com/joolsa/fake_real_news_dataset/master/fake_or_real_news.csv.zip {:.h2_title} ## Benchmarking ```bash precision recall f1-score support FAKE 0.85 0.90 0.88 931 REAL 0.90 0.85 0.88 962 accuracy 0.88 1893 macro avg 0.88 0.88 0.88 1893 weighted avg 0.88 0.88 0.88 1893 ``` --- layout: model title: English BertForTokenClassification Cased model (from alexbrandsen) author: John Snow Labs name: bert_ner_ArcheoBERTje_NER date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ArcheoBERTje-NER` is a English model originally trained by `alexbrandsen`. ## Predicted Entities `MAT`, `ART`, `CON`, `SPE`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_ArcheoBERTje_NER_en_4.0.0_3.0_1657107891355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_ArcheoBERTje_NER_en_4.0.0_3.0_1657107891355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_ArcheoBERTje_NER","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_ArcheoBERTje_NER","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_ArcheoBERTje_NER| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/alexbrandsen/ArcheoBERTje-NER --- layout: model title: Voice of the Patients author: John Snow Labs name: ner_vop_wip date: 2023-03-30 tags: [en, licensed, public_health, voice_of_patients] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: [3.0, 3.2] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `SubstanceQuantity`, `Measurements`, `Treatment`, `Modifier`, `RaceEthnicity`, `Allergen`, `TestResult`, `InjuryOrPoisoning`, `Frequency`, `MedicalDevice`, `Procedure`, `Duration`, `DateTime`, `HealthStatus`, `Route`, `Vaccine`, `Disease`, `Symptom`, `RelationshipStatus`, `Dosage`, `Substance`, `VitalTest`, `AdmissionDischarge`, `Test`, `Laterality`, `ClinicalDept`, `PsychologicalCondition`, `Age`, `BodyPart`, `Drug`, `Employment`, `Form` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/VOICE_OF_THE_PATIENTS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_en_4.4.0_3.0_1680207981665.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_en_4.4.0_3.0_1680207981665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I"m 20 year old girl. I"m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I"m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I"m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I"m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light headed,poor digestion | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | 2nd test | Test | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | allopathy medicine | Treatment | | homeopathy | Treatment | | thyroid | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Benchmarking ```bash label tp fp fn total precision recall f1 SubstanceQuantity 9 6 8 17 0.60 0.53 0.56 Measurements 30 15 31 61 0.67 0.49 0.57 Treatment 67 32 26 93 0.68 0.72 0.70 Modifier 638 253 184 822 0.72 0.78 0.74 RaceEthnicity 5 0 3 8 1.00 0.63 0.77 Allergen 7 0 4 11 1.00 0.64 0.78 TestResult 371 110 95 466 0.77 0.80 0.78 InjuryOrPoisoning 113 27 30 143 0.81 0.79 0.80 Frequency 424 81 112 536 0.84 0.79 0.81 MedicalDevice 186 42 39 225 0.82 0.83 0.82 Procedure 288 67 63 351 0.81 0.82 0.82 Duration 786 165 181 967 0.83 0.81 0.82 DateTime 1738 409 288 2026 0.81 0.86 0.83 HealthStatus 76 24 5 81 0.76 0.94 0.84 Route 38 2 13 51 0.95 0.75 0.84 Vaccine 22 3 5 27 0.88 0.81 0.85 Disease 1168 228 149 1317 0.84 0.89 0.86 Symptom 2877 646 318 3195 0.82 0.90 0.86 RelationshipStatus 18 0 5 23 1.00 0.78 0.88 Dosage 268 36 39 307 0.88 0.87 0.88 Substance 167 28 19 186 0.86 0.90 0.88 VitalTest 122 13 16 138 0.90 0.88 0.89 AdmissionDischarge 25 2 4 29 0.93 0.86 0.89 Test 699 84 93 792 0.89 0.88 0.89 Laterality 456 41 59 515 0.92 0.89 0.90 ClinicalDept 169 14 24 193 0.92 0.88 0.90 PsychologicalCondition 264 30 14 278 0.90 0.95 0.92 Age 250 24 17 267 0.91 0.94 0.92 BodyPart 2344 189 135 2479 0.93 0.95 0.94 Drug 962 53 46 1008 0.95 0.95 0.95 Employment 863 50 47 910 0.95 0.95 0.95 Form 201 15 7 208 0.93 0.97 0.95 Gender 1196 28 12 1208 0.98 0.99 0.98 macro_avg 16847 2717 2091 18938 0.86 0.83 0.84 micro_avg 16847 2717 2091 18938 0.86 0.89 0.88 ``` --- layout: model title: Spanish XlmRoBertaForQuestionAnswering (from saattrupdan) author: John Snow Labs name: xlm_roberta_qa_xlmr_base_texas_squad_es_es_saattrupdan date: 2022-06-24 tags: [es, open_source, question_answering, xlmroberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlmr-base-texas-squad-es` is a Spanish model originally trained by `saattrupdan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_es_es_saattrupdan_es_4.0.0_3.0_1656065698550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlmr_base_texas_squad_es_es_saattrupdan_es_4.0.0_3.0_1656065698550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlmr_base_texas_squad_es_es_saattrupdan","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlmr_base_texas_squad_es_es_saattrupdan","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squad_spanish_tuned.xlmr_roberta.base.by_saattrupdan").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlmr_base_texas_squad_es_es_saattrupdan| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|876.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saattrupdan/xlmr-base-texas-squad-es --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_bert_triplet_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_bert_triplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223322432.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_bert_triplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223322432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_bert_triplet_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_bert_triplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_bert_triplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_bert_triplet_epochs_1_shard_1_squad2.0 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from autoevaluate) author: John Snow Labs name: roberta_qa_autoevaluate_base_squad2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `autoevaluate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_autoevaluate_base_squad2_en_4.2.4_3.0_1669986662532.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_autoevaluate_base_squad2_en_4.2.4_3.0_1669986662532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autoevaluate_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_autoevaluate_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_autoevaluate_base_squad2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/autoevaluate/roberta-base-squad2 - https://haystack.deepset.ai/tutorials/first-qa-system - https://github.com/deepset-ai/haystack/ - https://haystack.deepset.ai/tutorials/first-qa-system - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - http://deepset.ai/ - https://haystack.deepset.ai/ - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/haystack - https://haystack.deepset.ai - https://haystack.deepset.ai/community/join - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Spanish RoBERTa Embeddings (from mrm8488) author: John Snow Labs name: roberta_embeddings_RuPERTa_base date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `RuPERTa-base` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RuPERTa_base_es_3.4.2_3.0_1649945390775.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RuPERTa_base_es_3.4.2_3.0_1649945390775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RuPERTa_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RuPERTa_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.RuPERTa_base").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_RuPERTa_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|472.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/RuPERTa-base - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://github.com/josecannete/spanish-corpora - https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/RuPERTa_base_finetuned_POS.ipynb - https://www.tensorflow.org/tfrc - https://twitter.com/mrm8488 --- layout: model title: Dutch Named Entity Recognition (from Davlan) author: John Snow Labs name: distilbert_ner_distilbert_base_multilingual_cased_ner_hrl date: 2022-05-16 tags: [distilbert, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-ner-hrl` is a Dutch model orginally trained by `Davlan`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_multilingual_cased_ner_hrl_nl_3.4.2_3.0_1652721841985.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_multilingual_cased_ner_hrl_nl_3.4.2_3.0_1652721841985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_multilingual_cased_ner_hrl","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_multilingual_cased_ner_hrl","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_distilbert_base_multilingual_cased_ner_hrl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/distilbert-base-multilingual-cased-ner-hrl - https://camel.abudhabi.nyu.edu/anercorp/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll --- layout: model title: Japanese BertForTokenClassification Cased model (from jurabi) author: John Snow Labs name: bert_token_classifier_ner_japanese date: 2022-11-30 tags: [ja, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-ner-japanese` is a Japanese model originally trained by `jurabi`. ## Predicted Entities `イベント名`, `その他の組織名`, `施設名`, `製品名`, `法人名`, `人名`, `地名`, `政治的組織名` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_japanese_ja_4.2.4_3.0_1669815393314.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_japanese_ja_4.2.4_3.0_1669815393314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_japanese","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_japanese","ja") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_japanese| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ja| |Size:|415.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jurabi/bert-ner-japanese - https://github.com/stockmarkteam/ner-wikipedia-dataset - https://github.com/jurabiinc/bert-ner-japanese - https://creativecommons.org/licenses/by-sa/3.0/ --- layout: model title: Russian BertForQuestionAnswering Large Cased model (from Timur1984) author: John Snow Labs name: bert_qa_timur1984_sbert_large_nlu_ru_finetuned_squad_full date: 2022-07-07 tags: [ru, open_source, bert, question_answering] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sbert_large_nlu_ru-finetuned-squad-full` is a Russian model originally trained by `Timur1984`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_timur1984_sbert_large_nlu_ru_finetuned_squad_full_ru_4.0.0_3.0_1657191520660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_timur1984_sbert_large_nlu_ru_finetuned_squad_full_ru_4.0.0_3.0_1657191520660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_timur1984_sbert_large_nlu_ru_finetuned_squad_full","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_timur1984_sbert_large_nlu_ru_finetuned_squad_full","ru") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_timur1984_sbert_large_nlu_ru_finetuned_squad_full| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Timur1984/sbert_large_nlu_ru-finetuned-squad-full --- layout: model title: SDOH Under Treatment For Classification author: John Snow Labs name: genericclassifier_sdoh_under_treatment_sbiobert_cased_mli date: 2023-04-10 tags: [en, licenced, clinical, sdoh, generic_classifier, under_treatment, biobert, licensed] task: Text Classification language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting if the patient is under treatment or not. If under treatment is not mentioned in the text, it is regarded as "not under treatment". The model is trained by using GenericClassifierApproach annotator. `Under_Treatment`: The patient is under treatment. `Not_Under_Treatment_Or_Not_Mentioned`: The patient is not under treatment or it is not mentioned in the clinical notes. ## Predicted Entities `Under_Treatment`, `Not_Under_Treatment_Or_Not_Mentioned` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.3.2_3.2_1681116067393.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.3.2_3.2_1681116067393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["""Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications. To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity. Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly. With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life. """, """John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures. Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor. """] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("""Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications. To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity. Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly. With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life. """, """John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures. Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor. """)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+--------------------------------------+ | text| result| +----------------------------------------------------------------------------------------------------+--------------------------------------+ |Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disea...| [Under_Treatment]| |John, a 60-year-old man with a history of smoking and high blood pressure, presented to his prima...|[Not_Under_Treatment_Or_Not_Mentioned]| +----------------------------------------------------------------------------------------------------+--------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_under_treatment_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Not_Under_Treatment_Or_Not_Mentioned 0.86 0.68 0.76 222 Under_Treatment 0.86 0.94 0.90 450 accuracy - - 0.86 672 macro-avg 0.86 0.81 0.83 672 weighted-avg 0.86 0.86 0.85 672 ``` --- layout: model title: Part of Speech for Greek author: John Snow Labs name: pos_ud_gdt date: 2020-05-05 16:56:00 +0800 task: Part of Speech Tagging language: el edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, el] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gdt_el_2.5.0_2.4_1588686949851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gdt_el_2.5.0_2.4_1588686949851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_gdt", "el") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_gdt", "el") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Εκτός από το ότι είναι ο βασιλιάς του Βορρά, ο John Snow είναι Άγγλος γιατρός και ηγέτης στην ανάπτυξη της αναισθησίας και της ιατρικής υγιεινής."""] pos_df = nlu.load('el.pos.ud_gdt').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='ADV', metadata={'word': 'Εκτός'}), Row(annotatorType='pos', begin=6, end=8, result='ADP', metadata={'word': 'από'}), Row(annotatorType='pos', begin=10, end=11, result='DET', metadata={'word': 'το'}), Row(annotatorType='pos', begin=13, end=15, result='SCONJ', metadata={'word': 'ότι'}), Row(annotatorType='pos', begin=17, end=21, result='AUX', metadata={'word': 'είναι'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gdt| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|el| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Part of Speech for Chinese author: John Snow Labs name: pos_ud_gsd date: 2021-03-08 tags: [part_of_speech, open_source, chinese, pos_ud_gsd, zh] task: Part of Speech Tagging language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - VERB - ADJ - PUNCT - ADV - NUM - NOUN - PRON - PART - X - ADP - DET - CONJ - PROPN - AUX - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_3.0.0_3.0_1615230264489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_3.0.0_3.0_1615230264489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.pos.ud_gsd').predict(text) token_df ```
## Results ```bash token pos 0 从 ADP 1 JohnSnowLabs X 2 你 PRON 3 好 ADV 4 ! VERB ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|zh| --- layout: model title: Legal Conditions Of Underwriters Obligations Clause Binary Classifier author: John Snow Labs name: legclf_conditions_of_underwriters_obligations_clause date: 2023-01-29 tags: [en, legal, classification, conditions, underwriters, obligations, clauses, conditions_of_underwriters_obligations, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `conditions-of-underwriters-obligations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `conditions-of-underwriters-obligations`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conditions_of_underwriters_obligations_clause_en_1.0.0_3.0_1675005111493.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conditions_of_underwriters_obligations_clause_en_1.0.0_3.0_1675005111493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_conditions_of_underwriters_obligations_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[conditions-of-underwriters-obligations]| |[other]| |[other]| |[conditions-of-underwriters-obligations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conditions_of_underwriters_obligations_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support conditions-of-underwriters-obligations 0.98 0.98 0.98 55 other 0.99 0.99 0.99 104 accuracy - - 0.99 159 macro-avg 0.99 0.99 0.99 159 weighted-avg 0.99 0.99 0.99 159 ``` --- layout: model title: English BertForQuestionAnswering model (from juliusco) author: John Snow Labs name: bert_qa_biobert_base_cased_v1.1_squad_finetuned_covbiobert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-base-cased-v1.1-squad-finetuned-covbiobert` is a English model orginally trained by `juliusco`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covbiobert_en_4.0.0_3.0_1654185623703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covbiobert_en_4.0.0_3.0_1654185623703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_covbiobert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_base_cased_v1.1_squad_finetuned_covbiobert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.covid_biobert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_base_cased_v1.1_squad_finetuned_covbiobert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/juliusco/biobert-base-cased-v1.1-squad-finetuned-covbiobert --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl](https://nlp.johnsnowlabs.com/2021/06/24/ner_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_pipeline_en_3.4.1_3.0_1647868891528.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_pipeline_en_3.4.1_3.0_1647868891528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Modifier | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Word2Vec Embeddings in Bavarian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bar, open_source] task: Embeddings language: bar edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bar_3.4.1_3.0_1647286112336.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bar_3.4.1_3.0_1647286112336.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bar.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bar| |Size:|407.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Self Report Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_vop_self_report date: 2023-06-13 tags: [licensed, classification, vop, en, self_report, tensorflow] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify texts depending on if they are self-reported or if they refer to another person. ## Predicted Entities `1st_Person`, `3rd_Person` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_self_report_en_4.4.3_3.0_1686671270069.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_self_report_en_4.4.3_3.0_1686671270069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_self_report", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["My friend was treated for her skin cancer two years ago.", "I started with dysphagia in 2021, then, a few weeks later, felt weakness in my legs, followed by a severe diarrhea."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_self_report", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("prediction") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("My friend was treated for her skin cancer two years ago.", "I started with dysphagia in 2021, then, a few weeks later, felt weakness in my legs, followed by a severe diarrhea.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------------+------------+ |text |result | +-------------------------------------------------------------------------------------------------------------------+------------+ |My friend was treated for her skin cancer two years ago. |[3rd_Person]| |I started with dysphagia in 2021, then, a few weeks later, felt weakness in my legs, followed by a severe diarrhea.|[1st_Person]| +-------------------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_self_report| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset “Hello,I’m 20 year old girl. I’m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I’m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I’m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I’m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.” ## Benchmarking ```bash label precision recall f1-score support 1st_Person 0.932432 0.985714 0.958333 70 3rd_Person 0.975000 0.886364 0.928571 44 accuracy - - 0.947368 114 macro_avg 0.953716 0.936039 0.943452 114 weighted_avg 0.948862 0.947368 0.946846 114 ``` --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_small_2_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-2-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_2_finetuned_squadv2_en_4.0.0_3.0_1654145692046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_2_finetuned_squadv2_en_4.0.0_3.0_1654145692046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_2_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_2_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.small_v2.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_2_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|130.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-small-2-finetuned-squadv2 --- layout: model title: Fast Neural Machine Translation Model from English to Marshallese author: John Snow Labs name: opus_mt_en_mh date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mh, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mh_xx_2.7.0_2.4_1609167210626.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mh_xx_2.7.0_2.4_1609167210626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mh", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mh", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mh').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mh| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Irish to English author: John Snow Labs name: opus_mt_ga_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ga, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ga` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ga_en_xx_2.7.0_2.4_1609167339997.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ga_en_xx_2.7.0_2.4_1609167339997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ga_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ga_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ga.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ga_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_declutr_model_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `declutr-model_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_declutr_model_squad2.0_en_4.0.0_3.0_1655728196031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_declutr_model_squad2.0_en_4.0.0_3.0_1655728196031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_declutr_model_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_declutr_model_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.declutr.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_declutr_model_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/declutr-model_squad2.0 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab6_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab6_by_hassnain` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain_en_4.2.0_3.0_1664040541841.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain_en_4.2.0_3.0_1664040541841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab6_by_hassnain| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Named Entity Recognition - ELECTRA Small (OntoNotes) author: John Snow Labs name: onto_electra_small_uncased date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `electra_small_uncased` embeddgings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_electra_small_uncased_en_2.7.0_2.4_1607202932422.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_electra_small_uncased_en_2.7.0_2.4_1607202932422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("electra_small_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_onto = NerDLModel.pretrained("onto_electra_small_uncased", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("electra_small_uncased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_onto = NerDLModel.pretrained("onto_electra_small_uncased", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.electra.uncased_small').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |1970s |DATE | |1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Microsoft |ORG | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | |the late 1990s |DATE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_electra_small_uncased| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.87234557, rec: 0.8584134, f1: 0.8653234 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11149 phrases; correct: 9598. accuracy: 97.22%; 9598 11257 11149 precision: 86.09%; recall: 85.26%; FB1: 85.67 CARDINAL: 789 935 948 precision: 83.23%; recall: 84.39%; FB1: 83.80 948 DATE: 1400 1602 1659 precision: 84.39%; recall: 87.39%; FB1: 85.86 1659 EVENT: 31 63 50 precision: 62.00%; recall: 49.21%; FB1: 54.87 50 FAC: 72 135 111 precision: 64.86%; recall: 53.33%; FB1: 58.54 111 GPE: 2086 2240 2197 precision: 94.95%; recall: 93.12%; FB1: 94.03 2197 LANGUAGE: 8 22 10 precision: 80.00%; recall: 36.36%; FB1: 50.00 10 LAW: 21 40 34 precision: 61.76%; recall: 52.50%; FB1: 56.76 34 LOC: 114 179 201 precision: 56.72%; recall: 63.69%; FB1: 60.00 201 MONEY: 282 314 321 precision: 87.85%; recall: 89.81%; FB1: 88.82 321 NORP: 786 841 848 precision: 92.69%; recall: 93.46%; FB1: 93.07 848 ORDINAL: 180 195 227 precision: 79.30%; recall: 92.31%; FB1: 85.31 227 ORG: 1359 1795 1616 precision: 84.10%; recall: 75.71%; FB1: 79.68 1616 PERCENT: 312 349 349 precision: 89.40%; recall: 89.40%; FB1: 89.40 349 PERSON: 1852 1988 2059 precision: 89.95%; recall: 93.16%; FB1: 91.52 2059 PRODUCT: 32 76 69 precision: 46.38%; recall: 42.11%; FB1: 44.14 69 QUANTITY: 86 105 105 precision: 81.90%; recall: 81.90%; FB1: 81.90 105 TIME: 124 212 207 precision: 59.90%; recall: 58.49%; FB1: 59.19 207 WORK_OF_ART: 64 166 138 precision: 46.38%; recall: 38.55%; FB1: 42.11 138 ``` --- layout: model title: Google's Tapas Table Understanding (Tiny, WTQ) author: John Snow Labs name: table_qa_tapas_tiny_finetuned_wtq date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Tiny Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_tiny_finetuned_wtq_en_4.2.0_3.0_1664530892606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_tiny_finetuned_wtq_en_4.2.0_3.0_1664530892606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_tiny_finetuned_wtq","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_tiny_finetuned_wtq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|17.4 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_256_zh_4.2.4_3.0_1670021663275.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_256_zh_4.2.4_3.0_1670021663275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|33.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_06 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_english_filipino_wav2vec2_l_xls_r_test_06 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_06` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_06_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_06_en_4.2.0_3.0_1664115006095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_06_en_4.2.0_3.0_1664115006095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_06", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_06", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_06| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Translate West Slavic languages to English Pipeline author: John Snow Labs name: translate_zlw_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, zlw, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `zlw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_zlw_en_xx_2.7.0_2.4_1609686075998.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_zlw_en_xx_2.7.0_2.4_1609686075998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_zlw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_zlw_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.zlw.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_zlw_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic Bert Embeddings (Base, Trained on an eighth of the full MSA dataset) author: John Snow Labs name: bert_embeddings_bert_base_arabic_camelbert_msa_eighth date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-eighth` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_eighth_ar_3.4.2_3.0_1649678829456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_eighth_ar_3.4.2_3.0_1649678829456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_eighth","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_eighth","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic_camelbert_msa_eighth").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic_camelbert_msa_eighth| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-eighth - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Legal Consent To Jurisdiction Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_consent_to_jurisdiction_bert date: 2023-03-05 tags: [en, legal, classification, clauses, consent_to_jurisdiction, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Consent_To_Jurisdiction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Consent_To_Jurisdiction`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consent_to_jurisdiction_bert_en_1.0.0_3.0_1678049885944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consent_to_jurisdiction_bert_en_1.0.0_3.0_1678049885944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_consent_to_jurisdiction_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Consent_To_Jurisdiction]| |[Other]| |[Other]| |[Consent_To_Jurisdiction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_consent_to_jurisdiction_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Consent_To_Jurisdiction 1.00 0.83 0.91 18 Other 0.91 1.00 0.95 31 accuracy - - 0.94 49 macro-avg 0.96 0.92 0.93 49 weighted-avg 0.94 0.94 0.94 49 ``` --- layout: model title: Portuguese BertForQuestionAnswering model (from pucpr) author: John Snow Labs name: bert_qa_bioBERTpt_squad_v1.1_portuguese date: 2022-06-02 tags: [pt, open_source, question_answering, bert] task: Question Answering language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioBERTpt-squad-v1.1-portuguese` is a Portuguese model orginally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bioBERTpt_squad_v1.1_portuguese_pt_4.0.0_3.0_1654185542826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bioBERTpt_squad_v1.1_portuguese_pt_4.0.0_3.0_1654185542826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bioBERTpt_squad_v1.1_portuguese","pt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bioBERTpt_squad_v1.1_portuguese","pt") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("pt.answer_question.squad.biobert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bioBERTpt_squad_v1.1_portuguese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pucpr/bioBERTpt-squad-v1.1-portuguese - https://github.com/HAILab-PUCPR/ --- layout: model title: Google's Tapas Table Understanding (Medium, SQA) author: John Snow Labs name: table_qa_tapas_medium_finetuned_sqa date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Medium Has aggregation operations?: False ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_sqa_en_4.2.0_3.0_1664530610375.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_medium_finetuned_sqa_en_4.2.0_3.0_1664530610375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_medium_finetuned_sqa","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_medium_finetuned_sqa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|157.5 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 --- layout: model title: NER Pipeline for Demographic Information - Voice of the Patient author: John Snow Labs name: ner_vop_demographic_pipeline date: 2023-06-10 tags: [licensed, ner, pipeline, vop, en, demographic] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of demographic information from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_pipeline_en_4.4.3_3.0_1686410353004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_pipeline_en_4.4.3_3.0_1686410353004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_demographic_pipeline", "en", "clinical/models") pipeline.annotate(" My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_demographic_pipeline", "en", "clinical/models") val result = pipeline.annotate(" My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications. ") ```
## Results ```bash | chunk | ner_label | |:--------|:--------------| | grandma | Gender | | 85 | Age | | Black | RaceEthnicity | | doctors | Employment | | her | Gender | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_demographic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Translate English to Romance languages Pipeline author: John Snow Labs name: translate_en_roa date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, roa, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `roa` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_roa_xx_2.7.0_2.4_1609686717997.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_roa_xx_2.7.0_2.4_1609686717997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_roa", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_roa", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.roa').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_roa| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Military Leave Clause Binary Classifier author: John Snow Labs name: legclf_military_leave_clause date: 2023-01-29 tags: [en, legal, classification, military, leave, clauses, military_leave, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `military-leave` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `military-leave`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_military_leave_clause_en_1.0.0_3.0_1674994024589.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_military_leave_clause_en_1.0.0_3.0_1674994024589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_military_leave_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[military-leave]| |[other]| |[other]| |[military-leave]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_military_leave_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.97 0.99 39 subcontractors 0.95 1.00 0.98 21 accuracy - - 0.98 60 macro-avg 0.98 0.99 0.98 60 weighted-avg 0.98 0.98 0.98 60 ``` --- layout: model title: Legal Agreements Clause Binary Classifier author: John Snow Labs name: legclf_agreements_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `agreements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `agreements` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreements_clause_en_1.0.0_3.2_1660122097948.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreements_clause_en_1.0.0_3.2_1660122097948.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_agreements_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[agreements]| |[other]| |[other]| |[agreements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreements_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support agreements 0.84 0.83 0.83 212 other 0.94 0.94 0.94 618 accuracy - - 0.92 830 macro-avg 0.89 0.89 0.89 830 weighted-avg 0.92 0.92 0.92 830 ``` --- layout: model title: Generic Text Generation - Large author: John Snow Labs name: text_generator_generic_flan_t5_large date: 2023-04-04 tags: [licensed, en, text_generation, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is based on google's Flan-T5 Large, and can generate conditional text. Sequence length is 512 tokens. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_GENERATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.1.Medical_Text_Generation.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/text_generator_generic_flan_t5_large_en_4.3.2_3.0_1680648636099.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/text_generator_generic_flan_t5_large_en_4.3.2_3.0_1680648636099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("prompt")\ .setOutputCol("document_prompt") med_text_generator = MedicalTextGenerator.pretrained("text_generator_generic_flan_t5_large", "en", "clinical/models")\ .setInputCols("document_prompt")\ .setOutputCol("answer")\ .setMaxNewTokens(256)\ .setDoSample(True)\ .setTopK(3)\ .setRandomSeed(42) pipeline = Pipeline(stages=[document_assembler, med_text_generator]) data = spark.createDataFrame([["""Classify the following review as negative or positive: Not a huge fan of her acting, but the movie was actually quite good!"""]]).toDF("prompt") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("prompt") .setOutputCol("document_prompt") val med_text_generator = MedicalTextGenerator.pretrained("text_generator_generic_flan_t5_large", "en", "clinical/models") .setInputCols("document_prompt") .setOutputCol("answer") .setMaxNewTokens(256) .setDoSample(true) .setTopK(3) .setRandomSeed(42) val pipeline = new Pipeline().setStages(Array(document_assembler, med_text_generator)) val data = Seq(Array("""Classify the following review as negative or positive: Not a huge fan of her acting, but the movie was actually quite good!""")).toDS.toDF("prompt") val result = pipeline.fit(data).transform(data) ```
## Results ```bash positive ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_generator_generic_flan_t5_large| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.9 GB| --- layout: model title: Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species date: 2022-06-27 tags: [pt, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: pt edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Portuguese which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pt_3.5.3_3.0_1656319516041.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pt_3.5.3_3.0_1656319516041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "pt", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_living_species", "pt", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.med_ner.living_species.token_bert").predict("""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.""") ```
## Results ```bash +-------------------+-------+ |ner_chunk |label | +-------------------+-------+ |rapariga |HUMAN | |pessoal |HUMAN | |paciente |HUMAN | |gato |SPECIES| |veterinário |HUMAN | |Trichophyton rubrum|SPECIES| +-------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.75 0.95 0.84 2832 B-SPECIES 0.64 0.85 0.73 2810 I-HUMAN 0.79 0.37 0.51 180 I-SPECIES 0.63 0.71 0.67 1107 micro-avg 0.69 0.86 0.76 6929 macro-avg 0.70 0.72 0.69 6929 weighted-avg 0.69 0.86 0.76 6929 ``` --- layout: model title: English image_classifier_vit_tiny_patch16_224 ViTForImageClassification from WinKawaks author: John Snow Labs name: image_classifier_vit_tiny_patch16_224 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_tiny_patch16_224` is a English model originally trained by WinKawaks. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_tiny_patch16_224_en_4.1.0_3.0_1660166492014.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_tiny_patch16_224_en_4.1.0_3.0_1660166492014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_tiny_patch16_224", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_tiny_patch16_224", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_tiny_patch16_224| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|22.1 MB| --- layout: model title: Universal Sentence Encoder Multilingual Large author: John Snow Labs name: tfhub_use_multi_lg date: 2020-12-08 task: Embeddings language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [embeddings, open_source, xx] deprecated: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is the variable-length text and the output is a 512-dimensional vector. The universal-sentence-encoder model has trained with a deep averaging network (DAN) encoder. This model supports 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder. The details are described in the paper "[Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/abs/1907.04307)". Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_lg_xx_2.7.0_2.4_1607439900967.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_lg_xx_2.7.0_2.4_1607439900967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi_lg", "xx") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Me encanta usar SparkNLP']], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi_lg", "xx") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Me encanta usar SparkNLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Me encanta usar SparkNLP"] embeddings_df = nlu.load('xx.use.multi_lg').predict(text, output_level='sentence') embeddings_df ```
## Results It gives a 512-dimensional vector of the sentences. ```bash sentence xx_use_multi_lg_embeddings 0 I love NLP [-0.055500030517578125, 0.13188503682613373, -... 1 Me encanta usar SparkNLP [0.008856392465531826, 0.06127211079001427, 0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_multi_lg| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3) ## Benchmarking - We apply this model to the STS benchmark for semantic similarity. The eval can be seen in the [example notebook](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb) made available. Results are shown below: ```bash STSBenchmark | dev | test | -----------------------------------|--------|-------| Correlation coefficient of Pearson | 0.837 | 0.825 | ``` - For semantic similarity retrieval, we evaluate the model on [Quora and AskUbuntu retrieval task.](https://arxiv.org/abs/1811.08008). Results are shown below: ```bash Dataset | Quora | AskUbuntu | Average | -----------------------|-------|-----------|---------| Mean Averge Precision | 89.1 | 42.3 | 65.7 | ``` - For the translation pair retrieval, we evaluate the model on the United Nation Parallal Corpus. Results are shown below: ```bash Language Pair | en-es | en-fr | en-ru | en-zh | ---------------|--------|-------|-------|-------| Precision@1 | 86.1 | 83.3 | 88.9 | 78.8 | ``` --- layout: model title: Sentiment Analysis of Vietnamese texts author: John Snow Labs name: classifierdl_distilbert_sentiment date: 2022-02-09 tags: [vietnamese, sentiment, vi, open_source] task: Text Classification language: vi edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies Positive or Negative sentiments in Vietnamese texts. ## Predicted Entities `POSITIVE`, `NEGATIVE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_distilbert_sentiment_vi_3.4.0_3.0_1644408533716.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_distilbert_sentiment_vi_3.4.0_3.0_1644408533716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ .setCleanupMode("shrink") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") normalizer = Normalizer() \ .setInputCols(["token"]) \ .setOutputCol("normalized") lemmatizer = LemmatizerModel.pretrained("lemma", "vi") \ .setInputCols(["normalized"]) \ .setOutputCol("lemma") distilbert = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi")\ .setInputCols(["document",'token'])\ .setOutputCol("embeddings")\ .setCaseSensitive(False) embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") sentimentClassifier = ClassifierDLModel.pretrained('classifierdl_distilbert_sentiment', 'vi') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") vi_sentiment_pipeline = Pipeline(stages=[document, tokenizer, normalizer, lemmatizer, distilbert, embeddingsSentence, sentimentClassifier]) light_pipeline = LightPipeline(vi_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.annotate("Chất cotton siêu đẹp mịn mát.") result["class"] ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val normalizer = Normalizer() .setInputCols(Array("token")) .setOutputCol("normalized") val lemmatizer = LemmatizerModel.pretrained("lemma", "vi") .setInputCols(Array("normalized")) .setOutputCol("lemma") val distilbert = DistilBertEmbeddings.pretrained("distilbert_base_cased", "vi") .setInputCols(Array("document","token")) .setOutputCol("embeddings") .setCaseSensitive(False) val embeddingsSentence = SentenceEmbeddings() .setInputCols(Array("document", "embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val sentimentClassifier = ClassifierDLModel.pretrained.("classifierdl_distilbert_sentiment", "vi") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document, tokenizer, normalizer, lemmatizer, distilbert, embeddingsSentence, sentimentClassifier)) val light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) val result = light_pipeline.annotate("Chất cotton siêu đẹp mịn mát.") ```
## Results ```bash ['POSITIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_distilbert_sentiment| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|vi| |Size:|23.6 MB| ## References [https://www.kaggle.com/datvuthanh/vietnamese-sentiment](https://www.kaggle.com/datvuthanh/vietnamese-sentiment) ## Benchmarking ```bash label precision recall f1-score support NEGATIVE 0.88 0.79 0.83 956 POSITIVE 0.80 0.89 0.84 931 accuracy - - 0.84 1887 macro-avg 0.84 0.84 0.84 1887 weighted-avg 0.84 0.84 0.84 1887 ``` --- layout: model title: Temporality / Certainty Assertion Status (md) author: John Snow Labs name: legassertion_time_md date: 2023-01-02 tags: [time, possibility, en, licensed] task: Assertion Status language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a medium (md) Assertion Status Model aimed to detect temporality (PRESENT, PAST, FUTURE) or Certainty (POSSIBLE) in your legal documents, which may improve the results of the `legassertion_time` (small) model. ## Predicted Entities `PRESENT`, `PAST`, `FUTURE`, `POSSIBLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGASSERTION_TEMPORALITY){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legassertion_time_md_en_1.0.0_3.0_1672687111582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legassertion_time_md_en_1.0.0_3.0_1672687111582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings_ner")\ ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings_ner"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["DOC", "EFFDATE", "PARTY"]) # We will check time only on these assertion = legal.AssertionDLModel.pretrained("legassertion_time_md", "en", "legal/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings "]) \ .setOutputCol("assertion") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings_ner, ner_model, ner_converter, assertion ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) lp = LightPipeline(model) texts = ["The subsidiaries of Atlantic Inc will participate in a merging operation", "The Conditions and Warranties of this agreement might be modified"] lp.annotate(texts) ```
## Results ```bash chunk,begin,end,entity_type,assertion Atlantic Inc,20,31,ORG,FUTURE chunk,begin,end,entity_type,assertion Conditions and Warranties,4,28,DOC,POSSIBLE ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legassertion_time_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, doc_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations augmented with synthetic data. ## Benchmarking ```bash label tp fp fn prec rec f1 PRESENT 115 11 5 0.9126984 0.9583333 0.9349593 POSSIBLE 79 5 4 0.9404762 0.9518072 0.9461077 PAST 54 5 11 0.91525424 0.83076924 0.8709678 FUTURE 77 3 4 0.9625 0.9506173 0.95652175 Macro-average 325 24 24 0.9327322 0.9228818 0.92778087 Micro-average 325 24 24 0.9312321 0.9312321 0.9312321 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab3_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab3_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144_en_4.2.0_3.0_1664041726477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144_en_4.2.0_3.0_1664041726477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab3_by_sherry7144| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_unsup_consert_base_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unsup-consert-base_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_unsup_consert_base_squad2.0_en_4.0.0_3.0_1657193161816.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_unsup_consert_base_squad2.0_en_4.0.0_3.0_1657193161816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_unsup_consert_base_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_unsup_consert_base_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_unsup_consert_base_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/unsup-consert-base_squad2.0 --- layout: model title: Translate American Sign Language to English Pipeline author: John Snow Labs name: translate_ase_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ase, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ase` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ase_en_xx_2.7.0_2.4_1609691414900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ase_en_xx_2.7.0_2.4_1609691414900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ase_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ase_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ase.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ase_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_de_4.2.0_3.0_1664096002135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman_de_4.2.0_3.0_1664096002135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_by_jonatasgrosman| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from leabum) author: John Snow Labs name: distilbert_qa_leabum_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `leabum`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_leabum_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771943076.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_leabum_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771943076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_leabum_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_leabum_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_leabum_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/leabum/distilbert-base-uncased-finetuned-squad --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Neha2608) author: John Snow Labs name: xlmroberta_ner_neha2608_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Neha2608`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_neha2608_base_finetuned_panx_de_4.1.0_3.0_1660429918530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_neha2608_base_finetuned_panx_de_4.1.0_3.0_1660429918530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_neha2608_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_neha2608_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_neha2608_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Neha2608/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Pipeline for Detect Generic PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_generic_pipeline date: 2023-05-31 tags: [licensed, deidentification, clinical, ar, generic] task: Pipeline Healthcare language: ar edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2023/05/29/ner_deid_generic_ar.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_ar_4.4.1_3.0_1685568113997.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_ar_4.4.1_3.0_1685568113997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "ar", "clinical/models") text = '''ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. '' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "ar", "clinical/models") val text = "ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash +---------------+----------------------+ |chunks |entities | +---------------+----------------------+ |16 أبريل 2000 |DATE | |ليلى حسن |NAME | |789، |LOCATION | |الأمانة، جدة |LOCATION | |54321 |LOCATION | |المملكة العربية |LOCATION | |السعودية |LOCATION | |مستشفى النور |LOCATION | |أميرة أحمد |NAME | |ليلى |NAME | +---------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|ar| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Clinical Context Spell Checker author: John Snow Labs name: spellcheck_clinical date: 2021-03-30 tags: [en, licensed] supported: true task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 annotator: SpellCheckModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.0.0_3.0_1617128886628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.0.0_3.0_1617128886628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use The model works at the token level, so you must put it after tokenization. The model can change the length of the tokens when correcting words, so keep this in mind when using it before other annotators that may work with absolute references to the original document like NerConverter.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setPrefixes(["\"", "“", "(", "[", "\n", "."]) \ .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained()\ .setInputCols("token")\ .setOutputCol("checked")\ ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new RecursiveTokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setPrefixes(Array("\"", "“", "(", "[", "\n", ".")) .setSuffixes(Array("\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained(). setInputCols("token"). setOutputCol("checked") ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical").predict("""]) \ .setSuffixes([""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical| |Compatibility:|Spark NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Language:|en| ## Data Source The dataset used contains data drawn from MT Samples clinical notes, i2b2 clinical notes, and PubMed abstracts. --- layout: model title: English DistilBertForQuestionAnswering model (from superspray) author: John Snow Labs name: distilbert_qa_base_squad2_custom_dataset date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_squad2_custom_dataset` is a English model originally trained by `superspray`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_squad2_custom_dataset_en_4.0.0_3.0_1654727906233.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_squad2_custom_dataset_en_4.0.0_3.0_1654727906233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_squad2_custom_dataset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_squad2_custom_dataset","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_squad2_custom_dataset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/superspray/distilbert_base_squad2_custom_dataset --- layout: model title: Japanese Bert Embeddings (Small) author: John Snow Labs name: bert_embeddings_bert_small_japanese date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-small-japanese` is a Japanese model orginally trained by `izumi-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_small_japanese_ja_3.4.2_3.0_1649674508136.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_small_japanese_ja_3.4.2_3.0_1649674508136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_small_japanese","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_small_japanese","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_small_japanese").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_small_japanese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|68.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/izumi-lab/bert-small-japanese - https://github.com/google-research/bert - https://github.com/retarfi/language-pretraining/tree/v1.0 - https://arxiv.org/abs/2003.10555 - https://arxiv.org/abs/2003.10555 - https://creativecommons.org/licenses/by-sa/4.0/ --- layout: model title: Legal Provisions Clause Binary Classifier author: John Snow Labs name: legclf_provisions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `provisions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `provisions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_provisions_clause_en_1.0.0_3.2_1660123856314.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_provisions_clause_en_1.0.0_3.2_1660123856314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_provisions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[provisions]| |[other]| |[other]| |[provisions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_provisions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.88 0.86 0.87 183 provisions 0.71 0.73 0.72 82 accuracy - - 0.82 265 macro-avg 0.79 0.80 0.79 265 weighted-avg 0.82 0.82 0.82 265 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Atlantic-Congo Languages author: John Snow Labs name: opus_mt_en_alv date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, alv, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `alv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_alv_xx_2.7.0_2.4_1609163699220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_alv_xx_2.7.0_2.4_1609163699220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_alv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_alv", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.alv').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_alv| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_greedy_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_greedy_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_pipeline_en_3.4.1_3.0_1647872438982.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_pipeline_en_3.4.1_3.0_1647872438982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_greedy_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_greedy_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_greedy.pipeline").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""") ```
## Results ```bash +----+----------------------------------+---------+-------+------------+ | | chunks | begin | end | entities | |---:|---------------------------------:|--------:|------:|-----------:| | 0 | 1 capsule of Advil 10 mg | 27 | 50 | DRUG | | 1 | magnesium hydroxide 100mg/1ml PO | 67 | 98 | DRUG | | 2 | for 5 days | 52 | 61 | DURATION | | 3 | 40 units of insulin glargine | 168 | 195 | DRUG | | 4 | at night | 197 | 204 | FREQUENCY | | 5 | 12 units of insulin lispro | 207 | 232 | DRUG | | 6 | with meals | 234 | 243 | FREQUENCY | | 7 | metformin 1000 mg | 250 | 266 | DRUG | | 8 | two times a day | 268 | 282 | FREQUENCY | +----+----------------------------------+---------+-------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_greedy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_declutr_techqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `declutr-techqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_declutr_techqa_en_4.0.0_3.0_1655728229593.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_declutr_techqa_en_4.0.0_3.0_1655728229593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_declutr_techqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_declutr_techqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.techqa_declutr.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_declutr_techqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/declutr-techqa --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_scrambled_squad_15 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-15` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_15_en_4.0.0_3.0_1655734059356.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_15_en_4.0.0_3.0_1655734059356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_15","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_15","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_scrambled_15.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_15| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-15 --- layout: model title: Polish T5ForConditionalGeneration Base Cased model (from allegro) author: John Snow Labs name: t5_plt5_base date: 2023-01-30 tags: [pl, open_source, t5, tensorflow] task: Text Generation language: pl edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `plt5-base` is a Polish model originally trained by `allegro`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_plt5_base_pl_4.3.0_3.0_1675106653699.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_plt5_base_pl_4.3.0_3.0_1675106653699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_plt5_base","pl") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_plt5_base","pl") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_plt5_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|pl| |Size:|601.0 MB| ## References - https://huggingface.co/allegro/plt5-base - https://github.com/facebookresearch/cc_net - https://github.com/facebookresearch/cc_net - http://nkjp.pl/index.php?page=14&lang=1 - http://opus.nlpl.eu/OpenSubtitles-v2018.php - https://dumps.wikimedia.org/ - https://wolnelektury.pl/ - https://ml.allegro.tech/ - http://zil.ipipan.waw.pl/ --- layout: model title: Stop Words Cleaner for Persian author: John Snow Labs name: stopwords_fa date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: fa edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, fa] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_fa_fa_2.5.4_2.4_1594742438615.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_fa_fa_2.5.4_2.4_1594742438615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_fa", "fa") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("جان اسنو غیر از سلطان شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_fa", "fa") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("جان اسنو غیر از سلطان شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""جان اسنو غیر از سلطان شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است."""] stopword_df = nlu.load('fa.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=2, result='جان', metadata={'sentence': '0'}), Row(annotatorType='token', begin=4, end=7, result='اسنو', metadata={'sentence': '0'}), Row(annotatorType='token', begin=9, end=11, result='غیر', metadata={'sentence': '0'}), Row(annotatorType='token', begin=16, end=20, result='سلطان', metadata={'sentence': '0'}), Row(annotatorType='token', begin=22, end=25, result='شمال', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_fa| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|fa| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English BertForQuestionAnswering model (from kamilali) author: John Snow Labs name: bert_qa_kamilali_distilbert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model orginally trained by `kamilali`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kamilali_distilbert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654187527690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kamilali_distilbert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654187527690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kamilali_distilbert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_kamilali_distilbert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.distilled_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kamilali_distilbert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kamilali/distilbert-base-uncased-finetuned-squad --- layout: model title: Korean Electra Embeddings (from krevas) author: John Snow Labs name: electra_embeddings_finance_koelectra_small_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finance-koelectra-small-generator` is a Korean model orginally trained by `krevas`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_finance_koelectra_small_generator_ko_3.4.4_3.0_1652786816801.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_finance_koelectra_small_generator_ko_3.4.4_3.0_1652786816801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_finance_koelectra_small_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_finance_koelectra_small_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_finance_koelectra_small_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|52.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/krevas/finance-koelectra-small-generator - https://openreview.net/forum?id=r1xMH1BtvB - https://github.com/google-research/electra --- layout: model title: Nigerian Pidgin XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_naija date: 2022-08-13 tags: [pcm, open_source, xlm_roberta, ner] task: Named Entity Recognition language: pcm edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-naija` is a Nigerian Pidgin model originally trained by `mbeukman`. ## Predicted Entities `ORG`, `LOC`, `PER`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_naija_pcm_4.1.0_3.0_1660427533599.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_naija_pcm_4.1.0_3.0_1660427533599.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_naija","pcm") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_naija","pcm") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_naija| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pcm| |Size:|778.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-naija - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner - https://arxiv.org/pdf/2103.11811.pdf - https://arxiv.org/abs/2103.11811 - https://arxiv.org/abs/2103.11811 --- layout: model title: Pipeline to Identify Adverse Drug Events author: John Snow Labs name: explain_clinical_doc_ade date: 2021-02-11 task: Pipeline Healthcare language: en nav_key: models edition: Spark NLP 2.7.3 spark_version: 2.4 tags: [en, licensed] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline containing multiple models to identify Adverse Drug Events in clinical and free text. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb#scrollTo=8i805kxSnnwA){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_2.7.3_2.4_1613049375392.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_2.7.3_2.4_1613049375392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models') res = pipeline.fullAnnotate('The clinical course suggests that the interstitial pneumonitis was induced by hydroxyurea.') ``` ```scala val era_pipeline = new PretrainedPipeline("explain_clinical_doc_era", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""The clinical course suggests that the interstitial pneumonitis was induced by hydroxyurea.""")(0) ```
## Results ```bash | # | chunks | entities | assertion | |----|-------------------------------|------------|------------| | 0 | interstitial pneumonitis | ADE | Present | | 1 | hydroxyurea | DRUG | Present | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_ade| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models `biobert_pubmed_base_cased`, `classifierdl_ade_conversational_biobert`, `ner_ade_biobert` , `assertion_dl_biobert` --- layout: model title: XLNet Large CoNLL-03 NER Pipeline author: John Snow Labs name: xlnet_large_token_classifier_conll03_pipeline date: 2022-04-21 tags: [open_source, ner, token_classifier, xlnet, conll03, large, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_large_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650546509566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlnet_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650546509566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.4 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlnetForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Grant of security interest Clause Binary Classifier author: John Snow Labs name: legclf_grant_of_security_interest_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `grant-of-security-interest` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `grant-of-security-interest` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_grant_of_security_interest_clause_en_1.0.0_3.2_1660122489934.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_grant_of_security_interest_clause_en_1.0.0_3.2_1660122489934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_grant_of_security_interest_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[grant-of-security-interest]| |[other]| |[other]| |[grant-of-security-interest]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_grant_of_security_interest_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support grant-of-security-interest 1.00 0.91 0.95 34 other 0.97 1.00 0.98 85 accuracy - - 0.97 119 macro-avg 0.98 0.96 0.97 119 weighted-avg 0.98 0.97 0.97 119 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Restaurant, Neg, Repeat author: John Snow Labs name: distilbert_qa_base_uncased_squad2_with_ner_mit_restaurant_with_neg_with_repeat date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-mit-restaurant-with-neg-with-repeat` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_mit_restaurant_with_neg_with_repeat_en_4.0.0_3.0_1654727349522.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_mit_restaurant_with_neg_with_repeat_en_4.0.0_3.0_1654727349522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_mit_restaurant_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_mit_restaurant_with_neg_with_repeat","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base_uncased.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_with_ner_mit_restaurant_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-mit-restaurant-with-neg-with-repeat --- layout: model title: Part of Speech for Chinese author: John Snow Labs name: pos_ud_gsd date: 2021-01-03 task: Part of Speech Tagging language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, zh, cn, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities `ADJ`, `ADP`, `ADV`, `AUX`, `CONJ`, `DET`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SYM`, `VERB`, and `X`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_2.7.0_2.4_1609699328856.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_2.7.0_2.4_1609699328856.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, word_segmenter, posTagger ]) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh") .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] pos_df = nlu.load('zh.pos.ud_gsd').predict(text, output_level='token') pos_df ```
## Results ```bash +-----+-----+ |token|pos | +-----+-----+ |然而 |ADV | |, |PUNCT| |这样 |PRON | |的 |PART | |处理 |NOUN | |也 |ADV | |衍生 |VERB | |了 |PART | |一些 |ADJ | |问题 |NOUN | |。 |PUNCT| +-----+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|zh| ## Data Source The model was trained on the [Universal Dependencies (UD)](https://universaldependencies.org/) for Chinese (GNU license) curated by Google (Simplified Chinese) Reference: > Zeman, Daniel; Nivre, Joakim; Abrams, Mitchell; et al., 2020, Universal Dependencies 2.7, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3424. ## Benchmarking ```bash | pos_tag | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.68 | 0.67 | 0.67 | 271 | | ADP | 0.85 | 0.85 | 0.85 | 513 | | ADV | 0.90 | 0.91 | 0.91 | 549 | | AUX | 0.97 | 0.96 | 0.96 | 91 | | CONJ | 0.93 | 0.86 | 0.90 | 191 | | DET | 0.93 | 0.93 | 0.93 | 138 | | NOUN | 0.89 | 0.92 | 0.90 | 3310 | | NUM | 0.98 | 0.98 | 0.98 | 653 | | PART | 0.95 | 0.94 | 0.94 | 1346 | | PRON | 0.99 | 0.97 | 0.98 | 168 | | PROPN | 0.88 | 0.85 | 0.86 | 1006 | | PUNCT | 1.00 | 1.00 | 1.00 | 1688 | | SYM | 1.00 | 1.00 | 1.00 | 3 | | VERB | 0.88 | 0.86 | 0.87 | 1981 | | X | 0.99 | 0.79 | 0.88 | 104 | | accuracy | 0.91 | 12012 | | | | macro avg | 0.92 | 0.90 | 0.91 | 12012 | | weighted avg | 0.91 | 0.91 | 0.91 | 12012 | ``` --- layout: model title: English BertForQuestionAnswering Small Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd1_small date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd1-small` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd1_small_en_4.0.0_3.0_1657188031070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd1_small_en_4.0.0_3.0_1657188031070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd1_small","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd1_small","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd1_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd1-small --- layout: model title: Pipeline to Resolve ICD-9-CM Codes author: John Snow Labs name: icd9_resolver_pipeline date: 2022-09-30 tags: [en, licensed, clinical, resolver, chunk_mapping, pipeline, icd9cm] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding ICD-9-CM codes. You’ll just feed your text and it will return the corresponding ICD-9-CM codes. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd9_resolver_pipeline_en_4.1.0_3.0_1664543263329.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd9_resolver_pipeline_en_4.1.0_3.0_1664543263329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline resolver_pipeline = PretrainedPipeline("icd9_resolver_pipeline", "en", "clinical/models") text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""" result = resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val med_resolver_pipeline = new PretrainedPipeline("icd9_resolver_pipeline", "en", "clinical/models") val result = med_resolver_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd9.pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ```
## Results ```bash +-----------------------------+---------+---------+ |chunk |ner_chunk|icd9_code| +-----------------------------+---------+---------+ |gestational diabetes mellitus|PROBLEM |V12.21 | |anisakiasis |PROBLEM |127.1 | |fetal and neonatal hemorrhage|PROBLEM |772 | +-----------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd9_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.2 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_multi_cased_finetuned_chaii date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-finetuned-chaii` is a English model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_chaii_en_4.0.0_3.0_1654183828565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_finetuned_chaii_en_4.0.0_3.0_1654183828565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_finetuned_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_cased_finetuned_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_cased_finetuned_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-multi-cased-finetuned-chaii --- layout: model title: Pipeline to Detect PHI for Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_subentity_bert_pipeline date: 2023-03-09 tags: [deidentification, bert, phi, ner, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity_bert](https://nlp.johnsnowlabs.com/2022/06/27/ner_deid_subentity_bert_ro_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_pipeline_ro_4.3.0_3.2_1678385703815.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_pipeline_ro_4.3.0_3.2_1678385703815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_bert_pipeline", "ro", "clinical/models") text = '''Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_bert_pipeline", "ro", "clinical/models") val text = "Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------|--------:|------:|:------------|-------------:| | 0 | Spitalul Pentru Ochi de Deal | 0 | 27 | HOSPITAL | 0.84306 | | 1 | Drumul Oprea Nr. 972 | 30 | 49 | STREET | 0.99784 | | 2 | Vaslui | 51 | 56 | CITY | 0.9896 | | 3 | 737405 | 59 | 64 | ZIP | 1 | | 4 | +40(235)413773 | 79 | 92 | PHONE | 1 | | 5 | 25 May 2022 | 119 | 129 | DATE | 1 | | 6 | BUREAN MARIA | 158 | 169 | PATIENT | 0.7259 | | 7 | 77 | 180 | 181 | AGE | 1 | | 8 | Agota Evelyn Tımar | 191 | 208 | DOCTOR | 0.803667 | | 9 | 2450502264401 | 218 | 230 | IDNUM | 0.9995 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|484.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Brokers Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_brokers_bert date: 2023-03-05 tags: [en, legal, classification, clauses, brokers, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Brokers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Brokers`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_brokers_bert_en_1.0.0_3.0_1678049935750.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_brokers_bert_en_1.0.0_3.0_1678049935750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_brokers_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Brokers]| |[Other]| |[Other]| |[Brokers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_brokers_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Brokers 0.98 0.98 0.98 47 Other 0.98 0.98 0.98 66 accuracy - - 0.98 113 macro-avg 0.98 0.98 0.98 113 weighted-avg 0.98 0.98 0.98 113 ``` --- layout: model title: Catalan RobertaForQuestionAnswering Base Cased model (from crodri) author: John Snow Labs name: roberta_qa_base_ca_v2_catalan date: 2023-01-20 tags: [ca, open_source, roberta, question_answering, tensorflow] task: Question Answering language: ca edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ca-v2-qa-catalanqa` is a Catalan model originally trained by `crodri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_v2_catalan_ca_4.3.0_3.0_1674213071415.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_v2_catalan_ca_4.3.0_3.0_1674213071415.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_v2_catalan","ca")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_v2_catalan","ca") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_ca_v2_catalan| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ca| |Size:|456.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/crodri/roberta-base-ca-v2-qa-catalanqa --- layout: model title: Detect clinical entities (ner_jsl_biobert) author: John Snow Labs name: ner_jsl_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect symptoms, modifiers, age, drugs, treatments, tests and a lot more using a single pretrained NER model. Definitions of Predicted Entities: - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Death_Entity`: Mentions that indicate the death of a patient. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Weight`: All mentions related to a patients weight. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Gender`: Gender-specific nouns and pronouns. - `Temperature`: All mentions that refer to body temperature. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Respiration`: Number of breaths per minute. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Frequency`: Frequency of administration for a dose prescribed. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Allergen`: Allergen related extractions mentioned in the document. ## Predicted Entities `Symptom_Name`, `Negated`, `Pulse_Rate`, `Negation`, `Date_of_death`, `Age`, `Modifier`, `Substance_Name`, `Causative_Agents_(Virus_and_Bacteria)`, `Drug_incident_description`, `Diagnosis`, `Weight`, `Drug_Name`, `Procedure_Name`, `Lab_Name`, `Blood_Pressure`, `Cause_of_death`, `Lab_Result`, `Gender`, `Name`, `Temperature`, `Procedure_Findings`, `Section_Name`, `Route`, `Maybe`, `O2_Saturation`, `Respiratory_Rate`, `Procedure`, `Procedure_incident_description`, `Frequency`, `Dosage`, `Allergenic_substance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_en_3.0.0_3.0_1617260821875.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_en_3.0.0_3.0_1617260821875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([['']]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.biobert").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-------------------------------+------------+ |chunk |ner_label | +-------------------------------+------------+ |21-day-old |Age | |male |Gender | |mom |Gender | |she |Gender | |mild |Modifier | |problems with his breathing |Symptom_Name| |negative |Negated | |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |more |Modifier | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | |His |Gender | |urine output has also decreased|Symptom_Name| |he |Gender | |he |Gender | |Mom |Gender | |denies |Negated | |diarrhea |Symptom_Name| |His |Gender | +-------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Fast Neural Machine Translation Model from Swazi to English author: John Snow Labs name: opus_mt_ss_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ss, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ss` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ss_en_xx_2.7.0_2.4_1609166521015.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ss_en_xx_2.7.0_2.4_1609166521015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ss_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ss_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ss.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ss_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Urdu RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_roberta_urdu_small date: 2022-04-14 tags: [roberta, embeddings, ur, open_source] task: Embeddings language: ur edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-urdu-small` is a Urdu model orginally trained by `urduhack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_urdu_small_ur_3.4.2_3.0_1649948085721.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_urdu_small_ur_3.4.2_3.0_1649948085721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_urdu_small","ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_urdu_small","ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.embed.roberta_urdu_small").predict("""مجھے سپارک این ایل پی سے محبت ہے""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_urdu_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ur| |Size:|473.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/urduhack/roberta-urdu-small - https://github.com/urduhack/urduhack/blob/master/LICENSE - https://github.com/urduhack/urduhack --- layout: model title: English asr_wav2vec2_base_timit_asr TFWav2Vec2ForCTC from elgeish author: John Snow Labs name: asr_wav2vec2_base_timit_asr date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_asr` is a English model originally trained by elgeish. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_asr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_asr_en_4.2.0_3.0_1664025374893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_asr_en_4.2.0_3.0_1664025374893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_asr", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_asr", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_asr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.3 MB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from coolzhao) author: John Snow Labs name: xlmroberta_ner_coolzhao_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `coolzhao`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_coolzhao_base_finetuned_panx_de_4.1.0_3.0_1660431915864.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_coolzhao_base_finetuned_panx_de_4.1.0_3.0_1660431915864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_coolzhao_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_coolzhao_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_coolzhao_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/coolzhao/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-6` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6_en_4.0.0_3.0_1654191567392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6_en_4.0.0_3.0_1654191567392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_32d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|376.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-6 --- layout: model title: Sentiment Analysis of Spanish texts author: John Snow Labs name: classifierdl_bert_sentiment date: 2021-09-28 tags: [es, spanish, sentiment, open_source] task: Sentiment Analysis language: es edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies the sentiments (neutral, positive or negative) in Spanish texts. ## Predicted Entities `NEUTRAL`, `POSITIVE`, `NEGATIVE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_es_3.3.0_2.4_1632820716491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_es_3.3.0_2.4_1632820716491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = BertSentenceEmbeddings\ .pretrained('labse', 'xx') \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "es") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") fr_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier]) light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result1 = light_pipeline.annotate("Estoy seguro de que esta vez pasará la entrevista.") result2 = light_pipeline.annotate("Soy una persona que intenta desayunar todas las mañanas sin falta.") result3 = light_pipeline.annotate("No estoy seguro de si mi salario mensual es suficiente para vivir.") print(result1["class"], result2["class"], sep = "\n") ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings .pretrained("labse", "xx") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "es") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val fr_sentiment_pipeline = new Pipeline().setStages(Array(document, embeddings, sentimentClassifier)) val light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result1 = light_pipeline.annotate("Estoy seguro de que esta vez pasará la entrevista.") val result2 = light_pipeline.annotate("Soy una persona que intenta desayunar todas las mañanas sin falta.") val result3 = light_pipeline.annotate("No estoy seguro de si mi salario mensual es suficiente para vivir.") ```
## Results ```bash ['POSITIVE'] ['NEUTRAL'] ['NEGATIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_sentiment| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|es| ## Data Source https://github.com/charlesmalafosse/open-dataset-for-sentiment-analysis ## Benchmarking ```bash precision recall f1-score support NEGATIVE 0.73 0.81 0.77 1837 NEUTRAL 0.80 0.71 0.75 2222 POSITIVE 0.80 0.81 0.80 2187 accuracy 0.78 6246 macro avg 0.77 0.78 0.77 6246 weighted avg 0.78 0.78 0.77 6246 ``` --- layout: model title: Translate Kongo to English Pipeline author: John Snow Labs name: translate_kg_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kg, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kg_en_xx_2.7.0_2.4_1609690173661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kg_en_xx_2.7.0_2.4_1609690173661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kg.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kg_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: News Classifier of Turkish text author: John Snow Labs name: classifierdl_bert_news date: 2021-05-03 tags: [tr, news, classifier, open_source] task: Text Classification language: tr edition: Spark NLP 3.0.2 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify Turkish news texts ## Predicted Entities `kultur`, `saglik`, `ekonomi`, `teknoloji`, `siyaset`, `spor` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_tr_3.0.2_3.0_1620040285456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_tr_3.0.2_3.0_1620040285456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = BertSentenceEmbeddings\ .pretrained('labse', 'xx') \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "tr") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document, embeddings, document_classifier]) light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.annotate('Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı.') ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings .pretrained("labse", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "tr") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier)) val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result = light_pipeline.annotate("Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı".) ``` {:.nlu-block} ```python import nlu nlu.load("tr.classify.news").predict("""Bonservisi elinde olan Milli oyuncu, yeni takımıyla el sıkıştı.""") ```
## Results ```bash ["spor"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_news| |Compatibility:|Spark NLP 3.0.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|tr| |Dependencies:|labse_BERT| ## Data Source Trained on a custom dataset with multi-lingual Bert Embeddings `labse`. ## Benchmarking ```bash precision recall f1-score support ekonomi 0.88 0.86 0.87 263 kultur 0.93 0.96 0.94 277 saglik 0.95 0.96 0.95 273 siyaset 0.89 0.91 0.90 257 spor 0.97 0.97 0.97 279 teknoloji 0.94 0.88 0.91 250 accuracy 0.93 1599 macro avg 0.93 0.92 0.93 1599 weighted avg 0.93 0.93 0.93 1599 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_4_en_4.0.0_3.0_1657184034505.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_4_en_4.0.0_3.0_1657184034505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-4 --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_bert_all_squad_que_translated date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-all-squad_que_translated` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_squad_que_translated_en_4.0.0_3.0_1654179547914.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_squad_que_translated_en_4.0.0_3.0_1654179547914.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_all_squad_que_translated","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_all_squad_que_translated","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_translated.bert.que.by_krinal214").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_all_squad_que_translated| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/bert-all-squad_que_translated --- layout: model title: Kanuri asr_wav2vec2_xlsr_korean_senior TFWav2Vec2ForCTC from hyyoka author: John Snow Labs name: asr_wav2vec2_xlsr_korean_senior date: 2022-09-24 tags: [wav2vec2, kr, audio, open_source, asr] task: Automatic Speech Recognition language: kr edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_korean_senior` is a Kanuri model originally trained by hyyoka. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_korean_senior_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_korean_senior_kr_4.2.0_3.0_1664024368402.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_korean_senior_kr_4.2.0_3.0_1664024368402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_korean_senior", "kr")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_korean_senior", "kr") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_korean_senior| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|kr| |Size:|1.2 GB| --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from elastic) author: John Snow Labs name: distilbert_token_classifier_base_uncased_finetuned_conll03_english date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll03-english` is a English model originally trained by `elastic`. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll03_english_en_4.3.1_3.0_1678783123169.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll03_english_en_4.3.1_3.0_1678783123169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll03_english","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll03_english","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_finetuned_conll03_english| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Recognize Entities DL Pipeline for Italian - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, italian, entity_recognizer_md, pipeline, it] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: it edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_it_3.0.0_3.0_1616446397778.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_it_3.0.0_3.0_1616446397778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'it') annotations = pipeline.fullAnnotate(""Ciao da John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "it") val result = pipeline.fullAnnotate("Ciao da John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Ciao da John Snow Labs! ""] result_df = nlu.load('it.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Ciao da John Snow Labs! '] | ['Ciao da John Snow Labs!'] | ['Ciao', 'da', 'John', 'Snow', 'Labs!'] | [[-0.146050006151199,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|it| --- layout: model title: Legal Eu Institutions And European Civil Service Document Classifier (EURLEX) author: John Snow Labs name: legclf_eu_institutions_and_european_civil_service_bert date: 2023-03-06 tags: [en, legal, classification, clauses, eu_institutions_and_european_civil_service, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_eu_institutions_and_european_civil_service_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Eu_Institutions_and_European_Civil_Service or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Eu_Institutions_and_European_Civil_Service`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_eu_institutions_and_european_civil_service_bert_en_1.0.0_3.0_1678111712438.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_eu_institutions_and_european_civil_service_bert_en_1.0.0_3.0_1678111712438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_eu_institutions_and_european_civil_service_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Eu_Institutions_and_European_Civil_Service]| |[Other]| |[Other]| |[Eu_Institutions_and_European_Civil_Service]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_eu_institutions_and_european_civil_service_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Eu_Institutions_and_European_Civil_Service 0.85 1.00 0.92 45 Other 1.00 0.83 0.91 47 accuracy - - 0.91 92 macro-avg 0.92 0.91 0.91 92 weighted-avg 0.93 0.91 0.91 92 ``` --- layout: model title: English asr_wav2vec2_large_10min_lv60_self TFWav2Vec2ForCTC from Splend1dchan author: John Snow Labs name: asr_wav2vec2_large_10min_lv60_self date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_10min_lv60_self` is a English model originally trained by Splend1dchan. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_10min_lv60_self_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_10min_lv60_self_en_4.2.0_3.0_1664024523802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_10min_lv60_self_en_4.2.0_3.0_1664024523802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_10min_lv60_self", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_10min_lv60_self", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_10min_lv60_self| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|757.6 MB| --- layout: model title: Legal Defaults Clause Binary Classifier author: John Snow Labs name: legclf_defaults_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `defaults` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `defaults` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_defaults_clause_en_1.0.0_3.2_1660123396601.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_defaults_clause_en_1.0.0_3.2_1660123396601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_defaults_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[defaults]| |[other]| |[other]| |[defaults]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_defaults_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support defaults 0.93 0.90 0.91 173 other 0.95 0.96 0.95 316 accuracy - - 0.94 489 macro-avg 0.94 0.93 0.93 489 weighted-avg 0.94 0.94 0.94 489 ``` --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_de_4.2.0_3.0_1664114234970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673_de_4.2.0_3.0_1664114234970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s673| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Lunda author: John Snow Labs name: opus_mt_en_lun date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, lun, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `lun` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lun_xx_2.7.0_2.4_1609168688562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lun_xx_2.7.0_2.4_1609168688562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_lun", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_lun", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.lun').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_lun| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hausa BertForMaskedLM Base Cased model (from Davlan) author: John Snow Labs name: bert_embeddings_base_multilingual_cased_finetuned_hausa date: 2022-12-02 tags: [ha, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ha edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-hausa` is a Hausa model originally trained by `Davlan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_multilingual_cased_finetuned_hausa_ha_4.2.4_3.0_1670018401233.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_multilingual_cased_finetuned_hausa_ha_4.2.4_3.0_1670018401233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_multilingual_cased_finetuned_hausa","ha") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_multilingual_cased_finetuned_hausa","ha") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_multilingual_cased_finetuned_hausa| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ha| |Size:|667.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-hausa - http://data.statmt.org/cc-100/ - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English BertForTokenClassification Uncased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Modified_scibert_scivocab_uncased date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Modified-scibert_scivocab_uncased` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_scibert_scivocab_uncased_en_4.0.0_3.0_1657109139547.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_scibert_scivocab_uncased_en_4.0.0_3.0_1657109139547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_scibert_scivocab_uncased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_scibert_scivocab_uncased","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Modified_scibert_scivocab_uncased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.4 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4_Modified-scibert_scivocab_uncased --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from SelamatPagi) author: John Snow Labs name: xlmroberta_ner_selamatpagi_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `SelamatPagi`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_selamatpagi_base_finetuned_panx_de_4.1.0_3.0_1660430414691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_selamatpagi_base_finetuned_panx_de_4.1.0_3.0_1660430414691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_selamatpagi_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_selamatpagi_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_selamatpagi_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/SelamatPagi/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Word2Vec Embeddings in Newari (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, new, open_source] task: Embeddings language: new edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_new_3.4.1_3.0_1647448157849.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_new_3.4.1_3.0_1647448157849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","new") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","new") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("new.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|new| |Size:|174.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus author: John Snow Labs name: sent_bert_wiki_books date: 2021-08-31 tags: [en, open_source, sentence_embeddings, wikipedia_dataset, books_corpus_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture pretrained from scratch on Wikipedia and BooksCorpus. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learning that improve its accuracy over the original BERT base checkpoint. This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_en_3.2.0_3.0_1630412117264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_en_3.2.0_3.0_1630412117264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_wiki_books| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/2 --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265901 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265901` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265901_en_4.0.0_3.0_1655984774779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265901_en_4.0.0_3.0_1655984774779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265901","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265901","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265901").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265901| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265901 --- layout: model title: English BertForQuestionAnswering Tiny model (from mrm8488) author: John Snow Labs name: bert_qa_tiny_wrslb_finetuned_squadv1 date: 2022-07-19 tags: [open_source, bert, question_answering, tiny, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BERT Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-wrslb-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_wrslb_finetuned_squadv1_en_4.0.0_3.0_1658255196030.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tiny_wrslb_finetuned_squadv1_en_4.0.0_3.0_1658255196030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tiny_wrslb_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR 'QUESTION' STRING HERE?", "PUT YOUR 'CONTEXT' STRING HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tiny_wrslb_finetuned_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR 'QUESTION' STRING HERE?", "PUT YOUR 'CONTEXT' STRING HERE").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tiny_wrslb_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://huggingface.co/mrm8488/bert-tiny-wrslb-finetuned-squadv1 --- layout: model title: Word2Vec Embeddings in Swedish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sv, open_source] task: Embeddings language: sv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sv_3.4.1_3.0_1647461203585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sv_3.4.1_3.0_1647461203585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Jag älskar gnista nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Jag älskar gnista nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sv.embed.w2v_cc_300d").predict("""Jag älskar gnista nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sv| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English image_classifier_vit_resnet_50_euroSat ViTForImageClassification from YKXBCi author: John Snow Labs name: image_classifier_vit_resnet_50_euroSat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_resnet_50_euroSat` is a English model originally trained by YKXBCi. ## Predicted Entities `Residential`, `AnnualCrop`, `Highway`, `Pasture`, `SeaLake`, `Industrial`, `HerbaceousVegetation`, `River`, `PermanentCrop`, `Forest` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_resnet_50_euroSat_en_4.1.0_3.0_1660173630646.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_resnet_50_euroSat_en_4.1.0_3.0_1660173630646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_resnet_50_euroSat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_resnet_50_euroSat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_resnet_50_euroSat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.0 MB| --- layout: model title: Legal Business Organisation Document Classifier (EURLEX) author: John Snow Labs name: legclf_business_organisation_bert date: 2023-03-06 tags: [en, legal, classification, clauses, business_organisation, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_business_organisation_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Business_Organisation or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Business_Organisation`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_business_organisation_bert_en_1.0.0_3.0_1678111708222.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_business_organisation_bert_en_1.0.0_3.0_1678111708222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_business_organisation_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Business_Organisation]| |[Other]| |[Other]| |[Business_Organisation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_business_organisation_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Business_Organisation 0.88 0.92 0.90 49 Other 0.93 0.90 0.92 60 accuracy - - 0.91 109 macro-avg 0.91 0.91 0.91 109 weighted-avg 0.91 0.91 0.91 109 ``` --- layout: model title: Kabyle asr_Kabyle_xlsr TFWav2Vec2ForCTC from Akashpb13 author: John Snow Labs name: asr_Kabyle_xlsr date: 2022-09-24 tags: [wav2vec2, kab, audio, open_source, asr] task: Automatic Speech Recognition language: kab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Kabyle_xlsr` is a Kabyle model originally trained by Akashpb13. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Kabyle_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Kabyle_xlsr_kab_4.2.0_3.0_1664018846367.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Kabyle_xlsr_kab_4.2.0_3.0_1664018846367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Kabyle_xlsr", "kab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Kabyle_xlsr", "kab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Kabyle_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|kab| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from jgammack) author: John Snow Labs name: bert_qa_MTL_bert_base_uncased_ww_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MTL-bert-base-uncased-ww-squad` is a English model orginally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_MTL_bert_base_uncased_ww_squad_en_4.0.0_3.0_1654178797558.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_MTL_bert_base_uncased_ww_squad_en_4.0.0_3.0_1654178797558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_MTL_bert_base_uncased_ww_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_MTL_bert_base_uncased_ww_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_jgammack").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_MTL_bert_base_uncased_ww_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/MTL-bert-base-uncased-ww-squad --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1654180821898.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1654180821898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_256d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-0 --- layout: model title: English RobertaForSequenceClassification Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_distilroberta_finetuned_age_news_classification date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-age_news-classification` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_age_news_classification_en_4.0.0_3.0_1657716247106.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_age_news_classification_en_4.0.0_3.0_1657716247106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_age_news_classification","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_age_news_classification","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_distilroberta_finetuned_age_news_classification| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/distilroberta-finetuned-age_news-classification --- layout: model title: Word Segmenter for Chinese author: John Snow Labs name: wordseg_msra date: 2021-03-09 tags: [word_segmentation, open_source, chinese, wordseg_msra, zh] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_msra_zh_3.0.0_3.0_1615292321709.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_msra_zh_3.0.0_3.0_1615292321709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_msra", "zh") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_msra", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.segment_words.msra').predict(text) token_df ```
## Results ```bash 0 从 1 J 2 o 3 h 4 n 5 S 6 n 7 o 8 w 9 L 10 a 11 b 12 s 13 你 14 好 15 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_msra| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|zh| --- layout: model title: English asr_wav2vec2_base_100h_with_lm_turkish TFWav2Vec2ForCTC from gorkemgoknar author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_with_lm_turkish date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_turkish` is a English model originally trained by gorkemgoknar. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_with_lm_turkish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_turkish_en_4.2.0_3.0_1664038564858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_turkish_en_4.2.0_3.0_1664038564858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_with_lm_turkish', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_with_lm_turkish", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_with_lm_turkish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Extract Entities in Clinical Trial Abstracts author: John Snow Labs name: ner_clinical_trials_abstracts_pipeline date: 2022-06-27 tags: [licensed, clinical, en, ner] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/06/22/ner_clinical_trials_abstracts_en_3_0.html) model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_pipeline_en_3.5.3_3.0_1656313637828.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_pipeline_en_3.5.3_3.0_1656313637828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "en", "clinical/models") result = pipeline.fullAnnotate("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_trials_abstracts.pipe").predict("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""") ```
## Results ```bash +----------------+------------------+ | chunk| label| +----------------+------------------+ | randomised| CTDesign| | multicentre| CTDesign| |insulin glargine| Drug| | NPH insulin| Drug| | type 2 diabetes|DisorderOrSyndrome| | multicentre| CTDesign| | open| CTDesign| | randomised| CTDesign| | 570| NumberPatients| | Type 2 diabetes|DisorderOrSyndrome| | 34| Age| | 80| Age| | 52 weeks| Duration| |insulin glargine| Drug| | NPH insulin| Drug| | once daily| DrugTime| | bedtime| DrugTime| +----------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_trials_abstracts_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Word2Vec Embeddings in Armenian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, hy, open_source] task: Embeddings language: hy edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hy_3.4.1_3.0_1647282980364.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hy_3.4.1_3.0_1647282980364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hy") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ես սիրում եմ Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hy") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ես սիրում եմ Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hy.embed.w2v_cc_300d").predict("""Ես սիրում եմ Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|hy| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Smaller BERT Sentence Embeddings (L-4_H-128_A-2) author: John Snow Labs name: sent_small_bert_L4_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_128_en_2.6.0_2.4_1598350314094.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_128_en_2.6.0_2.4_1598350314094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_128", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_128", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L4_128').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L4_128_embeddings sentence [-0.9171956777572632, 0.9261569976806641, 0.85... I hate cancer [0.5746028423309326, 0.5926093459129333, -0.19... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L4_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1 --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: bert_ner_bert_fa_base_uncased_ner_peyma date: 2022-05-09 tags: [bert, ner, token_classification, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-fa-base-uncased-ner-peyma` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `LOC`, `PER`, `TIM`, `MON`, `DAT`, `PCT`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_fa_base_uncased_ner_peyma_fa_3.4.2_3.0_1652099750115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_fa_base_uncased_ner_peyma_fa_3.4.2_3.0_1652099750115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_fa_base_uncased_ner_peyma","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_fa_base_uncased_ner_peyma","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_fa_base_uncased_ner_peyma| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/bert-fa-base-uncased-ner-peyma - https://github.com/hooshvare/parsbert - http://nsurl.org/tasks/task-7-named-entity-recognition-ner-for-farsi/ - https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://github.com/hooshvare/parsbert/issues --- layout: model title: Dutch Legal Roberta Embeddings author: John Snow Labs name: roberta_base_dutch_legal date: 2023-02-16 tags: [nl, dutch, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: nl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-dutch-roberta-base` is a Dutch model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_dutch_legal_nl_4.2.4_3.0_1676561397040.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_dutch_legal_nl_4.2.4_3.0_1676561397040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_dutch_legal", "nl")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_dutch_legal", "nl") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_dutch_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|nl| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-dutch-roberta-base --- layout: model title: BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on MNLI author: John Snow Labs name: bert_wiki_books_mnli date: 2021-08-30 tags: [en, open_source, books_corpus_dataset, wikipedia_dataset, bert_embeddings, mnli_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on MNLI. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_mnli_en_3.2.0_3.0_1630316951802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_mnli_en_3.2.0_3.0_1630316951802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_wiki_books_mnli", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_wiki_books_mnli", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.wiki_books_mnli').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_wiki_books_mnli| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [MNLI dataset](https://cims.nyu.edu/~sbowman/multinli/) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/mnli/2 --- layout: model title: English DistilBertForQuestionAnswering model (from Firat) author: John Snow Labs name: distilbert_qa_Firat_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Firat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Firat_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724151617.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Firat_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724151617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Firat_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Firat_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Firat").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Firat_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Firat/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect PHI for Deidentification (Generic) author: John Snow Labs name: ner_deid_generic_pipeline date: 2023-03-13 tags: [deid, ner, de, licensed] task: Named Entity Recognition language: de edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_generic_de.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_de_4.3.0_3.2_1678743247498.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_de_4.3.0_3.2_1678743247498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "de", "clinical/models") text = '''Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "de", "clinical/models") val text = "Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.deid_generic.pipeline").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------------------|--------:|------:|:------------|-------------:| | 0 | Michael Berger | 0 | 13 | NAME | 0.99555 | | 1 | 12 Dezember 2018 | 34 | 49 | DATE | 0.999967 | | 2 | St. Elisabeth-Krankenhaus | 55 | 79 | LOCATION | 0.797267 | | 3 | Bad Kissingen | 84 | 96 | LOCATION | 0.90785 | | 4 | Berger | 117 | 122 | NAME | 0.935 | | 5 | 76 | 128 | 129 | AGE | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Arabic BertForMaskedLM Cased model (from UBC-NLP) author: John Snow Labs name: bert_embeddings_arbert date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ARBERT` is a Arabic model originally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_arbert_ar_4.2.4_3.0_1670014289698.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_arbert_ar_4.2.4_3.0_1670014289698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_arbert","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_arbert","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_arbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|608.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/UBC-NLP/ARBERT - https://mageed.arts.ubc.ca/files/2020/12/marbert_arxiv_2020.pdf - https://github.com/UBC-NLP/marbert - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: Longformer Large NER Pipeline author: John Snow Labs name: longformer_large_token_classifier_conll03_pipeline date: 2022-04-20 tags: [open_source, ner, token_classifier, longformer, conll, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [longformer_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_large_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650461801705.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_large_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650461801705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.5 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - LongformerForTokenClassification - NerConverter - Finisher --- layout: model title: Translate English to Swedish Pipeline author: John Snow Labs name: translate_en_sv date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sv, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sv_xx_2.7.0_2.4_1609687910567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sv_xx_2.7.0_2.4_1609687910567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sv').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sv| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect PHI for Deidentification purposes (Italian, reduced entities) author: John Snow Labs name: ner_deid_generic date: 2022-03-25 tags: [deid, it, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data [(Catelli et al.)](https://ieeexplore.ieee.org/document/9335570) and several data augmentation mechanisms. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_it_3.4.2_2.4_1648224320582.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_it_3.4.2_2.4_1648224320582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.med_ner.deid_generic").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""") ```
## Results ```bash +-------------+----------+ | token| ner_label| +-------------+----------+ | Ho| O| | visto| O| | Gastone| B-NAME| |Montanariello| I-NAME| | (| O| | 49| B-AGE| | anni| O| | )| O| | riferito| O| | all| O| | '| O| | Ospedale|B-LOCATION| | San|I-LOCATION| | Camillo|I-LOCATION| | per| O| | diabete| O| | mal| O| | controllato| O| | con| O| | sintomi| O| | risalenti| O| | a| O| | marzo| B-DATE| | 2015| I-DATE| | .| O| +-------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|15.0 MB| ## References - Internally annotated corpus - [COVID-19 Italian de-identification dataset making up 15% of total data - R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita and M. Esposito, "A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records," in IEEE Access, vol. 9, pp. 19097-19110, 2021, doi: 10.1109/ACCESS.2021.3054479.](https://ieeexplore.ieee.org/document/9335570) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 244.0 1.0 0.0 244.0 0.9959 1.0 0.998 NAME 1082.0 69.0 59.0 1141.0 0.9401 0.9483 0.9442 DATE 1173.0 26.0 17.0 1190.0 0.9783 0.9857 0.982 ID 138.0 2.0 21.0 159.0 0.9857 0.8679 0.9231 SEX 742.0 21.0 32.0 774.0 0.9725 0.9587 0.9655 LOCATION 1039.0 64.0 108.0 1147.0 0.942 0.9058 0.9236 PROFESSION 300.0 15.0 69.0 369.0 0.9524 0.813 0.8772 AGE 746.0 5.0 35.0 781.0 0.9933 0.9552 0.9739 macro - - - - - - 0.9484 micro - - - - - - 0.9521 ``` --- layout: model title: English image_classifier_vit_exper_batch_32_e8 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper_batch_32_e8 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper_batch_32_e8` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_32_e8_en_4.1.0_3.0_1660167668828.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_32_e8_en_4.1.0_3.0_1660167668828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper_batch_32_e8", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper_batch_32_e8", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper_batch_32_e8| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to Dutch author: John Snow Labs name: opus_mt_af_nl date: 2021-06-01 tags: [open_source, seq2seq, translation, af, nl, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: nl {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_nl_xx_3.1.0_2.4_1622561158825.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_nl_xx_3.1.0_2.4_1622561158825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_nl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_nl", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.Dutch').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_nl| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Covenants Clause Binary Classifier author: John Snow Labs name: legclf_covenants_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `covenants` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `covenants` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_covenants_clause_en_1.0.0_3.2_1660123381690.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_covenants_clause_en_1.0.0_3.2_1660123381690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_covenants_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[covenants]| |[other]| |[other]| |[covenants]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_covenants_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support covenants 0.88 0.83 0.85 216 other 0.93 0.95 0.94 488 accuracy - - 0.91 704 macro-avg 0.90 0.89 0.90 704 weighted-avg 0.91 0.91 0.91 704 ``` --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_tiny_5_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-5-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_5_finetuned_squadv2_en_4.0.0_3.0_1654185000117.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_5_finetuned_squadv2_en_4.0.0_3.0_1654185000117.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_5_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_tiny_5_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.tiny_v5.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_tiny_5_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|24.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-tiny-5-finetuned-squadv2 - https://twitter.com/mrm8488 - https://github.com/google-research - https://arxiv.org/abs/1908.08962 - https://rajpurkar.github.io/SQuAD-explorer/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Recognize Entities OntoNotes - BERT Large author: John Snow Labs name: onto_recognize_entities_bert_large date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `bert_large_cased` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_large_en_2.7.0_2.4_1607510312335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_large_en_2.7.0_2.4_1607510312335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_bert_large') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_large") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.bert.large').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |Parliament |ORG | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | |Parliament |ORG | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_large| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: English asr_URDU_ASR TFWav2Vec2ForCTC from Talha author: John Snow Labs name: asr_URDU_ASR date: 2022-09-26 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_URDU_ASR` is a English model originally trained by Talha. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_URDU_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_URDU_ASR_en_4.2.0_3.0_1664195846441.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_URDU_ASR_en_4.2.0_3.0_1664195846441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_URDU_ASR", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_URDU_ASR", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_URDU_ASR| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_3_lr_2e_5_bs_32_ep_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_3-lr-2e-5-bs-32-ep-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188560008.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_2e_5_bs_32_ep_3_en_4.0.0_3.0_1657188560008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_2e_5_bs_32_ep_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_2e_5_bs_32_ep_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_3_lr_2e_5_bs_32_ep_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_3-lr-2e-5-bs-32-ep-3 --- layout: model title: English XlmRoBertaForQuestionAnswering (from ncduy) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_small date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-squad2-distilled-finetuned-chaii-small` is a English model originally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_small_en_4.0.0_3.0_1655991682956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_small_en_4.0.0_3.0_1655991682956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_small","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_small","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_chaii.xlm_roberta.distilled_base_small").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_squad2_distilled_finetuned_chaii_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|881.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/xlm-roberta-base-squad2-distilled-finetuned-chaii-small --- layout: model title: Fast Neural Machine Translation Model from English to Ruund author: John Snow Labs name: opus_mt_en_rnd date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, rnd, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `rnd` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_rnd_xx_2.7.0_2.4_1609164840800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_rnd_xx_2.7.0_2.4_1609164840800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_rnd", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_rnd", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.rnd').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_rnd| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Investment Advisory Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_investment_advisory_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, investment, advisory, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_investment_advisory_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `investment-advisory-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `investment-advisory-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_agreement_bert_en_1.0.0_3.0_1670349562105.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_agreement_bert_en_1.0.0_3.0_1670349562105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_investment_advisory_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[investment-advisory-agreement]| |[other]| |[other]| |[investment-advisory-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investment_advisory_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support investment-advisory-agreement 0.98 0.98 0.98 42 other 0.97 0.97 0.97 35 accuracy - - 0.97 77 macro-avg 0.97 0.97 0.97 77 weighted-avg 0.97 0.97 0.97 77 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from Shushant) author: John Snow Labs name: bert_qa_biobert_v1.1_biomedicalquestionanswering date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert-v1.1-biomedicalQuestionAnswering` is a English model originally trained by `Shushant`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_biomedicalquestionanswering_en_4.0.0_3.0_1657189157999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_biomedicalquestionanswering_en_4.0.0_3.0_1657189157999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_v1.1_biomedicalquestionanswering","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_biobert_v1.1_biomedicalquestionanswering","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_v1.1_biomedicalquestionanswering| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|403.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Shushant/biobert-v1.1-biomedicalQuestionAnswering --- layout: model title: English BertForMaskedLM Large Uncased model author: John Snow Labs name: bert_embeddings_large_uncased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_uncased_en_4.2.4_3.0_1670020442331.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_uncased_en_4.2.4_3.0_1670020442331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/bert-large-uncased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Legal Employees Clause Binary Classifier author: John Snow Labs name: legclf_employees_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `employees` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `employees` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employees_clause_en_1.0.0_3.2_1660122396915.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employees_clause_en_1.0.0_3.2_1660122396915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_employees_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[employees]| |[other]| |[other]| |[employees]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employees_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support employees 0.90 0.94 0.92 80 other 0.98 0.97 0.98 276 accuracy - - 0.96 356 macro-avg 0.94 0.95 0.95 356 weighted-avg 0.96 0.96 0.96 356 ``` --- layout: model title: English asr_finetuned_audio_transcriber TFWav2Vec2ForCTC from mfreihaut author: John Snow Labs name: asr_finetuned_audio_transcriber date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_finetuned_audio_transcriber` is a English model originally trained by mfreihaut. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_finetuned_audio_transcriber_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_finetuned_audio_transcriber_en_4.2.0_3.0_1664121733972.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_finetuned_audio_transcriber_en_4.2.0_3.0_1664121733972.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_finetuned_audio_transcriber", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_finetuned_audio_transcriber", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_finetuned_audio_transcriber| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_finetuned_squad_r3f date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-finetuned-squad-r3f` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_finetuned_squad_r3f_en_4.0.0_3.0_1657192772160.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_finetuned_squad_r3f_en_4.0.0_3.0_1657192772160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_finetuned_squad_r3f","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_finetuned_squad_r3f","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_finetuned_squad_r3f| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|400.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-finetuned-squad-r3f --- layout: model title: Extract entities in covid trials author: John Snow Labs name: ner_covid_trials date: 2021-11-05 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for extracting covid-related clinical terminology from covid trials. ## Predicted Entities `Stage`, `Severity`, `Virus`, `Trial_Design`, `Trial_Phase`, `N_Patients`, `Institution`, `Statistical_Indicator`, `Section_Header`, `Cell_Type`, `Cellular_component`, `Viral_components`, `Physiological_reaction`, `Biological_molecules`, `Admission_Discharge`, `Age`, `BMI`, `Cerebrovascular_Disease`, `Date`, `Death_Entity`, `Diabetes`, `Disease_Syndrome_Disorder`, `Dosage`, `Drug_Ingredient`, `Employment`, `Frequency`, `Gender`, `Heart_Disease`, `Hypertension`, `Obesity`, `Pulse`, `Race_Ethnicity`, `Respiration`, `Route`, `Smoking`, `Time`, `Total_Cholesterol`, `Treatment`, `VS_Finding`, `Vaccine` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_COVID/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_en_3.2.3_3.0_1636083991325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_en_3.2.3_3.0_1636083991325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') clinical_ner = MedicalNerModel.pretrained("ner_covid_trials", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tract such as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR )."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_covid_trials", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tract such as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.covid_trials").predict("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tract such as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).""") ```
## Results ```bash | | chunk | begin | end | entity | |---:|:------------------------------------|--------:|------:|:--------------------------| | 0 | December 2019 | 3 | 15 | Date | | 1 | acute respiratory disease | 48 | 72 | Disease_Syndrome_Disorder | | 2 | beta-coronavirus | 146 | 161 | Virus | | 3 | 2019 coronavirus infection | 198 | 223 | Disease_Syndrome_Disorder | | 4 | SARS-CoV-2 | 227 | 236 | Virus | | 5 | coronavirus | 243 | 253 | Virus | | 6 | β-coronaviruses | 284 | 298 | Virus | | 7 | subgenus Coronaviridae | 307 | 328 | Virus | | 8 | SARS-CoV-2 | 336 | 345 | Virus | | 9 | zoonotic coronavirus disease | 366 | 393 | Disease_Syndrome_Disorder | | 10 | severe acute respiratory syndrome | 401 | 433 | Disease_Syndrome_Disorder | | 11 | SARS | 437 | 440 | Disease_Syndrome_Disorder | | 12 | Middle Eastern respiratory syndrome | 448 | 482 | Disease_Syndrome_Disorder | | 13 | MERS | 486 | 489 | Disease_Syndrome_Disorder | | 14 | SARS-CoV-2 | 511 | 520 | Virus | | 15 | WHO | 541 | 543 | Institution | | 16 | CDC | 547 | 549 | Institution | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_covid_trials| |Compatibility:|Healthcare NLP 3.3.2| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source This model is trained on data sampled from clinicaltrials.gov - covid trials, and annotated in-house. ## Benchmarking ```bash label tp fp fn prec rec f1 B-Cerebrovascular_Disease 11 1 1 0.9166667 0.9166667 0.9166667 I-Cerebrovascular_Disease 2 0 0 1.0 1.0 1.0 I-Vaccine 20 2 7 0.90909094 0.7407407 0.81632656 I-N_Patients 2 2 5 0.5 0.2857143 0.36363637 B-Heart_Disease 32 8 11 0.8 0.74418604 0.7710843 I-Institution 35 15 56 0.7 0.3846154 0.49645394 B-Obesity 8 0 0 1.0 1.0 1.0 I-Trial_Phase 16 9 4 0.64 0.8 0.7111111 B-Dosage 50 28 43 0.64102566 0.53763443 0.5847953 B-Hypertension 10 0 0 1.0 1.0 1.0 I-Stage 0 1 2 0.0 0.0 0.0 I-Cell_Type 82 31 17 0.7256637 0.82828283 0.7735849 B-Admission_Discharge 95 1 4 0.9895833 0.959596 0.974359 B-Date 88 10 9 0.8979592 0.9072165 0.9025641 I-Admission_Discharge 0 0 2 0.0 0.0 0.0 I-Drug_Ingredient 104 71 57 0.5942857 0.6459627 0.61904764 B-Stage 0 2 6 0.0 0.0 0.0 B-Cellular_component 18 22 28 0.45 0.39130434 0.41860464 B-Total_Cholesterol 5 0 0 1.0 1.0 1.0 I-Biological_molecules 52 39 87 0.5714286 0.37410071 0.45217392 I-Virus 56 26 24 0.68292683 0.7 0.69135803 B-BMI 9 2 2 0.8181818 0.8181818 0.8181818 B-Drug_Ingredient 330 82 84 0.80097085 0.79710144 0.7990315 B-Severity 43 22 16 0.6615385 0.7288136 0.69354844 B-Section_Header 86 27 28 0.76106197 0.75438595 0.7577093 I-Treatment 35 7 32 0.8333333 0.52238804 0.64220184 I-Pulse 1 1 0 0.5 1.0 0.6666667 I-Respiration 1 2 0 0.33333334 1.0 0.5 I-Section_Header 95 30 67 0.76 0.58641976 0.66202086 I-VS_Finding 2 2 1 0.5 0.6666667 0.57142854 B-Death_Entity 14 3 6 0.8235294 0.7 0.7567568 B-Statistical_Indicator 79 53 52 0.5984849 0.60305345 0.60076046 B-Frequency 31 8 8 0.7948718 0.7948718 0.79487187 I-Diabetes 3 0 0 1.0 1.0 1.0 B-Race_Ethnicity 3 0 0 1.0 1.0 1.0 B-Cell_Type 90 43 35 0.6766917 0.72 0.6976744 B-N_Patients 37 11 9 0.7708333 0.8043478 0.787234 B-Trial_Phase 13 4 2 0.7647059 0.8666667 0.8125 B-Biological_molecules 171 68 122 0.71548116 0.58361775 0.6428571 I-Hypertension 1 0 0 1.0 1.0 1.0 I-Age 15 0 3 1.0 0.8333333 0.90909094 B-Employment 75 28 21 0.7281553 0.78125 0.7537688 B-Time 8 0 2 1.0 0.8 0.88888896 I-Physiological_reaction 11 11 36 0.5 0.23404256 0.3188406 I-Viral_components 22 2 21 0.9166667 0.5116279 0.6567164 B-Treatment 65 8 19 0.89041096 0.77380955 0.8280255 B-Trial_Design 26 22 19 0.5416667 0.5777778 0.5591398 I-Severity 6 13 4 0.31578946 0.6 0.41379312 I-Route 1 1 1 0.5 0.5 0.5 B-Smoking 3 1 0 0.75 1.0 0.85714287 B-Diabetes 10 0 0 1.0 1.0 1.0 B-Gender 21 2 6 0.9130435 0.7777778 0.84 I-Trial_Design 41 25 13 0.6212121 0.7592593 0.6833333 B-Virus 105 28 34 0.7894737 0.7553957 0.77205884 B-Vaccine 28 9 2 0.7567568 0.93333334 0.8358209 I-Heart_Disease 32 6 15 0.84210527 0.68085104 0.75294113 I-Dosage 41 32 35 0.56164384 0.5394737 0.5503355 I-Cellular_component 12 7 19 0.6315789 0.38709676 0.48 I-Frequency 27 9 8 0.75 0.7714286 0.7605634 B-Age 14 10 11 0.5833333 0.56 0.57142854 B-Pulse 1 1 0 0.5 1.0 0.6666667 I-Statistical_Indicator 40 27 44 0.5970149 0.47619048 0.52980137 I-Date 56 0 7 1.0 0.8888889 0.94117653 B-Route 38 7 12 0.84444445 0.76 0.79999995 B-Institution 24 14 44 0.6315789 0.3529412 0.4528302 B-Viral_components 19 7 19 0.7307692 0.5 0.59375 B-Respiration 1 2 0 0.33333334 1.0 0.5 I-BMI 10 1 4 0.90909094 0.71428573 0.8000001 B-Disease_Syndrome_Disorder 656 76 85 0.89617485 0.88529015 0.8906992 B-VS_Finding 23 5 1 0.8214286 0.9583333 0.8846154 B-Physiological_reaction 9 4 26 0.6923077 0.25714287 0.37500003 I-Disease_Syndrome_Disorder 336 50 132 0.8704663 0.71794873 0.78688526 I-Employment 23 9 12 0.71875 0.6571429 0.6865672 Macro-average 3529 1050 1482 0.7160117 0.700098 0.70796543 Micro-average 3529 1050 1482 0.7706923 0.7042506 0.73597497 ``` --- layout: model title: Pipeline to Detect Time-related Terminology author: John Snow Labs name: roberta_token_classifier_timex_semeval_pipeline date: 2022-06-19 tags: [timex, semeval, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_timex_semeval](https://nlp.johnsnowlabs.com/2021/12/28/roberta_token_classifier_timex_semeval_en.html) model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TIMEX_SEMEVAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_timex_semeval_pipeline_en_4.0.0_3.0_1655653812815.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_timex_semeval_pipeline_en_4.0.0_3.0_1655653812815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python timex_pipeline = PretrainedPipeline("roberta_token_classifier_timex_semeval_pipeline", lang = "en") timex_pipeline.annotate("Model training was started at 22:12C and it took 3 days from Tuesday to Friday.") ``` ```scala val timex_pipeline = new PretrainedPipeline("roberta_token_classifier_timex_semeval_pipeline", lang = "en") timex_pipeline.annotate("Model training was started at 22:12C and it took 3 days from Tuesday to Friday.") ```
## Results ```bash +-------+-----------------+ |chunk |ner_label | +-------+-----------------+ |22:12C |Period | |3 |Number | |days |Calendar-Interval| |Tuesday|Day-Of-Week | |to |Between | |Friday |Day-Of-Week | +-------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_timex_semeval_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|439.5 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Fast Neural Machine Translation Model from English to Nyaneka author: John Snow Labs name: opus_mt_en_nyk date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, nyk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `nyk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_nyk_xx_2.7.0_2.4_1609167091339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_nyk_xx_2.7.0_2.4_1609167091339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_nyk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_nyk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.nyk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_nyk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from husnu) author: John Snow Labs name: bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xtremedistil-l6-h256-uncased-finetuned_lr-2e-05_epochs-6` is a English model orginally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6_en_4.0.0_3.0_1654192632520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6_en_4.0.0_3.0_1654192632520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.xtremedistiled_uncased_lr_2e_05_epochs_6.by_husnu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|47.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/husnu/xtremedistil-l6-h256-uncased-finetuned_lr-2e-05_epochs-6 --- layout: model title: Translate English to Isoko Pipeline author: John Snow Labs name: translate_en_iso date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, iso, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `iso` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_iso_xx_2.7.0_2.4_1609687955401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_iso_xx_2.7.0_2.4_1609687955401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_iso", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_iso", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.iso').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_iso| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Bemba (Zambia) asr_wav2vec2_large_xls_r_1b_bemba_fds TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds date: 2022-09-24 tags: [wav2vec2, bem, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: bem edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_1b_bemba_fds` is a Bemba (Zambia) model originally trained by csikasote. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds_bem_4.2.0_3.0_1664043681866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds_bem_4.2.0_3.0_1664043681866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds', lang = 'bem') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds", lang = "bem") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_1b_bemba_fds| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|bem| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Stop Words Cleaner for Irish author: John Snow Labs name: stopwords_ga date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ga edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ga] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ga_ga_2.5.4_2.4_1594742439377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ga_ga_2.5.4_2.4_1594742439377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ga", "ga") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ga", "ga") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Seachas a bheith ina rí ar an tuaisceart, is dochtúir Sasanach é John Snow agus ceannaire i bhforbairt ainéistéise agus sláinteachas míochaine."""] stopword_df = nlu.load('ga.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='Seachas', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=15, result='bheith', metadata={'sentence': '0'}), Row(annotatorType='token', begin=21, end=22, result='rí', metadata={'sentence': '0'}), Row(annotatorType='token', begin=30, end=39, result='tuaisceart', metadata={'sentence': '0'}), Row(annotatorType='token', begin=40, end=40, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ga| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ga| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: RxNorm Cd ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_cd_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-07-27 task: Entity Resolution edition: Healthcare NLP 2.5.1 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_cd_clinical_en_2.5.1_2.4_1595813950836.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_cd_clinical_en_2.5.1_2.4_1595813950836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... rxnorm_resolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_cd_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\ .setInputCols(['token', 'chunk_embeddings'])\ .setOutputCol('rxnorm_resolution')\ .setPoolingStrategy("MAX") pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnorm_resolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_cd_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9)) .setInputCols('token', 'chunk_embeddings') .setOutputCol('rxnorm_resolution') .setPoolingStrategy("MAX") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364| | glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407| | dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |----------------|---------------------------------| | Name: | chunkresolve_rxnorm_cd_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.1+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on December 2019 RxNorm Clinical Drugs (TTY=CD) ontology graph with `embeddings_clinical` https://www.nlm.nih.gov/pubs/techbull/nd19/brief/nd19_rxnorm_december_2019_release.html --- layout: model title: Legal Definitions Clause Binary Classifier author: John Snow Labs name: legclf_definitions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `definitions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `definitions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_definitions_clause_en_1.0.0_3.2_1660123404094.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_definitions_clause_en_1.0.0_3.2_1660123404094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_definitions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[definitions]| |[other]| |[other]| |[definitions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_definitions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support definitions 0.90 0.86 0.88 81 other 0.93 0.95 0.94 148 accuracy - - 0.92 229 macro-avg 0.91 0.91 0.91 229 weighted-avg 0.92 0.92 0.92 229 ``` --- layout: model title: English asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm TFWav2Vec2ForCTC from gxbag author: John Snow Labs name: asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm` is a English model originally trained by gxbag. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm_en_4.2.0_3.0_1664026344512.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm_en_4.2.0_3.0_1664026344512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|333.9 MB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1657183937367.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_10_en_4.0.0_3.0_1657183937367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-10 --- layout: model title: Legal Procedures Clause Binary Classifier author: John Snow Labs name: legclf_procedures_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `procedures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `procedures` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_procedures_clause_en_1.0.0_3.2_1660123841459.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_procedures_clause_en_1.0.0_3.2_1660123841459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_procedures_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[procedures]| |[other]| |[other]| |[procedures]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_procedures_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.96 0.96 243 procedures 0.92 0.92 0.92 116 accuracy - - 0.95 359 macro-avg 0.94 0.94 0.94 359 weighted-avg 0.95 0.95 0.95 359 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from yirmibesogluz) author: John Snow Labs name: t5_t2t_assert_ade_balanced date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t2t-assert-ade-balanced` is a English model originally trained by `yirmibesogluz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_t2t_assert_ade_balanced_en_4.3.0_3.0_1675107688482.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_t2t_assert_ade_balanced_en_4.3.0_3.0_1675107688482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_t2t_assert_ade_balanced","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_t2t_assert_ade_balanced","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_t2t_assert_ade_balanced| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|917.2 MB| ## References - https://huggingface.co/yirmibesogluz/t2t-assert-ade-balanced - https://github.com/gokceuludogan/boun-tabi-smm4h22 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_512 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_512_zh_4.2.4_3.0_1670021731519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_512_zh_4.2.4_3.0_1670021731519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|113.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English T5ForConditionalGeneration Small Cased model (from deep-learning-analytics) author: John Snow Labs name: t5_wikihow_small date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `wikihow-t5-small` is a English model originally trained by `deep-learning-analytics`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_wikihow_small_en_4.3.0_3.0_1675158602814.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_wikihow_small_en_4.3.0_3.0_1675158602814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_wikihow_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_wikihow_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_wikihow_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|286.9 MB| ## References - https://huggingface.co/deep-learning-analytics/wikihow-t5-small - https://medium.com/@priya.dwivedi/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81 --- layout: model title: Extract Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_general date: 2022-11-24 tags: [licensed, clinical, en, oncology, anatomy, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical entities using an unspecific label. Definitions of Predicted Entities: - `Anatomical_Site`: Relevant anatomical terms mentioned in text. - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". ## Predicted Entities `Anatomical_Site`, `Direction` {:.btn-box} {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_en_4.2.2_3.0_1669298930681.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_en_4.2.2_3.0_1669298930681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_general").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:----------------| | left | Direction | | breast | Anatomical_Site | | lungs | Anatomical_Site | | liver | Anatomical_Site | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_general| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Anatomical_Site 2946 549 638 3584 0.84 0.82 0.83 Direction 864 209 120 984 0.81 0.88 0.84 macro_avg 3810 758 758 4568 0.82 0.85 0.84 micro_avg 3810 758 758 4568 0.83 0.83 0.83 ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from atlantis) author: John Snow Labs name: xlmroberta_ner_atlantis_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `atlantis`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_atlantis_base_finetuned_panx_de_4.1.0_3.0_1660431159909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_atlantis_base_finetuned_panx_de_4.1.0_3.0_1660431159909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_atlantis_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_atlantis_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_atlantis_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|852.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/atlantis/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Pipeline to Detect Clinical Conditions (ner_eu_clinical_case - fr) author: John Snow Labs name: ner_eu_clinical_condition_pipeline date: 2023-03-08 tags: [fr, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: fr edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_condition](https://nlp.johnsnowlabs.com/2023/02/06/ner_eu_clinical_condition_fr.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_fr_4.3.0_3.2_1678260057351.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_fr_4.3.0_3.2_1678260057351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_condition_pipeline", "fr", "clinical/models") text = " Il aurait présenté il y’ a environ 30 ans des ulcérations génitales non traitées spontanément guéries. L’interrogatoire retrouvait une toux sèche depuis trois mois, des douleurs rétro-sternales constrictives, une dyspnée stade III de la NYHA et un contexte d’ apyrexie. Sur ce tableau s’ est greffé des œdèmes des membres inférieurs puis un tableau d’ anasarque d’ où son hospitalisation en cardiologie pour décompensation cardiaque globale. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_condition_pipeline", "fr", "clinical/models") val text = " Il aurait présenté il y’ a environ 30 ans des ulcérations génitales non traitées spontanément guéries. L’interrogatoire retrouvait une toux sèche depuis trois mois, des douleurs rétro-sternales constrictives, une dyspnée stade III de la NYHA et un contexte d’ apyrexie. Sur ce tableau s’ est greffé des œdèmes des membres inférieurs puis un tableau d’ anasarque d’ où son hospitalisation en cardiologie pour décompensation cardiaque globale. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-------------------------|--------:|------:|:-------------------|-------------:| | 0 | ulcérations | 47 | 57 | clinical_condition | 0.9995 | | 1 | toux sèche | 136 | 145 | clinical_condition | 0.917 | | 2 | douleurs | 170 | 177 | clinical_condition | 0.9999 | | 3 | dyspnée | 214 | 220 | clinical_condition | 0.9999 | | 4 | apyrexie | 261 | 268 | clinical_condition | 0.9963 | | 5 | anasarque | 353 | 361 | clinical_condition | 0.9973 | | 6 | décompensation cardiaque | 409 | 432 | clinical_condition | 0.8948 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_consert_techqa date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `consert-techqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_consert_techqa_en_4.0.0_3.0_1657189198783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_consert_techqa_en_4.0.0_3.0_1657189198783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_consert_techqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_consert_techqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_consert_techqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/consert-techqa --- layout: model title: German Bert Embeddings (from redewiedergabe) author: John Snow Labs name: bert_embeddings_bert_base_historical_german_rw_cased date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-historical-german-rw-cased` is a German model orginally trained by `redewiedergabe`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_historical_german_rw_cased_de_3.4.2_3.0_1649676133672.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_historical_german_rw_cased_de_3.4.2_3.0_1649676133672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_historical_german_rw_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_historical_german_rw_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_historical_german_rw_cased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_historical_german_rw_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/redewiedergabe/bert-base-historical-german-rw-cased - https://textgrid.de/digitale-bibliothek - https://www1.ids-mannheim.de/kl/projekte/korpora/archiv/gri.html - https://repos.ids-mannheim.de/mkhz-beschreibung.html - http://www.deutschestextarchiv.de/doku/textquellen#grenzboten - https://www.projekt-gutenberg.org - https://github.com/redewiedergabe/tagger - https://www.aclweb.org/anthology/N19-4010 - https://arxiv.org/abs/1508.01991 --- layout: model title: Legal Stock Option Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_stock_option_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, stock, option, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stock_option_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `stock-option-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `stock-option-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stock_option_agreement_en_1.0.0_3.0_1671393683177.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stock_option_agreement_en_1.0.0_3.0_1671393683177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_stock_option_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[stock-option-agreement]| |[other]| |[other]| |[stock-option-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stock_option_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 202 stock-option-agreement 0.98 0.98 0.98 100 accuracy - - 0.99 302 macro-avg 0.99 0.99 0.99 302 weighted-avg 0.99 0.99 0.99 302 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kpeyton) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_atuscol date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-atuscol` is a English model originally trained by `kpeyton`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_atuscol_en_4.3.0_3.0_1672767823596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_atuscol_en_4.3.0_3.0_1672767823596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_atuscol","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_atuscol","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_atuscol| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kpeyton/distilbert-base-uncased-finetuned-atuscol --- layout: model title: Extract neurologic deficits related to Stroke Scale (NIHSS) author: John Snow Labs name: ner_nihss date: 2021-11-15 tags: [ner, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The National Institutes of Health Stroke Scale (NIHSS) is a 15-item neurologic examination stroke scale. It quantifies the physical manifestations of neurological deficits and provides crucial support for clinical decision making and early-stage emergency triage. ## Predicted Entities `11_ExtinctionInattention`, `6b_RightLeg`, `1c_LOCCommands`, `10_Dysarthria`, `NIHSS`, `5_Motor`, `8_Sensory`, `4_FacialPalsy`, `6_Motor`, `2_BestGaze`, `Measurement`, `6a_LeftLeg`, `5b_RightArm`, `5a_LeftArm`, `1b_LOCQuestions`, `3_Visual`, `9_BestLanguage`, `7_LimbAtaxia`, `1a_LOC` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_NIHSS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_nihss_en_3.3.2_3.0_1636997459858.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_nihss_en_3.3.2_3.0_1636997459858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') clinical_ner = MedicalNerModel.pretrained("ner_nihss", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_nihss", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.nihss").predict("""Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently""") ```
## Results ```bash | | chunk | entity | |---:|:-------------------|:-------------------------| | 0 | NIH stroke scale | NIHSS | | 1 | 23 to 24 | Measurement | | 2 | one | Measurement | | 3 | consciousness | 1a_LOC | | 4 | two | Measurement | | 5 | month and year and | 1b_LOCQuestions | | 6 | two | Measurement | | 7 | eye / grip | 1c_LOCCommands | | 8 | one to | Measurement | | 9 | two | Measurement | | 10 | gaze | 2_BestGaze | | 11 | two | Measurement | | 12 | face | 4_FacialPalsy | | 13 | eight | Measurement | | 14 | one | Measurement | | 15 | limited | 7_LimbAtaxia | | 16 | ataxia | 7_LimbAtaxia | | 17 | one to two | Measurement | | 18 | sensory | 8_Sensory | | 19 | three | Measurement | | 20 | best language | 9_BestLanguage | | 21 | two | Measurement | | 22 | attention | 11_ExtinctionInattention | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_nihss| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source @article{wangnational, title={National Institutes of Health Stroke Scale (NIHSS) Annotations for the MIMIC-III Database}, author={Wang, Jiayang and Huang, Xiaoshuo and Yang, Lin and Li, Jiao} } ## Benchmarking ```bash label tp fp fn prec rec f1 I-NIHSS 126 5 14 0.96183205 0.9 0.92988926 B-NIHSS 152 9 14 0.94409937 0.91566265 0.9296636 B-5b_RightArm 33 0 7 1.0 0.825 0.90410954 I-Measurement 17 1 69 0.9444444 0.19767442 0.3269231 I-5_Motor 12 3 2 0.8 0.85714287 0.82758623 I-1a_LOC 134 1 3 0.9925926 0.9781022 0.9852941 I-9_BestLanguage 85 3 0 0.96590906 1.0 0.982659 B-7_LimbAtaxia 39 0 3 1.0 0.9285714 0.9629629 B-4_FacialPalsy 53 4 4 0.9298246 0.9298246 0.9298246 B-1a_LOC 39 0 5 1.0 0.8863636 0.939759 B-6a_LeftLeg 35 0 4 1.0 0.8974359 0.945946 B-10_Dysarthria 51 2 4 0.9622642 0.92727274 0.94444454 B-8_Sensory 43 3 7 0.9347826 0.86 0.8958333 I-6b_RightLeg 149 5 0 0.96753246 1.0 0.9834984 B-3_Visual 40 0 6 1.0 0.8695652 0.9302325 B-5_Motor 4 1 4 0.8 0.5 0.61538464 I-6a_LeftLeg 157 1 6 0.9936709 0.9631902 0.9781932 B-5a_LeftArm 37 1 4 0.9736842 0.902439 0.93670887 I-4_FacialPalsy 129 5 4 0.96268654 0.9699248 0.96629214 B-2_BestGaze 41 1 2 0.97619045 0.95348835 0.9647058 I-8_Sensory 78 0 10 1.0 0.8863636 0.939759 I-5b_RightArm 153 1 12 0.9935065 0.92727274 0.95924765 B-9_BestLanguage 45 1 5 0.9782609 0.9 0.9375 I-5a_LeftArm 159 3 10 0.9814815 0.9408284 0.96072507 I-1c_LOCCommands 109 1 0 0.9909091 1.0 0.9954338 I-6_Motor 12 4 4 0.75 0.75 0.75 B-1b_LOCQuestions 43 0 2 1.0 0.95555556 0.97727275 B-6b_RightLeg 32 1 2 0.969697 0.9411765 0.9552239 I-2_BestGaze 112 0 2 1.0 0.98245615 0.99115044 I-7_LimbAtaxia 113 1 0 0.99122804 1.0 0.9955947 I-11_ExtinctionInattention 142 0 7 1.0 0.95302016 0.97594506 B-1c_LOCCommands 40 0 1 1.0 0.9756098 0.9876543 B-6_Motor 4 1 5 0.8 0.44444445 0.57142854 I-3_Visual 103 1 5 0.99038464 0.9537037 0.9716981 B-11_ExtinctionInattention 44 1 3 0.9777778 0.9361702 0.9565217 B-Measurement 787 23 13 0.97160494 0.98375 0.97763973 I-10_Dysarthria 76 0 1 1.0 0.987013 0.99346405 I-1b_LOCQuestions 114 0 6 1.0 0.95 0.9743589 Macro-average 3542 83 250 0.9606412 0.8876058 0.9226804 Micro-average 3542 83 250 0.9771035 0.9340717 0.9551032 ``` --- layout: model title: Detect Diseases in Medical Text author: John Snow Labs name: bert_token_classifier_ner_ncbi_disease date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects disease entities from a medical text ## Predicted Entities `Disease` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ncbi_disease_en_4.0.0_3.0_1658756090634.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ncbi_disease_en_4.0.0_3.0_1658756090634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ncbi_disease", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""Kniest dysplasia is a moderately severe type II collagenopathy, characterized by short trunk and limbs, kyphoscoliosis, midface hypoplasia, severe myopia, and hearing loss."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_ncbi_disease", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""Kniest dysplasia is a moderately severe type II collagenopathy, characterized by short trunk and limbs, kyphoscoliosis, midface hypoplasia, severe myopia, and hearing loss.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ncbi_disease").predict("""Kniest dysplasia is a moderately severe type II collagenopathy, characterized by short trunk and limbs, kyphoscoliosis, midface hypoplasia, severe myopia, and hearing loss.""") ```
## Results ```bash +----------------------+-------+ |ner_chunk |label | +----------------------+-------+ |Kniest dysplasia |Disease| |type II collagenopathy|Disease| |kyphoscoliosis |Disease| |midface hypoplasia |Disease| |myopia |Disease| |hearing loss |Disease| +----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ncbi_disease| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-Disease 0.8392 0.9406 0.8870 960 I-Disease 0.8752 0.9356 0.9044 1087 micro-avg 0.8579 0.9380 0.8961 2047 macro-avg 0.8572 0.9381 0.8957 2047 weighted-avg 0.8583 0.9380 0.8963 2047 ``` --- layout: model title: Legal Relation Extraction (Parties, Alias, Dates, Document Type) (Md, Undirectional) author: John Snow Labs name: legre_contract_doc_parties_md date: 2022-11-02 tags: [en, legal, licensed, re, agreements] task: Relation Extraction language: en nav_key: models edition: Spark NLP for Legal 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs; - This is a Legal Relation Extraction model, which can be used after the NER Model for extracting Parties, Document Types, Effective Dates and Aliases, called legner_contract_doc_parties. As an output, you will get the relations linking the different concepts together, if such relation exists. The list of relations is: - dated_as: A Document has an Effective Date - has_alias: The Alias of a Party all along the document - has_collective_alias: An Alias hold by several parties at the same time - signed_by: Between a Party and the document they signed This is a `md` model with Unidirectional Relations, meaning that the model retrieves in chunk1 the left side of the relation (source), and in chunk2 the right side (target). ## Predicted Entities `dated_as`, `has_alias`, `has_collective_alias`, `signed_by` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_contract_doc_parties_md_en_1.0.0_3.0_1667404651340.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_contract_doc_parties_md_en_1.0.0_3.0_1667404651340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \ .setInputCols("document", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") re_model = legal.RelationExtractionDLModel().pretrained('legre_contract_doc_parties_md', 'en', 'legal/models')\ .setPredictionThreshold(0.5)\ .setInputCols(["ner_chunk", "document"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter, re_model ]) empty_df = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_df) text="""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").""" data = spark.createDataFrame([[text]]).toDF("text") result = model.transform(data) ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |----------------------|---------|---------------|-------------|---------------------------------|---------|---------------|-------------|-------------------------------------|------------| | dated_as | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | EFFDATE | 70 | 86 | December 31, 2018 | 0.99994016 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | PARTY | 142 | 164 | Armstrong Flooring, Inc | 0.9995191 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 193 | 198 | Seller | 0.9823355 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | PARTY | 206 | 222 | AFI Licensing LLC | 0.9989542 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 264 | 272 | Licensing | 0.92109 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 293 | 298 | Seller | 0.9938019 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | PARTY | 316 | 331 | AHF Holding, Inc | 0.9989403 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 400 | 404 | Buyer | 0.89959186 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | PARTY | 412 | 446 | Armstrong Hardwood Flooring Company | 0.9974464 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 479 | 485 | Company | 0.95839113 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 506 | 510 | Buyer | 0.95839113 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 517 | 530 | Buyer Entities | 0.95839113 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 612 | 616 | Party | 0.95839113 | | signed_by | DOC | 6 | 36 | INTELLECTUAL PROPERTY AGREEMENT | ALIAS | 642 | 648 | Parties | 0.95839113 | | dated_as | EFFDATE | 70 | 86 | December 31, 2018 | PARTY | 142 | 164 | Armstrong Flooring, Inc | 0.69556713 | | dated_as | ALIAS | 193 | 198 | Seller | EFFDATE | 70 | 86 | December 31, 2018 | 0.7183331 | | dated_as | EFFDATE | 70 | 86 | December 31, 2018 | ALIAS | 264 | 272 | Licensing | 0.7500907 | | dated_as | EFFDATE | 70 | 86 | December 31, 2018 | ALIAS | 293 | 298 | Seller | 0.6601122 | | dated_as | EFFDATE | 70 | 86 | December 31, 2018 | PARTY | 316 | 331 | AHF Holding, Inc | 0.52062315 | | dated_as | EFFDATE | 70 | 86 | December 31, 2018 | ALIAS | 400 | 404 | Buyer | 0.7104727 | | dated_as | EFFDATE | 70 | 86 | December 31, 2018 | PARTY | 412 | 446 | Armstrong Hardwood Flooring Company | 0.70473474 | | dated_as | ALIAS | 479 | 485 | Company | EFFDATE | 70 | 86 | December 31, 2018 | 0.98484945 | | dated_as | ALIAS | 506 | 510 | Buyer | EFFDATE | 70 | 86 | December 31, 2018 | 0.98484945 | | dated_as | ALIAS | 517 | 530 | Buyer Entities | EFFDATE | 70 | 86 | December 31, 2018 | 0.98484945 | | dated_as | ALIAS | 612 | 616 | Party | EFFDATE | 70 | 86 | December 31, 2018 | 0.98484945 | | dated_as | ALIAS | 642 | 648 | Parties | EFFDATE | 70 | 86 | December 31, 2018 | 0.98484945 | | has_alias | PARTY | 142 | 164 | Armstrong Flooring, Inc | ALIAS | 264 | 272 | Licensing | 0.686296 | | has_alias | PARTY | 206 | 222 | AFI Licensing LLC | ALIAS | 264 | 272 | Licensing | 0.8194909 | | has_collective_alias | ALIAS | 264 | 272 | Licensing | PARTY | 316 | 331 | AHF Holding, Inc | 0.5534526 | | has_alias | PARTY | 316 | 331 | AHF Holding, Inc | ALIAS | 479 | 485 | Company | 0.52909577 | | has_alias | PARTY | 316 | 331 | AHF Holding, Inc | ALIAS | 506 | 510 | Buyer | 0.52909607 | | has_alias | PARTY | 316 | 331 | AHF Holding, Inc | ALIAS | 517 | 530 | Buyer Entities | 0.52909607 | | has_alias | PARTY | 316 | 331 | AHF Holding, Inc | ALIAS | 612 | 616 | Party | 0.52909607 | | has_alias | PARTY | 316 | 331 | AHF Holding, Inc | ALIAS | 642 | 648 | Parties | 0.52909607 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_contract_doc_parties_md| |Compatibility:|Spark NLP for Legal 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|402.3 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label Recall Precision F1 Support dated_as 1.000 0.957 0.978 44 has_alias 0.950 0.974 0.962 40 has_collective_alias 0.667 1.000 0.800 3 signed_by 0.957 0.989 0.972 92 Avg. 0.913 0.977 0.938 - Weighted-Avg. 0.973 0.974 0.973 - ``` --- layout: model title: English RoBertaForQuestionAnswering model (from deepset) author: John Snow Labs name: roberta_base_qa_squad2 date: 2022-06-15 tags: [open_source, roberta, question_answering, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_qa_squad2_en_4.0.0_3.0_1655293095511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_qa_squad2_en_4.0.0_3.0_1655293095511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_base_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = RoBertaForQuestionAnswering.pretrained("roberta_base_qa_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_deepset").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://huggingface.co/deepset/roberta-base-squad2 --- layout: model title: Extraction of Clinical Abbreviations and Acronyms author: John Snow Labs name: ner_abbreviation_clinical date: 2021-12-30 tags: [ner, abbreviation, acronym, en, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to extract clinical abbreviations and acronyms in text. ## Predicted Entities `ABBR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ABBREVIATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_en_3.3.4_2.4_1640852436967.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_en_3.3.4_2.4_1640852436967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') abbr_ner = MedicalNerModel.pretrained('ner_abbreviation_clinical', 'en', 'clinical/models') \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("abbr_ner") abbr_converter = NerConverter() \ .setInputCols(["sentence", "token", "abbr_ner"]) \ .setOutputCol("abbr_ner_chunk")\ ner_pipeline = Pipeline( stages = [ documentAssembler, sentenceDetector, tokenizer, embeddings, abbr_ner, abbr_converter ]) sample_df = spark.createDataFrame([["Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."]]).toDF("text") result = ner_pipeline.fit(sample_df).transform(sample_df) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("abbr_ner") val abbr_converter = NerConverter() .setInputCols(Array("sentence", "token", "abbr_ner")) .setOutputCol("abbr_ner_chunk") val ner_pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, abbr_ner, abbr_converter)) val sample_df = Seq("Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.").toDF("text") val result = ner_pipeline.fit(sample_df).transform(sample_df) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.abbreviation_clinical").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash +-----+---------+ |chunk|ner_label| +-----+---------+ |CBC |ABBR | |AB |ABBR | |VDRL |ABBR | |HIV |ABBR | +-----+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_abbreviation_clinical| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.6 MB| ## Data Source Trained on the in-house dataset. ## Benchmarking ```bash Quality on validation dataset (20.0%), validation examples = 454 time to finish evaluation: 5.34s +-------+------+------+------+----------+------+------+ | Label | tp| fp| fn| precision|recall| f1| +-------+------+------+------+----------+------+------+ | B-ABBR| 672.0| 42.0| 40.0| 0.9411|0.9438|0.9424| +-------+------+------+------+----------+------+------+ +------------+----------+--------+--------+ | | precision| recall| f1| +------------+----------+--------+--------+ | macro| 0.9411| 0.9438| 0.9424| +------------+----------+--------+--------+ | micro| 0.9411| 0.9438| 0.9424| +------------+----------+--------+--------+ ``` --- layout: model title: Legal Other benefits Clause Binary Classifier author: John Snow Labs name: legclf_other_benefits_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `other-benefits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `other-benefits` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_other_benefits_clause_en_1.0.0_3.2_1660122796233.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_other_benefits_clause_en_1.0.0_3.2_1660122796233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_other_benefits_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[other-benefits]| |[other]| |[other]| |[other-benefits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_other_benefits_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 109 other-benefits 0.91 0.86 0.88 35 accuracy - - 0.94 144 macro-avg 0.93 0.91 0.92 144 weighted-avg 0.94 0.94 0.94 144 ``` --- layout: model title: Legal Budget Document Classifier (EURLEX) author: John Snow Labs name: legclf_budget_bert date: 2023-03-06 tags: [en, legal, classification, clauses, budget, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_budget_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Budget or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Budget`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_budget_bert_en_1.0.0_3.0_1678111716576.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_budget_bert_en_1.0.0_3.0_1678111716576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_budget_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Budget]| |[Other]| |[Other]| |[Budget]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_budget_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Budget 0.82 0.88 0.85 41 Other 0.90 0.85 0.87 52 accuracy - - 0.86 93 macro-avg 0.86 0.86 0.86 93 weighted-avg 0.86 0.86 0.86 93 ``` --- layout: model title: German Named Entity Recognition (from HuggingAlex1247) author: John Snow Labs name: distilbert_ner_distilbert_base_german_europeana_cased_germeval_14 date: 2022-05-16 tags: [distilbert, ner, token_classification, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-german-europeana-cased-germeval_14` is a German model orginally trained by `HuggingAlex1247`. ## Predicted Entities `LOCderiv`, `PERderiv`, `OTHderiv`, `OTH`, `OTHpart`, `ORGpart`, `PER`, `LOC`, `ORG`, `PERpart`, `ORGderiv`, `LOCpart` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_german_europeana_cased_germeval_14_de_3.4.2_3.0_1652721706374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_german_europeana_cased_germeval_14_de_3.4.2_3.0_1652721706374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_german_europeana_cased_germeval_14","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_german_europeana_cased_germeval_14","de") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_distilbert_base_german_europeana_cased_germeval_14| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|253.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HuggingAlex1247/distilbert-base-german-europeana-cased-germeval_14 - https://paperswithcode.com/sota?task=Token+Classification&dataset=germeval_14 --- layout: model title: Clinical Deidentification (Spanish, augmented) author: John Snow Labs name: clinical_deidentification_augmented date: 2022-03-03 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. It differs from the previous `clinical_deidentificaiton` pipeline in that it includes the `ner_deid_subentity_augmented` NER model and some improvements in ContextualParsers and RegexMatchers. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `USERNAME`, `STREET`, `COUNTRY`, `CITY`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_augmented_es_3.4.1_2.4_1646331074905.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_augmented_es_3.4.1_2.4_1646331074905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""" result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical_augmented").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""") ```
## Results ```bash Masked with entity labels ------------------------------ Datos . Nombre: . Apellidos: . NHC: . NASS: . Domicilio: , B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . Médico: NºCol: . Informe clínico : de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > ; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición - () Correo electrónico: Masked with chars ------------------------------ Datos [**********]. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: [************]. Domicilio: [*******************], * B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. Médico: [******************] NºCol: [*********]. Informe clínico [**********]: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; [**] 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > [****]; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). [*********] [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [***************************] [***] [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos ****. Nombre: **** . Apellidos: ****. NHC: ****. NASS: ****. Domicilio: ****, **** B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. Médico: **** NºCol: ****. Informe clínico ****: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; **** 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > ****; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). **** **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** **** **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos Hombre. Nombre: Aurora Garrido Paez . Apellidos: Aurora Garrido Paez. NHC: BBBBBBBBQR648597. NASS: 48127833R. Domicilio: C/ Rambla, 246, 5 B.. Localidad/ Provincia: Alicante. CP: 24202. Datos asistenciales. Fecha de nacimiento: 21/04/1977. País: Portugal. Edad: 56 años Sexo: Hombre. Fecha de Ingreso: 10/07/2018. Médico: Francisco José Roca Bermúdez NºCol: 21344083-P. Informe clínico Hombre: 041010000011 de 56 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Zaragoza en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; Tecnogroup SL 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: María Miguélez Sanz +++; Test de Coombs > 07-25-1974; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). F. 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Francisco José Roca Bermúdez Hospital 12 de Octubre Servicio de Endocrinología y Nutrición Calle Ramón y Cajal s/n 03129 Zaragoza - Alicante (Portugal) Correo electrónico: richard@company.it ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_augmented| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.2 MB| ## Included Models - nlp.DocumentAssembler - nlp.SentenceDetectorDLModel - nlp.TokenizerModel - nlp.WordEmbeddingsModel - medical.NerModel - nlp.NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - nlp.RegexMatcherModel - ChunkMergeModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - medical.DeIdentificationModel - Finisher --- layout: model title: Malay ALBERT Embeddings (Tiny) author: John Snow Labs name: albert_embeddings_albert_tiny_bahasa_cased date: 2022-04-14 tags: [albert, embeddings, ms, open_source] task: Embeddings language: ms edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-tiny-bahasa-cased` is a Malay model orginally trained by `malay-huggingface`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_tiny_bahasa_cased_ms_3.4.2_3.0_1649954332763.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_tiny_bahasa_cased_ms_3.4.2_3.0_1649954332763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_tiny_bahasa_cased","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_tiny_bahasa_cased","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ms.embed.albert_tiny_bahasa_cased").predict("""Saya suka Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_tiny_bahasa_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ms| |Size:|21.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/malay-huggingface/albert-tiny-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean - https://github.com/huseinzol05/malay-dataset/tree/master/corpus/pile - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/albert --- layout: model title: Detect PHI for Deidentification purposes (Spanish, augmented) author: John Snow Labs name: ner_deid_subentity_augmented date: 2022-02-15 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities, which is more than the previously released `ner_deid_subentity` model. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf and MeddoCan datasets, and includes several data augmentation mechanisms. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `ID`, `STREET`, `USERNAME`, `SEX`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_es_3.3.4_2.4_1644927080275.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_es_3.3.4_2.4_1644927080275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = [''' Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos." val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.subentity_augmented").predict(""" Antonio Miguel Martínez, varón de de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+------------+ | token| ner_label| +------------+------------+ | Antonio| B-PATIENT| | Miguel| I-PATIENT| | Martínez| I-PATIENT| | ,| O| | varón| B-SEX| | de| O| | de| O| | 35| B-AGE| | años| O| | de| O| | edad| O| | ,| O| | de| O| | profesión| O| | auxiliar|B-PROFESSION| | de|I-PROFESSION| | enfermería|I-PROFESSION| | y| O| | nacido| O| | en| O| | Cadiz| B-CITY| | ,| O| | España| B-COUNTRY| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14| B-DATE| | de| I-DATE| | Marzo| I-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica| B-HOSPITAL| | San| I-HOSPITAL| | Carlos| I-HOSPITAL| | .| O| +------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) - [MeddoCan](https://temu.bsc.es/meddocan/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 2022.0 224.0 140.0 2162.0 0.9003 0.9352 0.9174 HOSPITAL 259.0 35.0 50.0 309.0 0.881 0.8382 0.859 DATE 1023.0 12.0 12.0 1035.0 0.9884 0.9884 0.9884 ORGANIZATION 2624.0 516.0 544.0 3168.0 0.8357 0.8283 0.832 CITY 1561.0 339.0 266.0 1827.0 0.8216 0.8544 0.8377 ID 36.0 1.0 3.0 39.0 0.973 0.9231 0.9474 STREET 197.0 14.0 9.0 206.0 0.9336 0.9563 0.9448 USERNAME 10.0 6.0 1.0 11.0 0.625 0.9091 0.7407 SEX 682.0 13.0 11.0 693.0 0.9813 0.9841 0.9827 EMAIL 134.0 0.0 1.0 135.0 1.0 0.9926 0.9963 ZIP 141.0 2.0 1.0 142.0 0.986 0.993 0.9895 MEDICALRECORD 29.0 5.0 0.0 29.0 0.8529 1.0 0.9206 PROFESSION 252.0 27.0 25.0 277.0 0.9032 0.9097 0.9065 PHONE 51.0 11.0 0.0 51.0 0.8226 1.0 0.9027 COUNTRY 505.0 74.0 82.0 587.0 0.8722 0.8603 0.8662 DOCTOR 444.0 26.0 48.0 492.0 0.9447 0.9024 0.9231 AGE 549.0 15.0 7.0 556.0 0.9734 0.9874 0.9804 macro - - - - - - 0.9138 micro - - - - - - 0.8930 ``` --- layout: model title: Understanding Non-compete Items in Non-Compete Clauses author: John Snow Labs name: legclf_nda_non_compete_items date: 2023-04-07 tags: [en, licensed, legal, classification, nda, non_compete, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a clause classified as `NON_COMP` using the `legmulticlf_mnda_sections_paragraph_other` classifier, you can subclassify the sentences as `NON_COMPETE_ITEMS`, or `OTHER` from it using the `legclf_nda_non_compete_items` model. ## Predicted Entities `NON_COMPETE_ITEMS`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_non_compete_items_en_1.0.0_3.0_1680900015288.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_non_compete_items_en_1.0.0_3.0_1680900015288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = nlp.UniversalSentenceEncoder.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") classifier = legal.ClassifierDLModel.pretrained("legclf_nda_non_compete_items", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_embeddings, classifier ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text_list = ["""This Agreement will be binding upon and inure to the benefit of each Party and its respective heirs, successors and assigns""", """Activity that is in direct competition with the Company's business, including but not limited to developing, marketing, or selling products or services that are similar to those of the Company."""] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +--------------------------------------------------------------------------------+-----------------+ | text| class| +--------------------------------------------------------------------------------+-----------------+ |This Agreement will be binding upon and inure to the benefit of each Party an...| OTHER| |Activity that is in direct competition with the Company's business, including...|NON_COMPETE_ITEMS| +--------------------------------------------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_non_compete_items| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support NON_COMPETE_ITEMS 0.95 1.00 0.97 18 OTHER 1.00 0.95 0.97 20 accuracy - - 0.97 38 macro-avg 0.97 0.97 0.97 38 weighted-avg 0.98 0.97 0.97 38 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from abhinavkulkarni) author: John Snow Labs name: distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `abhinavkulkarni`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724925543.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724925543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_abhinavkulkarni").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_abhinavkulkarni_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/abhinavkulkarni/distilbert-base-uncased-finetuned-squad --- layout: model title: SNOMED ChunkResolver author: John Snow Labs name: chunkresolve_snomed_findings_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities Snomed Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_snomed_findings_clinical_en_3.0.0_3.0_1618603404974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_snomed_findings_clinical_en_3.0.0_3.0_1618603404974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("snomed_resolution") pipeline_snomed = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, snomed_ner_converter, chunk_embeddings, snomed_resolver]) data = ["""Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !""", """Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .""", """Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control ."""] model = pipeline_snomed.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala ... val snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("snomed_resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, snomed_ner_converter, chunk_embeddings, snomed_resolver)) val data = Array("""Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !""", """Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .""", """Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control .""") val result = pipeline.fit(Seq.empty[String]).transform(data) ```
## Results ```bash +-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+ | chunk| entity| target_text| code|confidence| +-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+ | erythematous skin lesions|PROBLEM|Skin lesion:::Achromic skin lesions of pinta:::Scaly skin:::Skin constricture:::Cratered skin les...| 95324001| 0.0937| | pruritus|PROBLEM|Pruritus:::Genital pruritus:::Postmenopausal pruritus:::Pruritus hiemalis:::Pruritus ani:::Anogen...| 418363000| 0.1394| | pruritus|PROBLEM|Pruritus:::Genital pruritus:::Postmenopausal pruritus:::Pruritus hiemalis:::Pruritus ani:::Anogen...| 418363000| 0.1394| | hypertension|PROBLEM|Hypertension:::Renovascular hypertension:::Idiopathic hypertension:::Venous hypertension:::Resist...| 38341003| 0.1019| | headache or pain|PROBLEM|Pain:::Headache:::Postchordotomy pain:::Throbbing pain:::Aching headache:::Postspinal headache:::...| 22253000| 0.0953| | applied to lesion on corner of mouth|PROBLEM|Lesion of tongue:::Erythroleukoplakia of mouth:::Lesion of nose:::Lesion of oropharynx:::Erythrop...| 300246005| 0.0547| | nausea and vomiting|PROBLEM|Nausea and vomiting:::Vomiting without nausea:::Nausea:::Intractable nausea and vomiting:::Vomiti...| 16932000| 0.0995| | perianal irritation|PROBLEM|Perineal irritation:::Vulval irritation:::Skin irritation:::Perianal pain:::Perianal itch:::Vagin...| 281639001| 0.0764| | insomnia|PROBLEM|Insomnia:::Mood insomnia:::Nonorganic insomnia:::Persistent insomnia:::Psychophysiologic insomnia...| 193462001| 0.1198| +-----------------------------------------------------------------------------+-------+----------------------------------------------------------------------------------------------------+-----------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_snomed_findings_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm]| |Language:|en| --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_large_data_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-data-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_data_seed_4_en_4.3.0_3.0_1674221236388.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_data_seed_4_en_4.3.0_3.0_1674221236388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_data_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_data_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_data_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-large-data-seed-4 --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_scrambled_squad_10_new date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-10-new` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_10_new_en_4.0.0_3.0_1655734004039.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_scrambled_squad_10_new_en_4.0.0_3.0_1655734004039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_10_new","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_scrambled_squad_10_new","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_scrambled_10_new.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_scrambled_squad_10_new| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-10-new --- layout: model title: Chinese BertForMaskedLM Large Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_lert_large date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-lert-large` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_lert_large_zh_4.2.4_3.0_1670021088288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_lert_large_zh_4.2.4_3.0_1670021088288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_lert_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_lert_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_lert_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-lert-large - https://github.com/ymcui/LERT/blob/main/README_EN.md - https://arxiv.org/abs/2211.05344 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from emr-se-miniproject) author: John Snow Labs name: roberta_qa_base_emr date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-emr` is a English model originally trained by `emr-se-miniproject`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_emr_en_4.3.0_3.0_1674213227451.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_emr_en_4.3.0_3.0_1674213227451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_emr","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_emr","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_emr| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/emr-se-miniproject/roberta-base-emr --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_en_4.3.0_3.0_1675118464936.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_en_4.3.0_3.0_1675118464936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|147.8 MB| ## References - https://huggingface.co/google/t5-efficient-small - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities - DE (Wiki NER 6B 100) author: John Snow Labs name: wikiner_6B_100 date: 2019-07-13 task: Named Entity Recognition language: de edition: Spark NLP 2.1.0 spark_version: 2.4 tags: [open_source, ner, de] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Wiki NER is a Named Entity Recognition (or NER) model, that can be used to find features such as names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Wiki NER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities ``Persons``, ``Locations``, ``Organizations``, ``Misc``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DE){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_DE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_de_2.1.0_2.4_1564861417829.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_de_2.1.0_2.4_1564861417829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ner = NerDLModel.pretrained("wikiner_6B_100", "de") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ``` ```scala val ner = NerDLModel.pretrained("wikiner_6B_100", "de") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://de.wikipedia.org](https://de.wikipedia.org) --- layout: model title: Smaller BERT Embeddings (L-6_H-256_A-4) author: John Snow Labs name: small_bert_L6_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L6_256_en_2.6.0_2.4_1598344429629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L6_256_en_2.6.0_2.4_1598344429629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L6_256", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L6_256", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L6_256').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L6_256_embeddings I [0.782717764377594, 0.3001103401184082, 1.5869... love [1.3923038244247437, -0.4370909631252289, 2.15... NLP [1.5447113513946533, 0.4231811761856079, 0.277... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L6_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1 --- layout: model title: Financial Finbert Sentiment Analysis author: John Snow Labs name: finclf_bert_sentiment_phrasebank date: 2022-09-07 tags: [en, finance, sentiment, classification, sentiment_analysis, licensed] task: Sentiment Analysis language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification. Financial PhraseBank by Malo et al. (2014) and in-house JSL documents and annotations have been used for fine-tuning. ## Predicted Entities `positive`, `negative`, `neutral` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_FINANCE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_phrasebank_en_1.0.0_3.2_1662539499618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_sentiment_phrasebank_en_1.0.0_3.2_1662539499618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier_loaded = finance.BertForSequenceClassification.pretrained("finclf_bert_sentiment_phrasebank", "en", "finance/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier_loaded ]) # couple of simple examples example = spark.createDataFrame([["Stocks rallied and the British pound gained."]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +--------------------+----------+ | text| result| +--------------------+----------+ |Stocks rallied an...|[positive]| +--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_bert_sentiment_phrasebank| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house financial documents and Financial PhraseBank by Malo et al. (2014) ## Benchmarking ```bash label precision recall f1-score support positive 0.76 0.89 0.82 253 negative 0.87 0.86 0.87 133 neutral 0.94 0.87 0.90 584 accuracy - - 0.87 970 macro-avg 0.86 0.87 0.86 970 weighted-avg 0.88 0.87 0.88 970 ``` --- layout: model title: Pipeline to Detect Normalized Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_go_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, gene, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_go_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_human_phenotype_go_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_pipeline_en_3.4.1_3.0_1647868376700.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_clinical_pipeline_en_3.4.1_3.0_1647868376700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_human_phenotype_go_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.") ``` ```scala val pipeline = new PretrainedPipeline("ner_human_phenotype_go_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype_clinical.pipeline").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""") ```
## Results ```bash +----+--------------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+==========================+=========+=======+==========+ | 0 | tumor | 39 | 43 | HP | +----+--------------------------+---------+-------+----------+ | 1 | tricarboxylic acid cycle | 79 | 102 | GO | +----+--------------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Word2Vec Embeddings in Erzya (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, myv, open_source] task: Embeddings language: myv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_myv_3.4.1_3.0_1647368418007.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_myv_3.4.1_3.0_1647368418007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","myv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","myv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("myv.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|myv| |Size:|102.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Sentence Entity Resolver for ICD10-CM (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_icd10cm date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts ICD10-CM Codes and their normalized definitions. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_en_3.0.4_3.0_1621189196513.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_en_3.0.4_3.0_1621189196513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") chunk2doc = Chunk2Doc()\ .setInputCols("entities")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val chunk2doc = new Chunk2Doc() .setInputCols("entities") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| I150| 0.2606|Renovascular hype...|I150:::K766:::I10...| |chronic renal ins...| 83|109| PROBLEM| N186| 0.2059|End stage renal d...|N186:::D631:::P96...| | COPD| 113|116| PROBLEM| I2781| 0.2132|Cor pulmonale (ch...|I2781:::J449:::J4...| | gastritis| 120|128| PROBLEM| K5281| 0.1425|Eosinophilic gast...|K5281:::K140:::K9...| | TIA| 136|138| PROBLEM| G459| 0.1152|Transient cerebra...|G459:::I639:::T79...| |a non-ST elevatio...| 182|202| PROBLEM| I214| 0.0889|Non-ST elevation ...|I214:::I256:::M62...| |Guaiac positive s...| 208|229| PROBLEM| K626| 0.0631|Ulcer of anus and...|K626:::K380:::R15...| |cardiac catheteri...| 295|317| TEST| Z950| 0.2549|Presence of cardi...|Z950:::Z95811:::I...| | PTCA| 324|327|TREATMENT| Z9861| 0.1268|Coronary angiopla...|Z9861:::Z9862:::I...| | mid LAD lesion| 332|345| PROBLEM|L02424| 0.1117|Furuncle of left ...|L02424:::Q202:::L...| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD10 Clinical Modification dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: Maltese Legal Roberta Embeddings author: John Snow Labs name: roberta_base_maltese_legal date: 2023-02-16 tags: [mt, maltese, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: mt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-maltese-roberta-base` is a Maltese model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_maltese_legal_mt_4.2.4_3.0_1676554739478.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_maltese_legal_mt_4.2.4_3.0_1676554739478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_maltese_legal", "mt")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_maltese_legal", "mt") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_maltese_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|mt| |Size:|415.9 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-maltese-roberta-base --- layout: model title: Fast Neural Machine Translation Model from Kaonde to English author: John Snow Labs name: opus_mt_kqn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kqn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kqn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kqn_en_xx_2.7.0_2.4_1609169620248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kqn_en_xx_2.7.0_2.4_1609169620248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kqn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kqn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kqn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kqn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Tonga (Zambezi) to English Pipeline author: John Snow Labs name: translate_toi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, toi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `toi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_toi_en_xx_2.7.0_2.4_1609686778128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_toi_en_xx_2.7.0_2.4_1609686778128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_toi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_toi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.toi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_toi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Duplicate Question Detection author: John Snow Labs name: distilbert_base_sequence_classifier_qqp date: 2022-02-12 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: DistilBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/assemblyai/distilbert-base-uncased-qqp)) and it's been trained on Quora Question Pairs dataset, leveraging `Distil-BERT` embeddings and `DistilBertForSequenceClassification` for text classification purposes. As an input, it requires two questions separated by a space. ## Predicted Entities `non_duplicated`, `duplicated` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_qqp_en_3.4.0_3.0_1644663826044.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_sequence_classifier_qqp_en_3.4.0_3.0_1644663826044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_qqp", "en")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result1 = light_pipeline.annotate("Do we have to go there? Are you a doctor?") result2 = light_pipeline.annotate("Do you want to eat something? Are you hungry?") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_qqp", "en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example1 = Seq.empty["Do we have to go there? Are you a doctor?"].toDS.toDF("text") val example2 = Seq.empty["Do you want to eat something? Are you hungry?"].toDS.toDF("text") val result1 = pipeline.fit(example1).transform(example1) val result2 = pipeline.fit(example2).transform(example2) ```
## Results ```bash ['non_duplicated'] ['duplicated'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_sequence_classifier_qqp| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[class]| |Language:|en| |Size:|249.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References [https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) --- layout: model title: Translate English to Samoan Pipeline author: John Snow Labs name: translate_en_sm date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sm, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sm` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sm_xx_2.7.0_2.4_1609689371340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sm_xx_2.7.0_2.4_1609689371340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sm", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sm", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sm').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sm| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for SNOMED (sbiobertresolve_snomed_drug) author: John Snow Labs name: sbiobertresolve_snomed_drug date: 2022-01-01 tags: [snomed, licensed, en, clinical, drug] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps detected drug entities to SNOMED codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `SNOMED Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_drug_en_3.3.4_2.4_1641035704765.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_drug_en_3.3.4_2.4_1641035704765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['DRUG']) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sentence_chunk_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_drug", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN")\ resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter, c2doc, sentence_chunk_embeddings, snomed_resolver ]) model = resolver_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = model.transform(spark.createDataFrame([["She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Avandia 4 mg daily, aspirin 81 mg daily, Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin."]]).toDF("text")) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("DRUG")) val c2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sentence_chunk_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_drug", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("snomed_code") .setDistanceFunction("EUCLIDEAN") resolver_pipeline = new Pipeline().setStages(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, clinical_ner, ner_converter, c2doc, sentence_chunk_embeddings, snomed_resolver) data = Seq("She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Avandia 4 mg daily, aspirin 81 mg daily, Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin.").toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_drug").predict("""She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Avandia 4 mg daily, aspirin 81 mg daily, Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin.""") ```
## Results ```bash +-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+ | ner_chunk|entity| snomed_code| resolved_text| all_k_results| all_k_resolutions| +-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+ | Fragmin| DRUG| 9487801000001106| Fragmin|9487801000001106:::130752006:::28999000:::953500100000110...|Fragmin:::Fragilysin:::Fusarin:::Femulen:::Fumonisin:::Fr...| | OxyContin| DRUG| 9296001000001100| OxyCONTIN|9296001000001100:::373470001:::230091000001108:::55452001...|OxyCONTIN:::Oxychlorosene:::Oxyargin:::oxyCODONE:::Oxymor...| | folic acid| DRUG| 63718003| Folic acid|63718003:::6247001:::226316008:::432165000:::438451000124...|Folic acid:::Folic acid-containing product:::Folic acid s...| | levothyroxine| DRUG|10071011000001106| Levothyroxine|10071011000001106:::710809001:::768532006:::126202002:::7...|Levothyroxine:::Levothyroxine (substance):::Levothyroxine...| | Avandia| DRUG| 9217601000001109| avandia|9217601000001109:::9217501000001105:::12226401000001108::...|avandia:::avandamet:::Anatera:::Intanza:::Avamys:::Aragam...| | aspirin| DRUG| 387458008| Aspirin|387458008:::7947003:::5145711000001107:::426365001:::4125...|Aspirin:::Aspirin-containing product:::Aspirin powder:::A...| | Neurontin| DRUG| 9461401000001102| neurontin|9461401000001102:::130694004:::86822004:::952840100000110...|neurontin:::Neurolysin:::Neurine (substance):::Nebilet:::...| |magnesium citrate| DRUG| 12495006|Magnesium citrate|12495006:::387401007:::21691008:::15531411000001106:::408...|Magnesium citrate:::Magnesium carbonate:::Magnesium trisi...| | insulin| DRUG| 67866001| Insulin|67866001:::325072002:::414515005:::39487003:::411530000::...|Insulin:::Insulin aspart:::Insulin detemir:::Insulin-cont...| +-----------------+------+-----------------+-----------------+------------------------------------------------------------+------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_drug| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|false| ## Data Source Trained on `SNOMED` code dataset with `sbiobert_base_cased_mli` sentence embeddings. --- layout: model title: Translate English to Malayalam Pipeline author: John Snow Labs name: translate_en_ml date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ml, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ml` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ml_xx_2.7.0_2.4_1609691611770.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ml_xx_2.7.0_2.4_1609691611770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ml", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ml", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ml').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ml| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4` is a English model originally trained by lilitket. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4_en_4.2.0_3.0_1664119928268.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4_en_4.2.0_3.0_1664119928268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Translate Samoan to English Pipeline author: John Snow Labs name: translate_sm_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sm, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sm` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sm_en_xx_2.7.0_2.4_1609688483418.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sm_en_xx_2.7.0_2.4_1609688483418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sm_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sm_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sm.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sm_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: French CamemBert Embeddings (from fjluque) author: John Snow Labs name: camembert_embeddings_fjluque_generic_model2 date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model2` is a French model orginally trained by `fjluque`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_fjluque_generic_model2_fr_3.4.4_3.0_1653991250171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_fjluque_generic_model2_fr_3.4.4_3.0_1653991250171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_fjluque_generic_model2","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_fjluque_generic_model2","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_fjluque_generic_model2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/fjluque/dummy-model2 --- layout: model title: Stop Words Cleaner for German author: John Snow Labs name: stopwords_de date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: de edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, de] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_de_de_2.5.4_2.4_1594742442247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_de_de_2.5.4_2.4_1594742442247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_de", "de") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow ist nicht nur der König des Nordens, sondern auch ein englischer Arzt und führend in der Entwicklung von Anästhesie und medizinischer Hygiene.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_de", "de") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("John Snow ist nicht nur der König des Nordens, sondern auch ein englischer Arzt und führend in der Entwicklung von Anästhesie und medizinischer Hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow ist nicht nur der König des Nordens, sondern auch ein englischer Arzt und führend in der Entwicklung von Anästhesie und medizinischer Hygiene."""] stopword_df = nlu.load('de.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}), Row(annotatorType='token', begin=28, end=32, result='König', metadata={'sentence': '0'}), Row(annotatorType='token', begin=38, end=44, result='Nordens', metadata={'sentence': '0'}), Row(annotatorType='token', begin=45, end=45, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_de| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|de| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el4_en_4.3.0_3.0_1675120069644.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el4_en_4.3.0_3.0_1675120069644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|135.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-el4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English image_classifier_vit_baked_goods ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_baked_goods date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_baked_goods` is a English model originally trained by nateraw. ## Predicted Entities `cake`, `cookie`, `pie` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_baked_goods_en_4.1.0_3.0_1660171354219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_baked_goods_en_4.1.0_3.0_1660171354219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_baked_goods", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_baked_goods", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_baked_goods| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English RobertaForQuestionAnswering (from csarron) author: John Snow Labs name: roberta_qa_roberta_base_squad_v1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad-v1` is a English model originally trained by `csarron`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_v1_en_4.0.0_3.0_1655734876022.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad_v1_en_4.0.0_3.0_1655734876022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_csarron").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|457.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/csarron/roberta-base-squad-v1 - https://github.com/csarron - https://arxiv.org/abs/1907.11692 - https://awk.ai/ - https://twitter.com/sysnlp - https://rajpurkar.github.io/SQuAD-explorer --- layout: model title: German asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204_de_4.2.0_3.0_1664121138596.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204_de_4.2.0_3.0_1664121138596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Abkhazian asr_xls_r_demo_test TFWav2Vec2ForCTC from chmanoj author: John Snow Labs name: pipeline_asr_xls_r_demo_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_demo_test` is a Abkhazian model originally trained by chmanoj. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_demo_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_demo_test_ab_4.2.0_3.0_1664020281747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_demo_test_ab_4.2.0_3.0_1664020281747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_demo_test', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_demo_test", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_demo_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.4 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForSequenceClassification Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_distilroberta_finetuned_rotten_tomatoes_sentiment_analysis date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-rotten_tomatoes-sentiment-analysis` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_rotten_tomatoes_sentiment_analysis_en_4.0.0_3.0_1657716107308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_rotten_tomatoes_sentiment_analysis_en_4.0.0_3.0_1657716107308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_rotten_tomatoes_sentiment_analysis","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_rotten_tomatoes_sentiment_analysis","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_distilroberta_finetuned_rotten_tomatoes_sentiment_analysis| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/distilroberta-finetuned-rotten_tomatoes-sentiment-analysis --- layout: model title: English BertForQuestionAnswering model (from ricardo-filho) author: John Snow Labs name: bert_qa_bert_base_faquad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_base_faquad` is a English model orginally trained by `ricardo-filho`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_faquad_en_4.0.0_3.0_1654185088848.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_faquad_en_4.0.0_3.0_1654185088848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_faquad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_faquad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_faquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ricardo-filho/bert_base_faquad --- layout: model title: Korean Electra Embeddings (from deeq) author: John Snow Labs name: electra_embeddings_delectra_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `delectra-generator` is a Korean model orginally trained by `deeq`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_delectra_generator_ko_3.4.4_3.0_1652786213718.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_delectra_generator_ko_3.4.4_3.0_1652786213718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_delectra_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_delectra_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_delectra_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|139.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/deeq/delectra-generator --- layout: model title: English Electra Embeddings (from google) author: John Snow Labs name: electra_embeddings_electra_base_generator date: 2022-05-17 tags: [en, open_source, electra, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-generator` is a English model orginally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_generator_en_3.4.4_3.0_1652786554017.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_generator_en_3.4.4_3.0_1652786554017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_generator","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_generator","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|126.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/google/electra-base-generator - https://arxiv.org/pdf/1406.2661.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://openreview.net/pdf?id=r1xMH1BtvB - https://gluebenchmark.com/ - https://rajpurkar.github.io/SQuAD-explorer/ - https://www.clips.uantwerpen.be/conll2000/chunking/ --- layout: model title: Legal Obligations absolute Clause Binary Classifier author: John Snow Labs name: legclf_obligations_absolute_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `obligations-absolute` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `obligations-absolute` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_obligations_absolute_clause_en_1.0.0_3.2_1660122773792.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_obligations_absolute_clause_en_1.0.0_3.2_1660122773792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_obligations_absolute_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[obligations-absolute]| |[other]| |[other]| |[obligations-absolute]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_obligations_absolute_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support obligations-absolute 0.96 0.75 0.84 36 other 0.90 0.99 0.94 83 accuracy - - 0.92 119 macro-avg 0.93 0.87 0.89 119 weighted-avg 0.92 0.92 0.91 119 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from doc2query) author: John Snow Labs name: t5_s2orc_base_v1 date: 2023-01-30 tags: [en, open_source, t5] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `S2ORC-t5-base-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_s2orc_base_v1_en_4.3.0_3.0_1675098438086.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_s2orc_base_v1_en_4.3.0_3.0_1675098438086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_s2orc_base_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_s2orc_base_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_s2orc_base_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/doc2query/S2ORC-t5-base-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html - https://github.com/allenai/s2orc --- layout: model title: Fast Neural Machine Translation Model from English to Gilbertese author: John Snow Labs name: opus_mt_en_gil date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gil, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gil` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gil_xx_2.7.0_2.4_1609165162139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gil_xx_2.7.0_2.4_1609165162139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gil", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gil", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gil').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gil| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Erisa Clause Binary Classifier author: John Snow Labs name: legclf_erisa_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `erisa` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `erisa` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_erisa_clause_en_1.0.0_3.2_1660123496220.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_erisa_clause_en_1.0.0_3.2_1660123496220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_erisa_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[erisa]| |[other]| |[other]| |[erisa]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_erisa_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support erisa 0.95 1.00 0.98 21 other 1.00 0.98 0.99 64 accuracy - - 0.99 85 macro-avg 0.98 0.99 0.98 85 weighted-avg 0.99 0.99 0.99 85 ``` --- layout: model title: Legal Real property Clause Binary Classifier author: John Snow Labs name: legclf_real_property_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `real-property` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `real-property` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_real_property_clause_en_1.0.0_3.2_1660123886366.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_real_property_clause_en_1.0.0_3.2_1660123886366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_real_property_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[real-property]| |[other]| |[other]| |[real-property]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_real_property_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 60 real-property 0.92 0.88 0.90 25 accuracy - - 0.94 85 macro-avg 0.93 0.92 0.93 85 weighted-avg 0.94 0.94 0.94 85 ``` --- layout: model title: Fast Neural Machine Translation Model from Vietnamese to English author: John Snow Labs name: opus_mt_vi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, vi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `vi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_vi_en_xx_2.7.0_2.4_1609170656837.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_vi_en_xx_2.7.0_2.4_1609170656837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_vi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_vi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.vi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_vi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Indonesian to English author: John Snow Labs name: opus_mt_id_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, id, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `id` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_id_en_xx_2.7.0_2.4_1609166611872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_id_en_xx_2.7.0_2.4_1609166611872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_id_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_id_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.id.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_id_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from msms) author: John Snow Labs name: roberta_qa_msms_base_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `msms`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_msms_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219434301.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_msms_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219434301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_msms_base_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_msms_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_msms_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/msms/roberta-base-squad2-finetuned-squad --- layout: model title: English BertForTokenClassification Cased model (from SkolkovoInstitute) author: John Snow Labs name: bert_token_classifier_lewip_informal_tagger date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `LEWIP-informal-tagger` is a English model originally trained by `SkolkovoInstitute`. ## Predicted Entities `equal`, `replace`, `delete`, `insert` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_lewip_informal_tagger_en_4.2.4_3.0_1669813857750.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_lewip_informal_tagger_en_4.2.4_3.0_1669813857750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_lewip_informal_tagger","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_lewip_informal_tagger","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_lewip_informal_tagger| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/SkolkovoInstitute/LEWIP-informal-tagger --- layout: model title: English asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy TFWav2Vec2ForCTC from rafiulrumy author: John Snow Labs name: asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy` is a English model originally trained by rafiulrumy. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy_en_4.2.0_3.0_1664102238417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy_en_4.2.0_3.0_1664102238417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_news_pretrain_roberta_FT_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_roberta_FT_newsqa_en_4.0.0_3.0_1655729227412.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_roberta_FT_newsqa_en_4.0.0_3.0_1655729227412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_news_pretrain_roberta_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_news_pretrain_roberta_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_news_pretrain_roberta_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/news_pretrain_roberta_FT_newsqa --- layout: model title: Self Report Age Classifier (BioBERT - Reddit) author: John Snow Labs name: bert_sequence_classifier_exact_age_reddit date: 2022-07-26 tags: [licensed, clinical, en, classifier, sequence_classification, age, public_health] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify self-report the exact age into social media forum (Reddit) posts. ## Predicted Entities `self_report_age`, `no_report` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_AGE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_exact_age_reddit_en_4.0.0_3.0_1658852929276.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_exact_age_reddit_en_4.0.0_3.0_1658852929276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_exact_age_reddit", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["Is it bad for a 19 year old it's been getting worser.", "I was about 10. So not quite as young as you but young."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_exact_age_reddit", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("Is it bad for a 19 year old it's been getting worser.", "I was about 10. So not quite as young as you but young.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.exact_age").predict("""I was about 10. So not quite as young as you but young.""") ```
## Results ```bash +-------------------------------------------------------+-----------------+ |text |result | +-------------------------------------------------------+-----------------+ |Is it bad for a 19 year old it's been getting worser. |[self_report_age]| |I was about 10. So not quite as young as you but young.|[no_report] | +-------------------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_exact_age_reddit| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References The dataset is disease-specific and consists of posts collected via a series of keywords associated with dry eye disease. ## Benchmarking ```bash label precision recall f1-score support no_report 0.9324 0.9577 0.9449 1325 self_report_age 0.9124 0.8637 0.8874 675 accuracy - - 0.9260 2000 macro-avg 0.9224 0.9107 0.9161 2000 weighted-avg 0.9256 0.9260 0.9255 2000 ``` --- layout: model title: Extract Demographic Entities from Oncology Texts author: John Snow Labs name: ner_oncology_demographics date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, demographics] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic information from oncology texts, including age, gender and smoking status. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father"). - `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. - `Smoking_Status`: All mentions of smoking related to the patient or to someone else. ## Predicted Entities `Age`, `Gender`, `Race_Ethnicity`, `Smoking_Status` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_en_4.0.0_3.0_1666720851983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_en_4.0.0_3.0_1666720851983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_demographics", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient is a 40-year-old man with history of heavy smoking."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_demographics", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient is a 40-year-old man with history of heavy smoking.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_demographics").predict("""The patient is a 40-year-old man with history of heavy smoking.""") ```
## Results ```bash | chunk | ner_label | |:------------|:---------------| | 40-year-old | Age | | man | Gender | | smoking | Smoking_Status | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_demographics| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.6 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Smoking_Status 60 19 8 68 0.76 0.88 0.82 Age 934 33 15 949 0.97 0.98 0.97 Race_Ethnicity 57 5 5 62 0.92 0.92 0.92 Gender 1248 18 6 1254 0.99 1.00 0.99 macro_avg 2299 75 34 2333 0.91 0.95 0.93 micro_avg 2299 75 34 2333 0.97 0.99 0.98 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_2_en_4.3.0_3.0_1674213755024.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_2_en_4.3.0_3.0_1674213755024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|423.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-2 --- layout: model title: French asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271 date: 2022-09-26 tags: [wav2vec2, fr, audio, open_source, asr] task: Automatic Speech Recognition language: fr edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271` is a French model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271_fr_4.2.0_3.0_1664202398743.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271_fr_4.2.0_3.0_1664202398743.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271", "fr")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271", "fr") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_france_10_belgium_0_s271| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fr| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering (from sumba) author: John Snow Labs name: roberta_qa_sumba_roberta_base_squad2_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `sumba`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_sumba_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735541055.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_sumba_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735541055.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_sumba_roberta_base_squad2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_sumba_roberta_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_sumba").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_sumba_roberta_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sumba/roberta-base-squad2-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_large_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-large-squad` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_large_squad_en_4.3.0_3.0_1674224834331.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_large_squad_en_4.3.0_3.0_1674224834331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_large_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_large_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_large_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tli8hf/unqover-roberta-large-squad --- layout: model title: English T5ForConditionalGeneration Mini Cased model (from google) author: John Snow Labs name: t5_efficient_mini_nl6 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-mini-nl6` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl6_en_4.3.0_3.0_1675118348456.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl6_en_4.3.0_3.0_1675118348456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_mini_nl6","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_mini_nl6","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_mini_nl6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|102.0 MB| ## References - https://huggingface.co/google/t5-efficient-mini-nl6 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd3_en_4.0.0_3.0_1657188222101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd3_en_4.0.0_3.0_1657188222101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd3 --- layout: model title: Spanish Part of Speech Tagger (from mrm8488) author: John Snow Labs name: bert_pos_bert_spanish_cased_finetuned_pos date: 2022-05-09 tags: [bert, pos, part_of_speech, es, open_source] task: Part of Speech Tagging language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-spanish-cased-finetuned-pos` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_spanish_cased_finetuned_pos_es_3.4.2_3.0_1652091544165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_spanish_cased_finetuned_pos_es_3.4.2_3.0_1652091544165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_spanish_cased_finetuned_pos","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_spanish_cased_finetuned_pos","es") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_spanish_cased_finetuned_pos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-pos - https://www.kaggle.com/nltkdata/conll-corpora - https://github.com/dccuchile/beto - https://www.kaggle.com/nltkdata/conll-corpora - https://twitter.com/mrm8488 --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_fpdm_triplet_bert_FT_new_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_triplet_bert_FT_new_newsqa_en_4.0.0_3.0_1654187889882.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_triplet_bert_FT_new_newsqa_en_4.0.0_3.0_1654187889882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_triplet_bert_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fpdm_triplet_bert_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.qa_fpdm_triplet_ft_new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fpdm_triplet_bert_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_triplet_bert_FT_new_newsqa --- layout: model title: Bert for Sequence Classification (Clinical Question vs Statement) author: John Snow Labs name: bert_sequence_classifier_question_statement_clinical date: 2021-11-05 tags: [question, statement, clinical, en, licensed] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Trained to add sentence classifying capabilities to distinguish between Question vs Statements in clinical domain. This model was imported from Hugging Face (https://huggingface.co/shahrukhx01/question-vs-statement-classifier), trained based on Haystack (https://github.com/deepset-ai/haystack/issues/611) and finetuned by John Snow Labs with in-house clinical annotations. ## Predicted Entities `question`, `statement` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_question_statement_clinical_en_3.3.2_3.0_1636106577489.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_question_statement_clinical_en_3.3.2_3.0_1636106577489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") seq = nlp.BertForSequenceClassification.pretrained('bert_sequence_classifier_question_statement_clinical', 'en', 'clinical/models')\ .setInputCols(["token", "sentence"])\ .setOutputCol("label")\ .setCaseSensitive(True) pipeline = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, seq]) test_sentences = [["""Hello I am going to be having a baby throughand have just received my medical results before I have my tubes tested. I had the tests on day 23 of my cycle. My progresterone level is 10. What does this mean? What does progesterone level of 10 indicate? Your progesterone report is perfectly normal. We expect this result on day 23rd of the cycle.So there's nothing to worry as it's perfectly alright"""]] data = spark.createDataFrame(test_sentences).toDF("text") res = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val seq = BertForSequenceClassification.pretrained("bert_sequence_classifier_question_statement_clinical", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, seq)) val test_sentences = """Hello I am going to be having a baby throughand have just received my medical results before I have my tubes tested. I had the tests on day 23 of my cycle. My progresterone level is 10. What does this mean? What does progesterone level of 10 indicate? Your progesterone report is perfectly normal. We expect this result on day 23rd of the cycle.So there's nothing to worry as it's perfectly alright""" val example = Seq(test_sentences).toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.question_statement_clinical").predict("""Hello I am going to be having a baby throughand have just received my medical results before I have my tubes tested. I had the tests on day 23 of my cycle. My progresterone level is 10. What does this mean? What does progesterone level of 10 indicate? Your progesterone report is perfectly normal. We expect this result on day 23rd of the cycle.So there's nothing to worry as it's perfectly alright""") ```
## Results ```bash +--------------------------------------------------------------------------------------------------------------------+---------+ |sentence |label | +--------------------------------------------------------------------------------------------------------------------+---------+ |Hello I am going to be having a baby throughand have just received my medical results before I have my tubes tested.|statement| |I had the tests on day 23 of my cycle. |statement| |My progresterone level is 10. |statement| |What does this mean? |question | |What does progesterone level of 10 indicate? |question | |Your progesterone report is perfectly normal. We expect this result on day 23rd of the cycle. |statement| |So there's nothing to worry as it's perfectly alright |statement| +--------------------------------------------------------------------------------------------------------------------+--------- ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_question_statement_clinical| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source For generic domain training: https://github.com/deepset-ai/haystack/issues/611 For finetuning in clinical domain, in house JSL annotations based on clinical Q&A. ## Benchmarking ```bash label precision recall f1-score support question 0.97 0.94 0.96 243 statement 0.98 0.99 0.99 729 accuracy - - 0.98 972 macro-avg 0.98 0.97 0.97 972 weighted-avg 0.98 0.98 0.98 972 ``` --- layout: model title: Fast Neural Machine Translation Model from French to English author: John Snow Labs name: opus_mt_fr_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, fr, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `fr` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_fr_en_xx_2.7.0_2.4_1609166458139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_fr_en_xx_2.7.0_2.4_1609166458139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_fr_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_fr_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.fr.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_fr_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Rent Clause Binary Classifier author: John Snow Labs name: legclf_rent_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `rent` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `rent` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rent_clause_en_1.0.0_3.2_1660122938773.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rent_clause_en_1.0.0_3.2_1660122938773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_rent_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[rent]| |[other]| |[other]| |[rent]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_rent_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.96 0.96 95 rent 0.88 0.91 0.90 33 accuracy - - 0.95 128 macro-avg 0.93 0.93 0.93 128 weighted-avg 0.95 0.95 0.95 128 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_MoHai TFWav2Vec2ForCTC from MoHai author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_MoHai date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_MoHai` is a English model originally trained by MoHai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_MoHai_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_MoHai_en_4.2.0_3.0_1664118941814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_MoHai_en_4.2.0_3.0_1664118941814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_MoHai", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_MoHai", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_MoHai| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: GPT2 text-to-text model (Base) author: John Snow Labs name: gpt2 date: 2021-12-03 tags: [en, gpt2, open_source] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true recommended: true annotator: GPT2Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation. This is the Base model. Other models (medium, large) are also available in [Models Hub](https://nlp.johnsnowlabs.com/models?task=Text+Generation). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/gpt2_en_3.4.0_3.0_1638510926608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/gpt2_en_3.4.0_3.0_1638510926608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") gpt2 = GPT2Transformer.pretrained("gpt2") \ .setInputCols(["documents"]) \ .setMaxOutputLength(50) \\ .setOutputCol("generation") pipeline = Pipeline().setStages([documentAssembler, gpt2]) data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("summaries.generation").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val gpt2 = GPT2Transformer.pretrained("gpt2") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) .setDoSample(false) .setTopK(50) .setNoRepeatNgramSize(3) .setOutputCol("generation") val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2)) val data = Seq("My name is Leonardo.").toDF("text") val result = pipeline.fit(data).transform(data) results.select("generation.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.gpt2").predict("""My name is Leonardo.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the | | United States in 1776, and I have lived in the United Kingdom since 1776] | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|gpt2| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[generation]| |Language:|en| ## Data Source OpenAI WebText - a corpus created by scraping web pages with emphasis on document quality. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText. --- layout: model title: Clinical Longformer author: John Snow Labs name: clinical_longformer date: 2022-02-08 tags: [longformer, clinical, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: LongformerEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This embeddings model was imported from `Hugging Face`([link](https://huggingface.co/yikuan8/Clinical-Longformer)). Clinical-Longformer is a clinical knowledge enriched version of `Longformer` that was further pretrained using MIMIC-III clinical notes. It allows up to 4,096 tokens as the model input. `Clinical-Longformer` consistently out-performs `ClinicalBERT` across 10 baseline dataset for at least 2 percent. Those downstream experiments broadly cover named entity recognition (NER), question answering (QA), natural language inference (NLI) and text classification tasks. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clinical_longformer_en_3.4.0_3.0_1644309598171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clinical_longformer_en_3.4.0_3.0_1644309598171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = LongformerEmbeddings.pretrained("clinical_longformer", "en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setCaseSensitive(True)\ .setMaxSentenceLength(4096) ``` ```scala val embeddings = LongformerEmbeddings.pretrained("clinical_longformer", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) .setMaxSentenceLength(4096) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.longformer.clinical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_longformer| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|534.9 MB| |Case sensitive:|true| |Max sentence length:|4096| ## References [https://arxiv.org/pdf/2201.11838.pdf](https://arxiv.org/pdf/2201.11838.pdf) --- layout: model title: Legal Integration Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_integration_bert date: 2023-03-05 tags: [en, legal, classification, clauses, integration, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Integration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Integration`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_integration_bert_en_1.0.0_3.0_1678050549115.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_integration_bert_en_1.0.0_3.0_1678050549115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_integration_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Integration]| |[Other]| |[Other]| |[Integration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_integration_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Integration 0.86 0.89 0.88 36 Other 0.92 0.91 0.91 53 accuracy - - 0.90 89 macro-avg 0.89 0.90 0.90 89 weighted-avg 0.90 0.90 0.90 89 ``` --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from krinal214) author: John Snow Labs name: bert_qa_3lang date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-3lang` is a Multilingual model originally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_3lang_xx_4.0.0_3.0_1657182629539.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_3lang_xx_4.0.0_3.0_1657182629539.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_3lang","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_3lang","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_3lang| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/bert-3lang --- layout: model title: Word2Vec Embeddings in Neapolitan (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, nap, open_source] task: Embeddings language: nap edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nap_3.4.1_3.0_1647447926085.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nap_3.4.1_3.0_1647447926085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nap") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nap") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nap.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|nap| |Size:|74.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Word2Vec Embeddings in Bishnupriya (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bpy, open_source] task: Embeddings language: bpy edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bpy_3.4.1_3.0_1647286979430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bpy_3.4.1_3.0_1647286979430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bpy") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bpy") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bpy.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bpy| |Size:|80.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from V3RX2000) author: John Snow Labs name: distilbert_qa_v3rx2000_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `V3RX2000`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_v3rx2000_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769492752.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_v3rx2000_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769492752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_v3rx2000_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_v3rx2000_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_v3rx2000_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/V3RX2000/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from venkateshdas) author: John Snow Labs name: roberta_qa_base_squad2_ta_qna_3e date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-ta-qna-roberta3e` is a English model originally trained by `venkateshdas`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_ta_qna_3e_en_4.3.0_3.0_1674219788427.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_ta_qna_3e_en_4.3.0_3.0_1674219788427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_ta_qna_3e","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_ta_qna_3e","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_ta_qna_3e| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/venkateshdas/roberta-base-squad2-ta-qna-roberta3e --- layout: model title: Snomed to UMLS Code Mapping author: John Snow Labs name: snomed_umls_mapping date: 2021-07-01 tags: [snomed, umls, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps SNOMED codes to UMLS codes without using any text data. You’ll just feed white space-delimited SNOMED codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CODE_MAPPING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_3.1.0_2.4_1625126296775.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_3.1.0_2.4_1625126296775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline( 'snomed_umls_mapping','en','clinical/models') pipeline.annotate('733187009 449433008 51264003') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline( 'snomed_umls_mapping','en','clinical/models') val result = pipeline.annotate('733187009 449433008 51264003') ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.umls").predict("""733187009 449433008 51264003""") ```
## Results ```bash {'snomed': ['733187009', '449433008', '51264003'], 'umls': ['C4546029', 'C3164619', 'C0271267']} Note: |SNOMED | Details | | ---------- | ----------------------------------------------------------:| | 733187009 | osteolysis following surgical procedure on skeletal system | | 449433008 | Diffuse stenosis of left pulmonary artery | | 51264003 | Limbal AND/OR corneal involvement in vernal conjunctivitis | | UMLS | Details | | ---------- | ----------------------------------------------------------:| | C4546029 | osteolysis following surgical procedure on skeletal system | | C3164619 | diffuse stenosis of left pulmonary artery | | C0271267 | limbal and/or corneal involvement in vernal conjunctivitis | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: Norwegian BertForMaskedLM Cased model (from ltgoslo) author: John Snow Labs name: bert_embeddings_norbert2 date: 2022-12-06 tags: ["no", open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: "no" edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `norbert2` is a Norwegian model originally trained by `ltgoslo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert2_no_4.2.4_3.0_1670327036179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert2_no_4.2.4_3.0_1670327036179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert2","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert2","no") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_norbert2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|no| |Size:|467.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ltgoslo/norbert2 - http://vectors.nlpl.eu/repository/20/221.zip - http://norlm.nlpl.eu/ - https://github.com/ltgoslo/NorBERT - https://aclanthology.org/2021.nodalida-main.4/ - https://www.eosc-nordic.eu/ - https://www.mn.uio.no/ifi/english/research/groups/ltg/ --- layout: model title: Detect PHI for Generic Deidentification in Romanian author: John Snow Labs name: ner_deid_generic date: 2022-07-08 tags: [deidentification, word2vec, phi, generic, ner, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with word2vec embeddings and can detect 7 generic entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_ro_3.5.0_3.0_1657267952051.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_ro_3.5.0_3.0_1657267952051.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui,737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid_generic").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui,737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|LOCATION | |Drumul Oprea Nr. |LOCATION | |Vaslui |LOCATION | |737405 România |LOCATION | |+40(235)413773 |CONTACT | |25 May 2022 |DATE | |BUREAN MARIA |NAME | |77 |AGE | |Agota Evelyn Tımar |NAME | |2450502264401 |ID | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|15.0 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.96 0.98 0.97 1235 CONTACT 0.98 0.99 0.99 379 DATE 0.90 0.85 0.87 5006 ID 1.00 1.00 1.00 694 LOCATION 0.89 0.90 0.90 1746 NAME 0.95 0.98 0.96 3018 PROFESSION 0.78 0.72 0.75 173 micro-avg 0.93 0.91 0.92 12251 macro-avg 0.92 0.92 0.92 12251 weighted-avg 0.92 0.91 0.92 12251 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from whispAI) author: John Snow Labs name: distilbert_token_classifier_directquote_sentlevel_distilbert date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DirectQuote-SentLevel-DistilBERT` is a English model originally trained by `whispAI`. ## Predicted Entities `Out`, `RightSpeaker`, `LeftSpeaker`, `Unknown`, `Speaker` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_directquote_sentlevel_distilbert_en_4.3.1_3.0_1678783180815.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_directquote_sentlevel_distilbert_en_4.3.1_3.0_1678783180815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_directquote_sentlevel_distilbert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_directquote_sentlevel_distilbert","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_directquote_sentlevel_distilbert| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/whispAI/DirectQuote-SentLevel-DistilBERT - #quote-extraction--attribution-on-directquotehttpsarxivorgabs211007827-dataset-with-bert-based-token-classification-💬 - https://arxiv.org/abs/2110.07827 - https://arxiv.org/abs/2110.07827 - https://arxiv.org/abs/2110.07827 - https://www.theguardian.com/info/2021/nov/25/talking-sense-using-machine-learning-to-understand-quotes - https://arxiv.org/abs/2110.07827 - https://stanfordnlp.github.io/CoreNLP/quote.html - https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations - https://stanfordnlp.github.io/CoreNLP/quote.html - https://arxiv.org/abs/2110.07827 - https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations --- layout: model title: Legal Defence Document Classifier (EURLEX) author: John Snow Labs name: legclf_defence_bert date: 2023-03-06 tags: [en, legal, classification, clauses, defence, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_defence_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Defence or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Defence`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_defence_bert_en_1.0.0_3.0_1678111563924.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_defence_bert_en_1.0.0_3.0_1678111563924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_defence_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Defence]| |[Other]| |[Other]| |[Defence]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_defence_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Defence 0.97 0.82 0.89 34 Other 0.85 0.97 0.91 36 accuracy - - 0.90 70 macro-avg 0.91 0.90 0.90 70 weighted-avg 0.91 0.90 0.90 70 ``` --- layout: model title: Translate English to Kaonde Pipeline author: John Snow Labs name: translate_en_kqn date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, kqn, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `kqn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_kqn_xx_2.7.0_2.4_1609688641010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_kqn_xx_2.7.0_2.4_1609688641010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_kqn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_kqn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.kqn').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_kqn| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Expenses Clause Binary Classifier author: John Snow Labs name: legclf_expenses_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `expenses` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_expenses_clause_en_1.0.0_3.2_1660123533827.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_expenses_clause_en_1.0.0_3.2_1660123533827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_expenses_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[expenses]| |[other]| |[other]| |[expenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_expenses_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support expenses 0.97 0.97 0.97 61 other 0.98 0.98 0.98 108 accuracy - - 0.98 169 macro-avg 0.97 0.97 0.97 169 weighted-avg 0.98 0.98 0.98 169 ``` --- layout: model title: Legal Authorization Clause Binary Classifier author: John Snow Labs name: legclf_authorization_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `authorization` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `authorization` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_authorization_clause_en_1.0.0_3.2_1660123246689.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_authorization_clause_en_1.0.0_3.2_1660123246689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_authorization_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[authorization]| |[other]| |[other]| |[authorization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_authorization_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support authorization 1.00 0.83 0.91 35 other 0.90 1.00 0.95 56 accuracy - - 0.93 91 macro-avg 0.95 0.91 0.93 91 weighted-avg 0.94 0.93 0.93 91 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from Zamachi) author: John Snow Labs name: distilbert_qa_for_question_answering date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-for-question-answering` is a English model originally trained by `Zamachi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_for_question_answering_en_4.3.0_3.0_1672774880071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_for_question_answering_en_4.3.0_3.0_1672774880071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_for_question_answering","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_for_question_answering","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_for_question_answering| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|249.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Zamachi/distillbert-for-question-answering --- layout: model title: Legal Financial information Clause Binary Classifier author: John Snow Labs name: legclf_financial_information_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `financial-information` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `financial-information` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financial_information_clause_en_1.0.0_3.2_1660122441519.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financial_information_clause_en_1.0.0_3.2_1660122441519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_financial_information_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[financial-information]| |[other]| |[other]| |[financial-information]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_financial_information_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support financial-information 0.95 0.78 0.85 68 other 0.92 0.98 0.95 167 accuracy - - 0.92 235 macro-avg 0.93 0.88 0.90 235 weighted-avg 0.92 0.92 0.92 235 ``` --- layout: model title: SDOH Economics Status For Binary Classification author: John Snow Labs name: genericclassifier_sdoh_economics_binary_sbiobert_cased_mli date: 2023-01-14 tags: [en, licensed, generic_classifier, sdoh, economics, clinical] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies related to social economics status in the clinical documents and trained by using GenericClassifierApproach annotator. `True:` if the patient was currently employed or unemployed. `False:` if there was no related passage. ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_economics_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673699299086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_economics_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673699299086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes", "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago."] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq("Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.generic.sdoh_ecnomics_sbiobert_cased").predict("""The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------+ | text| result| +----------------------------------------------------------------------------------------------------+-------+ |Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...| [True]| |The patient quit smoking approximately two years ago with an approximately a 40 pack year history...|[False]| +----------------------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_economics_binary_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.5 MB| ## Benchmarking ```bash label precision recall f1-score support False 0.93 0.85 0.89 894 True 0.79 0.90 0.84 562 accuracy - - 0.87 1456 macro-avg 0.86 0.87 0.86 1456 weighted-avg 0.87 0.87 0.87 1456 ``` --- layout: model title: English asr_wav2vec2_large_xlsr_moroccan TFWav2Vec2ForCTC from othrif author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_moroccan date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_moroccan` is a English model originally trained by othrif. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_moroccan_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_moroccan_en_4.2.0_3.0_1664098087454.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_moroccan_en_4.2.0_3.0_1664098087454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_moroccan', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_moroccan", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_moroccan| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_hyAM_batch4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4` is a English model originally trained by lilitket. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_hyAM_batch4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hyAM_batch4_en_4.2.0_3.0_1664119373568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hyAM_batch4_en_4.2.0_3.0_1664119373568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_hyAM_batch4", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_hyAM_batch4", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_hyAM_batch4| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Multilingual BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_multi_uncased_finetuned_xquadv1 date: 2022-06-02 tags: [en, es, de, el, ru, tr, ar, vi, th, zh, hi, open_source, question_answering, bert, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-uncased-finetuned-xquadv1` is a Multilingual model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_uncased_finetuned_xquadv1_xx_4.0.0_3.0_1654184636643.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_uncased_finetuned_xquadv1_xx_4.0.0_3.0_1654184636643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_uncased_finetuned_xquadv1","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_uncased_finetuned_xquadv1","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.xquad.bert.uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_uncased_finetuned_xquadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|626.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-multi-uncased-finetuned-xquadv1 - https://colab.research.google.com/github/mrm8488/shared_colab_notebooks/blob/master/Try_mrm8488_xquad_finetuned_uncased_model.ipynb - https://github.com/google-research/bert/blob/master/multilingual.md - https://twitter.com/mrm8488 - https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl - https://github.com/fxsjy/jieba - https://github.com/deepmind/xquad --- layout: model title: Finnish asr_wav2vec2_xlsr_300m_finnish TFWav2Vec2ForCTC from aapot author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_300m_finnish date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish` is a Finnish model originally trained by aapot. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_300m_finnish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_fi_4.2.0_3.0_1664023070934.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_300m_finnish_fi_4.2.0_3.0_1664023070934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_300m_finnish', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_300m_finnish", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_300m_finnish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Urdu Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, ur, open_source] task: Embeddings language: ur edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Urdu model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_ur_3.4.2_3.0_1649676577872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_ur_3.4.2_3.0_1649676577872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مجھے سپارک این ایل پی سے محبت ہے"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مجھے سپارک این ایل پی سے محبت ہے").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.embed.muril_adapted_local").predict("""مجھے سپارک این ایل پی سے محبت ہے""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ur| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: English asr_wav2vec2_tcrs_runtest TFWav2Vec2ForCTC from neelan-elucidate-ai author: John Snow Labs name: pipeline_asr_wav2vec2_tcrs_runtest date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_tcrs_runtest` is a English model originally trained by neelan-elucidate-ai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_tcrs_runtest_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_tcrs_runtest_en_4.2.0_3.0_1664104297819.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_tcrs_runtest_en_4.2.0_3.0_1664104297819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_tcrs_runtest', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_tcrs_runtest", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_tcrs_runtest| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: DistilRoBERTa Base Ontonotes NER Pipeline author: John Snow Labs name: distilroberta_base_token_classifier_ontonotes_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, distilroberta, ontonotes, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilroberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/09/26/distilroberta_base_token_classifier_ontonotes_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilroberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655654420816.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilroberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655654420816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilroberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|307.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Portuguese Named Entity Recognition (from monilouise) author: John Snow Labs name: bert_ner_ner_news_portuguese date: 2022-05-09 tags: [bert, ner, token_classification, pt, open_source] task: Named Entity Recognition language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ner_news_portuguese` is a Portuguese model orginally trained by `monilouise`. ## Predicted Entities `PUB`, `PESSOA`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_ner_news_portuguese_pt_3.4.2_3.0_1652097921222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_ner_news_portuguese_pt_3.4.2_3.0_1652097921222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_ner_news_portuguese","pt") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_ner_news_portuguese","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_ner_news_portuguese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/monilouise/ner_news_portuguese - https://github.com/neuralmind-ai/portuguese-bert/blob/master/README.md --- layout: model title: Extract Treatment Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_treatment_emb_clinical_large date: 2023-06-06 tags: [license, clinical, ner, en, vop, treatment, drug, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts treatments mentioned in documents transferred from the patient’s own sentences. ## Predicted Entities `Drug`, `Form`, `Route`, `Dosage`, `Duration`, `Procedure`, `Frequency`, `Treatment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_emb_clinical_large_en_4.4.3_3.0_1686077860546.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_emb_clinical_large_en_4.4.3_3.0_1686077860546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_treatment_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_treatment_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | metformin | Drug | | glipizide | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_treatment_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Drug 1265 67 175 1440 0.95 0.88 0.91 Form 252 35 14 266 0.88 0.95 0.91 Route 42 5 6 48 0.89 0.88 0.88 Dosage 331 28 81 412 0.92 0.80 0.86 Duration 2005 405 305 2310 0.83 0.87 0.85 Procedure 559 104 146 705 0.84 0.79 0.82 Frequency 895 197 184 1079 0.82 0.83 0.82 Treatment 167 55 61 228 0.75 0.73 0.74 macro_avg 5516 896 972 6488 0.86 0.84 0.85 micro_avg 5516 896 972 6488 0.86 0.85 0.85 ``` --- layout: model title: English ElectraForQuestionAnswering model (from ptran74) Version-1 author: John Snow Labs name: electra_qa_DSPFirst_Finetuning_1 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DSPFirst-Finetuning-1` is a English model originally trained by `ptran74`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_1_en_4.0.0_3.0_1655919309936.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_DSPFirst_Finetuning_1_en_4.0.0_3.0_1655919309936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_DSPFirst_Finetuning_1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.finetuning_1").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_DSPFirst_Finetuning_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ptran74/DSPFirst-Finetuning-1 - https://github.gatech.edu/pages/VIP-ITS/textbook_SQuAD_explore/explore/textbookv1.0/textbook/ --- layout: model title: Translate English to Eastern Malayo-Polynesian languages Pipeline author: John Snow Labs name: translate_en_pqe date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, pqe, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `pqe` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pqe_xx_2.7.0_2.4_1609689714319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pqe_xx_2.7.0_2.4_1609689714319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_pqe", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_pqe", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.pqe').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_pqe| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Monetary Economics Document Classifier (EURLEX) author: John Snow Labs name: legclf_monetary_economics_bert date: 2023-03-06 tags: [en, legal, classification, clauses, monetary_economics, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_monetary_economics_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Monetary_Economics or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Monetary_Economics`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_monetary_economics_bert_en_1.0.0_3.0_1678111728678.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_monetary_economics_bert_en_1.0.0_3.0_1678111728678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_monetary_economics_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Monetary_Economics]| |[Other]| |[Other]| |[Monetary_Economics]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_monetary_economics_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Monetary_Economics 0.89 0.96 0.92 67 Other 0.94 0.86 0.90 58 accuracy - - 0.91 125 macro-avg 0.92 0.91 0.91 125 weighted-avg 0.91 0.91 0.91 125 ``` --- layout: model title: Detect Oncology-Specific Entities (clinical_large) author: John Snow Labs name: ner_oncology_emb_clinical_large date: 2023-04-12 tags: [licensed, clinical, en, oncology, biomarker, treatment, ner, clinical_large] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts more than 40 oncology-related entities, including therapies, tests and staging. Definitions of Predicted Entities: `Adenopathy:` Mentions of pathological findings of the lymph nodes. `Age:` All mention of ages, past or present, related to the patient or with anybody else. `Biomarker:` Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. `Biomarker_Result:` Terms or values that are identified as the result of a biomarkers. `Cancer_Dx:` Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction. `Cancer_Score:` Clinical or imaging scores that are specific for cancer settings (e.g. “BI-RADS” or “Allred score”). `Cancer_Surgery:` Terms that indicate surgery as a form of cancer treatment. `Chemotherapy:` Mentions of chemotherapy drugs, or unspecific words such as “chemotherapy”. `Cycle_Coun:` The total number of cycles being administered of an oncological therapy (e.g. “5 cycles”). `Cycle_Day:` References to the day of the cycle of oncological therapy (e.g. “day 5”). `Cycle_Number:` The number of the cycle of an oncological therapy that is being applied (e.g. “third cycle”). `Date:` Mentions of exact dates, in any format, including day number, month and/or year. `Death_Entity:` Words that indicate the death of the patient or someone else (including family members), such as “died” or “passed away”. `Direction:` Directional and laterality terms, such as “left”, “right”, “bilateral”, “upper” and “lower”. `Dosage:` The quantity prescribed by the physician for an active ingredient. `Duration:` Words indicating the duration of a treatment (e.g. “for 2 weeks”). `Frequency:` Words indicating the frequency of treatment administration (e.g. “daily” or “bid”). `Gender:` Gender-specific nouns and pronouns (including words such as “him” or “she”, and family members such as “father”). `Grade:` All pathological grading of tumors (e.g. “grade 1”) or degrees of cellular differentiation (e.g. “well-differentiated”) `Histological_Type:` Histological variants or cancer subtypes, such as “papillary”, “clear cell” or “medullary”. `Hormonal_Therapy:` Mentions of hormonal drugs used to treat cancer, or unspecific words such as “hormonal therapy”. `Imaging_Test:` Imaging tests mentioned in texts, such as “chest CT scan”. `Immunotherapy:` Mentions of immunotherapy drugs, or unspecific words such as “immunotherapy”. `Invasion:` Mentions that refer to tumor invasion, such as “invasion” or “involvement”. Metastases or lymph node involvement are excluded from this category. `Line_Of_Therapy:` Explicit references to the line of therapy of an oncological therapy (e.g. “first-line treatment”). `Metastasis:` Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. `Oncogene:` Mentions of genes that are implicated in the etiology of cancer. `Pathology_Result:` The findings of a biopsy from the pathology report that is not covered by another entity (e.g. “malignant ductal cells”). `Pathology_Test:` Mentions of biopsies or tests that use tissue samples. `Performance_Status:` Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. “ECOG performance status of 4”). `Race_Ethnicity:` The race and ethnicity categories include racial and national origin or sociocultural groups. `Radiotherapy:` Terms that indicate the use of Radiotherapy. `Response_To_Treatment:` Terms related to clinical progress of the patient related to cancer treatment, including “recurrence”, “bad response” or “improvement”. `Relative_Date:` Temporal references that are relative to the date of the text or to any other specific date (e.g. “yesterday” or “three years later”). `Route:` Words indicating the type of administration route (such as “PO” or “transdermal”). `Site_Bone:` Anatomical terms that refer to the human skeleton. `Site_Brain:` Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). `Site_Breast:` Anatomical terms that refer to the breasts. `Site_Liver:` Anatomical terms that refer to the liver. `Site_Lung:` Anatomical terms that refer to the lungs. `Site_Lymph_Node:` Anatomical terms that refer to lymph nodes, excluding adenopathies. `Site_Other_Body_Part:` Relevant anatomical terms that are not included in the rest of the anatomical entities. `Smoking_Status:` All mentions of smoking related to the patient or to someone else. `Staging:` Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”. `Targeted_Therapy:` Mentions of targeted therapy drugs, or unspecific words such as “targeted therapy”. `Tumor_Finding:` All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”). `Tumor_Size:` Size of the tumor, including numerical value and unit of measurement (e.g. “3 cm”). `Unspecific_Therapy:` Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. “chemoradiotherapy” or “adjuvant therapy”). ## Predicted Entities `Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_emb_clinical_large_en_4.3.2_3.0_1681316109615.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_emb_clinical_large_en_4.3.2_3.0_1681316109615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large","en","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_oncology_emb_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter]) data = spark.createDataFrame([["""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large","en","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------+-----+---+---------------------+ |chunk |begin|end|ner_label | +------------------------------+-----+---+---------------------+ |left |31 |34 |Direction | |mastectomy |36 |45 |Cancer_Surgery | |axillary lymph node dissection|54 |83 |Cancer_Surgery | |left |91 |94 |Direction | |breast cancer |96 |108|Cancer_Dx | |twenty years ago |110 |125|Relative_Date | |tumor |132 |136|Tumor_Finding | |positive |142 |149|Biomarker_Result | |ER |155 |156|Biomarker | |PR |162 |163|Biomarker | |radiotherapy |183 |194|Radiotherapy | |breast |229 |234|Site_Breast | |cancer |241 |246|Cancer_Dx | |recurred |248 |255|Response_To_Treatment| |right |262 |266|Direction | |lung |268 |271|Site_Lung | |metastasis |273 |282|Metastasis | |13 years later |284 |297|Relative_Date | |adriamycin |346 |355|Chemotherapy | |60 mg/m2 |358 |365|Dosage | |cyclophosphamide |372 |387|Chemotherapy | |600 mg/m2 |390 |398|Dosage | |six courses |406 |416|Cycle_Count | |first line |422 |431|Line_Of_Therapy | +------------------------------+-----+---+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_emb_clinical_large| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.3 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 141.0 40.0 70.0 211.0 0.779 0.6682 0.7194 Direction 672.0 150.0 159.0 831.0 0.8175 0.8087 0.8131 Staging 102.0 27.0 36.0 138.0 0.7907 0.7391 0.764 Cancer_Score 10.0 2.0 11.0 21.0 0.8333 0.4762 0.6061 Imaging_Test 754.0 147.0 146.0 900.0 0.8368 0.8378 0.8373 Cycle_Number 48.0 43.0 12.0 60.0 0.5275 0.8 0.6358 Tumor_Finding 970.0 89.0 109.0 1079.0 0.916 0.899 0.9074 Site_Lymph_Node 210.0 68.0 61.0 271.0 0.7554 0.7749 0.765 Invasion 146.0 39.0 21.0 167.0 0.7892 0.8743 0.8295 Response_To_Treat... 280.0 146.0 90.0 370.0 0.6573 0.7568 0.7035 Smoking_Status 42.0 11.0 6.0 48.0 0.7925 0.875 0.8317 Cycle_Count 104.0 23.0 40.0 144.0 0.8189 0.7222 0.7675 Tumor_Size 197.0 37.0 41.0 238.0 0.8419 0.8277 0.8347 Adenopathy 30.0 13.0 13.0 43.0 0.6977 0.6977 0.6977 Age 205.0 15.0 23.0 228.0 0.9318 0.8991 0.9152 Biomarker_Result 564.0 160.0 121.0 685.0 0.779 0.8234 0.8006 Unspecific_Therapy 108.0 30.0 66.0 174.0 0.7826 0.6207 0.6923 Site_Breast 92.0 18.0 18.0 110.0 0.8364 0.8364 0.8364 Chemotherapy 687.0 59.0 55.0 742.0 0.9209 0.9259 0.9234 Targeted_Therapy 178.0 29.0 28.0 206.0 0.8599 0.8641 0.862 Radiotherapy 143.0 22.0 18.0 161.0 0.8667 0.8882 0.8773 Performance_Status 17.0 15.0 15.0 32.0 0.5313 0.5313 0.5313 Pathology_Test 387.0 197.0 99.0 486.0 0.6627 0.7963 0.7234 Site_Other_Body_Part 678.0 287.0 460.0 1138.0 0.7026 0.5958 0.6448 Cancer_Surgery 398.0 82.0 95.0 493.0 0.8292 0.8073 0.8181 Line_Of_Therapy 38.0 9.0 10.0 48.0 0.8085 0.7917 0.8 Pathology_Result 180.0 206.0 161.0 341.0 0.4663 0.5279 0.4952 Hormonal_Therapy 98.0 12.0 25.0 123.0 0.8909 0.7967 0.8412 Site_Bone 172.0 43.0 51.0 223.0 0.8 0.7713 0.7854 Biomarker 693.0 144.0 138.0 831.0 0.828 0.8339 0.8309 Immunotherapy 66.0 17.0 16.0 82.0 0.7952 0.8049 0.8 Cycle_Day 85.0 44.0 43.0 128.0 0.6589 0.6641 0.6615 Frequency 199.0 37.0 36.0 235.0 0.8432 0.8468 0.845 Route 91.0 10.0 25.0 116.0 0.901 0.7845 0.8387 Duration 179.0 57.0 117.0 296.0 0.7585 0.6047 0.6729 Death_Entity 40.0 10.0 4.0 44.0 0.8 0.9091 0.8511 Metastasis 337.0 27.0 25.0 362.0 0.9258 0.9309 0.9284 Site_Liver 149.0 56.0 25.0 174.0 0.7268 0.8563 0.7863 Cancer_Dx 723.0 114.0 107.0 830.0 0.8638 0.8711 0.8674 Grade 47.0 21.0 19.0 66.0 0.6912 0.7121 0.7015 Date 403.0 15.0 14.0 417.0 0.9641 0.9664 0.9653 Site_Lung 338.0 134.0 64.0 402.0 0.7161 0.8408 0.7735 Site_Brain 165.0 53.0 41.0 206.0 0.7569 0.801 0.7783 Relative_Date 376.0 271.0 84.0 460.0 0.5811 0.8174 0.6793 Race_Ethnicity 42.0 0.0 13.0 55.0 1.0 0.7636 0.866 Gender 1255.0 17.0 7.0 1262.0 0.9866 0.9945 0.9905 Dosage 417.0 53.0 68.0 485.0 0.8872 0.8598 0.8733 Oncogene 178.0 83.0 57.0 235.0 0.682 0.7574 0.7177 Radiation_Dose 41.0 4.0 11.0 52.0 0.9111 0.7885 0.8454 macro - - - - - - 0.7863 micro - - - - - - 0.8145 ``` --- layout: model title: Legal Records Clause Binary Classifier author: John Snow Labs name: legclf_records_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `records` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `records` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_records_clause_en_1.0.0_3.2_1660122878980.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_records_clause_en_1.0.0_3.2_1660122878980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_records_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[records]| |[other]| |[other]| |[records]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_records_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support books-and-records 0.91 0.94 0.92 31 other 0.98 0.97 0.98 104 accuracy - - 0.96 135 macro-avg 0.94 0.95 0.95 135 weighted-avg 0.96 0.96 0.96 135 ``` --- layout: model title: Chinese BertForQuestionAnswering model (from TingChenChang) author: John Snow Labs name: bert_qa_bert_base_chinese_finetuned_squad_colab date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese-finetuned-squad-colab` is a Chinese model orginally trained by `TingChenChang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_chinese_finetuned_squad_colab_zh_4.0.0_3.0_1654179899019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_chinese_finetuned_squad_colab_zh_4.0.0_3.0_1654179899019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_chinese_finetuned_squad_colab","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_chinese_finetuned_squad_colab","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.squad.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_chinese_finetuned_squad_colab| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|381.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/TingChenChang/bert-base-chinese-finetuned-squad-colab --- layout: model title: Pipeline to Detect Drugs and Posology Entities (ner_posology_greedy) author: John Snow Labs name: ner_posology_greedy_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_greedy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_pipeline_en_4.3.0_3.2_1678869761403.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_greedy_pipeline_en_4.3.0_3.2_1678869761403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_greedy_pipeline", "en", "clinical/models") text = '''The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_greedy_pipeline", "en", "clinical/models") val text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_greedy.pipeline").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------------------------|--------:|------:|:------------|-------------:| | 0 | 1 capsule of Advil 10 mg | 27 | 50 | DRUG | 0.638183 | | 1 | for 5 days | 52 | 61 | DURATION | 0.573533 | | 2 | magnesium hydroxide 100mg/1ml suspension PO | 67 | 109 | DRUG | 0.68788 | | 3 | 40 units of insulin glargine | 179 | 206 | DRUG | 0.61964 | | 4 | at night | 208 | 215 | FREQUENCY | 0.7431 | | 5 | 12 units of insulin lispro | 218 | 243 | DRUG | 0.66034 | | 6 | with meals | 245 | 254 | FREQUENCY | 0.79235 | | 7 | metformin 1000 mg | 261 | 277 | DRUG | 0.707133 | | 8 | two times a day | 279 | 293 | FREQUENCY | 0.700825 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_greedy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering Cased model (from botika) author: John Snow Labs name: distilbert_qa_checkpoint_124500_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `checkpoint-124500-finetuned-squad` is a English model originally trained by `botika`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_checkpoint_124500_finetuned_squad_en_4.3.0_3.0_1672765895389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_checkpoint_124500_finetuned_squad_en_4.3.0_3.0_1672765895389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_checkpoint_124500_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_checkpoint_124500_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_checkpoint_124500_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/botika/checkpoint-124500-finetuned-squad --- layout: model title: Portuguese BertForTokenClassification Cased model (from pucpr) author: John Snow Labs name: bert_token_classifier_clinicalnerpt_chemical date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-chemical` is a Portuguese model originally trained by `pucpr`. ## Predicted Entities `ChemicalDrugs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_chemical_pt_4.2.4_3.0_1669822296330.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_chemical_pt_4.2.4_3.0_1669822296330.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_chemical","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_chemical","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_clinicalnerpt_chemical| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/pucpr/clinicalnerpt-chemical - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/SemClinBr - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: English asr_xlsr_wav2vec2_final TFWav2Vec2ForCTC from chrisvinsen author: John Snow Labs name: pipeline_asr_xlsr_wav2vec2_final date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec2_final` is a English model originally trained by chrisvinsen. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_wav2vec2_final_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec2_final_en_4.2.0_3.0_1664111417708.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec2_final_en_4.2.0_3.0_1664111417708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_wav2vec2_final', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_wav2vec2_final", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_wav2vec2_final| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Multilingual BertForQuestionAnswering model (from Martin97Bozic) author: John Snow Labs name: bert_qa_bert_base_multilingual_uncased_finetuned_squad date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-uncased-finetuned-squad` is a Multilingual model orginally trained by `Martin97Bozic`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_uncased_finetuned_squad_xx_4.0.0_3.0_1654180305032.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_uncased_finetuned_squad_xx_4.0.0_3.0_1654180305032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_uncased_finetuned_squad","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_uncased_finetuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.squad.bert.multilingual_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|626.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Martin97Bozic/bert-base-multilingual-uncased-finetuned-squad --- layout: model title: German Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_de_cased date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-de-cased` is a German model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_de_cased_de_3.4.2_3.0_1649676315880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_de_cased_de_3.4.2_3.0_1649676315880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_de_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_de_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.bert_base_de_cased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_de_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|398.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-de-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Bangla Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Bangla model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_bn_3.4.2_3.0_1649673508278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_bn_3.4.2_3.0_1649673508278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed.muril_adapted_local").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to French author: John Snow Labs name: opus_mt_af_fr date: 2021-06-01 tags: [open_source, seq2seq, translation, af, fr, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: fr {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_fr_xx_3.1.0_2.4_1622556757335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_fr_xx_3.1.0_2.4_1622556757335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_fr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_fr", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.French').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_fr| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Termination Clause Binary Classifier author: John Snow Labs name: legclf_termination_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `termination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_clause_en_1.0.0_3.2_1660123088167.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_clause_en_1.0.0_3.2_1660123088167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_termination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination]| |[other]| |[other]| |[termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support duration-and-termination 0.93 0.89 0.91 28 other 0.97 0.98 0.98 107 accuracy - - 0.96 135 macro-avg 0.95 0.94 0.94 135 weighted-avg 0.96 0.96 0.96 135 ``` --- layout: model title: French CamemBert Embeddings (from pgperrone) author: John Snow Labs name: camembert_embeddings_pgperrone_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `pgperrone`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_pgperrone_generic_model_fr_3.4.4_3.0_1653990085205.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_pgperrone_generic_model_fr_3.4.4_3.0_1653990085205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_pgperrone_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_pgperrone_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_pgperrone_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/pgperrone/dummy-model --- layout: model title: Legal Certificates Clause Binary Classifier author: John Snow Labs name: legclf_certificates_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `certificates` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `certificates` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_certificates_clause_en_1.0.0_3.2_1660122225478.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_certificates_clause_en_1.0.0_3.2_1660122225478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_certificates_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[certificates]| |[other]| |[other]| |[certificates]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_certificates_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support certificates 0.88 0.80 0.84 35 other 0.94 0.96 0.95 105 accuracy - - 0.92 140 macro-avg 0.91 0.88 0.89 140 weighted-avg 0.92 0.92 0.92 140 ``` --- layout: model title: Arabic Named Entity Recognition (from CAMeL-Lab) author: John Snow Labs name: bert_ner_bert_base_arabic_camelbert_ca_ner date: 2022-05-09 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-ca-ner` is a Arabic model orginally trained by `CAMeL-Lab`. ## Predicted Entities `LOC`, `PERS`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_ca_ner_ar_3.4.2_3.0_1652099496014.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_ca_ner_ar_3.4.2_3.0_1652099496014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_ca_ner","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_ca_ner","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_arabic_camelbert_ca_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-ca-ner - https://camel.abudhabi.nyu.edu/anercorp/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: French CamemBert Embeddings (from eduardopds) author: John Snow Labs name: camembert_embeddings_eduardopds_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `eduardopds`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_eduardopds_generic_model_fr_3.4.4_3.0_1653988153125.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_eduardopds_generic_model_fr_3.4.4_3.0_1653988153125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_eduardopds_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_eduardopds_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_eduardopds_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/eduardopds/dummy-model --- layout: model title: Electra MeDAL Acronym BERT Embeddings author: John Snow Labs name: electra_medal_acronym date: 2022-01-04 tags: [acronym, abbreviation, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.3.3 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Electra model fine tuned on MeDAL, a large dataset on abbreviation disambiguation, designed for pretraining natural language understanding models in the medical domain. Check the reference [here](https://aclanthology.org/2020.clinicalnlp-1.15.pdf). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_medal_acronym_en_3.3.3_3.0_1641310227830.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_medal_acronym_en_3.3.3_3.0_1641310227830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler= DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer= Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_medal_acronym", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlpPipeline= Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, embeddings]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_medal_acronym", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.electra.medical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_medal_acronym| |Compatibility:|Spark NLP 3.3.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[electra]| |Language:|en| |Size:|66.0 MB| |Case sensitive:|true| ## Data Source https://github.com/BruceWen120/medal --- layout: model title: Dutch RobertaForQuestionAnswering Base Cased model (from Nadav) author: John Snow Labs name: roberta_qa_base_squad date: 2023-01-20 tags: [nl, open_source, roberta, question_answering, tensorflow] task: Question Answering language: nl edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad-nl` is a Dutch model originally trained by `Nadav`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_nl_4.3.0_3.0_1674218840970.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_nl_4.3.0_3.0_1674218840970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad","nl")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad","nl") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|nl| |Size:|436.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Nadav/roberta-base-squad-nl --- layout: model title: Part of Speech Tagger Pretrained with Clinical Data author: John Snow Labs name: pos_clinical date: 2021-03-29 tags: [pos, parts_of_speech, en, licensed] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. This model was trained on additional medical data. ## Predicted Entities - PROPN - PUNCT - ADJ - NOUN - VERB - DET - ADP - AUX - PRON - PART - SCONJ - NUM - ADV - CCONJ - X - INTJ - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/pos_clinical_en_3.0.0_3.0_1617052315327.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/pos_clinical_en_3.0.0_3.0_1617052315327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token") pos = PerceptronModel.pretrained("pos_clinical","en","clinical/models").setInputCols("token","document") pipeline = Pipeline(stages=[document_assembler, tokenizer, pos]) df = spark.createDataFrame([['POS assigns each token in a sentence a grammatical label']], ["text"]) result = pipeline.fit(df).transform(df) result.select("pos.result").show(false) ``` ```scala val document_assembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val tokenizer = new Tokenizer().setInputCols(Array("document")).setOutputCol("token") val pos = PerceptronModel.pretrained("pos_clinical","en","clinical/models").setInputCols("token","document") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, pos)) val df = Seq("POS assigns each token in a sentence a grammatical label").toDF("text") val result = pipeline.fit(df).transform(df) result.select("pos.result").show(false) ``` {:.nlu-block} ```python nlu.load('pos.clinical').predict("POS assigns each token in a sentence a grammatical label") ```
## Results ```bash +------------------------------------------+ |result | +------------------------------------------+ |[NN, NNS, PND, NN, II, DD, NN, DD, JJ, NN]| +------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_clinical| |Compatibility:|Spark NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|en| --- layout: model title: Greek BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_el_cased date: 2022-12-02 tags: [el, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: el edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-el-cased` is a Greek model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_el_cased_el_4.2.4_3.0_1670016616225.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_el_cased_el_4.2.4_3.0_1670016616225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_el_cased","el") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_el_cased","el") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_el_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|el| |Size:|356.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-el-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: DistilBERT base multilingual model (cased) author: John Snow Labs name: distilbert_base_multilingual_cased date: 2021-05-20 tags: [distilbert, embeddings, xx, multilingual, open_source] task: Embeddings language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a distilled version of the [BERT base multilingual model](bert-base-multilingual-cased). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation). This model is cased: it does make a difference between english and English. The model is trained on the concatenation of Wikipedia in 104 different languages listed [here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages). The model has 6 layers, 768 dimension,s and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average DistilmBERT is twice as fast as mBERT-base. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_multilingual_cased_xx_3.1.0_2.4_1621522568093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_multilingual_cased_xx_3.1.0_2.4_1621522568093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = DistilBertEmbeddings.pretrained("distilbert_base_multilingual_cased", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_multilingual_cased", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.distilbert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_multilingual_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|xx| |Case sensitive:|true| ## Data Source [https://huggingface.co/distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased) ## Benchmarking ```bash | Model | English | Spanish | Chinese | German | Arabic | Urdu | | :---: | :---: | :---: | :---: | :---: | :---: | :---:| | mBERT base cased (computed) | 82.1 | 74.6 | 69.1 | 72.3 | 66.4 | 58.5 | | mBERT base uncased (reported)| 81.4 | 74.3 | 63.8 | 70.5 | 62.1 | 58.3 | | DistilmBERT | 78.2 | 69.1 | 64.0 | 66.3 | 59.1 | 54.7 | ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from ml6team) author: John Snow Labs name: t5_keyphrase_generation_small_openkp date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-generation-t5-small-openkp` is a English model originally trained by `ml6team`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_keyphrase_generation_small_openkp_en_4.3.0_3.0_1675104714518.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_keyphrase_generation_small_openkp_en_4.3.0_3.0_1675104714518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_keyphrase_generation_small_openkp","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_keyphrase_generation_small_openkp","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_keyphrase_generation_small_openkp| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|285.9 MB| ## References - https://huggingface.co/ml6team/keyphrase-generation-t5-small-openkp - https://github.com/microsoft/OpenKP - https://arxiv.org/abs/1911.02671 - https://paperswithcode.com/sota?task=Keyphrase+Generation&dataset=openkp --- layout: model title: Fast Neural Machine Translation Model from American Sign Language to German author: John Snow Labs name: opus_mt_ase_de date: 2021-06-01 tags: [open_source, seq2seq, translation, ase, de, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ase target languages: de {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_de_xx_3.1.0_2.4_1622556174364.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_de_xx_3.1.0_2.4_1622556174364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ase_de", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ase_de", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.American Sign Language.translate_to.German').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ase_de| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from MaggieXM) author: John Snow Labs name: distilbert_qa_maggiexm_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `MaggieXM`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_maggiexm_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768751753.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_maggiexm_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768751753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_maggiexm_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_maggiexm_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_maggiexm_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/MaggieXM/distilbert-base-uncased-finetuned-squad --- layout: model title: Recognize Entities DL Pipeline for Finnish - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, finnish, entity_recognizer_sm, pipeline, fi] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: fi edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_fi_3.0.0_3.0_1616443699887.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_fi_3.0.0_3.0_1616443699887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'fi') annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "fi") val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei John Snow Labs! ""] result_df = nlu.load('fi.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-------------------------|:------------------------|:---------------------------------|:-----------------------------|:---------------------------------|:--------------------| | 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | [[-0.394499987363815,.,...]] | ['O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to Finnish author: John Snow Labs name: opus_mt_af_fi date: 2021-06-01 tags: [open_source, seq2seq, translation, af, fi, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: fi {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_fi_xx_3.1.0_2.4_1622562348480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_fi_xx_3.1.0_2.4_1622562348480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_fi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_fi", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.Finnish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_fi| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from peggyhuang) author: John Snow Labs name: bert_qa_finetune_bert_base_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-bert-base-v2` is a English model orginally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v2_en_4.0.0_3.0_1654187750899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v2_en_4.0.0_3.0_1654187750899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetune_bert_base_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_finetune_bert_base_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_finetune_bert_base_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/finetune-bert-base-v2 --- layout: model title: French CamemBert Embeddings (from mohammadrea76) author: John Snow Labs name: camembert_embeddings_mohammadrea76_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `mohammadrea76`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_mohammadrea76_generic_model_fr_3.4.4_3.0_1653989762430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_mohammadrea76_generic_model_fr_3.4.4_3.0_1653989762430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_mohammadrea76_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_mohammadrea76_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_mohammadrea76_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/mohammadrea76/dummy-model --- layout: model title: English asr_wav2vec2_base_timit_demo_colab10 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab10 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab10` is a English model originally trained by sameearif88. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab10_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab10_en_4.2.0_3.0_1664019670817.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab10_en_4.2.0_3.0_1664019670817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab10', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab10", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab10| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_large_xlsr_upper_sorbian_mixed TFWav2Vec2ForCTC from jimregan author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_upper_sorbian_mixed` is a English model originally trained by jimregan. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed_en_4.2.0_3.0_1664020298753.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed_en_4.2.0_3.0_1664020298753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_upper_sorbian_mixed| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering model (from gokulkarthik) author: John Snow Labs name: distilbert_qa_gokulkarthik_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `gokulkarthik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_gokulkarthik_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725276170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_gokulkarthik_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725276170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gokulkarthik_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gokulkarthik_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_gokulkarthik").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_gokulkarthik_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/gokulkarthik/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Malayalam to English author: John Snow Labs name: opus_mt_ml_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ml, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ml` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ml_en_xx_2.7.0_2.4_1609168877964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ml_en_xx_2.7.0_2.4_1609168877964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ml_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ml_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ml.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ml_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from vuiseng9) author: John Snow Labs name: bert_qa_vuiseng9_bert_base_uncased_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad` is a English model orginally trained by `vuiseng9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_vuiseng9_bert_base_uncased_squad_en_4.0.0_3.0_1654181321679.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_vuiseng9_bert_base_uncased_squad_en_4.0.0_3.0_1654181321679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_vuiseng9_bert_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_vuiseng9_bert_base_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_vuiseng9").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_vuiseng9_bert_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vuiseng9/bert-base-uncased-squad --- layout: model title: Sentence Entity Resolver for HCC codes (Augmented) author: John Snow Labs name: sbiobertresolve_hcc_augmented date: 2021-05-30 tags: [entity_resolution, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to HCC codes using Sentence Bert Embeddings. ## Predicted Entities HCC codes and their descriptions. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcc_augmented_en_3.0.4_3.0_1622370690651.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcc_augmented_en_3.0.4_3.0_1622370690651.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_hcc_augmented``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. ```PROBLEM``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hcc_augmented","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq.empty["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."].toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.hcc").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+-------+----+----------+--------------------+--------------------+ | chunk|begin|end| entity|code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+-------+----+----------+--------------------+--------------------+ | hypertension| 68| 79|PROBLEM| 139| 0.4357|renal hypertensio...|139:::85:::108:::...| |chronic renal ins...| 83|109|PROBLEM| 139| 0.9748|chronic renal ins...|139:::140:::136::...| | COPD| 113|116|PROBLEM| 111| 0.5609|copd - chronic ob...| 111:::112:::84:::85| | gastritis| 120|128|PROBLEM| 188| 0.1991|functional disord...|188:::6:::75/18::...| | TIA| 136|138|PROBLEM| 167| 0.3094|cerebral concussi...|167:::100:::167/1...| |a non-ST elevatio...| 182|202|PROBLEM| 86| 0.4165|silent myocardial...|86:::87:::100:::9...| |Guaiac positive s...| 208|229|PROBLEM| 188| 0.1492|appendicovesicost...|188:::33:::48:::1...| | mid LAD lesion| 332|345|PROBLEM| 86| 0.8090|stemi involving l...| 86:::108:::107| | hypotension| 362|372|PROBLEM| 59| 0.8107|drug-induced hypo...|59:::78:::2:::23:...| | bradycardia| 378|388|PROBLEM| 96| 0.5205|tachycardia-brady...|96:::59:::78:::23...| | vagal reaction| 466|479|PROBLEM| 108| 0.4985|vasomotor reactio...|108:::96:::23:::7...| +--------------------+-----+---+-------+----+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_hcc_augmented| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hcc_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC2GM_Gene_Modified_PubMedBERT date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC2GM-Gene-Modified_PubMedBERT` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `GENE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Modified_PubMedBERT_en_4.0.0_3.0_1657107928157.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Modified_PubMedBERT_en_4.0.0_3.0_1657107928157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Modified_PubMedBERT","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Modified_PubMedBERT","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC2GM_Gene_Modified_PubMedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC2GM-Gene-Modified_PubMedBERT --- layout: model title: Translate English to Indic languages Pipeline author: John Snow Labs name: translate_en_inc date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, inc, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `inc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_inc_xx_2.7.0_2.4_1609690946201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_inc_xx_2.7.0_2.4_1609690946201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_inc", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_inc", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.inc').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_inc| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from SauravMaheshkar) Xqua author: John Snow Labs name: distilbert_qa_multi_finetuned_for_xqua_on_chaii date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finetuned-for-xqua-on-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finetuned_for_xqua_on_chaii_en_4.0.0_3.0_1654727570723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finetuned_for_xqua_on_chaii_en_4.0.0_3.0_1654727570723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finetuned_for_xqua_on_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finetuned_for_xqua_on_chaii","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.distil_bert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_multi_finetuned_for_xqua_on_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/distilbert-multi-finetuned-for-xqua-on-chaii --- layout: model title: Smaller BERT Sentence Embeddings (L-8_H-128_A-2) author: John Snow Labs name: sent_small_bert_L8_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_128_en_2.6.0_2.4_1598350334113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_128_en_2.6.0_2.4_1598350334113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_128", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_128", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L8_128').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L8_128_embeddings sentence [1.3234734535217285, -0.39469125866889954, 0.0... I hate cancer [1.8613495826721191, -0.7580474019050598, -0.6... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L8_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1) --- layout: model title: Question Pair Classifier Pipeline author: John Snow Labs name: classifierdl_electra_questionpair_pipeline date: 2021-08-25 tags: [quora, question_pair, public, en, open_source, pipeline] task: Text Classification language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline identifies whether the two question sentences are semantically repetitive or different. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_QUESTIONPAIR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_QUESTIONPAIRS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_electra_questionpair_pipeline_en_3.2.0_2.4_1629892687975.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_electra_questionpair_pipeline_en_3.2.0_2.4_1629892687975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use - The question pairs should be identified with "q1" and "q2" in the text. The input text format should be as follows : `text = "q1: What is your name? q2: Who are you?"`.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("classifierdl_electra_questionpair_pipeline", "en") result1 = pipeline.fullAnnotate("q1: What is your favorite movie? q2: Which movie do you like most?") result2 = pipeline.fullAnnotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("classifierdl_electra_questionpair_pipeline", "en") val result1 = pipeline.fullAnnotate("q1: What is your favorite movie? q2: Which movie do you like most?")(0) val result2 = pipeline.fullAnnotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?")(0) ```
## Results ```bash result1 --> ['almost_same'] result2 --> ['not_same'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_electra_questionpair_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: Translate Hindi to English Pipeline author: John Snow Labs name: translate_hi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, hi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `hi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_hi_en_xx_2.7.0_2.4_1609688323314.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_hi_en_xx_2.7.0_2.4_1609688323314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_hi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_hi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.hi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_hi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect posology entities (large-biobert) author: John Snow Labs name: ner_posology_large_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect Drug, Dosage and administration instructions in text using pretraiend NER model. ## Predicted Entities `DOSAGE`, `DRUG`, `STRENGTH`, `FORM`, `DURATION`, `FREQUENCY`, `ROUTE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_biobert_en_3.0.0_3.0_1617260818924.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_biobert_en_3.0.0_3.0_1617260818924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_large_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_posology_large_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.large_biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_large_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English AlbertForQuestionAnswering XXLarge model (from rahulchakwate) author: John Snow Labs name: albert_qa_xxlarge_finetuned_squad date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xxlarge-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_finetuned_squad_en_4.0.0_3.0_1656063867194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_finetuned_squad_en_4.0.0_3.0_1656063867194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.albert.xxl").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xxlarge_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|772.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulchakwate/albert-xxlarge-finetuned-squad --- layout: model title: Sentence Entity Resolver for RxNorm (sbertresolve_rxnorm_disposition) author: John Snow Labs name: sbertresolve_rxnorm_disposition date: 2021-08-28 tags: [rxnorm, en, licensed, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps medication entities (like drugs/ingredients) to RxNorm codes and their dispositions using sbert_jsl_medium_uncased Sentence Bert Embeddings. If you look for a faster inference with just drug names (excluding dosage and strength), this version of RxNorm model would be a better alternative. ## Predicted Entities Predicts RxNorm Codes, their normalized definition for each chunk, and dispositions if any. In the result, look for the aux_label parameter in the metadata to get dispositions divided by `|`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_rxnorm_disposition_en_3.1.3_2.4_1630175088499.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_rxnorm_disposition_en_3.1.3_2.4_1630175088499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_rxnorm_disposition", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") rxnorm_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver]) rxnorm_lp = LightPipeline(rxnorm_pipelineModel) result = rxnorm_lp.fullAnnotate("alizapride 25 mg/ml") ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_rxnorm_disposition", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val pipelineModel= new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val rxnorm_lp = LightPipeline(pipelineModel) val result = rxnorm_lp.fullAnnotate("alizapride 25 mg/ml") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.disposition").predict("""alizapride 25 mg/ml""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_k_aux_labels | all_distances | |---:|:-------------------|:-------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------| | 0 |alizapride 25 mg/ml | 330948 | [alizapride 25 mg/ml, alizapride 50 mg, alizapride 25 mg/ml oral solution, adalimumab 50 mg/ml, adalimumab 100 mg/ml [humira], adalimumab 50 mg/ml [humira], alirocumab 150 mg/ml, ...]| [330948, 330949, 249531, 358817, 1726845, 576023, 1659153, ...] | [Dopamine receptor antagonist, Dopamine receptor antagonist, Dopamine receptor antagonist, -, -, -, -, ...] | [0.0000, 0.0936, 0.1166, 0.1525, 0.1584, 0.1567, 0.1631, ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_rxnorm_disposition| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[new_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Financial NER on Responsibility and ESG Reports(Medium) author: John Snow Labs name: finner_responsibility_reports_md date: 2023-05-14 tags: [en, ner, finance, licensed, responsibility, reports] task: Named Entity Recognition language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Financial NER model can extract up to 20 quantifiable entities, including KPI, from the Responsibility and ESG Reports of companies. This `medium` model has been trained with more data. If you look for a `small` version of the model, you can find it [here](https://nlp.johnsnowlabs.com/2023/03/09/finner_responsibility_reports_en.html) ## Predicted Entities `AGE`, `AMOUNT`, `COUNTABLE_ITEM`, `DATE_PERIOD`, `ECONOMIC_ACTION`, `ECONOMIC_KPI`, `ENVIRONMENTAL_ACTION`, `ENVIRONMENTAL_KPI`, `ENVIRONMENTAL_UNIT`, `ESG_ROLE`, `FACILITY_PLACE`, `ISO`, `PERCENTAGE`, `PROFESSIONAL_GROUP`, `RELATIVE_METRIC`, `SOCIAL_ACTION`, `SOCIAL_KPI`, `TARGET_GROUP`, `TARGET_GROUP_BUSINESS`, `WASTE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_responsibility_reports_md_en_1.0.0_3.0_1684066884328.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_responsibility_reports_md_en_1.0.0_3.0_1684066884328.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021.""" data = spark.createDataFrame([[text]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False) ```
## Results ```bash +----------------------+------------------+ |chunk |label | +----------------------+------------------+ |direct GHG emissions |ENVIRONMENTAL_KPI | |12,135 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2017 |DATE_PERIOD | |4 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2021 |DATE_PERIOD | |indirect GHG emissions|ENVIRONMENTAL_KPI | |scope 2 |ENVIRONMENTAL_KPI | |imported energy |ENVIRONMENTAL_KPI | |electricity |ENVIRONMENTAL_KPI | |heat |ENVIRONMENTAL_KPI | |steam |ENVIRONMENTAL_KPI | |cooling |ENVIRONMENTAL_KPI | |scope 2 emissions |ENVIRONMENTAL_KPI | |3 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2017-2018 |DATE_PERIOD | |4 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2020-2021 |DATE_PERIOD | |scope 3 emissions |ENVIRONMENTAL_KPI | |sold |ECONOMIC_ACTION | |products |SOCIAL_KPI | |emissions |ENVIRONMENTAL_KPI | |377 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2017 |DATE_PERIOD | |408 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2021 |DATE_PERIOD | +----------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_responsibility_reports_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References In-house annotations on Responsibility and ESG Reports ## Benchmarking ```bash label precision recall f1-score support B-AMOUNT 0.97 0.97 0.97 1207 I-AMOUNT 0.97 0.94 0.96 361 B-ENVIRONMENTAL_KPI 0.79 0.81 0.80 1051 I-ENVIRONMENTAL_KPI 0.74 0.88 0.81 716 B-DATE_PERIOD 0.94 0.95 0.94 980 I-DATE_PERIOD 0.90 0.95 0.92 498 B-PERCENTAGE 0.99 0.99 0.99 695 I-PERCENTAGE 0.99 1.00 1.00 692 B-SOCIAL_KPI 0.66 0.74 0.70 481 I-SOCIAL_KPI 0.56 0.33 0.41 43 B-ENVIRONMENTAL_UNIT 0.94 0.96 0.95 459 I-ENVIRONMENTAL_UNIT 0.91 0.86 0.88 268 B-PROFESSIONAL_GROUP 0.85 0.92 0.88 358 I-PROFESSIONAL_GROUP 0.94 0.94 0.94 32 B-TARGET_GROUP 0.89 0.85 0.87 337 I-TARGET_GROUP 0.76 0.95 0.84 59 B-ENVIRONMENTAL_ACTION 0.72 0.68 0.70 341 I-ENVIRONMENTAL_ACTION 1.00 0.56 0.71 18 B-SOCIAL_ACTION 0.59 0.72 0.65 241 B-ESG_ROLE 0.76 0.72 0.74 109 I-ESG_ROLE 0.81 0.84 0.83 305 B-ECONOMIC_KPI 0.77 0.67 0.71 219 I-ECONOMIC_KPI 0.47 0.70 0.56 50 B-RELATIVE_METRIC 0.92 0.98 0.95 147 I-RELATIVE_METRIC 0.89 0.99 0.94 178 B-FACILITY_PLACE 0.74 0.89 0.81 139 I-FACILITY_PLACE 0.77 0.93 0.84 89 B-COUNTABLE_ITEM 0.64 0.69 0.67 154 I-COUNTABLE_ITEM 0.25 1.00 0.40 1 B-WASTE 0.84 0.64 0.73 126 I-WASTE 0.91 0.51 0.65 57 B-ECONOMIC_ACTION 0.73 0.74 0.73 91 I-ECONOMIC_ACTION 0.00 0.00 0.00 1 B-TARGET_GROUP_BUSINESS 0.93 0.85 0.89 74 I-TARGET_GROUP_BUSINESS 0.00 0.00 0.00 1 B-AGE 0.74 0.70 0.72 37 I-AGE 0.90 0.65 0.75 40 B-ISO 0.84 0.72 0.78 36 I-ISO 0.91 0.80 0.85 25 micro avg 0.86 0.88 0.87 10716 macro avg 0.75 0.75 0.74 10716 weighted avg 0.86 0.88 0.87 10716 ``` --- layout: model title: Fast Neural Machine Translation Model from Philippine Languages to English author: John Snow Labs name: opus_mt_phi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, phi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `phi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_phi_en_xx_2.7.0_2.4_1609164667695.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_phi_en_xx_2.7.0_2.4_1609164667695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_phi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_phi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.phi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_phi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Dutch RobertaForQuestionAnswering Base Cased model (from Nadav) author: John Snow Labs name: roberta_qa_robbert_base_squad_finetuned_on_runaways date: 2023-01-20 tags: [nl, open_source, roberta, question_answering, tensorflow] task: Question Answering language: nl edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robbert-base-squad-finetuned-on-runaways-nl` is a Dutch model originally trained by `Nadav`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_robbert_base_squad_finetuned_on_runaways_nl_4.3.0_3.0_1674212455477.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_robbert_base_squad_finetuned_on_runaways_nl_4.3.0_3.0_1674212455477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robbert_base_squad_finetuned_on_runaways","nl")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_robbert_base_squad_finetuned_on_runaways","nl") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_robbert_base_squad_finetuned_on_runaways| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|nl| |Size:|436.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Nadav/robbert-base-squad-finetuned-on-runaways-nl --- layout: model title: English asr_wav2vec2_base_checkpoint_14 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: asr_wav2vec2_base_checkpoint_14 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_14` is a English model originally trained by jiobiala24. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_checkpoint_14_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_14_en_4.2.0_3.0_1664114893660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_14_en_4.2.0_3.0_1664114893660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_checkpoint_14", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_checkpoint_14", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_checkpoint_14| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.1 MB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144_en_4.2.0_3.0_1664025098987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144_en_4.2.0_3.0_1664025098987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sherry7144| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Cased model (from harikp20) author: John Snow Labs name: distilbert_qa_hkp24 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hkp24` is a English model originally trained by `harikp20`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hkp24_en_4.3.0_3.0_1672775193434.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hkp24_en_4.3.0_3.0_1672775193434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hkp24","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hkp24","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hkp24| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/harikp20/hkp24 --- layout: model title: English RobertaForQuestionAnswering (from shmuelamar) author: John Snow Labs name: roberta_qa_REQA_RoBERTa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `REQA-RoBERTa` is a English model originally trained by `shmuelamar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_REQA_RoBERTa_en_4.0.0_3.0_1655726985824.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_REQA_RoBERTa_en_4.0.0_3.0_1655726985824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_REQA_RoBERTa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_REQA_RoBERTa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_shmuelamar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_REQA_RoBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/shmuelamar/REQA-RoBERTa --- layout: model title: Sentence Detection in Multi-lingual Texts author: John Snow Labs name: sentence_detector_dl date: 2021-01-02 task: Sentence Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, xx, sentence_detection] supported: true annotator: SentenceDetectorDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_xx_2.7.0_2.4_1609610616998.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_xx_2.7.0_2.4_1609610616998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "xx") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται. Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC. Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη. Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές. Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val result = pipeline.fit(Seq.empty["Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται. Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC. Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη. Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές. Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC."].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xx.sentence_detector").predict("""Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται. Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC. Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη. Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές. Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.""") ```
## Results ```bash +---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 0 | Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται. | | 1 | Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC. | | 2 | Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη. | | 3 | Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές. | | 4 | Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC. | +---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|xx| ## Data Source --- layout: model title: Catalan RobertaForQuestionAnswering Base Cased model (from projecte-aina) author: John Snow Labs name: roberta_qa_base_ca_v2_cased date: 2022-12-02 tags: [ca, open_source, roberta, question_answering, tensorflow] task: Question Answering language: ca edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ca-v2-cased-qa` is a Catalan model originally trained by `projecte-aina`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_v2_cased_ca_4.2.4_3.0_1669986112036.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ca_v2_cased_ca_4.2.4_3.0_1669986112036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_v2_cased","ca")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ca_v2_cased","ca") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_ca_v2_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ca| |Size:|455.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/projecte-aina/roberta-base-ca-v2-cased-qa - https://arxiv.org/abs/1907.11692 - https://github.com/projecte-aina/club - https://www.apache.org/licenses/LICENSE-2.0 - https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca%7Cen - https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina - https://paperswithcode.com/sota?task=question-answering&dataset=CatalanQA --- layout: model title: Language Detection & Identification Pipeline - 99 Languages author: John Snow Labs name: detect_language_99 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Afrikaans`, `Arabic`, `Algerian Arabic`, `Assamese`, `Kotava`, `Azerbaijani`, `Belarusian`, `Bengali`, `Berber`, `Breton`, `Bulgarian`, `Catalan`, `Chavacano`, `Cebuano`, `Czech`, `Chuvash`, `Mandarin Chinese`, `Cornish`, `Danish`, `German`, `Central Dusun`, `Modern Greek (1453-)`, `English`, `Esperanto`, `Estonian`, `Basque`, `Finnish`, `French`, `Guadeloupean Creole French`, `Irish`, `Galician`, `Gronings`, `Guarani`, `Hebrew`, `Hindi`, `Croatian`, `Hungarian`, `Armenian`, `Ido`, `Interlingue`, `Ilocano`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Lojban`, `Japanese`, `Kabyle`, `Georgian`, `Kazakh`, `Khasi`, `Khmer`, `Korean`, `Coastal Kadazan`, `Latin`, `Lingua Franca Nova`, `Lithuanian`, `Latvian`, `Literary Chinese`, `Marathi`, `Meadow Mari`, `Macedonian`, `Low German (Low Saxon)`, `Dutch`, `Norwegian Nynorsk`, `Norwegian Bokmål`, `Occitan`, `Ottoman Turkish`, `Kapampangan`, `Picard`, `Persian`, `Polish`, `Portuguese`, `Romanian`, `Kirundi`, `Russian`, `Slovak`, `Spanish`, `Albanian`, `Serbian`, `Swedish`, `Swabian`, `Tatar`, `Tagalog`, `Thai`, `Klingon`, `Toki Pona`, `Turkmen`, `Turkish`, `Uyghur`, `Ukrainian`, `Urdu`, `Vietnamese`, `Volapük`, `Waray`, `Shanghainese`, `Yiddish`, `Cantonese`, `Malay`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_99_xx_2.7.0_2.4_1607185604600.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_99_xx_2.7.0_2.4_1607185604600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_99", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_99", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.99").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_99| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: Pipeline to Detect Chemicals in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc5cdr_chemicals_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclasification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bc5cdr_chemicals](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_bc5cdr_chemicals_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_chemicals_pipeline_en_4.3.0_3.2_1679301940550.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_chemicals_pipeline_en_4.3.0_3.2_1679301940550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_bc5cdr_chemicals_pipeline", "en", "clinical/models") text = '''The possibilities that these cardiovascular findings might be the result of non-selective inhibition of monoamine oxidase or of amphetamine and metamphetamine are discussed. The results have shown that the degradation product p-choloroaniline is not a significant factor in chlorhexidine-digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone-iodine irrigations were associated with erosive cystitis and suggested a possible complication with human usage.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bc5cdr_chemicals_pipeline", "en", "clinical/models") val text = "The possibilities that these cardiovascular findings might be the result of non-selective inhibition of monoamine oxidase or of amphetamine and metamphetamine are discussed. The results have shown that the degradation product p-choloroaniline is not a significant factor in chlorhexidine-digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone-iodine irrigations were associated with erosive cystitis and suggested a possible complication with human usage." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------|--------:|------:|:------------|-------------:| | 0 | amphetamine | 128 | 138 | CHEM | 0.999973 | | 1 | metamphetamine | 144 | 157 | CHEM | 0.999972 | | 2 | p-choloroaniline | 226 | 241 | CHEM | 0.588953 | | 3 | chlorhexidine-digluconate | 274 | 298 | CHEM | 0.999979 | | 4 | kanamycin | 350 | 358 | CHEM | 0.999978 | | 5 | colistin | 362 | 369 | CHEM | 0.999942 | | 6 | povidone-iodine | 375 | 389 | CHEM | 0.999977 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc5cdr_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: German BertForMaskedLM Large Cased model (from deepset) author: John Snow Labs name: bert_embeddings_g_large date: 2022-12-06 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gbert-large` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_large_de_4.2.4_3.0_1670326448640.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_large_de_4.2.4_3.0_1670326448640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_large","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_large","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_g_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gbert-large - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - http://deepset.ai/ - https://haystack.deepset.ai/ - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/haystack - https://docs.haystack.deepset.ai - https://haystack.deepset.ai/community - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Language Detection & Identification Pipeline - 21 Languages author: John Snow Labs name: detect_language_21 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Estonian`, `Finnish`, `French`, `Hungarian`, `Italian`, `Lithuanian`, `Latvian`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Slovak`, `Slovenian`, `Spanish`, `Swedish`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_21_xx_2.7.0_2.4_1607181080664.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_21_xx_2.7.0_2.4_1607181080664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_21", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_21", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.21").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_21| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: Legal Equity Distribution Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_equity_distribution_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, equity_distribution, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_equity_distribution_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `equity-distribution-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `equity-distribution-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_equity_distribution_agreement_bert_en_1.0.0_3.0_1669311256795.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_equity_distribution_agreement_bert_en_1.0.0_3.0_1669311256795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_equity_distribution_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[equity-distribution-agreement]| |[other]| |[other]| |[equity-distribution-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_equity_distribution_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support equity-distribution-agreement 0.90 0.97 0.94 38 other 0.98 0.94 0.96 65 accuracy - - 0.95 103 macro-avg 0.94 0.96 0.95 103 weighted-avg 0.95 0.95 0.95 103 ``` --- layout: model title: English image_classifier_vit_llama_or_what ViTForImageClassification from firebolt author: John Snow Labs name: image_classifier_vit_llama_or_what date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_llama_or_what` is a English model originally trained by firebolt. ## Predicted Entities `alpaca`, `guanaco`, `llama`, `vicuna` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_or_what_en_4.1.0_3.0_1660170360179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_or_what_en_4.1.0_3.0_1660170360179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_llama_or_what", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_llama_or_what", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_llama_or_what| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97 TFWav2Vec2ForCTC from chaitanya97 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97` is a English model originally trained by chaitanya97. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97_en_4.2.0_3.0_1664036108476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97_en_4.2.0_3.0_1664036108476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Public announcements Clause Binary Classifier author: John Snow Labs name: legclf_public_announcements_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `public-announcements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `public-announcements` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_public_announcements_clause_en_1.0.0_3.2_1660122864114.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_public_announcements_clause_en_1.0.0_3.2_1660122864114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_public_announcements_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[public-announcements]| |[other]| |[other]| |[public-announcements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_public_announcements_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.99 0.99 86 public-announcements 0.97 1.00 0.99 36 accuracy - - 0.99 122 macro-avg 0.99 0.99 0.99 122 weighted-avg 0.99 0.99 0.99 122 ``` --- layout: model title: Legal Management Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_management_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, management, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_management_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `management-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `management-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_agreement_bert_en_1.0.0_3.0_1671393846956.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_agreement_bert_en_1.0.0_3.0_1671393846956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_management_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[management-agreement]| |[other]| |[other]| |[management-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_management_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support management-agreement 0.94 0.94 0.94 110 other 0.97 0.97 0.97 204 accuracy - - 0.96 314 macro-avg 0.96 0.95 0.95 314 weighted-avg 0.96 0.96 0.96 314 ``` --- layout: model title: French CamemBert Embeddings (from katrin-kc) author: John Snow Labs name: camembert_embeddings_katrin_kc_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `katrin-kc`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_katrin_kc_generic_model_fr_3.4.4_3.0_1653989117966.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_katrin_kc_generic_model_fr_3.4.4_3.0_1653989117966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_katrin_kc_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_katrin_kc_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_katrin_kc_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/katrin-kc/dummy-model --- layout: model title: English asr_maialong_model TFWav2Vec2ForCTC from buidung2004 author: John Snow Labs name: asr_maialong_model date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_maialong_model` is a English model originally trained by buidung2004. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_maialong_model_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_maialong_model_en_4.2.0_3.0_1664023411018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_maialong_model_en_4.2.0_3.0_1664023411018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_maialong_model", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_maialong_model", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_maialong_model| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.6 MB| --- layout: model title: English RobertaForQuestionAnswering (from emr-se-miniproject) author: John Snow Labs name: roberta_qa_roberta_base_emr date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-emr` is a English model originally trained by `emr-se-miniproject`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_emr_en_4.0.0_3.0_1655730544403.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_emr_en_4.0.0_3.0_1655730544403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_emr","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_emr","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.base.by_emr-se-miniproject").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_emr| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/emr-se-miniproject/roberta-base-emr --- layout: model title: English RobertaForQuestionAnswering (from PremalMatalia) author: John Snow Labs name: roberta_qa_roberta_base_best_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-best-squad2` is a English model originally trained by `PremalMatalia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_best_squad2_en_4.0.0_3.0_1655729882726.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_best_squad2_en_4.0.0_3.0_1655729882726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_best_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_best_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_PremalMatalia").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_best_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/PremalMatalia/roberta-base-best-squad2 --- layout: model title: English BertForQuestionAnswering model (from howey) author: John Snow Labs name: bert_qa_bert_base_uncased_squad_L6 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-L6` is a English model orginally trained by `howey`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad_L6_en_4.0.0_3.0_1654181343179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad_L6_en_4.0.0_3.0_1654181343179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad_L6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad_L6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_l6.by_howey").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad_L6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|249.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/howey/bert-base-uncased-squad-L6 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_256_zh_4.2.4_3.0_1670325925602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_256_zh_4.2.4_3.0_1670325925602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|33.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Enforcements Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_enforcements_bert date: 2023-03-05 tags: [en, legal, classification, clauses, enforcements, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Enforcements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Enforcements`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_enforcements_bert_en_1.0.0_3.0_1678050541165.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_enforcements_bert_en_1.0.0_3.0_1678050541165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_enforcements_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Enforcements]| |[Other]| |[Other]| |[Enforcements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_enforcements_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Enforcements 0.77 0.95 0.85 21 Other 0.97 0.82 0.89 34 accuracy - - 0.87 55 macro-avg 0.87 0.89 0.87 55 weighted-avg 0.89 0.87 0.87 55 ``` --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-768_A-12_cord19-200616_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_en_4.0.0_3.0_1654185324833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_en_4.0.0_3.0_1654185324833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_cord19.bert.uncased_4l_768d_a12a_768d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|194.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-768_A-12_cord19-200616_squad2 --- layout: model title: Pipeline for Detect Generic PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_generic_pipeline date: 2023-06-06 tags: [licensed, deidentification, clinical, ar, generic] task: Pipeline Healthcare language: ar edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2023/05/30/ner_deid_generic_ar.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_ar_4.4.1_3.0_1686078043466.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_ar_4.4.1_3.0_1686078043466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "ar", "clinical/models") text = '''ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. '' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "ar", "clinical/models") val text = "ملاحظات سريرية - مريض الربو. التاريخ: 16 أبريل 2000. اسم المريضة: ليلى حسن. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة. الرمز البريدي: 54321. البلد: المملكة العربية السعودية. اسم المستشفى: مستشفى النور. اسم الطبيب: د. أميرة أحمد. تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash +---------------+----------------------+ |chunks |entities | +---------------+----------------------+ |16 أبريل 2000 |DATE | |ليلى حسن |NAME | |789، |LOCATION | |الأمانة، جدة |LOCATION | |54321 |LOCATION | |المملكة العربية |LOCATION | |السعودية |LOCATION | |مستشفى النور |LOCATION | |أميرة أحمد |NAME | |ليلى |NAME | +---------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|ar| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English Named Entity Recognition (from elastic) author: John Snow Labs name: distilbert_ner_distilbert_base_uncased_finetuned_conll03_english date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll03-english` is a English model orginally trained by `elastic`. ## Predicted Entities `ORG`, `MISC`, `PER`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_uncased_finetuned_conll03_english_en_3.4.2_3.0_1652721873911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_uncased_finetuned_conll03_english_en_3.4.2_3.0_1652721873911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_uncased_finetuned_conll03_english","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_uncased_finetuned_conll03_english","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_distilbert_base_uncased_finetuned_conll03_english| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.7 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english --- layout: model title: Temporality / Certainty Assertion Status (md) author: John Snow Labs name: finassertiondl_time_md date: 2023-01-04 tags: [en, licensed] task: Assertion Status language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a medium (md) Assertion Status Model aimed to detect temporality (PRESENT, PAST, FUTURE) or Certainty (POSSIBLE) in your financial documents, which may improve the results of the `finassertion_time` (small) model. ## Predicted Entities `PRESENT`, `PAST`, `FUTURE`, `POSSIBLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINASSERTION_TEMPORALITY){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertiondl_time_md_en_1.0.0_3.0_1672844660896.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertiondl_time_md_en_1.0.0_3.0_1672844660896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = finance.BertForTokenClassification.pretrained("finner_bert_roles","en","finance/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) chunk_converter = nlp.NerConverter() \ .setInputCols(["document", "token", "ner"]) \ .setOutputCol("ner_chunk") assertion = finance.AssertionDLModel.pretrained("finassertion_time_md", "en", "finance/models")\ .setInputCols(["document", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner, chunk_converter, assertion ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) lp = nlp.LightPipeline(model) texts = ["John Crawford will be hired by Atlantic Inc as CTO"] lp.annotate(texts) ```
## Results ```bash chunk,begin,end,entity_type,assertion CTO,47,49,ROLE,FUTURE ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finassertiondl_time_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, doc_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations on financial and legal corpora ## Benchmarking ```bash label tp fp fn prec rec f1 PRESENT 115 11 5 0.9126984 0.9583333 0.9349593 POSSIBLE 79 5 4 0.9404762 0.9518072 0.9461077 PAST 54 5 11 0.91525424 0.83076924 0.8709678 FUTURE 77 3 4 0.9625 0.9506173 0.95652175 Macro-average 325 24 24 0.9327322 0.9228818 0.92778087 Micro-average 325 24 24 0.9312321 0.9312321 0.9312321 ``` --- layout: model title: Bangla XLMRoBerta Embeddings (from neuralspace-reverie) author: John Snow Labs name: xlmroberta_embeddings_indic_transformers_bn_xlmroberta date: 2022-05-13 tags: [bn, open_source, xlm_roberta, embeddings] task: Embeddings language: bn edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-bn-xlmroberta` is a Bangla model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_indic_transformers_bn_xlmroberta_bn_3.4.4_3.0_1652439835846.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_indic_transformers_bn_xlmroberta_bn_3.4.4_3.0_1652439835846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_indic_transformers_bn_xlmroberta","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি পছন্দ করি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_indic_transformers_bn_xlmroberta","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি পছন্দ করি").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_indic_transformers_bn_xlmroberta| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bn| |Size:|505.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-xlmroberta - https://oscar-corpus.com/ --- layout: model title: BioBERT Embeddings (Pubmed) author: John Snow Labs name: biobert_pubmed_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pubmed_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pubmed_base_cased_embeddings I [0.4227580428123474, -0.01985771767795086, -0.... hate [0.04862901195883751, 0.2535072863101959, -0.5... cancer [0.05491625890135765, 0.09395376592874527, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pubmed_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Legal Maritime And Inland Waterway Transport Document Classifier (EURLEX) author: John Snow Labs name: legclf_maritime_and_inland_waterway_transport_bert date: 2023-03-06 tags: [en, legal, classification, clauses, maritime_and_inland_waterway_transport, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_maritime_and_inland_waterway_transport_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Maritime_and_Inland_Waterway_Transport or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Maritime_and_Inland_Waterway_Transport`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_maritime_and_inland_waterway_transport_bert_en_1.0.0_3.0_1678111720706.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_maritime_and_inland_waterway_transport_bert_en_1.0.0_3.0_1678111720706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_maritime_and_inland_waterway_transport_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Maritime_and_Inland_Waterway_Transport]| |[Other]| |[Other]| |[Maritime_and_Inland_Waterway_Transport]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_maritime_and_inland_waterway_transport_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Maritime_and_Inland_Waterway_Transport 0.98 0.91 0.94 190 Other 0.90 0.98 0.94 157 accuracy - - 0.94 347 macro-avg 0.94 0.94 0.94 347 weighted-avg 0.94 0.94 0.94 347 ``` --- layout: model title: Pipeline to Summarize Clinical Notes (Augmented) author: John Snow Labs name: summarizer_clinical_jsl_augmented_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization, augmented] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_jsl_augmented](https://nlp.johnsnowlabs.com/2023/03/30/summarizer_clinical_jsl_augmented_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_pipeline_en_4.4.1_3.0_1685533778119.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_augmented_pipeline_en_4.4.1_3.0_1685533778119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_jsl_augmented_pipeline", "en", "clinical/models") text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_jsl_augmented_pipeline", "en", "clinical/models") val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for a recheck. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. Her medications include Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, and TriViFlor. She also has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.6 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English asr_wynehills_mimi_ASR TFWav2Vec2ForCTC from mimi author: John Snow Labs name: pipeline_asr_wynehills_mimi_ASR date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wynehills_mimi_ASR` is a English model originally trained by mimi. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wynehills_mimi_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wynehills_mimi_ASR_en_4.2.0_3.0_1664097235761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wynehills_mimi_ASR_en_4.2.0_3.0_1664097235761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wynehills_mimi_ASR', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wynehills_mimi_ASR", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wynehills_mimi_ASR| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Tiv Pipeline author: John Snow Labs name: translate_en_tiv date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tiv, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tiv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tiv_xx_2.7.0_2.4_1609699301504.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tiv_xx_2.7.0_2.4_1609699301504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tiv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tiv", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tiv').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tiv| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Securities and Purchase Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_securities_purchase_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, purchase, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_securities_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `securities-purchase-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `securities-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_securities_purchase_agreement_en_1.0.0_3.0_1668110959997.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_securities_purchase_agreement_en_1.0.0_3.0_1668110959997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_securities_purchase_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[securities-purchase-agreement]| |[other]| |[other]| |[securities-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_securities_purchase_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.92 0.92 85 securities-purchase-agreement 0.77 0.77 0.77 30 accuracy - - 0.88 115 macro-avg 0.84 0.84 0.84 115 weighted-avg 0.88 0.88 0.88 115 ``` --- layout: model title: Detect Diagnoses and Procedures (Spanish using Roberta embeddings) author: John Snow Labs name: roberta_ner_diag_proc date: 2021-11-04 tags: [ner, clinical, es, licensed] task: Named Entity Recognition language: es edition: Healthcare NLP 3.3.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Pretrained named entity recognition deep learning model for diagnostics and procedures in Spanish This model is similar to `https://nlp.johnsnowlabs.com/2021/03/31/ner_diag_proc_es.html` but instead of using `embeddings_scielowiki_300d` it uses `roberta_base_biomedical_es` embeddings. Tasks trained using Clinical Case Coding in Spanish Shared Task (eHealth CLEF 2020) (link: https://temu.bsc.es/codiesp/) ## Predicted Entities `DIAGNOSTICO`, `PROCEDIMIENTO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/roberta_ner_diag_proc_es_3.3.0_3.0_1636024910686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/roberta_ner_diag_proc_es_3.3.0_3.0_1636024910686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = medical.NerModel.pretrained("roberta_ner_diag_proc", "es", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter() \ .setInputCols(['sentence', 'token', 'ner']) \ .setOutputCol('ner_chunk') pipeline = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter]) empty = spark.createDataFrame([['']]).toDF("text") p_model = pipeline.fit(empty) test_sentence = """Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.""" import pandas as pd res = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("roberta_ner_diag_proc", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter)) val test_sentence = """Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.""" val data = Seq(test_sentence).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.roberta_ner_diag_proc").predict("""Mujer de 28 años con antecedentes de diabetes mellitus gestacional diagnosticada ocho años antes de la presentación y posterior diabetes mellitus tipo dos (DM2), un episodio previo de pancreatitis inducida por HTG tres años antes de la presentación, asociado con una hepatitis aguda, y obesidad con un índice de masa corporal (IMC) de 33,5 kg / m2, que se presentó con antecedentes de una semana de poliuria, polidipsia, falta de apetito y vómitos. Dos semanas antes de la presentación, fue tratada con un ciclo de cinco días de amoxicilina por una infección del tracto respiratorio. Estaba tomando metformina, glipizida y dapagliflozina para la DM2 y atorvastatina y gemfibrozil para la HTG. Había estado tomando dapagliflozina durante seis meses en el momento de la presentación. El examen físico al momento de la presentación fue significativo para la mucosa oral seca; significativamente, su examen abdominal fue benigno sin dolor a la palpación, protección o rigidez. Los hallazgos de laboratorio pertinentes al ingreso fueron: glucosa sérica 111 mg / dl, bicarbonato 18 mmol / l, anión gap 20, creatinina 0,4 mg / dl, triglicéridos 508 mg / dl, colesterol total 122 mg / dl, hemoglobina glucosilada (HbA1c) 10%. y pH venoso 7,27. La lipasa sérica fue normal a 43 U / L. Los niveles séricos de acetona no pudieron evaluarse ya que las muestras de sangre se mantuvieron hemolizadas debido a una lipemia significativa. La paciente ingresó inicialmente por cetosis por inanición, ya que refirió una ingesta oral deficiente durante los tres días previos a la admisión. Sin embargo, la química sérica obtenida seis horas después de la presentación reveló que su glucosa era de 186 mg / dL, la brecha aniónica todavía estaba elevada a 21, el bicarbonato sérico era de 16 mmol / L, el nivel de triglicéridos alcanzó un máximo de 2050 mg / dL y la lipasa fue de 52 U / L. Se obtuvo el nivel de β-hidroxibutirato y se encontró que estaba elevado a 5,29 mmol / L; la muestra original se centrifugó y la capa de quilomicrones se eliminó antes del análisis debido a la interferencia de la turbidez causada por la lipemia nuevamente. El paciente fue tratado con un goteo de insulina para euDKA y HTG con una reducción de la brecha aniónica a 13 y triglicéridos a 1400 mg / dL, dentro de las 24 horas. Se pensó que su euDKA fue precipitada por su infección del tracto respiratorio en el contexto del uso del inhibidor de SGLT2. La paciente fue atendida por el servicio de endocrinología y fue dada de alta con 40 unidades de insulina glargina por la noche, 12 unidades de insulina lispro con las comidas y metformina 1000 mg dos veces al día. Se determinó que todos los inhibidores de SGLT2 deben suspenderse indefinidamente. Tuvo un seguimiento estrecho con endocrinología post alta.""") ```
## Results ```bash +---------------------------------+------------+ | text|ner_label | +---------------------------------+------------+ | diabetes mellitus gestacional|DIAGNOSTICO| | diabetes mellitus tipo dos|DIAGNOSTICO| | DM2|DIAGNOSTICO| | pancreatitis inducida por HTG|DIAGNOSTICO| | hepatitis aguda|DIAGNOSTICO| | obesidad|DIAGNOSTICO| | índice de masa corporal|DIAGNOSTICO| | IMC|DIAGNOSTICO| | poliuria|DIAGNOSTICO| | polidipsia|DIAGNOSTICO| | vómitos|DIAGNOSTICO| |infección del tracto respiratorio|DIAGNOSTICO| | DM2|DIAGNOSTICO| | HTG|DIAGNOSTICO| | dolor|DIAGNOSTICO| | rigidez|DIAGNOSTICO| | cetosis|DIAGNOSTICO| |infección del tracto respiratorio|DIAGNOSTICO| +---------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_diag_proc| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| ## Data Source https://temu.bsc.es/codiesp/ ## Benchmarking ```bash label tp fp fn prec rec f1 DIAGNOSTICO 1705 543 610 0.75845194 0.7365011 0.74731535 PROCEDIMIENTO 500 195 288 0.7194245 0.6345178 0.67430884 Macro-average 2205 738 898 0.7389382 0.68550944 0.7112218 Micro-average 2205 738 898 0.74923545 0.71060264 0.72940785 ``` --- layout: model title: Pipeline to Detect Diseases author: John Snow Labs name: ner_diseases_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, disease, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diseases](https://nlp.johnsnowlabs.com/2021/03/31/ner_diseases_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_pipeline_en_3.4.1_3.0_1647871685950.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_pipeline_en_3.4.1_3.0_1647871685950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_diseases_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` ```scala val pipeline = new PretrainedPipeline("ner_diseases_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Multiple autoimmune syndrome has been detected. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. She has Chikungunya virus disease story also. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-------------------------+---------+ |chunk |ner_label| +-------------------------+---------+ |autoimmune syndrome |Disease | |T-cell leukemia |Disease | |T-cell leukemia |Disease | |Chikungunya virus disease|Disease | +-------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Arabic Named Entity Recognition (Modern Standard Arabic-MSA, Dialectal Arabic-DA and Classical Arabic-CA) author: John Snow Labs name: bert_ner_bert_base_arabic_camelbert_mix_ner date: 2022-05-04 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-mix-ner` is a Arabic model orginally trained by `CAMeL-Lab`. ## Predicted Entities `ORG`, `LOC`, `PERS`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_mix_ner_ar_3.4.2_3.0_1651630195148.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_mix_ner_ar_3.4.2_3.0_1651630195148.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_mix_ner","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_mix_ner","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner.arabic_camelbert_mix_ner").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_arabic_camelbert_mix_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix-ner - https://camel.abudhabi.nyu.edu/anercorp/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: German ElectraForQuestionAnswering Large model (from deepset) author: John Snow Labs name: electra_qa_g_large_germanquad date: 2022-06-22 tags: [de, open_source, electra, question_answering] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-large-germanquad` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_g_large_germanquad_de_4.0.0_3.0_1655921897609.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_g_large_germanquad_de_4.0.0_3.0_1655921897609.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_g_large_germanquad","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_g_large_germanquad","de") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.electra.large").predict("""Was ist mein Name?|||"Mein Name ist Clara und ich lebe in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_g_large_germanquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/gelectra-large-germanquad - https://deepset.ai/germanquad - https://deepset.ai/german-bert - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://haystack.deepset.ai/community/join --- layout: model title: Legal Increased costs Clause Binary Classifier author: John Snow Labs name: legclf_increased_costs_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `increased-costs` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `increased-costs` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_increased_costs_clause_en_1.0.0_3.2_1660122505120.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_increased_costs_clause_en_1.0.0_3.2_1660122505120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_increased_costs_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[increased-costs]| |[other]| |[other]| |[increased-costs]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_increased_costs_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support increased-costs 1.00 0.94 0.97 36 other 0.98 1.00 0.99 100 accuracy - - 0.99 136 macro-avg 0.99 0.97 0.98 136 weighted-avg 0.99 0.99 0.99 136 ``` --- layout: model title: Legal Multilabel Classifier on Covid-19 Exceptions (French) author: John Snow Labs name: legmulticlf_covid19_exceptions_french date: 2023-04-13 tags: [fr, legal, multilabel, classification, licensed, tensorflow] task: Text Classification language: fr edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the Multi-Label Text Classification model that can be used to identify up to 7 classes to facilitate analysis, discovery, and comparison of legal texts in French related to COVID-19 exception measures. The classes are as follows: - Army_mobilization - Closures/lockdown - Government_oversight - Restrictions_of_daily_liberties - Restrictions_of_fundamental_rights_and_civil_liberties - State_of_Emergency - Suspension_of_international_cooperation_and_commitments ## Predicted Entities `Army_mobilization`, `Closures/lockdown`, `Government_oversight`, `Restrictions_of_daily_liberties`, `Restrictions_of_fundamental_rights_and_civil_liberties`, `State_of_Emergency`, `Suspension_of_international_cooperation_and_commitments` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_french_fr_1.0.0_3.0_1681406840199.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_french_fr_1.0.0_3.0_1681406840199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_multi_base_br", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") classifierdl = nlp.MultiClassifierDLModel.pretrained("legmulticlf_covid19_exceptions_french", "fr", "legal/models") \ .setInputCols(["sentence_embeddings"]) .setOutputCol("class") clf_pipeline = nlp.Pipeline( stages=[document_assembler, embeddings, classifierdl]) df = spark.createDataFrame([["Par dérogation à l'alinéa 1er, sont autorisés :- les cérémonies funéraires, mais uniquement en présence de 15 personnes maximum, et sans possibilité d'exposition du corps ;"]]).toDF("text") model = clf_pipeline.fit(df) result = model.transform(df) result.select("text", "class.result").show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+ |text |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+ |Par dérogation à l'alinéa 1er, sont autorisés :- les cérémonies funéraires, mais uniquement en présence de 15 personnes maximum, et sans possibilité d'exposition du corps ;|[Restrictions_of_fundamental_rights_and_civil_liberties, Restrictions_of_daily_liberties]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_covid19_exceptions_french| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|fr| |Size:|14.0 MB| ## References Train dataset available [here](https://huggingface.co/datasets/joelito/covid19_emergency_event) ## Benchmarking ```bash label precision recall f1-score support Army_mobilization 1.00 1.00 1.00 11 Closures/lockdown 0.71 0.86 0.77 84 Government_oversight 1.00 0.67 0.80 3 Restrictions_of_daily_liberties 0.72 0.73 0.73 75 Restrictions_of_fundamental_rights_and_civil_liberties 0.65 0.66 0.65 47 State_of_Emergency 0.81 0.74 0.77 53 Suspension_of_international_cooperation_and_commitments 1.00 0.33 0.50 6 micro-avg 0.73 0.76 0.75 279 macro-avg 0.84 0.71 0.75 279 weighted-avg 0.74 0.76 0.74 279 samples-avg 0.75 0.80 0.74 279 ``` --- layout: model title: Sentiment Analysis of Danish texts author: John Snow Labs name: bert_sequence_classifier_sentiment date: 2022-01-18 tags: [danish, sentiment, da, open_source] task: Sentiment Analysis language: da edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/DaNLP/da-bert-tone-sentiment-polarity)) and it's been finetuned for Danish language, leveraging Danish Bert embeddings and `BertForSequenceClassification` for text classification purposes. ## Predicted Entities `negative`, `positive`, `neutral` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_sentiment_da_3.3.4_3.0_1642495726064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_sentiment_da_3.3.4_3.0_1642495726064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_sentiment', 'da') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['Protester over hele landet ledet af utilfredse civilsamfund på grund af den danske regerings COVID-19 lockdown-politik er kommet ud af kontrol.']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_sentiment", "da") .setInputCols("document", "token") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Protester over hele landet ledet af utilfredse civilsamfund på grund af den danske regerings COVID-19 lockdown-politik er kommet ud af kontrol."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['negative'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_sentiment| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|da| |Size:|415.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## Data Source [https://huggingface.co/DaNLP/da-bert-tone-sentiment-polarity#training-data](https://huggingface.co/DaNLP/da-bert-tone-sentiment-polarity#training-data) --- layout: model title: Pipeline to Detect Clinical Entities in Romanian (w2v_cc_300d) author: John Snow Labs name: ner_clinical_pipeline date: 2023-03-09 tags: [licenced, clinical, ro, ner, w2v, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical](https://nlp.johnsnowlabs.com/2022/07/01/ner_clinical_ro_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_pipeline_ro_4.3.0_3.2_1678384416326.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_pipeline_ro_4.3.0_3.2_1678384416326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_pipeline", "ro", "clinical/models") text = ''' Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_pipeline", "ro", "clinical/models") val text = " Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------------------|--------:|------:|:--------------------------|-------------:| | 0 | Angio CT | 13 | 20 | Imaging_Test | 0.92675 | | 1 | cardio-toracic | 22 | 35 | Body_Part | 0.9854 | | 2 | Atrezie | 54 | 60 | Disease_Syndrome_Disorder | 0.9985 | | 3 | valva pulmonara | 65 | 79 | Body_Part | 0.9271 | | 4 | Hipoplazie | 82 | 91 | Disease_Syndrome_Disorder | 0.9926 | | 5 | VS | 93 | 94 | Body_Part | 0.9984 | | 6 | Atrezie | 97 | 103 | Disease_Syndrome_Disorder | 0.9607 | | 7 | VAV stang | 105 | 113 | Body_Part | 0.94825 | | 8 | Anastomoza Glenn | 116 | 131 | Disease_Syndrome_Disorder | 0.9787 | | 9 | Sp | 134 | 135 | Body_Part | 0.8138 | | 10 | Tromboza | 138 | 145 | Disease_Syndrome_Disorder | 0.9986 | | 11 | Sectia Clinica Cardiologie | 182 | 207 | Clinical_Dept | 0.8721 | | 12 | GE Revolution HD | 239 | 254 | Medical_Device | 0.999133 | | 13 | Branula albastra | 257 | 272 | Medical_Device | 0.98465 | | 14 | membrului superior | 293 | 310 | Body_Part | 0.9793 | | 15 | drept | 312 | 316 | Direction | 0.7679 | | 16 | 30 ml | 336 | 340 | Dosage | 0.99775 | | 17 | Iomeron 350 | 342 | 352 | Drug_Ingredient | 0.9878 | | 18 | 2.2 ml/s | 362 | 369 | Dosage | 0.9599 | | 19 | 20 ml | 382 | 386 | Dosage | 0.99515 | | 20 | ser fiziologic | 388 | 401 | Drug_Ingredient | 0.9802 | | 21 | angio-CT | 446 | 453 | Imaging_Test | 0.9843 | | 22 | cardiotoracica | 455 | 468 | Body_Part | 0.9995 | | 23 | achizitii secventiale prospective | 473 | 505 | Imaging_Technique | 0.8514 | | 24 | 100/min | 540 | 546 | Pulse | 0.8501 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. It has been augmented with synonyms, four times richer than previous resolver, making it more accurate. ## Predicted Entities Predicts ICD10-CM Codes and their normalized definitions. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.0.4_3.0_1621191389631.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_3.0.4_3.0_1621191389631.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+---------+------+------------------------------------------+---------------------+ | chunk| entity| code| all_k_resolutions| all_k_codes| +--------------------+---------+------+------------------------------------------+---------------------+ | hypertension| PROBLEM| I10|hypertension:::hypertension monitored::...|I10:::Z8679:::I159...| |chronic renal ins...| PROBLEM| N189|chronic renal insufficiency:::chronic r...|N189:::P2930:::N19...| | COPD| PROBLEM| J449|copd - chronic obstructive pulmonary di...|J449:::J984:::J628...| | gastritis| PROBLEM| K2970|gastritis:::chemical gastritis:::gastri...|K2970:::K2960:::K2...| | TIA| PROBLEM| S0690|cerebral trauma (disorder):::cerebral c...|S0690:::S060X:::G4...| |a non-ST elevatio...| PROBLEM| I219|silent myocardial infarction (disorder)...|I219:::I248:::I256...| |Guaiac positive s...| PROBLEM| K921|guaiac-positive stools:::acholic stool ...|K921:::R195:::R15:...| | mid LAD lesion| PROBLEM| I2102|stemi involving left anterior descendin...|I2102:::I2101:::Q2...| +--------------------+---------+------+------------------------------------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code_aug]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD10 Clinical Modification dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: Legal Swiss Judgements Classification (French) author: John Snow Labs name: legclf_bert_swiss_judgements date: 2022-10-27 tags: [fr, legal, licensed, sequence_classification] task: Text Classification language: fr edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Bert-based model that can be used to classify Swiss Judgement documents in French language into the following 6 classes according to their case area. It has been trained with SOTA approach. ## Predicted Entities `public law`, `civil law`, `insurance law`, `social law`, `penal law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_fr_1.0.0_3.0_1666866243544.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_fr_1.0.0_3.0_1666866243544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") clf_model = legal.BertForSequenceClassification.pretrained("legclf_bert_swiss_judgements", "fr", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, clf_model ]) data = spark.createDataFrame([["""Résumé : A. X. 1948) et Z. Ils se sont mariés à la xxxx 1992. Le mariage est resté sans enfants. T._ est, cependant, le père des enfants divorcés S._ et T._ (geb. 2004 et 2006). Après la suppression du budget commun, la vie séparée a dû être réglée. Disponible du 17. En décembre 2010, le président de la Cour de justice, Dorneck-Thierstein, a autorisé les époux à se séparer. Dans la mesure où cela est encore important, le juge a obligé le mari, pour l'année 2010 encore Fr. 3'000.-- à payer l'entretien de sa femme (Ziff. 3 ) De même, Z._ a été condamné, X._ à partir de janvier 2011 pour la durée ultérieure de la séparation une contribution de subsistance mensuelle de Fr. 7'085.-- de vous dépenser et de vous payer, en outre, la moitié du bonus net versé à chacun immédiatement après sa destination (Ziff. 4 ) En outre, le président de la Cour a ordonné la séparation des marchandises (Ziff. 5), dispose de la compétition du parti ou Les frais d’avocat (Ziff. 9) et impose les frais judiciaires à la moitié des deux parties (Ziff. 10 ) B. À l’encontre de cette décision, X._ a fait appel à la Cour suprême du canton de Solothurn. Elle a demandé de supprimer les paragraphes 3, 4, 5, 9 et 10 de la décision de première instance, et a présenté les demandes juridiques suivantes: Le mari est tenu de l'engager pour la période à partir de 21. Septembre 2009 à la fin du mois de décembre 2010 une contribution supplémentaire de Fr. 34'400.-- pour rembourser; pour la vie séparée à partir de janvier 2011, elle est dotée d'une contribution de subsistance de Fr. 10'000.-- pour recevoir par mois. La distribution des marchandises est de 21. Déposer en septembre 2010. En conclusion, le conjoint doit payer une contribution de parti raisonnable d'au moins Fr. 6'000.-- et pour payer tous les frais de justice. La Cour suprême du canton de Solothurn a déposé le recours à l'arrêt du 18. en mai 2011. C. À ce titre, X._ (ci-après dénommée « plaignante ») procède à la Cour fédérale. Dans sa plainte du 20. En juin 2011, elle présente la demande, la décision de la Cour suprême du canton Solothurn du 18. annuler en mai 2011 et répéter les demandes légales qu’elle a présentées devant la Cour suprême (cf. Bst. B ) En outre, il demande que la séparation des marchandises soit plus égalitaire par 7. Décembre 2010 à ordonner. Aucune consultation n’a été faite, mais les actes préjudiciels ont été reçus."""]]).toDF("text") result = clf_pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | document| class| +----------------------------------------------------------------------------------------------------+---------+ |Résumé : A. X. 1948) et Z. Ils se sont mariés à la xxxx 1992. Le mariage est resté sans enfants. ...|civil law| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_swiss_judgements| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|fr| |Size:|390.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Training data is available [here](https://zenodo.org/record/7109926#.Y1gJwexBw8E). ## Benchmarking ```bash | label | precision | recall | f1-score | support | |---------------|-----------|--------|----------|---------| | civil-law | 0.81 | 0.97 | 0.88 | 869 | | insurance-law | 0.95 | 0.94 | 0.95 | 790 | | other | 1.00 | 0.40 | 0.57 | 15 | | penal-law | 0.94 | 0.91 | 0.93 | 1077 | | public-law | 0.93 | 0.85 | 0.89 | 1259 | | social-law | 0.94 | 0.95 | 0.95 | 834 | | accuracy | - | - | 0.91 | 4844 | | macro-avg | 0.93 | 0.84 | 0.86 | 4844 | | weighted-avg | 0.92 | 0.91 | 0.91 | 4844 | ``` --- layout: model title: Chinese T5ForConditionalGeneration Base Cased model (from shibing624) author: John Snow Labs name: t5_mengzi_base_chinese_correction date: 2023-01-30 tags: [zh, open_source, t5, tensorflow] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mengzi-t5-base-chinese-correction` is a Chinese model originally trained by `shibing624`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mengzi_base_chinese_correction_zh_4.3.0_3.0_1675105223361.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mengzi_base_chinese_correction_zh_4.3.0_3.0_1675105223361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mengzi_base_chinese_correction","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mengzi_base_chinese_correction","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mengzi_base_chinese_correction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|1.0 GB| ## References - https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction - https://github.com/shibing624/pycorrector - https://github.com/shibing624/pycorrector/tree/master/pycorrector/t5 - https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ - http://nlp.ee.ncu.edu.tw/resource/csc.html - https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml --- layout: model title: Translate Greenlandic to English Pipeline author: John Snow Labs name: translate_kl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kl_en_xx_2.7.0_2.4_1609689134804.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kl_en_xx_2.7.0_2.4_1609689134804.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: SDOH Alcohol Usage For Classification author: John Snow Labs name: genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli date: 2023-01-14 tags: [en, licensed, generic_classifier, sdoh, alcohol, clinical] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator. `Present:` if the patient was a current consumer of alcohol. `Past:` the patient was a consumer in the past and had quit. `Never:` if the patient had never consumed alcohol. `None: ` if there was no related text. ## Predicted Entities `Present`, `Past`, `Never`, `None` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_ALCOHOL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli_en_4.2.4_3.0_1673698550774.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli_en_4.2.4_3.0_1673698550774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes", "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.", "Employee in neuro departmentin at the Center Hospital 18. Widower since 2001. Current smoker since 20 years. No EtOH or illicits.", "Patient smoked 4 ppd x 37 years, quitting 22 years ago. He is widowed, lives alone, has three children."] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq("Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.generic.sdoh_alchol_usage_sbiobert_cased").predict("""The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use. He also reports 'heavy alcohol use', quit 15 months ago.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | text| result| +----------------------------------------------------------------------------------------------------+---------+ |Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]| |The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [Past]| |Employee in neuro departmentin at the Center Hospital 18. Widower since 2001. Current smoker sinc...| [Never]| |Patient smoked 4 ppd x 37 years, quitting 22 years ago. He is widowed, lives alone, has three chi...| [None]| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.5 MB| ## Benchmarking ```bash label precision recall f1-score support Never 0.84 0.87 0.85 523 None 0.83 0.74 0.81 341 Past 0.51 0.35 0.50 98 Present 0.74 0.83 0.79 418 accuracy - - 0.79 1380 macro-avg 0.73 0.70 0.71 1380 weighted-avg 0.78 0.79 0.78 1380 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223579788.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223579788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_hier_triplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_epochs_1_shard_1_squad2.0 --- layout: model title: Chunk Resolver (Cpt Clinical) author: John Snow Labs name: chunkresolve_cpt_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities CPT Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_cpt_clinical_en_3.0.0_3.0_1617355184583.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_cpt_clinical_en_3.0.0_3.0_1617355184583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... cpt_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, cpt_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val cpt_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_cpt_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, cpt_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash chunk entity cpt_description cpt_code 0 a cold, cough PROBLEM Thoracoscopy, surgical; with removal of a sing... 32669 1 runny nose PROBLEM Unlisted procedure, larynx 31599 2 fever PROBLEM Cesarean delivery only; 59514 3 difficulty breathing PROBLEM Repair, laceration of diaphragm, any approach 39501 4 her cough PROBLEM Exploration for postoperative hemorrhage, thro... 35840 5 physical exam TEST Cesarean delivery only; including postpartum care 59515 6 fairly congested PROBLEM Pyelotomy; with drainage, pyelostomy 50125 7 Amoxil TREATMENT Cholecystoenterostomy; with gastroenterostomy 47721 8 Aldex TREATMENT Laparoscopy, surgical; with omentopexy (omenta... 49326 9 difficulty breathing PROBLEM Repair, laceration of diaphragm, any approach 39501 10 more congested PROBLEM for section of 1 or more cranial nerves 61460 11 trouble sleeping PROBLEM Repair, laceration of diaphragm, any approach 39501 12 congestion PROBLEM Repair, laceration of diaphragm, any approach 39501 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_cpt_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10pcs]| |Language:|en| --- layout: model title: Fast Neural Machine Translation Model from Luba-Lulua to English author: John Snow Labs name: opus_mt_lua_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lua, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lua` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lua_en_xx_2.7.0_2.4_1609170193384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lua_en_xx_2.7.0_2.4_1609170193384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lua_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lua_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lua.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lua_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from doc2query) author: John Snow Labs name: t5_stackexchange_title_body_base_v1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `stackexchange-title-body-t5-base-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_stackexchange_title_body_base_v1_en_4.3.0_3.0_1675107445767.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_stackexchange_title_body_base_v1_en_4.3.0_3.0_1675107445767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_stackexchange_title_body_base_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_stackexchange_title_body_base_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_stackexchange_title_body_base_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/doc2query/stackexchange-title-body-t5-base-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html --- layout: model title: Legal Delivery Clause Binary Classifier author: John Snow Labs name: legclf_delivery_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `delivery` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `delivery` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_delivery_clause_en_1.0.0_3.2_1660123411565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_delivery_clause_en_1.0.0_3.2_1660123411565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_delivery_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[delivery]| |[other]| |[other]| |[delivery]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_delivery_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support delivery 0.86 0.86 0.86 35 other 0.94 0.94 0.94 80 accuracy - - 0.91 115 macro-avg 0.90 0.90 0.90 115 weighted-avg 0.91 0.91 0.91 115 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from QWEasd1122) author: John Snow Labs name: distilbert_qa_qweasd1122_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `QWEasd1122`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_qweasd1122_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768957349.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_qweasd1122_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768957349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_qweasd1122_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_qweasd1122_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_qweasd1122_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/QWEasd1122/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Effective Dates Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_effective_dates_bert date: 2023-03-05 tags: [en, legal, classification, clauses, effective_dates, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Effective_Dates` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Effective_Dates`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effective_dates_bert_en_1.0.0_3.0_1678049960025.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effective_dates_bert_en_1.0.0_3.0_1678049960025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_effective_dates_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Effective_Dates]| |[Other]| |[Other]| |[Effective_Dates]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_effective_dates_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Effective_Dates 0.94 0.94 0.94 52 Other 0.96 0.96 0.96 73 accuracy - - 0.95 125 macro-avg 0.95 0.95 0.95 125 weighted-avg 0.95 0.95 0.95 125 ``` --- layout: model title: Chinese Word Segmentation author: John Snow Labs name: wordseg_weibo date: 2021-01-03 task: Word Segmentation language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [word_segmentation, cn, zh, open_source] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_weibo_zh_2.7.0_2.4_1609694553500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_weibo_zh_2.7.0_2.4_1609694553500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_weibo', 'zh')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_weibo", "zh") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""然而,这样的处理也衍生了一些问题。"""] token_df = nlu.load('zh.segment_words.weibo').predict(text, output_level='token') token_df ```
## Results ```bash +----------------------------------+--------------------------------------------------------+ |text |result | +----------------------------------+--------------------------------------------------------+ |然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]| +----------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_weibo| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source The model is trained on the data set from He, Hangfeng, and Xu Sun. "F-score driven max margin neural network for named entity recognition in chinese social media." arXiv preprint arXiv:1611.04234 (2016). ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 | | WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 | | WORDSEG_MSR | 0,5984 | 0,6088 | 0,6035 | | WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 | | WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 | ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken TFWav2Vec2ForCTC from cuzeverynameistaken author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken` is a English model originally trained by cuzeverynameistaken. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken_en_4.2.0_3.0_1664039230746.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken_en_4.2.0_3.0_1664039230746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CDR_Chem_Modified_BlueBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-BlueBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BlueBERT_512_en_4.0.0_3.0_1657109356626.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BlueBERT_512_en_4.0.0_3.0_1657109356626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BlueBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BlueBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CDR_Chem_Modified_BlueBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-BlueBERT-512 --- layout: model title: Legal Term and termination Clause Binary Classifier author: John Snow Labs name: legclf_term_and_termination_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `term-and-termination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `term-and-termination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_term_and_termination_clause_en_1.0.0_3.2_1660123073091.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_term_and_termination_clause_en_1.0.0_3.2_1660123073091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_term_and_termination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[term-and-termination]| |[other]| |[other]| |[term-and-termination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_term_and_termination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.97 0.97 95 term-and-termination 0.93 0.93 0.93 40 accuracy - - 0.96 135 macro-avg 0.95 0.95 0.95 135 weighted-avg 0.96 0.96 0.96 135 ``` --- layout: model title: Legal Transition Services Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_transition_services_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, transition_services, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_transition_services_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `transition-services-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `transition-services-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transition_services_agreement_en_1.0.0_3.0_1668118234770.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transition_services_agreement_en_1.0.0_3.0_1668118234770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_transition_services_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[transition-services-agreement]| |[other]| |[other]| |[transition-services-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transition_services_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 0.99 66 transition-services-agreement 1.00 0.97 0.99 35 accuracy - - 0.99 101 macro-avg 0.99 0.99 0.99 101 weighted-avg 0.99 0.99 0.99 101 ``` --- layout: model title: Igbo Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_ner_igbo date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ig, open_source] task: Named Entity Recognition language: ig edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-igbo` is a Igbo model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_igbo_ig_3.4.2_3.0_1652809359756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_igbo_ig_3.4.2_3.0_1652809359756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_igbo","ig") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ahụrụ m n'anya na-atọ m ụtọ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_igbo","ig") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ahụrụ m n'anya na-atọ m ụtọ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_ner_igbo| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ig| |Size:|773.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-igbo - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - http --- layout: model title: Social Determinants of Health (clinical_medium) author: John Snow Labs name: ner_sdoh_emb_clinical_medium_wip date: 2023-04-12 tags: [en, clinical_medium, social_determinants, ner, public_health, sdoh, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_medium_wip_en_4.3.2_3.2_1681303578006.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_medium_wip_en_4.3.2_3.2_1681303578006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_medium_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_medium_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------+-----+---+-------------------+ |chunk |begin|end|ner_label | +------------------+-----+---+-------------------+ |55 years old |11 |22 |Age | |divorced |25 |32 |Marital_Status | |Mexcian |34 |40 |Gender | |American |42 |49 |Race_Ethnicity | |woman |51 |55 |Gender | |financial problems|62 |79 |Financial_Status | |She |82 |84 |Gender | |spanish |93 |99 |Language | |She |102 |104|Gender | |apartment |118 |126|Housing | |She |129 |131|Gender | |diabetes |158 |165|Other_Disease | |hospitalizations |233 |248|Other_SDoH_Keywords| |cleaning assistant|307 |324|Employment | |health insurance |354 |369|Insurance_Status | |She |391 |393|Gender | |son |401 |403|Family_Member | |student |405 |411|Education | |college |416 |422|Education | |depression |454 |463|Mental_Health | |She |466 |468|Gender | |she |479 |481|Gender | |rehab |489 |493|Access_To_Care | |her |514 |516|Gender | |catholic faith |518 |531|Spiritual_Beliefs | |support |547 |553|Social_Support | |She |565 |567|Gender | |etoh abuse |589 |598|Alcohol | |her |614 |616|Gender | |teens |618 |622|Age | |She |625 |627|Gender | |she |637 |639|Gender | |daily |652 |656|Substance_Frequency| |drinker |658 |664|Alcohol | |30 years |670 |677|Substance_Duration | |drinking |694 |701|Alcohol | |beer |703 |706|Alcohol | |daily |708 |712|Substance_Frequency| |She |715 |717|Gender | |smokes |719 |724|Smoking | |a pack |726 |731|Substance_Quantity | |cigarettes |736 |745|Smoking | |a day |747 |751|Substance_Frequency| |She |754 |756|Gender | |DUI |762 |764|Legal_Issues | +------------------+-----+---+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_emb_clinical_medium_wip| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| |Dependencies:|embeddings_clinical_medium| ## References Internal SHOP Project ## Benchmarking ```bash label precision recall f1-score support Geographic_Entity 0.89 0.88 0.88 106 Gender 0.99 0.99 0.99 4957 Healthcare_Institution 0.98 0.96 0.97 776 Employment 0.95 0.95 0.95 2120 Access_To_Care 0.90 0.81 0.85 459 Income 0.79 0.79 0.79 29 Social_Support 0.90 0.92 0.91 629 Family_Member 0.97 0.99 0.98 2101 Age 0.94 0.93 0.94 436 Mental_Health 0.89 0.86 0.87 479 Alcohol 0.96 0.96 0.96 254 Substance_Use 0.88 0.95 0.91 208 Hypertension 0.96 1.00 0.98 24 Other_Disease 0.90 0.94 0.92 583 Disability 0.93 0.97 0.95 40 Insurance_Status 0.87 0.85 0.86 85 Transportation 0.82 0.96 0.89 53 Sexual_Orientation 0.78 0.95 0.86 19 Marital_Status 0.98 0.96 0.97 90 Race_Ethnicity 0.92 0.96 0.94 25 Spiritual_Beliefs 0.80 0.80 0.80 51 Housing 0.89 0.85 0.87 366 Education 0.87 0.86 0.86 70 Other_SDoH_Keywords 0.78 0.88 0.83 237 Language 0.87 0.77 0.82 26 Substance_Frequency 0.92 0.83 0.87 65 Legal_Issues 0.77 0.85 0.81 55 Social_Exclusion 0.97 0.97 0.97 30 Financial_Status 0.88 0.66 0.75 123 Violence_Or_Abuse 0.82 0.65 0.73 57 Substance_Quantity 0.88 0.93 0.90 56 Smoking 0.99 0.99 0.99 71 Population_Group 0.91 0.71 0.80 14 Hyperlipidemia 0.78 1.00 0.88 7 Community_Safety 0.98 1.00 0.99 47 Exercise 0.91 0.88 0.90 60 Food_Insecurity 1.00 1.00 1.00 29 Eating_Disorder 0.67 0.92 0.77 13 Quality_Of_Life 0.79 0.82 0.81 61 Sexual_Activity 0.89 0.83 0.86 29 Chidhood_Event 0.90 0.72 0.80 25 Diet 0.97 0.92 0.94 62 Substance_Duration 0.66 0.95 0.78 39 Environmental_Condition 1.00 1.00 1.00 20 Obesity 1.00 1.00 1.00 14 Communicable_Disease 1.00 0.94 0.97 32 micro-avg 0.95 0.95 0.95 15132 macro-avg 0.89 0.90 0.89 15132 weighted-avg 0.95 0.95 0.95 15132 ``` --- layout: model title: Kandawo RobertaForQuestionAnswering (from Gam) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_cuad_gam date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: gam edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-cuad-gam` is a Kandawo model originally trained by `Gam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_cuad_gam_gam_4.0.0_3.0_1655733793939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_cuad_gam_gam_4.0.0_3.0_1655733793939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_cuad_gam","gam") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_cuad_gam","gam") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("gam.answer_question.cuad_gam.roberta.base.by_Gam").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_cuad_gam| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|gam| |Size:|450.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gam/roberta-base-finetuned-cuad-gam --- layout: model title: Thai BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_th_cased date: 2022-12-02 tags: [th, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: th edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-th-cased` is a Thai model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_th_cased_th_4.2.4_3.0_1670019070316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_th_cased_th_4.2.4_3.0_1670019070316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_th_cased","th") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_th_cased","th") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_th_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|th| |Size:|347.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-th-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_multi_fields_08_29_v1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-multi_fields-08-29-v1` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_multi_fields_08_29_v1_en_4.3.0_3.0_1672766196812.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_multi_fields_08_29_v1_en_4.3.0_3.0_1672766196812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_multi_fields_08_29_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_multi_fields_08_29_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_multi_fields_08_29_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-multi_fields-08-29-v1 --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from unicamp-dl) author: John Snow Labs name: t5_translation_en2pt date: 2023-01-31 tags: [pt, en, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `translation-en-pt-t5` is a Multilingual model originally trained by `unicamp-dl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_translation_en2pt_xx_4.3.0_3.0_1675157481338.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_translation_en2pt_xx_4.3.0_3.0_1675157481338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_translation_en2pt","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_translation_en2pt","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_translation_en2pt| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|926.3 MB| ## References - https://huggingface.co/unicamp-dl/translation-en-pt-t5 - https://github.com/unicamp-dl/Lite-T5-Translation - https://aclanthology.org/2020.wmt-1.90.pdf --- layout: model title: Pipeline to Detect Adverse Drug Events (healthcare) author: John Snow Labs name: ner_ade_healthcare_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_pipeline_en_4.3.0_3.2_1678824548052.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_healthcare_pipeline_en_4.3.0_3.2_1678824548052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_ade_healthcare_pipeline", "en", "clinical/models") text = '''Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_ade_healthcare_pipeline", "en", "clinical/models") val text = "Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare_ade.pipeline").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lipitor | 12 | 18 | DRUG | 0.998 | | 1 | severe fatigue | 52 | 65 | ADE | 0.67055 | | 2 | voltaren | 97 | 104 | DRUG | 0.9255 | | 3 | cramps | 152 | 157 | ADE | 0.9392 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: German RoBERTa Embeddings (Base, Hotel) author: John Snow Labs name: roberta_embeddings_HotelBERT date: 2022-04-14 tags: [roberta, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `HotelBERT` is a German model orginally trained by `FabianGroeger`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_HotelBERT_de_3.4.2_3.0_1649948014702.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_HotelBERT_de_3.4.2_3.0_1649948014702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_HotelBERT","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_HotelBERT","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_HotelBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|411.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/FabianGroeger/HotelBERT --- layout: model title: English image_classifier_vit_planes_airlines ViTForImageClassification from suhnylla author: John Snow Labs name: image_classifier_vit_planes_airlines date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_planes_airlines` is a English model originally trained by suhnylla. ## Predicted Entities `planes cathay pacific`, `planes malaysia airlines`, `planes delta airlines`, `planes singapore airlines`, `planes virgin airlines` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_planes_airlines_en_4.1.0_3.0_1660168651829.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_planes_airlines_en_4.1.0_3.0_1660168651829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_planes_airlines", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_planes_airlines", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_planes_airlines| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_128 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_128_zh_4.2.4_3.0_1670021457823.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_128_zh_4.2.4_3.0_1670021457823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|18.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Part of Speech for Norwegian author: John Snow Labs name: pos_ud_bokmaal date: 2022-06-25 tags: [pos, norwegian, nb, open_source] task: Part of Speech Tagging language: nb edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. This model was trained using the dataset available at https://universaldependencies.org ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_4.0.0_3.0_1656123012700.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bokmaal_nb_4.0.0_3.0_1656123012700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.") ``` ```scala val pos = PerceptronModel.pretrained("pos_ud_bokmaal", "nb") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Annet enn å være kongen i nord, er John Snow en engelsk lege og en leder innen utvikling av anestesi og medisinsk hygiene."""] pos_df = nlu.load('nb.pos.ud_bokmaal').predict(text) pos_df ```
## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='DET', metadata={'word': 'Annet'}), Row(annotatorType='pos', begin=6, end=8, result='SCONJ', metadata={'word': 'enn'}), Row(annotatorType='pos', begin=10, end=10, result='PART', metadata={'word': 'å'}), Row(annotatorType='pos', begin=12, end=15, result='AUX', metadata={'word': 'være'}), Row(annotatorType='pos', begin=17, end=22, result='NOUN', metadata={'word': 'kongen'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bokmaal| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nb| |Size:|12.0 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel --- layout: model title: German Bert Embeddings (Large, Cased) author: John Snow Labs name: bert_embeddings_gbert_large date: 2022-04-11 tags: [bert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `gbert-large` is a German model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_gbert_large_de_3.4.2_3.0_1649676005002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_gbert_large_de_3.4.2_3.0_1649676005002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_gbert_large","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_gbert_large","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.gbert_large").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_gbert_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gbert-large - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Arabic Bert Embeddings (Base, 860k Iterations) author: John Snow Labs name: bert_embeddings_bert_base_qarib60_860k date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-qarib60_860k` is a Arabic model orginally trained by `qarib`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib60_860k_ar_3.4.2_3.0_1649678987131.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_qarib60_860k_ar_3.4.2_3.0_1649678987131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib60_860k","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_qarib60_860k","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_qarib60_860k").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_qarib60_860k| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/qarib/bert-base-qarib60_860k - http://opus.nlpl.eu/ - https://github.com/qcri/QARiB/blob/main/Training_QARiB.md - https://github.com/qcri/QARiB/blob/main/Using_QARiB.md --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_gullenasatish TFWav2Vec2ForCTC from gullenasatish author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_gullenasatish date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_gullenasatish` is a English model originally trained by gullenasatish. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_gullenasatish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_gullenasatish_en_4.2.0_3.0_1664038782488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_gullenasatish_en_4.2.0_3.0_1664038782488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_gullenasatish", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_gullenasatish", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_gullenasatish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: English BertForQuestionAnswering model (from Harsit) author: John Snow Labs name: bert_qa_Harsit_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `Harsit`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Harsit_bert_finetuned_squad_en_4.0.0_3.0_1654535470994.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Harsit_bert_finetuned_squad_en_4.0.0_3.0_1654535470994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Harsit_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Harsit_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_Harsit").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Harsit_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Harsit/bert-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_squadv2 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_squadv2_en_4.3.0_3.0_1675126161506.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_squadv2_en_4.3.0_3.0_1675126161506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_squadv2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_squadv2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_squadv2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|275.3 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-squadv2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb - https://twitter.com/psuraj28 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_v1_1_base date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-v1_1-base` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_v1_1_base_en_4.3.0_3.0_1675156307761.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_v1_1_base_en_4.3.0_3.0_1675156307761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_v1_1_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_v1_1_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_v1_1_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|521.2 MB| ## References - https://huggingface.co/google/t5-v1_1-base - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md#t511 - https://arxiv.org/abs/2002.05202 - https://arxiv.org/pdf/1910.10683.pdf - https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67 --- layout: model title: Extract Anatomical Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_anatomy date: 2023-06-06 tags: [licensed, clinical, en, ner, vop, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `BodyPart`, `Laterality` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_en_4.4.3_3.0_1686073832453.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_en_4.4.3_3.0_1686073832453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_anatomy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_anatomy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | muscle | BodyPart | | neck | BodyPart | | trapezius | BodyPart | | head | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_anatomy| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 BodyPart 2682 193 218 2900 0.93 0.92 0.93 Laterality 539 66 89 628 0.89 0.86 0.87 macro_avg 3221 259 307 3528 0.91 0.89 0.90 micro_avg 3221 259 307 3528 0.92 0.91 0.92 ``` --- layout: model title: English asr_wav2vec2_bilal_20epoch TFWav2Vec2ForCTC from Roshana author: John Snow Labs name: asr_wav2vec2_bilal_20epoch date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_bilal_20epoch` is a English model originally trained by Roshana. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_bilal_20epoch_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_bilal_20epoch_en_4.2.0_3.0_1664119636401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_bilal_20epoch_en_4.2.0_3.0_1664119636401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_bilal_20epoch", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_bilal_20epoch", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_bilal_20epoch| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223704448.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223704448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_shuffled_paras_epochs_1_shard_1_squad2.0 --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-42` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42_en_4.0.0_3.0_1654189581310.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42_en_4.0.0_3.0_1654189581310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|394.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-42 --- layout: model title: NER Pipeline for Hindi+English author: John Snow Labs name: bert_token_classifier_hi_en_ner_pipeline date: 2022-06-25 tags: [hindi, bert_token, hi, open_source] task: Named Entity Recognition language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on [bert_token_classifier_hi_en_ner](https://nlp.johnsnowlabs.com/2021/12/27/bert_token_classifier_hi_en_ner_hi.html). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_HINDI_ENGLISH/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_HINDI_ENGLISH.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_hi_en_ner_pipeline_hi_4.0.0_3.0_1656118650191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_hi_en_ner_pipeline_hi_4.0.0_3.0_1656118650191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_hi_en_ner_pipeline", lang = "hi") pipeline.annotate("रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_hi_en_ner_pipeline", lang = "hi") val pipeline.annotate("रिलायंस इंडस्ट्रीज़ लिमिटेड (Reliance Industries Limited) एक भारतीय संगुटिका नियंत्रक कंपनी है, जिसका मुख्यालय मुंबई, महाराष्ट्र (Maharashtra) में स्थित है।रतन नवल टाटा (28 दिसंबर 1937, को मुम्बई (Mumbai), में जन्मे) टाटा समुह के वर्तमान अध्यक्ष, जो भारत की सबसे बड़ी व्यापारिक समूह है, जिसकी स्थापना जमशेदजी टाटा ने की और उनके परिवार की पीढियों ने इसका विस्तार किया और इसे दृढ़ बनाया।") ```
## Results ```bash +---------------------------+------------+ |chunk |ner_label | +---------------------------+------------+ |रिलायंस इंडस्ट्रीज़ लिमिटेड |ORGANISATION| |Reliance Industries Limited|ORGANISATION| |भारतीय |PLACE | |मुंबई |PLACE | |महाराष्ट्र |PLACE | |Maharashtra) |PLACE | |नवल टाटा |PERSON | |मुम्बई |PLACE | |Mumbai |PLACE | |टाटा समुह |ORGANISATION| |भारत |PLACE | |जमशेदजी टाटा |PERSON | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_hi_en_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|hi| |Size:|665.8 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - BertForTokenClassification - NerConverter - Finisher --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265897 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265897` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265897_en_4.0.0_3.0_1655984345899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265897_en_4.0.0_3.0_1655984345899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265897","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265897","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265897").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265897| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265897 --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_PubMedBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-PubMedBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_PubMedBERT_384_en_4.0.0_3.0_1657108433882.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_PubMedBERT_384_en_4.0.0_3.0_1657108433882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_PubMedBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_PubMedBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_PubMedBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-PubMedBERT-384 --- layout: model title: German asr_wav2vec2_base_german TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: asr_wav2vec2_base_german date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_german` is a German model originally trained by aware-ai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_german_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_german_de_4.2.0_3.0_1664099267567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_german_de_4.2.0_3.0_1664099267567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_german", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_german", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_german| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|355.0 MB| --- layout: model title: Korean asr_wav2vec2_large_xlsr_korean TFWav2Vec2ForCTC from kresnik author: John Snow Labs name: asr_wav2vec2_large_xlsr_korean date: 2022-09-25 tags: [wav2vec2, ko, audio, open_source, asr] task: Automatic Speech Recognition language: ko edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_korean` is a Korean model originally trained by kresnik. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_korean_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_korean_ko_4.2.0_3.0_1664112457943.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_korean_ko_4.2.0_3.0_1664112457943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_korean", "ko")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_korean", "ko") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_korean| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ko| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from leemii18) author: John Snow Labs name: distilbert_qa_robust_baseline_02 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-baseline-02` is a English model originally trained by `leemii18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robust_baseline_02_en_4.3.0_3.0_1672775352182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robust_baseline_02_en_4.3.0_3.0_1672775352182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robust_baseline_02","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robust_baseline_02","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_robust_baseline_02| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/leemii18/robustqa-baseline-02 --- layout: model title: Abkhazian asr_baseline TFWav2Vec2ForCTC from neelan-elucidate-ai author: John Snow Labs name: pipeline_asr_baseline date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_baseline` is a Abkhazian model originally trained by neelan-elucidate-ai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_baseline_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_baseline_ab_4.2.0_3.0_1664021881717.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_baseline_ab_4.2.0_3.0_1664021881717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_baseline', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_baseline", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_baseline| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.2 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Interpretations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_interpretations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, interpretations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Interpretations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Interpretations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_interpretations_bert_en_1.0.0_3.0_1678050650625.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_interpretations_bert_en_1.0.0_3.0_1678050650625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_interpretations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Interpretations]| |[Other]| |[Other]| |[Interpretations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_interpretations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Interpretations 0.79 0.93 0.86 45 Other 0.95 0.84 0.89 68 accuracy - - 0.88 113 macro-avg 0.87 0.89 0.87 113 weighted-avg 0.89 0.88 0.88 113 ``` --- layout: model title: Relation extraction between body parts and procedures author: John Snow Labs name: re_bodypart_proceduretest date: 2021-01-18 task: Relation Extraction language: en nav_key: models edition: Spark NLP for Healthcare 2.7.1 spark_version: 2.4 tags: [en, relation_extraction, clinical, licensed] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entites ['Internal_organ_or_component','External_body_part_or_region'] and procedure and test entities ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_BODYPART_ENT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=D8TtVuN-Ee8s){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_en_2.7.1_2.4_1610989267602.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_proceduretest_en_2.7.1_2.4_1610989267602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_bodypart_proceduretest` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:-------------------------:|:--------------:|:---------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | re_bodypart_proceduretest | 0,1 | ner_jsl | [“external_body_part_or_region-test”,
“test-external_body_part_or_region”,
“internal_organ_or_component-test”,
“test-internal_organ_or_component”,
“external_body_part_or_region-procedure”,
“procedure-external_body_part_or_region”,
“procedure-internal_organ_or_component”,
“internal_organ_or_component-procedure”] | Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel()\ .pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel()\ .pretrained("re_bodypart_proceduretest", "en", "clinical/models")\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4)\ .setPredictionThreshold(0.9)\ .setRelationPairs(["external_body_part_or_region-test"]) # Possible relation pairs. Default: All Relations. nlp_pipeline = Pipeline(stages=[documenter, sentencer,tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('''TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.''') ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel().pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel().pretrained("re_bodypart_proceduretest", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) #default: 0 .setPredictionThreshold(0.9) #default: 0.5 .setRelationPairs(Array("external_body_part_or_region-test")) # Possible relation pairs. Default: All Relations. val nlpPipeline = new Pipeline().setStages(Array(documenter, sentencer,tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val result = pipeline.fit(Seq.empty[String]).transform(data) val annotations = light_pipeline.fullAnnotate("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|------------------------------|---------------|-------------|--------|---------|-------------|-------------|---------------------|------------| | 0 | 1 | External_body_part_or_region | 94 | 98 | chest | Test | 117 | 135 | portable ultrasound | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_proceduretest| |Type:|re| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on data gathered and manually annotated by John Snow Labs ## Benchmarking ```bash label recall precision f1 0 0.55 0.35 0.43 1 0.73 0.86 0.79 ``` --- layout: model title: Turkish BertForQuestionAnswering Cased model (from Aybars) author: John Snow Labs name: bert_qa_modelontquad date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ModelOnTquad` is a Turkish model originally trained by `Aybars`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_modelontquad_tr_4.0.0_3.0_1657182127988.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_modelontquad_tr_4.0.0_3.0_1657182127988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_modelontquad","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_modelontquad","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_modelontquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|689.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Aybars/ModelOnTquad --- layout: model title: Named Entity Recognition for Chinese (BERT-Weibo Dataset) author: John Snow Labs name: ner_weibo_bert_768d date: 2021-01-04 task: Named Entity Recognition language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [zh, ner, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained `bert_base_chinese` embeddings model from `BertEmbeddings` annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities | Tag | Meaning | Example | |:-------:|------------------------------------|--------------| | PER.NAM | Person name | 张三 | | PER.NOM | Code, category | 穷人 | | LOC.NAM | Specific location | 紫玉山庄 | | LOC.NOM | Generic location | 大峡谷、宾馆 | | GPE.NAM | Administrative regions and areas | 北京 | | ORG.NAM | Specific organization | 通惠医院 | | ORG.NOM | Generic or collective organization | 文艺公司 | {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_ZH/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_weibo_bert_768d_zh_2.7.0_2.4_1609719542498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_weibo_bert_768d_zh_2.7.0_2.4_1609719542498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained(name='bert_base_chinese', lang='zh')\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("ner_weibo_bert_768d", "zh") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter]) example = spark.createDataFrame([['张三去中国山东省泰安市爬中国五岳的泰山了']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained(name='bert_base_chinese', lang='zh') .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_weibo_bert_768d", "zh") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner)) val data = Seq("张三去中国山东省泰安市爬中国五岳的泰山了").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["张三去中国山东省泰安市爬中国五岳的泰山了"] ner_df = nlu.load('zh.ner.weibo.bert_768d').predict(text) ner_df ```
## Results ```bash +--------+-------+ |token |ner | +--------+-------+ |张三 |PER.NAM| |中国 |GPE.NAM| |山东省 |GPE.NAM| |中国五岳|GPE.NAM| |泰山 |GPE.NAM| +--------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_weibo_bert_768d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|zh| ## Data Source The model was trained on the [Weibo NER (He and Sun, 2017)](https://www.aclweb.org/anthology/E17-2113/) data set. ## Benchmarking ```bash | ner_tag | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | GPE.NAM | 0.73 | 0.66 | 0.69 | 50 | | GPE.NOM | 0.00 | 0.00 | 0.00 | 2 | | LOC.NAM | 0.60 | 0.10 | 0.18 | 29 | | LOC.NOM | 0.20 | 0.10 | 0.13 | 10 | | ORG.NAM | 0.53 | 0.15 | 0.23 | 60 | | ORG.NOM | 0.50 | 0.28 | 0.36 | 18 | | PER.NAM | 0.66 | 0.61 | 0.63 | 139 | | PER.NOM | 0.68 | 0.63 | 0.66 | 197 | | accuracy | 0.96 | 9110 | | | | macro avg | 0.54 | 0.39 | 0.43 | 9110 | | weighted avg | 0.96 | 0.96 | 0.96 | 9110 | ``` --- layout: model title: Danish asr_alvenir_wav2vec2_base_nst_cv9 TFWav2Vec2ForCTC from chcaa author: John Snow Labs name: asr_alvenir_wav2vec2_base_nst_cv9 date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_alvenir_wav2vec2_base_nst_cv9` is a Danish model originally trained by chcaa. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_alvenir_wav2vec2_base_nst_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_alvenir_wav2vec2_base_nst_cv9_da_4.2.0_3.0_1664104693003.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_alvenir_wav2vec2_base_nst_cv9_da_4.2.0_3.0_1664104693003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_alvenir_wav2vec2_base_nst_cv9", "da")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_alvenir_wav2vec2_base_nst_cv9", "da") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_alvenir_wav2vec2_base_nst_cv9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|da| |Size:|226.0 MB| --- layout: model title: English AlbertForQuestionAnswering model (from ahotrod) author: John Snow Labs name: albert_qa_xxlargev1_squad2_512 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert_xxlargev1_squad2_512` is a English model originally trained by `ahotrod`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xxlargev1_squad2_512_en_4.0.0_3.0_1656064248993.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xxlargev1_squad2_512_en_4.0.0_3.0_1656064248993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlargev1_squad2_512","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xxlargev1_squad2_512","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.xxl_512d").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xxlargev1_squad2_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|772.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ahotrod/albert_xxlargev1_squad2_512 --- layout: model title: BERT Sentence Embeddings German (Base Cased) author: John Snow Labs name: sent_bert_base_cased date: 2021-09-15 tags: [open_source, bert_sentence_embeddings, de] task: Embeddings language: de edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERT model trained in German language on a 16GB dataset comprising of Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl in an MLM fashion. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_de_3.2.2_3.0_1631706255661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_de_3.2.2_3.0_1631706255661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed_sentence.bert.base_cased").predict("""Put your text here.""") ```
## Results ```bash 768 dimensional embedding vector per sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_cased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|de| |Case sensitive:|true| ## Data Source This model is imported from https://huggingface.co/dbmdz/bert-base-german-cased --- layout: model title: Fast Neural Machine Translation Model from Efik to English author: John Snow Labs name: opus_mt_efi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, efi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `efi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_efi_en_xx_2.7.0_2.4_1609170868137.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_efi_en_xx_2.7.0_2.4_1609170868137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_efi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_efi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.efi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_efi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Luba-Katanga to English author: John Snow Labs name: opus_mt_lu_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lu, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lu_en_xx_2.7.0_2.4_1609164607218.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lu_en_xx_2.7.0_2.4_1609164607218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lu_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lu_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lu.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lu_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from hetpandya) author: John Snow Labs name: t5_small_tapaco date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-tapaco` is a English model originally trained by `hetpandya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_tapaco_en_4.3.0_3.0_1675155980692.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_tapaco_en_4.3.0_3.0_1675155980692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_tapaco","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_tapaco","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_tapaco| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.8 MB| ## References - https://huggingface.co/hetpandya/t5-small-tapaco - https://towardsdatascience.com/training-t5-for-paraphrase-generation-ab3b5be151a2 - https://github.com/hetpandya - https://www.linkedin.com/in/het-pandya --- layout: model title: English Medical BertForQuestionAnswering model (PubMed, Squad) author: John Snow Labs name: bert_qa_sapbert_from_pubmedbert_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sapbert-from-pubmedbert-squad2` is a English model orginally trained by `bigwiz83`. This model is a fine-tuned version of `SapBERT-from-PubMedBERT-fulltext` on the squad_v2 dataset. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sapbert_from_pubmedbert_squad2_en_4.0.0_3.0_1654189327808.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sapbert_from_pubmedbert_squad2_en_4.0.0_3.0_1654189327808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sapbert_from_pubmedbert_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_sapbert_from_pubmedbert_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_pubmed.sapbert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sapbert_from_pubmedbert_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bigwiz83/sapbert-from-pubmedbert-squad2 --- layout: model title: Pipeline to Resolve ICD-10-CM Codes author: John Snow Labs name: icd10cm_resolver_pipeline date: 2023-03-29 tags: [en, licensed, clinical, resolver, chunk_mapping, pipeline, icd10cm] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities with their corresponding ICD-10-CM codes. You’ll just feed your text and it will return the corresponding ICD-10-CM codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.3.2_3.2_1680105059178.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_resolver_pipeline_en_4.3.2_3.2_1680105059178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline resolver_pipeline = PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models") text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""" result = resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val resolver_pipeline = new PretrainedPipeline("icd10cm_resolver_pipeline", "en", "clinical/models") val result = resolver_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm_resolver.pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage""") ```
## Results ```bash +-----------------------------+---------+------------+ |chunk |ner_chunk|icd10cm_code| +-----------------------------+---------+------------+ |gestational diabetes mellitus|PROBLEM |O24.919 | |anisakiasis |PROBLEM |B81.0 | |fetal and neonatal hemorrhage|PROBLEM |P545 | +-----------------------------+---------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: German asr_exp_w2v2t_vp_s184 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_vp_s184 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s184` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_vp_s184_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_s184_de_4.2.0_3.0_1664109637831.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_s184_de_4.2.0_3.0_1664109637831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_vp_s184", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_vp_s184", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_vp_s184| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: BERT Sentence Embeddings (Base Uncased) author: John Snow Labs name: sent_bert_base_uncased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_en_2.6.0_2.4_1598346203624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_en_2.6.0_2.4_1598346203624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.bert_base_uncased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_bert_base_uncased_embeddings sentence [0.48797768354415894, 0.250575453042984, -0.24... I hate cancer [0.06506041437387466, 0.06032593920826912, -0.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1](https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1) --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from mrm8488) author: John Snow Labs name: distilbert_qa_finetuned_for_xqua_on_tydi date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finetuned-for-xqua-on-tydiqa` is a Multilingual model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_for_xqua_on_tydi_xx_4.3.0_3.0_1672774240682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_for_xqua_on_tydi_xx_4.3.0_3.0_1672774240682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_for_xqua_on_tydi","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned_for_xqua_on_tydi","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finetuned_for_xqua_on_tydi| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/distilbert-multi-finetuned-for-xqua-on-tydiqa - https://ai.google.com/research/tydiqa - https://github.com/google-research-datasets/tydiqa/blob/master/README.md#the-tasks - https://twitter.com/mrm8488 --- layout: model title: Legal NER Obligations on Agreements author: John Snow Labs name: legner_obligations date: 2022-08-22 tags: [en, legal, ner, obligations, agreements, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_obligations_clause` Text Classifier to select only these paragraphs; This Name Entity Recognition model is aimed to extract what the different parties of an agreement commit to do. We call it "obligations", but could also be called "commitments" or "agreemeents". This model extracts the subject (who commits to doing what), the action (the verb - will provide, shall sign...) and the object (what subject will provide, what subject shall sign, etc). Also, if the recipient of the obligation is a third party (a subject will provide to the Company X ...), then that third party (Company X) will be extracted as an indirect object. This model also has a Relation Extraction model which can be used to connect the entities together. The object is usually very diverse (will provide with technology? documents? people? items? etc) and often times, very long clauses. For that, we include a more advanced way to extract objects, using Automatic Question Generation (what will [subject] [action]? Example - What will the Company provide?) and Question Answering (using that question and a context, we retrieve the answer from the text). Please, check the Question Answering notebook in the Spark NLP Workshop for more information about this approach. ## Predicted Entities `OBLIGATION_SUBJECT`, `OBLIGATION_ACTION`, `OBLIGATION`, `OBLIGATION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_obligations_en_1.0.0_3.2_1661182145726.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_obligations_en_1.0.0_3.2_1661182145726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sparktokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = legal.BertForTokenClassification.pretrained("legner_obligations", "en", "legal/models")\ .setInputCols("token", "document")\ .setOutputCol("label")\ .setCaseSensitive(True) pipeline = nlp.Pipeline(stages=[ documentAssembler, sparktokenizer, tokenClassifier ] ) import pandas as pd p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) text = """The Buyer shall use such materials and supplies only in accordance with the present agreement""" res = p_model.transform(spark.createDataFrame([[text]]).toDF("text")) ```
## Results ```bash +----------+--------------------+ | token| ner_label| +----------+--------------------+ | The| O| | Buyer|B-OBLIGATION_SUBJECT| | shall| B-OBLIGATION_ACTION| | use| I-OBLIGATION_ACTION| | such| B-OBLIGATION| | materials| I-OBLIGATION| | and| I-OBLIGATION| | supplies| I-OBLIGATION| | only| I-OBLIGATION| | in| I-OBLIGATION| |accordance| I-OBLIGATION| | with| I-OBLIGATION| | the| I-OBLIGATION| | present| I-OBLIGATION| | agreement| I-OBLIGATION| +----------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_obligations| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References In-house annotated documents on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support B-OBLIGATION 0.61 0.44 0.51 93 B-OBLIGATION_ACTION 0.88 0.89 0.89 85 B-OBLIGATION_INDIRECT_OBJECT 0.69 0.71 0.70 34 B-OBLIGATION_SUBJECT 0.80 0.87 0.84 87 I-OBLIGATION 0.72 0.77 0.75 1251 I-OBLIGATION_ACTION 0.80 0.79 0.79 167 I-OBLIGATION_SUBJECT 0.75 0.43 0.55 14 O 0.87 0.84 0.85 2395 accuracy - - 0.81 4126 macro-avg 0.76 0.72 0.73 4126 weighted-avg 0.81 0.81 0.81 4126 ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from furyhawk) author: John Snow Labs name: xlmroberta_ner_furyhawk_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `furyhawk`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_furyhawk_base_finetuned_panx_de_4.1.0_3.0_1660432979216.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_furyhawk_base_finetuned_panx_de_4.1.0_3.0_1660432979216.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_furyhawk_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_furyhawk_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_furyhawk_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/furyhawk/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Sentence Entity Resolver for ICD-9-CM author: John Snow Labs name: sbiobertresolve_icd9 date: 2022-09-30 tags: [entity_resolution, en, licensed, icd9, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-9-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `ICD-9-CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd9_en_4.1.0_3.0_1664533186655.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd9_en_4.1.0_3.0_1664533186655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd9_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd9","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd9_resolver]) clinical_note = [ 'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years ' 'prior to presentation and subsequent type two diabetes mellitus, associated ' 'with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, '] data= spark.createDataFrame([clinical_note]).toDF('text') results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd9_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd9","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd9_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with an acute hepatitis and obesity with a body mass index (BMI) of 33.5 kg/m2").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.ic9").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years """) ```
## Results ```bash +-------------------------------------+-------+---------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+ | ner_chunk| entity|icd9_code| resolution| all_codes| +-------------------------------------+-------+---------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| V12.21|[Personal history of gestational diabetes, Neonatal diabetes mellitus, Second...|[V12.21, 775.1, 249, 250, 249.7, 249.71, 249.9, 249.61, 648.0, 249.51, 249.11...| |subsequent type two diabetes mellitus|PROBLEM| 249|[Secondary diabetes mellitus, Diabetes mellitus, Secondary diabetes mellitus ...|[249, 250, 249.9, 249.7, 775.1, 249.6, 249.8, V12.21, 249.71, V77.1, 249.5, 2...| | an acute hepatitis|PROBLEM| 571.1|[Acute alcoholic hepatitis, Viral hepatitis, Autoimmune hepatitis, Injury to ...|[571.1, 070, 571.42, 902.22, 279.51, 571.4, 091.62, 572.2, 864, 070.0, 572.0,...| | obesity|PROBLEM| 278.0|[Overweight and obesity, Morbid obesity, Overweight, Screening for obesity, O...|[278.0, 278.01, 278.02, V77.8, 278, 278.00, 272.2, 783.1, 277.7, 728.5, 521.5...| | a body mass index|PROBLEM| V85|[Body mass index [BMI], Human bite, Localized adiposity, Effects of air press...|[V85, E928.3, 278.1, 993, E008.4, V61.5, 747.63, V85.5, 278.02, 780.97, 782.8...| +-------------------------------------+-------+---------+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+ only showing top 5 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd9| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd9_code]| |Language:|en| |Size:|50.1 MB| |Case sensitive:|false| --- layout: model title: Legal Grant of option Clause Binary Classifier author: John Snow Labs name: legclf_grant_of_option_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `grant-of-option` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `grant-of-option` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_grant_of_option_clause_en_1.0.0_3.2_1660122482403.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_grant_of_option_clause_en_1.0.0_3.2_1660122482403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_grant_of_option_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[grant-of-option]| |[other]| |[other]| |[grant-of-option]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_grant_of_option_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support grant-of-option 0.94 1.00 0.97 34 other 1.00 0.98 0.99 109 accuracy - - 0.99 143 macro-avg 0.97 0.99 0.98 143 weighted-avg 0.99 0.99 0.99 143 ``` --- layout: model title: English BertForQuestionAnswering model (from xraychen) author: John Snow Labs name: bert_qa_mqa_cls date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mqa-cls` is a English model orginally trained by `xraychen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_cls_en_4.0.0_3.0_1654188344311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_cls_en_4.0.0_3.0_1654188344311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mqa_cls","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mqa_cls","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_qu estion.mqa_cls.bert.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mqa_cls| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/xraychen/mqa-cls --- layout: model title: Pipeline to Detect Pathogen, Medical Condition and Medicine author: John Snow Labs name: ner_pathogen_pipeline date: 2023-03-09 tags: [licensed, clinical, en, ner, pathogen, medical_condition, medicine] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_pathogen](https://nlp.johnsnowlabs.com/2022/06/28/ner_pathogen_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_pipeline_en_4.3.0_3.2_1678385356869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_pipeline_en_4.3.0_3.2_1678385356869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_pathogen_pipeline", "en", "clinical/models") text = '''Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_pathogen_pipeline", "en", "clinical/models") val text = "Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.pathogen.pipeline").predict("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:-----------------|-------------:| | 0 | Racecadotril | 0 | 11 | Medicine | 0.9468 | | 1 | loperamide | 80 | 89 | Medicine | 0.9986 | | 2 | Diarrhea | 92 | 99 | MedicalCondition | 0.9848 | | 3 | dehydration | 187 | 197 | MedicalCondition | 0.6305 | | 4 | skin color | 291 | 300 | MedicalCondition | 0.6586 | | 5 | fast heart rate | 305 | 319 | MedicalCondition | 0.757233 | | 6 | rabies virus | 383 | 394 | Pathogen | 0.95685 | | 7 | Lyssavirus | 397 | 406 | Pathogen | 0.9694 | | 8 | Ephemerovirus | 412 | 424 | Pathogen | 0.6919 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_pathogen_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Auto renewal Clause Binary Classifier author: John Snow Labs name: legclf_auto_renewal_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `auto_renewal` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `auto_renewal` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_auto_renewal_clause_en_1.0.0_3.2_1660122142731.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_auto_renewal_clause_en_1.0.0_3.2_1660122142731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_auto_renewal_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[auto_renewal]| |[other]| |[other]| |[auto_renewal]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_auto_renewal_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support auto_renewal 0.88 0.91 0.89 32 other 0.97 0.96 0.97 104 accuracy - - 0.95 136 macro-avg 0.92 0.93 0.93 136 weighted-avg 0.95 0.95 0.95 136 ``` --- layout: model title: Fast Neural Machine Translation Model from Bantu Languages to English author: John Snow Labs name: opus_mt_bnt_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bnt, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bnt` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bnt_en_xx_2.7.0_2.4_1609170308591.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bnt_en_xx_2.7.0_2.4_1609170308591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bnt_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bnt_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bnt.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bnt_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Slovak author: John Snow Labs name: opus_mt_en_sk date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sk_xx_2.7.0_2.4_1609254373086.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sk_xx_2.7.0_2.4_1609254373086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial Exhibits Item Binary Classifier author: John Snow Labs name: finclf_exhibits_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exhibits` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `exhibits` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_exhibits_item_en_1.0.0_3.2_1660154412713.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_exhibits_item_en_1.0.0_3.2_1660154412713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_exhibits_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exhibits]| |[other]| |[other]| |[exhibits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_exhibits_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support exhibits 0.94 0.83 0.88 18 other 0.88 0.95 0.91 22 accuracy - - 0.90 40 macro-avg 0.91 0.89 0.90 40 weighted-avg 0.90 0.90 0.90 40 ``` --- layout: model title: English BertForTokenClassification Cased model (from datummd) author: John Snow Labs name: bert_token_classifier_ncbi_bc5cdr_disease date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `NCBI_BC5CDR_disease` is a English model originally trained by `datummd`. ## Predicted Entities `bio` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ncbi_bc5cdr_disease_en_4.2.4_3.0_1669813894935.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ncbi_bc5cdr_disease_en_4.2.4_3.0_1669813894935.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ncbi_bc5cdr_disease","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ncbi_bc5cdr_disease","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ncbi_bc5cdr_disease| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/datummd/NCBI_BC5CDR_disease - https://github.com/datummd/bionlp --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from huxxx657) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_jumbling_squad_15 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-jumbling-squad-15` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_jumbling_squad_15_en_4.3.0_3.0_1672768090466.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_jumbling_squad_15_en_4.3.0_3.0_1672768090466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_jumbling_squad_15","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_jumbling_squad_15","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_jumbling_squad_15| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/distilbert-base-uncased-finetuned-jumbling-squad-15 --- layout: model title: Sentiment Analysis Pipeline for German texts author: John Snow Labs name: classifierdl_bert_sentiment_pipeline date: 2021-09-28 tags: [sentiment, de, pipeline, open_source] task: Sentiment Analysis language: de edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline identifies the sentiments (positive or negative) in German texts. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_De_SENTIMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_pipeline_de_3.3.0_2.4_1632832830977.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_pipeline_de_3.3.0_2.4_1632832830977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "de") result = pipeline.annotate("Spiel und Meisterschaft nicht spannend genug? Muss man jetzt den Videoschiedsrichter kontrollieren? Ich bin entsetzt...dachte der darf nur bei krassen Fehlentscheidungen ran. So macht der Fussball keinen Spass mehr.") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "de") val result = pipeline.fullAnnotate("Spiel und Meisterschaft nicht spannend genug? Muss man jetzt den Videoschiedsrichter kontrollieren? Ich bin entsetzt...dachte der darf nur bei krassen Fehlentscheidungen ran. So macht der Fussball keinen Spass mehr.")(0) ```
## Results ```bash ['NEGATIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_sentiment_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Language:|de| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: Korean Electra Embeddings (from monologg) author: John Snow Labs name: electra_embeddings_koelectra_base_v3_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v3-generator` is a Korean model orginally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_base_v3_generator_ko_3.4.4_3.0_1652786913023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_base_v3_generator_ko_3.4.4_3.0_1652786913023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_base_v3_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_base_v3_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_koelectra_base_v3_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|137.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/monologg/koelectra-base-v3-generator - https://github.com/monologg/KoELECTRA/blob/master/README_EN.md --- layout: model title: English T5ForConditionalGeneration Base Cased model (from razent) author: John Snow Labs name: t5_scifive_base_pmc date: 2023-01-30 tags: [en, open_source, t5] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SciFive-base-PMC` is a English model originally trained by `razent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_scifive_base_pmc_en_4.3.0_3.0_1675098735326.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_scifive_base_pmc_en_4.3.0_3.0_1675098735326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_scifive_base_pmc","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_scifive_base_pmc","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_scifive_base_pmc| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|474.3 MB| ## References - https://huggingface.co/razent/SciFive-base-PMC - https://arxiv.org/abs/2106.03598 - https://github.com/justinphan3110/SciFive --- layout: model title: Fast Neural Machine Translation Model from Sranan Tongo to English author: John Snow Labs name: opus_mt_srn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, srn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `srn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_srn_en_xx_2.7.0_2.4_1609164885070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_srn_en_xx_2.7.0_2.4_1609164885070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_srn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_srn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.srn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_srn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic T5ForConditionalGeneration Base Cased model (from UBC-NLP) author: John Snow Labs name: t5_arat5_base_title_generation date: 2023-01-30 tags: [ar, open_source, t5] task: Text Generation language: ar edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `AraT5-base-title-generation` is a Arabic model originally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_arat5_base_title_generation_ar_4.3.0_3.0_1675096301207.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_arat5_base_title_generation_ar_4.3.0_3.0_1675096301207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_arat5_base_title_generation","ar") \ .setInputCols(["document"]) \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_arat5_base_title_generation","ar") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_arat5_base_title_generation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ar| |Size:|1.4 GB| ## References - https://huggingface.co/UBC-NLP/AraT5-base-title-generation - https://aclanthology.org/2022.acl-long.47/ - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: Japanese Bert Embeddings (Small, Financial) author: John Snow Labs name: bert_embeddings_bert_small_japanese_fin date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-small-japanese-fin` is a Japanese model orginally trained by `izumi-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_small_japanese_fin_ja_3.4.2_3.0_1649674517664.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_small_japanese_fin_ja_3.4.2_3.0_1649674517664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_small_japanese_fin","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_small_japanese_fin","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_small_japanese_fin").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_small_japanese_fin| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|68.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/izumi-lab/bert-small-japanese-fin - https://github.com/google-research/bert - https://github.com/retarfi/language-pretraining/tree/v1.0 - https://arxiv.org/abs/2003.10555 - https://arxiv.org/abs/2003.10555 - https://creativecommons.org/licenses/by-sa/4.0/ --- layout: model title: Fast Neural Machine Translation Model from Cebuano to English author: John Snow Labs name: opus_mt_ceb_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ceb, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ceb` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ceb_en_xx_2.7.0_2.4_1609169055400.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ceb_en_xx_2.7.0_2.4_1609169055400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ceb_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ceb_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ceb.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ceb_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_Insectodoptera ViTForImageClassification from Erolgo author: John Snow Labs name: image_classifier_vit_Insectodoptera date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Insectodoptera` is a English model originally trained by Erolgo. ## Predicted Entities `crab`, `honeybee`, `spider`, `mosquito`, `wasp` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Insectodoptera_en_4.1.0_3.0_1660168032113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Insectodoptera_en_4.1.0_3.0_1660168032113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Insectodoptera", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Insectodoptera", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Insectodoptera| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English T5ForConditionalGeneration Cased model (from allenai) author: John Snow Labs name: t5_tailor date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tailor` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_tailor_en_4.3.0_3.0_1675157251689.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_tailor_en_4.3.0_3.0_1675157251689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_tailor","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_tailor","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_tailor| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|926.0 MB| ## References - https://huggingface.co/allenai/tailor - https://homes.cs.washington.edu/~wtshuang/static/papers/2021-arxiv-tailor.pdf - https://github.com/allenai/tailor --- layout: model title: Legal Sentiment Analysis using Assertion Status (Sigma, ABSA dataset) author: John Snow Labs name: legassertion_sigma_absa_sentiment date: 2022-12-16 tags: [en, licensed] task: Assertion Status language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This mode was trained to be benchmarked against the SigmaLaw's official Aspect-based Sentiment Analysis model, based on ABSA dataset, where several parties were tagger with their sentiments in lega texts. For more information a bout the annotation guidelines please check their official paper https://arxiv.org/pdf/2011.06326.pdf Macro-F1 Reported by SigmaLaw: - TD-LSTM 0.564682 - TC-LSTM 0.543762 - AE-LSTM 0.558778 - AT-LSTM 0.559181 - ATAE-LSTM 0.580193 - IAN 0.564990 - MemNet 0.436025 - Cabasc 0.564300 - RAM 0.602201 Obtained with Legal NLP: - Assertion Status 0.637 (+0.035 compared to RAM, +0.08 in average) More details here: https://arxiv.org/pdf/2011.06326.pdf ## Predicted Entities `neutral`, `positive`, `negative` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legassertion_sigma_absa_sentiment_en_1.0.0_3.0_1671205882337.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legassertion_sigma_absa_sentiment_en_1.0.0_3.0_1671205882337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = legal.NerModel.pretrained("legner_sigma_absa_people", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") assertion = legal.AssertionDLModel.pretrained("legassertion_sigma_absa_sentiment", "en", "legal/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings"])\ .setOutputCol("label") pipe = nlp.Pipeline(stages = [ document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter, assertion]) text = "Petitioner Jae Lee moved to the United States from South Korea with his parents when he was 13. He feared that a criminal conviction may affect his status." ```
## Results ```bash +------------------+---------+ | ner_chunk|assertion| +------------------+---------+ |Petitioner Jae Lee| neutral| | his| neutral| | he| neutral| | He| negative| | his| negative| +---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legassertion_sigma_absa_sentiment| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References https://metatext.io/datasets/sigmalaw-absa ## Benchmarking ```bash label tp fp fn prec rec f1 neutral 36 25 32 0.59016395 0.5294118 0.5581395 positive 166 111 84 0.599278 0.664 0.629981 negative 236 82 102 0.7421384 0.69822484 0.7195123 Macro-average 438 218 218 0.6438601 0.63054556 0.63713324 Micro-average 438 218 218 0.66768295 0.66768295 0.66768295 ``` --- layout: model title: Fast Neural Machine Translation Model from Wolaytta to English author: John Snow Labs name: opus_mt_wal_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, wal, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `wal` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_wal_en_xx_2.7.0_2.4_1609169640356.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_wal_en_xx_2.7.0_2.4_1609169640356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_wal_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_wal_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.wal.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_wal_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ckadam15) author: John Snow Labs name: distilbert_qa_ckadam15_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ckadam15`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ckadam15_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770481619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ckadam15_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770481619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ckadam15_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ckadam15_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ckadam15_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ckadam15/distilbert-base-uncased-finetuned-squad --- layout: model title: Multilingual XLMRoBerta Embeddings (from coastalcph) author: John Snow Labs name: xlmroberta_embeddings_fairlex_fscs_minilm date: 2022-05-13 tags: [fr, de, it, open_source, xlm_roberta, embeddings, xx, fairlex] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fairlex-fscs-minilm` is a Multilingual model orginally trained by `coastalcph`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_fairlex_fscs_minilm_xx_3.4.4_3.0_1652439790067.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_fairlex_fscs_minilm_xx_3.4.4_3.0_1652439790067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_fairlex_fscs_minilm","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_fairlex_fscs_minilm","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_fairlex_fscs_minilm| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|403.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/coastalcph/fairlex-fscs-minilm - https://coastalcph.github.io - https://github.com/iliaschalkidis - https://twitter.com/KiddoThe2B --- layout: model title: Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species date: 2022-06-23 tags: [fr, ner, clinical, licensed] task: Named Entity Recognition language: fr edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in French which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `w2v_cc_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_fr_3.5.3_3.0_1655973573119.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_fr_3.5.3_3.0_1655973573119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "fr", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "fr", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.med_ner.living_species").predict("""Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40.""") ```
## Results ```bash +--------------------------------+-------+ |ner_chunk |label | +--------------------------------+-------+ |Femme |HUMAN | |mari |HUMAN | |enfants |HUMAN | |patient |HUMAN | |Coxiella burnetii |SPECIES| |Bartonella henselae |SPECIES| |Borrelia burgdorferi |SPECIES| |Entamoeba histolytica |SPECIES| |Toxoplasma gondii |SPECIES| |cytomégalovirus |SPECIES| |virus d'Epstein Barr |SPECIES| |virus de la varicelle et du zona|SPECIES| |parvovirus B19 |SPECIES| |Brucella |SPECIES| +--------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|15.1 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.78 0.97 0.87 2552 B-SPECIES 0.66 0.91 0.77 2836 I-HUMAN 0.81 0.67 0.73 114 I-SPECIES 0.69 0.86 0.76 1118 micro-avg 0.71 0.92 0.80 6620 macro-avg 0.74 0.85 0.78 6620 weighted-avg 0.72 0.92 0.80 6620 ``` --- layout: model title: Oncology Pipeline for Diagnosis Entities author: John Snow Labs name: oncology_diagnosis_pipeline date: 2022-11-04 tags: [licensed, en, oncology, clinical] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.2.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status, Relation Extraction and Entity Resolution models to extract information from oncology texts. This pipeline focuses on entities related to oncological diagnosis. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_diagnosis_pipeline_en_4.2.2_3.0_1667569522240.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_diagnosis_pipeline_en_4.2.2_3.0_1667569522240.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_diagnosis_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_diagnosis_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_diagnosis.pipeline").predict("""Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Size | | tumor | Tumor_Finding | | left | Direction | | breast | Site_Breast | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | lung | Site_Lung | | metastases | Metastasis | ******************** ner_oncology_diagnosis_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Size | | tumor | Tumor_Finding | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | metastases | Metastasis | ******************** ner_oncology_tnm_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Description | | tumor | Tumor | | ductal | Tumor_Description | | carcinoma | Cancer_Dx | | metastases | Metastasis | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-----------|:------------------|:------------| | tumor | Tumor_Finding | Present | | ductal | Histological_Type | Present | | carcinoma | Cancer_Dx | Present | | metastases | Metastasis | Absent | ******************** assertion_oncology_problem_wip results ******************** | chunk | ner_label | assertion | |:-----------|:------------------|:-----------------------| | tumor | Tumor_Finding | Medical_History | | ductal | Histological_Type | Medical_History | | carcinoma | Cancer_Dx | Medical_History | | metastases | Metastasis | Hypothetical_Or_Absent | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:--------------|:-----------|:--------------|:--------------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_related_to | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | | tumor | Tumor_Finding | breast | Site_Breast | is_related_to | | breast | Site_Breast | carcinoma | Cancer_Dx | O | | lung | Site_Lung | metastases | Metastasis | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:--------------|:-----------|:--------------|:---------------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_size_of | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | | tumor | Tumor_Finding | breast | Site_Breast | is_location_of | | breast | Site_Breast | carcinoma | Cancer_Dx | O | | lung | Site_Lung | metastases | Metastasis | is_location_of | ******************** re_oncology_size_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:-----------|:----------|:--------------|:-----------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_size_of | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | ******************** ICD-O resolver results ******************** | chunk | ner_label | code | normalized_term | |:-----------|:------------------|:-------|:------------------| | tumor | Tumor_Finding | 8000/1 | tumor | | breast | Site_Breast | C50 | breast | | ductal | Histological_Type | 8500/2 | dcis | | carcinoma | Cancer_Dx | 8010/3 | carcinoma | | lung | Site_Lung | C34.9 | lung | | metastases | Metastasis | 8000/6 | tumor, metastatic | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_diagnosis_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP for Healthcare 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel - RelationExtractionModel - ChunkMergeModel - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel --- layout: model title: Detect PHI (Deidentification)(clinical_large) author: John Snow Labs name: ner_deid_large_emb_clinical_large date: 2023-04-12 tags: [ner, licensed, clinical, phi, deidentification, english, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Description : Deidentification NER (Large) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. The entities it annotates are Age, Contact, Date, Id, Location, Name, and Profession. This model is trained with the 'embeddings_clinical_large' word embeddings model, so be sure to use the same embeddings in the pipeline. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/ ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_emb_clinical_large_en_4.3.2_3.0_1681321107196.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_emb_clinical_large_en_4.3.2_3.0_1681321107196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.load( "ner_deid_large_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("deid_ner") deid_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "deid_ner"]) \ .setOutputCol("deid_ner_chunk") deid_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, deid_ner, deid_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") deid_ner_model = deid_ner_pipeline.fit(empty_data) results = deid_ner_model.transform(spark.createDataFrame([["""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same."""]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(Array("sentence", "token"))\ .setOutputCol("embeddings") val deid_ner_model = BertForTokenClassification.pretrained('ner_deid_large_emb_clinical_large' "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("deid_ner") val deid_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("deid_ner_chunk") val deid_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner_model, deid_ner_converter)) val data = Seq(""" """HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same."""""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:----------------|--------:|------:|:-----------| | 0 | Smith | 32 | 36 | NAME | | 1 | VA Hospital | 184 | 194 | LOCATION | | 2 | Day Hospital | 258 | 269 | LOCATION | | 3 | 02/04/2003 | 341 | 350 | DATE | | 4 | Smith | 374 | 378 | NAME | | 5 | Day Hospital | 397 | 408 | LOCATION | | 6 | Smith | 782 | 786 | NAME | | 7 | Smith | 1131 | 1135 | NAME | | 8 | 7 Ardmore Tower | 1153 | 1167 | LOCATION | | 9 | Hart | 1221 | 1224 | NAME | | 10 | Smith | 1231 | 1235 | NAME | | 11 | 02/07/2003 | 1329 | 1338 | DATE | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_large_emb_clinical_large| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash precision recall f1-score support CONTACT 0.92 0.98 0.95 126 DATE 0.99 0.99 0.99 2631 NAME 0.98 0.98 0.98 2594 AGE 0.99 0.94 0.97 284 LOCATION 0.97 0.94 0.95 1511 ID 0.91 0.96 0.94 213 PROFESSION 0.81 0.84 0.83 160 micro avg 0.97 0.97 0.97 7519 macro avg 0.94 0.95 0.94 7519 weighted avg 0.97 0.97 0.97 7519 ``` --- layout: model title: Fast Neural Machine Translation Model from Gilbertese to English author: John Snow Labs name: opus_mt_gil_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gil, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gil` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gil_en_xx_2.7.0_2.4_1609167188635.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gil_en_xx_2.7.0_2.4_1609167188635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gil_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gil_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gil.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gil_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_newsqa_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `newsqa_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_newsqa_el_4_el_4.0.0_3.0_1657190554892.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_newsqa_el_4_el_4.0.0_3.0_1657190554892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_newsqa_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_newsqa_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_newsqa_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/newsqa_bert_el_4 --- layout: model title: Part of Speech for Korean author: John Snow Labs name: pos_ud_kaist date: 2021-01-03 task: Part of Speech Tagging language: ko edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pos, ko, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities `ADJ`, `ADP`, `ADV`, `AUX`, `CONJ`, `DET`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SYM`, `VERB`, and `X`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_kaist_ko_2.7.0_2.4_1609701893746.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_kaist_ko_2.7.0_2.4_1609701893746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko")\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ud_kaist", "ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, word_segmenter, posTagger ]) example = spark.createDataFrame([['비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다.']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko") .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ud_kaist", "ko") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos)) val data = Seq("비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다."""] pos_df = nlu.load('ko.pos.ud_kaist').predict(text, output_level='token') pos_df ```
## Results ```bash +----------+-----+ |token |pos | +----------+-----+ |비파를 |NOUN | |탄주하는 |VERB | |그 |DET | |늙은 |VERB | |명인의 |NOUN | |시는 |NOUN | |아름다운 |ADJ | |화음이었고|CCONJ| |완벽한 |VERB | |음악으로 |NOUN | |순간적인 |VERB | |조화를 |NOUN | |이룬 |VERB | |세계의 |NOUN | |울림이었다|VERB | |. |PUNCT| +----------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_kaist| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|ko| ## Data Source The model was trained in the Universal Dependencies, curated by the Korea Advanced Institute of Science and Technology (KAIST) Reference: > Building Universal Dependency Treebanks in Korean, Jayeol Chun, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC'18, Miyazaki, Japan, 2018. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.90 | 0.81 | 0.85 | 1180 | | ADP | 0.95 | 0.97 | 0.96 | 160 | | ADV | 0.89 | 0.82 | 0.85 | 4156 | | AUX | 0.85 | 0.84 | 0.84 | 1074 | | CCONJ | 0.82 | 0.71 | 0.76 | 1471 | | DET | 0.92 | 0.91 | 0.91 | 272 | | INTJ | 0.00 | 0.00 | 0.00 | 3 | | NOUN | 0.80 | 0.93 | 0.86 | 8338 | | NUM | 0.86 | 0.91 | 0.88 | 631 | | PART | 0.00 | 0.00 | 0.00 | 18 | | PRON | 0.96 | 0.91 | 0.93 | 405 | | PROPN | 0.85 | 0.58 | 0.69 | 1377 | | PUNCT | 1.00 | 1.00 | 1.00 | 3109 | | SCONJ | 0.84 | 0.72 | 0.78 | 1547 | | SYM | 1.00 | 0.98 | 0.99 | 115 | | VERB | 0.87 | 0.87 | 0.87 | 4378 | | X | 0.94 | 0.71 | 0.81 | 132 | | accuracy | 0.86 | 28366 | | | | macro avg | 0.79 | 0.74 | 0.76 | 28366 | | weighted avg | 0.86 | 0.86 | 0.86 | 28366 | ``` --- layout: model title: Dutch Named Entity Recognition (from Davlan) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_large_ner_hrl date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-ner-hrl` is a Dutch model orginally trained by `Davlan`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_ner_hrl_nl_3.4.2_3.0_1652809843691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_large_ner_hrl_nl_3.4.2_3.0_1652809843691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_ner_hrl","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_large_ner_hrl","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_large_ner_hrl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|1.8 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl - https://camel.abudhabi.nyu.edu/anercorp/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2002/ner/ - https --- layout: model title: Detect clinical events (biobert) author: John Snow Labs name: ner_events_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect clinical events like Admission, Department, Date, Discharge, etc in reports and medical text using pretrained NER model. ## Predicted Entities `OCCURRENCE`, `TREATMENT`, `ADMISSION`, `TIME`, `PROBLEM`, `DATE`, `FREQUENCY`, `CLINICAL_DEPT`, `DURATION`, `EVIDENTIAL`, `DISCHARGE`, `TEST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_en_3.0.0_3.0_1617260774039.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_en_3.0.0_3.0_1617260774039.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_events_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_events_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English image_classifier_vit_WEC_types ViTForImageClassification from lazyturtl author: John Snow Labs name: image_classifier_vit_WEC_types date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_WEC_types` is a English model originally trained by lazyturtl. ## Predicted Entities `Attenuators`, `Oscillating water column`, `Overtopping Devices`, `Point Absorber` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_WEC_types_en_4.1.0_3.0_1660168241563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_WEC_types_en_4.1.0_3.0_1660168241563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_WEC_types", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_WEC_types", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_WEC_types| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: German RobertaForMaskedLM Small Cased model (from FabianGroeger) author: John Snow Labs name: roberta_embeddings_hotelbert_small date: 2022-12-12 tags: [de, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `HotelBERT-small` is a German model originally trained by `FabianGroeger`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_hotelbert_small_de_4.2.4_3.0_1670858504100.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_hotelbert_small_de_4.2.4_3.0_1670858504100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_hotelbert_small","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_hotelbert_small","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_hotelbert_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|313.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/FabianGroeger/HotelBERT-small --- layout: model title: English T5ForConditionalGeneration Cased model (from SkolkovoInstitute) author: John Snow Labs name: t5_informal date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-informal` is a English model originally trained by `SkolkovoInstitute`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_informal_en_4.3.0_3.0_1675124817401.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_informal_en_4.3.0_3.0_1675124817401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_informal","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_informal","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_informal| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|927.0 MB| ## References - https://huggingface.co/SkolkovoInstitute/t5-informal - https://aclanthology.org/N18-1012/ --- layout: model title: English BertForQuestionAnswering Cased model (from Nausheen) author: John Snow Labs name: bert_qa_nausheen_finetuned_squad_accelera date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model originally trained by `Nausheen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nausheen_finetuned_squad_accelera_en_4.0.0_3.0_1657186978905.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nausheen_finetuned_squad_accelera_en_4.0.0_3.0_1657186978905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nausheen_finetuned_squad_accelera","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_nausheen_finetuned_squad_accelera","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nausheen_finetuned_squad_accelera| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Nausheen/bert-finetuned-squad-accelerate --- layout: model title: Translate Bengali to English Pipeline author: John Snow Labs name: translate_bn_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bn, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bn_en_xx_2.7.0_2.4_1609685962101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bn_en_xx_2.7.0_2.4_1609685962101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bn_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bn.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bn_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: SDOH Community Absent Binary Classification author: John Snow Labs name: bert_sequence_classifier_sdoh_community_absent_status date: 2022-12-18 tags: [en, licensed, clinical, sequence_classification, classifier, community_absent, sdoh] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies related to the loss of social support such as a family member or friend in the clinical documents. A discharge summary was classified True for Community-Absent if the discharge summary had passages related to the loss of social support and False if such passages were not found in the discharge summary. ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_community_absent_status_en_4.2.2_3.0_1671370818272.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_sdoh_community_absent_status_en_4.2.2_3.0_1671370818272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_community_absent_status", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) sample_texts =["She has two adult sons. She is a widow. She was employed with housework. She quit smoking 20 to 30 years ago, but smoked two packs per day for 20 to 30 years. She drinks one glass of wine occasionally. She avoids salt in her diet. ", "65 year old male presented with several days of vice like chest pain. He states that he felt like his chest was being crushed from back to the front. Lives with spouse and two sons moved to US 1 month ago."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_community_absent_status", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq("She has two adult sons. She is a widow. She was employed with housework. She quit smoking 20 to 30 years ago, but smoked two packs per day for 20 to 30 years. She drinks one glass of wine occasionally. She avoids salt in her diet.") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.sdoh_community_absent_status").predict("""She has two adult sons. She is a widow. She was employed with housework. She quit smoking 20 to 30 years ago, but smoked two packs per day for 20 to 30 years. She drinks one glass of wine occasionally. She avoids salt in her diet. """) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------+ | text| result| +----------------------------------------------------------------------------------------------------+-------+ |She has two adult sons. She is a widow. She was employed with housework. She quit smoking 20 to 3...| [True]| |65 year old male presented with several days of vice like chest pain. He states that he felt like...|[False]| +----------------------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_sdoh_community_absent_status| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support False 0.89 0.77 0.83 155 True 0.63 0.80 0.70 74 accuracy - - 0.78 229 macro-avg 0.76 0.79 0.76 229 weighted-avg 0.80 0.78 0.79 229 ``` --- layout: model title: Extract conditions and benefits from drug reviews author: John Snow Labs name: bert_token_classifier_ner_supplement date: 2022-02-09 tags: [bertfortokenclassification, licensed, ner, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to extract benefits of using drugs for certain conditions. ## Predicted Entities `CONDITION`, `BENEFIT` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_supplement_en_3.0.2_3.0_1644368324280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_supplement_en_3.0.2_3.0_1644368324280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_supplement","en", "clinical/models")\ .setInputCols(["token", "document"])\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline( stages=[ documentAssembler, tokenizer, tokenClassifier, ner_converter ] ) sample_df = spark.createDataFrame([["Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :)"],["Eager to have my ferritin grow and less hair loss."]]).toDF("text") result = pipeline.fit(sample_df).transform(sample_df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_supplement", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val test_sentence = "Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :)" val data = Seq(test_sentence).toDF(“text”) Val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.supplement").predict("""Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :)""") ```
## Results ```bash +-----------+---------+ |chunk |ner_label| +-----------+---------+ |nervousness|CONDITION| |night sleep|BENEFIT | |hair |BENEFIT | |nail growth|BENEFIT | |ferritin |BENEFIT | |hair loss |CONDITION| +-----------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_supplement| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on healthsea dataset: https://github.com/explosion/healthsea/tree/main/project/assets/ner ## Benchmarking ```bash label precision recall f1 support B-BENEFIT 0.85 0.89 0.87 184 B-CONDITION 0.82 0.90 0.86 202 I-BENEFIT 0.83 0.70 0.76 64 I-CONDITION 0.81 0.76 0.78 100 O 1.00 0.99 1.00 12700 accuracy 0.99 0.99 0.99 13250 macro-avg 0.86 0.85 0.85 13250 weighted-avg 0.99 0.99 0.99 13250 ``` --- layout: model title: English AlbertForQuestionAnswering model (from saburbutt) TweetQA author: John Snow Labs name: albert_qa_xxlarge_tweetqa date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert_xxlarge_tweetqa` is a English model originally trained by `saburbutt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_tweetqa_en_4.0.0_3.0_1656064171081.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_tweetqa_en_4.0.0_3.0_1656064171081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_tweetqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_tweetqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.albert.xxl").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xxlarge_tweetqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|771.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saburbutt/albert_xxlarge_tweetqa --- layout: model title: English asr_wav2vec2_base_timit_demo_colab1_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144_en_4.2.0_3.0_1664018670756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144_en_4.2.0_3.0_1664018670756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab1_by_sherry7144| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect Drugs and posology entities including experimental drugs and cycles (ner_posology_experimental) author: John Snow Labs name: ner_posology_experimental_pipeline date: 2023-03-15 tags: [licensed, clinical, en, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_experimental](https://nlp.johnsnowlabs.com/2021/09/01/ner_posology_experimental_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_pipeline_en_4.3.0_3.2_1678870276632.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_pipeline_en_4.3.0_3.2_1678870276632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_experimental_pipeline", "en", "clinical/models") text = '''Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA).. Calcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_experimental_pipeline", "en", "clinical/models") val text = "Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA).. Calcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_experimental.pipeline").predict("""Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA).. Calcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------------|--------:|------:|:------------|-------------:| | 0 | Anti-Tac | 15 | 22 | Drug | 0.8797 | | 1 | 10 mCi | 25 | 30 | Dosage | 0.5403 | | 2 | 15 mCi | 108 | 113 | Dosage | 0.6266 | | 3 | yttrium labeled anti-TAC | 118 | 141 | Drug | 0.9122 | | 4 | calcium trisodium Inj | 156 | 176 | Drug | 0.397533 | | 5 | Calcium-DTPA | 191 | 202 | Drug | 0.9794 | | 6 | Ca-DTPA | 205 | 211 | Drug | 0.9544 | | 7 | intravenously | 234 | 246 | Route | 0.9518 | | 8 | Days 1-3 | 251 | 258 | Cycleday | 0.83325 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_experimental_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Spanish RobertaForQuestionAnswering (from hackathon-pln-es) author: John Snow Labs name: roberta_qa_roberta_base_biomedical_clinical_es_squad2_hackathon_pln date: 2022-06-21 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-biomedical-clinical-es-squad2-es` is a Spanish model originally trained by `hackathon-pln-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_biomedical_clinical_es_squad2_hackathon_pln_es_4.0.0_3.0_1655790238957.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_biomedical_clinical_es_squad2_hackathon_pln_es_4.0.0_3.0_1655790238957.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_biomedical_clinical_es_squad2_hackathon_pln","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_biomedical_clinical_es_squad2_hackathon_pln","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2_clinical_bio_medical.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_biomedical_clinical_es_squad2_hackathon_pln| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hackathon-pln-es/roberta-base-biomedical-clinical-es-squad2-es - https://somosnlp.org/hackathon --- layout: model title: English T5ForConditionalGeneration Cased model (from paulowoicho) author: John Snow Labs name: t5_podcast_summarisation date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-podcast-summarisation` is a English model originally trained by `paulowoicho`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_podcast_summarisation_en_4.3.0_3.0_1675125262437.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_podcast_summarisation_en_4.3.0_3.0_1675125262437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_podcast_summarisation","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_podcast_summarisation","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_podcast_summarisation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|927.4 MB| ## References - https://huggingface.co/paulowoicho/t5-podcast-summarisation - https://arxiv.org/abs/2004.04270 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/1910.10683.pdf - https://arxiv.org/abs/2004.04270 - https://github.com/paulowoicho/msc_project/blob/master/reformat.py - https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb - https://github.com/abhimishra91 --- layout: model title: Chinese BertForQuestionAnswering model (from hfl) author: John Snow Labs name: bert_qa_chinese_pert_base_mrc date: 2022-06-06 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-pert-base-mrc` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pert_base_mrc_zh_4.0.0_3.0_1654537805799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pert_base_mrc_zh_4.0.0_3.0_1654537805799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chinese_pert_base_mrc","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_chinese_pert_base_mrc","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chinese_pert_base_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|381.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hfl/chinese-pert-base-mrc - https://github.com/ymcui/PERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-Minority-PLM - https://github.com/ymcui/HFL-Anthology - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/MacBERT --- layout: model title: Extract Entities in Covid Trials author: John Snow Labs name: ner_covid_trials date: 2022-10-19 tags: [ner, en, clinical, licensed, covid] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.2.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for extracting covid-related clinical terminology from covid trials. ## Predicted Entities `Stage`, `Severity`, `Virus`, `Trial_Design`, `Trial_Phase`, `N_Patients`, `Institution`, `Statistical_Indicator`, `Section_Header`, `Cell_Type`, `Cellular_component`, `Viral_components`, `Physiological_reaction`, `Biological_molecules`, `Admission_Discharge`, `Age`, `BMI`, `Cerebrovascular_Disease`, `Date`, `Death_Entity`, `Diabetes`, `Disease_Syndrome_Disorder`, `Dosage`, `Drug_Ingredient`, `Employment`, `Frequency`, `Gender`, `Heart_Disease`, `Hypertension`, `Obesity`, `Pulse`, `Race_Ethnicity`, `Respiration`, `Route`, `Smoking`, `Time`, `Total_Cholesterol`, `Treatment`, `VS_Finding`, `Vaccine`, `Vaccine_Name` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_COVID/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_en_4.2.0_3.0_1666177383134.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_en_4.2.0_3.0_1666177383134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_covid_trials","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner")\ .setLabelCasing("upper") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") ner_model = ner_pipeline.fit(empty_data) text= """In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""" results= model.transform(spark.createDataFrame([[text]]).toDF('text')) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_covid_trials", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.covid_trials").predict("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""") ```
## Results ```bash | | chunks | begin | end | sentence_id | entities | |---:|:------------------------------------|--------:|------:|--------------:|:--------------------------| | 0 | December 2019 | 3 | 15 | 0 | Date | | 1 | acute respiratory disease | 48 | 72 | 0 | Disease_Syndrome_Disorder | | 2 | beta-coronavirus | 146 | 161 | 1 | Virus | | 3 | 2019 | 198 | 201 | 1 | Date | | 4 | coronavirus infection | 203 | 223 | 1 | Disease_Syndrome_Disorder | | 5 | SARS-CoV-2 | 228 | 237 | 2 | Virus | | 6 | coronavirus | 244 | 254 | 2 | Virus | | 7 | β-coronaviruses | 285 | 299 | 2 | Virus | | 8 | subgenus Coronaviridae | 308 | 329 | 2 | Virus | | 9 | SARS-CoV-2 | 337 | 346 | 3 | Virus | | 10 | zoonotic coronavirus disease | 367 | 394 | 3 | Disease_Syndrome_Disorder | | 11 | severe acute respiratory syndrome | 402 | 434 | 3 | Disease_Syndrome_Disorder | | 12 | SARS | 438 | 441 | 3 | Disease_Syndrome_Disorder | | 13 | Middle Eastern respiratory syndrome | 449 | 483 | 3 | Disease_Syndrome_Disorder | | 14 | MERS | 487 | 490 | 3 | Disease_Syndrome_Disorder | | 15 | SARS-CoV-2 | 513 | 522 | 4 | Virus | | 16 | WHO | 543 | 545 | 4 | Institution | | 17 | CDC | 549 | 551 | 4 | Institution | | 18 | 2020 | 852 | 855 | 5 | Date | | 19 | COVID‑19 vaccine | 868 | 883 | 5 | Vaccine_Name | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_covid_trials| |Compatibility:|Spark NLP for Healthcare 4.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.8 MB| ## References This model is trained on data sampled from clinicaltrials.gov - covid trials, and annotated in-house. ## Benchmarking ```bash label tp fp fn total precision recall f1 Institution 34 8 20 55.0 0.7958 0.6343 0.706 VS_Finding 19 2 1 20.0 0.9048 0.95 0.9268 Respiration 5 0 0 5.0 1.0 1.0 1.0 Cerebrovascular_D... 5 2 2 7.0 0.7143 0.7143 0.7143 Cell_Type 152 27 14 167.0 0.8479 0.9123 0.8789 Heart_Disease 36 3 5 41.0 0.9231 0.878 0.9 Severity 57 25 3 60.0 0.6881 0.95 0.7981 N_Patients 27 3 1 29.0 0.8871 0.9483 0.9167 Pulse 12 2 0 12.0 0.8571 1.0 0.9231 Obesity 3 0 0 3.0 1.0 1.0 1.0 Admission_Discharge 85 3 0 85.0 0.9659 1.0 0.9827 Diabetes 8 0 0 8.0 1.0 1.0 1.0 Section_Header 94 8 13 108.0 0.9154 0.8711 0.8927 Age 22 1 0 22.0 0.9429 1.0 0.9706 Cellular_component 40 21 10 50.0 0.6534 0.8 0.7193 Hypertension 10 0 0 10.0 1.0 1.0 1.0 BMI 5 1 1 6.0 0.8333 0.8333 0.8333 Trial_Phase 13 0 1 14.0 0.9398 0.9286 0.9341 Employment 98 12 8 107.0 0.8874 0.9206 0.9037 Statistical_Indic... 76 29 11 88.0 0.7206 0.8689 0.7879 Time 2 0 1 3.0 1.0 0.6667 0.8 Total_Cholesterol 14 1 2 17.0 0.9355 0.8529 0.8923 Drug_Ingredient 327 33 67 395.0 0.9084 0.8281 0.8664 Physiological_rea... 27 7 14 41.0 0.7864 0.6585 0.7168 Treatment 66 4 25 92.0 0.9433 0.7228 0.8185 Vaccine 20 1 2 23.0 0.9531 0.8841 0.9173 Disease_Syndrome_... 774 70 41 816.0 0.9171 0.9495 0.933 Virus 121 8 23 144.0 0.9365 0.8403 0.8858 Frequency 57 1 2 59.9 0.9787 0.9556 0.967 Route 37 4 10 47.0 0.9024 0.7872 0.8409 Death_Entity 20 9 3 23.0 0.6897 0.8696 0.7692 Stage 4 0 7 12.0 1.0 0.3889 0.56 Vaccine_Name 10 1 0 10.0 0.9091 1.0 0.9524 Trial_Design 32 13 8 41.0 0.7149 0.7951 0.7529 Biological_molecules 251 91 53 305.0 0.7335 0.8233 0.7758 Date 98 5 2 100.0 0.9492 0.98 0.9643 Race_Ethnicity 0 0 2 2.0 0.0 0.0 0.0 Gender 46 1 0 46.0 0.9787 1.0 0.9892 Dosage 49 9 24 73.0 0.8376 0.6712 0.7452 Viral_components 18 10 15 34.0 0.6512 0.549 0.5957 macro - - - - - - 0.8382 micro - - - - - - 0.8704 ``` --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_mini_wrslb_finetuned_squadv1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-wrslb-finetuned-squadv1` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_mini_wrslb_finetuned_squadv1_en_4.0.0_3.0_1654183783125.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_mini_wrslb_finetuned_squadv1_en_4.0.0_3.0_1654183783125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_mini_wrslb_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_mini_wrslb_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_mini_wrslb_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-mini-wrslb-finetuned-squadv1 --- layout: model title: Slovak RobertaForMaskedLM Cased model (from gerulata) author: John Snow Labs name: roberta_embeddings_slovakbert date: 2022-12-12 tags: [sk, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: sk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `slovakbert` is a Slovak model originally trained by `gerulata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_slovakbert_sk_4.2.4_3.0_1670859900714.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_slovakbert_sk_4.2.4_3.0_1670859900714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_slovakbert","sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_slovakbert","sk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_slovakbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sk| |Size:|298.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/gerulata/slovakbert - https://www.gerulata.com/ - https://arxiv.org/abs/2109.15254 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab57 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab57 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab57` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab57_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab57_en_4.2.0_3.0_1664023726902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab57_en_4.2.0_3.0_1664023726902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab57', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab57", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab57| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|228.5 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_exper5_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper5_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper5_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper5_mesum5_en_4.1.0_3.0_1660165661435.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper5_mesum5_en_4.1.0_3.0_1660165661435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper5_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper5_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper5_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Mapping Abbreviations and Acronyms of Medical Regulatory Activities with Their Definitions (Augmented) author: John Snow Labs name: abbreviation_mapper_augmented date: 2022-10-30 tags: [abbreviation, definition, chunk_mapper, clinical, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Spark NLP for Healthcare 4.2.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps abbreviations and acronyms of medical regulatory activities with their `definition`. This is an augmented version of the `abbreviation_mapper ` model with new abbreviations. ## Predicted Entities `definition` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/abbreviation_mapper_augmented_en_4.2.1_3.0_1667127908106.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/abbreviation_mapper_augmented_en_4.2.1_3.0_1667127908106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("abbr_ner") abbr_converter = NerConverter() \ .setInputCols(["sentence", "token", "abbr_ner"]) \ .setOutputCol("abbr_ner_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper_augmented", "en", "clinical/models")\ .setInputCols(["abbr_ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["definition"])\ pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper]) text = ["""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. VDRL: Nonreactive HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""] data = spark.createDataFrame([text]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("abbr_ner") val abbr_converter = new NerConverter() .setInputCols(Array("sentence", "token", "abbr_ner")) .setOutputCol("abbr_ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper_augmented", "en", "clinical/models") .setInputCols("abbr_ner_chunk") .setOutputCol("mappings") .setRels(Array("definition")) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper)) val sample_text = """Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. VDRL: Nonreactive HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""" val data = Seq(sample_text).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.abbreviation_augmented").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. VDRL: Nonreactive HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash +---------+--------------------------------------+ |ner_chunk|abbreviation | +---------+--------------------------------------+ |CBC |complete blood count | |VDRL |Venereal Disease Research Laboratories| |HIV |human immunodeficiency virus | +---------+--------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|abbreviation_mapper_augmented| |Compatibility:|Spark NLP for Healthcare 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[abbr_ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|319.6 KB| --- layout: model title: Mapping RxNorm Codes with Corresponding National Drug Codes(NDC) author: John Snow Labs name: rxnorm_ndc_mapper date: 2022-06-27 tags: [rxnorm, ndc, chunk_mapper, licensed, clinical, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC). ## Predicted Entities `Product NDC`, `Package NDC` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapper_en_3.5.3_3.0_1656314699115.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapper_en_3.5.3_3.0_1656314699115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("rxnorm_ndc_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("ndc_mappings")\ .setRels(["Product NDC", "Package NDC"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline = LightPipeline(model) result = light_pipeline.annotate(["doxycycline hyclate 50 MG Oral Tablet", "macadamia nut 100 MG/ML"]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("rxnorm_ndc_mapper", "en", "clinical/models") .setInputCols("rxnorm_code") .setOutputCol("ndc_mappings") .setRels("Product NDC", "Package NDC") val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper )) val data = Seq(Array("doxycycline hyclate 50 MG Oral Tablet", "macadamia nut 100 MG/ML")).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm_to_ndc").predict("""doxycycline hyclate 50 MG Oral Tablet""") ```
## Results ```bash | | ner_chunk | rxnorm_code | Package NDC | Product NDC | |---:|:--------------------------------------|--------------:|:--------------|:--------------| | 0 | doxycycline hyclate 50 MG Oral Tablet | 1652674 | 62135-0625-60 | 46708-0499 | | 1 | macadamia nut 100 MG/ML | 259934 | 13349-0010-39 | 13349-0010 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_ndc_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|2.0 MB| --- layout: model title: German Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_base_german_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, de, open_source] task: Part of Speech Tagging language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-german-upos` is a German model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_german_upos_de_3.4.2_3.0_1652092297527.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_german_upos_de_3.4.2_3.0_1652092297527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_german_upos","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_german_upos","de") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_german_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|410.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-german-upos - https://github.com/UniversalDependencies/UD_German-HDT - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-6_H-128_A-2_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna_en_4.0.0_3.0_1654185384846.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna_en_4.0.0_3.0_1654185384846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.uncased_6l_128d_a2a_128d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|19.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-6_H-128_A-2_squad2_covid-qna --- layout: model title: Legal Stock Pledge Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_stock_pledge_agreement_bert date: 2023-02-02 tags: [en, legal, classification, stock, pledge, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stock_pledge_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `stock-pledge-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `stock-pledge-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stock_pledge_agreement_bert_en_1.0.0_3.0_1675360395046.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stock_pledge_agreement_bert_en_1.0.0_3.0_1675360395046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_stock_pledge_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[stock-pledge-agreement]| |[other]| |[other]| |[stock-pledge-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stock_pledge_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.95 0.96 73 stock-pledge-agreement 0.89 0.94 0.92 36 accuracy - - 0.94 109 macro-avg 0.93 0.94 0.94 109 weighted-avg 0.95 0.94 0.95 109 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_768 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_768_zh_4.2.4_3.0_1670021588203.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_768_zh_4.2.4_3.0_1670021588203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Word2Vec Embeddings in Zeelandic (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, zea, open_source] task: Embeddings language: zea edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_zea_3.4.1_3.0_1647467775728.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_zea_3.4.1_3.0_1647467775728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","zea") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","zea") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zea.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|zea| |Size:|52.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Extract Anatomical Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_anatomy_emb_clinical_medium date: 2023-06-07 tags: [licensed, clinical, ner, en, vop, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `BodyPart`, `Laterality` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_emb_clinical_medium_en_4.4.3_3.0_1686148509918.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_emb_clinical_medium_en_4.4.3_3.0_1686148509918.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_anatomy_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_anatomy_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Ugh, I pulled a muscle in my neck from sleeping weird last night. It's like a knot in my trapezius and it hurts to turn my head.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | muscle | BodyPart | | neck | BodyPart | | trapezius | BodyPart | | head | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_anatomy_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 BodyPart 2729 269 171 2900 0.91 0.94 0.93 Laterality 547 50 81 628 0.92 0.87 0.89 macro_avg 3276 319 252 3528 0.92 0.90 0.91 micro_avg 3276 319 252 3528 0.91 0.93 0.92 ``` --- layout: model title: Corsican RobertaForMaskedLM Small Cased model (from huggingface) author: John Snow Labs name: roberta_embeddings_codeberta_small_v1 date: 2022-12-12 tags: [co, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: co edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CodeBERTa-small-v1` is a Corsican model originally trained by `huggingface`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_codeberta_small_v1_co_4.2.4_3.0_1670858344002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_codeberta_small_v1_co_4.2.4_3.0_1670858344002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_codeberta_small_v1","co") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_codeberta_small_v1","co") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_codeberta_small_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|co| |Size:|314.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/huggingface/CodeBERTa-small-v1 - https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/ - https://tensorboard.dev/experiment/irRI7jXGQlqmlxXS0I07ew/#scalars --- layout: model title: English RoBERTa Embeddings (Smiles Strings, v1) author: John Snow Labs name: roberta_embeddings_chEMBL_smiles_v1 date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chEMBL_smiles_v1` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_chEMBL_smiles_v1_en_3.4.2_3.0_1649947021342.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_chEMBL_smiles_v1_en_3.4.2_3.0_1649947021342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_chEMBL_smiles_v1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_chEMBL_smiles_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.chEMBL_smiles_v1").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_chEMBL_smiles_v1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|88.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/chEMBL_smiles_v1 - https://onlinelibrary.wiley.com/doi/full/10.1002/minf.201700111 - https://github.com/topazape/LSTM_Chem/blob/master/cleanup_smiles.py - https://github.com/topazape/LSTM_Chem - https://www.ncbi.nlm.nih.gov/pubmed/29095571 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batteryscibert_cased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batteryscibert-cased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batteryscibert_cased_squad_v1_en_4.0.0_3.0_1654179380922.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batteryscibert_cased_squad_v1_en_4.0.0_3.0_1654179380922.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batteryscibert_cased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batteryscibert_cased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.scibert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batteryscibert_cased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/batteryscibert-cased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: English DistilBertForQuestionAnswering Cased model (from mcurmei) author: John Snow Labs name: distilbert_qa_single_label_n_max_long_training date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `single_label_N_max_long_training` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_n_max_long_training_en_4.3.0_3.0_1672775531709.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_n_max_long_training_en_4.3.0_3.0_1672775531709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_n_max_long_training","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_n_max_long_training","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_single_label_n_max_long_training| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/single_label_N_max_long_training --- layout: model title: Fast Neural Machine Translation Model from English to Uralic Languages author: John Snow Labs name: opus_mt_en_urj date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, urj, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `urj` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_urj_xx_2.7.0_2.4_1609164932364.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_urj_xx_2.7.0_2.4_1609164932364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_urj", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_urj", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.urj').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_urj| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_VIT_Basic ViTForImageClassification from Zayn author: John Snow Labs name: image_classifier_vit_VIT_Basic date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_VIT_Basic` is a English model originally trained by Zayn. ## Predicted Entities `chairs`, `ice cream`, `hot dog`, `ladders`, `tables` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_VIT_Basic_en_4.1.0_3.0_1660169773358.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_VIT_Basic_en_4.1.0_3.0_1660169773358.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_VIT_Basic", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_VIT_Basic", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_VIT_Basic| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Wolof XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_ner_wolof date: 2022-08-01 tags: [wo, open_source, xlm_roberta, ner] task: Named Entity Recognition language: wo edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-wolof` is a Wolof model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_wolof_wo_4.1.0_3.0_1659355584152.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_wolof_wo_4.1.0_3.0_1659355584152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner_wolof","wo") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner_wolof","wo") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_ner_wolof| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|wo| |Size:|772.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-wolof - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbertresolve_icd10cm_augmented date: 2023-05-24 tags: [en, clinical, licensed, icd10cm, entity_resolution] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-10-CM codes using `sbert_jsl_medium_uncased`. Also, it has been augmented with synonyms for making it more accurate. ## Predicted Entities `ICD10CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_en_4.4.2_3.0_1684961214137.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_augmented_en_4.4.2_3.0_1684961214137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("bert_embeddings")\ .setCaseSensitive(False) icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented", "en", "clinical/models") \ .setInputCols(["bert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, bert_embeddings, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val bert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased", "en", "clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("bert_embeddings") .setCaseSensitive(False) val icd10_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_augmented", "en", "clinical/models") .setInputCols("bert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | ner_chunk | entity | icd10_code | all_codes | resolutions | |:--------------------------------------|:---------|:-------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | gestational diabetes mellitus | PROBLEM | O24.4 | O24.4:::O24.41:::O24.43:::Z86.32:::K86.8:::P70.2:::O24.434:::E10.9 | gestational diabetes mellitus [gestational diabetes mellitus]:::gestational diabetes mellitus [gestational diabetes mellitus]:::gestational diabetes mellitus in the puerperium [gestational diabetes mellitus in the puerperium]:::history of gestational diabetes mellitus [history of gestational diabetes mellitus]:::secondary pancreatic diabetes mellitus [secondary pancreatic diabetes mellitus]:::neonatal diabetes mellitus [neonatal diabetes mellitus]:::gestational diabetes mellitus in the puerperium, insulin controlled [gestational diabetes mellitus in the puerperium, insulin controlled]:::juvenile onset diabetes mellitus [juvenile onset diabetes mellitus] | | subsequent type two diabetes mellitus | PROBLEM | E11 | E11:::E11.9:::E10.9:::E10:::E13.9:::Z83.3:::L83:::E11.8:::E11.32:::E10.8:::Z86.39 | type 2 diabetes mellitus [type 2 diabetes mellitus]:::type ii diabetes mellitus [type ii diabetes mellitus]:::type i diabetes mellitus [type i diabetes mellitus]:::type 1 diabetes mellitus [type 1 diabetes mellitus]:::secondary diabetes mellitus [secondary diabetes mellitus]:::fh: diabetes mellitus [fh: diabetes mellitus]:::type 2 diabetes mellitus with acanthosis nigricans [type 2 diabetes mellitus with acanthosis nigricans]:::complication of type ii diabetes mellitus [complication of type ii diabetes mellitus]:::secondary endocrine diabetes mellitus [secondary endocrine diabetes mellitus]:::complication of type i diabetes mellitus [complication of type i diabetes mellitus]:::history of diabetes mellitus type ii [history of diabetes mellitus type ii]:::pregnancy and type 2 diabetes mellitus [pregnancy and type 2 diabetes mellitus]:::type 2 diabetes mellitus with ophthalmic complications [type 2 diabetes mellitus with ophthalmic complications]:::proteinuria due to type 2 diabetes mellitus [proteinuria due to type 2 diabetes mellitus]:::type 2 diabetes mellitus with hyperglycemia [type 2 diabetes mellitus with hyperglycemia] | | drug-induced pancreatitis | PROBLEM | F10.2 | F10.2:::K85.3:::M79.3:::T46.5X:::K29.6:::T50.90:::J98.5:::K85.20:::T36.95:::K85.32 | drug-induced chronic pancreatitis [drug-induced chronic pancreatitis]:::drug-induced acute pancreatitis [drug-induced acute pancreatitis]:::drug-induced panniculitis [drug-induced panniculitis]:::drug-induced pericarditis [drug-induced pericarditis]:::drug-induced gastritis [drug-induced gastritis]:::drug-induced dermatomyositis [drug-induced dermatomyositis]:::drug-induced granulomatous mediastinitis [drug-induced granulomatous mediastinitis]:::alcohol-induced pancreatitis [alcohol-induced pancreatitis]:::drug-induced colitis [drug-induced colitis]:::drug induced acute pancreatitis with infected necrosis [drug induced acute pancreatitis with infected necrosis]', 'drug-induced pneumonitis [drug-induced pneumonitis]:::rbn - retrobulbar neuritis [rbn - retrobulbar neuritis]:::drug induced pneumonitis [drug induced pneumonitis]:::infective endocarditis at site of patch of interatrial communication (disorder) [infective endocarditis at site of patch of interatrial communication (disorder)]:::infective endocarditis at site of interatrial communication (disorder) [infective endocarditis at site of interatrial communication (disorder)]:::post-ercp acute pancreatitis (disorder) [post-ercp acute pancreatitis (disorder)]:::dm - dermatomyositis [dm - dermatomyositis]:::drug-induced pericarditis (disorder) [drug-induced pericarditis (disorder)]:::alcohol-induced chronic pancreatitis [alcohol-induced chronic pancreatitis] | | acute hepatitis | PROBLEM | K72.0 | K72.0:::B15:::B17.2:::B17.10:::B17.1:::B16:::B17.9:::B18.8 | acute hepatitis [acute hepatitis]:::acute hepatitis a [acute hepatitis a]:::acute hepatitis e [acute hepatitis e]:::acute hepatitis c [acute hepatitis c]:::acute hepatitis c [acute hepatitis c]:::acute hepatitis b [acute hepatitis b]:::acute viral hepatitis [acute viral hepatitis]:::chronic hepatitis e [chronic hepatitis e]:::acute type a viral hepatitis [acute type a viral hepatitis]:::acute focal hepatitis [acute focal hepatitis]:::chronic hepatitis [chronic hepatitis]:::acute type b viral hepatitis [acute type b viral hepatitis]:::hepatitis a [hepatitis a]:::chronic hepatitis c [chronic hepatitis c]:::hepatitis [hepatitis]:::ischemic hepatitis [ischemic hepatitis]:::chronic viral hepatitis [chronic viral hepatitis] | | obesity | PROBLEM | E66.9 | E66.9:::E66.8:::P90:::Q13.0:::M79.4:::Z86.39 | obesity [obesity]:::upper body obesity [upper body obesity]:::childhood obesity [childhood obesity]:::central obesity [central obesity]:::localised obesity [localised obesity]:::history of obesity [history of obesity] | | a body mass index | PROBLEM | E66.9 | E66.9:::Z68.41:::Z68:::E66.8:::Z68.45:::Z68.4:::Z68.1:::Z68.2 | observation of body mass index [observation of body mass index]:::finding of body mass index [finding of body mass index]:::body mass index [bmi] [body mass index [bmi]]:::body mass index equal to or greater than 40 [body mass index equal to or greater than 40]:::body mass index [bmi] 70 or greater, adult [body mass index [bmi] 70 or greater, adult]:::body mass index [bmi] 40 or greater, adult [body mass index [bmi] 40 or greater, adult]:::body mass index [bmi] 19.9 or less, adult [body mass index [bmi] 19.9 or less, adult]:::body mass index [bmi] 20-29, adult [body mass index [bmi] 20-29, adult]:::mass of body region [mass of body region]:::body mass index [bmi] 22.0-22.9, adult [body mass index [bmi] 22.0-22.9, adult]:::body mass index [bmi] 21.0-21.9, adult [body mass index [bmi] 21.0-21.9, adult]:::body mass index [bmi] 25.0-25.9, adult [body mass index [bmi] 25.0-25.9, adult]:::body mass index [bmi] 50.0-59.9, adult [body mass index [bmi] 50.0-59.9, adult]:::body mass index [bmi] 20.0-20.9, adult [body mass index [bmi] 20.0-20.9, adult]:::body mass index [bmi] 26.0-26.9, adult [body mass index [bmi] 26.0-26.9, adult]:::body mass index [bmi] 31.0-31.9, adult [body mass index [bmi] 31.0-31.9, adult]:::body mass index [bmi] 45.0-49.9, adult [body mass index [bmi] 45.0-49.9, adult]:::body mass index [bmi] 29.0-29.9, adult [body mass index [bmi] 29.0-29.9, adult]:::defined border of mass of abdomen [defined border of mass of abdomen]:::body mass index [bmi] 24.0-24.9, adult [body mass index [bmi] 24.0-24.9, adult]:::subcutaneous mass of head (finding) [subcutaneous mass of head (finding)]:::mass of thoracic structure (finding) [mass of thoracic structure (finding)] | | polyuria | PROBLEM | R35 | R35:::E88.8:::R30.0:::N28.89:::O04.8:::R82.4:::E74.8:::R82.2 | polyuria [polyuria]:::sialuria [sialuria]:::stranguria [stranguria]:::isosthenuria [isosthenuria]:::oliguria [oliguria]:::ketonuria [ketonuria]:::xylosuria [xylosuria]:::biliuria [biliuria]:::sucrosuria [sucrosuria]:::chyluria [chyluria]:::anuria [anuria]:::pollakiuria [pollakiuria]:::podocyturia [podocyturia]:::galactosuria [galactosuria]:::proteinuria [proteinuria]:::other polyuria [other polyuria]:::other polyuria [other polyuria]:::alcaptonuria [alcaptonuria]:::xanthinuria [xanthinuria]:::oliguria and anuria [oliguria and anuria]:::hawkinsinuria [hawkinsinuria] | | polydipsia | PROBLEM | R63.1 | R63.1:::Q17.0:::Q89.4:::Q89.09:::Q74.8:::H53.8:::H53.2:::Q13.2 | polydipsia [polydipsia]:::polyotia [polyotia]:::polysomia [polysomia]:::polysplenia [polysplenia]:::polymelia [polymelia]:::palinopsia [palinopsia]:::polyopia [polyopia]:::polycoria [polycoria]:::adipsia [adipsia]:::primary polydipsia [primary polydipsia]:::chromatopsia [chromatopsia]:::polyphasic units [polyphasic units]:::duodenal polyp [duodenal polyp]:::polymicrogyria [polymicrogyria]:::polygalactia [polygalactia]:::otic polyp [otic polyp]:::pseudochondroplasia [pseudochondroplasia]:::polythelia [polythelia]:::polymyositis [polymyositis]:::juvenile polyps of large bowel [juvenile polyps of large bowel]:::syndactylia [syndactylia] | | poor appetite | PROBLEM | R63.0 | R63.0:::R63.2:::P92.9:::R45.81:::Z55.8:::R41.84:::R41.3:::Z74.8 | poor appetite [poor appetite]:::excessive appetite [excessive appetite]:::poor feeding [poor feeding]:::poor self-esteem [poor self-esteem]:::poor education [poor education]:::poor concentration [poor concentration]', 'poor memory [poor memory]:::poor informal care arrangements (finding) [poor informal care arrangements (finding)]:::poor attention control [poor attention control]:::poor self-esteem (finding) [poor self-esteem (finding)]:::trouble eating [trouble eating]:::poor daily routine (finding) [poor daily routine (finding)]:::poor manual dexterity [poor manual dexterity]:::poor family relationship (finding) [poor family relationship (finding)]:::inadequate psyllium intake (finding) [inadequate psyllium intake (finding)]:::poor work record [poor work record]:::diet poor [diet poor]:::poor historian [poor historian]:::inadequate copper intake [inadequate copper intake] | | vomiting | PROBLEM | R11.1 | R11.1:::K91.0:::K92.0:::A08.39:::R11:::P92.0:::P92.09:::R11.12 | vomiting [vomiting]:::vomiting bile [vomiting bile]:::vomiting blood [vomiting blood]:::viral vomiting [viral vomiting]:::vomiting (disorder) [vomiting (disorder)]:::vomiting of newborn [vomiting of newborn]:::vomiting in newborn (disorder) [vomiting in newborn (disorder)]:::projectile vomiting [projectile vomiting]:::morning vomiting (disorder) [morning vomiting (disorder)]:::vomiting of pregnancy [vomiting of pregnancy]:::habit vomiting (disorder) [habit vomiting (disorder)]:::periodic vomiting [periodic vomiting] | | a respiratory tract infection | PROBLEM | J98.8 | J98.8:::J06.9:::P39.3:::J22:::N39.0:::A49.9:::Z59.3:::T83.51 | respiratory tract infection [respiratory tract infection]:::upper respiratory tract infection [upper respiratory tract infection]:::urinary tract infection [urinary tract infection]:::lrti - lower respiratory tract infection [lrti - lower respiratory tract infection]', 'uti - urinary tract infection [uti - urinary tract infection]:::bacterial respiratory infection [bacterial respiratory infection]:::institution-acquired respiratory infection [institution-acquired respiratory infection]:::catheter-associated urinary tract infection [catheter-associated urinary tract infection] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[bert_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.0 GB| |Case sensitive:|false| --- layout: model title: English image_classifier_vit_finetuned_cats_dogs ViTForImageClassification from nickmuchi author: John Snow Labs name: image_classifier_vit_finetuned_cats_dogs date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_finetuned_cats_dogs` is a English model originally trained by nickmuchi. ## Predicted Entities `cat`, `dog` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_finetuned_cats_dogs_en_4.1.0_3.0_1660168924334.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_finetuned_cats_dogs_en_4.1.0_3.0_1660168924334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_finetuned_cats_dogs", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_finetuned_cats_dogs", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_finetuned_cats_dogs| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Posology concepts (ner_posology_healthcare) author: John Snow Labs name: ner_posology_healthcare_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_healthcare_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_healthcare_pipeline_en_4.3.0_3.2_1678870448348.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_healthcare_pipeline_en_4.3.0_3.2_1678870448348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_healthcare_pipeline", "en", "clinical/models") text = '''The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that chest pain started yesterday evening. He has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_healthcare_pipeline", "en", "clinical/models") val text = "The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that chest pain started yesterday evening. He has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.healthcare_pipeline").predict("""The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that chest pain started yesterday evening. He has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------|--------:|------:|:------------|-------------:| | 0 | Aspirin | 267 | 273 | Drug | 0.9983 | | 1 | 81 milligrams | 275 | 287 | Strength | 0.9921 | | 2 | QDay | 289 | 292 | Frequency | 0.995 | | 3 | insulin | 295 | 301 | Drug | 0.9469 | | 4 | 50 units | 303 | 310 | Dosage | 0.6343 | | 5 | in a.m | 312 | 317 | Frequency | 0.71315 | | 6 | HCTZ | 320 | 323 | Drug | 0.9789 | | 7 | 50 mg | 325 | 329 | Strength | 0.96705 | | 8 | QDay | 331 | 334 | Frequency | 0.9877 | | 9 | Nitroglycerin | 337 | 349 | Drug | 0.9927 | | 10 | 1/150 | 351 | 355 | Strength | 0.9565 | | 11 | sublingually. | 357 | 369 | Route | 0.72065 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Extract Clinical Department Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_clinical_dept_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, ner, en, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `ClinicalDept`, `AdmissionDischarge`, `MedicalDevice` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_emb_clinical_medium_en_4.4.3_3.0_1686074863054.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_emb_clinical_medium_en_4.4.3_3.0_1686074863054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------------------|:--------------| | orthopedic department | ClinicalDept | | titanium plate | MedicalDevice | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_clinical_dept_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 ClinicalDept 284 28 42 326 0.91 0.87 0.89 AdmissionDischarge 29 3 5 34 0.91 0.85 0.88 MedicalDevice 235 40 97 332 0.85 0.71 0.77 macro_avg 548 71 144 692 0.89 0.81 0.85 micro_avg 548 71 144 692 0.88 0.79 0.83 ``` --- layout: model title: Fast Neural Machine Translation Model from Tai to English author: John Snow Labs name: opus_mt_taw_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, taw, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `taw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_taw_en_xx_2.7.0_2.4_1609170891318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_taw_en_xx_2.7.0_2.4_1609170891318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_taw_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_taw_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.taw.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_taw_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Large Uncased model (from tiennvcs) author: John Snow Labs name: bert_qa_large_uncased_finetuned_infovqa date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-finetuned-infovqa` is a English model originally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_large_uncased_finetuned_infovqa_en_4.0.0_3.0_1657187273878.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_large_uncased_finetuned_infovqa_en_4.0.0_3.0_1657187273878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_large_uncased_finetuned_infovqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_large_uncased_finetuned_infovqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_large_uncased_finetuned_infovqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/bert-large-uncased-finetuned-infovqa --- layout: model title: Japanese Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_base_japanese_luw_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ja, open_source] task: Part of Speech Tagging language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-luw-upos` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_luw_upos_ja_3.4.2_3.0_1652091883607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_japanese_luw_upos_ja_3.4.2_3.0_1652091883607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_luw_upos","ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_japanese_luw_upos","ja") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_japanese_luw_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ja| |Size:|338.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-base-japanese-luw-upos - https://universaldependencies.org/u/pos/ - http://id.nii.ac.jp/1001/00216223/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English BertForTokenClassification Base Uncased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Modified_BiomedNLP_PubMedBERT_base_uncased_abstract date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_Modified_BiomedNLP-PubMedBERT-base-uncased-abstract` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_BiomedNLP_PubMedBERT_base_uncased_abstract_en_4.0.0_3.0_1657109176981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Modified_BiomedNLP_PubMedBERT_base_uncased_abstract_en_4.0.0_3.0_1657109176981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_BiomedNLP_PubMedBERT_base_uncased_abstract","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Modified_BiomedNLP_PubMedBERT_base_uncased_abstract","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Modified_BiomedNLP_PubMedBERT_base_uncased_abstract| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4_Modified_BiomedNLP-PubMedBERT-base-uncased-abstract --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: bert_token_classifier_ner_clinical_pipeline date: 2022-03-15 tags: [bert, bert_token_classifier, ner, clinical, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_clinical_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_2.4_1647349002874.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_2.4_1647349002874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python clinical_pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") clinical_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .") ``` ```scala val clinical_pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") clinical_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.clinical_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |gestational diabetes mellitus|PROBLEM | |type two diabetes mellitus |PROBLEM | |T2DM |PROBLEM | |HTG-induced pancreatitis |PROBLEM | |an acute hepatitis |PROBLEM | |obesity |PROBLEM | |a body mass index |TEST | |BMI |TEST | |polyuria |PROBLEM | |polydipsia |PROBLEM | |poor appetite |PROBLEM | |vomiting |PROBLEM | |amoxicillin |TREATMENT| |a respiratory tract infection|PROBLEM | |metformin |TREATMENT| |glipizide |TREATMENT| |dapagliflozin |TREATMENT| |T2DM |PROBLEM | |atorvastatin |TREATMENT| |gemfibrozil |TREATMENT| +-----------------------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: Legal Equity Distribution Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_equity_distribution_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, equity_distribution, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_equity_distribution_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `equity-distribution-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `equity-distribution-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_equity_distribution_agreement_en_1.0.0_3.0_1668112774409.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_equity_distribution_agreement_en_1.0.0_3.0_1668112774409.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_equity_distribution_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[equity-distribution-agreement]| |[other]| |[other]| |[equity-distribution-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_equity_distribution_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support equity-distribution-agreement 1.00 1.00 1.0 41 other 1.00 1.00 1.0 66 accuracy - - 1.0 107 macro-avg 1.00 1.00 1.0 107 weighted-avg 1.00 1.00 1.0 107 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from lqdisme) author: John Snow Labs name: distilbert_qa_lqdisme_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `lqdisme`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_lqdisme_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772075404.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_lqdisme_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772075404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lqdisme_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_lqdisme_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_lqdisme_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/lqdisme/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Cased model (from comacrae) author: John Snow Labs name: roberta_qa_unaugmentedv3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-unaugmentedv3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unaugmentedv3_en_4.3.0_3.0_1674222641663.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unaugmentedv3_en_4.3.0_3.0_1674222641663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unaugmentedv3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unaugmentedv3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unaugmentedv3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/comacrae/roberta-unaugmentedv3 --- layout: model title: Detect Assertion Status (assertion_dl_scope_L10R10) author: John Snow Labs name: assertion_dl_scope_L10R10 date: 2022-03-17 tags: [clinical, licensed, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model considers 10 tokens on the left and 10 tokens on the right side of the clinical entities extracted by NER models and assigns their assertion status based on their context in this scope. ## Predicted Entities `hypothetical`, `associated_with_someone_else`, `conditional`, `possible`, `absent`, `present` {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_scope_L10R10_en_3.4.2_3.0_1647494736416.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_scope_L10R10_en_3.4.2_3.0_1647494736416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token') word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_scope_L10R10", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[document,sentenceDetector, token, word_embeddings,clinical_ner,ner_converter, clinical_assertion]) text = "Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer." data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_scope_L10R10", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.l10r10").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""") ```
## Results ```bash +---------------+---------+----------------------------+ |chunk |entity |assertion | +---------------+---------+----------------------------+ |severe fever |PROBLEM |present | |sore throat |PROBLEM |present | |stomach pain |PROBLEM |absent | |an epidural |TREATMENT|present | |PCA |TREATMENT|present | |pain control |PROBLEM |present | |short of breath|PROBLEM |conditional | |CT |TEST |present | |lung tumor |PROBLEM |present | |Alzheimer |PROBLEM |associated_with_someone_else| +---------------+---------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_scope_L10R10| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|1.4 MB| ## References Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label tp fp fn prec rec f1 absent 812 48 71 0.94418603 0.9195923 0.93172693 present 2463 127 141 0.9509652 0.9458525 0.948402 conditional 25 19 28 0.5681818 0.4716981 0.5154639 associated_with_someone_else 36 7 9 0.8372093 0.8 0.8181818 hypothetical 147 31 28 0.8258427 0.84 0.8328612 possible 159 87 42 0.64634144 0.7910448 0.71140933 Macro-average - - - 0.79545444 0.7946979 0.795076 Micro-average - - - 0.91946477 0.9194648 0.91946477 ``` --- layout: model title: Clinical Deidentification Pipeline (English, slim) author: John Snow Labs name: clinical_deidentification_slim date: 2022-07-24 tags: [deidentification, deid, glove, slim, pipeline, clinical, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight `glove_100d` embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_slim_en_4.0.0_3.0_1658699267236.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_slim_en_4.0.0_3.0_1658699267236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_slim","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.clinical_slim").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID: , IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID: [********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID: ****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Layne Nation, Record date: 2093-03-13, # C6240488. Dr. Dr Rosalba Hill, ID: JY:3489547, IP 005.005.005.005. He is a 79 male was admitted to the JOHN MUIR MEDICAL CENTER-CONCORD CAMPUS for cystectomy on 01-25-1997. Patient's VIN : 3CCCC22DDDD333888, SSN SSN-289-37-4495, Driver's license S99983662. Phone 04.32.52.27.90, North Adrienne, Colorado Springs, E-MAIL: Rawland@google.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_slim| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|181.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - TextMatcherModel - ContextualParserModel - RegexMatcherModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Hindi RoBERTa Embeddings Cased model (from mrm8488) author: John Snow Labs name: roberta_embeddings_hindi date: 2022-07-14 tags: [hi, open_source, roberta, embeddings] task: Embeddings language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `HindiBERTa` is a Hindi model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_hindi_hi_4.0.0_3.0_1657810059786.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_hindi_hi_4.0.0_3.0_1657810059786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_hindi","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_hindi","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_hindi| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|314.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/HindiBERTa --- layout: model title: Translate Russian to English Pipeline author: John Snow Labs name: translate_ru_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ru, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ru` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ru_en_xx_2.7.0_2.4_1609689034986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ru_en_xx_2.7.0_2.4_1609689034986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ru_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ru_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ru.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ru_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from wiselinjayajos) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_v2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad_v2` is a English model originally trained by `wiselinjayajos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_v2_en_4.3.0_3.0_1672773795694.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_v2_en_4.3.0_3.0_1672773795694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/wiselinjayajos/distilbert-base-uncased-finetuned-squad_v2 --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_bahasa_indonesia TFWav2Vec2ForCTC from Bagus author: John Snow Labs name: asr_wav2vec2_large_xlsr_bahasa_indonesia date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_bahasa_indonesia` is a Modern Greek (1453-) model originally trained by Bagus. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_bahasa_indonesia_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_bahasa_indonesia_el_4.2.0_3.0_1664108032291.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_bahasa_indonesia_el_4.2.0_3.0_1664108032291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_bahasa_indonesia", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_bahasa_indonesia", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_bahasa_indonesia| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_base_common_voice_second_colab TFWav2Vec2ForCTC from zoha author: John Snow Labs name: pipeline_asr_wav2vec2_base_common_voice_second_colab date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_common_voice_second_colab` is a English model originally trained by zoha. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_common_voice_second_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_common_voice_second_colab_en_4.2.0_3.0_1664024632309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_common_voice_second_colab_en_4.2.0_3.0_1664024632309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_common_voice_second_colab', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_common_voice_second_colab", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_common_voice_second_colab| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect Clinical Events (Admissions) author: John Snow Labs name: ner_events_admission_clinical_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_admission_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_events_admission_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_pipeline_en_4.3.0_3.2_1678866813967.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_pipeline_en_4.3.0_3.2_1678866813967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_events_admission_clinical_pipeline", "en", "clinical/models") text = '''The patient presented to the emergency room last evening.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_events_admission_clinical_pipeline", "en", "clinical/models") val text = "The patient presented to the emergency room last evening." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_admission_clinical.pipeline").predict("""The patient presented to the emergency room last evening.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------|--------:|------:|:--------------|-------------:| | 0 | presented | 12 | 20 | OCCURRENCE | 0.6219 | | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | 0.812 | | 2 | last evening | 44 | 55 | TIME | 0.9534 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_admission_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from Japanese to English author: John Snow Labs name: opus_mt_ja_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ja, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ja` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ja_en_xx_2.7.0_2.4_1609170731591.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ja_en_xx_2.7.0_2.4_1609170731591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ja_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ja_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ja.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ja_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in South Azerbaijani (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, azb, open_source] task: Embeddings language: azb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_azb_3.4.1_3.0_1647458923395.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_azb_3.4.1_3.0_1647458923395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","azb") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","azb") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("azb.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|azb| |Size:|228.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect Risk Factors author: John Snow Labs name: ner_risk_factors_en date: 2020-04-22 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for Heart Disease Risk Factors and Personal Health Information. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `CAD`, `DIABETES`, `FAMILY_HIST`, `HYPERLIPIDEMIA`, `HYPERTENSION`, `MEDICATION`, `OBESE`, `PHI`, `SMOKER`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RISK_FACTORS/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_en_2.4.2_2.4_1587513300751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_en_2.4.2_2.4_1587513300751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_risk_factors", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.\n\nREVIEW OF SYSTEMS: All other systems reviewed & are negative.\n\nPAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.\n\nSOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.\n\nFAMILY HISTORY: Positive for coronary artery disease (father & brother).']], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_risk_factors", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain".The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour.The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.REVIEW OF SYSTEMS: All other systems reviewed & are negative.PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.FAMILY HISTORY: Positive for coronary artery disease (father & brother).").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +------------------------------------+------------+ |chunk |ner | +------------------------------------+------------+ |diabetic |DIABETES | |coronary artery disease |CAD | |Diabetes mellitus type II |DIABETES | |hypertension |HYPERTENSION| |coronary artery disease |CAD | |1995 |PHI | |ABC |PHI | |Smokes 2 packs of cigarettes per day|SMOKER | |banker |PHI | +------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors_en_2.4.2_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.2| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with ``embeddings_clinical``. https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|-----------------:|------:|------:|------:|---------:|---------:|---------:| | 1 | I-HYPERLIPIDEMIA | 7 | 7 | 7 | 0.5 | 0.5 | 0.5 | | 2 | B-CAD | 104 | 52 | 101 | 0.666667 | 0.507317 | 0.576177 | | 3 | I-DIABETES | 127 | 67 | 92 | 0.654639 | 0.579909 | 0.615012 | | 4 | B-HYPERTENSION | 173 | 52 | 64 | 0.768889 | 0.729958 | 0.748918 | | 5 | B-OBESE | 46 | 20 | 3 | 0.69697 | 0.938776 | 0.8 | | 6 | B-PHI | 1968 | 599 | 252 | 0.766654 | 0.886486 | 0.822227 | | 7 | B-HYPERLIPIDEMIA | 71 | 17 | 14 | 0.806818 | 0.835294 | 0.820809 | | 8 | I-SMOKER | 116 | 73 | 94 | 0.613757 | 0.552381 | 0.581454 | | 9 | I-OBESE | 9 | 8 | 4 | 0.529412 | 0.692308 | 0.6 | | 10 | I-FAMILY_HIST | 5 | 0 | 10 | 1 | 0.333333 | 0.5 | | 11 | B-DIABETES | 190 | 59 | 58 | 0.763052 | 0.766129 | 0.764587 | | 12 | B-MEDICATION | 838 | 224 | 81 | 0.789077 | 0.911861 | 0.846037 | | 13 | I-PHI | 597 | 202 | 136 | 0.747184 | 0.814461 | 0.779373 | | 14 | Macro-average | 4533 | 1784 | 1600 | 0.620602 | 0.567477 | 0.592852 | | 15 | Micro-average | 4533 | 1784 | 1600 | 0.717588 | 0.739116 | 0.728193 | ``` --- layout: model title: English image_classifier_vit_rare_puppers_demo ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_rare_puppers_demo date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_puppers_demo` is a English model originally trained by nateraw. ## Predicted Entities `corgi`, `husky`, `samoyed`, `shiba inu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_demo_en_4.1.0_3.0_1660170746799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_demo_en_4.1.0_3.0_1660170746799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_puppers_demo", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_puppers_demo", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_puppers_demo| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Chemicals and Proteins in text (biobert) author: John Snow Labs name: ner_chemprot_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect mentions of chemicals and proteins in text using pretrained NER model. ## Predicted Entities `GENE-N`, `GENE-Y`, `CHEMICAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_biobert_en_3.0.0_3.0_1617260815838.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_biobert_en_3.0.0_3.0_1617260815838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_chemprot_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_chemprot_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Detect SDOH of Social Environment author: John Snow Labs name: ner_sdoh_social_environment_wip date: 2023-02-10 tags: [licensed, clinical, social_determinants, en, ner, social, environment, sdoh, public_health] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts social environment terminologies related to Social Determinants of Health from various kinds of documents. ## Predicted Entities `Social_Support`, `Chidhood_Event`, `Social_Exclusion`, `Violence_Abuse_Legal` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_social_environment_wip_en_4.2.8_3.0_1675998295035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_social_environment_wip_en_4.2.8_3.0_1675998295035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_social_environment_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["He is the primary caregiver.", "There is some evidence of abuse.", "She stated that she was in a safe environment in prison, but that her siblings lived in an unsafe neighborhood, she was very afraid for them and witnessed their ostracism by other people.", "Medical history: Jane was born in a low - income household and experienced significant trauma during her childhood, including physical and emotional abuse."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_social_environment_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Medical history: Jane was born in a low - income household and experienced significant trauma during her childhood, including physical and emotional abuse.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------+-----+---+---------------------------+ |ner_label |begin|end|chunk | +--------------------+-----+---+---------------------------+ |Social_Support |10 |26 |primary caregiver | |Violence_Abuse_Legal|26 |30 |abuse | |Violence_Abuse_Legal|49 |54 |prison | |Social_Exclusion |161 |169|ostracism | |Chidhood_Event |87 |113|trauma during her childhood| |Violence_Abuse_Legal|139 |153|emotional abuse | +--------------------+-----+---+---------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_social_environment_wip| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|858.7 KB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Chidhood_Event 34.0 6.0 5.0 39.0 0.850000 0.871795 0.860759 Social_Exclusion 45.0 6.0 12.0 57.0 0.882353 0.789474 0.833333 Social_Support 1139.0 57.0 103.0 1242.0 0.952341 0.917069 0.934372 Violence_Abuse_Legal 235.0 38.0 44.0 279.0 0.860806 0.842294 0.851449 ``` --- layout: model title: Translate English to Southern Sotho Pipeline author: John Snow Labs name: translate_en_st date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, st, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `st` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_st_xx_2.7.0_2.4_1609686477697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_st_xx_2.7.0_2.4_1609686477697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_st", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_st", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.st').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_st| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Tax Withholdings Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_tax_withholdings_bert date: 2023-03-05 tags: [en, legal, classification, clauses, tax_withholdings, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Tax_Withholdings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Tax_Withholdings`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tax_withholdings_bert_en_1.0.0_3.0_1678049984586.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tax_withholdings_bert_en_1.0.0_3.0_1678049984586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_tax_withholdings_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Tax_Withholdings]| |[Other]| |[Other]| |[Tax_Withholdings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_tax_withholdings_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.95 0.95 0.95 65 Tax_Withholdings 0.93 0.93 0.93 44 accuracy - - 0.94 109 macro-avg 0.94 0.94 0.94 109 weighted-avg 0.94 0.94 0.94 109 ``` --- layout: model title: Pipeline to Detect Genomic Variants author: John Snow Labs name: ner_genetic_variants_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, genomic_variants, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_genetic_variants](https://nlp.johnsnowlabs.com/2021/06/25/ner_genetic_variants_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_genetic_variants_pipeline_en_3.4.1_3.0_1647872138512.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_genetic_variants_pipeline_en_3.4.1_3.0_1647872138512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_genetic_variants_pipeline", "en", "clinical/models") pipeline.annotate("The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.") ``` ```scala val pipeline = new PretrainedPipeline("ner_genetic_variants_pipeline", "en", "clinical/models") pipeline.annotate("The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.genetic_variants.pipeline").predict("""The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.""") ```
## Results ```bash +---------+---------------+ |chunk |ner_label | +---------+---------------+ |A3243G |DNAMutation | |G13513A |DNAMutation | |T10191C |DNAMutation | |p.S45P |ProteinMutation| |A11470C |DNAMutation | |p. K237N |ProteinMutation| |T13046C |DNAMutation | |p. |ProteinMutation| |M237T |ProteinMutation| |A11470C |DNAMutation | |T13046C |DNAMutation | |rs3753394|SNP | |rs800292 |SNP | |rs1061170|SNP | |rs1329428|SNP | |rs2274700|SNP | |rs1410996|SNP | |rs3753394|SNP | |rs800292 |SNP | |rs1061170|SNP | |rs1329428|SNP | |rs7535263|SNP | |rs1410996|SNP | |rs2274700|SNP | +---------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_genetic_variants_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: French T5ForConditionalGeneration Base Cased model (from plguillou) author: John Snow Labs name: t5_base_sum_cnndm date: 2023-01-30 tags: [fr, open_source, t5, tensorflow] task: Text Generation language: fr edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-fr-sum-cnndm` is a French model originally trained by `plguillou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_sum_cnndm_fr_4.3.0_3.0_1675109501910.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_sum_cnndm_fr_4.3.0_3.0_1675109501910.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_sum_cnndm","fr") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_sum_cnndm","fr") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_sum_cnndm| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|fr| |Size:|923.2 MB| ## References - https://huggingface.co/plguillou/t5-base-fr-sum-cnndm --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from huxxx657) author: John Snow Labs name: distilbert_qa_huxxx657_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huxxx657_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771380474.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huxxx657_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771380474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huxxx657_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huxxx657_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_huxxx657_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/distilbert-base-uncased-finetuned-squad --- layout: model title: Clean patterns pipeline for English author: John Snow Labs name: clean_pattern date: 2022-07-07 tags: [open_source, english, clean_pattern, pipeline, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_pattern_en_4.0.0_3.0_1657188243499.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_pattern_en_4.0.0_3.0_1657188243499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('clean_pattern', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("clean_pattern", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.pattern').predict(text) result_df ```
## Results ```bash | | document | sentence | token | normal | |---:|:-----------|:-----------|:----------|:----------| | 0 | ['Hello'] | ['Hello'] | ['Hello'] | ['Hello'] || | document | sentence | token | normal | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_pattern| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|11.3 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - NormalizerModel --- layout: model title: Legal Language Clause Binary Classifier author: John Snow Labs name: legclf_language_clause date: 2022-12-07 tags: [en, legal, classification, clauses, language, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `language` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `language`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_language_clause_en_1.0.0_3.0_1670445331581.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_language_clause_en_1.0.0_3.0_1670445331581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_language_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[language]| |[other]| |[other]| |[language]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_language_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support language 1.00 1.00 1.00 28 other 1.00 1.00 1.00 73 accuracy - - 1.00 101 macro-avg 1.00 1.00 1.00 101 weighted-avg 1.00 1.00 1.00 101 ``` --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-26 tags: [da, legal, ner, licensed, mapa] task: Named Entity Recognition language: da edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Danish` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_da_1.0.0_3.0_1682551046131.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_da_1.0.0_3.0_1682551046131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_da_cased", "da")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "da", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Fra den 1. februar 2012 til den 31. januar 2014, og således også under den omtvistede periode, blev arbejdstagere hos Martimpex udsendt til Østrig for at udføre det samme arbejde."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------------+------------+ |chunk |ner_label | +---------------+------------+ |1. februar 2012|DATE | |31. januar 2014|DATE | |Martimpex |ORGANISATION| |Østrig |ADDRESS | +---------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|da| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.95 0.90 0.93 21 AMOUNT 1.00 1.00 1.00 4 DATE 0.98 0.98 0.98 54 ORGANISATION 0.74 0.74 0.74 31 PERSON 0.79 0.86 0.82 43 macro-avg 0.87 0.89 0.88 153 macro-avg 0.89 0.90 0.89 153 weighted-avg 0.87 0.89 0.88 153 ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Rgl73) author: John Snow Labs name: xlmroberta_ner_rgl73_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Rgl73`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_rgl73_base_finetuned_panx_de_4.1.0_3.0_1660430161684.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_rgl73_base_finetuned_panx_de_4.1.0_3.0_1660430161684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_rgl73_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_rgl73_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_rgl73_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Rgl73/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_2 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_2` is a English model originally trained by doddle124578. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_2_en_4.2.0_3.0_1664038115289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_2_en_4.2.0_3.0_1664038115289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Uncased model (from negfir) author: John Snow Labs name: bert_qa_negfir_distilbert_base_uncased_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `negfir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_negfir_distilbert_base_uncased_finetuned_squad_en_4.0.0_3.0_1657189360322.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_negfir_distilbert_base_uncased_finetuned_squad_en_4.0.0_3.0_1657189360322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_negfir_distilbert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_negfir_distilbert_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_negfir_distilbert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|201.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/negfir/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739488564.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739488564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_hier_triplet_0.1_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_triplet_0.1_epochs_1_shard_1_squad2.0 --- layout: model title: Legal Venues Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_venues_bert date: 2023-03-05 tags: [en, legal, classification, clauses, venues, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Venues` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Venues`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_venues_bert_en_1.0.0_3.0_1678050565118.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_venues_bert_en_1.0.0_3.0_1678050565118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_venues_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Venues]| |[Other]| |[Other]| |[Venues]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_venues_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.95 0.95 0.95 20 Venues 0.91 0.91 0.91 11 accuracy - - 0.94 31 macro-avg 0.93 0.93 0.93 31 weighted-avg 0.94 0.94 0.94 31 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_large_initialization_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-initialization-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_initialization_seed_0_en_4.0.0_3.0_1655737205059.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_initialization_seed_0_en_4.0.0_3.0_1655737205059.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_initialization_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_initialization_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.large_init_large_seed_0.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_initialization_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-large-initialization-seed-0 --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_03 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_03` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03_en_4.2.0_3.0_1664108163355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03_en_4.2.0_3.0_1664108163355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_03| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Tetela to English Pipeline author: John Snow Labs name: translate_tll_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tll, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tll` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tll_en_xx_2.7.0_2.4_1609690035934.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tll_en_xx_2.7.0_2.4_1609690035934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tll_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tll_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tll.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tll_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hindi BertForQuestionAnswering model (from Yuchen) author: John Snow Labs name: bert_qa_muril_large_cased_hita_qa date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muril-large-cased-hita-qa` is a Hindi model orginally trained by `Yuchen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_muril_large_cased_hita_qa_hi_4.0.0_3.0_1654188711440.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_muril_large_cased_hita_qa_hi_4.0.0_3.0_1654188711440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_muril_large_cased_hita_qa","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_muril_large_cased_hita_qa","hi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hi.answer_question.bert.large_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_muril_large_cased_hita_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Yuchen/muril-large-cased-hita-qa --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from imyday) author: John Snow Labs name: xlmroberta_ner_imyday_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `imyday`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_imyday_base_finetuned_panx_de_4.1.0_3.0_1660433961095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_imyday_base_finetuned_panx_de_4.1.0_3.0_1660433961095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_imyday_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_imyday_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_imyday_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/imyday/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English image_classifier_vit_platzi__base_beans_omar_espejel ViTForImageClassification from platzi author: John Snow Labs name: image_classifier_vit_platzi__base_beans_omar_espejel date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_platzi__base_beans_omar_espejel` is a English model originally trained by platzi. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_platzi__base_beans_omar_espejel_en_4.1.0_3.0_1660168482283.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_platzi__base_beans_omar_espejel_en_4.1.0_3.0_1660168482283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_platzi__base_beans_omar_espejel", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_platzi__base_beans_omar_espejel", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_platzi__base_beans_omar_espejel| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from phiyodr) author: John Snow Labs name: bert_qa_bert_large_finetuned_squad2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-finetuned-squad2` is a English model orginally trained by `phiyodr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_finetuned_squad2_en_4.0.0_3.0_1654536363807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_finetuned_squad2_en_4.0.0_3.0_1654536363807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large.by_phiyodr").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/phiyodr/bert-large-finetuned-squad2 - https://arxiv.org/abs/1810.04805 - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/abs/1806.03822 --- layout: model title: English image_classifier_vit_airplanes ViTForImageClassification from johnnydevriese author: John Snow Labs name: image_classifier_vit_airplanes date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_airplanes` is a English model originally trained by johnnydevriese. ## Predicted Entities `drone`, `fighter`, `uav` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_airplanes_en_4.1.0_3.0_1660166809508.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_airplanes_en_4.1.0_3.0_1660166809508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_airplanes", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_airplanes", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_airplanes| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from Tianle) author: John Snow Labs name: bert_qa_Tianle_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `Tianle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Tianle_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181063650.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Tianle_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181063650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Tianle_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Tianle_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_Tianle").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Tianle_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Tianle/bert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from gokulkarthik) author: John Snow Labs name: distilbert_qa_gokulkarthik_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `gokulkarthik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_gokulkarthik_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770851366.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_gokulkarthik_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770851366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gokulkarthik_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gokulkarthik_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_gokulkarthik_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/gokulkarthik/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentence Entity Resolver for Snomed Concepts, CT version (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_snomed_findings date: 2021-06-15 tags: [en, licensed, entity_resolution, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to Snomed codes (CT version) using sentence embeddings. ## Predicted Entities Snomed Codes and their normalized definition with `sbiobert_base_cased_mli` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_3.1.0_3.0_1623770110268.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_3.1.0_3.0_1623770110268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_snomed_findings``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. No need to set ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") .setCaseSensitive(False) val snomed_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.findings").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 38341003| 0.3234|hypertension:::hy...|38341003:::155295...| |chronic renal ins...| 83|109| PROBLEM|723190009| 0.7522|chronic renal ins...|723190009:::70904...| | COPD| 113|116| PROBLEM| 13645005| 0.1226|copd - chronic ob...|13645005:::155565...| | gastritis| 120|128| PROBLEM|235653009| 0.2444|gastritis:::gastr...|235653009:::45560...| | TIA| 136|138| PROBLEM|275382005| 0.0766|cerebral trauma (...|275382005:::44739...| |a non-ST elevatio...| 182|202| PROBLEM|233843008| 0.2224|silent myocardial...|233843008:::19479...| |Guaiac positive s...| 208|229| PROBLEM| 59614000| 0.9678|guaiac-positive s...|59614000:::703960...| |cardiac catheteri...| 295|317| TEST|301095005| 0.2584|cardiac finding::...|301095005:::25090...| | PTCA| 324|327|TREATMENT|373108000| 0.0809|post percutaneous...|373108000:::25103...| | mid LAD lesion| 332|345| PROBLEM|449567000| 0.0900|overriding left v...|449567000:::46140...| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_findings| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Case sensitive:|false| ## Data Source http://www.snomed.org/ --- layout: model title: English image_classifier_vit_pond_image_classification_3 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_3` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_3_en_4.1.0_3.0_1660165996085.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_3_en_4.1.0_3.0_1660165996085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Оcr base v2 optimized for handwritten text author: John Snow Labs name: ocr_base_handwritten_v2_opt date: 2023-01-17 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 4.2.4 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr base handwritten model v2 optimized with ONNX for recognise handwritten text based on a TrOCR model pretrained with handwritten datasets. It is an Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_HANDWRITTEN/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextHandwritten_V2_opt.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_opt_en_4.2.2_3.0_1670608549000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_opt_en_4.2.2_3.0_1670608549000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(True) \ .setSizeThreshold(-1) \ .setLinkThreshold(0.3) \ .setWidth(500) ocr = ImageToTextV2Opt.pretrained("ocr_base_handwritten_v2_opt", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setGroupImages(True) \ .setOutputCol("text") \ .setRegionsColumn("text_regions") draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, ocr, draw_regions ]) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/handwritten/handwritten_example.jpg imagePath = 'handwritten_example.jpg' image_df = spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(True) .setSizeThreshold(-1) .setLinkThreshold(0.3) .setWidth(500) val ocr = ImageToTextV2Opt .pretrained("ocr_base_handwritten_v2_opt", "en", "clinical/ocr") .setInputCols(Array("image", "text_regions")) .setGroupImages(True) .setOutputCol("text") .setRegionsColumn("text_regions") val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val pipeline = new PipelineModel().setStages(Array( binary_to_image, text_detector, ocr, draw_regions)) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/handwritten/handwritten_example.jpg val imagePath = "handwritten_example.jpg" val image_df = spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image3.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image3_out2.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash This is an example of handwritten beerxt Let's # check the performance ! I hope it will be awesome ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_base_handwritten_v2_opt| |Type:|ocr| |Compatibility:|Visual NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Extract clinical problems (Voice of the Patients) author: John Snow Labs name: ner_vop_problem_reduced_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences. The taxonomy is reduced (one label for all clinical problems). Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Modifier`, `HealthStatus`, `Problem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_wip_en_4.4.0_3.0_1682012638074.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_wip_en_4.4.0_3.0_1682012638074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I"ve been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I"ve been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Problem | | fatigue | Problem | | rheumatoid arthritis | Problem | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_reduced_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Modifier 755 306 232 987 0.71 0.76 0.74 HealthStatus 86 37 19 105 0.70 0.82 0.75 Problem 6126 983 1015 7141 0.86 0.86 0.86 macro_avg 6967 1326 1266 8233 0.76 0.81 0.78 micro_avg 6967 1326 1266 8233 0.84 0.85 0.84 ``` --- layout: model title: Multilingual Representations for Indian Languages (MuRIL) author: John Snow Labs name: bert_muril date: 2021-08-29 tags: [embeddings, xx, open_source] task: Embeddings language: xx edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A BERT model pre-trained on 17 Indian languages, and their transliterated counterparts. This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for the following 17 Indian languages: `Assamese`, `Bengali` , `English` , `Gujarati` , `Hindi` , `Kannada` , `Kashmiri` , `Malayalam` , `Marathi` , `Nepali` , `Oriya` , `Punjabi` , `Sanskrit` , `Sindhi` , `Tamil` , `Telugu` , `Urdu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_muril_xx_3.2.0_3.0_1630224119168.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_muril_xx_3.2.0_3.0_1630224119168.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_muril", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_muril", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.bert.muril").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_muril| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|xx| |Case sensitive:|false| ## Data Source [1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018. [2]: [Wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia) [3]: [Common Crawl](http://commoncrawl.org/the-data/) [4]: [PMINDIA](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html) [5]: [Dakshina](https://github.com/google-research-datasets/dakshina) The model is imported from: https://tfhub.dev/google/MuRIL/1 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab70 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab70 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab70` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab70_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab70_en_4.2.0_3.0_1664021706799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab70_en_4.2.0_3.0_1664021706799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab70", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab70", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab70| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab3_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab3_by_hassnain` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain_en_4.2.0_3.0_1664040313478.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain_en_4.2.0_3.0_1664040313478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab3_by_hassnain| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal International Law Document Classifier (EURLEX) author: John Snow Labs name: legclf_international_law_bert date: 2023-03-06 tags: [en, legal, classification, clauses, international_law, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_international_law_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class International_Law or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `International_Law`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_international_law_bert_en_1.0.0_3.0_1678111863969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_international_law_bert_en_1.0.0_3.0_1678111863969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_international_law_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[International_Law]| |[Other]| |[Other]| |[International_Law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_international_law_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support International_Law 0.88 0.91 0.9 136 Other 0.91 0.88 0.9 141 accuracy - - 0.9 277 macro-avg 0.90 0.90 0.9 277 weighted-avg 0.90 0.90 0.9 277 ``` --- layout: model title: Recognize Entities OntoNotes pipeline - ELECTRA Large author: John Snow Labs name: onto_recognize_entities_electra_large date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_electra_large, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_electra_large is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_large_en_3.0.0_3.0_1616478230579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_large_en_3.0.0_3.0_1616478230579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_electra_large', lang = 'en') annotations = pipeline.fullAnnotate("Hello from John Snow Labs!")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_large", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.large').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.264069110155105,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_electra_large| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Spanish DistilBertForQuestionAnswering Base Uncased model (from CenIA) author: John Snow Labs name: distilbert_qa_distillbert_base_uncased_finetuned_s_c date: 2023-01-03 tags: [es, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-sqac` is a Spanish model originally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_uncased_finetuned_s_c_es_4.3.0_3.0_1672774780047.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_uncased_finetuned_s_c_es_4.3.0_3.0_1672774780047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_uncased_finetuned_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_uncased_finetuned_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distillbert_base_uncased_finetuned_s_c| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|250.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-sqac --- layout: model title: Legal Letter Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_letter_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, letter, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_letter_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `letter-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `letter-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_letter_agreement_en_1.0.0_3.0_1669292126317.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_letter_agreement_en_1.0.0_3.0_1669292126317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_letter_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[letter-agreement]| |[other]| |[other]| |[letter-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_letter_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support letter-agreement 0.91 0.78 0.84 40 other 0.91 0.97 0.94 90 accuracy - - 0.91 130 macro-avg 0.91 0.87 0.89 130 weighted-avg 0.91 0.91 0.91 130 ``` --- layout: model title: Detect Anatomical Structures (Single Entity - biobert) author: John Snow Labs name: ner_anatomy_coarse_biobert_en date: 2020-11-04 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.1 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description An NER model to extract all types of anatomical references in text using "biobert_pubmed_base_cased" embeddings. It is a single entity model and generalizes all anatomical references to a single entity. ## Predicted Entities `Anatomy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_en_2.6.1_2.4_1604435983087.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_biobert_en_2.6.1_2.4_1604435983087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_anatomy_coarse_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([["content in the lung tissue"]]).toDF("text")) results = model.transform(data) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_anatomy_coarse_biobert", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("content in the lung tissue").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a "ner" column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select "token.result" and "ner.result" from your output dataframe or add the "Finisher" to the end of your pipeline.me: ```bash | | ner_chunk | entity | |---:|:------------------|:----------| | 0 | lung tissue | Anatomy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse_biobert| |Type:|NerDLModel| |Compatibility:|Spark NLP 2.6.1 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|True| {:.h2_title} ## Data Source Trained on a custom dataset using 'biobert_pubmed_base_cased'. {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | B-Anatomy | 2499 | 155 | 162 | 0.941598 | 0.939121 | 0.940357 | | 1 | I-Anatomy | 1695 | 116 | 158 | 0.935947 | 0.914733 | 0.925218 | | 2 | Macro-average | 4194 | 271 | 320 | 0.938772 | 0.926927 | 0.932812 | | 3 | Micro-average | 4194 | 271 | 320 | 0.939306 | 0.929109 | 0.93418 | ``` --- layout: model title: Word2Vec Embeddings in Bosnian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bs, open_source] task: Embeddings language: bs edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bs_3.4.1_3.0_1647287247610.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bs_3.4.1_3.0_1647287247610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bs") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Volim iskre nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bs") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Volim iskre nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bs.embed.w2v_cc_300d").predict("""Volim iskre nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bs| |Size:|657.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect medical risk factors (biobert) author: John Snow Labs name: ner_risk_factors_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract risk factors of patients using pretrained NER model. ## Predicted Entities `CAD`, `HYPERTENSION`, `SMOKER`, `OBESE`, `FAMILY_HIST`, `MEDICATION`, `PHI`, `HYPERLIPIDEMIA`, `DIABETES` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RISK_FACTORS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_biobert_en_3.0.0_3.0_1617260853390.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_biobert_en_3.0.0_3.0_1617260853390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_risk_factors_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_risk_factors_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.risk_factors.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from NAOKITY) author: John Snow Labs name: distilbert_qa_naokity_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `NAOKITY`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_naokity_finetuned_squad_en_4.3.0_3.0_1672765777735.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_naokity_finetuned_squad_en_4.3.0_3.0_1672765777735.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_naokity_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_naokity_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_naokity_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/NAOKITY/bert-finetuned-squad --- layout: model title: Dutch NER Model author: John Snow Labs name: bert_token_classifier_dutch_udlassy_ner date: 2021-12-08 tags: [dutch, token_classifier, bert, ner, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face`, and has been fine-tuned on Universal Dependencies Lassy dataset for Dutch language, leveraging `Bert` embeddings and `BertForTokenClassification` for NER purposes. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_dutch_udlassy_ner_nl_3.3.2_2.4_1638958339134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_dutch_udlassy_ner_nl_3.3.2_2.4_1638958339134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_dutch_udlassy_ner", "nl"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_dutch_udlassy_ner", "nl")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("nl.ner.bert").predict("""Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.""") ```
## Results ```bash +------------------------+---------+ |chunk |ner_label| +------------------------+---------+ |Peter Fergusson |PERSON | |oktober 2011 |DATE | |New York |GPE | |5 jaar |DATE | |Tesla Motor |ORG | +------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_dutch_udlassy_ner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|nl| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner) --- layout: model title: Danish asr_wav2vec2_base_nst TFWav2Vec2ForCTC from Alvenir author: John Snow Labs name: pipeline_asr_wav2vec2_base_nst date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_nst` is a Danish model originally trained by Alvenir. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_nst_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_nst_da_4.2.0_3.0_1664103870645.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_nst_da_4.2.0_3.0_1664103870645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_nst', lang = 'da') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_nst", lang = "da") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_nst| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|da| |Size:|354.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: MS-BERT base model (uncased) author: John Snow Labs name: ms_bluebert_base_uncased date: 2021-07-25 tags: [embeddings, bert, open_source, en, clinical] task: Embeddings language: en nav_key: models edition: Spark NLP 3.1.3 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained by taking BlueBert as the base model, and training on dataset contained approximately 75,000 clinical notes, for about 5000 patients, totaling to over 35.7 million words. These notes were collected from patients who visited St. Michael's Hospital MS Clinic between 2015 to 2019. The notes contained a variety of information pertaining to a neurological exam. For example, a note can contain information on the patient's condition, their progress over time and diagnosis. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. Next sentence prediction (NSP): the models concatenate two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ms_bluebert_base_uncased_en_3.1.3_3.0_1627225948184.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ms_bluebert_base_uncased_en_3.1.3_3.0_1627225948184.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("ms_bluebert_base_uncased", "en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("ms_bluebert_base_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ```
## Results ```bash Generates 768 dimensional embeddings per token ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ms_bluebert_base_uncased| |Compatibility:|Spark NLP 3.1.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source https://huggingface.co/NLP4H/ms_bert --- layout: model title: Stopwords Remover for German language (544 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, de, open_source] task: Stop Words Removal language: de edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_de_3.4.1_3.0_1646672969394.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_de_3.4.1_3.0_1646672969394.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","de") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Du bist nicht besser als ich"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","de") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Du bist nicht besser als ich").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.stopwords").predict("""Du bist nicht besser als ich""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|de| |Size:|2.8 KB| --- layout: model title: English DistilBertForQuestionAnswering model (from nlpunibo) Config3 author: John Snow Labs name: distilbert_qa_base_config3 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config3` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config3_en_4.0.0_3.0_1654727868959.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config3_en_4.0.0_3.0_1654727868959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_config3.by_nlpunibo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_config3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_base_config3 --- layout: model title: English DistilBertForQuestionAnswering Cased model (from tabo) author: John Snow Labs name: distilbert_qa_checkpoint_500_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `checkpoint-500-finetuned-squad` is a English model originally trained by `tabo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_checkpoint_500_finetuned_squad_en_4.3.0_3.0_1672765928587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_checkpoint_500_finetuned_squad_en_4.3.0_3.0_1672765928587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_checkpoint_500_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_checkpoint_500_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_checkpoint_500_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tabo/checkpoint-500-finetuned-squad --- layout: model title: Word Segmenter for Chinese author: John Snow Labs name: wordseg_weibo date: 2021-03-09 tags: [word_segmentation, open_source, chinese, wordseg_weibo, zh] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_weibo_zh_3.0.0_3.0_1615292340879.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_weibo_zh_3.0.0_3.0_1615292340879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_weibo", "zh") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_weibo", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.segment_words.weibo').predict(text) token_df ```
## Results ```bash 0 从 1 JohnSnowLabs 2 你 3 好 4 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_weibo| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|zh| --- layout: model title: English asr_Fine_Tunning_on_CV_dataset TFWav2Vec2ForCTC from Sania67 author: John Snow Labs name: asr_Fine_Tunning_on_CV_dataset date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Fine_Tunning_on_CV_dataset` is a English model originally trained by Sania67. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Fine_Tunning_on_CV_dataset_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Fine_Tunning_on_CV_dataset_en_4.2.0_3.0_1664118279387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Fine_Tunning_on_CV_dataset_en_4.2.0_3.0_1664118279387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Fine_Tunning_on_CV_dataset", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Fine_Tunning_on_CV_dataset", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Fine_Tunning_on_CV_dataset| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: DistilRoBERTa base model author: John Snow Labs name: distilroberta_base date: 2021-05-20 tags: [roberta, embeddings, en, english, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a distilled version of the [RoBERTa-base model](https://huggingface.co/roberta-base). It follows the same training procedure as [DistilBERT](https://huggingface.co/distilbert-base-uncased). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation). This model is case-sensitive: it makes a difference between english and English. The model has 6 layers, 768 dimensions, and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilroberta_base_en_3.1.0_2.4_1621523016677.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilroberta_base_en_3.1.0_2.4_1621523016677.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = RoBertaEmbeddings.pretrained("distilroberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ``` ```scala val embeddings = RoBertaEmbeddings.pretrained("distilroberta_base", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilroberta_base| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/distilroberta-base](https://huggingface.co/distilroberta-base) ## Benchmarking ```bash When fine-tuned on downstream tasks, this model achieves the following results: Glue test results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 84.0 | 89.4 | 90.8 | 92.5 | 59.3 | 88.3 | 86.6 | 67.9 | ``` --- layout: model title: Extract Biomarkers and their Results author: John Snow Labs name: ner_oncology_biomarker date: 2022-11-24 tags: [licensed, clinical, en, ner, oncology, biomarker] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of biomarkers and biomarker results from oncology texts. Definitions of Predicted Entities: - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer (including oncogenes). - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. ## Predicted Entities `Biomarker`, `Biomarker_Result` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_en_4.2.2_3.0_1669299787628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_en_4.2.2_3.0_1669299787628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_biomarker", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_biomarker", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_biomarker").predict("""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.""") ```
## Results ```bash | chunk | ner_label | |:-----------------------------------------|:-----------------| | negative | Biomarker_Result | | CK7 | Biomarker | | synaptophysin | Biomarker | | Syn | Biomarker | | chromogranin A | Biomarker | | CgA | Biomarker | | Muc5AC | Biomarker | | human epidermal growth factor receptor-2 | Biomarker | | HER2 | Biomarker | | Muc6 | Biomarker | | positive | Biomarker_Result | | CK20 | Biomarker | | Muc1 | Biomarker | | Muc2 | Biomarker | | E-cadherin | Biomarker | | p53 | Biomarker | | Ki-67 index | Biomarker | | 87% | Biomarker_Result | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_biomarker| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Biomarker_Result 1030 148 415 1445 0.87 0.71 0.79 Biomarker 1685 272 279 1964 0.86 0.86 0.86 macro_avg 2715 420 694 3409 0.87 0.79 0.82 micro_avg 2715 420 694 3409 0.87 0.80 0.83 ``` --- layout: model title: English LongformerForQuestionAnswering model (from allenai) author: John Snow Labs name: longformer_qa_large_4096_finetuned_triviaqa date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-large-4096-finetuned-triviaqa` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_large_4096_finetuned_triviaqa_en_4.0.0_3.0_1656255478666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_large_4096_finetuned_triviaqa_en_4.0.0_3.0_1656255478666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_large_4096_finetuned_triviaqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_large_4096_finetuned_triviaqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.longformer.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_large_4096_finetuned_triviaqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/allenai/longformer-large-4096-finetuned-triviaqa --- layout: model title: Legal Criticality Prediction Classifier (German) author: John Snow Labs name: legclf_critical_prediction_legal date: 2023-03-27 tags: [classification, de, licensed, legal, tensorflow] task: Text Classification language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Binary classification model which identifies two criticality labels(critical, non-critical) in German-based Court Cases. ## Predicted Entities `critical`, `non-critical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_critical_prediction_legal_de_1.0.0_3.0_1679923757236.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_critical_prediction_legal_de_1.0.0_3.0_1679923757236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = nlp.RoBertaForSequenceClassification.pretrained("legclf_critical_prediction_legal", "de", "legal/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") nlpPipeline = nlp.Pipeline( stages = [documentAssembler, tokenizer, classifier]) # Example text example = spark.createDataFrame([["erkennt der Präsident: 1. Auf die Beschwerde wird nicht eingetreten. 2. Es werden keine Gerichtskosten erhoben. 3. Dieses Urteil wird den Parteien, dem Sozialversicherungsgericht des Kantons Zürich und dem Staatssekretariat für Wirtschaft (SECO) schriftlich mitgeteilt. Luzern, 5. Dezember 2016 Im Namen der I. sozialrechtlichen Abteilung des Schweizerischen Bundesgerichts Der Präsident: Maillard Der Gerichtsschreiber: Grünvogel"]]).toDF("text") empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) result = model.transform(example) # result is a DataFrame result.select("text", "class.result").show() ```
## Results ```bash +----------------------------------------------------------------------------------------------------+--------------+ | text| result| +----------------------------------------------------------------------------------------------------+--------------+ |erkennt der Präsident: 1. Auf die Beschwerde wird nicht eingetreten. 2. Es werden keine Gerichtsk...|[non_critical]| +----------------------------------------------------------------------------------------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_critical_prediction_legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|468.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction) ## Benchmarking ```bash label precision recall f1-score support critical 0.76 0.87 0.81 249 non_critical 0.87 0.76 0.81 293 accuracy - - 0.81 542 macro-avg 0.81 0.82 0.81 542 weighted-avg 0.82 0.81 0.81 542 ``` --- layout: model title: Japanese T5ForConditionalGeneration Cased model (from astremo) author: John Snow Labs name: t5_jainu date: 2023-01-30 tags: [ja, open_source, t5] task: Text Generation language: ja edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `JAINU` is a Japanese model originally trained by `astremo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_jainu_ja_4.3.0_3.0_1675097938002.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_jainu_ja_4.3.0_3.0_1675097938002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_jainu","ja") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_jainu","ja") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_jainu| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ja| |Size:|923.5 MB| ## References - https://huggingface.co/astremo/JAINU - http://creativecommons.org/licenses/by/4.0/ - http://creativecommons.org/licenses/by/4.0/ - http://creativecommons.org/licenses/by/4.0/ --- layout: model title: Pipeline to Detect Temporal Relations for Clinical Events author: John Snow Labs name: re_temporal_events_clinical_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, events, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_temporal_events_clinical](https://nlp.johnsnowlabs.com/2020/09/28/re_temporal_events_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_clinical_pipeline_en_3.4.1_3.0_1648733872152.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_clinical_pipeline_en_3.4.1_3.0_1648733872152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_temporal_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` ```scala val pipeline = new PretrainedPipeline("re_temporal_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_event_clinical.pipeline").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
## Results ```bash +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+============+=================+===============+==========================+===========+=================+===============+=====================+==============+ | 0 | OVERLAP | OCCURRENCE | 121 | 144 | a motor vehicle accident | DATE | 149 | 165 | September of 2005 | 0.999975 | +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ | 1 | OVERLAP | DATE | 171 | 179 | that time | PROBLEM | 201 | 219 | any specific injury | 0.956654 | +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_temporal_events_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Legal Access Clause Binary Classifier author: John Snow Labs name: legclf_access_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `access` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `access` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_access_clause_en_1.0.0_3.2_1660122082866.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_access_clause_en_1.0.0_3.2_1660122082866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_access_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[access]| |[other]| |[other]| |[access]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_access_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support access 0.93 0.91 0.92 58 other 0.97 0.97 0.97 149 accuracy - - 0.96 207 macro-avg 0.95 0.94 0.95 207 weighted-avg 0.96 0.96 0.96 207 ``` --- layout: model title: Pipeline for Adverse Drug Events author: John Snow Labs name: explain_clinical_doc_ade date: 2021-07-15 tags: [licensed, clinical, en, pipeline] task: [Named Entity Recognition, Text Classification, Relation Extraction, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for Adverse Drug Events (ADE) with `ner_ade_biobert`, `assertion_dl_biobert`, `classifierdl_ade_conversational_biobert`, and `re_ade_biobert` . It will classify the document, extract ADE and DRUG clinical entities, assign assertion status to ADE entities, and relate Drugs with their ADEs. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_3.1.2_3.0_1626380200755.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_3.1.2_3.0_1626380200755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models') res = pipeline.fullAnnotate("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ``` ```scala val era_pipeline = new PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash Class: True NER_Assertion: | | chunk | entitiy | assertion | |----|-------------------------|------------|-------------| | 0 | Lipitor | DRUG | - | | 1 | severe fatigue | ADE | Conditional | | 2 | voltaren | DRUG | - | | 3 | cramps | ADE | Conditional | Relations: | | chunk1 | entitiy1 | chunk2 | entity2 | relation | |----|-------------------------------|------------|-------------|---------|----------| | 0 | severe fatigue | ADE | Lipitor | DRUG | 1 | | 1 | cramps | ADE | Lipitor | DRUG | 0 | | 2 | severe fatigue | ADE | voltaren | DRUG | 0 | | 3 | cramps | ADE | voltaren | DRUG | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_ade| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - BertEmbeddings - SentenceEmbeddings - ClassifierDLModel - MedicalNerModel - NerConverterInternal - PerceptronModel - DependencyParserModel - RelationExtractionModel - NerConverterInternal - AssertionDLModel --- layout: model title: English asr_iwslt_asr_wav2vec_large_4500h TFWav2Vec2ForCTC from nguyenvulebinh author: John Snow Labs name: asr_iwslt_asr_wav2vec_large_4500h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_iwslt_asr_wav2vec_large_4500h` is a English model originally trained by nguyenvulebinh. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_iwslt_asr_wav2vec_large_4500h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_iwslt_asr_wav2vec_large_4500h_en_4.2.0_3.0_1664042789844.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_iwslt_asr_wav2vec_large_4500h_en_4.2.0_3.0_1664042789844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_iwslt_asr_wav2vec_large_4500h", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_iwslt_asr_wav2vec_large_4500h", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_iwslt_asr_wav2vec_large_4500h| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering (from Modfiededition) author: John Snow Labs name: roberta_qa_roberta_fine_tuned_tweet_sentiment_extractor date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-fine-tuned-tweet-sentiment-extractor` is a English model originally trained by `Modfiededition`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_fine_tuned_tweet_sentiment_extractor_en_4.0.0_3.0_1655735888818.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_fine_tuned_tweet_sentiment_extractor_en_4.0.0_3.0_1655735888818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_fine_tuned_tweet_sentiment_extractor","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_fine_tuned_tweet_sentiment_extractor","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_fine_tuned_tweet_sentiment_extractor| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|441.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Modfiededition/roberta-fine-tuned-tweet-sentiment-extractor --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification date: 2022-01-08 tags: [deidentification, licensed, pipeline, de] task: Pipeline Healthcare language: de edition: Healthcare NLP 3.4.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from **German** medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `CONTACT`, `ID`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `DLN`, `PLATE` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_3.4.0_2.4_1641636618956.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_3.4.0_2.4_1641636618956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models") sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """ result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models") val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.clinical").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """) ```
## Results ```bash Masked with entity labels ------------------------------ Zusammenfassung : wird am Morgen des ins eingeliefert. Herr ist Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: Platte Kontonummer: SSN : Lizenznummer: Adresse : Masked with chars ------------------------------ Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************] eingeliefert. Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: [*******] Platte [*****] Kontonummer: [********************] SSN : [**********] Lizenznummer: [*********] Adresse : [*****************] [***] Masked with fixed length chars ------------------------------ Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert. Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: **** Platte **** Kontonummer: **** SSN : **** Lizenznummer: **** Adresse : **** **** Obfusceted ------------------------------ Zusammenfassung : Herrmann Kallert wird am Morgen des 11-26-1977 ins International Neuroscience eingeliefert. Herr Herrmann Kallert ist 79 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: 136704D357 Platte QA348G Kontonummer: 192837465738 SSN : 1310011981M454 Lizenznummer: XX123456 Adresse : Klingelhöferring 31206 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_dm512 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-dm512` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dm512_en_4.3.0_3.0_1675115410843.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dm512_en_4.3.0_3.0_1675115410843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_dm512","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_dm512","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_dm512| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|739.8 MB| ## References - https://huggingface.co/google/t5-efficient-large-dm512 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Relation Extraction between anatomical entities and other clinical entities (ReDL) author: John Snow Labs name: redl_oncology_location_biobert_wip date: 2022-09-29 tags: [licensed, clinical, oncology, en, relation_extraction, anatomy] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links extractions from anatomical entities (such as Site_Breast or Site_Lung) to other clinical entities (such as Tumor_Finding or Cancer_Surgery). ## Predicted Entities `is_location_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_location_biobert_wip_en_4.1.0_3.0_1664454650547.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_location_biobert_wip_en_4.1.0_3.0_1664454650547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding", "Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_location_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["In April 2011, she first noticed a lump in her right breast."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding","Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_location_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""In April 2011, she first noticed a lump in her right breast.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_location_biobert_wip").predict("""In April 2011, she first noticed a lump in her right breast.""") ```
## Results ```bash |chunk1 | entity1 | chunk2 | entity2 | relation | confidence| |-------|--------------- |--------|-------------|----------------|-----------| | lump | Tumor_Finding | breast | Site_Breast | is_location_of | 0.9628376| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_location_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.90 0.94 0.92 is_location_of 0.94 0.90 0.92 macro-avg 0.92 0.92 0.92 ``` --- layout: model title: Legal Financial statements Clause Binary Classifier author: John Snow Labs name: legclf_financial_statements_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `financial-statements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `financial-statements` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financial_statements_clause_en_1.0.0_3.2_1660123541235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financial_statements_clause_en_1.0.0_3.2_1660123541235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_financial_statements_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[financial-statements]| |[other]| |[other]| |[financial-statements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_financial_statements_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support financial-statements 1.00 1.00 1.00 23 other 1.00 1.00 1.00 40 accuracy - - 1.00 63 macro-avg 1.00 1.00 1.00 63 weighted-avg 1.00 1.00 1.00 63 ``` --- layout: model title: Legal Good reason Clause Binary Classifier author: John Snow Labs name: legclf_good_reason_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `good-reason` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `good-reason` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_good_reason_clause_en_1.0.0_3.2_1660123563997.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_good_reason_clause_en_1.0.0_3.2_1660123563997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_good_reason_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[good-reason]| |[other]| |[other]| |[good-reason]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_good_reason_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support good-reason 0.97 0.92 0.94 36 other 0.96 0.99 0.98 82 accuracy - - 0.97 118 macro-avg 0.97 0.95 0.96 118 weighted-avg 0.97 0.97 0.97 118 ``` --- layout: model title: Оcr small for handwritten text author: John Snow Labs name: ocr_small_handwritten date: 2023-01-10 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.3.3 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr small model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_HANDWRITTEN/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextHandwritten.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_handwritten_en_3.3.3_2.4_1645080334390.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_small_handwritten_en_3.3.3_2.4_1645080334390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(True) \ .setSizeThreshold(-1) \ .setLinkThreshold(0.3) \ .setWidth(500) # Try "ocr_base_handwritten" for better quality ocr = ImageToTextV2.pretrained("ocr_small_handwritten", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setGroupImages(True) \ .setOutputCol("text") draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, ocr, draw_regions ]) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/handwritten/handwritten_example.jpg imagePath = 'handwritten_example.jpg' image_df = spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(True) .setSizeThreshold(-1) .setLinkThreshold(0.3) .setWidth(500) # Try "ocr_base_handwritten" for better quality val ocr = ImageToTextV2 .pretrained("ocr_small_handwritten", "en", "clinical/ocr") .setInputCols(Array("image", "text_regions")) .setGroupImages(True) .setOutputCol("text") val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val pipeline = new PipelineModel().setStages(Array( binary_to_image, text_detector, ocr, draw_regions)) # Download image: # !wget -q https://github.com/JohnSnowLabs/spark-ocr-workshop/raw/4.0.0-release-candidate/jupyter/data/handwritten/handwritten_example.jpg val imagePath = "handwritten_example.jpg" val image_df = spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image3.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image3_out2.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash This is an example of handwritten beerxt Let's # check the performance? I hope it will be awesome ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_small_handwritten| |Type:|ocr| |Compatibility:|Visual NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|146.7 MB| --- layout: model title: English asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten_en_4.2.0_3.0_1664096379955.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten_en_4.2.0_3.0_1664096379955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_100h_with_lm_by_patrickvonplaten| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Stopwords Remover for Tatar language (1006 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, tt, open_source] task: Stop Words Removal language: tt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_tt_3.4.1_3.0_1646673084805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_tt_3.4.1_3.0_1646673084805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","tt") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Tanışuıbızğa şatmım"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","tt") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Tanışuıbızğa şatmım").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tt.stopwords").predict("""Tanışuıbızğa şatmım""") ```
## Results ```bash +----------------------+ |result | +----------------------+ |[Tanışuıbızğa, şatmım]| +----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|tt| |Size:|5.1 KB| --- layout: model title: French CamemBert Embeddings (from dianeshan) author: John Snow Labs name: camembert_embeddings_dianeshan_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `dianeshan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_dianeshan_generic_model_fr_3.4.4_3.0_1653987940563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_dianeshan_generic_model_fr_3.4.4_3.0_1653987940563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_dianeshan_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_dianeshan_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_dianeshan_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/dianeshan/dummy-model --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_base_ontonotes5 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-base-ontonotes5` is a English model originally trained by `asahi417`. ## Predicted Entities `language`, `product`, `percent`, `time`, `quantity`, `ordinal number`, `law`, `cardinal number`, `facility`, `event`, `geopolitical area`, `organization`, `group`, `money`, `work of art`, `person`, `location`, `date` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_ontonotes5_en_4.1.0_3.0_1660423534602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_ontonotes5_en_4.1.0_3.0_1660423534602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_ontonotes5","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_ontonotes5","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_base_ontonotes5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|798.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-base-ontonotes5 - https://github.com/asahi417/tner --- layout: model title: English image_classifier_vit_computer_stuff ViTForImageClassification from iyzg author: John Snow Labs name: image_classifier_vit_computer_stuff date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_computer_stuff` is a English model originally trained by iyzg. ## Predicted Entities `desktop`, `monitor`, `keyboard`, `computer mouse`, `laptop` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_computer_stuff_en_4.1.0_3.0_1660169901739.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_computer_stuff_en_4.1.0_3.0_1660169901739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_computer_stuff", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_computer_stuff", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_computer_stuff| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Estonian author: John Snow Labs name: opus_mt_en_et date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, et, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `et` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_et_xx_2.7.0_2.4_1609254350294.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_et_xx_2.7.0_2.4_1609254350294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_et", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_et", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.et').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_et| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal NER for Indian Court Documents author: John Snow Labs name: legner_indian_court_preamble date: 2022-10-25 tags: [en, licensed, legal, ner] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model trained on Indian court dataset, aimed to extract the following entities from preamble documents. ## Predicted Entities `COURT`, `PETITIONER`, `RESPONDENT`, `JUDGE`, `LAWYER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_indian_court_preamble_en_1.0.0_3.0_1666702718567.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_indian_court_preamble_en_1.0.0_3.0_1666702718567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ .setCleanupMode("shrink") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "en")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_indian_court_preamble", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""In The High Court Of Judicature At Madras Dated: 31/05/2006 The Hon'Ble Mr. Justice V. Dhanapalan C.M.A.No.535 of 1998 1. Sahabudeen ... Claimant/Appellant Vs 1. R. Selvaraj, 2. The New India Assurance Co.Ltd., ... Respondents Appeal filed under Section 173 of the Motor Vehicles Act to set aside the judgment and decree dated 25.03.97 passed in Mcop No.5/95 on the file of the I Additional District Judge-cum-Chief Judicial Magistrate, Coimbatore and pass the award of Rs.3,50,000/- instead of Rs.1,00 ,000/- towards the compensation to the petitioner. For Petitioner : Mr. K.Sudarsanam for M/s. Surithi Associates For Respondents: Mr. Mohd.Fiary Hussain for R1"""]]) result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") .setMaxSentenceLength(512) .setCaseSensitive(True) val ner_model = NerModel.pretrained("legner_indian_court_preamble", "en", "legal/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""In The High Court Of Judicature At Madras Dated: 31/05/2006 The Hon'Ble Mr. Justice V. Dhanapalan C.M.A.No.535 of 1998 1. Sahabudeen ... Claimant/Appellant Vs 1. R. Selvaraj, 2. The New India Assurance Co.Ltd., ... Respondents Appeal filed under Section 173 of the Motor Vehicles Act to set aside the judgment and decree dated 25.03.97 passed in Mcop No.5/95 on the file of the I Additional District Judge-cum-Chief Judicial Magistrate, Coimbatore and pass the award of Rs.3,50,000/- instead of Rs.1,00 ,000/- towards the compensation to the petitioner. For Petitioner : Mr. K.Sudarsanam for M/s. Surithi Associates For Respondents: Mr. Mohd.Fiary Hussain for R1""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------+----------+ |chunk |label | +----------------------------------+----------+ |High Court Of Judicature At Madras|COURT | |V. Dhanapalan |JUDGE | |Sahabudeen |PETITIONER| |Selvaraj |RESPONDENT| |New India Assurance |RESPONDENT| |K.Sudarsanam |LAWYER | |Mohd.Fiary Hussain |LAWYER | +----------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_indian_court_preamble| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References Training data is available [here](https://github.com/Legal-NLP-EkStep/legal_NER#3-data). ## Benchmarking ```bash label precision recall f1-score support COURT 0.92 0.91 0.91 109 JUDGE 0.96 0.92 0.94 168 LAWYER 0.94 0.93 0.94 377 PETITIONER 0.76 0.77 0.76 269 RESPONDENT 0.78 0.80 0.79 356 micro-avg 0.86 0.86 0.86 1279 macro-avg 0.87 0.86 0.87 1279 weighted-avg 0.86 0.86 0.86 1279 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1655733176675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1655733176675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_512d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-8 --- layout: model title: Word2Vec Embeddings in Amharic (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, am, open_source] task: Embeddings language: am edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_am_3.4.1_3.0_1647282427046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_am_3.4.1_3.0_1647282427046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","am") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ስካርቻ nlp እወዳለሁ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","am") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ስካርቻ nlp እወዳለሁ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("am.embed.w2v_cc_300d").predict("""ስካርቻ nlp እወዳለሁ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|am| |Size:|172.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_tiny_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_finetuned_squadv2_en_4.0.0_3.0_1654185015067.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_finetuned_squadv2_en_4.0.0_3.0_1654185015067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_tiny_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.tiny_.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_tiny_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-tiny-finetuned-squadv2 - https://twitter.com/mrm8488 - https://github.com/google-research - https://arxiv.org/abs/1908.08962 - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: ICD10PCS Entity Resolver author: John Snow Labs name: chunkresolve_icd10pcs_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-PCS Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10pcs_clinical_en_3.0.0_3.0_1617355415038.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10pcs_clinical_en_3.0.0_3.0_1617355415038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10pcs_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline_icd10pcs = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, ner, chunk_embeddings, model]) data = ["""He has a starvation ketosis but nothing found for significant for dry oral mucosa"""] pipeline_model = pipeline_icd10pcs.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline = LightPipeline(pipeline_model) result = light_pipeline.annotate(data) ``` ```scala ... val model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10pcs_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, ner, chunk_embeddings, model)) val data = Seq("He has a starvation ketosis but nothing found for significant for dry oral mucosa").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | code | resolutions | |---|----------------------|-------|-----|---------|--------------------------------------------------| | 0 | a starvation ketosis | 7 | 26 | 6A3Z1ZZ | Hyperthermia, Multiple:::Narcosynthesis:::Hype...| | 1 | dry oral mucosa | 66 | 80 | 8E0ZXY4 | Yoga Therapy:::Release Cecum, Open Approach:::...| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10pcs_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10pcs]| |Language:|en| --- layout: model title: Pipeline to Extract Entities in Covid Trials author: John Snow Labs name: ner_covid_trials_pipeline date: 2023-03-09 tags: [ner, en, clinical, licensed, covid] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_covid_trials](https://nlp.johnsnowlabs.com/2022/10/19/ner_covid_trials_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_pipeline_en_4.3.0_3.2_1678355313181.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_pipeline_en_4.3.0_3.2_1678355313181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_covid_trials_pipeline", "en", "clinical/models") text = '''In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_covid_trials_pipeline", "en", "clinical/models") val text = "In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------------|--------:|------:|:--------------------------|-------------:| | 0 | December 2019 | 3 | 15 | Date | 0.99655 | | 1 | acute respiratory disease | 48 | 72 | Disease_Syndrome_Disorder | 0.8597 | | 2 | beta-coronavirus | 146 | 161 | Virus | 0.6381 | | 3 | 2019 | 198 | 201 | Date | 0.8117 | | 4 | coronavirus infection | 203 | 223 | Disease_Syndrome_Disorder | 0.68335 | | 5 | SARS-CoV-2 | 227 | 236 | Virus | 0.9605 | | 6 | coronavirus | 243 | 253 | Virus | 0.9814 | | 7 | β-coronaviruses | 284 | 298 | Virus | 0.9564 | | 8 | subgenus Coronaviridae | 307 | 328 | Virus | 0.71465 | | 9 | SARS-CoV-2 | 336 | 345 | Virus | 0.9442 | | 10 | zoonotic coronavirus disease | 366 | 393 | Disease_Syndrome_Disorder | 0.922833 | | 11 | severe acute respiratory syndrome | 401 | 433 | Disease_Syndrome_Disorder | 0.959725 | | 12 | SARS | 437 | 440 | Disease_Syndrome_Disorder | 0.9959 | | 13 | Middle Eastern respiratory syndrome | 448 | 482 | Disease_Syndrome_Disorder | 0.9673 | | 14 | MERS | 486 | 489 | Disease_Syndrome_Disorder | 0.9759 | | 15 | SARS-CoV-2 | 511 | 520 | Virus | 0.9027 | | 16 | WHO | 541 | 543 | Institution | 0.9917 | | 17 | CDC | 547 | 549 | Institution | 0.8296 | | 18 | 2020 | 848 | 851 | Date | 0.9997 | | 19 | COVID‑19 vaccine | 864 | 879 | Vaccine_Name | 0.87505 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_covid_trials_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Normalized Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_go_clinical date: 2020-09-21 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect normalized mentions of genes (go) and human phenotypes (hp) in medical text. ## Predicted Entities `GO`, `HP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GO_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_HUMAN_PHENOTYPE_GO_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_2.5.5_2.4_1598558253840.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_2.5.5_2.4_1598558253840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.") ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_human_phenotype_go_clinical", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype.go_clinical").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""") ```
{:.h2_title} ## Results ```bash +----+--------------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+==========================+=========+=======+==========+ | 0 | tumor | 39 | 43 | HP | +----+--------------------------+---------+-------+----------+ | 1 | tricarboxylic acid cycle | 79 | 102 | GO | +----+--------------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_clinical| |Type:|ner| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| | Dependencies: | embeddings_clinical | ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:| | 0 | B-GO | 1530 | 129 | 57 | 0.922242 | 0.964083 | 0.942699 | | 1 | B-HP | 950 | 133 | 130 | 0.877193 | 0.87963 | 0.87841 | | 2 | I-HP | 253 | 46 | 68 | 0.846154 | 0.788162 | 0.816129 | | 3 | I-GO | 4550 | 344 | 154 | 0.92971 | 0.967262 | 0.948114 | | 4 | Macro-average | 7283 | 652 | 409 | 0.893825 | 0.899784 | 0.896795 | | 5 | Micro-average | 7283 | 652 | 409 | 0.917832 | 0.946828 | 0.932105 | ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Teepika) author: John Snow Labs name: roberta_qa_base_squad2_finetuned_sel date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-selqa` is a English model originally trained by `Teepika`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_sel_en_4.3.0_3.0_1674219133811.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_sel_en_4.3.0_3.0_1674219133811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_sel","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_sel","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_finetuned_sel| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Teepika/roberta-base-squad2-finetuned-selqa --- layout: model title: Hindi XLMRoBerta Embeddings (from neuralspace-reverie) author: John Snow Labs name: xlmroberta_embeddings_indic_transformers_hi_xlmroberta date: 2022-05-13 tags: [hi, open_source, xlm_roberta, embeddings] task: Embeddings language: hi edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-hi-xlmroberta` is a Hindi model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_indic_transformers_hi_xlmroberta_hi_3.4.4_3.0_1652439884277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_indic_transformers_hi_xlmroberta_hi_3.4.4_3.0_1652439884277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_indic_transformers_hi_xlmroberta","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी बहुत पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_indic_transformers_hi_xlmroberta","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी बहुत पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_indic_transformers_hi_xlmroberta| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|505.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-xlmroberta - https://oscar-corpus.com/ --- layout: model title: English BertForQuestionAnswering Base Uncased model (from ksabeh) author: John Snow Labs name: bert_qa_base_uncased_attribute_correction date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-attribute-correction` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_en_4.0.0_3.0_1657183728517.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_en_4.0.0_3.0_1657183728517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_attribute_correction| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ksabeh/bert-base-uncased-attribute-correction --- layout: model title: English BertForQuestionAnswering model (from sunitha) author: John Snow Labs name: bert_qa_Trial_3_Results date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Trial_3_Results` is a English model orginally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Trial_3_Results_en_4.0.0_3.0_1654179084301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Trial_3_Results_en_4.0.0_3.0_1654179084301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Trial_3_Results","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Trial_3_Results","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trial.bert.by_sunitha").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Trial_3_Results| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/Trial_3_Results --- layout: model title: Tagalog RobertaForMaskedLM Large Cased model (from jcblaise) author: John Snow Labs name: roberta_embeddings_tagalog_large date: 2022-12-12 tags: [tl, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: tl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-tagalog-large` is a Tagalog model originally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_tagalog_large_tl_4.2.4_3.0_1670859755115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_tagalog_large_tl_4.2.4_3.0_1670859755115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_tagalog_large","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_tagalog_large","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_tagalog_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/jcblaise/roberta-tagalog-large - https://blaisecruz.com --- layout: model title: English ElectraForQuestionAnswering model (from usami) author: John Snow Labs name: electra_qa_base_discriminator_finetuned_squad date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-discriminator-finetuned-squad` is a English model originally trained by `usami`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squad_en_4.0.0_3.0_1655920469404.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squad_en_4.0.0_3.0_1655920469404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_discriminator_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/usami/electra-base-discriminator-finetuned-squad --- layout: model title: Translate Finno-Ugrian languages to English Pipeline author: John Snow Labs name: translate_fiu_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, fiu, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `fiu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_fiu_en_xx_2.7.0_2.4_1609691048288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_fiu_en_xx_2.7.0_2.4_1609691048288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_fiu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_fiu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.fiu.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_fiu_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_42 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1655731778875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1655731778875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|424.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-42 --- layout: model title: Self Report Age Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_self_reported_age_tweet date: 2022-07-26 tags: [licensed, clinical, en, classifier, sequence_classification, age, public_health] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify self-report the exact age into social media data. ## Predicted Entities `self_report_age`, `no_report` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_AGE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_age_tweet_en_4.0.0_3.0_1658852070357.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_age_tweet_en_4.0.0_3.0_1658852070357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_age_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["Who knew I would spend my Saturday mornings at 21 still watching Disney channel", "My girl, Fancy, just turned 17. She’s getting up there, but she still has the energy of a puppy"], StringType()).toDF("text") result = pipeline.fit(data).transform(data) # Checking results result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_age_tweet", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("Who knew I would spend my Saturday mornings at 21 still watching Disney channel", "My girl, Fancy, just turned 17. She’s getting up there, but she still has the energy of a puppy")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.self_reported_age").predict("""My girl, Fancy, just turned 17. She’s getting up there, but she still has the energy of a puppy""") ```
## Results ```bash +-----------------------------------------------------------------------------------------------+-----------------+ |text |result | +-----------------------------------------------------------------------------------------------+-----------------+ |Who knew I would spend my Saturday mornings at 21 still watching Disney channel |[self_report_age]| |My girl, Fancy, just turned 17. She’s getting up there, but she still has the energy of a puppy|[no_report] | +-----------------------------------------------------------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_self_reported_age_tweet| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support no_report 0.939016 0.900332 0.919267 1505 self_report_age 0.801849 0.873381 0.836088 695 accuracy - - 0.891818 2200 macro-avg 0.870433 0.886857 0.877678 2200 weighted-avg 0.895684 0.891818 0.892990 2200 ``` --- layout: model title: Translate English to Indo-Iranian languages Pipeline author: John Snow Labs name: translate_en_iir date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, iir, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `iir` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_iir_xx_2.7.0_2.4_1609690996492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_iir_xx_2.7.0_2.4_1609690996492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_iir", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_iir", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.iir').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_iir| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Control Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_control_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, control, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_control_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `control-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `control-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_control_agreement_bert_en_1.0.0_3.0_1669310200226.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_control_agreement_bert_en_1.0.0_3.0_1669310200226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_control_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[control-agreement]| |[other]| |[other]| |[control-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_control_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support control-agreement 0.96 0.76 0.85 29 other 0.92 0.99 0.95 82 accuracy - - 0.93 111 macro-avg 0.94 0.87 0.90 111 weighted-avg 0.93 0.93 0.93 111 ``` --- layout: model title: Legal NER (Parties, Dates, Alias, Former names, Document Type - lg) author: John Snow Labs name: legner_contract_doc_parties_lg date: 2023-01-21 tags: [document, contract, agreement, type, parties, aliases, former, names, effective, dates, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description MPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs; This is a Legal NER Model, aimed to process the first page of the agreements when information can be found about: - Parties of the contract/agreement; - Their former names; - Aliases of those parties, or how those parties will be called further on in the document; - Document Type; - Effective Date of the agreement; - Other organizations; This model can be used all along with its Relation Extraction model to retrieve the relations between these entities, called `legre_contract_doc_parties` ## Predicted Entities `PARTY`, `EFFDATE`, `DOC`, `ALIAS`, `ORG`, `FORMER_NAME` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_PARTIES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_lg_en_1.0.0_3.0_1674321394808.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_lg_en_1.0.0_3.0_1674321394808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_lg', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" INTELLECTUAL PROPERTY AGREEMENT This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties"). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------------------------------+-----------+ |chunk |ner_label | +-----------------------------------+-----------+ |INTELLECTUAL PROPERTY AGREEMENT |DOC | |December 31, 2018 |EFFDATE | |Armstrong Flooring, Inc |PARTY | |Seller |ALIAS | |AFI Licensing LLC |PARTY | |Licensing |ALIAS | |Seller |PARTY | |Arizona |ALIAS | |AHF Holding, Inc. |ORG | |Tarzan HoldCo, Inc |FORMER_NAME| |Buyer |ALIAS | |Armstrong Hardwood Flooring Company|PARTY | |Company |ALIAS | |Buyer |PARTY | |Buyer Entities |ALIAS | |Arizona |PARTY | |Buyer Entities |PARTY | |Party |ALIAS | |Parties |ALIAS | +-----------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_contract_doc_parties_lg| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support B-ALIAS 0.95 0.95 0.95 193 B-DOC 0.87 0.85 0.86 118 I-DOC 0.92 0.83 0.87 245 B-PARTY 0.83 0.79 0.81 246 I-PARTY 0.90 0.88 0.89 630 B-ORG 0.91 0.84 0.87 207 I-ORG 0.93 0.87 0.90 355 I-ALIAS 0.77 0.83 0.80 29 B-EFFDATE 0.91 0.91 0.91 81 I-EFFDATE 0.95 0.97 0.96 261 B-FORMER_NAME 0.97 1.00 0.99 39 I-FORMER_NAME 0.99 1.00 0.99 93 micro-avg 0.91 0.88 0.90 2497 macro-avg 0.91 0.89 0.90 2497 weighted-avg 0.91 0.88 0.90 2497 ``` --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes author: John Snow Labs name: sbiobertresolve_icd10cm_augmented_billable_hcc date: 2023-05-31 tags: [licensed, en, clinical, hcc, icd10cm, entity_resolution] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings and it supports 7-digit codes with Hierarchical Condition Categories (HCC) status. It has been updated by dropping the invalid codes that exist in the previous versions. In the result, look for the `all_k_aux_labels` parameter in the metadata to get HCC status. The HCC status can be divided to get further information: `billable status`, `hcc status`, and `hcc score`. For example if the result is `1||1||8`: `the billable status is 1`, `hcc status is 1`, and `hcc score is 8`. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_4.4.2_3.0_1685507415461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_billable_hcc_en_4.4.2_3.0_1685507415461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use `sbiobertresolve_icd10cm_augmented_billable_hcc` resolver model must be used with `sbiobert_base_cased_mli` as embeddings.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline(stages = [document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, icd_resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") .setInputCols("sbert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| hcc_list| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], gestatio...| [O24.4, O24.41, O24.43, Z86.32, Z87.5, O24.31, O24.11, O24.1, O24.81]|[0||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 0||0||0, 0||0||0,...| |subsequent type two diabetes mellitus|PROBLEM| O24.11|[pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mel...|[O24.11, E11.8, E11, E13.9, E11.9, E11.3, E11.44, Z86.3, Z86.39, E11.32,...|[0||0||0, 1||1||18, 0||0||0, 1||1||19, 1||1||19, 0||0||0, 1||1||18, 0||0...| | obesity|PROBLEM| E66.9|[obesity [obesity, unspecified], abdominal obesity [other obesity], obes...|[E66.9, E66.8, Z68.41, Q13.0, E66, E66.01, Z86.39, E34.9, H35.50, Z83.49...|[1||0||0, 1||0||0, 1||1||22, 1||0||0, 0||0||0, 1||1||22, 1||0||0, 1||0||...| | a body mass index|PROBLEM| Z68.41|[finding of body mass index [body mass index [bmi] 40.0-44.9, adult], ob...|[Z68.41, E66.9, R22.9, Z68.1, R22.3, R22.1, Z68, R22.2, R22.0, R41.89, M...|[1||1||22, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0...| | polyuria|PROBLEM| R35|[polyuria [polyuria], nocturnal polyuria [nocturnal polyuria], polyuric ...|[R35, R35.81, R35.8, E23.2, R31, R35.0, R82.99, N40.1, E72.3, O04.8, R30...|[0||0||0, 1||0||0, 0||0||0, 1||1||23, 0||0||0, 1||0||0, 0||0||0, 1||0||0...| | polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], psychogenic polydipsia [other impulse disorder...|[R63.1, F63.89, E23.2, F63.9, O40, G47.5, M79.89, R63.2, R06.1, H53.8, I...|[1||0||0, 1||0||0, 1||1||23, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 1||0||0...| | poor appetite|PROBLEM| R63.0|[poor appetite [anorexia], poor feeding [feeding problem of newborn, uns...|[R63.0, P92.9, R43.8, R43.2, E86, R19.6, F52.0, Z72.4, R06.89, Z76.89, R...|[1||0||0, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||0||0,...| | vomiting|PROBLEM| R11.1|[vomiting [vomiting], intermittent vomiting [nausea and vomiting], vomit...| [R11.1, R11, R11.10, G43.A1, P92.1, P92.09, G43.A, R11.13, R11.0]|[0||0||0, 0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0,...| | a respiratory tract infection|PROBLEM| J98.8|[respiratory tract infection [other specified respiratory disorders], up...|[J98.8, J06.9, A49.9, J22, J20.9, Z59.3, T17, J04.10, Z13.83, J18.9, P28...|[1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0,...| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented_billable_hcc| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.4 GB| |Case sensitive:|false| --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_mrm8488_base_bne_finetuned_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_mrm8488_base_bne_finetuned_s_c_es_4.2.4_3.0_1669985676842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_mrm8488_base_bne_finetuned_s_c_es_4.2.4_3.0_1669985676842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mrm8488_base_bne_finetuned_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mrm8488_base_bne_finetuned_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_mrm8488_base_bne_finetuned_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/roberta-base-bne-finetuned-sqac --- layout: model title: Detect Symptoms, Treatments and Other Entities in German author: John Snow Labs name: ner_healthcare date: 2021-09-15 tags: [ner, healthcare, licensed, de] task: Named Entity Recognition language: de edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect clinical entities in medical text in German language. ## Predicted Entities `DIAGLAB_PROCEDURE`, `MEDICAL_SPECIFICATION`, `MEDICAL_DEVICE`, `MEASUREMENT`, `BIOLOGICAL_CHEMISTRY`, `BODY_FLUID`, `TIME_INFORMATION`, `LOCAL_SPECIFICATION`, `BIOLOGICAL_PARAMETER`, `PROCESS`, `MEDICATION`, `DOSING`, `DEGREE`, `MEDICAL_CONDITION`, `PERSON`, `TISSUE`, `STATE_OF_HEALTH`, `BODY_PART`, `TREATMENT` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HEALTHCARE_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_de_3.0.0_3.0_1631687601139.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_de_3.0.0_3.0_1631687601139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "de", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") clinical_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, clinical_ner_converter]) data = spark.createDataFrame([["Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt."]]).toDF("text") result = nlp_pipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de", "clinical/models") .setInputCols("sentence", "token") .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "de", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val clinical_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, clinical_ner_converter)) val data = Seq("""Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.healthcare").predict("""Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist Hernia femoralis, Akne, einseitig, ein hochmalignes bronchogenes Karzinom, das überwiegend im Zentrum der Lunge, in einem Hauptbronchus entsteht. Die mittlere Prävalenz wird auf 1/20.000 geschätzt.""") ```
## Results ```bash +-----------------+---------------------+ |chunk |label | +-----------------+---------------------+ |Kleinzellige |MEASUREMENT | |Bronchialkarzinom|MEDICAL_CONDITION | |Kleinzelliger |MEDICAL_SPECIFICATION| |Lungenkrebs |MEDICAL_CONDITION | |SCLC |MEDICAL_CONDITION | |Hernia |MEDICAL_CONDITION | |femoralis |LOCAL_SPECIFICATION | |Akne |MEDICAL_CONDITION | |einseitig |MEASUREMENT | |hochmalignes |MEDICAL_CONDITION | |bronchogenes |BODY_PART | |Karzinom |MEDICAL_CONDITION | |Lunge |BODY_PART | |Hauptbronchus |BODY_PART | |mittlere |MEASUREMENT | |Prävalenz |MEDICAL_CONDITION | +-----------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| ## Benchmarking ```bash label tp fp fn total precision recall f1 BIOLOGICAL_PARAMETER 1186.0 651.0 429.0 1615.0 0.6456 0.7344 0.6871 BODY_FLUID 32.0 8.0 23.0 55.0 0.8 0.5818 0.6737 PERSON 3927.0 293.0 641.0 4568.0 0.9306 0.8597 0.8937 DOSING 203.0 96.0 155.0 358.0 0.6789 0.567 0.618 DIAGLAB_PROCEDURE 2373.0 812.0 873.0 3246.0 0.7451 0.7311 0.738 TISSUE 4.0 3.0 2.0 6.0 0.5714 0.6667 0.6154 BODY_PART 1859.0 513.0 384.0 2243.0 0.7837 0.8288 0.8056 MEDICATION 3307.0 925.0 1075.0 4382.0 0.7814 0.7547 0.7678 STATE_OF_HEALTH 602.0 131.0 162.0 764.0 0.8213 0.788 0.8043 LOCAL_SPECIFICATION 231.0 86.0 97.0 328.0 0.7287 0.7043 0.7163 MEASUREMENT 6472.0 1612.0 1691.0 8163.0 0.8006 0.7928 0.7967 TREATMENT 9262.0 2054.0 2380.0 11642.0 0.8185 0.7956 0.8069 MEDICAL_SPECIFICATION 1455.0 782.0 493.0 1948.0 0.6504 0.7469 0.6953 MEDICAL_CONDITION 10464.0 2243.0 2364.0 12828.0 0.8235 0.8157 0.8196 TIME_INFORMATION 1496.0 534.0 603.0 2099.0 0.7369 0.7127 0.7246 PROCESS 526.0 232.0 251.0 777.0 0.6939 0.677 0.6853 BIOLOGICAL_CHEMISTRY 524.0 261.0 392.0 916.0 0.6675 0.5721 0.6161 ``` --- layout: model title: Pipeline to Detect Diseases author: John Snow Labs name: ner_diseases_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diseases](https://nlp.johnsnowlabs.com/2021/03/31/ner_diseases_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_pipeline_en_4.3.0_3.2_1678878597802.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_pipeline_en_4.3.0_3.2_1678878597802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_diseases_pipeline", "en", "clinical/models") text = '''Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_diseases_pipeline", "en", "clinical/models") val text = "Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:------------|-------------:| | 0 | T-cell leukemia | 136 | 150 | Disease | 0.92015 | | 1 | T-cell leukemia | 402 | 416 | Disease | 0.94145 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: ALBERT Base CoNNL-03 NER Pipeline author: John Snow Labs name: albert_base_token_classifier_conll03_pipeline date: 2022-04-22 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_base_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650636447138.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650636447138.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|43.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Pledge Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_pledge_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, pledge, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_pledge_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `pledge-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `pledge-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_pledge_agreement_en_1.0.0_3.0_1670358026248.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_pledge_agreement_en_1.0.0_3.0_1670358026248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_pledge_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[pledge-agreement]| |[other]| |[other]| |[pledge-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_pledge_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 1.00 0.97 111 pledge-agreement 1.00 0.88 0.94 50 accuracy - - 0.96 161 macro-avg 0.97 0.94 0.95 161 weighted-avg 0.96 0.96 0.96 161 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_sherry7144 TFWav2Vec2ForCTC from sherry7144 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_sherry7144` is a English model originally trained by sherry7144. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144_en_4.2.0_3.0_1664037871788.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144_en_4.2.0_3.0_1664037871788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab0_by_sherry7144| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect Oncology-Specific Entities (clinical_medium) author: John Snow Labs name: ner_oncology_emb_clinical_medium date: 2023-04-12 tags: [licensed, en, clinical, clinical_medium, ner, oncology, biomarker, treatment] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts more than 40 oncology-related entities, including therapies, tests and staging. Definitions of Predicted Entities: `Adenopathy:` Mentions of pathological findings of the lymph nodes. `Age:` All mention of ages, past or present, related to the patient or with anybody else. `Biomarker:` Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. `Biomarker_Result:` Terms or values that are identified as the result of a biomarkers. `Cancer_Dx:` Mentions of cancer diagnoses (such as “breast cancer”) or pathological types that are usually used as synonyms for “cancer” (e.g. “carcinoma”). When anatomical references are present, they are included in the Cancer_Dx extraction. `Cancer_Score:` Clinical or imaging scores that are specific for cancer settings (e.g. “BI-RADS” or “Allred score”). `Cancer_Surgery:` Terms that indicate surgery as a form of cancer treatment. `Chemotherapy:` Mentions of chemotherapy drugs, or unspecific words such as “chemotherapy”. `Cycle_Coun:` The total number of cycles being administered of an oncological therapy (e.g. “5 cycles”). `Cycle_Day:` References to the day of the cycle of oncological therapy (e.g. “day 5”). `Cycle_Number:` The number of the cycle of an oncological therapy that is being applied (e.g. “third cycle”). `Date:` Mentions of exact dates, in any format, including day number, month and/or year. `Death_Entity:` Words that indicate the death of the patient or someone else (including family members), such as “died” or “passed away”. `Direction:` Directional and laterality terms, such as “left”, “right”, “bilateral”, “upper” and “lower”. `Dosage:` The quantity prescribed by the physician for an active ingredient. `Duration:` Words indicating the duration of a treatment (e.g. “for 2 weeks”). `Frequency:` Words indicating the frequency of treatment administration (e.g. “daily” or “bid”). `Gender:` Gender-specific nouns and pronouns (including words such as “him” or “she”, and family members such as “father”). `Grade:` All pathological grading of tumors (e.g. “grade 1”) or degrees of cellular differentiation (e.g. “well-differentiated”) `Histological_Type:` Histological variants or cancer subtypes, such as “papillary”, “clear cell” or “medullary”. `Hormonal_Therapy:` Mentions of hormonal drugs used to treat cancer, or unspecific words such as “hormonal therapy”. `Imaging_Test:` Imaging tests mentioned in texts, such as “chest CT scan”. `Immunotherapy:` Mentions of immunotherapy drugs, or unspecific words such as “immunotherapy”. `Invasion:` Mentions that refer to tumor invasion, such as “invasion” or “involvement”. Metastases or lymph node involvement are excluded from this category. `Line_Of_Therapy:` Explicit references to the line of therapy of an oncological therapy (e.g. “first-line treatment”). `Metastasis:` Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. `Oncogene:` Mentions of genes that are implicated in the etiology of cancer. `Pathology_Result:` The findings of a biopsy from the pathology report that is not covered by another entity (e.g. “malignant ductal cells”). `Pathology_Test:` Mentions of biopsies or tests that use tissue samples. `Performance_Status:` Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. “ECOG performance status of 4”). `Race_Ethnicity:` The race and ethnicity categories include racial and national origin or sociocultural groups. `Radiotherapy:` Terms that indicate the use of Radiotherapy. `Response_To_Treatment:` Terms related to clinical progress of the patient related to cancer treatment, including “recurrence”, “bad response” or “improvement”. `Relative_Date:` Temporal references that are relative to the date of the text or to any other specific date (e.g. “yesterday” or “three years later”). `Route:` Words indicating the type of administration route (such as “PO” or “transdermal”). `Site_Bone:` Anatomical terms that refer to the human skeleton. `Site_Brain:` Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). `Site_Breast:` Anatomical terms that refer to the breasts. `Site_Liver:` Anatomical terms that refer to the liver. `Site_Lung:` Anatomical terms that refer to the lungs. `Site_Lymph_Node:` Anatomical terms that refer to lymph nodes, excluding adenopathies. `Site_Other_Body_Part:` Relevant anatomical terms that are not included in the rest of the anatomical entities. `Smoking_Status:` All mentions of smoking related to the patient or to someone else. `Staging:` Mentions of cancer stage such as “stage 2b” or “T2N1M0”. It also includes words such as “in situ”, “early-stage” or “advanced”. `Targeted_Therapy:` Mentions of targeted therapy drugs, or unspecific words such as “targeted therapy”. `Tumor_Finding:` All nonspecific terms that may be related to tumors, either malignant or benign (for example: “mass”, “tumor”, “lesion”, or “neoplasm”). `Tumor_Size:` Size of the tumor, including numerical value and unit of measurement (e.g. “3 cm”). `Unspecific_Therapy:` Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. “chemoradiotherapy” or “adjuvant therapy”). ## Predicted Entities `Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_emb_clinical_medium_en_4.3.2_3.0_1681316892301.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_emb_clinical_medium_en_4.3.2_3.0_1681316892301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------+-----+---+---------------------+ |chunk |begin|end|ner_label | +-------------------+-----+---+---------------------+ |left |31 |34 |Direction | |mastectomy |36 |45 |Cancer_Surgery | |axillary lymph node|54 |72 |Site_Lymph_Node | |dissection |74 |83 |Cancer_Surgery | |left |91 |94 |Direction | |breast cancer |96 |108|Cancer_Dx | |twenty years ago |110 |125|Relative_Date | |tumor |132 |136|Tumor_Finding | |positive |142 |149|Biomarker_Result | |ER |155 |156|Biomarker | |PR |162 |163|Response_To_Treatment| |radiotherapy |183 |194|Radiotherapy | |breast |229 |234|Site_Breast | |cancer |241 |246|Cancer_Dx | |recurred |248 |255|Response_To_Treatment| |right |262 |266|Direction | |lung |268 |271|Site_Lung | |metastasis |273 |282|Metastasis | |13 years later |284 |297|Relative_Date | |adriamycin |346 |355|Chemotherapy | |60 mg/m2 |358 |365|Chemotherapy | |cyclophosphamide |372 |387|Chemotherapy | |600 mg/m2 |390 |398|Dosage | |six courses |406 |416|Cycle_Count | |first line |422 |431|Line_Of_Therapy | +-------------------+-----+---+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.4 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 138.0 27.0 73.0 211.0 0.8364 0.654 0.734 Direction 679.0 163.0 152.0 831.0 0.8064 0.8171 0.8117 Staging 112.0 24.0 26.0 138.0 0.8235 0.8116 0.8175 Cancer_Score 9.0 2.0 12.0 21.0 0.8182 0.4286 0.5625 Imaging_Test 759.0 132.0 141.0 900.0 0.8519 0.8433 0.8476 Cycle_Number 43.0 29.0 17.0 60.0 0.5972 0.7167 0.6515 Tumor_Finding 971.0 98.0 108.0 1079.0 0.9083 0.8999 0.9041 Site_Lymph_Node 210.0 80.0 61.0 271.0 0.7241 0.7749 0.7487 Invasion 146.0 33.0 21.0 167.0 0.8156 0.8743 0.8439 Response_To_Treat... 224.0 98.0 146.0 370.0 0.6957 0.6054 0.6474 Smoking_Status 39.0 14.0 9.0 48.0 0.7358 0.8125 0.7723 Cycle_Count 113.0 34.0 31.0 144.0 0.7687 0.7847 0.7766 Tumor_Size 203.0 44.0 35.0 238.0 0.8219 0.8529 0.8371 Adenopathy 32.0 12.0 11.0 43.0 0.7273 0.7442 0.7356 Age 203.0 20.0 25.0 228.0 0.9103 0.8904 0.9002 Biomarker_Result 537.0 117.0 148.0 685.0 0.8211 0.7839 0.8021 Unspecific_Therapy 107.0 32.0 67.0 174.0 0.7698 0.6149 0.6837 Site_Breast 95.0 17.0 15.0 110.0 0.8482 0.8636 0.8559 Chemotherapy 684.0 72.0 58.0 742.0 0.9048 0.9218 0.9132 Targeted_Therapy 170.0 31.0 36.0 206.0 0.8458 0.8252 0.8354 Radiotherapy 141.0 43.0 20.0 161.0 0.7663 0.8758 0.8174 Performance_Status 20.0 12.0 12.0 32.0 0.625 0.625 0.625 Pathology_Test 359.0 159.0 127.0 486.0 0.6931 0.7387 0.7151 Site_Other_Body_Part 744.0 338.0 394.0 1138.0 0.6876 0.6538 0.6703 Cancer_Surgery 380.0 83.0 113.0 493.0 0.8207 0.7708 0.795 Line_Of_Therapy 38.0 7.0 10.0 48.0 0.8444 0.7917 0.8172 Pathology_Result 124.0 144.0 217.0 341.0 0.4627 0.3636 0.4072 Hormonal_Therapy 96.0 13.0 27.0 123.0 0.8807 0.7805 0.8276 Site_Bone 167.0 50.0 56.0 223.0 0.7696 0.7489 0.7591 Immunotherapy 61.0 13.0 21.0 82.0 0.8243 0.7439 0.7821 Biomarker 681.0 88.0 150.0 831.0 0.8856 0.8195 0.8513 Cycle_Day 85.0 43.0 43.0 128.0 0.6641 0.6641 0.6641 Frequency 200.0 40.0 35.0 235.0 0.8333 0.8511 0.8421 Route 98.0 13.0 18.0 116.0 0.8829 0.8448 0.8634 Duration 195.0 57.0 101.0 296.0 0.7738 0.6588 0.7117 Death_Entity 40.0 9.0 4.0 44.0 0.8163 0.9091 0.8602 Metastasis 335.0 34.0 27.0 362.0 0.9079 0.9254 0.9166 Site_Liver 146.0 64.0 28.0 174.0 0.6952 0.8391 0.7604 Cancer_Dx 722.0 96.0 108.0 830.0 0.8826 0.8699 0.8762 Grade 55.0 19.0 11.0 66.0 0.7432 0.8333 0.7857 Date 403.0 16.0 14.0 417.0 0.9618 0.9664 0.9641 Site_Lung 341.0 151.0 61.0 402.0 0.6931 0.8483 0.7629 Site_Brain 184.0 82.0 22.0 206.0 0.6917 0.8932 0.7797 Relative_Date 365.0 249.0 95.0 460.0 0.5945 0.7935 0.6797 Race_Ethnicity 47.0 2.0 8.0 55.0 0.9592 0.8545 0.9038 Gender 1260.0 15.0 2.0 1262.0 0.9882 0.9984 0.9933 Dosage 425.0 76.0 60.0 485.0 0.8483 0.8763 0.8621 Oncogene 178.0 89.0 57.0 235.0 0.6667 0.7574 0.7092 Radiation_Dose 41.0 6.0 11.0 52.0 0.8723 0.7885 0.8283 macro - - - - - - 0.7859 micro - - - - - - 0.8130 ``` --- layout: model title: Legal Indemnification NER (Bert, sm) author: John Snow Labs name: legner_bert_indemnifications date: 2022-09-27 tags: [indemnifications, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_indemnification_clause` Text Classifier to select only these paragraphs; This is a Legal Named Entity Recognition Model to identify the Subject (who), Action (web), Object(the indemnification) and Indirect Object (to whom) from Indemnification clauses. There is a lighter (non-transformer based) model available in Models Hub as `legner_indemnifications_md`. ## Predicted Entities `INDEMNIFICATION`, `INDEMNIFICATION_SUBJECT`, `INDEMNIFICATION_ACTION`, `INDEMNIFICATION_INDIRECT_OBJECT` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGALRE_INDEMNIFICATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_bert_indemnifications_en_1.0.0_3.0_1664273651991.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_bert_indemnifications_en_1.0.0_3.0_1664273651991.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = legal.BertForTokenClassification.pretrained("legner_bert_indemnifications", "en", "legal/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentencizer, tokenizer, tokenClassifier, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text='''The Company shall protect and indemnify the Supplier against any damages, losses or costs whatsoever''' data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(data) lmodel = LightPipeline(model) res = lmodel.annotate(text) ```
## Results ```bash +----------+---------------------------------+ | token| ner_label| +----------+---------------------------------+ | The| O| | Company| O| | shall| B-INDEMNIFICATION_ACTION| | protect| I-INDEMNIFICATION_ACTION| | and| O| | indemnify| B-INDEMNIFICATION_ACTION| | the| O| | Supplier|B-INDEMNIFICATION_INDIRECT_OBJECT| | against| O| | any| O| | damages| B-INDEMNIFICATION| | ,| O| | losses| B-INDEMNIFICATION| | or| O| | costs| B-INDEMNIFICATION| |whatsoever| O| +----------+---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_bert_indemnifications| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References In-house annotated examples from CUAD legal dataset ## Benchmarking ```bash label precision recall f1-score support B-INDEMNIFICATION 0.91 0.89 0.90 36 B-INDEMNIFICATION_ACTION 0.92 0.71 0.80 17 B-INDEMNIFICATION_INDIRECT_OBJECT 0.88 0.88 0.88 40 B-INDEMNIFICATION_SUBJECT 0.71 0.56 0.63 9 I-INDEMNIFICATION 0.88 0.78 0.82 9 I-INDEMNIFICATION_ACTION 0.81 0.87 0.84 15 I-INDEMNIFICATION_INDIRECT_OBJECT 1.00 0.53 0.69 17 O 0.97 0.91 0.94 510 accuracy - - 0.88 654 macro-avg 0.71 0.61 0.81 654 weighted-avg 0.95 0.88 0.91 654 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from arshiya20) author: John Snow Labs name: distilbert_qa_epochs_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `epochs-finetuned-squad` is a English model originally trained by `arshiya20`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_epochs_finetuned_squad_en_4.3.0_3.0_1672775091962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_epochs_finetuned_squad_en_4.3.0_3.0_1672775091962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_epochs_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_epochs_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_epochs_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/arshiya20/epochs-finetuned-squad --- layout: model title: End-to-End (E2E) and data-driven NLG Challenge author: John Snow Labs name: multiclassifierdl_use_e2e date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [en, open_source, text_classification] supported: true annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Natural language generation plays a critical role for Conversational Agents as it has a significant impact on a user’s impression of the system. This shared task focuses on recent end-to-end (E2E), data-driven NLG methods, which jointly learn sentence planning and surface realization from non-aligned data, e.g. (Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek, 2016; Lampouras and Vlachos, 2016), etc. So far, E2E NLG approaches were limited to small, de-lexicalized data sets, e.g. BAGEL, SF Hotels/ Restaurants, or RoboCup. In this shared challenge, we will provide a new crowd-sourced data set of 50k instances in the restaurant domain, as described in (Novikova, Lemon, and Rieser, 2016). Each instance consists of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. In contrast to previously used data, our data set includes additional challenges, such as open vocabulary, complex syntactic structures, and diverse discourse phenomena. ## Predicted Entities `name[Bibimbap House]`,`name[Wildwood]`,`name[Clowns]`,`name[Cotto]`,`near[Burger King]`,`name[The Dumpling Tree]`,`name[The Vaults]`,`name[The Golden Palace]`,`near[Crowne Plaza Hotel]`,`name[The Rice Boat]`,`customer rating[high]`,`near[Avalon]`,`name[Alimentum]`,`near[The Bakers]`,`name[The Waterman]`,`near[Ranch]`,`name[The Olive Grove]`,`name[The Wrestlers]`,`name[The Eagle]`,`eatType[restaurant]`,`near[All Bar One]`,`customer rating[low]`,`near[Café Sicilia]`,`near[Yippee Noodle Bar]`,`food[Indian]`,`eatType[pub]`,`name[Green Man]`,`name[Strada]`,`near[Café Adriatic]`,`name[Loch Fyne]`,`eatType[coffee shop]`,`customer rating[5 out of 5]`,`near[Express by Holiday Inn]`,`food[French]`,`name[The Mill]`,`food[Japanese]`,`name[Travellers Rest Beefeater]`,`name[The Plough]`,`name[Cocum]`,`near[The Six Bells]`,`name[The Phoenix]`,`priceRange[cheap]`,`name[Midsummer House]`,`near[Rainbow Vegetarian Café]`,`near[The Rice Boat]`,`customer rating[3 out of 5]`,`customer rating[1 out of 5]`,`name[The Cricketers]`,`area[riverside]`,`priceRange[£20-25]`,`name[Blue Spice]`,`priceRange[moderate]`,`priceRange[less than £20]`,`priceRange[high]`,`name[Giraffe]`,`name[The Golden Curry]`,`customer rating[average]`,`name[The Twenty Two]`,`name[Aromi]`,`food[Fast food]`,`name[Browns Cambridge]`,`near[Café Rouge]`,`area[city centre]`,`familyFriendly[no]`,`food[Chinese]`,`name[Taste of Cambridge]`,`food[Italian]`,`name[Zizzi]`,`near[Raja Indian Cuisine]`,`priceRange[more than £30]`,`name[The Punter]`,`food[English]`,`near[Clare Hall]`,`near[The Portland Arms]`,`name[The Cambridge Blue]`,`near[The Sorrento]`,`near[Café Brazil]`,`familyFriendly[yes]`,`name[Fitzbillies]` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_e2e_en_2.7.1_2.4_1611233305602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_e2e_en_2.7.1_2.4_1611233305602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained() \ .setInputCols(["document"])\ .setOutputCol("use_embeddings") docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_e2e") \ .setInputCols(["use_embeddings"])\ .setOutputCol("category")\ .setThreshold(0.5) pipeline = Pipeline( stages = [ document, use, docClassifier ]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val use = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("use_embeddings") val docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_e2e") .setInputCols("use_embeddings") .setOutputCol("category") .setThreshold(0.5f) val pipeline = new Pipeline() .setStages( Array( documentAssembler, use, docClassifier ) ) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.e2e").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|multiclassifierdl_use_e2e| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[use_embeddings]| |Output Labels:|[category]| |Language:|en| ## Data Source http://www.macs.hw.ac.uk/InteractionLab/E2E/ ## Benchmarking ```bash Summary Statistics Accuracy = 0.6366936009433872 F1 measure = 0.7561380632067716 Precision = 0.8678456763698633 Recall = 0.6911700403620353 Micro F1 measure = 0.7750978356361313 Micro precision = 0.8694288913773797 Micro recall = 0.6992326812925538 ``` --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_fpdm_bert_FT_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_bert_FT_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_bert_FT_newsqa_en_4.0.0_3.0_1654187822264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_bert_FT_newsqa_en_4.0.0_3.0_1654187822264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_bert_FT_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fpdm_bert_FT_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.fpdm_ft.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fpdm_bert_FT_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_bert_FT_newsqa --- layout: model title: Legal Warranty Clause Binary Classifier author: John Snow Labs name: legclf_warranty_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `warranty` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `warranty` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_warranty_clause_en_1.0.0_3.2_1660123201384.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_warranty_clause_en_1.0.0_3.2_1660123201384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_warranty_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[warranty]| |[other]| |[other]| |[warranty]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_warranty_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.96 0.95 99 warranty 0.86 0.81 0.83 31 accuracy - - 0.92 130 macro-avg 0.90 0.88 0.89 130 weighted-avg 0.92 0.92 0.92 130 ``` --- layout: model title: Legal Regions And Regional Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_regions_and_regional_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, regions_and_regional_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_regions_and_regional_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Regions_and_Regional_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Regions_and_Regional_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_regions_and_regional_policy_bert_en_1.0.0_3.0_1678111888788.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_regions_and_regional_policy_bert_en_1.0.0_3.0_1678111888788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_regions_and_regional_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Regions_and_Regional_Policy]| |[Other]| |[Other]| |[Regions_and_Regional_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_regions_and_regional_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.86 0.92 0.89 110 Regions_and_Regional_Policy 0.91 0.84 0.87 108 accuracy - - 0.88 218 macro-avg 0.88 0.88 0.88 218 weighted-avg 0.88 0.88 0.88 218 ``` --- layout: model title: Fast Neural Machine Translation Model from Uralic Languages to English author: John Snow Labs name: opus_mt_urj_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, urj, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `urj` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_urj_en_xx_2.7.0_2.4_1609170632141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_urj_en_xx_2.7.0_2.4_1609170632141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_urj_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_urj_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.urj.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_urj_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Clinical Entities (bert_token_classifier_ner_clinical) author: John Snow Labs name: bert_token_classifier_ner_clinical_pipeline date: 2023-03-20 tags: [berfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_4.3.0_3.2_1679308200770.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_4.3.0_3.2_1679308200770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") text = '''A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.clinical_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | gestational diabetes mellitus | 39 | 67 | PROBLEM | 0.999895 | | 1 | type two diabetes mellitus | 128 | 153 | PROBLEM | 0.999649 | | 2 | T2DM | 157 | 160 | PROBLEM | 0.991057 | | 3 | HTG-induced pancreatitis | 186 | 209 | PROBLEM | 0.999874 | | 4 | an acute hepatitis | 263 | 280 | PROBLEM | 0.999839 | | 5 | obesity | 288 | 294 | PROBLEM | 0.999873 | | 6 | a body mass index | 301 | 317 | TEST | 0.974921 | | 7 | BMI | 321 | 323 | TEST | 0.972609 | | 8 | polyuria | 380 | 387 | PROBLEM | 0.999895 | | 9 | polydipsia | 391 | 400 | PROBLEM | 0.999886 | | 10 | poor appetite | 404 | 416 | PROBLEM | 0.969424 | | 11 | vomiting | 424 | 431 | PROBLEM | 0.999771 | | 12 | amoxicillin | 511 | 521 | TREATMENT | 0.995783 | | 13 | a respiratory tract infection | 527 | 555 | PROBLEM | 0.999406 | | 14 | metformin | 570 | 578 | TREATMENT | 0.999728 | | 15 | glipizide | 582 | 590 | TREATMENT | 0.999702 | | 16 | dapagliflozin | 598 | 610 | TREATMENT | 0.999726 | | 17 | T2DM | 616 | 619 | PROBLEM | 0.999663 | | 18 | atorvastatin | 625 | 636 | TREATMENT | 0.999727 | | 19 | gemfibrozil | 642 | 652 | TREATMENT | 0.999675 | | 20 | HTG | 658 | 660 | PROBLEM | 0.999122 | | 21 | dapagliflozin | 680 | 692 | TREATMENT | 0.999708 | | 22 | Physical examination | 739 | 758 | TEST | 0.985332 | | 23 | dry oral mucosa | 796 | 810 | PROBLEM | 0.991374 | | 24 | her abdominal examination | 830 | 854 | TEST | 0.999292 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Detect Clinical Entities (clinical_large) author: John Snow Labs name: ner_jsl_emb_clinical_large date: 2023-04-12 tags: [ner, clinical_large, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vaccine_Name`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_emb_clinical_large_en_4.3.2_3.0_1681313273872.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_emb_clinical_large_en_4.3.2_3.0_1681313273872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_jsl_emb_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature. """]]).toDF("text") result = ner_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------------------------+-----+---+----------------------------+ |chunk |begin|end|ner_label | +-----------------------------------------+-----+---+----------------------------+ |21-day-old |17 |26 |Age | |Caucasian |28 |36 |Race_Ethnicity | |male |38 |41 |Gender | |for 2 days |48 |57 |Duration | |congestion |62 |71 |Symptom | |mom |75 |77 |Gender | |suctioning |88 |97 |Modifier | |yellow |99 |104|Modifier | |discharge |106 |114|Symptom | |nares |135 |139|External_body_part_or_region| |she |147 |149|Gender | |mild |168 |171|Modifier | |problems with his breathing while feeding|173 |213|Symptom | |perioral cyanosis |237 |253|Symptom | |retractions |258 |268|Symptom | |Influenza vaccine |325 |341|Vaccine_Name | |One day ago |344 |354|RelativeDate | |mom |357 |359|Gender | |tactile temperature |376 |394|Symptom | |Tylenol |417 |423|Drug_BrandName | +-----------------------------------------+-----+---+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_emb_clinical_large| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 MB| ## Benchmarking ```bash label precision recall f1-score support Internal_organ_or_component 0.88 0.90 0.89 10419 Injury_or_Poisoning 0.90 0.78 0.83 945 Diabetes 0.99 0.97 0.98 146 Drug_Ingredient 0.92 0.92 0.92 1988 Frequency 0.91 0.91 0.91 1110 Height 0.97 0.88 0.92 81 Disease_Syndrome_Disorder 0.83 0.90 0.86 4909 Strength 0.96 0.90 0.93 848 Form 0.85 0.79 0.82 261 Symptom 0.87 0.80 0.83 11966 Route 0.91 0.92 0.91 976 Procedure 0.88 0.88 0.88 6395 Gender 0.99 0.99 0.99 5686 RelativeTime 0.82 0.68 0.75 367 Vaccine 0.50 0.14 0.22 14 Psychological_Condition 0.89 0.70 0.78 186 Direction 0.89 0.92 0.90 4447 External_body_part_or_region 0.88 0.84 0.86 3246 Section_Header 0.98 0.96 0.97 9564 Age 0.90 0.92 0.91 750 Modifier 0.85 0.76 0.81 3027 Heart_Disease 0.97 0.82 0.89 849 Drug_BrandName 0.92 0.93 0.92 1011 Hyperlipidemia 0.93 0.84 0.88 31 Test 0.89 0.83 0.86 4337 Oncological 0.93 0.94 0.94 781 Labour_Delivery 0.78 0.68 0.73 158 Clinical_Dept 0.94 0.91 0.93 1714 Treatment 0.84 0.78 0.81 347 Oxygen_Therapy 0.81 0.80 0.81 120 Duration 0.81 0.87 0.84 927 Admission_Discharge 0.94 0.94 0.94 343 RelativeDate 0.91 0.86 0.89 1403 Hypertension 0.87 0.97 0.91 122 Employment 0.90 0.76 0.83 369 Dosage 0.87 0.84 0.85 461 Medical_Device 0.88 0.92 0.90 5499 Test_Result 0.85 0.78 0.81 1321 Time 0.73 0.65 0.69 34 Date 0.97 0.94 0.95 591 Obesity 0.88 1.00 0.94 45 Race_Ethnicity 0.99 0.99 0.99 120 Imaging_Technique 0.75 0.42 0.54 91 ImagingFindings 0.69 0.40 0.50 291 Cerebrovascular_Disease 0.85 0.65 0.74 133 Diet 0.78 0.52 0.62 114 Fetus_NewBorn 0.75 0.53 0.62 180 Kidney_Disease 0.95 0.94 0.94 168 Weight 0.90 0.91 0.91 243 Blood_Pressure 0.84 0.84 0.84 336 Pulse 0.83 0.96 0.89 311 Temperature 0.88 0.96 0.92 182 O2_Saturation 0.90 0.64 0.75 95 VS_Finding 0.72 0.74 0.73 311 Death_Entity 0.79 0.66 0.72 50 Total_Cholesterol 0.72 0.87 0.79 30 Substance 0.94 0.89 0.92 103 Relationship_Status 0.93 0.81 0.87 48 Alcohol 0.92 0.87 0.90 84 Vital_Signs_Header 0.93 0.99 0.96 656 Respiration 0.94 0.95 0.95 156 Family_History_Header 0.97 0.99 0.98 224 Pregnancy 0.82 0.69 0.75 203 Smoking 0.98 0.98 0.98 109 Vaccine_Name 0.89 0.55 0.68 31 EKG_Findings 0.64 0.27 0.38 154 Allergen 0.60 0.75 0.67 12 Medical_History_Header 0.95 0.96 0.95 411 Social_History_Header 0.91 0.97 0.94 213 Overweight 0.83 0.83 0.83 6 Communicable_Disease 0.73 0.51 0.60 47 Birth_Entity 0.00 0.00 0.00 6 Triglycerides 1.00 1.00 1.00 4 HDL 0.62 1.00 0.77 5 LDL 1.00 1.00 1.00 5 BMI 1.00 1.00 1.00 17 Sexually_Active_or_Sexual_Orientation 1.00 0.57 0.73 7 micro-avg 0.90 0.88 0.89 92950 macro-avg 0.86 0.81 0.83 92950 weighted-avg 0.90 0.88 0.89 92950 ``` --- layout: model title: Clinical Deidentification (Italian) author: John Snow Labs name: clinical_deidentification date: 2022-03-28 tags: [deidentification, pipeline, it, licensed] task: De-identification language: it edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts in Italian. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_3.4.2_2.4_1648498695375.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_3.4.2_2.4_1648498695375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = """RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""" result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = "RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("it.deid.clinical").predict("""RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""") ```
## Results ```bash Masked with entity labels ------------------------------ RAPPORTO DI RICOVERO NOME: CODICE FISCALE: INDIRIZZO: CITTÀ : CODICE POSTALE: DATA DI NASCITA: ETÀ: anni SESSO: EMAIL: DATA DI AMMISSIONE: DOTTORE: RAPPORTO CLINICO: anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. - , Dipartimento di Endocrinologia e Nutrizione - , EMAIL: Masked with chars ------------------------------ RAPPORTO DI RICOVERO NOME: [****************] CODICE FISCALE: [**************] INDIRIZZO: [**************] CITTÀ : [****] CODICE POSTALE: [***]DATA DI NASCITA: [********] ETÀ: **anni SESSO: * EMAIL: [***********] DATA DI AMMISSIONE: [********] DOTTORE: [*********] RAPPORTO CLINICO: **anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. [**************] - [*****************], Dipartimento di Endocrinologia e Nutrizione - [*******************], [***] [****] EMAIL: [******************] Masked with fixed length chars ------------------------------ RAPPORTO DI RICOVERO NOME: **** CODICE FISCALE: **** INDIRIZZO: **** CITTÀ : **** CODICE POSTALE: ****DATA DI NASCITA: **** ETÀ: **** anni SESSO: **** EMAIL: **** DATA DI AMMISSIONE: **** DOTTORE: **** RAPPORTO CLINICO: **** anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. **** - ****, Dipartimento di Endocrinologia e Nutrizione - ****, **** **** EMAIL: **** Obfuscated ------------------------------ RAPPORTO DI RICOVERO NOME: Scotto-Polani CODICE FISCALE: ECI-QLN77G15L455Y INDIRIZZO: Viale Orlando 808 CITTÀ : Sesto Raimondo CODICE POSTALE: 53581DATA DI NASCITA: 09/03/1946 ETÀ: 5 anni SESSO: U EMAIL: HenryWatson@world.com DATA DI AMMISSIONE: 10/01/2017 DOTTORE: Sig. Fredo Marangoni RAPPORTO CLINICO: 5 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Antonio Rusticucci - ASL 7 DI CARBONIA AZIENDA U.S.L. N. 7, Dipartimento di Endocrinologia e Nutrizione - Via Giorgio 0 Appartamento 26, 03461 Sesto Raimondo EMAIL: murat.g@jsl.com ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English BertForQuestionAnswering model (from ruselkomp) author: John Snow Labs name: bert_qa_tests_finetuned_squad_test_bert date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tests-finetuned-squad-test-bert` is a English model orginally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tests_finetuned_squad_test_bert_en_4.0.0_3.0_1654192233493.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tests_finetuned_squad_test_bert_en_4.0.0_3.0_1654192233493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tests_finetuned_squad_test_bert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tests_finetuned_squad_test_bert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_ruselkomp").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tests_finetuned_squad_test_bert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/tests-finetuned-squad-test-bert --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_8 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_8_en_4.3.0_3.0_1674215724845.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_8_en_4.3.0_3.0_1674215724845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_8","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|432.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-8 --- layout: model title: Legal NER for NDA (Assigment Clause) author: John Snow Labs name: legner_nda_assigment date: 2023-04-07 tags: [en, licensed, legal, ner, nda, assigment] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `ASSIGNMENT` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `ASSIGN_EXCEPTION` ## Predicted Entities `ASSIGN_EXCEPTION` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_assigment_en_1.0.0_3.0_1680829603099.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_assigment_en_1.0.0_3.0_1680829603099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_assigment", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Any attempted or purported assignment of this Agreement by either party without the prior written consent of the other party shall be null and void."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------------+----------------+ |chunk |ner_label | +---------------+----------------+ |written consent|ASSIGN_EXCEPTION| +---------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_assigment| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support B-ASSIGN_EXCEPTION 0.96 0.96 0.96 24 I-ASSIGN_EXCEPTION 0.94 0.94 0.94 17 micro-avg 0.95 0.95 0.95 41 macro-avg 0.95 0.95 0.95 41 weighted-avg 0.95 0.95 0.95 41 ``` --- layout: model title: Arabic ALBERT Embeddings (Base) author: John Snow Labs name: albert_embeddings_albert_base_arabic date: 2022-04-14 tags: [albert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-base-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_arabic_ar_3.4.2_3.0_1649954284852.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_arabic_ar_3.4.2_3.0_1649954284852.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.albert").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_base_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|44.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/asafaya/albert-base-arabic - https://oscar-corpus.com/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/albert - https://www.tensorflow.org/tfrc - https://github.com/KUIS-AI-Lab/Arabic-ALBERT/ --- layout: model title: English asr_wav2vec2_base_timit_demo_colab647 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab647 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab647` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab647_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab647_en_4.2.0_3.0_1664022843025.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab647_en_4.2.0_3.0_1664022843025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab647", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab647", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab647| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Telugu RoBERTa Embeddings (from neuralspace-reverie) author: John Snow Labs name: roberta_embeddings_indic_transformers_te_roberta date: 2022-04-14 tags: [roberta, embeddings, te, open_source] task: Embeddings language: te edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-te-roberta` is a Telugu model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_te_roberta_te_3.4.2_3.0_1649947854411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_indic_transformers_te_roberta_te_3.4.2_3.0_1649947854411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_te_roberta","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_indic_transformers_te_roberta","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.embed.indic_transformers_te_roberta").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_indic_transformers_te_roberta| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|te| |Size:|314.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-roberta - https://oscar-corpus.com/ --- layout: model title: Fast Neural Machine Translation Model from English to Bulgarian author: John Snow Labs name: opus_mt_en_bg date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bg, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bg_xx_2.7.0_2.4_1609166567271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bg_xx_2.7.0_2.4_1609166567271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bg", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bg", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bg').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bg| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squad1.1_block_sparse_0.13_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad1.1-block-sparse-0.13-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.13_v1_en_4.0.0_3.0_1654181428714.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.13_v1_en_4.0.0_3.0_1654181428714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.13_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.13_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_1_block_sparse_0.20_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad1.1_block_sparse_0.13_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|155.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squad1.1-block-sparse-0.13-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf - https://www.aclweb.org/anthology/N19-1423/ - https://arxiv.org/abs/2005.07683 --- layout: model title: English asr_wav2vec2_base_100h_with_lm_by_saahith TFWav2Vec2ForCTC from saahith author: John Snow Labs name: asr_wav2vec2_base_100h_with_lm_by_saahith date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_with_lm_by_saahith` is a English model originally trained by saahith. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_with_lm_by_saahith_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_by_saahith_en_4.2.0_3.0_1664117796201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_with_lm_by_saahith_en_4.2.0_3.0_1664117796201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_with_lm_by_saahith", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_with_lm_by_saahith", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_with_lm_by_saahith| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.9 MB| --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_04 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_english_filipino_wav2vec2_l_xls_r_test_04 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_04` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_04_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_04_en_4.2.0_3.0_1664108841524.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_04_en_4.2.0_3.0_1664108841524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_04", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_04", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_04| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Deidentify RB author: John Snow Labs name: deidentify_rb class: DeIdentificationModel language: en nav_key: models repository: clinical/models date: 2019-06-04 task: De-identification edition: Healthcare NLP 2.0.2 spark_version: 2.4 tags: [clinical,licensed,en] supported: true annotator: DeIdentificationModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Anonymization and DeIdentification model based on outputs from DeId NERs and Replacement Dictionaries. ## Predicted Entities Personal Information in order to deidentify. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/deidentify_rb_en_2.0.2_2.4_1559672122511.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/deidentify_rb_en_2.0.2_2.4_1559672122511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) masker = DeIdentificationModel.pretrained("deidentify_rb","en","clinical/models")\ .setInputCols("sentence","token","chunk")\ .setOutputCol("deidentified")\ .setMode("mask") text = '''A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street''' result = model.transform(spark.createDataFrame([[text]]).toDF("text")) deid_text = masker.transform(result) ``` ```scala ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street").toDF("text") val result = pipeline.fit(data).transform(data) val masker = DeIdentificationModel.pretrained("deidentify_rb","en","clinical/models") .setInputCols(Array("sentence", "token", "chunk")) .setOutputCol("deidentified") .setMode("mask") val deid_text = new masker.transform(result) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify").predict("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""") ```
{:.h2_title} ## Results ```bash | | sentence | deidentified | |---|---------------------------------------------------------------------------------------|-----------------------------------------------------------------------------| | 0 | A . | A . | | 1 | Record date : 2093-01-13 , David Hale , M.D . | Record date : , David Hale , M.D . | | 2 | , Name : Hendrickson , Ora MR . | , Name : Hendrickson , Ora MR . | | 3 | # 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . | # Date : PCP : Oliveira , 25 years-old , Record date : . | | 4 | Cocke County Baptist Hospital . | Cocke County Baptist Hospital . | | 5 | 0295 Keats Street | Keats Street | ``` {:.model-param} ## Model Information {:.table-model} |---------------|------------------------| | Name: | deidentify_rb | | Type: | DeIdentificationModel | | Compatibility: | Spark NLP 2.0.2+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token, chunk] | |Output labels: | [document] | | Language: | en | | Dependencies: | ner_deid | {:.h2_title} ## Data Source Rule based DeIdentifier based on `ner_deid`. --- layout: model title: Detect Clinical Relations author: John Snow Labs name: re_clinical class: RelationExtractionModel language: en nav_key: models repository: clinical/models date: 2020-09-24 task: Relation Extraction edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [clinical,licensed,relation extraction,en] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Relation Extraction model based on syntactic features using deep learning. Models the set of clinical relations defined in the 2010 ``i2b2`` relation challenge. `TrIP`: A certain treatment has improved or cured a medical problem (eg, ‘infection resolved with antibiotic course’) `TrWP`: A patient's medical problem has deteriorated or worsened because of or in spite of a treatment being administered (eg, ‘the tumor was growing despite the drain’) `TrCP`: A treatment caused a medical problem (eg, ‘penicillin causes a rash’) `TrAP`: A treatment administered for a medical problem (eg, ‘Dexamphetamine for narcolepsy’) `TrNAP`: The administration of a treatment was avoided because of a medical problem (eg, ‘Ralafen which is contra-indicated because of ulcers’) `TeRP`: A test has revealed some medical problem (eg, ‘an echocardiogram revealed a pericardial effusion’) `TeCP`: A test was performed to investigate a medical problem (eg, ‘chest x-ray done to rule out pneumonia’) `PIP`: Two problems are related to each other (eg, ‘Azotemia presumed secondary to sepsis’) {:.h2_title} ## Predicted Entities `TrIP`, `TrWP`, `TrCP`, `TrAP`, `TrAP`, `TeRP`, `TeCP`, `PIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_clinical_en_2.5.5_2.4_1600987935304.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_clinical_en_2.5.5_2.4_1600987935304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use In the table below, `re_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:-----------:|----------------------------------------|:------------:|---------------------------| | re_clinical | TrIP,TrWP,TrCP,TrAP,TrAP,TeRP,TeCP,PIP | ner_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPython.html %} ```python ... reModel = RelationExtractionModel.pretrained("re_clinical","en","clinical/models")\ .setInputCols(["word_embeddings","chunk","pos","dependency"])\ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = LightPipeline(model).fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""") ``` ```scala ... val reModel = RelationExtractionModel.pretrained("re_clinical","en","clinical/models") .setInputCols("word_embeddings","chunk","pos","dependency") .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |----------|-----------|---------------------------------|------------|-------------------------------|------------| | TrAP | PROBLEM | T2DM | TREATMENT | atorvastatin | 0.99955326 | | TrWP | TEST | blood samples | PROBLEM | significant lipemia | 0.99998724 | | TeRP | TEST | the anion gap | PROBLEM | still elevated | 0.9965193 | | TrAP | TEST | analysis | PROBLEM | interference from turbidity | 0.9676019 | | TrWP | TREATMENT | an insulin drip | PROBLEM | a reduction in the anion gap | 0.94099987 | | TeRP | PROBLEM | a reduction in the anion gap | TEST | triglycerides | 0.9956793 | | TeRP | PROBLEM | her respiratory tract infection | TREATMENT | SGLT2 inhibitor | 0.997498 | ``` {:.model-param} ## Model Information {:.table-model} |----------------|-----------------------------------------| | Name: | re_clinical | | Type: | RelationExtractionModel | | Compatibility: | Spark NLP 2.5.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [word_embeddings, chunk, pos, dependency] | |Output labels: | [category] | | Language: | en | | Case sensitive: | False | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label precision recall f1-score O 0.96 0.93 0.94 TeRP 0.91 0.94 0.92 PIP 0.86 0.92 0.89 TrAP 0.81 0.92 0.86 TrCP 0.56 0.55 0.55 TeCP 0.57 0.49 0.53 accuracy - - 0.88 macro-avg 0.65 0.59 0.60 weighted-avg 0.87 0.88 0.87 ``` --- layout: model title: Translate Urdu to English Pipeline author: John Snow Labs name: translate_ur_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ur, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ur` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ur_en_xx_2.7.0_2.4_1609687561243.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ur_en_xx_2.7.0_2.4_1609687561243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ur_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ur_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ur.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ur_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from pinecone) author: John Snow Labs name: bert_qa_bert_reader_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-reader-squad2` is a English model orginally trained by `pinecone`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_reader_squad2_en_4.0.0_3.0_1654184697176.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_reader_squad2_en_4.0.0_3.0_1654184697176.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_reader_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_reader_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.by_pinecone").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_reader_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pinecone/bert-reader-squad2 --- layout: model title: English DistilBertForTokenClassification Cased model (from m3hrdadfi) author: John Snow Labs name: distilbert_tok_classifier_typo_detector date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `typo-detector-distilbert-en` is a English model originally trained by `m3hrdadfi`. ## Predicted Entities `TYPO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_tok_classifier_typo_detector_en_4.3.1_3.0_1677881945749.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_tok_classifier_typo_detector_en_4.3.1_3.0_1677881945749.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_tok_classifier_typo_detector","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_tok_classifier_typo_detector","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_tok_classifier_typo_detector| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m3hrdadfi/typo-detector-distilbert-en - https://github.com/neuspell/neuspell - https://github.com/m3hrdadfi/typo-detector/issues --- layout: model title: Pipeline to Detect Anatomical Regions author: John Snow Labs name: ner_anatomy_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_pipeline_en_3.4.1_3.0_1647872855454.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_pipeline_en_3.4.1_3.0_1647872855454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_anatomy_pipeline", "en", "clinical/models") pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.") ``` ```scala val pipeline = new PretrainedPipeline("ner_anatomy_pipeline", "en", "clinical/models") val pipeline.annotate("This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatom.pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash +----------------+----------------------+ |chunk |ner_label | +----------------+----------------------+ |skin |Organ | |Extraocular |Multi-tissue_structure| |muscles |Organ | |turbinates |Multi-tissue_structure| |Mucous membranes|Tissue | |bowel |Organ | |skin |Organ | +----------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_Check_GoodBad_Teeth ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Check_GoodBad_Teeth date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Check_GoodBad_Teeth` is a English model originally trained by steven123. ## Predicted Entities `Bad Teeth`, `Good Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_GoodBad_Teeth_en_4.1.0_3.0_1660168301503.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_GoodBad_Teeth_en_4.1.0_3.0_1660168301503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Check_GoodBad_Teeth", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Check_GoodBad_Teeth", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Check_GoodBad_Teeth| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Translate Tiv to English Pipeline author: John Snow Labs name: translate_tiv_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tiv, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tiv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tiv_en_xx_2.7.0_2.4_1609690646006.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tiv_en_xx_2.7.0_2.4_1609690646006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tiv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tiv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tiv.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tiv_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: French ALBERT Embeddings (from qwant) author: John Snow Labs name: albert_embeddings_fralbert_base date: 2022-04-14 tags: [albert, embeddings, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fralbert-base` is a French model orginally trained by `qwant`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_fralbert_base_fr_3.4.2_3.0_1649954264678.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_fralbert_base_fr_3.4.2_3.0_1649954264678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_fralbert_base","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_fralbert_base","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.albert").predict("""J'adore Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_fralbert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|45.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/qwant/fralbert-base - https://arxiv.org/abs/1909.11942 - https://github.com/google-research/albert - https://fr.wikipedia.org/wiki/French_Wikipedia - https://hal.archives-ouvertes.fr/hal-03336060 --- layout: model title: Arabic Part of Speech Tagger (Modern Standard Arabic (MSA), Gulf Arabic POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_msa_pos_glf date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-pos-glf` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_glf_ar_3.4.2_3.0_1650993626607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_glf_ar_3.4.2_3.0_1650993626607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_glf","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_glf","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_msa_pos_glf").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_msa_pos_glf| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-pos-glf - https://camel.abudhabi.nyu.edu/annotated-gumar-corpus/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Korean BertForQuestionAnswering Cased model (from obokkkk) author: John Snow Labs name: bert_qa_kobert_finetuned_klue_v2 date: 2022-07-07 tags: [ko, open_source, bert, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kobert-finetuned-klue-v2` is a Korean model originally trained by `obokkkk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kobert_finetuned_klue_v2_ko_4.0.0_3.0_1657189603346.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kobert_finetuned_klue_v2_ko_4.0.0_3.0_1657189603346.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kobert_finetuned_klue_v2","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kobert_finetuned_klue_v2","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kobert_finetuned_klue_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|343.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/obokkkk/kobert-finetuned-klue-v2 --- layout: model title: Legal Trust Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_trust_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, trust, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_trust_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `trust-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `trust-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_trust_agreement_bert_en_1.0.0_3.0_1669372381953.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_trust_agreement_bert_en_1.0.0_3.0_1669372381953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_trust_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[trust-agreement]| |[other]| |[other]| |[trust-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_trust_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.90 0.94 82 trust-agreement 0.82 0.95 0.88 39 accuracy - - 0.92 121 macro-avg 0.90 0.93 0.91 121 weighted-avg 0.92 0.92 0.92 121 ``` --- layout: model title: English image_classifier_vit_SDO_VT1 ViTForImageClassification from kenobi author: John Snow Labs name: image_classifier_vit_SDO_VT1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_SDO_VT1` is a English model originally trained by kenobi. ## Predicted Entities `NASA_SDO_Coronal_Hole`, `NASA_SDO_Coronal_Loop`, `NASA_SDO_Solar_Flare` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_SDO_VT1_en_4.1.0_3.0_1660171968483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_SDO_VT1_en_4.1.0_3.0_1660171968483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_SDO_VT1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_SDO_VT1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_SDO_VT1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Clinical Major Concepts to UMLS Code Pipeline author: John Snow Labs name: umls_major_concepts_resolver_pipeline date: 2022-07-25 tags: [en, umls, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Major Concepts) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_resolver_pipeline_en_4.0.0_3.0_1658736979238.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_major_concepts_resolver_pipeline_en_4.0.0_3.0_1658736979238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_major_concepts_resolver_pipeline", "en", "clinical/models") pipeline.annotate("The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_major_concepts_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_major_concepts_resolver").predict("""The patient complains of pustules after falling from stairs. She has been advised Arthroscopy by her primary care pyhsician""") ```
## Results ```bash +-----------+-----------------------------------+---------+ |chunk |ner_label |umls_code| +-----------+-----------------------------------+---------+ |pustules |Sign_or_Symptom |C0241157 | |stairs |Daily_or_Recreational_Activity |C4300351 | |Arthroscopy|Therapeutic_or_Preventive_Procedure|C0179144 | +-----------+-----------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_major_concepts_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.0 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Stopwords Remover for Romanian language (494 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ro, open_source] task: Stop Words Removal language: ro edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ro_3.4.1_3.0_1646672935352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ro_3.4.1_3.0_1646672935352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ro") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nu ești mai bun decât mine"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ro") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nu ești mai bun decât mine").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.stopwords").predict("""Nu ești mai bun decât mine""") ```
## Results ```bash +------------+ |result | +------------+ |[bun, decât]| +------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ro| |Size:|2.7 KB| --- layout: model title: Mapping ICDO Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icdo_snomed_mapper date: 2022-06-26 tags: [icdo, snomed, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICDO codes to corresponding SNOMED codes. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapper_en_3.5.3_3.0_1656274513770.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapper_en_3.5.3_3.0_1656274513770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icdo_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icdo_augmented", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("icdo_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("icdo_snomed_mapper", "en", "clinical/models")\ .setInputCols(["icdo_code"])\ .setOutputCol("snomed_mappings")\ .setRels(["snomed_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, icdo_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Hepatocellular Carcinoma") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val icdo_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icdo_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("icdo_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("icdo_snomed_mapper", "en", "clinical/models") .setInputCols(Array("icdo_code")) .setOutputCol("snomed_mappings") .setRels(Array("snomed_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, icdo_resolver, chunkerMapper )) val data = Seq("Hepatocellular Carcinoma").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.icdo_to_snomed").predict("""Hepatocellular Carcinoma""") ```
## Results ```bash | | ner_chunk | icdo_code | snomed_mappings | |---:|:-------------------------|:------------|------------------:| | 0 | Hepatocellular Carcinoma | 8170/3 | 25370001 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icdo_snomed_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[icdo_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|127.9 KB| ## References This pretrained model maps ICDO codes to corresponding SNOMED codes under the Unified Medical Language System (UMLS). --- layout: model title: Extract demographic entities (Voice of the Patients) author: John Snow Labs name: ner_vop_demographic_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, demographic] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `SubstanceQuantity`, `RaceEthnicity`, `RelationshipStatus`, `Substance`, `Age`, `Employment`, `Gender` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_wip_en_4.4.0_3.0_1682012495522.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_wip_en_4.4.0_3.0_1682012495522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_demographic_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandma, who"s 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it"ll help regulate her heartbeat and prevent future complications."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_demographic_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandma, who"s 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it"ll help regulate her heartbeat and prevent future complications.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:--------|:--------------| | grandma | Gender | | 85 | Age | | Black | RaceEthnicity | | doctors | Employment | | her | Gender | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_demographic_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|4.0 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 SubstanceQuantity 18 5 25 43 0.78 0.42 0.55 RaceEthnicity 4 1 4 8 0.80 0.50 0.62 RelationshipStatus 21 4 7 28 0.84 0.75 0.79 Substance 160 33 30 190 0.83 0.84 0.84 Age 554 45 29 583 0.92 0.95 0.94 Employment 1135 83 57 1192 0.93 0.95 0.94 Gender 1199 34 9 1208 0.97 0.99 0.98 macro_avg 3091 205 161 3252 0.87 0.77 0.81 micro_avg 3091 205 161 3252 0.93 0.95 0.94 ``` --- layout: model title: Korean Electra Embeddings (from monologg) author: John Snow Labs name: electra_embeddings_koelectra_base_v2_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v2-generator` is a Korean model orginally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_base_v2_generator_ko_3.4.4_3.0_1652786893501.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_base_v2_generator_ko_3.4.4_3.0_1652786893501.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_base_v2_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_base_v2_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_koelectra_base_v2_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|130.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/monologg/koelectra-base-v2-generator - https://github.com/monologg/KoELECTRA/blob/master/README_EN.md --- layout: model title: Fast Neural Machine Translation Model from American Sign Language to English author: John Snow Labs name: opus_mt_ase_en date: 2021-06-01 tags: [open_source, seq2seq, translation, ase, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ase target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_en_xx_3.1.0_2.4_1622560965192.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_en_xx_3.1.0_2.4_1622560965192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ase_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ase_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.American Sign Language.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ase_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Portuguese BertForTokenClassification Cased model (from dominguesm) author: John Snow Labs name: bert_token_classifier_restore_punctuation_ptbr date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-restore-punctuation-ptbr` is a Portuguese model originally trained by `dominguesm`. ## Predicted Entities `,O`, `,U`, `OO`, `:O`, `;O`, `.O`, `?O`, `?U`, `OU`, `!U`, `!O`, `-O`, `:U`, `.U`, `'O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_restore_punctuation_ptbr_pt_4.2.4_3.0_1669815430287.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_restore_punctuation_ptbr_pt_4.2.4_3.0_1669815430287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_restore_punctuation_ptbr","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_restore_punctuation_ptbr","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_restore_punctuation_ptbr| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/dominguesm/bert-restore-punctuation-ptbr - https://wandb.ai/dominguesm/RestorePunctuationPTBR - https://github.com/DominguesM/respunct - https://github.com/esdurmus/Wikilingua - https://paperswithcode.com/sota?task=named-entity-recognition&dataset=wiki_lingua --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_de_4.2.0_3.0_1664116485500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886_de_4.2.0_3.0_1664116485500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s886| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Word2Vec Embeddings in Chechen (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, ce, open_source] task: Embeddings language: ce edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ce_3.4.1_3.0_1647290509083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ce_3.4.1_3.0_1647290509083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ce") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ce") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ce.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ce| |Size:|365.7 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Financial Relation Extraction on Earning Calls (Small) author: John Snow Labs name: finre_earning_calls_sm date: 2022-11-28 tags: [earning, calls, en, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts relations between amounts, counts, percentages, dates and the financial entities extracted with any earning calls NER model, as `finner_earning_calls_sm` (shown in the example above). ## Predicted Entities `has_amount`, `has_amount_date`, `has_percentage_date`, `has_percentage`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_earning_calls_sm_en_1.0.0_3.0_1669649131686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_earning_calls_sm_en_1.0.0_3.0_1669649131686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) bert_embeddings= nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("bert_embeddings") ner_model = finance.NerModel.pretrained("finner_earning_calls_sm", "en", "finance/models")\ .setInputCols(["sentence", "token", "bert_embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") # =========== # This is needed only to filter relation pairs using finance.RENerChunksFilter (see below) # =========== pos = nlp.PerceptronModel.pretrained("pos_anc", 'en')\ .setInputCols("sentence", "token")\ .setOutputCol("pos") dependency_parser = nlp.DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos", "token"]) \ .setOutputCol("dependencies") ENTITIES = ['PROFIT', 'PROFIT_INCREASE', 'PROFIT_DECLINE', 'CF', 'CF_INCREASE', 'CF_DECREASE', 'LIABILITY', 'EXPENSE', 'EXPENSE_INCREASE', 'EXPENSE_DECREASE'] ENTITY_PAIRS = [f"{x}-AMOUNT" for x in ENTITIES] ENTITY_PAIRS.extend([f"{x}-COUNT" for x in ENTITIES]) ENTITY_PAIRS.extend([f"{x}-PERCENTAGE" for x in ENTITIES]) ENTITY_PAIRS.append(f"AMOUNT-FISCAL_YEAR") ENTITY_PAIRS.append(f"AMOUNT-DATE") ENTITY_PAIRS.append(f"AMOUNT-CURRENCY") re_ner_chunk_filter = finance.RENerChunksFilter() \ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setRelationPairs(ENTITY_PAIRS)\ .setMaxSyntacticDistance(5) # =========== reDL = finance.RelationExtractionDLModel.pretrained('finre_financial_small', 'en', 'finance/models')\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relations") pipeline = nlp.Pipeline(stages=[ documentAssembler, sentencizer, tokenizer, bert_embeddings, ner_model, ner_converter, pos, dependency_parser, re_ner_chunk_filter, reDL]) text = "In the third quarter of fiscal 2021, we received net proceeds of $342.7 million, after deducting underwriters discounts and commissions and offering costs of $31.8 million, including the exercise of the underwriters option to purchase additional shares. " data = spark.createDataFrame([[text]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence has_amount CF 49 60 net proceeds AMOUNT 66 78 342.7 million 0.9999101 has_amount CURRENCY 65 65 $ AMOUNT 66 78 342.7 million 0.9925425 has_amount EXPENSE 125 154 commissions and offering costs AMOUNT 160 171 31.8 million 0.9997677 has_amount CURRENCY 159 159 $ AMOUNT 160 171 31.8 million 0.998896 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_earning_calls_sm| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|402.6 MB| ## References In-house annotations of scarpped Earning Call Transcripts. ## Benchmarking ```bash Relation Recall Precision F1 Support has_amount 0.973 0.973 0.973 183 has_amount_date 0.700 1.000 0.824 10 has_percentage 0.987 0.931 0.958 150 has_percentage_date 0.667 0.857 0.750 9 other 0.993 0.995 0.994 2048 Avg. 0.864 0.951 0.900 2048 Weighted-Avg. 0.988 0.988 0.988 2048 ``` --- layout: model title: Detect Living Species author: John Snow Labs name: ner_living_species_biobert date: 2022-06-22 tags: [ner, en, clinical, licensed, biobert] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using BERT token embeddings `biobert_pubmed_base_cased`. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_biobert_en_3.5.3_3.0_1655887968803.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_biobert_en_3.5.3_3.0_1655887968803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_biobert", "en","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.living_species.biobert").predict("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""") ```
## Results ```bash +-----------------------+-------+ |ner_chunk |label | +-----------------------+-------+ |woman |HUMAN | |bacterial |SPECIES| |Fusarium spp |SPECIES| |patient |HUMAN | |species |SPECIES| |Fusarium solani complex|SPECIES| |antifungals |SPECIES| +-----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_biobert| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.87 0.95 0.91 2950 B-SPECIES 0.80 0.89 0.84 3122 I-HUMAN 0.84 0.62 0.71 145 I-SPECIES 0.76 0.90 0.82 1162 micro-avg 0.82 0.91 0.86 7379 macro-avg 0.82 0.84 0.82 7379 weighted-avg 0.82 0.91 0.86 7379 ``` --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbertresolve_icd10cm_slim_billable_hcc_med) author: John Snow Labs name: sbertresolve_icd10cm_slim_billable_hcc_med date: 2021-08-26 tags: [icd10cm, entity_resolution, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD10 CM codes using sentence bert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It also returns the official resolution text within the brackets inside the metadata. The model is augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). ## Predicted Entities Outputs 7-digit billable ICD codes. In the result, look for aux_label parameter in the metadata to get HCC status. The HCC status can be divided to get further information: billable status, hcc status, and hcc score.For example, in the example shared below the billable status is 1, hcc status is 1, and hcc score is 11. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.1.3_2.4_1629989198744.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_icd10cm_slim_billable_hcc_med_en_3.1.3_2.4_1629989198744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbert_jsl_medium_uncased', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["bladder cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("bladder cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_billable_hcc_med").predict("""bladder cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:---------------|:--------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|--------------------------------------------------------------------------------------:|:----------------------------|:-----------------------------------------------------------------------------------------------------------------| | 0 | bladder cancer | C671 |[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], prostate cancer [Malignant neoplasm of prostate], cancer of the urinary bladder, lateral wall [Malignant neoplasm of lateral wall of bladder], cancer of the urinary bladder, anterior wall [Malignant neoplasm of anterior wall of bladder], cancer of the urinary bladder, posterior wall [Malignant neoplasm of posterior wall of bladder], cancer of the urinary bladder, neck [Malignant neoplasm of bladder neck], cancer of the urinary bladder, ureteric orifice [Malignant neoplasm of ureteric orifice]]| [C671, C679, C61, C672, C673, C674, C675, C676, D090, Z126, D494, C670, Z8551, C7911] | ['1', '1', '11'] | [0.0894, 0.1051, 0.1184, 0.1180, 0.1200, 0.1204, 0.1255, 0.1375, 0.1357, 0.1452, 0.1469, 0.1513, 0.1500, 0.1575] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_icd10cm_slim_billable_hcc_med| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Fast Neural Machine Translation Model from Malagasy to English author: John Snow Labs name: opus_mt_mg_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mg, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mg_en_xx_2.7.0_2.4_1609170371416.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mg_en_xx_2.7.0_2.4_1609170371416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mg_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mg_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mg.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mg_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Ruund to English author: John Snow Labs name: opus_mt_rnd_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, rnd, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `rnd` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_rnd_en_xx_2.7.0_2.4_1609166434564.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_rnd_en_xx_2.7.0_2.4_1609166434564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_rnd_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_rnd_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.rnd.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_rnd_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English ElectraForQuestionAnswering Large model (from sultan) BioASQ author: John Snow Labs name: electra_qa_BioM_Large_SQuAD2_BioASQ8B date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ELECTRA-Large-SQuAD2-BioASQ8B` is a English model originally trained by `sultan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Large_SQuAD2_BioASQ8B_en_4.0.0_3.0_1655919178306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Large_SQuAD2_BioASQ8B_en_4.0.0_3.0_1655919178306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_BioM_Large_SQuAD2_BioASQ8B","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_BioM_Large_SQuAD2_BioASQ8B","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_bioasq8b.electra.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_BioM_Large_SQuAD2_BioASQ8B| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sultan/BioM-ELECTRA-Large-SQuAD2-BioASQ8B --- layout: model title: Pipeline to Detect Drug Chemicals author: John Snow Labs name: ner_drugs_large_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugs_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_pipeline_en_3.4.1_3.0_1647872928647.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_large_pipeline_en_3.4.1_3.0_1647872928647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_drugs_large_pipeline", "en", "clinical/models") pipeline.annotate("""The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_drugs_large_pipeline", "en", "clinical/models") pipeline.annotate("""The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs_large.pipeline").predict("""The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. He has been advised Aspirin 81 milligrams QDay. Humulin N. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually PRN chest pain.""") ```
## Results ```bash +--------------------------------+---------+ |chunk |ner_label| +--------------------------------+---------+ |Aspirin 81 milligrams |DRUG | |Humulin N |DRUG | |insulin 50 units |DRUG | |HCTZ 50 mg |DRUG | |Nitroglycerin 1/150 sublingually|DRUG | +--------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Lemmatizer (Urdu, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, ur] task: Lemmatization language: ur edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Urdu Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ur_3.4.1_3.0_1646316529784.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ur_3.4.1_3.0_1646316529784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ur") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["تم مجھ سے بہتر نہیں ہو"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ur") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("تم مجھ سے بہتر نہیں ہو").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.lemma.spacylookup").predict("""تم مجھ سے بہتر نہیں ہو""") ```
## Results ```bash +------------------------------+ |result | +------------------------------+ |[میں, میں, سے, بہ, نہیں, ہونا]| +------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ur| |Size:|272.3 KB| --- layout: model title: Chamorro XlmRoBertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: xlm_roberta_qa_ADDI_CH_XLM_R date: 2022-06-23 tags: [ch, open_source, question_answering, xlmroberta] task: Question Answering language: ch edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-CH-XLM-R` is a Chamorro model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_CH_XLM_R_ch_4.0.0_3.0_1655980909446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_CH_XLM_R_ch_4.0.0_3.0_1655980909446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_ADDI_CH_XLM_R","ch") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_ADDI_CH_XLM_R","ch") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ch.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_ADDI_CH_XLM_R| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ch| |Size:|776.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-CH-XLM-R --- layout: model title: Stopwords Remover for Spanish language (551 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, es, open_source] task: Stop Words Removal language: es edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_es_3.4.1_3.0_1646673091651.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_es_3.4.1_3.0_1646673091651.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","es") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["No eres mejor que yo"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","es") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("No eres mejor que yo").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.stopwords").predict("""No eres mejor que yo""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|es| |Size:|2.9 KB| --- layout: model title: Russian XlmRoBertaForQuestionAnswering (from AlexKay) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_qa_multilingual_finedtuned_ru_ru_AlexKay date: 2022-06-24 tags: [ru, open_source, question_answering, roberta, es] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-qa-multilingual-finedtuned-ru` is a English model originally trained by `AlexKay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_qa_multilingual_finedtuned_ru_ru_AlexKay_ru_4.0.0_3.0_1656060003348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_qa_multilingual_finedtuned_ru_ru_AlexKay_ru_4.0.0_3.0_1656060003348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_qa_multilingual_finedtuned_ru_ru_AlexKay","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_qa_multilingual_finedtuned_ru_ru_AlexKay","ru") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ru.answer_question.xlm_roberta.multilingual_large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_qa_multilingual_finedtuned_ru_ru_AlexKay| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ru| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AlexKay/xlm-roberta-large-qa-multilingual-finedtuned-ru --- layout: model title: English BertForQuestionAnswering Uncased model (from Andaf) author: John Snow Labs name: bert_qa_uncased_finetuned_squad_indonesian date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-uncased-finetuned-squad-indonesian` is a English model originally trained by `Andaf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_finetuned_squad_indonesian_en_4.0.0_3.0_1657188726851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_finetuned_squad_indonesian_en_4.0.0_3.0_1657188726851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_finetuned_squad_indonesian","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_finetuned_squad_indonesian","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_finetuned_squad_indonesian| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|413.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Andaf/bert-uncased-finetuned-squad-indonesian --- layout: model title: Legal Vesting Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_vesting_bert date: 2023-03-05 tags: [en, legal, classification, clauses, vesting, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Vesting` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Vesting`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_vesting_bert_en_1.0.0_3.0_1678050618012.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_vesting_bert_en_1.0.0_3.0_1678050618012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_vesting_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Vesting]| |[Other]| |[Other]| |[Vesting]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_vesting_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.99 0.92 0.95 74 Vesting 0.90 0.98 0.94 53 accuracy - - 0.94 127 macro-avg 0.94 0.95 0.94 127 weighted-avg 0.95 0.94 0.95 127 ``` --- layout: model title: Sentiment Analysis on Auditors' Reports author: John Snow Labs name: finclf_auditor_sentiment_analysis date: 2022-11-04 tags: [auditor, sentiment, analysis, en, licensed] task: Sentiment Analysis language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Sentiment Analysis model which retrieves 3 sentiments (`positive`, `negative` or `neutral`) from Auditors' comments. ## Predicted Entities `positive`, `negative`, `neutral` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_auditor_sentiment_analysis_en_1.0.0_3.0_1667605773882.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_auditor_sentiment_analysis_en_1.0.0_3.0_1667605773882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("sentence") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") sentiment = nlp.ClassifierDLModel.pretrained("finclf_auditor_sentiment_analysis", "en", "finance/models") \ .setInputCols("sentence_embeddings") \ .setOutputCol("category") pipeline = nlp.Pipeline() \ .setStages( [ documentAssembler, embeddings, sentiment ] ) pipelineModel = pipeline.fit(sdf_test) res = pipelineModel.transform(sdf_test) res.select('sentence', 'category.result').show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+----------+ | sentence| result| +----------------------------------------------------------------------------------------------------+----------+ |In our opinion, the consolidated financial statements referred to above present fairly..............|[positive]| +----------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_auditor_sentiment_analysis| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Propietary auditors' reports ## Benchmarking ```bash label precision recall f1-score support negative 0.66 0.78 0.72 124 neutral 0.88 0.77 0.82 559 positive 0.65 0.76 0.70 286 accuracy - - 0.77 969 macro-avg 0.73 0.77 0.74 969 weighted-avg 0.78 0.77 0.77 969 ``` --- layout: model title: English image_classifier_vit_PanJuOffset_TwoClass ViTForImageClassification from ShihTing author: John Snow Labs name: image_classifier_vit_PanJuOffset_TwoClass date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_PanJuOffset_TwoClass` is a English model originally trained by ShihTing. ## Predicted Entities `Break`, `Normal` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_PanJuOffset_TwoClass_en_4.1.0_3.0_1660168931893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_PanJuOffset_TwoClass_en_4.1.0_3.0_1660168931893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_PanJuOffset_TwoClass", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_PanJuOffset_TwoClass", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_PanJuOffset_TwoClass| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Mapping RxNorm Codes with Corresponding National Drug Codes (NDC) author: John Snow Labs name: rxnorm_ndc_mapping date: 2023-03-29 tags: [en, licensed, clinical, pipeline, chunk_mapping, rxnorm, ndc] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RXNORM codes to NDC codes without using any text data. You’ll just feed white space-delimited RXNORM codes and it will return the corresponding two different types of ndc codes which are called `package ndc` and `product ndc`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_4.3.2_3.2_1680121244814.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_4.3.2_3.2_1680121244814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(1652674 259934) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(1652674 259934) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_to_ndc.pipe").predict("""Put your text here.""") ```
## Results ```bash {'document': ['1652674 259934'], 'package_ndc': ['62135-0625-60', '13349-0010-39'], 'product_ndc': ['46708-0499', '13349-0010'], 'rxnorm_code': ['1652674', '259934']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_ndc_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.0 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel - ChunkMapperModel --- layout: model title: Recognize Entities OntoNotes - BERT Base author: John Snow Labs name: onto_recognize_entities_bert_base date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [pipeline, open_source, en] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `bert_base_cased` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_base_en_2.7.0_2.4_1607509662244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_base_en_2.7.0_2.4_1607509662244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_bert_base') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_base") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.bert.base').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |Parliament |ORG | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | |Parliament |ORG | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_base| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: Telugu DistilBertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: distilbert_embeddings_indic_transformers date: 2022-12-12 tags: [te, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: te edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-te-distilbert` is a Telugu model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_te_4.2.4_3.0_1670864912289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_te_4.2.4_3.0_1670864912289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|te| |Size:|249.4 MB| |Case sensitive:|false| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-distilbert - https://oscar-corpus.com/ --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop` is a English model originally trained by ying-tina. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop_en_4.2.0_3.0_1664111733338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop_en_4.2.0_3.0_1664111733338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: English BertForQuestionAnswering model (from SanayCo) author: John Snow Labs name: bert_qa_model_output date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `model_output` is a English model orginally trained by `SanayCo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_model_output_en_4.0.0_3.0_1654188292716.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_model_output_en_4.0.0_3.0_1654188292716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_model_output","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_model_output","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_SanayCo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_model_output| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SanayCo/model_output --- layout: model title: Legal Benefits Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_benefits_bert date: 2023-03-05 tags: [en, legal, classification, clauses, benefits, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Benefits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Benefits`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_benefits_bert_en_1.0.0_3.0_1678050723092.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_benefits_bert_en_1.0.0_3.0_1678050723092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_benefits_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Benefits]| |[Other]| |[Other]| |[Benefits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_benefits_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Benefits 0.86 0.94 0.90 47 Other 0.95 0.90 0.93 70 accuracy - - 0.91 117 macro-avg 0.91 0.92 0.91 117 weighted-avg 0.92 0.91 0.92 117 ``` --- layout: model title: Judgements Classification (agent) author: John Snow Labs name: legclf_bert_judgements_agent date: 2022-09-07 tags: [en, legal, judgements, agent, echr, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Classification model, aimed to identify different the different argument types in Court Decisions texts about Human Rights. This model was inspired by [this](https://arxiv.org/pdf/2208.06178.pdf) paper, which uses a different approach (Named Entity Recognition). The model classifies the claims by the type of Agent (Party) involved (it's the Court talking, the applicant, ...). The classes are listed below. Please check the [original paper](https://arxiv.org/pdf/2208.06178.pdf) for more information about them. ## Predicted Entities `APPLICANT`, `COMMISSION/CHAMBER`, `ECHR`, `OTHER`, `STATE`, `THIRD_PARTIES` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEG_JUDGEMENTS_CLF/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_judgements_agent_en_1.0.0_3.2_1662560852536.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_judgements_agent_en_1.0.0_3.2_1662560852536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python text_list = ["""The applicant further noted that his placement in the home had already lasted more than eight years and that his hopes of leaving one day were futile , as the decision had to be approved by his guardian.""".lower(), """The Court observes that the situation was subsequently presented differently before the Riga Regional Court , the applicant having submitted , in the context of her appeal , a certificate prepared at her request by a psychologist on 16 December 2008 , that is , after the first - instance judgment . This document indicated that , while the child 's young age prevented her from expressing a preference as to her place of residence , an immediate separation from her mother was to be ruled out on account of the likelihood of psychological trauma ( see paragraph 22 above ).""".lower() ] # Test classifier in Spark NLP pipeline document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols(['document'])\ .setOutputCol("token") clf_model = legal.BertForSequenceClassification.pretrained("legclf_bert_judgements_agent", "en", "legal/models")\ .setInputCols(['document','token'])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, clf_model ]) # Generating example empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) light_model = LightPipeline(model) import pandas as pd df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) result = result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols"))\ .select(F.expr("cols['0']").alias("document"), F.expr("cols['1']").alias("class")).show(truncate = 60) ```
## Results ```bash +------------------------------------------------------------+---------+ | document| class| +------------------------------------------------------------+---------+ |the applicant further noted that his placement in the hom...|APPLICANT| |the court observes that the situation was subsequently pr...| ECHR| +------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_judgements_agent| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References Basedf on https://arxiv.org/pdf/2208.06178.pdf with in-house postprocessing ## Benchmarking ```bash label precision recall f1-score support APPLICANT 0.91 0.89 0.90 238 COMMISSION/CHAMBER 0.80 1.00 0.89 20 ECHR 0.92 0.96 0.94 870 OTHER 0.95 0.90 0.93 940 STATE 0.91 0.94 0.92 205 THIRD_PARTIES 0.96 0.92 0.94 26 accuracy - - 0.93 2299 macro-avg 0.91 0.94 0.92 2299 weighted-avg 0.93 0.93 0.93 2299 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from SauravMaheshkar) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_chaii date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_chaii_en_4.0.0_3.0_1655992708140.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_chaii_en_4.0.0_3.0_1655992708140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.xlm_roberta.large.by_SauravMaheshkar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/xlm-roberta-large-chaii --- layout: model title: NER Pipeline for 10 African Languages author: John Snow Labs name: xlm_roberta_large_token_classifier_masakhaner_pipeline date: 2022-02-10 tags: [masakhaner, african, xlm_roberta, multilingual, pipeline, amharic, hausa, igbo, kinyarwanda, luganda, swahilu, wolof, yoruba, nigerian, pidgin, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on [xlm_roberta_large_token_classifier_masakhaner](https://nlp.johnsnowlabs.com/2021/12/06/xlm_roberta_large_token_classifier_masakhaner_xx.html) ner model which is imported from `HuggingFace`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/Ner_masakhaner/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_pipeline_xx_3.4.0_3.0_1644501579458.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_pipeline_xx_3.4.0_3.0_1644501579458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python masakhaner_pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_masakhaner_pipeline", lang = "xx") masakhaner_pipeline.annotate("አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።") ``` ```scala val masakhaner_pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_masakhaner_pipeline", lang = "xx") val masakhaner_pipeline.annotate("አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።") ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |አህመድ ቫንዳ |PER | |ከ3-10-2000 ጀምሮ|DATE | |በአዲስ አበባ |LOC | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_masakhaner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl4_en_4.3.0_3.0_1675117814381.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl4_en_4.3.0_3.0_1675117814381.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|351.1 MB| ## References - https://huggingface.co/google/t5-efficient-large-nl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1655731919801.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_8_en_4.0.0_3.0_1655731919801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|415.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-8 --- layout: model title: Romanian T5ForConditionalGeneration Small Cased model (from BlackKakapo) author: John Snow Labs name: t5_small_summarization date: 2023-01-31 tags: [ro, open_source, t5, tensorflow] task: Text Generation language: ro edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-summarization-ro` is a Romanian model originally trained by `BlackKakapo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_summarization_ro_4.3.0_3.0_1675155949366.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_summarization_ro_4.3.0_3.0_1675155949366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_summarization","ro") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_summarization","ro") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_summarization| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ro| |Size:|287.1 MB| ## References - https://huggingface.co/BlackKakapo/t5-small-summarization-ro - https://img.shields.io/badge/V.1-18.10.2022-brightgreen --- layout: model title: Chunk Entity Resolver RxNorm-scdc author: John Snow Labs name: chunkresolve_rxnorm_scdc_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to RxNorm codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). ## Predicted Entities RxNorm codes {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scdc_clinical_en_3.0.0_3.0_1618605079581.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scdc_clinical_en_3.0.0_3.0_1618605079581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_scdc_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver]) data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ... ``` ```scala ... val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_scdc_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364| | glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407| | dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_scdc_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| --- layout: model title: English BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_en_cased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-en-cased` is a English model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_en_cased_en_4.2.4_3.0_1670016692729.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_en_cased_en_4.2.4_3.0_1670016692729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_en_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_en_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_en_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|405.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-en-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Applicable Law Clause Binary Classifier author: John Snow Labs name: legclf_applicable_law_cuad date: 2023-01-12 tags: [en, legal, classification, agreement, applicable_law, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the applicable-law clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `applicable-law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_applicable_law_cuad_en_1.0.0_3.0_1673557470944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_applicable_law_cuad_en_1.0.0_3.0_1673557470944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_applicable_law_cuad", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[applicable-law]| |[other]| |[other]| |[applicable-law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_applicable_law_cuad| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents + Lawinsider categorization ## Benchmarking ```bash label precision recall f1-score support applicable_law 0.87 0.93 0.90 72 other 0.95 0.90 0.92 101 accuracy - - 0.91 173 macro-avg 0.91 0.92 0.91 173 weighted-avg 0.92 0.91 0.91 173 ``` --- layout: model title: French Electra Embeddings (from dbmdz) author: John Snow Labs name: electra_embeddings_electra_base_french_europeana_cased_generator date: 2022-05-17 tags: [fr, open_source, electra, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-french-europeana-cased-generator` is a French model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_french_europeana_cased_generator_fr_3.4.4_3.0_1652786233658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_french_europeana_cased_generator_fr_3.4.4_3.0_1652786233658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_french_europeana_cased_generator","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_french_europeana_cased_generator","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_french_europeana_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|130.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/electra-base-french-europeana-cased-generator - https://github.com/stefan-it/europeana-bert - https://github.com/stefan-it/europeana-bert - https://github.com/dbmdz/berts/issues/new --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-42` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42_en_4.0.0_3.0_1654180754770.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42_en_4.0.0_3.0_1654180754770.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_1024d_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-42 --- layout: model title: Fast Neural Machine Translation Model from German to English author: John Snow Labs name: opus_mt_de_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, de, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `de` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_de_en_xx_2.7.0_2.4_1609166923201.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_de_en_xx_2.7.0_2.4_1609166923201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_de_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_de_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.de.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_de_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Financial Bert Embeddings author: John Snow Labs name: bert_embeddings_FinancialBERT date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `FinancialBERT` is a English model orginally trained by `ahmedrachid`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_FinancialBERT_en_3.4.2_3.0_1649672218476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_FinancialBERT_en_3.4.2_3.0_1649672218476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_FinancialBERT","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_FinancialBERT","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.FinancialBERT").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_FinancialBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/ahmedrachid/FinancialBERT - https://www.researchgate.net/publication/358284785_FinancialBERT_-_A_Pretrained_Language_Model_for_Financial_Text_Mining - https://www.linkedin.com/in/ahmed-rachid/ --- layout: model title: French CamemBert Embeddings (from lijingxin) author: John Snow Labs name: camembert_embeddings_lijingxin_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `lijingxin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_lijingxin_generic_model_fr_3.4.4_3.0_1653989437492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_lijingxin_generic_model_fr_3.4.4_3.0_1653989437492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_lijingxin_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_lijingxin_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_lijingxin_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/lijingxin/dummy-model --- layout: model title: English image_classifier_vit_teeth_test ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_teeth_test date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_teeth_test` is a English model originally trained by steven123. ## Predicted Entities `Good Teeth`, `Missing Teeth`, `Rotten Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_teeth_test_en_4.1.0_3.0_1660172606275.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_teeth_test_en_4.1.0_3.0_1660172606275.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_teeth_test", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_teeth_test", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_teeth_test| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Google's Tapas Table Understanding (Mini, SQA) author: John Snow Labs name: table_qa_tapas_mini_finetuned_sqa date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Mini Has aggregation operations?: False ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_mini_finetuned_sqa_en_4.2.0_3.0_1664530600015.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_mini_finetuned_sqa_en_4.2.0_3.0_1664530600015.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_mini_finetuned_sqa","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_mini_finetuned_sqa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|43.4 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from BillZou) author: John Snow Labs name: distilbert_qa_billzou_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `BillZou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_billzou_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768320324.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_billzou_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768320324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_billzou_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_billzou_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_billzou_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/BillZou/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_all_904029569 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029569` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Name` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1677881644846.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1677881644846.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_all_904029569| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029569 --- layout: model title: Stopwords Remover for Tagalog language (147 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, tl, open_source] task: Stop Words Removal language: tl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_tl_3.4.1_3.0_1646673050697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_tl_3.4.1_3.0_1646673050697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","tl") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Hindi ka mas mahusay kaysa sa akin"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","tl") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Hindi ka mas mahusay kaysa sa akin").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tl.stopwords").predict("""Hindi ka mas mahusay kaysa sa akin""") ```
## Results ```bash +------+ |result| +------+ |[mas] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|tl| |Size:|1.8 KB| --- layout: model title: Vietnamese BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_vietnamese date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: vi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-vietnamese` is a Vietnamese model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_vietnamese_vi_4.0.0_3.0_1654188597434.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_vietnamese_vi_4.0.0_3.0_1654188597434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_vietnamese","vi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_vietnamese","vi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("vi.answer_question.bert.multilingual_vietnamese_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_vietnamese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|vi| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-vietnamese --- layout: model title: Fast Neural Machine Translation Model from English to Italian author: John Snow Labs name: opus_mt_en_it date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, it, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `it` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_it_xx_2.7.0_2.4_1609166542377.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_it_xx_2.7.0_2.4_1609166542377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_it", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_it", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.it').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_it| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_all_904029569 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029569` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Name` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1678134032555.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029569_en_4.3.1_3.0_1678134032555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029569","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_all_904029569| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029569 --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223386005.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223386005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0 --- layout: model title: Spanish RobertaForQuestionAnswering (from IIC) author: John Snow Labs name: roberta_qa_roberta_base_spanish_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-sqac` is a Spanish model originally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_spanish_sqac_es_4.0.0_3.0_1655734667499.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_spanish_sqac_es_4.0.0_3.0_1655734667499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_spanish_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_spanish_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.base.by_IIC").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_spanish_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|460.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/IIC/roberta-base-spanish-sqac - https://www.bsc.es/ - https://arxiv.org/abs/2107.07253 - https://paperswithcode.com/sota?task=question-answering&dataset=PlanTL-GOB-ES%2FSQAC --- layout: model title: Legal Stock Purchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_stock_purchase_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, stock_purchase, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stock_purchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `stock-purchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `stock-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stock_purchase_agreement_bert_en_1.0.0_3.0_1669371824961.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stock_purchase_agreement_bert_en_1.0.0_3.0_1669371824961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_stock_purchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[stock-purchase-agreement]| |[other]| |[other]| |[stock-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stock_purchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.95 0.93 82 stock-purchase-agreement 0.91 0.86 0.88 49 accuracy - - 0.92 131 macro-avg 0.92 0.90 0.91 131 weighted-avg 0.92 0.92 0.92 131 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from huggingfaceepita) author: John Snow Labs name: distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huggingfaceepita`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771347237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771347237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huggingfaceepita/distilbert-base-uncased-finetuned-squad --- layout: model title: Part of Speech for Hungarian author: John Snow Labs name: pos_ud_szeged date: 2021-03-08 tags: [part_of_speech, open_source, hungarian, pos_ud_szeged, hu] task: Part of Speech Tagging language: hu edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - NOUN - ADV - NUM - ADJ - PUNCT - VERB - ADP - CCONJ - PRON - PROPN - SCONJ - AUX - PART - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_szeged_hu_3.0.0_3.0_1615230358382.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_szeged_hu_3.0.0_3.0_1615230358382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_szeged", "hu") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Helló John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_szeged", "hu") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Helló John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Helló John Snow Labs! ""] token_df = nlu.load('hu.pos.ud_szeged').predict(text) token_df ```
## Results ```bash token pos 0 Helló ADJ 1 John PROPN 2 Snow PROPN 3 Labs PROPN 4 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_szeged| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|hu| --- layout: model title: Fast Neural Machine Translation Model from Nyaneka to English author: John Snow Labs name: opus_mt_nyk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, nyk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `nyk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_nyk_en_xx_2.7.0_2.4_1609166201897.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_nyk_en_xx_2.7.0_2.4_1609166201897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_nyk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_nyk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.nyk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_nyk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from doc2query) author: John Snow Labs name: t5_stackexchange_base_v1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `stackexchange-t5-base-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_stackexchange_base_v1_en_4.3.0_3.0_1675107351420.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_stackexchange_base_v1_en_4.3.0_3.0_1675107351420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_stackexchange_base_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_stackexchange_base_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_stackexchange_base_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/doc2query/stackexchange-t5-base-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html --- layout: model title: English T5ForConditionalGeneration Small Cased model (from allenai) author: John Snow Labs name: t5_small_squad2_question_generation date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-squad2-question-generation` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_squad2_question_generation_en_4.3.0_3.0_1675155768128.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_squad2_question_generation_en_4.3.0_3.0_1675155768128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_squad2_question_generation","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_squad2_question_generation","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_squad2_question_generation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|148.2 MB| ## References - https://huggingface.co/allenai/t5-small-squad2-question-generation --- layout: model title: English XLMRobertaForTokenClassification Large Cased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_large_multiconer_multi date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-multiconer-multi` is a English model originally trained by `asahi417`. ## Predicted Entities `product`, `corporation`, `group`, `work of art`, `person`, `location` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_multiconer_multi_en_4.1.0_3.0_1660424763219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_multiconer_multi_en_4.1.0_3.0_1660424763219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_multiconer_multi","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_multiconer_multi","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_large_multiconer_multi| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-large-multiconer-multi --- layout: model title: Spanish XlmRoBertaForQuestionAnswering (from bhavikardeshna) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_spanish date: 2022-06-23 tags: [es, open_source, question_answering, xlmroberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-spanish` is a Spanish model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_spanish_es_4.0.0_3.0_1655991207140.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_spanish_es_4.0.0_3.0_1655991207140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_spanish","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_spanish","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_spanish| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|883.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/xlm-roberta-base-spanish --- layout: model title: Financial NER (lg, Large) author: John Snow Labs name: finner_financial_large date: 2022-10-20 tags: [en, finance, ner, annual, reports, 10k, filings, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `lg` (large) version of a financial model, trained with more generic labels than the other versions of the model (`md`, `lg`, ...) you can find in Models Hub. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The aim of this model is to detect the main pieces of financial information in annual reports of companies, more specifically this model is being trained with 10K filings. The currently available entities are: - AMOUNT: Numeric amounts, not percentages - PERCENTAGE: Numeric amounts which are percentages - CURRENCY: The currency of the amount - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - PROFIT: Profit or also Revenue - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - EXPENSE: An expense or loss - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - CF: Cash flow operations - CF_INCREASE: A piece of information saying there was a cash flow increase - CF_DECREASE: A piece of information saying there was a cash flow decrease - LIABILITY: A mentioned liability in the text You can also check for the Relation Extraction model which connects these entities together ## Predicted Entities `AMOUNT`, `CURRENCY`, `DATE`, `FISCAL_YEAR`, `CF`, `PERCENTAGE`, `LIABILITY`, `EXPENSE`, `EXPENSE_INCREASE`, `EXPENSE_DECREASE`, `PROFIT`, `PROFIT_INCREASE`, `PROFIT_DECLINE`, `CF_INCREASE`, `CF_DECREASE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_FINANCIAL_10K/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_financial_large_en_1.0.0_3.0_1666272385549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_financial_large_en_1.0.0_3.0_1666272385549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_financial_large", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. The increase was primarily related to increase in internal staff costs of $ 1.1 million as we increased delivery staff and work performed on internal projects, partially offset by a decrease in third party consultant costs of $ 0.6 million as these were converted to internal staff or terminated. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +---------------------------------------------------------+----------------+ |text |label | +---------------------------------------------------------+----------------+ |License fees revenue |PROFIT_DECLINE | |40 |PERCENTAGE | |$ |CURRENCY | |0.5 million |AMOUNT | |$ |CURRENCY | |0.7 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |1.2 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |Services revenue |PROFIT_INCREASE | |4 |PERCENTAGE | |$ |CURRENCY | |1.1 million |AMOUNT | |$ |CURRENCY | |25.6 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |24.5 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |Costs of revenue, excluding depreciation and amortization|EXPENSE_INCREASE| |$ |CURRENCY | |0.1 million |AMOUNT | |2 |PERCENTAGE | |$ |CURRENCY | |8.8 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |8.7 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |internal staff costs |EXPENSE_INCREASE| |$ |CURRENCY | |1.1 million |AMOUNT | |third party consultant costs |EXPENSE_DECREASE| |$ |CURRENCY | |0.6 million |AMOUNT | |travel costs |EXPENSE_DECREASE| |$ |CURRENCY | |0.4 million |AMOUNT | |cost of revenue, excluding depreciation and amortization |EXPENSE | |34 |PERCENTAGE | |December 31, 2020 |FISCAL_YEAR | |2019 |DATE | |Sales and marketing expenses |EXPENSE_DECREASE| |20 |PERCENTAGE | |$ |CURRENCY | |1.5 million |AMOUNT | |$ |CURRENCY | |6.0 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |7.5 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | +---------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_financial_large| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on 10-K Filings ## Benchmarking ```bash label tp fp fn prec rec f1 I-AMOUNT 849 16 10 0.9815029 0.98835856 0.9849188 B-AMOUNT 1056 23 64 0.97868395 0.94285715 0.9604366 B-DATE 574 52 25 0.9169329 0.95826375 0.9371428 I-LIABILITY 127 43 59 0.7470588 0.6827957 0.71348315 I-DATE 317 17 34 0.9491018 0.9031339 0.9255474 B-CF_DECREASE 16 0 10 1.0 0.61538464 0.76190484 I-EXPENSE 157 52 65 0.75119615 0.7072072 0.7285383 B-LIABILITY 71 22 44 0.76344085 0.6173913 0.6826923 I-CF 640 81 153 0.88765603 0.8070618 0.84544253 I-CF_DECREASE 37 3 17 0.925 0.6851852 0.7872341 B-PROFIT_INCREASE 46 10 7 0.8214286 0.8679245 0.8440367 B-EXPENSE 69 29 38 0.70408165 0.6448598 0.67317075 I-CF_INCREASE 54 43 3 0.556701 0.94736844 0.7012987 I-PERCENTAGE 6 0 2 1.0 0.75 0.85714287 I-PROFIT_DECLINE 36 10 5 0.7826087 0.8780488 0.82758623 B-CF_INCREASE 28 13 2 0.68292683 0.93333334 0.78873235 I-PROFIT 91 30 12 0.75206614 0.88349515 0.8125 B-CURRENCY 918 16 30 0.9828694 0.9683544 0.97555786 I-PROFIT_INCREASE 70 8 11 0.8974359 0.86419755 0.88050324 B-CF 183 49 53 0.7887931 0.7754237 0.7820512 B-PROFIT 47 22 21 0.68115944 0.6911765 0.6861314 B-PERCENTAGE 136 2 10 0.98550725 0.9315069 0.9577465 I-FISCAL_YEAR 729 39 23 0.94921875 0.9694149 0.9592105 B-PROFIT_DECLINE 22 5 4 0.8148148 0.84615386 0.83018863 B-EXPENSE_INCREASE 53 36 9 0.5955056 0.8548387 0.70198673 B-EXPENSE_DECREASE 35 6 10 0.85365856 0.7777778 0.81395346 B-FISCAL_YEAR 243 13 11 0.94921875 0.95669293 0.9529412 I-EXPENSE_DECREASE 69 22 11 0.7582418 0.8625 0.8070175 I-EXPENSE_INCREASE 114 70 5 0.6195652 0.9579832 0.7524752 Macro-average 6793 732 748 0.83021986 0.8368515 0.83352244 Micro-average 6793 732 748 0.90272427 0.90080893 0.9017655 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_job_all_903929564 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-job_all-903929564` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Job` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_job_all_903929564_en_4.3.1_3.0_1678783457474.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_job_all_903929564_en_4.3.1_3.0_1678783457474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_job_all_903929564","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_job_all_903929564","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_job_all_903929564| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-job_all-903929564 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_v1_1_small date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-v1_1-small` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_v1_1_small_en_4.3.0_3.0_1675156397272.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_v1_1_small_en_4.3.0_3.0_1675156397272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_v1_1_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_v1_1_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_v1_1_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|179.4 MB| ## References - https://huggingface.co/google/t5-v1_1-small - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/google-research/text-to-text-transfer-transformer/blob/master/released_checkpoints.md#t511 - https://arxiv.org/abs/2002.05202 - https://arxiv.org/pdf/1910.10683.pdf - https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67 --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_tiny_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-squad` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_finetuned_squad_en_4.0.0_3.0_1654185007888.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_finetuned_squad_en_4.0.0_3.0_1654185007888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_tiny_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.tiny").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_tiny_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-tiny-finetuned-squad --- layout: model title: Pipeline to Detect Radiology Concepts (WIP) author: John Snow Labs name: jsl_rd_ner_wip_greedy_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_rd_ner_wip_greedy_clinical](https://nlp.johnsnowlabs.com/2021/04/01/jsl_rd_ner_wip_greedy_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_clinical_pipeline_en_4.3.0_3.2_1678783884894.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_clinical_pipeline_en_4.3.0_3.2_1678783884894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("jsl_rd_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("jsl_rd_ner_wip_greedy_clinical_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_rd_wip_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature..""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------------|--------:|------:|:---------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.9913 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9988 | | 2 | male | 38 | 41 | Gender | 0.9996 | | 3 | for 2 days | 48 | 57 | Duration | 0.5107 | | 4 | congestion | 62 | 71 | Symptom | 0.8608 | | 5 | mom | 75 | 77 | Gender | 0.9711 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.345967 | | 7 | nares | 135 | 139 | BodyPart | 0.3583 | | 8 | she | 147 | 149 | Gender | 0.997 | | 9 | his | 187 | 189 | Gender | 0.9866 | | 10 | breathing while feeding | 191 | 213 | Symptom | 0.2221 | | 11 | perioral cyanosis | 237 | 253 | Symptom | 0.82215 | | 12 | retractions | 258 | 268 | Symptom | 0.9902 | | 13 | One day ago | 272 | 282 | RelativeDate | 0.6992 | | 14 | mom | 285 | 287 | Gender | 0.9588 | | 15 | tactile temperature | 304 | 322 | Symptom | 0.18075 | | 16 | Tylenol | 345 | 351 | Drug | 0.9919 | | 17 | Baby | 354 | 357 | Age | 0.9988 | | 18 | decreased p.o. intake | 377 | 397 | Symptom | 0.477125 | | 19 | His | 400 | 402 | Gender | 0.9993 | | 20 | q.2h. to 5 to 10 minutes | 450 | 473 | Frequency | 0.3258 | | 21 | his | 488 | 490 | Gender | 0.9909 | | 22 | respiratory congestion | 492 | 513 | Symptom | 0.25015 | | 23 | He | 516 | 517 | Gender | 0.9998 | | 24 | tired | 550 | 554 | Symptom | 0.8179 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_rd_ner_wip_greedy_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Extract demographic entities (Voice of the Patients) author: John Snow Labs name: ner_vop_demographic_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, demographic] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Gender`, `Employment`, `RaceEthnicity`, `Age`, `Substance`, `RelationshipStatus`, `SubstanceQuantity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_wip_en_4.4.2_3.0_1684512408110.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_wip_en_4.4.2_3.0_1684512408110.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_demographic_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_demographic_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:--------|:--------------| | grandma | Gender | | 85 | Age | | Black | RaceEthnicity | | doctors | Employment | | her | Gender | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_demographic_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1299 26 18 1317 0.98 0.99 0.98 Employment 1192 80 51 1243 0.94 0.96 0.95 RaceEthnicity 28 0 5 33 1.00 0.85 0.92 Age 553 66 29 582 0.89 0.95 0.92 Substance 387 52 34 421 0.88 0.92 0.90 RelationshipStatus 20 2 4 24 0.91 0.83 0.87 SubstanceQuantity 60 12 25 85 0.83 0.71 0.76 macro_avg 3539 238 166 3705 0.92 0.89 0.90 micro_avg 3539 238 166 3705 0.94 0.96 0.95 ``` --- layout: model title: Romanian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_ro_cased date: 2022-12-02 tags: [ro, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ro edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ro-cased` is a Romanian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ro_cased_ro_4.2.4_3.0_1670018791853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ro_cased_ro_4.2.4_3.0_1670018791853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ro_cased","ro") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ro_cased","ro") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_ro_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ro| |Size:|385.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ro-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English image_classifier_vit_modeversion2_m7_e8 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_modeversion2_m7_e8 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modeversion2_m7_e8` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion2_m7_e8_en_4.1.0_3.0_1660165696411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion2_m7_e8_en_4.1.0_3.0_1660165696411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_modeversion2_m7_e8", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_modeversion2_m7_e8", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_modeversion2_m7_e8| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Afrikaans Lemmatizer author: John Snow Labs name: lemma date: 2021-04-02 tags: [lemmatizer, af, open_source] task: Lemmatization language: af edition: Spark NLP 2.7.5 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_af_2.7.5_2.4_1617374543805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_af_2.7.5_2.4_1617374543805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "af") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) model = pipeline.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(["Ons het besliste teen-resessiebesteding deur die regering geïmplementeer , veral op infrastruktuur ."]) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "af") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Ons het besliste teen-resessiebesteding deur die regering geïmplementeer , veral op infrastruktuur .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Ons het besliste teen-resessiebesteding deur die regering geïmplementeer , veral op infrastruktuur ."] lemma_df = nlu.load('af.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash +--------------------+ | lemma| +--------------------+ | ons| | het| | beslis| |teen-resessiebest...| | deur| | die| | regering| | implementeer| | ,| | veral| | op| | infrastruktuur| | .| +--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|af| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7. ## Benchmarking ```bash Precision=0.81, Recall=0.78, F1-score=0.79 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from mbyanfei) author: John Snow Labs name: distilbert_qa_mbyanfei_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `mbyanfei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mbyanfei_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772207677.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mbyanfei_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772207677.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mbyanfei_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mbyanfei_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mbyanfei_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mbyanfei/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Cased model (from Neo87z1) author: John Snow Labs name: t5_stekgramarchecker date: 2023-01-30 tags: [en, open_source, t5] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `STEKGramarChecker` is a English model originally trained by `Neo87z1`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_stekgramarchecker_en_4.3.0_3.0_1675098528002.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_stekgramarchecker_en_4.3.0_3.0_1675098528002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_stekgramarchecker","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_stekgramarchecker","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_stekgramarchecker| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|904.6 MB| ## References - https://huggingface.co/Neo87z1/STEKGramarChecker - https://github.com/EricFillion/happy-transformer - https://arxiv.org/abs/1702.04066 - https://www.vennify.ai/fine-tune-grammar-correction/ --- layout: model title: Legal Social Protection Document Classifier (EURLEX) author: John Snow Labs name: legclf_social_protection_bert date: 2023-03-06 tags: [en, legal, classification, clauses, social_protection, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_social_protection_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Social_Protection or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Social_Protection`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_social_protection_bert_en_1.0.0_3.0_1678111810543.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_social_protection_bert_en_1.0.0_3.0_1678111810543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_social_protection_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Social_Protection]| |[Other]| |[Other]| |[Social_Protection]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_social_protection_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.95 0.94 41 Social_Protection 0.95 0.92 0.94 39 accuracy - - 0.94 80 macro-avg 0.94 0.94 0.94 80 weighted-avg 0.94 0.94 0.94 80 ``` --- layout: model title: Detect Drugs - Generalized Single Entity (ner_drugs_greedy) author: John Snow Labs name: ner_drugs_greedy date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a single entity model that generalises all posology concepts into one and finds longest available chunks of drugs. It is trained using `embeddings_clinical` so please use the same embeddings in the pipeline. ## Predicted Entities `DRUG`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_en_3.0.0_3.0_1617208410026.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_greedy_en_3.0.0_3.0_1617208410026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_drugs_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_drugs_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter)) val text = """DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugsgreedy").predict("""DOSAGE AND ADMINISTRATION The initial dosage of hydrocortisone tablets may vary from 20 mg to 240 mg of hydrocortisone per day depending on the specific disease entity being treated.""") ```
## Results ```bash +-----------------------------------+------------+ | chunk | ner_label | +-----------------------------------+------------+ | hydrocortisone tablets | DRUG | | 20 mg to 240 mg of hydrocortisone | DRUG | +-----------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_greedy| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented i2b2_med7 + FDA dataset with ``embeddings_clinical``, [https://www.i2b2.org/NLP/Medication](https://www.i2b2.org/NLP/Medication). ## Benchmarking ```bash label tp fp fn prec rec f1 I-DRUG 37858 4166 3338 0.90086615 0.91897273 0.9098294 B-DRUG 29926 2006 1756 0.937179 0.9445742 0.9408621 tp: 67784 fp: 6172 fn: 5094 labels: 2 Macro-average prec: 0.91902256, rec: 0.9317734, f1: 0.92535406 Micro-average prec: 0.916545, rec: 0.93010235, f1: 0.9232739 ``` --- layout: model title: Legal Relation Extraction (Obligations, md, Unidirectional) author: John Snow Labs name: legre_obligations_md date: 2022-11-03 tags: [en, legal, licensed, obligation, re] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_obligations_clause` Text Classifier to select only these paragraphs; We call "obligation" to any sentence in the text stating that a Party (OBLIGATION_SUBJECT) must do (OBLIGATION_ACITON) something (OBLIGATION_OBJECT) to other Party (OBLIGATION_INDIRECT_OBJECT). This model extracts relationships, connecting all of those parts of the sentence (subject with action, action with object, etc). This model requires `legner_obligations` as an NER in the pipeline.It's a `md` model with Unidirectional Relations, meaning that the model retrieves in chunk1 the left side of the relation (source), and in chunk2 the right side (target). This is a Deep Learning model, meaning only semantics are taking into account, not grammatical structures. If you want to parse the relations using a grammatical dependency tree, please feel free to use [this other model](https://nlp.johnsnowlabs.com/2022/08/24/legpipe_obligations_en.html) ## Predicted Entities `is_obliged_to`, `is_obliged_with`, `is_obliged_object` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_obligations_md_en_1.0.0_3.0_1667474780413.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_obligations_md_en_1.0.0_3.0_1667474780413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") ner_model = legal.BertForTokenClassification.pretrained("legner_obligations", "en", "legal/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") re_model = legal.RelationExtractionDLModel()\ .pretrained("legre_obligations_md", "en", "legal/models")\ .setPredictionThreshold(0.5)\ .setInputCols(["ner_chunk", "document"])\ .setOutputCol("relations") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, ner_model, ner_converter, re_model ]) empty_df = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_df) text="""Licensee agrees to reasonably cooperate with Licensor in achieving registration of the Licensed Mark.""" data = spark.createDataFrame([[text]]).toDF("text") result = model.transform(data) ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |-------------------|----------------------------|---------------|-------------|--------------------------------|----------------------------|---------------|-------------|------------------------------------------------|------------| | is_obliged_to | OBLIGATION_ACTION | 9 | 38 | agrees to reasonably cooperate | OBLIGATION_SUBJECT | 0 | 7 | Licensee | 0.91654503 | | is_obliged_with | OBLIGATION_SUBJECT | 0 | 7 | Licensee | OBLIGATION_INDIRECT_OBJECT | 45 | 52 | Licensor | 0.803172 | | is_obliged_to | OBLIGATION_SUBJECT | 0 | 7 | Licensee | OBLIGATION | 54 | 99 | in achieving registration of the Licensed Mark | 0.7439706 | | is_obliged_object | OBLIGATION_ACTION | 9 | 38 | agrees to reasonably cooperate | OBLIGATION_INDIRECT_OBJECT | 45 | 52 | Licensor | 0.96132916 | | is_obliged_object | OBLIGATION_ACTION | 9 | 38 | agrees to reasonably cooperate | OBLIGATION | 54 | 99 | in achieving registration of the Licensed Mark | 0.9174475 | | is_obliged_to | OBLIGATION_INDIRECT_OBJECT | 45 | 52 | Licensor | OBLIGATION | 54 | 99 | in achieving registration of the Licensed Mark | 0.9091029 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_obligations_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|402.3 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label Recall Precision F1 Support is_obliged_object 0.989 0.994 0.992 177 is_obliged_to 0.995 1.000 0.998 202 is_obliged_with 1.000 0.961 0.980 49 Avg. 0.996 0.989 0.992 - Weighted-Avg. 0.996 0.996 0.996 - ``` --- layout: model title: English DistilBertForQuestionAnswering model (from manudotc) author: John Snow Labs name: distilbert_qa_transformers_base_uncased_finetuneQA_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `transformers_distilbert-base-uncased_finetuneQA_squad` is a English model originally trained by `manudotc`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_transformers_base_uncased_finetuneQA_squad_en_4.0.0_3.0_1654728991354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_transformers_base_uncased_finetuneQA_squad_en_4.0.0_3.0_1654728991354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_transformers_base_uncased_finetuneQA_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_transformers_base_uncased_finetuneQA_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_manudotc").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_transformers_base_uncased_finetuneQA_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/manudotc/transformers_distilbert-base-uncased_finetuneQA_squad --- layout: model title: Pipeline to Detect Problem, Test and Treatment author: John Snow Labs name: ner_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, problem, test, treatment, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical](https://nlp.johnsnowlabs.com/2020/01/30/ner_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_pipeline_en_3.4.1_3.0_1647873179531.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_pipeline_en_3.4.1_3.0_1647873179531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` ```scala val pipeline = new PretrainedPipeline("ner_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-------------------------------------+---------+ |chunk |ner_label| +-------------------------------------+---------+ |congestion |PROBLEM | |suctioning yellow discharge |TREATMENT| |some mild problems with his breathing|PROBLEM | |any perioral cyanosis |PROBLEM | |retractions |PROBLEM | |a tactile temperature |PROBLEM | |Tylenol |TREATMENT| |his respiratory congestion |PROBLEM | |more tired |PROBLEM | |albuterol treatments |TREATMENT| |His urine output |TEST | |5 dirty diapers |TREATMENT| |4 wet diapers |TREATMENT| |any diarrhea |PROBLEM | |yellow colored |PROBLEM | +-------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Legal Solvency Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_solvency_bert date: 2023-03-05 tags: [en, legal, classification, clauses, solvency, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Solvency` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Solvency`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_solvency_bert_en_1.0.0_3.0_1678050703009.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_solvency_bert_en_1.0.0_3.0_1678050703009.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_solvency_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Solvency]| |[Other]| |[Other]| |[Solvency]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_solvency_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.99 0.99 70 Solvency 0.98 1.00 0.99 48 accuracy - - 0.99 118 macro-avg 0.99 0.99 0.99 118 weighted-avg 0.99 0.99 0.99 118 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1657184366228.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1657184366228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-8 --- layout: model title: Indonesian T5ForConditionalGeneration Small Cased model (from Wikidepia) author: John Snow Labs name: t5_indot5_small date: 2023-01-30 tags: [id, open_source, t5, tensorflow] task: Text Generation language: id edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IndoT5-small` is a Indonesian model originally trained by `Wikidepia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_indot5_small_id_4.3.0_3.0_1675097879795.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_indot5_small_id_4.3.0_3.0_1675097879795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_indot5_small","id") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_indot5_small","id") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_indot5_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|id| |Size:|179.1 MB| ## References - https://huggingface.co/Wikidepia/IndoT5-small - https://github.com/Wikidepia/indonesian_datasets/tree/master/dump/mc4 --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_cv9 TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_cv9 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_cv9` is a German model originally trained by oliverguhr. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_cv9_de_4.2.0_3.0_1664096972246.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_cv9_de_4.2.0_3.0_1664096972246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_cv9', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_cv9", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_cv9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Cased model (from phanimvsk) author: John Snow Labs name: roberta_qa_edtech_model1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Edtech_model1` is a English model originally trained by `phanimvsk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_edtech_model1_en_4.3.0_3.0_1674208077877.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_edtech_model1_en_4.3.0_3.0_1674208077877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_edtech_model1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_edtech_model1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_edtech_model1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|463.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/phanimvsk/Edtech_model1 --- layout: model title: Pipeline to Extract Demographic Entities from Oncology Texts author: John Snow Labs name: ner_oncology_demographics_pipeline date: 2023-03-09 tags: [licensed, clinical, en, ner, oncology, demographics] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_demographics](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_demographics_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_pipeline_en_4.3.0_3.2_1678345339056.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_demographics_pipeline_en_4.3.0_3.2_1678345339056.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_demographics_pipeline", "en", "clinical/models") text = '''The patient is a 40-year-old man with history of heavy smoking.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_demographics_pipeline", "en", "clinical/models") val text = "The patient is a 40-year-old man with history of heavy smoking." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------|--------:|------:|:---------------|-------------:| | 0 | 40-year-old | 17 | 27 | Age | 0.6743 | | 1 | man | 29 | 31 | Gender | 0.9365 | | 2 | heavy smoking | 49 | 61 | Smoking_Status | 0.7294 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_demographics_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson TFWav2Vec2ForCTC from jessiejohnson author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson` is a English model originally trained by jessiejohnson. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson_en_4.2.0_3.0_1664019402500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson_en_4.2.0_3.0_1664019402500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|228.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: French CamemBert Embeddings (from tpanza) author: John Snow Labs name: camembert_embeddings_tpanza_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `tpanza`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tpanza_generic_model_fr_3.4.4_3.0_1653990506967.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tpanza_generic_model_fr_3.4.4_3.0_1653990506967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tpanza_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tpanza_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_tpanza_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/tpanza/dummy-model --- layout: model title: Japanese Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_ja_cased date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-ja-cased` is a Japanese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ja_cased_ja_3.4.2_3.0_1649674707069.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ja_cased_ja_3.4.2_3.0_1649674707069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ja_cased","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ja_cased","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_ja_cased").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_ja_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|350.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ja-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_job_all_903929564 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-job_all-903929564` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Job`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_job_all_903929564_en_4.3.0_3.0_1677880986885.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_job_all_903929564_en_4.3.0_3.0_1677880986885.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_job_all_903929564","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_job_all_903929564","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_job_all_903929564| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-job_all-903929564 --- layout: model title: Stopwords Remover for Sanskrit language (562 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, sa, open_source] task: Stop Words Removal language: sa edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sa_3.4.1_3.0_1646673126601.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sa_3.4.1_3.0_1646673126601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","sa") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["यावद्द्रॄष्टिभ्रुवोर्मध्ये तावत्कालभयं कुत: ॥"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sa") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("यावद्द्रॄष्टिभ्रुवोर्मध्ये तावत्कालभयं कुत: ॥").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sa.stopwords").predict("""यावद्द्रॄष्टिभ्रुवोर्मध्ये तावत्कालभयं कुत: ॥""") ```
## Results ```bash +----------------------------------------------------+ |result | +----------------------------------------------------+ |[यावद्द्रॄष्टिभ्रुवोर्मध्ये, तावत्कालभयं, कुत, :, ॥]| +----------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sa| |Size:|2.9 KB| --- layout: model title: Detect Cellular/Molecular Biology Entities (clinical_medium) author: John Snow Labs name: ner_cellular_emb_clinical_medium date: 2023-05-24 tags: [ner, clinical, en, licensed, dna, rna, cell_type, cell_line, protein] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `DNA`, `RNA`, `Cell_type`, `Cell_line`, `Protein` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CELLULAR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_emb_clinical_medium_en_4.4.2_3.0_1684919993889.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_emb_clinical_medium_en_4.4.2_3.0_1684919993889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(['sentence', 'token', 'ner'])\ .setOutputCol('ner_chunk') pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_df = spark.createDataFrame([["""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""]]).toDF("text") result = pipeline.fit(sample_df).transform(sample_df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_cellular_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val sample_data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS.toDF("text") val result = pipeline.fit(sample_data).transform(sample_data) ```
## Results ```bash +-----------------------------------------------------------+-----+---+---------+ |chunk |begin|end|ner_label| +-----------------------------------------------------------+-----+---+---------+ |intracellular signaling proteins |27 |58 |protein | |human T-cell leukemia virus type 1 promoter |130 |172|DNA | |Tax |186 |188|protein | |Tax-responsive element 1 |193 |216|DNA | |cyclic AMP-responsive members |237 |265|protein | |CREB/ATF family |274 |288|protein | |transcription factors |293 |313|protein | |Tax |389 |391|protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|396 |454|DNA | |TRE-1), |457 |463|DNA | |lacZ gene |582 |590|DNA | |CYC1 promoter |617 |629|DNA | |TRE-1 |663 |667|DNA | |cyclic AMP response element-binding protein |695 |737|protein | |CREB |740 |743|protein | |CREB |749 |752|protein | |GAL4 activation domain |767 |788|protein | |GAD |791 |793|protein | |reporter gene |848 |860|DNA | |Tax |863 |865|protein | +-----------------------------------------------------------+-----+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. (http://www.geniaproject.org/) ## Benchmarking ```bash label precision recall f1-score support cell_line 0.59 0.80 0.68 1489 cell_type 0.89 0.76 0.82 4912 protein 0.80 0.90 0.85 9841 RNA 0.79 0.83 0.81 305 DNA 0.78 0.86 0.82 2845 micro-avg 0.80 0.85 0.82 19392 macro-avg 0.77 0.83 0.79 19392 weighted-avg 0.80 0.85 0.82 19392 ``` --- layout: model title: English image_classifier_vit_exper_batch_16_e8 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper_batch_16_e8 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper_batch_16_e8` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_16_e8_en_4.1.0_3.0_1660170063247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_16_e8_en_4.1.0_3.0_1660170063247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper_batch_16_e8", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper_batch_16_e8", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper_batch_16_e8| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kiana) author: John Snow Labs name: distilbert_qa_kiana_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kiana`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kiana_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771808591.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kiana_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771808591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kiana_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kiana_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kiana_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kiana/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Insurances Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_insurances_bert date: 2023-03-05 tags: [en, legal, classification, clauses, insurances, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Insurances` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Insurances`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_insurances_bert_en_1.0.0_3.0_1678049903507.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_insurances_bert_en_1.0.0_3.0_1678049903507.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_insurances_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Insurances]| |[Other]| |[Other]| |[Insurances]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_insurances_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Insurances 0.97 0.97 0.97 129 Other 0.98 0.98 0.98 162 accuracy - - 0.97 291 macro-avg 0.97 0.97 0.97 291 weighted-avg 0.97 0.97 0.97 291 ``` --- layout: model title: Understanding Perpetuity in "Return of Confidential Information" Clauses author: John Snow Labs name: legclf_nda_perpetuity date: 2023-04-07 tags: [en, legal, licensed, classification, nda, perpetuity, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a clause classified as `RETURN_OF_CONF_INFO` using the `legmulticlf_mnda_sections_paragraph_other` classifier, you can subclassify the sentences as `PERPETUITY` or `OTHER` from it using the `legclf_nda_perpetuity` model. ## Predicted Entities `PERPETUITY`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_perpetuity_en_1.0.0_3.0_1680880224296.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_perpetuity_en_1.0.0_3.0_1680880224296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = nlp.UniversalSentenceEncoder.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") classifier = legal.ClassifierDLModel.pretrained("legclf_nda_perpetuity", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_embeddings, classifier ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text_list = ["""Notwithstanding the return or destruction of all Evaluation Material, you or your Representatives shall continue to be bound by your obligations of confidentiality and other obligations hereunder.""", """There are no intended third party beneficiaries to this Agreement."""] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +--------------------------------------------------------------------------------+----------+ | text| class| +--------------------------------------------------------------------------------+----------+ |Notwithstanding the return or destruction of all Evaluation Material, you or ...|PERPETUITY| | There are no intended third-party beneficiaries to this Agreement.| OTHER| +--------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_perpetuity| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support OTHER 0.87 1.00 0.93 13 PERPETUITY 1.00 0.86 0.92 14 accuracy - - 0.93 27 macro-avg 0.93 0.93 0.93 27 weighted-avg 0.94 0.93 0.93 27 ``` --- layout: model title: Legal Stock Option Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_stock_option_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, stock, option, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stock_option_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `stock-option-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `stock-option-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stock_option_agreement_bert_en_1.0.0_3.0_1671393862273.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stock_option_agreement_bert_en_1.0.0_3.0_1671393862273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_stock_option_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[stock-option-agreement]| |[other]| |[other]| |[stock-option-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stock_option_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 204 stock-option-agreement 0.97 0.97 0.97 107 accuracy - - 0.98 311 macro-avg 0.98 0.98 0.98 311 weighted-avg 0.98 0.98 0.98 311 ``` --- layout: model title: English asr_20220507_122935 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: asr_20220507_122935 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_20220507_122935` is a English model originally trained by lilitket. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_20220507_122935_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_20220507_122935_en_4.2.0_3.0_1664117384162.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_20220507_122935_en_4.2.0_3.0_1664117384162.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_20220507_122935", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_20220507_122935", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_20220507_122935| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Spanish Named Entity Recognition (Large, CAPITEL competition at IberLEF 2020 dataset) author: John Snow Labs name: roberta_ner_roberta_large_bne_capitel_ner date: 2022-05-03 tags: [roberta, ner, open_source, es] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-bne-capitel-ner` is a Spanish model orginally trained by `PlanTL-GOB-ES`. ## Predicted Entities `ORG`, `LOC`, `PER`, `OTH` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_bne_capitel_ner_es_3.4.2_3.0_1651593572235.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_large_bne_capitel_ner_es_3.4.2_3.0_1651593572235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_bne_capitel_ner","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_large_bne_capitel_ner","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.ner.roberta_large_bne_capitel_ner").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_large_bne_capitel_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-capitel-ner - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://sites.google.com/view/capitel2020 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: English RobertaForQuestionAnswering (from Shappey) author: John Snow Labs name: roberta_qa_roberta_base_QnA_squad2_trained date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-QnA-squad2-trained` is a English model originally trained by `Shappey`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_QnA_squad2_trained_en_4.0.0_3.0_1655729837097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_QnA_squad2_trained_en_4.0.0_3.0_1655729837097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_QnA_squad2_trained","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_QnA_squad2_trained","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_Shappey").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_QnA_squad2_trained| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|456.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Shappey/roberta-base-QnA-squad2-trained --- layout: model title: Abkhazian asr_xls_r_ab_spanish TFWav2Vec2ForCTC from joheras author: John Snow Labs name: asr_xls_r_ab_spanish date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_spanish` is a Abkhazian model originally trained by joheras. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_spanish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_spanish_ab_4.2.0_3.0_1664020931579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_spanish_ab_4.2.0_3.0_1664020931579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_spanish", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_spanish", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_spanish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.2 KB| --- layout: model title: German BertForMaskedLM Base Cased model (from dbmdz) author: John Snow Labs name: bert_embeddings_dbmdz_base_german_cased date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-german-cased` is a German model originally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dbmdz_base_german_cased_de_4.2.4_3.0_1670017689214.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dbmdz_base_german_cased_de_4.2.4_3.0_1670017689214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_dbmdz_base_german_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_dbmdz_base_german_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dbmdz_base_german_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-german-cased - https://deepset.ai/german-bert - https://deepset.ai/ - https://spacy.io/ - https://github.com/allenai/scibert - https://github.com/stefan-it/fine-tuned-berts-seq - https://github.com/dbmdz/berts/issues/new --- layout: model title: Part of Speech for Ukrainian author: John Snow Labs name: pos_ud_iu date: 2020-05-05 11:59:00 +0800 task: Part of Speech Tagging language: uk edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, uk] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_iu_uk_2.5.0_2.4_1588668890963.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_iu_uk_2.5.0_2.4_1588668890963.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_iu", "uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("За винятком того, що є королем півночі, Джон Сноу є англійським лікарем та лідером у розвитку анестезії та медичної гігієни.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_iu", "uk") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("За винятком того, що є королем півночі, Джон Сноу є англійським лікарем та лідером у розвитку анестезії та медичної гігієни.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""За винятком того, що є королем півночі, Джон Сноу є англійським лікарем та лідером у розвитку анестезії та медичної гігієни."""] pos_df = nlu.load('uk.pos.ud_iu').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=1, result='ADP', metadata={'word': 'За'}), Row(annotatorType='pos', begin=3, end=10, result='NOUN', metadata={'word': 'винятком'}), Row(annotatorType='pos', begin=12, end=15, result='PRON', metadata={'word': 'того'}), Row(annotatorType='pos', begin=16, end=16, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=18, end=19, result='SCONJ', metadata={'word': 'що'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_iu| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|uk| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English asr_filipino_wav2vec2_l_xls_r_300m_official TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_filipino_wav2vec2_l_xls_r_300m_official` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official_en_4.2.0_3.0_1664114625849.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official_en_4.2.0_3.0_1664114625849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_filipino_wav2vec2_l_xls_r_300m_official| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Uncased model (from alistvt) author: John Snow Labs name: bert_qa_base_uncased_pretrain_finetuned_coqa_fal date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-pretrain-finetuned-coqa-falt` is a English model originally trained by `alistvt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_pretrain_finetuned_coqa_fal_en_4.0.0_3.0_1657185628420.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_pretrain_finetuned_coqa_fal_en_4.0.0_3.0_1657185628420.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_pretrain_finetuned_coqa_fal","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_pretrain_finetuned_coqa_fal","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_pretrain_finetuned_coqa_fal| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/alistvt/bert-base-uncased-pretrain-finetuned-coqa-falt --- layout: model title: Legal Whereas Clause Binary Classifier author: John Snow Labs name: legclf_whereas_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `whereas` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `whereas` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_whereas_clause_en_1.0.0_3.2_1660123208827.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_whereas_clause_en_1.0.0_3.2_1660123208827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_whereas_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[whereas]| |[other]| |[other]| |[whereas]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_whereas_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.98 0.98 257 whereas 0.97 0.96 0.96 126 accuracy - - 0.98 383 macro-avg 0.97 0.97 0.97 383 weighted-avg 0.98 0.98 0.98 383 ``` --- layout: model title: English AlbertForQuestionAnswering Base model (from elgeish) author: John Snow Labs name: albert_qa_cs224n_squad2.0_base_v2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cs224n-squad2.0-albert-base-v2` is a English model originally trained by `elgeish`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_base_v2_en_4.0.0_3.0_1656064283263.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_base_v2_en_4.0.0_3.0_1656064283263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_base_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_base_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.base_v2.by_elgeish").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_cs224n_squad2.0_base_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/elgeish/cs224n-squad2.0-albert-base-v2 - http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/elgeish/squad/tree/master/data --- layout: model title: English image_classifier_vit_resnet_50_ucSat ViTForImageClassification from YKXBCi author: John Snow Labs name: image_classifier_vit_resnet_50_ucSat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_resnet_50_ucSat` is a English model originally trained by YKXBCi. ## Predicted Entities `buildings`, `denseresidential`, `storagetanks`, `tenniscourt`, `parkinglot`, `golfcourse`, `intersection`, `harbor`, `river`, `runway`, `mediumresidential`, `chaparral`, `freeway`, `overpass`, `mobilehomepark`, `baseballdiamond`, `agricultural`, `airplane`, `sparseresidential`, `forest`, `beach` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_resnet_50_ucSat_en_4.1.0_3.0_1660173507645.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_resnet_50_ucSat_en_4.1.0_3.0_1660173507645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_resnet_50_ucSat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_resnet_50_ucSat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_resnet_50_ucSat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.1 MB| --- layout: model title: English BertForMaskedLM Large Cased model (from VMware) author: John Snow Labs name: bert_embeddings_v_2021_large date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vbert-2021-large` is a English model originally trained by `VMware`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_large_en_4.2.4_3.0_1670327246381.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_large_en_4.2.4_3.0_1670327246381.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_large","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_large","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_v_2021_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/VMware/vbert-2021-large - https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99 --- layout: model title: Extract temporal entities (Voice of the Patients) author: John Snow Labs name: ner_vop_temporal_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, temporal] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts temporal references from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `DateTime`, `Duration`, `Frequency` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_wip_en_4.4.2_3.0_1684513083221.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_wip_en_4.4.2_3.0_1684513083221.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_temporal_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_temporal_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:-----------|:------------| | last month | DateTime | | yesterday | DateTime | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_temporal_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 DateTime 3886 597 437 4323 0.87 0.90 0.88 Duration 1888 323 383 2271 0.85 0.83 0.84 Frequency 886 213 186 1072 0.81 0.83 0.82 macro_avg 6660 1133 1006 7666 0.84 0.85 0.85 micro_avg 6660 1133 1006 7666 0.86 0.87 0.86 ``` --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_fpdm_hier_bert_FT_new_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_hier_bert_FT_new_newsqa_en_4.0.0_3.0_1654187845291.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_hier_bert_FT_new_newsqa_en_4.0.0_3.0_1654187845291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_hier_bert_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fpdm_hier_bert_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.fpdm_hier_ft_by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fpdm_hier_bert_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_hier_bert_FT_new_newsqa --- layout: model title: Turkish Named Entity Recognition (from savasy) author: John Snow Labs name: bert_ner_bert_base_turkish_ner_cased date: 2022-05-09 tags: [bert, ner, token_classification, tr, open_source] task: Named Entity Recognition language: tr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-turkish-ner-cased` is a Turkish model orginally trained by `savasy`. ## Predicted Entities `LOC`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_ner_cased_tr_3.4.2_3.0_1652099145722.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_ner_cased_tr_3.4.2_3.0_1652099145722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_ner_cased","tr") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_ner_cased","tr") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_turkish_ner_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|tr| |Size:|412.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/savasy/bert-base-turkish-ner-cased - https://schweter.eu/storage/turkish-bert-wikiann/$file - https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt --- layout: model title: Translate English to South Slavic languages Pipeline author: John Snow Labs name: translate_en_zls date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, zls, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `zls` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_zls_xx_2.7.0_2.4_1609698691093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_zls_xx_2.7.0_2.4_1609698691093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_zls", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_zls", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.zls').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_zls| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Luganda Pipeline author: John Snow Labs name: translate_en_lg date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, lg, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `lg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lg_xx_2.7.0_2.4_1609699110006.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lg_xx_2.7.0_2.4_1609699110006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_lg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_lg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.lg').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_lg| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Sango to English Pipeline author: John Snow Labs name: translate_sg_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sg, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sg_en_xx_2.7.0_2.4_1609688282875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sg_en_xx_2.7.0_2.4_1609688282875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sg.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sg_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Animal Product Document Classifier (EURLEX) author: John Snow Labs name: legclf_animal_product_bert date: 2023-03-06 tags: [en, legal, classification, clauses, animal_product, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_animal_product_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Animal_Product or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Animal_Product`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_animal_product_bert_en_1.0.0_3.0_1678111601713.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_animal_product_bert_en_1.0.0_3.0_1678111601713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_animal_product_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Animal_Product]| |[Other]| |[Other]| |[Animal_Product]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_animal_product_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Animal_Product 0.91 0.91 0.91 591 Other 0.89 0.90 0.89 486 accuracy - - 0.90 1077 macro-avg 0.90 0.90 0.90 1077 weighted-avg 0.90 0.90 0.90 1077 ``` --- layout: model title: Named Entity Recognition (NER) Model in Finnish (GloVe 6B 100) author: John Snow Labs name: finnish_ner_6B_100 date: 2020-09-01 task: Named Entity Recognition language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, fi, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Finnish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/PP_EXPLAIN_DOCUMENT_FI/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/PP_EXPLAIN_DOCUMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/finnish_ner_6B_100_fi_2.6.0_2.4_1598965807300.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/finnish_ner_6B_100_fi_2.6.0_2.4_1598965807300.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("finnish_ner_6B_100", "fi") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia \u200b\u200bulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970'erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990'erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("finnish_ner_6B_100", "fi") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia ​​ulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970"erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990"erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [9] Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia ​​ulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970'erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990'erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [9] Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella."""] ner_df = nlu.load('fi.ner.6B_100d').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |lokakuuta 1955 |DATE | |​​ulkoministeriöitä |ORG | |Microsoft |ORG | |Microsoft Corporationin|ORG | |Microsoft |ORG | |Gates |PER | |2014 |DATE | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |ORG | |New Mexico |ORG | |Det |ORG | |verdens |ORG | |Gates |PER | |tammikuu 2000 |DATE | |Gates |PER | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finnish_ner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fi| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.567.pdf](https://www.aclweb.org/anthology/2020.lrec-1.567.pdf) --- layout: model title: Legal Question Answering (RoBerta, CUAD, Small) author: John Snow Labs name: legqa_roberta_cuad_small date: 2023-01-30 tags: [en, licensed, tensorflow] task: Question Answering language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal RoBerta-based Question Answering model, trained on squad-v2, finetuned on CUAD dataset (small). In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES: ``` "Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract" ``` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_small_en_1.0.0_3.0_1675082898321.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_roberta_cuad_small_en_1.0.0_3.0_1675082898321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("legqa_roberta_cuad_small","en", "legal/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) text = """THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and certain of its subsidiaries. Identified on the signature pages hereto (each a "BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, in its capacity as agent for the Lenders under this Agreement (hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ') question = ['"Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"'] qt = [ [q,text] for q in questions ] example = spark.createDataFrame(qt).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('document_question.result', 'answer.result').show(truncate=False) ```
## Results ```bash ["Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract"]|[P . H . GLATFELTER COMPANY , a Pennsylvania corporation ( the " COMPANY ")] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legqa_roberta_cuad_small| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|447.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Squad, finetuned with CUAD-based Question/Answering --- layout: model title: Medical Text Summarization author: John Snow Labs name: summarizer_generic_jsl date: 2023-03-30 tags: [licensed, en, clinical, text_summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with additional data curated by John Snow Labs. This model is further optimized by augmenting the training methodology, and dataset. It can generate summaries from clinical notes up to 512 tokens given the input text (max 1024 tokens) {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_generic_jsl_en_4.3.2_3.0_1680192338463.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_generic_jsl_en_4.3.2_3.0_1680192338463.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("documents") med_summarizer = MedicalSummarizer()\ .pretrained("summarizer_generic_jsl", "en", "clinical/models")\ .setInputCols("documents")\ .setOutputCol("summary")\ .setMaxNewTokens(100)\ .setMaxTextLength(1024) pipeline = Pipeline(stages=[document_assembler, med_summarizer]) text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. ALLERGIES:...""" data = spark.createDataFrame([[text]]).toDF("text") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val med_summarizer = MedicalSummarizer() .pretrained("summarizer_generic_jsl", "en", "clinical/models") .setInputCols("documents") .setOutputCol("summary") .setMaxNewTokens(100) val pipeline = new Pipeline().setStages(Array(document_assembler, med_summarizer)) val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. ALLERGIES:...""" val data = Seq(text).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash " +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[recheck. A 78-year-old female patient returns for recheck due to hypertension, syncope, and spinal stenosis. She has a history of heart failure, myocardial infarction, lymphoma, and asthma. She has been prescribed Atenolol, Premarin, calcium with vitamin D, multivitamin, aspirin, and TriViFlor. She has also been prescribed El]| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_generic_jsl| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.0 MB| ## Benchmarking ### Benchmark on Samsum Dataset | model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 | |--|--|--|--|--|--|--| philschmid/flan-t5-base-samsum | 240M | 0.2734 | 0.1813 | 0.8938 | 0.9133 | 0.9034 | linydub/bart-large-samsum | 500M | 0.3060 | 0.2168 | 0.8961 | 0.9065 | 0.9013 | philschmid/bart-large-cnn-samsum | 500M | 0.3794 | 0.1262 | 0.8599 | 0.9153 | 0.8867 | transformersbook/pegasus-samsum | 570M | 0.3049 | 0.1543 | 0.8942 | 0.9183 | 0.9061 | summarizer_generic_jsl | 240M | 0.2703 | 0.1932 | 0.8944 | 0.9161 | 0.9051 | --- layout: model title: Chinese Named Entity Recognition (from uer) author: John Snow Labs name: bert_ner_roberta_base_finetuned_cluener2020_chinese date: 2022-05-09 tags: [bert, ner, token_classification, zh, open_source] task: Named Entity Recognition language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-finetuned-cluener2020-chinese` is a Chinese model orginally trained by `uer`. ## Predicted Entities `position`, `company`, `address`, `movie`, `organization`, `game`, `name`, `book`, `government`, `scene` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_roberta_base_finetuned_cluener2020_chinese_zh_3.4.2_3.0_1652096419329.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_roberta_base_finetuned_cluener2020_chinese_zh_3.4.2_3.0_1652096419329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_roberta_base_finetuned_cluener2020_chinese","zh") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_roberta_base_finetuned_cluener2020_chinese","zh") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_roberta_base_finetuned_cluener2020_chinese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|zh| |Size:|381.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUENER2020 - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Indemnification Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_indemnification_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, indemnification, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_indemnification_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `indemnification-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `indemnification-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_agreement_bert_en_1.0.0_3.0_1669314700160.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_agreement_bert_en_1.0.0_3.0_1669314700160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indemnification_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[indemnification-agreement]| |[other]| |[other]| |[indemnification-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support indemnification-agreement 0.90 0.93 0.92 30 other 0.97 0.95 0.96 65 accuracy - - 0.95 95 macro-avg 0.94 0.94 0.94 95 weighted-avg 0.95 0.95 0.95 95 ``` --- layout: model title: Portuguese BERT Embeddings (Base Cased) author: John Snow Labs name: bert_portuguese_base_cased date: 2020-11-04 task: Embeddings language: pt edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, pt] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the pre-trained BERT model trained on the Portuguese language. `BERT-Base` and `BERT-Large` Cased variants were trained on the `BrWaC` (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_portuguese_base_cased_pt_2.6.0_2.4_1604487641612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_portuguese_base_cased_pt_2.6.0_2.4_1604487641612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_portuguese_base_cased", "pt") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Eu amo PNL']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_portuguese_base_cased", "pt") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Eu amo PNL").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Eu amo PNL"] embeddings_df = nlu.load('pt.bert.cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash pt_bert_cased_embeddings token [0.476963073015213, -0.31151092052459717, 0.91... Eu [0.5710286498069763, 0.039474084973335266, 0.3... amo [0.3184247314929962, 0.11230389773845673, 0.36... PNL ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_portuguese_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[pt]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from https://github.com/neuralmind-ai/portuguese-bert --- layout: model title: TREC(6) Question Classifier author: John Snow Labs name: classifierdl_use_trec6 class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/05/2020 task: Text Classification edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Classify open-domain, fact-based questions into one of the following broad semantic categories: Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values. {:.h2_title} ## Predicted Entities ``ABBR``, ``DESC``, ``NUM``, ``ENTY``, ``LOC``, ``HUM``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_EN_TREC/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_TREC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_en_2.5.0_2.4_1588492648979.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_trec6_en_2.5.0_2.4_1588492648979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_trec6', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('When did the construction of stone circles begin in the UK?') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_trec6", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("When did the construction of stone circles begin in the UK?").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""When did the construction of stone circles begin in the UK?"""] trec6_df = nlu.load('en.classify.trec6.use').predict(text, output_level='document') trec6_df[["document", "trec6"]] ```
{:.h2_title} ## Results {:.table-model} ```bash +------------------------------------------------------------------------------------------------+------------+ |document |class | +------------------------------------------------------------------------------------------------+------------+ |When did the construction of stone circles begin in the UK? | NUM | +------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |-------------------------|--------------------------------------| | Model Name | classifierdl_use_trec6 | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.0 | | Spark NLP Compatibility | 2.4 | | License | open source | | Edition | public | | Input Labels | [document, sentence_embeddings] | | Output Labels | [class] | | Language | en | | Upstream Dependencies | tfhub_use | {:.h2_title} ## Data Source This model is trained on the 6 class version of TREC dataset. http://search.r-project.org/library/textdata/html/dataset_trec.html {:.h2_title} ## Benchmarking ```bash precision recall f1-score support ABBR 0.00 0.00 0.00 26 DESC 0.89 0.96 0.92 343 ENTY 0.86 0.86 0.86 391 HUM 0.91 0.90 0.91 366 LOC 0.88 0.91 0.89 233 NUM 0.94 0.94 0.94 274 accuracy 0.89 1633 macro avg 0.75 0.76 0.75 1633 weighted avg 0.88 0.89 0.89 1633 ``` --- layout: model title: Pipeline to Detect Adverse Drug Events author: John Snow Labs name: ner_ade_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinical_pipeline_en_4.3.0_3.2_1678808562312.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinical_pipeline_en_4.3.0_3.2_1678808562312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_ade_clinical_pipeline", "en", "clinical/models") text = '''Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_ade_clinical_pipeline", "en", "clinical/models") val text = "Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.ade_clinical.pipeline").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lipitor | 12 | 18 | DRUG | 0.9969 | | 1 | severe fatigue | 52 | 65 | ADE | 0.48995 | | 2 | voltaren | 97 | 104 | DRUG | 0.9889 | | 3 | cramps | 152 | 157 | ADE | 0.7472 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Lemmatizer (Danish, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, da] task: Lemmatization language: da edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Danish Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_da_3.4.1_3.0_1646316498508.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_da_3.4.1_3.0_1646316498508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","da") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Du er ikke bedre end mig"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","da") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Du er ikke bedre end mig").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("da.lemma").predict("""Du er ikke bedre end mig""") ```
## Results ```bash +--------------------------------+ |result | +--------------------------------+ |[Du, være, ikke, god, ende, jeg]| +--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|da| |Size:|6.9 MB| --- layout: model title: English image_classifier_vit_exper_batch_8_e4 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper_batch_8_e4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper_batch_8_e4` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_8_e4_en_4.1.0_3.0_1660166565213.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_8_e4_en_4.1.0_3.0_1660166565213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper_batch_8_e4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper_batch_8_e4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper_batch_8_e4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Yoruba BertForMaskedLM Base Cased model (from Davlan) author: John Snow Labs name: bert_embeddings_base_multilingual_cased_finetuned_yoruba date: 2022-12-02 tags: [yo, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: yo edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-yoruba` is a Yoruba model originally trained by `Davlan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_multilingual_cased_finetuned_yoruba_yo_4.2.4_3.0_1670018518987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_multilingual_cased_finetuned_yoruba_yo_4.2.4_3.0_1670018518987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_multilingual_cased_finetuned_yoruba","yo") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_multilingual_cased_finetuned_yoruba","yo") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_multilingual_cased_finetuned_yoruba| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|yo| |Size:|667.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-yoruba - https://opus.nlpl.eu/ - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English DistilBertForQuestionAnswering model (from jsunster) author: John Snow Labs name: distilbert_qa_jsunster_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `jsunster`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_jsunster_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725613002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_jsunster_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725613002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jsunster_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jsunster_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_jsunster").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_jsunster_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jsunster/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentence Embeddings - sbert mini (tuned) author: John Snow Labs name: sbert_jsl_mini_uncased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_uncased_en_3.0.3_2.4_1621017120992.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_uncased_en_3.0.3_2.4_1621017120992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbert_jsl_mini_uncased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_mini_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_mini_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_mini_uncased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Score Acc 0.663 STS(cos) 0.701 ``` --- layout: model title: Malay T5ForConditionalGeneration Small Cased model (from mesolitica) author: John Snow Labs name: t5_small_bahasa_cased date: 2023-01-31 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_bahasa_cased_ms_4.3.0_3.0_1675125885354.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_bahasa_cased_ms_4.3.0_3.0_1675125885354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|147.7 MB| ## References - https://huggingface.co/mesolitica/t5-small-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/pretrained-model/t5/prepare - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5 --- layout: model title: English T5ForConditionalGeneration Cased model (from gagan3012) author: John Snow Labs name: t5_k2t_test3 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `k2t-test3` is a English model originally trained by `gagan3012`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_k2t_test3_en_4.3.0_3.0_1675103993995.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_k2t_test3_en_4.3.0_3.0_1675103993995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_k2t_test3","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_k2t_test3","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_k2t_test3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|246.4 MB| ## References - https://huggingface.co/gagan3012/k2t-test3 - https://pypi.org/project/keytotext/ - https://pepy.tech/project/keytotext - https://colab.research.google.com/github/gagan3012/keytotext/blob/master/notebooks/K2T.ipynb - https://share.streamlit.io/gagan3012/keytotext/UI/app.py - https://github.com/gagan3012/keytotext#api - https://hub.docker.com/r/gagan30/keytotext - https://keytotext.readthedocs.io/en/latest/?badge=latest - https://github.com/psf/black - https://socialify.git.ci/gagan3012/keytotext/image?description=1&forks=1&language=1&owner=1&stargazers=1&theme=Light --- layout: model title: English BertForTokenClassification Cased model (from Shenzy2) author: John Snow Labs name: bert_token_classifier_ner4designtutor date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `NER4DesignTutor` is a English model originally trained by `Shenzy2`. ## Predicted Entities `design_concept`, `ui_element` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner4designtutor_en_4.2.4_3.0_1669813970639.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner4designtutor_en_4.2.4_3.0_1669813970639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner4designtutor","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner4designtutor","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner4designtutor| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Shenzy2/NER4DesignTutor --- layout: model title: Financial NER on Responsibility and ESG Reports author: John Snow Labs name: finner_responsibility_reports date: 2023-03-09 tags: [en, finance, licensed, ner, responsibility, reports, tensorflow] task: Named Entity Recognition language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: FinanceBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Financial NER model can extract up to 20 quantifiable entities, including KPI, from the Responsibility and ESG Reports of companies. It has been trained with the SOTA approach. ## Predicted Entities `AGE`, `AMOUNT`, `COUNTABLE_ITEM`, `DATE_PERIOD`, `ECONOMIC_ACTION`, `ECONOMIC_KPI`, `ENVIRONMENTAL_ACTION`, `ENVIRONMENTAL_KPI`, `ENVIRONMENTAL_UNIT`, `ESG_ROLE`, `FACILITY_PLACE`, `ISO`, `PERCENTAGE`, `PROFESSIONAL_GROUP`, `RELATIVE_METRIC`, `SOCIAL_ACTION`, `SOCIAL_KPI`, `TARGET_GROUP`, `TARGET_GROUP_BUSINESS`, `WASTE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_responsibility_reports_en_1.0.0_3.0_1678368253780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_responsibility_reports_en_1.0.0_3.0_1678368253780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&']) ner_model = finance.BertForTokenClassification.pretrained("finner_responsibility_reports", "en", "finance/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) text = """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021.""" data = spark.createDataFrame([[text]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']['entity']").alias("label")).show(50, truncate = False) ```
## Results ```bash +----------------------+------------------+ |chunk |label | +----------------------+------------------+ |direct GHG emissions |ENVIRONMENTAL_KPI | |12,135 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2017 |DATE_PERIOD | |4 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2021 |DATE_PERIOD | |indirect GHG emissions|ENVIRONMENTAL_KPI | |scope 2 |ENVIRONMENTAL_KPI | |imported energy |ENVIRONMENTAL_KPI | |electricity |ENVIRONMENTAL_KPI | |heat |ENVIRONMENTAL_KPI | |steam |ENVIRONMENTAL_KPI | |cooling |ENVIRONMENTAL_KPI | |scope 2 emissions |ENVIRONMENTAL_KPI | |3 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2017-2018 |DATE_PERIOD | |4 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2020-2021 |DATE_PERIOD | |scope 3 emissions |ENVIRONMENTAL_KPI | |sold |ECONOMIC_ACTION | |products |SOCIAL_KPI | |emissions |ENVIRONMENTAL_KPI | |377 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2017 |DATE_PERIOD | |408 million |AMOUNT | |tonnes of CO2e |ENVIRONMENTAL_UNIT| |2021 |DATE_PERIOD | +----------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_responsibility_reports| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|406.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References In-house annotations on Responsibility and ESG Reports ## Benchmarking ```bash label precision recall f1-score support AGE 0.86 0.84 0.85 37 AMOUNT 0.93 0.96 0.95 1254 COUNTABLE_ITEM 0.87 0.86 0.87 212 DATE_PERIOD 0.90 0.93 0.92 925 ECONOMIC_ACTION 0.83 0.85 0.84 61 ECONOMIC_KPI 0.78 0.83 0.80 223 ENVIRONMENTAL_ACTION 0.84 0.84 0.84 332 ENVIRONMENTAL_KPI 0.79 0.86 0.82 948 ENVIRONMENTAL_UNIT 0.91 0.90 0.91 484 ESG_ROLE 0.76 0.81 0.79 139 FACILITY_PLACE 0.70 0.88 0.78 154 ISO 0.68 0.81 0.74 32 PERCENTAGE 0.98 1.00 0.99 706 PROFESSIONAL_GROUP 0.88 0.95 0.91 419 RELATIVE_METRIC 0.92 0.94 0.93 141 SOCIAL_ACTION 0.83 0.81 0.82 262 SOCIAL_KPI 0.82 0.84 0.83 480 TARGET_GROUP 0.84 0.92 0.88 257 TARGET_GROUP_BUSINESS 0.93 0.98 0.96 44 WASTE 0.80 0.77 0.79 106 micro-avg 0.87 0.91 0.89 7216 macro-avg 0.84 0.88 0.86 7216 weighted-avg 0.87 0.91 0.89 7216 ``` --- layout: model title: Romanian DistilBERT Embeddings author: John Snow Labs name: distilbert_embeddings_distilbert_base_ro_cased date: 2022-04-12 tags: [distilbert, embeddings, ro, open_source] task: Embeddings language: ro edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-ro-cased` is a Romanian model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ro_cased_ro_3.4.2_3.0_1649783969611.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ro_cased_ro_3.4.2_3.0_1649783969611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ro_cased","ro") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Îmi place Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ro_cased","ro") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Îmi place Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.embed.distilbert_base_cased").predict("""Îmi place Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_ro_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ro| |Size:|223.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-ro-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Absence of certain changes Clause Binary Classifier author: John Snow Labs name: legclf_absence_of_certain_changes_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `absence-of-certain-changes` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `absence-of-certain-changes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_absence_of_certain_changes_clause_en_1.0.0_3.2_1660122075233.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_absence_of_certain_changes_clause_en_1.0.0_3.2_1660122075233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_absence_of_certain_changes_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[absence-of-certain-changes]| |[other]| |[other]| |[absence-of-certain-changes]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_absence_of_certain_changes_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support absence-of-certain-changes 0.99 0.99 0.99 67 other 0.99 0.99 0.99 175 accuracy - - 0.99 242 macro-avg 0.99 0.99 0.99 242 weighted-avg 0.99 0.99 0.99 242 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from choosistant) author: John Snow Labs name: roberta_qa_model date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qa-model` is a English model originally trained by `choosistant`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_model_en_4.3.0_3.0_1674211832681.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_model_en_4.3.0_3.0_1674211832681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/choosistant/qa-model --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_nq_squad_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nq_squad_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nq_squad_el_4_el_4.0.0_3.0_1657190695605.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nq_squad_el_4_el_4.0.0_3.0_1657190695605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nq_squad_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_nq_squad_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nq_squad_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/nq_squad_bert_el_4 --- layout: model title: English BertForQuestionAnswering model (from armageddon) author: John Snow Labs name: bert_qa_bert_large_uncased_squad2_covid_qa_deepset date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squad2-covid-qa-deepset` is a English model orginally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654536702649.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654536702649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_squad2_covid_qa_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_squad2_covid_qa_deepset","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.bert.large_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_squad2_covid_qa_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/bert-large-uncased-squad2-covid-qa-deepset --- layout: model title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab) author: John Snow Labs name: bert_embeddings_base_arabic_camel_msa_half date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa-half` is a Arabic model originally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_half_ar_4.2.4_3.0_1670016117805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_half_ar_4.2.4_3.0_1670016117805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_half","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_half","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic_camel_msa_half| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-half - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Legal Overseas Countries And Territories Document Classifier (EURLEX) author: John Snow Labs name: legclf_overseas_countries_and_territories_bert date: 2023-03-06 tags: [en, legal, classification, clauses, overseas_countries_and_territories, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_overseas_countries_and_territories_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Overseas_Countries_and_Territories or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Overseas_Countries_and_Territories`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_overseas_countries_and_territories_bert_en_1.0.0_3.0_1678111581250.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_overseas_countries_and_territories_bert_en_1.0.0_3.0_1678111581250.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_overseas_countries_and_territories_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Overseas_Countries_and_Territories]| |[Other]| |[Other]| |[Overseas_Countries_and_Territories]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_overseas_countries_and_territories_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.89 0.94 28 Overseas_Countries_and_Territories 0.84 1.00 0.91 16 accuracy - - 0.93 44 macro-avg 0.92 0.95 0.93 44 weighted-avg 0.94 0.93 0.93 44 ``` --- layout: model title: BERT Base Biolink Embeddings author: John Snow Labs name: bert_biolink_base date: 2022-04-08 tags: [bert, medical, biolink, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This embeddings component was trained on PubMed abstracts all along with citations link information. The embeddings were introduced in this [paper](https://arxiv.org/abs/2203.15827). This model achieves state-of-the-art performance on several biomedical NLP benchmarks such as [BLURB](https://microsoft.github.io/BLURB/) and [MedQA-USMLE](https://github.com/jind11/MedQA). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_biolink_base_en_3.4.2_3.0_1649419433513.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_biolink_base_en_3.4.2_3.0_1649419433513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_biolink_base", "en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_biolink_base", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.e").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_biolink_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://pubmed.ncbi.nlm.nih.gov/](https://pubmed.ncbi.nlm.nih.gov/) ``` @InProceedings{yasunaga2022linkbert, author = {Michihiro Yasunaga and Jure Leskovec and Percy Liang}, title = {LinkBERT: Pretraining Language Models with Document Links}, year = {2022}, booktitle = {Association for Computational Linguistics (ACL)}, } ``` ## Benchmarking ```bash Scores for several benchmark datasets : - BLURB : 83.39 - PubMedQA : 70.2 - BioASQ : 91.4 - MedQA-USMLE : 40.0 ``` --- layout: model title: Chinese BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_chinese date: 2022-06-03 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-chinese` is a Chinese model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_chinese_zh_4.0.0_3.0_1654256118616.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_chinese_zh_4.0.0_3.0_1654256118616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_chinese","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_chinese","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_chinese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-chinese --- layout: model title: Legal Affirmative covenants Clause Binary Classifier (md) author: John Snow Labs name: legclf_affirmative_covenants_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `affirmative-covenants` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `affirmative-covenants` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_affirmative_covenants_md_en_1.0.0_3.0_1673460313752.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_affirmative_covenants_md_en_1.0.0_3.0_1673460313752.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_affirmative_covenants_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[affirmative-covenants]| |[other]| |[other]| |[affirmative-covenants]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_affirmative_covenants_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support other 0.95 0.92 0.94 39 termination-for-cause 0.92 0.95 0.93 37 accuracy 0.93 76 macro avg 0.93 0.93 0.93 76 weighted avg 0.93 0.93 0.93 76 ``` --- layout: model title: Indonesian RoBERTa Embeddings (from cahya) author: John Snow Labs name: roberta_embeddings_roberta_base_indonesian_522M date: 2022-04-14 tags: [roberta, embeddings, id, open_source] task: Embeddings language: id edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-indonesian-522M` is a Indonesian model orginally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_indonesian_522M_id_3.4.2_3.0_1649948345202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_indonesian_522M_id_3.4.2_3.0_1649948345202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_indonesian_522M","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_indonesian_522M","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka percikan NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.embed.roberta_base_indonesian_522M").predict("""Saya suka percikan NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_base_indonesian_522M| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|473.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/cahya/roberta-base-indonesian-522M - https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW) author: John Snow Labs name: distilbert_qa_model date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-es-model` is a Multilingual model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_xx_4.3.0_3.0_1672774931074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_xx_4.3.0_3.0_1672774931074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/en-de-es-model --- layout: model title: Legal Non disparagement Clause Binary Classifier author: John Snow Labs name: legclf_non_disparagement_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non-disparagement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `non-disparagement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_disparagement_clause_en_1.0.0_3.2_1660122744173.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_disparagement_clause_en_1.0.0_3.2_1660122744173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_non_disparagement_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[non-disparagement]| |[other]| |[other]| |[non-disparagement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_disparagement_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non-disparagement 0.97 1.00 0.98 31 other 1.00 0.99 0.99 99 accuracy - - 0.99 130 macro-avg 0.98 0.99 0.99 130 weighted-avg 0.99 0.99 0.99 130 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab1_by_tahazakir TFWav2Vec2ForCTC from tahazakir author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab1_by_tahazakir date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab1_by_tahazakir` is a English model originally trained by tahazakir. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_en_4.2.0_3.0_1664038769387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab1_by_tahazakir_en_4.2.0_3.0_1664038769387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab1_by_tahazakir", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab1_by_tahazakir", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab1_by_tahazakir| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Javanese BertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: bert_embeddings_javanese_small_imdb date: 2022-12-06 tags: [jv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-bert-small-imdb` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670326764200.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670326764200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_javanese_small_imdb| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-bert-small-imdb - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: Word2Vec Embeddings in Bulgarian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bg, open_source] task: Embeddings language: bg edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bg_3.4.1_3.0_1647288405077.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bg_3.4.1_3.0_1647288405077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bg") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Обичам искра nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bg") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Обичам искра nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bg.embed.w2v_cc_300d").predict("""Обичам искра nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bg| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Spanish Legal Electra Word Embeddings Base model author: John Snow Labs name: legalectra_base date: 2022-07-08 tags: [open_source, legalectra, embeddings, electra, legal, es] task: Embeddings language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Spanish Legal Word Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legalectra-base-spanish` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/legalectra_base_es_4.0.0_3.0_1657294426704.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/legalectra_base_es_4.0.0_3.0_1657294426704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") electra = BertEmbeddings.pretrained("legalectra_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, electra]) data = spark.createDataFrame([["Amo a Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val electra = BertEmbeddings.pretrained("legalectra_base","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, electra)) val data = Seq("Amo a Spark NLP.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legalectra_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|409.1 MB| |Case sensitive:|true| ## References https://huggingface.co/mrm8488/legalectra-base-spanish --- layout: model title: Clinical Portuguese Bert Embeddings (Biomedical and Clinical) author: John Snow Labs name: biobert_embeddings_all date: 2022-04-11 tags: [biobert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BioBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `biobertpt-all` is a Portuguese model orginally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_embeddings_all_pt_3.4.2_3.0_1649685875871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_embeddings_all_pt_3.4.2_3.0_1649685875871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_embeddings_all","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Odeio o cancro"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_embeddings_all","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Odeio o cancro").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.gs_all").predict("""Odeio o cancro""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_embeddings_all| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/pucpr/biobertpt-all - https://aclanthology.org/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Pipeline to Resolve Medication Codes(Transform) author: John Snow Labs name: medication_resolver_transform_pipeline date: 2023-03-31 tags: [resolver, rxnorm, ndc, snomed, umls, ade, pipeline, en, licensed] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used with Spark transform. You can use `medication_resolver_pipeline` as Lightpipeline (with `annotate/fullAnnotate`). {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_transform_pipeline_en_4.3.2_3.2_1680263905398.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_transform_pipeline_en_4.3.2_3.2_1680263905398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline medication_resolver_pipeline = PretrainedPipeline("medication_resolver_transform_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" data = spark.createDataFrame([[text]]).toDF("text") result = medication_resolver_pipeline.transform(data) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val medication_resolver_pipeline = new PretrainedPipeline("medication_resolver_transform_pipeline", "en", "clinical/models") val data = Seq("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""").toDS.toDF("text") val result = medication_resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication_transform.pipeline").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | chunk | ner_label | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |:-----------------------------|:------------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_transform_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - Doc2Chunk - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Doc2Chunk - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: Legal Termination of agreement Clause Binary Classifier author: John Snow Labs name: legclf_termination_of_agreement_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination-of-agreement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `termination-of-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_of_agreement_clause_en_1.0.0_3.2_1660124059611.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_of_agreement_clause_en_1.0.0_3.2_1660124059611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_termination_of_agreement_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination-of-agreement]| |[other]| |[other]| |[termination-of-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_of_agreement_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.99 0.97 74 termination-of-agreement 0.97 0.93 0.95 42 accuracy - - 0.97 116 macro-avg 0.97 0.96 0.96 116 weighted-avg 0.97 0.97 0.97 116 ``` --- layout: model title: Legal Loans Clause Binary Classifier author: John Snow Labs name: legclf_loans_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `loans` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `loans` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_loans_clause_en_1.0.0_3.2_1660123705884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_loans_clause_en_1.0.0_3.2_1660123705884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_loans_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[loans]| |[other]| |[other]| |[loans]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_loans_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support loans 0.86 0.88 0.87 41 other 0.92 0.91 0.92 66 accuracy - - 0.90 107 macro-avg 0.89 0.89 0.89 107 weighted-avg 0.90 0.90 0.90 107 ``` --- layout: model title: Social Determinants of Health (clinical_medium) author: John Snow Labs name: ner_sdoh_emb_clinical_medium_wip date: 2023-04-27 tags: [clinical_medium, social_determinants, sdoh, ner, public_health, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts terminology related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Access_To_Care`, `Age`, `Alcohol`, `Chidhood_Event`, `Communicable_Disease`, `Community_Safety`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Food_Insecurity`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hyperlipidemia`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Population_Group`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Substance_Use`, `Transportation`, `Violence_Or_Abuse` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_medium_wip_en_4.3.2_3.0_1682608963279.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_emb_clinical_medium_wip_en_4.3.2_3.0_1682608963279.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_medium_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = [["Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."]] data = spark.createDataFrame(sample_texts).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_emb_clinical_medium_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well. She has long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------+-----+---+-------------------+ |chunk |begin|end|ner_label | +------------------+-----+---+-------------------+ |55 years old |11 |22 |Age | |divorced |25 |32 |Marital_Status | |Mexcian |34 |40 |Gender | |American |42 |49 |Race_Ethnicity | |woman |51 |55 |Gender | |financial problems|62 |79 |Financial_Status | |She |82 |84 |Gender | |spanish |93 |99 |Language | |She |102 |104|Gender | |apartment |118 |126|Housing | |She |129 |131|Gender | |diabetes |158 |165|Other_Disease | |hospitalizations |233 |248|Other_SDoH_Keywords| |cleaning assistant|307 |324|Employment | |health insurance |354 |369|Insurance_Status | |She |391 |393|Gender | |son |401 |403|Family_Member | |student |405 |411|Education | |college |416 |422|Education | |depression |454 |463|Mental_Health | |She |466 |468|Gender | |she |479 |481|Gender | |rehab |489 |493|Access_To_Care | |her |514 |516|Gender | |catholic faith |518 |531|Spiritual_Beliefs | |support |547 |553|Social_Support | |She |565 |567|Gender | |etoh abuse |589 |598|Alcohol | |her |614 |616|Gender | |teens |618 |622|Age | |She |625 |627|Gender | |she |637 |639|Gender | |daily |652 |656|Substance_Frequency| |drinker |658 |664|Alcohol | |30 years |670 |677|Substance_Duration | |drinking |694 |701|Alcohol | |beer |703 |706|Alcohol | |daily |708 |712|Substance_Frequency| |She |715 |717|Gender | |smokes |719 |724|Smoking | |a pack |726 |731|Substance_Quantity | |cigarettes |736 |745|Smoking | |a day |747 |751|Substance_Frequency| |She |754 |756|Gender | |DUI |762 |764|Legal_Issues | +------------------+-----+---+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_emb_clinical_medium_wip| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| |Dependencies:|embeddings_clinical_medium| ## References Internal SHOP Project ## Benchmarking ```bash label precision recall f1-score support Geographic_Entity 0.89 0.88 0.88 106 Gender 0.99 0.99 0.99 4957 Healthcare_Institution 0.98 0.96 0.97 776 Employment 0.95 0.95 0.95 2120 Access_To_Care 0.90 0.81 0.85 459 Income 0.79 0.79 0.79 29 Social_Support 0.90 0.92 0.91 629 Family_Member 0.97 0.99 0.98 2101 Age 0.94 0.93 0.94 436 Mental_Health 0.89 0.86 0.87 479 Alcohol 0.96 0.96 0.96 254 Substance_Use 0.88 0.95 0.91 208 Hypertension 0.96 1.00 0.98 24 Other_Disease 0.90 0.94 0.92 583 Disability 0.93 0.97 0.95 40 Insurance_Status 0.87 0.85 0.86 85 Transportation 0.82 0.96 0.89 53 Sexual_Orientation 0.78 0.95 0.86 19 Marital_Status 0.98 0.96 0.97 90 Race_Ethnicity 0.92 0.96 0.94 25 Spiritual_Beliefs 0.80 0.80 0.80 51 Housing 0.89 0.85 0.87 366 Education 0.87 0.86 0.86 70 Other_SDoH_Keywords 0.78 0.88 0.83 237 Language 0.87 0.77 0.82 26 Substance_Frequency 0.92 0.83 0.87 65 Legal_Issues 0.77 0.85 0.81 55 Social_Exclusion 0.97 0.97 0.97 30 Financial_Status 0.88 0.66 0.75 123 Violence_Or_Abuse 0.82 0.65 0.73 57 Substance_Quantity 0.88 0.93 0.90 56 Smoking 0.99 0.99 0.99 71 Population_Group 0.91 0.71 0.80 14 Hyperlipidemia 0.78 1.00 0.88 7 Community_Safety 0.98 1.00 0.99 47 Exercise 0.91 0.88 0.90 60 Food_Insecurity 1.00 1.00 1.00 29 Eating_Disorder 0.67 0.92 0.77 13 Quality_Of_Life 0.79 0.82 0.81 61 Sexual_Activity 0.89 0.83 0.86 29 Chidhood_Event 0.90 0.72 0.80 25 Diet 0.97 0.92 0.94 62 Substance_Duration 0.66 0.95 0.78 39 Environmental_Condition 1.00 1.00 1.00 20 Obesity 1.00 1.00 1.00 14 Communicable_Disease 1.00 0.94 0.97 32 micro-avg 0.95 0.95 0.95 15132 macro-avg 0.89 0.90 0.89 15132 weighted-avg 0.95 0.95 0.95 15132 ``` --- layout: model title: Named Entity Recognition Profiling (Clinical) author: John Snow Labs name: ner_profiling_clinical date: 2021-09-24 tags: [ner, ner_profiling, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. Here are the NER models that this pretrained pipeline includes: `ner_ade_clinical_chunks`, `ner_posology_greedy_chunks`, `ner_risk_factors_chunks`, `jsl_ner_wip_clinical_chunks`, `ner_human_phenotype_gene_clinical_chunks`, `jsl_ner_wip_greedy_clinical_chunks`, `ner_cellular_chunks`, `ner_cancer_genetics_chunks`, `jsl_ner_wip_modifier_clinical_chunks`, `ner_drugs_greedy_chunks`, `ner_deid_sd_large_chunks`, `ner_diseases_chunks`, `nerdl_tumour_demo_chunks`, `ner_deid_subentity_augmented_chunks`, `ner_jsl_enriched_chunks`, `ner_genetic_variants_chunks`, `ner_bionlp_chunks`, `ner_measurements_clinical_chunks`, `ner_diseases_large_chunks`, `ner_radiology_chunks`, `ner_deid_augmented_chunks`, `ner_anatomy_chunks`, `ner_chemprot_clinical_chunks`, `ner_posology_experimental_chunks`, `ner_drugs_chunks`, `ner_deid_sd_chunks`, `ner_posology_large_chunks`, `ner_deid_large_chunks`, `ner_posology_chunks`, `ner_deidentify_dl_chunks`, `ner_deid_enriched_chunks`, `ner_bacterial_species_chunks`, `ner_drugs_large_chunks`, `ner_clinical_large_chunks`, `jsl_rd_ner_wip_greedy_clinical_chunks`, `ner_medmentions_coarse_chunks`, `ner_radiology_wip_clinical_chunks`, `ner_clinical_chunks`, `ner_chemicals_chunks`, `ner_deid_synthetic_chunks`, `ner_events_clinical_chunks`, `ner_posology_small_chunks`, `ner_anatomy_coarse_chunks`, `ner_human_phenotype_go_clinical_chunks`, `ner_jsl_slim_chunks`, `ner_jsl_chunks`, `ner_jsl_greedy_chunks`, `ner_events_admission_clinical_chunks` . {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_3.2.3_2.4_1632491778580.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_3.2.3_2.4_1632491778580.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_clinical', 'en', 'clinical/models') val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ner_ade_clinical_chunks : ['polydipsia', 'poor appetite', 'vomiting'] ner_posology_greedy_chunks : [] ner_risk_factors_chunks : ['diabetes mellitus', 'type two diabetes mellitus', 'obesity'] jsl_ner_wip_clinical_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'subsequent', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute', 'hepatitis', 'obesity', 'body mass index', '33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_human_phenotype_gene_clinical_chunks : ['type', 'obesity', 'mass', 'polyuria', 'polydipsia'] jsl_ner_wip_greedy_clinical_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass', 'BMI ) of 33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_cellular_chunks : [] ner_cancer_genetics_chunks : [] jsl_ner_wip_modifier_clinical_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass', 'BMI ) of 33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_drugs_greedy_chunks : [] ner_deid_sd_large_chunks : [] ner_diseases_chunks : ['gestational diabetes mellitus', 'diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'hepatitis', 'obesity', 'BMI', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] nerdl_tumour_demo_chunks : [] ner_deid_subentity_augmented_chunks : ['28-year-old'] ner_jsl_enriched_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'acute', 'hepatitis', 'obesity', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_genetic_variants_chunks : [] ner_bionlp_chunks : ['female', 'hepatitis'] ner_measurements_clinical_chunks : ['33.5', 'kg/m2'] ner_diseases_large_chunks : ['gestational diabetes mellitus', 'diabetes mellitus', 'T2DM', 'pancreatitis', 'hepatitis', 'obesity', 'polyuria', 'polydipsia', 'vomiting'] ner_radiology_chunks : ['gestational diabetes mellitus', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'acute hepatitis', 'obesity', 'body', 'mass index', 'BMI', '33.5', 'kg/m2', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_deid_augmented_chunks : [] ner_anatomy_chunks : ['body'] ner_chemprot_clinical_chunks : [] ner_posology_experimental_chunks : [] ner_drugs_chunks : [] ner_deid_sd_chunks : [] ner_posology_large_chunks : [] ner_deid_large_chunks : [] ner_posology_chunks : [] ner_deidentify_dl_chunks : [] ner_deid_enriched_chunks : [] ner_bacterial_species_chunks : [] ner_drugs_large_chunks : [] ner_clinical_large_chunks : ['gestational diabetes mellitus', 'subsequent type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'an acute hepatitis', 'obesity', 'a body mass index', 'BMI', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] token : ['A', '28-year-old', 'female', 'with', 'a', 'history', 'of', 'gestational', 'diabetes', 'mellitus', 'diagnosed', 'eight', 'years', 'prior', 'to', 'presentation', 'and', 'subsequent', 'type', 'two', 'diabetes', 'mellitus', '(', 'T2DM', '),', 'one', 'prior', 'episode', 'of', 'HTG-induced', 'pancreatitis', 'three', 'years', 'prior', 'to', 'presentation', ',', 'associated', 'with', 'an', 'acute', 'hepatitis', ',', 'and', 'obesity', 'with', 'a', 'body', 'mass', 'index', '(', 'BMI', ')', 'of', '33.5', 'kg/m2', ',', 'presented', 'with', 'a', 'one-week', 'history', 'of', 'polyuria', ',', 'polydipsia', ',', 'poor', 'appetite', ',', 'and', 'vomiting', '.'] jsl_rd_ner_wip_greedy_clinical_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'subsequent type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass index ( BMI', '33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_medmentions_coarse_chunks : ['female', 'diabetes mellitus', 'diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'associated with', 'acute hepatitis', 'obesity', 'body mass index', 'BMI', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_radiology_wip_clinical_chunks : ['gestational diabetes mellitus', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'acute hepatitis', 'obesity', 'body', 'mass index', '33.5', 'kg/m2', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_clinical_chunks : ['gestational diabetes mellitus', 'subsequent type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'an acute hepatitis', 'obesity', 'a body mass index', 'BMI', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_chemicals_chunks : [] ner_deid_synthetic_chunks : [] ner_events_clinical_chunks : ['gestational diabetes mellitus', 'eight years', 'presentation', 'subsequent type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years', 'presentation', 'an acute hepatitis', 'obesity', 'a body mass index ( BMI', 'presented', 'a one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_posology_small_chunks : [] ner_anatomy_coarse_chunks : ['body'] ner_human_phenotype_go_clinical_chunks : ['obesity', 'polydipsia', 'vomiting'] ner_jsl_slim_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass index', 'BMI ) of 33.5 kg/m2', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_jsl_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'subsequent', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute', 'hepatitis', 'obesity', 'body mass index', '33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_jsl_greedy_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass', 'BMI ) of 33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] sentence : ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .'] ner_events_admission_clinical_chunks : ['gestational diabetes mellitus', 'eight years', 'presentation', 'subsequent type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years', 'presentation', 'an acute hepatitis', 'obesity', 'a body mass index', 'BMI', 'kg/m2', 'presented', 'a one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_clinical| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel (x48) - NerConverter (x48) - Finisher --- layout: model title: Nigerian Pidgin XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_naija_finetuned_naija date: 2022-08-13 tags: [pcm, open_source, xlm_roberta, ner] task: Named Entity Recognition language: pcm edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-naija-finetuned-ner-naija` is a Nigerian Pidgin model originally trained by `mbeukman`. ## Predicted Entities `ORG`, `LOC`, `PER`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_naija_finetuned_naija_pcm_4.1.0_3.0_1660427083951.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_naija_finetuned_naija_pcm_4.1.0_3.0_1660427083951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_naija_finetuned_naija","pcm") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_naija_finetuned_naija","pcm") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_naija_finetuned_naija| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pcm| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-naija-finetuned-ner-naija - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner - https://arxiv.org/pdf/2103.11811.pdf - https://arxiv.org/abs/2103.11811 - https://arxiv.org/abs/2103.11811 --- layout: model title: Russian T5ForConditionalGeneration Cased model (from anzorq) author: John Snow Labs name: t5_kbd_lat_char_tokenizer date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kbd_lat-ru_char_tokenizer` is a Russian model originally trained by `anzorq`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_kbd_lat_char_tokenizer_ru_4.3.0_3.0_1675104071281.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_kbd_lat_char_tokenizer_ru_4.3.0_3.0_1675104071281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_kbd_lat_char_tokenizer","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_kbd_lat_char_tokenizer","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_kbd_lat_char_tokenizer| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|777.2 MB| ## References - https://huggingface.co/anzorq/kbd_lat-ru_char_tokenizer --- layout: model title: Portuguese Legal Bert Embeddings (Cased) author: John Snow Labs name: bert_embeddings_bert_base_cased_pt_lenerbr date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-cased-pt-lenerbr` is a Portuguese model orginally trained by `pierreguillou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_cased_pt_lenerbr_pt_3.4.2_3.0_1649673986730.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_cased_pt_lenerbr_pt_3.4.2_3.0_1649673986730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_cased_pt_lenerbr","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_cased_pt_lenerbr","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_base_cased_pt_lenerbr").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_cased_pt_lenerbr| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|408.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/pierreguillou/bert-base-cased-pt-lenerbr - https://medium.com/@pierre_guillou/nlp-modelos-e-web-app-para-reconhecimento-de-entidade-nomeada-ner-no-dom%C3%ADnio-jur%C3%ADdico-b658db55edfb - https://github.com/piegu/language-models/blob/master/Finetuning_language_model_BERtimbau_LeNER_Br.ipynb - https://paperswithcode.com/sota?task=Fill+Mask&dataset=pierreguillou%2Flener_br_finetuning_language_model --- layout: model title: Translate Slavic languages to English Pipeline author: John Snow Labs name: translate_sla_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sla, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sla` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sla_en_xx_2.7.0_2.4_1609686342089.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sla_en_xx_2.7.0_2.4_1609686342089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sla_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sla_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sla.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sla_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Papiamento to English Pipeline author: John Snow Labs name: translate_pap_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pap, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pap` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pap_en_xx_2.7.0_2.4_1609689337907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pap_en_xx_2.7.0_2.4_1609689337907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_pap_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pap_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.pap.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pap_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from mcurmei) author: John Snow Labs name: distilbert_qa_flat_n_max date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `flat_N_max` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_flat_n_max_en_4.3.0_3.0_1672775159097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_flat_n_max_en_4.3.0_3.0_1672775159097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_flat_n_max","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_flat_n_max","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_flat_n_max| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/flat_N_max --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-128_A-2_cord19-200616_squad2` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_en_4.0.0_3.0_1657188875648.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_en_4.0.0_3.0_1657188875648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-128_A-2_cord19-200616_squad2 --- layout: model title: Finance-related Tweets Classifier author: John Snow Labs name: finclf_twitter_news date: 2023-03-10 tags: [en, licensed, classifier, twitter, finance, tensorflow] task: Text Classification language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: FinanceBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which classifies financial tweets with one of the following topics: `Company_or_Product_News`, `Stock_Movement`, `General_News_or_Opinion`, `Earnings`, `Macro`, `Fed_or_Central_Banks`, `Politics`, `Stock_Commentary`, `Financials`, `M&A_or_Investments`, `Legal_or_Regulation`, `Personnel_Change`, `Markets`, `Energy_or_Oil`, `Dividend`, `Analyst_Update`, `Treasuries_or_Corporate_Debt`, `Currencies`. ## Predicted Entities `Company_or_Product_News`, `Stock_Movement`, `General_News_or_Opinion`, `Earnings`, `Macro`, `Fed_or_Central_Banks`, `Politics`, `Stock_Commentary`, `Financials`, `M&A_or_Investments`, `Legal_or_Regulation`, `Personnel_Change`, `Markets`, `Energy_or_Oil`, `Dividend`, `Analyst_Update`, `Treasuries_or_Corporate_Debt`, `Currencies` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_twitter_news_en_1.0.0_3.0_1678444505428.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_twitter_news_en_1.0.0_3.0_1678444505428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") seq_classifier = finance.BertForSequenceClassification.pretrained("finclf_twitter_news", "en", "finance/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, seq_classifier]) data = spark.createDataFrame([["Barclays believes earnings for these underperforming stocks may surprise Wall Street"]]).toDF("text") result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------+ | result| +----------------+ |[Analyst_Update]| +----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_twitter_news| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic) ## Benchmarking ```bash label precision recall f1-score support Analyst_Update 0.79 0.79 0.79 38 Company_or_Product_News 0.71 0.78 0.74 112 Currencies 0.80 1.00 0.89 12 Dividend 1.00 0.94 0.97 31 Earnings 0.95 0.97 0.96 100 Energy_or_Oil 0.78 0.89 0.83 55 Fed_or_Central_Banks 0.82 0.78 0.80 95 Financials 0.90 0.93 0.92 60 General_News_or_Opinion 0.71 0.74 0.72 80 Legal_or_Regulation 0.85 0.75 0.80 52 M&A_or_Investments 0.85 0.90 0.87 49 Macro 0.81 0.70 0.75 84 Markets 0.91 0.84 0.87 49 Personnel_Change 0.96 0.94 0.95 50 Politics 0.83 0.82 0.82 83 Stock_Commentary 0.87 0.94 0.90 63 Stock_Movement 0.94 0.90 0.92 89 Treasuries_or_Corporate_Debt 0.80 0.73 0.76 33 accuracy - - 0.84 1135 macro-avg 0.85 0.85 0.85 1135 weighted-avg 0.84 0.84 0.84 1135 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Firat) author: John Snow Labs name: roberta_qa_firat_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `Firat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_firat_base_finetuned_squad_en_4.3.0_3.0_1674217059391.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_firat_base_finetuned_squad_en_4.3.0_3.0_1674217059391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_firat_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_firat_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_firat_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Firat/roberta-base-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_hier_ft_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_ft_news_en_4.3.0_3.0_1674210937245.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_ft_news_en_4.3.0_3.0_1674210937245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_ft_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_ft_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_hier_ft_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|458.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_hier_roberta_FT_newsqa --- layout: model title: RoBERTa base model author: John Snow Labs name: roberta_base date: 2021-05-20 tags: [en, english, roberta, embeddings, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this paper](https://arxiv.org/abs/1907.11692) and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it makes a difference between english and English. RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_en_3.1.0_2.4_1621523388696.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_en_3.1.0_2.4_1621523388696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = RoBertaEmbeddings.pretrained("roberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ``` ```scala val embeddings = RoBertaEmbeddings.pretrained("roberta_base", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.roberta").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/roberta-base](https://huggingface.co/roberta-base) ## Benchmarking ```bash When fine-tuned on downstream tasks, this model achieves the following results: Glue test results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 87.6 | 91.9 | 92.8 | 94.8 | 63.6 | 91.2 | 90.2 | 78.7 | ``` --- layout: model title: English asr_wav2vec2_xls_r_300m_Turkish_Tr_med TFWav2Vec2ForCTC from emre author: John Snow Labs name: asr_wav2vec2_xls_r_300m_Turkish_Tr_med date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_Turkish_Tr_med` is a English model originally trained by emre. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_Turkish_Tr_med_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_Turkish_Tr_med_en_4.2.0_3.0_1664037743982.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_Turkish_Tr_med_en_4.2.0_3.0_1664037743982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_Turkish_Tr_med", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_Turkish_Tr_med", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_Turkish_Tr_med| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from SimonLi123) author: John Snow Labs name: distilbert_qa_simonli123_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SimonLi123`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_simonli123_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769256129.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_simonli123_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769256129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_simonli123_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_simonli123_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_simonli123_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SimonLi123/distilbert-base-uncased-finetuned-squad --- layout: model title: Korean ElectraForQuestionAnswering model (from seongju) author: John Snow Labs name: electra_qa_klue_mrc_base date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `klue-mrc-koelectra-base` is a Korean model originally trained by `seongju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_klue_mrc_base_ko_4.0.0_3.0_1655922056568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_klue_mrc_base_ko_4.0.0_3.0_1655922056568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_klue_mrc_base","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_klue_mrc_base","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.klue.electra.base").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_klue_mrc_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|419.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/seongju/klue-mrc-koelectra-base --- layout: model title: English asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287` is a English model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287_en_4.2.0_3.0_1664093815678.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287_en_4.2.0_3.0_1664093815678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Compliance with laws Clause Binary Classifier author: John Snow Labs name: legclf_compliance_with_laws_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `compliance-with-laws` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `compliance-with-laws` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compliance_with_laws_clause_en_1.0.0_3.2_1660123329440.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compliance_with_laws_clause_en_1.0.0_3.2_1660123329440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_compliance_with_laws_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[compliance-with-laws]| |[other]| |[other]| |[compliance-with-laws]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_compliance_with_laws_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support compliance-with-laws 0.92 0.94 0.93 64 other 0.97 0.96 0.97 141 accuracy - - 0.96 205 macro-avg 0.95 0.95 0.95 205 weighted-avg 0.96 0.96 0.96 205 ``` --- layout: model title: Norwegian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_no_cased date: 2022-12-02 tags: ["no", open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: "no" edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-no-cased` is a Norwegian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_no_cased_no_4.2.4_3.0_1670018633680.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_no_cased_no_4.2.4_3.0_1670018633680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_no_cased","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_no_cased","no") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_no_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|no| |Size:|390.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-no-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from caiosantillo) author: John Snow Labs name: distilbert_qa_caiosantillo_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `caiosantillo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_caiosantillo_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770315257.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_caiosantillo_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770315257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_caiosantillo_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_caiosantillo_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_caiosantillo_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/caiosantillo/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentence Entity Resolver for HCPCS Codes author: John Snow Labs name: sbiobertresolve_hcpcs date: 2022-02-28 tags: [hcpcs, resolver, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.4.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to [Healthcare Common Procedure Coding System (HCPCS)](https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/HCPCS/index.html#:~:text=The%20Healthcare%20Common%20Procedure%20Coding,%2C%20supplies%2C%20products%20and%20services.) codes using 'sbiobert_base_cased_mli' sentence embeddings. ## Predicted Entities `HCPCS Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_HCPCS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcpcs_en_3.4.0_3.0_1646035118020.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcpcs_en_3.4.0_3.0_1646035118020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sentence_embeddings") hcpcs_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hcpcs", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("hcpcs_code")\ .setDistanceFunction("EUCLIDEAN") hcpcs_pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, hcpcs_resolver]) data = spark.createDataFrame([["Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type"]]).toDF("text") results = hcpcs_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sentence_embeddings") val hcpcs_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hcpcs", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("hcpcs_code") .setDistanceFunction("EUCLIDEAN") val hcpcs_pipeline = new Pipeline().setStages(Array(documentAssembler, sbert_embedder, hcpcs_resolver)) val data = Seq("Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type").toDF("text") val results = hcpcs_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.hcpcs").predict("""Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type""") ```
## Results ```bash +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | |ner_chunk |hcpcs_code|all_codes |resolutions | +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |0 |Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type|L8001 |[L8001, L8002, L8000, L8033, L8032, ...]|'Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type', 'Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, bilateral, any size, any type', 'Breast prosthesis, mastectomy bra, without integrated breast prosthesis form, any size, any type', 'Nipple prosthesis, custom fabricated, reusable, any material, any type, each', ...| +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_hcpcs| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hcpcs_code]| |Language:|en| |Size:|21.5 MB| |Case sensitive:|false| --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191037278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657191037278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0 --- layout: model title: Moldavian, Moldovan, Romanian asr_romanian_wav2vec2 TFWav2Vec2ForCTC from gigant author: John Snow Labs name: pipeline_asr_romanian_wav2vec2 date: 2022-09-25 tags: [wav2vec2, ro, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ro edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_romanian_wav2vec2` is a Moldavian, Moldovan, Romanian model originally trained by gigant. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_romanian_wav2vec2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_romanian_wav2vec2_ro_4.2.0_3.0_1664098225815.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_romanian_wav2vec2_ro_4.2.0_3.0_1664098225815.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_romanian_wav2vec2', lang = 'ro') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_romanian_wav2vec2", lang = "ro") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_romanian_wav2vec2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Tagalog RobertaForMaskedLM Base Cased model (from jcblaise) author: John Snow Labs name: roberta_embeddings_tagalog_base date: 2022-12-12 tags: [tl, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: tl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-tagalog-base` is a Tagalog model originally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_tagalog_base_tl_4.2.4_3.0_1670859679601.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_tagalog_base_tl_4.2.4_3.0_1670859679601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_tagalog_base","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_tagalog_base","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_tagalog_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/jcblaise/roberta-tagalog-base - https://blaisecruz.com --- layout: model title: English XlmRoBertaForQuestionAnswering (from krinal214) author: John Snow Labs name: xlm_roberta_qa_xlm_3lang date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-3lang` is a English model originally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_3lang_en_4.0.0_3.0_1655988266844.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_3lang_en_4.0.0_3.0_1655988266844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_3lang","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_3lang","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.xlm_roberta.3lang").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_3lang| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|864.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/xlm-3lang --- layout: model title: Roberta Clinical Word Embeddings (Spanish) author: John Snow Labs name: roberta_base_biomedical date: 2021-11-01 tags: [embeddings, spanish, biomedical, clinical, roberta, es] task: Embeddings language: es edition: Spark NLP 3.3.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Biomedical pretrained language model for Spanish with a `768 embeddings dimension`, imported from Hugging Face (https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-es) to be used in SparkNLP using RobertaEmbeddings() transfformer class. This model is a RoBERTa-based model trained on a biomedical corpus in Spanish collected from several sources (see dataset section). The training corpus has been tokenized using a byte version of Byte-Pair Encoding (BPE) used in the original RoBERTA model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences. To see more details, please check the official page in Hugging Face: https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-es ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_biomedical_es_3.3.0_3.0_1635781845226.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_biomedical_es_3.3.0_3.0_1635781845226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("term")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["document", "token"])\ .setOutputCol("roberta_embeddings") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, roberta_embeddings]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("term") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("document", "token")) .setOutputCol("roberta_embeddings") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, roberta_embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.roberta_base_biomedical").predict("""Put your text here.""") ```
## Results ```bash The model has been evaluated on the Named Entity Recognition (NER) using the following datasets (taken from https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es) * PharmaCoNER: is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/). * CANTEMIST: is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ). * ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_biomedical| |Compatibility:|Spark NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|es| ## Data Source Datasets are available in the official author(s) github project, available here: https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es, and include: * Medical crawler 745,705,946 Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. * Clinical cases misc. 102,855,267 A miscellany of medical content, essentially clinical cases. Note that a clinical case report is a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. * Clinical notes/documents 91,250,080 Collection of more than 278K clinical documents, including discharge reports, clinical course notes and X-ray reports, for a total of 91M tokens. * Scielo 60,007,289 Publications written in Spanish crawled from the Spanish SciELO server in 2017. * BARR2_background 24,516,442 Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. * Wikipedia_life_sciences 13,890,501 Wikipedia articles crawled 04/01/2021 with the Wikipedia API python library starting from the "Ciencias_de_la_vida" category up to a maximum of 5 subcategories. Multiple links to the same articles are then discarded to avoid repeating content. * Patents 13,463,387 Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". * EMEA 5,377,448 Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. * mespen_Medline 4,166,077 Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. * PubMed 1,858,966 Open-access articles from the PubMed repository crawled in 2017. ## Benchmarking ```bash Taken from https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es: Task/models F1 | Precision | Recall PharmaCoNER 90.04 | 88.92 | 91.18 CANTEMIST 83.34 | 81.48 | 85.30 ICTUSnet 88.08 | 84.92 | 91.50 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Ndonga author: John Snow Labs name: opus_mt_en_ng date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ng, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ng` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ng_xx_2.7.0_2.4_1609167701698.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ng_xx_2.7.0_2.4_1609167701698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ng", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ng", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ng').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ng| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stopwords Remover for Persian (Farsi) language (430 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, fa, open_source] task: Stop Words Removal language: fa edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_fa_3.4.1_3.0_1646672314599.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_fa_3.4.1_3.0_1646672314599.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","fa") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["شما بهتر از من نیستید"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","fa") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("شما بهتر از من نیستید").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.stopwords").predict("""شما بهتر از من نیستید""") ```
## Results ```bash +--------+ |result | +--------+ |[نیستید]| +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|fa| |Size:|3.0 KB| --- layout: model title: English RobertaForQuestionAnswering (from manishiitg) author: John Snow Labs name: roberta_qa_distilrobert_base_squadv2_328seq_128stride_test date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilrobert-base-squadv2-328seq-128stride-test` is a English model originally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilrobert_base_squadv2_328seq_128stride_test_en_4.0.0_3.0_1655728265593.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilrobert_base_squadv2_328seq_128stride_test_en_4.0.0_3.0_1655728265593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilrobert_base_squadv2_328seq_128stride_test","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_distilrobert_base_squadv2_328seq_128stride_test","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.distilled_base_128d_32d_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilrobert_base_squadv2_328seq_128stride_test| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/distilrobert-base-squadv2-328seq-128stride-test --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in French (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-02-03 task: Named Entity Recognition language: fr edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, fr, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FR){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_fr_2.4.0_2.4_1579717534654.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_fr_2.4.0_2.4_1579717534654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "fr") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (né le 28 octobre 1955) est un magnat des affaires, développeur de logiciels, investisseur et philanthrope américain. Il est surtout connu comme le co-fondateur de Microsoft Corporation. Au cours de sa carrière chez Microsoft, Gates a occupé les postes de président, chef de la direction (PDG), président et architecte logiciel en chef, tout en étant le plus grand actionnaire individuel jusqu"en mai 2014. Il est l"un des entrepreneurs et pionniers les plus connus du révolution des micro-ordinateurs des années 1970 et 1980. Né et élevé à Seattle, Washington, Gates a cofondé Microsoft avec son ami d"enfance Paul Allen en 1975, à Albuquerque, au Nouveau-Mexique; il est devenu la plus grande société de logiciels informatiques au monde. Gates a dirigé l"entreprise en tant que président-directeur général jusqu"à sa démission en tant que PDG en janvier 2000, mais il est resté président et est devenu architecte logiciel en chef. À la fin des années 1990, Gates avait été critiqué pour ses tactiques commerciales, considérées comme anticoncurrentielles. Cette opinion a été confirmée par de nombreuses décisions de justice. En juin 2006, Gates a annoncé qu"il passerait à un poste à temps partiel chez Microsoft et à un emploi à temps plein à la Bill & Melinda Gates Foundation, la fondation caritative privée que lui et sa femme, Melinda Gates, ont créée en 2000. Il a progressivement transféré ses fonctions à Ray Ozzie et Craig Mundie. Il a démissionné de son poste de président de Microsoft en février 2014 et a assumé un nouveau poste de conseiller technologique pour soutenir le nouveau PDG Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "fr") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (né le 28 octobre 1955) est un magnat des affaires, développeur de logiciels, investisseur et philanthrope américain. Il est surtout connu comme le co-fondateur de Microsoft Corporation. Au cours de sa carrière chez Microsoft, Gates a occupé les postes de président, chef de la direction (PDG), président et architecte logiciel en chef, tout en étant le plus grand actionnaire individuel jusqu"en mai 2014. Il est l"un des entrepreneurs et pionniers les plus connus du révolution des micro-ordinateurs des années 1970 et 1980. Né et élevé à Seattle, Washington, Gates a cofondé Microsoft avec son ami d"enfance Paul Allen en 1975, à Albuquerque, au Nouveau-Mexique; il est devenu la plus grande société de logiciels informatiques au monde. Gates a dirigé l"entreprise en tant que président-directeur général jusqu"à sa démission en tant que PDG en janvier 2000, mais il est resté président et est devenu architecte logiciel en chef. À la fin des années 1990, Gates avait été critiqué pour ses tactiques commerciales, considérées comme anticoncurrentielles. Cette opinion a été confirmée par de nombreuses décisions de justice. En juin 2006, Gates a annoncé qu"il passerait à un poste à temps partiel chez Microsoft et à un emploi à temps plein à la Bill & Melinda Gates Foundation, la fondation caritative privée que lui et sa femme, Melinda Gates, ont créée en 2000. Il a progressivement transféré ses fonctions à Ray Ozzie et Craig Mundie. Il a démissionné de son poste de président de Microsoft en février 2014 et a assumé un nouveau poste de conseiller technologique pour soutenir le nouveau PDG Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (né le 28 octobre 1955) est un magnat des affaires, développeur de logiciels, investisseur et philanthrope américain. Il est surtout connu comme le co-fondateur de Microsoft Corporation. Au cours de sa carrière chez Microsoft, Gates a occupé les postes de président, chef de la direction (PDG), président et architecte logiciel en chef, tout en étant le plus grand actionnaire individuel jusqu'en mai 2014. Il est l'un des entrepreneurs et pionniers les plus connus du révolution des micro-ordinateurs des années 1970 et 1980. Né et élevé à Seattle, Washington, Gates a cofondé Microsoft avec son ami d'enfance Paul Allen en 1975, à Albuquerque, au Nouveau-Mexique; il est devenu la plus grande société de logiciels informatiques au monde. Gates a dirigé l'entreprise en tant que président-directeur général jusqu'à sa démission en tant que PDG en janvier 2000, mais il est resté président et est devenu architecte logiciel en chef. À la fin des années 1990, Gates avait été critiqué pour ses tactiques commerciales, considérées comme anticoncurrentielles. Cette opinion a été confirmée par de nombreuses décisions de justice. En juin 2006, Gates a annoncé qu'il passerait à un poste à temps partiel chez Microsoft et à un emploi à temps plein à la Bill & Melinda Gates Foundation, la fondation caritative privée que lui et sa femme, Melinda Gates, ont créée en 2000. Il a progressivement transféré ses fonctions à Ray Ozzie et Craig Mundie. Il a démissionné de son poste de président de Microsoft en février 2014 et a assumé un nouveau poste de conseiller technologique pour soutenir le nouveau PDG Satya Nadella."""] ner_df = nlu.load('fr.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |ORG | |Né |LOC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Nouveau-Mexique |LOC | |Gates |PER | |PDG |ORG | |À la fin des années |MISC | |Gates |PER | |Cette |LOC | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://fr.wikipedia.org](https://fr.wikipedia.org) --- layout: model title: Pipeline to Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [ca, ner, clinical, licensed] task: Named Entity Recognition language: ca edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_ca_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_ca_4.3.0_3.2_1678703245172.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_ca_4.3.0_3.2_1678703245172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "ca", "clinical/models") text = '''Dona de 47 anys al·lèrgica al iode, fumadora social, intervinguda de varices, dues cesàries i un abscés gluti. Sense altres antecedents mèdics d'interès ni tractament habitual. Viu amb el seu marit i tres fills, treballa com a professora. En el moment de la nostra valoració en la planta de Cirurgia General, la pacient presenta TA 69/40 mm Hg, freqüència cardíaca 120 lpm, taquipnea en repòs, pal·lidesa mucocutánea, mala perfusió distal i afligeix nàusees. L'abdomen és tou, no presenta peritonismo i el dèbit del drenatge abdominal roman sense canvis. Les serologies de Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, citomegalovirus, virus de Epstein Barr, virus varicel·la zoster i parvovirus B19 van ser negatives. No obstant això, es va detectar test de rosa de Bengala positiu per a Brucella, el test de Coombs i les aglutinacions també van ser positives amb un títol 1/40.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "ca", "clinical/models") val text = "Dona de 47 anys al·lèrgica al iode, fumadora social, intervinguda de varices, dues cesàries i un abscés gluti. Sense altres antecedents mèdics d'interès ni tractament habitual. Viu amb el seu marit i tres fills, treballa com a professora. En el moment de la nostra valoració en la planta de Cirurgia General, la pacient presenta TA 69/40 mm Hg, freqüència cardíaca 120 lpm, taquipnea en repòs, pal·lidesa mucocutánea, mala perfusió distal i afligeix nàusees. L'abdomen és tou, no presenta peritonismo i el dèbit del drenatge abdominal roman sense canvis. Les serologies de Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, citomegalovirus, virus de Epstein Barr, virus varicel·la zoster i parvovirus B19 van ser negatives. No obstant això, es va detectar test de rosa de Bengala positiu per a Brucella, el test de Coombs i les aglutinacions també van ser positives amb un títol 1/40." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | Dona | 0 | 3 | HUMAN | 1 | | 1 | marit | 192 | 196 | HUMAN | 0.9867 | | 2 | fills | 205 | 209 | HUMAN | 0.9822 | | 3 | professora | 227 | 236 | HUMAN | 0.9987 | | 4 | pacient | 312 | 318 | HUMAN | 0.9986 | | 5 | Coxiella burnetii | 573 | 589 | SPECIES | 0.96365 | | 6 | Bartonella henselae | 592 | 610 | SPECIES | 0.92445 | | 7 | Borrelia burgdorferi | 613 | 632 | SPECIES | 0.91515 | | 8 | Entamoeba histolytica | 635 | 655 | SPECIES | 0.87195 | | 9 | Toxoplasma gondii | 658 | 674 | SPECIES | 0.8935 | | 10 | citomegalovirus | 677 | 691 | SPECIES | 0.9227 | | 11 | virus de Epstein Barr | 694 | 714 | SPECIES | 0.730375 | | 12 | virus varicel·la zoster | 717 | 739 | SPECIES | 0.778333 | | 13 | parvovirus B19 | 743 | 756 | SPECIES | 0.9138 | | 14 | Brucella | 847 | 854 | SPECIES | 0.9483 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ca| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Danish asr_wav2vec2_xls_r_300m_ftspeech TFWav2Vec2ForCTC from saattrupdan author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_ftspeech date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_ftspeech` is a Danish model originally trained by saattrupdan. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_ftspeech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_ftspeech_da_4.2.0_3.0_1664101655296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_ftspeech_da_4.2.0_3.0_1664101655296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_ftspeech', lang = 'da') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_ftspeech", lang = "da") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_ftspeech| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|da| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Persons Clause Binary Classifier author: John Snow Labs name: legclf_persons_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `persons` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `persons` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_persons_clause_en_1.0.0_3.2_1660123826195.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_persons_clause_en_1.0.0_3.2_1660123826195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_persons_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[persons]| |[other]| |[other]| |[persons]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_persons_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.96 0.98 159 persons 0.92 1.00 0.96 73 accuracy - - 0.97 232 macro-avg 0.96 0.98 0.97 232 weighted-avg 0.98 0.97 0.97 232 ``` --- layout: model title: Legal Arbitration Clause Binary Classifier author: John Snow Labs name: legclf_arbitration_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `arbitration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `arbitration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_arbitration_clause_en_1.0.0_3.2_1660122127772.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_arbitration_clause_en_1.0.0_3.2_1660122127772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_arbitration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[arbitration]| |[other]| |[other]| |[arbitration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_arbitration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support arbitration 0.95 0.95 0.95 21 other 0.99 0.99 0.99 68 accuracy - - 0.98 89 macro-avg 0.97 0.97 0.97 89 weighted-avg 0.98 0.98 0.98 89 ``` --- layout: model title: Legal Agreement And Declaration Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_agreement_and_declaration_bert date: 2022-11-24 tags: [en, legal, classification, agreement, declaration, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement_and_declaration_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `agreement-and-declaration` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `agreement-and-declaration`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_declaration_bert_en_1.0.0_3.0_1669301061858.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_declaration_bert_en_1.0.0_3.0_1669301061858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreement_and_declaration_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[agreement-and-declaration]| |[other]| |[other]| |[agreement-and-declaration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement_and_declaration_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement-and-declaration 1.00 0.98 0.99 43 other 0.99 1.00 0.99 82 accuracy - - 0.99 125 macro-avg 0.99 0.99 0.99 125 weighted-avg 0.99 0.99 0.99 125 ``` --- layout: model title: English RobertaForQuestionAnswering (from Nakul24) author: John Snow Labs name: roberta_qa_RoBERTa_emotion_extraction date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa-emotion-extraction` is a English model originally trained by `Nakul24`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_emotion_extraction_en_4.0.0_3.0_1655727123419.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_emotion_extraction_en_4.0.0_3.0_1655727123419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RoBERTa_emotion_extraction","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_RoBERTa_emotion_extraction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_Nakul24").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_RoBERTa_emotion_extraction| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|426.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Nakul24/RoBERTa-emotion-extraction --- layout: model title: Dutch Lemmatizer author: John Snow Labs name: lemma date: 2020-05-03 22:46:00 +0800 task: Lemmatization language: nl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, nl] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_nl_2.5.0_2.4_1588532720582.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_nl_2.5.0_2.4_1588532720582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "nl") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Behalve dat hij de koning van het noorden is, is John Snow een Engelse arts en een leider in de ontwikkeling van anesthesie en medische hygiëne.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "nl") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Behalve dat hij de koning van het noorden is, is John Snow een Engelse arts en een leider in de ontwikkeling van anesthesie en medische hygiëne.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Behalve dat hij de koning van het noorden is, is John Snow een Engelse arts en een leider in de ontwikkeling van anesthesie en medische hygiëne."""] lemma_df = nlu.load('nl.lemma').predict(text, output_level='token') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='behalve', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=10, result='dat', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=14, result='hij', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=17, result='de', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=24, result='koning', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|nl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_radhakri119 TFWav2Vec2ForCTC from radhakri119 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_radhakri119` is a English model originally trained by radhakri119. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119_en_4.2.0_3.0_1664101814398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119_en_4.2.0_3.0_1664101814398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_radhakri119| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_vasilis TFWav2Vec2ForCTC from vasilis author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_greek_by_vasilis date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_vasilis` is a Modern Greek (1453-) model originally trained by vasilis. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_greek_by_vasilis_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_vasilis_el_4.2.0_3.0_1664105705977.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_vasilis_el_4.2.0_3.0_1664105705977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_vasilis", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_vasilis", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_greek_by_vasilis| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: Legal Replacement of Lenders Clause Binary Classifier author: John Snow Labs name: legclf_replacement_of_lenders_clause date: 2022-12-07 tags: [en, legal, classification, clauses, replacement_of_lenders, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `replacement-of-lenders` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `replacement-of-lenders`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_replacement_of_lenders_clause_en_1.0.0_3.0_1670445171577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_replacement_of_lenders_clause_en_1.0.0_3.0_1670445171577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_replacement_of_lenders_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[replacement-of-lenders]| |[other]| |[other]| |[replacement-of-lenders]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_replacement_of_lenders_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.95 0.97 73 replacement-of-lenders 0.87 1.00 0.93 26 accuracy - - 0.96 99 macro-avg 0.93 0.97 0.95 99 weighted-avg 0.96 0.96 0.96 99 ``` --- layout: model title: English image_classifier_vit_base_movie_scenes_v1 ViTForImageClassification from dingusagar author: John Snow Labs name: image_classifier_vit_base_movie_scenes_v1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_movie_scenes_v1` is a English model originally trained by dingusagar. ## Predicted Entities ` batman movie scenes`, `harry potter movie scenes` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_movie_scenes_v1_en_4.1.0_3.0_1660172228231.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_movie_scenes_v1_en_4.1.0_3.0_1660172228231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_movie_scenes_v1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_movie_scenes_v1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_movie_scenes_v1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English image_classifier_vit_PANDA_ViT ViTForImageClassification from smc author: John Snow Labs name: image_classifier_vit_PANDA_ViT date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_PANDA_ViT` is a English model originally trained by smc. ## Predicted Entities `Beningn`, `ISUP 5`, `ISUP 1`, `ISUP 2`, `ISUP 4`, `ISUP 3` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_PANDA_ViT_en_4.1.0_3.0_1660167599872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_PANDA_ViT_en_4.1.0_3.0_1660167599872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_PANDA_ViT", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_PANDA_ViT", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_PANDA_ViT| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Umbundu author: John Snow Labs name: opus_mt_en_umb date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, umb, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `umb` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_umb_xx_2.7.0_2.4_1609163118124.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_umb_xx_2.7.0_2.4_1609163118124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_umb", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_umb", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.umb').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_umb| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial NER (md, Medium) author: John Snow Labs name: finner_financial_medium date: 2022-10-19 tags: [en, finance, ner, annual, reports, 10k, filings, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `md` (medium) version of a financial model, trained with more generic labels than the other versions of the model (`md`, `lg`, ...) you can find in Models Hub. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The aim of this model is to detect the main pieces of financial information in annual reports of companies, more specifically this model is being trained with 10K filings. The currently available entities are: - AMOUNT: Numeric amounts, not percentages - PERCENTAGE: Numeric amounts which are percentages - CURRENCY: The currency of the amount - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - PROFIT: Profit or also Revenue - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - EXPENSE: An expense or loss - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - CF: Cash flow operations - CF_INCREASE: A piece of information saying there was a cash flow increase - CF_DECREASE: A piece of information saying there was a cash flow decrease - LIABILITY: A mentioned liability in the text You can also check for the Relation Extraction model which connects these entities together ## Predicted Entities `AMOUNT`, `CURRENCY`, `DATE`, `FISCAL_YEAR`, `CF`, `PERCENTAGE`, `LIABILITY`, `EXPENSE`, `EXPENSE_INCREASE`, `EXPENSE_DECREASE`, `PROFIT`, `PROFIT_INCREASE`, `PROFIT_DECLINE`, `CF_INCREASE`, `CF_DECREASE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_FINANCIAL_10K/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_financial_medium_en_1.0.0_3.0_1666185075692.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_financial_medium_en_1.0.0_3.0_1666185075692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_financial_medium", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. The increase was primarily related to increase in internal staff costs of $ 1.1 million as we increased delivery staff and work performed on internal projects, partially offset by a decrease in third party consultant costs of $ 0.6 million as these were converted to internal staff or terminated. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +---------------------------------------------------------+----------------+ |text |label | +---------------------------------------------------------+----------------+ |License fees revenue |PROFIT_DECLINE | |40 |PERCENTAGE | |$ |CURRENCY | |0.5 million |AMOUNT | |$ |CURRENCY | |0.7 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |1.2 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |Services revenue |PROFIT_INCREASE | |4 |PERCENTAGE | |$ |CURRENCY | |1.1 million |AMOUNT | |$ |CURRENCY | |25.6 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |24.5 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |Costs of revenue, excluding depreciation and amortization|EXPENSE_INCREASE| |$ |CURRENCY | |0.1 million |AMOUNT | |2 |PERCENTAGE | |$ |CURRENCY | |8.8 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |8.7 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | |internal staff costs |EXPENSE_INCREASE| |$ |CURRENCY | |1.1 million |AMOUNT | |third party consultant costs |EXPENSE_DECREASE| |$ |CURRENCY | |0.6 million |AMOUNT | |travel costs |EXPENSE_DECREASE| |$ |CURRENCY | |0.4 million |AMOUNT | |cost of revenue, excluding depreciation and amortization |EXPENSE | |34 |PERCENTAGE | |December 31, 2020 |FISCAL_YEAR | |2019 |DATE | |Sales and marketing expenses |EXPENSE_DECREASE| |20 |PERCENTAGE | |$ |CURRENCY | |1.5 million |AMOUNT | |$ |CURRENCY | |6.0 million |AMOUNT | |December 31, 2020 |FISCAL_YEAR | |$ |CURRENCY | |7.5 million |AMOUNT | |December 31, 2019 |FISCAL_YEAR | +---------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_financial_medium| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on 10-K Filings ## Benchmarking ```bash label tp fp fn prec rec f1 I-AMOUNT 293 1 1 0.99659866 0.99659866 0.99659866 B-AMOUNT 412 1 2 0.9975787 0.9951691 0.9963724 B-DATE 350 20 19 0.9459459 0.9485095 0.947226 I-LIABILITY 83 16 33 0.83838385 0.7155172 0.772093 I-DATE 281 21 43 0.93046355 0.86728394 0.89776355 B-CF_DECREASE 2 0 4 1.0 0.33333334 0.5 I-EXPENSE 54 22 24 0.7105263 0.6923077 0.7012987 B-LIABILITY 41 14 17 0.74545455 0.70689654 0.7256637 I-CF 219 74 34 0.7474403 0.8656126 0.8021978 I-CF_DECREASE 3 0 11 1.0 0.21428572 0.3529412 B-PROFIT_INCREASE 18 2 0 0.9 1.0 0.9473684 B-EXPENSE 27 13 15 0.675 0.64285713 0.6585366 I-CF_INCREASE 14 11 6 0.56 0.7 0.6222222 I-PROFIT_DECLINE 9 2 5 0.8181818 0.64285713 0.72 B-CF_INCREASE 6 4 2 0.6 0.75 0.6666667 I-PROFIT 36 9 11 0.8 0.7659575 0.7826087 B-CURRENCY 411 0 0 1.0 1.0 1.0 I-PROFIT_INCREASE 41 2 0 0.95348835 1.0 0.97619045 B-CF 68 26 22 0.7234042 0.75555557 0.73913044 B-PROFIT 22 6 8 0.78571427 0.73333335 0.7586207 B-PERCENTAGE 83 1 0 0.9880952 1.0 0.99401194 I-FISCAL_YEAR 402 34 19 0.92201835 0.95486933 0.93815637 B-PROFIT_DECLINE 8 3 2 0.72727275 0.8 0.76190484 B-EXPENSE_INCREASE 39 9 8 0.8125 0.82978725 0.8210527 B-EXPENSE_DECREASE 25 2 4 0.9259259 0.86206895 0.89285713 B-FISCAL_YEAR 134 13 6 0.91156465 0.95714283 0.93379796 I-EXPENSE_DECREASE 43 6 6 0.877551 0.877551 0.877551 I-EXPENSE_INCREASE 83 9 11 0.90217394 0.88297874 0.89247316 Macro-average 3207 321 313 0.84983146 0.8032311 0.8258744 Micro-average 3207 321 313 0.9090136 0.9110795 0.9100454 ``` --- layout: model title: Legal Swiss Judgements Classification (English) author: John Snow Labs name: legclf_bert_swiss_judgements date: 2022-10-27 tags: [en, legal, licensed, sequence_classification] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Bert-based model that can be used to classify Swiss Judgement documents into the following 6 classes according to their case area. It has been trained with SOTA approach. ## Predicted Entities `public law`, `civil law`, `insurance law`, `social law`, `penal law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_en_1.0.0_3.0_1666864758313.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_en_1.0.0_3.0_1666864758313.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols(['document'])\ .setOutputCol('token') clf_model = legal.BertForSequenceClassification.pretrained('legclf_bert_swiss_judgements', 'en', 'legal/models')\ .setInputCols(['document', 'token'])\ .setOutputCol('class')\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, clf_model ]) data = spark.createDataFrame([["""Facts of fact: A. The Canton Police arrested X._ on 2. January 2007 due to suspicion of having committed an intrusive bull. In the trial of the trial 3. In January 2007, he agreed to have, together with a complicient, carried out a rubbish steel in a Jeans store in the fountain. After that, the investigative judge opened to him orally, he took him into investigative detention for the risk of collusion and continuation. X._ renounced a written and justified order, but desired a review of the investigation by the president of the Canton Court. by 4. In January 2007, the investigative judge submitted the documents to the president of the Canton Court with the request to withdraw the complaint and maintain the investigative detention. X._ requested to withdraw the investigative detention and immediately release him into freedom. He may be released under conditions or conditions. At its disposal of 5. In January 2007, the president of the Canton Court stated that the urgent offence was suspected in relation to the authorized invasion of the Jeans business and other invasions already occurred during a previous imprisonment. The risk of collusion is not accepted, but the recurrence forecast is extremely disadvantaged, therefore there is a risk of continuation. This is the request of the investigative judge - this is according to the instructions of 23. May 2006 (GG 2006 2; www.kgsz.ch) was not authorized to order investigative detention - to carry out and to confirm the investigative detention. At its disposal of 5. In January 2007, the president of the Canton Court stated that the urgent offence was suspected in relation to the authorized invasion of the Jeans business and other invasions already occurred during a previous imprisonment. The risk of collusion is not accepted, but the recurrence forecast is extremely disadvantaged, therefore there is a risk of continuation. This is the request of the investigative judge - this is according to the instructions of 23. May 2006 (GG 2006 2; www.kgsz.ch) was not authorized to order investigative detention - to carry out and to confirm the investigative detention. B. With complaint in criminal cases of 5. February 2007 requested X._: 1. It should be noted that the order GP 2007 3 of the Canton Court President of the Canton of Schwyz of 5. January 2007 is invalid and the complainant must be immediately released from prison. 2nd Eventually the order GP 2007 3 of the Canton Court President of the Canton of Schwyz of 5. January 2007 shall be repealed and the complainant shall be immediately released from investigative detention. and 3. Subeventual is the complainant due to the violation of the cantonal Swiss law by the instructions of the Canton Court of Schwyz of 23. May 2006 immediately released from the detention. Fourth All under cost and compensation consequences at the expense of the complainant.” Fourth All under cost and compensation consequences at the expense of the complainant.” C. The investigative judge requires in his judgment that “there must be established that the investigative detention was ordered by the investigative authority in accordance with the law and that the appeal submitted by the Court of Appeal with the approval of the request for responsibility and the confirmation of the investigative detention (Decree of the President of the Canton Court of 5 January 2007) has been legally rejected.” Insofar as X._ requires his immediate release, the complaint must be rejected. The President of the Canton Court asks to reject the complaint insofar as it is necessary. X._ requires unpaid legal assistance and defence and completes in its response to the complaint."""]]).toDF("text") result = clf_pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+----------+ | document| class| +----------------------------------------------------------------------------------------------------+----------+ |Facts of fact: A. The Canton Police arrested X._ on 2. January 2007 due to suspicion of having co...|public law| +----------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_swiss_judgements| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|405.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Training data is available [here](https://zenodo.org/record/7109926#.Y1gJwexBw8E). ## Benchmarking ```bash | label | precision | recall | f1-score | support | |---------------|-----------|--------|----------|---------| | civil-law | 0.97 | 0.96 | 0.96 | 1189 | | insurance-law | 0.95 | 0.98 | 0.96 | 1081 | | other | 0.92 | 0.90 | 0.91 | 40 | | penal-law | 0.97 | 0.94 | 0.96 | 1140 | | public-law | 0.94 | 0.97 | 0.95 | 1551 | | social-law | 0.98 | 0.94 | 0.96 | 970 | | accuracy - - | 0.96 | 5971 | | macro-avg | 0.95 | 0.95 | 0.95 | 5971 | | weighted-avg | 0.96 | 0.96 | 0.96 | 5971 | ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from sunitha) author: John Snow Labs name: roberta_qa_custom_squad_ds date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Roberta_Custom_Squad_DS` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_custom_squad_ds_en_4.3.0_3.0_1674208785165.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_custom_squad_ds_en_4.3.0_3.0_1674208785165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_custom_squad_ds","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_custom_squad_ds","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_custom_squad_ds| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/sunitha/Roberta_Custom_Squad_DS --- layout: model title: Legal Executive Power And Public Service Document Classifier (EURLEX) author: John Snow Labs name: legclf_executive_power_and_public_service_bert date: 2023-03-06 tags: [en, legal, classification, clauses, executive_power_and_public_service, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_executive_power_and_public_service_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Executive_Power_and_Public_Service or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Executive_Power_and_Public_Service`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_executive_power_and_public_service_bert_en_1.0.0_3.0_1678111651127.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_executive_power_and_public_service_bert_en_1.0.0_3.0_1678111651127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_executive_power_and_public_service_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Executive_Power_and_Public_Service]| |[Other]| |[Other]| |[Executive_Power_and_Public_Service]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_executive_power_and_public_service_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Executive_Power_and_Public_Service 0.84 0.83 0.84 159 Other 0.84 0.85 0.84 166 accuracy - - 0.84 325 macro-avg 0.84 0.84 0.84 325 weighted-avg 0.84 0.84 0.84 325 ``` --- layout: model title: Detect Posology concepts (ner_posology_healthcare) author: John Snow Labs name: ner_posology_healthcare date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect Drug, Dosage and administration instructions in text using pretraiend NER model. ## Predicted Entities `Drug`, `Duration`, `Strength`, `Form`, `Frequency`, `Dosage`, `Route` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_healthcare_en_3.0.0_3.0_1617260847574.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_healthcare_en_3.0.0_3.0_1617260847574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_healthcare", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that chest pain started yesterday evening. He has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val text = """The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that chest pain started yesterday evening. He has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually.""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.healthcare").predict("""Put your text here.""") ```
## Results ```bash +-------------+---------+ |chunk |ner_label| +-------------+---------+ |Aspirin |Drug | |81 milligrams|Strength | |QDay |Frequency| |insulin |Drug | |50 units |Dosage | |in a.m. |Frequency| |HCTZ |Drug | |50 mg |Strength | |QDay |Frequency| |Nitroglycerin|Drug | |1/150 |Strength | |sublingually.|Route | +-------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash label tp fp fn total precision recall f1 DURATION 995.0 463.0 132.0 1127.0 0.6824 0.8829 0.7698 DRUG 4957.0 632.0 476.0 5433.0 0.8869 0.9124 0.8995 DOSAGE 539.0 183.0 380.0 919.0 0.7465 0.5865 0.6569 ROUTE 676.0 47.0 129.0 805.0 0.935 0.8398 0.8848 FREQUENCY 3688.0 675.0 313.0 4001.0 0.8453 0.9218 0.8819 FORM 1328.0 261.0 294.0 1622.0 0.8357 0.8187 0.8272 STRENGTH 5008.0 687.0 557.0 5565.0 0.8794 0.8999 0.8895 macro-avg - - - - - - 0.82994 micro-avg - - - - - - 0.86743 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from NeuML) author: John Snow Labs name: t5_small_txtsql date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-txtsql` is a English model originally trained by `NeuML`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_txtsql_en_4.3.0_3.0_1675156028568.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_txtsql_en_4.3.0_3.0_1675156028568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_txtsql","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_txtsql","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_txtsql| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|264.0 MB| ## References - https://huggingface.co/NeuML/t5-small-txtsql - https://github.com/neuml/txtai - https://github.com/neuml/txtai/tree/master/models/txtsql --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from prk) author: John Snow Labs name: roberta_qa_prk_base_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `prk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_prk_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219493892.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_prk_base_squad2_finetuned_squad_en_4.3.0_3.0_1674219493892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_prk_base_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_prk_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_prk_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/prk/roberta-base-squad2-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from horsbug98) author: John Snow Labs name: bert_qa_Part_1_mBERT_Model_E2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_1_mBERT_Model_E2` is a English model orginally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Part_1_mBERT_Model_E2_en_4.0.0_3.0_1654178923923.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Part_1_mBERT_Model_E2_en_4.0.0_3.0_1654178923923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Part_1_mBERT_Model_E2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Part_1_mBERT_Model_E2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.multi_lingual_bert.by_horsbug98").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Part_1_mBERT_Model_E2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_1_mBERT_Model_E2 --- layout: model title: Extract Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_general_healthcare date: 2023-01-11 tags: [licensed, clinical, oncology, en, ner, anatomy] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical entities using an unspecific label. ## Predicted Entities `Anatomical_Site`, `Direction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_healthcare_en_4.2.4_3.0_1673477824696.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_healthcare_en_4.2.4_3.0_1673477824696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel()\ .pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel\ .pretrained("ner_oncology_anatomy_general_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel .pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel() .pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_anatom_general_healthcare").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:----------------| | left | Direction | | breast | Anatomical_Site | | lungs | Anatomical_Site | | liver | Anatomical_Site | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_general_healthcare| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.0 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Anatomical_Site 1439 235 333 1772 0.86 0.81 0.84 Direction 434 92 65 499 0.83 0.87 0.85 macro-avg 1873 327 398 2271 0.84 0.84 0.84 micro-avg 1873 327 398 2271 0.85 0.82 0.84 ``` --- layout: model title: NER on Force Majeure Clauses author: John Snow Labs name: legner_force_majeure date: 2022-11-30 tags: [force, majeure, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model should be run on Force Majeure clauses. Use a Text Classifier to identify those clauses in your document, then run this NER on them - it will extract keywords related to Force Majeure exemptions. ## Predicted Entities `O`, `FORCE_MAJEURE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_force_majeure_en_1.0.0_3.0_1669802449878.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_force_majeure_en_1.0.0_3.0_1669802449878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_force_majeure','en','legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Force Majeure. In no event shall the Trustee be responsible or liable for any failure or delay in the performance of its obligations hereunder arising out of or caused by, directly or indirectly, forces beyond its control, including, without limitation, strikes, work stoppages, accidents, acts of war or terrorism, civil or military disturbances, nuclear or natural catastrophes or acts of God, and interruptions, loss or malfunctions of utilities, communications or computer (software and hardware) services; it being understood that the Trustee shall use reasonable efforts which are consistent with accepted practices in the banking industry to resume performance as soon as practicable under the circumstances."""] res = model.transform(spark.createDataFrame([text]).toDF("text"))
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python +--------------+---------------+ | token| ner_label| +--------------+---------------+ ... | ,| O| | directly| O| | or| O| | indirectly| O| | ,| O| | forces| O| | beyond| O| | its| O| | control| O| | ,| O| | including| O| | ,| O| | without| O| | limitation| O| | ,| O| | strikes|B-FORCE_MAJEURE| | ,| O| | work|B-FORCE_MAJEURE| | stoppages|I-FORCE_MAJEURE| | ,| O| | accidents|B-FORCE_MAJEURE| | ,| O| | acts|B-FORCE_MAJEURE| | of|I-FORCE_MAJEURE| | war|I-FORCE_MAJEURE| | or| O| | terrorism|B-FORCE_MAJEURE| | ,| O| | civil|B-FORCE_MAJEURE| | or| O| | military|B-FORCE_MAJEURE| | disturbances|I-FORCE_MAJEURE| | ,| O| | nuclear|B-FORCE_MAJEURE| | or| O| | natural|B-FORCE_MAJEURE| | catastrophes|I-FORCE_MAJEURE| | or| O| | acts|B-FORCE_MAJEURE| | of|I-FORCE_MAJEURE| | God|I-FORCE_MAJEURE| | ,| O| | and| O| | interruptions|B-FORCE_MAJEURE| | ,| O| | loss|B-FORCE_MAJEURE| | or| O| | malfunctions|B-FORCE_MAJEURE| | of|I-FORCE_MAJEURE| | utilities|I-FORCE_MAJEURE| | ,| O| |communications|B-FORCE_MAJEURE| ... +--------------+---------------+ ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_force_majeure| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-FORCE_MAJEURE 91 36 37 0.71653545 0.7109375 0.7137255 B-FORCE_MAJEURE 140 32 17 0.81395346 0.89171976 0.85106385 Macro-average 231 68 54 0.7652445 0.80132866 0.782871 Micro-average 231 68 54 0.77257526 0.8105263 0.7910959 ``` --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_el8 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-el8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el8_en_4.3.0_3.0_1675123395876.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el8_en_4.3.0_3.0_1675123395876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_el8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_el8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_el8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|68.3 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-el8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Defined Terms Clause Binary Classifier author: John Snow Labs name: legclf_defined_terms_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, defined, terms, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the defined-terms clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `defined-terms`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_defined_terms_clause_en_1.0.0_3.0_1671393651890.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_defined_terms_clause_en_1.0.0_3.0_1671393651890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_defined_terms_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[defined-terms]| |[other]| |[other]| |[defined-terms]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_defined_terms_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support defined-terms 1.00 0.96 0.98 28 other 0.97 1.00 0.99 39 accuracy - - 0.99 67 macro-avg 0.99 0.98 0.98 67 weighted-avg 0.99 0.99 0.99 67 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl2_en_4.3.0_3.0_1675118693652.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl2_en_4.3.0_3.0_1675118693652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|115.7 MB| ## References - https://huggingface.co/google/t5-efficient-small-dl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Portuguese BertForQuestionAnswering model (from pierreguillou) author: John Snow Labs name: bert_qa_bert_base_cased_squad_v1.1_portuguese date: 2022-06-02 tags: [pt, open_source, question_answering, bert] task: Question Answering language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-squad-v1.1-portuguese` is a Portuguese model orginally trained by `pierreguillou`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_squad_v1.1_portuguese_pt_4.0.0_3.0_1654179838573.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_squad_v1.1_portuguese_pt_4.0.0_3.0_1654179838573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_cased_squad_v1.1_portuguese","pt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_cased_squad_v1.1_portuguese","pt") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("pt.answer_question.squad.bert.base_cased.by_pierreguillou").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_cased_squad_v1.1_portuguese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pt| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pierreguillou/bert-base-cased-squad-v1.1-portuguese - https://colab.research.google.com/ - https://ailab.unb.br/ - https://medium.com/@pierre_guillou/nlp-modelo-de-question-answering-em-qualquer-idioma-baseado-no-bert-base-estudo-de-caso-em-12093d385e78 - https://www.linkedin.com/in/pierreguillou/ - https://medium.com/@pierre_guillou/nlp-modelo-de-question-answering-em-qualquer-idioma-baseado-no-bert-base-estudo-de-caso-em-12093d385e78#c572 - http://www.deeplearningbrasil.com.br/ - https://neuralmind.ai/ - https://colab.research.google.com/drive/18ueLdi_V321Gz37x4gHq8mb4XZSGWfZx?usp=sharing - https://github.com/piegu/language-models/blob/master/colab_question_answering_BERT_base_cased_squad_v11_pt.ipynb --- layout: model title: English DistilBertForQuestionAnswering Cased model (from Sounak) author: John Snow Labs name: distilbert_qa_sounak_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-finetuned` is a English model originally trained by `Sounak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sounak_finetuned_en_4.3.0_3.0_1672774032617.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sounak_finetuned_en_4.3.0_3.0_1672774032617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sounak_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sounak_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sounak_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sounak/distilbert-finetuned --- layout: model title: English asr_processor_with_lm TFWav2Vec2ForCTC from hf-internal-testing author: John Snow Labs name: asr_processor_with_lm date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_processor_with_lm` is a English model originally trained by hf-internal-testing. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_processor_with_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_processor_with_lm_en_4.2.0_3.0_1664025168389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_processor_with_lm_en_4.2.0_3.0_1664025168389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_processor_with_lm", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_processor_with_lm", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_processor_with_lm| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|453.4 KB| --- layout: model title: Legal NER (Parties, Dates, Document Type - md) author: John Snow Labs name: legner_contract_doc_parties_md date: 2022-12-01 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_introduction_clause` Text Classifier to select only these paragraphs; This is a Legal NER Model, aimed to process the first page of the agreements when information can be found about: - Parties of the contract/agreement; - Aliases of those parties, or how those parties will be called further on in the document; - Document Type; - Effective Date of the agreement; This model can be used all along with its Relation Extraction model to retrieve the relations between these entities, called `legre_contract_doc_parties` Other models can be found to detect other parts of the document, as Headers/Subheaders, Signers, "Will-do", etc. This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives outside the specific section, which may help to run it at whole document level (although not recommended). ## Predicted Entities `PARTY`, `EFFDATE`, `DOC`, `ALIAS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_PARTIES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_md_en_1.0.0_3.0_1669892999925.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_contract_doc_parties_md_en_1.0.0_3.0_1669892999925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_md', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" INTELLECTUAL PROPERTY AGREEMENT This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties"). """] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +------------+---------+ | token|ner_label| +------------+---------+ |INTELLECTUAL| B-DOC| | PROPERTY| I-DOC| | AGREEMENT| I-DOC| | This| O| |INTELLECTUAL| B-DOC| | PROPERTY| I-DOC| | AGREEMENT| I-DOC| | (| O| | this| O| | "| O| | Agreement| O| | "),| O| | dated| O| | as| O| | of| O| | December|B-EFFDATE| | 31|I-EFFDATE| | ,|I-EFFDATE| | 2018|I-EFFDATE| | (| O| | the| O| | "| O| | Effective| O| | Date| O| | ")| O| | is| O| | entered| O| | into| O| | by| O| | and| O| | between| O| | Armstrong| B-PARTY| | Flooring| I-PARTY| | ,| I-PARTY| | Inc| I-PARTY| | .,| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | Seller| B-ALIAS| | ")| O| | and| O| | AFI| B-PARTY| | Licensing| I-PARTY| | LLC| I-PARTY| | ,| O| | a| O| | Delaware| O| | limited| O| | liability| O| | company| O| | ("| O| | Licensing| B-ALIAS| | "| O| | and| O| | together| O| | with| O| | Seller| B-ALIAS| | ,| O| | "| O| | Arizona| B-ALIAS| | ")| O| | and| O| | AHF| B-PARTY| | Holding| I-PARTY| | ,| I-PARTY| | Inc| I-PARTY| | .| O| | (| O| | formerly| O| | known| O| | as| O| | Tarzan| O| | HoldCo| O| | ,| O| | Inc| O| | .),| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | Buyer| B-ALIAS| | ")| O| | and| O| | Armstrong| B-PARTY| | Hardwood| I-PARTY| | Flooring| I-PARTY| | Company| I-PARTY| ------------------------ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_contract_doc_parties_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-PARTY 413 57 64 0.8787234 0.8658281 0.8722281 B-EFFDATE 43 2 5 0.95555556 0.8958333 0.92473114 B-DOC 75 7 16 0.91463417 0.82417583 0.867052 I-EFFDATE 138 6 8 0.9583333 0.94520545 0.9517241 I-ALIAS 5 0 4 1.0 0.5555556 0.71428573 I-DOC 176 20 40 0.8979592 0.8148148 0.8543689 I-FORMER_PARTY_NAME 2 0 0 1.0 1.0 1.0 B-PARTY 141 21 34 0.8703704 0.8057143 0.8367952 B-FORMER_PARTY_NAME 1 0 0 1.0 1.0 1.0 B-ALIAS 66 4 11 0.94285715 0.85714287 0.89795923 Macro-average 66 4 11 0.94184333 0.856427 0.8971066 Micro-average 66 4 11 0.9005947 0.85346216 0.87639517 ``` --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_de_4.2.0_3.0_1664115294282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728_de_4.2.0_3.0_1664115294282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s728| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_squad_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-2` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_2_en_4.3.0_3.0_1674217595105.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_2_en_4.3.0_3.0_1674217595105.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|438.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad-2 --- layout: model title: English DistilBertForTokenClassification Base Cased model (from elastic) author: John Snow Labs name: distilbert_token_classifier_base_cased_finetuned_conll03_english date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-conll03-english` is a English model originally trained by `elastic`. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_cased_finetuned_conll03_english_en_4.3.1_3.0_1678783152051.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_cased_finetuned_conll03_english_en_4.3.1_3.0_1678783152051.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_cased_finetuned_conll03_english","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_cased_finetuned_conll03_english","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_cased_finetuned_conll03_english| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Sentence Entity Resolver for CPT (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_cpt date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to CPT codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts CPT codes and their descriptions. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_en_3.0.4_3.0_1621189492240.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_en_3.0.4_3.0_1621189492240.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") cpt_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_cpt","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, cpt_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val cpt_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_cpt","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, cpt_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+-----+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+---------+-----+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM|49425| 0.0967|Insertion of peri...|49425:::36818:::3...| |chronic renal ins...| 83|109| PROBLEM|50070| 0.2569|Nephrolithotomy; ...|50070:::49425:::5...| | COPD| 113|116| PROBLEM|49425| 0.0779|Insertion of peri...|49425:::31592:::4...| | gastritis| 120|128| PROBLEM|43810| 0.5289|Gastroduodenostom...|43810:::43880:::4...| | TIA| 136|138| PROBLEM|25927| 0.2060|Transmetacarpal a...|25927:::25931:::6...| |a non-ST elevatio...| 182|202| PROBLEM|33300| 0.3046|Repair of cardiac...|33300:::33813:::3...| |Guaiac positive s...| 208|229| PROBLEM|47765| 0.0974|Anastomosis, of i...|47765:::49425:::1...| |cardiac catheteri...| 295|317| TEST|62225| 0.1996|Replacement or ir...|62225:::33722:::4...| | PTCA| 324|327|TREATMENT|60500| 0.1481|Parathyroidectomy...|60500:::43800:::2...| | mid LAD lesion| 332|345| PROBLEM|33722| 0.3097|Closure of aortic...|33722:::33732:::3...| +--------------------+-----+---+---------+-----+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cpt| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[cpt_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on Current Procedural Terminology dataset with ``sbiobert_base_cased_mli`` sentence embeddings. --- layout: model title: Fast Neural Machine Translation Model from English to Esperanto author: John Snow Labs name: opus_mt_en_eo date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, eo, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `eo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_eo_xx_2.7.0_2.4_1609163285697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_eo_xx_2.7.0_2.4_1609163285697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_eo", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_eo", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.eo').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_eo| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German Named Entity Recognition (from severinsimmler) author: John Snow Labs name: bert_ner_literary_german_bert date: 2022-05-09 tags: [bert, ner, token_classification, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `literary-german-bert` is a German model orginally trained by `severinsimmler`. ## Predicted Entities `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_literary_german_bert_de_3.4.2_3.0_1652099383832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_literary_german_bert_de_3.4.2_3.0_1652099383832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_literary_german_bert","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_literary_german_bert","de") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_literary_german_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|410.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/severinsimmler/literary-german-bert - https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1 - https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release - https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1 - https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf - http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf - https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf --- layout: model title: English RobertaForQuestionAnswering (from vesteinn) author: John Snow Labs name: roberta_qa_IceBERT_QA date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IceBERT-QA` is a English model originally trained by `vesteinn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_IceBERT_QA_en_4.0.0_3.0_1655726743383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_IceBERT_QA_en_4.0.0_3.0_1655726743383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_IceBERT_QA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_IceBERT_QA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_vesteinn").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_IceBERT_QA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vesteinn/IceBERT-QA --- layout: model title: Legal Headings Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_headings_bert date: 2023-03-05 tags: [en, legal, classification, clauses, headings, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Headings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Headings`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_headings_bert_en_1.0.0_3.0_1678050671024.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_headings_bert_en_1.0.0_3.0_1678050671024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_headings_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Headings]| |[Other]| |[Other]| |[Headings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_headings_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Headings 1.00 0.98 0.99 81 Other 0.98 1.00 0.99 117 accuracy - - 0.99 198 macro-avg 0.99 0.99 0.99 198 weighted-avg 0.99 0.99 0.99 198 ``` --- layout: model title: Match Pattern author: ahmedlone127 name: match_pattern date: 2022-06-14 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The match_pattern is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and matches pattrens . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/match_pattern_en_4.0.0_3.0_1655211298298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/match_pattern_en_4.0.0_3.0_1655211298298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("match_pattern", "en", "clinical/models") result = pipeline.annotate("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|match_pattern| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|29.0 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - RegexMatcherModel --- layout: model title: Sentence Entity Resolver for LOINC (sbluebert_base_uncased_mli embeddings) author: John Snow Labs name: sbluebertresolve_loinc_uncased date: 2022-01-18 tags: [loinc, licensed, clinical, entity_resolution, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical NER entities to LOINC codes using `sbluebert_base_uncased_mli` Sentence Bert Embeddings. It trained on the augmented version of the uncased (lowercased) dataset which is used in previous LOINC resolver models. ## Predicted Entities `LOINC Code` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_loinc_uncased_en_3.3.4_2.4_1642535076764.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_loinc_uncased_en_3.3.4_2.4_1642535076764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical','en', 'clinical/models')\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['Test']) chunk2doc = Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings")\ .setCaseSensitive(True) resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_loinc_uncased", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"])\ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline_loinc = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver ]) test = """The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""" model = pipeline_loinc.fit(spark.createDataFrame([['']]).toDF("text")) sparkDF = spark.createDataFrame([[test]]).toDF("text") result = model.transform(sparkDF) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Test")) val chunk2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") .setCaseSensitive(True) val resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_loinc_uncased", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc_uncased").predict("""The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""") ```
## Results ```bash +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ner_chunk|entity| resolution| all_codes| resolutions| +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | BMI| Test| 39156-5|[39156-5, LP35925-4, BDYCRC, 73964-9, 59574-4,...] |[Body mass index, Body mass index (BMI), Body circumference, Body muscle mass, Body mass index (BMI) [Percentile], ...] | | aspartate aminotransferase| Test| 14409-7|['14409-7', '16325-3', '1916-6', '16324-6',...] |['Aspartate aminotransferase', 'Alanine aminotransferase/Aspartate aminotransferase', 'Aspartate aminotransferase/Alanine aminotransferase', 'Alanine aminotransferase', ...] | | alanine aminotransferase| Test| 16324-6|['16324-6', '1916-6', '16325-3', '59245-1',...] |['Alanine aminotransferase', 'Aspartate aminotransferase/Alanine aminotransferase', 'Alanine aminotransferase/Aspartate aminotransferase', 'Alanine glyoxylate aminotransferase',...] | | hgba1c| Test| 41995-2|['41995-2', 'LP35944-5', 'LP19717-5', '43150-2',...]|['Hemoglobin A1c', 'HbA1c measurement device', 'HBA1 gene', 'HbA1c measurement device panel', ...] | +-------------------------------------+------+-----------+------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbluebertresolve_loinc_uncased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| |Size:|653.4 MB| |Case sensitive:|false| |Dependencies:|sbluebert_base_uncased_mli| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Smaller BERT Embeddings (L-12_H-768_A-12) author: John Snow Labs name: small_bert_L12_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L12_768_en_2.6.0_2.4_1598345548247.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L12_768_en_2.6.0_2.4_1598345548247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L12_768", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L12_768", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L12_768').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L12_768_embeddings I [0.2770369052886963, 0.460345059633255, -0.236... love [-0.19471393525600433, 0.05537150800228119, -0... NLP [0.06311562657356262, 0.18206995725631714, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L12_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1 --- layout: model title: ICD10CM Sentence Entity Resolver (Slim, normalized) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_normalized date: 2021-05-17 tags: [licensed, clinical, en, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD10 CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped, making the model slim. It also returns the official resolution text within the brackets inside the metadata ## Predicted Entities ICD10 CM Codes. In this model, synonyms having low cosine similarity to unnormalized terms are dropped . It also returns the official resolution text within the brackets inside the metadata {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_normalized_en_3.0.4_3.0_1621286250442.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_normalized_en_3.0.4_3.0_1621286250442.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10cm_slim_normalized","en", "clinical/models")\ .setInputCols(["document", "sbert_embeddings"])\ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) bert_pipeline_icd = Pipeline(stages = [document_assembler, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["metastatic lung cancer"]]).toDF("text") results = bert_pipeline_icd.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") .setInputCols(Array("document", "sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val bert_pipeline_icd = new Pipeline().setStages(Array(document_assembler, sbert_embedder, icd10_resolver)) val data = Seq("metastatic lung cancer").toDF("text") val result = bert_pipeline_icd.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_normalized").predict("""metastatic lung cancer""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | billable_hcc_status_score | all_distances | |---:|:-----------------------|:-------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|:----------------------------|:-------------------------------------------------------------------------------------------------------------------------| | 0 | metastatic lung cancer | C7800 | ['cancer metastatic to lung', 'metastasis from malignant tumor of lung', 'cancer metastatic to left lung', 'history of cancer metastatic to lung', 'metastatic cancer', 'history of cancer metastatic to lung (situation)', 'metastatic adenocarcinoma to bilateral lungs', 'cancer metastatic to chest wall', 'metastatic malignant neoplasm to left lower lobe of lung', 'metastatic carcinoid tumour', 'cancer metastatic to respiratory tract', 'metastatic carcinoid tumor'] | ['C7800', 'C349', 'C7801', 'Z858', 'C800', 'Z8511', 'C780', 'C798', 'C7802', 'C799', 'C7830', 'C7B00'] | ['1', '1', '8'] | ['0.0464', '0.0829', '0.0852', '0.0860', '0.0914', '0.0989', '0.1133', '0.1220', '0.1220', '0.1253', '0.1249', '0.1260'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_normalized| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP_fi_4.2.0_3.0_1664039394189.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP_fi_4.2.0_3.0_1664039394189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_Finnish_NLP| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_hady TFWav2Vec2ForCTC from hady author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_hady date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_hady` is a English model originally trained by hady. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_hady_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_hady_en_4.2.0_3.0_1664039154195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_hady_en_4.2.0_3.0_1664039154195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_hady", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_hady", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_hady| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_el8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-el8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el8_en_4.3.0_3.0_1675111594266.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_el8_en_4.3.0_3.0_1675111594266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_el8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_el8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_el8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|420.4 MB| ## References - https://huggingface.co/google/t5-efficient-base-el8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Translate English to Armenian Pipeline author: John Snow Labs name: translate_en_hy date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, hy, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `hy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_hy_xx_2.7.0_2.4_1609689354058.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_hy_xx_2.7.0_2.4_1609689354058.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_hy", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_hy", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.hy').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_hy| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from HomayounSadri) author: John Snow Labs name: distilbert_qa_HomayounSadri_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `HomayounSadri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_HomayounSadri_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724239567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_HomayounSadri_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724239567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_HomayounSadri_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_HomayounSadri_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_HomayounSadri").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_HomayounSadri_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/HomayounSadri/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species date: 2022-06-23 tags: [ca, ner, clinical, licensed] task: Named Entity Recognition language: ca edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Catalan which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `w2v_cc_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_ca_3.5.3_3.0_1655975861739.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_ca_3.5.3_3.0_1655975861739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "ca")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "ca", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Dona de 47 anys al·lèrgica al iode, fumadora social, intervinguda de varices, dues cesàries i un abscés gluti. Sense altres antecedents mèdics d'interès ni tractament habitual. Viu amb el seu marit i tres fills, treballa com a professora. En el moment de la nostra valoració en la planta de Cirurgia General, la pacient presenta TA 69/40 mm Hg, freqüència cardíaca 120 lpm, taquipnea en repòs, pal·lidesa mucocutánea, mala perfusió distal i afligeix nàusees. L'abdomen és tou, no presenta peritonismo i el dèbit del drenatge abdominal roman sense canvis. Les serologies de Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, citomegalovirus, virus de Epstein Barr, virus varicel·la zoster i parvovirus B19 van ser negatives. No obstant això, es va detectar test de rosa de Bengala positiu per a Brucella, el test de Coombs i les aglutinacions també van ser positives amb un títol 1/40."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "ca") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "ca", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Dona de 47 anys al·lèrgica al iode, fumadora social, intervinguda de varices, dues cesàries i un abscés gluti. Sense altres antecedents mèdics d'interès ni tractament habitual. Viu amb el seu marit i tres fills, treballa com a professora. En el moment de la nostra valoració en la planta de Cirurgia General, la pacient presenta TA 69/40 mm Hg, freqüència cardíaca 120 lpm, taquipnea en repòs, pal·lidesa mucocutánea, mala perfusió distal i afligeix nàusees. L'abdomen és tou, no presenta peritonismo i el dèbit del drenatge abdominal roman sense canvis. Les serologies de Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, citomegalovirus, virus de Epstein Barr, virus varicel·la zoster i parvovirus B19 van ser negatives. No obstant això, es va detectar test de rosa de Bengala positiu per a Brucella, el test de Coombs i les aglutinacions també van ser positives amb un títol 1/40.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ca.med_ner.living_species").predict("""Dona de 47 anys al·lèrgica al iode, fumadora social, intervinguda de varices, dues cesàries i un abscés gluti. Sense altres antecedents mèdics d'interès ni tractament habitual. Viu amb el seu marit i tres fills, treballa com a professora. En el moment de la nostra valoració en la planta de Cirurgia General, la pacient presenta TA 69/40 mm Hg, freqüència cardíaca 120 lpm, taquipnea en repòs, pal·lidesa mucocutánea, mala perfusió distal i afligeix nàusees. L'abdomen és tou, no presenta peritonismo i el dèbit del drenatge abdominal roman sense canvis. Les serologies de Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, citomegalovirus, virus de Epstein Barr, virus varicel·la zoster i parvovirus B19 van ser negatives. No obstant això, es va detectar test de rosa de Bengala positiu per a Brucella, el test de Coombs i les aglutinacions també van ser positives amb un títol 1/40.""") ```
## Results ```bash +-----------------------+-------+ |ner_chunk |label | +-----------------------+-------+ |Dona |HUMAN | |marit |HUMAN | |fills |HUMAN | |professora |HUMAN | |pacient |HUMAN | |Coxiella burnetii |SPECIES| |Bartonella henselae |SPECIES| |Borrelia burgdorferi |SPECIES| |Entamoeba histolytica |SPECIES| |Toxoplasma gondii |SPECIES| |citomegalovirus |SPECIES| |virus de Epstein Barr |SPECIES| |virus varicel·la zoster|SPECIES| |parvovirus B19 |SPECIES| |Brucella |SPECIES| +-----------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ca| |Size:|15.1 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.88 0.97 0.92 3036 B-SPECIES 0.68 0.94 0.78 3354 I-HUMAN 0.87 0.64 0.74 195 I-SPECIES 0.75 0.83 0.79 1329 micro-avg 0.76 0.92 0.83 7914 macro-avg 0.79 0.84 0.81 7914 weighted-avg 0.77 0.92 0.84 7914 ``` --- layout: model title: Legal Licenses Clause Binary Classifier author: John Snow Labs name: legclf_licenses_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `licenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `licenses` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_licenses_clause_en_1.0.0_3.2_1660123683498.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_licenses_clause_en_1.0.0_3.2_1660123683498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_licenses_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[licenses]| |[other]| |[other]| |[licenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_licenses_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support licenses 0.91 0.84 0.88 38 other 0.92 0.96 0.94 71 accuracy - - 0.92 109 macro-avg 0.92 0.90 0.91 109 weighted-avg 0.92 0.92 0.92 109 ``` --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_vasilis TFWav2Vec2ForCTC from vasilis author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_vasilis` is a Modern Greek (1453-) model originally trained by vasilis. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis_el_4.2.0_3.0_1664105776395.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis_el_4.2.0_3.0_1664105776395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_greek_by_vasilis| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from Atlantic-Congo languages to English author: John Snow Labs name: opus_mt_alv_en date: 2021-06-01 tags: [open_source, seq2seq, translation, alv, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: alv target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_alv_en_xx_3.1.0_2.4_1622561621126.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_alv_en_xx_3.1.0_2.4_1622561621126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_alv_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_alv_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Atlantic-Congo languages.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_alv_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Russian to English author: John Snow Labs name: opus_mt_ru_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ru, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ru` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ru_en_xx_2.7.0_2.4_1609164904542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ru_en_xx_2.7.0_2.4_1609164904542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ru_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ru_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ru.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ru_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from maroo93) author: John Snow Labs name: bert_qa_squad1.1_1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad1.1_1` is a English model orginally trained by `maroo93`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad1.1_1_en_4.0.0_3.0_1654192154350.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad1.1_1_en_4.0.0_3.0_1654192154350.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad1.1_1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad1.1_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.v1.1.by_maroo93").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad1.1_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/maroo93/squad1.1_1 --- layout: model title: Part of Speech for Galician author: John Snow Labs name: pos_ud_treegal date: 2021-03-09 tags: [part_of_speech, open_source, galician, pos_ud_treegal, gl] task: Part of Speech Tagging language: gl edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - PROPN - VERB - NOUN - ADP - ADJ - PUNCT - CCONJ - PRON - AUX - ADV - NUM - SCONJ - INTJ - X - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_treegal_gl_3.0.0_3.0_1615292173539.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_treegal_gl_3.0.0_3.0_1615292173539.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_treegal", "gl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Ola de John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_treegal", "gl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Ola de John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Ola de John Snow Labs! ""] token_df = nlu.load('gl.pos').predict(text) token_df ```
## Results ```bash token pos 0 Ola NOUN 1 de ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_treegal| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|gl| --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from flood) author: John Snow Labs name: xlmroberta_ner_flood_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `flood`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_flood_base_finetuned_panx_all_xx_4.1.0_3.0_1660428231128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_flood_base_finetuned_panx_all_xx_4.1.0_3.0_1660428231128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_flood_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_flood_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_flood_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/flood/xlm-roberta-base-finetuned-panx-all --- layout: model title: Galician Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: gl edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, gl] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_gl_2.5.5_2.4_1596054395787.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_gl_2.5.5_2.4_1596054395787.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "gl") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "gl") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica."""] lemma_df = nlu.load('gl.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='ademais', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=9, result='de', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=13, result='ser', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=15, result='o', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=19, result='rei', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|gl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Fast Neural Machine Translation Model from Kwangali to English author: John Snow Labs name: opus_mt_kwn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kwn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kwn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kwn_en_xx_2.7.0_2.4_1609170022966.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kwn_en_xx_2.7.0_2.4_1609170022966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kwn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kwn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kwn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kwn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Indo-European Languages to English author: John Snow Labs name: opus_mt_ine_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ine, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ine` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ine_en_xx_2.7.0_2.4_1609169824118.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ine_en_xx_2.7.0_2.4_1609169824118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ine_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ine_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ine.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ine_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Clinical Deidentification Pipeline (Portuguese) author: John Snow Labs name: clinical_deidentification date: 2023-06-13 tags: [deid, deidentification, pt, licensed] task: [De-identification, Pipeline Healthcare] language: pt edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with `w2v_cc_300d` portuguese embeddings and can be used to deidentify PHI information from medical texts in Spanish. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `ID`, `COUNTRY`, `STREET`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `IP`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_4.4.4_3.2_1686665454016.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_pt_4.4.4_3.2_1686665454016.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = """Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = "Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("pt.deid.clinical").predict("""Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = """Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """ result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "pt", "clinical/models") sample = "Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("pt.deid.clinical").predict("""Dados do paciente. Nome: Mauro. Apelido: Gonçalves. NIF: 368503. NISS: 26 63514095. Endereço: Calle Miguel Benitez 90. CÓDIGO POSTAL: 28016. Dados de cuidados. Data de nascimento: 03/03/1946. País: Portugal. Idade: 70 anos Sexo: M. Data de admissão: 12/12/2016. Doutor: Ignacio Navarro Cuéllar NºCol: 28 28 70973. Relatório clínico do paciente: Paciente de 70 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicéridos de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Miguel Santos - Avenida dos Aliados, 22 Portugal E-mail: nnavcu@hotmail.com. """) ```
## Results ```bash Results Masked with entity labels ------------------------------ Dados do . Nome: . Apelido: . NIF: . NISS: . Endereço: . CÓDIGO POSTAL: . Dados de cuidados. Data de nascimento: . País: . Idade: anos Sexo: . Data de admissão: . Doutor: Cuéllar NºCol: . Relatório clínico do : de anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: - , 22 E-mail: . Masked with chars ------------------------------ Dados do [******]. Nome: [***]. Apelido: [*******]. NIF: [****]. NISS: [*********]. Endereço: [*********************]. CÓDIGO POSTAL: [***]. Dados de cuidados. Data de nascimento: [********]. País: [******]. Idade: ** anos Sexo: *. Data de admissão: [********]. Doutor: [*************] Cuéllar NºCol: ** ** [***]. Relatório clínico do [******]: [******] de ** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér[**] de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: [***********] - [*****************], 22 [******] E-mail: [****************]. Masked with fixed length chars ------------------------------ Dados do ****. Nome: ****. Apelido: ****. NIF: ****. NISS: ****. Endereço: ****. CÓDIGO POSTAL: ****. Dados de cuidados. Data de nascimento: ****. País: ****. Idade: **** anos Sexo: ****. Data de admissão: ****. Doutor: **** Cuéllar NºCol: **** **** ****. Relatório clínico do ****: **** de **** anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicér**** de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: **** - ****, 22 **** E-mail: ****. Obfuscated ------------------------------ Dados do H.. Nome: Marcos Alves. Apelido: Tiago Santos. NIF: 566-445. NISS: 134544332. Endereço: Rua de Santa María, 100. CÓDIGO POSTAL: 4099. Dados de cuidados. Data de nascimento: 31/03/1946. País: Espanha. Idade: 46 anos Sexo: Mulher. Data de admissão: 06/01/2017. Doutor: Carlos Melo Cuéllar NºCol: 134544332 134544332 124 445 311. Relatório clínico do H.: M. de 46 anos, mineiro reformado, sem alergias medicamentosas conhecidas, que apresenta como história pessoal: acidente de trabalho antigo com fracturas vertebrais e das costelas; operado por doença de Dupuytren na mão direita e iliofemoral esquerda; Diabetes Mellitus tipo II, hipercolesterolemia e hiperuricemia; alcoolismo activo, fumador de 20 cigarros / dia. Foi encaminhado dos cuidados primários porque apresentou uma vez hematúria macroscópica pós-morte e depois microhaematúria persistente, com micturição normal. O exame físico mostrou um bom estado geral, com abdómen e genitália normais; o exame rectal foi compatível com adenoma de próstata de grau I/IV. A urinálise mostrou 4 glóbulos vermelhos/campo e 0-5 leucócitos/campo; o resto do sedimento estava normal. Hemograma normal; a bioquímica mostrou glicemia de 169 mg/dl e triglicérHomen de 456 mg/dl; função hepática e renal normal. PSA de 1,16 ng/ml. A citologia da urina era repetidamente desconfiada por malignidade. A radiografia simples abdominal mostra alterações degenerativas na coluna lombar e calcificações vasculares tanto no hipocôndrio como na pélvis. A ecografia urológica revelou cistos corticais simples no rim direito, uma bexiga inalterada com boa capacidade e uma próstata com 30g de peso. O IVUS mostrou normofuncionalismo renal bilateral, calcificações na silhueta renal direita e ureteres artrosados com imagens de adição no terço superior de ambos os ureteres, relacionadas com pseudodiverticulose ureteral. O cistograma mostra uma bexiga com boa capacidade, mas com paredes trabeculadas em relação à bexiga de stress. A tomografia computorizada abdominal é normal. A cistoscopia revelou a existência de pequenos tumores na bexiga, e a ressecção transuretral foi realizada com o resultado anatomopatológico do carcinoma urotelial superficial da bexiga. Referido por: Carlos Melo - Avenida Dos Aliados, 56, 22 Espanha E-mail: maria.prado@jacob.com. {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|pt| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - TextMatcherModel - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Fast Neural Machine Translation Model from Semitic Languages to English author: John Snow Labs name: opus_mt_sem_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sem, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sem` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sem_en_xx_2.7.0_2.4_1609163743242.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sem_en_xx_2.7.0_2.4_1609163743242.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sem_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sem_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sem.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sem_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_0_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-0-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_0_cased_generator_de_3.4.4_3.0_1652786258159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_0_cased_generator_de_3.4.4_3.0_1652786258159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_0_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_0_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_0_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-0-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Medical Spell Checker author: John Snow Labs name: spellcheck_clinical date: 2022-04-18 tags: [spellcheck, medical, medical_spellchecker, spell_checker, spelling_corrector, en, licensed, clinical] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true annotator: SpellCheckModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Contextual Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your medical input text. It’s based on Levenshtein Automation for generating candidate corrections and a Neural Language Model for ranking corrections. This model has been trained in a dataset containing data from different sources; MTSamples, i2b2 clinical notes, and several specific medical corpora. You can download the model that comes fully pretrained and ready to use. However, you can still customize it further without the need for re-training a new model from scratch. This can be accomplished by providing custom definitions for the word classes the model has been trained on, namely Dates, Numbers, Ages, Units, and Medications. This model is trained for PySpark 2.4.x users with SparkNLP 3.4.2 and above. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.2_2.4_1650288379214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.2_2.4_1650288379214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setContextChars(["*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained('spellcheck_clinical', 'en', 'clinical/models')\ .setInputCols("token")\ .setOutputCol("checked") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, spellModel]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] result = light_pipeline.annotate(example) ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setContextChars(Array("*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained("spellcheck_clinical", "en", "clinical/models"). setInputCols("token"). setOutputCol("checked") val pipeline = new Pipeline().setStages(Array( assembler, tokenizer, spellChecker)) val light_pipeline = new LightPipeline(pipeline.fit(Seq("").toDS.toDF("text"))) val text = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") val result = light_pipeline.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical").predict(""") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, spellModel]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| |Size:|141.2 MB| ## References MTSamples, i2b2 clinical notes, and several specific medical corpora. --- layout: model title: Legal Tax withholding Clause Binary Classifier author: John Snow Labs name: legclf_tax_withholding_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `tax-withholding` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `tax-withholding` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tax_withholding_clause_en_1.0.0_3.2_1660124044782.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tax_withholding_clause_en_1.0.0_3.2_1660124044782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_tax_withholding_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[tax-withholding]| |[other]| |[other]| |[tax-withholding]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_tax_withholding_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 200 tax-withholding 0.98 0.98 0.98 82 accuracy - - 0.99 282 macro-avg 0.98 0.98 0.98 282 weighted-avg 0.99 0.99 0.99 282 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_recipe_triplet_recipes_base_squadv2_epochs_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `recipe_triplet_recipes-roberta-base_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_squadv2_epochs_3_en_4.3.0_3.0_1674212279557.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_recipe_triplet_recipes_base_squadv2_epochs_3_en_4.3.0_3.0_1674212279557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_squadv2_epochs_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_recipe_triplet_recipes_base_squadv2_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_recipe_triplet_recipes_base_squadv2_epochs_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/recipe_triplet_recipes-roberta-base_squadv2_epochs_3 --- layout: model title: Turkish BertForQuestionAnswering model (from savasy) author: John Snow Labs name: bert_qa_bert_base_turkish_squad date: 2022-06-02 tags: [tr, open_source, question_answering, bert] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-turkish-squad` is a Turkish model orginally trained by `savasy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_turkish_squad_tr_4.0.0_3.0_1654180711714.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_turkish_squad_tr_4.0.0_3.0_1654180711714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_turkish_squad","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_turkish_squad","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.squad.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_turkish_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/savasy/bert-base-turkish-squad - https://github.com/TQuad/turkish-nlp-qa-dataset --- layout: model title: Legal Executive Employment Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_executive_employment_agreement date: 2022-11-10 tags: [en, legal, licensed, classification] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_executive_employment_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `executive-employment-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `executive-employment-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_executive_employment_agreement_en_1.0.0_3.0_1668107314838.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_executive_employment_agreement_en_1.0.0_3.0_1668107314838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_executive_employment_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[executive-employment-agreement]| |[other]| |[other]| |[executive-employment-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_executive_employment_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support executive-employment-agreement 1.00 1.00 1.00 45 other 1.00 1.00 1.00 85 accuracy - - 1.00 130 macro-avg 1.00 1.00 1.00 130 weighted-avg 1.00 1.00 1.00 130 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab50 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab50 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab50` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab50_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab50_en_4.2.0_3.0_1664021065212.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab50_en_4.2.0_3.0_1664021065212.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab50", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab50", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab50| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English image_classifier_vit_dogs ViTForImageClassification from jasmeen author: John Snow Labs name: image_classifier_vit_dogs date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dogs` is a English model originally trained by jasmeen. ## Predicted Entities `golden retriever`, `great dane`, `husky` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dogs_en_4.1.0_3.0_1660166187363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dogs_en_4.1.0_3.0_1660166187363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dogs", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dogs", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dogs| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Arbitration Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_arbitration_bert date: 2023-03-05 tags: [en, legal, classification, clauses, arbitration, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Arbitration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Arbitration`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_arbitration_bert_en_1.0.0_3.0_1678050699076.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_arbitration_bert_en_1.0.0_3.0_1678050699076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_arbitration_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Arbitration]| |[Other]| |[Other]| |[Arbitration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_arbitration_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Arbitration 0.98 0.98 0.98 45 Other 0.98 0.98 0.98 61 accuracy - - 0.98 106 macro-avg 0.98 0.98 0.98 106 weighted-avg 0.98 0.98 0.98 106 ``` --- layout: model title: French CamemBert Embeddings (from lewtun) author: John Snow Labs name: camembert_embeddings_lewtun_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `lewtun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_lewtun_generic_model_fr_3.4.4_3.0_1653989332783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_lewtun_generic_model_fr_3.4.4_3.0_1653989332783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_lewtun_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_lewtun_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_lewtun_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/lewtun/dummy-model --- layout: model title: Translate English to Greek languages Pipeline author: John Snow Labs name: translate_en_grk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, grk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `grk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_grk_xx_2.7.0_2.4_1609686967896.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_grk_xx_2.7.0_2.4_1609686967896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_grk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_grk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.grk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_grk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Oncology-Specific Entities author: John Snow Labs name: ner_oncology date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, biomarker, treatment] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts more than 40 oncology-related entities, including therapies, tests and staging. Definitions of Predicted Entities: - `Adenopathy`: Mentions of pathological findings of the lymph nodes. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score"). - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy". - `Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Date`: Mentions of exact dates, in any format, including day number, month and/or year. - `Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as "died" or "passed away". - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father"). - `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated") - `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary". - `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy". - `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan". - `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy". - `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category. - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. - `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells"). - `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. - `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4"). - `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "yesterday" or "three years later"). - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). - `Site_Bone`: Anatomical terms that refer to the human skeleton. - `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). - `Site_Breast`: Anatomical terms that refer to the breasts. - `Site_Liver`: Anatomical terms that refer to the liver. - `Site_Lung`: Anatomical terms that refer to the lungs. - `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. - `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. - `Smoking_Status`: All mentions of smoking related to the patient or to someone else. - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy". - `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm"). - `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy"). ## Predicted Entities `Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_en_4.0.0_3.0_1666718178718.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_en_4.0.0_3.0_1666718178718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:----------------------| | left | Direction | | mastectomy | Cancer_Surgery | | axillary lymph node dissection | Cancer_Surgery | | left | Direction | | breast cancer | Cancer_Dx | | twenty years ago | Relative_Date | | tumor | Tumor_Finding | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | radiotherapy | Radiotherapy | | breast | Site_Breast | | cancer | Cancer_Dx | | recurred | Response_To_Treatment | | right | Direction | | lung | Site_Lung | | metastasis | Metastasis | | 13 years later | Relative_Date | | adriamycin | Chemotherapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Chemotherapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | first line | Line_Of_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.6 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 339 75 114 453 0.82 0.75 0.78 Direction 832 163 152 984 0.84 0.85 0.84 Staging 229 31 29 258 0.88 0.89 0.88 Cancer_Score 37 8 25 62 0.82 0.60 0.69 Imaging_Test 2027 214 177 2204 0.90 0.92 0.91 Cycle_Number 73 29 24 97 0.72 0.75 0.73 Tumor_Finding 1114 64 143 1257 0.95 0.89 0.91 Site_Lymph_Node 491 53 60 551 0.90 0.89 0.90 Invasion 158 36 23 181 0.81 0.87 0.84 Response_To_Treatment 431 149 165 596 0.74 0.72 0.73 Smoking_Status 66 18 2 68 0.79 0.97 0.87 Tumor_Size 1050 112 79 1129 0.90 0.93 0.92 Cycle_Count 177 62 53 230 0.74 0.77 0.75 Adenopathy 67 12 29 96 0.85 0.70 0.77 Age 930 33 19 949 0.97 0.98 0.97 Biomarker_Result 1160 169 285 1445 0.87 0.80 0.84 Unspecific_Therapy 198 86 80 278 0.70 0.71 0.70 Site_Breast 125 15 22 147 0.89 0.85 0.87 Chemotherapy 814 55 65 879 0.94 0.93 0.93 Targeted_Therapy 195 27 33 228 0.88 0.86 0.87 Radiotherapy 276 29 34 310 0.90 0.89 0.90 Performance_Status 121 17 14 135 0.88 0.90 0.89 Pathology_Test 888 296 162 1050 0.75 0.85 0.79 Site_Other_Body_Part 909 275 592 1501 0.77 0.61 0.68 Cancer_Surgery 693 119 126 819 0.85 0.85 0.85 Line_Of_Therapy 101 11 5 106 0.90 0.95 0.93 Pathology_Result 655 279 487 1142 0.70 0.57 0.63 Hormonal_Therapy 169 4 16 185 0.98 0.91 0.94 Site_Bone 264 81 49 313 0.77 0.84 0.80 Biomarker 1259 238 256 1515 0.84 0.83 0.84 Immunotherapy 103 47 25 128 0.69 0.80 0.74 Cycle_Day 200 36 48 248 0.85 0.81 0.83 Frequency 354 27 73 427 0.93 0.83 0.88 Route 91 15 22 113 0.86 0.81 0.83 Duration 625 161 136 761 0.80 0.82 0.81 Death_Entity 34 2 4 38 0.94 0.89 0.92 Metastasis 353 18 17 370 0.95 0.95 0.95 Site_Liver 189 64 45 234 0.75 0.81 0.78 Cancer_Dx 1301 103 93 1394 0.93 0.93 0.93 Grade 190 27 46 236 0.88 0.81 0.84 Date 807 21 24 831 0.97 0.97 0.97 Site_Lung 469 110 90 559 0.81 0.84 0.82 Site_Brain 221 64 58 279 0.78 0.79 0.78 Relative_Date 1211 401 111 1322 0.75 0.92 0.83 Race_Ethnicity 57 8 5 62 0.88 0.92 0.90 Gender 1247 17 7 1254 0.99 0.99 0.99 Oncogene 345 83 104 449 0.81 0.77 0.79 Dosage 900 30 160 1060 0.97 0.85 0.90 Radiation_Dose 108 5 18 126 0.96 0.86 0.90 macro_avg 24653 3999 4406 29059 0.85 0.84 0.84 micro_avg 24653 3999 4406 29059 0.86 0.85 0.85 ``` --- layout: model title: Legal Indemnification Clause Binary Classifier author: John Snow Labs name: legclf_indemnification_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnification` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indemnification` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_clause_en_1.0.0_3.2_1660123608974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_clause_en_1.0.0_3.2_1660123608974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_indemnification_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnification]| |[other]| |[other]| |[indemnification]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support indemnification 0.94 0.81 0.87 37 other 0.91 0.97 0.94 71 accuracy - - 0.92 108 macro-avg 0.92 0.89 0.90 108 weighted-avg 0.92 0.92 0.92 108 ``` --- layout: model title: English image_classifier_vit_Teeth_B ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Teeth_B date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Teeth_B` is a English model originally trained by steven123. ## Predicted Entities `Good Teeth`, `Missing Teeth`, `Rotten Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Teeth_B_en_4.1.0_3.0_1660172735230.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Teeth_B_en_4.1.0_3.0_1660172735230.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Teeth_B", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Teeth_B", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Teeth_B| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Medical Text Generation (BioGPT-based) author: John Snow Labs name: text_generator_biomedical_biogpt_base date: 2023-04-03 tags: [licensed, en, clinical, text_generation, biogpt, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a BioGPT (LLM) based text generation model that is finetuned with biomedical datasets (Pubmed abstracts) by John Snow Labs.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 1024 tokens given an input text (max 1024 tokens). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_GENERATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.1.Medical_Text_Generation.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/text_generator_biomedical_biogpt_base_en_4.3.2_3.0_1680514919715.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/text_generator_biomedical_biogpt_base_en_4.3.2_3.0_1680514919715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("prompt")\ .setOutputCol("document_prompt") med_text_generator = MedicalTextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")\ .setInputCols("document_prompt")\ .setOutputCol("answer")\ .setMaxNewTokens(256)\ .setDoSample(True)\ .setTopK(3)\ .setRandomSeed(42) pipeline = Pipeline(stages=[document_assembler, med_text_generator]) data = spark.createDataFrame([["Covid 19 is"], ["The most common cause of stomach pain is"]]).toDF("prompt") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("prompt") .setOutputCol("document_prompt") val med_text_generator = MedicalTextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models") .setInputCols("document_prompt") .setOutputCol("answer") .setMaxNewTokens(256) .setDoSample(true) .setTopK(3) .setRandomSeed(42) val pipeline = new Pipeline().setStages(Array(document_assembler, med_text_generator)) val data = Seq(Array("Covid 19 is", "The most common cause of stomach pain is")).toDS.toDF("prompt") val result = pipeline.fit(data).transform(data) ```
## Results ```bash [Covid 19 is a pandemic with high rates in both mortality ( around 5 to 8 percent in the United States ) and economic loss, which are likely related to the disruption of social life. The COVID - 19 crisis has caused a significant reduction in healthcare capacity and has led to an increased risk of infection in healthcare facilities and patients with underlying conditions, which has increased morbidity, increased mortality rates in patients, and increased healthcare costs. The COVID - 19 pandemic has also led to a significant increase in the number of patients with chronic diseases, which has led to an increase in the number of patients with chronic conditions who are at high cardiovascular ( cardiovascular diseases [ CDs ] ) risk and therefore require intensive care. " This review will discuss the impact of this COVID pandemic in the healthcare system, the potential impact in healthcare providers caring and treating patients with CDs, and the potential impact on the healthcare system. The COVID Pandemias- A Review of the Current Literature. The COVID - 19 pandemic has resulted in a significant increase in the number of patients with cardiovascular disease ( CVD ). The number of patients with CVD is expected to increase by approximately 20 percent by the end of 2020. The number of patients with CVD will also increase by approximately 20 percent by the end of 2020] [The most common cause of stomach pain is peptic ulcer disease. The diagnosis of gastric ulcer is based on the presence and severity ( as determined by endoscopy ) of the ulcer, as confirmed on the basis ofendoscopic biopsy and gastric mucosal biopsy with urease tests, and by the presence of Helicobacter pylori. The treatment of gastric ulcer is based on the eradication of H. pylori. The aim of this study, conducted on the population aged over 40 in the city of Szczecin, was to determine the prevalence of H. pylori infection in patients with gastric ulcer and to assess the effectiveness of the eradication therapy. MATERIAL AND METHODS: The study involved patients aged over 40 who were admitted to the Gastroenterology Clinic of the Medical University of Szczecin with a diagnosis of gastric ulcer. The study was conducted on the population of patients with gastric ulcer, who were admitted to the Gastroenterology Clinic of the Medical University of Szczecin between January and December 2014. The study included patients with gastric ulcer who were admitted to the Gastroenterology Clinic of the Medical University of Szczecin between January and December 2014. The study was conducted on the population of patients aged over 40 who were admitted to the Gastroenterology Clinic of the] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_generator_biomedical_biogpt_base| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|917.9 MB| |Case sensitive:|true| --- layout: model title: Russian T5ForConditionalGeneration Base Cased model (from BlackSamorez) author: John Snow Labs name: t5_ebanko_base date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ebanko-base` is a Russian model originally trained by `BlackSamorez`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ebanko_base_ru_4.3.0_3.0_1675101255926.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ebanko_base_ru_4.3.0_3.0_1675101255926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ebanko_base","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ebanko_base","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ebanko_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|927.4 MB| ## References - https://huggingface.co/BlackSamorez/ebanko-base - https://github.com/BlackSamorez - https://github.com/skoltech-nlp/russe_detox_2022 --- layout: model title: Sentence Detection in Gujrati Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [gu, sentence_detection, open_source] task: Sentence Detection language: gu edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_gu_3.2.0_3.0_1630336149356.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_gu_3.2.0_3.0_1630336149356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "gu") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""ઇંગલિશ વાંચન ફકરા એક મહાન સ્ત્રોત માટે શોધી રહ્યાં છો? તમે યોગ્ય જગ્યાએ આવ્યા છો. તાજેતરના એક અભ્યાસ મુજબ આજના યુવાનોમાં વાંચવાની ટેવ ઝડપથી ઘટી રહી છે. તેઓ આપેલ અંગ્રેજી વાંચન ફકરા પર થોડી સેકંડથી વધુ સમય સુધી ધ્યાન કેન્દ્રિત કરી શકતા નથી! ઉપરાંત, વાંચન તમામ સ્પર્ધાત્મક પરીક્ષાઓનો અભિન્ન ભાગ હતો અને છે. તો, તમે તમારી વાંચન કુશળતા કેવી રીતે સુધારી શકો છો? આ પ્રશ્નનો જવાબ વાસ્તવમાં બીજો પ્રશ્ન છે: વાંચન કુશળતાનો ઉપયોગ શું છે? વાંચનનો મુખ્ય હેતુ 'અર્થપૂર્ણ બનાવવાનો' છે.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "gu") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("ઇંગલિશ વાંચન ફકરા એક મહાન સ્ત્રોત માટે શોધી રહ્યાં છો? તમે યોગ્ય જગ્યાએ આવ્યા છો. તાજેતરના એક અભ્યાસ મુજબ આજના યુવાનોમાં વાંચવાની ટેવ ઝડપથી ઘટી રહી છે. તેઓ આપેલ અંગ્રેજી વાંચન ફકરા પર થોડી સેકંડથી વધુ સમય સુધી ધ્યાન કેન્દ્રિત કરી શકતા નથી! ઉપરાંત, વાંચન તમામ સ્પર્ધાત્મક પરીક્ષાઓનો અભિન્ન ભાગ હતો અને છે. તો, તમે તમારી વાંચન કુશળતા કેવી રીતે સુધારી શકો છો? આ પ્રશ્નનો જવાબ વાસ્તવમાં બીજો પ્રશ્ન છે: વાંચન કુશળતાનો ઉપયોગ શું છે? વાંચનનો મુખ્ય હેતુ 'અર્થપૂર્ણ બનાવવાનો' છે.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('gu.sentence_detector').predict("अंग्रेजी पढ्ने अनुच्छेद को एक महान स्रोत को लागी हेर्दै हुनुहुन्छ? तपाइँ सही ठाउँमा आउनुभएको छ. हालै गरिएको एक अध्ययन अनुसार आजको युवाहरुमा पढ्ने बानी छिटोछिटो घट्दै गएको छ. उनीहरु केहि सेकेन्ड भन्दा बढी को लागी एक दिईएको अंग्रेजी पढ्ने अनुच्छेद मा ध्यान केन्द्रित गर्न सक्दैनन्! साथै, पठन थियो र सबै प्रतियोगी परीक्षा को एक अभिन्न हिस्सा हो। त्यसोभए, तपाइँ तपाइँको पठन कौशल कसरी सुधार गर्नुहुन्छ? यो प्रश्न को जवाफ वास्तव मा अर्को प्रश्न हो: पढ्ने कौशल को उपयोग के हो? पढ्न को मुख्य उद्देश्य 'भावना बनाउन' हो.", output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------+ |[ઇંગલિશ વાંચન ફકરા એક મહાન સ્ત્રોત માટે શોધી રહ્યાં છો?] | |[તમે યોગ્ય જગ્યાએ આવ્યા છો.] | |[તાજેતરના એક અભ્યાસ મુજબ આજના યુવાનોમાં વાંચવાની ટેવ ઝડપથી ઘટી રહી છે.] | |[તેઓ આપેલ અંગ્રેજી વાંચન ફકરા પર થોડી સેકંડથી વધુ સમય સુધી ધ્યાન કેન્દ્રિત કરી શકતા નથી!] | |[ઉપરાંત, વાંચન તમામ સ્પર્ધાત્મક પરીક્ષાઓનો અભિન્ન ભાગ હતો અને છે.] | |[તો, તમે તમારી વાંચન કુશળતા કેવી રીતે સુધારી શકો છો?] | |[આ પ્રશ્નનો જવાબ વાસ્તવમાં બીજો પ્રશ્ન છે:] | |[વાંચન કુશળતાનો ઉપયોગ શું છે?] | |[વાંચનનો મુખ્ય હેતુ 'અર્થપૂર્ણ બનાવવાનો' છે.] | +-----------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|gu| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English RobertaForSequenceClassification Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_distilroberta_finetuned_tweets_hate_speech date: 2022-07-13 tags: [en, open_source, roberta, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-tweets-hate-speech` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_tweets_hate_speech_en_4.0.0_3.0_1657716214810.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_distilroberta_finetuned_tweets_hate_speech_en_4.0.0_3.0_1657716214810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_tweets_hate_speech","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_distilroberta_finetuned_tweets_hate_speech","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_distilroberta_finetuned_tweets_hate_speech| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|309.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/distilroberta-finetuned-tweets-hate-speech --- layout: model title: Legal Obligations Clause Binary Classifier author: John Snow Labs name: legclf_cuad_obligations_clause date: 2022-11-28 tags: [en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `obligations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `obligations` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_obligations_clause_en_1.0.0_3.0_1669636805969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_obligations_clause_en_1.0.0_3.0_1669636805969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_obligations_absolute_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[obligations]| |[other]| |[other]| |[obligations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_obligations_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.4 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support obligations 0.88 0.79 0.83 95 other 0.88 0.93 0.90 152 accuracy - - 0.88 247 macro-avg 0.88 0.86 0.87 247 weighted-avg 0.88 0.88 0.88 247 ``` --- layout: model title: English ElectraForQuestionAnswering model (from bdickson) Version-2 author: John Snow Labs name: electra_qa_small_discriminator_finetuned_squad_2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-discriminator-finetuned-squad-finetuned-squad` is a English model originally trained by `bdickson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_discriminator_finetuned_squad_2_en_4.0.0_3.0_1655921266307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_discriminator_finetuned_squad_2_en_4.0.0_3.0_1655921266307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_discriminator_finetuned_squad_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_discriminator_finetuned_squad_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.small_v2.by_bdickson").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_small_discriminator_finetuned_squad_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bdickson/electra-small-discriminator-finetuned-squad-finetuned-squad --- layout: model title: Korean BertForQuestionAnswering Cased model (from Taekyoon) author: John Snow Labs name: bert_qa_neg_komrc_train date: 2022-07-07 tags: [ko, open_source, bert, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `neg_komrc_train` is a Korean model originally trained by `Taekyoon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_neg_komrc_train_ko_4.0.0_3.0_1657190508421.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_neg_komrc_train_ko_4.0.0_3.0_1657190508421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_neg_komrc_train","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_neg_komrc_train","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_neg_komrc_train| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|406.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Taekyoon/neg_komrc_train --- layout: model title: English asr_Central_kurdish_xlsr TFWav2Vec2ForCTC from Akashpb13 author: John Snow Labs name: pipeline_asr_Central_kurdish_xlsr date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Central_kurdish_xlsr` is a English model originally trained by Akashpb13. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Central_kurdish_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Central_kurdish_xlsr_en_4.2.0_3.0_1664103869492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Central_kurdish_xlsr_en_4.2.0_3.0_1664103869492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Central_kurdish_xlsr', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Central_kurdish_xlsr", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Central_kurdish_xlsr| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Summarize Clinical Notes (PubMed) author: John Snow Labs name: summarizer_biomedical_pubmed date: 2023-04-03 tags: [licensed, en, clinical, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with biomedical datasets (Pubmed abstracts) by John Snow Labs.  It can generate summaries up to 512 tokens given an input text (max 1024 tokens). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/BIOMEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_biomedical_pubmed_en_4.3.2_3.0_1680523127672.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_biomedical_pubmed_en_4.3.2_3.0_1680523127672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer.pretrained("summarizer_biomedical_pubmed", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("summary")\ .setMaxTextLength(512)\ .setMaxNewTokens(512) pipeline = sparknlp.base.Pipeline(stages=[ document_assembler, summarizer ]) text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis.""" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer.pretrained("summarizer_biomedical_pubmed", "en", "clinical/models") .setInputCols("document_prompt") .setOutputCol("answer") .setMaxTextLength(512) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = """Residual disease after initial surgery for ovarian cancer is the strongest prognostic factor for survival. However, the extent of surgical resection required to achieve optimal cytoreduction is controversial. Our goal was to estimate the effect of aggressive surgical resection on ovarian cancer patient survival.\\n A retrospective cohort study of consecutive patients with International Federation of Gynecology and Obstetrics stage IIIC ovarian cancer undergoing primary surgery was conducted between January 1, 1994, and December 31, 1998. The main outcome measures were residual disease after cytoreduction, frequency of radical surgical resection, and 5-year disease-specific survival.\\n The study comprised 194 patients, including 144 with carcinomatosis. The mean patient age and follow-up time were 64.4 and 3.5 years, respectively. After surgery, 131 (67.5%) of the 194 patients had less than 1 cm of residual disease (definition of optimal cytoreduction). Considering all patients, residual disease was the only independent predictor of survival; the need to perform radical procedures to achieve optimal cytoreduction was not associated with a decrease in survival. For the subgroup of patients with carcinomatosis, residual disease and the performance of radical surgical procedures were the only independent predictors. Disease-specific survival was markedly improved for patients with carcinomatosis operated on by surgeons who most frequently used radical procedures compared with those least likely to use radical procedures (44% versus 17%, P < .001).\\n Overall, residual disease was the only independent predictor of survival. Minimizing residual disease through aggressive surgical resection was beneficial, especially in patients with carcinomatosis.""" val data = Seq(Array(text)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash ['The results of this review suggest that aggressive ovarian cancer surgery is associated with a significant reduction in the risk of recurrence and a reduction in the number of radical versus conservative surgical resections. However, the results of this review are based on only one small trial. Further research is needed to determine the role of aggressive ovarian cancer surgery in women with stage IIIC ovarian cancer.'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_biomedical_pubmed| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.0 MB| --- layout: model title: Arabic BertForMaskedLM Base Cased model (from CAMeL-Lab) author: John Snow Labs name: bert_embeddings_base_arabic_camel_msa_sixteenth date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic-camelbert-msa-sixteenth` is a Arabic model originally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_sixteenth_ar_4.2.4_3.0_1670016205522.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_camel_msa_sixteenth_ar_4.2.4_3.0_1670016205522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_sixteenth","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa_sixteenth","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic_camel_msa_sixteenth| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_squad_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_el_4_el_4.0.0_3.0_1657192859114.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_el_4_el_4.0.0_3.0_1657192859114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_squad_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/squad_bert_el_4 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186 TFWav2Vec2ForCTC from Sarahliu186 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186` is a English model originally trained by Sarahliu186. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_en_4.2.0_3.0_1664114380208.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186_en_4.2.0_3.0_1664114380208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_Sarahliu186| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Southern Sotho author: John Snow Labs name: opus_mt_en_st date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, st, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `st` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_st_xx_2.7.0_2.4_1609168171650.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_st_xx_2.7.0_2.4_1609168171650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_st", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_st", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.st').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_st| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Slovenian CamemBert Embeddings (from EMBEDDIA) author: John Snow Labs name: camembert_embeddings_sloberta date: 2022-05-31 tags: [sl, open_source, camembert, embeddings] task: Embeddings language: sl edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sloberta` is a Slovenian model orginally trained by `EMBEDDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_sloberta_sl_3.4.4_3.0_1653991888839.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_sloberta_sl_3.4.4_3.0_1653991888839.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_sloberta","sl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Obožujem Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_sloberta","sl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Obožujem Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_sloberta| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sl| |Size:|266.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/EMBEDDIA/sloberta - https://camembert-model.fr/ - https://github.com/clarinsi/Slovene-BERT-Tool --- layout: model title: Detect Clinical Entities (ner_jsl_biobert) author: John Snow Labs name: ner_jsl_biobert date: 2021-09-05 tags: [clinical, licensed, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is trained using BERT token embeddings `biobert_pubmed_base_cased`. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Strength`, `Pregnancy_Delivery_Puerperium`, `Female_Reproductive_Status`, `Fetus_NewBorn`, `Age`, `Alcohol`, `Treatment`, `Internal_organ_or_component`, `Vital_Signs_Header`, `Dosage`, `Employment`, `Gender`, `Disease_Syndrome_Disorder`, `Pregnancy`, `Symptom`, `Clinical_Dept`, `Medical_Device`, `Temperature`, `Hypertension`, `Cerebrovascular_Disease`, `Psychological_Condition`, `Respiration`, `Direction`, `Metastasis`, `Injury_or_Poisoning`, `Birth_Entity`, `Allergen`, `Labour_Delivery`, `Overweight`, `Family_History_Header`, `Section_Header`, `Diabetes`, `Hyperlipidemia`, `Death_Entity`, `Route`, `Duration`, `Admission_Discharge`, `Total_Cholesterol`, `Performance_Status`, `LDL`, `RelativeDate`, `Test_Result`, `Height`, `Procedure`, `Date`, `Cancer_Modifier`, `BMI`, `External_body_part_or_region`, `Kidney_Disease`, `Modifier`, `Oncology_Therapy`, `Drug_BrandName`, `Form`, `Substance`, `Social_History_Header`, `Obesity`, `Oncological`, `Sexually_Active_or_Sexual_Orientation`, `EKG_Findings`, `Oxygen_Therapy`, `Frequency`, `Relationship_Status`, `Communicable_Disease`, `Imaging_Technique`, `Vaccine`, `Pulse`, `Tumor_Finding`, `Heart_Disease`, `Time`, `ImagingFindings`, `HDL`, `O2_Saturation`, `Weight`, `Medical_History_Header`, `Blood_Pressure`, `Puerperium`, `Smoking`, `Substance_Quantity`, `RelativeTime`, `Test`, `Race_Ethnicity`, `Diet`, `Staging`, `Triglycerides`, `Drug_Ingredient`, `VS_Finding` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_en_3.2.0_3.0_1630831908173.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_biobert_en_3.2.0_3.0_1630831908173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.biobert").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | entity | |---:|:------------------------------------------|:-----------------------------| | 0 | 21-day-old | Age | | 1 | Caucasian | Race_Ethnicity | | 2 | male | Gender | | 3 | for 2 days | Duration | | 4 | congestion | Symptom | | 5 | mom | Gender | | 6 | suctioning yellow discharge | Symptom | | 7 | nares | External_body_part_or_region | | 8 | she | Gender | | 9 | mild | Modifier | | 10 | problems with his breathing while feeding | Symptom | | 11 | perioral cyanosis | Symptom | | 12 | retractions | Symptom | | 13 | One day ago | RelativeDate | | 14 | mom | Gender | | 15 | tactile temperature | Symptom | | 16 | Tylenol | Drug_BrandName | | 17 | decreased p.o | Symptom | | 18 | His | Gender | | 19 | from 20 minutes q.2h. to 5 to 10 minutes | Frequency | | 20 | his | Gender | | 21 | respiratory congestion | Symptom | | 22 | He | Gender | | 23 | tired | Symptom | | 24 | fussy | Symptom | | 25 | over the past | RelativeDate | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_biobert| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-Oxygen_Therapy 114 41 38 0.7354839 0.75 0.742671 B-Cerebrovascular_Disease 42 16 19 0.7241379 0.6885246 0.7058824 B-Triglycerides 2 0 2 1 0.5 0.6666667 I-Cerebrovascular_Disease 17 11 17 0.60714287 0.5 0.54838705 B-Medical_Device 2568 334 400 0.88490695 0.8652291 0.8749574 B-Labour_Delivery 31 8 17 0.7948718 0.6458333 0.71264374 I-Vaccine 12 3 4 0.8 0.75 0.7741936 I-Obesity 4 2 2 0.6666667 0.6666667 0.6666667 B-RelativeTime 126 71 94 0.6395939 0.57272726 0.60431653 B-Heart_Disease 254 80 43 0.76047903 0.8552188 0.8050713 B-Procedure 2019 270 302 0.88204455 0.86988366 0.8759219 I-RelativeTime 183 93 44 0.6630435 0.8061674 0.72763425 B-Obesity 46 5 5 0.9019608 0.9019608 0.9019608 I-RelativeDate 629 125 76 0.8342175 0.89219856 0.8622344 B-O2_Saturation 51 28 28 0.6455696 0.6455696 0.6455696 B-Direction 3016 219 360 0.93230295 0.8933649 0.9124187 I-Alcohol 3 2 3 0.6 0.5 0.54545456 I-Oxygen_Therapy 91 67 28 0.5759494 0.7647059 0.6570397 B-Dosage 277 82 86 0.7715877 0.7630854 0.767313 B-Injury_or_Poisoning 336 56 86 0.85714287 0.79620856 0.8255528 B-Hypertension 104 9 2 0.920354 0.9811321 0.9497717 I-Test_Result 1173 101 119 0.9207221 0.90789473 0.9142634 B-Substance_Quantity 4 8 0 0.33333334 1 0.5 B-Alcohol 68 9 6 0.8831169 0.9189189 0.90066224 B-Height 19 10 11 0.6551724 0.6333333 0.64406776 I-Substance 10 2 6 0.8333333 0.625 0.71428573 B-RelativeDate 416 91 58 0.82051283 0.87763715 0.84811425 B-Admission_Discharge 245 12 7 0.9533074 0.9722222 0.9626719 B-Date 316 17 14 0.9489489 0.95757574 0.9532428 B-Kidney_Disease 68 13 23 0.83950615 0.74725276 0.7906977 I-Strength 505 50 46 0.9099099 0.9165154 0.91320074 I-Injury_or_Poisoning 255 73 132 0.777439 0.65891474 0.71328676 I-Drug_Ingredient 279 102 38 0.7322835 0.8801262 0.799427 I-Time 323 31 17 0.9124294 0.95 0.9308358 B-Substance 46 6 12 0.88461536 0.79310346 0.8363636 B-Total_Cholesterol 8 4 7 0.6666667 0.53333336 0.59259266 I-Vital_Signs_Header 152 18 2 0.89411765 0.987013 0.9382716 I-Internal_organ_or_component 2755 490 350 0.8489985 0.88727856 0.8677165 B-Hyperlipidemia 37 7 3 0.84090906 0.925 0.8809524 I-Sexually_Active_or_Sexual_Orientation 5 0 0 1 1 1 B-Sexually_Active_or_Sexual_Orientation 5 0 2 1 0.71428573 0.8333334 I-Fetus_NewBorn 44 60 28 0.42307693 0.6111111 0.5 B-BMI 4 1 2 0.8 0.6666667 0.72727275 B-ImagingFindings 71 40 83 0.6396396 0.46103895 0.53584903 B-Drug_Ingredient 1636 235 222 0.8743987 0.8805167 0.877447 B-Test_Result 1369 180 188 0.883796 0.879255 0.8815196 B-Section_Header 2735 115 116 0.95964915 0.9593125 0.95948076 I-Treatment 84 28 35 0.75 0.7058824 0.7272728 B-Clinical_Dept 721 101 89 0.87712896 0.8901235 0.8835784 I-Kidney_Disease 106 9 7 0.9217391 0.9380531 0.9298245 I-Pulse 140 49 35 0.7407407 0.8 0.7692308 B-Test 2267 375 390 0.8580621 0.8532179 0.85563314 B-Weight 70 16 16 0.81395346 0.81395346 0.81395346 I-Respiration 61 5 28 0.92424244 0.6853933 0.78709674 I-EKG_Findings 50 38 44 0.5681818 0.5319149 0.5494506 I-Section_Header 1998 108 65 0.94871795 0.9684925 0.95850325 I-VS_Finding 36 31 29 0.53731346 0.5538462 0.5454546 B-Strength 541 51 54 0.9138514 0.9092437 0.9115417 I-Social_History_Header 43 3 5 0.9347826 0.8958333 0.9148936 B-Vital_Signs_Header 228 26 3 0.8976378 0.987013 0.94020617 B-Death_Entity 30 5 4 0.85714287 0.88235295 0.86956525 B-Modifier 2023 367 375 0.84644353 0.8436197 0.8450293 B-Blood_Pressure 110 23 32 0.8270677 0.7746479 0.8 I-O2_Saturation 93 56 29 0.62416106 0.76229507 0.6863469 B-Frequency 564 53 61 0.91410047 0.9024 0.9082126 I-Triglycerides 2 0 1 1 0.6666667 0.8 I-Duration 510 71 88 0.8777969 0.8528428 0.86513996 I-Diabetes 35 2 5 0.9459459 0.875 0.9090909 B-Race_Ethnicity 67 2 4 0.9710145 0.943662 0.9571429 I-Height 72 23 9 0.75789475 0.8888889 0.8181819 B-Communicable_Disease 12 5 8 0.7058824 0.6 0.6486487 I-Family_History_Header 57 3 1 0.95 0.98275864 0.9661017 B-LDL 1 0 2 1 0.33333334 0.5 B-Form 180 38 31 0.82568806 0.8530806 0.8391608 I-Race_Ethnicity 2 1 0 0.6666667 1 0.8 B-Psychological_Condition 87 15 20 0.85294116 0.8130841 0.83253586 I-Drug_BrandName 25 8 18 0.75757575 0.5813953 0.6578947 I-Age 182 18 33 0.91 0.8465116 0.87710845 B-EKG_Findings 41 19 24 0.68333334 0.63076925 0.65599996 B-Employment 161 16 45 0.90960455 0.7815534 0.8407311 I-Oncological 338 32 62 0.91351354 0.845 0.8779221 B-Time 335 42 19 0.88859415 0.9463277 0.91655266 B-Treatment 98 43 63 0.69503546 0.6086956 0.6490066 B-Temperature 97 13 20 0.8818182 0.82905984 0.8546256 I-Procedure 2657 326 438 0.89071405 0.8584814 0.8743007 B-Relationship_Status 34 4 3 0.8947368 0.9189189 0.90666664 B-Pregnancy 51 25 21 0.67105263 0.7083333 0.68918914 B-Fetus_NewBorn 30 31 27 0.4918033 0.5263158 0.5084746 I-Total_Cholesterol 10 2 8 0.8333333 0.5555556 0.66666675 I-Route 205 16 21 0.9276018 0.90707964 0.91722596 I-Communicable_Disease 6 4 2 0.6 0.75 0.6666667 I-Medical_History_Header 116 5 10 0.9586777 0.9206349 0.9392713 B-Smoking 85 4 3 0.9550562 0.96590906 0.960452 I-Labour_Delivery 30 5 22 0.85714287 0.5769231 0.6896552 I-Death_Entity 4 1 1 0.8 0.8 0.8000001 B-Diabetes 87 5 5 0.9456522 0.9456522 0.9456522 B-HDL 1 1 0 0.5 1 0.6666667 B-Drug_BrandName 828 112 96 0.8808511 0.8961039 0.88841206 B-Gender 4420 61 62 0.98638695 0.9861669 0.98627687 B-Vaccine 13 0 8 1 0.61904764 0.7647059 I-Heart_Disease 315 145 27 0.6847826 0.92105263 0.7855362 I-Dosage 214 75 64 0.7404844 0.76978415 0.7548501 B-Social_History_Header 72 3 6 0.96 0.9230769 0.9411765 B-External_body_part_or_region 1759 194 376 0.90066564 0.8238876 0.8605675 I-Clinical_Dept 531 43 52 0.9250871 0.9108062 0.9178911 I-Test 1692 404 352 0.80725193 0.82778865 0.81739134 I-Frequency 445 66 61 0.8708415 0.8794466 0.87512296 B-Age 492 28 39 0.9461538 0.92655367 0.9362512 B-Pulse 86 31 30 0.73504275 0.7413793 0.7381974 I-Symptom 3072 1404 1050 0.6863271 0.7452693 0.71458477 I-Pregnancy 43 25 26 0.63235295 0.6231884 0.6277372 I-LDL 3 0 1 1 0.75 0.85714287 I-Diet 29 15 26 0.65909094 0.5272727 0.5858585 I-Blood_Pressure 171 52 35 0.76681614 0.8300971 0.79720277 I-ImagingFindings 153 86 88 0.64016736 0.6348548 0.6375 I-Date 184 10 9 0.9484536 0.9533679 0.9509044 B-Route 726 77 80 0.9041096 0.90074444 0.9024239 B-Duration 212 29 50 0.87966806 0.8091603 0.84294236 B-Medical_History_Header 89 8 5 0.91752577 0.9468085 0.9319371 I-Metastasis 5 0 1 1 0.8333333 0.90909094 B-Respiration 49 10 18 0.8305085 0.73134327 0.77777773 I-External_body_part_or_region 431 49 133 0.8979167 0.7641844 0.82567054 I-BMI 13 2 3 0.8666667 0.8125 0.83870965 B-Internal_organ_or_component 4260 612 634 0.8743842 0.8704536 0.8724145 I-Weight 177 42 16 0.8082192 0.91709846 0.8592233 B-Disease_Syndrome_Disorder 2091 367 318 0.8506916 0.867995 0.85925627 B-Symptom 4752 913 803 0.83883494 0.85544556 0.84705883 B-VS_Finding 180 46 45 0.79646015 0.8 0.7982262 I-Disease_Syndrome_Disorder 1592 331 309 0.8278731 0.83745396 0.832636 I-Modifier 148 96 128 0.60655737 0.5362319 0.56923074 I-Medical_Device 1677 235 266 0.87709206 0.8630983 0.870039 B-Oncological 381 33 44 0.9202899 0.8964706 0.90822405 I-Temperature 154 12 34 0.92771083 0.81914896 0.8700565 I-Employment 82 19 30 0.8118812 0.73214287 0.76995313 I-Psychological_Condition 25 2 7 0.9259259 0.78125 0.8474576 B-Family_History_Header 58 2 2 0.96666664 0.96666664 0.96666664 I-Direction 189 29 49 0.8669725 0.7941176 0.8289474 I-HDL 1 2 0 0.33333334 1 0.5 Macro-average 69137 11083 11027 0.7179756 0.7057431 0.7118068 Micro-average 69137 11083 11027 0.8618424 0.8624444 0.86214334 ``` --- layout: model title: English asr_wav2vec2_base_100h_13K_steps TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_base_100h_13K_steps date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_100h_13K_steps` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_100h_13K_steps_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_13K_steps_en_4.2.0_3.0_1664101202655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_100h_13K_steps_en_4.2.0_3.0_1664101202655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_100h_13K_steps", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_100h_13K_steps", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_100h_13K_steps| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.2 MB| --- layout: model title: Greek BERT Base Uncased Embedding author: John Snow Labs name: bert_base_uncased date: 2021-09-07 tags: [greek, open_source, bert_embeddings, uncased, el] task: Embeddings language: el edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A Greek version of BERT pre-trained language model. The pre-training corpora of bert-base-greek-uncased-v1 include: - The Greek part of Wikipedia, - The Greek part of European Parliament Proceedings Parallel Corpus, and - The Greek part of OSCAR, a cleansed version of Common Crawl. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_el_3.2.2_3.0_1630999695036.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_el_3.2.2_3.0_1630999695036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_uncased", "el") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_uncased", "el") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_uncased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|el| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1 --- layout: model title: Legal Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_en_1.0.0_3.0_1671393659850.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_en_1.0.0_3.0_1671393659850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[agreement]| |[other]| |[other]| |[agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement 0.88 0.85 0.86 99 other 0.93 0.94 0.94 207 accuracy - - 0.91 306 macro-avg 0.90 0.90 0.9 306 weighted-avg 0.91 0.91 0.91 306 ``` --- layout: model title: Sentiment Analysis for Urdu (IMDB Review dataset) author: John Snow Labs name: sentimentdl_urduvec_imdb date: 2020-12-01 task: Sentiment Analysis language: ur edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [sentiment, ur, open_source] supported: true annotator: SentimentDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Analyse sentiment in reviews by classifying them as ``positive``, ``negative`` or ``neutral``. This model is trained using ``urduvec_140M_300d`` word embeddings. The word embeddings are then converted to sentence embeddings before feeding to the sentiment classifier which uses a DL architecture to classify sentences. {:.h2_title} ## Predicted Entities ``positive`` , ``negative`` , ``neutral``. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_urduvec_imdb_ur_2.7.0_2.4_1606817135630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_urduvec_imdb_ur_2.7.0_2.4_1606817135630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, SentenceEmbeddings.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel()\ .pretrained('urduvec_140M_300d', 'ur')\ .setInputCols(["document",'token'])\ .setOutputCol("word_embeddings") embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = SentimentDLModel.pretrained('sentimentdl_urduvec_imdb', 'ur' )\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('sentiment') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ", "بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"]) ``` ```scala ... val word_embeddings = WordEmbeddingsModel() .pretrained('urduvec_140M_300d', 'ur') .setInputCols(Array("document",'token')) .setOutputCol("word_embeddings") val embeddings = SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = SentimentDLModel.pretrained('sentimentdl_urduvec_imdb', 'ur' ) .setInputCols(Array('document', 'token', 'sentence_embeddings')).setOutputCol('sentiment') val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifier)) val result = pipeline.fit(Seq.empty["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ", "بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک ", "بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں"] urdusent_df = nlu.load('ur.sentiment').predict(text, output_level='sentence') urdusent_df ```
## Results ```bash | | document | sentiment | |---:|---------------------------------------------------------------------------------------------------------:|--------------:| | 0 |مجھے واقعی یہ شو سند ہے۔ یہی وجہ ہے کہ مجھے حال ہی میں یہ جان کر مایوسی ہوئی ہے کہ جارج لوپیز ایک | positive | | 1 |بالکل بھی اچھ ،ی کام نہیں کیا گیا ، پوری فلم صرف گرڈج تھی اور کہیں بھی بے ترتیب لوگوں کو ہلاک نہیں | negative | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentimentdl_urduvec_imdb| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[sentiment]| |Language:|ur| |Dependencies:|urduvec_140M_300d| ## Data Source This models in trained using data from https://www.kaggle.com/akkefa/imdb-dataset-of-50k-movie-translated-urdu-reviews ## Benchmarking ```bash loss: 2428.622 - acc: 0.8181 - val_acc: 80.0 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1657192576326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1657192576326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|377.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-8 --- layout: model title: English DistilBertForQuestionAnswering model (from usami) author: John Snow Labs name: distilbert_qa_usami_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `usami`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_usami_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726481958.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_usami_base_uncased_finetuned_squad_en_4.0.0_3.0_1654726481958.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_usami_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_usami_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_usami").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_usami_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/usami/distilbert-base-uncased-finetuned-squad --- layout: model title: German asr_exp_w2v2t_vp_s962 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_vp_s962 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s962` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_vp_s962_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_s962_de_4.2.0_3.0_1664111795635.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_s962_de_4.2.0_3.0_1664111795635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_vp_s962", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_vp_s962", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_vp_s962| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_256_zh_4.2.4_3.0_1670021776621.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_256_zh_4.2.4_3.0_1670021776621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|45.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Translate Japanese to English Pipeline author: John Snow Labs name: translate_ja_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ja, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ja` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ja_en_xx_2.7.0_2.4_1609698712644.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ja_en_xx_2.7.0_2.4_1609698712644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ja_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ja_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ja.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ja_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_kptimes date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-kptimes` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1678782856878.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_kptimes_en_4.3.1_3.0_1678782856878.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_kptimes","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_kptimes| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-kptimes - https://arxiv.org/abs/1911.12559 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=kptimes --- layout: model title: English DistilBertForQuestionAnswering model (from rahulchakwate) author: John Snow Labs name: distilbert_qa_base_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_finetuned_squad_en_4.0.0_3.0_1654723698397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_finetuned_squad_en_4.0.0_3.0_1654723698397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulchakwate/distilbert-base-finetuned-squad --- layout: model title: Legal Constitutional Judgment Court Decisions Classifier (Turkish) author: John Snow Labs name: legclf_constitutional_court_judgment_decisions date: 2023-05-12 tags: [tr, classification, licensed, legal, tensorflow] task: Text Classification language: tr edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Binary classification model which identifies two constitutional judgment labels( violation, no_violation) in Turkish-based court decisions. ## Predicted Entities `violation`, `no_violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_constitutional_court_judgment_decisions_tr_1.0.0_3.0_1683922167296.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_constitutional_court_judgment_decisions_tr_1.0.0_3.0_1683922167296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = legal.BertForSequenceClassification.pretrained("legclf_constitutional_court_judgment_decisions", "tr", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) # couple of simple examples example = spark.createDataFrame([["başvuru formu ve eklerinde ifade edildiği şekliyle ilgili olaylar özetle şöyledir başvurucu tarihli dilekçe ile piraziz bolu çorum ve şişli asliye ceza mahkemelerinden almış olduğu cezalara ilişkin kararların kesinleşip yerine getirildiğini infaz tarihlerinin üzerinden yıla yakın bir sürenin geçtiğini ileri sürerek memnu hakların iadesi talebinde bulunmuştur bulancak asliye ceza mahkemesi tarihli kararı ile talebi reddetmiştir kararın gerekçesinin ilgili kısmı şöyledir başvuru numarası karar tarihi hükümlünün uyap kayıtlarının incelenmesinden hakkında bulancak asliye ceza mahkemesinde tarihli esas karar sayılı ilamıyla petrol kaçakçılığı suçundan yapılan yargılamada atılı suçu işlediğinin sabit olmaması nedeniyle karar verildiği dosyanın temyiz edilerek yargıtaya gönderildiği ve henüz dönmediği anlaşılmıştır yasaklanmış hakların geri verilmesini düzenleyen sayılı adli sicil maddesi ile sayılı türk ceza kanunu dışındaki kanunların belli bir suçtan dolayı veya belli bir cezaya mahkumiyete bağladığı hak yoksunluklarının giderilebilmesi için yasaklanmış hakların geri verilmesi yoluna gidilebilir bunun için türk ceza kanununun üncü maddesinin beşinci ve altıncı fıkraları saklı kalmak kaydıyla a mahkum olunan cezanın infazının tamamlandığı tarihten itibaren yıllık bir sürenin geçmiş olması b kişinin bu süre zarfında yeni bir suç işlememiş olması ve hayatını iyi halli olarak sürdürdüğü hususunda mahkemede bir kanaat oluşması gerektiği hükmü getirilmiştir yukarıda açıklandığı üzere talep eden hükümlünün yukarıda tarih ve sayıları belirtilen cezaların infaz tarihlerinden sonra yıllık süre içerisinde suç işlenmemiş ise de yıllarda hakkında soruşturma ve kovuşturma yapıldığı bu kapsamda hayatını iyi halli olarak sürdürdüğü hususunda mahkememizde yeterli kanaat oluşmadığından talep yerinde görülmeyerek aşağıdaki şekilde hüküm kurulmuştur başvurucunun anılan karara itirazı giresun ağır ceza mahkemesinin tarihli kararıyla reddedilmiştir kararın gerekçesi şöyledir dosya ve eklerinin incelenmesinden ilgilinin adli sicil kaydının bulunmaması sabıka kaydında geçen kayıtların arşiv kaydı olması sayılı kanunun maddesine göre anayasanın maddesinde belirtilen suçlar için arşiv kaydının silinmesinin mümkün olmama talep sahibinin sabıka kaydında geçen suçların anayasa maddede sayılan suçlardan olması karşısında netice olarak vardığı sonuca göre usul ve yasaya uygun ola bulancak asliye ceza mahkemesinin tarih ve diş sayılı kararına yapılan itirazın reddine karar verilmiştir ret kararı tarihinde başvurucuya tebliğ edilmiştir başvurucu tarihinde bireysel başvuruda bulunmuştur iv hukuk tarihli ve sayılı adli sicil kanununun maddesinin ilgili kısmı şöyledir sayılı türk ceza kanunu dışındaki kanunların belli bir suçtan dolayı veya belli bir cezaya mahkumiyete bağladığı hak yoksunluklarının giderilebilmesi için yasaklanmış hakların geri verilmesi yoluna gidilebilir bunun için türk ceza kanununun üncü maddesinin beşinci ve altıncı fıkraları saklı kalmak kaydıyla a mahkum olunan cezanın infazının tamamlandığı tarihten itibaren üç yıllık bü sürenin geçmiş olması b kişinin bu süre zarfında yeni bir suç işlememiş olması ve hayatını iyi halli olarak sürdürdüğü hususunda mahkemede bir kanaat oluşması gerekir il v "]]).toDF("text") result = pipeline.fit(example).transform(example) # result is a DataFrame result.select("text", "class.result").show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-----------+ | text| result| +----------------------------------------------------------------------------------------------------+-----------+ |başvuru formu ve eklerinde ifade edildiği şekliyle ilgili olaylar özetle şöyledir başvurucu tarih...|[violation]| +----------------------------------------------------------------------------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_constitutional_court_judgment_decisions| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|tr| |Size:|628.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support no_violation 0.71 0.71 0.71 14 violation 0.88 0.88 0.88 34 accuracy - - 0.83 48 macro-avg 0.80 0.80 0.80 48 weighted-avg 0.83 0.83 0.83 48 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Lushai author: John Snow Labs name: opus_mt_en_lus date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, lus, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `lus` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lus_xx_2.7.0_2.4_1609163547010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lus_xx_2.7.0_2.4_1609163547010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_lus", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_lus", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.lus').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_lus| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from arvalinno) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_indosquad_v2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-indosquad-v2` is a English model originally trained by `arvalinno`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_indosquad_v2_en_4.0.0_3.0_1654723913360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_indosquad_v2_en_4.0.0_3.0_1654723913360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_indosquad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_indosquad_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_indosquad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/arvalinno/distilbert-base-uncased-finetuned-indosquad-v2 --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1657191985061.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_10_en_4.0.0_3.0_1657191985061.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|375.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-10 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from machine2049) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_876 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad_distilbert` is a English model originally trained by `machine2049`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_876_en_4.3.0_3.0_1672773762864.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_876_en_4.3.0_3.0_1672773762864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_876","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_876","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_876| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-squad_distilbert --- layout: model title: Part of Speech for Farsi author: John Snow Labs name: pos_ud_perdt date: 2021-03-09 tags: [part_of_speech, open_source, farsi, pos_ud_perdt, fa] task: Part of Speech Tagging language: fa edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - DET - ADJ - CCONJ - PUNCT - VERB - PROPN - PRON - INTJ - NUM - ADV - None - SCONJ - AUX - PART - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_perdt_fa_3.0.0_3.0_1615292265373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_perdt_fa_3.0.0_3.0_1615292265373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['سلام از John Ben Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("سلام از John Ben Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""سلام از John Ben Labs! ""] token_df = nlu.load('fa.pos').predict(text) token_df ```
## Results ```bash token pos 0 سلام NOUN 1 از ADP 2 John NOUN 3 Ben PROPN 4 Labs VERB 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_perdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|fa| --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Portuguese (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-05-10 task: Named Entity Recognition language: pt edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, pt, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_PT){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_pt_2.5.0_2.4_1588495233642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_pt_2.5.0_2.4_1588495233642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "pt") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "pt") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nascido em 28 de outubro de 1955) é um magnata americano de negócios, desenvolvedor de software, investidor e filantropo. Ele é mais conhecido como co-fundador da Microsoft Corporation. Durante sua carreira na Microsoft, Gates ocupou os cargos de presidente, diretor executivo (CEO), presidente e diretor de arquitetura de software, além de ser o maior acionista individual até maio de 2014. Ele é um dos empreendedores e pioneiros mais conhecidos da revolução dos microcomputadores nas décadas de 1970 e 1980. Nascido e criado em Seattle, Washington, Gates co-fundou a Microsoft com o amigo de infância Paul Allen em 1975, em Albuquerque, Novo México; tornou-se a maior empresa de software de computador pessoal do mundo. Gates liderou a empresa como presidente e CEO até deixar o cargo em janeiro de 2000, mas ele permaneceu como presidente e tornou-se arquiteto-chefe de software. No final dos anos 90, Gates foi criticado por suas táticas de negócios, que foram consideradas anticompetitivas. Esta opinião foi confirmada por várias decisões judiciais. Em junho de 2006, Gates anunciou que iria passar para um cargo de meio período na Microsoft e trabalhar em período integral na Fundação Bill & Melinda Gates, a fundação de caridade privada que ele e sua esposa, Melinda Gates, estabeleceram em 2000. Ele gradualmente transferiu seus deveres para Ray Ozzie e Craig Mundie. Ele deixou o cargo de presidente da Microsoft em fevereiro de 2014 e assumiu um novo cargo como consultor de tecnologia para apoiar a recém-nomeada CEO Satya Nadella."""] ner_df = nlu.load('pt.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |Ele |PER | |Microsoft Corporation |ORG | |Durante |PER | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Nascido |PER | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Novo México |LOC | |Gates |PER | |CEO |ORG | |Gates |PER | |Esta opinião |MISC | |Gates |PER | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://pt.wikipedia.org](https://pt.wikipedia.org) --- layout: model title: BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SST-2 author: John Snow Labs name: bert_wiki_books_sst2 date: 2021-08-30 tags: [en, bert_embeddings, sst_2_dataset, books_corpus_dataset, wikipedia_dataset, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on SST-2. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_sst2_en_3.2.0_3.0_1630322384133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_sst2_en_3.2.0_3.0_1630322384133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_wiki_books_sst2", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_wiki_books_sst2", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.wiki_books_sst2').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_wiki_books_sst2| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Stanford Sentiment Treebank (SST-2) dataset](https://nlp.stanford.edu/sentiment/index.html) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/sst2/2 --- layout: model title: Translate Rundi to English Pipeline author: John Snow Labs name: translate_run_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, run, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `run` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_run_en_xx_2.7.0_2.4_1609687390754.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_run_en_xx_2.7.0_2.4_1609687390754.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_run_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_run_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.run.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_run_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_modelo2 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-modelo2` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo2_es_4.3.0_3.0_1674218501375.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo2_es_4.3.0_3.0_1674218501375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo2","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_modelo2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-modelo2 --- layout: model title: Stopwords Remover for Russian language (768 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ru, open_source] task: Stop Words Removal language: ru edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ru_3.4.1_3.0_1646672321516.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ru_3.4.1_3.0_1646672321516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ru") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Ты не лучше меня"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ru") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Ты не лучше меня").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.stopwords").predict("""Ты не лучше меня""") ```
## Results ```bash +-------+ |result | +-------+ |[лучше]| +-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ru| |Size:|4.1 KB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada TFWav2Vec2ForCTC from nadaAlnada author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada` is a English model originally trained by nadaAlnada. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada_en_4.2.0_3.0_1664114129711.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada_en_4.2.0_3.0_1664114129711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RoBERTa Embeddings Base Cased model (from mrm8488) author: John Snow Labs name: roberta_embeddings_ruperta_base_finetuned_spa_constitution date: 2022-07-14 tags: [en, open_source, roberta, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-spa-constitution` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ruperta_base_finetuned_spa_constitution_en_4.0.0_3.0_1657809796803.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ruperta_base_finetuned_spa_constitution_en_4.0.0_3.0_1657809796803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ruperta_base_finetuned_spa_constitution","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ruperta_base_finetuned_spa_constitution","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_ruperta_base_finetuned_spa_constitution| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|472.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-spa-constitution --- layout: model title: Persian BertForQuestionAnswering Cased model (from newsha) author: John Snow Labs name: bert_qa_pquad date: 2022-07-07 tags: [fa, open_source, bert, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `PQuAD` is a Persian model originally trained by `newsha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_pquad_fa_4.0.0_3.0_1657182255080.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_pquad_fa_4.0.0_3.0_1657182255080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_pquad","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_pquad","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_pquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|607.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/newsha/PQuAD --- layout: model title: Pipeline to Extract conditions and benefits from drug reviews author: John Snow Labs name: ner_supplement_clinical_pipeline date: 2023-03-14 tags: [licensed, ner, en, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_supplement_clinical](https://nlp.johnsnowlabs.com/2022/02/01/ner_supplement_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_supplement_clinical_pipeline_en_4.3.0_3.2_1678777179236.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_supplement_clinical_pipeline_en_4.3.0_3.2_1678777179236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_supplement_clinical_pipeline", "en", "clinical/models") text = '''Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_supplement_clinical_pipeline", "en", "clinical/models") val text = "Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | nervousness | 42 | 52 | CONDITION | 0.9999 | | 1 | night sleep | 70 | 80 | BENEFIT | 0.80775 | | 2 | hair | 109 | 112 | BENEFIT | 0.9997 | | 3 | nail growth | 118 | 128 | BENEFIT | 0.9997 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_supplement_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Cyberbullying Classifier Pipeline in Turkish texts author: John Snow Labs name: classifierdl_berturk_cyberbullying_pipeline date: 2022-06-27 tags: [tr, cyberbullying, pipeline, open_source] task: Sentiment Analysis language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline Identifies whether a Turkish text contains cyberbullying or not. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_berturk_cyberbullying_pipeline_tr_4.0.0_3.0_1656361913070.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_berturk_cyberbullying_pipeline_tr_4.0.0_3.0_1656361913070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_berturk_cyberbullying_pipeline", "tr") result = pipeline.fullAnnotate("""Gidişin olsun, dönüşün olmasın inşallah senin..""") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_berturk_cyberbulling_pipeline", "tr") val result = pipeline.fullAnnotate("Gidişin olsun, dönüşün olmasın inşallah senin..")(0) ```
## Results ```bash ["Negative"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_berturk_cyberbullying_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|tr| |Size:|454.6 MB| ## Included Models - DocumentAssembler - TokenizerModel - NormalizerModel - StopWordsCleaner - LemmatizerModel - BertEmbeddings - SentenceEmbeddings - ClassifierDLModel --- layout: model title: Stopwords Remover for Azerbaijani language (140 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, az, open_source] task: Stop Words Removal language: az edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_az_3.4.1_3.0_1646672362326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_az_3.4.1_3.0_1646672362326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","az") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Sən məndən yaxşı deyilsən"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","az") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Sən məndən yaxşı deyilsən").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("az.stopwords").predict("""Sən məndən yaxşı deyilsən""") ```
## Results ```bash +------------------+ |result | +------------------+ |[məndən, deyilsən]| +------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|az| |Size:|1.8 KB| --- layout: model title: German XlmRoBertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: xlm_roberta_qa_ADDI_DE_XLM_R date: 2022-06-23 tags: [de, open_source, question_answering, xlmroberta] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-DE-XLM-R` is a German model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_DE_XLM_R_de_4.0.0_3.0_1655982953437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_DE_XLM_R_de_4.0.0_3.0_1655982953437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_ADDI_DE_XLM_R","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_ADDI_DE_XLM_R","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_ADDI_DE_XLM_R| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|de| |Size:|778.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-DE-XLM-R --- layout: model title: English ElectraForQuestionAnswering model (from rowan1224) author: John Snow Labs name: electra_qa_slp date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_slp_en_4.0.0_3.0_1655921215706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_slp_en_4.0.0_3.0_1655921215706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_slp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_slp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.by_rowan1224").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_slp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/electra-slp --- layout: model title: Explain Document DL Pipeline for Farsi/Persian author: John Snow Labs name: recognize_entities_dl date: 2021-04-26 tags: [pipeline, ner, fa, open_source] task: Pipeline Public language: fa edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_dl is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FA/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.SparkNLP_Pretrained_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/recognize_entities_dl_fa_3.0.0_3.0_1619451815476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/recognize_entities_dl_fa_3.0.0_3.0_1619451815476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('recognize_entities_dl', lang = 'fa') annotations = pipeline.fullAnnotate("""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند""")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("recognize_entities_dl", lang = "fa") val result = pipeline.fullAnnotate("""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند""")(0) ``` {:.nlu-block} ```python import nlu text = ["""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند"""] result_df = nlu.load('fa.recognize_entities_dl').predict(text) result_df ```
## Results ```bash | | document | sentence | token | clean_tokens | lemma | pos | embeddings | ner | entities | |---:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------|:---------------|:---------|:------|:-------------|:------|:---------------------| | 0 | "به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند | "به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند | " | " | " | PUNCT | " | O | خبرنگار ایرنا | | 1 | | | به | گزارش | به | ADP | به | O | محمد قمی | | 2 | | | گزارش | خبرنگار | گزارش | NOUN | گزارش | O | پاکدشت | | 3 | | | خبرنگار | ایرنا | خبرنگار | NOUN | خبرنگار | B-ORG | علی‌اکبر موسوی خوئینی | | 4 | | | ایرنا | ، | ایرنا | PROPN | ایرنا | I-ORG | شمس‌الدین وهابی | | 5 | | | ، | اساس | ؛ | PUNCT | ، | O | تهران | | 6 | | | بر | تصمیم | بر | ADP | بر | O | | | 7 | | | اساس | این | اساس | NOUN | اساس | O | | | 8 | | | تصمیم | مجمع | تصمیم | NOUN | تصمیم | O | | | 9 | | | این | ، | این | DET | این | O | | | 10 | | | مجمع | محمد | مجمع | NOUN | مجمع | O | | | 11 | | | ، | قمی | ؛ | PUNCT | ، | O | | | 12 | | | محمد | نماینده | محمد | PROPN | محمد | B-PER | | | 13 | | | قمی | پاکدشت | قمی | PROPN | قمی | I-PER | | | 14 | | | نماینده | عنوان | نماینده | NOUN | نماینده | O | | | 15 | | | مردم | رئیس | مردم | NOUN | مردم | O | | | 16 | | | پاکدشت | علی‌اکبر | پاکدشت | PROPN | پاکدشت | B-LOC | | | 17 | | | به | موسوی | به | ADP | به | O | | | 18 | | | عنوان | خوئینی | عنوان | NOUN | عنوان | O | | | 19 | | | رئیس | شمس‌الدین | رئیس | NOUN | رئیس | O | | | 20 | | | و | وهابی | او | CCONJ | و | O | | | 21 | | | علی‌اکبر | نمایندگان | علی‌اکبر | PROPN | علی‌اکبر | B-PER | | | 22 | | | موسوی | تهران | موسوی | PROPN | موسوی | I-PER | | | 23 | | | خوئینی | عنوان | خوئینی | PROPN | خوئینی | I-PER | | | 24 | | | و | نواب | او | CCONJ | و | O | | | 25 | | | شمس‌الدین | رئیس | شمس‌الدین | PROPN | شمس‌الدین | B-PER | | | 26 | | | وهابی | انتخاب | وهابی | PROPN | وهابی | I-PER | | | 27 | | | نمایندگان | | نماینده | NOUN | نمایندگان | O | | | 28 | | | مردم | | مردم | NOUN | مردم | O | | | 29 | | | تهران | | تهران | PROPN | تهران | B-LOC | | | 30 | | | به | | به | ADP | به | O | | | 31 | | | عنوان | | عنوان | NOUN | عنوان | O | | | 32 | | | نواب | | نواب | NOUN | نواب | O | | | 33 | | | رئیس | | رئیس | NOUN | رئیس | O | | | 34 | | | انتخاب | | انتخاب | NOUN | انتخاب | O | | | 35 | | | شدند | | کرد#کن | VERB | شدند | O | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_dl| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fa| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - StopWordsCleaner - LemmatizerModel - PerceptronModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Latvian Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:37:00 +0800 task: Lemmatization language: lv edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, lv] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_lv_2.5.5_2.4_1596055006860.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_lv_2.5.5_2.4_1596055006860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "lv") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "lv") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis."""] lemma_df = nlu.load('lv.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Džons', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=10, result='Snovs', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=13, result='būt', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=16, result='ne', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=22, result='tikai', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|lv| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English BertForQuestionAnswering model (from peterhsu) author: John Snow Labs name: bert_qa_tf_bert_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tf-bert-finetuned-squad` is a English model orginally trained by `peterhsu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tf_bert_finetuned_squad_en_4.0.0_3.0_1654192400068.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tf_bert_finetuned_squad_en_4.0.0_3.0_1654192400068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tf_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tf_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_peterhsu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tf_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peterhsu/tf-bert-finetuned-squad --- layout: model title: Text Detection author: John Snow Labs name: text_detection_v1 date: 2021-12-14 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description CRAFT: Character-Region Awareness For Text detection, is designed with a convolutional neural network producing the character region score and affinity score. The region score is used to localize individual characters in the image, and the affinity score is used to group each character into a single instance. To compensate for the lack of character-level annotations, we propose a weaklysupervised learning framework that estimates characterlevel ground truths in existing real word-level datasets. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/text_detection_v1_en_3.0.0_2.4_1639490832988.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/text_detection_v1_en_3.0.0_2.4_1639490832988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use text_detector = ImageTextDetector.pretrained("text_detection_v1", "en", "clinical/ocr") text_detector.setInputCol("image") text_detector.setOutputCol("text_regions")
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python text_detector = ImageTextDetector.pretrained("text_detection_v1", "en", "clinical/ocr") text_detector.setInputCol("image") text_detector.setOutputCol("text_regions") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_detection_v1| |Type:|ocr| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Output Labels:|[text_regions]| |Language:|en| |Size:|77.1 MB| --- layout: model title: Mapping Entities (Disease or Syndrome) with Corresponding UMLS CUI Codes author: John Snow Labs name: umls_disease_syndrome_mapper date: 2022-07-11 tags: [umls, chunk_mapper, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities (Disease or Syndrome) with corresponding UMLS CUI codes. ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_disease_syndrome_mapper_en_4.0.0_3.0_1657579514857.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_disease_syndrome_mapper_en_4.0.0_3.0_1657579514857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "clinical_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("umls_disease_syndrome_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["umls_code"])\ .setLowerCase(True) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper]) test_data = spark.createDataFrame([["A 35-year-old male with a history of obesity and gestational diabetes mellitus and acyclovir allergy"]]).toDF("text") result = mapper_pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols("sentence", "token", "clinical_ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("umls_disease_syndrome_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("umls_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper)) val test_data = Seq("A 35-year-old male with a history of obesity and gestational diabetes mellitus and acyclovir allergy").toDF("text") val result = mapper_pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_disease_syndrome_mapper").predict("""A 35-year-old male with a history of obesity and gestational diabetes mellitus and acyclovir allergy""") ```
## Results ```bash +-----------------------------+---------+ |ner_chunk |umls_code| +-----------------------------+---------+ |obesity |C0028754 | |gestational diabetes mellitus|C0085207 | |acyclovir allergy |C0571297 | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_disease_syndrome_mapper| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|9.0 MB| ## References `2022AA` UMLS dataset’s ‘ Disease or Syndrome` category. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Spanish RobertaForSequenceClassification Base Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_bsc_base_spanish_diagnostics date: 2022-07-13 tags: [es, open_source, roberta, sequence_classification] task: Text Classification language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bsc-roberta-base-spanish-diagnostics` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_bsc_base_spanish_diagnostics_es_4.0.0_3.0_1657716038012.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_bsc_base_spanish_diagnostics_es_4.0.0_3.0_1657716038012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_bsc_base_spanish_diagnostics","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_bsc_base_spanish_diagnostics","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_bsc_base_spanish_diagnostics| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|442.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bsc-roberta-base-spanish-diagnostics --- layout: model title: English AlbertForQuestionAnswering XLarge model (from rahulchakwate) author: John Snow Labs name: albert_qa_xlarge_finetuned_squad date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xlarge-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xlarge_finetuned_squad_en_4.0.0_3.0_1656063790707.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xlarge_finetuned_squad_en_4.0.0_3.0_1656063790707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xlarge_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xlarge_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.albert.xl").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xlarge_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|205.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulchakwate/albert-xlarge-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from deepset) author: John Snow Labs name: roberta_qa_tinyroberta_squad2_step1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2-step1` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_squad2_step1_en_4.0.0_3.0_1655740055680.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_squad2_step1_en_4.0.0_3.0_1655740055680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_tinyroberta_squad2_step1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_tinyroberta_squad2_step1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.tiny.v2.by_deepset").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tinyroberta_squad2_step1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/tinyroberta-squad2-step1 --- layout: model title: English BertForQuestionAnswering Base Cased model (from victorlee071200) author: John Snow Labs name: bert_qa_base_cased_finetuned_squad_v2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad_v2` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_squad_v2_en_4.0.0_3.0_1657182961027.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_squad_v2_en_4.0.0_3.0_1657182961027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_squad_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_cased_finetuned_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/victorlee071200/bert-base-cased-finetuned-squad_v2 --- layout: model title: Part of Speech for Thai (pos_lst20) author: John Snow Labs name: pos_lst20 date: 2021-01-13 task: Part of Speech Tagging language: th edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [th, pos, open_source] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. ## Predicted Entities | Tags | Name | Description | |------|---------------|-------------------------------------------------------------------------------------------------------| | AJ | Adjective | Attribute, modifier, or description of a noun | | AV | Adverb | Word that modifies or qualifies an adjective, verb, or another adverb | | AX | Auxiliary | Tense, aspect, mood, and voice | | CC | Connector | Conjunction and relative pronoun | | CL | Classifier | Class or measurement unit to which a noun or an action belongs | | FX | Prefix | Inflectional (nominalizer, adjectivizer, adverbializer, and courteous verbalizer), and derivational | | IJ | Interjection | Exclamation word | | NG | Negator | Word of negation | | NN | Noun | Person, place, thing, abstract concept, and proper name | | NU | Number | Quantity for counting and calculation | | PA | Particle | Politeness, intention, belief, question | | PR | Pronoun | Word used to refer to an element in the discourse | | PS | Preposition | Location, comparison, instrument, exemplification | | PU | Punctuation | Punctuation mark | | VV | Verb | Action, state, occurrence, and word that forms the predicate part | | XX | Others | Unknown category | {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_lst20_th_2.7.0_2.4_1610545897750.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_lst20_th_2.7.0_2.4_1610545897750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline after tokenization.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")\ .setInputCols(["sentence"])\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_lst20", "th") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, word_segmenter, posTagger ]) example = spark.createDataFrame([['ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th") .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_lst20", "th") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos)) val data = Seq("ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["ส่วนผลกระทบจากโครงการดังกล่าวจะดำเนินการนอกเขตอุทยานแห่งชาตินอกพื้นที่ป่าอนุรักษ์"] pos_df = nlu.load('th.pos').predict(text, output_level = "token") pos_df ```
## Results ```bash +-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+ |text |result | +-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+ |ส่วน ผล กระทบ จาก โครงการ ดังกล่าว จะ ดำเนินการ นอก เขต อุทยาน แห่ง ชาติ นอก พื้นที่ ป่า อนุรักษ์|[CC, NN, VV, PS, NN, AJ, AX, VV, PS, NN, NN, PS, NN, PS, NN, NN, VV]| +-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_lst20| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|th| ## Data Source The model was trained on the [LST20 Corpus](https://aiat.or.th/lst20-corpus/) from National Electronics and Computer Technology Center (NECTEC). ## Benchmarking ```bash | pos_tag | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | AJ | 0.73 | 0.66 | 0.69 | 4403 | | AV | 0.71 | 0.61 | 0.66 | 6722 | | AX | 0.76 | 0.75 | 0.76 | 7556 | | CC | 0.77 | 0.77 | 0.77 | 17613 | | CL | 0.68 | 0.63 | 0.65 | 3739 | | FX | 0.78 | 0.76 | 0.77 | 6918 | | IJ | 0.00 | 0.00 | 0.00 | 4 | | NG | 0.82 | 0.80 | 0.81 | 1694 | | NN | 0.82 | 0.81 | 0.81 | 58540 | | NU | 0.75 | 0.71 | 0.73 | 6256 | | PA | 0.74 | 0.84 | 0.79 | 194 | | PR | 0.76 | 0.75 | 0.76 | 2139 | | PS | 0.75 | 0.72 | 0.73 | 10886 | | PU | 0.42 | 0.80 | 0.55 | 4769 | | VV | 0.79 | 0.78 | 0.78 | 42586 | | XX | 0.00 | 0.00 | 0.00 | 26 | | accuracy | 0.77 | 174045 | | | | macro avg | 0.64 | 0.65 | 0.64 | 174045 | | weighted avg | 0.77 | 0.77 | 0.77 | 174045 | ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from bhan) author: John Snow Labs name: distilbert_qa_bhan_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `bhan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_bhan_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770214884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_bhan_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770214884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bhan_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_bhan_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_bhan_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/bhan/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Application of proceeds Clause Binary Classifier author: John Snow Labs name: legclf_application_of_proceeds_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `application-of-proceeds` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `application-of-proceeds` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_application_of_proceeds_clause_en_1.0.0_3.2_1660122112997.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_application_of_proceeds_clause_en_1.0.0_3.2_1660122112997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_application_of_proceeds_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[application-of-proceeds]| |[other]| |[other]| |[application-of-proceeds]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_application_of_proceeds_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support application-of-proceeds 0.95 0.95 0.95 37 other 0.98 0.98 0.98 111 accuracy - - 0.97 148 macro-avg 0.96 0.96 0.96 148 weighted-avg 0.97 0.97 0.97 148 ``` --- layout: model title: Detect Neoplasms author: John Snow Labs name: ner_neoplasms class: NerDLModel language: es repository: clinical/models date: 2020-07-08 task: Named Entity Recognition edition: Healthcare NLP 2.5.3 spark_version: 2.4 tags: [clinical,licensed,ner,es] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Neoplasms NER is a Named Entity Recognition model that annotates text to find references to tumors. The only entity it annotates is MalignantNeoplasm. Neoplasms NER is trained with the 'embeddings_scielowiki_300d' word embeddings model, so be sure to use the same embeddings in the pipeline. ## Predicted Entities ``MORFOLOGIA_NEOPLASIA`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_neoplasms_es_2.5.3_2.4_1594168624415.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_neoplasms_es_2.5.3_2.4_1594168624415.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") model = NerDLModel.pretrained("ner_neoplasms","es","clinical/models")\ .setInputCols("sentence","token","word_embeddings")\ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embed, model, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.']], ["text"])) ``` ```scala ... val embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val model = NerDLModel.pretrained("ner_neoplasms","es","clinical/models") .setInputCols("sentence","token","word_embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embed, model, ner_converter)) val data = Seq("HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.neoplasm").predict("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""") ```
## Results ```bash +------+--------------------+ |chunk |ner_label | +------+--------------------+ |cáncer|MORFOLOGIA_NEOPLASIA| +------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------| | Name: | ner_neoplasms | | Type: | NerDLModel | | Compatibility: | Spark NLP 2.5.3+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [sentence, token, word_embeddings] | |Output labels: | [ner] | | Language: | es | | Case sensitive: | False | | Dependencies: | embeddings_scielowiki_300d | {:.h2_title} ## Data Source Named Entity Recognition model for Neoplasic Morphology https://temu.bsc.es/cantemist/ --- layout: model title: Detect PHI for Deidentification purposes (Italian) author: John Snow Labs name: ner_deid_subentity date: 2022-03-25 tags: [deid, it, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 19 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data [(Catelli et al.)](https://ieeexplore.ieee.org/document/9335570) and several data augmentation mechanisms. ## Predicted Entities `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `EMAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `USERNAME`, `URL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_it_3.4.2_2.4_1648218077881.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_it_3.4.2_2.4_1648218077881.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.med_ner.deid_subentity").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""") ```
## Results ```bash +-------------+----------+ | token| ner_label| +-------------+----------+ | Ho| O| | visto| O| | Gastone| B-PATIENT| |Montanariello| I-PATIENT| | (| O| | 49| B-AGE| | anni| O| | )| O| | riferito| O| | all| O| | '| O| | Ospedale|B-HOSPITAL| | San|I-HOSPITAL| | Camillo|I-HOSPITAL| | per| O| | diabete| O| | mal| O| | controllato| O| | con| O| | sintomi| O| | risalenti| O| | a| O| | marzo| B-DATE| | 2015| I-DATE| | .| O| +-------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|15.0 MB| ## References - Internally annotated corpus - [COVID-19 Italian de-identification dataset making up 15% of total data: R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita and M. Esposito, "A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records," in IEEE Access, vol. 9, pp. 19097-19110, 2021, doi: 10.1109/ACCESS.2021.3054479.](https://ieeexplore.ieee.org/document/9335570) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 263.0 29.0 25.0 288.0 0.9007 0.9132 0.9069 HOSPITAL 365.0 36.0 48.0 413.0 0.9102 0.8838 0.8968 DATE 1164.0 13.0 26.0 1190.0 0.989 0.9782 0.9835 ORGANIZATION 72.0 25.0 26.0 98.0 0.7423 0.7347 0.7385 URL 41.0 0.0 0.0 41.0 1.0 1.0 1.0 CITY 421.0 9.0 19.0 440.0 0.9791 0.9568 0.9678 STREET 198.0 4.0 6.0 204.0 0.9802 0.9706 0.9754 USERNAME 20.0 2.0 2.0 22.0 0.9091 0.9091 0.9091 SEX 753.0 26.0 21.0 774.0 0.9666 0.9729 0.9697 IDNUM 113.0 3.0 7.0 120.0 0.9741 0.9417 0.9576 EMAIL 148.0 0.0 0.0 148.0 1.0 1.0 1.0 ZIP 148.0 3.0 1.0 149.0 0.9801 0.9933 0.9867 MEDICALRECORD 19.0 3.0 6.0 25.0 0.8636 0.76 0.8085 SSN 13.0 1.0 1.0 14.0 0.9286 0.9286 0.9286 PROFESSION 316.0 28.0 53.0 369.0 0.9186 0.8564 0.8864 PHONE 53.0 0.0 2.0 55.0 1.0 0.9636 0.9815 COUNTRY 182.0 14.0 15.0 197.0 0.9286 0.9239 0.9262 DOCTOR 769.0 77.0 62.0 831.0 0.909 0.9254 0.9171 AGE 763.0 8.0 18.0 781.0 0.9896 0.977 0.9832 macro - - - - - - 0.9328 micro - - - - - - 0.9494 ``` --- layout: model title: English image_classifier_vit_pasta_pizza_ravioli ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_pasta_pizza_ravioli date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pasta_pizza_ravioli` is a English model originally trained by nateraw. ## Predicted Entities `pasta`, `pizza`, `ravioli` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pasta_pizza_ravioli_en_4.1.0_3.0_1660165878844.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pasta_pizza_ravioli_en_4.1.0_3.0_1660165878844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pasta_pizza_ravioli", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pasta_pizza_ravioli", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pasta_pizza_ravioli| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dm1000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm1000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm1000_en_4.3.0_3.0_1675118935697.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm1000_en_4.3.0_3.0_1675118935697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dm1000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dm1000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dm1000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|295.2 MB| ## References - https://huggingface.co/google/t5-efficient-small-dm1000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: NER on Legal Texts (CUAD, Silver corpus) author: John Snow Labs name: legner_cuad_silver date: 2022-08-09 tags: [en, legal, ner, cuad, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal Name Entity Recognition model, trained on a Silver version of the CUAD dataset. We say a corpus is on its "Silver" version when we use automatic labelling algorithms, rules, vocabularies, patterns and some predefined annotations. The entities included are: "PERSON": Person "LAW": Mentioned law "PARTY": A party signing the agreement "EFFDATE": Date of the agreement "LOC": A mentioned location "DATE": Another date, not EFFDATE "DOC": Type of the document "ORDINAL": And ordinal number "ROLE": A role of a person or party "PERCENT": A percentage "ORG": An generic tag for detecting organizations You can several models trained on Golden versions of this dataset (annotated by our JSL in-house domain experts) in Models Hub, looking in the Legal library. ## Predicted Entities `PERSON`, `LAW`, `PARTY`, `EFFDATE`, `LOC`, `DATE`, `DOC`, `ORDINAL`, `ROLE`, `PERCENT`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_cuad_silver_en_1.0.0_3.2_1660041713538.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_cuad_silver_en_1.0.0_3.2_1660041713538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences")\ .setExplodeSentences(True) tokenizer = nlp.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "en")\ .setInputCols(["sentences", "token"])\ .setOutputCol("embeddings") jsl_ner = legal.NerModel.pretrained("legner_cuad_silver", "en", "legal/models")\ .setInputCols(["sentences", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = nlp.NerConverter() \ .setInputCols(["sentences", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = nlp.Pipeline().setStages([ documentAssembler, sentencizer, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) text = """December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender".""" df = spark.createDataFrame([[text]]).toDF("text") model = jsl_ner_pipeline.fit(df) res = model.transform(df) ```
## Results ```bash +------------+---------+----------+ | token|ner_label|confidence| +------------+---------+----------+ | December| B-DATE| 0.4111| | 2007| B-DATE| 0.7867| |SUBORDINATED| O| 0.5373| | LOAN| B-DOC| 0.9998| | AGREEMENT| I-DOC| 0.8615| | .| O| 0.9695| | THIS| O| 0.9977| | LOAN| B-DOC| 0.9995| | AGREEMENT| I-DOC| 0.9982| | is| O| 0.8592| | made| O| 0.9975| | on| O| 0.9906| | 7th| B-DATE| 0.7804| | December| B-DATE| 0.6701| | ,| B-DATE| 0.5395| | 2007| B-DATE| 0.5327| | BETWEEN| O| 0.9771| | :| O| 0.9497| | (| O| 0.7493| | 1| O| 0.9081| | )| O| 0.4178| | SILICIUM| B-ORG| 0.6731| | DE| B-ORG| 0.3681| | PROVENCE| B-ORG| 0.5065| | S.A.S| B-ORG| 0.8924| | .,| O| 0.7006| | a| O| 0.9722| | private| O| 0.9938| | company| O| 0.9982| | with| O| 0.9958| | limited| O| 0.981| | liability| O| 0.9994| | ,| O| 0.9933| |incorporated| O| 0.9997| | under| O| 0.9597| | the| O| 0.9833| | laws| O| 0.9969| | of| O| 0.7129| | France| B-LOC| 0.8789| +------------+---------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_cuad_silver| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.0 MB| ## References Manual rules, patterns, weak-labelling, preannotations from in-house models and from CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 B-PERSON 89 11 11 0.89 0.89 0.89 B-LAW 759 111 148 0.8724138 0.8368247 0.8542487 I-PARTY 8632 47 23 0.9945846 0.9973426 0.9959617 B-EFFDATE 9 1 4 0.9 0.6923077 0.7826087 B-LOC 372 76 61 0.83035713 0.8591224 0.8444949 B-DATE 1020 104 102 0.9074733 0.90909094 0.9082814 B-DOC 1370 36 12 0.97439545 0.9913169 0.9827834 I-EFFDATE 14 0 0 1.0 1.0 1.0 I-DOC 2227 49 0 0.978471 1.0 0.98911834 B-ORDINAL 99 11 15 0.9 0.8684211 0.8839286 B-ROLE 228 6 0 0.974359 1.0 0.987013 B-PERCENT 34 4 0 0.8947368 1.0 0.9444445 B-ORG 1992 478 624 0.8064777 0.7614679 0.7833268 B-PARTY 2275 39 82 0.9831461 0.96521 0.97409546 Macro-average 19120 973 1082 0.92188674 0.9122217 0.9170287 Micro-average 19120 973 1082 0.95157516 0.94644094 0.9490011 ``` --- layout: model title: English BertForMaskedLM Cased model (from beatrice-portelli) author: John Snow Labs name: bert_embeddings_dilbert date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `DiLBERT` is a English model originally trained by `beatrice-portelli`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dilbert_en_4.2.4_3.0_1670015148603.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dilbert_en_4.2.4_3.0_1670015148603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_dilbert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_dilbert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dilbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/beatrice-portelli/DiLBERT - https://github.com/KevinRoitero/dilbert --- layout: model title: English image_classifier_vit_test_model_a ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_test_model_a date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_test_model_a` is a English model originally trained by nateraw. ## Predicted Entities `cat`, `dog` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_test_model_a_en_4.1.0_3.0_1660168422748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_test_model_a_en_4.1.0_3.0_1660168422748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_test_model_a", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_test_model_a", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_test_model_a| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|273.1 KB| --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_de_4.2.0_3.0_1664103230354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350_de_4.2.0_3.0_1664103230354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s350| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English BertForMaskedLM Cased model (from philschmid) author: John Snow Labs name: bert_embeddings_fin_pretrain_yiyanghkust date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finbert-pretrain-yiyanghkust` is a English model originally trained by `philschmid`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fin_pretrain_yiyanghkust_en_4.2.4_3.0_1670326335691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fin_pretrain_yiyanghkust_en_4.2.4_3.0_1670326335691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fin_pretrain_yiyanghkust","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fin_pretrain_yiyanghkust","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_fin_pretrain_yiyanghkust| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/philschmid/finbert-pretrain-yiyanghkust - https://arxiv.org/abs/2006.08097 --- layout: model title: English asr_wav2vec2_base_cynthia_tedlium_2500_v2 TFWav2Vec2ForCTC from huyue012 author: John Snow Labs name: asr_wav2vec2_base_cynthia_tedlium_2500_v2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_cynthia_tedlium_2500_v2` is a English model originally trained by huyue012. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_cynthia_tedlium_2500_v2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_cynthia_tedlium_2500_v2_en_4.2.0_3.0_1664040506688.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_cynthia_tedlium_2500_v2_en_4.2.0_3.0_1664040506688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_cynthia_tedlium_2500_v2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_cynthia_tedlium_2500_v2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_cynthia_tedlium_2500_v2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.2 MB| --- layout: model title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_paraphrase_tiny_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-paraphrase-t5-tiny-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102090742.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102090742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_paraphrase_tiny_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_paraphrase_tiny_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_paraphrase_tiny_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|176.9 MB| ## References - https://huggingface.co/mesolitica/finetune-paraphrase-t5-tiny-standard-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/session/paraphrase/hf-t5 --- layout: model title: Named Entity Recognition (NER) Model in Swedish (GloVe 6B 300) author: John Snow Labs name: swedish_ner_6B_300 date: 2020-08-30 task: Named Entity Recognition language: sv edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, sv, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Swedish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_SV/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/swedish_ner_6B_300_sv_2.6.0_2.4_1598810268071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/swedish_ner_6B_300_sv_2.6.0_2.4_1598810268071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang = "xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("swedish_ner_6B_300", "sv") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang = "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("swedish_ner_6B_300", "sv") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella."""] ner_df = nlu.load('sv.ner.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +------------------------+---------+ |chunk |ner_label| +------------------------+---------+ |William Henry Gates |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |MISC | |Seattle |LOC | |Washington |LOC | |Gates Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |MISC | |Gates |MISC | |Gates |MISC | |Gates |MISC | |Microsoft |ORG | |Bill |MISC | |Melinda Gates Foundation|MISC | |Melinda Gates |MISC | |Ray Ozzie |PER | |Craig Mundie |PER | |Microsoft |ORG | +------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|swedish_ner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|sv| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on a custom dataset with multi-lingual GloVe Embeddings ``glove_6B_300``. --- layout: model title: Pipeline to Detect Temporal Relations for Clinical Events author: John Snow Labs name: re_temporal_events_clinical_pipeline date: 2023-06-13 tags: [licensed, clinical, relation_extraction, events, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_temporal_events_clinical](https://nlp.johnsnowlabs.com/2020/09/28/re_temporal_events_clinical_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_clinical_pipeline_en_4.4.4_3.2_1686664822742.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_clinical_pipeline_en_4.4.4_3.2_1686664822742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_temporal_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` ```scala val pipeline = new PretrainedPipeline("re_temporal_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_event_clinical.pipeline").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_temporal_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` ```scala val pipeline = new PretrainedPipeline("re_temporal_events_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_event_clinical.pipeline").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
## Results ```bash Results +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+============+=================+===============+==========================+===========+=================+===============+=====================+==============+ | 0 | OVERLAP | OCCURRENCE | 121 | 144 | a motor vehicle accident | DATE | 149 | 165 | September of 2005 | 0.999975 | +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ | 1 | OVERLAP | DATE | 171 | 179 | that time | PROBLEM | 201 | 219 | any specific injury | 0.956654 | +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_temporal_events_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_512_A_8_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-512_A-8_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_en_4.0.0_3.0_1654185304498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_en_4.0.0_3.0_1654185304498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.uncased_4l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_512_A_8_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-512_A-8_squad2 --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_lighteternal TFWav2Vec2ForCTC from lighteternal author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_greek_by_lighteternal date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_lighteternal` is a Modern Greek (1453-) model originally trained by lighteternal. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_greek_by_lighteternal_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_lighteternal_el_4.2.0_3.0_1664105127777.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_lighteternal_el_4.2.0_3.0_1664105127777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_lighteternal", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_lighteternal", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_greek_by_lighteternal| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_base_japanese_whole_word_masking date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-whole-word-masking` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_whole_word_masking_ja_4.2.4_3.0_1670018286563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_whole_word_masking_ja_4.2.4_3.0_1670018286563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_whole_word_masking","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_whole_word_masking","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_japanese_whole_word_masking| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|415.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English image_classifier_vit_rare_puppers3 ViTForImageClassification from Samlit author: John Snow Labs name: image_classifier_vit_rare_puppers3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_puppers3` is a English model originally trained by Samlit. ## Predicted Entities `Marcelle Lender doing the Bolero in Chilperic`, `Moulin Rouge_ La Goulue - Henri Toulouse-Lautrec`, `Salon at the Rue des Moulins - Henri de Toulouse-Lautrec`, `aristide bruant - Henri de Toulouse-Lautrec` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers3_en_4.1.0_3.0_1660171627502.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers3_en_4.1.0_3.0_1660171627502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_puppers3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_puppers3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_puppers3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Defaults and remedies Clause Binary Classifier author: John Snow Labs name: legclf_defaults_and_remedies_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `defaults-and-remedies` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `defaults-and-remedies` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_defaults_and_remedies_clause_en_1.0.0_3.2_1660122322269.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_defaults_and_remedies_clause_en_1.0.0_3.2_1660122322269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_defaults_and_remedies_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[defaults-and-remedies]| |[other]| |[other]| |[defaults-and-remedies]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_defaults_and_remedies_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support defaults-and-remedies 0.92 0.88 0.90 26 other 0.97 0.98 0.97 90 accuracy - - 0.96 116 macro-avg 0.94 0.93 0.94 116 weighted-avg 0.96 0.96 0.96 116 ``` --- layout: model title: Spanish Bert Embeddings (Base, Pasage, Squades) author: John Snow Labs name: bert_embeddings_dpr_spanish_passage_encoder_squades_base date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dpr-spanish-passage_encoder-squades-base` is a Spanish model orginally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_passage_encoder_squades_base_es_3.4.2_3.0_1649671351877.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_passage_encoder_squades_base_es_3.4.2_3.0_1649671351877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_passage_encoder_squades_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_passage_encoder_squades_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.dpr_spanish_passage_encoder_squades_base").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dpr_spanish_passage_encoder_squades_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/IIC/dpr-spanish-passage_encoder-squades-base - https://arxiv.org/abs/2004.04906 - https://github.com/facebookresearch/DPR - https://paperswithcode.com/sota?task=text+similarity&dataset=squad_es --- layout: model title: German ElectraForQuestionAnswering model (from Sahajtomar) author: John Snow Labs name: electra_qa_German_question_answer date: 2022-06-22 tags: [de, open_source, electra, question_answering] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `German-question-answer-Electra` is a German model originally trained by `Sahajtomar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_German_question_answer_de_4.0.0_3.0_1655919934127.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_German_question_answer_de_4.0.0_3.0_1655919934127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_German_question_answer","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_German_question_answer","de") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.electra").predict("""Was ist mein Name?|||"Mein Name ist Clara und ich lebe in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_German_question_answer| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sahajtomar/German-question-answer-Electra --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from MYX4567) author: John Snow Labs name: distilbert_qa_myx4567_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `MYX4567`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_myx4567_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768718891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_myx4567_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768718891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_myx4567_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_myx4567_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_myx4567_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/MYX4567/distilbert-base-uncased-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Mon-Khmer Languages to English author: John Snow Labs name: opus_mt_mkh_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mkh, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mkh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mkh_en_xx_2.7.0_2.4_1609170529009.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mkh_en_xx_2.7.0_2.4_1609170529009.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mkh_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mkh_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mkh.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mkh_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Recognize Entities DL Pipeline for Polish - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, polish, entity_recognizer_sm, pipeline, pl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pl_3.0.0_3.0_1616442303185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pl_3.0.0_3.0_1616442303185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'pl') annotations = pipeline.fullAnnotate(""Witaj z John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "pl") val result = pipeline.fullAnnotate("Witaj z John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Witaj z John Snow Labs! ""] result_df = nlu.load('pl.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Witaj z John Snow Labs! '] | ['Witaj z John Snow Labs!'] | ['Witaj', 'z', 'John', 'Snow', 'Labs!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pl| --- layout: model title: English Bert Embeddings (from lordtt13) author: John Snow Labs name: bert_embeddings_COVID_SciBERT date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `COVID-SciBERT` is a English model orginally trained by `lordtt13`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_COVID_SciBERT_en_3.4.2_3.0_1649672354701.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_COVID_SciBERT_en_3.4.2_3.0_1649672354701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_COVID_SciBERT","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_COVID_SciBERT","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.COVID_SciBERT").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_COVID_SciBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/lordtt13/COVID-SciBERT - https://arxiv.org/abs/1903.10676 - https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge - https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb - https://github.com/lordtt13 - https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/ --- layout: model title: Fast Neural Machine Translation Model from Arabic to French author: John Snow Labs name: opus_mt_ar_fr date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, fr, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: fr {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_fr_xx_3.1.0_2.4_1622552181438.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_fr_xx_3.1.0_2.4_1622552181438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_fr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_fr", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.French').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_fr| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: ALBERT XLarge CoNNL-03 NER Pipeline author: John Snow Labs name: albert_xlarge_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_xlarge_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_xlarge_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_xlarge_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654378483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_xlarge_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654378483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_xlarge_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|206.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-256_A-4_cord19-200616_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_en_4.0.0_3.0_1654185251893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_en_4.0.0_3.0_1654185251893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_cord19.bert.uncased_4l_256d_a4a_256d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-256_A-4_cord19-200616_squad2 --- layout: model title: Legal Investment company Clause Binary Classifier author: John Snow Labs name: legclf_investment_company_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `investment-company` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `investment-company` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_company_clause_en_1.0.0_3.2_1660123638708.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_company_clause_en_1.0.0_3.2_1660123638708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_investment_company_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[investment-company]| |[other]| |[other]| |[investment-company]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investment_company_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support investment-company 0.99 0.96 0.97 77 other 0.98 0.99 0.99 185 accuracy - - 0.98 262 macro-avg 0.99 0.98 0.98 262 weighted-avg 0.98 0.98 0.98 262 ``` --- layout: model title: Legal Contribution Clause Binary Classifier author: John Snow Labs name: legclf_contribution_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `contribution` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `contribution` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_contribution_clause_en_1.0.0_3.2_1660122292565.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_contribution_clause_en_1.0.0_3.2_1660122292565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_contribution_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[contribution]| |[other]| |[other]| |[contribution]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_contribution_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support contribution 1.00 0.88 0.93 16 other 0.96 1.00 0.98 43 accuracy - - 0.97 59 macro-avg 0.98 0.94 0.96 59 weighted-avg 0.97 0.97 0.97 59 ``` --- layout: model title: Russian Named Entity Recognition (from surdan) author: John Snow Labs name: bert_ner_LaBSE_ner_nerel date: 2022-05-09 tags: [bert, ner, token_classification, ru, open_source] task: Named Entity Recognition language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `LaBSE_ner_nerel` is a Russian model orginally trained by `surdan`. ## Predicted Entities `DATE`, `NATIONALITY`, `LAW`, `PERSON`, `PERCENT`, `FACILITY`, `PROFESSION`, `NUMBER`, `RELIGION`, `DISTRICT`, `WORK_OF_ART`, `LANGUAGE`, `LOCATION`, `AGE`, `AWARD`, `IDEOLOGY`, `COUNTRY`, `TIME`, `FAMILY`, `MONEY`, `CRIME`, `ORDINAL`, `EVENT`, `PRODUCT`, `CITY`, `ORGANIZATION`, `STATE_OR_PROVINCE`, `DISEASE`, `PENALTY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_LaBSE_ner_nerel_ru_3.4.2_3.0_1652098988587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_LaBSE_ner_nerel_ru_3.4.2_3.0_1652098988587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_LaBSE_ner_nerel","ru") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_LaBSE_ner_nerel","ru") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_LaBSE_ner_nerel| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ru| |Size:|481.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/surdan/LaBSE_ner_nerel --- layout: model title: English asr_wav2vec2_base_timit_demo_colab7_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab7_by_hassnain` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain_en_4.2.0_3.0_1664018905309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain_en_4.2.0_3.0_1664018905309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab7_by_hassnain| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jgriffi) author: John Snow Labs name: xlmroberta_ner_jgriffi_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jgriffi`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jgriffi_base_finetuned_panx_de_4.1.0_3.0_1660434676655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jgriffi_base_finetuned_panx_de_4.1.0_3.0_1660434676655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jgriffi_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jgriffi_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jgriffi_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jgriffi/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Marathi Named Entity Recognition (from l3cube-pune) author: John Snow Labs name: bert_ner_marathi_ner date: 2022-05-09 tags: [bert, ner, token_classification, mr, open_source] task: Named Entity Recognition language: mr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `marathi-ner` is a Marathi model orginally trained by `l3cube-pune`. ## Predicted Entities `Location`, `Time`, `Organization`, `Designation`, `Person`, `Other`, `Measure`, `Date` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_marathi_ner_mr_3.4.2_3.0_1652099098842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_marathi_ner_mr_3.4.2_3.0_1652099098842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_marathi_ner","mr") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_marathi_ner","mr") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_marathi_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|mr| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/l3cube-pune/marathi-ner - https://github.com/l3cube-pune/MarathiNLP - https://arxiv.org/abs/2204.06029 --- layout: model title: Pipeline to Detect mentions of general medical terms (coarse) author: John Snow Labs name: ner_medmentions_coarse_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_medmentions_coarse](https://nlp.johnsnowlabs.com/2021/04/01/ner_medmentions_coarse_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medmentions_coarse_pipeline_en_4.3.0_3.2_1678827534546.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medmentions_coarse_pipeline_en_4.3.0_3.2_1678827534546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_medmentions_coarse_pipeline", "en", "clinical/models") text = '''he patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_medmentions_coarse_pipeline", "en", "clinical/models") val text = "he patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.medmentions_coarse.pipeline").predict("""he patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------------|--------:|------:|:-------------------------------------|-------------:| | 0 | Caucasian | 27 | 35 | Population_Group | 0.8439 | | 1 | male | 37 | 40 | Organism_Attribute | 0.6712 | | 2 | congestion | 61 | 70 | Pathologic_Function | 0.4102 | | 3 | suctioning yellow discharge | 87 | 113 | Therapeutic_or_Preventive_Procedure | 0.278767 | | 4 | patient's nares | 124 | 138 | Body_Part,_Organ,_or_Organ_Component | 0.4463 | | 5 | breathing | 190 | 198 | Biologic_Function | 0.7258 | | 6 | perioral cyanosis | 236 | 252 | Sign_or_Symptom | 0.43535 | | 7 | side effect | 297 | 307 | Pathologic_Function | 0.35505 | | 8 | Influenza vaccine | 324 | 340 | Pharmacologic_Substance | 0.7951 | | 9 | temperature | 383 | 393 | Quantitative_Concept | 0.2589 | | 10 | Tylenol | 416 | 422 | Organic_Chemical | 0.5546 | | 11 | decreased | 448 | 456 | Quantitative_Concept | 0.9368 | | 12 | respiratory congestion | 563 | 584 | Pathologic_Function | 0.38635 | | 13 | albuterol | 708 | 716 | Organic_Chemical | 0.4335 | | 14 | treatments | 718 | 727 | Therapeutic_or_Preventive_Procedure | 0.4567 | | 15 | ER | 742 | 743 | Cell_Component | 0.3185 | | 16 | urine | 750 | 754 | Body_Substance | 0.9088 | | 17 | decreased | 772 | 780 | Quantitative_Concept | 0.9341 | | 18 | diapers | 823 | 829 | Manufactured_Object | 0.296 | | 19 | diapers | 870 | 876 | Manufactured_Object | 0.175 | | 20 | Mom | 892 | 894 | Professional_or_Occupational_Group | 0.8055 | | 21 | diarrhea | 907 | 914 | Sign_or_Symptom | 0.8549 | | 22 | bowel movements | 921 | 935 | Biologic_Function | 0.29385 | | 23 | yellow | 941 | 946 | Qualitative_Concept | 0.742 | | 24 | colored | 948 | 954 | Qualitative_Concept | 0.275 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_medmentions_coarse_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from Eastern Malayo-Polynesian Languages to English author: John Snow Labs name: opus_mt_pqe_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pqe, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pqe` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pqe_en_xx_2.7.0_2.4_1609170771310.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pqe_en_xx_2.7.0_2.4_1609170771310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pqe_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pqe_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pqe.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pqe_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Russian BertForQuestionAnswering Large Cased model (from ruselkomp) author: John Snow Labs name: bert_qa_ruselkomp_sbert_large_nlu_ru_finetuned_squad_full date: 2022-07-07 tags: [ru, open_source, bert, question_answering] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sbert_large_nlu_ru-finetuned-squad-full` is a Russian model originally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ruselkomp_sbert_large_nlu_ru_finetuned_squad_full_ru_4.0.0_3.0_1657191680560.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ruselkomp_sbert_large_nlu_ru_finetuned_squad_full_ru_4.0.0_3.0_1657191680560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ruselkomp_sbert_large_nlu_ru_finetuned_squad_full","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_ruselkomp_sbert_large_nlu_ru_finetuned_squad_full","ru") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ruselkomp_sbert_large_nlu_ru_finetuned_squad_full| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/sbert_large_nlu_ru-finetuned-squad-full --- layout: model title: ChunkResolver Loinc Clinical author: John Snow Labs name: chunkresolve_loinc_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-05-16 task: Entity Resolution edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities LOINC Codes with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_LOINC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_loinc_clinical_en_2.5.0_2.4_1589599195201.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_loinc_clinical_en_2.5.0_2.4_1589599195201.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... loinc_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_loinc_clinical", "en", "clinical/models") \ .setInputCols(["token", "chunk_embeddings"]) \ .setOutputCol("loinc_code") \ .setDistanceFunction("COSINE") \ .setNeighbours(5) pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, loinc_resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") model = pipeline_loinc.fit(data) results = model.transform(data) ``` ```scala ... val loinc_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_loinc_clinical", "en", "clinical/models") .setInputCols(Array("token", "chunk_embeddings")) .setOutputCol("loinc_code") .setDistanceFunction("COSINE") .setNeighbours(5) val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, loinc_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.").toDF("text") val result = pipeline_loinc.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash Chunk loinc-Code 0 gestational diabetes mellitus 44877-9 1 type two diabetes mellitus 44877-9 2 T2DM 93692-2 3 prior episode of HTG-induced pancreatitis 85695-5 4 associated with an acute hepatitis 24363-4 5 obesity with a body mass index 47278-7 6 BMI) of 33.5 kg/m2 47214-2 7 polyuria 35234-4 8 polydipsia 25541-4 9 poor appetite 50056-1 10 vomiting 34175-0 ``` {:.model-param} ## Model Information {:.table-model} |----------------|-----------------------------| | Name: | chunkresolve_loinc_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [entity] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on LOINC dataset with ``embeddings_clinical``. https://loinc.org/ --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465520 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465520` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465520_en_4.0.0_3.0_1655986650048.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465520_en_4.0.0_3.0_1655986650048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465520","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465520","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465520.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465520| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465520 --- layout: model title: Pipeline to Mapping RxNORM Codes with Their Corresponding UMLS Codes author: John Snow Labs name: rxnorm_umls_mapping date: 2023-03-29 tags: [en, licensed, clinical, pipeline, chunk_mapping, rxnorm, umls] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `rxnorm_umls_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_4.3.2_3.2_1680121570730.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapping_en_4.3.2_3.2_1680121570730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(1161611 315677) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(1161611 315677) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash | | rxnorm_code | umls_code | |---:|:-----------------|:--------------------| | 0 | 1161611 | 315677 | C3215948 | C0984912 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.9 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Recognize Entities DL Pipeline for Dutch - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, dutch, entity_recognizer_md, pipeline, nl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: nl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_nl_3.0.0_3.0_1616450887268.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_nl_3.0.0_3.0_1616450887268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'nl') annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "nl") val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo van John Snow Labs! ""] result_df = nlu.load('nl.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | [[0.5910000205039978,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| --- layout: model title: Mapping Company Names to Edgar Database author: John Snow Labs name: finmapper_edgar_companyname date: 2022-08-18 tags: [en, finance, companies, edgar, data, augmentation, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Chunk Mapper model allows you to, given a detected Organization with any NER model, augment it with information available in the SEC Edgar database. Some of the fields included in this Chunk Mapper are: - IRS number - Sector - Former names - Address, Phone, State - Dates where the company submitted filings - etc. IMPORTANT: Chunk Mappers work with exact matches, so before using Chunk Mapping, you need to carry out Company Name Normalization to get how the company name is stored in Edgar. To do this, use Entity Linking, more especifically the `finel_edgar_companynames` model, with the Organization Name extracted by any NER model. You will get the normalized version (by Edgar standards) of the name, which you can send to this model for data augmentation. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_edgar_companyname_en_1.0.0_3.2_1660817326595.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_edgar_companyname_en_1.0.0_3.2_1660817326595.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") # Optional: To normalize the ORG name using NASDAQ data before the mapping ########################################################################## chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk_doc") \ .setOutputCol("sentence_embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models") \ .setInputCols(["ner_chunk_doc", "sentence_embeddings"]) \ .setOutputCol("normalized")\ .setDistanceFunction("EUCLIDEAN") ########################################################################## cm = finance.ChunkMapperModel()\ .pretrained("finmapper_edgar_companyname", "en", "finance/models")\ .setInputCols(["normalized"])\ .setOutputCol("mappings") # or ner_chunk for non normalized versions nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, chunk_embeddings, use_er_model, cm ]) text = """NIKE Inc is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services""" test_data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(test_data) lp = nlp.LightPipeline(model) lp.annotate(text) ```
## Results ```bash {"mappings": [["labeled_dependency", 0, 22, "Jamestown Invest 1, LLC", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "name", "all_relations": ""}], ["labeled_dependency", 0, 22, "REAL ESTATE INVESTMENT TRUSTS [6798]", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "sic", "all_relations": ""}], ["labeled_dependency", 0, 22, "6798", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "sic_code", "all_relations": ""}], ["labeled_dependency", 0, 22, "831529368", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "irs_number", "all_relations": ""}], ["labeled_dependency", 0, 22, "1231", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "fiscal_year_end", "all_relations": ""}], ["labeled_dependency", 0, 22, "GA", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "state_location", "all_relations": ""}], ["labeled_dependency", 0, 22, "DE", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "state_incorporation", "all_relations": ""}], ["labeled_dependency", 0, 22, "PONCE CITY MARKET", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_street", "all_relations": ""}], ["labeled_dependency", 0, 22, "ATLANTA", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_city", "all_relations": ""}], ["labeled_dependency", 0, 22, "GA", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_state", "all_relations": ""}], ["labeled_dependency", 0, 22, "30308", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_zip", "all_relations": ""}], ["labeled_dependency", 0, 22, "7708051000", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "business_phone", "all_relations": ""}], ["labeled_dependency", 0, 22, "Jamestown Atlanta Invest 1, LLC", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "former_name", "all_relations": ""}], ["labeled_dependency", 0, 22, "20180824", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "former_name_date", "all_relations": ""}], ["labeled_dependency", 0, 22, "2019-11-21", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "date", "all_relations": "2019-10-24:::2019-11-25:::2019-11-12:::2022-01-13:::2022-03-31:::2022-04-11:::2022-07-12:::2022-06-30:::2021-01-14:::2021-04-06:::2021-03-31:::2021-04-28:::2021-06-30:::2021-09-10:::2021-09-22:::2021-09-30:::2021-10-08:::2020-03-16:::2021-12-30:::2020-04-06:::2020-04-29:::2020-06-12:::2020-07-20:::2020-07-07:::2020-07-28:::2020-07-31:::2020-09-09:::2020-09-25:::2020-10-08:::2020-11-12"}], ["labeled_dependency", 0, 22, "1751158", {"sentence": "0", "chunk": "0", "entity": "Jamestown Invest 1, LLC", "relation": "company_id", "all_relations": ""}]]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_edgar_companyname| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|11.0 MB| ## References Manually scrapped Edgar Database --- layout: model title: Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl date: 2022-01-06 tags: [ner_jsl, ner, berfortokenclassification, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. This model is trained with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects 77 entities. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.3.4_2.4_1641474169014.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.3.4_2.4_1641474169014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) sample_text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" data = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug_BrandName | |Baby-girl |Gender | |decreased p.o. intake |Symptom | |His |Gender | |breast-feeding |External_body_part_or_region| |20 minutes |Duration | |q.2h. to 5 to 10 minutes |Frequency | |his |Gender | |respiratory congestion |Symptom | |He |Gender | |tired |Symptom | |fussy |Symptom | |over the past 2 days |RelativeDate | |albuterol |Drug_Ingredient | |ER |Clinical_Dept | |His |Gender | |urine output has |Symptom | |decreased |Symptom | |he |Gender | |per 24 hours |Frequency | |he |Gender | |per 24 hours |Frequency | |Mom |Gender | |diarrhea |Symptom | |His |Gender | |bowel |Internal_organ_or_component | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.5 MB| |Case sensitive:|true| |Max sentense length:|128| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label precision recall f1-score support B-Admission_Discharge 0.96 0.97 0.96 298 B-Age 0.96 0.97 0.97 1545 B-Alcohol 0.89 0.86 0.87 117 B-BMI 1.00 0.67 0.80 9 B-Birth_Entity 0.50 0.60 0.55 5 B-Blood_Pressure 0.78 0.73 0.76 232 B-Cancer_Modifier 0.82 0.90 0.86 10 B-Cerebrovascular_Disease 0.68 0.75 0.71 163 B-Clinical_Dept 0.88 0.88 0.88 1510 B-Communicable_Disease 0.88 0.94 0.91 47 B-Date 0.96 0.97 0.96 960 B-Death_Entity 0.84 0.79 0.82 39 B-Diabetes 0.94 0.93 0.93 129 B-Diet 0.64 0.55 0.59 166 B-Direction 0.92 0.94 0.93 4605 B-Disease_Syndrome_Disorder 0.90 0.87 0.88 6729 B-Dosage 0.73 0.74 0.73 611 B-Drug_BrandName 0.89 0.90 0.89 2919 B-Drug_Ingredient 0.90 0.92 0.91 6243 B-Duration 0.76 0.80 0.78 296 B-EKG_Findings 0.75 0.76 0.76 114 B-Employment 0.86 0.84 0.85 315 B-External_body_part_or_region 0.86 0.87 0.86 3050 B-Family_History_Header 0.99 0.99 0.99 268 B-Fetus_NewBorn 0.71 0.67 0.69 97 B-Form 0.75 0.75 0.75 351 B-Frequency 0.88 0.93 0.90 1238 B-Gender 0.98 0.98 0.98 4409 B-HDL 1.00 0.50 0.67 8 B-Heart_Disease 0.83 0.86 0.85 728 B-Height 0.93 0.87 0.90 31 B-Hyperlipidemia 0.97 1.00 0.99 229 B-Hypertension 0.98 0.99 0.98 423 B-ImagingFindings 0.53 0.59 0.56 218 B-Imaging_Technique 0.49 0.57 0.53 60 B-Injury_or_Poisoning 0.84 0.80 0.82 721 B-Internal_organ_or_component 0.89 0.91 0.90 11514 B-Kidney_Disease 0.78 0.79 0.78 164 B-LDL 0.62 0.67 0.64 12 B-Labour_Delivery 0.75 0.64 0.69 84 B-Medical_Device 0.87 0.90 0.88 6762 B-Medical_History_Header 0.92 0.94 0.93 163 B-Modifier 0.82 0.86 0.84 3491 B-O2_Saturation 0.51 0.73 0.60 154 B-Obesity 0.92 0.97 0.95 120 B-Oncological 0.87 0.89 0.88 925 B-Oncology_Therapy 0.91 0.45 0.61 22 B-Overweight 0.75 0.60 0.67 10 B-Oxygen_Therapy 0.68 0.71 0.70 313 B-Pregnancy 0.75 0.85 0.79 237 B-Procedure 0.89 0.89 0.89 6219 B-Psychological_Condition 0.74 0.76 0.75 209 B-Pulse 0.80 0.74 0.77 178 B-Race_Ethnicity 0.93 0.99 0.96 111 B-Relationship_Status 1.00 0.76 0.87 34 B-RelativeDate 0.86 0.84 0.85 569 B-RelativeTime 0.65 0.68 0.67 250 B-Respiration 0.87 0.72 0.79 161 B-Route 0.85 0.86 0.85 1361 B-Section_Header 0.96 0.97 0.97 12925 B-Sexually_Active_or_Sexual_Orientation 1.00 1.00 1.00 1 B-Smoking 0.95 0.84 0.89 145 B-Social_History_Header 0.93 0.95 0.94 338 B-Staging 1.00 1.00 1.00 4 B-Strength 0.88 0.88 0.88 794 B-Substance 0.73 0.94 0.82 87 B-Symptom 0.87 0.86 0.87 11526 B-Temperature 0.81 0.84 0.83 198 B-Test 0.87 0.88 0.88 5850 B-Test_Result 0.84 0.85 0.84 2096 B-Time 0.92 0.98 0.95 1119 B-Total_Cholesterol 0.95 0.75 0.84 28 B-Treatment 0.62 0.65 0.64 354 B-Triglycerides 0.59 0.94 0.73 17 B-VS_Finding 0.77 0.86 0.81 592 B-Vaccine 0.90 0.84 0.87 77 B-Vital_Signs_Header 0.94 0.99 0.96 958 B-Weight 0.75 0.86 0.80 109 I-Age 0.92 0.96 0.94 283 I-Alcohol 0.83 0.62 0.71 8 I-Allergen 0.00 0.00 0.00 15 I-BMI 1.00 0.88 0.93 24 I-Blood_Pressure 0.82 0.84 0.83 456 I-Cerebrovascular_Disease 0.48 0.82 0.61 66 I-Clinical_Dept 0.92 0.91 0.91 717 I-Communicable_Disease 0.81 1.00 0.89 25 I-Date 0.94 0.99 0.96 152 I-Death_Entity 0.67 0.40 0.50 5 I-Diabetes 0.99 0.92 0.95 85 I-Diet 0.70 0.72 0.71 83 I-Direction 0.86 0.84 0.85 235 I-Disease_Syndrome_Disorder 0.88 0.86 0.87 3878 I-Dosage 0.78 0.69 0.73 540 I-Drug_BrandName 0.68 0.55 0.61 89 I-Drug_Ingredient 0.83 0.87 0.85 720 I-Duration 0.83 0.86 0.85 567 I-EKG_Findings 0.81 0.69 0.75 175 I-Employment 0.66 0.71 0.69 118 I-External_body_part_or_region 0.85 0.89 0.87 873 I-Family_History_Header 0.97 1.00 0.98 280 I-Fetus_NewBorn 0.55 0.58 0.56 88 I-Frequency 0.86 0.89 0.87 638 I-HDL 1.00 0.75 0.86 4 I-Heart_Disease 0.86 0.86 0.86 732 I-Height 0.97 0.95 0.96 82 I-Hypertension 0.82 0.75 0.78 12 I-ImagingFindings 0.64 0.60 0.62 257 I-Injury_or_Poisoning 0.78 0.79 0.79 589 I-Internal_organ_or_component 0.90 0.91 0.90 5278 I-Kidney_Disease 0.94 0.91 0.93 186 I-LDL 0.43 0.60 0.50 5 I-Labour_Delivery 0.84 0.77 0.80 69 I-Medical_Device 0.88 0.91 0.89 3590 I-Medical_History_Header 1.00 0.96 0.98 433 I-Metastasis 0.92 0.65 0.76 17 I-Modifier 0.62 0.58 0.60 322 I-O2_Saturation 0.65 0.87 0.75 217 I-Obesity 0.78 1.00 0.88 7 I-Oncological 0.85 0.91 0.88 741 I-Oxygen_Therapy 0.67 0.72 0.70 224 I-Pregnancy 0.57 0.62 0.59 115 I-Procedure 0.91 0.88 0.89 6231 I-Psychological_Condition 0.76 0.88 0.82 69 I-Pulse 0.82 0.82 0.82 278 I-RelativeDate 0.89 0.88 0.89 757 I-RelativeTime 0.70 0.80 0.74 227 I-Respiration 0.90 0.68 0.77 173 I-Route 0.98 0.92 0.95 359 I-Section_Header 0.96 0.98 0.97 7817 I-Sexually_Active_or_Sexual_Orientation 1.00 1.00 1.00 1 I-Smoking 0.67 0.50 0.57 4 I-Social_History_Header 0.91 1.00 0.95 200 I-Strength 0.85 0.89 0.87 761 I-Substance 0.70 0.90 0.79 21 I-Symptom 0.76 0.72 0.74 6496 I-Temperature 0.92 0.87 0.89 301 I-Test 0.84 0.86 0.85 3126 I-Test_Result 0.89 0.82 0.85 1676 I-Time 0.91 0.95 0.93 424 I-Total_Cholesterol 0.67 1.00 0.81 29 I-Treatment 0.68 0.62 0.65 196 I-Vaccine 0.89 0.96 0.92 25 I-Vital_Signs_Header 0.96 0.99 0.97 633 I-Weight 0.82 0.93 0.87 200 O 0.96 0.96 0.96 175897 accuracy - - 0.92 338378 macro-avg 0.76 0.74 0.74 338378 weighted-avg 0.92 0.92 0.92 338378 ``` --- layout: model title: Extract Clinical Problem Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_problem_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences using a granular taxonomy. ## Predicted Entities `PsychologicalCondition`, `Disease`, `Symptom`, `HealthStatus`, `InjuryOrPoisoning`, `Modifier` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_emb_clinical_medium_en_4.4.3_3.0_1686075959944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_emb_clinical_medium_en_4.4.3_3.0_1686075959944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Symptom | | fatigue | Symptom | | rheumatoid arthritis | Disease | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 PsychologicalCondition 409 41 35 444 0.91 0.92 0.91 Disease 1715 230 300 2015 0.88 0.85 0.87 Symptom 3603 515 972 4575 0.87 0.79 0.83 HealthStatus 93 41 14 107 0.69 0.87 0.77 InjuryOrPoisoning 126 36 50 176 0.78 0.72 0.75 Modifier 777 196 362 1139 0.80 0.68 0.74 macro_avg 6723 1059 1733 8456 0.82 0.80 0.81 micro_avg 6723 1059 1733 8456 0.86 0.80 0.83 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt4 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt4` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_4.2.4_3.0_1670022847527.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_4.2.4_3.0_1670022847527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt4| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|171.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt4 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Detect Genes/Proteins (BC2GM) in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc2gm_gene date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are asked to identify a gene mentioned in a sentence by giving its start and end characters. The training set consists of a set of sentences and a set of gene mentions (GENE annotations) in the English language for each sentence. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects genes/proteins from a medical text. ## Predicted Entities `GENE/PROTEIN` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc2gm_gene_en_4.0.0_3.0_1658751400656.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc2gm_gene_en_4.0.0_3.0_1658751400656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc2gm_gene", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc2gm_gene", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bc2gm_gene").predict("""ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination.""") ```
## Results ```bash +---------+------------+ |ner_chunk|label | +---------+------------+ |ROCK-I |GENE/PROTEIN| |Kinectin |GENE/PROTEIN| |mDia2 |GENE/PROTEIN| |RhoA |GENE/PROTEIN| |Cdc42 |GENE/PROTEIN| |tnaC |GENE/PROTEIN| |Rho |GENE/PROTEIN| |boxA |GENE/PROTEIN| |rut sites|GENE/PROTEIN| +---------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc2gm_gene| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-GENE/PROTEIN 0.7907 0.9320 0.8556 6325 I-GENE/PROTEIN 0.8350 0.8651 0.8498 8776 micro-avg 0.8151 0.8931 0.8523 15101 macro-avg 0.8129 0.8986 0.8527 15101 weighted-avg 0.8165 0.8931 0.8522 15101 ``` --- layout: model title: Recognize Entities DL Pipeline for French - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, french, entity_recognizer_md, pipeline, fr] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: fr edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_fr_3.0.0_3.0_1616445658156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_fr_3.0.0_3.0_1616445658156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'fr') annotations = pipeline.fullAnnotate(""Bonjour de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "fr") val result = pipeline.fullAnnotate("Bonjour de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Bonjour de John Snow Labs! ""] result_df = nlu.load('fr.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:--------------------------------|:-------------------------------|:-------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------------------| | 0 | ['Bonjour de John Snow Labs! '] | ['Bonjour de John Snow Labs!'] | ['Bonjour', 'de', 'John', 'Snow', 'Labs!'] | [[0.0783179998397827,.,...]] | ['I-MISC', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['Bonjour', 'John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fr| --- layout: model title: Sentence Embeddings - Bluebert uncased (MedNLI) author: John Snow Labs name: sbluebert_base_uncased_mli date: 2020-11-27 task: Embeddings language: en nav_key: models edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [embeddings, en, licensed] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model is trained to generate contextual sentence embeddings of input sentences. It has been fine-tuned on MedNLI dataset to provide sota performance on STS and SentEval Benchmarks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbluebert_base_uncased_mli_en_2.6.4_2.4_1606228596089.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbluebert_base_uncased_mli_en_2.6.4_2.4_1606228596089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, BertSentenceEmbeddings. The output of this model can be used in tasks like NER, Classification, Entity Resolution etc.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbluebert_base_uncased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbluebert_base_uncased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bluebert.mli").predict("""Put your text here.""") ```
{:.h2_title} ## Results Gives a 768 dimensional vector representation of the sentence. {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbluebert_base_uncased_mli| |Type:|BertSentenceEmbeddings| |Compatibility:|Spark NLP 2.6.4 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[ner_chunk]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Tuned on MedNLI dataset using Bluebert weights. --- layout: model title: Legal Revolving Credit Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_revolving_credit_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, revolving, credit, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_revolving_credit_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `revolving-credit-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `revolving-credit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_revolving_credit_agreement_en_1.0.0_3.0_1670358374165.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_revolving_credit_agreement_en_1.0.0_3.0_1670358374165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_revolving_credit_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[revolving-credit-agreement]| |[other]| |[other]| |[revolving-credit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_revolving_credit_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.97 0.98 111 revolving-credit-agreement 0.93 0.96 0.95 45 accuracy - - 0.97 156 macro-avg 0.96 0.96 0.96 156 weighted-avg 0.97 0.97 0.97 156 ``` --- layout: model title: Word2Vec Embeddings in Czech (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, cs, open_source] task: Embeddings language: cs edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cs_3.4.1_3.0_1647292556306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cs_3.4.1_3.0_1647292556306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cs") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Miluji jiskru nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cs") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Miluji jiskru nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("cs.embed.w2v_cc_300d").predict("""Miluji jiskru nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|cs| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Fast Neural Machine Translation Model from Greenlandic to English author: John Snow Labs name: opus_mt_kl_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kl_en_xx_2.7.0_2.4_1609165131019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kl_en_xx_2.7.0_2.4_1609165131019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Embeddings - sbert medium (tuned) author: John Snow Labs name: jsl_sbert_medium_rxnorm date: 2021-12-21 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps sentences & documents to a 512-dimensional dense vector space by using average pooling on top of BERT model. It's also fine-tuned on the RxNorm dataset to help generalization over medication-related datasets. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_sbert_medium_rxnorm_en_3.3.4_2.4_1640118356633.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_sbert_medium_rxnorm_en_3.3.4_2.4_1640118356633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = BertSentenceEmbeddings.pretrained("jsl_sbert_medium_rxnorm", "en", "clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sentence_embeddings = BertSentenceEmbeddings.pretrained('jsl_sbert_medium_rxnorm', 'en','clinical/models') .setInputCols("sentence") .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert_medium.rxnorm").predict("""Put your text here.""") ```
## Results ```bash Gives a 512-dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_sbert_medium_rxnorm| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|153.9 MB| |Case sensitive:|false| ## Data Source Tuned on RxNorm dataset. --- layout: model title: English asr_iwslt_asr_wav2vec_large_4500h TFWav2Vec2ForCTC from nguyenvulebinh author: John Snow Labs name: pipeline_asr_iwslt_asr_wav2vec_large_4500h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_iwslt_asr_wav2vec_large_4500h` is a English model originally trained by nguyenvulebinh. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_iwslt_asr_wav2vec_large_4500h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_iwslt_asr_wav2vec_large_4500h_en_4.2.0_3.0_1664042851317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_iwslt_asr_wav2vec_large_4500h_en_4.2.0_3.0_1664042851317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_iwslt_asr_wav2vec_large_4500h', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_iwslt_asr_wav2vec_large_4500h", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_iwslt_asr_wav2vec_large_4500h| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Detect PHI for Deidentification (ner_deidentification_dl) author: John Snow Labs name: ner_deidentify_dl date: 2021-02-01 tags: [ner, deidentify, en, clinical, licensed] supported: true edition: Spark NLP 2.7.2 spark_version: 2.4 annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotator (NERDLModel) allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (DL) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. ## Predicted Entities `Age`,`BIOID`,`City`,`Country`,`Country`,`Date`,`Device`,`Doctor`,`EMail`,`Hospital`,`Fax`,`Healthplan`,`Hospital`,,`Idnum`,`Location-Other`,`Medicalrecord`,`Organization`,`Patient`,`Phone`,`Profession`,`State`,`Street`,`URL`,`Username`,`Zip` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_en_2.7.2_2.4_1612178436389.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deidentify_dl_en_2.7.2_2.4_1612178436389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Model is trained with the 'embeddings_clinical' word embeddings model, so be sure to use the same embeddings in the pipeline.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... model = NerDLModel.pretrained("ner_deidentify_dl","en","clinical/models") \ .setInputCols("sentence","token","word_embeddings") \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, model, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) input_text = [ '''A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street'''] result = pipeline_model.transform(spark.createDataFrame([input_text], ["text"])) ``` ```scala val model = NerDLModel.pretrained("ner_deidentify_dl","en","clinical/models") .setInputCols("sentence","token","word_embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, model, ner_converter)) val result = pipeline.fit(Seq.empty [ '''A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street''']).toDS.toDF("text")).transform(data) ```
## Results ```bash +---------------+-----+ |ner_label |count| +---------------+-----+ |O |28 | |I-HOSPITAL |4 | |B-DATE |3 | |I-STREET |3 | |I-PATIENT |2 | |B-DOCTOR |2 | |B-AGE |1 | |B-PATIENT |1 | |I-DOCTOR |1 | |B-MEDICALRECORD|1 | +---------------+-----+. +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson , Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25 |AGE | |2079-11-09 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deidentify_dl| |Type:|ner| |Compatibility:|Spark NLP 2.7.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on JSL enriched n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:-----------------|------:|-----:|-----:|---------:|---------:|---------:| | 1 | I-AGE | 7 | 3 | 6 | 0.7 | 0.538462 | 0.608696 | | 2 | I-DOCTOR | 800 | 27 | 94 | 0.967352 | 0.894855 | 0.929692 | | 3 | I-IDNUM | 6 | 0 | 2 | 1 | 0.75 | 0.857143 | | 4 | B-DATE | 1883 | 34 | 56 | 0.982264 | 0.971119 | 0.97666 | | 5 | I-DATE | 425 | 28 | 25 | 0.93819 | 0.944444 | 0.941307 | | 6 | B-PHONE | 29 | 7 | 9 | 0.805556 | 0.763158 | 0.783784 | | 7 | B-STATE | 87 | 4 | 11 | 0.956044 | 0.887755 | 0.920635 | | 8 | B-CITY | 35 | 11 | 26 | 0.76087 | 0.57377 | 0.654206 | | 9 | I-ORGANIZATION | 12 | 4 | 15 | 0.75 | 0.444444 | 0.55814 | | 10 | B-DOCTOR | 728 | 75 | 53 | 0.9066 | 0.932138 | 0.919192 | | 11 | I-PROFESSION | 43 | 11 | 13 | 0.796296 | 0.767857 | 0.781818 | | 12 | I-PHONE | 62 | 4 | 4 | 0.939394 | 0.939394 | 0.939394 | | 13 | B-AGE | 234 | 13 | 16 | 0.947368 | 0.936 | 0.94165 | | 14 | B-STREET | 20 | 7 | 16 | 0.740741 | 0.555556 | 0.634921 | | 15 | I-ZIP | 60 | 3 | 2 | 0.952381 | 0.967742 | 0.96 | | 16 | I-MEDICALRECORD | 54 | 5 | 2 | 0.915254 | 0.964286 | 0.93913 | | 17 | B-ZIP | 2 | 1 | 0 | 0.666667 | 1 | 0.8 | | 18 | B-HOSPITAL | 256 | 23 | 66 | 0.917563 | 0.795031 | 0.851913 | | 19 | I-STREET | 150 | 17 | 20 | 0.898204 | 0.882353 | 0.890208 | | 20 | B-COUNTRY | 22 | 2 | 8 | 0.916667 | 0.733333 | 0.814815 | | 21 | I-COUNTRY | 1 | 0 | 0 | 1 | 1 | 1 | | 22 | I-STATE | 6 | 0 | 1 | 1 | 0.857143 | 0.923077 | | 23 | B-USERNAME | 30 | 0 | 4 | 1 | 0.882353 | 0.9375 | | 24 | I-HOSPITAL | 295 | 37 | 64 | 0.888554 | 0.821727 | 0.853835 | | 25 | I-PATIENT | 243 | 26 | 41 | 0.903346 | 0.855634 | 0.878843 | | 26 | B-PROFESSION | 52 | 8 | 17 | 0.866667 | 0.753623 | 0.806202 | | 27 | B-IDNUM | 32 | 3 | 12 | 0.914286 | 0.727273 | 0.810127 | | 28 | I-CITY | 76 | 15 | 13 | 0.835165 | 0.853933 | 0.844444 | | 29 | B-PATIENT | 337 | 29 | 40 | 0.920765 | 0.893899 | 0.907133 | | 30 | B-MEDICALRECORD | 74 | 6 | 4 | 0.925 | 0.948718 | 0.936709 | | 31 | B-ORGANIZATION | 20 | 5 | 13 | 0.8 | 0.606061 | 0.689655 | | 32 | Macro-average | 6083 | 408 | 673 | 0.7976 | 0.697533 | 0.744218 | | 33 | Micro-average | 6083 | 408 | 673 | 0.937144 | 0.900385 | 0.918397 | ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from Khanh) author: John Snow Labs name: bert_qa_khanh_base_multilingual_cased_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-squad` is a English model originally trained by `Khanh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_khanh_base_multilingual_cased_finetuned_squad_en_4.0.0_3.0_1657183221323.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_khanh_base_multilingual_cased_finetuned_squad_en_4.0.0_3.0_1657183221323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_khanh_base_multilingual_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_khanh_base_multilingual_cased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_khanh_base_multilingual_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Khanh/bert-base-multilingual-cased-finetuned-squad --- layout: model title: English image_classifier_vit_deit_tiny_patch16_224 ViTForImageClassification from facebook author: John Snow Labs name: image_classifier_vit_deit_tiny_patch16_224 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_deit_tiny_patch16_224` is a English model originally trained by facebook. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_tiny_patch16_224_en_4.1.0_3.0_1660166298335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_deit_tiny_patch16_224_en_4.1.0_3.0_1660166298335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_deit_tiny_patch16_224", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_deit_tiny_patch16_224", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_deit_tiny_patch16_224| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|22.1 MB| --- layout: model title: English BertForMaskedLM Cased model (from antoinelouis) author: John Snow Labs name: bert_embeddings_netbert date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `netbert` is a English model originally trained by `antoinelouis`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_netbert_en_4.2.4_3.0_1670326953096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_netbert_en_4.2.4_3.0_1670326953096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_netbert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_netbert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_netbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|405.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/antoinelouis/netbert - https://github.com/antoiloui/netbert/blob/master/docs/index.md --- layout: model title: German Named Entity Recognition (from fhswf) author: John Snow Labs name: bert_ner_bert_de_ner date: 2022-05-09 tags: [bert, ner, token_classification, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert_de_ner` is a German model orginally trained by `fhswf`. ## Predicted Entities `LOC`, `PERpart`, `OTHpart`, `PER`, `PERderiv`, `ORGderiv`, `OTHderiv`, `ORG`, `OTH`, `LOCderiv`, `LOCpart`, `ORGpart` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_de_ner_de_3.4.2_3.0_1652099253286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_de_ner_de_3.4.2_3.0_1652099253286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_de_ner","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_de_ner","de") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_de_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/fhswf/bert_de_ner - https://sites.google.com/site/germeval2014ner --- layout: model title: BERT Biolink Embeddings author: John Snow Labs name: bert_biolink_large date: 2022-04-08 tags: [biolink, bert, medical, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This embeddings component was trained on PubMed abstracts all along with citation link information. The model was introduced in [this paper](https://arxiv.org/abs/2203.15827), achieving state-of-the-art performance on several biomedical NLP benchmarks such as [BLURB](https://microsoft.github.io/BLURB/) and [MedQA-USMLE](https://github.com/jind11/MedQA). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_biolink_large_en_3.4.2_3.0_1649411676807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_biolink_large_en_3.4.2_3.0_1649411676807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_biolink_large", "en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_biolink_large", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.ge").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_biolink_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://pubmed.ncbi.nlm.nih.gov/](https://pubmed.ncbi.nlm.nih.gov/) ``` @InProceedings{yasunaga2022linkbert, author = {Michihiro Yasunaga and Jure Leskovec and Percy Liang}, title = {LinkBERT: Pretraining Language Models with Document Links}, year = {2022}, booktitle = {Association for Computational Linguistics (ACL)}, } ``` ## Benchmarking ```bash Scores for several benchmark datasets : - BLURB : 84.30 - PubMedQA : 72.2 - BioASQ : 94.8 - MedQA-USMLE : 44.6 ``` --- layout: model title: English asr_wav2vec2_xls_r_tf_left_right_shuru TFWav2Vec2ForCTC from hrdipto author: John Snow Labs name: asr_wav2vec2_xls_r_tf_left_right_shuru date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_tf_left_right_shuru` is a English model originally trained by hrdipto. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_tf_left_right_shuru_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_tf_left_right_shuru_en_4.2.0_3.0_1664039997553.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_tf_left_right_shuru_en_4.2.0_3.0_1664039997553.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_tf_left_right_shuru", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_tf_left_right_shuru", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_tf_left_right_shuru| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Embeddings Sciwiki 50 dims author: John Snow Labs name: embeddings_sciwiki_50d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-27 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_50d_es_2.5.0_2.4_1590609287349.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_sciwiki_50d_es_2.5.0_2.4_1590609287349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_50d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_sciwiki_50d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.sciwiki.50d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_sciwiki_50d``. {:.model-param} ## Model Information {:.table-model} |---------------|------------------------| | Name: | embeddings_sciwiki_50d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 50.0 | {:.h2_title} ## Data Source Trained on Clinical Wikipedia Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Financial News Summarization (Headers, Medium) author: John Snow Labs name: finsum_news_headers_md date: 2022-11-23 tags: [financial, summarization, en, licensed] task: Summarization language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Financial news Summarizer, aimed to extract headers from financial news. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finsum_news_headers_md_en_1.0.0_3.0_1669216808643.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finsum_news_headers_md_en_1.0.0_3.0_1669216808643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = nlp.T5Transformer.pretrained("finsum_news_headers_md" ,"en", "finance/models") \ .setTask("summarization") \ .setInputCols(["documents"]) \ .setMaxOutputLength(512) \ .setOutputCol("summaries") data_df = spark.createDataFrame([["FTX is expected to make its debut appearance Tuesday in Delaware bankruptcy court, where its new management is expected to recount events leading up to the cryptocurrency platform’s sudden collapse and explain the steps it has since taken to secure customer funds and other assets."]]).toDF("text") pipeline = nlp.Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("summaries.result").show(truncate=False) ```
## Results ```bash FTX to Make Debut in Delaware Bankruptcy Court Tuesday. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finsum_news_headers_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|925.6 MB| ## References In-house JSL financial summarized news. --- layout: model title: Pipeline to Detect Restaurant-related Terminology author: John Snow Labs name: nerdl_restaurant_100d_pipeline date: 2022-06-28 tags: [restaurant, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [nerdl_restaurant_100d](https://nlp.johnsnowlabs.com/2021/12/31/nerdl_restaurant_100d_en.html) model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_RESTAURANT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_RESTAURANT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_pipeline_en_4.0.0_3.0_1656389148575.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_restaurant_100d_pipeline_en_4.0.0_3.0_1656389148575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python restaurant_pipeline = PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en") restaurant_pipeline.annotate("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.") ``` ```scala val restaurant_pipeline = new PretrainedPipeline("nerdl_restaurant_100d_pipeline", lang = "en") restaurant_pipeline.annotate("Hong Kong’s favourite pasta bar also offers one of the most reasonably priced lunch sets in town! With locations spread out all over the territory Sha Tin – Pici’s formidable lunch menu reads like a highlight reel of the restaurant. Choose from starters like the burrata and arugula salad or freshly tossed tuna tartare, and reliable handmade pasta dishes like pappardelle. Finally, round out your effortless Italian meal with a tidy one-pot tiramisu, of course, an espresso to power you through the rest of the day.") ```
## Results ```bash +---------------------------+---------------+ |chunk |ner_label | +---------------------------+---------------+ |Hong Kong’s |Restaurant_Name| |favourite |Rating | |pasta bar |Dish | |most reasonably |Price | |lunch |Hours | |in town! |Location | |Sha Tin – Pici’s |Restaurant_Name| |burrata |Dish | |arugula salad |Dish | |freshly tossed tuna tartare|Dish | |reliable |Price | |handmade pasta |Dish | |pappardelle |Dish | |effortless |Amenity | |Italian |Cuisine | |tidy one-pot |Amenity | |espresso |Dish | +---------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_restaurant_100d_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|166.7 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter - Finisher --- layout: model title: BioBERT Sentence Embeddings (Discharge) author: John Snow Labs name: sent_biobert_discharge_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for discharge summaries. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_discharge_base_cased_en_2.6.2_2.4_1600533806048.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_discharge_base_cased_en_2.6.2_2.4_1600533806048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_discharge_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_discharge_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.discharge_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_discharge_base_cased_embeddings 0 I hate cancer [0.3155321180820465, 0.37484583258628845, -0.4... 1 Antibiotics aren't painkiller [0.3543206453323364, 0.0787968561053276, -0.08... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_discharge_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: Slovenian T5ForConditionalGeneration Small Cased model (from cjvt) author: John Snow Labs name: t5_sl_small date: 2023-01-31 tags: [sl, open_source, t5, tensorflow] task: Text Generation language: sl edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-sl-small` is a Slovenian model originally trained by `cjvt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_sl_small_sl_4.3.0_3.0_1675125783776.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_sl_small_sl_4.3.0_3.0_1675125783776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_sl_small","sl") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_sl_small","sl") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_sl_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|sl| |Size:|178.8 MB| ## References - https://huggingface.co/cjvt/t5-sl-small - https://arxiv.org/abs/2207.13988 --- layout: model title: English RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_distilroberta_base date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_en_3.4.2_3.0_1649946226278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_en_3.4.2_3.0_1649946226278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|308.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/distilroberta-base - https://skylion007.github.io/OpenWebTextCorpus/ --- layout: model title: Multilingual BertForQuestionAnswering Base Uncased model (from roshnir) author: John Snow Labs name: bert_qa_base_multi_uncased date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multi-uncased-en-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_multi_uncased_xx_4.0.0_3.0_1657183160452.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_multi_uncased_xx_4.0.0_3.0_1657183160452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_multi_uncased","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_multi_uncased","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_multi_uncased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/bert-base-multi-uncased-en-hi --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_all_903429548 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429548` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429548_en_4.3.1_3.0_1678134060472.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429548_en_4.3.1_3.0_1678134060472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429548","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429548","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_all_903429548| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429548 --- layout: model title: Swedish asr_wav2vec2_swedish_common_voice TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: pipeline_asr_wav2vec2_swedish_common_voice date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_swedish_common_voice` is a Swedish model originally trained by birgermoell. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_swedish_common_voice_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_swedish_common_voice_sv_4.2.0_3.0_1664114455926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_swedish_common_voice_sv_4.2.0_3.0_1664114455926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_swedish_common_voice', lang = 'sv') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_swedish_common_voice", lang = "sv") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_swedish_common_voice| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Partial invalidity Clause Binary Classifier author: John Snow Labs name: legclf_partial_invalidity_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `partial-invalidity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `partial-invalidity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_partial_invalidity_clause_en_1.0.0_3.2_1660122811266.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_partial_invalidity_clause_en_1.0.0_3.2_1660122811266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_partial_invalidity_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[partial-invalidity]| |[other]| |[other]| |[partial-invalidity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_partial_invalidity_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 1.00 126 partial-invalidity 1.00 0.98 0.99 44 accuracy - - 0.99 170 macro-avg 1.00 0.99 0.99 170 weighted-avg 0.99 0.99 0.99 170 ``` --- layout: model title: Smaller BERT Embeddings (L-2_H-512_A-8) author: John Snow Labs name: small_bert_L2_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L2_512_en_2.6.0_2.4_1598344551843.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L2_512_en_2.6.0_2.4_1598344551843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L2_512", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L2_512", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L2_512').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L2_512_embeddings I [0.03890656679868698, 0.820040762424469, 0.380... love [-0.22476400434970856, 1.5793002843856812, 0.1... NLP [0.027155205607414246, 0.4952375590801239, 0.3... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L2_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1 --- layout: model title: Javanese DistilBertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: distilbert_embeddings_javanese_small date: 2022-12-12 tags: [jv, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-distilbert-small` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_small_jv_4.2.4_3.0_1670864937392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_small_jv_4.2.4_3.0_1670864937392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_javanese_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|jv| |Size:|247.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/w11wo/javanese-distilbert-small - https://arxiv.org/abs/1910.01108 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: English image_classifier_vit_greens ViTForImageClassification from ajanco author: John Snow Labs name: image_classifier_vit_greens date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_greens` is a English model originally trained by ajanco. ## Predicted Entities `pickle`, `green beans`, `zucinni`, `okra`, `cucumber` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_greens_en_4.1.0_3.0_1660165619629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_greens_en_4.1.0_3.0_1660165619629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_greens", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_greens", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_greens| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_filipino_wav2vec2_l_xls_r_300m_official TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_filipino_wav2vec2_l_xls_r_300m_official date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_filipino_wav2vec2_l_xls_r_300m_official` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_filipino_wav2vec2_l_xls_r_300m_official_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_filipino_wav2vec2_l_xls_r_300m_official_en_4.2.0_3.0_1664114554615.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_filipino_wav2vec2_l_xls_r_300m_official_en_4.2.0_3.0_1664114554615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_filipino_wav2vec2_l_xls_r_300m_official", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_filipino_wav2vec2_l_xls_r_300m_official", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_filipino_wav2vec2_l_xls_r_300m_official| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Translate Luba-Katanga to English Pipeline author: John Snow Labs name: translate_lu_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lu, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lu_en_xx_2.7.0_2.4_1609688924831.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lu_en_xx_2.7.0_2.4_1609688924831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lu.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lu_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: bert_token_classifier_autotrain_final_784824213 date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824213` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_final_784824213_en_4.2.4_3.0_1669814559612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_final_784824213_en_4.2.4_3.0_1669814559612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_final_784824213","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_final_784824213","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_final_784824213| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824213 --- layout: model title: Detect Entities (Onto 100) author: John Snow Labs name: onto_100 date: 2020-02-03 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, en, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. Onto was trained on the OntoNotes text corpus. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Onto 100 is trained with GloVe 100d word embeddings, so be sure to use the same embeddings in the pipeline. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_100_en_2.4.0_2.4_1579729071672.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_100_en_2.4.0_2.4_1579729071672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## Predicted Entities `CARDINAL`, `EVENT`, `WORK_OF_ART`, `ORG`, `DATE`, `GPE`, `PERSON`, `PRODUCT`, `NORP`, `ORDINAL`, `MONEY`, `LOC`, `FAC`, `LAW`, `TIME`, `PERCENT`, `QUANTITY`, `LANGUAGE`. ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("onto_100", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("onto_100", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.glove.6B_100d').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |one |CARDINAL | |the 1970s and 1980s |DATE | |Born |PERSON | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Microsoft |ORG | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_100| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://catalog.ldc.upenn.edu/LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19) --- layout: model title: English DistilBertForQuestionAnswering model (from rahulkuruvilla) B Version author: John Snow Labs name: distilbert_qa_COVID_DistilBERTb date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-DistilBERTb` is a English model originally trained by `rahulkuruvilla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_COVID_DistilBERTb_en_4.0.0_3.0_1654722911790.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_COVID_DistilBERTb_en_4.0.0_3.0_1654722911790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_COVID_DistilBERTb","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_COVID_DistilBERTb","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid.distil_bert.b.by_rahulkuruvilla").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_COVID_DistilBERTb| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulkuruvilla/COVID-DistilBERTb --- layout: model title: English BertForQuestionAnswering model (from rahulkuruvilla) author: John Snow Labs name: bert_qa_COVID_BERTc date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-BERTc` is a English model orginally trained by `rahulkuruvilla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTc_en_4.0.0_3.0_1654176624579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTc_en_4.0.0_3.0_1654176624579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_COVID_BERTc","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_COVID_BERTc","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid_bert.c.by_rahulkuruvilla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_COVID_BERTc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulkuruvilla/COVID-BERTc --- layout: model title: Detect PHI for Deidentification purposes (Italian, reduced entities) author: John Snow Labs name: ner_deid_generic date: 2022-03-25 tags: [deid, it, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data [(Catelli et al.)](https://ieeexplore.ieee.org/document/9335570) and several data augmentation mechanisms. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE`, `SEX` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_it_3.4.2_3.0_1648224048081.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_it_3.4.2_3.0_1648224048081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.med_ner.deid_generic").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""") ```
## Results ```bash +-------------+----------+ | token| ner_label| +-------------+----------+ | Ho| O| | visto| O| | Gastone| B-NAME| |Montanariello| I-NAME| | (| O| | 49| B-AGE| | anni| O| | )| O| | riferito| O| | all| O| | '| O| | Ospedale|B-LOCATION| | San|I-LOCATION| | Camillo|I-LOCATION| | per| O| | diabete| O| | mal| O| | controllato| O| | con| O| | sintomi| O| | risalenti| O| | a| O| | marzo| B-DATE| | 2015| I-DATE| | .| O| +-------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|15.0 MB| ## References - Internally annotated corpus - [COVID-19 Italian de-identification dataset making up 15% of total data - R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita and M. Esposito, "A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records," in IEEE Access, vol. 9, pp. 19097-19110, 2021, doi: 10.1109/ACCESS.2021.3054479.](https://ieeexplore.ieee.org/document/9335570) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 244.0 1.0 0.0 244.0 0.9959 1.0 0.998 NAME 1082.0 69.0 59.0 1141.0 0.9401 0.9483 0.9442 DATE 1173.0 26.0 17.0 1190.0 0.9783 0.9857 0.982 ID 138.0 2.0 21.0 159.0 0.9857 0.8679 0.9231 SEX 742.0 21.0 32.0 774.0 0.9725 0.9587 0.9655 LOCATION 1039.0 64.0 108.0 1147.0 0.942 0.9058 0.9236 PROFESSION 300.0 15.0 69.0 369.0 0.9524 0.813 0.8772 AGE 746.0 5.0 35.0 781.0 0.9933 0.9552 0.9739 macro - - - - - - 0.9484 micro - - - - - - 0.9521 ``` --- layout: model title: Arabic Bert Embeddings (Base, Arabert Model, v02, Twitter) author: John Snow Labs name: bert_embeddings_bert_base_arabertv02_twitter date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabertv02-twitter` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv02_twitter_ar_3.4.2_3.0_1649677246211.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv02_twitter_ar_3.4.2_3.0_1649677246211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv02_twitter","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv02_twitter","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabertv02_twitter").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabertv02_twitter| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv02-twitter - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Japanese T5ForConditionalGeneration Cased model (from sonoisa) author: John Snow Labs name: t5_qiita_title_generation date: 2023-01-31 tags: [ja, open_source, t5, tensorflow] task: Text Generation language: ja edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qiita-title-generation` is a Japanese model originally trained by `sonoisa`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qiita_title_generation_ja_4.3.0_3.0_1675125661321.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qiita_title_generation_ja_4.3.0_3.0_1675125661321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_qiita_title_generation","ja") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_qiita_title_generation","ja") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_qiita_title_generation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ja| |Size:|925.3 MB| ## References - https://huggingface.co/sonoisa/t5-qiita-title-generation - https://qiita.com/sonoisa/items/30876467ad5a8a81821f --- layout: model title: Chinese Bert Embeddings (Base, COCO-ir dataset) author: John Snow Labs name: bert_embeddings_mengzi_oscar_base_retrieval date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mengzi-oscar-base-retrieval` is a Chinese model orginally trained by `Langboat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_retrieval_zh_3.4.2_3.0_1649670875498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_retrieval_zh_3.4.2_3.0_1649670875498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base_retrieval","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base_retrieval","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.mengzi_oscar_base_retrieval").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mengzi_oscar_base_retrieval| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Langboat/mengzi-oscar-base-retrieval - https://arxiv.org/abs/2110.06696 - https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md - https://github.com/microsoft/Oscar/blob/master/INSTALL.md - https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md --- layout: model title: English Named Entity Recognition (from lucifermorninstar011) author: John Snow Labs name: distilbert_ner_autotrain_lucifer_job_title_853727204 date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-lucifer_job_title-853727204` is a English model orginally trained by `lucifermorninstar011`. ## Predicted Entities `Job`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_lucifer_job_title_853727204_en_3.4.2_3.0_1652721613207.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_lucifer_job_title_853727204_en_3.4.2_3.0_1652721613207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_lucifer_job_title_853727204","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_lucifer_job_title_853727204","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_autotrain_lucifer_job_title_853727204| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lucifermorninstar011/autotrain-lucifer_job_title-853727204 --- layout: model title: Abkhazian asr_xls_r_ab_test_by_cahya TFWav2Vec2ForCTC from cahya author: John Snow Labs name: pipeline_asr_xls_r_ab_test_by_cahya date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_cahya` is a Abkhazian model originally trained by cahya. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_cahya_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_cahya_ab_4.2.0_3.0_1664019520046.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_cahya_ab_4.2.0_3.0_1664019520046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_cahya', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_cahya", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_test_by_cahya| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.2 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Mapping ICD-10-CM Codes with Corresponding Billable and Hierarchical Condition Category (HCC) Scores author: John Snow Labs name: icd10cm_billable_hcc_mapper date: 2023-05-26 tags: [icd10cm, billable, hcc_score, clinical, en, chunk_mapping, licensed] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: DocMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICD-10-CM codes with their corresponding billable and HCC scores. If there is no HCC score for the corresponding ICD-10-CM code, result will be returned as 0. ## Predicted Entities `billable`, `hcc_score` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_billable_hcc_mapper_en_4.4.2_3.0_1685107034729.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_billable_hcc_mapper_en_4.4.2_3.0_1685107034729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") docMapper= DocMapperModel().pretrained("icd10cm_billable_hcc_mapper", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("mappings")\ .setRels(["billable", "hcc_score"]) \ .setLowerCase(True) \ .setMultivaluesRelations(True) pipeline = Pipeline().setStages([ document_assembler, docMapper]) data = spark.createDataFrame([["D66"], ["S22.00"], ["Z3A.10"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val docMapper = DocMapperModel.pretrained("icd10cm_billable_hcc_mapper", "en", "clinical/models") .setInputCols("document") .setOutputCol("mappings") .setRels(Array(["billable", "hcc_score"])) val pipeline = new Pipeline().setStages(Array( document_assembler, docMapper)) val data = Seq(["D66"], ["S22.00"], ["Z3A.10"]).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ```
## Results ```bash +--------------+----------+-----------+ | icd10cm_code | billable | hcc_score | +--------------+----------+-----------+ | D66 | 1 | 46 | | S22.00 | 0 | 0 | | Z3A.10 | 1 | 0 | +--------------+----------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_billable_hcc_mapper| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.1 MB| --- layout: model title: Latin BertForMaskedLM Cased model (from cook) author: John Snow Labs name: bert_embeddings_cicero_similis date: 2022-12-06 tags: [la, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: la edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cicero-similis` is a Latin model originally trained by `cook`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_cicero_similis_la_4.2.4_3.0_1670326205733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_cicero_similis_la_4.2.4_3.0_1670326205733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_cicero_similis","la") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_cicero_similis","la") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_cicero_similis| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|la| |Size:|233.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/cook/cicero-similis --- layout: model title: Financial Zero-shot Relation Extraction author: John Snow Labs name: finre_zero_shot date: 2022-08-22 tags: [en, finance, re, zero, shot, zero_shot, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Relation Extraction Model, meaning that it does not require any training data, just few examples of of the relations types you are looking for, to output a proper result. Make sure you keep the proper syntax of the relations you want to extract. For example: ``` re_model.setRelationalCategories({ "DECREASE": ["{PROFIT_DECLINE} decrease {AMOUNT}", "{PROFIT_DECLINE}} decrease {PERCENTAGE}", "INCREASE": ["{PROFIT_INCREASE} increase {AMOUNT}", "{PROFIT_INCREASE}} increase {PERCENTAGE}"] }) ``` - The keys of the dictionary are the name of the relations (`DECREASE`, `INCREASE`) - The values are list of sentences with similar examples of the relation - The values in brackets are the NER labels extracted by an NER component before ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_zero_shot_en_1.0.0_3.2_1661179057628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_zero_shot_en_1.0.0_3.2_1661179057628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_10k", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\ .setInputCols(["ner_chunk", "sentence"]) \ .setOutputCol("relations") # Remember it's 2 curly brackets instead of one if you are using Spark NLP < 4.0 re_model.setRelationalCategories({ "DECREASE": ["{PROFIT_DECLINE} decrease {AMOUNT}", "{PROFIT_DECLINE} decrease {PERCENTAGE}"], "INCREASE": ["{PROFIT_INCREASE} increase {AMOUNT}", "{PROFIT_INCREASE} increase {PERCENTAGE}"] }) pipeline = sparknlp.base.Pipeline() \ .setStages([document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter, re_model ]) sample_text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. The increase was primarily related to increase in internal staff costs of $ 1.1 million as we increased delivery staff and work performed on internal projects, partially offset by a decrease in third party consultant costs of $ 0.6 million as these were converted to internal staff or terminated. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019" """ data = spark.createDataFrame([[sample_text]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) # ner output results.selectExpr("explode(ner_chunk) as ner").show(truncate=False) # relations output results.selectExpr("explode(relations) as relation").show(truncate=False) ```
## Results ```bash +--------------------------------------------------------------------------------------------------------------------------+ |ner | +--------------------------------------------------------------------------------------------------------------------------+ |[chunk, 0, 19, License fees revenue, [entity -> PROFIT_DECLINE, sentence -> 0, chunk -> 0, confidence -> 0.41060004], []] | |[chunk, 31, 32, 40, [entity -> PERCENTAGE, sentence -> 0, chunk -> 1, confidence -> 0.9995], []] | |[chunk, 40, 40, $, [entity -> CURRENCY, sentence -> 0, chunk -> 2, confidence -> 0.9995], []] | |[chunk, 42, 52, 0.5 million, [entity -> AMOUNT, sentence -> 0, chunk -> 3, confidence -> 0.99995], []] | |[chunk, 57, 57, $, [entity -> CURRENCY, sentence -> 0, chunk -> 4, confidence -> 0.9998], []] | |[chunk, 59, 69, 0.7 million, [entity -> AMOUNT, sentence -> 0, chunk -> 5, confidence -> 0.99985003], []] | |[chunk, 90, 106, December 31, 2020, [entity -> FISCAL_YEAR, sentence -> 0, chunk -> 6, confidence -> 0.977525], []] | |[chunk, 121, 121, $, [entity -> CURRENCY, sentence -> 0, chunk -> 7, confidence -> 0.9996], []] | |[chunk, 123, 133, 1.2 million, [entity -> AMOUNT, sentence -> 0, chunk -> 8, confidence -> 0.99975], []] | |[chunk, 154, 170, December 31, 2019, [entity -> FISCAL_YEAR, sentence -> 0, chunk -> 9, confidence -> 0.96227497], []] | |[chunk, 173, 188, Services revenue, [entity -> PROFIT_INCREASE, sentence -> 1, chunk -> 10, confidence -> 0.57490003], []]| |[chunk, 200, 200, 4, [entity -> PERCENTAGE, sentence -> 1, chunk -> 11, confidence -> 0.9997], []] | |[chunk, 208, 208, $, [entity -> CURRENCY, sentence -> 1, chunk -> 12, confidence -> 0.999], []] | |[chunk, 210, 220, 1.1 million, [entity -> AMOUNT, sentence -> 1, chunk -> 13, confidence -> 0.99995], []] | |[chunk, 226, 226, $, [entity -> CURRENCY, sentence -> 1, chunk -> 14, confidence -> 0.9982], []] | |[chunk, 228, 239, 25.6 million, [entity -> AMOUNT, sentence -> 1, chunk -> 15, confidence -> 0.99975], []] | |[chunk, 261, 277, December 31, 2020, [entity -> FISCAL_YEAR, sentence -> 1, chunk -> 16, confidence -> 0.97915], []] | |[chunk, 284, 284, $, [entity -> CURRENCY, sentence -> 1, chunk -> 17, confidence -> 0.9991], []] | |[chunk, 286, 297, 24.5 million, [entity -> AMOUNT, sentence -> 1, chunk -> 18, confidence -> 0.99965], []] | |[chunk, 318, 334, December 31, 2019, [entity -> FISCAL_YEAR, sentence -> 1, chunk -> 19, confidence -> 0.9588], []] | +--------------------------------------------------------------------------------------------------------------------------+ +--------+ |relation +--------+ |[category, 0, 217, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 40, confidence -> 0.9931541, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 40, entity2_end -> 32, entity1_end -> 19, entity2_begin -> 31, entity2 -> PERCENTAGE, chunk1 -> License fees revenue, sentence -> 0], []] | |[category, 672, 898, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 1.2 million, confidence -> 0.7394818, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 1.2 million, entity2_end -> 133, entity1_end -> 19, entity2_begin -> 123, entity2 -> AMOUNT, chunk1 -> License fees revenue, sentence -> 0], []]| |[category, 445, 671, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 0.7 million, confidence -> 0.99002415, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 0.7 million, entity2_end -> 69, entity1_end -> 19, entity2_begin -> 59, entity2 -> AMOUNT, chunk1 -> License fees revenue, sentence -> 0], []] | |[category, 218, 444, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 0.5 million, confidence -> 0.99084955, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 0.5 million, entity2_end -> 52, entity1_end -> 19, entity2_begin -> 42, entity2 -> AMOUNT, chunk1 -> License fees revenue, sentence -> 0], []] | +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_zero_shot| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| ## References Bert Base (cased) trained on the GLUE MNLI dataset. --- layout: model title: Fast Neural Machine Translation Model from English to Portuguese-Based Creoles And Pidgins author: John Snow Labs name: opus_mt_en_cpp date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, cpp, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `cpp` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cpp_xx_2.7.0_2.4_1609170791563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cpp_xx_2.7.0_2.4_1609170791563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_cpp", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_cpp", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.cpp').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_cpp| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Use Of Proceeds Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_use_of_proceeds_bert date: 2023-03-05 tags: [en, legal, classification, clauses, use_of_proceeds, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Use_Of_Proceeds` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Use_Of_Proceeds`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_use_of_proceeds_bert_en_1.0.0_3.0_1678049873085.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_use_of_proceeds_bert_en_1.0.0_3.0_1678049873085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_use_of_proceeds_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Use_Of_Proceeds]| |[Other]| |[Other]| |[Use_Of_Proceeds]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_use_of_proceeds_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.98 0.98 0.98 132 Use_Of_Proceeds 0.98 0.98 0.98 94 accuracy - - 0.98 226 macro-avg 0.98 0.98 0.98 226 weighted-avg 0.98 0.98 0.98 226 ``` --- layout: model title: Sentence Entity Resolver for LOINC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_loinc_augmented date: 2022-01-18 tags: [loinc, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous LOINC resolver models. ## Predicted Entities {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_augmented_en_3.3.2_2.4_1642533239691.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_augmented_en_3.3.2_2.4_1642533239691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings")\ .setCaseSensitive(False) loinc_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_augmented", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("loinc_code")\ .setDistanceFunction("EUCLIDEAN") loinc_pipelineModel = Pipeline( stages = [ documentAssembler, sbert_embedder, loinc_resolver]) data = spark.createDataFrame([["""The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%."""]]).toDF("text") results = loinc_pipelineModel.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val loinc_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_loinc_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("loinc_code") .setDistanceFunction("EUCLIDEAN") val loinc_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, loinc_resolver)) val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc.augmented").predict("""The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""") ```
## Results ```bash --------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+ | chunk|begin|end|entity|confidence|Loinc_Code| all_codes| resolutions| +--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+ | Body mass index| 74| 88| Test|0.39306664| LP35925-4|LP35925-4:::BDYCRC:::LP172732-2:::39156-5:::LP7...|body mass index:::body circumference:::body mus...| |aspartate aminotransferase| 111|136| Test| 0.74925| LP15426-7|LP15426-7:::14409-7:::LP307348-5:::LP15333-5:::...|aspartate aminotransferase::: aspartate transam...| | alanine aminotransferase| 146|169| Test| 0.9579| LP15333-5|LP15333-5:::LP307326-1:::16324-6:::LP307348-5::...|alanine aminotransferase:::alanine aminotransfe...| | hgba1c| 180|185| Test| 0.1118| 17855-8|17855-8:::4547-6:::55139-0:::72518-4:::45190-6:...| hba1c::: hgb a1::: hb1::: hcds1::: hhc1::: htr...| +--------------------------+-----+---+------+----------+----------+--------------------------------------------------+--------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_loinc_augmented| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| |Size:|1.5 GB| |Case sensitive:|false| |Dependencies:|sbiobert_base_cased_mli| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Word2Vec Embeddings in West Flemish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, vls, open_source] task: Embeddings language: vls edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vls_3.4.1_3.0_1647467395632.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vls_3.4.1_3.0_1647467395632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vls") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vls") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vls.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|vls| |Size:|94.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect PHI in text (base) author: John Snow Labs name: ner_deid_sd date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect sensitive information in medical text for de-identification using pretrained NER model. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `PROFESSION`, `CONTACT`, `DATE`, `NAME`, `AGE`, `ID`, `LOCATION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_en_3.0.0_3.0_1617260827858.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_en_3.0.0_3.0_1617260827858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_sd", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_sd", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.sd").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_sd| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English AlbertForQuestionAnswering XXLarge model (from elgeish) author: John Snow Labs name: albert_qa_cs224n_squad2.0_xxlarge_v1 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cs224n-squad2.0-albert-xxlarge-v1` is a English model originally trained by `elgeish`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_xxlarge_v1_en_4.0.0_3.0_1656064342097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_cs224n_squad2.0_xxlarge_v1_en_4.0.0_3.0_1656064342097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_xxlarge_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_cs224n_squad2.0_xxlarge_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.xxl.by_elgeish").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_cs224n_squad2.0_xxlarge_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|772.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/elgeish/cs224n-squad2.0-albert-xxlarge-v1 - http://web.stanford.edu/class/cs224n/project/default-final-project-handout.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/elgeish/squad/tree/master/data --- layout: model title: Ganda XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_luganda_finetuned_luganda date: 2022-08-13 tags: [lg, open_source, xlm_roberta, ner] task: Named Entity Recognition language: lg edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-luganda-finetuned-ner-luganda` is a Ganda model originally trained by `mbeukman`. ## Predicted Entities `ORG`, `LOC`, `PER`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_finetuned_luganda_lg_4.1.0_3.0_1660426976430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luganda_finetuned_luganda_lg_4.1.0_3.0_1660426976430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda_finetuned_luganda","lg") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luganda_finetuned_luganda","lg") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_luganda_finetuned_luganda| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|lg| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luganda-finetuned-ner-luganda - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner - https://arxiv.org/pdf/2103.11811.pdf - https://arxiv.org/abs/2103.11811 - https://arxiv.org/abs/2103.11811 --- layout: model title: Legal Names Of Parties Clause Binary Classifier author: John Snow Labs name: legclf_names_of_parties_clause date: 2023-02-13 tags: [en, legal, classification, parties, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `names_of_parties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `names_of_parties`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_names_of_parties_clause_en_1.0.0_3.0_1676301523874.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_names_of_parties_clause_en_1.0.0_3.0_1676301523874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_names_of_parties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[names_of_parties]| |[other]| |[other]| |[names_of_parties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_names_of_parties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support names_of_parties 0.96 0.96 0.96 24 other 0.95 0.95 0.95 21 accuracy - - 0.96 45 macro-avg 0.96 0.96 0.96 45 weighted-avg 0.96 0.96 0.96 45 ``` --- layout: model title: Extract Biomarkers and their Results author: John Snow Labs name: ner_oncology_biomarker date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, biomarker] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of biomarkers and biomarker results from oncology texts. Definitions of Predicted Entities: - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer (including oncogenes). - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. ## Predicted Entities `Biomarker`, `Biomarker_Result` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_en_4.0.0_3.0_1666723339627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_en_4.0.0_3.0_1666723339627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_biomarker", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_biomarker", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_biomarker").predict("""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.""") ```
## Results ```bash | chunk | ner_label | |:-----------------------------------------|:-----------------| | negative | Biomarker_Result | | CK7 | Biomarker | | synaptophysin | Biomarker | | Syn | Biomarker | | chromogranin A | Biomarker | | CgA | Biomarker | | Muc5AC | Biomarker | | human epidermal growth factor receptor-2 | Biomarker | | HER2 | Biomarker | | Muc6 | Biomarker | | positive | Biomarker_Result | | CK20 | Biomarker | | Muc1 | Biomarker | | Muc2 | Biomarker | | E-cadherin | Biomarker | | p53 | Biomarker | | Ki-67 index | Biomarker | | 87% | Biomarker_Result | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_biomarker| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Biomarker_Result 1030 148 415 1445 0.87 0.71 0.79 Biomarker 1685 272 279 1964 0.86 0.86 0.86 macro_avg 2715 420 694 3409 0.87 0.79 0.82 micro_avg 2715 420 694 3409 0.87 0.80 0.83 ``` --- layout: model title: Clinical Deidentification (French) author: John Snow Labs name: clinical_deidentification date: 2022-03-04 tags: [deid, fr, licensed] task: De-identification language: fr edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts in French. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_fr_3.4.1_2.4_1646396751450.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_fr_3.4.1_2.4_1646396751450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = """COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """ result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = "COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("fr.deid_obfuscated").predict("""COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """) ```
## Results ```bash Masked with entity labels ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : NOM : NUMÉRO DE SÉCURITÉ SOCIALE : ADRESSE : VILLE : CODE POSTAL : DATE DE NAISSANCE : Âge : Sexe : COURRIEL : DATE D'ADMISSION : MÉDÉCIN : RAPPORT CLINIQUE : ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. nous a été adressé car présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : - Service D'Endocrinologie et de Nutrition - , COURRIEL : Masked with chars ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : [**] NOM : [****] NUMÉRO DE SÉCURITÉ SOCIALE : [***********] ADRESSE : [****************] VILLE : [******] CODE POSTAL : [***] DATE DE NAISSANCE : [********] Âge : [****] Sexe : * COURRIEL : [****************] DATE D'ADMISSION : [********] MÉDÉCIN : [**************] RAPPORT CLINIQUE : ** ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. ** nous a été adressé car ** présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : [**************] - [****************************] Service D'Endocrinologie et de Nutrition - [******************], [***] [******] COURRIEL : [****************] Masked with fixed length chars ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : **** NOM : **** NUMÉRO DE SÉCURITÉ SOCIALE : **** ADRESSE : **** VILLE : **** CODE POSTAL : **** DATE DE NAISSANCE : **** Âge : **** Sexe : **** COURRIEL : **** DATE D'ADMISSION : **** MÉDÉCIN : **** RAPPORT CLINIQUE : **** ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. **** nous a été adressé car **** présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : **** - **** Service D'Endocrinologie et de Nutrition - ****, **** **** COURRIEL : **** Obfuscated ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : Mme Ollivier NOM : Mme Traore NUMÉRO DE SÉCURITÉ SOCIALE : 164033818514436 ADRESSE : 731, boulevard de Legrand VILLE : Sainte Antoine CODE POSTAL : 37443 DATE DE NAISSANCE : 18/03/1946 Âge : 46 Sexe : Femme COURRIEL : georgeslemonnier@live.com DATE D'ADMISSION : 10/01/2017 MÉDÉCIN : Pr. Manon Dupuy RAPPORT CLINIQUE : 26 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Homme nous a été adressé car Homme présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dr Tristan-Gilbert Poulain - CENTRE HOSPITALIER D'ORTHEZ Service D'Endocrinologie et de Nutrition - 6, avenue Pages, 37443 Sainte Antoine COURRIEL : massecatherine@bouygtel.fr ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Fast Neural Machine Translation Model from Afro-Asiatic Languages to English author: John Snow Labs name: opus_mt_afa_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, afa, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `afa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_afa_en_xx_2.7.0_2.4_1609170390869.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_afa_en_xx_2.7.0_2.4_1609170390869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_afa_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_afa_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.afa.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_afa_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Cased model (from ThomasNLG) author: John Snow Labs name: t5_qa_squad2neg date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-qa_squad2neg-en` is a English model originally trained by `ThomasNLG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_qa_squad2neg_en_4.3.0_3.0_1675125429554.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_qa_squad2neg_en_4.3.0_3.0_1675125429554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_qa_squad2neg","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_qa_squad2neg","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_qa_squad2neg| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|922.0 MB| ## References - https://huggingface.co/ThomasNLG/t5-qa_squad2neg-en - https://github.com/ThomasScialom/QuestEval - https://arxiv.org/abs/2103.12693 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_text2log date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-text2log` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_text2log_en_4.3.0_3.0_1675126191501.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_text2log_en_4.3.0_3.0_1675126191501.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_text2log","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_text2log","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_text2log| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|286.3 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-text2log --- layout: model title: Mapping Companies IRS to Edgar Database author: John Snow Labs name: legmapper_edgar_irs date: 2022-08-18 tags: [en, legal, companies, edgar, data, augmentation, irs, licensed] task: Chunk Mapping language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Chunk Mapper model allows you to, given a detected IRS with any NER model, augment it with information available in the SEC Edgar database. Some of the fields included in this Chunk Mapper are: - Company Name - Sector - Former names - Address, Phone, State - Dates where the company submitted filings - etc. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmapper_edgar_irs_en_1.0.0_3.2_1660817727715.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmapper_edgar_irs_en_1.0.0_3.2_1660817727715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["CARDINAL"]) CM = legal.ChunkMapperModel()\ .pretrained("legmapper_edgar_irs", "en", "legal/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, CM]) text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""] test_data = spark.createDataFrame([text]).toDF("text") model = pipeline.fit(test_data) res= model.transform(test_data) ```
## Results ```bash [Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=8, result='Masterworks 096, LLC', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='RETAIL-RETAIL STORES, NEC [5990]', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'sic', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='5990', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'sic_code', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='873474341', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'irs_number', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='1231', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'fiscal_year_end', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NY', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'state_location', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='DE', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'state_incorporation', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='225 LIBERTY STREET', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_street', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NEW YORK', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_city', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NY', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_state', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='10281', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_zip', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='2035185172', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_phone', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'former_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'former_name_date', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='2022-01-10', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'date', 'all_relations': '2022-04-26:::2021-11-17'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='1894064', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'company_id', 'all_relations': ''}, embeddings=[])])] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmapper_edgar_irs| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|5.7 MB| ## References Manually scrapped Edgar Database --- layout: model title: Snomed to ICD10 Code Mapping author: John Snow Labs name: snomed_icd10cm_mapping date: 2021-05-02 tags: [snomed, icd10cm, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps SNOMED codes to ICD10CM codes without using any text data. You’ll just feed a comma or white space-delimited SNOMED codes and it will return the corresponding candidate ICD10CM codes as a list (multiple ICD10 codes for each Snomed code). For the time being, it supports 132K Snomed codes and 30K ICD10 codes and will be augmented & enriched in the next releases. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CODE_MAPPING/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_3.0.2_3.0_1619955719388.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_3.0.2_3.0_1619955719388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline( "snomed_icd10cm_mapping","en","clinical/models") pipeline.annotate('721617001 733187009 109006') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_snomed_mapping","en","clinical/models") val result = pipeline.annotate('721617001 733187009 109006') ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icd10cm.pipe").predict("""721617001 733187009 109006""") ```
## Results ```bash {'snomed': ['721617001', '733187009', '109006'], 'icd10cm': ['K22.70, C15.5', 'M89.59, M89.50, M96.89', 'F41.9, F40.10, F94.8, F93.0, F40.8, F93.8']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icd10cm_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: Relation Extraction between dates and other entities (ReDL) author: John Snow Labs name: redl_oncology_temporal_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Date and Relative_Date extractions to clinical entities such as Test or Cancer_Dx. ## Predicted Entities `is_date_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_temporal_biobert_wip_en_4.2.4_3.0_1673774363542.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_temporal_biobert_wip_en_4.2.4_3.0_1673774363542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Each relevant relation pair in the pipeline should include one date entity (Date or Relative_Date) and a clinical entity (such as Pathology_Test, Cancer_Dx or Chemotherapy).
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_temporal_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["Her breast cancer was diagnosed last year."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_temporal_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("Her breast cancer was diagnosed last year.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_temporal_biobert_wip").predict("""Her breast cancer was diagnosed last year.""") ```
## Results ```bash +----------+---------+-------------+-----------+-------------+-------------+-------------+-----------+---------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +----------+---------+-------------+-----------+-------------+-------------+-------------+-----------+---------+----------+ |is_date_of|Cancer_Dx| 4| 16|breast cancer|Relative_Date| 32| 40|last year| 0.9999256| +----------+---------+-------------+-----------+-------------+-------------+-------------+-----------+---------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_temporal_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 support O 0.77 0.81 0.79 302.0 is_date_of 0.82 0.78 0.80 298.0 macro-avg 0.79 0.79 0.79 - ``` --- layout: model title: English DistilBertForQuestionAnswering model (from tli8hf) Squad author: John Snow Labs name: distilbert_qa_unqover_base_uncased_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-distilbert-base-uncased-squad` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_unqover_base_uncased_squad_en_4.0.0_3.0_1654729114130.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_unqover_base_uncased_squad_en_4.0.0_3.0_1654729114130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_unqover_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_unqover_base_uncased_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_tli8hf").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_unqover_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-distilbert-base-uncased-squad --- layout: model title: English image_classifier_vit_doggos_lol ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_doggos_lol date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_doggos_lol` is a English model originally trained by nateraw. ## Predicted Entities `bernese mountain dog`, `husky`, `saint bernard` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_doggos_lol_en_4.1.0_3.0_1660167986320.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_doggos_lol_en_4.1.0_3.0_1660167986320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_doggos_lol", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_doggos_lol", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_doggos_lol| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl date: 2021-09-16 tags: [ner, ner_jsl, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. This model is trained with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects 77 entities. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.2.0_2.4_1631824058676.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.2.0_2.4_1631824058676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentence = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models") .setInputCols(Array("token", "document")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""he patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug_BrandName | |Baby-girl |Gender | |decreased p.o. intake |Symptom | |His |Gender | |breast-feeding |External_body_part_or_region| |20 minutes |Duration | |q.2h. to 5 to 10 minutes |Frequency | |his |Gender | |respiratory congestion |Symptom | |He |Gender | |tired |Symptom | |fussy |Symptom | |over the past 2 days |RelativeDate | |albuterol |Drug_Ingredient | |ER |Clinical_Dept | |His |Gender | |urine output has |Symptom | |decreased |Symptom | |he |Gender | |per 24 hours |Frequency | |he |Gender | |per 24 hours |Frequency | |Mom |Gender | |diarrhea |Symptom | |His |Gender | |bowel |Internal_organ_or_component | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|256| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash Label precision recall f1-score support B-Admission_Discharge 0.96 0.97 0.96 298 B-Age 0.96 0.97 0.97 1545 B-Alcohol 0.89 0.86 0.87 117 B-BMI 1.00 0.67 0.80 9 B-Birth_Entity 0.50 0.60 0.55 5 B-Blood_Pressure 0.78 0.73 0.76 232 B-Cancer_Modifier 0.82 0.90 0.86 10 B-Cerebrovascular_Disease 0.68 0.75 0.71 163 B-Clinical_Dept 0.88 0.88 0.88 1510 B-Communicable_Disease 0.88 0.94 0.91 47 B-Date 0.96 0.97 0.96 960 B-Death_Entity 0.84 0.79 0.82 39 B-Diabetes 0.94 0.93 0.93 129 B-Diet 0.64 0.55 0.59 166 B-Direction 0.92 0.94 0.93 4605 B-Disease_Syndrome_Disorder 0.90 0.87 0.88 6729 B-Dosage 0.73 0.74 0.73 611 B-Drug_BrandName 0.89 0.90 0.89 2919 B-Drug_Ingredient 0.90 0.92 0.91 6243 B-Duration 0.76 0.80 0.78 296 B-EKG_Findings 0.75 0.76 0.76 114 B-Employment 0.86 0.84 0.85 315 B-External_body_part_or_region 0.86 0.87 0.86 3050 B-Family_History_Header 0.99 0.99 0.99 268 B-Fetus_NewBorn 0.71 0.67 0.69 97 B-Form 0.75 0.75 0.75 351 B-Frequency 0.88 0.93 0.90 1238 B-Gender 0.98 0.98 0.98 4409 B-HDL 1.00 0.50 0.67 8 B-Heart_Disease 0.83 0.86 0.85 728 B-Height 0.93 0.87 0.90 31 B-Hyperlipidemia 0.97 1.00 0.99 229 B-Hypertension 0.98 0.99 0.98 423 B-ImagingFindings 0.53 0.59 0.56 218 B-Imaging_Technique 0.49 0.57 0.53 60 B-Injury_or_Poisoning 0.84 0.80 0.82 721 B-Internal_organ_or_component 0.89 0.91 0.90 11514 B-Kidney_Disease 0.78 0.79 0.78 164 B-LDL 0.62 0.67 0.64 12 B-Labour_Delivery 0.75 0.64 0.69 84 B-Medical_Device 0.87 0.90 0.88 6762 B-Medical_History_Header 0.92 0.94 0.93 163 B-Modifier 0.82 0.86 0.84 3491 B-O2_Saturation 0.51 0.73 0.60 154 B-Obesity 0.92 0.97 0.95 120 B-Oncological 0.87 0.89 0.88 925 B-Oncology_Therapy 0.91 0.45 0.61 22 B-Overweight 0.75 0.60 0.67 10 B-Oxygen_Therapy 0.68 0.71 0.70 313 B-Pregnancy 0.75 0.85 0.79 237 B-Procedure 0.89 0.89 0.89 6219 B-Psychological_Condition 0.74 0.76 0.75 209 B-Pulse 0.80 0.74 0.77 178 B-Race_Ethnicity 0.93 0.99 0.96 111 B-Relationship_Status 1.00 0.76 0.87 34 B-RelativeDate 0.86 0.84 0.85 569 B-RelativeTime 0.65 0.68 0.67 250 B-Respiration 0.87 0.72 0.79 161 B-Route 0.85 0.86 0.85 1361 B-Section_Header 0.96 0.97 0.97 12925 B-Sexually_Active_or_Sexual_Orientation 1.00 1.00 1.00 1 B-Smoking 0.95 0.84 0.89 145 B-Social_History_Header 0.93 0.95 0.94 338 B-Staging 1.00 1.00 1.00 4 B-Strength 0.88 0.88 0.88 794 B-Substance 0.73 0.94 0.82 87 B-Symptom 0.87 0.86 0.87 11526 B-Temperature 0.81 0.84 0.83 198 B-Test 0.87 0.88 0.88 5850 B-Test_Result 0.84 0.85 0.84 2096 B-Time 0.92 0.98 0.95 1119 B-Total_Cholesterol 0.95 0.75 0.84 28 B-Treatment 0.62 0.65 0.64 354 B-Triglycerides 0.59 0.94 0.73 17 B-VS_Finding 0.77 0.86 0.81 592 B-Vaccine 0.90 0.84 0.87 77 B-Vital_Signs_Header 0.94 0.99 0.96 958 B-Weight 0.75 0.86 0.80 109 I-Age 0.92 0.96 0.94 283 I-Alcohol 0.83 0.62 0.71 8 I-Allergen 0.00 0.00 0.00 15 I-BMI 1.00 0.88 0.93 24 I-Blood_Pressure 0.82 0.84 0.83 456 I-Cerebrovascular_Disease 0.48 0.82 0.61 66 I-Clinical_Dept 0.92 0.91 0.91 717 I-Communicable_Disease 0.81 1.00 0.89 25 I-Date 0.94 0.99 0.96 152 I-Death_Entity 0.67 0.40 0.50 5 I-Diabetes 0.99 0.92 0.95 85 I-Diet 0.70 0.72 0.71 83 I-Direction 0.86 0.84 0.85 235 I-Disease_Syndrome_Disorder 0.88 0.86 0.87 3878 I-Dosage 0.78 0.69 0.73 540 I-Drug_BrandName 0.68 0.55 0.61 89 I-Drug_Ingredient 0.83 0.87 0.85 720 I-Duration 0.83 0.86 0.85 567 I-EKG_Findings 0.81 0.69 0.75 175 I-Employment 0.66 0.71 0.69 118 I-External_body_part_or_region 0.85 0.89 0.87 873 I-Family_History_Header 0.97 1.00 0.98 280 I-Fetus_NewBorn 0.55 0.58 0.56 88 I-Frequency 0.86 0.89 0.87 638 I-HDL 1.00 0.75 0.86 4 I-Heart_Disease 0.86 0.86 0.86 732 I-Height 0.97 0.95 0.96 82 I-Hypertension 0.82 0.75 0.78 12 I-ImagingFindings 0.64 0.60 0.62 257 I-Injury_or_Poisoning 0.78 0.79 0.79 589 I-Internal_organ_or_component 0.90 0.91 0.90 5278 I-Kidney_Disease 0.94 0.91 0.93 186 I-LDL 0.43 0.60 0.50 5 I-Labour_Delivery 0.84 0.77 0.80 69 I-Medical_Device 0.88 0.91 0.89 3590 I-Medical_History_Header 1.00 0.96 0.98 433 I-Metastasis 0.92 0.65 0.76 17 I-Modifier 0.62 0.58 0.60 322 I-O2_Saturation 0.65 0.87 0.75 217 I-Obesity 0.78 1.00 0.88 7 I-Oncological 0.85 0.91 0.88 741 I-Oxygen_Therapy 0.67 0.72 0.70 224 I-Pregnancy 0.57 0.62 0.59 115 I-Procedure 0.91 0.88 0.89 6231 I-Psychological_Condition 0.76 0.88 0.82 69 I-Pulse 0.82 0.82 0.82 278 I-RelativeDate 0.89 0.88 0.89 757 I-RelativeTime 0.70 0.80 0.74 227 I-Respiration 0.90 0.68 0.77 173 I-Route 0.98 0.92 0.95 359 I-Section_Header 0.96 0.98 0.97 7817 I-Sexually_Active_or_Sexual_Orientation 1.00 1.00 1.00 1 I-Smoking 0.67 0.50 0.57 4 I-Social_History_Header 0.91 1.00 0.95 200 I-Strength 0.85 0.89 0.87 761 I-Substance 0.70 0.90 0.79 21 I-Symptom 0.76 0.72 0.74 6496 I-Temperature 0.92 0.87 0.89 301 I-Test 0.84 0.86 0.85 3126 I-Test_Result 0.89 0.82 0.85 1676 I-Time 0.91 0.95 0.93 424 I-Total_Cholesterol 0.67 1.00 0.81 29 I-Treatment 0.68 0.62 0.65 196 I-Vaccine 0.89 0.96 0.92 25 I-Vital_Signs_Header 0.96 0.99 0.97 633 I-Weight 0.82 0.93 0.87 200 O 0.96 0.96 0.96 175897 accuracy - - 0.92 338378 macro-avg 0.76 0.74 0.74 338378 weighted-avg 0.92 0.92 0.92 338378 ``` --- layout: model title: Spanish RobertaForSequenceClassification Cased model (from hackathon-pln-es) author: John Snow Labs name: roberta_classifier_detect_acoso_twitter date: 2022-12-09 tags: [es, open_source, roberta, sequence_classification, classification] task: Text Classification language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Detect-Acoso-Twitter-Es` is a Spanish model originally trained by `hackathon-pln-es`. ## Predicted Entities `acoso`, `No acoso` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_detect_acoso_twitter_es_4.2.4_3.0_1670620150630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_detect_acoso_twitter_es_4.2.4_3.0_1670620150630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_detect_acoso_twitter","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_detect_acoso_twitter","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_detect_acoso_twitter| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|309.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/hackathon-pln-es/Detect-Acoso-Twitter-Es --- layout: model title: Part of Speech for Bulgarian author: John Snow Labs name: pos_ud_btb date: 2020-05-04 23:32:00 +0800 task: Part of Speech Tagging language: bg edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, bg] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_btb_bg_2.5.0_2.4_1588621401140.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_btb_bg_2.5.0_2.4_1588621401140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_btb", "bg") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_btb", "bg") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Освен че е крал на север, Джон Сноу е английски лекар и лидер в развитието на анестезия и медицинска хигиена."""] pos_df = nlu.load('bg.pos.ud_btb').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='ADP', metadata={'word': 'Освен'}), Row(annotatorType='pos', begin=6, end=7, result='SCONJ', metadata={'word': 'че'}), Row(annotatorType='pos', begin=9, end=9, result='AUX', metadata={'word': 'е'}), Row(annotatorType='pos', begin=11, end=14, result='VERB', metadata={'word': 'крал'}), Row(annotatorType='pos', begin=16, end=17, result='ADP', metadata={'word': 'на'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_btb| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|bg| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab90 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab90 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab90` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab90_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab90_en_4.2.0_3.0_1664022209096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab90_en_4.2.0_3.0_1664022209096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab90", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab90", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab90| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English asr_wav2vec2_base_timit_asr TFWav2Vec2ForCTC from elgeish author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_asr date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_asr` is a English model originally trained by elgeish. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_asr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_asr_en_4.2.0_3.0_1664025415427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_asr_en_4.2.0_3.0_1664025415427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_asr', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_asr", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_asr| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from csarron) author: John Snow Labs name: bert_qa_csarron_bert_base_uncased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-v1` is a English model orginally trained by `csarron`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_csarron_bert_base_uncased_squad_v1_en_4.0.0_3.0_1654181380672.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_csarron_bert_base_uncased_squad_v1_en_4.0.0_3.0_1654181380672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_csarron_bert_base_uncased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_csarron_bert_base_uncased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_csarron").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_csarron_bert_base_uncased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/csarron/bert-base-uncased-squad-v1 - https://twitter.com/sysnlp - https://awk.ai/ - https://github.com/csarron - https://www.aclweb.org/anthology/N19-1423/ - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: Legal NER - License Grant Clauses (Md, Lighter version) author: John Snow Labs name: legner_grants_md date: 2022-12-01 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model aims to detect License grants / permissions in agreements, provided by a Subject (PERMISSION_SUBJECT) to a Recipient (PERMISSION_INDIRECT_OBJECT). THe permission itself is in PERMISSION tag. This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives. This is also different from other permission models in that this only is lighter, non-transformer based. ## Predicted Entities `PERMISSION`, `PERMISSION_SUBJECT`, `PERMISSION_OBJECT`, `PERMISSION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_grants_md_en_1.0.0_3.0_1669893713541.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_grants_md_en_1.0.0_3.0_1669893713541.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_grants_md', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[documentAssembler,sentenceDetector,tokenizer,embeddings,ner_model,ner_converter]) import pandas as pd p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) text = """Fox grants to Licensee a limited, exclusive (except as otherwise may be provided in this Agreement), non-transferable (except as permitted in Paragraph 17(d)) right and license""" res = p_model.transform(spark.createDataFrame([[text]]).toDF("text")) from pyspark.sql import functions as F res.select(F.explode(F.arrays_zip('token.result', 'label.result')).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"))\ .show(20, truncate=100) ```
## Results ```bash +----------------+----------------------------+ | token| ner_label| +----------------+----------------------------+ | Fox| B-PERMISSION_SUBJECT| | grants| O| | to| O| | Licensee|B-PERMISSION_INDIRECT_OBJECT| | a| O| | limited| B-PERMISSION| | ,| I-PERMISSION| | exclusive| I-PERMISSION| | (| I-PERMISSION| | except| I-PERMISSION| | as| I-PERMISSION| | otherwise| I-PERMISSION| | may| I-PERMISSION| | be| I-PERMISSION| | provided| I-PERMISSION| | in| I-PERMISSION| | this| I-PERMISSION| | Agreement| I-PERMISSION| | ),| I-PERMISSION| |non-transferable| I-PERMISSION| +----------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_grants_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-PERMISSION 111 28 37 0.79856116 0.75 0.7735192 B-PERMISSION 12 3 2 0.8 0.85714287 0.82758623 B-PERMISSION_INDIRECT_OBJECT 10 1 5 0.90909094 0.6666667 0.7692308 B-PERMISSION_SUBJECT 9 1 5 0.9 0.64285713 0.74999994 Macro-average 142 33 52 0.68153036 0.5833334 0.72862015 Micro-average 142 33 52 0.81142855 0.73195875 0.76964766 ``` --- layout: model title: French CamemBert Embeddings (from Ghani-25) author: John Snow Labs name: camembert_embeddings_SummFinFR date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SummFinFR` is a French model orginally trained by `Ghani-25`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_SummFinFR_fr_3.4.4_3.0_1653985664557.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_SummFinFR_fr_3.4.4_3.0_1653985664557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_SummFinFR","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_SummFinFR","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_SummFinFR| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Ghani-25/SummFinFR --- layout: model title: Malay T5ForConditionalGeneration Tiny Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_translation_super_tiny_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-translation-t5-super-super-tiny-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_super_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102214685.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_super_tiny_standard_bahasa_cased_ms_4.3.0_3.0_1675102214685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_translation_super_tiny_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_translation_super_tiny_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_translation_super_tiny_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|37.7 MB| ## References - https://huggingface.co/mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/translation/laser - https://github.com/huseinzol05/malaya/tree/master/session/translation/hf-t5 --- layout: model title: Part of Speech for Spanish author: John Snow Labs name: pos_ud_gsd date: 2021-03-08 tags: [part_of_speech, open_source, spanish, pos_ud_gsd, es] task: Part of Speech Tagging language: es edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADV - PRON - VERB - ADP - DET - NOUN - ADJ - SCONJ - CCONJ - PUNCT - NUM - PROPN - AUX - X - SYM - INTJ - PART {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_es_3.0.0_3.0_1615230187276.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_es_3.0.0_3.0_1615230187276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "es") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hola de John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "es") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hola de John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] token_df = nlu.load('es.pos.ud_gsd').predict(text) token_df ```
## Results ```bash token pos 0 Hola PART 1 de ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|es| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from chiendvhust) author: John Snow Labs name: distilbert_qa_chiendvhust_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `chiendvhust`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_chiendvhust_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770415893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_chiendvhust_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770415893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_chiendvhust_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_chiendvhust_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_chiendvhust_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/chiendvhust/distilbert-base-uncased-finetuned-squad --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from KayKozaronek) author: John Snow Labs name: xlmroberta_ner_kaykozaronek_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `KayKozaronek`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_kaykozaronek_base_finetuned_panx_de_4.1.0_3.0_1660429548182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_kaykozaronek_base_finetuned_panx_de_4.1.0_3.0_1660429548182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_kaykozaronek_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_kaykozaronek_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_kaykozaronek_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/KayKozaronek/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: French CamemBert Embeddings (from Weipeng) author: John Snow Labs name: camembert_embeddings_Weipeng_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Weipeng`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Weipeng_generic_model_fr_3.4.4_3.0_1653987069537.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Weipeng_generic_model_fr_3.4.4_3.0_1653987069537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Weipeng_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Weipeng_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Weipeng_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Weipeng/dummy-model --- layout: model title: German asr_exp_w2v2t_r_wav2vec2_s466 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_r_wav2vec2_s466 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_r_wav2vec2_s466` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_r_wav2vec2_s466_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_r_wav2vec2_s466_de_4.2.0_3.0_1664108974264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_r_wav2vec2_s466_de_4.2.0_3.0_1664108974264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_r_wav2vec2_s466', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_r_wav2vec2_s466", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_r_wav2vec2_s466| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect Adverse Drug Events author: John Snow Labs name: ner_ade_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_clinical_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinical_pipeline_en_3.4.1_3.0_1647874530624.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinical_pipeline_en_3.4.1_3.0_1647874530624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_ade_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps") ``` ```scala val pipeline = new PretrainedPipeline("ner_ade_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.ade_clinical.pipeline").predict("""Been taking Lipitor for 15 years, have experienced severe fatigue a lot!!!. Doctor moved me to voltaren 2 months ago, so far, have only experienced cramps""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |Lipitor |DRUG | |severe fatigue|ADE | |voltaren |DRUG | |cramps |ADE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from arvalinno) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_indosquad_v2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-indosquad-v2` is a English model originally trained by `arvalinno`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_indosquad_v2_en_4.3.0_3.0_1672768023392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_indosquad_v2_en_4.3.0_3.0_1672768023392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_indosquad_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_indosquad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_indosquad_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/arvalinno/distilbert-base-uncased-finetuned-indosquad-v2 --- layout: model title: Legal Required Disclosure Clause Binary Classifier author: John Snow Labs name: legclf_req_discl_clause date: 2023-02-13 tags: [en, legal, classification, required, disclosure, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `req_discl` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `req_discl`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_req_discl_clause_en_1.0.0_3.0_1676305535722.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_req_discl_clause_en_1.0.0_3.0_1676305535722.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_req_discl_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[req_discl]| |[other]| |[other]| |[req_discl]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_req_discl_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.80 0.80 0.80 5 req_discl 0.89 0.89 0.89 9 accuracy - - 0.86 14 macro-avg 0.84 0.84 0.84 14 weighted-avg 0.86 0.86 0.86 14 ``` --- layout: model title: Pipeline to Detect Drug Chemicals (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_drugs_pipeline date: 2023-03-20 tags: [drug, berfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_drugs](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_drugs_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_pipeline_en_4.3.0_3.2_1679307572006.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_pipeline_en_4.3.0_3.2_1679307572006.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_drugs_pipeline", "en", "clinical/models") text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_drugs_pipeline", "en", "clinical/models") val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_ade.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | potassium | 92 | 100 | DrugChem | 0.990254 | | 1 | nucleotide | 471 | 480 | DrugChem | 0.500501 | | 2 | anthracyclines | 1124 | 1137 | DrugChem | 0.999987 | | 3 | taxanes | 1143 | 1149 | DrugChem | 0.999972 | | 4 | vinorelbine | 1203 | 1213 | DrugChem | 0.999991 | | 5 | vinorelbine | 1343 | 1353 | DrugChem | 0.999991 | | 6 | anthracyclines | 1390 | 1403 | DrugChem | 0.99999 | | 7 | taxanes | 1409 | 1415 | DrugChem | 0.999946 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_drugs_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering Base Cased model (from Khanh) author: John Snow Labs name: bert_qa_base_multilingual_cased_finetuned_viquad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-viquad` is a English model originally trained by `Khanh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_multilingual_cased_finetuned_viquad_en_4.0.0_3.0_1657183284605.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_multilingual_cased_finetuned_viquad_en_4.0.0_3.0_1657183284605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_multilingual_cased_finetuned_viquad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_multilingual_cased_finetuned_viquad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_multilingual_cased_finetuned_viquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Khanh/bert-base-multilingual-cased-finetuned-viquad --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman` is a Modern Greek (1453-) model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman_el_4.2.0_3.0_1664108602273.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman_el_4.2.0_3.0_1664108602273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: English image_classifier_vit_violation_classification_bantai_ ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_violation_classification_bantai_ date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_violation_classification_bantai_` is a English model originally trained by AykeeSalazar. ## Predicted Entities `Public Smoking`, `Public-Drinking`, `ambiguous`, `non-violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__en_4.1.0_3.0_1660167321181.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__en_4.1.0_3.0_1660167321181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_violation_classification_bantai_", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_violation_classification_bantai_", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_violation_classification_bantai_| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from dfsj) author: John Snow Labs name: xlmroberta_ner_dfsj_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `dfsj`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_dfsj_base_finetuned_panx_de_4.1.0_3.0_1660432281814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_dfsj_base_finetuned_panx_de_4.1.0_3.0_1660432281814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_dfsj_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_dfsj_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_dfsj_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/dfsj/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English asr_wav2vec2_indian_english TFWav2Vec2ForCTC from anjulRajendraSharma author: John Snow Labs name: pipeline_asr_wav2vec2_indian_english date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_indian_english` is a English model originally trained by anjulRajendraSharma. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_indian_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_indian_english_en_4.2.0_3.0_1664123351004.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_indian_english_en_4.2.0_3.0_1664123351004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_indian_english', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_indian_english", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_indian_english| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.5 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Embeddings Scielowiki 50 dims author: John Snow Labs name: embeddings_scielowiki_50d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-26 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielowiki_50d_es_2.5.0_2.4_1590467602230.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielowiki_50d_es_2.5.0_2.4_1590467602230.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_scielowiki_50d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_scielowiki_50d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.scielowiki.50d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_scielowiki_50d``. {:.model-param} ## Model Information {:.table-model} |---------------|---------------------------| | Name: | embeddings_scielowiki_50d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 50.0 | {:.h2_title} ## Data Source Trained on Scielo Articles + Clinical Wikipedia Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Extract Treatment Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_treatment_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, treatment, drug] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts treatments mentioned in documents transferred from the patient’s own sentences. ## Predicted Entities `Drug`, `Form`, `Route`, `Duration`, `Dosage`, `Frequency`, `Procedure`, `Treatment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_emb_clinical_medium_en_4.4.3_3.0_1686078011720.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_emb_clinical_medium_en_4.4.3_3.0_1686078011720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_treatment_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_treatment_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | metformin | Drug | | glipizide | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_treatment_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Drug 1314 115 126 1440 0.92 0.91 0.92 Form 249 30 17 266 0.89 0.94 0.91 Route 41 6 7 48 0.87 0.85 0.86 Duration 1896 275 414 2310 0.87 0.82 0.85 Dosage 328 34 84 412 0.91 0.80 0.85 Frequency 920 181 159 1079 0.84 0.85 0.84 Procedure 575 107 130 705 0.84 0.82 0.83 Treatment 159 46 69 228 0.78 0.70 0.73 macro_avg 5482 794 1006 6488 0.86 0.84 0.85 micro_avg 5482 794 1006 6488 0.87 0.84 0.86 ``` --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_mini_5_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-5-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_mini_5_finetuned_squadv2_en_4.0.0_3.0_1654183766044.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_mini_5_finetuned_squadv2_en_4.0.0_3.0_1654183766044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_mini_5_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_mini_5_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_v2_5.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_mini_5_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|66.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-mini-5-finetuned-squadv2 --- layout: model title: Legal Asset Distribution Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_distribution_agreement date: 2022-11-10 tags: [en, legal, classification, distribution, agreement, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_distribution_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `distribution-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `distribution-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_distribution_agreement_en_1.0.0_3.0_1668106088121.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_distribution_agreement_en_1.0.0_3.0_1668106088121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_distribution_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[distribution-agreement]| |[other]| |[other]| |[distribution-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_distribution_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support distribution-agreement 1.00 0.89 0.94 28 other 0.97 1.00 0.98 85 accuracy - - 0.97 113 macro-avg 0.98 0.95 0.96 113 weighted-avg 0.97 0.97 0.97 113 ``` --- layout: model title: Italian CamemBert Embeddings (from Musixmatch) author: John Snow Labs name: camembert_embeddings_umberto_wikipedia_uncased_v1 date: 2022-05-23 tags: [it, open_source, camembert, embeddings] task: Embeddings language: it edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `umberto-wikipedia-uncased-v1` is a Italian model orginally trained by `Musixmatch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_umberto_wikipedia_uncased_v1_it_3.4.4_3.0_1653321429870.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_umberto_wikipedia_uncased_v1_it_3.4.4_3.0_1653321429870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_umberto_wikipedia_uncased_v1","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_umberto_wikipedia_uncased_v1","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_umberto_wikipedia_uncased_v1| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|265.5 MB| |Case sensitive:|false| ## References - https://huggingface.co/Musixmatch/umberto-wikipedia-uncased-v1 - https://github.com/musixmatchresearch/umberto - https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/ - http://bit.ly/35wbSj6 - https://github.com/google/sentencepiece - https://github.com/musixmatchresearch/umberto - https://github.com/UniversalDependencies/UD_Italian-ISDT - https://github.com/UniversalDependencies/UD_Italian-ParTUT - http://www.evalita.it/ - https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500 - https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub - https://github.com/loretoparisi - https://github.com/simonefrancia - https://github.com/paulthemagno - https://twitter.com/Musixmatch - https://twitter.com/musixmatchai - https://github.com/musixmatchresearch --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury TFWav2Vec2ForCTC from Satyamatury author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury` is a English model originally trained by Satyamatury. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_en_4.2.0_3.0_1664112292335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury_en_4.2.0_3.0_1664112292335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_Satyamatury| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect Anatomical Regions author: John Snow Labs name: ner_anatomy_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_pipeline_en_4.3.0_3.2_1678861992438.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_pipeline_en_4.3.0_3.2_1678861992438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_anatomy_pipeline", "en", "clinical/models") text = '''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_anatomy_pipeline", "en", "clinical/models") val text = "This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatom.pipeline").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:-----------------------|-------------:| | 0 | skin | 374 | 377 | Organ | 1 | | 1 | Extraocular muscles | 574 | 592 | Organ | 0.68465 | | 2 | turbinates | 659 | 668 | Multi-tissue_structure | 0.9511 | | 3 | Mucous membranes | 716 | 731 | Tissue | 0.90445 | | 4 | bowel | 802 | 806 | Organ | 0.9648 | | 5 | skin | 956 | 959 | Organ | 0.9999 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate English to West Slavic languages Pipeline author: John Snow Labs name: translate_en_zlw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, zlw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `zlw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_zlw_xx_2.7.0_2.4_1609689828637.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_zlw_xx_2.7.0_2.4_1609689828637.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_zlw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_zlw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.zlw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_zlw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Korean Word Segmentation author: John Snow Labs name: wordseg_kaist_ud date: 2021-01-03 task: Word Segmentation language: ko edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, word_segmentation, ko] supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: - Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_kaist_ud_ko_2.7.0_2.4_1609693294761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_kaist_ud_ko_2.7.0_2.4_1609693294761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") word_segmenter = WordSegmenterModel.pretrained('wordseg_kaist_ud', 'ko')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[ document_assembler, word_segmenter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다.']], ["text"]) result = model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다."""] token_df = nlu.load('ko.segment_words').predict(text, output_level='token') token_df ```
## Results ```bash +-------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+ |text |result | +-------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+ |비파를탄주하는그늙은명인의시는아름다운화음이었고완벽한음악으로순간적인조화를이룬세계의울림이었다.|[비파를, 탄주하는, 그, 늙은, 명인의, 시는, 아름다운, 화음이었고, 완벽한, 음악으로, 순간적인, 조화를, 이룬, 세계의, 울림이었다, .]| +-------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_kaist_ud| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|ko| ## Data Source We trained the model using the [Universal Dependenicies](universaldependencies.org) data set from Korea Advanced Institute of Science and Technology (KAIST-UD). Reference: > Building Universal Dependency Treebanks in Korean, Jayeol Chun, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC'18, Miyazaki, Japan, 2018. ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | KO_KAIST-UD | 0,6966 | 0,7779 | 0,7350 | ``` --- layout: model title: English asr_wolof_ASR TFWav2Vec2ForCTC from khady author: John Snow Labs name: asr_wolof_ASR date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wolof_ASR` is a English model originally trained by khady. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wolof_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wolof_ASR_en_4.2.0_3.0_1664024629366.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wolof_ASR_en_4.2.0_3.0_1664024629366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wolof_ASR", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wolof_ASR", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wolof_ASR| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: French CamemBert Embeddings (from hackertec) author: John Snow Labs name: camembert_embeddings_generic2 date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy2` is a French model orginally trained by `hackertec`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_generic2_fr_3.4.4_3.0_1653991459699.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_generic2_fr_3.4.4_3.0_1653991459699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_generic2","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_generic2","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_generic2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/hackertec/dummy2 --- layout: model title: Fast Neural Machine Translation Model from English to Finno-Ugrian Languages author: John Snow Labs name: opus_mt_en_fiu date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, fiu, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `fiu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_fiu_xx_2.7.0_2.4_1609168517010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_fiu_xx_2.7.0_2.4_1609168517010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_fiu", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_fiu", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.fiu').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_fiu| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect bacterial species author: John Snow Labs name: ner_bacterial_species_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_bacterial_species](https://nlp.johnsnowlabs.com/2021/04/01/ner_bacterial_species_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_pipeline_en_3.4.1_3.0_1647872238493.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_pipeline_en_3.4.1_3.0_1647872238493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_bacterial_species_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE_TEXT") ``` ```scala val pipeline = new PretrainedPipeline("ner_bacterial_species_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE_TEXT") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bacterial_species.pipeline").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bacterial_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Detect Subentity PHI for Deidentification (Arabic) author: John Snow Labs name: ner_deid_subentity date: 2023-05-29 tags: [licensed, ner, clinical, deidentification, arabic, ar] task: Named Entity Recognition language: ar edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model Word2Vec Arabic Clinical Embeddings. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `SEX`, `IDNUM`, `EMAIL`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ar_4.4.2_3.0_1685387641635.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_ar_4.4.2_3.0_1685387641635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = ''' عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948235 وبريده الإلكتروني mohamedmell@gmail.com. ''' data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline .fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) text = ''' عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948235 وبريده الإلكتروني mohamedmell@gmail.com. ''' val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------------------------+----------------+ |chunk |ner_label| +------------------------------------------------+---------------+ |محمد |DOCTOR | |55 سنة |AGE | |15/05/2000 |DATE | |الرباط |CITY | |0610948235 |PHONE | |mohamedmell@gmail.com |EMAIL | +------------------------------------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ar| |Size:|15.0 MB| ## References Custom John Snow Labs datasets Data augmentation techniques ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 196.0 26.0 32.0 228.0 0.8829 0.8596 0.8711 HOSPITAL 193.0 41.0 37.0 230.0 0.8248 0.8391 0.8319 DATE 877.0 14.0 8.0 885.0 0.9843 0.991 0.9876 ORGANIZATION 41.0 11.0 6.0 47.0 0.7885 0.8723 0.8283 CITY 260.0 8.0 5.0 265.0 0.9701 0.9811 0.9756 STREET 103.0 3.0 0.0 103.0 0.9717 1.0 0.9856 USERNAME 8.0 0.0 0.0 8.0 1.0 1.0 1.0 SEX 300.0 9.0 69.0 369.0 0.9709 0.813 0.885 IDNUM 13.0 1.0 0.0 13.0 0.9286 1.0 0.963 EMAIL 112.0 5.0 0.0 112.0 0.9573 1.0 0.9782 ZIP 80.0 4.0 0.0 80.0 0.9524 1.0 0.9756 MEDICALRECORD 17.0 1.0 0.0 17.0 0.9444 1.0 0.9714 PROFESSION 303.0 27.0 32.0 335.0 0.9182 0.9045 0.9113 PHONE 38.0 4.0 2.0 40.0 0.9048 0.95 0.9268 COUNTRY 158.0 10.0 8.0 166.0 0.9405 0.9518 0.9461 DOCTOR 440.0 23.0 34.0 474.0 0.9503 0.9283 0.9392 AGE 610.0 18.0 7.0 617.0 0.9713 0.9887 0.9799 macro - - - - - - 0.9386 micro - - - - - - 0.9434 ``` --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_qacombined_el_3 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qacombined_bert_el_3` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qacombined_el_3_el_4.0.0_3.0_1657190831315.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qacombined_el_3_el_4.0.0_3.0_1657190831315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qacombined_el_3","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_qacombined_el_3","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qacombined_el_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/qacombined_bert_el_3 --- layout: model title: Fast Neural Machine Translation Model from Bengali to English author: John Snow Labs name: opus_mt_bn_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bn, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bn` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bn_en_xx_2.7.0_2.4_1609164818886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bn_en_xx_2.7.0_2.4_1609164818886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bn_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bn_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bn.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bn_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICDO Codes author: John Snow Labs name: snomed_icdo_mapping date: 2023-03-29 tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, icdo] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_icdo_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_4.3.2_3.2_1680122937361.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapping_en_4.3.2_3.2_1680122937361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(10376009 2026006 26638004) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icdo_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(10376009 2026006 26638004) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icdo.pipe").predict("""Put your text here.""") ```
## Results ```bash | | snomed_code | icdo_code | |---:|:------------------------------|:-------------------------| | 0 | 10376009 | 2026006 | 26638004 | 8050/2 | 9014/0 | 8322/0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icdo_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|208.8 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Legal Liabilities Clause Binary Classifier author: John Snow Labs name: legclf_liabilities_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `liabilities` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `liabilities` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_liabilities_clause_en_1.0.0_3.2_1660123675886.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_liabilities_clause_en_1.0.0_3.2_1660123675886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_liabilities_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[liabilities]| |[other]| |[other]| |[liabilities]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_liabilities_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support liabilities 0.96 0.84 0.90 32 other 0.94 0.99 0.96 75 accuracy - - 0.94 107 macro-avg 0.95 0.92 0.93 107 weighted-avg 0.94 0.94 0.94 107 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from comacrae) author: John Snow Labs name: roberta_qa_paraphrasev3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-paraphrasev3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_paraphrasev3_en_4.3.0_3.0_1674222313416.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_paraphrasev3_en_4.3.0_3.0_1674222313416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_paraphrasev3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_paraphrasev3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_paraphrasev3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/comacrae/roberta-paraphrasev3 --- layout: model title: Mapping ICD-9-CM Codes with Their Corresponding ICD-10-CM Codes author: John Snow Labs name: icd9_icd10_mapper date: 2022-09-30 tags: [en, clinical, chunk_mapping, icd9, icd10, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICD-9-CM codes to corresponding ICD-10-CM codes ## Predicted Entities `icd10_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd9_icd10_mapper_en_4.1.0_3.0_1664537323845.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd9_icd10_mapper_en_4.1.0_3.0_1664537323845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("doc") chunk_assembler = Doc2Chunk()\ .setInputCols(["doc"])\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel\ .pretrained("icd9_icd10_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["icd10_code"]) mapper_pipeline = Pipeline(stages=[ document_assembler, chunk_assembler, chunkerMapper ]) model = mapper_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("00322") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val chunk_assembler = new Doc2Chunk() .setInputCols(Array("doc")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("icd9_icd10_mapper", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("icd10_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, chunk_assembler, chunkerMapper)) val data = Seq("00322").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd9_icd10").predict("""Put your text here.""") ```
## Results ```bash +---------+-------------+ |icd9_code|icd10_mapping| +---------+-------------+ |00322 |A0222 | +---------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd9_icd10_mapper| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|323.6 KB| --- layout: model title: Pipeline to Detect Living Species(roberta_base_biomedical) author: John Snow Labs name: ner_living_species_roberta_pipeline date: 2023-03-13 tags: [es, ner, clinical, licensed, roberta] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_roberta](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_roberta_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pipeline_es_4.3.0_3.2_1678731883580.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_roberta_pipeline_es_4.3.0_3.2_1678731883580.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_roberta_pipeline", "es", "clinical/models") text = '''Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_roberta_pipeline", "es", "clinical/models") val text = "Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lactante varón | 0 | 13 | HUMAN | 0.93805 | | 1 | familiares | 41 | 50 | HUMAN | 1 | | 2 | personales | 78 | 87 | HUMAN | 1 | | 3 | neonatal | 116 | 123 | HUMAN | 0.9997 | | 4 | legumbres | 162 | 170 | SPECIES | 0.9963 | | 5 | lentejas | 243 | 250 | SPECIES | 0.9988 | | 6 | garbanzos | 254 | 262 | SPECIES | 0.9905 | | 7 | legumbres | 290 | 298 | SPECIES | 0.9979 | | 8 | madre | 334 | 338 | HUMAN | 1 | | 9 | Cacahuete | 616 | 624 | SPECIES | 0.9985 | | 10 | padres | 728 | 733 | HUMAN | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_roberta_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|318.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate English to Congo Swahili Pipeline author: John Snow Labs name: translate_en_swc date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, swc, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `swc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_swc_xx_2.7.0_2.4_1609687714800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_swc_xx_2.7.0_2.4_1609687714800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_swc", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_swc", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.swc').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_swc| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_augmented date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `augmented` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_augmented_en_4.0.0_3.0_1654179228784.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_augmented_en_4.0.0_3.0_1654179228784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_augmented","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_augmented","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.augmented").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_augmented| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/augmented --- layout: model title: Chinese BertForMaskedLM Large Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_roberta_wwm_ext_large date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-roberta-wwm-ext-large` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_large_zh_4.2.4_3.0_1670021397725.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_ext_large_zh_4.2.4_3.0_1670021397725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext_large","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_ext_large","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_wwm_ext_large| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-roberta-wwm-ext-large - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [de, ner, legal, licensed, mapa] task: Named Entity Recognition language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `German` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_de_1.0.0_3.0_1682589773968.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_de_1.0.0_3.0_1682589773968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_de_cased", "de")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "de", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Herr Liberato und Frau Grigorescu heirateten am 22 Oktober 2005 in Rom (Italien) und lebten in diesem Mitgliedstaat bis zur Geburt ihres Kindes am 20 Februar 2006 zusammen."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |Herr Liberato |PERSON | |Frau Grigorescu |PERSON | |22 Oktober 2005|DATE | |Rom (Italien) |ADDRESS | |20 Februar 2006 |DATE | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.69 0.85 0.76 13 AMOUNT 1.00 0.75 0.86 4 DATE 0.92 0.93 0.93 61 ORGANISATION 0.64 0.77 0.70 30 PERSON 0.85 0.87 0.86 46 macro-avg 0.82 0.87 0.84 154 macro-avg 0.82 0.83 0.82 154 weighted-avg 0.83 0.87 0.85 154 ``` --- layout: model title: English image_classifier_vit_base_patch16_384 ViTForImageClassification from google author: John Snow Labs name: image_classifier_vit_base_patch16_384 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_384` is a English model originally trained by google. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_384_en_4.1.0_3.0_1660165957586.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_384_en_4.1.0_3.0_1660165957586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_384", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_384", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_384| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|325.9 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from gulteng) author: John Snow Labs name: distilbert_qa_gulteng_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `gulteng`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_gulteng_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770951880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_gulteng_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770951880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gulteng_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_gulteng_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_gulteng_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/gulteng/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering (from ahmedattia143) author: John Snow Labs name: roberta_qa_roberta_squadv1_base date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_squadv1_base` is a English model originally trained by `ahmedattia143`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_squadv1_base_en_4.0.0_3.0_1655739217489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_squadv1_base_en_4.0.0_3.0_1655739217489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_squadv1_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_squadv1_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_ahmedattia143").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_squadv1_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ahmedattia143/roberta_squadv1_base --- layout: model title: Pipeline to Detect Drugs and Proteins author: John Snow Labs name: ner_drugprot_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drugprot, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugprot_clinical](https://nlp.johnsnowlabs.com/2021/12/20/ner_drugprot_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugprot_clinical_pipeline_en_3.4.1_3.0_1647872275186.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugprot_clinical_pipeline_en_3.4.1_3.0_1647872275186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_drugprot_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.") ``` ```scala val pipeline = new PretrainedPipeline("ner_drugprot_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_drugprot.pipeline").predict("""Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |clenbuterol |CHEMICAL | |beta 2-adrenoceptor |GENE | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugprot_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Stopwords Remover for Gujarati language (84 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, gu, open_source] task: Stop Words Removal language: gu edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_gu_3.4.1_3.0_1646673173952.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_gu_3.4.1_3.0_1646673173952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","gu") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["તમે મારા કરતાં વધુ સારા નથી"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","gu") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("તમે મારા કરતાં વધુ સારા નથી").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gu.stopwords").predict("""તમે મારા કરતાં વધુ સારા નથી""") ```
## Results ```bash +-----------------------------+ |result | +-----------------------------+ |[તમે, મારા, કરતાં, વધુ, સારા]| +-----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|gu| |Size:|1.7 KB| --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-10_H-512_A-8_cord19-200616_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185170892.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185170892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_cord19.bert.uncased_10l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|177.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616_squad2 - https://arxiv.org/abs/1908.08962 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from victorlee071200) author: John Snow Labs name: roberta_qa_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-squad` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_en_4.3.0_3.0_1674210358228.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_en_4.3.0_3.0_1674210358228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/victorlee071200/distilroberta-base-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Cebuano author: John Snow Labs name: opus_mt_en_ceb date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ceb, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ceb` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ceb_xx_2.7.0_2.4_1609166226528.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ceb_xx_2.7.0_2.4_1609166226528.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ceb", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ceb", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ceb').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ceb| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Use Clause Binary Classifier author: John Snow Labs name: legclf_use_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `use` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `use` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_use_clause_en_1.0.0_3.2_1660123163293.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_use_clause_en_1.0.0_3.2_1660123163293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_use_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[use]| |[other]| |[other]| |[use]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_use_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support absence-of-certain-changes 0.99 0.99 0.99 67 other 0.99 0.99 0.99 175 accuracy - - 0.99 242 macro-avg 0.99 0.99 0.99 242 weighted-avg 0.99 0.99 0.99 242 ``` --- layout: model title: Translate English to Indo-European languages Pipeline author: John Snow Labs name: translate_en_ine date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ine, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ine` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ine_xx_2.7.0_2.4_1609686616836.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ine_xx_2.7.0_2.4_1609686616836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ine", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ine", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ine').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ine| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English XlmRoBertaForQuestionAnswering (from aicryptogroup) author: John Snow Labs name: xlm_roberta_qa_distill_xlm_mrc date: 2022-06-23 tags: [en, vi, open_source, question_answering, xlmroberta, xx] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distill-xlm-mrc` is a multilingual model originally trained by `aicryptogroup`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_distill_xlm_mrc_en_4.0.0_3.0_1655987250264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_distill_xlm_mrc_en_4.0.0_3.0_1655987250264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_distill_xlm_mrc","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_distill_xlm_mrc","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.distilled").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_distill_xlm_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|151.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aicryptogroup/distill-xlm-mrc --- layout: model title: English T5ForConditionalGeneration Cased model (from SkolkovoInstitute) author: John Snow Labs name: t5_lewip_informal date: 2023-01-30 tags: [en, open_source, t5] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `LEWIP-informal` is a English model originally trained by `SkolkovoInstitute`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_lewip_informal_en_4.3.0_3.0_1675098112375.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_lewip_informal_en_4.3.0_3.0_1675098112375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_lewip_informal","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_lewip_informal","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_lewip_informal| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|914.1 MB| ## References - https://huggingface.co/SkolkovoInstitute/LEWIP-informal --- layout: model title: Italian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_italian_legal date: 2023-02-16 tags: [it, italian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-italian-roberta-base` is a Italian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_italian_legal_it_4.2.4_3.0_1676579541941.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_italian_legal_it_4.2.4_3.0_1676579541941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_italian_legal", "it")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_italian_legal", "it") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_italian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|416.2 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-italian-roberta-base --- layout: model title: RxNorm Sbd ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_sbd_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_sbd_clinical_en_3.0.0_3.0_1618603306546.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_sbd_clinical_en_3.0.0_3.0_1618603306546.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... rxnorm_resolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\ .setInputCols(['token', 'chunk_embeddings'])\ .setOutputCol('rxnorm_resolution')\ .setPoolingStrategy("MAX") pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnorm_resolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9)) .setInputCols('token', 'chunk_embeddings') .setOutputCol('rxnorm_resolution') .setPoolingStrategy("MAX") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver)) val result = pipeline.fit(Seq.empty[String]).transform(data) ```
## Results ```bash +-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text(rxnorm)| code|confidence| +-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Table...| 105376| 0.2067| | glipizide|TREATMENT|Glipizide 5 MG Oral Tablet [Minidiab]:::Glipizide 5 MG Oral Tablet [Glucotrol]:::Glipizide 5 MG O...| 105373| 0.2224| | dapagliflozin for T2DM|TREATMENT|dapagliflozin 5 MG / saxagliptin 5 MG Oral Tablet [Qtern]:::dapagliflozin 10 MG / saxagliptin 5 M...|2169276| 0.2532| | atorvastatin and gemfibrozil for HTG|TREATMENT|atorvastatin 20 MG / ezetimibe 10 MG Oral Tablet [Liptruzet]:::atorvastatin 40 MG / ezetimibe 10 ...|1422095| 0.2183| | dapagliflozin|TREATMENT|dapagliflozin 5 MG Oral Tablet [Farxiga]:::dapagliflozin 10 MG Oral Tablet [Farxiga]:::dapagliflo...|1486981| 0.3523| | bicarbonate|TREATMENT|Sodium Bicarbonate 0.417 MEQ/ML Oral Solution [Desempacho]:::potassium bicarbonate 25 MEQ Efferve...|1305099| 0.2149| |insulin drip for euDKA and HTG with a reduction|TREATMENT|insulin aspart, human 30 UNT/ML / insulin degludec 70 UNT/ML Pen Injector [Ryzodeg]:::3 ML insuli...|1994318| 0.2124| | SGLT2 inhibitor|TREATMENT|C1 esterase inhibitor (human) 500 UNT Injection [Cinryze]:::alpha 1-proteinase inhibitor, human 1...| 809871| 0.2044| | insulin glargine|TREATMENT|Insulin Glargine 100 UNT/ML Pen Injector [Lantus]:::Insulin Glargine 300 UNT/ML Pen Injector [Tou...|1359856| 0.2265| | insulin lispro with meals|TREATMENT|Insulin Lispro 100 UNT/ML Cartridge [Humalog]:::Insulin Lispro 200 UNT/ML Pen Injector [Humalog]:...|1652648| 0.2469| | metformin|TREATMENT|Metformin hydrochloride 500 MG Oral Tablet [Glucamet]:::Metformin hydrochloride 850 MG Oral Table...| 105376| 0.2067| | SGLT2 inhibitors|TREATMENT|alpha 1-proteinase inhibitor, human 1 MG Injection [Prolastin]:::C1 esterase inhibitor (human) 50...|1661220| 0.2167| +-----------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_sbd_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm]| |Language:|en| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from changjin) author: John Snow Labs name: distilbert_qa_changjin_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `changjin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_changjin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770382601.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_changjin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770382601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_changjin_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_changjin_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_changjin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/changjin/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Drug Chemicals author: John Snow Labs name: ner_drugs_en date: 2020-03-25 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.4 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for Drugs. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities ``DrugChem``. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_en_2.4.4_2.4_1584452534235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_drugs", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([["The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes."]]).toDF("text")) results = model.transform(data) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_drugs", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. ```bash +-----------------+---------+ |chunk |ner | +-----------------+---------+ |potassium |DrugChem | |anthracyclines |DrugChem | |taxanes |DrugChem | |vinorelbine |DrugChem | +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_en_2.4.4_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.4+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on i2b2_med7 + FDA with 'embeddings_clinical'. https://www.i2b2.org/NLP/Medication {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|------:|------:|------:|---------:|---------:|---------:| | 0 | B-DrugChem | 32745 | 1738 | 979 | 0.949598 | 0.97097 | 0.960165 | | 1 | I-DrugChem | 35522 | 1551 | 764 | 0.958164 | 0.978945 | 0.968443 | | 2 | Macro-average | 68267 | 3289 | 1743 | 0.953881 | 0.974958 | 0.964304 | | 3 | Micro-average | 68267 | 3289 | 1743 | 0.954036 | 0.975104 | 0.964455 | ``` --- layout: model title: Legal Governing Laws Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_governing_laws_bert date: 2023-03-05 tags: [en, legal, classification, clauses, governing_laws, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Governing_Laws` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Governing_Laws`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_governing_laws_bert_en_1.0.0_3.0_1678050561190.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_governing_laws_bert_en_1.0.0_3.0_1678050561190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_governing_laws_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Governing_Laws]| |[Other]| |[Other]| |[Governing_Laws]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_governing_laws_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Governing_Laws 0.95 0.98 0.96 393 Other 0.98 0.95 0.97 414 accuracy - - 0.97 807 macro-avg 0.97 0.97 0.97 807 weighted-avg 0.97 0.97 0.97 807 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Multiple languages author: John Snow Labs name: opus_mt_en_mul date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mul, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mul` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mul_xx_2.7.0_2.4_1609167318754.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mul_xx_2.7.0_2.4_1609167318754.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mul", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mul", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mul').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mul| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Typo Detector Pipeline for English author: John Snow Labs name: distilbert_token_classifier_typo_detector_pipeline date: 2022-03-08 tags: [ner, bert, bert_for_token, typo, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_en.html). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_en_3.4.1_3.0_1646736450779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_en_3.4.1_3.0_1646736450779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en") typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.") ``` ```scala val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en") typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.") ```
## Results ```bash +----------+---------+ |chunk |ner_label| +----------+---------+ |stgruggled|PO | |tine |PO | +----------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|244.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: English RobertaForQuestionAnswering (from armageddon) author: John Snow Labs name: roberta_qa_roberta_base_squad2_covid_qa_deepset date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-covid-qa-deepset` is a English model originally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_covid_qa_deepset_en_4.0.0_3.0_1655735227767.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2_covid_qa_deepset_en_4.0.0_3.0_1655735227767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2_covid_qa_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2_covid_qa_deepset","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.roberta.base.by_armageddon").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2_covid_qa_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/roberta-base-squad2-covid-qa-deepset --- layout: model title: German asr_wav2vec2_large_xlsr_german_by_maxidl TFWav2Vec2ForCTC from maxidl author: John Snow Labs name: asr_wav2vec2_large_xlsr_german_by_maxidl date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_german_by_maxidl` is a German model originally trained by maxidl. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_german_by_maxidl_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_german_by_maxidl_de_4.2.0_3.0_1664097922955.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_german_by_maxidl_de_4.2.0_3.0_1664097922955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_german_by_maxidl", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_german_by_maxidl", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_german_by_maxidl| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Spell Checker for Drug Names (Norvig) author: John Snow Labs name: spellcheck_drug_norvig date: 2023-05-13 tags: [spellcheck, clinical, en, drug, norvig, licensed] task: Spell Check language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: NorvigSweetingModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model corrects spelling mistakes in drug names by using The Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_drug_norvig_en_4.4.0_3.0_1684020970665.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_drug_norvig_en_4.4.0_3.0_1684020970665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") spell = NorvigSweetingModel.pretrained("spellcheck_drug_norvig", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("corrected_token")\ pipeline = Pipeline( stages = [ documentAssembler, tokenizer, spell ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) text = "You have to take Amrosia artemisiifoli, Oactra and a bit of Grastk and lastacaf" test_df= spark.createDataFrame([[text]]).toDF("text") result= model.transform(test_df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val spell= NorvigSweetingModel.pretrained("spellcheck_drug_norvig", "en", "clinical/models") .setInputCols("token") .setOutputCol("corrected_token") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, spell)) val data = Seq("You have to take Amrosia artemisiifoli, Oactra and a bit of Grastk and lastacaf").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ```
## Results ```bash Original Text: You have to take Amrosia artemisiifoli , Oactra and a bit of Grastk and lastacaf Corrected Text: You have to take Ambrosia artemisiifolia , Odactra and a bit of Grastek and lastacaft ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_drug_norvig| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[spell]| |Language:|en| |Size:|4.5 MB| |Case sensitive:|true| --- layout: model title: Urdu Lemmatizer author: John Snow Labs name: lemma date: 2020-11-28 task: Lemmatization language: ur edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [lemmatizer, ur, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ur_2.7.0_2.4_1606583060260.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ur_2.7.0_2.4_1606583060260.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "ur") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(['کی جوکہ جیسے جنھیں جنھوں جسے انھوں جنہوں انہیں جو جس جنہیں جن']) ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "ur") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("کی جوکہ جیسے جنھیں جنھوں جسے انھوں جنہوں انہیں جو جس جنہیں جن").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""کی جوکہ جیسے جنھیں جنھوں جسے انھوں جنہوں انہیں جو جس جنہیں جن"""] lemma_df = nlu.load('ur.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 1, كر, {'sentence': '0'}), Annotation(token, 3, 6, جوکہ, {'sentence': '0'}), Annotation(token, 8, 11, جیسا, {'sentence': '0'}), Annotation(token, 13, 17, جو, {'sentence': '0'}), Annotation(token, 19, 23, جو, {'sentence': '0'}), Annotation(token, 25, 27, جو, {'sentence': '0'}), Annotation(token, 29, 33, وہ, {'sentence': '0'}), ...]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ur| ## Data Source This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/) --- layout: model title: Korean Electra Embeddings (from monologg) author: John Snow Labs name: electra_embeddings_koelectra_small_generator date: 2022-05-17 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-small-generator` is a Korean model orginally trained by `monologg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_small_generator_ko_3.4.4_3.0_1652786927191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_koelectra_small_generator_ko_3.4.4_3.0_1652786927191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_small_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_koelectra_small_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_koelectra_small_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|52.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/monologg/koelectra-small-generator - https://github.com/monologg/KoELECTRA/blob/master/README_EN.md --- layout: model title: BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QQP author: John Snow Labs name: bert_wiki_books_qqp date: 2021-08-30 tags: [en, open_source, wikipedia_dataset, books_corpus_dataset, bert_embeddings, qqp_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on QQP. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_qqp_en_3.2.0_3.0_1630322349023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_qqp_en_3.2.0_3.0_1630322349023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_wiki_books_qqp", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_wiki_books_qqp", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.wiki_books_qqp').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_wiki_books_qqp| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Quora Question Pairs (QQP) dataset](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/qqp/2 --- layout: model title: XLM-RoBERTa Base NER Pipeline author: John Snow Labs name: xlm_roberta_base_token_classifier_ontonotes_pipeline date: 2022-04-21 tags: [open_source, ner, token_classifier, xlm_roberta, ontonotes, xlm, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_ontonotes_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_3.4.1_3.0_1650542464092.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_3.4.1_3.0_1650542464092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|858.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Compensation and benefits Clause Binary Classifier author: John Snow Labs name: legclf_compensation_and_benefits_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `compensation-and-benefits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `compensation-and-benefits` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compensation_and_benefits_clause_en_1.0.0_3.2_1660122232901.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compensation_and_benefits_clause_en_1.0.0_3.2_1660122232901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_compensation_and_benefits_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[compensation-and-benefits]| |[other]| |[other]| |[compensation-and-benefits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_compensation_and_benefits_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support compensation-and-benefits 0.94 0.89 0.91 36 other 0.97 0.98 0.97 119 accuracy - - 0.96 155 macro-avg 0.95 0.94 0.94 155 weighted-avg 0.96 0.96 0.96 155 ``` --- layout: model title: Recognize Entities OntoNotes - BERT Tiny author: John Snow Labs name: onto_recognize_entities_bert_tiny date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `small_bert_L2_128` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_tiny_en_2.7.0_2.4_1607511353779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_tiny_en_2.7.0_2.4_1607511353779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_bert_tiny') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_tiny") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.bert.tiny').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |Parliament |ORG | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_tiny| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: English T5ForConditionalGeneration Small Cased model (from hetpandya) author: John Snow Labs name: t5_small_quora date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-quora` is a English model originally trained by `hetpandya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_quora_en_4.3.0_3.0_1675155570316.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_quora_en_4.3.0_3.0_1675155570316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_quora","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_quora","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_quora| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.7 MB| ## References - https://huggingface.co/hetpandya/t5-small-quora - https://github.com/hetpandya - https://www.linkedin.com/in/het-pandya --- layout: model title: English image_classifier_vit_pond_image_classification_10 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_10 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_10` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_10_en_4.1.0_3.0_1660170931147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_10_en_4.1.0_3.0_1660170931147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_10", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_10", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_10| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Translate English to Estonian Pipeline author: John Snow Labs name: translate_en_et date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, et, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `et` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_et_xx_2.7.0_2.4_1609687288732.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_et_xx_2.7.0_2.4_1609687288732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_et", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_et", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.et').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_et| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal ORG, PER, ROLE, DATE NER author: John Snow Labs name: legner_org_per_role_date date: 2023-01-01 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model trained on SEC 10K documents, aimed to extract the following entities: - ORG - PER - ROLE - DATE ## Predicted Entities `ORG`, `PER`, `ROLE`, `DATE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINPIPE_ORG_PER_DATE_ROLES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_org_per_role_date_en_1.0.0_3.0_1672597265576.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_org_per_role_date_en_1.0.0_3.0_1672597265576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter, ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----+---+---------------------+------+ |begin|end| chunk|entity| +-----+---+---------------------+------+ | 0| 20|Jeffrey Preston Bezos|PERSON| | 37| 48| entrepreneur| ROLE| | 51| 57| founder| ROLE| | 63| 65| CEO| ROLE| | 70| 75| Amazon| ORG| +-----+---+---------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_org_per_role_date| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References SEC 10-K filings with in-house annotations ## Benchmarking ```bash label tp fp fn prec rec f1 B-PERSON 254 20 56 0.9270073 0.81935483 0.86986303 I-ORG 1161 133 231 0.8972179 0.8340517 0.8644826 B-DATE 202 15 14 0.9308756 0.9351852 0.9330255 I-DATE 302 29 12 0.9123867 0.96178347 0.93643415 B-ROLE 219 21 47 0.9125 0.8233083 0.8656126 B-ORG 674 92 163 0.87989557 0.80525684 0.84092325 I-ROLE 260 26 68 0.90909094 0.79268295 0.8469055 I-PERSON 501 34 94 0.9364486 0.8420168 0.88672566 Macro-average 3573 370 685 0.91317785 0.851705 0.8813709 Micro-average 3573 370 685 0.9061628 0.83912635 0.8713572 ``` --- layout: model title: Explain Document Pipeline for Danish author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, danish, explain_document_sm, pipeline, da] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: da edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_da_3.0.0_3.0_1616428730097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_da_3.0.0_3.0_1616428730097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'da') annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "da") val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej fra John Snow Labs! ""] result_df = nlu.load('da.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|da| --- layout: model title: Fast Neural Machine Translation Model from English to Luganda author: John Snow Labs name: opus_mt_en_lg date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, lg, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `lg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_lg_xx_2.7.0_2.4_1609170260045.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_lg_xx_2.7.0_2.4_1609170260045.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_lg", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_lg", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.lg').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_lg| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Part of Speech for Ukrainian author: John Snow Labs name: pos_ud_iu date: 2021-03-08 tags: [part_of_speech, open_source, ukrainian, pos_ud_iu, uk] task: Part of Speech Tagging language: uk edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - ADJ - PROPN - VERB - PUNCT - CCONJ - ADV - PRON - PART - DET - SCONJ - NUM - X - AUX - INTJ - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_iu_uk_3.0.0_3.0_1615230349831.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_iu_uk_3.0.0_3.0_1615230349831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_iu", "uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Привіт з Джона Снігової лабораторії! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_iu", "uk") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Привіт з Джона Снігової лабораторії! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Привіт з Джона Снігової лабораторії! ""] token_df = nlu.load('uk.pos.ud_iu').predict(text) token_df ```
## Results ```bash token pos 0 Привіт NOUN 1 з ADP 2 Джона PROPN 3 Снігової ADJ 4 лабораторії NOUN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_iu| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|uk| --- layout: model title: Fast Neural Machine Translation Model from Sango to English author: John Snow Labs name: opus_mt_sg_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sg, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sg_en_xx_2.7.0_2.4_1609167681098.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sg_en_xx_2.7.0_2.4_1609167681098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sg_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sg_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sg.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sg_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from edwardjross) author: John Snow Labs name: xlmroberta_ner_edwardjross_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `edwardjross`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_edwardjross_base_finetuned_panx_all_xx_4.1.0_3.0_1660428113477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_edwardjross_base_finetuned_panx_all_xx_4.1.0_3.0_1660428113477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_edwardjross_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_edwardjross_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_edwardjross_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|862.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/edwardjross/xlm-roberta-base-finetuned-panx-all --- layout: model title: English image_classifier_vit_exper1_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper1_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper1_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper1_mesum5_en_4.1.0_3.0_1660167810615.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper1_mesum5_en_4.1.0_3.0_1660167810615.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper1_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper1_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper1_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from mrm8488) author: John Snow Labs name: roberta_qa_longformer_base_4096_spanish_finetuned_squad date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-base-4096-spanish-finetuned-squad` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_longformer_base_4096_spanish_finetuned_squad_es_4.2.4_3.0_1669985184973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_longformer_base_4096_spanish_finetuned_squad_es_4.2.4_3.0_1669985184973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_longformer_base_4096_spanish_finetuned_squad","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_longformer_base_4096_spanish_finetuned_squad","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_longformer_base_4096_spanish_finetuned_squad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|473.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/longformer-base-4096-spanish-finetuned-squad - https://es.wikipedia.org/ - https://creativecommons.org/licenses/by-sa/3.0/legalcode - https://es.wikinews.org/ - https://creativecommons.org/licenses/by/2.5/ - http://clic.ub.edu/corpus/en - https://creativecommons.org/licenses/by/4.0/legalcode - https://twitter.com/mrm8488 - https://www.narrativa.com/ --- layout: model title: Mapping Drugs With Their Corresponding Adverse Drug Events (ADE) author: John Snow Labs name: drug_ade_mapper date: 2022-08-23 tags: [en, chunkmapping, chunkmapper, drug, ade, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drugs with their corresponding Adverse Drug Events. ## Predicted Entities `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_ade_mapper_en_4.0.2_3.0_1661250246683.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_ade_mapper_en_4.0.2_3.0_1661250246683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") #NER model to detect drug in the text ner = MedicalNerModel.pretrained('ner_posology_greedy', 'en', 'clinical/models') \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_chunk = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ chunkMapper = ChunkMapperModel.pretrained("drug_ade_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["ADE"]) pipeline = Pipeline().setStages([document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_chunk, chunkMapper]) text = ["""The patient was prescribed 1000 mg fish oil and multivitamins. She was discharged on zopiclone and ambrisentan"""] data = spark.createDataFrame([text]).toDF("text") result= pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") #NER model to detect drug in the text val ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_chunk = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunkMapper = ChunkMapperModel.pretrained("drug_ade_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("ADE")) val pipeline = new Pipeline(stages = Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_chunk, chunkMapper)) val data = Seq("The patient was prescribed 1000 mg fish oil and multivitamins. She was discharged on zopiclone and ambrisentan").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_ade").predict("""The patient was prescribed 1000 mg fish oil and multivitamins. She was discharged on zopiclone and ambrisentan""") ```
## Results ```bash +----------------+------------+-------------------------------------------------------------------------------------------+ |ner_chunk |ade_mappings|all_relations | +----------------+------------+-------------------------------------------------------------------------------------------+ |1000 mg fish oil|Dizziness |Myocardial infarction:::Nausea | |multivitamins |Erythema |Acne:::Dry skin:::Skin burning sensation:::Inappropriate schedule of product administration| |zopiclone |Vomiting |Malaise:::Drug interaction:::Asthenia:::Hyponatraemia | |ambrisentan |Dyspnoea |Therapy interrupted:::Death:::Dizziness:::Drug ineffective | +----------------+------------+-------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_ade_mapper| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_pos_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|7.9 MB| ## References Data from the FDA Adverse Event Reporting System (FAERS) for the years 2020, 2021 and 2022 were used as the source for this mapper model. https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html --- layout: model title: English asr_Dansk_wav2vec21 TFWav2Vec2ForCTC from Siyam author: John Snow Labs name: asr_Dansk_wav2vec21 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Dansk_wav2vec21` is a English model originally trained by Siyam. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Dansk_wav2vec21_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Dansk_wav2vec21_en_4.2.0_3.0_1664118556486.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Dansk_wav2vec21_en_4.2.0_3.0_1664118556486.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Dansk_wav2vec21", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Dansk_wav2vec21", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Dansk_wav2vec21| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el16_dl1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el16-dl1` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl1_en_4.3.0_3.0_1675119566747.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl1_en_4.3.0_3.0_1675119566747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el16_dl1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|168.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-el16-dl1 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: French CamemBert Embeddings (from HueyNemud) author: John Snow Labs name: camembert_embeddings_das22_10_camembert_pretrained date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `das22-10-camembert_pretrained` is a French model orginally trained by `HueyNemud`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_das22_10_camembert_pretrained_fr_3.4.4_3.0_1653985789548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_das22_10_camembert_pretrained_fr_3.4.4_3.0_1653985789548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_das22_10_camembert_pretrained","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_das22_10_camembert_pretrained","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_das22_10_camembert_pretrained| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|415.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/HueyNemud/das22-10-camembert_pretrained - https://doi.org/10.1007/978-3-031-06555-2_30 - https://github.com/soduco/paper-ner-bench-das22 --- layout: model title: Word2Vec Embeddings in Norwegian Nynorsk (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, nn, open_source] task: Embeddings language: nn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nn_3.4.1_3.0_1647448994639.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nn_3.4.1_3.0_1647448994639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nn.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|nn| |Size:|750.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Exclusions Clause Binary Classifier author: John Snow Labs name: legclf_exclusions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exclusions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `exclusions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exclusions_clause_en_1.0.0_3.2_1660123511441.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exclusions_clause_en_1.0.0_3.2_1660123511441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_exclusions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exclusions]| |[other]| |[other]| |[exclusions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exclusions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support exclusions 0.97 0.91 0.94 43 other 0.95 0.99 0.97 70 accuracy - - 0.96 113 macro-avg 0.96 0.95 0.95 113 weighted-avg 0.96 0.96 0.96 113 ``` --- layout: model title: Fon asr_fonxlsr TFWav2Vec2ForCTC from chrisjay author: John Snow Labs name: pipeline_asr_fonxlsr date: 2022-09-24 tags: [wav2vec2, fon, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fon edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_fonxlsr` is a Fon model originally trained by chrisjay. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_fonxlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_fonxlsr_fon_4.2.0_3.0_1664024872592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_fonxlsr_fon_4.2.0_3.0_1664024872592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_fonxlsr', lang = 'fon') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_fonxlsr", lang = "fon") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_fonxlsr| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fon| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Оcr base for handwritten text author: John Snow Labs name: ocr_base_handwritten date: 2022-02-16 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.3.3 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_en_3.3.3_2.4_1645034046021.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_en_3.3.3_2.4_1645034046021.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ocr = ImageToTextv2().pretrained("ocr_base_handwritten", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToTextv2().pretrained("ocr_base_handwritten", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_base_handwritten| |Type:|ocr| |Compatibility:|Visual NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|781.9 MB| --- layout: model title: Smaller BERT Embeddings (L-10_H-768_A-12) author: John Snow Labs name: small_bert_L10_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L10_768_en_2.6.0_2.4_1598345383155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L10_768_en_2.6.0_2.4_1598345383155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L10_768", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L10_768", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L10_768').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L10_768_embeddings I [-0.48520517349243164, -0.4840145409107208, 0.... love [-0.5073645114898682, -0.1760852038860321, -0.... NLP [0.20363765954971313, 0.5075660347938538, 0.25... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L10_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1 --- layout: model title: English RobertaForQuestionAnswering (from AyushPJ) author: John Snow Labs name: roberta_qa_ai_club_inductions_21_nlp_roBERTa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-roBERTa` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_roBERTa_en_4.0.0_3.0_1655727513331.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_roBERTa_en_4.0.0_3.0_1655727513331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp_roBERTa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ai_club_inductions_21_nlp_roBERTa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_AyushPJ").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ai_club_inductions_21_nlp_roBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|465.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-roBERTa --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1655733492517.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_4_en_4.0.0_3.0_1655733492517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_64d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|419.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-4 --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_en_4.3.0_3.0_1675123202788.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_en_4.3.0_3.0_1675123202788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|46.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Publicity Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_publicity_bert date: 2023-03-05 tags: [en, legal, classification, clauses, publicity, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Publicity` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Publicity`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_publicity_bert_en_1.0.0_3.0_1678050025086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_publicity_bert_en_1.0.0_3.0_1678050025086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_publicity_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Publicity]| |[Other]| |[Other]| |[Publicity]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_publicity_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.96 0.98 56 Publicity 0.95 1.00 0.97 38 accuracy - - 0.98 94 macro-avg 0.97 0.98 0.98 94 weighted-avg 0.98 0.98 0.98 94 ``` --- layout: model title: French CamemBert Embeddings (from seyfullah) author: John Snow Labs name: camembert_embeddings_seyfullah_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `seyfullah`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_seyfullah_generic_model_fr_3.4.4_3.0_1653990295365.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_seyfullah_generic_model_fr_3.4.4_3.0_1653990295365.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_seyfullah_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_seyfullah_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_seyfullah_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/seyfullah/dummy-model --- layout: model title: German asr_wav2vec2_large_xlsr_53_German TFWav2Vec2ForCTC from MehdiHosseiniMoghadam author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_German date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_German` is a German model originally trained by MehdiHosseiniMoghadam. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_German_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_German_de_4.2.0_3.0_1664107209368.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_German_de_4.2.0_3.0_1664107209368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_German", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_German", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_German| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: General Oncology Pipeline author: John Snow Labs name: oncology_general_pipeline date: 2022-11-03 tags: [licensed, oncology, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.1.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline extracts diagnoses, treatments, tests, anatomical references and demographic entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_general_pipeline_en_4.1.0_3.0_1667489644241.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_general_pipeline_en_4.1.0_3.0_1667489644241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_general_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_general_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_general.pipeline").predict("""The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:---------------|:-----------------| | left | Direction | | mastectomy | Cancer_Surgery | | left | Direction | | breast cancer | Cancer_Dx | | two months ago | Relative_Date | | tumor | Tumor_Finding | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | ******************** ner_oncology_diagnosis_wip results ******************** | chunk | ner_label | |:--------------|:--------------| | breast cancer | Cancer_Dx | | tumor | Tumor_Finding | ******************** ner_oncology_tnm_wip results ******************** | chunk | ner_label | |:--------------|:------------| | breast cancer | Cancer_Dx | | tumor | Tumor | ******************** ner_oncology_therapy_wip results ******************** | chunk | ner_label | |:-----------|:---------------| | mastectomy | Cancer_Surgery | ******************** ner_oncology_test_wip results ******************** | chunk | ner_label | |:---------|:-----------------| | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:--------------|:---------------|:------------| | mastectomy | Cancer_Surgery | Past | | breast cancer | Cancer_Dx | Present | | tumor | Tumor_Finding | Present | | ER | Biomarker | Present | | PR | Biomarker | Present | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:--------------|:-----------------|:---------------|:--------------|:--------------| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_related_to | | breast cancer | Cancer_Dx | two months ago | Relative_Date | is_related_to | | tumor | Tumor_Finding | ER | Biomarker | O | | tumor | Tumor_Finding | PR | Biomarker | O | | positive | Biomarker_Result | ER | Biomarker | is_related_to | | positive | Biomarker_Result | PR | Biomarker | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:--------------|:-----------------|:---------------|:--------------|:--------------| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_date_of | | breast cancer | Cancer_Dx | two months ago | Relative_Date | is_date_of | | tumor | Tumor_Finding | ER | Biomarker | O | | tumor | Tumor_Finding | PR | Biomarker | O | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_general_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP for Healthcare 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel --- layout: model title: Legal Settlement Agreement And Mutual Release Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_settlement_agreement_and_mutual_release_bert date: 2023-01-26 tags: [en, legal, classification, settlement, agreement, mutual, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_settlement_agreement_and_mutual_release_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `settlement-agreement-and-mutual-release` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `settlement-agreement-and-mutual-release`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_settlement_agreement_and_mutual_release_bert_en_1.0.0_3.0_1674734834409.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_settlement_agreement_and_mutual_release_bert_en_1.0.0_3.0_1674734834409.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_settlement_agreement_and_mutual_release_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[settlement-agreement-and-mutual-release]| |[other]| |[other]| |[settlement-agreement-and-mutual-release]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_settlement_agreement_and_mutual_release_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 1.00 0.99 61 settlement-agreement-and-mutual-release 1.00 0.97 0.99 39 accuracy - - 0.99 100 macro-avg 0.99 0.99 0.99 100 weighted-avg 0.99 0.99 0.99 100 ``` --- layout: model title: English BertForQuestionAnswering model (from peterhsu) author: John Snow Labs name: bert_qa_peterhsu_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `peterhsu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_peterhsu_bert_finetuned_squad_en_4.0.0_3.0_1654535755956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_peterhsu_bert_finetuned_squad_en_4.0.0_3.0_1654535755956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_peterhsu_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_peterhsu_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.v2.by_peterhsu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_peterhsu_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peterhsu/bert-finetuned-squad --- layout: model title: Malay T5ForConditionalGeneration Base Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_translation_base_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-translation-t5-base-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_base_standard_bahasa_cased_ms_4.3.0_3.0_1675102146994.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_translation_base_standard_bahasa_cased_ms_4.3.0_3.0_1675102146994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_translation_base_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_translation_base_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_translation_base_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|928.4 MB| ## References - https://huggingface.co/mesolitica/finetune-translation-t5-base-standard-bahasa-cased - https://github.com/huseinzol05/malay-dataset/tree/master/translation/laser - https://github.com/huseinzol05/malaya/tree/master/session/translation/hf-t5 --- layout: model title: Tamil BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetuned_chaii date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: ta edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-chaii` is a Tamil model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_chaii_ta_4.0.0_3.0_1654180005534.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_chaii_ta_4.0.0_3.0_1654180005534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_chaii","ta") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetuned_chaii","ta") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ta.answer_question.chaii.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ta| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-base-multilingual-cased-finetuned-chaii --- layout: model title: Arabic BertForMaskedLM Base Cased model (from asafaya) author: John Snow Labs name: bert_embeddings_base_arabic date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabic` is a Arabic model originally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_ar_4.2.4_3.0_1670015917386.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabic_ar_4.2.4_3.0_1670015917386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabic| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|414.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-base-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: English BertForQuestionAnswering model (from AnonymousSub) author: John Snow Labs name: bert_qa_fpdm_bert_FT_new_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_bert_FT_new_newsqa` is a English model orginally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_bert_FT_new_newsqa_en_4.0.0_3.0_1654187801317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fpdm_bert_FT_new_newsqa_en_4.0.0_3.0_1654187801317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fpdm_bert_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fpdm_bert_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.fpdm_ft_new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fpdm_bert_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_bert_FT_new_newsqa --- layout: model title: English DistilBertForQuestionAnswering model (from mcurmei) Flat author: John Snow Labs name: distilbert_qa_flat_N_max date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `flat_N_max` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_flat_N_max_en_4.0.0_3.0_1654728355862.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_flat_N_max_en_4.0.0_3.0_1654728355862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_flat_N_max","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_flat_N_max","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.flat_n_max.by_mcurmei").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_flat_N_max| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/flat_N_max --- layout: model title: Part of Speech for Romanian author: John Snow Labs name: pos_ud_rrt date: 2021-03-08 tags: [part_of_speech, open_source, romanian, pos_ud_rrt, ro] task: Part of Speech Tagging language: ro edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - PROPN - PRON - VERB - NOUN - ADP - NUM - ADV - PUNCT - CCONJ - ADJ - PART - AUX - SCONJ - INTJ - SYM - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_rrt_ro_3.0.0_3.0_1615230320555.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_rrt_ro_3.0.0_3.0_1615230320555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_rrt", "ro") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Bună ziua de la John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_rrt", "ro") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Bună ziua de la John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Bună ziua de la John Snow Labs! ""] token_df = nlu.load('ro.pos.ud_rrt').predict(text) token_df ```
## Results ```bash token pos 0 Bună ADJ 1 ziua NOUN 2 de ADP 3 la ADP 4 John PROPN 5 Snow PROPN 6 Labs PROPN 7 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_rrt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ro| --- layout: model title: Wolof asr_av2vec2_xls_r_300m_wolof_lm TFWav2Vec2ForCTC from abdouaziiz author: John Snow Labs name: pipeline_asr_av2vec2_xls_r_300m_wolof_lm date: 2022-09-24 tags: [wav2vec2, wo, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: wo edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_av2vec2_xls_r_300m_wolof_lm` is a Wolof model originally trained by abdouaziiz. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_av2vec2_xls_r_300m_wolof_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_av2vec2_xls_r_300m_wolof_lm_wo_4.2.0_3.0_1664038255067.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_av2vec2_xls_r_300m_wolof_lm_wo_4.2.0_3.0_1664038255067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_av2vec2_xls_r_300m_wolof_lm', lang = 'wo') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_av2vec2_xls_r_300m_wolof_lm", lang = "wo") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_av2vec2_xls_r_300m_wolof_lm| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|wo| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Google's Tapas Table Understanding (Small, WTQ) author: John Snow Labs name: table_qa_tapas_small_finetuned_wtq date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Small Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_wtq_en_4.2.0_3.0_1664530480673.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_small_finetuned_wtq_en_4.2.0_3.0_1664530480673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_small_finetuned_wtq","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_small_finetuned_wtq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|110.1 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions --- layout: model title: Pipeline to Detect Clinical Entities (Slim version, BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_slim_pipeline date: 2023-03-20 tags: [ner, bertfortokenclassification, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl_slim](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_jsl_slim_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_pipeline_en_4.3.0_3.2_1679308050229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_pipeline_en_4.3.0_3.2_1679308050229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_slim_pipeline", "en", "clinical/models") text = '''HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_slim_pipeline", "en", "clinical/models") val text = "HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.jsl_slim.pipeline").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:-------------|-------------:| | 0 | HISTORY: | 0 | 7 | Header | 0.994786 | | 1 | 30-year-old | 9 | 19 | Age | 0.982408 | | 2 | female | 21 | 26 | Demographics | 0.99981 | | 3 | mammography | 59 | 69 | Test | 0.993892 | | 4 | soft tissue lump | 86 | 101 | Symptom | 0.999448 | | 5 | shoulder | 146 | 153 | Body_Part | 0.99978 | | 6 | breast cancer | 192 | 204 | Oncological | 0.999466 | | 7 | her mother | 213 | 222 | Demographics | 0.997765 | | 8 | age 58 | 227 | 232 | Age | 0.997636 | | 9 | breast cancer | 270 | 282 | Oncological | 0.999452 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_slim_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Legal Contribution Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_contribution_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, contribution, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_contribution_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `contribution-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `contribution-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_contribution_agreement_bert_en_1.0.0_3.0_1669308987185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_contribution_agreement_bert_en_1.0.0_3.0_1669308987185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_contribution_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[contribution-agreement]| |[other]| |[other]| |[contribution-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_contribution_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support contribution-agreement 0.91 0.85 0.88 34 other 0.94 0.96 0.95 82 accuracy - - 0.93 116 macro-avg 0.92 0.91 0.92 116 weighted-avg 0.93 0.93 0.93 116 ``` --- layout: model title: Legal Principal Underwriting Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_principal_underwriting_agreement_bert date: 2023-01-26 tags: [en, legal, classification, underwriting, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_principal_underwriting_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `principal-underwriting-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `principal-underwriting-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_principal_underwriting_agreement_bert_en_1.0.0_3.0_1674734655700.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_principal_underwriting_agreement_bert_en_1.0.0_3.0_1674734655700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_principal_underwriting_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[principal-underwriting-agreement]| |[other]| |[other]| |[principal-underwriting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_principal_underwriting_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.98 0.99 105 principal-underwriting-agreement 0.96 0.98 0.97 50 accuracy - - 0.98 155 macro-avg 0.98 0.98 0.98 155 weighted-avg 0.98 0.98 0.98 155 ``` --- layout: model title: Pipeline to Detect Clinical Conditions (ner_eu_clinical_condition - es) author: John Snow Labs name: ner_eu_clinical_condition_pipeline date: 2023-03-08 tags: [es, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_condition](https://nlp.johnsnowlabs.com/2023/02/06/ner_eu_clinical_condition_es.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_es_4.3.0_3.2_1678260469189.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_es_4.3.0_3.2_1678260469189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_condition_pipeline", "es", "clinical/models") text = " La exploración abdominal revela una cicatriz de laparotomía media infraumbilical, la presencia de ruidos disminuidos, y dolor a la palpación de manera difusa sin claros signos de irritación peritoneal. No existen hernias inguinales o crurales. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_condition_pipeline", "es", "clinical/models") val text = " La exploración abdominal revela una cicatriz de laparotomía media infraumbilical, la presencia de ruidos disminuidos, y dolor a la palpación de manera difusa sin claros signos de irritación peritoneal. No existen hernias inguinales o crurales. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:---------------------|--------:|------:|:-------------------|-------------:| | 0 | cicatriz | 37 | 44 | clinical_condition | 0.9883 | | 1 | dolor a la palpación | 121 | 140 | clinical_condition | 0.87025 | | 2 | signos | 170 | 175 | clinical_condition | 0.9862 | | 3 | irritación | 180 | 189 | clinical_condition | 0.9975 | | 4 | hernias inguinales | 214 | 231 | clinical_condition | 0.7543 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English T5ForConditionalGeneration Base Cased model (from Tejas21) author: John Snow Labs name: t5_totto_base_bert_score_20k_steps date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Totto_t5_base_BERT_Score_20k_steps` is a English model originally trained by `Tejas21`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_totto_base_bert_score_20k_steps_en_4.3.0_3.0_1675099761482.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_totto_base_bert_score_20k_steps_en_4.3.0_3.0_1675099761482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_totto_base_bert_score_20k_steps","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_totto_base_bert_score_20k_steps","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_totto_base_bert_score_20k_steps| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|925.2 MB| ## References - https://huggingface.co/Tejas21/Totto_t5_base_BERT_Score_20k_steps - https://github.com/google-research-datasets/ToTTo - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/google-research/language/tree/master/language/totto - https://github.com/Tiiiger/bert_score --- layout: model title: Hindi BertForQuestionAnswering model (from vanichandna) author: John Snow Labs name: bert_qa_muril_finetuned_squadv1 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muril-finetuned-squadv1` is a Hindi model orginally trained by `vanichandna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_muril_finetuned_squadv1_hi_4.0.0_3.0_1654188665964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_muril_finetuned_squadv1_hi_4.0.0_3.0_1654188665964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_muril_finetuned_squadv1","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_muril_finetuned_squadv1","hi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hi.answer_question.squad.bert.v1.by_vanichandna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_muril_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|891.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vanichandna/muril-finetuned-squadv1 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson TFWav2Vec2ForCTC from izzy-lazerson author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson` is a English model originally trained by izzy-lazerson. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson_en_4.2.0_3.0_1664018677659.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson_en_4.2.0_3.0_1664018677659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Recognize Entities DL Pipeline for Swedish - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, swedish, entity_recognizer_md, pipeline, sv] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: sv edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_sv_3.0.0_3.0_1616452368340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_sv_3.0.0_3.0_1616452368340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'sv') annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "sv") val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej från John Snow Labs! ""] result_df = nlu.load('sv.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | [[0.4006600081920624,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab971 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab971 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab971` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab971_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab971_en_4.2.0_3.0_1664043058086.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab971_en_4.2.0_3.0_1664043058086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab971", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab971", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab971| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from MMVos) author: John Snow Labs name: distilbert_qa_mmvos_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `MMVos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mmvos_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768686035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mmvos_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768686035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mmvos_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mmvos_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mmvos_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/MMVos/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Modifications Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_modifications_bert date: 2023-03-05 tags: [en, legal, classification, clauses, modifications, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Modifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Modifications`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_modifications_bert_en_1.0.0_3.0_1678050613956.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_modifications_bert_en_1.0.0_3.0_1678050613956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_modifications_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Modifications]| |[Other]| |[Other]| |[Modifications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_modifications_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Modifications 0.83 0.89 0.86 27 Other 0.93 0.88 0.90 43 accuracy - - 0.89 70 macro-avg 0.88 0.89 0.88 70 weighted-avg 0.89 0.89 0.89 70 ``` --- layout: model title: Translate Italic languages to English Pipeline author: John Snow Labs name: translate_itc_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, itc, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `itc` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_itc_en_xx_2.7.0_2.4_1609699157847.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_itc_en_xx_2.7.0_2.4_1609699157847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_itc_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_itc_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.itc.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_itc_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from Sourabh714) author: John Snow Labs name: distilbert_qa_Sourabh714_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Sourabh714`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Sourabh714_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724676238.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Sourabh714_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724676238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Sourabh714_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Sourabh714_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Sourabh714").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Sourabh714_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Sourabh714/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icd10cm_snomed_mapping date: 2023-06-13 tags: [en, licensed, icd10cm, snomed, pipeline, chunk_mapping] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icd10cm_snomed_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_4.4.4_3.2_1686665531228.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_4.4.4_3.2_1686665531228.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(R079 N4289 M62830) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(R079 N4289 M62830) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd10cm_to_snomed.pipe").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(R079 N4289 M62830) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(R079 N4289 M62830) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd10cm_to_snomed.pipe").predict("""Put your text here.""") ```
## Results ```bash Results | | icd10cm_code | snomed_code | |---:|:----------------------|:-----------------------------------------| | 0 | R079 | N4289 | M62830 | 161972006 | 22035000 | 16410651000119105 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_snomed_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Javanese BertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: bert_embeddings_javanese_small date: 2022-12-06 tags: [jv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-bert-small` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_jv_4.2.4_3.0_1670326727249.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_small_jv_4.2.4_3.0_1670326727249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_javanese_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_javanese_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-bert-small - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Latin RobertaForTokenClassification Large Cased model (from tner) author: John Snow Labs name: roberta_token_classifier_large_ontonotes5 date: 2023-03-01 tags: [la, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: la edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-ontonotes5` is a Latin model originally trained by `tner`. ## Predicted Entities `NORP`, `FAC`, `QUANTITY`, `LOC`, `EVENT`, `CARDINAL`, `LANGUAGE`, `GPE`, `ORG`, `TIME`, `PERSON`, `WORK_OF_ART`, `DATE`, `PRODUCT`, `PERCENT`, `LAW`, `ORDINAL`, `MONEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_large_ontonotes5_la_4.3.0_3.0_1677703467254.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_large_ontonotes5_la_4.3.0_3.0_1677703467254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_large_ontonotes5","la") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_large_ontonotes5","la") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_large_ontonotes5| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|la| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/tner/roberta-large-ontonotes5 - https://github.com/asahi417/tner - https://github.com/asahi417/tner - https://aclanthology.org/2021.eacl-demos.7/ - https://paperswithcode.com/sota?task=Token+Classification&dataset=tner%2Fontonotes5 --- layout: model title: Part of Speech for Yoruba author: John Snow Labs name: pos_ud_ytb date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: yo edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, yo] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ytb_yo_2.5.5_2.4_1596054392981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ytb_yo_2.5.5_2.4_1596054392981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_ytb", "yo") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_ytb", "yo") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Yato si jijẹ ọba ariwa, John Snow jẹ oṣoogun ara Gẹẹsi kan ati adari ninu idagbasoke anaesthesia ati imototo ilera."""] pos_df = nlu.load('yo.pos').predict(text) pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=3, result='NOUN', metadata={'word': 'Yato'}), Row(annotatorType='pos', begin=5, end=6, result='VERB', metadata={'word': 'si'}), Row(annotatorType='pos', begin=8, end=11, result='VERB', metadata={'word': 'jijẹ'}), Row(annotatorType='pos', begin=13, end=15, result='NOUN', metadata={'word': 'ọba'}), Row(annotatorType='pos', begin=17, end=21, result='NOUN', metadata={'word': 'ariwa'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ytb| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|yo| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Extract clinical problems (Voice of the Patients) author: John Snow Labs name: ner_vop_problem_reduced_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences. The taxonomy is reduced (one label for all clinical problems). Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Problem`, `HealthStatus`, `Modifier` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_wip_en_4.4.2_3.0_1684512630461.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_wip_en_4.4.2_3.0_1684512630461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Problem | | fatigue | Problem | | rheumatoid arthritis | Problem | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_reduced_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Problem 6126 983 1015 7141 0.86 0.86 0.86 HealthStatus 86 37 19 105 0.70 0.82 0.75 Modifier 755 306 232 987 0.71 0.76 0.74 macro_avg 6967 1326 1266 8233 0.76 0.81 0.78 micro_avg 6967 1326 1266 8233 0.84 0.85 0.84 ``` --- layout: model title: Legal Remedies Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_remedies_bert date: 2023-03-05 tags: [en, legal, classification, clauses, remedies, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Remedies` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Remedies`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_remedies_bert_en_1.0.0_3.0_1678049890088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_remedies_bert_en_1.0.0_3.0_1678049890088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_remedies_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Remedies]| |[Other]| |[Other]| |[Remedies]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_remedies_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.92 0.92 0.92 114 Remedies 0.88 0.88 0.88 78 accuracy - - 0.91 192 macro-avg 0.90 0.90 0.90 192 weighted-avg 0.91 0.91 0.91 192 ``` --- layout: model title: Clinical Findings to UMLS Code Pipeline author: John Snow Labs name: umls_clinical_findings_resolver_pipeline date: 2023-04-11 tags: [licensed, clinical, en, umls, pipeline] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Findings) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.3.2_3.0_1681216655167.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.3.2_3.0_1681216655167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") text = 'HTG-induced pancreatitis associated with an acute hepatitis, and obesity' result = pipeline.annotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") val text = "HTG-induced pancreatitis associated with an acute hepatitis, and obesity" val result = pipeline.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_clinical_findings_resolver").predict("""HTG-induced pancreatitis associated with an acute hepatitis, and obesity""") ```
## Results ```bash +------------------------+---------+---------+ |chunk |ner_label|umls_code| +------------------------+---------+---------+ |HTG-induced pancreatitis|PROBLEM |C1963198 | |an acute hepatitis |PROBLEM |C4750596 | |obesity |PROBLEM |C1963185 | +------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_clinical_findings_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Pipeline to Detect details of cellular structures (biobert) author: John Snow Labs name: ner_cellular_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_cellular_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_cellular_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_pipeline_en_3.4.1_3.0_1647870475485.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_biobert_pipeline_en_3.4.1_3.0_1647870475485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_cellular_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` ```scala val pipeline = new PretrainedPipeline("ner_cellular_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cellular_biobert.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-------------------------------------------+--------+ |chunks |entities| +-------------------------------------------+--------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter|DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |Tax-responsive element 1 |DNA | |TRE-1 |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein|protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-------------------------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|421.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: English DistilBERT Embeddings Cased model (from mrm8488) author: John Snow Labs name: distilbert_embeddings_finetuned_sarcasm_classification date: 2022-07-15 tags: [open_source, distilbert, embeddings, sarcasm, en] task: Embeddings language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_embeddings_finetuned_sarcasm_classification` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_finetuned_sarcasm_classification_en_4.0.0_3.0_1657884379182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_finetuned_sarcasm_classification_en_4.0.0_3.0_1657884379182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_finetuned_sarcasm_classification","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_finetuned_sarcasm_classification","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_finetuned_sarcasm_classification| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| ## References https://huggingface.co/mrm8488/distilbert-finetuned-sarcasm-classification --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_4_en_4.3.0_3.0_1674213839010.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_4_en_4.3.0_3.0_1674213839010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|423.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-4 --- layout: model title: English T5ForConditionalGeneration Base Cased model (from Salesforce) author: John Snow Labs name: t5_mixqg_base date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mixqg-base` is a English model originally trained by `Salesforce`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mixqg_base_en_4.3.0_3.0_1675105642190.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mixqg_base_en_4.3.0_3.0_1675105642190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mixqg_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mixqg_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mixqg_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|583.1 MB| ## References - https://huggingface.co/Salesforce/mixqg-base - https://arxiv.org/abs/2110.08175 - https://github.com/salesforce/QGen --- layout: model title: Translate English to Ndonga Pipeline author: John Snow Labs name: translate_en_ng date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ng, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ng` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ng_xx_2.7.0_2.4_1609687263099.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ng_xx_2.7.0_2.4_1609687263099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ng", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ng", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ng').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ng| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Moldavian, Moldovan, Romanian asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila TFWav2Vec2ForCTC from gmihaila author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila date: 2022-09-25 tags: [wav2vec2, ro, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ro edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila` is a Moldavian, Moldovan, Romanian model originally trained by gmihaila. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila_ro_4.2.0_3.0_1664099488049.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila_ro_4.2.0_3.0_1664099488049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila', lang = 'ro') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila", lang = "ro") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Tiny Cased model (from janeel) author: John Snow Labs name: roberta_qa_janeel_tiny_squad2_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-squad2-finetuned-squad` is a English model originally trained by `janeel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_janeel_tiny_squad2_finetuned_squad_en_4.3.0_3.0_1674224399345.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_janeel_tiny_squad2_finetuned_squad_en_4.3.0_3.0_1674224399345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_janeel_tiny_squad2_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_janeel_tiny_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_janeel_tiny_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/janeel/tinyroberta-squad2-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Base Cased model (from doc2query) author: John Snow Labs name: t5_msmarco_base_v1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `msmarco-t5-base-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_msmarco_base_v1_en_4.3.0_3.0_1675105723928.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_msmarco_base_v1_en_4.3.0_3.0_1675105723928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_msmarco_base_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_msmarco_base_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_msmarco_base_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/doc2query/msmarco-t5-base-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html - https://github.com/microsoft/MSMARCO-Passage-Ranking --- layout: model title: Typo Detector Pipeline for English author: John Snow Labs name: distilbert_token_classifier_typo_detector_pipeline date: 2022-06-19 tags: [ner, bert, bert_for_token, typo, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_en.html). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_en_4.0.0_3.0_1655653789927.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_en_4.0.0_3.0_1655653789927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en") typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.") ``` ```scala val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "en") typo_pipeline.annotate("He had also stgruggled with addiction during his tine in Congress.") ```
## Results ```bash +----------+---------+ |chunk |ner_label| +----------+---------+ |stgruggled|PO | |tine |PO | +----------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|244.2 MB| ## Included Models - DocumentAssembler - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: Detect Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_gene_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects mentions of genes and human phenotypes (hp) in medical text. ## Predicted Entities : `GENE`, `HP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GENE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_3.0.0_3.0_1617209698053.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_clinical_en_3.0.0_3.0_1617209698053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(spark.createDataFrame([["Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3)."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype.gene_clinical").predict("""Here we presented a case (BS type) of a 17 years old female presented with polyhydramnios, polyuria, nephrocalcinosis and hypokalemia, which was alleviated after treatment with celecoxib and vitamin D(3).""") ```
## Results ```bash +----+------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+==================+=========+=======+==========+ | 0 | BS type | 29 | 32 | GENE | +----+------------------+---------+-------+----------+ | 1 | polyhydramnios | 75 | 88 | HP | +----+------------------+---------+-------+----------+ | 2 | polyuria | 91 | 98 | HP | +----+------------------+---------+-------+----------+ | 3 | nephrocalcinosis | 101 | 116 | HP | +----+------------------+---------+-------+----------+ | 4 | hypokalemia | 122 | 132 | HP | +----+------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|-----:|-----:|---------:|---------:|---------:| | 0 | I-HP | 303 | 56 | 64 | 0.844011 | 0.825613 | 0.834711 | | 1 | B-GENE | 1176 | 158 | 252 | 0.881559 | 0.823529 | 0.851557 | | 2 | B-HP | 1078 | 133 | 96 | 0.890173 | 0.918228 | 0.903983 | | 3 | Macro-average | 2557 | 347 | 412 | 0.871915 | 0.85579 | 0.863777 | | 4 | Micro-average | 2557 | 347 | 412 | 0.88051 | 0.861233 | 0.870765 | ``` --- layout: model title: Legal Licenses Clause Binary Classifier author: John Snow Labs name: legclf_cuad_licenses_clause date: 2022-09-27 tags: [en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `licenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `licenses` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_licenses_clause_en_1.0.0_3.0_1664272270378.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_licenses_clause_en_1.0.0_3.0_1664272270378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_cuad_licenses_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[licenses]| |[other]| |[other]| |[licenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_licenses_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.0 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support licenses 1.00 0.60 0.75 10 other 0.84 1.00 0.91 21 accuracy - - 0.87 31 macro avg 0.92 0.80 0.83 31 weighted avg 0.89 0.87 0.86 31 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from SEISHIN) author: John Snow Labs name: distilbert_qa_SEISHIN_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SEISHIN`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_SEISHIN_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724444491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_SEISHIN_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724444491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SEISHIN_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SEISHIN_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_SEISHIN").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_SEISHIN_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SEISHIN/distilbert-base-uncased-finetuned-squad --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_de_4.2.0_3.0_1664114641841.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295_de_4.2.0_3.0_1664114641841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_10_austria_0_s295| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Extract Clinical Problem Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_problem_emb_clinical_large date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences using a granular taxonomy. ## Predicted Entities `PsychologicalCondition`, `Disease`, `Symptom`, `HealthStatus`, `Modifier`, `InjuryOrPoisoning` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_emb_clinical_large_en_4.4.3_3.0_1686075814525.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_emb_clinical_large_en_4.4.3_3.0_1686075814525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Symptom | | fatigue | Symptom | | rheumatoid arthritis | Disease | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 PsychologicalCondition 415 42 29 444 0.91 0.93 0.92 Disease 1662 169 353 2015 0.91 0.82 0.86 Symptom 3716 679 859 4575 0.85 0.81 0.83 HealthStatus 83 26 24 107 0.76 0.78 0.77 Modifier 819 222 320 1139 0.79 0.72 0.75 InjuryOrPoisoning 125 37 51 176 0.77 0.71 0.74 macro_avg 6820 1175 1636 8456 0.83 0.80 0.81 micro_avg 6820 1175 1636 8456 0.86 0.80 0.83 ``` --- layout: model title: Pipeline to Detect Clinical Conditions (ner_eu_clinical_condition - it) author: John Snow Labs name: ner_eu_clinical_condition_pipeline date: 2023-03-08 tags: [it, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: it edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_condition](https://nlp.johnsnowlabs.com/2023/02/06/ner_eu_clinical_condition_it.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_it_4.3.0_3.2_1678258845491.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_it_4.3.0_3.2_1678258845491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_condition_pipeline", "it", "clinical/models") text = " Donna, 64 anni, ricovero per dolore epigastrico persistente, irradiato a barra e posteriormente, associato a dispesia e anoressia. Poche settimane dopo compaiono, però, iperemia, intenso edema vulvare ed una esione ulcerativa sul lato sinistro della parete rettale che la RM mostra essere una fistola transfinterica. Questi trattamenti determinano un miglioramento dell’ infiammazione ed una riduzione dell’ ulcera, ma i condilomi permangono inalterati. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_condition_pipeline", "it", "clinical/models") val text = " Donna, 64 anni, ricovero per dolore epigastrico persistente, irradiato a barra e posteriormente, associato a dispesia e anoressia. Poche settimane dopo compaiono, però, iperemia, intenso edema vulvare ed una esione ulcerativa sul lato sinistro della parete rettale che la RM mostra essere una fistola transfinterica. Questi trattamenti determinano un miglioramento dell’ infiammazione ed una riduzione dell’ ulcera, ma i condilomi permangono inalterati. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-----------------------|--------:|------:|:-------------------|-------------:| | 0 | dolore epigastrico | 30 | 47 | clinical_condition | 0.90845 | | 1 | anoressia | 121 | 129 | clinical_condition | 0.9998 | | 2 | iperemia | 170 | 177 | clinical_condition | 0.9999 | | 3 | edema | 188 | 192 | clinical_condition | 1 | | 4 | fistola transfinterica | 294 | 315 | clinical_condition | 0.97785 | | 5 | infiammazione | 372 | 384 | clinical_condition | 0.9996 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from fvector) author: John Snow Labs name: xlmroberta_ner_fvector_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `fvector`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_fvector_base_finetuned_panx_de_4.1.0_3.0_1660433089765.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_fvector_base_finetuned_panx_de_4.1.0_3.0_1660433089765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_fvector_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_fvector_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_fvector_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/fvector/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Pipeline to Detect Pathogen, Medical Condition and Medicine (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_pathogen_pipeline date: 2023-03-20 tags: [licensed, clinical, en, ner, pathogen, medical_condition, medicine, berfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_pathogen](https://nlp.johnsnowlabs.com/2022/07/28/bert_token_classifier_ner_pathogen_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_pathogen_pipeline_en_4.3.0_3.2_1679299357172.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_pathogen_pipeline_en_4.3.0_3.2_1679299357172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_pathogen_pipeline", "en", "clinical/models") text = '''Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe; while it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_pathogen_pipeline", "en", "clinical/models") val text = "Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe; while it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:-----------------|-------------:| | 0 | Racecadotril | 0 | 11 | Medicine | 0.986453 | | 1 | loperamide | 80 | 89 | Medicine | 0.967653 | | 2 | Diarrhea | 92 | 99 | MedicalCondition | 0.92107 | | 3 | loose | 128 | 132 | MedicalCondition | 0.639717 | | 4 | liquid | 135 | 140 | MedicalCondition | 0.739769 | | 5 | watery | 145 | 150 | MedicalCondition | 0.911771 | | 6 | bowel movements | 152 | 166 | MedicalCondition | 0.637392 | | 7 | dehydration | 187 | 197 | MedicalCondition | 0.81079 | | 8 | loss | 282 | 285 | MedicalCondition | 0.526605 | | 9 | color | 295 | 299 | MedicalCondition | 0.612506 | | 10 | fast | 304 | 307 | MedicalCondition | 0.555894 | | 11 | heart rate | 309 | 318 | MedicalCondition | 0.486794 | | 12 | rabies virus | 381 | 392 | Pathogen | 0.738198 | | 13 | Lyssavirus | 395 | 404 | Pathogen | 0.979239 | | 14 | Ephemerovirus | 410 | 422 | Pathogen | 0.992292 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_pathogen_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Detect Drug Chemicals (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_drugs date: 2022-01-06 tags: [drug, berfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for Drugs. This model is traiend with `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. It detects drug chemicals. ## Predicted Entities `DrugChem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_en_3.3.4_2.4_1641472225294.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_drugs_en_3.3.4_2.4_1641472225294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_drugs", "en", "clinical/models")\ .setInputCols(["token", "sentence"])\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_drugs", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_drugs").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |potassium |DrugChem | |nucleotide |DrugChem | |anthracyclines|DrugChem | |taxanes |DrugChem | |vinorelbine |DrugChem | |vinorelbine |DrugChem | |anthracyclines|DrugChem | |taxanes |DrugChem | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_drugs| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|128| ## Data Source Trained on i2b2_med7 + FDA. https://www.i2b2.org/NLP/Medication ## Benchmarking ```bash label precision recall f1-score support B-DrugChem 0.99 0.99 0.99 97872 I-DrugChem 0.99 0.99 0.99 54909 O 1.00 1.00 1.00 1191109 accuracy - - 1.00 1343890 macro-avg 0.99 0.99 0.99 1343890 weighted-avg 1.00 1.00 1.00 1343890 ``` --- layout: model title: Legal Credit Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_credit_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, credit, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_credit_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `credit-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `credit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_credit_agreement_bert_en_1.0.0_3.0_1669310408516.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_credit_agreement_bert_en_1.0.0_3.0_1669310408516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_credit_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[credit-agreement]| |[other]| |[other]| |[credit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_credit_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support credit-agreement 0.95 0.95 0.95 38 other 0.97 0.97 0.97 65 accuracy - - 0.96 103 macro-avg 0.96 0.96 0.96 103 weighted-avg 0.96 0.96 0.96 103 ``` --- layout: model title: Translate English to Baltic languages Pipeline author: John Snow Labs name: translate_en_bat date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bat, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bat_xx_2.7.0_2.4_1609691023515.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bat_xx_2.7.0_2.4_1609691023515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bat", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bat", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bat').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bat| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Armenian author: John Snow Labs name: opus_mt_en_hy date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, hy, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `hy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_hy_xx_2.7.0_2.4_1609168705243.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_hy_xx_2.7.0_2.4_1609168705243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_hy", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_hy", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.hy').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_hy| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_covid_qna date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-128_A-2_cord19-200616_squad2_covid-qna` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1657188884477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1657188884477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_2_h_128_a_2_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-128_A-2_cord19-200616_squad2_covid-qna --- layout: model title: Relation Extraction between Tumors and Sizes (ReDL) author: John Snow Labs name: redl_oncology_size_biobert_wip date: 2022-09-28 tags: [licensed, clinical, oncology, en, relation_extraction] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Tumor_Size extractions to their corresponding Tumor_Finding extractions. ## Predicted Entities `is_size_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_size_biobert_wip_en_4.1.0_3.0_1664404453416.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_size_biobert_wip_en_4.1.0_3.0_1664404453416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Tumor_Finding and Tumor_Size should be included in the relation pairs.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter() .setInputCols(["ner_chunk", "dependencies"]) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_size_biobert_wip", "en", "clinical/models") .setInputCols(["re_ner_chunk", "sentence"]) .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols("ner_chunk", "dependencies") .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_size_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols("re_ner_chunk", "sentence") .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_size_biobert").predict("""The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.""") ```
## Results ```bash chunk1 entity1 chunk2 entity2 relation confidence 2 cm Tumor_Size mass Tumor_Finding is_size_of 0.9604708 tumor Tumor_Finding 3 cm Tumor_Size is_size_of 0.99731797 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_size_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 support O 0.87 0.84 0.86 143.0 is_size_of 0.85 0.88 0.86 157.0 macro-avg 0.86 0.86 0.86 NaN ``` --- layout: model title: Lemmatizer (Turkish, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, tr] task: Lemmatization language: tr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Turkish Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_tr_3.4.1_3.0_1646316553135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_tr_3.4.1_3.0_1646316553135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","tr") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Sen benden daha iyi değilsin"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","tr") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Sen benden daha iyi değilsin").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.lemma.spacylookup").predict("""Sen benden daha iyi değilsin""") ```
## Results ```bash +--------------------------+ |result | +--------------------------+ |[Sen, ben, daha, iyi, değ]| +--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|tr| |Size:|15.5 MB| --- layout: model title: Detect Persons, Locations, Organizations, Dates, Time, Numbers, and Designation Entities in Urdu (urduvec_140M_300d) author: John Snow Labs name: uner_mk_140M_300d date: 2022-07-26 tags: [ner, ur, open_source] task: Named Entity Recognition language: ur edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Urdu word embeddings to find 7 different types of entities in Urdu text. It is trained using urduvec_140M_300d word embeddings, so please use the same embeddings in the pipeline. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/uner_mk_140M_300d_ur_4.0.0_3.0_1658874449475.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/uner_mk_140M_300d_ur_4.0.0_3.0_1658874449475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("uner_mk_140M_300d", "ur" ) \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])}\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔") ``` ```scala val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("uner_mk_140M_300d", "ur") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.ner").predict("""بریگیڈیئر ایڈ بٹلر سنہ دوہزارچھ میں ہلمند کے فوجی کمانڈر تھے۔""") ```
## Results ```bash | | ner_chunk | entity | |---:|---------------:|-------------:| | 0 |بریگیڈیئر | DESIGNATION | | 1 |ایڈ بٹلر | PERSON | | 2 |سنہ دوہزارچھ | DATE | | 3 |ہلمند | LOCATION | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|uner_mk_140M_300d| |Type:|ner| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|ur| |Size:|14.9 MB| |Dependencies:|urduvec_140M_300d| ## References This model is trained using the following datasets: https://www.researchgate.net/publication/312218764_Named_Entity_Dataset_for_Urdu_Named_Entity_Recognition_Task https://www.researchgate.net/publication/332653135_Urdu_Named_Entity_Recognition_Corpus_Generation_and_Deep_Learning_Applications ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 12 10 1 0.545455 0.923077 0.685714 B-PERSON 2808 846 535 0.768473 0.839964 0.80263 B-DATE 34 6 6 0.85 0.85 0.85 I-DATE 45 1 2 0.978261 0.957447 0.967742 B-DESIGNATION 49 30 16 0.620253 0.753846 0.680556 I-LOCATION 2110 750 701 0.737762 0.750623 0.744137 B-TIME 11 9 3 0.55 0.785714 0.647059 I-ORGANIZATION 2006 772 760 0.722102 0.725235 0.723665 I-NUMBER 18 6 2 0.75 0.9 0.818182 B-LOCATION 5428 1255 582 0.81221 0.903161 0.855275 B-NUMBER 194 36 27 0.843478 0.877828 0.86031 I-DESIGNATION 25 15 6 0.625 0.806452 0.704225 I-PERSON 3562 759 433 0.824346 0.891614 0.856662 B-ORGANIZATION 1114 466 641 0.705063 0.634758 0.668066 Macro-average 17416 4961 3715 0.738029 0.828551 0.780675 Micro-average 17416 4961 3715 0.778299 0.824192 0.800588 ``` --- layout: model title: Translate Indo-Iranian languages to English Pipeline author: John Snow Labs name: translate_iir_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, iir, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `iir` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_iir_en_xx_2.7.0_2.4_1609688049853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_iir_en_xx_2.7.0_2.4_1609688049853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_iir_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_iir_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.iir.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_iir_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Maithili (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mai, open_source] task: Embeddings language: mai edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mai_3.4.1_3.0_1647443950715.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mai_3.4.1_3.0_1647443950715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mai") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mai") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mai.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mai| |Size:|85.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_becasv3 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-becasv3` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasv3_es_4.3.0_3.0_1674218266592.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_becasv3_es_4.3.0_3.0_1674218266592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasv3","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_becasv3","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_becasv3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-becasv3 --- layout: model title: Translate English to Esperanto Pipeline author: John Snow Labs name: translate_en_eo date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, eo, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `eo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_eo_xx_2.7.0_2.4_1609686406047.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_eo_xx_2.7.0_2.4_1609686406047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_eo", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_eo", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.eo').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_eo| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_vs_all_902129475 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_vs_all-902129475` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_vs_all_902129475_en_4.3.1_3.0_1677881724840.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_vs_all_902129475_en_4.3.1_3.0_1677881724840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_vs_all_902129475","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_vs_all_902129475","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_vs_all_902129475| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_vs_all-902129475 --- layout: model title: English image_classifier_vit_hotdog_not_hotdog ViTForImageClassification from julien-c author: John Snow Labs name: image_classifier_vit_hotdog_not_hotdog date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_hotdog_not_hotdog` is a English model originally trained by julien-c. ## Predicted Entities `hot dog`, `not hot dog` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hotdog_not_hotdog_en_4.1.0_3.0_1660166087894.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hotdog_not_hotdog_en_4.1.0_3.0_1660166087894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_hotdog_not_hotdog", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_hotdog_not_hotdog", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_hotdog_not_hotdog| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Financial Business Item Binary Classifier author: John Snow Labs name: finclf_business_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `business` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `business` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_business_item_en_1.0.0_3.2_1660154375895.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_business_item_en_1.0.0_3.2_1660154375895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_business_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[business]| |[other]| |[other]| |[business]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_business_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support business 0.92 0.89 0.91 644 other 0.90 0.93 0.92 684 accuracy - - 0.91 1328 macro-avg 0.91 0.91 0.91 1328 weighted-avg 0.91 0.91 0.91 1328 ``` --- layout: model title: Pipeline to Detect Adverse Drug Events (bert-clinical) author: John Snow Labs name: ner_ade_clinicalbert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_clinicalbert](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_clinicalbert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinicalbert_pipeline_en_3.4.1_3.0_1647874360518.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinicalbert_pipeline_en_3.4.1_3.0_1647874360518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_ade_clinicalbert_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` ```scala val pipeline = new PretrainedPipeline("ner_ade_clinicalbert_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_bert_ade.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +---------+--------+ |chunks |entities| +---------+--------+ |erbA IRES|DRUG | +---------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_clinicalbert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Extract Entities Related to TNM Staging author: John Snow Labs name: ner_oncology_tnm_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts staging information and mentions related to tumors, lymph nodes and metastases. Tumor_Description is used to extract characteristics from tumors such as size, histological type or presence of invasion. Lymph_Node_Modifier is used to extract modifiers that refer to an abnormal lymph node (such as "enlarged"). Definitions of Predicted Entities: - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Lymph_Node`: Mentions of lymph nodes and pathological findings of the lymph nodes. - `Lymph_Node_Modifier`: Words that refer to a lymph node being abnormal (such as "enlargement"). - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Tumor`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Description`: Information related to tumor characteristics, such as size, presence of invasion, grade and hystological type. ## Predicted Entities `Cancer_Dx`, `Lymph_Node`, `Lymph_Node_Modifier`, `Metastasis`, `Staging`, `Tumor`, `Tumor_Description` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_wip_en_4.0.0_3.0_1664561705395.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_wip_en_4.0.0_3.0_1664561705395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_tnm_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The final diagnosis was metastatic breast carcinoma, and the TNM classification was T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_tnm_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The final diagnosis was metastatic breast carcinoma, and the TNM classification was T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_tnm_wip").predict("""The final diagnosis was metastatic breast carcinoma, and the TNM classification was T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.""") ```
## Results ```bash | chunk | ner_label | |:-------------------|:------------------| | metastatic | Metastasis | | breast carcinoma | Cancer_Dx | | T2N1M1 stage IV | Staging | | histological grade | Tumor_Description | | 4 cm | Tumor_Description | | tumor | Tumor | | grade 2 | Tumor_Description | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_tnm_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|858.6 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Lymph_Node 410.0 31.0 100.0 510.0 0.93 0.80 0.86 Staging 166.0 15.0 50.0 216.0 0.92 0.77 0.84 Lymph_Node_Modifier 19.0 1.0 12.0 31.0 0.95 0.61 0.75 Tumor_Description 1996.0 537.0 385.0 2381.0 0.79 0.84 0.81 Tumor 834.0 48.0 108.0 942.0 0.95 0.89 0.91 Metastasis 273.0 16.0 16.0 289.0 0.94 0.94 0.94 Cancer_Dx 949.0 44.0 117.0 1066.0 0.96 0.89 0.92 macro_avg 4647.0 692.0 788.0 5435.0 0.92 0.82 0.86 micro_avg NaN NaN NaN NaN 0.88 0.86 0.86 ``` --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Dutch (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-05-10 task: Named Entity Recognition language: nl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, nl, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_NL){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_nl_2.5.0_2.4_1588546201483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_nl_2.5.0_2.4_1588546201483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "nl") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "nl") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd 's werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella."""] ner_df = nlu.load('nl.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |Amerikaanse |MISC | |Microsoft Corporation |ORG | |Tijdens |ORG | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Gates |ORG | |Seattle |LOC | |Washington |LOC | |Paul Allen Microsoft |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |CEO |ORG | |CEO |ORG | |Eind |MISC | |Gates |PER | |Deze mening |MISC | |In juni 2006 |MISC | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|nl| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://nl.wikipedia.org](https://nl.wikipedia.org) --- layout: model title: English image_classifier_vit_rust_image_classification_7 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_7 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_7` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_7_en_4.1.0_3.0_1660169299964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_7_en_4.1.0_3.0_1660169299964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_7", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_7", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_7| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Mentions of Tumors in Text author: John Snow Labs name: nerdl_tumour_demo date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract entities related to tumor in medical text using pretrained NER model ## Predicted Entities `Localization`, `Size`, `Laterality`, `Staging`, `Grading` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/nerdl_tumour_demo_en_3.0.0_3.0_1617260634175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/nerdl_tumour_demo_en_3.0.0_3.0_1617260634175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("nerdl_tumour_demo", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("nerdl_tumour_demo", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(Seq.empty[String]).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.tumour").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_tumour_demo| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Abkhazian asr_xls_r_demo_test TFWav2Vec2ForCTC from chmanoj author: John Snow Labs name: asr_xls_r_demo_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_demo_test` is a Abkhazian model originally trained by chmanoj. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_demo_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_demo_test_ab_4.2.0_3.0_1664020259177.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_demo_test_ab_4.2.0_3.0_1664020259177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_demo_test", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_demo_test", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_demo_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.7 KB| --- layout: model title: Fast Neural Machine Translation Model from English to Chuukese author: John Snow Labs name: opus_mt_en_chk date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, chk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `chk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_chk_xx_2.7.0_2.4_1609168399642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_chk_xx_2.7.0_2.4_1609168399642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_chk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_chk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.chk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_chk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Waray to English author: John Snow Labs name: opus_mt_war_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, war, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `war` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_war_en_xx_2.7.0_2.4_1609170925446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_war_en_xx_2.7.0_2.4_1609170925446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_war_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_war_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.war.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_war_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_256_zh_4.2.4_3.0_1670326037254.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_256_zh_4.2.4_3.0_1670326037254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|45.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English BertForSequenceClassification Tiny Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_tiny_finetuned_yahoo_answers_topics date: 2022-07-13 tags: [en, open_source, bert, sequence_classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-finetuned-yahoo_answers_topics` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_yahoo_answers_topics_en_4.0.0_3.0_1657720817515.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_tiny_finetuned_yahoo_answers_topics_en_4.0.0_3.0_1657720817515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_yahoo_answers_topics","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_tiny_finetuned_yahoo_answers_topics","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_tiny_finetuned_yahoo_answers_topics| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|16.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bert-tiny-finetuned-yahoo_answers_topics --- layout: model title: Word2Vec Embeddings in Belarusian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, be, open_source] task: Embeddings language: be edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_be_3.4.1_3.0_1647286503811.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_be_3.4.1_3.0_1647286503811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","be") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","be") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("be.embed.w2v_cc_300d").predict("""Я люблю Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|be| |Size:|998.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Fast Neural Machine Translation Model from English to Basque (Family) author: John Snow Labs name: opus_mt_en_euq date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, euq, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `euq` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_euq_xx_2.7.0_2.4_1609164213872.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_euq_xx_2.7.0_2.4_1609164213872.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_euq", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_euq", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.euq').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_euq| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from jgammack) author: John Snow Labs name: roberta_qa_sae_base_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SAE-roberta-base-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_sae_base_squad_en_4.3.0_3.0_1674208844842.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_sae_base_squad_en_4.3.0_3.0_1674208844842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_sae_base_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_sae_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_sae_base_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jgammack/SAE-roberta-base-squad --- layout: model title: Sentence Entity Resolver for ICD10-PCS (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_icd10pcs language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to ICD10-PCS codes using chunk embeddings. {:.h2_title} ## Predicted Entities ICD10-PCS Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_en_2.6.4_2.4_1606235760312.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_en_2.6.4_2.4_1606235760312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use ```sbiobertresolve_icd10pcs``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Procedure``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10pcs_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10pcs","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10pcs_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10pcs","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+-------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+-------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM|DWY18ZZ| 0.0626|Hyperthermia of H...|DWY18ZZ:::6A3Z1ZZ...| |chronic renal ins...| 83|109| PROBLEM|DTY17ZZ| 0.0722|Contact Radiation...|DTY17ZZ:::04593ZZ...| | COPD| 113|116| PROBLEM|2W04X7Z| 0.0765|Change Intermitte...|2W04X7Z:::0J063ZZ...| | gastritis| 120|128| PROBLEM|04723Z6| 0.0826|Dilation of Gastr...|04723Z6:::04724Z6...| | TIA| 136|138| PROBLEM|00F5XZZ| 0.1074|Fragmentation in ...|00F5XZZ:::00F53ZZ...| |a non-ST elevatio...| 182|202| PROBLEM|B307ZZZ| 0.0750|Plain Radiography...|B307ZZZ:::2W59X3Z...| |Guaiac positive s...| 208|229| PROBLEM|3E1G38Z| 0.0886|Irrigation of Upp...|3E1G38Z:::3E1G38X...| |cardiac catheteri...| 295|317| TEST|4A0234Z| 0.0783|Measurement of Ca...|4A0234Z:::4A02X4A...| | PTCA| 324|327|TREATMENT|03SG3ZZ| 0.0507|Reposition Intrac...|03SG3ZZ:::0GCQ3ZZ...| | mid LAD lesion| 332|345| PROBLEM|02H73DZ| 0.0490|Insertion of Intr...|02H73DZ:::02163Z7...| +--------------------+-----+---+---------+-------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_icd10pcs | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on ICD10 Procedure Coding System dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://www.icd10data.com/ICD10PCS/Codes --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1654180845344.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1654180845344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_32d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-0 --- layout: model title: Fast Neural Machine Translation Model from English to Mossi author: John Snow Labs name: opus_mt_en_mos date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mos, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mos` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mos_xx_2.7.0_2.4_1609163677483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mos_xx_2.7.0_2.4_1609163677483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mos", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mos", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mos').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mos| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Amharic Lemmatizer author: John Snow Labs name: lemma date: 2021-01-20 task: Lemmatization language: am edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [am, lemmatizer, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_am_2.7.0_2.4_1611181790547.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_am_2.7.0_2.4_1611181790547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "am") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenize, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) results = light_pipeline.fullAnnotate(["መጽሐፉን መጽሐፍ ኡ ን አስያዛት አስያዝ ኧ ኣት ።"]) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "am") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("መጽሐፉን መጽሐፍ ኡ ን አስያዛት አስያዝ ኧ ኣት ።").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["መጽሐፉን መጽሐፍ ኡ ን አስያዛት አስያዝ ኧ ኣት ።"] lemma_df = nlu.load('am.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 4, _, {'sentence': '0'}), Annotation(token, 6, 9, መጽሐፍ, {'sentence': '0'}), Annotation(token, 11, 11, ኡ, {'sentence': '0'}), Annotation(token, 13, 13, ን, {'sentence': '0'}), Annotation(token, 15, 19, _, {'sentence': '0'}), Annotation(token, 21, 24, አስያዝ, {'sentence': '0'}), Annotation(token, 26, 26, ኧ, {'sentence': '0'}), Annotation(token, 28, 29, ኣት, {'sentence': '0'}), Annotation(token, 31, 31, ።, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|am| ## Data Source The model was trained on the [Universal Dependencies](http://universaldependencies.org) version 2.7. Reference: > Binyam Ephrem Seyoum ,Yusuke Miyao and Baye Yimam Mekonnen.2018.Universal Dependencies for Amharic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 2216–2222, Miyazaki, Japan: European Language Resources Association (ELRA) --- layout: model title: English DistilBertForQuestionAnswering model (from poom-sci) author: John Snow Labs name: distilbert_qa_qa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-qa` is a English model originally trained by `poom-sci`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_qa_en_4.0.0_3.0_1654727679141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_qa_en_4.0.0_3.0_1654727679141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_qa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_poom-sci").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/poom-sci/distilbert-qa --- layout: model title: Stopwords Remover for Tswana language (119 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, tn, open_source] task: Stop Words Removal language: tn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_tn_3.4.1_3.0_1646673194666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_tn_3.4.1_3.0_1646673194666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","tn") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["A ga o batle go bina?"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","tn") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("A ga o batle go bina?").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tn.stopwords").predict("""A ga o batle go bina?""") ```
## Results ```bash +-------------------+ |result | +-------------------+ |[A, batle, bina, ?]| +-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|tn| |Size:|1.7 KB| --- layout: model title: Word2Vec Embeddings in Scots (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sco, open_source] task: Embeddings language: sco edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sco_3.4.1_3.0_1647455504159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sco_3.4.1_3.0_1647455504159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sco") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sco") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sco.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sco| |Size:|360.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_hier_ft_new_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_ft_new_news_en_4.3.0_3.0_1674210874319.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_ft_new_news_en_4.3.0_3.0_1674210874319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_ft_new_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_ft_new_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_hier_ft_new_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|461.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_hier_roberta_FT_new_newsqa --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from mbartolo) author: John Snow Labs name: roberta_qa_large_syn_ext date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-synqa-ext` is a English model originally trained by `mbartolo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_syn_ext_en_4.2.4_3.0_1669988389452.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_syn_ext_en_4.2.4_3.0_1669988389452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_syn_ext","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_syn_ext","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_syn_ext| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbartolo/roberta-large-synqa-ext - https://arxiv.org/abs/2002.00293 - https://arxiv.org/abs/2104.08678 - https://paperswithcode.com/sota?task=Question+Answering&dataset=adversarial_qa --- layout: model title: Chinese BertForMaskedLM Mini Cased model (from hfl) author: John Snow Labs name: bert_embeddings_minirbt_h256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minirbt-h256` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h256_zh_4.2.4_3.0_1670326879123.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h256_zh_4.2.4_3.0_1670326879123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_minirbt_h256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|39.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/minirbt-h256 - https://github.com/iflytek/MiniRBT - https://github.com/ymcui/LERT - https://github.com/ymcui/PERT - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/iflytek/HFL-Anthology --- layout: model title: English T5ForConditionalGeneration Cased model (from philschmid) author: John Snow Labs name: t5_flan_base_samsum date: 2023-03-01 tags: [open_source, t5, flan, en, tensorflow] task: Text Generation language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. flan-t5-base-samsum is a English model originally trained by philschmid. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_base_samsum_en_4.3.0_3.0_1677705397088.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_base_samsum_en_4.3.0_3.0_1677705397088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_flan_base_samsum","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_flan_base_samsum","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_flan_base_samsum| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References https://huggingface.co/philschmid/flan-t5-base-samsum --- layout: model title: Legal Transfers Clause Binary Classifier author: John Snow Labs name: legclf_transfers_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `transfers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `transfers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transfers_clause_en_1.0.0_3.2_1660123132946.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transfers_clause_en_1.0.0_3.2_1660123132946.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_transfers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[transfers]| |[other]| |[other]| |[transfers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transfers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.83 0.93 0.88 99 transfers 0.72 0.49 0.58 37 accuracy - - 0.81 136 macro-avg 0.77 0.71 0.73 136 weighted-avg 0.80 0.81 0.80 136 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from Salesforce) author: John Snow Labs name: roberta_qa_discord date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `discord_qa` is a English model originally trained by `Salesforce`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_discord_en_4.3.0_3.0_1674210253072.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_discord_en_4.3.0_3.0_1674210253072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_discord","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_discord","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_discord| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|845.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Salesforce/discord_qa --- layout: model title: Fast Neural Machine Translation Model from Urdu to English author: John Snow Labs name: opus_mt_ur_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ur, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ur` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ur_en_xx_2.7.0_2.4_1609166694262.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ur_en_xx_2.7.0_2.4_1609166694262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ur_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ur_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ur.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ur_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Reports Clause Binary Classifier author: John Snow Labs name: legclf_reports_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `reports` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `reports` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reports_clause_en_1.0.0_3.2_1660123946698.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reports_clause_en_1.0.0_3.2_1660123946698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_reports_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[reports]| |[other]| |[other]| |[reports]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reports_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.95 0.94 62 reports 0.91 0.88 0.90 34 accuracy - - 0.93 96 macro-avg 0.92 0.92 0.92 96 weighted-avg 0.93 0.93 0.93 96 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from DHBaek) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_korquad_mask date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-korquad-mask` is a English model originally trained by `DHBaek`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_korquad_mask_en_4.0.0_3.0_1655995736555.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_korquad_mask_en_4.0.0_3.0_1655995736555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_korquad_mask","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_korquad_mask","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.korquad.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_korquad_mask| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|2.1 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/DHBaek/xlm-roberta-large-korquad-mask --- layout: model title: Fast Neural Machine Translation Model from South Slavic to English author: John Snow Labs name: opus_mt_zls_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, zls, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `zls` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_zls_en_xx_2.7.0_2.4_1609168621397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_zls_en_xx_2.7.0_2.4_1609168621397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_zls_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_zls_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.zls.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_zls_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_exp_w2v2t_vp_s946 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_vp_s946 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_s946` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_vp_s946_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_s946_de_4.2.0_3.0_1664110287042.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_vp_s946_de_4.2.0_3.0_1664110287042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_vp_s946", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_vp_s946", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_vp_s946| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2023-04-28 tags: [licensed, en, clinical, biobert, profiling, ner_profiling, ner] task: [Named Entity Recognition, Pipeline Healthcare] language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_greedy_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_anatomy_coarse_biobert`, `ner_bionlp_biobert`, `ner_cellular_biobert`, `ner_chemprot_biobert`, `ner_clinical_biobert`, `ner_deid_biobert`, `ner_deid_enriched_biobert`, `ner_diseases_biobert`, `ner_events_biobert`, `ner_human_phenotype_gene_biobert`, `ner_human_phenotype_go_biobert`, `ner_jsl_biobert`, `ner_jsl_enriched_biobert`, `ner_jsl_greedy_biobert`, `ner_living_species_biobert`, `ner_posology_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.4.0_3.0_1682667240497.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.4.0_3.0_1682667240497.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_biobert", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ```
## Results ```bash ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|766.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: English DistilBertForQuestionAnswering model (from arvalinno) author: John Snow Labs name: distilbert_qa_arvalinno_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `arvalinno`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_arvalinno_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725034094.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_arvalinno_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725034094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arvalinno_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arvalinno_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_arvalinno").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_arvalinno_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/arvalinno/distilbert-base-uncased-finetuned-squad --- layout: model title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_base_japanese date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_ja_4.2.4_3.0_1670018109829.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_ja_4.2.4_3.0_1670018109829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_japanese| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|415.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Translate English to Bulgarian Pipeline author: John Snow Labs name: translate_en_bg date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bg, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bg` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bg_xx_2.7.0_2.4_1609689394205.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bg_xx_2.7.0_2.4_1609689394205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bg", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bg').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bg| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Celtic languages to English Pipeline author: John Snow Labs name: translate_cel_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cel, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cel` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cel_en_xx_2.7.0_2.4_1609690102873.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cel_en_xx_2.7.0_2.4_1609690102873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cel_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cel_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cel.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cel_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_by_facebook TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook date: 2022-09-24 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_by_facebook` is a German model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook_de_4.2.0_3.0_1664026440041.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook_de_4.2.0_3.0_1664026440041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_by_facebook| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|756.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli TFWav2Vec2ForCTC from tingtingyuli author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli` is a English model originally trained by tingtingyuli. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli_en_4.2.0_3.0_1664108385340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli_en_4.2.0_3.0_1664108385340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Context Spell Checker for English author: John Snow Labs name: spellcheck_dl date: 2022-04-02 tags: [spellchecker, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 2.4 supported: true annotator: ContextSpellCheckerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. The model is trained for PySpark 2.4.x users with SparkNLP 3.4.2 and above. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.4.2_2.4_1648913172863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.4.2_2.4_1648913172863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setPrefixes(["\"", "“", "(", "[", "\n", "."]) \ .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained("spellcheck_dl", "en")\ .setInputCols("token")\ .setOutputCol("checked")\ pipeline = Pipeline(stages = [documentAssembler, tokenizer, spellModel]) empty_df = spark.createDataFrame([[""]]).toDF("text") lp = LightPipeline(pipeline.fit(empty_df)) text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] lp.annotate(text) ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new RecursiveTokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setPrefixes(Array("\"", "“", "(", "[", "\n", ".")) .setSuffixes(Array("\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained("spellcheck_dl", "en"). setInputCols("token"). setOutputCol("checked") val pipeline = new Pipeline().setStages(Array(assembler, tokenizer, spellChecker)) val empty_df = spark.createDataFrame([[""]]).toDF("text") val lp = new LightPipeline(pipeline.fit(empty_df)) val text = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") lp.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("spell").predict("""During the summer we have the best ueather.""") ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| |Size:|99.4 MB| ## References Combination of custom public data sets. --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_soup_model_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_roberta_soup_model_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_soup_model_squad2.0_en_4.3.0_3.0_1674211121560.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_soup_model_squad2.0_en_4.3.0_3.0_1674211121560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_soup_model_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_soup_model_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_soup_model_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_roberta_soup_model_squad2.0 --- layout: model title: English BertForQuestionAnswering model author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-finetuned-squad` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_en_4.0.0_3.0_1654537083082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_en_4.0.0_3.0_1654537083082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad - https://github.com/google-research/bert - https://arxiv.org/abs/1810.04805 - https://en.wikipedia.org/wiki/English_Wikipedia - https://yknzhu.wixsite.com/mbweb --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in German (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-02-03 task: Named Entity Recognition language: de edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, de, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DE){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_DE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_de_2.4.0_2.4_1579699913555.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_de_2.4.0_2.4_1579699913555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "de") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (* 28. Oktober 1955 in London) ist ein US-amerikanischer Geschäftsmann, Softwareentwickler, Investor und Philanthrop. Er ist bekannt als Mitbegründer der Microsoft Corporation. Während seiner Karriere bei Microsoft war Gates Vorsitzender, Chief Executive Officer (CEO), Präsident und Chief Software Architect und bis Mai 2014 der größte Einzelaktionär. Er ist einer der bekanntesten Unternehmer und Pioniere der Mikrocomputer-Revolution der 1970er und 1980er Jahre. Gates wurde in Seattle, Washington, geboren und wuchs dort auf. 1975 gründete er Microsoft zusammen mit seinem Freund aus Kindertagen, Paul Allen, in Albuquerque, New Mexico. Es entwickelte sich zum weltweit größten Unternehmen für Personal-Computer-Software. Gates leitete das Unternehmen als Chairman und CEO, bis er im Januar 2000 als CEO zurücktrat. Er blieb jedoch Chairman und wurde Chief Software Architect. In den späten neunziger Jahren wurde Gates für seine Geschäftstaktiken kritisiert, die als wettbewerbswidrig angesehen wurden. Diese Meinung wurde durch zahlreiche Gerichtsurteile bestätigt. Im Juni 2006 gab Gates bekannt, dass er eine Teilzeitstelle bei Microsoft und eine Vollzeitstelle bei der Bill & Melinda Gates Foundation, der privaten gemeinnützigen Stiftung, die er und seine Frau Melinda Gates im Jahr 2000 gegründet haben, übernehmen wird. Er übertrug seine Aufgaben nach und nach auf Ray Ozzie und Craig Mundie. Im Februar 2014 trat er als Vorsitzender von Microsoft zurück und übernahm eine neue Position als Technologieberater, um den neu ernannten CEO Satya Nadella zu unterstützen.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "de") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (* 28. Oktober 1955 in London) ist ein US-amerikanischer Geschäftsmann, Softwareentwickler, Investor und Philanthrop. Er ist bekannt als Mitbegründer der Microsoft Corporation. Während seiner Karriere bei Microsoft war Gates Vorsitzender, Chief Executive Officer (CEO), Präsident und Chief Software Architect und bis Mai 2014 der größte Einzelaktionär. Er ist einer der bekanntesten Unternehmer und Pioniere der Mikrocomputer-Revolution der 1970er und 1980er Jahre. Gates wurde in Seattle, Washington, geboren und wuchs dort auf. 1975 gründete er Microsoft zusammen mit seinem Freund aus Kindertagen, Paul Allen, in Albuquerque, New Mexico. Es entwickelte sich zum weltweit größten Unternehmen für Personal-Computer-Software. Gates leitete das Unternehmen als Chairman und CEO, bis er im Januar 2000 als CEO zurücktrat. Er blieb jedoch Chairman und wurde Chief Software Architect. In den späten neunziger Jahren wurde Gates für seine Geschäftstaktiken kritisiert, die als wettbewerbswidrig angesehen wurden. Diese Meinung wurde durch zahlreiche Gerichtsurteile bestätigt. Im Juni 2006 gab Gates bekannt, dass er eine Teilzeitstelle bei Microsoft und eine Vollzeitstelle bei der Bill & Melinda Gates Foundation, der privaten gemeinnützigen Stiftung, die er und seine Frau Melinda Gates im Jahr 2000 gegründet haben, übernehmen wird. Er übertrug seine Aufgaben nach und nach auf Ray Ozzie und Craig Mundie. Im Februar 2014 trat er als Vorsitzender von Microsoft zurück und übernahm eine neue Position als Technologieberater, um den neu ernannten CEO Satya Nadella zu unterstützen.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (* 28. Oktober 1955 in London) ist ein US-amerikanischer Geschäftsmann, Softwareentwickler, Investor und Philanthrop. Er ist bekannt als Mitbegründer der Microsoft Corporation. Während seiner Karriere bei Microsoft war Gates Vorsitzender, Chief Executive Officer (CEO), Präsident und Chief Software Architect und bis Mai 2014 der größte Einzelaktionär. Er ist einer der bekanntesten Unternehmer und Pioniere der Mikrocomputer-Revolution der 1970er und 1980er Jahre. Gates wurde in Seattle, Washington, geboren und wuchs dort auf. 1975 gründete er Microsoft zusammen mit seinem Freund aus Kindertagen, Paul Allen, in Albuquerque, New Mexico. Es entwickelte sich zum weltweit größten Unternehmen für Personal-Computer-Software. Gates leitete das Unternehmen als Chairman und CEO, bis er im Januar 2000 als CEO zurücktrat. Er blieb jedoch Chairman und wurde Chief Software Architect. In den späten neunziger Jahren wurde Gates für seine Geschäftstaktiken kritisiert, die als wettbewerbswidrig angesehen wurden. Diese Meinung wurde durch zahlreiche Gerichtsurteile bestätigt. Im Juni 2006 gab Gates bekannt, dass er eine Teilzeitstelle bei Microsoft und eine Vollzeitstelle bei der Bill & Melinda Gates Foundation, der privaten gemeinnützigen Stiftung, die er und seine Frau Melinda Gates im Jahr 2000 gegründet haben, übernehmen wird. Er übertrug seine Aufgaben nach und nach auf Ray Ozzie und Craig Mundie. Im Februar 2014 trat er als Vorsitzender von Microsoft zurück und übernahm eine neue Position als Technologieberater, um den neu ernannten CEO Satya Nadella zu unterstützen."""] ner_df = nlu.load('de.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +--------------------------+---------+ |chunk |ner_label| +--------------------------+---------+ |William Henry Gates III |PER | |London |LOC | |US-amerikanischer |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Mikrocomputer-Revolution |ORG | |Gates |PER | |Seattle |LOC | |Washington |LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Personal-Computer-Software|ORG | |Gates |ORG | |CEO |ORG | |Gates |PER | |Gates |PER | |Microsoft |ORG | +--------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://de.wikipedia.org](https://de.wikipedia.org) --- layout: model title: Dutch RoBERTa Embeddings (Bort) author: John Snow Labs name: roberta_embeddings_robbertje_1_gb_bort date: 2022-04-14 tags: [roberta, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbertje-1-gb-bort` is a Dutch model orginally trained by `DTAI-KULeuven`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_bort_nl_3.4.2_3.0_1649949030427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_bort_nl_3.4.2_3.0_1649949030427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_bort","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_bort","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.robbertje_1_gb_bort").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robbertje_1_gb_bort| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|172.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-bort - http://github.com/iPieter/robbert - http://github.com/iPieter/robbertje - https://www.clinjournal.org/clinj/article/view/131 - https://www.clin31.ugent.be - https://arxiv.org/abs/2101.05716 --- layout: model title: Finnish asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP` is a Finnish model originally trained by Finnish-NLP. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664040123361.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP_fi_4.2.0_3.0_1664040123361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_300m_finnish_lm_by_Finnish_NLP| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: Danish asr_xls_r_300m_danish_nst_cv9 TFWav2Vec2ForCTC from chcaa author: John Snow Labs name: asr_xls_r_300m_danish_nst_cv9 date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_300m_danish_nst_cv9` is a Danish model originally trained by chcaa. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_300m_danish_nst_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_300m_danish_nst_cv9_da_4.2.0_3.0_1664102550304.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_300m_danish_nst_cv9_da_4.2.0_3.0_1664102550304.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_300m_danish_nst_cv9", "da")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_300m_danish_nst_cv9", "da") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_300m_danish_nst_cv9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|da| |Size:|757.6 MB| --- layout: model title: English RobertaForQuestionAnswering (from rahulchakwate) author: John Snow Labs name: roberta_qa_rahulchakwate_roberta_base_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rahulchakwate_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734360971.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rahulchakwate_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734360971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rahulchakwate_roberta_base_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rahulchakwate_roberta_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_rahulchakwate").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rahulchakwate_roberta_base_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulchakwate/roberta-base-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from Intel) author: John Snow Labs name: bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squadv1.1-sparse-90-unstructured` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured_en_4.0.0_3.0_1654536793950.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured_en_4.0.0_3.0_1654536793950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased_sparse_90_unstructured.by_Intel").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|363.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Intel/bert-large-uncased-squadv1.1-sparse-90-unstructured - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: Stopwords Remover for Irish language (107 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ga, open_source] task: Stop Words Removal language: ga edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ga_3.4.1_3.0_1646673113131.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ga_3.4.1_3.0_1646673113131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ga") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Níl tú níos fearr ná mise"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ga") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Níl tú níos fearr ná mise").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ga.stopwords").predict("""Níl tú níos fearr ná mise""") ```
## Results ```bash +------------------------+ |result | +------------------------+ |[Níl, níos, fearr, mise]| +------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ga| |Size:|1.6 KB| --- layout: model title: Visual Document NER with SROIE author: John Snow Labs name: visual_document_NER_SROIE0526_en_3.0.0_3.0.1_1621990933091 date: 2021-05-26 tags: [en, licensed] supported: true task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description this NER model is based on LayoutLM pre-trained model and fine-tuned with SROIE dataset ## Predicted Entities - O - B-DATE - B-COMPANY - B-TOTAL {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visual_document_NER_SROIE0526_en_3.0.0_3.0.1_1621990933091_en_3.0.0_3.0_1622015545288.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visual_document_NER_SROIE0526_en_3.0.0_3.0.1_1621990933091_en_3.0.0_3.0_1622015545288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use after feeding hocr as the input, this model should predict the related entity per work/token
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToHocr()\ .setInputCol("image")\ .setOutputCol("hocr")\ .setIgnoreResolution(False)\ .setOcrParams(["preserve_interword_spaces=0"]) doc_ner = VisualDocumentNer()\ .pretrained("visual_document_NER_SROIE0526", "en", "public/ocr/models") \ .setInputCol("hocr")\ .setOutputCol("label") df = doc_ner.transform(ocr.transform(visual_document_df)) path_array = split(df['path'], '/') df.withColumn('filename', path_array.getItem(size(path_array) - 1)) \ .select("filename", "entities", "label") \ .show(truncate=False) ``` ```scala val ocr = new ImageToHocr() .setInputCol("corrected_image") .setOutputCol("hocr") .setIgnoreResolution(false) .setOcrParams(Array("preserve_interword_spaces=0")) val visualDocumentNER = VisualDocumentNER .pretrained(testSparkModel, "en", "public/ocr/models") .setInputCol("hocr") val results = visualDocumentNER.transform(ocr) ```
## Results ```bash +------------------------------------------------------------------------+---------+ |entities |label | +------------------------------------------------------------------------+---------+ |[entity, 0, 0, O, [word -> [1060, token -> [], []] |O | |[entity, 0, 0, O, [word -> [1060, token -> 1060], []] |O | |[entity, 0, 0, O, [word -> [1060, token -> 1060], []] |O | |[entity, 0, 0, O, [word -> 257, token -> 257], []] |O | |[entity, 0, 0, O, [word -> LEMON, token -> lemon], []] |O | |[entity, 0, 0, O, [word -> TREE, token -> tree], []] |O | |[entity, 0, 0, B-COMPANY, [word -> RESTAURANT, token -> restaurant], []]|B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> JTJ, token -> jtj], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> JTJ, token -> jtj], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> JTJ, token -> jtj], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> FOODS, token -> foods], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> SDN, token -> sdn], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> SDN, token -> sdn], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> BHD, token -> bhd], []] |B-COMPANY| |[entity, 0, 0, B-COMPANY, [word -> BHD, token -> bhd], []] |B-COMPANY| |[entity, 0, 0, O, [word -> (1179227A), token -> (], []] |O | |[entity, 0, 0, O, [word -> (1179227A), token -> 1179227a], []] |O | |[entity, 0, 0, O, [word -> (1179227A), token -> 1179227a], []] |O | |[entity, 0, 0, O, [word -> (1179227A), token -> 1179227a], []] |O | |[entity, 0, 0, O, [word -> (1179227A), token -> 1179227a], []] |O | |[entity, 0, 0, O, [word -> (1179227A), token -> 1179227a], []] |O | |[entity, 0, 0, O, [word -> (1179227A), token -> )], []] |O | |[entity, 0, 0, O, [word -> GST, token -> gst], []] |O | |[entity, 0, 0, O, [word -> GST, token -> gst], []] |O | |[entity, 0, 0, O, [word -> 001085747200, token -> 001085747200], []] |O | |[entity, 0, 0, O, [word -> 001085747200, token -> 001085747200], []] |O | |[entity, 0, 0, O, [word -> 001085747200, token -> 001085747200], []] |O | |[entity, 0, 0, O, [word -> 001085747200, token -> 001085747200], []] |O | |[entity, 0, 0, O, [word -> 001085747200, token -> 001085747200], []] |O | |[entity, 0, 0, O, [word -> 001085747200, token -> 001085747200], []] |O | |[entity, 0, 0, O, [word -> No, token -> no], []] |O | |[entity, 0, 0, O, [word -> 3,, token -> 3], []] |O | |[entity, 0, 0, O, [word -> 3,, token -> ,], []] |O | |[entity, 0, 0, O, [word -> Jalan, token -> jalan], []] |O | |[entity, 0, 0, O, [word -> Permas, token -> permas], []] |O | |[entity, 0, 0, O, [word -> Permas, token -> permas], []] |O | |[entity, 0, 0, O, [word -> 10/8,, token -> 10], []] |O | |[entity, 0, 0, O, [word -> 10/8,, token -> /], []] |O | |[entity, 0, 0, O, [word -> 10/8,, token -> 8], []] |O | |[entity, 0, 0, O, [word -> 10/8,, token -> ,], []] |O | |[entity, 0, 0, O, [word -> Bandar, token -> bandar], []] |O | |[entity, 0, 0, O, [word -> Bandar, token -> bandar], []] |O | |[entity, 0, 0, O, [word -> Baru, token -> baru], []] |O | |[entity, 0, 0, O, [word -> Baru, token -> baru], []] |O | |[entity, 0, 0, O, [word -> Perrnas, token -> perrnas], []] |O | |[entity, 0, 0, O, [word -> Perrnas, token -> perrnas], []] |O | |[entity, 0, 0, O, [word -> Perrnas, token -> perrnas], []] |O | |[entity, 0, 0, O, [word -> Jaya,, token -> jaya], []] |O | |[entity, 0, 0, O, [word -> Jaya,, token -> ,], []] |O | |[entity, 0, 0, O, [word -> 81750, token -> 81750], []] |O | |[entity, 0, 0, O, [word -> 81750, token -> 81750], []] |O | |[entity, 0, 0, O, [word -> 81750, token -> 81750], []] |O | |[entity, 0, 0, O, [word -> Masai,, token -> masai], []] |O | |[entity, 0, 0, O, [word -> Masai,, token -> masai], []] |O | |[entity, 0, 0, O, [word -> Masai,, token -> ,], []] |O | |[entity, 0, 0, O, [word -> Johor, token -> johor], []] |O | |[entity, 0, 0, O, [word -> 07, token -> 07], []] |O | |[entity, 0, 0, O, [word -> 3823456, token -> 3823456], []] |O | |[entity, 0, 0, O, [word -> 3823456, token -> 3823456], []] |O | |[entity, 0, 0, O, [word -> 3823456, token -> 3823456], []] |O | |[entity, 0, 0, O, [word -> 3823456, token -> 3823456], []] |O | |[entity, 0, 0, O, [word -> SIMPLIFIED, token -> simplified], []] |O | |[entity, 0, 0, O, [word -> TAX, token -> tax], []] |O | |[entity, 0, 0, O, [word -> INVOICE, token -> invoice], []] |O | |[entity, 0, 0, O, [word -> INVOICE, token -> invoice], []] |O | |[entity, 0, 0, O, [word -> INVOICE, token -> invoice], []] |O | |[entity, 0, 0, O, [word -> INVOICENO, token -> invoiceno], []] |O | |[entity, 0, 0, O, [word -> INVOICENO, token -> invoiceno], []] |O | |[entity, 0, 0, O, [word -> INVOICENO, token -> invoiceno], []] |O | |[entity, 0, 0, O, [word -> INVOICENO, token -> invoiceno], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> ©s00014], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> ©s00014], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> ©s00014], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> ©s00014], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> ©s00014], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> /], []] |O | |[entity, 0, 0, O, [word -> ©S00014/69, token -> 69], []] |O | |[entity, 0, 0, O, [word -> INVOICE, token -> invoice], []] |O | |[entity, 0, 0, O, [word -> INVOICE, token -> invoice], []] |O | |[entity, 0, 0, O, [word -> INVOICE, token -> invoice], []] |O | |[entity, 0, 0, O, [word -> DALE:, token -> dale], []] |O | |[entity, 0, 0, O, [word -> DALE:, token -> :], []] |O | |[entity, 0, 0, B-DATE, [word -> 6/1/2018, token -> 6], []] |B-DATE | |[entity, 0, 0, O, [word -> 6/1/2018, token -> /], []] |O | |[entity, 0, 0, B-DATE, [word -> 6/1/2018, token -> 1], []] |B-DATE | |[entity, 0, 0, O, [word -> 6/1/2018, token -> /], []] |O | |[entity, 0, 0, B-DATE, [word -> 6/1/2018, token -> 2018], []] |B-DATE | |[entity, 0, 0, O, [word -> 6:42:02, token -> 6], []] |O | |[entity, 0, 0, O, [word -> 6:42:02, token -> :], []] |O | |[entity, 0, 0, O, [word -> 6:42:02, token -> 42], []] |O | |[entity, 0, 0, O, [word -> 6:42:02, token -> :], []] |O | |[entity, 0, 0, O, [word -> 6:42:02, token -> 02], []] |O | |[entity, 0, 0, O, [word -> PM, token -> pm], []] |O | |[entity, 0, 0, O, [word -> WAITER:, token -> waiter], []] |O | |[entity, 0, 0, O, [word -> WAITER:, token -> :], []] |O | |[entity, 0, 0, O, [word -> Vanessa, token -> vanessa], []] |O | |[entity, 0, 0, O, [word -> “Sane, token -> “sane], []] |O | |[entity, 0, 0, O, [word -> “Sane, token -> “sane], []] |O | |[entity, 0, 0, O, [word -> “Sane, token -> “sane], []] |O | |[entity, 0, 0, O, [word -> Pax, token -> pax], []] |O | +------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|visual_document_NER_SROIE0526_en_3.0.0_3.0.1_1621990933091| |Type:|ocr| |Compatibility:|Spark NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| |Max sentense length:|512| ## Data Source SROIE --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_twostagetriplet_hier_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostagetriplet_hier_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674224008013.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostagetriplet_hier_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674224008013.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostagetriplet_hier_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostagetriplet_hier_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_twostagetriplet_hier_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|306.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0 --- layout: model title: Pipeline to Classify Texts into 4 News Categories author: ahmedlone127 name: bert_sequence_classifier_age_news_pipeline date: 2022-06-14 tags: [ag_news, news, bert, bert_sequence, classification, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_sequence_classifier_age_news_en](https://nlp.johnsnowlabs.com/2021/11/07/bert_sequence_classifier_age_news_en.html) which is imported from `HuggingFace`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/bert_sequence_classifier_age_news_pipeline_en_4.0.0_3.0_1655212293047.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/bert_sequence_classifier_age_news_pipeline_en_4.0.0_3.0_1655212293047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python news_pipeline = PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en") news_pipeline.annotate("Microsoft has taken its first step into the metaverse.") ``` ```scala val news_pipeline = new PretrainedPipeline("bert_sequence_classifier_age_news_pipeline", lang = "en") news_pipeline.annotate("Microsoft has taken its first step into the metaverse.") ```
## Results ```bash ['Sci/Tech'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_age_news_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|42.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertForSequenceClassification --- layout: model title: English DistilBertForQuestionAnswering model (from tli8hf) Newsqa author: John Snow Labs name: distilbert_qa_unqover_base_uncased_newsqa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-distilbert-base-uncased-newsqa` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_unqover_base_uncased_newsqa_en_4.0.0_3.0_1654729078822.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_unqover_base_uncased_newsqa_en_4.0.0_3.0_1654729078822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_unqover_base_uncased_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_unqover_base_uncased_newsqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_unqover_base_uncased_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-distilbert-base-uncased-newsqa --- layout: model title: Pipeline to Detect Adverse Drug Events (biobert) author: John Snow Labs name: ner_ade_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_biobert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_biobert_pipeline_en_3.4.1_3.0_1647874721404.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_biobert_pipeline_en_3.4.1_3.0_1647874721404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_ade_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` ```scala val pipeline = new PretrainedPipeline("ner_ade_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biobert_ade.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +--------------------------------------------+--------+ |chunks |entities| +--------------------------------------------+--------+ |5-10-fold |DRUG | |higher transformed colony forming efficiency|ADE | +--------------------------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|421.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Pipeline to Detect Cancer Genetics author: John Snow Labs name: bert_token_classifier_ner_bionlp_pipeline date: 2022-03-21 tags: [licensed, ner, bionlp, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bionlp](https://nlp.johnsnowlabs.com/2022/01/03/bert_token_classifier_ner_bionlp_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_pipeline_en_3.4.1_3.0_1647863601515.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_pipeline_en_3.4.1_3.0_1647863601515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_bionlp_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bionlp_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.biolp.pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |erbA IRES |Organism | |erbA/myb virus |Organism | |erythroid cells |Cell | |bone marrow |Multi-tissue_structure| |blastoderm cultures|Cell | |erbA/myb IRES virus|Organism | |erbA IRES virus |Organism | |blastoderm |Cell | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bionlp_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: Modern Greek (1453-) BertForQuestionAnswering model (from Danastos) author: John Snow Labs name: bert_qa_squad_bert_el_Danastos date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad_bert_el` is a Modern Greek (1453-) model orginally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_bert_el_Danastos_el_4.0.0_3.0_1654250032678.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_bert_el_Danastos_el_4.0.0_3.0_1654250032678.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_bert_el_Danastos","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad_bert_el_Danastos","el") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("el.answer_question.squad.bert.by_Danastos").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_bert_el_Danastos| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|421.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/squad_bert_el --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl8_en_4.3.0_3.0_1675110336665.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dl8_en_4.3.0_3.0_1675110336665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|402.3 MB| ## References - https://huggingface.co/google/t5-efficient-base-dl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Translate Austro-Asiatic languages to English Pipeline author: John Snow Labs name: translate_aav_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, aav, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `aav` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_aav_en_xx_2.7.0_2.4_1609689179911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_aav_en_xx_2.7.0_2.4_1609689179911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_aav_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_aav_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.aav.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_aav_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Pipeline (Headers / Subheaders) author: John Snow Labs name: legpipe_header_subheader date: 2023-01-20 tags: [en, licensed, legal, ner, contextual_parser] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal pretrained pipeline, aimed to carry out Section Splitting by using the Headers and Subheaders entities, detected in the document. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legpipe_header_subheader_en_1.0.0_3.0_1674244247295.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legpipe_header_subheader_en_1.0.0_3.0_1674244247295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python legal_pipeline = nlp.PretrainedPipeline("legpipe_header_subheader", "en", "legal/models") text = ["""2. DEFINITION. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller. 2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6 2.2 Customer Agreements."""] result = legal_pipeline.annotate(text) ```
## Results ```bash | chunks | begin | end | entities | |------------------------:|------:|----:|----------:| | 2. DEFINITION | 0 | 12 | HEADER | | 2.1 Appointment | 154 | 168 | SUBHEADER | | 2.2 Customer Agreements | 530 | 552 | SUBHEADER | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legpipe_header_subheader| |Type:|pipeline| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|23.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel --- layout: model title: Depression Classifier (PHS-BERT) author: John Snow Labs name: bert_sequence_classifier_depression date: 2022-08-09 tags: [public_health, en, licensed, sequence_classification, mental_health, depression] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT](https://arxiv.org/abs/2204.04521) based text classification model that can classify depression level of social media text into three levels: `no-depression`, `minimum`, `high-depression`. ## Predicted Entities `no-depression`, `minimum`, `high-depression` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_depression_en_4.0.2_3.0_1660043784879.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_depression_en_4.0.2_3.0_1660043784879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_depression", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["None that I know of. Any mental health issue needs to be cared for like any other health issue. Doctors and medications can help."], ["I don’t know. Was this okay? Should I hate him? Or was it just something new? I really don’t know what to make of the situation."], ["It makes me so disappointed in myself because I hate what I've become and I hate feeling so helpless."] ]).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_depression", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array( "None that I know of. Any mental health issue needs to be cared for like any other health issue. Doctors and medications can help.", "I don’t know. Was this okay? Should I hate him? Or was it just something new? I really don’t know what to make of the situation.", "It makes me so disappointed in myself because I hate what I've become and I hate feeling so helpless." )).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify .bert_sequence.depression").predict("""None that I know of. Any mental health issue needs to be cared for like any other health issue. Doctors and medications can help.""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------+-----------------+ |text |result | +---------------------------------------------------------------------------------------------------------------------------------+-----------------+ |None that I know of. Any mental health issue needs to be cared for like any other health issue. Doctors and medications can help.|[no-depression] | |I don’t know. Was this okay? Should I hate him? Or was it just something new? I really don’t know what to make of the situation. |[minimum] | |It makes me so disappointed in myself because I hate what I've become and I hate feeling so helpless. |[high-depression]| +---------------------------------------------------------------------------------------------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_depression| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support no-depression 0.99 0.99 0.99 98 minimum 0.85 0.86 0.85 155 high-depression 0.81 0.80 0.81 119 accuracy - - 0.87 372 macro-avg 0.88 0.88 0.88 372 weighted-avg 0.87 0.87 0.87 372 ``` --- layout: model title: English BertForQuestionAnswering model (from manishiitg) author: John Snow Labs name: bert_qa_spanbert_recruit_qa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-recruit-qa` is a English model orginally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_recruit_qa_en_4.0.0_3.0_1654191876032.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_recruit_qa_en_4.0.0_3.0_1654191876032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_recruit_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_recruit_qa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.span_bert.by_manishiitg").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_recruit_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/spanbert-recruit-qa --- layout: model title: Extract test entities (Voice of the Patients) author: John Snow Labs name: ner_vop_test_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts test mentions from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `VitalTest`, `Test`, `Measurements`, `TestResult` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_wip_en_4.4.2_3.0_1684513260011.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_wip_en_4.4.2_3.0_1684513260011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_test_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_test_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------|:------------| | thyroid levels | Test | | blood test | Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_test_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 VitalTest 157 29 15 172 0.84 0.91 0.88 Test 1056 138 152 1208 0.88 0.87 0.88 Measurements 160 28 26 186 0.85 0.86 0.86 TestResult 370 109 154 524 0.77 0.71 0.74 macro_avg 1743 304 347 2090 0.84 0.84 0.84 micro_avg 1743 304 347 2090 0.85 0.83 0.84 ``` --- layout: model title: English BertForQuestionAnswering model (from ptnv-s) author: John Snow Labs name: bert_qa_biobert_squad2_cased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_squad2_cased-finetuned-squad` is a English model orginally trained by `ptnv-s`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_squad2_cased_finetuned_squad_en_4.0.0_3.0_1654185715410.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_squad2_cased_finetuned_squad_en_4.0.0_3.0_1654185715410.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_squad2_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_squad2_cased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.biobert.cased.by_ptnv-s").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_squad2_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ptnv-s/biobert_squad2_cased-finetuned-squad --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265907 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265907` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265907_en_4.0.0_3.0_1655985443748.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265907_en_4.0.0_3.0_1655985443748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265907","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265907","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265907").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265907| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265907 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_urdu_v2 TFWav2Vec2ForCTC from omar47 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_urdu_v2` is a English model originally trained by omar47. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2_en_4.2.0_3.0_1664118164146.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2_en_4.2.0_3.0_1664118164146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_urdu_v2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Relation extraction between body parts and problem entities author: John Snow Labs name: re_bodypart_problem date: 2021-01-18 task: Relation Extraction language: en nav_key: models edition: Spark NLP for Healthcare 2.7.1 spark_version: 2.4 tags: [en, clinical, relation_extraction, licensed] supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts and problem entities in clinical texts. `1` : Shows that there is a relation between the body part entity and the entities labeled as problem ( diagnosis, symptom etc.), `0` : Shows that there no relation between the body part entity and the entities labeled as problem ( diagnosis, symptom etc.). ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_BODYPART_ENT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_problem_en_2.7.1_2.4_1610959377894.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_problem_en_2.7.1_2.4_1610959377894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_bodypart_problem` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODELS LABES | NER MODEL | RE PAIRS | |:-------------------:|:---------------:|:---------:|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | re_bodypart_problem | 0,1 | ner_jsl | [“internal_organ_or_component-cerebrovascular_disease”,
“cerebrovascular_disease-internal_organ_or_component”,
“internal_organ_or_component-communicable_disease”,
“communicable_disease-internal_organ_or_component”,
“internal_organ_or_component-diabetes”,
“diabetes-internal_organ_or_component”,
“internal_organ_or_component-disease_syndrome_disorder”,
“disease_syndrome_disorder-internal_organ_or_component”,
“internal_organ_or_component-ekg_findings”,
“ekg_findings-internal_organ_or_component”,
“internal_organ_or_component-heart_disease”,
“heart_disease-internal_organ_or_component”,
“internal_organ_or_component-hyperlipidemia”,
“hyperlipidemia-internal_organ_or_component”,
“internal_organ_or_component-hypertension”,
“hypertension-internal_organ_or_component”,
“internal_organ_or_component-imagingfindings”,
“imagingfindings-internal_organ_or_component”,
“internal_organ_or_component-injury_or_poisoning”,
“injury_or_poisoning-internal_organ_or_component”,
“internal_organ_or_component-kidney_disease”,
“kidney_disease-internal_organ_or_component”,
“internal_organ_or_component-oncological”,
“oncological-internal_organ_or_component”,
“internal_organ_or_component-psychological_condition”,
“psychological_condition-internal_organ_or_component”,
“internal_organ_or_component-symptom”,
“symptom-internal_organ_or_component”,
“internal_organ_or_component-vs_finding”,
“vs_finding-internal_organ_or_component”,
“external_body_part_or_region-communicable_disease”,
“communicable_disease-external_body_part_or_region”,
“external_body_part_or_region-diabetes”,
“diabetes-external_body_part_or_region”,
“external_body_part_or_region-disease_syndrome_disorder”,
“disease_syndrome_disorder-external_body_part_or_region”,
“external_body_part_or_region-hypertension”,
“hypertension-external_body_part_or_region”,
“external_body_part_or_region-imagingfindings”,
“imagingfindings-external_body_part_or_region”,
“external_body_part_or_region-injury_or_poisoning”,
“injury_or_poisoning-external_body_part_or_region”,
“external_body_part_or_region-obesity”,
“obesity-external_body_part_or_region”,
“external_body_part_or_region-oncological”,
“oncological-external_body_part_or_region”,
“external_body_part_or_region-overweight”,
“overweight-external_body_part_or_region”,
“external_body_part_or_region-symptom”,
“symptom-external_body_part_or_region”,
“external_body_part_or_region-vs_finding”,
“vs_finding-external_body_part_or_region”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel()\ .pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") reModel = RelationExtractionModel.pretrained("re_bodypart_problem","en","clinical/models")\ .setInputCols(["embeddings","ner_chunks","pos_tags","dependencies"])\ .setOutputCol("relations") \ .setRelationPairs(['symptom-external_body_part_or_region']) pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = LightPipeline(model).fullAnnotate('''No neurologic deficits other than some numbness in his left hand.''') ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel() .pretrained("jsl_ner_wip_greedy_clinical","en","clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val reModel = RelationExtractionModel().pretrained("re_bodypart_problem","en","clinical/models") .setInputCols(Array("embeddings","ner_chunks","pos_tags","dependencies")) .setOutput("relations") .setRelationPairs(Array("symptom-external_body_part_or_region")) val nlpPipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_chunker, dependency_parser, reModel)) val result = pipeline.fit(Seq.empty[String]).transform(data) val results = LightPipeline(model).fullAnnotate("""No neurologic deficits other than some numbness in his left hand.""") ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|---------|---------------|-------------|---------------------|------------------------------|-------------|-------------|--------|------------| | 0 | 0 | Symptom | 3 | 21 | neurologic deficits | external_body_part_or_region | 60 | 63 | hand | 0.999998 | | 1 | 1 | Symptom | 39 | 46 | numbness | external_body_part_or_region | 60 | 63 | hand | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_problem| |Type:|re| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Dependencies:|embeddings_clinical| ## Data Source Trained on custom datasets annotated internally ## Benchmarking ```bash label recall precision 0 0.72 0.82 1 0.94 0.91 ``` --- layout: model title: English image_classifier_vit_pond_image_classification_8 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_8 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_8` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_8_en_4.1.0_3.0_1660166897106.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_8_en_4.1.0_3.0_1660166897106.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_8", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_8", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_8| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English image_classifier_vit_base_avengers_v1 ViTForImageClassification from dingusagar author: John Snow Labs name: image_classifier_vit_base_avengers_v1 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_avengers_v1` is a English model originally trained by dingusagar. ## Predicted Entities `Thor`, `Captain America`, `Captain Marvel`, `Falcon Avengers`, `Vision Avengers`, `Bucky Barnes`, `Loki`, `Black Widow`, `Iron Man`, `Black Panther`, `Docter Strage`, `Scarlet Witch`, `Ant Man`, `Hulk`, `Spider Man`, `Hawkeye Avengers` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_avengers_v1_en_4.1.0_3.0_1660165592727.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_avengers_v1_en_4.1.0_3.0_1660165592727.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_avengers_v1", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_avengers_v1", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_avengers_v1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from NeuML) author: John Snow Labs name: t5_small_bashsql date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-bashsql` is a English model originally trained by `NeuML`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_bashsql_en_4.3.0_3.0_1675125918159.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_bashsql_en_4.3.0_3.0_1675125918159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_bashsql","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_bashsql","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_bashsql| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|260.9 MB| ## References - https://huggingface.co/NeuML/t5-small-bashsql - https://github.com/neuml/txtai - https://en.wikipedia.org/wiki/Bash_(Unix_shell) - https://github.com/neuml/txtai/tree/master/models/bashsql --- layout: model title: Legal Defined Terms Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_defined_terms_bert date: 2023-03-05 tags: [en, legal, classification, clauses, defined_terms, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Defined_Terms` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Defined_Terms`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_defined_terms_bert_en_1.0.0_3.0_1678050008850.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_defined_terms_bert_en_1.0.0_3.0_1678050008850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_defined_terms_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Defined_Terms]| |[Other]| |[Other]| |[Defined_Terms]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_defined_terms_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Defined_Terms 0.98 1.00 0.99 41 Other 1.00 0.98 0.99 61 accuracy - - 0.99 102 macro-avg 0.99 0.99 0.99 102 weighted-avg 0.99 0.99 0.99 102 ``` --- layout: model title: Portuguese asr_bp_commonvoice100_xlsr TFWav2Vec2ForCTC from lgris author: John Snow Labs name: asr_bp_commonvoice100_xlsr date: 2022-09-26 tags: [wav2vec2, pt, audio, open_source, asr] task: Automatic Speech Recognition language: pt edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bp_commonvoice100_xlsr` is a Portuguese model originally trained by lgris. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_bp_commonvoice100_xlsr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_bp_commonvoice100_xlsr_pt_4.2.0_3.0_1664197387804.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_bp_commonvoice100_xlsr_pt_4.2.0_3.0_1664197387804.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_bp_commonvoice100_xlsr", "pt")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_bp_commonvoice100_xlsr", "pt") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_bp_commonvoice100_xlsr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|pt| |Size:|756.0 MB| --- layout: model title: Normalizing Section Headers in Clinical Notes author: John Snow Labs name: normalized_section_header_mapper date: 2022-06-26 tags: [chunk_mapper, normalizer, licensed, clinical, sectionheader, chunk_mapping, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline normalizes the section headers in clinical notes. It returns two levels of normalization called `level_1` and `level_2`. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NORMALIZED_SECTION_HEADER_MAPPER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NORMALIZED_SECTION_HEADER_MAPPER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/normalized_section_header_mapper_en_3.5.3_3.0_1656263408017.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/normalized_section_header_mapper_en_3.5.3_3.0_1656263408017.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["Header"]) chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["level_1", "level_2"]) pipeline = Pipeline().setStages([ document_assembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter, chunkerMapper]) sample_text = """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma. PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma. GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""" test_data = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models") .setInputCols(Array("sentence","token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Header")) val chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("mappings") .setRels(Array("level_1", "level_2")) val pipeline = new Pipeline().setStages(Array( document_assembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter, chunkerMapper)) val sample_text= """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma. PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma. GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""" val test_data = Seq(sample_text).toDS.toDF("text") val result = pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.section_headers_normalized").predict("""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma. PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma. GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.""") ```
## Results ```bash | | section | normalized_section(level_1 | level_2) | |---:|:--------------------|:----------------------------------------| | 0 | ADMISSION DIAGNOSIS | DIAGNOSIS | ADMISSION DIAGNOSIS | | 1 | GENERAL REVIEW | REVIEW TYPE | GENERAL REVIEW | | 2 | PRINCIPAL DIAGNOSIS | DIAGNOSIS | PRINCIPAL DIAGNOSIS | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|normalized_section_header_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|19.3 KB| --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_2_h_512_a_8_cord19_200616_squad2_covid_qna date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-512_A-8_cord19-200616_squad2_covid-qna` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_512_a_8_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1657188914020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_512_a_8_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1657188914020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_512_a_8_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_512_a_8_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_2_h_512_a_8_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|83.4 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-512_A-8_cord19-200616_squad2_covid-qna --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from nlpconnect) author: John Snow Labs name: roberta_qa_base_squad2_nq date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-nq` is a English model originally trained by `nlpconnect`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_nq_en_4.3.0_3.0_1674219670141.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_nq_en_4.3.0_3.0_1674219670141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_nq","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_nq","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_nq| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nlpconnect/roberta-base-squad2-nq - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2 --- layout: model title: Detect concepts in drug development trials (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_drug_development_trials date: 2022-03-22 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description It is a `BertForTokenClassification` NER model to identify concepts related to drug development including `Trial Groups` , `End Points` , `Hazard Ratio` and other entities in free text. ## Predicted Entities `Patient_Count`, `Duration`, `End_Point`, `Value`, `Trial_Group`, `Hazard_Ratio`, `Total_Patients` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.4_2.4_1647957458547.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_en_3.3.4_2.4_1647957458547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) test_sentence = """In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""" data = spark.createDataFrame([[test_sentence]]).toDF('text') result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.drug_development_trials").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""") ```
## Results ```bash +-----------------+-------------+ |chunk |ner_label | +-----------------+-------------+ |median |Duration | |overall survival |End_Point | |with |Trial_Group | |without topotecan|Trial_Group | |4.0 |Value | |3.6 months |Value | |23 |Patient_Count| |63 |Patient_Count| |55 |Patient_Count| |33 patients |Patient_Count| |topotecan |Trial_Group | |11 |Patient_Count| |61 |Patient_Count| |66 |Patient_Count| |32 patients |Patient_Count| |without topotecan|Trial_Group | +-----------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_drug_development_trials| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References Trained on data obtained from `clinicaltrials.gov` and annotated in-house. ## Benchmarking ```bash label prec rec f1 support B-Duration 0.93 0.94 0.93 1820 B-End_Point 0.99 0.98 0.98 5022 B-Hazard_Ratio 0.97 0.95 0.96 778 B-Patient_Count 0.81 0.88 0.85 300 B-Trial_Group 0.86 0.88 0.87 6751 B-Value 0.94 0.96 0.95 7675 I-Duration 0.71 0.82 0.76 185 I-End_Point 0.94 0.98 0.96 1491 I-Patient_Count 0.48 0.64 0.55 44 I-Trial_Group 0.78 0.75 0.77 4561 I-Value 0.93 0.95 0.94 1511 O 0.96 0.95 0.95 47423 accuracy 0.94 0.94 0.94 77608 macro-avg 0.79 0.82 0.80 77608 weighted-avg 0.94 0.94 0.94 77608 ``` --- layout: model title: English BertForQuestionAnswering model (from Intel) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_4.0.0_3.0_1654181622912.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_4.0.0_3.0_1654181622912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_Intel").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|179.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Intel/bert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: Translate English to Slovak Pipeline author: John Snow Labs name: translate_en_sk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sk_xx_2.7.0_2.4_1609689898839.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sk_xx_2.7.0_2.4_1609689898839.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Malagasy to English Pipeline author: John Snow Labs name: translate_mg_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mg, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mg_en_xx_2.7.0_2.4_1609698536372.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mg_en_xx_2.7.0_2.4_1609698536372.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mg.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mg_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Absence Of Litigation Clause Binary Classifier author: John Snow Labs name: legclf_absence_of_litigation_clause date: 2023-01-27 tags: [en, legal, classification, litigation, clauses, absence_of_litigation, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `absence-of-litigation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `absence-of-litigation`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_absence_of_litigation_clause_en_1.0.0_3.0_1674820144584.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_absence_of_litigation_clause_en_1.0.0_3.0_1674820144584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_absence_of_litigation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[absence-of-litigation]| |[other]| |[other]| |[absence-of-litigation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_absence_of_litigation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support absence-of-litigation 1.00 1.00 1.00 49 other 1.00 1.00 1.00 81 accuracy - - 1.00 130 macro-avg 1.00 1.00 1.00 130 weighted-avg 1.00 1.00 1.00 130 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_2_en_4.0.0_3.0_1657183991413.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_2_en_4.0.0_3.0_1657183991413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-2 --- layout: model title: English BertForQuestionAnswering model (from graviraja) author: John Snow Labs name: bert_qa_covid_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `covid_squad` is a English model orginally trained by `graviraja`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_covid_squad_en_4.0.0_3.0_1654187329919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_covid_squad_en_4.0.0_3.0_1654187329919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_covid_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_covid_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_covid.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_covid_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/graviraja/covid_squad --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1654191498731.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0_en_4.0.0_3.0_1654191498731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_32d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|377.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-0 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el2_en_4.3.0_3.0_1675119873194.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el2_en_4.3.0_3.0_1675119873194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|123.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-el2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from csarron) author: John Snow Labs name: roberta_qa_large_squad_v1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad-v1` is a English model originally trained by `csarron`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad_v1_en_4.3.0_3.0_1674222069233.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_squad_v1_en_4.3.0_3.0_1674222069233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_squad_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/csarron/roberta-large-squad-v1 --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_dl8 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-dl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl8_en_4.3.0_3.0_1675123295064.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_dl8_en_4.3.0_3.0_1675123295064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_dl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_dl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_dl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|66.3 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-dl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Mapping CVX Codes with Their Corresponding Vaccine Names and CPT Codes. author: John Snow Labs name: cvx_code_mapper date: 2022-10-12 tags: [cvx, cpt, chunk_mapping, en, licensed, clinical] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.2.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps CVX codes with their corresponding vaccine names and CPT codes. It returns 3 types of vaccine names; `short_name`, `full_name` and `trade_name`. ## Predicted Entities `short_name`, `full_name`, `trade_name`, `cpt_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/cvx_code_mapper_en_4.2.1_3.0_1665598034618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/cvx_code_mapper_en_4.2.1_3.0_1665598034618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') chunk_assembler = Doc2Chunk()\ .setInputCols(['doc'])\ .setOutputCol('ner_chunk') chunkerMapper = ChunkMapperModel\ .pretrained("cvx_code_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["short_name", "full_name", "trade_name", "cpt_code"]) mapper_pipeline = Pipeline(stages=[ document_assembler, chunk_assembler, chunkerMapper ]) data = spark.createDataFrame([['75'], ['20'], ['48'], ['19']]).toDF('text') res = mapper_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("doc") val chunk_assembler = new Doc2Chunk() .setInputCols(Array("doc")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("cvx_code_mapper", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("short_name", "full_name", "trade_name", "cpt_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, chunk_assembler, chunkerMapper)) val data = Seq("75", "20", "48", "19").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.cvx_code").predict("""Put your text here.""") ```
## Results ```bash +--------+---------------------+-------------------------------------------------------------+------------+--------+ |cvx_code|short_name |full_name |trade_name |cpt_code| +--------+---------------------+-------------------------------------------------------------+------------+--------+ |[75] |[vaccinia (smallpox)]|[vaccinia (smallpox) vaccine] |[DRYVAX] |[90622] | |[20] |[DTaP] |[diphtheria, tetanus toxoids and acellular pertussis vaccine]|[ACEL-IMUNE]|[90700] | |[48] |[Hib (PRP-T)] |[Haemophilus influenzae type b vaccine, PRP-T conjugate] |[ACTHIB] |[90648] | |[19] |[BCG] |[Bacillus Calmette-Guerin vaccine] |[MYCOBAX] |[90585] | +--------+---------------------+-------------------------------------------------------------+------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|cvx_code_mapper| |Compatibility:|Healthcare NLP 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|12.3 KB| --- layout: model title: Fast Neural Machine Translation Model from English to Pohnpeian author: John Snow Labs name: opus_mt_en_pon date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, pon, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `pon` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pon_xx_2.7.0_2.4_1609166479699.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pon_xx_2.7.0_2.4_1609166479699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_pon", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_pon", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.pon').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_pon| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_ff9000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-ff9000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff9000_en_4.3.0_3.0_1675112485586.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff9000_en_4.3.0_3.0_1675112485586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_ff9000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_ff9000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_ff9000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|910.0 MB| ## References - https://huggingface.co/google/t5-efficient-base-ff9000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English image_classifier_vit_rare_puppers_09_04_2021 ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_rare_puppers_09_04_2021 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_puppers_09_04_2021` is a English model originally trained by nateraw. ## Predicted Entities `corgi`, `samoyed`, `shiba inu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_09_04_2021_en_4.1.0_3.0_1660170617309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_09_04_2021_en_4.1.0_3.0_1660170617309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_puppers_09_04_2021", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_puppers_09_04_2021", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_puppers_09_04_2021| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline for Adverse Drug Events author: John Snow Labs name: explain_clinical_doc_ade date: 2021-04-01 tags: [pipeline, en, clinical, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for Adverse Drug Events (ADE) with `ner_ade_biobert`, `assertiondl_biobert` and `classifierdl_ade_conversational_biobert`. It will extract ADE and DRUG clinical entities, assign assertion status to ADE entities, and then assign ADE status to a text (True means ADE, False means not related to ADE). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_3.0.0_3.0_1617297946478.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_3.0.0_3.0_1617297946478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('explain_clinical_doc_ade', 'en', 'clinical/models') res = pipeline.fullAnnotate('The clinical course suggests that the interstitial pneumonitis was induced by hydroxyurea.') ``` ```scala val era_pipeline = new PretrainedPipeline("explain_clinical_doc_era", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""The clinical course suggests that the interstitial pneumonitis was induced by hydroxyurea.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_ade").predict("""The clinical course suggests that the interstitial pneumonitis was induced by hydroxyurea.""") ```
## Results ```bash | # | chunks | entities | assertion | |----|-------------------------------|------------|------------| | 0 | interstitial pneumonitis | ADE | Present | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_ade| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter - NerConverter - AssertionDLModel - SentenceEmbeddings - ClassifierDLModel --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Polish (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-05-10 task: Named Entity Recognition language: pl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, pl, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_PL){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_pl_2.5.0_2.4_1588519719572.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_pl_2.5.0_2.4_1588519719572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "pl") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "pl") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (ur. 28 października 1955 r.) To amerykański magnat biznesowy, programista, inwestor i filantrop. Najbardziej znany jest jako współzałożyciel Microsoft Corporation. Podczas swojej kariery w Microsoft Gates zajmował stanowiska prezesa, dyrektora generalnego (CEO), prezesa i głównego architekta oprogramowania, będąc jednocześnie największym indywidualnym akcjonariuszem do maja 2014 r. Jest jednym z najbardziej znanych przedsiębiorców i pionierów rewolucja mikrokomputerowa lat 70. i 80. Urodzony i wychowany w Seattle w stanie Waszyngton, Gates był współzałożycielem Microsoftu z przyjacielem z dzieciństwa Paulem Allenem w 1975 r. W Albuquerque w Nowym Meksyku; stała się największą na świecie firmą produkującą oprogramowanie komputerowe. Gates prowadził firmę jako prezes i dyrektor generalny, aż do ustąpienia ze stanowiska dyrektora generalnego w styczniu 2000 r., Ale pozostał przewodniczącym i został głównym architektem oprogramowania. Pod koniec lat 90. Gates był krytykowany za taktykę biznesową, którą uważano za antykonkurencyjną. Opinię tę podtrzymują liczne orzeczenia sądowe. W czerwcu 2006 r. Gates ogłosił, że przejdzie do pracy w niepełnym wymiarze godzin w Microsoft i pracy w pełnym wymiarze godzin w Bill & Melinda Gates Foundation, prywatnej fundacji charytatywnej, którą on i jego żona Melinda Gates utworzyli w 2000 r. Stopniowo przeniósł obowiązki na Raya Ozziego i Craiga Mundie. Zrezygnował z funkcji prezesa Microsoftu w lutym 2014 r. I objął nowe stanowisko jako doradca ds. Technologii, aby wesprzeć nowo mianowaną CEO Satyę Nadellę."""] ner_df = nlu.load('pl.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |To amerykański |MISC | |Microsoft Corporation |ORG | |Microsoft Gates |ORG | |CEO |ORG | |Urodzony |MISC | |Seattle |LOC | |Waszyngton |LOC | |Gates |PER | |Microsoftu |ORG | |Paulem Allenem |PER | |Albuquerque |LOC | |Nowym Meksyku |LOC | |Gates |PER | |Ale |PER | |Pod koniec lat 90 |MISC | |Gates |PER | |Opinię |MISC | |W czerwcu 2006 |MISC | |Gates |PER | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pl| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://pl.wikipedia.org](https://pl.wikipedia.org) --- layout: model title: Extract Substance Usage Entities from Social Determinants of Health Texts author: John Snow Labs name: ner_sdoh_substance_usage_wip date: 2023-02-23 tags: [licensed, en, sdoh, ner, clinical, social_determinants, public_health, substance_usage] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts substance usage information related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Smoking`, `Substance_Duration`, `Substance_Use`, `Substance_Quantity`, `Substance_Frequency`, `Alcohol` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_substance_usage_wip_en_4.3.1_3.0_1677186927181.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_substance_usage_wip_en_4.3.1_3.0_1677186927181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_substance_usage_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["He does drink occasional alcohol approximately 5 to 6 alcoholic drinks per month.", "He continues to smoke one pack of cigarettes daily, as he has for the past 28 years."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_substance_usage_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("He does drink occasional alcohol approximately 5 to 6 alcoholic drinks per month.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------+-----+---+-------------------+ |chunk |begin|end|ner_label | +----------------+-----+---+-------------------+ |drink |8 |12 |Alcohol | |occasional |14 |23 |Substance_Frequency| |alcohol |25 |31 |Alcohol | |5 to 6 |47 |52 |Substance_Quantity | |alcoholic drinks|54 |69 |Alcohol | |per month |71 |79 |Substance_Frequency| |smoke |16 |20 |Smoking | |one pack |22 |29 |Substance_Quantity | |cigarettes |34 |43 |Smoking | |daily |45 |49 |Substance_Frequency| |past 28 years |70 |82 |Substance_Duration | +----------------+-----+---+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_substance_usage_wip| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Substance_Frequency 52.0 2.0 12.0 64.0 0.962963 0.812500 0.881356 Smoking 77.0 4.0 2.0 79.0 0.950617 0.974684 0.962500 Alcohol 327.0 8.0 15.0 342.0 0.976119 0.956140 0.966027 Substance_Quantity 74.0 7.0 12.0 86.0 0.913580 0.860465 0.886228 Substance_Duration 27.0 7.0 14.0 41.0 0.794118 0.658537 0.720000 Substance_Use 204.0 8.0 6.0 210.0 0.962264 0.971429 0.966825 ``` --- layout: model title: Spanish DistilBertForQuestionAnswering model (from CenIA) TAR author: John Snow Labs name: distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_tar date: 2022-06-08 tags: [es, open_source, distilbert, question_answering] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-tar` is a English model originally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_tar_es_4.0.0_3.0_1654728163939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_tar_es_4.0.0_3.0_1654728163939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_tar","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_tar","es") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.distil_bert.base_uncased").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_tar| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|250.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-tar --- layout: model title: Pipeline to Resolve Medication Codes(Transform) author: John Snow Labs name: medication_resolver_transform_pipeline date: 2023-04-11 tags: [resolver, rxnorm, ndc, snomed, umls, ade, pipeline, en, licensed] task: Entity Resolution language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used with Spark transform. You can use `medication_resolver_pipeline` as Lightpipeline (with `annotate/fullAnnotate`). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_transform_pipeline_en_4.3.2_3.0_1681190723377.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_transform_pipeline_en_4.3.2_3.0_1681190723377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline medication_resolver_pipeline = PretrainedPipeline("medication_resolver_transform_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" data = spark.createDataFrame([[text]]).toDF("text") result = medication_resolver_pipeline.transform(data) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val medication_resolver_pipeline = new PretrainedPipeline("medication_resolver_transform_pipeline", "en", "clinical/models") val data = Seq("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""").toDS.toDF("text") val result = medication_resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication_transform.pipeline").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | chunk | ner_label | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |:-----------------------------|:------------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_transform_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - Doc2Chunk - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Doc2Chunk - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: Translate Lunda to English Pipeline author: John Snow Labs name: translate_lun_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lun, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lun` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lun_en_xx_2.7.0_2.4_1609690219054.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lun_en_xx_2.7.0_2.4_1609690219054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lun_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lun_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lun.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lun_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Genomic Variant Information (ner_genetic_variants) author: John Snow Labs name: ner_genetic_variants_pipeline date: 2023-03-14 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_genetic_variants](https://nlp.johnsnowlabs.com/2021/06/25/ner_genetic_variants_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_genetic_variants_pipeline_en_4.3.0_3.2_1678784720653.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_genetic_variants_pipeline_en_4.3.0_3.2_1678784720653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_genetic_variants_pipeline", "en", "clinical/models") text = '''The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_genetic_variants_pipeline", "en", "clinical/models") val text = "The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.genetic_variants.pipeline").predict("""The mutation pattern of mitochondrial DNA (mtDNA) in mainland Chinese patients with mitochondrial myopathy, encephalopathy, lactic acidosis and stroke-like episodes (MELAS) has been rarely reported, though previous data suggested that the mutation pattern of MELAS could be different among geographically localized populations. We presented the results of comprehensive mtDNA mutation analysis in 92 unrelated Chinese patients with MELAS (85 with classic MELAS and 7 with MELAS/Leigh syndrome (LS) overlap syndrome). The mtDNA A3243G mutation was the most common causal genotype in this patient group (79/92 and 85.9%). The second common gene mutation was G13513A (7/92 and 7.6%). Additionally, we identified T10191C (p.S45P) in ND3, A11470C (p. K237N) in ND4, T13046C (p.M237T) in ND5 and a large-scale deletion (13025-13033:14417-14425) involving partial ND5 and ND6 subunits of complex I in one patient each. Among them, A11470C, T13046C and the single deletion were novel mutations. In summary, patients with mutations affecting mitochondrially encoded complex I (MTND) reached 12.0% (11/92) in this group. It is noteworthy that all seven patients with MELAS/LS overlap syndrome were associated with MTND mutations. Our data emphasize the important role of MTND mutations in the pathogenicity of MELAS, especially MELAS/LS overlap syndrome.PURPOSE: Genes in the complement pathway, including complement factor H (CFH), C2/BF, and C3, have been reported to be associated with age-related macular degeneration (AMD). Genetic variants, single-nucleotide polymorphisms (SNPs), in these genes were geno-typed for a case-control association study in a mainland Han Chinese population. METHODS: One hundred and fifty-eight patients with wet AMD, 80 patients with soft drusen, and 220 matched control subjects were recruited among Han Chinese in mainland China. Seven SNPs in CFH and two SNPs in C2, CFB', and C3 were genotyped using the ABI SNaPshot method. A deletion of 84,682 base pairs covering the CFHR1 and CFHR3 genes was detected by direct polymerase chain reaction and gel electrophoresis. RESULTS: Four SNPs, including rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), in CFH showed a significant association with wet AMD in the cohort of this study. A haplotype containing these four SNPs (CATA) significantly increased protection of wet AMD with a P value of 0.0005 and an odds ratio of 0.29 (95% confidence interval: 0.15-0.60). Unlike in other populations, rs2274700 and rs1410996 did not show a significant association with AMD in the Chinese population of this study. None of the SNPs in CFH showed a significant association with drusen, and none of the SNPs in CFH, C2, CFB, and C3 showed a significant association with either wet AMD or drusen in the cohort of this study. The CFHR1 and CFHR3 deletion was not polymorphic in the Chinese population and was not associated with wet AMD or drusen. CONCLUSION: This study showed that SNPs rs3753394 (P = 0.0276), rs800292 (P = 0.0266), rs1061170 (P = 0.00514), and rs1329428 (P = 0.0089), but not rs7535263, rs1410996, or rs2274700, in CFH were significantly associated with wet AMD in a mainland Han Chinese population. This study showed that CFH was more likely to be AMD susceptibility gene at Chr.1q31 based on the finding that the CFHR1 and CFHR3 deletion was not polymorphic in the cohort of this study, and none of the SNPs that were significantly associated with AMD in a white population in C2, CFB, and C3 genes showed a significant association with AMD.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:----------------|-------------:| | 0 | A3243G | 527 | 532 | DNAMutation | 1 | | 1 | G13513A | 656 | 662 | DNAMutation | 0.9994 | | 2 | T10191C | 709 | 715 | DNAMutation | 1 | | 3 | p.S45P | 718 | 723 | ProteinMutation | 1 | | 4 | A11470C | 734 | 740 | DNAMutation | 1 | | 5 | p. K237N | 743 | 750 | ProteinMutation | 1 | | 6 | T13046C | 761 | 767 | DNAMutation | 1 | | 7 | p.M237T | 770 | 776 | ProteinMutation | 1 | | 8 | A11470C | 924 | 930 | DNAMutation | 1 | | 9 | T13046C | 933 | 939 | DNAMutation | 0.9986 | | 10 | rs3753394 | 2126 | 2134 | SNP | 1 | | 11 | rs800292 | 2150 | 2157 | SNP | 1 | | 12 | rs1061170 | 2173 | 2181 | SNP | 1 | | 13 | rs1329428 | 2202 | 2210 | SNP | 1 | | 14 | rs2274700 | 2518 | 2526 | SNP | 1 | | 15 | rs1410996 | 2532 | 2540 | SNP | 1 | | 16 | rs3753394 | 3000 | 3008 | SNP | 1 | | 17 | rs800292 | 3024 | 3031 | SNP | 1 | | 18 | rs1061170 | 3047 | 3055 | SNP | 1 | | 19 | rs1329428 | 3076 | 3084 | SNP | 1 | | 20 | rs7535263 | 3108 | 3116 | SNP | 1 | | 21 | rs1410996 | 3119 | 3127 | SNP | 1 | | 22 | rs2274700 | 3133 | 3141 | SNP | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_genetic_variants_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Italian XlmRoBertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: xlm_roberta_qa_ADDI_IT_XLM_R date: 2022-06-23 tags: [it, open_source, question_answering, xlmroberta] task: Question Answering language: it edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-IT-XLM-R` is a Italian model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_IT_XLM_R_it_4.0.0_3.0_1655983184346.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_IT_XLM_R_it_4.0.0_3.0_1655983184346.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_ADDI_IT_XLM_R","it") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_ADDI_IT_XLM_R","it") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("it.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_ADDI_IT_XLM_R| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|it| |Size:|778.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-IT-XLM-R --- layout: model title: Translate English to Italian Pipeline author: John Snow Labs name: translate_en_it date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, it, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `it` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_it_xx_2.7.0_2.4_1609685985914.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_it_xx_2.7.0_2.4_1609685985914.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_it", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_it", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.it').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_it| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from huggingfaceepita) author: John Snow Labs name: distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `huggingfaceepita`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725513669.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725513669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_huggingfaceepita").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_huggingfaceepita_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huggingfaceepita/distilbert-base-uncased-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_twostagequadruplet_hier_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostagequadruplet_hier_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223924983.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostagequadruplet_hier_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223924983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostagequadruplet_hier_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostagequadruplet_hier_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_twostagequadruplet_hier_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|306.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0 --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC2GM_Gene_Modified_scibert_scivocab_cased date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC2GM-Gene-Modified_scibert_scivocab_cased` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `GENE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Modified_scibert_scivocab_cased_en_4.0.0_3.0_1657107964426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_Modified_scibert_scivocab_cased_en_4.0.0_3.0_1657107964426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Modified_scibert_scivocab_cased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_Modified_scibert_scivocab_cased","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC2GM_Gene_Modified_scibert_scivocab_cased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC2GM-Gene-Modified_scibert_scivocab_cased --- layout: model title: English DistilBertForQuestionAnswering model (from rowan1224) author: John Snow Labs name: distilbert_qa_squad_slp date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squad-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_slp_en_4.0.0_3.0_1654727731033.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_slp_en_4.0.0_3.0_1654727731033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_slp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_slp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.by_rowan1224").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_slp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/distilbert-squad-slp --- layout: model title: Explain Document pipeline for Korean (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-04-30 tags: [korean, open_source, explain_document_lg, pipeline, ko, ner] task: Named Entity Recognition language: ko edition: Spark NLP 3.0.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pre-trained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_ko_3.0.2_3.0_1619772353571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_ko_3.0.2_3.0_1619772353571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('explain_document_lg', lang = 'ko') annotations = pipeline.fullAnnotate(""안녕하세요, 환영합니다!"")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "ko") val result = pipeline.fullAnnotate("안녕하세요, 환영합니다!")(0) ``` {:.nlu-block} ```python import nlu nlu.load("ko.explain_document").predict("""안녕하세요, 환영합니다!""") ```
## Results ```bash +------------------------+--------------------------+--------------------------+--------------------------------+----------------------------+---------------------+ |text |document |sentence |token |ner |ner_chunk | +------------------------+--------------------------+--------------------------+--------------------------------+----------------------------+---------------------+ |안녕, 존 스노우!|[안녕, 존 스노우!]|[안녕, 존 스노우!]|[안녕, ,, 존, 스노우, !] |[B-DATE, O, O, O, O]| [안녕] | +------------------------+--------------------------+--------------------------+--------------------------------+----------------------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.2+| |License:|Open Source| |Edition:|Official| |Language:|ko| ## Included Models - DocumentAssembler - SentenceDetector - WordSegmenterModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: English asr_Dansk_wav2vec2_stt TFWav2Vec2ForCTC from Siyam author: John Snow Labs name: asr_Dansk_wav2vec2_stt date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Dansk_wav2vec2_stt` is a English model originally trained by Siyam. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Dansk_wav2vec2_stt_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Dansk_wav2vec2_stt_en_4.2.0_3.0_1664120202511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Dansk_wav2vec2_stt_en_4.2.0_3.0_1664120202511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Dansk_wav2vec2_stt", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Dansk_wav2vec2_stt", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Dansk_wav2vec2_stt| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mtreviso) author: John Snow Labs name: t5_ct5_small_wiki date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ct5-small-en-wiki` is a English model originally trained by `mtreviso`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ct5_small_wiki_en_4.3.0_3.0_1675100765401.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ct5_small_wiki_en_4.3.0_3.0_1675100765401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ct5_small_wiki","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ct5_small_wiki","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ct5_small_wiki| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|289.0 MB| ## References - https://huggingface.co/mtreviso/ct5-small-en-wiki - https://github.com/mtreviso/chunked-t5 --- layout: model title: Translate Luo (Kenya and Tanzania) to English Pipeline author: John Snow Labs name: translate_luo_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, luo, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `luo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_luo_en_xx_2.7.0_2.4_1609690840125.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_luo_en_xx_2.7.0_2.4_1609690840125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_luo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_luo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.luo.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_luo_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Part of Speech for Bengali author: John Snow Labs name: pos_msri date: 2021-03-09 tags: [part_of_speech, open_source, bengali, pos_msri, bn] task: Part of Speech Tagging language: bn edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - NN - SYM - NNP - VM - INTF - JJ - QF - CC - NST - PSP - QC - DEM - RDP - PRP - NEG - WQ - RB - VAUX - UT - XC - RP - QO - BM - NNC - PPR - INJ - CL - UNK {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_msri_bn_3.0.0_3.0_1615292420029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_msri_bn_3.0.0_3.0_1615292420029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") posTagger = PerceptronModel.pretrained("pos_msri", "bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[document_assembler, tokenizer, posTagger]) example = spark.createDataFrame([['জন স্নো ল্যাবস থেকে হ্যালো! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_msri", "bn") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("জন স্নো ল্যাবস থেকে হ্যালো! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""জন স্নো ল্যাবস থেকে হ্যালো! ""] token_df = nlu.load('bn.pos').predict(text) token_df ```
## Results ```bash token pos 0 জন NN 1 স্নো NN 2 ল্যাবস NN 3 থেকে PSP 4 হ্যালো JJ 5 ! SYM ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_msri| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|bn| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_256_finetuned_squad_seed_8 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_8_en_4.3.0_3.0_1674215082127.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_256_finetuned_squad_seed_8_en_4.3.0_3.0_1674215082127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_8","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_256_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_256_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|427.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-8 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab53_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab53_by_hassnain` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain_en_4.2.0_3.0_1664022647575.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain_en_4.2.0_3.0_1664022647575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab53_by_hassnain| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_ahmad573 TFWav2Vec2ForCTC from ahmad573 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_ahmad573` is a English model originally trained by ahmad573. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573_en_4.2.0_3.0_1664036864621.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573_en_4.2.0_3.0_1664036864621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab2_by_ahmad573| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_9 TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_9` is a English model originally trained by nimrah. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9_en_4.2.0_3.0_1664118102685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9_en_4.2.0_3.0_1664118102685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Disability Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_disability_bert date: 2023-03-05 tags: [en, legal, classification, clauses, disability, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Disability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Disability`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disability_bert_en_1.0.0_3.0_1678050630345.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disability_bert_en_1.0.0_3.0_1678050630345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_disability_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Disability]| |[Other]| |[Other]| |[Disability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disability_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Disability 0.94 0.94 0.94 31 Other 0.95 0.95 0.95 43 accuracy - - 0.95 74 macro-avg 0.94 0.94 0.94 74 weighted-avg 0.95 0.95 0.95 74 ``` --- layout: model title: Italian T5ForConditionalGeneration Base Cased model (from it5) author: John Snow Labs name: t5_it5_base_news_summarization date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-base-news-summarization` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_base_news_summarization_it_4.3.0_3.0_1675103162283.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_base_news_summarization_it_4.3.0_3.0_1675103162283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_base_news_summarization","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_base_news_summarization","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_base_news_summarization| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|1.0 GB| ## References - https://huggingface.co/it5/it5-base-news-summarization - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=News+Summarization&dataset=NewsSum-IT --- layout: model title: Translate German to English Pipeline author: John Snow Labs name: translate_de_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, de, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `de` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_de_en_xx_2.7.0_2.4_1609688947536.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_de_en_xx_2.7.0_2.4_1609688947536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_de_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_de_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.de.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_de_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl date: 2021-06-21 tags: [ner, licensed, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_3.0_1624284761441.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.1.0_3.0_1624284761441.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Modifier | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 235.0 46.0 43.0 278.0 0.8363 0.8453 0.8408 Direction 3972.0 465.0 458.0 4430.0 0.8952 0.8966 0.8959 Respiration 82.0 4.0 4.0 86.0 0.9535 0.9535 0.9535 Cerebrovascular_D... 93.0 20.0 24.0 117.0 0.823 0.7949 0.8087 Family_History_He... 88.0 6.0 3.0 91.0 0.9362 0.967 0.9514 Heart_Disease 447.0 82.0 119.0 566.0 0.845 0.7898 0.8164 RelativeTime 158.0 80.0 59.0 217.0 0.6639 0.7281 0.6945 Strength 624.0 58.0 53.0 677.0 0.915 0.9217 0.9183 Smoking 121.0 11.0 4.0 125.0 0.9167 0.968 0.9416 Medical_Device 3716.0 491.0 466.0 4182.0 0.8833 0.8886 0.8859 Pulse 136.0 22.0 14.0 150.0 0.8608 0.9067 0.8831 Psychological_Con... 135.0 9.0 29.0 164.0 0.9375 0.8232 0.8766 Overweight 2.0 1.0 0.0 2.0 0.6667 1.0 0.8 Triglycerides 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Obesity 42.0 5.0 6.0 48.0 0.8936 0.875 0.8842 Admission_Discharge 318.0 24.0 11.0 329.0 0.9298 0.9666 0.9478 HDL 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Diabetes 110.0 14.0 8.0 118.0 0.8871 0.9322 0.9091 Section_Header 3740.0 148.0 157.0 3897.0 0.9619 0.9597 0.9608 Age 627.0 75.0 48.0 675.0 0.8932 0.9289 0.9107 O2_Saturation 34.0 14.0 17.0 51.0 0.7083 0.6667 0.6869 Kidney_Disease 96.0 12.0 34.0 130.0 0.8889 0.7385 0.8067 Test 2504.0 545.0 498.0 3002.0 0.8213 0.8341 0.8276 Communicable_Disease 21.0 10.0 6.0 27.0 0.6774 0.7778 0.7241 Hypertension 162.0 5.0 10.0 172.0 0.9701 0.9419 0.9558 External_body_par... 2626.0 356.0 413.0 3039.0 0.8806 0.8641 0.8723 Oxygen_Therapy 81.0 15.0 14.0 95.0 0.8438 0.8526 0.8482 Modifier 2341.0 404.0 539.0 2880.0 0.8528 0.8128 0.8324 Test_Result 1007.0 214.0 255.0 1262.0 0.8247 0.7979 0.8111 BMI 9.0 1.0 0.0 9.0 0.9 1.0 0.9474 Labour_Delivery 57.0 23.0 33.0 90.0 0.7125 0.6333 0.6706 Employment 271.0 59.0 55.0 326.0 0.8212 0.8313 0.8262 Fetus_NewBorn 66.0 33.0 51.0 117.0 0.6667 0.5641 0.6111 Clinical_Dept 923.0 110.0 83.0 1006.0 0.8935 0.9175 0.9053 Time 29.0 13.0 16.0 45.0 0.6905 0.6444 0.6667 Procedure 3185.0 462.0 501.0 3686.0 0.8733 0.8641 0.8687 Diet 36.0 20.0 45.0 81.0 0.6429 0.4444 0.5255 Oncological 459.0 61.0 55.0 514.0 0.8827 0.893 0.8878 LDL 3.0 0.0 3.0 6.0 1.0 0.5 0.6667 Symptom 7104.0 1302.0 1200.0 8304.0 0.8451 0.8555 0.8503 Temperature 116.0 6.0 8.0 124.0 0.9508 0.9355 0.9431 Vital_Signs_Header 215.0 29.0 24.0 239.0 0.8811 0.8996 0.8903 Relationship_Status 49.0 2.0 1.0 50.0 0.9608 0.98 0.9703 Total_Cholesterol 11.0 4.0 5.0 16.0 0.7333 0.6875 0.7097 Blood_Pressure 158.0 18.0 22.0 180.0 0.8977 0.8778 0.8876 Injury_or_Poisoning 579.0 130.0 127.0 706.0 0.8166 0.8201 0.8184 Drug_Ingredient 1716.0 153.0 132.0 1848.0 0.9181 0.9286 0.9233 Treatment 136.0 36.0 60.0 196.0 0.7907 0.6939 0.7391 Pregnancy 123.0 36.0 51.0 174.0 0.7736 0.7069 0.7387 Vaccine 13.0 2.0 6.0 19.0 0.8667 0.6842 0.7647 Disease_Syndrome_... 2981.0 559.0 446.0 3427.0 0.8421 0.8699 0.8557 Height 30.0 10.0 15.0 45.0 0.75 0.6667 0.7059 Frequency 595.0 99.0 138.0 733.0 0.8573 0.8117 0.8339 Route 858.0 76.0 89.0 947.0 0.9186 0.906 0.9123 Duration 351.0 99.0 108.0 459.0 0.78 0.7647 0.7723 Death_Entity 43.0 14.0 5.0 48.0 0.7544 0.8958 0.819 Internal_organ_or... 6477.0 972.0 991.0 7468.0 0.8695 0.8673 0.8684 Alcohol 80.0 18.0 13.0 93.0 0.8163 0.8602 0.8377 Substance_Quantity 6.0 7.0 4.0 10.0 0.4615 0.6 0.5217 Date 498.0 38.0 19.0 517.0 0.9291 0.9632 0.9459 Hyperlipidemia 47.0 3.0 3.0 50.0 0.94 0.94 0.94 Social_History_He... 99.0 7.0 7.0 106.0 0.934 0.934 0.934 Race_Ethnicity 116.0 0.0 0.0 116.0 1.0 1.0 1.0 Imaging_Technique 40.0 18.0 47.0 87.0 0.6897 0.4598 0.5517 Drug_BrandName 859.0 62.0 61.0 920.0 0.9327 0.9337 0.9332 RelativeDate 566.0 124.0 143.0 709.0 0.8203 0.7983 0.8091 Gender 6096.0 80.0 101.0 6197.0 0.987 0.9837 0.9854 Dosage 244.0 31.0 57.0 301.0 0.8873 0.8106 0.8472 Form 234.0 32.0 55.0 289.0 0.8797 0.8097 0.8432 Medical_History_H... 114.0 9.0 10.0 124.0 0.9268 0.9194 0.9231 Birth_Entity 4.0 2.0 3.0 7.0 0.6667 0.5714 0.6154 Substance 59.0 8.0 11.0 70.0 0.8806 0.8429 0.8613 Sexually_Active_o... 5.0 3.0 4.0 9.0 0.625 0.5556 0.5882 Weight 90.0 10.0 21.0 111.0 0.9 0.8108 0.8531 macro - - - - - - 0.8148 micro - - - - - - 0.8788 ``` --- layout: model title: Turkish Lemmatizer author: John Snow Labs name: lemma date: 2020-05-03 12:43:00 +0800 task: Lemmatization language: tr edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, tr] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_tr_2.5.0_2.4_1587479962436.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_tr_2.5.0_2.4_1587479962436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "tr") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "tr") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir."""] lemma_df = nlu.load('tr.lemma').predict(text, output_level='token') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=9, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=17, result='kuzey', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=23, result='kral', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|tr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_07 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_07` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07_en_4.2.0_3.0_1664116961217.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07_en_4.2.0_3.0_1664116961217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_07| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Modern Greek (1453-) BertForQuestionAnswering model (from Danastos) author: John Snow Labs name: bert_qa_nq_bert_el_Danastos date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nq_bert_el` is a Modern Greek (1453-) model orginally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nq_bert_el_Danastos_el_4.0.0_3.0_1654249964610.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nq_bert_el_Danastos_el_4.0.0_3.0_1654249964610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nq_bert_el_Danastos","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_nq_bert_el_Danastos","el") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("el.answer_question.bert.by_Danastos").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nq_bert_el_Danastos| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|421.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/nq_bert_el --- layout: model title: Fast Neural Machine Translation Model from Caucasian Languages to English author: John Snow Labs name: opus_mt_cau_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cau, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cau` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cau_en_xx_2.7.0_2.4_1609166650362.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cau_en_xx_2.7.0_2.4_1609166650362.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cau_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cau_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cau.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cau_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from nlpconnect) author: John Snow Labs name: roberta_qa_dpr_nq_reader_roberta_base date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dpr-nq-reader-roberta-base` is a English model originally trained by `nlpconnect`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_roberta_base_en_4.0.0_3.0_1655728485078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dpr_nq_reader_roberta_base_en_4.0.0_3.0_1655728485078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_dpr_nq_reader_roberta_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_dpr_nq_reader_roberta_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.base.by_nlpconnect").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_dpr_nq_reader_roberta_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpconnect/dpr-nq-reader-roberta-base --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_small_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_finetuned_squadv2_en_4.0.0_3.0_1654184772953.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_finetuned_squadv2_en_4.0.0_3.0_1654184772953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.small.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-small-finetuned-squadv2 - https://twitter.com/mrm8488 - https://github.com/google-research - https://arxiv.org/abs/1908.08962 - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_spanbert_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_finetuned_squadv2_en_4.0.0_3.0_1654191806090.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_finetuned_squadv2_en_4.0.0_3.0_1654191806090.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.span_bert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|402.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/spanbert-finetuned-squadv2 - https://arxiv.org/abs/1907.10529 - https://twitter.com/mrm8488 - https://github.com/facebookresearch - https://github.com/facebookresearch/SpanBERT - https://github.com/facebookresearch/SpanBERT#pre-trained-models - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: English RobertaForQuestionAnswering (from navteca) author: John Snow Labs name: roberta_qa_roberta_large_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2` is a English model originally trained by `navteca`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_en_4.0.0_3.0_1655737495788.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_en_4.0.0_3.0_1655737495788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/navteca/roberta-large-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Clinical Drugs to UMLS Code Mapping author: John Snow Labs name: umls_drug_resolver_pipeline date: 2022-07-26 tags: [en, licensed, umls, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Drugs) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_resolver_pipeline_en_4.0.0_3.0_1658815710311.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_resolver_pipeline_en_4.0.0_3.0_1658815710311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_drug_resolver_pipeline", "en", "clinical/models") pipeline.annotate("The patient was given Adapin 10 MG, coumadn 5 mg") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_drug_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("The patient was given Adapin 10 MG, coumadn 5 mg") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_drug_resolver").predict("""The patient was given Adapin 10 MG, coumadn 5 mg""") ```
## Results ```bash +------------+---------+---------+ |chunk |ner_label|umls_code| +------------+---------+---------+ |Adapin 10 MG|DRUG |C2930083 | |coumadn 5 mg|DRUG |C2723075 | +------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_drug_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.6 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Malay T5ForConditionalGeneration Small Cased model (from mesolitica) author: John Snow Labs name: t5_finetune_paraphrase_small_standard_bahasa_cased date: 2023-01-30 tags: [ms, open_source, t5, tensorflow] task: Text Generation language: ms edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-paraphrase-t5-small-standard-bahasa-cased` is a Malay model originally trained by `mesolitica`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_small_standard_bahasa_cased_ms_4.3.0_3.0_1675102064396.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_finetune_paraphrase_small_standard_bahasa_cased_ms_4.3.0_3.0_1675102064396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_finetune_paraphrase_small_standard_bahasa_cased","ms") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_finetune_paraphrase_small_standard_bahasa_cased","ms") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_finetune_paraphrase_small_standard_bahasa_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ms| |Size:|288.6 MB| ## References - https://huggingface.co/mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased - https://github.com/huseinzol05/malaya/tree/master/session/paraphrase/hf-t5 --- layout: model title: English BertForQuestionAnswering model (from huggingface-course) author: John Snow Labs name: bert_qa_huggingface_course_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `huggingface-course`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_huggingface_course_bert_finetuned_squad_en_4.0.0_3.0_1654535631902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_huggingface_course_bert_finetuned_squad_en_4.0.0_3.0_1654535631902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_huggingface_course_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_huggingface_course_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_huggingface-course").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_huggingface_course_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huggingface-course/bert-finetuned-squad --- layout: model title: Spanish RobertaForQuestionAnswering (from nlp-en-es) author: John Snow Labs name: roberta_qa_bertin_large_finetuned_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertin-large-finetuned-sqac` is a Spanish model originally trained by `nlp-en-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bertin_large_finetuned_sqac_es_4.0.0_3.0_1655727749941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bertin_large_finetuned_sqac_es_4.0.0_3.0_1655727749941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_bertin_large_finetuned_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_bertin_large_finetuned_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.large.by_nlp-en-es").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_bertin_large_finetuned_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|450.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlp-en-es/bertin-large-finetuned-sqac --- layout: model title: Abkhazian asr_xls_r_ab_test_by_cahya TFWav2Vec2ForCTC from cahya author: John Snow Labs name: asr_xls_r_ab_test_by_cahya date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_cahya` is a Abkhazian model originally trained by cahya. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_test_by_cahya_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_cahya_ab_4.2.0_3.0_1664019494559.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_cahya_ab_4.2.0_3.0_1664019494559.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_test_by_cahya", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_test_by_cahya", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_test_by_cahya| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.5 KB| --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: ner_jsl_slim_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_slim](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_slim_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_slim_pipeline_en_3.4.1_3.0_1647870247934.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_slim_pipeline_en_3.4.1_3.0_1647870247934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_jsl_slim_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.") ``` ```scala val pipeline = new PretrainedPipeline("ner_jsl_slim_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_slim.pipeline").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""") ```
## Results ```bash | | chunk | entity | |---:|:-----------------|:-------------| | 0 | HISTORY: | Header | | 1 | 30-year-old | Age | | 2 | female | Demographics | | 3 | mammography | Test | | 4 | soft tissue lump | Symptom | | 5 | shoulder | Body_Part | | 6 | breast cancer | Oncological | | 7 | her mother | Demographics | | 8 | age 58 | Age | | 9 | breast cancer | Oncological | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_slim_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Indonesian DistilBERT Embeddings author: John Snow Labs name: distilbert_embeddings_distilbert_base_indonesian date: 2022-04-12 tags: [distilbert, embeddings, id, open_source] task: Embeddings language: id edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-indonesian` is a Indonesian model orginally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_indonesian_id_3.4.2_3.0_1649783905452.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_indonesian_id_3.4.2_3.0_1649783905452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_indonesian","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka percikan NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_indonesian","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka percikan NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.embed.distilbert").predict("""Saya suka percikan NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_indonesian| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|253.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/cahya/distilbert-base-indonesian - https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers --- layout: model title: Embeddings Scielowiki 150 dims author: John Snow Labs name: embeddings_scielowiki_150d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-26 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielowiki_150d_es_2.5.0_2.4_1590467545910.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielowiki_150d_es_2.5.0_2.4_1590467545910.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_scielowiki_150d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_scielowiki_150d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.scielowiki.150d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_scielowiki_150d``. {:.model-param} ## Model Information {:.table-model} |---------------|----------------------------| | Name: | embeddings_scielowiki_150d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | es | | Dimension: | 150.0 | {:.h2_title} ## Data Source Trained on Scielo Articles + Clinical Wikipedia Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: Greek Legal Roberta Embeddings author: John Snow Labs name: roberta_base_greek_legal date: 2023-02-16 tags: [el, greek, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: el edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-greek-roberta-base` is a Greek model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_greek_legal_el_4.2.4_3.0_1676558051730.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_greek_legal_el_4.2.4_3.0_1676558051730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_greek_legal", "el")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_greek_legal", "el") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_greek_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|el| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-greek-roberta-base --- layout: model title: English BertForQuestionAnswering Base Uncased model (from michaelrglass) author: John Snow Labs name: bert_qa_base_uncased_ssp date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-sspt` is a English model originally trained by `michaelrglass`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_ssp_en_4.0.0_3.0_1657185870319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_ssp_en_4.0.0_3.0_1657185870319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_ssp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_ssp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_ssp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|259.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/michaelrglass/bert-base-uncased-sspt --- layout: model title: Arabic Bert Embeddings (Mbert model, Covid-19) author: John Snow Labs name: bert_embeddings_mbert_ar_c19 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mbert_ar_c19` is a Arabic model orginally trained by `moha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mbert_ar_c19_ar_3.4.2_3.0_1649678730860.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mbert_ar_c19_ar_3.4.2_3.0_1649678730860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_mbert_ar_c19","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_mbert_ar_c19","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.mbert_ar_c19").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mbert_ar_c19| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|627.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/moha/mbert_ar_c19 - https://arxiv.org/pdf/2105.03143.pdf - https://arxiv.org/abs/2004.04315 - https://github.com/MohamedHadjAmeur --- layout: model title: Legal Officers certificate Clause Binary Classifier author: John Snow Labs name: legclf_officers_certificate_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `officers-certificate` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `officers-certificate` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_officers_certificate_clause_en_1.0.0_3.2_1660122781261.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_officers_certificate_clause_en_1.0.0_3.2_1660122781261.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_officers_certificate_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[officers-certificate]| |[other]| |[other]| |[officers-certificate]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_officers_certificate_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support officers-certificate 0.98 1.00 0.99 40 other 1.00 0.99 1.00 103 accuracy - - 0.99 143 macro-avg 0.99 1.00 0.99 143 weighted-avg 0.99 0.99 0.99 143 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from ZYW) author: John Snow Labs name: distilbert_qa_test_squad_trained date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `test-squad-trained` is a English model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_en_4.0.0_3.0_1654728858592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_en_4.0.0_3.0_1654728858592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.by_ZYW").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_test_squad_trained| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/test-squad-trained --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop` is a English model originally trained by ying-tina. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop_en_4.2.0_3.0_1664111764164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop_en_4.2.0_3.0_1664111764164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs50_earlystop| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Arabic Bert Embeddings (Mini) author: John Snow Labs name: bert_embeddings_bert_mini_arabic date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-mini-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_mini_arabic_ar_3.4.2_3.0_1649677815613.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_mini_arabic_ar_3.4.2_3.0_1649677815613.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_mini_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_mini_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_mini_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_mini_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|43.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-mini-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el64 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el64` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el64_en_4.3.0_3.0_1675120423638.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el64_en_4.3.0_3.0_1675120423638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el64","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el64","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el64| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|499.3 MB| ## References - https://huggingface.co/google/t5-efficient-small-el64 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Medical Spell Checker Pipeline author: John Snow Labs name: spellcheck_clinical_pipeline date: 2022-04-13 tags: [spellcheck, medical, medical_spell_checker, spell_corrector, spell_pipeline, en, licensed, clinical] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained medical spellchecker pipeline is built on the top of [spellcheck_clinical](https://nlp.johnsnowlabs.com/2022/04/11/spellcheck_clinical_en_3_0.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.1_3.0_1649854381549.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.1_3.0_1649854381549.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] pipeline.fullAnnotate(example) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") val example = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") pipeline.fullAnnotate(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical.pipeline").predict("""Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|141.2 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: English Named Entity Recognition (from sagorsarker) author: John Snow Labs name: bert_ner_codeswitch_spaeng_lid_lince date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-spaeng-lid-lince` is a English model orginally trained by `sagorsarker`. ## Predicted Entities `mixed`, `other`, `unk`, `en`, `ambiguous`, `spa`, `ne`, `fw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_spaeng_lid_lince_en_3.4.2_3.0_1652096746269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_spaeng_lid_lince_en_3.4.2_3.0_1652096746269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_spaeng_lid_lince","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_spaeng_lid_lince","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_codeswitch_spaeng_lid_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-spaeng-lid-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from harish3110) author: John Snow Labs name: xlmroberta_ner_harish3110_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `harish3110`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_harish3110_base_finetuned_panx_de_4.1.0_3.0_1660433578458.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_harish3110_base_finetuned_panx_de_4.1.0_3.0_1660433578458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_harish3110_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_harish3110_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_harish3110_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/harish3110/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Text cleaner v1 author: John Snow Labs name: text_cleaner_v1 date: 2021-12-21 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.0.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Model for cleaning image with text. It is based on text detection model with extra post-processing. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/text_cleaner_v1_en_3.0.0_2.4_1640088709401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/text_cleaner_v1_en_3.0.0_2.4_1640088709401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python cleaner = ImageTextCleaner \ .pretrained("text_cleaner_v1", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("cleaned_image") \ .setMedianBlur(0) \ .setSizeThreshold(1) \ .setTextThreshold(0.3) \ .setPadding(2) \ .setBinarize(False) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_cleaner_v1| |Type:|ocr| |Compatibility:|Visual NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|77.1 MB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from aravind-812) author: John Snow Labs name: roberta_qa_train_json date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-train-json` is a English model originally trained by `aravind-812`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_train_json_en_4.3.0_3.0_1674222495676.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_train_json_en_4.3.0_3.0_1674222495676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_train_json","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_train_json","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_train_json| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/aravind-812/roberta-train-json --- layout: model title: English image_classifier_vit_fruits ViTForImageClassification from hafidber author: John Snow Labs name: image_classifier_vit_fruits date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_fruits` is a English model originally trained by hafidber. ## Predicted Entities `banana`, `grape`, `apple`, `kiwi`, `lemon` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_fruits_en_4.1.0_3.0_1660170220847.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_fruits_en_4.1.0_3.0_1660170220847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_fruits", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_fruits", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_fruits| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Loinc Sentence Entity Resolver author: John Snow Labs name: sbiobertresolve_loinc date: 2021-04-29 tags: [en, licensed, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Map clinical NER entities to LOINC codes. ## Predicted Entities LOINC codes - per input NER entity {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_LOINC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_en_3.0.2_3.0_1619677092954.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_en_3.0.2_3.0_1619677092954.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") stopwords = StopWordsCleaner.pretrained()\ .setInputCols("token")\ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "cleanTokens"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "cleanTokens", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline_loinc.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val stopwords = StopWordsCleaner.pretrained() .setInputCols("token") .setOutputCol("cleanTokens") .setCaseSensitive(False) val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "cleanTokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "cleanTokens", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""").toDS().toDF("text") val results = pipeline_loinc.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | chunk | loinc_code | |---:|:--------------------------------------|:-------------| | 0 | gestational diabetes mellitus | 45636-8 | | 1 | subsequent type two diabetes mellitus | 44877-9 | | 2 | T2DM | 45636-8 | | 3 | HTG-induced pancreatitis | 66667-7 | | 4 | an acute hepatitis | 45690-5 | | 5 | obesity | 73708-0 | | 6 | a body mass index | 59574-4 | | 7 | BMI | 59574-4 | | 8 | polyuria | 28239-2 | | 9 | polydipsia | 90552-1 | | 10 | poor appetite | 28387-9 | | 11 | vomiting | 81224-8 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_loinc| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Drug Spell Checker author: John Snow Labs name: spellcheck_drug_norvig date: 2021-09-15 tags: [spell, spell_checker, clinical, en, licensed, drug] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.2.2 spark_version: 3.0 supported: true annotator: NorvigSweetingModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects and corrects spelling errors of drugs in your input text based on Norvig's approach. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_drug_norvig_en_3.2.2_3.0_1631700986904.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_drug_norvig_en_3.2.2_3.0_1631700986904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") spell = NorvigSweetingModel.pretrained("spellcheck_drug_norvig", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("spell")\ pipeline = Pipeline( stages = [ documentAssembler, tokenizer, spell]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) result = lp.annotate("You have to take Neutrcare and colfosrinum and a bit of Fluorometholne & Ribotril") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val spell = new NorvigSweetingModel.pretrained("spellcheck_drug_norvig", "en", "clinical/models") .setInputCols("token") .setOutputCol("spell") val pipeline = new Pipeline().setStages(Array(documentAssembler,tokenizer,spell)) val model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) val lp = new LightPipeline(model) val result = lp.annotate("You have to take Neutrcare and colfosrinum and a bit of Fluorometholne & Ribotril") ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.drug_norvig").predict("""You have to take Neutrcare and colfosrinum and a bit of Fluorometholne & Ribotril""") ```
## Results ```bash Original text : You have to take Neutrcare and colfosrinum and a bit of fluorometholne & Ribotril Corrected text : You have to take Neutracare and colforsinum and a bit of fluorometholone & Rivotril ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_drug_norvig| |Compatibility:|Healthcare NLP 3.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[spell]| |Language:|en| |Case sensitive:|true| --- layout: model title: SDOH Mental Health For Classification author: John Snow Labs name: genericclassifier_sdoh_mental_health_clinical date: 2023-04-10 tags: [en, licenced, clinical, sdoh, generic_classifier, mental_health, embeddings_clinical, licensed] task: Text Classification language: en edition: Healthcare NLP 4.3.2 spark_version: [3.2, 3.0] supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting if the patient has mental health problems in clinical notes. This model is trained by using GenericClassifierApproach annotator. `Mental_Disorder`: The patient has mental health problems. `No_Or_Not_Mentioned`: The patient doesn't have mental health problems or it is not mentioned in the clinical notes. ## Predicted Entities `Mental_Disorder`, `No_Or_Not_Mentioned` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_mental_health_clinical_en_4.3.2_3.0_1681132553520.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_mental_health_clinical_en_4.3.2_3.0_1681132553520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_mental_health_clinical", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier ]) sample_texts = ["""James is a 28-year-old man who has been struggling with schizophrenia for the past five years. He was diagnosed with the condition after experiencing a psychotic episode in his early 20s. The following is a case study that explores James' background, symptoms, diagnosis, treatment, and outcomes. Background: James grew up in a middle-class family with his parents and two younger siblings. He had a relatively normal childhood and was an above-average student in school. However, in his late teens, James started experiencing paranoid delusions and hallucinations. He became increasingly isolated and withdrawn, and his grades started to suffer. James was eventually hospitalized after experiencing a psychotic episode in college. Symptoms: James' symptoms of schizophrenia include delusions, hallucinations, disorganized speech and behavior, and negative symptoms. He experiences paranoid delusions, believing that people are out to get him or that he is being followed. James also experiences auditory hallucinations, hearing voices that are critical or commanding. He has difficulty organizing his thoughts and expressing himself coherently. James also experiences negative symptoms, such as reduced motivation, social withdrawal, and flattened affect. Diagnosis: James' diagnosis of schizophrenia was based on his symptoms, history, and a comprehensive evaluation by a mental health professional. He was diagnosed with paranoid schizophrenia, which is characterized by delusions and hallucinations. Treatment: James' treatment for schizophrenia consisted of medication and therapy. He was prescribed antipsychotic medication to help manage his symptoms. He also participated in cognitive-behavioral therapy (CBT), which focused on helping him identify and challenge his delusions and improve his communication skills. James also attended support groups for individuals with schizophrenia. Outcomes: With ongoing treatment, James' symptoms have become more manageable. He still experiences occasional psychotic episodes, but they are less frequent and less severe. James has also developed better coping skills and has learned to recognize the warning signs of an impending episode. He is able to maintain employment and has a stable home life. James' family has also been involved in his treatment, which has helped to improve his support system and overall quality of life. """, """Patient John is a 60-year-old man who presents to a primary care clinic for a routine check-up. He reports feeling generally healthy, with no significant medical concerns. However, he reveals that he is a smoker and drinks alcohol on a regular basis. The patient also mentions that he has a history of working long hours and has limited time for physical activity and social interactions. Based on this information, it appears that Patient John's overall health may be affected by several social determinants of health, including tobacco and alcohol use, lack of physical activity, and social isolation. To address these issues, the healthcare provider may recommend a comprehensive physical exam and develop a treatment plan that includes lifestyle modifications, such as smoking cessation and reduction of alcohol intake. Additionally, the patient may benefit from referrals to local organizations that provide resources for physical activity and social engagement. The healthcare provider may also recommend strategies to reduce work-related stress and promote work-life balance. By addressing these social determinants of health, healthcare providers can help promote Patient John's overall health and prevent future health problems."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(["document"]) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") .setInputCols(["document","token"]) .setOutputCol("word_embeddings") val sentence_embeddings = SentenceEmbeddings() .setInputCols(["document", "word_embeddings"]) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_mental_health_clinical", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("""James is a 28-year-old man who has been struggling with schizophrenia for the past five years. He was diagnosed with the condition after experiencing a psychotic episode in his early 20s. The following is a case study that explores James' background, symptoms, diagnosis, treatment, and outcomes. Background: James grew up in a middle-class family with his parents and two younger siblings. He had a relatively normal childhood and was an above-average student in school. However, in his late teens, James started experiencing paranoid delusions and hallucinations. He became increasingly isolated and withdrawn, and his grades started to suffer. James was eventually hospitalized after experiencing a psychotic episode in college. Symptoms: James' symptoms of schizophrenia include delusions, hallucinations, disorganized speech and behavior, and negative symptoms. He experiences paranoid delusions, believing that people are out to get him or that he is being followed. James also experiences auditory hallucinations, hearing voices that are critical or commanding. He has difficulty organizing his thoughts and expressing himself coherently. James also experiences negative symptoms, such as reduced motivation, social withdrawal, and flattened affect. Diagnosis: James' diagnosis of schizophrenia was based on his symptoms, history, and a comprehensive evaluation by a mental health professional. He was diagnosed with paranoid schizophrenia, which is characterized by delusions and hallucinations. Treatment: James' treatment for schizophrenia consisted of medication and therapy. He was prescribed antipsychotic medication to help manage his symptoms. He also participated in cognitive-behavioral therapy (CBT), which focused on helping him identify and challenge his delusions and improve his communication skills. James also attended support groups for individuals with schizophrenia. Outcomes: With ongoing treatment, James' symptoms have become more manageable. He still experiences occasional psychotic episodes, but they are less frequent and less severe. James has also developed better coping skills and has learned to recognize the warning signs of an impending episode. He is able to maintain employment and has a stable home life. James' family has also been involved in his treatment, which has helped to improve his support system and overall quality of life. """, """Patient John is a 60-year-old man who presents to a primary care clinic for a routine check-up. He reports feeling generally healthy, with no significant medical concerns. However, he reveals that he is a smoker and drinks alcohol on a regular basis. The patient also mentions that he has a history of working long hours and has limited time for physical activity and social interactions. Based on this information, it appears that Patient John's overall health may be affected by several social determinants of health, including tobacco and alcohol use, lack of physical activity, and social isolation. To address these issues, the healthcare provider may recommend a comprehensive physical exam and develop a treatment plan that includes lifestyle modifications, such as smoking cessation and reduction of alcohol intake. Additionally, the patient may benefit from referrals to local organizations that provide resources for physical activity and social engagement. The healthcare provider may also recommend strategies to reduce work-related stress and promote work-life balance. By addressing these social determinants of health, healthcare providers can help promote Patient John's overall health and prevent future health problems.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------------------+ | text| result| +----------------------------------------------------------------------------------------------------+---------------------+ |James is a 28-year-old man who has been struggling with schizophrenia for the past five years. He...| [Mental_Disorder]| |Patient John is a 60-year-old man who presents to a primary care clinic for a routine check-up. H...|[No_Or_Not_Mentioned]| +----------------------------------------------------------------------------------------------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_mental_health_clinical| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|1.5 MB| |Dependencies:|embeddings_clinical| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Mental_Disorder 0.79 0.85 0.82 223 No_Or_Not_Mentioned 0.85 0.78 0.82 240 accuracy - - 0.82 463 macro-avg 0.82 0.82 0.82 463 weighted-avg 0.82 0.82 0.82 463 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_wikisql date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-wikiSQL` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_wikisql_en_4.3.0_3.0_1675126227801.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_wikisql_en_4.3.0_3.0_1675126227801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_wikisql","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_wikisql","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_wikisql| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|262.1 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-wikiSQL - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/salesforce/WikiSQL - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb - https://github.com/patil-suraj - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Glove Embeddings 6B 100 author: John Snow Labs name: glove_100d date: 2020-01-22 task: Embeddings language: en nav_key: models edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description GloVe (Global Vectors) is a model for distributed word representation. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. It outperformed many common Word2vec models on the word analogy task. One benefit of GloVe is that it is the result of directly modeling relationships, instead of getting them as a side effect of training a language model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dl-ner/ner_dl.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/glove_100d_en_2.4.0_2.4_1579690104032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""I love Spark NLP"""] glove_df = nlu.load('en.embed.glove.100d').predict(text) glove_df ```
{:.h2_title} ## Results ```bash token | glove_embeddings | -------|----------------------------------------------------| I | [0.1941000074148178, 0.22603000700473785, -0.4...] | love | [0.13948999345302582, 0.534529983997345, -0.25...] | Spark | [0.20353999733924866, 0.6292600035667419, 0.27...] | NLP | [0.059436000883579254, 0.18411000072956085, -0...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|glove_100d| |Type:|word_embeddings| |Compatibility:|Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|[en]| |Dimension:|100| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/) --- layout: model title: English asr_bach_arb TFWav2Vec2ForCTC from bkh6722 author: John Snow Labs name: pipeline_asr_bach_arb date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_bach_arb` is a English model originally trained by bkh6722. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_bach_arb_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_bach_arb_en_4.2.0_3.0_1664118896542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_bach_arb_en_4.2.0_3.0_1664118896542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_bach_arb', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_bach_arb", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_bach_arb| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465515 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465515` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465515_en_4.0.0_3.0_1655986105907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465515_en_4.0.0_3.0_1655986105907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465515","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465515","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465515.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465515| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465515 --- layout: model title: Legal Submission To Jurisdiction Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_submission_to_jurisdiction_bert date: 2023-03-05 tags: [en, legal, classification, clauses, submission_to_jurisdiction, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Submission_To_Jurisdiction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Submission_To_Jurisdiction`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_submission_to_jurisdiction_bert_en_1.0.0_3.0_1678047235558.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_submission_to_jurisdiction_bert_en_1.0.0_3.0_1678047235558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_submission_to_jurisdiction_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Submission_To_Jurisdiction]| |[Other]| |[Other]| |[Submission_To_Jurisdiction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_submission_to_jurisdiction_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.97 0.99 37 Submission_To_Jurisdiction 0.96 1.00 0.98 24 accuracy - - 0.98 61 macro-avg 0.98 0.99 0.98 61 weighted-avg 0.98 0.98 0.98 61 ``` --- layout: model title: Sentence Entity Resolver for MeSH (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_mesh date: 2022-01-18 tags: [mesh, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities to Medical Subject Heading (MeSH) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_mesh_en_3.3.2_2.4_1642534218495.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_mesh_en_3.3.2_2.4_1642534218495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings")\ .setCaseSensitive(False) mesh_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_mesh","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("mesh_code")\ .setDistanceFunction("EUCLIDEAN") mesh_pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, mesh_resolver]) data = spark.createDataFrame([["""She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed malignant mesothelioma."""]]).toDF("text") result = mesh_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") .setCaseSensitive(False) val mesh_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_mesh","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("mesh_code") .setDistanceFunction("EUCLIDEAN") val mesh_pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, mesh_resolver]) val data = spark.createDataFrame([["""She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed malignant mesothelioma."""]]).toDF("text") val result = mesh_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.mesh").predict("""She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy with fluid biopsies, which were performed, which revealed malignant mesothelioma.""") ```
## Results ```bash +--------------------------+---------+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | ner_chunk| entity| mesh_code| all_codes| resolutions| distances| +--------------------------+---------+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | chest pain| PROBLEM| D002637|D002637:::D059350:::D019547:::D020069:::D015746:::D000072716:::D005157:::D059265:::D001416:::D048...|Chest Pain:::Chronic Pain:::Neck Pain:::Shoulder Pain:::Abdominal Pain:::Cancer Pain:::Facial Pai...|0.0000:::0.0577:::0.0587:::0.0601:::0.0658:::0.0704:::0.0712:::0.0741:::0.0766:::0.0778:::0.0794:...| |bilateral pleural effusion| PROBLEM| D010996|D010996:::D010490:::D011654:::D016724:::D010995:::D016066:::D011001:::D007819:::D035422:::D004653...|Pleural Effusion:::Pericardial Effusion:::Pulmonary Edema:::Empyema, Pleural:::Pleural Diseases::...|0.0309:::0.1010:::0.1115:::0.1213:::0.1218:::0.1398:::0.1425:::0.1401:::0.1451:::0.1464:::0.1464:...| | the pathology| TEST| D010336|D010336:::D010335:::D001004:::D020969:::C001675:::C536472:::D004194:::D003951:::D013631:::C535329...|Pathology:::Pathologic Processes:::Anus Diseases:::Disease Attributes:::malformins:::Upington dis...|0.0788:::0.0977:::0.1364:::0.1396:::0.1419:::0.1459:::0.1418:::0.1393:::0.1514:::0.1541:::0.1491:...| | the pericardectomy|TREATMENT| D010492|D010492:::D011670:::D018700:::D020884:::D011672:::D005927:::D064727:::D002431:::C000678968:::D011...|Pericardiectomy:::Pulpectomy:::Pleurodesis:::Colpotomy:::Pulpotomy:::Glossectomy:::Posterior Caps...|0.1098:::0.1448:::0.1801:::0.1852:::0.1871:::0.1923:::0.1901:::0.2023:::0.2075:::0.2010:::0.1996:...| | mesothelioma| PROBLEM|D000086002|D000086002:::C535700:::D009208:::D032902:::D018301:::D018199:::C562740:::C000686536:::D018276:::D...|Mesothelioma, Malignant:::Malignant mesenchymal tumor:::Myoepithelioma:::Ganoderma:::Neoplasms, M...|0.0813:::0.1515:::0.1599:::0.1810:::0.1864:::0.1881:::0.1907:::0.1938:::0.1924:::0.1876:::0.2040:...| | chest tube placement|TREATMENT| D015505|D015505:::D019616:::D013896:::D012124:::D013906:::D013510:::D020708:::D035423:::D013903:::D000066...|Chest Tubes:::Thoracic Surgical Procedures:::Thoracic Diseases:::Respiratory Care Units:::Thoraco...|0.0557:::0.1473:::0.1598:::0.1604:::0.1725:::0.1651:::0.1795:::0.1760:::0.1804:::0.1846:::0.1883:...| | drainage of the fluid|TREATMENT| D004322|D004322:::D018495:::C045413:::D021061:::D045268:::D018508:::D005441:::D015633:::D014906:::D001834...|Drainage:::Fluid Shifts:::Bonain's liquid:::Liquid Ventilation:::Flowmeters:::Water Purification:...|0.1141:::0.1403:::0.1582:::0.1549:::0.1586:::0.1626:::0.1599:::0.1655:::0.1667:::0.1656:::0.1741:...| | thoracoscopy|TREATMENT| D013906|D013906:::D020708:::D035423:::D013905:::D035441:::D013897:::D001468:::D000069258:::D013909:::D013...|Thoracoscopy:::Thoracoscopes:::Thoracic Cavity:::Thoracoplasty:::Thoracic Wall:::Thoracic Duct:::...|0.0000:::0.0359:::0.0744:::0.1007:::0.1070:::0.1143:::0.1186:::0.1257:::0.1228:::0.1356:::0.1354:...| | fluid biopsies| TEST|D000073890|D000073890:::D010533:::D020420:::D011677:::D017817:::D001706:::D005441:::D005751:::D013582:::D000...|Liquid Biopsy:::Peritoneal Lavage:::Cyst Fluid:::Punctures:::Nasal Lavage Fluid:::Biopsy:::Fluids...|0.1408:::0.1612:::0.1763:::0.1744:::0.1744:::0.1810:::0.1744:::0.1828:::0.1896:::0.1909:::0.1950:...| | malignant mesothelioma| PROBLEM|D000086002|D000086002:::C535700:::C562740:::D009236:::D007890:::D012515:::D009208:::C009823:::C000683999:::C...|Mesothelioma, Malignant:::Malignant mesenchymal tumor:::Hemangiopericytoma, Malignant:::Myxosarco...|0.0737:::0.1106:::0.1658:::0.1627:::0.1660:::0.1639:::0.1728:::0.1676:::0.1791:::0.1843:::0.1849:...| +-------+--------------------------+---------+----------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_mesh| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[mesh_code]| |Language:|en| |Size:|1.0 GB| |Case sensitive:|false| |Dependencies:|sbiobert_base_cased_mli| ## Data Source Trained on 01 December 2021 MeSH dataset. --- layout: model title: Sundanese asr_wav2vec2_indonesian_javanese_sundanese TFWav2Vec2ForCTC from indonesian-nlp author: John Snow Labs name: asr_wav2vec2_indonesian_javanese_sundanese date: 2022-09-24 tags: [wav2vec2, su, audio, open_source, asr] task: Automatic Speech Recognition language: su edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_indonesian_javanese_sundanese` is a Sundanese model originally trained by indonesian-nlp. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_indonesian_javanese_sundanese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_indonesian_javanese_sundanese_su_4.2.0_3.0_1664038636731.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_indonesian_javanese_sundanese_su_4.2.0_3.0_1664038636731.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_indonesian_javanese_sundanese", "su")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_indonesian_javanese_sundanese", "su") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_indonesian_javanese_sundanese| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|su| |Size:|1.2 GB| --- layout: model title: Recognize Entities DL Pipeline for Portuguese - Small author: John Snow Labs name: entity_recognizer_sm date: 2021-03-22 tags: [open_source, portuguese, entity_recognizer_sm, pipeline, pt] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pt edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pt_3.0.0_3.0_1616442035630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pt_3.0.0_3.0_1616442035630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_sm', lang = 'pt') annotations = pipeline.fullAnnotate(""Olá de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_sm", lang = "pt") val result = pipeline.fullAnnotate("Olá de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Olá de John Snow Labs! ""] result_df = nlu.load('pt.ner').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:----------------------------|:---------------------------|:---------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Olá de John Snow Labs! '] | ['Olá de John Snow Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pt| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab0_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_hassnain` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab0_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_hassnain_en_4.2.0_3.0_1664039022806.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab0_by_hassnain_en_4.2.0_3.0_1664039022806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_hassnain", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab0_by_hassnain", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab0_by_hassnain| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Company Name Normalization using Nasdaq author: John Snow Labs name: finel_nasdaq_data_company_name date: 2022-10-22 tags: [en, finance, companies, nasdaq, ticker, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial Entity Resolver model, trained to obtain normalized versions of Company Names, registered in NASDAQ. You can use this model after extracting a company name using any NER, and you will obtain the official name of the company as per NASDAQ database. After this, you can use `finmapper_nasdaq_data_company_name` to augment and obtain more information about a company using NASDAQ datasource, including Ticker, Sector, Location, Currency, etc. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_data_company_name_en_1.0.0_3.0_1666473632696.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_nasdaq_data_company_name_en_1.0.0_3.0_1666473632696.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python test = ["FIDUS INVESTMENT corp","ASPECT DEVELOPMENT Inc","CFSB BANCORP","DALEEN TECHNOLOGIES","GLEASON Corporation"] testdf = pandas.DataFrame(test, columns=['text']) testsdf = spark.createDataFrame(testdf).toDF('text') documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("sentence") use = nlp.UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_data_company_name', 'en', 'finance/models')\ .setInputCols("embeddings")\ .setOutputCol('normalized')\ prediction_Model = nlp.Pipeline(stages=[documentAssembler, use, use_er_model]) test_pred = prediction_Model.transform(testsdf) ```
## Results ```bash +----------------------+-------------------------+ |text |result | +----------------------+-------------------------+ |FIDUS INVESTMENT corp |[FIDUS INVESTMENT CORP] | |ASPECT DEVELOPMENT Inc|[ASPECT DEVELOPMENT INC] | |CFSB BANCORP |[CFSB BANCORP INC] | |DALEEN TECHNOLOGIES |[DALEEN TECHNOLOGIES INC]| |GLEASON Corporation |[GLEASON CORP] | +----------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_nasdaq_data_company_name| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[normalized]| |Language:|en| |Size:|69.7 MB| |Case sensitive:|false| ## References NASDAQ Database --- layout: model title: Wolof asr_av2vec2_xls_r_300m_wolof_lm TFWav2Vec2ForCTC from abdouaziiz author: John Snow Labs name: asr_av2vec2_xls_r_300m_wolof_lm date: 2022-09-24 tags: [wav2vec2, wo, audio, open_source, asr] task: Automatic Speech Recognition language: wo edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_av2vec2_xls_r_300m_wolof_lm` is a Wolof model originally trained by abdouaziiz. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_av2vec2_xls_r_300m_wolof_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_av2vec2_xls_r_300m_wolof_lm_wo_4.2.0_3.0_1664038180127.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_av2vec2_xls_r_300m_wolof_lm_wo_4.2.0_3.0_1664038180127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_av2vec2_xls_r_300m_wolof_lm", "wo")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_av2vec2_xls_r_300m_wolof_lm", "wo") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_av2vec2_xls_r_300m_wolof_lm| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|wo| |Size:|1.2 GB| --- layout: model title: Spanish DistilBertForQuestionAnswering model (from CenIA) SQAC author: John Snow Labs name: distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_sqac date: 2022-06-08 tags: [es, open_source, distilbert, question_answering] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-sqac` is a English model originally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_sqac_es_4.0.0_3.0_1654728123375.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_sqac_es_4.0.0_3.0_1654728123375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_sqac","es") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("¿Cuál es mi nombre?", "Mi nombre es Clara y vivo en Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.distil_bert.base_uncased").predict("""¿Cuál es mi nombre?|||"Mi nombre es Clara y vivo en Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distillbert_base_spanish_uncased_finetuned_qa_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|250.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-sqac --- layout: model title: Legal General Clause Binary Classifier author: John Snow Labs name: legclf_general_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `general` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `general` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_general_clause_en_1.0.0_3.2_1660123556610.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_general_clause_en_1.0.0_3.2_1660123556610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_general_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[general]| |[other]| |[other]| |[general]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_general_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support general 0.59 0.43 0.50 51 other 0.80 0.88 0.84 129 accuracy - - 0.76 180 macro-avg 0.70 0.66 0.67 180 weighted-avg 0.74 0.76 0.74 180 ``` --- layout: model title: Translate Irish to English Pipeline author: John Snow Labs name: translate_ga_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ga, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ga` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ga_en_xx_2.7.0_2.4_1609686142264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ga_en_xx_2.7.0_2.4_1609686142264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ga_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ga_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ga.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ga_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Marshallese Pipeline author: John Snow Labs name: translate_en_mh date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mh, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mh_xx_2.7.0_2.4_1609687817974.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mh_xx_2.7.0_2.4_1609687817974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mh", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mh').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mh| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Clinical Events (Admissions) author: John Snow Labs name: ner_events_admission_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_admission_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_events_admission_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_pipeline_en_3.4.1_3.0_1647873398796.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_pipeline_en_3.4.1_3.0_1647873398796.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_events_admission_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient presented to the emergency room last evening") ``` ```scala val pipeline = new PretrainedPipeline("ner_events_admission_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient presented to the emergency room last evening") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_admission_clinical.pipeline").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +------------------+-------------+ |chunk |ner_label | +------------------+-------------+ |presented |OCCURRENCE | |the emergency room|CLINICAL_DEPT| |last evening |TIME | +------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_admission_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Spanish RoBERTa Embeddings (Base, Gaussian Function) author: John Snow Labs name: roberta_embeddings_bertin_base_gaussian date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-base-gaussian` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_gaussian_es_3.4.2_3.0_1649945762002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_gaussian_es_3.4.2_3.0_1649945762002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_gaussian","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_gaussian","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_base_gaussian").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_base_gaussian| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|234.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-base-gaussian --- layout: model title: Japanese BertForMaskedLM Base Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_base_japanese_char_v2 date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-japanese-char-v2` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_v2_ja_4.2.4_3.0_1670018182092.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_japanese_char_v2_ja_4.2.4_3.0_1670018182092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char_v2","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_japanese_char_v2","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_japanese_char_v2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|340.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2 - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shaoyezh) author: John Snow Labs name: distilbert_qa_shaoyezh_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shaoyezh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shaoyezh_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772572706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shaoyezh_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772572706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shaoyezh_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shaoyezh_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shaoyezh_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shaoyezh/distilbert-base-uncased-finetuned-squad --- layout: model title: Financial Properties Item Binary Classifier author: John Snow Labs name: finclf_properties_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `properties` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `properties` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_properties_item_en_1.0.0_3.2_1660154457823.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_properties_item_en_1.0.0_3.2_1660154457823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_properties_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[properties]| |[other]| |[other]| |[properties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_properties_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support other 0.96 1.00 0.98 22 properties 1.00 0.96 0.98 25 accuracy - - 0.98 47 macro-avg 0.98 0.98 0.98 47 weighted-avg 0.98 0.98 0.98 47 ``` --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology](https://nlp.johnsnowlabs.com/2020/04/15/ner_posology_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_pipeline_en_3.4.1_3.0_1647871564965.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_pipeline_en_3.4.1_3.0_1647871564965.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English asr_wav2vec2_base_checkpoint_10 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: asr_wav2vec2_base_checkpoint_10 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_10` is a English model originally trained by jiobiala24. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_checkpoint_10_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_10_en_4.2.0_3.0_1664020556256.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_10_en_4.2.0_3.0_1664020556256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_checkpoint_10", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_checkpoint_10", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_checkpoint_10| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.2 MB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from andyjennings) author: John Snow Labs name: xlmroberta_ner_andyjennings_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `andyjennings`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_andyjennings_base_finetuned_panx_de_4.1.0_3.0_1660430787800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_andyjennings_base_finetuned_panx_de_4.1.0_3.0_1660430787800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_andyjennings_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_andyjennings_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_andyjennings_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/andyjennings/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English BertForQuestionAnswering Tiny Cased model (from MichelBartels) author: John Snow Labs name: bert_qa_tinybert_6l_768d_squad2_large_teacher_dummy date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinybert-6l-768d-squad2-large-teacher-dummy` is a English model originally trained by `MichelBartels`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teacher_dummy_en_4.0.0_3.0_1657192981288.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_large_teacher_dummy_en_4.0.0_3.0_1657192981288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2_large_teacher_dummy","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2_large_teacher_dummy","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tinybert_6l_768d_squad2_large_teacher_dummy| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|249.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MichelBartels/tinybert-6l-768d-squad2-large-teacher-dummy --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2021-09-23 tags: [ner, ner_profiling, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. Here are the NER models that this pretrained pipeline includes: `ner_jsl_enriched_biobert`, `ner_clinical_biobert`, `ner_chemprot_biobert`, `ner_jsl_greedy_biobert`, `ner_bionlp_biobert`, `ner_human_phenotype_go_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert`, `ner_anatomy_coarse_biobert`, `ner_deid_enriched_biobert`, `ner_human_phenotype_gene_biobert`, `ner_jsl_biobert`, `ner_events_biobert`, `ner_deid_biobert`, `ner_posology_biobert`, `ner_diseases_biobert`, `jsl_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_cellular_biobert`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.2.3_2.4_1632427360617.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.2.3_2.4_1632427360617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ner_cellular_biobert_chunks : [] ner_diseases_biobert_chunks : ['gestational diabetes mellitus', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'hepatitis', 'obesity', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_events_biobert_chunks : ['gestational diabetes mellitus', 'eight years', 'presentation', 'type two diabetes mellitus ( T2DM', 'HTG-induced pancreatitis', 'three years', 'presentation', 'an acute hepatitis', 'obesity', 'a body mass index', 'BMI', 'presented', 'a one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_bionlp_biobert_chunks : [] ner_jsl_greedy_biobert_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass index', 'BMI ) of 33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_jsl_biobert_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute', 'hepatitis', 'obesity', 'body mass index', 'BMI ) of 33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_anatomy_biobert_chunks : ['body'] ner_jsl_enriched_biobert_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'acute', 'hepatitis', 'obesity', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_human_phenotype_go_biobert_chunks : ['obesity', 'polyuria', 'polydipsia'] ner_deid_biobert_chunks : ['eight years', 'three years'] ner_deid_enriched_biobert_chunks : [] token : ['A', '28-year-old', 'female', 'with', 'a', 'history', 'of', 'gestational', 'diabetes', 'mellitus', 'diagnosed', 'eight', 'years', 'prior', 'to', 'presentation', 'and', 'subsequent', 'type', 'two', 'diabetes', 'mellitus', '(', 'T2DM', '),', 'one', 'prior', 'episode', 'of', 'HTG-induced', 'pancreatitis', 'three', 'years', 'prior', 'to', 'presentation', ',', 'associated', 'with', 'an', 'acute', 'hepatitis', ',', 'and', 'obesity', 'with', 'a', 'body', 'mass', 'index', '(', 'BMI', ')', 'of', '33.5', 'kg/m2', ',', 'presented', 'with', 'a', 'one-week', 'history', 'of', 'polyuria', ',', 'polydipsia', ',', 'poor', 'appetite', ',', 'and', 'vomiting', '.'] ner_clinical_biobert_chunks : ['gestational diabetes mellitus', 'subsequent type two diabetes mellitus ( T2DM', 'HTG-induced pancreatitis', 'an acute hepatitis', 'obesity', 'a body mass index ( BMI )', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_anatomy_coarse_biobert_chunks : ['body'] ner_human_phenotype_gene_biobert_chunks : ['obesity', 'mass', 'polyuria', 'polydipsia', 'vomiting'] ner_posology_large_biobert_chunks : [] jsl_rd_ner_wip_greedy_biobert_chunks : ['gestational diabetes mellitus', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'acute hepatitis', 'obesity', 'body mass index', '33.5', 'kg/m2', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_posology_biobert_chunks : [] jsl_ner_wip_greedy_biobert_chunks : ['28-year-old', 'female', 'gestational diabetes mellitus', 'eight years prior', 'type two diabetes mellitus', 'T2DM', 'HTG-induced pancreatitis', 'three years prior', 'acute hepatitis', 'obesity', 'body mass index', 'BMI ) of 33.5 kg/m2', 'one-week', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_chemprot_biobert_chunks : [] ner_ade_biobert_chunks : ['pancreatitis', 'acute hepatitis', 'polyuria', 'polydipsia', 'poor appetite', 'vomiting'] ner_risk_factors_biobert_chunks : ['diabetes mellitus', 'subsequent type two diabetes mellitus', 'obesity'] sentence : ['A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel (x21) - NerConverter (x21) - Finisher --- layout: model title: Legal Plant Product Document Classifier (EURLEX) author: John Snow Labs name: legclf_plant_product_bert date: 2023-03-06 tags: [en, legal, classification, clauses, plant_product, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_plant_product_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Plant_Product or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Plant_Product`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_plant_product_bert_en_1.0.0_3.0_1678111757503.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_plant_product_bert_en_1.0.0_3.0_1678111757503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_plant_product_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Plant_Product]| |[Other]| |[Other]| |[Plant_Product]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_plant_product_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.89 0.81 0.85 48 Plant_Product 0.83 0.90 0.86 48 accuracy - - 0.85 96 macro-avg 0.86 0.85 0.85 96 weighted-avg 0.86 0.85 0.85 96 ``` --- layout: model title: English BertForQuestionAnswering model (from mezes) author: John Snow Labs name: bert_qa_eauction_section_parsing_from_pretrained date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `eauction-section-parsing-from-pretrained` is a English model orginally trained by `mezes`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_eauction_section_parsing_from_pretrained_en_4.0.0_3.0_1654187617325.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_eauction_section_parsing_from_pretrained_en_4.0.0_3.0_1654187617325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_eauction_section_parsing_from_pretrained","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_eauction_section_parsing_from_pretrained","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_mezes").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_eauction_section_parsing_from_pretrained| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|421.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mezes/eauction-section-parsing-from-pretrained --- layout: model title: Arabic BertForQuestionAnswering model (from gfdgdfgdg) author: John Snow Labs name: bert_qa_arap_qa_bert_v2 date: 2022-06-02 tags: [ar, open_source, question_answering, bert] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `arap_qa_bert_v2` is a Arabic model orginally trained by `gfdgdfgdg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_arap_qa_bert_v2_ar_4.0.0_3.0_1654179201289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_arap_qa_bert_v2_ar_4.0.0_3.0_1654179201289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_arap_qa_bert_v2","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_arap_qa_bert_v2","ar") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.bert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_arap_qa_bert_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ar| |Size:|505.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/gfdgdfgdg/arap_qa_bert_v2 --- layout: model title: Legal Performance Share Award Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_performance_share_award_agreement_bert date: 2023-01-29 tags: [en, legal, classification, performance, share, award, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_performance_share_award_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `performance-share-award-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `performance-share-award-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_performance_share_award_agreement_bert_en_1.0.0_3.0_1674990799030.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_performance_share_award_agreement_bert_en_1.0.0_3.0_1674990799030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_performance_share_award_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[performance-share-award-agreement]| |[other]| |[other]| |[performance-share-award-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_performance_share_award_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 1.00 0.99 54 performance-share-award-agreement 1.00 0.97 0.99 34 accuracy - - 0.99 88 macro-avg 0.99 0.99 0.99 88 weighted-avg 0.99 0.99 0.99 88 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_becasv4.1 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BECASV4.1` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_becasv4.1_es_4.3.0_3.0_1674207847912.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_becasv4.1_es_4.3.0_3.0_1674207847912.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_becasv4.1","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_becasv4.1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_becasv4.1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|459.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/BECASV4.1 --- layout: model title: Slovenian Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:37:00 +0800 task: Lemmatization language: sl edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, sl] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_sl_2.5.5_2.4_1596055008133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_sl_2.5.5_2.4_1596055008133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "sl") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "sl") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene."""] lemma_df = nlu.load('sl.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=11, result='jesti', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=17, result='poleg', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=22, result='ta', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|sl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Pipeline to Detect Chemical Compounds and Genes author: John Snow Labs name: ner_chemprot_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemprot_clinical](https://nlp.johnsnowlabs.com/2021/03/31/ner_chemprot_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_pipeline_en_3.4.1_3.0_1647868192988.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_pipeline_en_3.4.1_3.0_1647868192988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_chemprot_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` ```scala val pipeline = new PretrainedPipeline("ner_chemprot_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot_clinical.pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +----+---------------------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+=================================+=========+=======+==========+ | 0 | Keratinocyte growth factor | 0 | 25 | GENE-Y | +----+---------------------------------+---------+-------+----------+ | 1 | acidic fibroblast growth factor | 31 | 61 | GENE-Y | +----+---------------------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_violation_classification_bantai__withES ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_violation_classification_bantai__withES date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_violation_classification_bantai__withES` is a English model originally trained by AykeeSalazar. ## Predicted Entities `Public Smoking`, `Public-Drinking`, `ambiguous`, `non-violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__withES_en_4.1.0_3.0_1660169254291.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__withES_en_4.1.0_3.0_1660169254291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_violation_classification_bantai__withES", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_violation_classification_bantai__withES", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_violation_classification_bantai__withES| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect PHI for Deidentification (Sub Entity) author: John Snow Labs name: ner_deid_subentity_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deid, de] task: Named Entity Recognition language: de edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_subentity_de.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_de_3.4.1_3.0_1647887751010.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_de_3.4.1_3.0_1647887751010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "de", "clinical/models") pipeline.annotate("Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "de", "clinical/models") pipeline.annotate("Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.") ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.ner_subentity.pipeline").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash +-------------------------+-------------------------+ |chunk |ner_deid_subentity_chunk | +-------------------------+-------------------------+ |Michael Berger |PATIENT | |12 Dezember 2018 |DATE | |St. Elisabeth-Krankenhaus|HOSPITAL | |Bad Kissingen |CITY | |Berger |PATIENT | |76 |AGE | +-------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Translate English to Ga Pipeline author: John Snow Labs name: translate_en_gaa date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gaa, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gaa` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gaa_xx_2.7.0_2.4_1609699086612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gaa_xx_2.7.0_2.4_1609699086612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gaa", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gaa", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gaa').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gaa| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Manx (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, gv, open_source] task: Embeddings language: gv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gv_3.4.1_3.0_1647445667657.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gv_3.4.1_3.0_1647445667657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gv.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|gv| |Size:|58.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from rahulchakwate) author: John Snow Labs name: roberta_qa_rahulchakwate_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `rahulchakwate`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rahulchakwate_base_finetuned_squad_en_4.3.0_3.0_1674217478490.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rahulchakwate_base_finetuned_squad_en_4.3.0_3.0_1674217478490.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rahulchakwate_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rahulchakwate_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rahulchakwate_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/rahulchakwate/roberta-base-finetuned-squad --- layout: model title: French CamemBert Embeddings (from zhenghuabin) author: John Snow Labs name: camembert_embeddings_zhenghuabin_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy_model` is a French model orginally trained by `zhenghuabin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_zhenghuabin_generic_model_fr_3.4.4_3.0_1653991564672.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_zhenghuabin_generic_model_fr_3.4.4_3.0_1653991564672.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_zhenghuabin_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_zhenghuabin_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_zhenghuabin_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/zhenghuabin/dummy_model --- layout: model title: Turkish ElectraForQuestionAnswering model (from husnu) author: John Snow Labs name: electra_qa_small_turkish_uncased_discriminator_finetuned date: 2022-06-22 tags: [tr, open_source, electra, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-turkish-uncased-discriminator-finetuned_lr-2e-05_epochs-6` is a Turkish model originally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_turkish_uncased_discriminator_finetuned_tr_4.0.0_3.0_1655921302558.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_turkish_uncased_discriminator_finetuned_tr_4.0.0_3.0_1655921302558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_turkish_uncased_discriminator_finetuned","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_turkish_uncased_discriminator_finetuned","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.electra.small_uncased").predict("""Benim adım ne?|||"Benim adım Clara ve Berkeley'de yaşıyorum.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_small_turkish_uncased_discriminator_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|52.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/husnu/electra-small-turkish-uncased-discriminator-finetuned_lr-2e-05_epochs-6 --- layout: model title: Pipeline to Summarize Radiology Reports author: John Snow Labs name: summarizer_radiology_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization, radiology] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_radiology](https://nlp.johnsnowlabs.com/2023/04/23/summarizer_jsl_radiology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_pipeline_en_4.4.1_3.0_1685530176511.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_radiology_pipeline_en_4.4.1_3.0_1685530176511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_radiology_pipeline", "en", "clinical/models") text = """INDICATIONS: Peripheral vascular disease with claudication. RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96. LEFT: 1. Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower lobes. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_radiology_pipeline", "en", "clinical/models") val text = """INDICATIONS: Peripheral vascular disease with claudication. RIGHT: 1. Normal arterial imaging of right lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic. 4. Ankle brachial index is 0.96. LEFT: 1. Normal arterial imaging of left lower extremity. 2. Peak systolic velocity is normal. 3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic. 4. Ankle brachial index is 1.06. IMPRESSION: Normal arterial imaging of both lower lobes. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging, but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower lobes. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_radiology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|937.2 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squad_v1_sparse0.25 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-v1-sparse0.25` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad_v1_sparse0.25_en_4.0.0_3.0_1654181403156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad_v1_sparse0.25_en_4.0.0_3.0_1654181403156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad_v1_sparse0.25","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad_v1_sparse0.25","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad_v1_sparse0.25| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|194.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squad-v1-sparse0.25 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf - https://www.aclweb.org/anthology/N19-1423/ - https://arxiv.org/abs/2005.07683 --- layout: model title: Legal Resignation Clause Binary Classifier author: John Snow Labs name: legclf_resignation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `resignation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `resignation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_resignation_clause_en_1.0.0_3.2_1660122953916.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_resignation_clause_en_1.0.0_3.2_1660122953916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_resignation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[resignation]| |[other]| |[other]| |[resignation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_resignation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 1.00 0.99 115 resignation 1.00 0.94 0.97 33 accuracy - - 0.99 148 macro-avg 0.99 0.97 0.98 148 weighted-avg 0.99 0.99 0.99 148 ``` --- layout: model title: Sarcasm Classifier author: John Snow Labs name: classifierdl_use_sarcasm class: ClassifierDLModel language: en nav_key: models repository: public/models date: 03/07/2020 task: Text Classification edition: Spark NLP 2.5.3 spark_version: 2.4 tags: [classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Classify if a text contains sarcasm. {:.h2_title} ## Predicted Entities ``normal``, ``sarcasm`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_SARCASM/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_SARCASM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_sarcasm_en_2.5.3_2.4_1593783319298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_sarcasm_en_2.5.3_2.4_1593783319298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained(lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_sarcasm', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate('If I could put into words how much I love waking up at am on Tuesdays I would') ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained(lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_use_sarcasm", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, document_classifier)) val data = Seq("If I could put into words how much I love waking up at am on Tuesdays I would").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""If I could put into words how much I love waking up at am on Tuesdays I would"""] sarcasm_df = nlu.load('classify.sarcasm.use').predict(text, output_level='document') sarcasm_df[["document", "sarcasm"]] ```
{:.h2_title} ## Results ```bash +--------------------------------------------------------------------------------------------------------+------------+ |document |class | +--------------------------------------------------------------------------------------------------------+------------+ |If I could put into words how much I love waking up at am on Tuesdays I would | sarcasm | +--------------------------------------------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| | Model Name | classifierdl_use_sarcasm | | Model Class | ClassifierDLModel | | Spark Compatibility | 2.5.3 | | Spark NLP Compatibility | 2.4 | | License | open source | | Edition | public | | Input Labels | [document, sentence_embeddings] | | Output Labels | [class] | | Language | en | | Upstream Dependencies | with tfhub_use | {:.h2_title} ## Data Source This model is trained on the sarcam detection dataset. https://github.com/MirunaPislar/Sarcasm-Detection/tree/master/res/datasets/riloff {:.h2_title} ## Benchmarking Accuracy of label `sarcasm` with USE Embeddings is `0.84` ```bash precision recall f1-score support 0 0.84 1.00 0.91 495 1 0.00 0.00 0.00 93 accuracy 0.84 588 macro avg 0.42 0.50 0.46 588 weighted avg 0.71 0.84 0.77 588 ``` --- layout: model title: Ukrainian Part of Speech Tagger (from ukr-models) author: John Snow Labs name: xlmroberta_pos_uk_morph date: 2022-05-18 tags: [xlm_roberta, pos, part_of_speech, uk, open_source] task: Part of Speech Tagging language: uk edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `uk-morph` is a Ukrainian model orginally trained by `ukr-models`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_pos_uk_morph_uk_3.4.2_3.0_1652837627542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_pos_uk_morph_uk_3.4.2_3.0_1652837627542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_pos_uk_morph","uk") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_pos_uk_morph","uk") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_pos_uk_morph| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|uk| |Size:|409.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ukr-models/uk-morph - https://pypi.org/project/tokenize_uk/ --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad1.1-pruned-x3.2-v2` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2_en_4.0.0_3.0_1654181467619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2_en_4.0.0_3.0_1654181467619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|172.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squad1.1-pruned-x3.2-v2 --- layout: model title: Translate Umbundu to English Pipeline author: John Snow Labs name: translate_umb_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, umb, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `umb` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_umb_en_xx_2.7.0_2.4_1609691853164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_umb_en_xx_2.7.0_2.4_1609691853164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_umb_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_umb_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.umb.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_umb_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from nlpunibo) author: John Snow Labs name: roberta_qa_roberta date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_en_4.3.0_3.0_1674212513043.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_en_4.3.0_3.0_1674212513043.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|463.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nlpunibo/roberta --- layout: model title: Spanish BertForQuestionAnswering model (from MMG) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac date: 2022-06-02 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-squad2-es-finetuned-sqac` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac_es_4.0.0_3.0_1654180544867.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac_es_4.0.0_3.0_1654180544867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2_sqac.bert.base_cased_v2.by_MMG").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-squad2-es-finetuned-sqac --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_large_data_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-data-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_data_seed_0_en_4.3.0_3.0_1674221077269.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_data_seed_0_en_4.3.0_3.0_1674221077269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_data_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_data_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_data_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-large-data-seed-0 --- layout: model title: Detect Cellular/Molecular Biology Entities author: John Snow Labs name: ner_cellular date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for molecular biology related terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `DNA`, `Cell_type`, `Cell_line`, `RNA`, `Protein`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cellular_en_3.0.0_3.0_1617209730811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cellular_en_3.0.0_3.0_1617209730811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") cellular_ner = MedicalNerModel.pretrained("ner_cellular", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, cellular_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val celular_ner = MedicalNerModel.pretrained("ner_cellular", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, cellular_ner, ner_converter)) val data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cellular").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. """) ```
## Results ```bash |chunk |ner | +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1), |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_cellular| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on the JNLPBA corpus containing more than 2.404 publication abstracts with ``'embeddings_clinical'``. http://www.geniaproject.org/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:| | 0 | B-cell_line | 377 | 203 | 123 | 0.65 | 0.754 | 0.698148 | | 1 | I-DNA | 1519 | 277 | 266 | 0.845768 | 0.85098 | 0.848366 | | 2 | I-protein | 3981 | 911 | 786 | 0.813778 | 0.835116 | 0.824309 | | 3 | B-protein | 4483 | 1433 | 579 | 0.757776 | 0.885618 | 0.816724 | | 4 | I-cell_line | 786 | 340 | 203 | 0.698046 | 0.794742 | 0.743262 | | 5 | I-RNA | 178 | 42 | 9 | 0.809091 | 0.951872 | 0.874693 | | 6 | B-RNA | 99 | 28 | 19 | 0.779528 | 0.838983 | 0.808163 | | 7 | B-cell_type | 1440 | 294 | 480 | 0.83045 | 0.75 | 0.788177 | | 8 | I-cell_type | 2431 | 377 | 559 | 0.865741 | 0.813044 | 0.838565 | | 9 | B-DNA | 814 | 267 | 240 | 0.753006 | 0.772296 | 0.762529 | | 10 | Macro-average | 16108 | 4172 | 3264 | 0.780318 | 0.824665 | 0.801879 | | 11 | Micro-average | 16108 | 4172 | 3264 | 0.79428 | 0.831509 | 0.812469 | ``` --- layout: model title: Translate Isoko to English Pipeline author: John Snow Labs name: translate_iso_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, iso, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `iso` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_iso_en_xx_2.7.0_2.4_1609690055762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_iso_en_xx_2.7.0_2.4_1609690055762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_iso_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_iso_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.iso.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_iso_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Japanese Part of Speech Tagger (from KoichiYasuoka) author: John Snow Labs name: bert_pos_bert_large_japanese_upos date: 2022-05-09 tags: [bert, pos, part_of_speech, ja, open_source] task: Part of Speech Tagging language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese-upos` is a Japanese model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_japanese_upos_ja_3.4.2_3.0_1652091962796.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_japanese_upos_ja_3.4.2_3.0_1652091962796.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_japanese_upos","ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_japanese_upos","ja") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_large_japanese_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ja| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-large-japanese-upos - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Legal Law Area Prediction Classifier (German) author: John Snow Labs name: legclf_law_area_prediction_german date: 2023-03-29 tags: [de, licensed, classification, legal, tensorflow] task: Text Classification language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which identifies law area labels(civil_law, penal_law, public_law, social_law) in German-based Court Cases. ## Predicted Entities `civil_law`, `penal_law`, `public_law`, `social_law` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_german_de_1.0.0_3.0_1680091124408.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_law_area_prediction_german_de_1.0.0_3.0_1680091124408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx")\ .setInputCols(["document"]) \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_law_area_prediction_german", "de", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, docClassifier ]) df = spark.createDataFrame([["Demnach erkennt das Bundesgericht: 1. Auf die Beschwerde wird nicht eingetreten. 2. Das Gesuch um unentgeltliche Rechtspflege und Verbeiständung wird abgewiesen. 3. Es werden keine Kosten erhoben. 4. Dieses Urteil wird den Parteien und dem Kantonsgericht des Kantons Schwyz, 1. Rekurskammer, schriftlich mitgeteilt. Lausanne, 21. Oktober 2008 Im Namen der II. zivilrechtlichen Abteilung des Schweizerischen Bundesgerichts Der Präsident: Der Gerichtsschreiber: Raselli von Roten"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) result.select("text", "category.result").show(truncate=100) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-----------+ | text| result| +----------------------------------------------------------------------------------------------------+-----------+ |Demnach erkennt das Bundesgericht: 1. Auf die Beschwerde wird nicht eingetreten. 2. Das Gesuch um...|[civil_law]| +----------------------------------------------------------------------------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_law_area_prediction_german| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|de| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/rcds/legal_criticality_prediction) ## Benchmarking ```bash label precision recall f1-score support civil_law 0.94 0.94 0.94 1058 penal_law 0.99 0.93 0.96 929 public_law 0.92 0.96 0.94 965 social_law 0.99 0.99 0.99 1003 accuracy - - 0.96 3955 macro-avg 0.96 0.96 0.96 3955 weighted-avg 0.96 0.96 0.96 3955 ``` --- layout: model title: English Named Entity Recognition (from sagorsarker) author: John Snow Labs name: bert_ner_codeswitch_spaeng_ner_lince date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-spaeng-ner-lince` is a English model orginally trained by `sagorsarker`. ## Predicted Entities `LOC`, `TIME`, `PER`, `PROD`, `TITLE`, `OTHER`, `GROUP`, `ORG`, `EVENT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_spaeng_ner_lince_en_3.4.2_3.0_1652097117742.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_codeswitch_spaeng_ner_lince_en_3.4.2_3.0_1652097117742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_spaeng_ner_lince","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_codeswitch_spaeng_ner_lince","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_codeswitch_spaeng_ner_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-spaeng-ner-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_42 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_42_en_4.0.0_3.0_1655730867733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_42_en_4.0.0_3.0_1655730867733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|446.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-42 --- layout: model title: English T5ForConditionalGeneration Cased model (from KES) author: John Snow Labs name: t5_caribe_capitalise date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `caribe-capitalise` is a English model originally trained by `KES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_caribe_capitalise_en_4.3.0_3.0_1675100357729.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_caribe_capitalise_en_4.3.0_3.0_1675100357729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_caribe_capitalise","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_caribe_capitalise","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_caribe_capitalise| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|864.8 MB| ## References - https://huggingface.co/KES/caribe-capitalise - https://pypi.org/project/Caribe/ --- layout: model title: Translate Creoles and pidgins, French‑based to English Pipeline author: John Snow Labs name: translate_cpf_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, cpf, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `cpf` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_cpf_en_xx_2.7.0_2.4_1609689966703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_cpf_en_xx_2.7.0_2.4_1609689966703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_cpf_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_cpf_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.cpf.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_cpf_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Miscellaneous Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_miscellaneous_bert date: 2023-03-05 tags: [en, legal, classification, clauses, miscellaneous, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Miscellaneous` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Miscellaneous`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_bert_en_1.0.0_3.0_1678046022297.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_bert_en_1.0.0_3.0_1678046022297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_miscellaneous_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Miscellaneous]| |[Other]| |[Other]| |[Miscellaneous]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_miscellaneous_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Miscellaneous 0.78 0.83 0.80 52 Other 0.88 0.84 0.86 76 accuracy - - 0.84 128 macro-avg 0.83 0.83 0.83 128 weighted-avg 0.84 0.84 0.84 128 ``` --- layout: model title: Fast Neural Machine Translation Model from Austro-Asiatic languages to English author: John Snow Labs name: opus_mt_aav_en date: 2021-06-01 tags: [open_source, seq2seq, translation, aav, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: aav target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_aav_en_xx_3.1.0_2.4_1622560180309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_aav_en_xx_3.1.0_2.4_1622560180309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_aav_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_aav_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Austro-Asiatic languages.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_aav_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Mapping ICD-10-CM Codes with Their Corresponding ICD-9-CM Codes author: John Snow Labs name: icd10_icd9_mapper date: 2022-09-30 tags: [en, licensed, icd10, icd9, chunk_mapping] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICD-10-CM codes to corresponding ICD-9-CM codes ## Predicted Entities `icd9_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapper_en_4.1.0_3.0_1664526779493.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapper_en_4.1.0_3.0_1664526779493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel.pretrained("icd10_icd9_mapper", "en", "clinical/models")\ .setInputCols(["icd10cm_code"])\ .setOutputCol("mappings")\ .setRels(["icd9_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, icd_resolver, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Diabetes Mellitus") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel.pretrained("icd10_icd9_mapper", "en", "clinical/models") .setInputCols(Array("icd10cm_code")) .setOutputCol("mappings") .setRels(Array("icd9_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, icd_resolver, chunkerMapper )) val data = Seq("Diabetes Mellitus").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd10_ic9").predict("""Diabetes Mellitus""") ```
## Results ```bash | | chunk | icd10cm_code | icd9_mapping | |---:|:------------------|:---------------|:---------------| | 0 | Diabetes Mellitus | Z833 | V180 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10_icd9_mapper| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|580.0 KB| --- layout: model title: Arabic Bert Embeddings (Base, DA-CA-MSA variants) author: John Snow Labs name: bert_embeddings_bert_base_arabic_camelbert_mix date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-mix` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_mix_ar_3.4.2_3.0_1649678024373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_mix_ar_3.4.2_3.0_1649678024373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_mix","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_mix","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic_camelbert_mix").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic_camelbert_mix| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://arxiv.org/abs/2103.06678 - https://zenodo.org/record/3891466#.YEX4-F0zbzc - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: English BertForQuestionAnswering model (from victoraavila) author: John Snow Labs name: bert_qa_victoraavila_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `victoraavila`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_victoraavila_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181153101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_victoraavila_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181153101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_victoraavila_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_victoraavila_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_victoraavila").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_victoraavila_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/victoraavila/bert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_MoHai TFWav2Vec2ForCTC from MoHai author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_MoHai` is a English model originally trained by MoHai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai_en_4.2.0_3.0_1664118973557.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai_en_4.2.0_3.0_1664118973557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_MoHai| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect PHI for Deidentification purposes (Italian) author: John Snow Labs name: ner_deid_subentity_pipeline date: 2023-03-13 tags: [deid, it, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_subentity_it_2_4.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_it_4.3.0_3.2_1678744372196.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_it_4.3.0_3.2_1678744372196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "it", "clinical/models") text = '''Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "it", "clinical/models") val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|:-------------| | 0 | Gastone Montanariello | 9 | 29 | PATIENT | | | 1 | 49 | 32 | 33 | AGE | | | 2 | Ospedale San Camillo | 55 | 74 | HOSPITAL | | | 3 | marzo 2015 | 128 | 137 | DATE | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_terri1102 TFWav2Vec2ForCTC from terri1102 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_terri1102 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_terri1102` is a English model originally trained by terri1102. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_terri1102_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_terri1102_en_4.2.0_3.0_1664107264156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_terri1102_en_4.2.0_3.0_1664107264156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_terri1102", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_terri1102", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_terri1102| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Detect PHI (Deidentification)(clinical_medium) author: John Snow Labs name: ner_deid_large_emb_clinical_medium date: 2023-04-12 tags: [ner, clinical, english, licensed, phi, deidentification, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER (Large) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. The entities it annotates are Age, Contact, Date, Id, Location, Name, and Profession. This model is trained with the 'embeddings_clinical_medium word embeddings model, so be sure to use the same embeddings in the pipeline. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/ ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_emb_clinical_medium_en_4.3.2_3.0_1681322146240.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_emb_clinical_medium_en_4.3.2_3.0_1681322146240.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained('ner_deid_large_emb_clinical_large', "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("deid_ner") deid_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "deid_ner"]) \ .setOutputCol("deid_ner_chunk") deid_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, deid_ner, deid_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") deid_ner_model = deid_ner_pipeline.fit(empty_data) results = deid_ner_model.transform(spark.createDataFrame([["""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same."""]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(Array("sentence", "token"))\ .setOutputCol("embeddings") val deid_ner_model = MedicalNerModel.pretrained('ner_deid_large_emb_clinical_large' , "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("deid_ner") val deid_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("deid_ner_chunk") val deid_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner_model, deid_ner_converter)) val data = Seq(""" """HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same."""""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:----------------|--------:|------:|:-----------| | 0 | Day Hospital | 258 | 269 | NAME | | 1 | Radiology | 321 | 329 | LOCATION | | 2 | 02/04/2003 | 341 | 350 | DATE | | 3 | COURSE | 362 | 367 | NAME | | 4 | Hospital | 401 | 408 | NAME | | 5 | Urology surgery | 430 | 444 | NAME | | 6 | On | 447 | 448 | NAME | | 7 | Following | 1103 | 1111 | NAME | | 8 | 02/07/2003 | 1329 | 1338 | DATE | | 9 | Plavix daily | 1366 | 1377 | NAME | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_large_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash precision recall f1-score support CONTACT 0.92 0.96 0.94 126 DATE 0.99 0.99 0.99 2631 NAME 0.97 0.98 0.97 2594 AGE 0.96 0.94 0.95 284 LOCATION 0.95 0.92 0.94 1511 ID 0.93 0.95 0.94 213 PROFESSION 0.84 0.73 0.78 160 micro avg 0.97 0.96 0.97 7519 macro avg 0.94 0.92 0.93 7519 weighted avg 0.97 0.96 0.96 7519 ``` --- layout: model title: Legal Adjustments Clause Binary Classifier author: John Snow Labs name: legclf_adjustments_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `adjustments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `adjustments` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_adjustments_clause_en_1.0.0_3.2_1660123224069.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_adjustments_clause_en_1.0.0_3.2_1660123224069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_adjustments_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[adjustments]| |[other]| |[other]| |[adjustments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_adjustments_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support adjustments 0.92 0.89 0.91 27 other 0.96 0.97 0.96 69 accuracy - - 0.95 96 macro-avg 0.94 0.93 0.93 96 weighted-avg 0.95 0.95 0.95 96 ``` --- layout: model title: Classify Earning Calls, Broker Reports and 10K author: John Snow Labs name: finclf_earning_broker_10k date: 2022-11-24 tags: [10k, earning, calls, broker, reports, en, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Cassification model, which can help you identify if a model is an `Earning Call`, a `Broker Report`, a `10K filing` or something else. ## Predicted Entities `earning_call`, `broker_report`, `10k`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINCLF_EARNING_BROKER_10K/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_earning_broker_10k_en_1.0.0_3.0_1669296495349.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_earning_broker_10k_en_1.0.0_3.0_1669296495349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = finance.ClassifierDLModel.pretrained("finclf_earning_broker_10k", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("label") \ nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) text = """Varun Beverages Investors are advised to refer through important disclosures made at the last page of the Research Report. Motilal Oswal research is available on www.motilaloswal.com/Institutional -Equities, Bloomberg, Thomson Reuters, Factset and S&P Capital. Research Analyst: Sumant Kumar (Sumant.Kumar@MotilalOswal.com) Research Analyst: Meet Jain (Meet.Jain@ Motilal Oswal.com) / Omkar Shintre (Omkar.Shintre @Motilal Oswal.com)""" sdf = spark.createDataFrame([[text]]).toDF("text") fit = nlpPipeline.fit(sdf) res = fit.transform(sdf) res = res.select('label.result') ```
## Results ```bash [broker_report] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_earning_broker_10k| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[label]| |Language:|en| |Size:|22.8 MB| ## References - Scrapped broker reports, earning calls, and 10K filings from the internet - Other financial documents ## Benchmarking ```bash label precision recall f1-score support 10k 1.00 1.00 1.00 17 broker_report 1.00 1.00 1.00 18 earning_call 1.00 1.00 1.00 19 other 1.00 1.00 1.00 98 accuracy - - 1.00 152 macro-avg 1.00 1.00 1.00 152 weighted-avg 1.00 1.00 1.00 152 ``` --- layout: model title: Legal Security interest Clause Binary Classifier author: John Snow Labs name: legclf_security_interest_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `security-interest` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `security-interest` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_security_interest_clause_en_1.0.0_3.2_1660122991220.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_security_interest_clause_en_1.0.0_3.2_1660122991220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_security_interest_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[security-interest]| |[other]| |[other]| |[security-interest]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_security_interest_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support grant-of-security-interest 1.00 0.91 0.95 34 other 0.97 1.00 0.98 85 accuracy - - 0.97 119 macro-avg 0.98 0.96 0.97 119 weighted-avg 0.98 0.97 0.97 119 ``` --- layout: model title: Pipeline to Detect PHI for Deidentification author: John Snow Labs name: bert_token_classifier_ner_deid_pipeline date: 2022-03-21 tags: [licensed, ner, deid, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_deid](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_deid_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_pipeline_en_3.4.1_3.0_1647863928244.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_pipeline_en_3.4.1_3.0_1647863928244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_deid_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_deid_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_deid.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |Oliveira |PATIENT | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |302) 786-5227 |PHONE | |Brothers Coal-Mine |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_deid_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: Translate Swazi to English Pipeline author: John Snow Labs name: translate_ss_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ss, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ss` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ss_en_xx_2.7.0_2.4_1609688619411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ss_en_xx_2.7.0_2.4_1609688619411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ss_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ss_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ss.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ss_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hindi Bert Embeddings author: John Snow Labs name: bert_embeddings_indic_transformers_hi_bert date: 2022-04-11 tags: [bert, embeddings, hi, open_source] task: Embeddings language: hi edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-hi-bert` is a Hindi model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_bert_hi_3.4.2_3.0_1649673091559.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_hi_bert_hi_3.4.2_3.0_1649673091559.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_indic_transformers_hi_bert","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मुझे स्पार्क एनएलपी पसंद है"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_indic_transformers_hi_bert","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मुझे स्पार्क एनएलपी पसंद है").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hi.embed.indic_transformers_hi_bert").predict("""मुझे स्पार्क एनएलपी पसंद है""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers_hi_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|hi| |Size:|612.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-bert - https://oscar-corpus.com/ --- layout: model title: Detect PHI for deidentification purposes author: John Snow Labs name: ner_deid_subentity_augmented_i2b2 date: 2021-11-29 tags: [deid, ner, phi, deidentification, licensed, i2b2, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition model that finds Protected Health Information (PHI) for deidentification purposes. This NER model is trained with a reviewed version of the re-augmented 2014 i2b2 Deid dataset, and detects up to 23 entity types. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `MEDICALRECORD`, `ORGANIZATION`, `DOCTOR`, `USERNAME`, `PROFESSION`, `HEALTHPLAN`, `URL`, `CITY`, `DATE`, `LOCATION-OTHER`, `STATE`, `PATIENT`, `DEVICE`, `COUNTRY`, `ZIP`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `SREET`, `BIOID`, `FAX`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_en_3.3.2_2.4_1638185564971.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_en_3.3.2_2.4_1638185564971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented_i2b2", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk_subentity") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) data = spark.createDataFrame([["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented_i2b2", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk_subentity") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter)) val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_subentity_augmented_i2b2").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |(302) 786-5227 |PHONE | |Brothers Coal-Mine Corp |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented_i2b2| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source In-house annotations based on `2014 i2b2 Deid dataset`. ## Benchmarking (on official test set from 2014 i2b2 Deid Data-set) ```bash label precision recall f1-score support AGE 0.96 0.96 0.96 764 CITY 0.83 0.84 0.84 260 COUNTRY 0.79 0.85 0.82 117 DATE 0.97 0.97 0.97 4980 DEVICE 0.88 0.88 0.88 8 DOCTOR 0.94 0.88 0.91 1912 HOSPITAL 0.91 0.83 0.87 875 IDNUM 0.84 0.85 0.84 195 LOCATION-OTHER 0.86 0.46 0.60 13 MEDICALRECORD 0.98 0.95 0.96 422 ORGANIZATION 0.83 0.59 0.69 82 PATIENT 0.93 0.93 0.93 879 PHONE 0.93 0.91 0.92 215 PROFESSION 0.84 0.75 0.79 179 STATE 0.95 0.86 0.90 190 STREET 0.96 0.97 0.97 136 USERNAME 1.00 0.96 0.98 92 ZIP 0.98 0.99 0.98 140 micro-avg 0.95 0.92 0.94 11459 macro-avg 0.86 0.81 0.83 11459 weighted-avg 0.95 0.92 0.93 11459 ``` `FAX` and `EMAIL` has been removed from official i2b2 test-set since there is not enough data to train in the official i2b2 train-set. --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_testimonial TFWav2Vec2ForCTC from testimonial author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_testimonial` is a English model originally trained by testimonial. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial_en_4.2.0_3.0_1664107745407.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial_en_4.2.0_3.0_1664107745407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_testimonial| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Abkhazian asr_xls_r_eng TFWav2Vec2ForCTC from mattchurgin author: John Snow Labs name: pipeline_asr_xls_r_eng date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_eng` is a Abkhazian model originally trained by mattchurgin. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_eng_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_eng_ab_4.2.0_3.0_1664019642325.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_eng_ab_4.2.0_3.0_1664019642325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_eng', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_eng", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_eng| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|1.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Abkhazian asr_xls_r_ab_test_by_hf_test TFWav2Vec2ForCTC from hf-test author: John Snow Labs name: asr_xls_r_ab_test_by_hf_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_hf_test` is a Abkhazian model originally trained by hf-test. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_test_by_hf_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_hf_test_ab_4.2.0_3.0_1664021515457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_hf_test_ab_4.2.0_3.0_1664021515457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_test_by_hf_test", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_test_by_hf_test", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_test_by_hf_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.6 KB| --- layout: model title: Legal License Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_license_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, license, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_license_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `license-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `license-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_license_agreement_bert_en_1.0.0_3.0_1671393841702.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_license_agreement_bert_en_1.0.0_3.0_1671393841702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_license_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[license-agreement]| |[other]| |[other]| |[license-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_license_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support license-agreement 0.95 0.97 0.96 103 other 0.99 0.98 0.98 204 accuracy - - 0.97 307 macro-avg 0.97 0.97 0.97 307 weighted-avg 0.97 0.97 0.97 307 ``` --- layout: model title: Legal Operating Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_operating_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, operating, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_operating_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `operating-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `operating-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_operating_agreement_en_1.0.0_3.0_1669293670385.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_operating_agreement_en_1.0.0_3.0_1669293670385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_operating_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[operating-agreement]| |[other]| |[other]| |[operating-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_operating_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support operating-agreement 0.94 0.88 0.91 33 other 0.96 0.98 0.97 90 accuracy - - 0.95 123 macro-avg 0.95 0.93 0.94 123 weighted-avg 0.95 0.95 0.95 123 ``` --- layout: model title: Legal Deidentification Pipeline author: John Snow Labs name: legpipe_deid date: 2023-02-22 tags: [deid, deidentification, anonymization, en, licensed] task: [De-identification, Pipeline Legal] language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Pretrained Pipeline aimed to deidentify legal and financial documents to be compliant with data privacy regulations as GDPR and CCPA. Since the models used in this pipeline are statistical, make sure you use this model in a human-in-the-loop process to guarantee a 100% accuracy. You can carry out both masking and obfuscation with this pipeline, on the following entities: `ALIAS`, `EMAIL`, `PHONE`, `PROFESSION`, `ORG`, `DATE`, `PERSON`, `ADDRESS`, `STREET`, `CITY`, `STATE`, `ZIP`, `COUNTRY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/DEID_LEGAL/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legpipe_deid_en_1.0.0_3.0_1677077428423.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legpipe_deid_en_1.0.0_3.0_1677077428423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("legpipe_deid", "en", "legal/models") sample_2 = """Pizza Fusion Holdings, Inc. Franchise Agreement This Franchise Agreement (the "Agreement") is entered into as of the Agreement Date shown on the cover page between Pizza Fusion Holding, Inc., a Florida corporation, and the individual or legal entity identified on the cover page. Source: PF HOSPITALITY GROUP INC., 9/23/2015 1. RIGHTS GRANTED 1.1. Grant of Franchise. 1.1.1 We grant you the right, and you accept the obligation, to use the Proprietary Marks and the System to operate one Restaurant (the "Franchised Business") at the Premises, in accordance with the terms of this Agreement. Source: PF HOSPITALITY GROUP INC., 9/23/2015 1.3. Our Limitations and Our Reserved Rights. The rights granted to you under this Agreement are not exclusive.sed Business. Source: PF HOSPITALITY GROUP INC., 9/23/2015 """ result = deid_pipeline.annotate(sample_2) print("\nMasked with entity labels") print("-"*30) print("\n".join(result['deidentified'])) print("\nMasked with chars") print("-"*30) print("\n".join(result['masked_with_chars'])) print("\nMasked with fixed length chars") print("-"*30) print("\n".join(result['masked_fixed_length_chars'])) print("\nObfuscated") print("-"*30) print("\n".join(result['obfuscated'])) ```
## Results ```bash Masked with entity labels ------------------------------ . This (the ) is entered into as of the Agreement Date shown on the cover page between a Florida corporation, and the individual or legal entity identified on the cover page. Source: ., 1. 1.1. . 1.1.1 We grant you the right, and you accept the obligation, to use the and the System to operate one Restaurant (the ) at the Premises, in accordance with the terms of this Agreement. Source: ., 1.3. Our and . The rights granted to you under this Agreement are not exclusive.sed Business. Source: ., Masked with chars ------------------------------ [************************]. [*****************] This [*****************] (the [*********]) is entered into as of the Agreement Date shown on the cover page between [*************************] a Florida corporation, and the individual or legal entity identified on the cover page. Source: [**********************]., [*******] 1. [************] 1.1. [****************]. 1.1.1 We grant you the right, and you accept the obligation, to use the [***************] and the System to operate one Restaurant (the [*******************]) at the Premises, in accordance with the terms of this Agreement. Source: [**********************]., [*******] 1.3. Our [*********] and [*****************]. The rights granted to you under this Agreement are not exclusive.sed Business. Source: [**********************]., [*******] Masked with fixed length chars ------------------------------ ****. **** This **** (the ****) is entered into as of the Agreement Date shown on the cover page between **** a Florida corporation, and the individual or legal entity identified on the cover page. Source: ****., **** 1. **** 1.1. ****. 1.1.1 We grant you the right, and you accept the obligation, to use the **** and the System to operate one Restaurant (the ****) at the Premises, in accordance with the terms of this Agreement. Source: ****., **** 1.3. Our **** and ****. The rights granted to you under this Agreement are not exclusive.sed Business. Source: ****., **** Obfuscated ------------------------------ SESA CO.. Estate Document This Estate Document (the (the "Contract")) is entered into as of the Agreement Date shown on the cover page between Clarus llc. a Florida corporation, and the individual or legal entity identified on the cover page. Source: SESA CO.., 11/7/2016 1. SESA CO. 1.1. Clarus llc.. 1.1.1 We grant you the right, and you accept the obligation, to use the John Snow Labs Inc and the System to operate one Restaurant (the (the" Agreement")) at the Premises, in accordance with the terms of this Agreement. Source: SESA CO.., 11/7/2016 1.3. Our MGT Trust Company, LLC. and John Snow Labs Inc. The rights granted to you under this Agreement are not exclusive.sed Business. Source: SESA CO.., 11/7/2016 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legpipe_deid| |Type:|pipeline| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|965.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaEmbeddings - LegalNerModel - NerConverterInternalModel - LegalNerModel - NerConverter - ZeroShotNerModel - NerConverterInternalModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel --- layout: model title: French CamemBert Embeddings (from osanseviero) author: John Snow Labs name: camembert_embeddings_osanseviero_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `osanseviero`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_osanseviero_generic_model_fr_3.4.4_3.0_1653989868525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_osanseviero_generic_model_fr_3.4.4_3.0_1653989868525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_osanseviero_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_osanseviero_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_osanseviero_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/osanseviero/dummy-model --- layout: model title: Pipeline to Detect Medical Risk Factors author: John Snow Labs name: ner_risk_factors_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, risk_factor, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_risk_factors](https://nlp.johnsnowlabs.com/2021/03/31/ner_risk_factors_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RISK_FACTORS/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_RISK_FACTORS.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_pipeline_en_3.4.1_3.0_1647871784709.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_pipeline_en_3.4.1_3.0_1647871784709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_risk_factors_pipeline", "en", "clinical/models") pipeline.fullAnnotate("HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.REVIEW OF SYSTEMS: All other systems reviewed & are negative.PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a Bank Manager. FAMILY HISTORY: Positive for coronary artery disease (father & brother).") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_risk_factors_pipeline", "en", "clinical/models") pipeline.fullAnnotate("HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.REVIEW OF SYSTEMS: All other systems reviewed & are negative.PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a Bank Manager. FAMILY HISTORY: Positive for coronary artery disease (father & brother).") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.risk_factors.pipeline").predict("""HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of 'chest pain'. The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.REVIEW OF SYSTEMS: All other systems reviewed & are negative.PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a Bank Manager. FAMILY HISTORY: Positive for coronary artery disease (father & brother).""") ```
## Results ```bash +--------------------------------+------------+ |chunk |ner_label | +--------------------------------+------------+ |diabetic |DIABETES | |coronary artery disease |CAD | |nitroglycerin |MEDICATION | |Diabetes mellitus type II |DIABETES | |hypertension |HYPERTENSION| |coronary artery disease |CAD | |1995 |PHI | |ABC |PHI | |Smokes 2 packs of cigarettes per|SMOKER | |Bank Manager |PHI | +--------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English asr_hausa_4_wa2vec_data_aug_xls_r_300m TFWav2Vec2ForCTC from Tiamz author: John Snow Labs name: asr_hausa_4_wa2vec_data_aug_xls_r_300m date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_hausa_4_wa2vec_data_aug_xls_r_300m` is a English model originally trained by Tiamz. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_hausa_4_wa2vec_data_aug_xls_r_300m_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_hausa_4_wa2vec_data_aug_xls_r_300m_en_4.2.0_3.0_1664108024097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_hausa_4_wa2vec_data_aug_xls_r_300m_en_4.2.0_3.0_1664108024097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_hausa_4_wa2vec_data_aug_xls_r_300m", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_hausa_4_wa2vec_data_aug_xls_r_300m", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_hausa_4_wa2vec_data_aug_xls_r_300m| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Sentence Embeddings - sbert tiny (tuned) author: John Snow Labs name: sbert_jsl_tiny_umls_uncased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_umls_uncased_en_3.0.3_2.4_1621017145970.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_umls_uncased_en_3.0.3_2.4_1621017145970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbert_jsl_tiny_umls_uncased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_tiny_umls_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_tiny_umls_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_tiny_umls_uncased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.616 STS(cos) 0.632 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_ff6000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-ff6000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff6000_en_4.3.0_3.0_1675121052795.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff6000_en_4.3.0_3.0_1675121052795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_ff6000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_ff6000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_ff6000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|244.0 MB| ## References - https://huggingface.co/google/t5-efficient-small-ff6000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Stopwords Remover for Macedonian language (811 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, mk, open_source] task: Stop Words Removal language: mk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_mk_3.4.1_3.0_1646673098507.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_mk_3.4.1_3.0_1646673098507.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","mk") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Вие не сте подобри од мене"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","mk") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Вие не сте подобри од мене").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mk.stopwords").predict("""Вие не сте подобри од мене""") ```
## Results ```bash +---------+ |result | +---------+ |[подобри]| +---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|mk| |Size:|4.4 KB| --- layout: model title: ICDO Entity Resolver author: John Snow Labs name: chunkresolve_icdo_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance to map medical entities to ICD-O codes. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding `Morphology` codes comprising of `Histology` and `Behavior` codes. ## Predicted Entities ICD-O Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICDO/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icdo_clinical_en_3.0.0_3.0_1617344918016.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icdo_clinical_en_3.0.0_3.0_1617344918016.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... model = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner_model, clinical_ner_chunker, chunk_embeddings, model]) data = ["""DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes. PHYSICAL EXAMINATION NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present. RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present. ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time."""] pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline = LightPipeline(pipeline_model) result = light_pipeline.annotate(data) ``` ```scala ... val model = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, clinical_ner_model, clinical_ner_chunker, chunk_embeddings, model)) val data = Seq("DIAGNOSIS: Left breast adenocarcinoma stage T3 N1b M0, stage IIIA. She has been found more recently to have stage IV disease with metastatic deposits and recurrence involving the chest wall and lower left neck lymph nodes. PHYSICAL EXAMINATION NECK: On physical examination palpable lymphadenopathy is present in the left lower neck and supraclavicular area. No other cervical lymphadenopathy or supraclavicular lymphadenopathy is present. RESPIRATORY: Good air entry bilaterally. Examination of the chest wall reveals a small lesion where the chest wall recurrence was resected. No lumps, bumps or evidence of disease involving the right breast is present. ABDOMEN: Normal bowel sounds, no hepatomegaly. No tenderness on deep palpation. She has just started her last cycle of chemotherapy today, and she wishes to visit her daughter in Brooklyn, New York. After this she will return in approximately 3 to 4 weeks and begin her radiotherapy treatment at that time.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | chunk | begin | end | entity | idco_description | icdo_code | |---|----------------------------|-------|-----|--------|---------------------------------------------|-----------| | 0 | Left breast adenocarcinoma | 11 | 36 | Cancer | Intraductal carcinoma, noninfiltrating, NOS | 8500/2 | | 1 | T3 N1b M0 | 44 | 52 | Cancer | Kaposi sarcoma | 9140/3 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icdo_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10pcs]| |Language:|en| --- layout: model title: Catalan, Valencian asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly TFWav2Vec2ForCTC from ccoreilly author: John Snow Labs name: asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly date: 2022-09-24 tags: [wav2vec2, ca, audio, open_source, asr] task: Automatic Speech Recognition language: ca edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly` is a Catalan, Valencian model originally trained by ccoreilly. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly_ca_4.2.0_3.0_1664035790289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly_ca_4.2.0_3.0_1664035790289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly", "ca")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly", "ca") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_100k_voxpopuli_catala_by_ccoreilly| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ca| |Size:|1.2 GB| --- layout: model title: Legal Payments Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_payments_bert date: 2023-03-05 tags: [en, legal, classification, clauses, payments, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Payments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Payments`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_payments_bert_en_1.0.0_3.0_1678049947810.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_payments_bert_en_1.0.0_3.0_1678049947810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_payments_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Payments]| |[Other]| |[Other]| |[Payments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_payments_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.87 0.90 134 Payments 0.83 0.91 0.87 97 accuracy - - 0.88 231 macro-avg 0.88 0.89 0.88 231 weighted-avg 0.89 0.88 0.88 231 ``` --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_600000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-600000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_600000_cased_generator_de_3.4.4_3.0_1652786450936.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_600000_cased_generator_de_3.4.4_3.0_1652786450936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_600000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_600000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_600000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|223.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-600000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Detect Chemicals in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc4chemd_chemicals date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects chemical entities from a medical text. ## Predicted Entities `CHEM` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc4chemd_chemicals_en_4.0.0_3.0_1658751849323.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc4chemd_chemicals_en_4.0.0_3.0_1658751849323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin)."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bc4chemd_chemicals").predict("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""") ```
## Results ```bash +-------------------------------+-----+ |ner_chunk |label| +-------------------------------+-----+ |triterpenes |CHEM | |alpha - amyrin |CHEM | |beta - amyrin |CHEM | |lupeol |CHEM | |betulin |CHEM | |betulinic acid |CHEM | |uvaol |CHEM | |erythrodiol |CHEM | |oleanolic acid |CHEM | |phenolic acid |CHEM | |4 - hydroxybenzoic acid |CHEM | |gallic and protocatechuic acids|CHEM | |isocorilagin |CHEM | +-------------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc4chemd_chemicals| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-CHEM 0.7642 0.9536 0.8485 25346 I-CHEM 0.9446 0.9502 0.9474 29642 micro-avg 0.8517 0.9518 0.8990 54988 macro-avg 0.8544 0.9519 0.8979 54988 weighted-avg 0.8614 0.9518 0.9018 54988 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from votr) author: John Snow Labs name: distilbert_qa_votr_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `votr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_votr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773232513.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_votr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773232513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_votr_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_votr_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_votr_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/votr/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from anurag0077) author: John Snow Labs name: distilbert_qa_anurag0077_base_uncased_finetuned_squad2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `anurag0077`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773499851.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anurag0077_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773499851.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anurag0077_base_uncased_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anurag0077_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anurag0077/distilbert-base-uncased-finetuned-squad2 --- layout: model title: Spanish BertForTokenClassification Cased model (from mrm8488) author: John Snow Labs name: bert_ner_spanish_cased_finedtuned date: 2022-07-19 tags: [open_source, bert, ner, es] task: Named Entity Recognition language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BERT NER model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-spanish-cased-finedtuned-ner` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_spanish_cased_finedtuned_es_4.0.0_3.0_1658250718179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_spanish_cased_finedtuned_es_4.0.0_3.0_1658250718179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") ner = BertForTokenClassification.pretrained("bert_ner_spanish_cased_finedtuned","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner = BertForTokenClassification.pretrained("bert_ner_spanish_cased_finedtuned","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, ner)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_spanish_cased_finedtuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/mrm8488/bert-spanish-cased-finedtuned-ner --- layout: model title: Russian DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_ru_cased date: 2022-04-12 tags: [distilbert, embeddings, ru, open_source] task: Embeddings language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-ru-cased` is a Russian model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ru_cased_ru_3.4.2_3.0_1649783536283.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ru_cased_ru_3.4.2_3.0_1649783536283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ru_cased","ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ru_cased","ru") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю искра NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.embed.distilbert_base_cased").predict("""Я люблю искра NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_ru_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ru| |Size:|202.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-ru-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Extract Cancer Therapies and Granular Posology Information author: John Snow Labs name: ner_oncology_posology_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts cancer therapies (Cancer_Surgery, Radiotherapy and Cancer_Therapy) and posology information at a granular level. Definitions of Predicted Entities: - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Cancer_Therapy`: Any cancer treatment mentioned in text, excluding surgeries and radiotherapy. - `Cycle_Count`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Radiation_Dose`: Dose used in radiotherapy. - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). ## Predicted Entities `Cancer_Surgery`, `Cancer_Therapy`, `Cycle_Count`, `Cycle_Day`, `Cycle_Number`, `Dosage`, `Duration`, `Frequency`, `Radiotherapy`, `Radiation_Dose`, `Route` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_wip_en_4.0.0_3.0_1664599604423.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_wip_en_4.0.0_3.0_1664599604423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_posology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_posology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_posology_wip").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------| | adriamycin | Cancer_Therapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | second cycle | Cycle_Number | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_posology_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|856.4 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Cycle_Number 56.0 12.0 17.0 73.0 0.82 0.77 0.79 Cycle_Count 148.0 44.0 27.0 175.0 0.77 0.85 0.81 Radiotherapy 185.0 2.0 37.0 222.0 0.99 0.83 0.90 Cancer_Surgery 494.0 59.0 151.0 645.0 0.89 0.77 0.82 Cycle_Day 144.0 22.0 39.0 183.0 0.87 0.79 0.83 Frequency 270.0 17.0 79.0 349.0 0.94 0.77 0.85 Route 67.0 5.0 30.0 97.0 0.93 0.69 0.79 Cancer_Therapy 1093.0 74.0 165.0 1258.0 0.94 0.87 0.90 Duration 316.0 43.0 231.0 547.0 0.88 0.58 0.70 Dosage 703.0 16.0 124.0 827.0 0.98 0.85 0.91 Radiation_Dose 84.0 14.0 12.0 96.0 0.86 0.88 0.87 macro_avg 3560.0 308.0 912.0 4472.0 0.90 0.78 0.83 micro_avg NaN NaN NaN NaN 0.92 0.80 0.85 ``` --- layout: model title: Extract Financial, Legal and Generic Entities in Arabic author: John Snow Labs name: finner_arabert_arabic date: 2022-08-16 tags: [ar, financial, licensed] task: Named Entity Recognition language: ar edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description AraBert v2 model, trained in house, using the dataset available [here](https://ontology.birzeit.edu/Wojood/) and augmenting with financial and legal information. The entities you can find in this model are: PERS (person) EVENT CARDINAL NORP (group of people) DATE ORDINAL OCC (occupation) TIME PERCENT ORG (organization) LANGUAGE QUANTITY GPE (geopolitical entity) WEBSITE UNIT LOC (geographical location) LAW MONEY FAC (facility: landmarks places) PRODUCT CURR (currency) ## Predicted Entities `NORP`, `PERS`, `LOC`, `MONEY`, `TIME`, `ORG`, `WEBSITE`, `ORDINAL`, `PERCENT`, `EVENT`, `QUANTITY`, `OCC`, `LANGUAGE`, `CARDINAL`, `DATE`, `GPE`, `PRODUCT`, `CURR`, `FAC`, `UNIT`, `LAW` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_arabert_arabic_ar_1.0.0_3.2_1660655139446.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_arabert_arabic_ar_1.0.0_3.2_1660655139446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = finance.BertForTokenClassification.pretrained("finner_arabert_arabic", "ar", "finance/models")\ .setInputCols("token", "document")\ .setOutputCol("label") pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) import pandas as pd example = spark.createDataFrame(pd.DataFrame({'text': ["""أمثلة: جامعة بيرزيت وبالتعاون مع مؤسسة ادوارد سعيد تنظم مهرجان للفن الشعبي سيبدأ الساعة الرابعة عصرا، بتاريخ 16/5/2016. بورصة فلسطين تسجل ارتفاعا بنسبة 0.08% ، في جلسة بلغت قيمة تداولاتها أكثر من نصف مليون دولار . إنتخاب رئيس هيئة سوق رأس المال وتعديل مادة (4) في القانون الأساسي. مسيرة قرب باب العامود والذي 700 متر عن المسجد الأقصى."""]})) result = pipeline.fit(example).transform(example) ```
## Results ```bash ["أمثلة:","O"] ["جامعة","B-ORG"] ["بيرزيت","I-ORG"] ["وبالتعاون","O"] ["مع","O"] ["مؤسسة","B-ORG"] ["ادوارد","B-PERS"] ["سعيد","I-PERS"] ["تنظم","O"] ["مهرجان","B-EVENT"] ["للفن","I-EVENT"] ["الشعبي","I-EVENT"] ["سيبدأ","O"] ["الساعة","B-TIME"] ["الرابعة","I-TIME"] ["عصرا،","I-TIME"] ["بتاريخ","B-DATE"] ["16/5/2016.","I-DATE"] ["بورصة","B-ORG"] ["فلسطين","I-ORG"] ["تسجل","O"] ["ارتفاعا","O"] ["بنسبة","O"] ["0.08%","B-PERCENT"] ["،","O"] ["في","O"] ["جلسة","O"] ["بلغت","O"] ["قيمة","O"] ["تداولاتها","O"] ["أكثر","O"] ["من","O"] ["نصف","B-MONEY"] ["مليون","I-MONEY"] ["دولار","B-CURR"] [".","O"] ["إنتخاب","O"] ["رئيس","B-OCC"] ["هيئة","B-ORG"] ["سوق","I-ORG"] ["رأس","I-ORG"] ["المال","I-ORG"] ["وتعديل","O"] ["مادة","B-LAW"] ["(4)","I-LAW"] ["في","O"] ["القانون","B-LAW"] ["الأساسي.","O"] ["مسيرة","O"] ["قرب","O"] ["باب","B-FAC"] ["العامود","I-FAC"] ["والذي","O"] ["700","B-QUANTITY"] ["متر","B-UNIT"] ["عن","O"] ["المسجد","B-FAC"] ["الأقصى.","I-FAC"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_arabert_arabic| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://ontology.birzeit.edu/Wojood/ ## Benchmarking ```bash label precision recall f1-score support B-CARDINAL 0.93 0.87 0.80 19 B-DATE 0.88 0.93 0.90 106 B-EVENT 1.00 0.86 0.92 14 B-FAC 1.00 0.67 0.80 3 B-GPE 0.88 0.85 0.87 89 B-LAW 1.00 0.50 0.67 6 B-NORP 0.72 0.81 0.76 32 B-OCC 0.88 0.83 0.85 52 B-ORDINAL 0.76 0.80 0.78 35 B-ORG 0.81 0.87 0.84 103 B-PERS 0.78 0.89 0.83 47 B-WEBSITE 0.62 1.00 0.77 5 I-CARDINAL 0.33 0.62 0.43 8 I-DATE 0.98 0.99 0.98 447 I-EVENT 0.91 0.91 0.91 23 I-FAC 0.75 0.43 0.55 7 I-GPE 0.80 0.92 0.86 53 I-LAW 1.00 0.85 0.92 13 I-NORP 0.65 0.48 0.55 23 I-OCC 0.97 0.86 0.91 96 I-ORG 0.87 0.91 0.89 139 I-PERS 0.94 1.00 0.97 60 I-WEBSITE 0.94 1.00 0.97 15 O 0.98 0.97 0.98 3062 accuracy - - 0.95 4468 macro-avg 0.83 0.81 0.81 4468 weighted-avg 0.95 0.95 0.95 4468 ``` --- layout: model title: Legal Indenture Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_indenture_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, indenture, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_indenture_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `indenture` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indenture_agreement_bert_en_1.0.0_3.0_1671393852026.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indenture_agreement_bert_en_1.0.0_3.0_1671393852026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indenture_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[indenture]| |[other]| |[other]| |[indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indenture_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support indenture 0.96 0.93 0.94 97 other 0.97 0.98 0.97 204 accuracy - - 0.96 301 macro-avg 0.96 0.95 0.96 301 weighted-avg 0.96 0.96 0.96 301 ``` --- layout: model title: Legal Indemnification by the company Clause Binary Classifier author: John Snow Labs name: legclf_indemnification_by_the_company_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `indemnification-by-the-company` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `indemnification-by-the-company` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_by_the_company_clause_en_1.0.0_3.2_1660122512480.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnification_by_the_company_clause_en_1.0.0_3.2_1660122512480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_indemnification_by_the_company_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[indemnification-by-the-company]| |[other]| |[other]| |[indemnification-by-the-company]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnification_by_the_company_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support indemnification-by-the-company 0.85 0.85 0.85 20 other 0.95 0.95 0.95 61 accuracy - - 0.93 81 macro-avg 0.90 0.90 0.90 81 weighted-avg 0.93 0.93 0.93 81 ``` --- layout: model title: BioBERT Sentence Embeddings (Pubmed Large) author: John Snow Labs name: sent_biobert_pubmed_large_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_large_cased_en_2.6.0_2.4_1598348255724.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_large_cased_en_2.6.0_2.4_1598348255724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_large_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_large_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_large_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_biobert_pubmed_large_cased_embeddings sentence [-0.50444495677948, 0.15660151839256287, -0.03... I hate cancer [0.3133466839790344, 0.1945181041955948, 0.014... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pubmed_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Legal Swiss Judgements Classification (Italian) author: John Snow Labs name: legclf_bert_swiss_judgements date: 2022-10-27 tags: [it, legal, licensed, sequence_classification] task: Text Classification language: it edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Bert-based model that can be used to classify Swiss Judgement documents in Italian language into the following 6 classes according to their case area. It has been trained with SOTA approach. ## Predicted Entities `public law`, `civil law`, `insurance law`, `social law`, `penal law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_it_1.0.0_3.0_1666866260601.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_it_1.0.0_3.0_1666866260601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols(['document'])\ .setOutputCol('token') clf_model = legal.BertForSequenceClassification.pretrained('legclf_bert_swiss_judgements', 'it', 'legal/models')\ .setInputCols(['document', 'token'])\ .setOutputCol('class')\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, clf_model ]) data = spark.createDataFrame([["""Attualità: A. Disponibile dal 21. Nell'ottobre del 2004, l'Allianza di assicurazioni svizzere (in prosieguo: Allianz) ha messo in atto il R._ (geb. 1965) per le conseguenze di un incidente del 23. Nel mese di marzo del 2001 le prestazioni sono ritornate al 31. Nel mese di marzo del 2004 si è presentato la decisione del 6. Nel luglio del 2005 è stato arrestato. A. A disposizione del 21. Nell'ottobre del 2004, l'Allianza di assicurazioni svizzere (in prosieguo: Allianz) ha messo in atto il R._ (geb. 1965) per le conseguenze di un incidente del 23. Nel mese di marzo del 2001 le prestazioni sono ritornate al 31. Nel mese di marzo del 2004 si è presentato la decisione del 6. Nel luglio del 2005 è stato arrestato. di B. Il 7. Nel novembre 2005 R._ ha presentato una denuncia contro la decisione di interrogatorio al Tribunale amministrativo del Cantone di Schwyz. Con la lettera del 9. Nel novembre del 2005, il vicepresidente del Tribunale amministrativo ha informato gli assicurati che la denuncia è stata presentata in ritardo secondo la legge cantonale massiccia, il motivo per cui non è possibile procedere, e gli ha dato l'opportunità di pronunciarsi. Con l’ingresso del 15. Nel novembre 2005 R._ ha presentato una richiesta di ripristino del termine di reclamo. Con la decisione del 6. Nel dicembre 2005 il Tribunale amministrativo non ha presentato la denuncia. di B. Il 7. Nel novembre 2005 R._ ha presentato una denuncia contro la decisione di interrogatorio al Tribunale amministrativo del Cantone di Schwyz. Con la lettera del 9. Nel novembre del 2005, il vicepresidente del Tribunale amministrativo ha informato gli assicurati che la denuncia è stata presentata in ritardo secondo la legge cantonale massiccia, il motivo per cui non è possibile procedere, e gli ha dato l'opportunità di pronunciarsi. Con l’ingresso del 15. Nel novembre 2005 R._ ha presentato una richiesta di ripristino del termine di reclamo. Con la decisione del 6. Nel dicembre 2005 il Tribunale amministrativo non ha presentato la denuncia. C. Con un ricorso al Tribunale amministrativo, R._ chiede alla causa principale che, annullando la decisione pregiudiziale, il tribunale cantonale sia obbligato a presentare il ricorso del 7. di entrare nel novembre 2005. Dal punto di vista procedurale, il giudice può presentare la richiesta giuridica di aderire agli atti pregiudiziali e di ordinare un secondo cambio di scrittura. Il Tribunale amministrativo del Cantone di Schwyz e l'Alleanza concludono il ricorso alla Corte amministrativa. L’Ufficio federale per la salute rinuncia ad una consultazione."""]]).toDF("text") result = clf_pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-------------+ | document| class| +----------------------------------------------------------------------------------------------------+-------------+ |Attualità: A. Disponibile dal 21. Nell'ottobre del 2004, l'Allianza di assicurazioni svizzere (in...|insurance law| +----------------------------------------------------------------------------------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_swiss_judgements| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|it| |Size:|392.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Training data is available [here](https://zenodo.org/record/7109926#.Y1gJwexBw8E). ## Benchmarking ```bash | label | precision | recall | f1-score | support | |---------------|-----------|--------|----------|---------| | civil-law | 0.90 | 0.93 | 0.92 | 1144 | | insurance-law | 0.93 | 0.96 | 0.95 | 1034 | | other | 0.88 | 0.44 | 0.58 | 32 | | penal-law | 0.91 | 0.95 | 0.93 | 1219 | | public-law | 0.93 | 0.88 | 0.90 | 1433 | | social-law | 0.96 | 0.92 | 0.94 | 924 | | accuracy | - | - | 0.92 | 5786 | | macro-avg | 0.92 | 0.85 | 0.87 | 5786 | | weighted-avg | 0.92 | 0.92 | 0.92 | 5786 | ``` --- layout: model title: Detect Living Species(w2v_cc_300d) author: John Snow Labs name: ner_living_species date: 2022-06-22 tags: [es, ner, clinical, licensed] task: Named Entity Recognition language: es edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `w2v_cc_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_es_3.5.3_3.0_1655907754521.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_es_3.5.3_3.0_1655907754521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "es","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "es","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.living_species").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.1 MB| ## References [https://temu.bsc.es/livingner/](https://temu.bsc.es/livingner/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 1.00 1.00 1.00 3281 B-SPECIES 1.00 1.00 1.00 3712 I-HUMAN 1.00 0.99 0.99 297 I-SPECIES 0.90 0.99 0.95 1732 micro-avg 0.98 1.00 0.99 9022 macro-avg 0.97 0.99 0.98 9022 weighted-avg 0.98 1.00 0.99 9022 ``` --- layout: model title: Italian BertForQuestionAnswering model (from antoniocappiello) author: John Snow Labs name: bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello date: 2022-06-03 tags: [it, open_source, question_answering, bert] task: Question Answering language: it edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-uncased-squad-it` is a Italian model orginally trained by `antoniocappiello`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello_it_4.0.0_3.0_1654249486819.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello_it_4.0.0_3.0_1654249486819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello","it") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello","it") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("it.answer_question.squad.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|410.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/antoniocappiello/bert-base-italian-uncased-squad-it - http://sag.art.uniroma2.it/demo-software/squadit/ - https://github.com/crux82/squad-it/blob/master/README.md#evaluating-a-neural-model-over-squad-it --- layout: model title: Legal Holidays Clause Binary Classifier author: John Snow Labs name: legclf_holidays_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `holidays` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `holidays` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_holidays_clause_en_1.0.0_3.2_1660123586412.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_holidays_clause_en_1.0.0_3.2_1660123586412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_holidays_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[holidays]| |[other]| |[other]| |[holidays]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_holidays_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support holidays 0.98 0.94 0.96 49 other 0.96 0.99 0.97 68 accuracy - - 0.97 117 macro-avg 0.97 0.96 0.96 117 weighted-avg 0.97 0.97 0.97 117 ``` --- layout: model title: English RobertaForQuestionAnswering (from comacrae) author: John Snow Labs name: roberta_qa_roberta_unaugmentedv3 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-unaugmentedv3` is a English model originally trained by `comacrae`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_unaugmentedv3_en_4.0.0_3.0_1655738511638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_unaugmentedv3_en_4.0.0_3.0_1655738511638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_unaugmentedv3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_unaugmentedv3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.augmented").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_unaugmentedv3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/comacrae/roberta-unaugmentedv3 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from ericklerouge123) author: John Snow Labs name: xlmroberta_ner_ericklerouge123_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `ericklerouge123`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ericklerouge123_base_finetuned_panx_de_4.1.0_3.0_1660432622351.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_ericklerouge123_base_finetuned_panx_de_4.1.0_3.0_1660432622351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ericklerouge123_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_ericklerouge123_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_ericklerouge123_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ericklerouge123/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_05 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_05` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05_en_4.2.0_3.0_1664114174628.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05_en_4.2.0_3.0_1664114174628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_05| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from MrAnderson) author: John Snow Labs name: bert_qa_bert_base_2048_full_trivia_copied_embeddings date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-2048-full-trivia-copied-embeddings` is a English model orginally trained by `MrAnderson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_2048_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179629858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_2048_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179629858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_2048_full_trivia_copied_embeddings","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_2048_full_trivia_copied_embeddings","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.bert.base_2048.by_MrAnderson").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_2048_full_trivia_copied_embeddings| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|412.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MrAnderson/bert-base-2048-full-trivia-copied-embeddings --- layout: model title: English BertForQuestionAnswering model (from graviraja) author: John Snow Labs name: bert_qa_covidbert_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `covidbert_squad` is a English model orginally trained by `graviraja`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_covidbert_squad_en_4.0.0_3.0_1654187350931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_covidbert_squad_en_4.0.0_3.0_1654187350931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_covidbert_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_covidbert_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.covid_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_covidbert_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/graviraja/covidbert_squad --- layout: model title: French asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607 date: 2022-09-26 tags: [wav2vec2, fr, audio, open_source, asr] task: Automatic Speech Recognition language: fr edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607` is a French model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607_fr_4.2.0_3.0_1664201843650.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607_fr_4.2.0_3.0_1664201843650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607", "fr")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607", "fr") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_france_5_belgium_5_s607| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fr| |Size:|1.2 GB| --- layout: model title: Sentence Entity Resolver for Snomed Codes author: John Snow Labs name: sbertresolve_snomed date: 2021-09-16 tags: [snomed, de, clinial, licensed] task: Entity Resolution language: de edition: Healthcare NLP 3.2.2 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to SNOMED codes for the German language using `sent_bert_base_cased` (de) embeddings. ## Predicted Entities `SNOMED Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_SNOMED_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_de_3.2.2_2.4_1631826969583.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_de_3.2.2_2.4_1631826969583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed", "de", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code") snomed_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, snomed_resolver]) snomed_lp = LightPipeline(snomed_pipelineModel) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed", "de", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code") val snomed_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, snomed_resolver)) val snomed_lp = LightPipeline(snomed_pipelineModel) ``` {:.nlu-block} ```python import nlu nlu.load("de.resolve.snomed").predict("""Put your text here.""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:------------------|:--------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | Bronchialkarzinom | 22628 | Bronchialkarzinom, Bronchuskarzinom, Rektumkarzinom, Klavikulakarzinom, Lippenkarzinom, Urothelkarzinom, Hodenteratokarzinom, Unterbauchkarzinom, Teratokarzinom, Oropharynxkarzinom, Harnleiterkarzinom, Herzbeutelkarzinom, Thekazellkarzinom, Plattenepithelkarzinom, Weichteilkarzinom, Perikardkarzinom, Zervixkarzinom, Samenstrangkarzinom, Nierenkelchkarzinom, Querkolonkarzinom, Perianalkarzinom, Endozervixkarzinom, Parotiskarzinom, Gehörgangskarzinom, Prostatakarzinom| [22628, 111139, 18116, 107569, 18830, 22909, 16259, 111193, 22383, 19807, 22613, 20014, 74820, 21331, 30182, 20015, 23130, 22068, 20340, 29968, 15757, 23917, 25303, 17800, 21706] | [0.0000, 0.0073, 0.0090, 0.0098, 0.0098, 0.0102, 0.0102, 0.0110, 0.0111, 0.0120, 0.0121, 0.0123, 0.0128, 0.0130, 0.0129, 0.0131, 0.0128, 0.0131, 0.0135, 0.0133, 0.0137, 0.0137, 0.0139, 0.0137, 0.0139] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_snomed| |Compatibility:|Healthcare NLP 3.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|de| |Case sensitive:|false| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_128 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_128_zh_4.2.4_3.0_1670325870728.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_128_zh_4.2.4_3.0_1670325870728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|12.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English XlmRoBertaForQuestionAnswering (from deepset) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_squad2 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-squad2` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_squad2_en_4.0.0_3.0_1655996042440.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_squad2_en_4.0.0_3.0_1655996042440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/xlm-roberta-large-squad2 - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://www.linkedin.com/company/deepset-ai/ - https://twitter.com/deepset_ai - http://www.deepset.ai/jobs - https://haystack.deepset.ai/community/join - https://public-mlflow.deepset.ai/#/experiments/124/runs/3a540e3f3ecf4dd98eae8fc6d457ff20 - https://github.com/deepset-ai/haystack/ - https://github.com/deepset-ai/FARM - https://deepset.ai/germanquad - https://github.com/deepmind/xquad - https://deepset.ai - https://deepset.ai/german-bert - https://github.com/deepset-ai/haystack/discussions - https://github.com/facebookresearch/MLQA --- layout: model title: Detect Living Species(embeddings_scielo_300d) author: John Snow Labs name: ner_living_species_300 date: 2022-07-29 tags: [es, ner, clinical, licensed] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `embeddings_scielo_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_es_4.0.0_3.0_1659075919931.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_es_4.0.0_3.0_1659075919931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_300", "es","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_300", "es","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.living_species.300").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_300| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.98 0.97 0.98 3281 B-SPECIES 0.94 0.98 0.96 3712 I-HUMAN 0.87 0.81 0.84 297 I-SPECIES 0.79 0.89 0.84 1732 micro-avg 0.92 0.95 0.94 9022 macro-avg 0.90 0.91 0.90 9022 weighted-avg 0.93 0.95 0.94 9022 ``` --- layout: model title: English BertForMaskedLM Base Uncased model (from ayansinha) author: John Snow Labs name: bert_embeddings_false_positives_scancode_base_uncased_l8_1 date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `false-positives-scancode-bert-base-uncased-L8-1` is a English model originally trained by `ayansinha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_base_uncased_l8_1_en_4.2.4_3.0_1670326295941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_false_positives_scancode_base_uncased_l8_1_en_4.2.4_3.0_1670326295941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_base_uncased_l8_1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_false_positives_scancode_base_uncased_l8_1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_false_positives_scancode_base_uncased_l8_1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ayansinha/false-positives-scancode-bert-base-uncased-L8-1 - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py --- layout: model title: Legal Permits Clause Binary Classifier author: John Snow Labs name: legclf_permits_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `permits` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `permits` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_permits_clause_en_1.0.0_3.2_1660122841810.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_permits_clause_en_1.0.0_3.2_1660122841810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_permits_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[permits]| |[other]| |[other]| |[permits]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_permits_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.96 0.98 94 permits 0.90 1.00 0.95 36 accuracy - - 0.97 130 macro-avg 0.95 0.98 0.96 130 weighted-avg 0.97 0.97 0.97 130 ``` --- layout: model title: Legal Beverages And Sugar Document Classifier (EURLEX) author: John Snow Labs name: legclf_beverages_and_sugar_bert date: 2023-03-06 tags: [en, legal, classification, clauses, beverages_and_sugar, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_beverages_and_sugar_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Beverages_and_Sugar or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Beverages_and_Sugar`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_beverages_and_sugar_bert_en_1.0.0_3.0_1678111822961.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_beverages_and_sugar_bert_en_1.0.0_3.0_1678111822961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_beverages_and_sugar_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Beverages_and_Sugar]| |[Other]| |[Other]| |[Beverages_and_Sugar]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_beverages_and_sugar_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Beverages_and_Sugar 0.89 0.84 0.87 50 Other 0.85 0.90 0.87 49 accuracy - - 0.87 99 macro-avg 0.87 0.87 0.87 99 weighted-avg 0.87 0.87 0.87 99 ``` --- layout: model title: English asr_wav2vec2_common_voice_accents_indian TFWav2Vec2ForCTC from willcai author: John Snow Labs name: pipeline_asr_wav2vec2_common_voice_accents_indian date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_accents_indian` is a English model originally trained by willcai. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_common_voice_accents_indian_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_common_voice_accents_indian_en_4.2.0_3.0_1664106206690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_common_voice_accents_indian_en_4.2.0_3.0_1664106206690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_common_voice_accents_indian', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_common_voice_accents_indian", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_common_voice_accents_indian| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman` is a Modern Greek (1453-) model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman_el_4.2.0_3.0_1664108673057.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman_el_4.2.0_3.0_1664108673057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_greek_by_jonatasgrosman| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Portuguese BertForMaskedLM Base Cased model (from neuralmind) author: John Snow Labs name: bert_embeddings_base_portuguese_cased date: 2022-12-02 tags: [pt, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-portuguese-cased` is a Portuguese model originally trained by `neuralmind`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_portuguese_cased_pt_4.2.4_3.0_1670018715684.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_portuguese_cased_pt_4.2.4_3.0_1670018715684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_portuguese_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_portuguese_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_portuguese_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|408.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralmind/bert-base-portuguese-cased - https://imgur.com/JZ7Hynh.jpg - https://github.com/neuralmind-ai/portuguese-bert/ --- layout: model title: English T5ForConditionalGeneration Small Cased model (from JulesBelveze) author: John Snow Labs name: t5_small_headline_generator date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-headline-generator` is a English model originally trained by `JulesBelveze`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_headline_generator_en_4.3.0_3.0_1675126378182.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_headline_generator_en_4.3.0_3.0_1675126378182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_headline_generator","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_headline_generator","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_headline_generator| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|282.9 MB| ## References - https://huggingface.co/JulesBelveze/t5-small-headline-generator --- layout: model title: Sentence Entity Resolver for HCPCS Codes author: John Snow Labs name: sbiobertresolve_hcpcs date: 2022-02-28 tags: [en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.4.0 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to [Healthcare Common Procedure Coding System (HCPCS)](https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/HCPCS/index.html#:~:text=The%20Healthcare%20Common%20Procedure%20Coding,%2C%20supplies%2C%20products%20and%20services.) codes using 'sbiobert_base_cased_mli' sentence embeddings. ## Predicted Entities `HCPCS Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_HCPCS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcpcs_en_3.4.0_2.4_1646036125003.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcpcs_en_3.4.0_2.4_1646036125003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sentence_embeddings") hcpcs_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hcpcs", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("hcpcs_code")\ .setDistanceFunction("EUCLIDEAN") hcpcs_pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, hcpcs_resolver]) data = spark.createDataFrame([["Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type"]]).toDF("text") results = hcpcs_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sentence_embeddings") val hcpcs_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hcpcs", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("hcpcs_code") .setDistanceFunction("EUCLIDEAN") val hcpcs_pipeline = new Pipeline().setStages(Array(documentAssembler, sbert_embedder, hcpcs_resolver)) val data = Seq("Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type").toDF("text") val results = hcpcs_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.hcpcs").predict("""Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type""") ```
## Results ```bash +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | |ner_chunk |hcpcs_code|all_codes |resolutions | +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |0 |Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type|L8001 |[L8001, L8002, L8000, L8033, L8032, ...]|'Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type', 'Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, bilateral, any size, any type', 'Breast prosthesis, mastectomy bra, without integrated breast prosthesis form, any size, any type', 'Nipple prosthesis, custom fabricated, reusable, any material, any type, each', ...| +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_hcpcs| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hcpcs_code]| |Language:|en| |Size:|21.5 MB| |Case sensitive:|false| --- layout: model title: Split Sentences in English Texts author: John Snow Labs name: sentence_detector_dl class: SentenceDetectorDLModel language: en nav_key: models date: 2020-09-13 task: Sentence Detection edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [open_source,sentence_detection,en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_en_2.6.2_2.4_1600002888450.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_en_2.6.2_2.4_1600002888450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John loves Mary.Mary loves Peter. Peter loves Helen .Helen loves John; Total: four people involved."""] sent_df = nlu.load('sentence_detector.deep').predict(text, output_level='sentence') sent_df ```
{:.h2_title} ## Results ```bash +---+------------------------------+ | 0 | John loves Mary. | +---+------------------------------+ | 1 | Mary loves Peter | +---+------------------------------+ | 2 | Peter loves Helen . | +---+------------------------------+ | 3 | Helen loves John; | +---+------------------------------+ | 4 | Total: four people involved. | +---+------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|-------------------------------------------| | Name: | sentence_detector_dl | | Type: | SentenceDetectorDLModel | | Compatibility: | Spark NLP 2.6.2+ | | License: | Open Sources | | Edition: | Official | |Input labels: | [document] | |Output labels: | [sentence] | | Language: | en | {:.h2_title} ## Data Source Please visit the [repo](https://github.com/dbmdz/deep-eos) for more information https://github.com/dbmdz/deep-eos {:.h2_title} ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English image_classifier_vit_rust_image_classification_10 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_10 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_10` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_10_en_4.1.0_3.0_1660166162294.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_10_en_4.1.0_3.0_1660166162294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_10", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_10", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_10| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Kongo to English author: John Snow Labs name: opus_mt_kg_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kg, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kg_en_xx_2.7.0_2.4_1609163723315.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kg_en_xx_2.7.0_2.4_1609163723315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kg_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kg_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kg.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kg_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Turkish BERT Base Cased (BERTurk) author: John Snow Labs name: bert_base_turkish_cased date: 2021-05-20 tags: [open_source, embeddings, bert, tr, turkish] task: Embeddings language: tr edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERTurk is a community-driven cased BERT model for Turkish. Some datasets used for pretraining and evaluation are contributed from the awesome Turkish NLP community, as well as the decision for the model name: BERTurk. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_turkish_cased_tr_3.1.0_2.4_1621508465134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_turkish_cased_tr_3.1.0_2.4_1621508465134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_turkish_cased", "tr") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_turkish_cased", "tr") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("tr.embed.bert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_turkish_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|tr| |Case sensitive:|true| ## Data Source [https://huggingface.co/dbmdz/bert-base-turkish-cased](https://huggingface.co/dbmdz/bert-base-turkish-cased) ## Benchmarking ```bash For results on PoS tagging or NER tasks, please refer to [this repository](https://github.com/stefan-it/turkish-bert). ``` --- layout: model title: Named Entity Recognition (NER) Model in Danish (Dane 6B 100) author: John Snow Labs name: dane_ner_6B_100 date: 2020-08-30 task: Named Entity Recognition language: da edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, da, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Norne is a Named Entity Recognition (or NER) model of Danish, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Dane NER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DA/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dane_ner_6B_100_da_2.6.0_2.4_1598810267725.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dane_ner_6B_100_da_2.6.0_2.4_1598810267725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained('glove_100d') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("dane_ner_6B_100", "da") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af \u200b\u200bde mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af \u200b\u200b1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [ 9] Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("dane_ner_6B_100", "da") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af ​​de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970"erne og 1980"erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af ​​1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af ​​de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af ​​1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."""] ner_df = nlu.load('da.ner.6B100D').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |amerikansk |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |1970'erne |MISC | |1980'erne |MISC | |Seattle |LOC | |Washington |LOC | |Gates |LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |​​1990'erne |MISC | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dane_ner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|da| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.565.pdf](https://www.aclweb.org/anthology/2020.lrec-1.565.pdf) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from Unbabel) author: John Snow Labs name: t5_gec_small date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gec-t5_small` is a English model originally trained by `Unbabel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_gec_small_en_4.3.0_3.0_1675102492460.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_gec_small_en_4.3.0_3.0_1675102492460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_gec_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_gec_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_gec_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|287.5 MB| ## References - https://huggingface.co/Unbabel/gec-t5_small - https://arxiv.org/pdf/2106.03830.pdf --- layout: model title: Word2Vec Embeddings in Vietnamese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, vi, open_source] task: Embeddings language: vi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vi_3.4.1_3.0_1647466838102.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vi_3.4.1_3.0_1647466838102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Tôi yêu Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Tôi yêu Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vi.embed.w2v_cc_300d").predict("""Tôi yêu Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|vi| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Spanish BertForSequenceClassification Tiny Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_spanish_tinybert_betito_finetuned_xnli date: 2022-07-13 tags: [es, open_source, bert, sequence_classification] task: Text Classification language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanish-TinyBERT-betito-finetuned-xnli-es` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_spanish_tinybert_betito_finetuned_xnli_es_4.0.0_3.0_1657720702941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_spanish_tinybert_betito_finetuned_xnli_es_4.0.0_3.0_1657720702941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_spanish_tinybert_betito_finetuned_xnli","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_spanish_tinybert_betito_finetuned_xnli","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_spanish_tinybert_betito_finetuned_xnli| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|54.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/spanish-TinyBERT-betito-finetuned-xnli-es --- layout: model title: Legal Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_bert_en_1.0.0_3.0_1671393839002.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_bert_en_1.0.0_3.0_1671393839002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[agreement]| |[other]| |[other]| |[agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement 0.75 0.71 0.73 90 other 0.88 0.90 0.89 204 accuracy - - 0.84 294 macro-avg 0.81 0.80 0.81 294 weighted-avg 0.84 0.84 0.84 294 ``` --- layout: model title: Detect PHI in medical text (biobert) author: John Snow Labs name: ner_deid_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect PHI like Name, Date, Age etc in medical text for de-identification. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `PROFESSION`, `CONTACT`, `DATE`, `NAME`, `AGE`, `ID`, `LOCATION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_en_3.0.0_3.0_1617260631832.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_en_3.0.0_3.0_1617260631832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") \ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Translate Niuean to English Pipeline author: John Snow Labs name: translate_niu_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, niu, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `niu` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_niu_en_xx_2.7.0_2.4_1609690816720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_niu_en_xx_2.7.0_2.4_1609690816720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_niu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_niu_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.niu.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_niu_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect PHI in medical text author: John Snow Labs name: ner_deid_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_biobert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_pipeline_en_3.4.1_3.0_1647866776890.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_biobert_pipeline_en_3.4.1_3.0_1647866776890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_biobert_pipeline", "en", "clinical/models") pipeline.annotate("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_biobert_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.ner_biobert.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+--------+ |chunks |entities| +-----------------------------+--------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson |NAME | |Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |LOCATION| |1-11-2000 |DATE | |Cocke County Baptist Hospital|LOCATION| |Keats Street |LOCATION| |Brothers |LOCATION| +-----------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Legal Titles Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_titles_bert date: 2023-03-05 tags: [en, legal, classification, clauses, titles, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Titles` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Titles`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_titles_bert_en_1.0.0_3.0_1678049976477.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_titles_bert_en_1.0.0_3.0_1678049976477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_titles_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Titles]| |[Other]| |[Other]| |[Titles]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_titles_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.92 0.88 0.90 51 Titles 0.83 0.88 0.85 33 accuracy - - 0.88 84 macro-avg 0.87 0.88 0.88 84 weighted-avg 0.88 0.88 0.88 84 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from ParanoidAndroid) author: John Snow Labs name: bert_qa_paranoidandroid_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `ParanoidAndroid`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_paranoidandroid_finetuned_squad_en_4.0.0_3.0_1657186188742.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_paranoidandroid_finetuned_squad_en_4.0.0_3.0_1657186188742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_paranoidandroid_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_paranoidandroid_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_paranoidandroid_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ParanoidAndroid/bert-finetuned-squad --- layout: model title: Legal Term of employment Clause Binary Classifier author: John Snow Labs name: legclf_term_of_employment_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `term-of-employment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `term-of-employment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_term_of_employment_clause_en_1.0.0_3.2_1660123080620.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_term_of_employment_clause_en_1.0.0_3.2_1660123080620.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_term_of_employment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[term-of-employment]| |[other]| |[other]| |[term-of-employment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_term_of_employment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.98 0.99 96 term-of-employment 0.95 1.00 0.97 38 accuracy - - 0.99 134 macro-avg 0.97 0.99 0.98 134 weighted-avg 0.99 0.99 0.99 134 ``` --- layout: model title: Legal Master Lease Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_master_lease_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, master_lease, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_master_lease_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `master-lease-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `master-lease-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_master_lease_agreement_bert_en_1.0.0_3.0_1669367845988.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_master_lease_agreement_bert_en_1.0.0_3.0_1669367845988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_master_lease_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[master-lease-agreement]| |[other]| |[other]| |[master-lease-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_master_lease_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support master-lease-agreement 0.93 0.85 0.89 33 other 0.93 0.97 0.95 65 accuracy - - 0.93 98 macro-avg 0.93 0.91 0.92 98 weighted-avg 0.93 0.93 0.93 98 ``` --- layout: model title: Mapping Entities (Drug Substances) with Corresponding UMLS CUI Codes author: John Snow Labs name: umls_drug_substance_mapper date: 2022-07-11 tags: [umls, chunk_mapper, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities (Drug Substances) with their corresponding UMLS CUI codes. ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_mapper_en_4.0.0_3.0_1657578250652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_mapper_en_4.0.0_3.0_1657578250652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "clinical_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("umls_drug_substance_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["umls_code"])\ .setLowerCase(True) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper]) test_data = spark.createDataFrame([["The patient was given metformin, lenvatinib and lavender 700 ml/ml"]]).toDF("text") result = mapper_pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols("sentence", "token", "clinical_ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("umls_drug_substance_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("umls_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper)) val test_data = Seq("The patient was given metformin, lenvatinib and lavender 700 ml/ml").toDF("text") val result = mapper_pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_drug_substance_mapper").predict("""The patient was given metformin, lenvatinib and lavender 700 ml/ml""") ```
## Results ```bash +------------------+---------+ |ner_chunk |umls_code| +------------------+---------+ |metformin |C0025598 | |lenvatinib |C2986924 | |lavender 700 ml/ml|C0772360 | +------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_drug_substance_mapper| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|30.1 MB| ## References 2022AA UMLS dataset’s Clinical Drug, Pharmacologic Substance, Antibiotic, Hazardous or Poisonous Substance categories. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Stopwords Remover for Italian language (624 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, it, open_source] task: Stop Words Removal language: it edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_it_3.4.1_3.0_1646673016901.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_it_3.4.1_3.0_1646673016901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","it") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Non sei migliore di me"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","it") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Non sei migliore di me").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.stopwords").predict("""Non sei migliore di me""") ```
## Results ```bash +----------+ |result | +----------+ |[migliore]| +----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|it| |Size:|3.1 KB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_scrambled_squad_5_new date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-5-new` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_5_new_en_4.3.0_3.0_1674216999446.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_5_new_en_4.3.0_3.0_1674216999446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_5_new","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_5_new","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_scrambled_squad_5_new| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-5-new --- layout: model title: English image_classifier_vit_snacks ViTForImageClassification from Shivagowri author: John Snow Labs name: image_classifier_vit_snacks date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_snacks` is a English model originally trained by Shivagowri. ## Predicted Entities `salad`, `candy`, `muffin`, `banana`, `grape`, `popcorn`, `pretzel`, `pineapple`, `juice`, `orange`, `doughnut`, `carrot`, `waffle`, `cake`, `cookie`, `ice cream`, `watermelon`, `hot dog`, `apple`, `strawberry` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_snacks_en_4.1.0_3.0_1660173257599.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_snacks_en_4.1.0_3.0_1660173257599.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_snacks", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_snacks", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_snacks| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: BioBERT Sentence Embeddings (Pubmed) author: John Snow Labs name: sent_biobert_pubmed_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_base_cased_en_2.6.2_2.4_1600449483871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_base_cased_en_2.6.2_2.4_1600449483871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_biobert_pubmed_base_cased_embeddings sentence [0.209750697016716, 0.21535921096801758, -0.59... I hate cancer [0.01466107927262783, -0.20778851211071014, -0... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pubmed_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Legal General Distribution Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_general_distribution_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, general_distribution, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_general_distribution_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `general-distribution-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `general-distribution-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_general_distribution_agreement_bert_en_1.0.0_3.0_1669313767606.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_general_distribution_agreement_bert_en_1.0.0_3.0_1669313767606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_general_distribution_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[general-distribution-agreement]| |[other]| |[other]| |[general-distribution-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_general_distribution_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support general-distribution-agreement 0.91 1.00 0.95 10 other 1.00 0.86 0.92 7 accuracy - - 0.94 17 macro-avg 0.95 0.93 0.94 17 weighted-avg 0.95 0.94 0.94 17 ``` --- layout: model title: Legal Records Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_records_bert date: 2023-03-05 tags: [en, legal, classification, clauses, records, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Records` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Records`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_records_bert_en_1.0.0_3.0_1678049931654.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_records_bert_en_1.0.0_3.0_1678049931654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_records_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Records]| |[Other]| |[Other]| |[Records]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_records_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.98 0.95 51 Records 0.96 0.87 0.91 30 accuracy - - 0.94 81 macro-avg 0.94 0.92 0.93 81 weighted-avg 0.94 0.94 0.94 81 ``` --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes (Clinical Drug) author: John Snow Labs name: sbiobertresolve_umls_clinical_drugs date: 2021-10-11 tags: [entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities to UMLS CUI codes. It is trained on `2021AB` UMLS dataset. The complete dataset has 127 different categories, and this model is trained on the `Clinical Drug` category using `sbiobert_base_cased_mli` embeddings. ## Predicted Entities `Predicts UMLS codes for Clinical Drug medical concepts` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_clinical_drugs_en_3.2.3_3.0_1633912003256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_clinical_drugs_en_3.2.3_3.0_1633912003256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_umls_clinical_drugs``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_clinical_drugs","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` ```scala ... val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") .setCaseSensitive(False) val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_umls_clinical_drugs", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val p_model = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.").toDF("text") val res = p_model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls_clinical_drugs").predict("""She was immediately given hydrogen peroxide 30 mg to treat the infection on her leg, and has been advised Neosporin Cream for 5 days. She has a history of taking magnesium hydroxide 100mg/1ml and metformin 1000 mg.""") ```
## Results ```bash | | chunk | code | code_description | all_k_code_desc | all_k_codes | |---:|:------------------------------|:---------|:---------------------------|:-------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0 | hydrogen peroxide 30 mg | C1126248 | hydrogen peroxide 30 mg/ml | ['C1126248', 'C0304655', 'C1605252', 'C0304656', 'C1154260'] | ['hydrogen peroxide 30 mg/ml', 'hydrogen peroxide solution 30%', 'hydrogen peroxide 30 mg/ml [proxacol]', 'hydrogen peroxide 30 mg/ml cutaneous solution', 'benzoyl peroxide 30 mg/ml'] | | 1 | Neosporin Cream | C0132149 | neosporin cream | ['C0132149', 'C0358174', 'C0357999', 'C0307085', 'C0698810'] | ['neosporin cream', 'nystan cream', 'nystadermal cream', 'nupercainal cream', 'nystaform cream'] | | 2 | magnesium hydroxide 100mg/1ml | C1134402 | magnesium hydroxide 100 mg | ['C1134402', 'C1126785', 'C4317023', 'C4051486', 'C4047137'] | ['magnesium hydroxide 100 mg', 'magnesium hydroxide 100 mg/ml', 'magnesium sulphate 100mg/ml injection', 'magnesium sulfate 100 mg', 'magnesium sulfate 100 mg/ml'] | | 3 | metformin 1000 mg | C0987664 | metformin 1000 mg | ['C0987664', 'C2719784', 'C0978482', 'C2719786', 'C4282269'] | ['metformin 1000 mg', 'metformin hydrochloride 1000 mg', 'metformin hcl 1000mg tab', 'metformin hydrochloride 1000 mg [fortamet]', 'metformin hcl 1000mg sa tab'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_clinical_drugs| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_chunk_embeddings]| |Output Labels:|[output]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on `2021AB` UMLS dataset's `Clinical Drug` category. https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Legal NER for NDA (Non-compete Clause) author: John Snow Labs name: legner_nda_non_compete date: 2023-04-10 tags: [en, legal, licensed, ner, nda, non_compete] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `NON_COMP` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `NON_COMPETE_COUNTRY`, and `NON_COMPETE_TERM`. ## Predicted Entities `NON_COMPETE_COUNTRY`, `NON_COMPETE_TERM` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_non_compete_en_1.0.0_3.0_1681096039352.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_non_compete_en_1.0.0_3.0_1681096039352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_non_compete", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""The employee shall not engage in any business activities that compete with the company in France for a period of two years after leaving the company."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------+-------------------+ |chunk |ner_label | +---------+-------------------+ |France |NON_COMPETE_COUNTRY| |two years|NON_COMPETE_TERM | +---------+-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_non_compete| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support NON_COMPETE_COUNTRY 1.00 1.00 1.00 8 NON_COMPETE_TERM 1.00 1.00 1.00 15 micro-avg 1.00 1.00 1.00 23 macro-avg 1.00 1.00 1.00 23 weighted-avg 1.00 1.00 1.00 23 ``` --- layout: model title: Understand Increased or Decreased Amounts and Percentages in Context author: John Snow Labs name: finassertion_increase_decrease_amounts date: 2023-03-11 tags: [en, licensed, finance, assertion, responsibility] task: Assertion Status language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Assertion Status model which allows you to detect if a mentioned amount or percentage is stated to be increased or decreased in context. ## Predicted Entities `INCREASE`, `DECREASE`, `NOT_STATED` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertion_increase_decrease_amounts_en_1.0.0_3.0_1678533483377.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertion_increase_decrease_amounts_en_1.0.0_3.0_1678533483377.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.BertForTokenClassification.pretrained("finner_responsibility_reports", "en", "finance/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["AMOUNT", "PERCENTAGE"]) fin_assertion = finance.AssertionDLModel.pretrained("finassertion_increase_decrease_amounts", "en", "finance/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings"])\ .setOutputCol("assertion")\ nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter, fin_assertion ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text_list = ["""This reduction in GHG emissions from the previous year can be attributed to a decrease in Scope 2 emissions from indirect energy use, which decreased from 13,907 metric tons CO2e in 2020 to 12,297 metric tons CO2e in 2021.""", """Cal Water's year-over-year total energy consumption increased slightly from 584,719 GJ in 2020 to 587,923 GJ in 2021.""", """In 2020, 89 % of our employees received a year-end performance review while in 2021, this increased to 93 %.""", """With over 80,000 consultants and professionals in 400 locations globally, CGI has a strong presence in the technology sector, offering end-to-end services to over 5,500 clients ."""] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +-------+----------+----------+ |chunk |ner_label |assertion | +-------+----------+----------+ |13,907 |AMOUNT |DECREASE | |12,297 |AMOUNT |DECREASE | |584,719|AMOUNT |INCREASE | |587,923|AMOUNT |INCREASE | |89 % |PERCENTAGE|INCREASE | |93 % |PERCENTAGE|INCREASE | |80,000 |AMOUNT |NOT_STATED| |400 |AMOUNT |NOT_STATED| |5,500 |AMOUNT |NOT_STATED| +-------+----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finassertion_increase_decrease_amounts| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, ner_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations on Responsibility and ESG Reports ## Benchmarking ```bash label precision recall f1-score support DECREASE 0.88 0.91 0.89 97 INCREASE 0.84 0.77 0.80 56 NOT_STATED 0.89 0.90 0.89 94 accuracy - - 0.87 247 macro-avg 0.87 0.86 0.86 247 weighted-avg 0.87 0.87 0.87 247 ``` --- layout: model title: Spanish NER Pipeline author: John Snow Labs name: roberta_token_classifier_bne_capitel_ner_pipeline date: 2022-06-25 tags: [roberta, token_classifier, spanish, ner, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_bne_capitel_ner_es](https://nlp.johnsnowlabs.com/2021/12/07/roberta_token_classifier_bne_capitel_ner_es.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_pipeline_es_4.0.0_3.0_1656123876363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bne_capitel_ner_pipeline_es_4.0.0_3.0_1656123876363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_token_classifier_bne_capitel_ner_pipeline", lang = "es") pipeline.annotate("Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_token_classifier_bne_capitel_ner_pipeline", lang = "es") pipeline.annotate("Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid.") ```
## Results ```bash +------------------------+---------+ |chunk |ner_label| +------------------------+---------+ |Antonio |PER | |fábrica de Mercedes-Benz|ORG | |Madrid |LOC | +------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_bne_capitel_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| |Size:|459.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English RobertaForQuestionAnswering Cased model (from alistvt) author: John Snow Labs name: roberta_qa_01_dialdoc date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `01-roberta-dialdoc` is a English model originally trained by `alistvt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_01_dialdoc_en_4.3.0_3.0_1674206907196.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_01_dialdoc_en_4.3.0_3.0_1674206907196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_01_dialdoc","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_01_dialdoc","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_01_dialdoc| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/alistvt/01-roberta-dialdoc --- layout: model title: Named Entity Recognition - BERT Tiny (OntoNotes) author: John Snow Labs name: onto_small_bert_L2_128 date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `small_bert_L2_128` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_small_bert_L2_128_en_2.7.0_2.4_1607198998042.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_small_bert_L2_128_en_2.7.0_2.4_1607198998042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... ner_onto = NerDLModel.pretrained("onto_small_bert_L2_128", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val ner_onto = NerDLModel.pretrained("onto_small_bert_L2_128", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.bert.small_l2_128').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |William Henry Gates III |PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |the microcomputer revolution|EVENT | |1970s |DATE | |1980s |DATE | |Seattle |GPE | |Washington |GPE | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | |the late 1990s |DATE | |Gates |PERSON | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_small_bert_L2_128| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.86477494, rec: 0.8204466, f1: 0.8420278 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 10772 phrases; correct: 9153. accuracy: 96.73%; 9153 11257 10772 precision: 84.97%; recall: 81.31%; FB1: 83.10 CARDINAL: 733 935 890 precision: 82.36%; recall: 78.40%; FB1: 80.33 890 DATE: 1278 1602 1494 precision: 85.54%; recall: 79.78%; FB1: 82.56 1494 EVENT: 22 63 45 precision: 48.89%; recall: 34.92%; FB1: 40.74 45 FAC: 67 135 114 precision: 58.77%; recall: 49.63%; FB1: 53.82 114 GPE: 2044 2240 2201 precision: 92.87%; recall: 91.25%; FB1: 92.05 2201 LANGUAGE: 8 22 14 precision: 57.14%; recall: 36.36%; FB1: 44.44 14 LAW: 12 40 15 precision: 80.00%; recall: 30.00%; FB1: 43.64 15 LOC: 104 179 155 precision: 67.10%; recall: 58.10%; FB1: 62.28 155 MONEY: 265 314 316 precision: 83.86%; recall: 84.39%; FB1: 84.13 316 NORP: 775 841 886 precision: 87.47%; recall: 92.15%; FB1: 89.75 886 ORDINAL: 180 195 239 precision: 75.31%; recall: 92.31%; FB1: 82.95 239 ORG: 1280 1795 1548 precision: 82.69%; recall: 71.31%; FB1: 76.58 1548 PERCENT: 308 349 350 precision: 88.00%; recall: 88.25%; FB1: 88.13 350 PERSON: 1784 1988 2032 precision: 87.80%; recall: 89.74%; FB1: 88.76 2032 PRODUCT: 33 76 49 precision: 67.35%; recall: 43.42%; FB1: 52.80 49 QUANTITY: 83 105 112 precision: 74.11%; recall: 79.05%; FB1: 76.50 112 TIME: 124 212 205 precision: 60.49%; recall: 58.49%; FB1: 59.47 205 WORK_OF_ART: 53 166 107 precision: 49.53%; recall: 31.93%; FB1: 38.83 107 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1657185304800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_8_en_4.0.0_3.0_1657185304800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-8 --- layout: model title: Indonesian T5ForConditionalGeneration Cased model (from muchad) author: John Snow Labs name: t5_idt5_qa_qg date: 2023-01-30 tags: [id, open_source, t5, tensorflow] task: Text Generation language: id edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `idt5-qa-qg` is a Indonesian model originally trained by `muchad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_idt5_qa_qg_id_4.3.0_3.0_1675102982556.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_idt5_qa_qg_id_4.3.0_3.0_1675102982556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_idt5_qa_qg","id") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_idt5_qa_qg","id") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_idt5_qa_qg| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|id| |Size:|981.3 MB| ## References - https://huggingface.co/muchad/idt5-qa-qg - https://github.com/Wikidepia/indonesian_datasets/tree/master/question-answering/squad - https://ai.muchad.com/qg/ - https://t.me/caritahubot - https://colab.research.google.com/github/muchad/qaqg/blob/main/idT5_Question_Generation.ipynb - https://colab.research.google.com/github/muchad/qaqg/blob/main/idT5_Question_Answering.ipynb --- layout: model title: Korean BertForQuestionAnswering Cased model (from Taekyoon) author: John Snow Labs name: bert_qa_komrc_train date: 2022-07-07 tags: [ko, open_source, bert, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `komrc_train` is a Korean model originally trained by `Taekyoon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_komrc_train_ko_4.0.0_3.0_1657189679496.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_komrc_train_ko_4.0.0_3.0_1657189679496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_komrc_train","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_komrc_train","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_komrc_train| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|406.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Taekyoon/komrc_train --- layout: model title: Adverse Drug Events Classifier (LogReg) author: John Snow Labs name: classifier_logreg_ade date: 2023-05-04 tags: [en, clinical, text_classification, logreg, licensed, ade] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: DocumentLogRegClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the Logistic Regression algorithm and classifies text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1683217611053.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1683217611053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") logreg = DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("prediction") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, logreg]) data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler =new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val logreg = new DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models") .setInputCols("token") .setOutputCol("prediction") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, logreg)) val data = Seq(Array("I feel great after taking tylenol", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------------+-------+ |I feel great after taking tylenol |[False]| |Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] | +----------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifier_logreg_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[prediction]| |Language:|en| |Size:|596.4 KB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.91 0.92 0.92 3362 True 0.79 0.79 0.79 1361 accuracy - - 0.88 4723 macro_avg 0.85 0.85 0.85 4723 weighted_avg 0.88 0.88 0.88 4723 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from ParulChaudhari) author: John Snow Labs name: distilbert_qa_ParulChaudhari_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ParulChaudhari`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ParulChaudhari_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724325249.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ParulChaudhari_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724325249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ParulChaudhari_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ParulChaudhari_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_ParulChaudhari").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ParulChaudhari_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ParulChaudhari/distilbert-base-uncased-finetuned-squad --- layout: model title: Extract Information from Termination Clauses (sm) author: John Snow Labs name: legner_termination date: 2022-11-09 tags: [termination, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_termination_clause` Text Classifier to select only these paragraphs; This is a NER model which extracts information from Termination Clauses, like the subject (Who? Which party?) the action (verb) the object (What?) and the Indirect Object (to Whom?). ## Predicted Entities `TERMINATION_SUBJECT`, `TERMINATION_ACTION`, `TERMINATION_OBJECT`, `TERMINATION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_termination_en_1.0.0_3.0_1667988803376.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_termination_en_1.0.0_3.0_1667988803376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_termination','en','legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) text = "(b) Either Party may terminate this Agreement" data = spark.createDataFrame([[test]]).toDF("text") model = nlpPipeline.fit(data) ```
## Results ```bash +-----------+---------------------+ | token| ner_label| +-----------+---------------------+ | (| O| | b| O| | )| O| | Either|B-TERMINATION_SUBJECT| | Party|I-TERMINATION_SUBJECT| | may| B-TERMINATION_ACTION| | terminate| I-TERMINATION_ACTION| | this| O| | Agreement| O| +-----------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_termination| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References In-house annotations of CUAD dataset. ## Benchmarking ```bash label tp fp fn prec rec f1 I-TERMINATION_INDIRECT_OBJECT 6 2 0 0.75 1.0 0.85714287 B-TERMINATION_INDIRECT_OBJECT 5 3 2 0.625 0.71428573 0.6666667 B-TERMINATION_OBJECT 48 13 25 0.78688526 0.65753424 0.7164179 I-TERMINATION_ACTION 84 11 12 0.8842105 0.875 0.8795811 I-TERMINATION_OBJECT 337 75 145 0.81796116 0.6991701 0.75391495 B-TERMINATION_SUBJECT 43 5 1 0.8958333 0.97727275 0.9347826 I-TERMINATION_SUBJECT 38 6 0 0.8636364 1.0 0.9268293 B-TERMINATION_ACTION 42 4 1 0.9130435 0.9767442 0.94382024 Macro-average - - - 0.8170713 0.86250085 0.83917177 Micro-average - - - 0.83518004 0.76425856 0.7981469 ``` --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_swahili_sw_4.1.0_3.0_1659354241457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_swahili_sw_4.1.0_3.0_1659354241457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_kinyarwanda_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-kinyarwanda-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Legal Restrictions Clause Binary Classifier author: John Snow Labs name: legclf_restrictions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `restrictions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `restrictions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_restrictions_clause_en_1.0.0_3.2_1660123954238.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_restrictions_clause_en_1.0.0_3.2_1660123954238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_restrictions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[restrictions]| |[other]| |[other]| |[restrictions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_restrictions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.81 0.97 0.88 61 restrictions 0.91 0.60 0.72 35 accuracy - - 0.83 96 macro-avg 0.86 0.78 0.80 96 weighted-avg 0.85 0.83 0.82 96 ``` --- layout: model title: Word2Vec Embeddings in French (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-02-03 tags: [cc, embeddings, fastText, word2vec, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fr_3.4.0_3.0_1643891127135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fr_3.4.0_3.0_1643891127135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.w2v_cc_300d").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|1.3 GB| |Case sensitive:|false| |Dimension:|300| ## References [fastText common crawl word embeddings for French](https://fasttext.cc/docs/en/crawl-vectors.html). --- layout: model title: English asr_wav2vec2_base_timit_demo_colab51 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab51 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab51` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab51_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab51_en_4.2.0_3.0_1664021524524.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab51_en_4.2.0_3.0_1664021524524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab51', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab51", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab51| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English Bert Embeddings (from smeylan) author: John Snow Labs name: bert_embeddings_childes_bert date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `childes-bert` is a English model orginally trained by `smeylan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_childes_bert_en_3.4.2_3.0_1649672895554.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_childes_bert_en_3.4.2_3.0_1649672895554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_childes_bert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_childes_bert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.childes_bert").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_childes_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/smeylan/childes-bert --- layout: model title: Legal Compliance With Laws Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_compliance_with_laws_bert date: 2023-03-05 tags: [en, legal, classification, clauses, compliance_with_laws, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Compliance_With_Laws` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Compliance_With_Laws`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compliance_with_laws_bert_en_1.0.0_3.0_1678050642566.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compliance_with_laws_bert_en_1.0.0_3.0_1678050642566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_compliance_with_laws_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Compliance_With_Laws]| |[Other]| |[Other]| |[Compliance_With_Laws]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_compliance_with_laws_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Compliance_With_Laws 0.93 0.91 0.92 123 Other 0.93 0.94 0.93 151 accuracy - - 0.93 274 macro-avg 0.93 0.93 0.93 274 weighted-avg 0.93 0.93 0.93 274 ``` --- layout: model title: English Deberta Embeddings model (from S2312dal) author: John Snow Labs name: deberta_embeddings_m5_mlm date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `M5_MLM` is a English model originally trained by `S2312dal`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_m5_mlm_en_4.3.1_3.0_1678703598087.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_m5_mlm_en_4.3.1_3.0_1678703598087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_m5_mlm","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_m5_mlm","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_m5_mlm| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|691.7 MB| |Case sensitive:|false| ## References https://huggingface.co/S2312dal/M5_MLM --- layout: model title: Context Spell Checker for the Italian Language author: John Snow Labs name: spellcheck_dl date: 2021-03-08 tags: [it, open_source] supported: true task: Spell Check language: it edition: Spark NLP 2.7.4 spark_version: 2.4 annotator: ContextSpellCheckerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Italian Context Spell Checker trained on the Paisà corpus. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_it_2.7.4_2.4_1615238955709.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_it_2.7.4_2.4_1615238955709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use The model works at the token level, so you must put it after tokenization. The model can change the length of the tokes when correcting words, so keep this in mind when using it before other annotators that may work with absolute references to the original document like NerConverter.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python assembler = DocumentAssembler()\ .setInputCol("value")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols("document")\ .setOutputCol("token")\ .setPrefixes(["\"", """, "(", "[", "\n", ".", "l'", "dell'", "nell'", "sull'", "all'", "d'", "un'"])\ .setSuffixes(["\"", """, ".", ",", "?", ")", "]", "!", ";", ":"]) spellChecker = ContextSpellCheckerModel("spellcheck_dl", "it").\ setInputCols("token").\ setOutputCol("corrected") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl| |Compatibility:|Spark NLP 2.7.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|it| ## Data Source Paisà Italian Language Corpus. --- layout: model title: German RoBERTa Embeddings (from benjamin) author: John Snow Labs name: roberta_embeddings_roberta_base_wechsel_german date: 2022-04-14 tags: [roberta, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-wechsel-german` is a German model orginally trained by `benjamin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_german_de_3.4.2_3.0_1649947974859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_base_wechsel_german_de_3.4.2_3.0_1649947974859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_german","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_base_wechsel_german","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.roberta_base_wechsel_german").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_base_wechsel_german| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|468.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/benjamin/roberta-base-wechsel-german - https://github.com/CPJKU/wechsel - https://arxiv.org/abs/2112.06598 --- layout: model title: Portuguese DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_pt_cased date: 2022-04-12 tags: [distilbert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-pt-cased` is a Portuguese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_pt_cased_pt_3.4.2_3.0_1649783515589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_pt_cased_pt_3.4.2_3.0_1649783515589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_pt_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_pt_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.distilbert_base_cased").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_pt_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|233.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-pt-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: German BertForTokenClassification Base Cased model (from mrm8488) author: John Snow Labs name: bert_token_classifier_base_german_finetuned_ler date: 2022-11-30 tags: [de, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-german-finetuned-ler` is a German model originally trained by `mrm8488`. ## Predicted Entities `EUN`, `LIT`, `RR`, `INN`, `RS`, `PER`, `VO`, `UN`, `MRK`, `AN`, `LD`, `STR`, `GRT`, `ORG`, `GS`, `VT`, `LDS`, `ST`, `VS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_german_finetuned_ler_de_4.2.4_3.0_1669814863747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_german_finetuned_ler_de_4.2.4_3.0_1669814863747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_german_finetuned_ler","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_german_finetuned_ler","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_german_finetuned_ler| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|407.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/bert-base-german-finetuned-ler - https://github.com/elenanereiss/Legal-Entity-Recognition - https://github.com/elenanereiss/Legal-Entity-Recognition - http://www.rechtsprechung-im-internet.de - https://colab.research.google.com/drive/156Qrd7NsUHwA3nmQ6gXdZY0NzOvqk9AT?usp=sharing - https://github.com/elenanereiss/Legal-Entity-Recognition/blob/master/docs/Annotationsrichtlinien.pdf - https://twitter.com/mrm8488 --- layout: model title: English BertForQuestionAnswering model (from juliusco) author: John Snow Labs name: bert_qa_juliusco_distilbert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model orginally trained by `juliusco`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_juliusco_distilbert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654187507139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_juliusco_distilbert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654187507139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_juliusco_distilbert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_juliusco_distilbert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.distilled_base_uncased.by_juliusco").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_juliusco_distilbert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/juliusco/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_ksponspeech TFWav2Vec2ForCTC from Taeham author: John Snow Labs name: asr_wav2vec2_ksponspeech date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_ksponspeech` is a English model originally trained by Taeham. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_ksponspeech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_ksponspeech_en_4.2.0_3.0_1664102551862.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_ksponspeech_en_4.2.0_3.0_1664102551862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_ksponspeech", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_ksponspeech", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_ksponspeech| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from sunitha) author: John Snow Labs name: roberta_qa_customds_finetune date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-customds-finetune` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_customds_finetune_en_4.3.0_3.0_1674220035505.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_customds_finetune_en_4.3.0_3.0_1674220035505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_customds_finetune","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_customds_finetune","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_customds_finetune| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/sunitha/roberta-customds-finetune --- layout: model title: English RobertaForQuestionAnswering (from CNT-UPenn) author: John Snow Labs name: roberta_qa_RoBERTa_for_seizureFrequency_QA date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTa_for_seizureFrequency_QA` is a English model originally trained by `CNT-UPenn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_for_seizureFrequency_QA_en_4.0.0_3.0_1655727173665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RoBERTa_for_seizureFrequency_QA_en_4.0.0_3.0_1655727173665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RoBERTa_for_seizureFrequency_QA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_RoBERTa_for_seizureFrequency_QA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_CNT-UPenn").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_RoBERTa_for_seizureFrequency_QA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/CNT-UPenn/RoBERTa_for_seizureFrequency_QA - https://doi.org/10.1093/jamia/ocac018 --- layout: model title: Detect Living Species(biobert_embeddings_biomedical) author: John Snow Labs name: ner_living_species_bert date: 2022-06-22 tags: [pt, ner, clinical, licensed, biobert] task: Named Entity Recognition language: pt edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Portuguese which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `biobert_embeddings_biomedical` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pt_3.5.3_3.0_1655923613188.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pt_3.5.3_3.0_1655923613188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_embeddings_biomedical","pt")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "pt","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_embeddings_biomedical","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "pt","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.med_ner.living_species.bert").predict("""Uma rapariga de 16 anos com um historial pessoal de asma apresentou ao departamento de dermatologia com lesões cutâneas assintomáticas que tinham estado presentes durante 2 meses. A paciente tinha sido tratada com creme corticosteróide devido a uma suspeita inicial de eczema atópico, apesar do qual apresentava um crescimento progressivo marcado das lesões. Tinha um gato doméstico que ela nunca tinha levado ao veterinário. O exame físico revelou placas em forma de anel com uma borda periférica activa na parte superior das costas e nos aspectos laterais do pescoço e da face. Cultura local obtida por raspagem de tapete isolado Trichophyton rubrum. Com base em dados clínicos e cultura, foi estabelecido o diagnóstico de tinea incognito.""") ```
## Results ```bash +-------------------+-------+ |ner_chunk |label | +-------------------+-------+ |rapariga |HUMAN | |pessoal |HUMAN | |paciente |HUMAN | |gato |SPECIES| |veterinário |HUMAN | |Trichophyton rubrum|SPECIES| +-------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.87 0.92 0.89 2827 B-SPECIES 0.70 0.82 0.76 2798 I-HUMAN 0.96 0.40 0.56 180 I-SPECIES 0.75 0.77 0.76 1100 micro-avg 0.78 0.84 0.81 6905 macro-avg 0.82 0.72 0.74 6905 weighted-avg 0.79 0.84 0.81 6905 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl40 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl40` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl40_en_4.3.0_3.0_1675122791666.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl40_en_4.3.0_3.0_1675122791666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl40","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl40","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl40| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|628.5 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl40 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Clinical Deidentification (French) author: John Snow Labs name: clinical_deidentification date: 2022-02-23 tags: [deid, fr, licensed] task: De-identification language: fr edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts in French. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_fr_3.4.1_3.0_1645643868811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_fr_3.4.1_3.0_1645643868811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = """COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """ result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "fr", "clinical/models") sample = "COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("fr.deid_obfuscated").predict("""COMPTE-RENDU D'HOSPITALISATION PRENOM : Jean NOM : Dubois NUMÉRO DE SÉCURITÉ SOCIALE : 1780160471058 ADRESSE : 18 Avenue Matabiau VILLE : Grenoble CODE POSTAL : 38000 DATE DE NAISSANCE : 03/03/1946 Âge : 70 ans Sexe : H COURRIEL : jdubois@hotmail.fr DATE D'ADMISSION : 12/12/2016 MÉDÉCIN : Dr Michel Renaud RAPPORT CLINIQUE : 70 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Il nous a été adressé car il présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dre Marie Breton - Centre Hospitalier de Bellevue Service D'Endocrinologie et de Nutrition - Rue Paulin Bussières, 38000 Grenoble COURRIEL : mariebreton@chb.fr """) ```
## Results ```bash Masked with entity labels ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : NOM : NUMÉRO DE SÉCURITÉ SOCIALE : ADRESSE : VILLE : CODE POSTAL : DATE DE NAISSANCE : Âge : Sexe : COURRIEL : DATE D'ADMISSION : MÉDÉCIN : RAPPORT CLINIQUE : ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. nous a été adressé car présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : - Service D'Endocrinologie et de Nutrition - , COURRIEL : Masked with chars ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : [**] NOM : [****] NUMÉRO DE SÉCURITÉ SOCIALE : [***********] ADRESSE : [****************] VILLE : [******] CODE POSTAL : [***] DATE DE NAISSANCE : [********] Âge : [****] Sexe : * COURRIEL : [****************] DATE D'ADMISSION : [********] MÉDÉCIN : [**************] RAPPORT CLINIQUE : ** ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. ** nous a été adressé car ** présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : [**************] - [****************************] Service D'Endocrinologie et de Nutrition - [******************], [***] [******] COURRIEL : [****************] Masked with fixed length chars ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : **** NOM : **** NUMÉRO DE SÉCURITÉ SOCIALE : **** ADRESSE : **** VILLE : **** CODE POSTAL : **** DATE DE NAISSANCE : **** Âge : **** Sexe : **** COURRIEL : **** DATE D'ADMISSION : **** MÉDÉCIN : **** RAPPORT CLINIQUE : **** ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. **** nous a été adressé car **** présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : **** - **** Service D'Endocrinologie et de Nutrition - ****, **** **** COURRIEL : **** Obfuscated ------------------------------ COMPTE-RENDU D'HOSPITALISATION PRENOM : Mme Ollivier NOM : Mme Traore NUMÉRO DE SÉCURITÉ SOCIALE : 164033818514436 ADRESSE : 731, boulevard de Legrand VILLE : Sainte Antoine CODE POSTAL : 37443 DATE DE NAISSANCE : 18/03/1946 Âge : 46 Sexe : Femme COURRIEL : georgeslemonnier@live.com DATE D'ADMISSION : 10/01/2017 MÉDÉCIN : Pr. Manon Dupuy RAPPORT CLINIQUE : 26 ans, retraité, sans allergie médicamenteuse connue, qui présente comme antécédents : ancien accident du travail avec fractures vertébrales et des côtes ; opéré de la maladie de Dupuytren à la main droite et d'un pontage ilio-fémoral gauche ; diabète de type II, hypercholestérolémie et hyperuricémie ; alcoolisme actif, fume 20 cigarettes / jour. Homme nous a été adressé car Homme présentait une hématurie macroscopique postmictionnelle à une occasion et une microhématurie persistante par la suite, avec une miction normale. L'examen physique a montré un bon état général, avec un abdomen et des organes génitaux normaux ; le toucher rectal était compatible avec un adénome de la prostate de grade I/IV. L'analyse d'urine a montré 4 globules rouges/champ et 0-5 leucocytes/champ ; le reste du sédiment était normal. Hémogramme normal ; la biochimie a montré une glycémie de 169 mg/dl et des triglycérides de 456 mg/dl ; les fonctions hépatiques et rénales étaient normales. PSA de 1,16 ng/ml. ADDRESSÉ À : Dr Tristan-Gilbert Poulain - CENTRE HOSPITALIER D'ORTHEZ Service D'Endocrinologie et de Nutrition - 6, avenue Pages, 37443 Sainte Antoine COURRIEL : massecatherine@bouygtel.fr ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Smaller BERT Sentence Embeddings (L-10_H-256_A-4) author: John Snow Labs name: sent_small_bert_L10_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_256_en_2.6.0_2.4_1598350461634.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_256_en_2.6.0_2.4_1598350461634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_256", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_256", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L10_256').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L10_256_embeddings sentence [-0.38345780968666077, -0.15031394362449646, -... I hate cancer [0.170307457447052, -0.0283052921295166, -0.33... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L10_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1 --- layout: model title: Translate English to Italic languages Pipeline author: John Snow Labs name: translate_en_itc date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, itc, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `itc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_itc_xx_2.7.0_2.4_1609699320357.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_itc_xx_2.7.0_2.4_1609699320357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_itc", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_itc", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.itc').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_itc| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Sentence Entity Resolver for ICD-O (base) author: John Snow Labs name: sbiobertresolve_icdo_base date: 2021-07-02 tags: [entity_resolution, licensed, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD-O codes (Topography & Morphology codes) using BioBert Sentence Embeddings. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original `Topography` codes, `Morphology` codes comprising of `Histology` and `Behavior` codes, and descriptions in the aux metadata. ## Predicted Entities ICD-O Codes and their normalized definition with `sbiobert_base_cased_mli` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_base_en_3.1.0_3.0_1625252163641.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_base_en_3.1.0_3.0_1625252163641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_icdo_base``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Oncologocal``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icdo_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icdo_base","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver]) data = spark.createDataFrame([["The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icdo_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icdo_base","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq("The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icdo.base").predict("""The patient is a very pleasant 61-year-old female with a strong family history of colon polyps. The patient reports her first polyps noted at the age of 50. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. She also has history of several malignancies in the family. Her father died of a brain tumor at the age of 81. Her sister died at the age of 65 breast cancer. She has two maternal aunts with history of lung cancer both of whom were smoker. Also a paternal grandmother who was diagnosed with leukemia at 86 and a paternal grandfather who had B-cell lymphoma.""") ```
## Results ```bash +--------------------+-----+---+-----------+-------------+----------------------------------+ | chunk|begin|end| entity| code| all_k_resolutions| +--------------------+-----+---+-----------+-------------+----------------------------------+ | mesothelioma| 255|266|Oncological|9050/3||C38.3|Mesothelioma, malignant ...| |several malignancies| 293|312|Oncological|8001/3||C39.8|Tumor cells, malignant ...| | brain tumor| 350|360|Oncological|8001/4||C71.7|Tumor cells, malignant of brain...| | breast cancer| 413|425|Oncological|8550/3||C50.9|Acinar cell carcinoma of breast...| | lung cancer| 471|481|Oncological|8046/3||C34.3|Non-small cell carcinoma of low...| | leukemia| 560|567|Oncological|980-994 |Leukemias ...| | B-cell lymphoma| 610|624|Oncological|967-969 |Mature B-cell lymphomas ...| +--------------------+-----+---+-----------+-------------+----------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icdo_base| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sbert_embeddings]| |Output Labels:|[icdo_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on ICD-O Histology Behaviour dataset with `sbiobert_base_cased_mli` sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf --- layout: model title: Swedish BertForQuestionAnswering Base Cased model (from KBLab) author: John Snow Labs name: bert_qa_base_swedish_cased_squad_experimental date: 2022-07-07 tags: [sv, open_source, bert, question_answering] task: Question Answering language: sv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-swedish-cased-squad-experimental` is a Swedish model originally trained by `KBLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_swedish_cased_squad_experimental_sv_4.0.0_3.0_1657183499388.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_swedish_cased_squad_experimental_sv_4.0.0_3.0_1657183499388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_swedish_cased_squad_experimental","sv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Vad är mitt namn?", "Jag heter Clara och jag bor i Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_swedish_cased_squad_experimental","sv") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Vad är mitt namn?", "Jag heter Clara och jag bor i Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_swedish_cased_squad_experimental| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|sv| |Size:|465.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KBLab/bert-base-swedish-cased-squad-experimental --- layout: model title: Clinical Deidentification (Italian) author: John Snow Labs name: clinical_deidentification date: 2022-03-28 tags: [deidentification, it, licensed, pipeline] task: De-identification language: it edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts in Italian. The pipeline can mask and obfuscate the following entities: `DATE`, `AGE`, `SEX`, `PROFESSION`, `ORGANIZATION`, `PHONE`, `E-MAIL`, `ZIP`, `STREET`, `CITY`, `COUNTRY`, `PATIENT`, `DOCTOR`, `HOSPITAL`, `MEDICALRECORD`, `SSN`, `IDNUM`, `ACCOUNT`, `PLATE`, `USERNAME`, `URL`, and `IPADDR`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_3.4.2_3.0_1648484125736.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_it_3.4.2_3.0_1648484125736.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = """RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""" result = deid_pipeline.annotate(sample) ``` ```scala val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "it", "clinical/models") sample = "RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("it.deid.clinical").predict("""RAPPORTO DI RICOVERO NOME: Lodovico Fibonacci CODICE FISCALE: MVANSK92F09W408A INDIRIZZO: Viale Burcardo 7 CITTÀ : Napoli CODICE POSTALE: 80139 DATA DI NASCITA: 03/03/1946 ETÀ: 70 anni SESSO: M EMAIL: lpizzo@tim.it DATA DI AMMISSIONE: 12/12/2016 DOTTORE: Eva Viviani RAPPORTO CLINICO: 70 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Bruno Ferrabosco - ASL Napoli 1 Centro, Dipartimento di Endocrinologia e Nutrizione - Stretto Scamarcio 320, 80138 Napoli EMAIL: bferrabosco@poste.it""") ```
## Results ```bash Masked with entity labels ------------------------------ RAPPORTO DI RICOVERO NOME: CODICE FISCALE: INDIRIZZO: CITTÀ : CODICE POSTALE: DATA DI NASCITA: ETÀ: anni SESSO: EMAIL: DATA DI AMMISSIONE: DOTTORE: RAPPORTO CLINICO: anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. - , Dipartimento di Endocrinologia e Nutrizione - , EMAIL: Masked with chars ------------------------------ RAPPORTO DI RICOVERO NOME: [****************] CODICE FISCALE: [**************] INDIRIZZO: [**************] CITTÀ : [****] CODICE POSTALE: [***]DATA DI NASCITA: [********] ETÀ: **anni SESSO: * EMAIL: [***********] DATA DI AMMISSIONE: [********] DOTTORE: [*********] RAPPORTO CLINICO: **anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. [**************] - [*****************], Dipartimento di Endocrinologia e Nutrizione - [*******************], [***] [****] EMAIL: [******************] Masked with fixed length chars ------------------------------ RAPPORTO DI RICOVERO NOME: **** CODICE FISCALE: **** INDIRIZZO: **** CITTÀ : **** CODICE POSTALE: ****DATA DI NASCITA: **** ETÀ: **** anni SESSO: **** EMAIL: **** DATA DI AMMISSIONE: **** DOTTORE: **** RAPPORTO CLINICO: **** anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. **** - ****, Dipartimento di Endocrinologia e Nutrizione - ****, **** **** EMAIL: **** Obfuscated ------------------------------ RAPPORTO DI RICOVERO NOME: Scotto-Polani CODICE FISCALE: ECI-QLN77G15L455Y INDIRIZZO: Viale Orlando 808 CITTÀ : Sesto Raimondo CODICE POSTALE: 53581DATA DI NASCITA: 09/03/1946 ETÀ: 5 anni SESSO: U EMAIL: HenryWatson@world.com DATA DI AMMISSIONE: 10/01/2017 DOTTORE: Sig. Fredo Marangoni RAPPORTO CLINICO: 5 anni, pensionato, senza allergie farmacologiche note, che presenta la seguente storia: ex incidente sul lavoro con fratture vertebrali e costali; operato per la malattia di Dupuytren alla mano destra e un bypass ileo-femorale sinistro; diabete di tipo II, ipercolesterolemia e iperuricemia; alcolismo attivo, fuma 20 sigarette/giorno. È stato indirizzato a noi perché ha presentato un'ematuria macroscopica post-evacuazione in un'occasione e una microematuria persistente in seguito, con un'evacuazione normale. L'esame fisico ha mostrato buone condizioni generali, con addome e genitali normali; l'esame digitale rettale era coerente con un adenoma prostatico di grado I/IV. L'analisi delle urine ha mostrato 4 globuli rossi/campo e 0-5 leucociti/campo; il resto del sedimento era normale. L'emocromo è normale; la biochimica ha mostrato una glicemia di 169 mg/dl e trigliceridi 456 mg/dl; la funzione epatica e renale sono normali. PSA di 1,16 ng/ml. INDIRIZZATO A: Dott. Antonio Rusticucci - ASL 7 DI CARBONIA AZIENDA U.S.L. N. 7, Dipartimento di Endocrinologia e Nutrizione - Via Giorgio 0 Appartamento 26, 03461 Sesto Raimondo EMAIL: murat.g@jsl.com ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from unicamp-dl) author: John Snow Labs name: t5_translation_pt2en date: 2023-01-31 tags: [pt, en, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `translation-pt-en-t5` is a Multilingual model originally trained by `unicamp-dl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_translation_pt2en_xx_4.3.0_3.0_1675157565884.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_translation_pt2en_xx_4.3.0_3.0_1675157565884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_translation_pt2en","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_translation_pt2en","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_translation_pt2en| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|914.0 MB| ## References - https://huggingface.co/unicamp-dl/translation-pt-en-t5 - https://github.com/unicamp-dl/Lite-T5-Translation - https://aclanthology.org/2020.wmt-1.90.pdf --- layout: model title: English asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman` is a English model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman_en_4.2.0_3.0_1664036190630.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman_en_4.2.0_3.0_1664036190630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Lemmatizer (Indonesian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, id] task: Lemmatization language: id edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Indonesian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_id_3.4.1_3.0_1646316503301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_id_3.4.1_3.0_1646316503301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","id") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Anda tidak lebih baik dari saya"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","id") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Anda tidak lebih baik dari saya").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.lemma.spacylookup").predict("""Anda tidak lebih baik dari saya""") ```
## Results ```bash +--------------------------------------+ |result | +--------------------------------------+ |[Anda, tidak, lebih, baik, dari, saya]| +--------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|id| |Size:|370.9 KB| --- layout: model title: English ElectraForQuestionAnswering model (from armageddon) author: John Snow Labs name: electra_qa_base_squad2_covid_deepset date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-squad2-covid-qa-deepset` is a English model originally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_covid_deepset_en_4.0.0_3.0_1655920773134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_covid_deepset_en_4.0.0_3.0_1655920773134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_squad2_covid_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_squad2_covid_deepset","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.electra.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_squad2_covid_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/electra-base-squad2-covid-qa-deepset --- layout: model title: Sentence Embeddings - sbert mini (tuned) author: John Snow Labs name: sbert_jsl_mini_umls_uncased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_umls_uncased_en_3.1.0_2.4_1625050218116.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_umls_uncased_en_3.1.0_2.4_1625050218116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_mini_umls_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_mini_umls_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_mini_umlsuncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_mini_umls_uncased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.677 STS(cos) 0.681 ``` --- layout: model title: English asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune TFWav2Vec2ForCTC from hrdipto author: John Snow Labs name: asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune` is a English model originally trained by hrdipto. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_en_4.2.0_3.0_1664041647997.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune_en_4.2.0_3.0_1664041647997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_bangla_command_generated_data_finetune| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_base_checkpoint_10 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: pipeline_asr_wav2vec2_base_checkpoint_10 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_10` is a English model originally trained by jiobiala24. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_checkpoint_10_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_10_en_4.2.0_3.0_1664020588231.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_10_en_4.2.0_3.0_1664020588231.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_checkpoint_10', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_checkpoint_10", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_checkpoint_10| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Polish Legal Roberta Embeddings author: John Snow Labs name: roberta_base_polish_legal date: 2023-02-16 tags: [pl, polish, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: pl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-polish-roberta-base` is a Polish model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_polish_legal_pl_4.2.4_3.0_1676556043773.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_polish_legal_pl_4.2.4_3.0_1676556043773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_polish_legal", "pl")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_polish_legal", "pl") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_polish_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pl| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-polish-roberta-base --- layout: model title: Extract relations between NIHSS entities author: John Snow Labs name: redl_nihss_biobert date: 2021-11-16 tags: [en, licensed, clinical, relation_extraction] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relate scale items and their measurements according to NIHSS guidelines. ## Predicted Entities `Has_Value` : Measurement is related to the entity, `0` : Measurement is not related to the entity {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_NIHSS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_nihss_biobert_en_3.3.2_3.0_1637038860417.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_nihss_biobert_en_3.3.2_3.0_1637038860417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_nihss", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") re_model = RelationExtractionDLModel().pretrained('redl_nihss_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text= "There , her initial NIHSS score was 4 , as recorded by the ED physicians . This included 2 for weakness in her left leg and 2 for what they felt was subtle ataxia in her left arm and leg ." p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text")) result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [text]}))) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_nihss", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") val re_model = RelationExtractionDLModel().pretrained("redl_nihss_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""There , her initial NIHSS score was 4 , as recorded by the ED physicians . This included 2 for weakness in her left leg and 2 for what they felt was subtle ataxia in her left arm and leg .""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.extract_relation.nihss").predict("""There , her initial NIHSS score was 4 , as recorded by the ED physicians . This included 2 for weakness in her left leg and 2 for what they felt was subtle ataxia in her left arm and leg .""") ```
## Results ```bash +---------+-----------+-------------+-----------+-----------+------------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+-----------+-------------+-----------+-----------+------------+-------------+-----------+--------------------+----------+ |Has_Value| NIHSS| 20| 30|NIHSS score| Measurement| 36| 36| 4| 0.9998851| |Has_Value|Measurement| 89| 89| 2| 6a_LeftLeg| 111| 118| left leg| 0.9987311| | 0|Measurement| 89| 89| 2| Measurement| 124| 124| 2|0.97510725| | 0|Measurement| 89| 89| 2|7_LimbAtaxia| 156| 185|ataxia in her lef...| 0.999889| | 0| 6a_LeftLeg| 111| 118| left leg| Measurement| 124| 124| 2|0.99989617| | 0| 6a_LeftLeg| 111| 118| left leg|7_LimbAtaxia| 156| 185|ataxia in her lef...| 0.9999521| |Has_Value|Measurement| 124| 124| 2|7_LimbAtaxia| 156| 185|ataxia in her lef...| 0.9896319| +---------+-----------+-------------+-----------+-----------+------------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_nihss_biobert| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source @article{wangnational, title={National Institutes of Health Stroke Scale (NIHSS) Annotations for the MIMIC-III Database}, author={Wang, Jiayang and Huang, Xiaoshuo and Yang, Lin and Li, Jiao} } ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.989 0.976 0.982 611 Has_Value 0.983 0.992 0.988 889 Avg. 0.986 0.984 0.985 - Weighted-Avg. 0.985 0.985 0.985 - ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Gayathri) author: John Snow Labs name: distilbert_qa_Gayathri_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Gayathri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Gayathri_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724182325.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Gayathri_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724182325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Gayathri_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Gayathri_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Gayathri").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Gayathri_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Gayathri/distilbert-base-uncased-finetuned-squad --- layout: model title: English Bert Embeddings Cased model (from Shafin) author: John Snow Labs name: bert_embeddings_chemical_uncased_finetuned_cust_c2 date: 2023-02-21 tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chemical-bert-uncased-finetuned-cust-c2` is a English model originally trained by `shafin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chemical_uncased_finetuned_cust_c2_en_4.3.0_3.0_1676998811176.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chemical_uncased_finetuned_cust_c2_en_4.3.0_3.0_1676998811176.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chemical_uncased_finetuned_cust_c2","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chemical_uncased_finetuned_cust_c2","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chemical_uncased_finetuned_cust_c2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.1 MB| |Case sensitive:|true| ## References https://huggingface.co/shafin/chemical-bert-uncased-finetuned-cust-c2 --- layout: model title: Word2Vec Embeddings in Dutch (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nl_3.4.1_3.0_1647294285411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nl_3.4.1_3.0_1647294285411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.w2v_cc_300d").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|nl| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English AlbertForQuestionAnswering model (from replydotai) author: John Snow Labs name: albert_qa_xxlarge_v1_finetuned_squad2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xxlarge-v1-finetuned-squad2` is a English model originally trained by `replydotai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_v1_finetuned_squad2_en_4.0.0_3.0_1656063937190.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_v1_finetuned_squad2_en_4.0.0_3.0_1656063937190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_v1_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_v1_finetuned_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.xxl").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xxlarge_v1_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|772.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/replydotai/albert-xxlarge-v1-finetuned-squad2 --- layout: model title: Clinical Deidentification Pipeline (Romanian) author: John Snow Labs name: clinical_deidentification date: 2022-06-28 tags: [licensed, clinical, ro, deid, deidentification] task: Pipeline Healthcare language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with `w2v_cc_300d` Romanian embeddings and can be used to deidentify PHI information from medical texts in Romanian. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP`, `ACCOUNT`, `LICENSE`, `PLATE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ro_4.0.0_3.0_1656402882837.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ro_4.0.0_3.0_1656402882837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models") sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """ result = deid_pipeline.annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification", "ro", "clinical/models") val sample = """Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """ val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("ro.deid.clinical").predict("""Medic : Dr. Agota EVELYN, C.N.P : 2450502264401, Data setului de analize: 25 May 2022 Varsta : 77, Nume si Prenume : BUREAN MARIA Tel: +40(235)413773, E-mail : hale@gmail.com, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999, Spitalul Pentru Ochi de Deal Drumul Oprea Nr. 972 Vaslui, 737405 """) ```
## Results ```bash Masked with entity labels ------------------------------ Medic : Dr. , C.N.P : , Data setului de analize: Varsta : , Nume si Prenume : Tel: , E-mail : , Licență : , Înmatriculare : , Cont : , , Masked with chars ------------------------------ Medic : Dr. [**********], C.N.P : [***********], Data setului de analize: [*********] Varsta : **, Nume si Prenume : [**********] Tel: [************], E-mail : [************], Licență : [*********], Înmatriculare : [******], Cont : [******************], [**************************] [******************] [****], [****] Masked with fixed length chars ------------------------------ Medic : Dr. ****, C.N.P : ****, Data setului de analize: **** Varsta : ****, Nume si Prenume : **** Tel: ****, E-mail : ****, Licență : ****, Înmatriculare : ****, Cont : ****, **** **** ****, **** Obfuscated ------------------------------ Medic : Dr. Doina Gheorghiu, C.N.P : 6794561192919, Data setului de analize: 01-04-2001 Varsta : 91, Nume si Prenume : Dragomir Emilia Tel: 0248 551 376, E-mail : tudorsmaranda@kappa.ro, Licență : T003485962M, Înmatriculare : AR-65-UPQ, Cont : KHHO5029180812813651, Centrul Medical de Evaluare si Recuperare pentru Copii si Tineri Cristian Serban Buzias Aleea Voinea Curcani, 328479 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from machine2049) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_duorc date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-duorc_distilbert` is a English model originally trained by `machine2049`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_duorc_en_4.3.0_3.0_1672767924782.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_duorc_en_4.3.0_3.0_1672767924782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_duorc","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_duorc","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_duorc| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-duorc_distilbert --- layout: model title: French CamemBert Embeddings (from DataikuNLP) author: John Snow Labs name: camembert_embeddings_DataikuNLP_camembert_base date: 2022-05-23 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `camembert-base` is a French model orginally trained by `DataikuNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_DataikuNLP_camembert_base_fr_3.4.4_3.0_1653321075374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_DataikuNLP_camembert_base_fr_3.4.4_3.0_1653321075374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_DataikuNLP_camembert_base","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_DataikuNLP_camembert_base","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_DataikuNLP_camembert_base| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/DataikuNLP/camembert-base - https://arxiv.org/abs/1911.03894 - https://camembert-model.fr/ --- layout: model title: English BertForQuestionAnswering Base Cased model (from Slavka) author: John Snow Labs name: bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespac date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-log-parser-winlogbeat_nowhitespace` is a English model originally trained by `Slavka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespac_en_4.0.0_3.0_1657182762527.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespac_en_4.0.0_3.0_1657182762527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespac","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespac","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_cased_finetuned_log_parser_winlogbeat_nowhitespac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Slavka/bert-base-cased-finetuned-log-parser-winlogbeat_nowhitespace --- layout: model title: Abkhazian asr_xls_test TFWav2Vec2ForCTC from pere author: John Snow Labs name: asr_xls_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_test` is a Abkhazian model originally trained by pere. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_test_ab_4.2.0_3.0_1664020688638.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_test_ab_4.2.0_3.0_1664020688638.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_test", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_test", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.8 KB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from squirro) author: John Snow Labs name: roberta_qa_base_squad_v2 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-squad_v2` is a English model originally trained by `squirro`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_v2_en_4.2.4_3.0_1669985131628.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_v2_en_4.2.4_3.0_1669985131628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad_v2| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/squirro/distilroberta-base-squad_v2 - https://rajpurkar.github.io/SQuAD-explorer/ - http://squirro.com - https://open.spotify.com/show/6NPLcv9EyaD2DcNT8v89Kb - https://podcasts.apple.com/us/podcast/redefining-ai/id1613934397 - https://www.linkedin.com/company/squirroag - https://www.linkedin.com/showcase/the-squirro-academy - https://twitter.com/Squirro - https://www.facebook.com/squirro - https://www.instagram.com/squirro/ - https://paperswithcode.com/sota?task=Question+Answering&dataset=The+Stanford+Question+Answering+Dataset --- layout: model title: Pipeline to Detect PHI (Deidentification) author: John Snow Labs name: ner_deid_large_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, deidentification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_large_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_pipeline_en_3.4.1_3.0_1647869607303.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_pipeline_en_3.4.1_3.0_1647869607303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_large_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_large_pipeline", "en", "clinical/models") pipeline.annotate("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_large.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |NAME | |VA Hospital |LOCATION | |Day Hospital |LOCATION | |02/04/2003 |DATE | |Smith |NAME | |Day Hospital |LOCATION | |Smith |NAME | |Smith |NAME | |7 Ardmore Tower|LOCATION | |Hart. |NAME | |Smith |NAME | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC2GM_Gene_ImbalancedPubMedBERT date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC2GM-Gene_ImbalancedPubMedBERT` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `GENE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_ImbalancedPubMedBERT_en_4.0.0_3.0_1657107999526.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC2GM_Gene_ImbalancedPubMedBERT_en_4.0.0_3.0_1657107999526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_ImbalancedPubMedBERT","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC2GM_Gene_ImbalancedPubMedBERT","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC2GM_Gene_ImbalancedPubMedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC2GM-Gene_ImbalancedPubMedBERT --- layout: model title: Ganda asr_wav2vec2_luganda_by_birgermoell TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: pipeline_asr_wav2vec2_luganda_by_birgermoell date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_birgermoell` is a Ganda model originally trained by birgermoell. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_luganda_by_birgermoell_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_birgermoell_lg_4.2.0_3.0_1664037236402.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_luganda_by_birgermoell_lg_4.2.0_3.0_1664037236402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_luganda_by_birgermoell', lang = 'lg') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_luganda_by_birgermoell", lang = "lg") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_luganda_by_birgermoell| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|lg| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: French T5ForConditionalGeneration Base Cased model (from JDBN) author: John Snow Labs name: t5_base_qg_fquad date: 2023-01-30 tags: [fr, open_source, t5, tensorflow] task: Text Generation language: fr edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-fr-qg-fquad` is a French model originally trained by `JDBN`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_qg_fquad_fr_4.3.0_3.0_1675109420285.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_qg_fquad_fr_4.3.0_3.0_1675109420285.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_qg_fquad","fr") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_qg_fquad","fr") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_qg_fquad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|fr| |Size:|923.2 MB| ## References - https://huggingface.co/JDBN/t5-base-fr-qg-fquad - https://github.com/patil-suraj/question_generation --- layout: model title: English BertForQuestionAnswering model (from NeuML) author: John Snow Labs name: bert_qa_bert_small_cord19qa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-cord19qa` is a English model orginally trained by `NeuML`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_cord19qa_en_4.0.0_3.0_1654184749665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_cord19qa_en_4.0.0_3.0_1654184749665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_cord19qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_cord19qa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.cord19.bert.small").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_cord19qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|130.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/NeuML/bert-small-cord19qa - https://www.kaggle.com/davidmezzetti/cord19-qa?select=cord19-qa.json - https://www.semanticscholar.org/cord19 --- layout: model title: Understanding Non-compete Items in Non-Compete Clauses (Bert) author: John Snow Labs name: legclf_nda_non_compete_items_bert date: 2023-05-17 tags: [en, legal, licensed, bert, classification, nda, non_compete, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a clause classified as `NON_COMP` using the `legmulticlf_mnda_sections_paragraph_other` classifier, you can subclassify the sentences as `NON_COMPETE_ITEMS`, or `OTHER` from it using the `legclf_nda_non_compete_items_bert` model. It has been trained with the SOTA approach. ## Predicted Entities `NON_COMPETE_ITEMS`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_non_compete_items_bert_en_1.0.0_3.0_1684358961459.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_non_compete_items_bert_en_1.0.0_3.0_1684358961459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") sequence_classifier = legal.BertForSequenceClassification.pretrained("legclf_nda_non_compete_items_bert", "en", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequence_classifier ]) empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) text_list = [ """This Agreement will be binding upon and inure to the benefit of each Party and its respective heirs, successors and assigns""", """Activity that is in direct competition with the Company's business, including but not limited to developing, marketing, or selling products or services that are similar to those of the Company.""" ] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +--------------------------------------------------------------------------------+-----------------+ | text| class| +--------------------------------------------------------------------------------+-----------------+ |This Agreement will be binding upon and inure to the benefit of each Party an...| OTHER| |Activity that is in direct competition with the Company's business, including...|NON_COMPETE_ITEMS| +--------------------------------------------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_non_compete_items_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support NON_COMPETE_ITEMS 1.00 1.00 1.00 10 OTHER 1.00 1.00 1.00 64 accuracy - - 1.00 74 macro avg 1.00 1.00 1.00 74 weighted avg 1.00 1.00 1.00 74 ``` --- layout: model title: English image_classifier_vit_rare_puppers_new_auth ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_rare_puppers_new_auth date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_puppers_new_auth` is a English model originally trained by nateraw. ## Predicted Entities `corgi`, `samoyed`, `shiba inu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_new_auth_en_4.1.0_3.0_1660168114642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers_new_auth_en_4.1.0_3.0_1660168114642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_puppers_new_auth", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_puppers_new_auth", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_puppers_new_auth| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English image_classifier_vit_ALL ViTForImageClassification from Ahmed9275 author: John Snow Labs name: image_classifier_vit_ALL date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ALL` is a English model originally trained by Ahmed9275. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ALL_en_4.1.0_3.0_1660168743224.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ALL_en_4.1.0_3.0_1660168743224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_ALL", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_ALL", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_ALL| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: NER Pipeline for Treatments - Voice of the Patient author: John Snow Labs name: ner_vop_treatment_pipeline date: 2023-06-10 tags: [licensed, ner, en, vop, treatment, clinical, pipeline] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of treatment entities from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_pipeline_en_4.4.3_3.0_1686430967159.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_pipeline_en_4.4.3_3.0_1686430967159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_treatment_pipeline", "en", "clinical/models") pipeline.annotate(" My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_treatment_pipeline", "en", "clinical/models") val result = pipeline.annotate(" My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well. ") ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | metformin | Drug | | glipizide | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_treatment_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Translate English to Japanese Pipeline author: John Snow Labs name: translate_en_jap date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, jap, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `jap` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_jap_xx_2.7.0_2.4_1609688554440.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_jap_xx_2.7.0_2.4_1609688554440.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_jap", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_jap", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.jap').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_jap| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Preamble Clause Binary Classifier author: John Snow Labs name: legclf_preamble_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `preamble` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `preamble` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_preamble_clause_en_1.0.0_3.2_1660123833867.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_preamble_clause_en_1.0.0_3.2_1660123833867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_preamble_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[preamble]| |[other]| |[other]| |[preamble]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_preamble_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.94 0.94 66 preamble 0.84 0.84 0.84 25 accuracy - - 0.91 91 macro avg 0.89 0.89 0.89 91 weighted avg 0.91 0.91 0.91 91 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_hassnain` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain_en_4.2.0_3.0_1664040008933.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain_en_4.2.0_3.0_1664040008933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab2_by_hassnain| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_roberta_base_1B_1_finetuned_squadv1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-1B-1-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_1B_1_finetuned_squadv1_en_4.0.0_3.0_1655729682223.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_1B_1_finetuned_squadv1_en_4.0.0_3.0_1655729682223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_1B_1_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_1B_1_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_1B_1_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|447.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/roberta-base-1B-1-finetuned-squadv1 - https://twitter.com/mrm8488 - https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: XLM-RoBERTa Base, CoNLL-03 NER Pipeline author: John Snow Labs name: xlm_roberta_base_token_classifier_conll03_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, xlm_roberta, conll03, xlm, base, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654206397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655654206397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|851.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Emotion Detection Classifier author: John Snow Labs name: classifierdl_use_emotion date: 2021-01-09 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en, classifier, emotion] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Automatically identify Joy, Surprise, Fear, Sadness emotions in Tweets. ## Predicted Entities `surprise`, `sadness`, `fear`, `joy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN_EMOTION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN_EMOTION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_emotion_en_2.7.1_2.4_1610190563302.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_emotion_en_2.7.1_2.4_1610190563302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") classifier = ClassifierDLModel.pretrained('classifierdl_use_emotion', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("sentiment") nlpPipeline = Pipeline(stages=[document_assembler, use, classifier]) data = spark.createDataFrame([["@Mira I just saw you on live t.v!!"], ["Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!!"], ["Nooooo! My dad turned off the internet so I can't listen to band music!"], ["My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after."]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained('tfhub_use', "en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val classifier = ClassifierDLModel.pretrained("classifierdl_use_emotion", "en") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("sentiment") val pipeline = new Pipeline().setStages(Array(documentAssembler, use, classifier)) val data = Seq(Array("@Mira I just saw you on live t.v!!", "Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!!", "Nooooo! My dad turned off the internet so I can't listen to band music!", "My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""@Mira I just saw you on live t.v!!"""] emotion_df = nlu.load('en.classify.emotion.use').predict(text, output_level='document') emotion_df[["document", "emotion"]] ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------------------------------+---------+ |document |sentiment| +-------------------------------------------------------------------------------------------------------------------------------------+---------+ |@Mira I just saw you on live t.v!! |surprise | |Just home from group celebration - dinner at Trattoria Gianni, then Hershey Felder's performance - AMAZING!! |joy | |Nooooo! My dad turned off the internet so I can't listen to band music! |sadness | |My soul has just been pierced by the most evil look from @rickosborneorg. A mini panic attack and chill in bones followed soon after.|fear | +-------------------------------------------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_emotion| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source This model is trained on multiple datasets inlcuding youtube comments, twitter and ISEAR dataset. ## Benchmarking ```bash label tp fp fn support fear 0.78 0.67 0.72 2253 joy 0.71 0.68 0.69 3000 sadness 0.69 0.73 0.71 3075 surprise 0.67 0.73 0.70 3067 accuracy - - 0.71 11395 macro-avg 0.71 0.70 0.71 11395 weighted-avg 0.71 0.71 0.71 11395 ``` --- layout: model title: Chinese Bert Embeddings (Siku Quanshu corpus) author: John Snow Labs name: bert_embeddings_sikubert date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sikubert` is a Chinese model orginally trained by `SIKU-BERT`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sikubert_zh_3.4.2_3.0_1649669690991.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sikubert_zh_3.4.2_3.0_1649669690991.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_sikubert","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_sikubert","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.sikubert").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_sikubert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|408.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/SIKU-BERT/sikubert - https://github.com/SIKU-BERT/SikuBERT-for-digital-humanities-and-classical-Chinese-information-processing --- layout: model title: English asr_wav2vec2_large_10min_lv60_self TFWav2Vec2ForCTC from Splend1dchan author: John Snow Labs name: pipeline_asr_wav2vec2_large_10min_lv60_self date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_10min_lv60_self` is a English model originally trained by Splend1dchan. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_10min_lv60_self_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_10min_lv60_self_en_4.2.0_3.0_1664024573442.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_10min_lv60_self_en_4.2.0_3.0_1664024573442.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_10min_lv60_self', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_10min_lv60_self", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_10min_lv60_self| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Stop Words Cleaner for Catalan author: John Snow Labs name: stopwords_ca date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ca edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ca] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ca_ca_2.5.4_2.4_1594742440888.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ca_ca_2.5.4_2.4_1594742440888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ca", "ca") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ca", "ca") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica."""] stopword_df = nlu.load('ca.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=2, end=5, result='part', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=12, result='ser', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=19, result='rei', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=28, result='nord', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=29, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ca| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ca| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English asr_wav2vec2_large_xlsr_53_gpt TFWav2Vec2ForCTC from voidful author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_gpt date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_gpt` is a English model originally trained by voidful. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_gpt_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_gpt_en_4.2.0_3.0_1664095239689.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_gpt_en_4.2.0_3.0_1664095239689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_gpt", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_gpt", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_gpt| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.3 GB| --- layout: model title: English BertForQuestionAnswering model (from ydshieh) author: John Snow Labs name: bert_qa_ydshieh_bert_base_cased_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-squad2` is a English model orginally trained by `ydshieh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ydshieh_bert_base_cased_squad2_en_4.0.0_3.0_1654179878924.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ydshieh_bert_base_cased_squad2_en_4.0.0_3.0_1654179878924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ydshieh_bert_base_cased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ydshieh_bert_base_cased_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_cased.by_ydshieh").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ydshieh_bert_base_cased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ydshieh/bert-base-cased-squad2 --- layout: model title: Detect Assertion Status (assertion_ml_en) author: John Snow Labs name: assertion_ml_en date: 2020-01-30 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [clinical, licensed, ner, en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Logistic regression based named entity recognition model for assertions. ## Predicted Labels Hypothetical, Present, Absent, Possible, Conditional, Associated_with_someone_else {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_large_en_2.5.0_2.4_1590022282256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_large_en_2.5.0_2.4_1590022282256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel, NerConverter, AssertionLogRegModel.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_ml", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_result = LightPipeline(model).fullAnnotate('Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain')[0] ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion_ml = AssertionLogRegModel.pretrained("assertion_ml", "en", "clinical/models") .setInputCols("sentence", "ner_chunk", "embeddings") .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion_ml)) val data = Seq("Patient has a headache for the last 2 weeks and appears anxious when she walks fast. No alopecia noted. She denies pain").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and an "assertion" column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select "ner_chunk.result" and "assertion.result" from your output dataframe. ```bash | | chunks | entities | assertion | |---|------------|----------|-------------| | 0 | a headache | PROBLEM | present | | 1 | anxious | PROBLEM | conditional | | 2 | alopecia | PROBLEM | absent | | 3 | pain | PROBLEM | absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_ml_en_2.4.0_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, ner_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ --- layout: model title: Legal Voting Clause Binary Classifier author: John Snow Labs name: legclf_voting_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `voting` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `voting` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_voting_clause_en_1.0.0_3.2_1660123185915.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_voting_clause_en_1.0.0_3.2_1660123185915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_voting_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[voting]| |[other]| |[other]| |[voting]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_voting_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.98 0.98 193 voting 0.96 0.93 0.95 75 accuracy - - 0.97 268 macro-avg 0.97 0.96 0.96 268 weighted-avg 0.97 0.97 0.97 268 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_6 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_6_en_4.3.0_3.0_1674213983376.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_6_en_4.3.0_3.0_1674213983376.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|422.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-6 --- layout: model title: Detect Assertion Status from Oncology Treatments author: John Snow Labs name: assertion_oncology_treatment_binary_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion, treatment] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of oncology treatment entities. The model identifies if the treatment has been used (Present_Or_Past status), or if it is mentioned but in hypothetical or absent sentences (Hypothetical_Or_Absent status). ## Predicted Entities `Hypothetical_Or_Absent`, `Present_Or_Past` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_treatment_binary_wip_en_4.1.0_3.0_1664641549969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_treatment_binary_wip_en_4.1.0_3.0_1664641549969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Surgery", "Chemotherapy"]) assertion = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient underwent a mastectomy two years ago. We recommend to start chemotherapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Surgery", "Chemotherapy")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient underwent a mastectomy two years ago. We recommend to start chemotherapy.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_treatment_binary").predict("""The patient underwent a mastectomy two years ago. We recommend to start chemotherapy.""") ```
## Results ```bash | chunk | ner_label | assertion | |:-------------|:---------------|:-----------------------| | mastectomy | Cancer_Surgery | Present_Or_Past | | chemotherapy | Chemotherapy | Hypothetical_Or_Absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_treatment_binary_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Hypothetical_Or_Absent 0.76 0.77 0.76 128.0 Present_Or_Past 0.75 0.73 0.74 118.0 macro-avg 0.75 0.75 0.75 246.0 weighted-avg 0.75 0.75 0.75 246.0 ``` --- layout: model title: Word2Vec Embeddings in Central Bikol (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, bcl, open_source] task: Embeddings language: bcl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bcl_3.4.1_3.0_1647290353278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bcl_3.4.1_3.0_1647290353278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bcl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bcl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bcl.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bcl| |Size:|65.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_small_lm_adapt date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-lm-adapt` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_lm_adapt_en_4.3.0_3.0_1675126486527.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_lm_adapt_en_4.3.0_3.0_1675126486527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_lm_adapt","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_lm_adapt","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_lm_adapt| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|179.3 MB| ## References - https://huggingface.co/google/t5-small-lm-adapt - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#lm-adapted-t511lm100k - https://arxiv.org/abs/2002.05202 - https://arxiv.org/pdf/1910.10683.pdf - https://arxiv.org/pdf/1910.10683.pdf - https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67 --- layout: model title: Fast Neural Machine Translation Model from Marathi to English author: John Snow Labs name: opus_mt_mr_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mr, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mr` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mr_en_xx_2.7.0_2.4_1609167855864.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mr_en_xx_2.7.0_2.4_1609167855864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mr_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mr_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mr.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mr_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Agreement and Declaration Document Classifier (Longformer) author: John Snow Labs name: legclf_agreement_and_declaration date: 2022-11-10 tags: [licensed, legal, en, classification, agreement, declaration] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_agreement_and_declaration` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `agreement-and-declaration` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `agreement-and-declaration`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_declaration_en_1.0.0_3.0_1668106274229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreement_and_declaration_en_1.0.0_3.0_1668106274229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_agreement_and_declaration", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[agreement-and-declaration]| |[other]| |[other]| |[agreement-and-declaration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreement_and_declaration| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support agreement-and-declaration 0.97 1.00 0.98 32 other 1.00 0.98 0.99 66 accuracy - - 0.99 98 macro-avg 0.98 0.99 0.99 98 weighted-avg 0.99 0.99 0.99 98 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1657192470456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_8_en_4.0.0_3.0_1657192470456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|384.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-8 --- layout: model title: Arabic Bert Embeddings (Large, Arabert Model, v02, Twitter) author: John Snow Labs name: bert_embeddings_bert_large_arabertv02_twitter date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-arabertv02-twitter` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv02_twitter_ar_3.4.2_3.0_1649678133718.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv02_twitter_ar_3.4.2_3.0_1649678133718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv02_twitter","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv02_twitter","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_large_arabertv02_twitter").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_arabertv02_twitter| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.4 GB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-large-arabertv02-twitter - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Legal Stockholder Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_stockholder_agreement date: 2022-11-10 tags: [legal, licensed, classification, stockholder, agreement, en] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stockholder_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `stockholder-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `stockholder-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stockholder_agreement_en_1.0.0_3.0_1668110286989.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stockholder_agreement_en_1.0.0_3.0_1668110286989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_stockholder_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[stockholder-agreement]| |[other]| |[other]| |[stockholder-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stockholder_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.98 65 stockholder-agreement 1.00 0.95 0.97 39 accuracy - - 0.98 104 macro-avg 0.99 0.97 0.98 104 weighted-avg 0.98 0.98 0.98 104 ``` --- layout: model title: Legal Standard of care Clause Binary Classifier author: John Snow Labs name: legclf_standard_of_care_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `standard-of-care` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `standard-of-care` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_standard_of_care_clause_en_1.0.0_3.2_1660123028396.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_standard_of_care_clause_en_1.0.0_3.2_1660123028396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_standard_of_care_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[standard-of-care]| |[other]| |[other]| |[standard-of-care]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_standard_of_care_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.98 0.96 97 standard-of-care 0.94 0.86 0.90 36 accuracy - - 0.95 133 macro-avg 0.94 0.92 0.93 133 weighted-avg 0.95 0.95 0.95 133 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Manx author: John Snow Labs name: opus_mt_en_gv date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gv, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gv_xx_2.7.0_2.4_1609169580658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gv_xx_2.7.0_2.4_1609169580658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gv", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gv').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gv| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from erickfm) author: John Snow Labs name: t5_base_finetuned_bias date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-bias` is a English model originally trained by `erickfm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_bias_en_4.3.0_3.0_1675108728968.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_bias_en_4.3.0_3.0_1675108728968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_bias","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_bias","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_bias| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|922.1 MB| ## References - https://huggingface.co/erickfm/t5-base-finetuned-bias - https://github.com/rpryzant/neutralizing-bias --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICD10-CM Codes author: John Snow Labs name: snomed_icd10cm_mapping date: 2023-03-29 tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, icd10cm] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_icd10cm_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_4.3.2_3.2_1680122747029.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_4.3.2_3.2_1680122747029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(128041000119107 292278006 293072005) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(128041000119107 292278006 293072005) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icd10cm.pipe").predict("""Put your text here.""") ```
## Results ```bash | | snomed_code | icd10cm_code | |---:|:----------------------------------------|:---------------------------| | 0 | 128041000119107 | 292278006 | 293072005 | K22.70 | T43.595 | T37.1X5 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icd10cm_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.5 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_128_finetuned_squad_seed_42 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_42_en_4.3.0_3.0_1674213900898.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_128_finetuned_squad_seed_42_en_4.3.0_3.0_1674213900898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_42","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_128_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_128_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|431.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-42 --- layout: model title: English image_classifier_vit_base_patch16_224_finetuned_pneumothorax ViTForImageClassification from mrm8488 author: John Snow Labs name: image_classifier_vit_base_patch16_224_finetuned_pneumothorax date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_finetuned_pneumothorax` is a English model originally trained by mrm8488. ## Predicted Entities `normal`, `pneumothorax` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_pneumothorax_en_4.1.0_3.0_1660170763244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_pneumothorax_en_4.1.0_3.0_1660170763244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_finetuned_pneumothorax", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_finetuned_pneumothorax", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_finetuned_pneumothorax| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: ICD10CM Neoplasms Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_neoplasms_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-CM Codes and their normalized definition with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_neoplasms_clinical_en_3.0.0_3.0_1617355435147.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_neoplasms_clinical_en_3.0.0_3.0_1617355435147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... neoplasm_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_neoplasms_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, neoplasm_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val neoplasm_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_neoplasms_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, neoplasm_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash chunk entity icd10_neoplasm_description icd10_neoplasm_code 0 patient Organism Acute myelomonocytic leukemia, in remission C9251 1 infant Organism Malignant (primary) neoplasm, unspecified C801 2 nose Organ Malignant neoplasm of nasal cavity C300 3 She Organism Malignant neoplasm of thyroid gland C73 4 She Organism Malignant neoplasm of thyroid gland C73 5 She Organism Malignant neoplasm of thyroid gland C73 6 Aldex Gene_or_gene_product Acute megakaryoblastic leukemia not having ach... C9420 7 ear Organ Other benign neoplasm of skin of right ear and... D2321 8 She Organism Malignant neoplasm of thyroid gland C73 9 She Organism Malignant neoplasm of thyroid gland C73 10 She Organism Malignant neoplasm of thyroid gland C73 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10cm_neoplasms_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm]| |Language:|en| --- layout: model title: Fast Neural Machine Translation Model from English to Czech author: John Snow Labs name: opus_mt_en_cs date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, cs, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `cs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_cs_xx_2.7.0_2.4_1609169748097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_cs_xx_2.7.0_2.4_1609169748097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_cs", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_cs", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.cs').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_cs| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from spacemanidol) author: John Snow Labs name: bert_qa_neuralmagic_bert_squad_12layer_0sparse date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `neuralmagic-bert-squad-12layer-0sparse` is a English model orginally trained by `spacemanidol`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_neuralmagic_bert_squad_12layer_0sparse_en_4.0.0_3.0_1654188887647.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_neuralmagic_bert_squad_12layer_0sparse_en_4.0.0_3.0_1654188887647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_neuralmagic_bert_squad_12layer_0sparse","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_neuralmagic_bert_squad_12layer_0sparse","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_neuralmagic_bert_squad_12layer_0sparse| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/spacemanidol/neuralmagic-bert-squad-12layer-0sparse --- layout: model title: English XLMRobertaForTokenClassification Base Uncased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_base_uncased_ontonotes5 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-base-uncased-ontonotes5` is a English model originally trained by `asahi417`. ## Predicted Entities `language`, `product`, `percent`, `time`, `quantity`, `ordinal number`, `law`, `cardinal number`, `facility`, `event`, `geopolitical area`, `organization`, `group`, `money`, `work of art`, `person`, `location`, `date` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_uncased_ontonotes5_en_4.1.0_3.0_1660423763480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_uncased_ontonotes5_en_4.1.0_3.0_1660423763480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_uncased_ontonotes5","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_uncased_ontonotes5","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_base_uncased_ontonotes5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|791.8 MB| |Case sensitive:|false| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-base-uncased-ontonotes5 - https://github.com/asahi417/tner --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_which_5e_05 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-which-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_which_5e_05_en_4.3.0_3.0_1674209836673.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_which_5e_05_en_4.3.0_3.0_1674209836673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_which_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_which_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_which_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-which-5e-05 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_urdu_v2 TFWav2Vec2ForCTC from omar47 author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_urdu_v2 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_urdu_v2` is a English model originally trained by omar47. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_urdu_v2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_urdu_v2_en_4.2.0_3.0_1664117963655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_urdu_v2_en_4.2.0_3.0_1664117963655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_urdu_v2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_urdu_v2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_urdu_v2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect PHI for Deidentification (Generic - Augmented) author: John Snow Labs name: ner_deid_generic_augmented date: 2021-06-01 tags: [clinical, deid, ner, en, licensed] task: [Named Entity Recognition, De-identification] language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Generic - Augmented) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This ner model is trained with a combination of i2b2 train set and augmented version of i2b2 train set. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `CONTACT`, `AGE`, `ID` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_en_3.0.3_2.4_1622538123966.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_en_3.0.3_2.4_1622538123966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk_generic") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]}))) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk_generic") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter)) val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.generic_augmented").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson, Ora |NAME | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|LOCATION | |0295 Keats Street |LOCATION | |(302) 786-5227 |CONTACT | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_augmented| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source A custom data set which is created from the i2b2-PHI train and the augmented version of the i2b2-PHI train set is used. ## Benchmarking ```bash entity tp fp fn total precision recall f1 CONTACT 341.0 15.0 14.0 355.0 0.9579 0.9606 0.9592 NAME 5065.0 165.0 205.0 5270.0 0.9685 0.9611 0.9648 DATE 5532.0 53.0 110.0 5642.0 0.9905 0.9805 0.9855 ID 614.0 23.0 71.0 685.0 0.9639 0.8964 0.9289 LOCATION 2686.0 164.0 284.0 2970.0 0.9425 0.9044 0.923 PROFESSION 228.0 28.0 145.0 373.0 0.8906 0.6113 0.725 AGE 713.0 34.0 49.0 762.0 0.9545 0.9357 0.945 macro-avg - - - - - - 0.91876 micro-avg - - - - - - 0.95616 ``` --- layout: model title: Chinese XlmRoBertaForQuestionAnswering (from bhavikardeshna) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_chinese date: 2022-06-23 tags: [zh, open_source, question_answering, xlmroberta] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-chinese` is a Chinese model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_chinese_zh_4.0.0_3.0_1655989391641.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_chinese_zh_4.0.0_3.0_1655989391641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_chinese","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_chinese","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_chinese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|zh| |Size:|888.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/xlm-roberta-base-chinese --- layout: model title: Detect radiology concepts (ner_radiology_wip_clinical) author: John Snow Labs name: ner_radiology_wip_clinical date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect clincal concepts from radiology reports and text using pretrained NER model. ## Predicted Entities `ImagingFindings`, `Direction`, `OtherFindings`, `Measurements`, `Score`, `BodyPart`, `Medical_Device`, `Test`, `ManualFix`, `Procedure`, `Disease_Syndrome_Disorder`, `Test_Result`, `Imaging_Technique`, `ImagingTest`, `Symptom`, `Units` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_en_3.0.0_3.0_1617260824931.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_en_3.0.0_3.0_1617260824931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_radiology_wip_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_radiology_wip_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology.wip_clinical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology_wip_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Part of Speech for Swedish author: John Snow Labs name: pos_ud_tal date: 2021-03-08 tags: [part_of_speech, open_source, swedish, pos_ud_tal, sv] task: Part of Speech Tagging language: sv edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADJ - NOUN - ADP - VERB - PUNCT - PRON - ADV - SCONJ - NUM - AUX - PART - DET - CCONJ - PROPN - SYM - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_tal_sv_3.0.0_3.0_1615230340020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_tal_sv_3.0.0_3.0_1615230340020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_tal", "sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hej från John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_tal", "sv") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hej från John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hej från John Snow Labs! ""] token_df = nlu.load('sv.pos.ud_tal').predict(text) token_df ```
## Results ```bash token pos 0 Hej NOUN 1 från ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_tal| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|sv| --- layout: model title: English asr_model_sid_voxforge_cetuc_2 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: pipeline_asr_model_sid_voxforge_cetuc_2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_sid_voxforge_cetuc_2` is a English model originally trained by joaoalvarenga. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_model_sid_voxforge_cetuc_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_sid_voxforge_cetuc_2_en_4.2.0_3.0_1664021468830.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_sid_voxforge_cetuc_2_en_4.2.0_3.0_1664021468830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_model_sid_voxforge_cetuc_2', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_model_sid_voxforge_cetuc_2", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_model_sid_voxforge_cetuc_2| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Explain Document DL Pipeline for Farsi/Persian author: John Snow Labs name: recognize_entities_dl date: 2022-06-29 tags: [pipeline, ner, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_dl is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FA/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.SparkNLP_Pretrained_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/recognize_entities_dl_fa_4.0.0_3.0_1656494752817.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/recognize_entities_dl_fa_4.0.0_3.0_1656494752817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('recognize_entities_dl', lang = 'fa') annotations = pipeline.fullAnnotate("""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند""")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("recognize_entities_dl", lang = "fa") val result = pipeline.fullAnnotate("""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند""")(0) ``` {:.nlu-block} ```python import nlu text = ["""به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند"""] result_df = nlu.load('fa.recognize_entities_dl').predict(text) result_df ```
## Results ```bash | | document | sentence | token | clean_tokens | lemma | pos | embeddings | ner | entities | |---:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------|:---------------|:---------|:------|:-------------|:------|:---------------------| | 0 | "به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند | "به گزارش خبرنگار ایرنا ، بر اساس تصمیم این مجمع ، محمد قمی نماینده مردم پاکدشت به عنوان رئیس و علی‌اکبر موسوی خوئینی و شمس‌الدین وهابی نمایندگان مردم تهران به عنوان نواب رئیس انتخاب شدند | " | " | " | PUNCT | " | O | خبرنگار ایرنا | | 1 | | | به | گزارش | به | ADP | به | O | محمد قمی | | 2 | | | گزارش | خبرنگار | گزارش | NOUN | گزارش | O | پاکدشت | | 3 | | | خبرنگار | ایرنا | خبرنگار | NOUN | خبرنگار | B-ORG | علی‌اکبر موسوی خوئینی | | 4 | | | ایرنا | ، | ایرنا | PROPN | ایرنا | I-ORG | شمس‌الدین وهابی | | 5 | | | ، | اساس | ؛ | PUNCT | ، | O | تهران | | 6 | | | بر | تصمیم | بر | ADP | بر | O | | | 7 | | | اساس | این | اساس | NOUN | اساس | O | | | 8 | | | تصمیم | مجمع | تصمیم | NOUN | تصمیم | O | | | 9 | | | این | ، | این | DET | این | O | | | 10 | | | مجمع | محمد | مجمع | NOUN | مجمع | O | | | 11 | | | ، | قمی | ؛ | PUNCT | ، | O | | | 12 | | | محمد | نماینده | محمد | PROPN | محمد | B-PER | | | 13 | | | قمی | پاکدشت | قمی | PROPN | قمی | I-PER | | | 14 | | | نماینده | عنوان | نماینده | NOUN | نماینده | O | | | 15 | | | مردم | رئیس | مردم | NOUN | مردم | O | | | 16 | | | پاکدشت | علی‌اکبر | پاکدشت | PROPN | پاکدشت | B-LOC | | | 17 | | | به | موسوی | به | ADP | به | O | | | 18 | | | عنوان | خوئینی | عنوان | NOUN | عنوان | O | | | 19 | | | رئیس | شمس‌الدین | رئیس | NOUN | رئیس | O | | | 20 | | | و | وهابی | او | CCONJ | و | O | | | 21 | | | علی‌اکبر | نمایندگان | علی‌اکبر | PROPN | علی‌اکبر | B-PER | | | 22 | | | موسوی | تهران | موسوی | PROPN | موسوی | I-PER | | | 23 | | | خوئینی | عنوان | خوئینی | PROPN | خوئینی | I-PER | | | 24 | | | و | نواب | او | CCONJ | و | O | | | 25 | | | شمس‌الدین | رئیس | شمس‌الدین | PROPN | شمس‌الدین | B-PER | | | 26 | | | وهابی | انتخاب | وهابی | PROPN | وهابی | I-PER | | | 27 | | | نمایندگان | | نماینده | NOUN | نمایندگان | O | | | 28 | | | مردم | | مردم | NOUN | مردم | O | | | 29 | | | تهران | | تهران | PROPN | تهران | B-LOC | | | 30 | | | به | | به | ADP | به | O | | | 31 | | | عنوان | | عنوان | NOUN | عنوان | O | | | 32 | | | نواب | | نواب | NOUN | نواب | O | | | 33 | | | رئیس | | رئیس | NOUN | رئیس | O | | | 34 | | | انتخاب | | انتخاب | NOUN | انتخاب | O | | | 35 | | | شدند | | کرد#کن | VERB | شدند | O | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_dl| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fa| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - StopWordsCleaner - LemmatizerModel - PerceptronModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Pipeline to Detect Radiology Entities, Assign Assertion Status and Find Relations author: John Snow Labs name: explain_clinical_doc_radiology date: 2022-08-03 tags: [licensed, clinical, en, ner, assertion, relation_extraction, radiology] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for detecting radiology entities with the `ner_radiology` NER model, assigning their assertion status with `assertion_dl_radiology` model, and extracting relations between the diagnosis, test, and findings with `re_test_problem_finding` relation extraction model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_4.0.0_3.0_1659519840982.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_radiology_en_4.0.0_3.0_1659519840982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models") result = pipeline.fullAnnotate("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_radiology", "en", "clinical/models") val result = pipeline.fullAnnotate("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_radiology.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash +----+------------------------------------------+---------------------------+ | | chunks | entities | |---:|:-----------------------------------------|:--------------------------| | 0 | Bilateral breast | BodyPart | | 1 | ultrasound | ImagingTest | | 2 | ovoid mass | ImagingFindings | | 3 | 0.5 x 0.5 x 0.4 | Measurements | | 4 | cm | Units | | 5 | anteromedial aspect of the left shoulder | BodyPart | | 6 | mass | ImagingFindings | | 7 | isoechoic echotexture | ImagingFindings | | 8 | muscle | BodyPart | | 9 | internal color flow | ImagingFindings | | 10 | benign fibrous tissue | ImagingFindings | | 11 | lipoma | Disease_Syndrome_Disorder | +----+------------------------------------------+---------------------------+ +----+-----------------------+---------------------------+-------------+ | | chunks | entities | assertion | |---:|:----------------------|:--------------------------|:------------| | 0 | ultrasound | ImagingTest | Confirmed | | 1 | ovoid mass | ImagingFindings | Confirmed | | 2 | mass | ImagingFindings | Confirmed | | 3 | isoechoic echotexture | ImagingFindings | Confirmed | | 4 | internal color flow | ImagingFindings | Negative | | 5 | benign fibrous tissue | ImagingFindings | Suspected | | 6 | lipoma | Disease_Syndrome_Disorder | Suspected | +----+-----------------------+---------------------------+-------------+ +---------+-----------------+-----------------------+---------------------------+------------+ |relation | entity1 | chunk1 | entity2 | chunk2 | |--------:|:----------------|:----------------------|:--------------------------|:-----------| | 1 | ImagingTest | ultrasound | ImagingFindings | ovoid mass | | 0 | ImagingFindings | benign fibrous tissue | Disease_Syndrome_Disorder | lipoma | +---------+-----------------+-----------------------+---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_radiology| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - NerConverter - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel --- layout: model title: English BertForQuestionAnswering model (from vumichien) author: John Snow Labs name: bert_qa_tf_bert_base_cased_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tf-bert-base-cased-squad2` is a English model orginally trained by `vumichien`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tf_bert_base_cased_squad2_en_4.0.0_3.0_1654192377114.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tf_bert_base_cased_squad2_en_4.0.0_3.0_1654192377114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tf_bert_base_cased_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tf_bert_base_cased_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tf_bert_base_cased_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vumichien/tf-bert-base-cased-squad2 --- layout: model title: Word2Vec Embeddings in Northern Frisian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, frr, open_source] task: Embeddings language: frr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_frr_3.4.1_3.0_1647448216003.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_frr_3.4.1_3.0_1647448216003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","frr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","frr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("frr.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|frr| |Size:|120.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Augment Company Names with NASDAQ database author: John Snow Labs name: finmapper_nasdaq_data_company_name date: 2022-10-22 tags: [en, finance, companies, nasdaq, ticker, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial Chunk Mapper which will retrieve, given a ticker, extra information about the company, including: - Company Name - Stock Exchange - Section - Sic codes - Section - Industry - Category - Currency - Location - Previous names (first_name) - Company type (INC, CORP, etc) - and some more. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_data_company_name_en_1.0.0_3.0_1666474142842.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_data_company_name_en_1.0.0_3.0_1666474142842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") # Optional: To normalize the ORG name using NASDAQ data before the mapping ########################################################################## chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("chunk_embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_data_company_name', 'en', 'finance/models')\ .setInputCols("chunk_embeddings")\ .setOutputCol('normalized')\ .setDistanceFunction("EUCLIDEAN") ########################################################################## CM = finance.ChunkMapperModel()\ .pretrained('finmapper_nasdaq_data_company_name', 'en', 'finance/models')\ .setInputCols(["normalized"])\ .setOutputCol("mappings") #or ner_chunk without normalization pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, # Optional for normalization chunk_embeddings, # Optional for normalization use_er_model, # Optional for normalization CM]) text = """GLEASON CORP is a company which ...""" test_data = spark.createDataFrame([[text]]).toDF("text") model = pipeline.fit(test_data) lp = nlp.LightPipeline(model) lp.fullAnnotate(text) ```
## Results ```bash Row(annotatorType='labeled_dependency', begin=0, end=11, relation='ticker', result='GLE1'...) Row(annotatorType='labeled_dependency', begin=0, end=11, relation='name', result='GLEASON CORP'...) Row(annotatorType='labeled_dependency', begin=0, end=11, relation='exchange', result='NYSE'...) Row(annotatorType='labeled_dependency', begin=0, end=11, relation='category' result='Domestic Common Stock'...) ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_nasdaq_data_company_name| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|989.1 KB| ## References NASDAQ Database --- layout: model title: Luo (Kenya and Tanzania) XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_swahili_finetuned_ner date: 2022-08-01 tags: [luo, open_source, xlm_roberta, ner] task: Named Entity Recognition language: luo edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-luo` is a Luo (Kenya and Tanzania) model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_luo_4.1.0_3.0_1659355757719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_luo_4.1.0_3.0_1659355757719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner","luo") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner","luo") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_swahili_finetuned_ner| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|luo| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-luo - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy ViTForImageClassification from mrm8488 author: John Snow Labs name: image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy` is a English model originally trained by mrm8488. ## Predicted Entities `ulcerative-colitis`, `normal-pylorus`, `normal-cecum`, `normal-z-line`, `esophagitis`, `dyed-lifted-polyps`, `dyed-resection-margins`, `polyps` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy_en_4.1.0_3.0_1660167194867.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy_en_4.1.0_3.0_1660167194867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_finetuned_kvasirv2_colonoscopy| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English Named Entity Recognition (from lucifermorninstar011) author: John Snow Labs name: distilbert_ner_autotrain_defector_ner_846726994 date: 2022-05-16 tags: [distilbert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-defector_ner-846726994` is a English model orginally trained by `lucifermorninstar011`. ## Predicted Entities `Name`, `Zip`, `Job`, `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_defector_ner_846726994_en_3.4.2_3.0_1652719415807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_autotrain_defector_ner_846726994_en_3.4.2_3.0_1652719415807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_defector_ner_846726994","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_autotrain_defector_ner_846726994","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_autotrain_defector_ner_846726994| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lucifermorninstar011/autotrain-defector_ner-846726994 --- layout: model title: English BertForTokenClassification Large Cased model (from 51la5) author: John Snow Labs name: bert_token_classifier_large_ner date: 2022-11-30 tags: [en, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-NER` is a English model originally trained by `51la5`. ## Predicted Entities `MISC`, `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_large_ner_en_4.2.4_3.0_1669815272612.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_large_ner_en_4.2.4_3.0_1669815272612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_large_ner","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_large_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_large_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/51la5/bert-large-NER - https://www.aclweb.org/anthology/W03-0419.pdf - https://www.aclweb.org/anthology/W03-0419.pdf - https://arxiv.org/pdf/1810.04805 - https://github.com/google-research/bert/issues/223 --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squad1.1_block_sparse_0.07_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad1.1-block-sparse-0.07-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.07_v1_en_4.0.0_3.0_1654181416112.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squad1.1_block_sparse_0.07_v1_en_4.0.0_3.0_1654181416112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.07_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squad1.1_block_sparse_0.07_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_x2.44_f87.7_d26_hybrid_filled_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squad1.1_block_sparse_0.07_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|140.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squad1.1-block-sparse-0.07-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf - https://www.aclweb.org/anthology/N19-1423/ - https://arxiv.org/abs/2005.07683 --- layout: model title: Classification of Self Reported Vaccine Status (BioBERT) author: John Snow Labs name: bert_sequence_classifier_self_reported_vaccine_status_tweet date: 2022-07-29 tags: [bert, licensed, en, clinical, classifier, sequence_classification, public_health, vaccine, tweet] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true recommended: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classification of tweets indicating self-reported COVID-19 vaccination status. This model involves the identification of self-reported COVID-19 vaccination status in English tweets. ## Predicted Entities `Vaccine_chatter`, `Self_reports` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_VACCINE_STATUS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_vaccine_status_tweet_en_4.0.0_3.0_1659076646646.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_vaccine_status_tweet_en_4.0.0_3.0_1659076646646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_vaccine_status_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) example = spark.createDataFrame(["I came to a point finally and i've vaccinated, didnt feel pain.Suggest everyone", "If Pfizer believes we need a booster shot, we need it. Who knows their product better? Following the guidance of @CDCgov is how I wound up w/ Covid-19 and having to shut down my K-2 classroom for an entire week. I will do whatever it takes to protect my students, friends, family."], StringType()).toDF("text") result = pipeline.fit(example).transform(example) result.select("text", "class.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_vaccine_status_tweet", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) # couple of simple examples val example = Seq(Array("I came to a point finally and i've vaccinated, didnt feel pain.Suggest everyone", "If Pfizer believes we need a booster shot, we need it. Who knows their product better? Following the guidance of @CDCgov is how I wound up w/ Covid-19 and having to shut down my K-2 classroom for an entire week. I will do whatever it takes to protect my students, friends, family.")).toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.self_reported_vaccine_status").predict("""If Pfizer believes we need a booster shot, we need it. Who knows their product better? Following the guidance of @CDCgov is how I wound up w/ Covid-19 and having to shut down my K-2 classroom for an entire week. I will do whatever it takes to protect my students, friends, family.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ |text |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ |I came to a point finally and i've vaccinated, didnt feel pain. Suggest everyone |[Self_reports] | |If Pfizer believes we need a booster shot, we need it. Who knows their product better? Following the guidance of @CDCgov is how I wound up w/ Covid-19 and having to shut down my K-2 classroom for an entire week. I will do whatever it takes to protect my students, friends, family.|[Vaccine_chatter]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_self_reported_vaccine_status_tweet| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References [SMM4H 2022](https://healthlanguageprocessing.org/smm4h-2022/) ## Benchmarking ```bash label precision recall f1-score support Self_reports 0.79 0.78 0.78 311 Vaccine_chatter 0.97 0.97 0.97 2410 accuracy - - 0.95 2721 macro-avg 0.88 0.88 0.88 2721 weighted-avg 0.95 0.95 0.95 2721 ``` --- layout: model title: Detect Temporal Relations for Clinical Events author: John Snow Labs name: re_temporal_events_clinical date: 2020-09-28 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [re, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to identify temporal relationships among clinical events. ## Predicted Entities `AFTER`, `BEFORE`, `OVERLAP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_EVENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_CLINICAL_EVENTS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_clinical_en_2.5.5_2.4_1597774124917.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_clinical_en_2.5.5_2.4_1597774124917.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel. In the table below, `re_temporal_events_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------------:|:----------------------:|:-------------------:|:-------------------------:| | re_temporal_events_clinical | AFTER, BEFORE, OVERLAP | ner_events_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") clinical_re_Model = RelationExtractionModel()\ .pretrained("re_temporal_events_clinical", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4)\ .setPredictionThreshold(0.9)\ .setRelationPairs(["date-problem", "occurrence-date"]) # Possible relation pairs. Default: All Relations. nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos_tagger, word_embeddings, clinical_ner, ner_converter, dependency_parser, clinical_re_Model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val clinical_re_Model = RelationExtractionModel() .pretrained("re_temporal_events_clinical", "en", 'clinical/models') .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) .setPredictionThreshold(0.9) .setRelationPairs("date-problem", "occurrence-date") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos_tagger, dependecy_parser, word_embeddings, clinical_ner, ner_converter, clinical_re_Model)) val data = Seq("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_events_clinical").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
{:.h2_title} ## Results ```bash +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+============+=================+===============+==========================+===========+=================+===============+=====================+==============+ | 0 | OVERLAP | OCCURRENCE | 121 | 144 | a motor vehicle accident | DATE | 149 | 165 | September of 2005 | 0.999975 | +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ | 1 | OVERLAP | DATE | 171 | 179 | that time | PROBLEM | 201 | 219 | any specific injury | 0.956654 | +----+------------+------------+-----------------+---------------+--------------------------+-----------+-----------------+---------------+---------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_temporal_events_clinical| |Type:|re| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[embeddings, pos_tags, ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|[en]| |Case sensitive:|false| | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash |Relation | Recall | Precision | F1 | |---------:|--------:|----------:|-----:| | OVERLAP | 0.81 | 0.73 | 0.77 | | BEFORE | 0.85 | 0.88 | 0.86 | | AFTER | 0.38 | 0.46 | 0.43 | ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_twostage_quadruplet_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostage_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223883289.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_twostage_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223883289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostage_quadruplet_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_twostage_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_twostage_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|306.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostage_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab30 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab30 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab30` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab30_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab30_en_4.2.0_3.0_1664020671616.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab30_en_4.2.0_3.0_1664020671616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab30', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab30", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab30| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Persian BertForMaskedLM Base Uncased model (from HooshvareLab) author: John Snow Labs name: bert_embeddings_fa_base_uncased date: 2022-12-02 tags: [fa, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: fa edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-fa-base-uncased` is a Persian model originally trained by `HooshvareLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fa_base_uncased_fa_4.2.4_3.0_1670019389838.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fa_base_uncased_fa_4.2.4_3.0_1670019389838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fa_base_uncased","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fa_base_uncased","fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_fa_base_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fa| |Size:|609.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/HooshvareLab/bert-fa-base-uncased - https://github.com/hooshvare/parsbert - https://arxiv.org/abs/2005.12515 - https://dumps.wikimedia.org/fawiki/ - https://github.com/miras-tech/MirasText - https://bigbangpage.com/ - https://www.chetor.com/ - https://www.eligasht.com/Blog/ - https://www.digikala.com/mag/ - https://www.ted.com/talks - https://github.com/hooshvare/parsbert/issues --- layout: model title: Arabic Lemmatizer author: John Snow Labs name: lemma date: 2020-11-28 task: Lemmatization language: ar edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [lemmatizer, ar, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model converts words to their basic form. For example, it can convert past and present tense of a word, singular and plural words in a single form, which enables the downstream model to treat both words similarly instead of different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ar_2.7.0_2.4_1606572966993.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ar_2.7.0_2.4_1606572966993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "ar") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(['يرفض رفضت يرفضون أرفض نرفض رفض رفضوا ترفض رفضتا']) ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "ar") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("يرفض رفضت يرفضون أرفض نرفض رفض رفضوا ترفض رفضتا").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""يرفض رفضت يرفضون أرفض نرفض رفض رفضوا ترفض رفضتا"""] lemma_df = nlu.load('ar.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 3, رَفَض, {'sentence': '0'}), Annotation(token, 5, 8, رَفَض, {'sentence': '0'}), Annotation(token, 10, 15, رَفَض, {'sentence': '0'}), Annotation(token, 17, 20, رَفَض, {'sentence': '0'}), Annotation(token, 22, 25, نرفض, {'sentence': '0'}), Annotation(token, 27, 29, رَفض, {'sentence': '0'}), Annotation(token, 31, 35, رَفَض, {'sentence': '0'}), Annotation(token, 37, 40, رَفَض, {'sentence': '0'}), Annotation(token, 42, 46, رَفَض, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ar| ## Data Source This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/) --- layout: model title: Wolof XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_swahili_finetuned_ner_wolof date: 2022-08-01 tags: [wo, open_source, xlm_roberta, ner] task: Named Entity Recognition language: wo edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-wolof` is a Wolof model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_wolof_wo_4.1.0_3.0_1659355944958.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_wolof_wo_4.1.0_3.0_1659355944958.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner_wolof","wo") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner_wolof","wo") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_swahili_finetuned_ner_wolof| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|wo| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-wolof - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Categorize Chat Messages from Customer Service author: John Snow Labs name: finclf_customer_service_category date: 2023-02-03 tags: [en, licensed, finance, customer, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Classification model that can help you classify a chat message from customer service. ## Predicted Entities `ACCOUNT`, `CANCELLATION_FEE`, `CONTACT`, `DELIVERY`, `FEEDBACK`, `INVOICE`, `NEWSLETTER`, `ORDER`, `PAYMENT`, `REFUND`, `SHIPPING_ADDRESS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_customer_service_category_en_1.0.0_3.0_1675417487415.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_customer_service_category_en_1.0.0_3.0_1675417487415.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") embeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = finance.ClassifierDLModel.pretrained("finclf_customer_service_category", "en", "finance/models")\ .setInputCols("sentence_embeddings") \ .setOutputCol("class") pipeline = nlp.Pipeline().setStages( [ document_assembler, embeddings, docClassifier ] ) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) light_model = nlp.LightPipeline(model) result = light_model.annotate("""can I place an order from Finland?""") result["class"] ```
## Results ```bash ['DELIVERY'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_customer_service_category| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References https://github.com/bitext/customer-support-intent-detection-evaluation-dataset ## Benchmarking ```bash label precision recall f1-score support ACCOUNT 0.99 0.99 0.99 180 CANCELLATION_FEE 1.00 1.00 1.00 30 CONTACT 0.98 1.00 0.99 60 DELIVERY 1.00 1.00 1.00 60 FEEDBACK 0.97 0.95 0.96 60 INVOICE 1.00 1.00 1.00 60 NEWSLETTER 0.94 1.00 0.97 30 ORDER 1.00 0.99 1.00 120 OTHER 1.00 0.97 0.98 63 PAYMENT 0.95 1.00 0.98 60 REFUND 0.99 0.98 0.98 90 SHIPPING_ADDRESS 1.00 0.98 0.99 60 accuracy - - 0.99 973 macro-avg 0.99 0.99 0.99 873 weighted-avg 0.99 0.99 0.99 873 ``` --- layout: model title: Sango asr_wav2vec2_large_xlsr_53_swiss_german TFWav2Vec2ForCTC from Yves author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_swiss_german date: 2022-09-24 tags: [wav2vec2, sg, audio, open_source, asr] task: Automatic Speech Recognition language: sg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_swiss_german` is a Sango model originally trained by Yves. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_swiss_german_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_swiss_german_sg_4.2.0_3.0_1664022648999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_swiss_german_sg_4.2.0_3.0_1664022648999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_swiss_german", "sg")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_swiss_german", "sg") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_swiss_german| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sg| |Size:|1.2 GB| --- layout: model title: Legal Accounting terms Clause Binary Classifier author: John Snow Labs name: legclf_accounting_terms_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `accounting-terms` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `accounting-terms` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_accounting_terms_clause_en_1.0.0_3.2_1660122090467.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_accounting_terms_clause_en_1.0.0_3.2_1660122090467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_accounting_terms_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[accounting-terms]| |[other]| |[other]| |[accounting-terms]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_accounting_terms_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support accounting-terms 1.00 1.00 1.00 26 other 1.00 1.00 1.00 109 accuracy - - 1.00 135 macro-avg 1.00 1.00 1.00 135 weighted-avg 1.00 1.00 1.00 135 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from xinranyyyy) author: John Snow Labs name: roberta_qa_checkpoint_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_checkpoint-finetuned-squad` is a English model originally trained by `xinranyyyy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_checkpoint_finetuned_squad_en_4.3.0_3.0_1674223200816.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_checkpoint_finetuned_squad_en_4.3.0_3.0_1674223200816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_checkpoint_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_checkpoint_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_checkpoint_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|465.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/xinranyyyy/roberta_checkpoint-finetuned-squad --- layout: model title: Translate French to English Pipeline author: John Snow Labs name: translate_fr_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, fr, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `fr` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_fr_en_xx_2.7.0_2.4_1609688968930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_fr_en_xx_2.7.0_2.4_1609688968930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_fr_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_fr_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.fr.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_fr_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Legal Contracts BertEmbeddings model (Base, Uncased) author: John Snow Labs name: bert_base_uncased_contracts date: 2022-06-30 tags: [open_source, bert, embeddings, finance, contracts, en] task: Embeddings language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Word Embeddings model, trained on legal contracts, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-contracts` is a English model originally trained by `nlpaueb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_uncased_contracts_en_4.0.0_3.0_1656599332268.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_uncased_contracts_en_4.0.0_3.0_1656599332268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_uncased_contracts","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_uncased_contracts","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_uncased_contracts| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.0 MB| |Case sensitive:|true| ## References https://huggingface.co/nlpaueb/bert-base-uncased-contracts --- layout: model title: English T5ForConditionalGeneration Base Cased model (from swcrazyfan) author: John Snow Labs name: t5_kingjamesify_base date: 2023-01-30 tags: [en, open_source, t5] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `KingJamesify-T5-Base` is a English model originally trained by `swcrazyfan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_kingjamesify_base_en_4.3.0_3.0_1675098022215.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_kingjamesify_base_en_4.3.0_3.0_1675098022215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_kingjamesify_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_kingjamesify_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_kingjamesify_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|915.1 MB| ## References - https://huggingface.co/swcrazyfan/KingJamesify-T5-Base --- layout: model title: Detect bacterial species (embedding_clinical_medium) author: John Snow Labs name: ner_bacterial_species_emb_clinical_medium date: 2023-05-23 tags: [ner, clinical, bacterial_species, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect different types of species of bacteria in text using pretrained NER model. ## Predicted Entities `SPECIES` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BACTERIAL_SPECIES/){:.button.button-orange} [Open in Colab](https://demo.johnsnowlabs.com/healthcare/NER_BACTERIAL_SPECIES/){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_emb_clinical_medium_en_4.4.2_3.0_1684848483995.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_emb_clinical_medium_en_4.4.2_3.0_1684848483995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") species_ner = MedicalNerModel.pretrained("ner_bacterial_species_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("species_ner") species_ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "species_ner"]) \ .setOutputCol("species_ner_chunk") species_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, species_ner, species_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") species_ner_model = species_ner_pipeline.fit(empty_data) results = species_ner_model.transform(spark.createDataFrame([[''' Proportions of Veillonella parvula and Prevotella melaninogenica were higher in saliva and on the lateral and dorsal surfaces of the tongue, while Streptococcus mitis and S. oralis were in significantly lower proportions in saliva and on the tongue dorsum. Cluster analysis resulted in the formation of 2 clusters with >85% similarity. Cluster 1 comprised saliva, lateral and dorsal tongue surfaces, while Cluster 2 comprised the remaining soft tissue locations. V. parvula, P. melaninogenica, Eikenella corrodens, Neisseria mucosa, Actinomyces odontolyticus, Fusobacterium periodonticum, F. nucleatum ss vincentii and Porphyromonas gingivalis were in significantly higher proportions in Cluster 1 and S. mitis, S. oralis and S. noxia were significantly higher in Cluster 2. These findings were confirmed using data from the 44 subjects providing plaque samples.''']]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val species_ner_model = MedicalNerModel.pretrained("ner_bacterial_species_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("species_ner") val species_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "species_ner")) .setOutputCol("species_ner_chunk") val species_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, species_ner_model, species_ner_converter)) val data = Seq(""" This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:----------------------------|--------:|------:|:-----------| | 0 | Veillonella parvula | 16 | 34 | SPECIES | | 1 | Prevotella melaninogenica | 40 | 64 | SPECIES | | 2 | Streptococcus mitis | 148 | 166 | SPECIES | | 3 | S. oralis | 172 | 180 | SPECIES | | 4 | V. parvula | 464 | 473 | SPECIES | | 5 | P. melaninogenica | 476 | 492 | SPECIES | | 6 | Eikenella corrodens | 495 | 513 | SPECIES | | 7 | Neisseria mucosa | 516 | 531 | SPECIES | | 8 | Actinomyces odontolyticus | 534 | 558 | SPECIES | | 9 | Fusobacterium periodonticum | 561 | 587 | SPECIES | | 10 | F. nucleatum ss vincentii | 590 | 614 | SPECIES | | 11 | Porphyromonas gingivalis | 620 | 643 | SPECIES | | 12 | S. mitis | 703 | 710 | SPECIES | | 13 | S. oralis | 713 | 721 | SPECIES | | 14 | S. noxia | 727 | 734 | SPECIES | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bacterial_species_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash label precision recall f1-score support SPECIES 0.80 0.82 0.81 1810 micro-avg 0.80 0.82 0.81 1810 macro-avg 0.80 0.82 0.81 1810 weighted-avg 0.80 0.82 0.81 1810 ``` --- layout: model title: English BertForTokenClassification Cased model (from RJ3vans) author: John Snow Labs name: bert_pos_13.05.2022.SSCCVspanTagger date: 2022-07-06 tags: [bert, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `13.05.2022.SSCCVspanTagger` is a English model originally trained by `RJ3vans`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_13.05.2022.SSCCVspanTagger_en_4.0.0_3.0_1657118899607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_13.05.2022.SSCCVspanTagger_en_4.0.0_3.0_1657118899607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_13.05.2022.SSCCVspanTagger","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_13.05.2022.SSCCVspanTagger","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_13.05.2022.SSCCVspanTagger| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/RJ3vans/13.05.2022.SSCCVspanTagger - http://rgcl.wlv.ac.uk/~richard/Evans2020_SentenceSimplificationForTextProcessing.pdf --- layout: model title: Mapping Entities with Corresponding ICD-10-CM Codes author: John Snow Labs name: icd10cm_mapper date: 2023-05-30 tags: [icd10cm, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities with their corresponding ICD-10-CM codes. ## Predicted Entities `icd10cm_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_mapper_en_4.4.2_3.0_1685478700624.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_mapper_en_4.4.2_3.0_1685478700624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel\ .pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "ner")\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) chunkerMapper = ChunkMapperModel\ .pretrained("icd10cm_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["icd10cm_code"]) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunkerMapper]) test_data = spark.createDataFrame([["A 35-year-old male with a history of chronic renal insufficiency, type 2 diabetes mellitus diagnosed eight years prior, hypertension, and hyperlipidemia, presented with a two-week history of polydipsia, poor appetite, and vomiting."]]).toDF("text") mapper_model = mapper_pipeline.fit(test_data) result= mapper_model.transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) val chunkerMapper = ChunkMapperModel .pretrained("icd10cm_mapper", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("mappings") .setRels("icd10cm_code") val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunkerMapper)) val data = Seq("A 35-year-old male with a history of chronic renal insufficiency, type 2 diabetes mellitus diagnosed eight years prior, hypertension, and hyperlipidemia, presented with a two-week history of polydipsia, poor appetite, and vomiting.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------+-------+------------+ |ner_chunk |entity |icd10cm_code| +---------------------------+-------+------------+ |chronic renal insufficiency|PROBLEM|N18.9 | |type 2 diabetes mellitus |PROBLEM|E11 | |hypertension |PROBLEM|I10 | |hyperlipidemia |PROBLEM|E78.5 | |polydipsia |PROBLEM|R63.1 | |poor appetite |PROBLEM|R63.0 | |vomiting |PROBLEM|R11.1 | +---------------------------+-------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_mapper| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|14.1 MB| --- layout: model title: English T5ForConditionalGeneration Cased model (from leslyarun) author: John Snow Labs name: t5_grammatical_error_correction date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `grammatical-error-correction` is a English model originally trained by `leslyarun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_grammatical_error_correction_en_4.3.0_3.0_1675102821479.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_grammatical_error_correction_en_4.3.0_3.0_1675102821479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_grammatical_error_correction","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_grammatical_error_correction","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_grammatical_error_correction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|920.7 MB| ## References - https://huggingface.co/leslyarun/grammatical-error-correction --- layout: model title: French CamemBert Embeddings (from h4d35) author: John Snow Labs name: camembert_embeddings_h4d35_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `h4d35`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_h4d35_generic_model_fr_3.4.4_3.0_1653988685837.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_h4d35_generic_model_fr_3.4.4_3.0_1653988685837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_h4d35_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_h4d35_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_h4d35_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/h4d35/dummy-model --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1655731444229.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_8_en_4.0.0_3.0_1655731444229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|422.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-8 --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-28 tags: [fi, licensed, ner, legal, mapa] task: Named Entity Recognition language: fi edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Finnish` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_fi_1.0.0_3.0_1682671773751.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_fi_1.0.0_3.0_1682671773751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_finnish_legal","fi")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "fi", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Liberato vaati 22.5.2007 päivätyllä kanteellaan Tribunale di Teramossa ( Teramon alioikeus, Italia ) asumuseroa Grigorescusta ja lapsen huoltajuutta."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------+---------+ |chunk |ner_label| +-------------+---------+ |Liberato |PERSON | |22.5.2007 |DATE | |Italia |ADDRESS | |Grigorescusta|PERSON | +-------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fi| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.81 0.93 0.86 27 AMOUNT 1.00 1.00 1.00 2 DATE 0.92 0.95 0.94 61 ORGANISATION 0.88 0.81 0.85 27 PERSON 0.93 0.95 0.94 40 micro-avg 0.90 0.92 0.91 157 macro-avg 0.91 0.93 0.92 157 weighted-avg 0.90 0.92 0.91 157 ``` --- layout: model title: English asr_wav2vec2_xls_r_300m_german_english TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: asr_wav2vec2_xls_r_300m_german_english date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_german_english` is a English model originally trained by aware-ai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_german_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_german_english_en_4.2.0_3.0_1664111876863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_german_english_en_4.2.0_3.0_1664111876863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_german_english", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_german_english", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_german_english| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Existence Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_existence_bert date: 2023-03-05 tags: [en, legal, classification, clauses, existence, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Existence` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Existence`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_existence_bert_en_1.0.0_3.0_1678050589588.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_existence_bert_en_1.0.0_3.0_1678050589588.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_existence_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Existence]| |[Other]| |[Other]| |[Existence]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_existence_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Existence 0.96 1.00 0.98 26 Other 1.00 0.98 0.99 43 accuracy - - 0.99 69 macro-avg 0.98 0.99 0.98 69 weighted-avg 0.99 0.99 0.99 69 ``` --- layout: model title: Chinese BertForQuestionAnswering model (from luhua) author: John Snow Labs name: bert_qa_chinese_pretrain_mrc_macbert_large date: 2022-06-06 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_pretrain_mrc_macbert_large` is a Chinese model orginally trained by `luhua`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pretrain_mrc_macbert_large_zh_4.0.0_3.0_1654537920747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pretrain_mrc_macbert_large_zh_4.0.0_3.0_1654537920747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chinese_pretrain_mrc_macbert_large","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_chinese_pretrain_mrc_macbert_large","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.mac_bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chinese_pretrain_mrc_macbert_large| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/luhua/chinese_pretrain_mrc_macbert_large - https://github.com/basketballandlearn/MRC_Competition_Dureader --- layout: model title: Pipeline to Detect Clinical Entities author: John Snow Labs name: bert_token_classifier_ner_clinical_pipeline date: 2022-03-15 tags: [clincal, ner, bert_token_classifier, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_clinical](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_clinical_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_3.0_1647346130209.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_pipeline_en_3.4.1_3.0_1647346130209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python clinical_pipeline = PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") clinical_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .") ``` ```scala val clinical_pipeline = new PretrainedPipeline("bert_token_classifier_ner_clinical_pipeline", "en", "clinical/models") clinical_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.clinical_pipeline").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |gestational diabetes mellitus|PROBLEM | |type two diabetes mellitus |PROBLEM | |T2DM |PROBLEM | |HTG-induced pancreatitis |PROBLEM | |an acute hepatitis |PROBLEM | |obesity |PROBLEM | |a body mass index |TEST | |BMI |TEST | |polyuria |PROBLEM | |polydipsia |PROBLEM | |poor appetite |PROBLEM | |vomiting |PROBLEM | |amoxicillin |TREATMENT| |a respiratory tract infection|PROBLEM | |metformin |TREATMENT| |glipizide |TREATMENT| |dapagliflozin |TREATMENT| |T2DM |PROBLEM | |atorvastatin |TREATMENT| |gemfibrozil |TREATMENT| +-----------------------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: Detect Clinical Entities (ner_jsl_slim) author: John Snow Labs name: ner_jsl_slim date: 2021-08-13 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a pretrained named entity recognition deep learning model for clinical terminology. It is based on `ner_jsl` model, but with more generalised entities. Definitions of Predicted Entities: - `Death_Entity`: Mentions that indicate the death of a patient. - `Medical_Device`: All mentions related to medical devices and supplies. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Allergen`: Allergen related extractions mentioned in the document. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Birth_Entity`: Mentions that indicate giving birth. - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). ## Predicted Entities `Death_Entity`, `Medical_Device`, `Vital_Sign`, `Alergen`, `Drug`, `Clinical_Dept`, `Lifestyle`, `Symptom`, `Body_Part`, `Physical_Measurement`, `Admission_Discharge`, `Date_Time`, `Age`, `Birth_Entity`, `Header`, `Oncological`, `Substance_Quantity`, `Test_Result`, `Test`, `Procedure`, `Treatment`, `Disease_Syndrome_Disorder`, `Pregnancy_Newborn`, `Demographics` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_slim_en_3.2.0_3.0_1628875762291.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_slim_en_3.2.0_3.0_1628875762291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_slim").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""") ```
## Results ```bash | | chunk | entity | |---:|:-----------------|:-------------| | 0 | HISTORY: | Header | | 1 | 30-year-old | Age | | 2 | female | Demographics | | 3 | mammography | Test | | 4 | soft tissue lump | Symptom | | 5 | shoulder | Body_Part | | 6 | breast cancer | Oncological | | 7 | her mother | Demographics | | 8 | age 58 | Age | | 9 | breast cancer | Oncological | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_slim| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data annotated by JSL. ## Benchmarking ```bash label tp fp fn prec rec f1 B-Medical_Device 2696 444 282 0.8585987 0.9053055 0.8813337 I-Physical_Measurement 220 16 34 0.9322034 0.8661417 0.8979592 B-Procedure 1800 239 281 0.8827857 0.8649688 0.8737864 B-Drug 1865 218 237 0.8953432 0.8872502 0.8912784 I-Test_Result 289 203 292 0.5873983 0.4974182 0.5386766 I-Pregnancy_Newborn 150 41 104 0.7853403 0.5905512 0.6741573 B-Admission_Discharge 255 35 6 0.8793103 0.9770115 0.9255898 B-Demographics 4609 119 123 0.9748308 0.9740068 0.9744186 I-Lifestyle 71 49 20 0.5916666 0.7802198 0.6729857 B-Header 2463 53 122 0.9789348 0.9528046 0.965693 I-Date_Time 928 184 191 0.8345324 0.8293119 0.8319139 B-Test_Result 866 198 262 0.8139097 0.7677305 0.7901459 I-Treatment 114 37 46 0.7549669 0.7125 0.733119 B-Clinical_Dept 688 83 76 0.8923476 0.9005235 0.8964169 B-Test 1920 333 313 0.8521970 0.8598298 0.8559965 B-Death_Entity 36 9 2 0.8 0.9473684 0.8674699 B-Lifestyle 268 58 50 0.8220859 0.8427673 0.8322981 B-Date_Time 823 154 176 0.8423746 0.8238238 0.8329959 I-Age 136 34 49 0.8 0.7351351 0.7661972 I-Oncological 345 41 19 0.8937824 0.9478022 0.9199999 I-Body_Part 3717 720 424 0.8377282 0.8976093 0.8666356 B-Pregnancy_Newborn 153 51 104 0.75 0.5953307 0.6637744 B-Treatment 169 41 58 0.8047619 0.7444933 0.7734553 I-Procedure 2302 326 417 0.8759513 0.8466348 0.8610435 B-Birth_Entity 6 5 7 0.5454545 0.4615384 0.5 I-Vital_Sign 639 197 93 0.7643540 0.8729508 0.815051 I-Header 4451 111 216 0.9756685 0.9537176 0.9645682 I-Death_Entity 2 0 0 1 1 1 I-Clinical_Dept 621 54 39 0.92 0.9409091 0.9303371 I-Test 1593 378 353 0.8082192 0.8186022 0.8133775 B-Age 472 43 51 0.9165048 0.9024856 0.9094413 I-Symptom 4227 1271 1303 0.7688250 0.7643761 0.7665941 I-Demographics 321 53 53 0.8582887 0.8582887 0.8582887 B-Body_Part 6312 912 809 0.8737541 0.8863923 0.8800279 B-Physical_Measurement 91 10 17 0.9009901 0.8425926 0.8708134 B-Disease_Syndrome_Disorder 2817 336 433 0.8934348 0.8667692 0.8799001 B-Symptom 4522 830 747 0.8449178 0.8582274 0.8515206 I-Disease_Syndrome_Disorder 2814 386 530 0.879375 0.8415072 0.8600244 I-Drug 3737 612 517 0.859278 0.8784673 0.8687667 I-Medical_Device 1825 331 131 0.8464749 0.9330266 0.8876459 B-Oncological 276 28 27 0.9078947 0.9108911 0.9093904 B-Vital_Sign 429 97 79 0.8155893 0.8444882 0.8297872 Macro-average 62038 9340 9110 0.7678277 0.7648211 0.7663215 Micro-average 62038 9340 9110 0.8691473 0.8719570 0.87055 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli TFWav2Vec2ForCTC from tingtingyuli author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli` is a English model originally trained by tingtingyuli. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli_en_4.2.0_3.0_1664108351521.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli_en_4.2.0_3.0_1664108351521.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_tingtingyuli| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: NER Pipeline for 10 High Resourced Languages author: John Snow Labs name: xlm_roberta_large_token_classifier_hrl_pipeline date: 2022-04-04 tags: [arabic, german, english, spanish, french, italian, latvian, dutch, portuguese, chinese, xlm, roberta, ner, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_large_token_classifier_hrl](https://nlp.johnsnowlabs.com/2021/12/26/xlm_roberta_large_token_classifier_hrl_xx.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_HRL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_HRL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_pipeline_xx_3.4.1_3.0_1649060718074.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_hrl_pipeline_xx_3.4.1_3.0_1649060718074.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_hrl_pipeline", lang = "xx") pipeline.annotate("يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_hrl_pipeline", lang = "xx") pipeline.annotate("يمكنكم مشاهدة أمير منطقة الرياض الأمير فيصل بن بندر بن عبد العزيز في كل مناسبة وافتتاح تتعلق بمشاريع التعليم والصحة وخدمة الطرق والمشاريع الثقافية في منطقة الرياض.") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |الرياض |LOC | |فيصل بن بندر بن عبد العزيز |PER | |الرياض |LOC | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_hrl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_768 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_768_zh_4.2.4_3.0_1670325846629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_768_zh_4.2.4_3.0_1670325846629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_128 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_128_zh_4.2.4_3.0_1670021609529.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_128_zh_4.2.4_3.0_1670021609529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|12.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Fast Neural Machine Translation Model from North Germanic Languages to English author: John Snow Labs name: opus_mt_gmq_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gmq, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gmq` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gmq_en_xx_2.7.0_2.4_1609163881569.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gmq_en_xx_2.7.0_2.4_1609163881569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gmq_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gmq_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gmq.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gmq_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_BioBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-BioBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BioBERT_512_en_4.0.0_3.0_1657108318400.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BioBERT_512_en_4.0.0_3.0_1657108318400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BioBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BioBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_BioBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-BioBERT-512 --- layout: model title: Legal Documents Clause Binary Classifier author: John Snow Labs name: legclf_documents_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `documents` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `documents` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_documents_clause_en_1.0.0_3.2_1660122381904.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_documents_clause_en_1.0.0_3.2_1660122381904.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_documents_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[documents]| |[other]| |[other]| |[documents]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_documents_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support documents 0.82 0.77 0.79 81 other 0.93 0.95 0.94 284 accuracy - - 0.91 365 macro-avg 0.88 0.86 0.87 365 weighted-avg 0.91 0.91 0.91 365 ``` --- layout: model title: Detect PHI for Deidentification purposes (French) author: John Snow Labs name: ner_deid_subentity date: 2022-02-14 tags: [deid, fr, licensed] task: De-identification language: fr edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (French) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 15 entities. This NER model is trained with a custom dataset internally annotated, the [French WikiNER dataset](https://metatext.io/datasets/wikiner), a [public dataset of French company names](https://www.data.gouv.fr/fr/datasets/entreprises-immatriculees-en-2017/), a [public dataset of French hospital names](https://salesdorado.com/fichiers-prospection/hopitaux/) and several data augmentation mechanisms. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `E-MAIL`, `USERNAME`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `DOCTOR`, `AGE`, `STREET`, `CITY`, `COUNTRY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_fr_3.4.1_3.0_1644838067533.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_fr_3.4.1_3.0_1644838067533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.med_ner.deid_subentity").predict("""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""") ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | J'ai| O| | vu| O| | en| O| |consultation| O| | Michel| B-PATIENT| | Martinez| I-PATIENT| | (| O| | 49| B-AGE| | ans| O| | )| O| | adressé| O| | au| O| | Centre|B-HOSPITAL| | Hospitalier|I-HOSPITAL| | De|I-HOSPITAL| | Plaisir|I-HOSPITAL| | pour| O| | un| O| | diabète| O| | mal| O| | contrôlé| O| | avec| O| | des| O| | symptômes| O| | datant| O| | de| O| | Mars| B-DATE| | 2015| I-DATE| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [French WikiNER dataset](https://metatext.io/datasets/wikiner) - [Public dataset of French company names](https://www.data.gouv.fr/fr/datasets/entreprises-immatriculees-en-2017/) - [Public dataset of French hospital names](https://salesdorado.com/fichiers-prospection/hopitaux/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 1966.0 124.0 135.0 2101.0 0.9407 0.9357 0.9382 HOSPITAL 315.0 23.0 19.0 334.0 0.932 0.9431 0.9375 DATE 2605.0 31.0 49.0 2654.0 0.9882 0.9815 0.9849 ORGANIZATION 503.0 142.0 159.0 662.0 0.7798 0.7598 0.7697 CITY 2296.0 370.0 351.0 2647.0 0.8612 0.8674 0.8643 MAIL 46.0 0.0 0.0 46.0 1.0 1.0 1.0 STREET 31.0 4.0 3.0 34.0 0.8857 0.9118 0.8986 USERNAME 91.0 1.0 14.0 105.0 0.9891 0.8667 0.9239 ZIP 33.0 0.0 0.0 33.0 1.0 1.0 1.0 MEDICALRECORD 100.0 11.0 2.0 102.0 0.9009 0.9804 0.939 PROFESSION 321.0 59.0 87.0 408.0 0.8447 0.7868 0.8147 PHONE 114.0 3.0 2.0 116.0 0.9744 0.9828 0.9785 COUNTRY 287.0 14.0 51.0 338.0 0.9535 0.8491 0.8983 DOCTOR 622.0 7.0 4.0 626.0 0.9889 0.9936 0.9912 AGE 370.0 52.0 71.0 441.0 0.8768 0.839 0.8575 macro - - - - - - 0.9197 micro - - - - - - 0.9154 ``` --- layout: model title: English RobertaForQuestionAnswering (from Gam) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_cuad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-cuad` is a English model originally trained by `Gam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_cuad_en_4.0.0_3.0_1655733742036.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_cuad_en_4.0.0_3.0_1655733742036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_cuad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.cuad.roberta.base.by_Gam").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_cuad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|451.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gam/roberta-base-finetuned-cuad --- layout: model title: Multilingual (Croatian, Slovenian, English) Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_crosloengual_bert date: 2022-04-11 tags: [bert, embeddings, en, hr, sl, xx, multilingual, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `crosloengual-bert` is a English model orginally trained by `EMBEDDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_crosloengual_bert_en_3.4.2_3.0_1649671890116.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_crosloengual_bert_en_3.4.2_3.0_1649671890116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_crosloengual_bert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_crosloengual_bert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.crosloengual_bert").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_crosloengual_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|466.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/EMBEDDIA/crosloengual-bert - https://arxiv.org/abs/2006.07890 --- layout: model title: Detect Assertion Status (assertion_dl_scope_L10R10) author: John Snow Labs name: assertion_dl_scope_L10R10 date: 2022-03-17 tags: [clinical, en, assertion, licensed] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model considers 10 tokens on the left and 10 tokens on the right side of the clinical entities extracted by NER models and assigns their assertion status based on their context in this scope. ## Predicted Entities `present`, `absent`, `possible`, `conditional`, `associated_with_someone_else`, `hypothetical` {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_scope_L10R10_en_3.4.2_2.4_1647530830424.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_scope_L10R10_en_3.4.2_2.4_1647530830424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token') word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_scope_L10R10", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[document,sentenceDetector, token, word_embeddings,clinical_ner,ner_converter, clinical_assertion]) text = "Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer." data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_scope_L10R10", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.l10r10").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""") ```
## Results ```bash +---------------+---------+----------------------------+ |chunk |entity |assertion | +---------------+---------+----------------------------+ |severe fever |PROBLEM |present | |sore throat |PROBLEM |present | |stomach pain |PROBLEM |absent | |an epidural |TREATMENT|present | |PCA |TREATMENT|present | |pain control |PROBLEM |present | |short of breath|PROBLEM |conditional | |CT |TEST |present | |lung tumor |PROBLEM |present | |Alzheimer |PROBLEM |associated_with_someone_else| +---------------+---------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_scope_L10R10| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|1.4 MB| ## References Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label tp fp fn prec rec f1 absent 812 48 71 0.94418603 0.9195923 0.93172693 present 2463 127 141 0.9509652 0.9458525 0.948402 conditional 25 19 28 0.5681818 0.4716981 0.5154639 associated_with_someone_else 36 7 9 0.8372093 0.8 0.8181818 hypothetical 147 31 28 0.8258427 0.84 0.8328612 possible 159 87 42 0.64634144 0.7910448 0.71140933 Macro-average - - - 0.79545444 0.7946979 0.795076 Micro-average - - - 0.91946477 0.9194648 0.91946477 ``` --- layout: model title: Legal Clauses Multilabel Classifier author: John Snow Labs name: legmulticlf_edgar date: 2022-08-30 tags: [en, legal, classification, clauses, edgar, ledgar, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multilabel Document Classification model, which can be used to identify up to 15 classes in texts. The classes are the following: - terminations - assigns - notices - amendments - waivers - survival - successors - governing laws - severability - expenses - assignments - warranties - representations - entire agreements - counterparts ## Predicted Entities `terminations`, `assigns`, `notices`, `amendments`, `waivers`, `survival`, `successors`, `governing laws`, `severability`, `expenses`, `assignments`, `warranties`, `representations`, `entire agreements`, `counterparts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGMULTICLF_LEDGAR/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_edgar_en_1.0.0_3.2_1661858359724.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_edgar_en_1.0.0_3.2_1661858359724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") multiClassifier = nlp.MultiClassifierDLModel.pretrained("legmulticlf_edgar", "en", "legal/models") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") ledgar_pipeline = nlp.Pipeline( stages=[document, embeddings, multiClassifier]) light_pipeline = LightPipeline(ledgar_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.annotate("""(a) No failure or delay by the Administrative Agent or any Lender in exercising any right or power hereunder shall operate as a waiver thereof, nor shall any single or partial exercise of any such right or power, or any abandonment or discontinuance of steps to enforce such a right or power, preclude any other or further exercise thereof or the exercise of any other right or power. The rights and remedies of the Administrative Agent and the Lenders hereunder are cumulative and are not exclusive of any rights or remedies that they would otherwise have. No waiver of any provision of this Agreement or consent to any departure by the Borrower therefrom shall in any event be effective unless the same shall be permitted by paragraph (b) of this Section, and then such waiver or consent shall be effective only in the specific instance and for the purpose for which given. Without limiting the generality of the foregoing, the making of a Loan shall not be construed as a waiver of any Default, regardless of whether the Administrative Agent or any Lender may have had notice or knowledge of such Default at the time.""") result["class"] ```
## Results ```bash ['waivers', 'amendments'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_edgar| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|13.9 MB| ## References Ledgar dataset, available at https://metatext.io/datasets/ledgar, with in-house data augmentation ## Benchmarking ```bash label precision recall f1-score support amendments 0.89 0.66 0.76 2126 expenses 0.74 0.45 0.56 783 assigns 0.82 0.36 0.50 1156 counterparts 0.99 0.97 0.98 1903 entire_agreements 0.98 0.91 0.94 2168 expenses 0.99 0.53 0.70 817 governing_laws 0.96 0.98 0.97 2608 notices 0.94 0.94 0.94 1888 representations 0.91 0.72 0.80 911 severability 0.97 0.95 0.96 1640 successors 0.90 0.50 0.64 1423 survival 0.95 0.85 0.90 1175 terminations 0.62 0.76 0.68 912 waivers 0.92 0.59 0.72 1474 warranties 0.82 0.66 0.73 756 micro-avg 0.92 0.77 0.84 21740 macro-avg 0.89 0.72 0.78 21740 weighted-avg 0.91 0.77 0.82 21740 samples-avg 0.81 0.80 0.80 21740 ``` --- layout: model title: Translate Luvale to English Pipeline author: John Snow Labs name: translate_lue_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lue, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lue` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lue_en_xx_2.7.0_2.4_1609686986745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lue_en_xx_2.7.0_2.4_1609686986745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lue_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lue_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lue.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lue_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_exp_w2v2t_vp_100k_s627 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_vp_100k_s627 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_vp_100k_s627` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_vp_100k_s627_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_100k_s627_de_4.2.0_3.0_1664106080087.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_vp_100k_s627_de_4.2.0_3.0_1664106080087.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_vp_100k_s627', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_vp_100k_s627", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_vp_100k_s627| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_10_en_4.3.0_3.0_1674214242487.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_10_en_4.3.0_3.0_1674214242487.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|416.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-10 --- layout: model title: English BertForQuestionAnswering model (from huggingface) author: John Snow Labs name: bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `prunebert-base-uncased-6-finepruned-w-distil-squad` is a English model orginally trained by `huggingface`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad_en_4.0.0_3.0_1654189045251.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad_en_4.0.0_3.0_1654189045251.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.distilled_base_uncased.by_huggingface").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|141.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/huggingface/prunebert-base-uncased-6-finepruned-w-distil-squad --- layout: model title: Legal Notification Clause Binary Classifier author: John Snow Labs name: legclf_notification_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `notification` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `notification` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_notification_clause_en_1.0.0_3.2_1660123773456.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_notification_clause_en_1.0.0_3.2_1660123773456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_notification_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[notification]| |[other]| |[other]| |[notification]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_notification_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support notification 0.87 0.91 0.89 90 other 0.96 0.94 0.95 214 accuracy - - 0.93 304 macro-avg 0.92 0.93 0.92 304 weighted-avg 0.94 0.93 0.93 304 ``` --- layout: model title: Amharic RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_am_roberta date: 2022-04-14 tags: [roberta, embeddings, am, open_source] task: Embeddings language: am edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `am-roberta` is a Amharic model orginally trained by `uhhlt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_am_roberta_am_3.4.2_3.0_1649949250060.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_am_roberta_am_3.4.2_3.0_1649949250060.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_am_roberta","am") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ስካርቻ nlp እወዳለሁ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_am_roberta","am") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ስካርቻ nlp እወዳለሁ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("am.embed.am_roberta").predict("""ስካርቻ nlp እወዳለሁ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_am_roberta| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|am| |Size:|1.6 GB| |Case sensitive:|true| ## References - https://huggingface.co/uhhlt/am-roberta - https://github.com/uhh-lt/amharicmodels - https://github.com/uhh-lt/amharicmodels - https://www.mdpi.com/1999-5903/13/11/275 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl4_en_4.3.0_3.0_1675118749298.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dl4_en_4.3.0_3.0_1675118749298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|131.9 MB| ## References - https://huggingface.co/google/t5-efficient-small-dl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Clinical Findings to UMLS Code Pipeline author: John Snow Labs name: umls_clinical_findings_resolver_pipeline date: 2023-03-10 tags: [en, licensed, umls, pipeline] task: Entity Resolution language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Findings) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.3.0_3.2_1678436541287.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.3.0_3.2_1678436541287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") text = '''['HTG-induced pancreatitis associated with an acute hepatitis, and obesity']''' result = pipeline.annotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") val text = "HTG-induced pancreatitis associated with an acute hepatitis, and obesity" val result = pipeline.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_clinical_findings_resolver").predict("""['HTG-induced pancreatitis associated with an acute hepatitis, and obesity']""") ```
## Results ```bash +------------------------+---------+---------+ |chunk |ner_label|umls_code| +------------------------+---------+---------+ |HTG-induced pancreatitis|PROBLEM |C1963198 | |an acute hepatitis |PROBLEM |C4750596 | |obesity |PROBLEM |C1963185 | +------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_clinical_findings_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_emanuals_squad2.0 date: 2022-07-06 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `EManuals_BERT_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_emanuals_squad2.0_en_4.0.0_3.0_1657101207353.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_emanuals_squad2.0_en_4.0.0_3.0_1657101207353.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_emanuals_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_emanuals_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_emanuals_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/EManuals_BERT_squad2.0 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from aaraki) author: John Snow Labs name: distilbert_qa_aaraki_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `aaraki`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_aaraki_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769590924.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_aaraki_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769590924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_aaraki_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_aaraki_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_aaraki_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aaraki/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Effective date Clause Binary Classifier author: John Snow Labs name: legclf_effective_date_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `effective-date` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `effective-date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_effective_date_clause_en_1.0.0_3.2_1660123458833.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_effective_date_clause_en_1.0.0_3.2_1660123458833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_effective_date_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[effective-date]| |[other]| |[other]| |[effective-date]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_effective_date_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support effective-date 0.97 0.85 0.91 46 other 0.93 0.99 0.96 94 accuracy - - 0.94 140 macro-avg 0.95 0.92 0.93 140 weighted-avg 0.94 0.94 0.94 140 ``` --- layout: model title: Czech Legal Roberta Embeddings author: John Snow Labs name: roberta_base_czech_legal date: 2023-02-16 tags: [cs, czech, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: cs edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-czech-roberta-base` is a Czech model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_czech_legal_cs_4.2.4_3.0_1676561802836.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_czech_legal_cs_4.2.4_3.0_1676561802836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_czech_legal", "cs")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_czech_legal", "cs") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_czech_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|cs| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-czech-roberta-base --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt6 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt6` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_4.2.4_3.0_1670022879408.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt6_zh_4.2.4_3.0_1670022879408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt6","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt6| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|224.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt6 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Sentence Entity Resolver for ICD-10-PCS (Augmented) author: John Snow Labs name: sbiobertresolve_icd10pcs_augmented date: 2022-10-28 tags: [entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Spark NLP for Healthcare 4.2.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-PCS codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It trained on the augmented version of the dataset which is used in previous ICD-10-PCS resolver model. ## Predicted Entities `ICD-10-PCS Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_augmented_en_4.2.1_3.0_1666966980428.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10pcs_augmented_en_4.2.1_3.0_1666966980428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['Procedure', 'Test', 'Test_Result', 'Treatment', 'Pulse', 'Imaging_Technique', 'Labour_Delivery', 'Blood_Pressure', 'Oxygen_Therapy', 'Weight', 'LDL', 'O2_Saturation', 'BMI', 'Vaccine', 'Respiration', 'Temperature', 'Birth_Entity', 'Triglycerides']) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10pcs_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_icd10pcs_augmented","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver]) text = [["""Given the severity of her abdominal examination and her persistence of her symptoms, it is detected that need for laparoscopic appendectomy and possible open appendectomy as well as pyeloplasty. We recommend performing a mediastinoscopy"""]] data= spark.createDataFrame(text).toDF('text') results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Procedure", "Test", "Test_Result", "Treatment", "Pulse", "Imaging_Technique", "Labour_Delivery", "Blood_Pressure", "Oxygen_Therapy", "Weight", "LDL", "O2_Saturation", "BMI", "Vaccine", "Respiration", "Temperature", "Birth_Entity", "Triglycerides")) val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10pcs_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10pcs_augmented","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10pcs_resolver)) val data = Seq("Given the severity of her abdominal examination and her persistence of her symptoms, it is detected that need for laparoscopic appendectomy and possible open appendectomy as well as pyeloplasty. We recommend performing a mediastinoscopy").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10pcs_augmented").predict("""Given the severity of her abdominal examination and her persistence of her symptoms, it is detected that need for laparoscopic appendectomy and possible open appendectomy as well as pyeloplasty. We recommend performing a mediastinoscopy""") ```
## Results ```bash +-------------------------+---------+-------------+------------------------------------------------------------+------------------------------------------------------------+ | ner_chunk| entity|icd10pcs_code| resolutions| all_codes| +-------------------------+---------+-------------+------------------------------------------------------------+------------------------------------------------------------+ | abdominal examination| Test| 2W63XZZ|[traction of abdominal wall [traction of abdominal wall],...|[2W63XZZ, BW40ZZZ, DWY37ZZ, 0WJFXZZ, 2W03X2Z, 0WJF4ZZ, 0W...| |laparoscopic appendectomy|Procedure| 0DTJ8ZZ|[resection of appendix, endo [resection of appendix, endo...|[0DTJ8ZZ, 0DT84ZZ, 0DTJ4ZZ, 0WBH4ZZ, 0DTR4ZZ, 0DBJ8ZZ, 0D...| | open appendectomy|Procedure| 0DBJ0ZZ|[excision of appendix, open approach [excision of appendi...|[0DBJ0ZZ, 0DTJ0ZZ, 0DBA0ZZ, 0D5J0ZZ, 0DB80ZZ, 0DB90ZZ, 04...| | pyeloplasty|Procedure| 0TS84ZZ|[reposition bilateral ureters, perc endo approach [reposi...|[0TS84ZZ, 0TS74ZZ, 069B3ZZ, 06SB3ZZ, 0TR74JZ, 0TQ43ZZ, 04...| | mediastinoscopy|Procedure| BB1CZZZ|[fluoroscopy of mediastinum [fluoroscopy of mediastinum],...|[BB1CZZZ, 0WJC4ZZ, BB4CZZZ, 0WJC3ZZ, 0WHC33Z, 0WHC43Z, 0W...| +-------------------------+---------+-------------+------------------------------------------------------------+------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10pcs_augmented| |Compatibility:|Spark NLP for Healthcare 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10pcs_code]| |Language:|en| |Size:|649.1 MB| |Case sensitive:|false| ## References Trained on ICD-10 Procedure Coding System dataset with sbiobert_base_cased_mli sentence embeddings. https://www.icd10data.com/ICD10PCS/Codes --- layout: model title: Part of Speech Tagging - French author: John Snow Labs name: pos_ud_gsd date: 2021-04-29 tags: [pos, parts_of_speech, fr, open_source] supported: true task: Part of Speech Tagging language: fr edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - NOUN - ADJ - AUX - VERB - ADV - ADP - SCONJ - PRON - PUNCT - PROPN - CCONJ - NUM - SYM - X - PART - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_fr_3.0.0_3.0_1619656324911.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_fr_3.0.0_3.0_1619656324911.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Bonjour de John Snow Labs!']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "fr") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Bonjour de John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ""Bonjour de John Snow Labs! "" pos_df = nlu.load('pos_ud_gsd').predict(text) ```
## Results ```bash token_result pos_result 0 Bonjour INTJ 1 de ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|fr| --- layout: model title: Finnish asr_wav2vec2_base_10k_voxpopuli TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_base_10k_voxpopuli date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_10k_voxpopuli` is a Finnish model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_10k_voxpopuli_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_10k_voxpopuli_fi_4.2.0_3.0_1664020221628.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_10k_voxpopuli_fi_4.2.0_3.0_1664020221628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_10k_voxpopuli', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_10k_voxpopuli", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_10k_voxpopuli| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|228.1 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Base Cased model (from macavaney) author: John Snow Labs name: t5_doc2query_base_msmarco date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `doc2query-t5-base-msmarco` is a English model originally trained by `macavaney`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_doc2query_base_msmarco_en_4.3.0_3.0_1675101186297.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_doc2query_base_msmarco_en_4.3.0_3.0_1675101186297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_doc2query_base_msmarco","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_doc2query_base_msmarco","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_doc2query_base_msmarco| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|474.3 MB| ## References - https://huggingface.co/macavaney/doc2query-t5-base-msmarco - https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip - https://github.com/terrierteam/pyterrier_doc2query - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2007.14271 --- layout: model title: English T5ForConditionalGeneration Cased model (from dbernsohn) author: John Snow Labs name: t5_wikisql_sql2en date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_wikisql_SQL2en` is a English model originally trained by `dbernsohn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_wikisql_sql2en_en_4.3.0_3.0_1675157160888.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_wikisql_sql2en_en_4.3.0_3.0_1675157160888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_wikisql_sql2en","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_wikisql_sql2en","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_wikisql_sql2en| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.2 MB| ## References - https://huggingface.co/dbernsohn/t5_wikisql_SQL2en - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://github.com/DorBernsohn/CodeLM/tree/main/SQLM - https://www.linkedin.com/in/dor-bernsohn-70b2b1146/ --- layout: model title: Legal English Bert Embeddings (Small, Uncased) author: John Snow Labs name: bert_embeddings_legal_bert_small_uncased date: 2022-04-11 tags: [bert, embeddings, en, open_source, legal] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Small version of the Legal Pretrained Bert Embeddings model (uncased), uploaded to Hugging Face, adapted and imported into Spark NLP. `legal-bert-small-uncased` is a English model orginally trained by `nlpaueb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_legal_bert_small_uncased_en_3.4.2_3.0_1649676353340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_legal_bert_small_uncased_en_3.4.2_3.0_1649676353340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_small_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_small_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.legal_bert_small_uncased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_legal_bert_small_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|131.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/nlpaueb/legal-bert-small-uncased - https://aclanthology.org/2020.findings-emnlp.261/ - https://eur-lex.europa.eu/ - https://www.legislation.gov.uk/ - https://case.law/ - https://www.sec.gov/edgar.shtml - https://archive.org/details/legal_bert_fp - http://nlp.cs.aueb.gr/ --- layout: model title: Portuguese Bert Embeddings (Small, Cased) author: John Snow Labs name: bert_embeddings_bert_small_gl_cased date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-small-gl-cased` is a Portuguese model orginally trained by `marcosgg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_small_gl_cased_pt_3.4.2_3.0_1649674068652.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_small_gl_cased_pt_3.4.2_3.0_1649674068652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_small_gl_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_small_gl_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_small_gl_cased").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_small_gl_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|312.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/marcosgg/bert-small-gl-cased - https://github.com/marcospln/homonymy_acl21 - https://arxiv.org/abs/2106.13553 --- layout: model title: English T5ForConditionalGeneration Base Cased model (from ZhangCheng) author: John Snow Labs name: t5_base_v1.1_fine_tuned_for_question_generation date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5v1.1-Base-Fine-Tuned-for-Question-Generation` is a English model originally trained by `ZhangCheng`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_v1.1_fine_tuned_for_question_generation_en_4.3.0_3.0_1675099669115.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_v1.1_fine_tuned_for_question_generation_en_4.3.0_3.0_1675099669115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_v1.1_fine_tuned_for_question_generation","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_v1.1_fine_tuned_for_question_generation","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_v1.1_fine_tuned_for_question_generation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/ZhangCheng/T5v1.1-Base-Fine-Tuned-for-Question-Generation --- layout: model title: Fast Neural Machine Translation Model from Wallisian to English author: John Snow Labs name: opus_mt_wls_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, wls, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `wls` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_wls_en_xx_2.7.0_2.4_1609165011595.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_wls_en_xx_2.7.0_2.4_1609165011595.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_wls_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_wls_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.wls.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_wls_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect PHI for Deidentification (Glove - Subentity) author: John Snow Labs name: ner_deid_subentity_glove_pipeline date: 2023-03-13 tags: [ner, deid, licensed, en, glove, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity_glove](https://nlp.johnsnowlabs.com/2021/06/06/ner_deid_subentity_glove_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_glove_pipeline_en_4.3.0_3.2_1678734155007.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_glove_pipeline_en_4.3.0_3.2_1678734155007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_glove_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_glove_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:--------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 1 | | 1 | David Hale | 27 | 36 | DOCTOR | 0.99 | | 2 | Hendrickson Ora | 55 | 69 | PATIENT | 0.60755 | | 3 | 7194334 | 78 | 84 | MEDICALRECORD | 0.9993 | | 4 | 01/13/93 | 93 | 100 | DATE | 1 | | 5 | Oliveira | 110 | 117 | DOCTOR | 0.9082 | | 6 | 25 | 121 | 122 | AGE | 0.9665 | | 7 | 2079-11-09 | 150 | 159 | DATE | 1 | | 8 | Cocke County Baptist Hospital | 163 | 191 | HOSPITAL | 0.731325 | | 9 | 0295 Keats Street | 195 | 211 | STREET | 0.737067 | | 10 | 302-786-5227 | 221 | 232 | PHONE | 0.9882 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_glove_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|167.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from Hindi to English author: John Snow Labs name: opus_mt_hi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, hi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `hi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_hi_en_xx_2.7.0_2.4_1609163929050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_hi_en_xx_2.7.0_2.4_1609163929050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_hi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_hi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.hi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_hi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Base Cased model (from marksverdhei) author: John Snow Labs name: t5_base_define date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-define` is a English model originally trained by `marksverdhei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_define_en_4.3.0_3.0_1675108436852.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_define_en_4.3.0_3.0_1675108436852.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_define","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_define","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_define| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|926.2 MB| ## References - https://huggingface.co/marksverdhei/t5-base-define - https://gist.github.com/marksverdhei/0a13f67e65460b71c05fcf558a6a91ae --- layout: model title: Persian BertForQuestionAnswering Cased model (from AlirezaBaneshi) author: John Snow Labs name: bert_qa_testpersianqa date: 2022-07-07 tags: [fa, open_source, bert, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `testPersianQA` is a Persian model originally trained by `AlirezaBaneshi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_testpersianqa_fa_4.0.0_3.0_1657192915259.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_testpersianqa_fa_4.0.0_3.0_1657192915259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_testpersianqa","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_testpersianqa","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_testpersianqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|607.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AlirezaBaneshi/testPersianQA --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_declutr_emanuals_techqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `declutr-emanuals-techqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_declutr_emanuals_techqa_en_4.0.0_3.0_1655728155165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_declutr_emanuals_techqa_en_4.0.0_3.0_1655728155165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_declutr_emanuals_techqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_declutr_emanuals_techqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.techqa_declutr_emanuals.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_declutr_emanuals_techqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/declutr-emanuals-techqa --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_ff2000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-ff2000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff2000_en_4.3.0_3.0_1675111887886.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_ff2000_en_4.3.0_3.0_1675111887886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_ff2000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_ff2000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_ff2000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|402.3 MB| ## References - https://huggingface.co/google/t5-efficient-base-ff2000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English LongformerForQuestionAnswering model (from ponmari) author: John Snow Labs name: longformer_qa_ponmari date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Question-Answering` is a English model originally trained by `ponmari`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_ponmari_en_4.0.0_3.0_1656255321712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_ponmari_en_4.0.0_3.0_1656255321712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_ponmari","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_ponmari","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.longformer").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_ponmari| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|526.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ponmari/Question-Answering --- layout: model title: English asr_wav2vec2_base_timit_demo_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_base_timit_demo_by_patrickvonplaten date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_by_patrickvonplaten_en_4.2.0_3.0_1664025392086.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_by_patrickvonplaten_en_4.2.0_3.0_1664025392086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_by_patrickvonplaten", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_by_patrickvonplaten", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_by_patrickvonplaten| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.3 MB| --- layout: model title: Swedish asr_wav2vec2_large_xlsr_swedish TFWav2Vec2ForCTC from marma author: John Snow Labs name: asr_wav2vec2_large_xlsr_swedish date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_swedish` is a Swedish model originally trained by marma. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_swedish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_swedish_sv_4.2.0_3.0_1664118671142.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_swedish_sv_4.2.0_3.0_1664118671142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_swedish", "sv")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_swedish", "sv") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_swedish| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sv| |Size:|756.1 MB| --- layout: model title: Sentence Entity Resolver for Snomed Concepts, Body Structure Version author: John Snow Labs name: sbertresolve_snomed_bodyStructure_med date: 2021-06-15 tags: [snomed, en, entity_resolution, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical (anatomical structures) entities to Snomed codes (body structure version) using sentence embeddings. ## Predicted Entities Snomed Codes and their normalized definition with `sbert_jsl_medium_uncased ` embeddings. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_bodyStructure_med_en_3.1.0_3.0_1623766196031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbertresolve_snomed_bodyStructure_med_en_3.1.0_3.0_1623766196031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") jsl_sbert_embedder = BertSentenceEmbeddings.pretrained('sbert_jsl_medium_uncased','en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_bodyStructure_med", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code") snomed_pipelineModel = PipelineModel( stages = [ documentAssembler, jsl_sbert_embedder, snomed_resolver]) snomed_lp = LightPipeline(snomed_pipelineModel) result = snomed_lp.fullAnnotate("Amputation stump") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbert_jsl_medium_uncased","en","clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_bodyStructure_med", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("snomed_code") val snomed_pipelineModel= new PipelineModel().setStages(Array(document_assembler, sbert_embedder, snomed_resolver)) val snomed_lp = LightPipeline(snomed_pipelineModel) val result = snomed_lp.fullAnnotate("Amputation stump") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed_body_structure_med").predict("""Amputation stump""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_distances | |---:|:-----------------|:---------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------| | 0 | amputation stump | 38033009 | [Amputation stump, Amputation stump of upper limb, Amputation stump of left upper limb, Amputation stump of lower limb, Amputation stump of left lower limb, Amputation stump of right upper limb, Amputation stump of right lower limb, ...]| ['38033009', '771359009', '771364008', '771358001', '771367001', '771365009', '771368006', ...] | ['0.0000', '0.0773', '0.0858', '0.0863', '0.0905', '0.0911', '0.0972', ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbertresolve_snomed_bodyStructure_med| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[snomed_code]| |Language:|en| |Case sensitive:|true| ## Data Source https://www.snomed.org/ --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from mrm8488) author: John Snow Labs name: distilbert_qa_finedtuned_squad date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finedtuned-squad-pt` is a Multilingual model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finedtuned_squad_xx_4.3.0_3.0_1672774117656.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finedtuned_squad_xx_4.3.0_3.0_1672774117656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finedtuned_squad","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finedtuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finedtuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/distilbert-multi-finedtuned-squad-pt --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_all_903429540 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429540` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1677881697961.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429540_en_4.3.1_3.0_1677881697961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429540","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_all_903429540| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429540 --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_aapot` is a Finnish model originally trained by aapot. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot_fi_4.2.0_3.0_1664019288191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot_fi_4.2.0_3.0_1664019288191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_aapot| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Lemmatizer (Croatian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, hr] task: Lemmatization language: hr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Croatian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_hr_3.4.1_3.0_1646316576267.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_hr_3.4.1_3.0_1646316576267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","hr") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Nisi bolji od mene"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","hr") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Nisi bolji od mene").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hr.lemma").predict("""Nisi bolji od mene""") ```
## Results ```bash +---------------------+ |result | +---------------------+ |[Nisi, dobar, od, ja]| +---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|hr| |Size:|15.5 MB| --- layout: model title: Chinese BERT Base author: John Snow Labs name: bert_base_chinese date: 2021-05-20 tags: [zh, chinese, bert, embeddings, open_source] task: Embeddings language: zh edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018. The weights of this model are those released by the original BERT authors. This model has been pre-trained for Chinese on Wikipedia. For training, random input masking has been applied independently to word pieces (as in the original BERT paper). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_chinese_zh_3.1.0_2.4_1621517505756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_chinese_zh_3.1.0_2.4_1621517505756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_chinese", "zh") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_chinese", "zh") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_chinese| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|zh| |Case sensitive:|true| ## Data Source [https://huggingface.co/bert-base-chinese](https://huggingface.co/bert-base-chinese) --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1655732774400.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_8_en_4.0.0_3.0_1655732774400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_32d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|417.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-8 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shivkumarganesh) author: John Snow Labs name: distilbert_qa_shivkumarganesh_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shivkumarganesh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shivkumarganesh_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772704308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shivkumarganesh_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772704308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shivkumarganesh_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shivkumarganesh_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shivkumarganesh_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shivkumarganesh/distilbert-base-uncased-finetuned-squad --- layout: model title: Sentence Embeddings - sbert tiny (tuned) author: John Snow Labs name: sbert_jsl_tiny_uncased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_uncased_en_3.1.0_2.4_1625050227188.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_uncased_en_3.1.0_2.4_1625050227188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_tiny_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_tiny_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_tiny_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_tiny_uncased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Score Acc 0.625 STS(cos) 0.682 ``` --- layout: model title: Smaller BERT Embeddings (L-8_H-256_A-4) author: John Snow Labs name: small_bert_L8_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L8_256_en_2.6.0_2.4_1598344454830.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L8_256_en_2.6.0_2.4_1598344454830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L8_256", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L8_256", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L8_256').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L8_256_embeddings I [-0.7437451481819153, 0.2771262526512146, 0.34... love [-0.08792433142662048, 0.33525994420051575, 0.... NLP [0.15274208784103394, -0.5481979846954346, -1.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L8_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1 --- layout: model title: English AlbertForQuestionAnswering model (from armageddon) author: John Snow Labs name: albert_qa_xxlarge_v2_squad2_covid_deepset date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xxlarge-v2-squad2-covid-qa-deepset` is a English model originally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_v2_squad2_covid_deepset_en_4.0.0_3.0_1656064084373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xxlarge_v2_squad2_covid_deepset_en_4.0.0_3.0_1656064084373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_v2_squad2_covid_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xxlarge_v2_squad2_covid_deepset","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.albert.xxl_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xxlarge_v2_squad2_covid_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|771.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/albert-xxlarge-v2-squad2-covid-qa-deepset --- layout: model title: Pipeline to Detect PHI for Deidentification (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_deid_pipeline date: 2023-03-20 tags: [licensed, berfortokenclassification, deid, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_deid](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_deid_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_pipeline_en_4.3.0_3.2_1679307104702.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_pipeline_en_4.3.0_3.2_1679307104702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_deid_pipeline", "en", "clinical/models") text = '''A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_deid_pipeline", "en", "clinical/models") val text = "A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_deid.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:--------------|-------------:| | 0 | 2093-01-13 | 17 | 26 | DATE | 0.957256 | | 1 | David Hale | 29 | 38 | DOCTOR | 0.983641 | | 2 | Hendrickson, Ora | 53 | 68 | PATIENT | 0.992943 | | 3 | 7194334 | 76 | 82 | MEDICALRECORD | 0.999349 | | 4 | Oliveira | 91 | 98 | DOCTOR | 0.763455 | | 5 | Cocke County Baptist Hospital | 114 | 142 | HOSPITAL | 0.999558 | | 6 | 0295 Keats Street | 145 | 161 | STREET | 0.997889 | | 7 | 302) 786-5227 | 174 | 186 | PHONE | 0.970114 | | 8 | Brothers Coal-Mine | 253 | 270 | ORGANIZATION | 0.998911 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_deid_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Universal Sentence Encoder Multilingual author: John Snow Labs name: tfhub_use_multi date: 2020-12-08 task: Embeddings language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [xx, embeddings, open_source] deprecated: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is the variable-length text and the output is a 512-dimensional vector. The universal-sentence-encoder model has trained with a deep averaging network (DAN) encoder. This model supports 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder. The details are described in the paper "[Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/abs/1907.04307)". Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_xx_2.7.0_2.4_1607427221245.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_xx_2.7.0_2.4_1607427221245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Me encanta usar SparkNLP']], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Me encanta usar SparkNLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Me encanta usar SparkNLP"] embeddings_df = nlu.load('xx.use.multi').predict(text, output_level='sentence') embeddings_df ```
## Results It gives a 512-dimensional vector of the sentences. ```bash xx_use_multi_embeddings sentence 0 [-0.07108945399522781, 0.034001532942056656, 0... I love NLP 1 [-0.029642866924405098, -0.027465445920825005,... Me encanta usar SparkNLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_multi| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-multilingual/3](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3) ## Benchmarking - We apply this model to the STS benchmark for semantic similarity. The eval can be seen in the [example notebook](https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb) made available. Results are shown below: ```bash STSBenchmark | dev | test | -----------------------------------|--------|-------| Correlation coefficient of Pearson | 0.829 | 0.809 | ``` - For semantic similarity retrieval, we evaluate the model on [Quora and AskUbuntu retrieval task.](https://arxiv.org/abs/1811.08008). Results are shown below: ```bash Dataset | Quora | AskUbuntu | Average | -----------------------|-------|-----------|---------| Mean Averge Precision | 89.2 | 39.9 | 64.6 | ``` - For the translation pair retrieval, we evaluate the model on the United Nation Parallal Corpus. Results are shown below: ```bash Language Pair | en-es | en-fr | en-ru | en-zh | ---------------|--------|-------|-------|-------| Precision@1 | 85.8 | 82.7 | 87.4 | 79.5 | ``` --- layout: model title: Legal Political Framework Document Classifier (EURLEX) author: John Snow Labs name: legclf_political_framework_bert date: 2023-03-06 tags: [en, legal, classification, clauses, political_framework, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_political_framework_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Political_Framework or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Political_Framework`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_political_framework_bert_en_1.0.0_3.0_1678111675759.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_political_framework_bert_en_1.0.0_3.0_1678111675759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_political_framework_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Political_Framework]| |[Other]| |[Other]| |[Political_Framework]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_political_framework_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.82 0.86 0.84 42 Political_Framework 0.83 0.78 0.81 37 accuracy - - 0.82 79 macro-avg 0.82 0.82 0.82 79 weighted-avg 0.82 0.82 0.82 79 ``` --- layout: model title: Translate Malayalam to English Pipeline author: John Snow Labs name: translate_ml_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ml, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ml` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ml_en_xx_2.7.0_2.4_1609687691628.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ml_en_xx_2.7.0_2.4_1609687691628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ml_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ml_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ml.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ml_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SQuAD 2.0 author: John Snow Labs name: bert_wiki_books_squad2 date: 2021-08-30 tags: [en, open_source, bert_embeddings, squad_2_dataset, wikipedia_dataset, books_corpus_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on SQuAD 2.0 This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_squad2_en_3.2.0_3.0_1630328938565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_squad2_en_3.2.0_3.0_1630328938565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_wiki_books_squad2", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_wiki_books_squad2", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.wiki_books_squad2').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_wiki_books_squad2| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [Stanford Queston Answering (SQuAD 2.0) dataset](https://rajpurkar.github.io/SQuAD-explorer/) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/squad2/2 --- layout: model title: Part of Speech for Yoruba author: John Snow Labs name: pos_ud_ytb date: 2021-03-09 tags: [part_of_speech, open_source, yoruba, pos_ud_ytb, yo] task: Part of Speech Tagging language: yo edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - DET - VERB - CCONJ - PUNCT - PRON - ADJ - AUX - SCONJ - ADV - NUM - PART - PROPN - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ytb_yo_3.0.0_3.0_1615292243232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ytb_yo_3.0.0_3.0_1615292243232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_ytb", "yo") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Kaabo lati awọn laanu snown Johan! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_ytb", "yo") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Kaabo lati awọn laanu snown Johan! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Kaabo lati awọn laanu snown Johan! ""] token_df = nlu.load('yo.pos').predict(text) token_df ```
## Results ```bash token pos 0 Kaabo NOUN 1 lati VERB 2 awọn NOUN 3 laanu ADP 4 snown VERB 5 Johan PROPN 6 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ytb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|yo| --- layout: model title: Translate Multiple languages to English Pipeline author: John Snow Labs name: translate_mul_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mul, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mul` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mul_en_xx_2.7.0_2.4_1609688751635.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mul_en_xx_2.7.0_2.4_1609688751635.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mul_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mul_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mul.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mul_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese Bert Embeddings (from genggui001) author: John Snow Labs name: bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese_roberta_wwm_large_ext_fix_mlm` is a Chinese model orginally trained by `genggui001`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_3.4.2_3.0_1649670762047.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_3.4.2_3.0_1649670762047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.chinese_roberta_wwm_large_ext_fix_mlm").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/genggui001/chinese_roberta_wwm_large_ext_fix_mlm - https://github.com/ymcui/Chinese-BERT-wwm/issues/98 - https://github.com/genggui001/chinese_roberta_wwm_large_ext_fix_mlm --- layout: model title: Pipeline to Detect Clinical Entities (ner_jsl_greedy) author: John Snow Labs name: ner_jsl_greedy_pipeline date: 2023-03-14 tags: [ner, en, licensed, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_greedy](https://nlp.johnsnowlabs.com/2021/06/24/ner_jsl_greedy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_pipeline_en_4.3.0_3.2_1678780756612.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_pipeline_en_4.3.0_3.2_1678780756612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_greedy_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_greedy_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_greedy.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.9817 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.9998 | | 2 | male | 38 | 41 | Gender | 0.9922 | | 3 | for 2 days | 48 | 57 | Duration | 0.6968 | | 4 | congestion | 62 | 71 | Symptom | 0.875 | | 5 | mom | 75 | 77 | Gender | 0.8156 | | 6 | suctioning yellow discharge | 88 | 114 | Symptom | 0.2697 | | 7 | nares | 135 | 139 | External_body_part_or_region | 0.6216 | | 8 | she | 147 | 149 | Gender | 0.9965 | | 9 | mild problems with his breathing while feeding | 168 | 213 | Symptom | 0.444029 | | 10 | perioral cyanosis | 237 | 253 | Symptom | 0.3283 | | 11 | retractions | 258 | 268 | Symptom | 0.957 | | 12 | One day ago | 272 | 282 | RelativeDate | 0.646267 | | 13 | mom | 285 | 287 | Gender | 0.692 | | 14 | tactile temperature | 304 | 322 | Symptom | 0.20765 | | 15 | Tylenol | 345 | 351 | Drug | 0.9951 | | 16 | Baby | 354 | 357 | Age | 0.981 | | 17 | decreased p.o. intake | 377 | 397 | Symptom | 0.437375 | | 18 | His | 400 | 402 | Gender | 0.999 | | 19 | 20 minutes | 439 | 448 | Duration | 0.20415 | | 20 | q.2h. | 450 | 454 | Frequency | 0.6406 | | 21 | to 5 to 10 minutes | 456 | 473 | Duration | 0.12444 | | 22 | his | 488 | 490 | Gender | 0.9904 | | 23 | respiratory congestion | 492 | 513 | Symptom | 0.5294 | | 24 | He | 516 | 517 | Gender | 0.9989 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: XLM-RoBERTa Base NER Pipeline author: John Snow Labs name: xlm_roberta_base_token_classifier_ontonotes_pipeline date: 2022-06-19 tags: [open_source, ner, token_classifier, xlm_roberta, ontonotes, xlm, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_ontonotes_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655654146234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1655654146234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|858.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English BertForQuestionAnswering model (from kamilali) author: John Snow Labs name: bert_qa_distilbert_base_uncased_finetuned_custom date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-custom` is a English model orginally trained by `kamilali`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_distilbert_base_uncased_finetuned_custom_en_4.0.0_3.0_1654187453913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_distilbert_base_uncased_finetuned_custom_en_4.0.0_3.0_1654187453913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_distilbert_base_uncased_finetuned_custom","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_distilbert_base_uncased_finetuned_custom","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.distilled_base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_distilbert_base_uncased_finetuned_custom| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kamilali/distilbert-base-uncased-finetuned-custom --- layout: model title: Korean BertForQuestionAnswering model (from obokkkk) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetuned_klue date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-klue` is a Korean model orginally trained by `obokkkk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_klue_ko_4.0.0_3.0_1654180066545.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_klue_ko_4.0.0_3.0_1654180066545.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_klue","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetuned_klue","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.klue.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_klue| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/obokkkk/bert-base-multilingual-cased-finetuned-klue --- layout: model title: English BertForQuestionAnswering model (from ericRosello) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_squad_frozen_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad-frozen-v2` is a English model orginally trained by `ericRosello`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_frozen_v2_en_4.0.0_3.0_1654181177113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_frozen_v2_en_4.0.0_3.0_1654181177113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_squad_frozen_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_squad_frozen_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_v2.by_ericRosello").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_squad_frozen_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ericRosello/bert-base-uncased-finetuned-squad-frozen-v2 --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_news_pretrain_ft_new_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_ft_new_news_en_4.3.0_3.0_1674211658822.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_ft_new_news_en_4.3.0_3.0_1674211658822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_news_pretrain_ft_new_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_news_pretrain_ft_new_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_news_pretrain_ft_new_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/news_pretrain_roberta_FT_new_newsqa --- layout: model title: Spanish DistilBERT Embeddings (from Recognai) author: John Snow Labs name: distilbert_embeddings_distilbert_base_es_multilingual_cased date: 2022-04-12 tags: [distilbert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-es-multilingual-cased` is a Spanish model orginally trained by `Recognai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_es_multilingual_cased_es_3.4.2_3.0_1649783303797.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_es_multilingual_cased_es_3.4.2_3.0_1649783303797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_es_multilingual_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_es_multilingual_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.distilbert_base_es_multilingual_cased").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_es_multilingual_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|237.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/Recognai/distilbert-base-es-multilingual-cased - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Detect PHI for Deidentification in Romanian (BERT) author: John Snow Labs name: ner_deid_subentity_bert date: 2022-06-27 tags: [deidentification, bert, phi, ner, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with `bert_base_cased` embeddings and can detect 17 entities. This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. ## Predicted Entities `AGE`, `CITY`, `COUNTRY`, `DATE`, `DOCTOR`, `EMAIL`, `FAX`, `HOSPITAL`, `IDNUM`, `LOCATION-OTHER`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `ZIP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_ro_4.0.0_3.0_1656311815383.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_bert_ro_4.0.0_3.0_1656311815383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = """ Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" data = spark.createDataFrame([[text]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""" val data = Seq(text).toDS.toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.med_ner.deid.subentity.bert").predict(""" Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401""") ```
## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |Spitalul Pentru Ochi de Deal|HOSPITAL | |Drumul Oprea Nr |STREET | |Vaslui |CITY | |737405 |ZIP | |+40(235)413773 |PHONE | |25 May 2022 |DATE | |BUREAN MARIA |PATIENT | |77 |AGE | |Agota Evelyn Tımar |DOCTOR | |2450502264401 |IDNUM | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_bert| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.5 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label precision recall f1-score support AGE 0.98 0.95 0.96 1186 CITY 0.94 0.87 0.90 299 COUNTRY 0.90 0.73 0.81 108 DATE 0.98 0.95 0.96 4518 DOCTOR 0.91 0.94 0.93 1979 EMAIL 1.00 0.62 0.77 8 FAX 0.98 0.95 0.96 56 HOSPITAL 0.92 0.85 0.88 881 IDNUM 0.98 0.99 0.98 235 LOCATION-OTHER 1.00 0.85 0.92 13 MEDICALRECORD 0.99 1.00 1.00 444 ORGANIZATION 0.86 0.76 0.81 75 PATIENT 0.91 0.87 0.89 937 PHONE 0.96 0.98 0.97 302 PROFESSION 0.85 0.82 0.83 161 STREET 0.96 0.94 0.95 173 ZIP 0.99 0.98 0.99 138 micro-avg 0.95 0.93 0.94 11513 macro-avg 0.95 0.89 0.91 11513 weighted-avg 0.95 0.93 0.94 11513 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_movie_mitmovie_squad date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `movie-roberta-MITmovie-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_mitmovie_squad_en_4.2.4_3.0_1669985249159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_mitmovie_squad_en_4.2.4_3.0_1669985249159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_mitmovie_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_mitmovie_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_movie_mitmovie_squad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/thatdramebaazguy/movie-roberta-MITmovie-squad - https://github.com/ibm-aur-nlp/domain-specific-QA - https://github.com/adityaarunsinghal/Domain-Adaptation/blob/master/scripts/shell_scripts/movieR_NER_squad.sh - https://github.com/adityaarunsinghal/Domain-Adaptation/ --- layout: model title: Legal Purchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_purchase_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, purchase, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_purchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `purchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purchase_agreement_bert_en_1.0.0_3.0_1669368990577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purchase_agreement_bert_en_1.0.0_3.0_1669368990577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_purchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[purchase-agreement]| |[other]| |[other]| |[purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_purchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.90 0.93 0.92 82 purchase-agreement 0.85 0.81 0.83 42 accuracy - - 0.89 124 macro-avg 0.88 0.87 0.87 124 weighted-avg 0.89 0.89 0.89 124 ``` --- layout: model title: Portuguese Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_bert_base_portuguese_cased date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-portuguese-cased` is a Portuguese model orginally trained by `neuralmind`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_pt_3.4.2_3.0_1649673570343.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_pt_3.4.2_3.0_1649673570343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_base_portuguese_cased").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_portuguese_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|408.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralmind/bert-base-portuguese-cased - https://github.com/neuralmind-ai/portuguese-bert/ --- layout: model title: English BertForQuestionAnswering model (from anindabitm) author: John Snow Labs name: bert_qa_sagemaker_BioclinicalBERT_ADR date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sagemaker-BioclinicalBERT-ADR` is a English model orginally trained by `anindabitm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sagemaker_BioclinicalBERT_ADR_en_4.0.0_3.0_1654189301281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sagemaker_BioclinicalBERT_ADR_en_4.0.0_3.0_1654189301281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sagemaker_BioclinicalBERT_ADR","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_sagemaker_BioclinicalBERT_ADR","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bio_clinical.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sagemaker_BioclinicalBERT_ADR| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anindabitm/sagemaker-BioclinicalBERT-ADR --- layout: model title: Chinese BertForTokenClassification Tiny Cased model (from ckiplab) author: John Snow Labs name: bert_token_classifier_tiny_chinese_ws date: 2022-11-30 tags: [zh, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-chinese-ws` is a Chinese model originally trained by `ckiplab`. ## Predicted Entities `B`, `I` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_tiny_chinese_ws_zh_4.2.4_3.0_1669815451961.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_tiny_chinese_ws_zh_4.2.4_3.0_1669815451961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_tiny_chinese_ws","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_tiny_chinese_ws","zh") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_tiny_chinese_ws| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|zh| |Size:|43.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ckiplab/bert-tiny-chinese-ws - https://github.com/ckiplab/ckip-transformers - https://muyang.pro - https://ckip.iis.sinica.edu.tw - https://github.com/ckiplab/ckip-transformers - https://github.com/ckiplab/ckip-transformers --- layout: model title: Arabic ElectraForQuestionAnswering model (from Damith) author: John Snow Labs name: electra_qa_AraELECTRA_discriminator_SOQAL date: 2022-06-22 tags: [ar, open_source, electra, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `AraELECTRA-discriminator-SOQAL` is a Arabic model originally trained by `Damith`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_AraELECTRA_discriminator_SOQAL_ar_4.0.0_3.0_1655918555995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_AraELECTRA_discriminator_SOQAL_ar_4.0.0_3.0_1655918555995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_AraELECTRA_discriminator_SOQAL","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_AraELECTRA_discriminator_SOQAL","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.electra").predict("""ما هو اسمي؟|||"اسمي كلارا وأنا أعيش في بيركلي.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_AraELECTRA_discriminator_SOQAL| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|504.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Damith/AraELECTRA-discriminator-SOQAL --- layout: model title: Part of Speech for Catalan author: John Snow Labs name: pos_ud_ancora date: 2021-03-09 tags: [part_of_speech, open_source, catalan, pos_ud_ancora, ca] task: Part of Speech Tagging language: ca edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - PROPN - PUNCT - AUX - VERB - NOUN - ADP - NUM - ADJ - CCONJ - PRON - SCONJ - SYM - ADV - PART - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ancora_ca_3.0.0_3.0_1615292158091.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ancora_ca_3.0.0_3.0_1615292158091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_ancora", "ca") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hola de John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_ancora", "ca") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hola de John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] token_df = nlu.load('ca.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hola PROPN 1 de ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ancora| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ca| --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824218 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824218` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1678134202429.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824218_en_4.3.1_3.0_1678134202429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824218","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824218| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824218 --- layout: model title: Telugu Bert Embeddings (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers_te_bert date: 2022-04-11 tags: [bert, embeddings, te, open_source] task: Embeddings language: te edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-te-bert` is a Telugu model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_bert_te_3.4.2_3.0_1649675210689.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_te_bert_te_3.4.2_3.0_1649675210689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_indic_transformers_te_bert","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_indic_transformers_te_bert","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.embed.indic_transformers_te_bert").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers_te_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|te| |Size:|612.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-bert - https://oscar-corpus.com/ --- layout: model title: Swahili XlmRoBertaForQuestionAnswering (from cjrowe) author: John Snow Labs name: xlm_roberta_qa_afriberta_base_finetuned_tydiqa date: 2022-06-23 tags: [sw, open_source, question_answering, xlmroberta] task: Question Answering language: sw edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `afriberta_base-finetuned-tydiqa` is a Swahili model originally trained by `cjrowe`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_afriberta_base_finetuned_tydiqa_sw_4.0.0_3.0_1655984062960.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_afriberta_base_finetuned_tydiqa_sw_4.0.0_3.0_1655984062960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_afriberta_base_finetuned_tydiqa","sw") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_afriberta_base_finetuned_tydiqa","sw") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sw.answer_question.tydiqa.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_afriberta_base_finetuned_tydiqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|sw| |Size:|415.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/cjrowe/afriberta_base-finetuned-tydiqa --- layout: model title: Word Segmenter for Chinese author: John Snow Labs name: wordseg_gsd_ud_trad date: 2021-03-09 tags: [word_segmentation, open_source, chinese, wordseg_gsd_ud_trad, zh] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_trad_zh_3.0.0_3.0_1615292446833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_trad_zh_3.0.0_3.0_1615292446833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud_trad", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.segment_words.gsd').predict(text) token_df ```
## Results ```bash 0 从J 1 o 2 h 3 n 4 Snow 5 L 6 a 7 b 8 s 9 你 10 好 11 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_gsd_ud_trad| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|zh| --- layout: model title: Receipts Binary Classification author: John Snow Labs name: finvisualclf_vit_tickets date: 2022-09-07 tags: [en, finance, classification, tickets, licensed] task: Image Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a ViT (Visual Transformer) model, which can be used to carry out Binary Classification (true or false) on pictures / photos / images. This model has been trained in-house with different corpora, including: - CORD - COCO - In-house annotated receipts You can use this model to filter out non-tickets from a folder of images or mobile pictures, and then use Visual NLP to extract information using the layout and the text features. ## Predicted Entities `ticket`, `no_ticket` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finvisualclf_vit_tickets_en_1.0.0_3.2_1662560058841.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finvisualclf_vit_tickets_en_1.0.0_3.2_1662560058841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier_loaded = nlp.ViTForImageClassification.pretrained("finvisualclf_vit_tickets", "en", "finance/models")\ .setInputCols(["image_assembler"])\ .setOutputCol("class") pipeline = nlp.Pipeline().setStages([ document_assembler, imageClassifier_loaded ]) test_image = spark.read\ .format("image")\ .option("dropInvalid", value = True)\ .load("./ticket.JPEG") result = pipeline.fit(test_image).transform(test_image) result.select("class.result").show(1, False) ```
## Results ```bash +--------+ |result | +--------+ |[ticket]| +--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finvisualclf_vit_tickets| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| ## References Cord, rvl-cdip, visual-genome and an external receipt dataset ## Benchmarking ```bash label score training_loss 0.0006 validation_loss 0.0044 f1 0.9997 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_vs_all_902129475 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_vs_all-902129475` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_vs_all_902129475_en_4.3.1_3.0_1678783345162.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_vs_all_902129475_en_4.3.1_3.0_1678783345162.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_vs_all_902129475","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_vs_all_902129475","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_vs_all_902129475| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_vs_all-902129475 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab92 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab92 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab92` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab92_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab92_en_4.2.0_3.0_1664022414316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab92_en_4.2.0_3.0_1664022414316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab92", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab92", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab92| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-10_H-512_A-8_cord19-200616_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185183590.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185183590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid_cord19.bert.uncased_10l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|177.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616_squad2_covid-qna --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_quora_for_paraphrasing date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-quora-for-paraphrasing` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_quora_for_paraphrasing_en_4.3.0_3.0_1675126066647.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_quora_for_paraphrasing_en_4.3.0_3.0_1675126066647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_quora_for_paraphrasing","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_quora_for_paraphrasing","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_quora_for_paraphrasing| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|269.4 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-quora-for-paraphrasing - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson TFWav2Vec2ForCTC from izzy-lazerson author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson` is a English model originally trained by izzy-lazerson. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson_en_4.2.0_3.0_1664018577570.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson_en_4.2.0_3.0_1664018577570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_izzy_lazerson| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_base_timit_moaiz_explast TFWav2Vec2ForCTC from moaiz237 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_moaiz_explast date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_explast` is a English model originally trained by moaiz237. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_moaiz_explast_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_explast_en_4.2.0_3.0_1664039531385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_moaiz_explast_en_4.2.0_3.0_1664039531385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_moaiz_explast', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_moaiz_explast", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_moaiz_explast| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Swedish author: John Snow Labs name: opus_mt_en_sv date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sv, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sv` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sv_xx_2.7.0_2.4_1609168667170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sv_xx_2.7.0_2.4_1609168667170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sv", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sv').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sv| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_hier_roberta_FT_new_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_hier_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_roberta_FT_new_newsqa_en_4.0.0_3.0_1655728611408.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_hier_roberta_FT_new_newsqa_en_4.0.0_3.0_1655728611408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_hier_roberta_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_fpdm_hier_roberta_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_fpdm_hier_roberta_ft_new_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_hier_roberta_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|461.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/fpdm_hier_roberta_FT_new_newsqa --- layout: model title: Smaller BERT Embeddings (L-4_H-256_A-4) author: John Snow Labs name: small_bert_L4_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L4_256_en_2.6.0_2.4_1598344409205.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L4_256_en_2.6.0_2.4_1598344409205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L4_256", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L4_256", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L4_256').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L4_256_embeddings I [-0.406830757856369, 1.3900043964385986, 0.599... love [0.17971229553222656, 0.8448253870010376, 0.30... NLP [0.30530980229377747, 0.3500273525714874, -0.2... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L4_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1 --- layout: model title: Judgements Classification (Argument Type) author: John Snow Labs name: legclf_bert_judgements_argtype date: 2022-09-07 tags: [en, legal, judgements, argument, echr, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Text Classification model, aimed to identify different the different argument types in Court Decisions texts about Human Rights. This model was inspired by [this](https://arxiv.org/pdf/2208.06178.pdf) paper, which uses a different approach (Named Entity Recognition). The classes are listed below. Please check the [original paper](https://arxiv.org/pdf/2208.06178.pdf) for more information about them. - APPLICATION CASE - DECISION ECHR - LEGAL BASIS - LEGITIMATE PURPOSE - NECESSITY/PROPORTIONALITY - NON CONTESTATION - OTHER - PRECEDENTS ECHR ## Predicted Entities `APPLICATION CASE`, `DECISION ECHR`, `LEGAL BASIS`, `LEGITIMATE PURPOSE`, `NECESSITY/PROPORTIONALITY`, `NON CONTESTATION`, `OTHER`, `PRECEDENTS ECHR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEG_JUDGEMENTS_CLF/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_judgements_argtype_en_1.0.0_3.2_1662562438186.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_judgements_argtype_en_1.0.0_3.2_1662562438186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python text_list = ["""Indeed, given that the expectation of protection of private life may be reduced on account of the public functions exercised, the Court considers that, in order to ensure a fair balancing of the interests at stake, the domestic courts, in assessing the facts submitted for their examination, ought to have taken into account the potential impact of the Prince's status as Head of State, and to have attempted, in that context, to determine the parts of the impugned article that belonged to the strictly private domain and what fell within the public sphere.""", """Article 8 requires that the domestic authorities should strike a fair balance between the interests of the child and those of the parents, and that, in the balancing process, particular importance should be attached to the best interests of the child, which, depending on their nature and seriousness, may override those of the parents. In particular, a parent can not be entitled under Article 8 to have such measures taken as would harm the child's health and development ( see Sahin, cited above, § 66, and Sommerfeld, cited above, § 64 )."""] text_list = [x.lower() for x in text_list] text_list document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols(['document'])\ .setOutputCol("token") clf_model = legal.BertForSequenceClassification.pretrained("legclf_bert_judgements_argtype", "en", "legal/models")\ .setInputCols(['document','token'])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, clf_model ]) # Generating example empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) light_model = LightPipeline(model) import pandas as pd df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols"))\ .select(F.expr("cols['0']").alias("document"), F.expr("cols['1']").alias("class")).show(truncate = 60) ```
## Results ```bash +------------------------------------------------------------+----------------+ | document| class| +------------------------------------------------------------+----------------+ |indeed, given that the expectation of protection of priva...|APPLICATION CASE| |article 8 requires that the domestic authorities should s...| PRECEDENTS ECHR| +------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_judgements_argtype| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Basedf on https://arxiv.org/pdf/2208.06178.pdf with in-house postprocessing ## Benchmarking ```bash label precision recall f1-score support APPLICATION_CASE 0.85 0.83 0.84 983 DECISION_ECHR 0.82 0.86 0.84 103 LEGAL_BASIS 0.61 0.50 0.55 40 LEGITIMATE_PURPOSE 0.94 0.88 0.91 17 NECESSITY/PROPORTIONALITY 0.62 0.66 0.64 207 NON_CONTESTATION 0.64 0.69 0.67 13 OTHER 0.97 0.97 0.97 2557 PRECEDENTS_ECHR 0.80 0.85 0.83 262 accuracy - - 0.91 4182 macro-avg 0.78 0.78 0.78 4182 weighted-avg 0.91 0.91 0.91 4182 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_fpdm_triplet_ft_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fpdm_triplet_roberta_FT_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_ft_news_en_4.3.0_3.0_1674211248579.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_fpdm_triplet_ft_news_en_4.3.0_3.0_1674211248579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_ft_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_fpdm_triplet_ft_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_fpdm_triplet_ft_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|458.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/fpdm_triplet_roberta_FT_newsqa --- layout: model title: English Bert Embeddings (from wilsontam) author: John Snow Labs name: bert_embeddings_bert_base_uncased_dstc9 date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-uncased-dstc9` is a English model orginally trained by `wilsontam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_dstc9_en_3.4.2_3.0_1649673032641.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_uncased_dstc9_en_3.4.2_3.0_1649673032641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_dstc9","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_uncased_dstc9","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_base_uncased_dstc9").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_uncased_dstc9| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/wilsontam/bert-base-uncased-dstc9 - https://github.com/alexa/alexa-with-dstc9-track1-dataset --- layout: model title: Legal Taxes Clause Binary Classifier author: John Snow Labs name: legclf_taxes_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `taxes` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `taxes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_taxes_clause_en_1.0.0_3.2_1660124052298.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_taxes_clause_en_1.0.0_3.2_1660124052298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_taxes_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[taxes]| |[other]| |[other]| |[taxes]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_taxes_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.97 0.96 115 taxes 0.90 0.85 0.88 41 accuracy - - 0.94 156 macro-avg 0.92 0.91 0.92 156 weighted-avg 0.94 0.94 0.94 156 ``` --- layout: model title: English asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke TFWav2Vec2ForCTC from logicbloke author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke` is a English model originally trained by logicbloke. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke_en_4.2.0_3.0_1664095628498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke_en_4.2.0_3.0_1664095628498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English image_classifier_vit_modeversion1_m6_e4n ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_modeversion1_m6_e4n date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modeversion1_m6_e4n` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion1_m6_e4n_en_4.1.0_3.0_1660168388194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modeversion1_m6_e4n_en_4.1.0_3.0_1660168388194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_modeversion1_m6_e4n", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_modeversion1_m6_e4n", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_modeversion1_m6_e4n| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Japanese BERT Base author: John Snow Labs name: bert_base_japanese date: 2021-09-16 tags: [bert, ja, open_source, japanese, embeddings] task: Embeddings language: ja edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. It was originally published by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018. The weights of this model are those released by the original BERT authors. The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, consisting of approximately 30M sentences. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_japanese_ja_3.2.2_3.0_1631803687387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_japanese_ja_3.2.2_3.0_1631803687387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_japanese", "ja") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_japanese", "ja") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_japanese| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Case sensitive:|true| ## Data Source The models are trained on the Japanese version of Wikipedia. The training corpus is generated from the Wikipedia Cirrussearch dump file as of August 31, 2020. The generated corpus files are 4.0GB in total, consisting of approximately 30M sentences. https://github.com/cl-tohoku/bert-japanese --- layout: model title: Fast Neural Machine Translation Model from American Sign Language to Spanish author: John Snow Labs name: opus_mt_ase_es date: 2021-06-01 tags: [open_source, seq2seq, translation, ase, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ase target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_es_xx_3.1.0_2.4_1622555281797.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_es_xx_3.1.0_2.4_1622555281797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ase_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ase_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.American Sign Language.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ase_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Liability Clause Binary Classifier author: John Snow Labs name: legclf_liability_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `liability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `liability` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_liability_clause_en_1.0.0_3.2_1660122616869.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_liability_clause_en_1.0.0_3.2_1660122616869.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_liability_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[liability]| |[other]| |[other]| |[liability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_liability_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support liability 0.85 0.74 0.79 31 other 0.93 0.96 0.94 105 accuracy - - 0.91 136 macro-avg 0.89 0.85 0.87 136 weighted-avg 0.91 0.91 0.91 136 ``` --- layout: model title: Translate English to Celtic languages Pipeline author: John Snow Labs name: translate_en_cel date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, cel, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `cel` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_cel_xx_2.7.0_2.4_1609691211925.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_cel_xx_2.7.0_2.4_1609691211925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_cel", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_cel", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.cel').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_cel| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_pond_image_classification_5 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_5` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_5_en_4.1.0_3.0_1660172359191.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_5_en_4.1.0_3.0_1660172359191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Sentence Detection in Kannada Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [kn, open_source, sentence_detection] task: Sentence Detection language: kn edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_kn_3.2.0_3.0_1630336398052.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_kn_3.2.0_3.0_1630336398052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "kn") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""ಇಂಗ್ಲಿಷ್ ಓದುವ ಪ್ಯಾರಾಗಳ ಉತ್ತಮ ಮೂಲವನ್ನು ಹುಡುಕುತ್ತಿರುವಿರಾ? ನೀವು ಸರಿಯಾದ ಸ್ಥಳಕ್ಕೆ ಬಂದಿದ್ದೀರಿ. ಇತ್ತೀಚಿನ ಅಧ್ಯಯನದ ಪ್ರಕಾರ, ಇಂದಿನ ಯುವಜನರಲ್ಲಿ ಓದುವ ಅಭ್ಯಾಸವು ವೇಗವಾಗಿ ಕಡಿಮೆಯಾಗುತ್ತಿದೆ. ಅವರು ಕೆಲವು ಸೆಕೆಂಡುಗಳಿಗಿಂತ ಹೆಚ್ಚು ಕಾಲ ಆಂಗ್ಲ ಓದುವ ಪ್ಯಾರಾಗ್ರಾಫ್ ಮೇಲೆ ಕೇಂದ್ರೀಕರಿಸಲು ಸಾಧ್ಯವಿಲ್ಲ! ಅಲ್ಲದೆ, ಓದುವುದು ಎಲ್ಲಾ ಸ್ಪರ್ಧಾತ್ಮಕ ಪರೀಕ್ಷೆಗಳ ಅವಿಭಾಜ್ಯ ಅಂಗವಾಗಿತ್ತು. ಹಾಗಾದರೆ, ನಿಮ್ಮ ಓದುವ ಕೌಶಲ್ಯವನ್ನು ನೀವು ಹೇಗೆ ಸುಧಾರಿಸುತ್ತೀರಿ? ಈ ಪ್ರಶ್ನೆಗೆ ಉತ್ತರವು ವಾಸ್ತವವಾಗಿ ಇನ್ನೊಂದು ಪ್ರಶ್ನೆಯಾಗಿದೆ: ಓದುವ ಕೌಶಲ್ಯದ ಉಪಯೋಗವೇನು? ಓದುವ ಮುಖ್ಯ ಉದ್ದೇಶ 'ಅರ್ಥ ಮಾಡಿಕೊಳ್ಳುವುದು'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "kn") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("ಇಂಗ್ಲಿಷ್ ಓದುವ ಪ್ಯಾರಾಗಳ ಉತ್ತಮ ಮೂಲವನ್ನು ಹುಡುಕುತ್ತಿರುವಿರಾ? ನೀವು ಸರಿಯಾದ ಸ್ಥಳಕ್ಕೆ ಬಂದಿದ್ದೀರಿ. ಇತ್ತೀಚಿನ ಅಧ್ಯಯನದ ಪ್ರಕಾರ, ಇಂದಿನ ಯುವಜನರಲ್ಲಿ ಓದುವ ಅಭ್ಯಾಸವು ವೇಗವಾಗಿ ಕಡಿಮೆಯಾಗುತ್ತಿದೆ. ಅವರು ಕೆಲವು ಸೆಕೆಂಡುಗಳಿಗಿಂತ ಹೆಚ್ಚು ಕಾಲ ಆಂಗ್ಲ ಓದುವ ಪ್ಯಾರಾಗ್ರಾಫ್ ಮೇಲೆ ಕೇಂದ್ರೀಕರಿಸಲು ಸಾಧ್ಯವಿಲ್ಲ! ಅಲ್ಲದೆ, ಓದುವುದು ಎಲ್ಲಾ ಸ್ಪರ್ಧಾತ್ಮಕ ಪರೀಕ್ಷೆಗಳ ಅವಿಭಾಜ್ಯ ಅಂಗವಾಗಿತ್ತು. ಹಾಗಾದರೆ, ನಿಮ್ಮ ಓದುವ ಕೌಶಲ್ಯವನ್ನು ನೀವು ಹೇಗೆ ಸುಧಾರಿಸುತ್ತೀರಿ? ಈ ಪ್ರಶ್ನೆಗೆ ಉತ್ತರವು ವಾಸ್ತವವಾಗಿ ಇನ್ನೊಂದು ಪ್ರಶ್ನೆಯಾಗಿದೆ: ಓದುವ ಕೌಶಲ್ಯದ ಉಪಯೋಗವೇನು? ಓದುವ ಮುಖ್ಯ ಉದ್ದೇಶ 'ಅರ್ಥ ಮಾಡಿಕೊಳ್ಳುವುದು'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('kn.sentence_detector').predict("ಇಂಗ್ಲಿಷ್ ಓದುವ ಪ್ಯಾರಾಗಳ ಉತ್ತಮ ಮೂಲವನ್ನು ಹುಡುಕುತ್ತಿರುವಿರಾ? ನೀವು ಸರಿಯಾದ ಸ್ಥಳಕ್ಕೆ ಬಂದಿದ್ದೀರಿ. ಇತ್ತೀಚಿನ ಅಧ್ಯಯನದ ಪ್ರಕಾರ, ಇಂದಿನ ಯುವಜನರಲ್ಲಿ ಓದುವ ಅಭ್ಯಾಸವು ವೇಗವಾಗಿ ಕಡಿಮೆಯಾಗುತ್ತಿದೆ. ಅವರು ಕೆಲವು ಸೆಕೆಂಡುಗಳಿಗಿಂತ ಹೆಚ್ಚು ಕಾಲ ಆಂಗ್ಲ ಓದುವ ಪ್ಯಾರಾಗ್ರಾಫ್ ಮೇಲೆ ಕೇಂದ್ರೀಕರಿಸಲು ಸಾಧ್ಯವಿಲ್ಲ! ಅಲ್ಲದೆ, ಓದುವುದು ಎಲ್ಲಾ ಸ್ಪರ್ಧಾತ್ಮಕ ಪರೀಕ್ಷೆಗಳ ಅವಿಭಾಜ್ಯ ಅಂಗವಾಗಿತ್ತು. ಹಾಗಾದರೆ, ನಿಮ್ಮ ಓದುವ ಕೌಶಲ್ಯವನ್ನು ನೀವು ಹೇಗೆ ಸುಧಾರಿಸುತ್ತೀರಿ? ಈ ಪ್ರಶ್ನೆಗೆ ಉತ್ತರವು ವಾಸ್ತವವಾಗಿ ಇನ್ನೊಂದು ಪ್ರಶ್ನೆಯಾಗಿದೆ: ಓದುವ ಕೌಶಲ್ಯದ ಉಪಯೋಗವೇನು? ಓದುವ ಮುಖ್ಯ ಉದ್ದೇಶ 'ಅರ್ಥ ಮಾಡಿಕೊಳ್ಳುವುದು'.", output_level ='sentence') ```
## Results ```bash +---------------------------------------------------------------------------------------------+ |result | +---------------------------------------------------------------------------------------------+ |[ಇಂಗ್ಲಿಷ್ ಓದುವ ಪ್ಯಾರಾಗಳ ಉತ್ತಮ ಮೂಲವನ್ನು ಹುಡುಕುತ್ತಿರುವಿರಾ?] | |[ನೀವು ಸರಿಯಾದ ಸ್ಥಳಕ್ಕೆ ಬಂದಿದ್ದೀರಿ.] | |[ಇತ್ತೀಚಿನ ಅಧ್ಯಯನದ ಪ್ರಕಾರ, ಇಂದಿನ ಯುವಜನರಲ್ಲಿ ಓದುವ ಅಭ್ಯಾಸವು ವೇಗವಾಗಿ ಕಡಿಮೆಯಾಗುತ್ತಿದೆ.] | |[ಅವರು ಕೆಲವು ಸೆಕೆಂಡುಗಳಿಗಿಂತ ಹೆಚ್ಚು ಕಾಲ ಆಂಗ್ಲ ಓದುವ ಪ್ಯಾರಾಗ್ರಾಫ್ ಮೇಲೆ ಕೇಂದ್ರೀಕರಿಸಲು ಸಾಧ್ಯವಿಲ್ಲ!] | |[ಅಲ್ಲದೆ, ಓದುವುದು ಎಲ್ಲಾ ಸ್ಪರ್ಧಾತ್ಮಕ ಪರೀಕ್ಷೆಗಳ ಅವಿಭಾಜ್ಯ ಅಂಗವಾಗಿತ್ತು.] | |[ಹಾಗಾದರೆ, ನಿಮ್ಮ ಓದುವ ಕೌಶಲ್ಯವನ್ನು ನೀವು ಹೇಗೆ ಸುಧಾರಿಸುತ್ತೀರಿ?] | |[ಈ ಪ್ರಶ್ನೆಗೆ ಉತ್ತರವು ವಾಸ್ತವವಾಗಿ ಇನ್ನೊಂದು ಪ್ರಶ್ನೆಯಾಗಿದೆ:] | |[ಓದುವ ಕೌಶಲ್ಯದ ಉಪಯೋಗವೇನು?] | |[ಓದುವ ಮುಖ್ಯ ಉದ್ದೇಶ 'ಅರ್ಥ ಮಾಡಿಕೊಳ್ಳುವುದು'.] | +---------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|kn| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl2_en_4.3.0_3.0_1675117675037.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nl2_en_4.3.0_3.0_1675117675037.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|238.9 MB| ## References - https://huggingface.co/google/t5-efficient-large-nl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: PICO Classifier author: John Snow Labs name: classifierdl_pico_biobert date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 tags: [en, licensed, clinical, classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify medical text according to PICO framework. ## Predicted Entities `CONCLUSIONS`, `DESIGN_SETTING`, `INTERVENTION`, `PARTICIPANTS`, `FINDINGS`, `MEASUREMENTS`, `AIMS`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_PICO/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLINICAL_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_pico_biobert_en_2.7.1_2.4_1611248887230.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_pico_biobert_en_2.7.1_2.4_1611248887230.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document') tokenizer = Tokenizer().setInputCols('document').setOutputCol('token') embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\ .setInputCols(["document", 'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["""A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""", """When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced."""]) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.pico").predict("""A total of 10 adult daily smokers who reported at least one stressful event and coping episode and provided post-quit data.""") ```
## Results ```bash | sentences | class | |------------------------------------------------------+--------------+ | A total of 10 adult daily smokers who reported at... | PARTICIPANTS | | When carbamazepine is withdrawn from the combinat... | CONCLUSIONS | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_pico_biobert| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|biobert_pubmed_base_cased| ## Data Source Trained on a custom dataset derived from PICO classification dataset. ## Benchmarking ```bash precision recall f1-score support AIMS 0.9229 0.9186 0.9207 7815 CONCLUSIONS 0.8556 0.8401 0.8478 8837 DESIGN_SETTING 0.8556 0.7494 0.7990 11551 FINDINGS 0.8949 0.9342 0.9142 18827 INTERVENTION 0.6866 0.7508 0.7173 4920 MEASUREMENTS 0.7564 0.8664 0.8077 6505 PARTICIPANTS 0.8483 0.7559 0.7994 5539 accuracy 0.8495 63994 macro avg 0.8315 0.8308 0.8294 63994 weighted avg 0.8517 0.8495 0.8491 63994 ``` --- layout: model title: English BertForQuestionAnswering model (from motiondew) author: John Snow Labs name: bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_1-lr-2e-5-bs-32-ep-4` is a English model orginally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1654184718919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1654184718919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.32d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_1-lr-2e-5-bs-32-ep-4 --- layout: model title: Relation extraction between dates and clinical entities (ReDL) author: John Snow Labs name: redl_date_clinical_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Identify if tests were conducted on a particular date or any diagnosis was made on a specific date by checking relations between clinical entities and dates. `1` : Shows date and the clinical entity are related, `0` : Shows date and the clinical entity are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_2.7.3_2.4_1612448249418.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_2.7.3_2.4_1612448249418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['symptom-date', 'date-procedure', 'relativedate-test', 'test-date']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_date_clinical_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94." data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array('symptom-date', 'date-procedure', 'relativedate-test', 'test-date')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_date_clinical_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date").predict("""This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.""") ```
## Results ```bash | | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |---|-----------|---------|---------------|-------------|------------------------------------------|---------|-------------|-------------|---------|------------| | 0 | 1 | Test | 24 | 25 | CT | Date | 31 | 37 | 1/12/95 | 1.0 | | 1 | 1 | Symptom | 45 | 84 | progressive memory and cognitive decline | Date | 92 | 98 | 8/11/94 | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_date_clinical_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on an internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.738 0.729 0.734 84 1 0.945 0.947 0.946 416 Avg. 0.841 0.838 0.840 ``` --- layout: model title: Abkhazian asr_hf_challenge_test TFWav2Vec2ForCTC from Iskaj author: John Snow Labs name: pipeline_asr_hf_challenge_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_hf_challenge_test` is a Abkhazian model originally trained by Iskaj. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_hf_challenge_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_hf_challenge_test_ab_4.2.0_3.0_1664021301569.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_hf_challenge_test_ab_4.2.0_3.0_1664021301569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_hf_challenge_test', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_hf_challenge_test", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_hf_challenge_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.3 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Spanish Named Entity Recognition (from mrm8488) author: John Snow Labs name: bert_ner_bert_spanish_cased_finetuned_ner date: 2022-05-09 tags: [bert, ner, token_classification, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-spanish-cased-finetuned-ner` is a Spanish model orginally trained by `mrm8488`. ## Predicted Entities `LOC`, `PER`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_spanish_cased_finetuned_ner_es_3.4.2_3.0_1652096454738.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_spanish_cased_finetuned_ner_es_3.4.2_3.0_1652096454738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_spanish_cased_finetuned_ner","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_spanish_cased_finetuned_ner","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_spanish_cased_finetuned_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner - https://www.kaggle.com/nltkdata/conll-corpora - https://github.com/dccuchile/beto - https://www.kaggle.com/nltkdata/conll-corpora - https://twitter.com/mrm8488 --- layout: model title: T5 Question Generation (Small) author: John Snow Labs name: t5_question_generation_small date: 2022-07-05 tags: [en, open_source] task: Text Generation language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true recommended: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Text Generation model, originally trained on SQUAD dataset, then finetuned by AllenAI team, to generate questions from texts. The power lies on the ability to generate also questions providing a low number of tokens, for example a subject and a verb (`Amazon` `should provide`), what would return a question similar to `What Amazon should provide?`). At the same time, this model can be used to feed Question Answering Models, as the first parameter (question), while providing a bigger paragraph as context. This way, you: - First, generate questions on the fly - Second, look for an answer in the text. Moreover, the input of this model can even be a concatenation of entities from NER (`EMV` - ORG , `will provide` - ACTION). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_question_generation_small_en_4.0.0_3.0_1657032292222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_question_generation_small_en_4.0.0_3.0_1657032292222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer() \ .pretrained("t5_question_generation_small") \ .setTask("")\ .setMaxOutputLength(200)\ .setInputCols(["documents"]) \ .setOutputCol("question") data_df = spark.createDataFrame([["EMV will pay"]]).toDF("text") pipeline = Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("question.result").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_question_generation_small") .setTask("") .setMaxOutputLength(200) .setInputCols("documents") .setOutputCol("question") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("EMV will pay").toDF("text") val result = pipeline.fit(data).transform(data) result.select("question.result").show(false) ```
## Results ```bash +--------------------+ |result | +--------------------+ |[What will EMV pay?]| +--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_question_generation_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|148.0 MB| ## References SQUAD2.0 --- layout: model title: Pipeline to Detect PHI for Deidentification (Augmented) author: John Snow Labs name: ner_deid_synthetic_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_synthetic](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_synthetic_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_synthetic_pipeline_en_4.3.0_3.2_1678735795195.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_synthetic_pipeline_en_4.3.0_3.2_1678735795195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_synthetic_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_synthetic_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 1 | | 1 | David Hale | 27 | 36 | NAME | 0.85705 | | 2 | Hendrickson Ora | 55 | 69 | NAME | 0.8646 | | 3 | 7194334 | 78 | 84 | ID | 1 | | 4 | 01/13/93 | 93 | 100 | DATE | 1 | | 5 | Oliveira | 110 | 117 | NAME | 0.9998 | | 6 | 25 | 121 | 122 | AGE | 0.9951 | | 7 | 2079-11-09 | 150 | 159 | DATE | 0.9999 | | 8 | Cocke County Baptist Hospital | 163 | 191 | LOCATION | 0.968825 | | 9 | 0295 Keats Street | 195 | 211 | LOCATION | 0.7831 | | 10 | 302-786-5227 | 221 | 232 | CONTACT | 0.9985 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_synthetic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Albanian BertForTokenClassification Base Cased model (from Kushtrim) author: John Snow Labs name: bert_token_classifier_base_multilingual_cased_finetuned_albanian_ner date: 2022-11-30 tags: [sq, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: sq edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-albanian-ner` is a Albanian model originally trained by `Kushtrim`. ## Predicted Entities `MISC`, `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_multilingual_cased_finetuned_albanian_ner_sq_4.2.4_3.0_1669814948982.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_multilingual_cased_finetuned_albanian_ner_sq_4.2.4_3.0_1669814948982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_multilingual_cased_finetuned_albanian_ner","sq") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_multilingual_cased_finetuned_albanian_ner","sq") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_multilingual_cased_finetuned_albanian_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|sq| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Kushtrim/bert-base-multilingual-cased-finetuned-albanian-ner --- layout: model title: Translate Atlantic-Congo languages to English Pipeline author: John Snow Labs name: translate_alv_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, alv, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `alv` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_alv_en_xx_2.7.0_2.4_1609691953961.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_alv_en_xx_2.7.0_2.4_1609691953961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_alv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_alv_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.alv.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_alv_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-PubMedBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_512_en_4.0.0_3.0_1657108707338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_512_en_4.0.0_3.0_1657108707338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_PubMedBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-PubMedBERT-512 --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_nl_4.0.0_3.0_1656131090325.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_nl_4.0.0_3.0_1656131090325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "nl") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("nl.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Pipeline to Detect PHI for Deidentification (Generic - Augmented) author: John Snow Labs name: ner_deid_generic_augmented_pipeline date: 2023-06-13 tags: [licensed, ner, clinical, deidentification, generic, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic_augmented](https://nlp.johnsnowlabs.com/2021/06/30/ner_deid_generic_augmented_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_pipeline_en_4.4.4_3.2_1686664043324.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_augmented_pipeline_en_4.4.4_3.2_1686664043324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_generic_augmented.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_generic_augmented_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_generic_augmented.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227.""") ```
## Results ```bash Results +-------------------------------------------------+---------+ |chunk |ner_label| +-------------------------------------------------+---------+ |2093-01-13 |DATE | |David Hale |NAME | |Hendrickson |NAME | |Ora MR. |LOCATION | |7194334 |ID | |01/13/93 |DATE | |Oliveira |NAME | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital. 0295 Keats Street.|LOCATION | |(302) 786-5227 |CONTACT | +-------------------------------------------------+---------+ {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Word2Vec Embeddings in Slovenian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sl, open_source] task: Embeddings language: sl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sl_3.4.1_3.0_1647458729591.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sl_3.4.1_3.0_1647458729591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sl.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sl| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English Deberta Embeddings model (from iewaij) author: John Snow Labs name: deberta_embeddings_v3_base_lm date: 2023-03-12 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-v3-base-lm` is a English model originally trained by `iewaij`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_base_lm_en_4.3.1_3.0_1678629806286.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_v3_base_lm_en_4.3.1_3.0_1678629806286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_base_lm","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_v3_base_lm","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_v3_base_lm| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|691.5 MB| |Case sensitive:|false| ## References https://huggingface.co/iewaij/deberta-v3-base-lm --- layout: model title: Smaller BERT Sentence Embeddings (L-2_H-768_A-12) author: John Snow Labs name: sent_small_bert_L2_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_768_en_2.6.0_2.4_1598350960245.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_768_en_2.6.0_2.4_1598350960245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_768", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_768", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L2_768').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L2_768_embeddings I hate cancer [-0.09778258204460144, 0.16162623465061188, -0... Antibiotics aren't painkiller [0.32761386036872864, -0.14685948193073273, -0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L2_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1 --- layout: model title: COVID-19 Sentiment Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_covid_sentiment date: 2022-08-01 tags: [public_health, covid19_sentiment, en, licenced] task: Sentiment Analysis language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT](https://nlp.johnsnowlabs.com/2022/07/18/biobert_pubmed_base_cased_v1.2_en_3_0.html) based sentiment analysis model that can extract information from COVID-19 pandemic-related tweets. The model predicts whether a tweet contains positive, negative, or neutral sentiments about COVID-19 pandemic. ## Predicted Entities `neutral`, `positive`, `negative` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_covid_sentiment_en_4.0.2_3.0_1659344524584.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_covid_sentiment_en_4.0.2_3.0_1659344524584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_covid_sentiment", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["British Department of Health confirms first two cases of in UK"], ["so my trip to visit my australian exchange student just got canceled bc of coronavirus. im heartbroken :("], [ "I wish everyone to be safe at home and stop pandemic"]] ).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_covid_sentiment", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("British Department of Health confirms first two cases of in UK", "so my trip to visit my australian exchange student just got canceled bc of coronavirus. im heartbroken :(", "I wish everyone to be safe at home and stop pandemic" )).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.covid_sentiment").predict("""so my trip to visit my australian exchange student just got canceled bc of coronavirus. im heartbroken :(""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------------+----------+ |text |result | +---------------------------------------------------------------------------------------------------------+----------+ |British Department of Health confirms first two cases of in UK |[neutral] | |so my trip to visit my australian exchange student just got canceled bc of coronavirus. im heartbroken :(|[negative]| |I wish everyone to be safe at home and stop pandemic |[positive]| +---------------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_covid_sentiment| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support negative 0.96 0.97 0.97 3284 positive 0.94 0.96 0.95 1207 neutral 0.96 0.94 0.95 3232 accuracy - - 0.96 7723 macro-avg 0.95 0.96 0.96 7723 weighted-avg 0.96 0.96 0.96 7723 ``` --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_6_H_128_A_2_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-6_H-128_A-2_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_en_4.0.0_3.0_1654185377419.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_en_4.0.0_3.0_1654185377419.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_6_H_128_A_2_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_6_H_128_A_2_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.uncased_6l_128d_a2a_128d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_6_H_128_A_2_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|19.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-6_H-128_A-2_squad2 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab52 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab52 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab52` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab52_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab52_en_4.2.0_3.0_1664021979565.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab52_en_4.2.0_3.0_1664021979565.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab52", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab52", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab52| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: NER Pipeline for Temporal Mentions - Voice of the Patient author: John Snow Labs name: ner_vop_temporal_pipeline date: 2023-06-10 tags: [pipeline, licensed, temporal, ner, en, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of temporal entities from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_pipeline_en_4.4.3_3.0_1686423272687.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_pipeline_en_4.4.3_3.0_1686423272687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_temporal_pipeline", "en", "clinical/models") pipeline.annotate(" I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_temporal_pipeline", "en", "clinical/models") val result = pipeline.annotate(" I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.") ```
## Results ```bash | chunk | ner_label | |:-----------|:------------| | last month | DateTime | | yesterday | DateTime | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_temporal_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2 ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2` is a English model originally trained by AykeeSalazar. ## Predicted Entities `nonViolation`, `publicDrinking`, `publicSmoking` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2_en_4.1.0_3.0_1660166245519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2_en_4.1.0_3.0_1660166245519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vc_bantai__withoutAMBI_adunest_v2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Restrictions on transfer Clause Binary Classifier author: John Snow Labs name: legclf_restrictions_on_transfer_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `restrictions-on-transfer` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `restrictions-on-transfer` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_restrictions_on_transfer_clause_en_1.0.0_3.2_1660122961388.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_restrictions_on_transfer_clause_en_1.0.0_3.2_1660122961388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_restrictions_on_transfer_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[restrictions-on-transfer]| |[other]| |[other]| |[restrictions-on-transfer]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_restrictions_on_transfer_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.96 0.94 92 restrictions-on-transfer 0.85 0.77 0.81 30 accuracy - - 0.91 122 macro-avg 0.89 0.86 0.87 122 weighted-avg 0.91 0.91 0.91 122 ``` --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Italian (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-02-03 task: Named Entity Recognition language: it edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, it, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_IT){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_IT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_it_2.4.0_2.4_1579717534334.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_it_2.4.0_2.4_1579717534334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "it") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nato il 28 ottobre 1955) è un magnate d"affari americano, sviluppatore di software, investitore e filantropo. È noto soprattutto come co-fondatore di Microsoft Corporation. Durante la sua carriera in Microsoft, Gates ha ricoperto le posizioni di presidente, amministratore delegato (CEO), presidente e capo architetto del software, pur essendo il principale azionista individuale fino a maggio 2014. È uno dei più noti imprenditori e pionieri del rivoluzione dei microcomputer degli anni "70 e "80. Nato e cresciuto a Seattle, Washington, Gates ha co-fondato Microsoft con l"amico d"infanzia Paul Allen nel 1975, ad Albuquerque, nel New Mexico; divenne la più grande azienda di software per personal computer al mondo. Gates ha guidato l"azienda come presidente e CEO fino a quando non si è dimesso da CEO nel gennaio 2000, ma è rimasto presidente e divenne capo architetto del software. Alla fine degli anni "90, Gates era stato criticato per le sue tattiche commerciali, che erano state considerate anticoncorrenziali. Questa opinione è stata confermata da numerose sentenze giudiziarie. Nel giugno 2006, Gates ha annunciato che sarebbe passato a un ruolo part-time presso Microsoft e un lavoro a tempo pieno presso la Bill & Melinda Gates Foundation, la fondazione di beneficenza privata che lui e sua moglie, Melinda Gates, hanno fondato nel 2000. A poco a poco trasferì i suoi doveri a Ray Ozzie e Craig Mundie. Si è dimesso da presidente di Microsoft nel febbraio 2014 e ha assunto un nuovo incarico come consulente tecnologico per supportare il neo nominato CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "it") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nato il 28 ottobre 1955) è un magnate d"affari americano, sviluppatore di software, investitore e filantropo. È noto soprattutto come co-fondatore di Microsoft Corporation. Durante la sua carriera in Microsoft, Gates ha ricoperto le posizioni di presidente, amministratore delegato (CEO), presidente e capo architetto del software, pur essendo il principale azionista individuale fino a maggio 2014. È uno dei più noti imprenditori e pionieri del rivoluzione dei microcomputer degli anni "70 e "80. Nato e cresciuto a Seattle, Washington, Gates ha co-fondato Microsoft con l"amico d"infanzia Paul Allen nel 1975, ad Albuquerque, nel New Mexico; divenne la più grande azienda di software per personal computer al mondo. Gates ha guidato l"azienda come presidente e CEO fino a quando non si è dimesso da CEO nel gennaio 2000, ma è rimasto presidente e divenne capo architetto del software. Alla fine degli anni "90, Gates era stato criticato per le sue tattiche commerciali, che erano state considerate anticoncorrenziali. Questa opinione è stata confermata da numerose sentenze giudiziarie. Nel giugno 2006, Gates ha annunciato che sarebbe passato a un ruolo part-time presso Microsoft e un lavoro a tempo pieno presso la Bill & Melinda Gates Foundation, la fondazione di beneficenza privata che lui e sua moglie, Melinda Gates, hanno fondato nel 2000. A poco a poco trasferì i suoi doveri a Ray Ozzie e Craig Mundie. Si è dimesso da presidente di Microsoft nel febbraio 2014 e ha assunto un nuovo incarico come consulente tecnologico per supportare il neo nominato CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nato il 28 ottobre 1955) è un magnate d'affari americano, sviluppatore di software, investitore e filantropo. È noto soprattutto come co-fondatore di Microsoft Corporation. Durante la sua carriera in Microsoft, Gates ha ricoperto le posizioni di presidente, amministratore delegato (CEO), presidente e capo architetto del software, pur essendo il principale azionista individuale fino a maggio 2014. È uno dei più noti imprenditori e pionieri del rivoluzione dei microcomputer degli anni '70 e '80. Nato e cresciuto a Seattle, Washington, Gates ha co-fondato Microsoft con l'amico d'infanzia Paul Allen nel 1975, ad Albuquerque, nel New Mexico; divenne la più grande azienda di software per personal computer al mondo. Gates ha guidato l'azienda come presidente e CEO fino a quando non si è dimesso da CEO nel gennaio 2000, ma è rimasto presidente e divenne capo architetto del software. Alla fine degli anni '90, Gates era stato criticato per le sue tattiche commerciali, che erano state considerate anticoncorrenziali. Questa opinione è stata confermata da numerose sentenze giudiziarie. Nel giugno 2006, Gates ha annunciato che sarebbe passato a un ruolo part-time presso Microsoft e un lavoro a tempo pieno presso la Bill & Melinda Gates Foundation, la fondazione di beneficenza privata che lui e sua moglie, Melinda Gates, hanno fondato nel 2000. A poco a poco trasferì i suoi doveri a Ray Ozzie e Craig Mundie. Si è dimesso da presidente di Microsoft nel febbraio 2014 e ha assunto un nuovo incarico come consulente tecnologico per supportare il neo nominato CEO Satya Nadella."""] ner_df = nlu.load('it.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |MISC | |Nato |ORG | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |MISC | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|MISC | |Melinda Gates |PER | |Ray Ozzie |PER | |Craig Mundie |PER | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://it.wikipedia.org](https://it.wikipedia.org) --- layout: model title: Legal Distributive Trades Document Classifier (EURLEX) author: John Snow Labs name: legclf_distributive_trades_bert date: 2023-03-06 tags: [en, legal, classification, clauses, distributive_trades, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_distributive_trades_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Distributive_Trades or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Distributive_Trades`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_distributive_trades_bert_en_1.0.0_3.0_1678111781654.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_distributive_trades_bert_en_1.0.0_3.0_1678111781654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_distributive_trades_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Distributive_Trades]| |[Other]| |[Other]| |[Distributive_Trades]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_distributive_trades_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Distributive_Trades 0.87 0.80 0.84 51 Other 0.80 0.87 0.83 45 accuracy - - 0.83 96 macro-avg 0.83 0.84 0.83 96 weighted-avg 0.84 0.83 0.83 96 ``` --- layout: model title: Legal Security Clause Binary Classifier author: John Snow Labs name: legclf_security_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `security` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `security` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_security_clause_en_1.0.0_3.2_1660123976848.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_security_clause_en_1.0.0_3.2_1660123976848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_security_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[security]| |[other]| |[other]| |[security]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_security_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.97 0.95 119 security 0.94 0.86 0.90 57 accuracy - - 0.94 176 macro-avg 0.94 0.92 0.93 176 weighted-avg 0.94 0.94 0.94 176 ``` --- layout: model title: Pipeline to Detect Medication Entities, Assign Assertion Status and Find Relations author: John Snow Labs name: explain_clinical_doc_medication date: 2022-08-10 tags: [licensed, clinical, ner, en, assertion, relation_extraction, posology, medication] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for detecting posology entities with the `ner_posology_large` NER model, assigning their assertion status with `assertion_jsl` model, and extracting relations between posology-related terminology with `posology_re` relation extraction model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_4.0.0_3.0_1660153472886.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_medication_en_4.0.0_3.0_1660153472886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models") result = pipeline.fullAnnotate("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_medication", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_dco.clinical_medication.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2. She received a course of Bactrim for 14 days for UTI. She was prescribed 5000 units of Fragmin subcutaneously daily, and along with Lantus 40 units subcutaneously at bedtime.""") ```
## Results ```bash +----+----------------+------------+ | | chunks | entities | |---:|:---------------|:-----------| | 0 | insulin | DRUG | | 1 | Bactrim | DRUG | | 2 | for 14 days | DURATION | | 3 | 5000 units | DOSAGE | | 4 | Fragmin | DRUG | | 5 | subcutaneously | ROUTE | | 6 | daily | FREQUENCY | | 7 | Lantus | DRUG | | 8 | 40 units | DOSAGE | | 9 | subcutaneously | ROUTE | | 10 | at bedtime | FREQUENCY | +----+----------------+------------+ +----+----------+------------+-------------+ | | chunks | entities | assertion | |---:|:---------|:-----------|:------------| | 0 | insulin | DRUG | Present | | 1 | Bactrim | DRUG | Past | | 2 | Fragmin | DRUG | Planned | | 3 | Lantus | DRUG | Planned | +----+----------+------------+-------------+ +----------------+-----------+------------+-----------+----------------+ | relation | entity1 | chunk1 | entity2 | chunk2 | |:---------------|:----------|:-----------|:----------|:---------------| | DRUG-DURATION | DRUG | Bactrim | DURATION | for 14 days | | DOSAGE-DRUG | DOSAGE | 5000 units | DRUG | Fragmin | | DRUG-ROUTE | DRUG | Fragmin | ROUTE | subcutaneously | | DRUG-FREQUENCY | DRUG | Fragmin | FREQUENCY | daily | | DRUG-DOSAGE | DRUG | Lantus | DOSAGE | 40 units | | DRUG-ROUTE | DRUG | Lantus | ROUTE | subcutaneously | | DRUG-FREQUENCY | DRUG | Lantus | FREQUENCY | at bedtime | +----------------+-----------+------------+-----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_medication| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - NerConverterInternalModel - AssertionDLModel - PerceptronModel - DependencyParserModel - PosologyREModel --- layout: model title: Medical Spell Checker Pipeline author: John Snow Labs name: spellcheck_clinical_pipeline date: 2023-03-31 tags: [spellcheck, medical, medical_spell_check, spell_corrector, spell_pipeline, en, licensed, clinical] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained medical spellchecker pipeline is built on the top of `spellcheck_clinical` model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.2 and above. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_4.3.2_3.2_1680277489830.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_4.3.2_3.2_1680277489830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] pipeline.fullAnnotate(example) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") val example = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") pipeline.fullAnnotate(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical.pipeline").predict("""Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|100.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - ContextSpellCheckerModel --- layout: model title: Czech BertForMaskedLM Cased model (from fav-kky) author: John Snow Labs name: bert_embeddings_fernet_c5 date: 2022-12-02 tags: [cs, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: cs edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-C5` is a Czech model originally trained by `fav-kky`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_fernet_c5_cs_4.2.4_3.0_1670015195520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_fernet_c5_cs_4.2.4_3.0_1670015195520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fernet_c5","cs") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_fernet_c5","cs") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_fernet_c5| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|cs| |Size:|612.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/fav-kky/FERNET-C5 - https://arxiv.org/abs/2107.10042 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_base_mitmovie_squad date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-MITmovie-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_mitmovie_squad_en_4.2.4_3.0_1669985621604.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_mitmovie_squad_en_4.2.4_3.0_1669985621604.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_mitmovie_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_mitmovie_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_mitmovie_squad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|461.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/thatdramebaazguy/roberta-base-MITmovie-squad - https://github.com/ibm-aur-nlp/domain-specific-QA - https://github.com/adityaarunsinghal/Domain-Adaptation/blob/master/scripts/shell_scripts/movieR_NER_squad.sh - https://github.com/adityaarunsinghal/Domain-Adaptation/ --- layout: model title: Detect Assertion Status from Smoking Status Entity author: John Snow Labs name: assertion_oncology_smoking_status_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of the Smoking_Status entity. It classifies extractions as Present, Past or Absent. ## Predicted Entities `Absent`, `Past`, `Present` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_smoking_status_wip_en_4.0.0_3.0_1665522281153.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_smoking_status_wip_en_4.0.0_3.0_1665522281153.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Smoking_Status"]) assertion = AssertionDLModel.pretrained("assertion_oncology_smoking_status_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient quit smoking three years ago."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Smoking_Status")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_smoking_status_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient quit smoking three years ago.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_smoking_status").predict("""The patient quit smoking three years ago.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------|:---------------|:------------| | smoking | Smoking_Status | Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_smoking_status_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Absent 0.58 0.94 0.71 16.0 Past 0.88 0.65 0.75 23.0 Present 0.80 0.57 0.67 14.0 macro-avg 0.75 0.72 0.71 53.0 weighted-avg 0.77 0.72 0.72 53.0 ``` --- layout: model title: Chinese BertForMaskedLM Base Cased model (from model-attribution-challenge) author: John Snow Labs name: bert_embeddings_model_attribution_challenge_base_chinese date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-chinese` is a Chinese model originally trained by `model-attribution-challenge`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_model_attribution_challenge_base_chinese_zh_4.2.4_3.0_1670016400120.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_model_attribution_challenge_base_chinese_zh_4.2.4_3.0_1670016400120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_model_attribution_challenge_base_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_model_attribution_challenge_base_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_model_attribution_challenge_base_chinese| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/model-attribution-challenge/bert-base-chinese - https://aclanthology.org/2021.acl-long.330.pdf - https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 --- layout: model title: Pipeline for Adverse Drug Events author: John Snow Labs name: explain_clinical_doc_ade date: 2023-04-20 tags: [en, clinical, licensed, ade, pipeline] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline for Adverse Drug Events (ADE) with `ner_ade_biobert`, `assertion_dl_biobert`, `classifierdl_ade_conversational_biobert`, and `re_ade_biobert` . It will classify the document, extract ADE and DRUG clinical entities, assign assertion status to ADE entities, and relate Drugs with their ADEs. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_4.3.0_3.2_1682021368267.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_ade_en_4.3.0_3.2_1682021368267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models") text = """Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""" result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_ade", "en", "clinical/models") val text = """Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.clinical_ade").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash Class: True NER_Assertion: | | chunk | entitiy | assertion | |----|-------------------------|------------|-------------| | 0 | Lipitor | DRUG | - | | 1 | severe fatigue | ADE | Conditional | | 2 | voltaren | DRUG | - | | 3 | cramps | ADE | Conditional | Relations: | | chunk1 | entitiy1 | chunk2 | entity2 | relation | |----|-------------------------------|------------|-------------|---------|----------| | 0 | severe fatigue | ADE | Lipitor | DRUG | 1 | | 1 | cramps | ADE | Lipitor | DRUG | 0 | | 2 | severe fatigue | ADE | voltaren | DRUG | 0 | | 3 | cramps | ADE | voltaren | DRUG | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_ade| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|485.0 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertEmbeddings - SentenceEmbeddings - ClassifierDLModel - MedicalNerModel - NerConverterInternalModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - NerConverterInternalModel - AssertionDLModel --- layout: model title: Spanish RobertaForQuestionAnswering Large Cased model (from PlanTL-GOB-ES) author: John Snow Labs name: roberta_qa_plantl_gob_es_large_bne_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-sqac` is a Spanish model originally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_plantl_gob_es_large_bne_s_c_es_4.2.4_3.0_1669987149868.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_plantl_gob_es_large_bne_s_c_es_4.2.4_3.0_1669987149868.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_plantl_gob_es_large_bne_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_plantl_gob_es_large_bne_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_plantl_gob_es_large_bne_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-sqac - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://github.com/PlanTL-GOB-ES/lm-spanish - https://www.apache.org/licenses/LICENSE-2.0 - http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405 - https://paperswithcode.com/sota?task=question-answering&dataset=SQAC --- layout: model title: Word2Vec Embeddings in Tamil (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ta, open_source] task: Embeddings language: ta edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ta_3.4.1_3.0_1647462039197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ta_3.4.1_3.0_1647462039197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ta") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["நான் தீப்பொறி NLP ஐ நேசிக்கிறேன்"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ta") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("நான் தீப்பொறி NLP ஐ நேசிக்கிறேன்").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ta.embed.w2v_cc_300d").predict("""நான் தீப்பொறி NLP ஐ நேசிக்கிறேன்""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ta| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering model (from SupriyaArun) author: John Snow Labs name: distilbert_qa_SupriyaArun_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SupriyaArun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_SupriyaArun_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724707818.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_SupriyaArun_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724707818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SupriyaArun_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_SupriyaArun_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_SupriyaArun").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_SupriyaArun_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SupriyaArun/distilbert-base-uncased-finetuned-squad --- layout: model title: NER Pipeline for Clinical Problems - Voice of the Patient author: John Snow Labs name: ner_vop_problem_pipeline date: 2023-06-10 tags: [licensed, pipeline, en, ner, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline extracts mentions of clinical problems from health-related text in colloquial language. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_pipeline_en_4.4.3_3.0_1686413426747.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_pipeline_en_4.4.3_3.0_1686413426747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_vop_problem_pipeline", "en", "clinical/models") pipeline.annotate(" I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms. ") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_vop_problem_pipeline", "en", "clinical/models") val result = pipeline.annotate(" I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms. ") ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Symptom | | fatigue | Symptom | | rheumatoid arthritis | Disease | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|791.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Spanish (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-02-03 task: Named Entity Recognition language: es edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, es, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_ES){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_es_2.4.0_2.4_1581971942090.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_es_2.4.0_2.4_1581971942090.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "es") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "es") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella."""] ner_df = nlu.load('es.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +----------------------------+---------+ |chunk |ner_label| +----------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Durante su carrera |MISC | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Nacido y criado |MISC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Nuevo México |LOC | |Gates |PER | |CEO |ORG | |CEO |ORG | |A fines de la década de 1990|MISC | |Gates |PER | |Esta opinión |MISC | +----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://es.wikipedia.org](https://es.wikipedia.org) --- layout: model title: Legal Agricultural Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_agricultural_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, agricultural_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_agricultural_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Agricultural_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Agricultural_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_policy_bert_en_1.0.0_3.0_1678111696003.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agricultural_policy_bert_en_1.0.0_3.0_1678111696003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agricultural_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Agricultural_Policy]| |[Other]| |[Other]| |[Agricultural_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agricultural_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.0 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Agricultural_Policy 0.84 0.90 0.87 980 Other 0.86 0.79 0.82 796 accuracy - - 0.85 1776 macro-avg 0.85 0.84 0.84 1776 weighted-avg 0.85 0.85 0.85 1776 ``` --- layout: model title: Multilingual (English, German, Spanish) DistilBertForQuestionAnswering model (from ZYW) author: John Snow Labs name: distilbert_qa_en_de_es_model date: 2022-06-08 tags: [en, de, es, open_source, distilbert, question_answering, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-es-model` is a English model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_es_model_xx_4.0.0_3.0_1654728203267.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_es_model_xx_4.0.0_3.0_1654728203267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_es_model","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_es_model","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.distil_bert.en_de_es_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_en_de_es_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/en-de-es-model --- layout: model title: Legal Disclosures Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_disclosures_bert date: 2023-03-05 tags: [en, legal, classification, clauses, disclosures, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Disclosures` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Disclosures`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disclosures_bert_en_1.0.0_3.0_1678050000589.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disclosures_bert_en_1.0.0_3.0_1678050000589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_disclosures_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Disclosures]| |[Other]| |[Other]| |[Disclosures]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disclosures_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Disclosures 0.97 0.86 0.91 70 Other 0.90 0.98 0.94 97 accuracy - - 0.93 167 macro-avg 0.94 0.92 0.92 167 weighted-avg 0.93 0.93 0.93 167 ``` --- layout: model title: Part of Speech for Polish author: John Snow Labs name: pos_ud_lfg date: 2020-05-03 18:18:00 +0800 task: Part of Speech Tagging language: pl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, pl] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_lfg_pl_2.5.0_2.4_1588518541171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_lfg_pl_2.5.0_2.4_1588518541171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_lfg", "pl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_lfg", "pl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Oprócz bycia królem północy, John Snow jest angielskim lekarzem i liderem w rozwoju anestezjologii i higieny medycznej."""] pos_df = nlu.load('pl.pos.ud_lfg').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=5, result='ADP', metadata={'word': 'Oprócz'}), Row(annotatorType='pos', begin=7, end=11, result='NOUN', metadata={'word': 'bycia'}), Row(annotatorType='pos', begin=13, end=18, result='NOUN', metadata={'word': 'królem'}), Row(annotatorType='pos', begin=20, end=26, result='NOUN', metadata={'word': 'północy'}), Row(annotatorType='pos', begin=27, end=27, result='PUNCT', metadata={'word': ','}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_lfg| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|pl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Legal NER for NDA (Names of the Parties Clauses) author: John Snow Labs name: legner_nda_names_of_parties date: 2023-04-10 tags: [en, legal, licensed, ner, nda, names_of_parties] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `NAMES_OF_PARTIES` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `ALIAS`, `EFFDATE_NUMERIC`, `LOCATION`, and `PARTY`. ## Predicted Entities `ALIAS`, `EFFDATE_NUMERIC`, `LOCATION`, `PARTY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_names_of_parties_en_1.0.0_3.0_1681153822264.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_names_of_parties_en_1.0.0_3.0_1681153822264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_names_of_parties", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""This Confidentiality Agreement (this "Agreement") is dated effective as of the 4th day of June 2001, between Amerada Hess Corporation, a Delaware corporation ("AHC"), and Triton Energy Limited, a Cayman Islands company (the "Company")."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +------------------------+---------------+ |chunk |ner_label | +------------------------+---------------+ |4th day of June 2001 |EFFDATE_NUMERIC| |Amerada Hess Corporation|PARTY | |Delaware |LOCATION | |AHC |ALIAS | |Triton Energy Limited |PARTY | |Cayman Islands |LOCATION | |Company |ALIAS | +------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_names_of_parties| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support ALIAS 0.92 0.96 0.94 25 EFFDATE_NUMERIC 0.90 0.96 0.93 27 LOCATION 1.00 0.93 0.96 14 PARTY 0.77 0.88 0.82 26 micro-avg 0.88 0.93 0.91 92 macro-avg 0.90 0.93 0.91 92 weighted-avg 0.88 0.93 0.91 92 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from jeew) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_ckpt_95000 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-ckpt-95000` is a English model originally trained by `jeew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_ckpt_95000_en_4.0.0_3.0_1655992139810.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_ckpt_95000_en_4.0.0_3.0_1655992139810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_ckpt_95000","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_ckpt_95000","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.by_jeew").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_ckpt_95000| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|807.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jeew/xlm-roberta-ckpt-95000 --- layout: model title: Relation Extraction between different oncological entity types (ReDL) author: John Snow Labs name: redl_oncology_biobert_wip date: 2022-09-29 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model identifies relations between dates and other clinical entities, between tumor mentions and their size, between anatomical entities and other clinical entities, and between tests and their results. In contrast to re_oncology_granular, all these relation types are labeled as is_related_to. The different types of relations can be identified considering the pairs of entities that are linked. ## Predicted Entities `is_related_to` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biobert_wip_en_4.1.0_3.0_1664478881483.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biobert_wip_en_4.1.0_3.0_1664478881483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter() .setInputCols(["ner_chunk", "dependencies"]) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_biobert_wip", "en", "clinical/models") .setInputCols(["re_ner_chunk", "sentence"]) .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols("ner_chunk", "dependencies") .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols("re_ner_chunk", "sentence") .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("A mastectomy was performed two months ago, and a 3 cm mass was extracted.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_biobert_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""") ```
## Results ```bash | chunk1 | entity1 | chunk2 | entity2 | relation | confidence | |:-----------|:---------------|:---------------|:--------------|:--------------|-------------:| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_related_to | 0.914221 | | 3 cm | Tumor_Size | mass | Tumor_Finding | is_related_to | 0.90399 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.82 0.89 0.86 is_related_to 0.90 0.84 0.87 macro-avg 0.86 0.87 0.86 ``` --- layout: model title: Swedish asr_test_by_marma TFWav2Vec2ForCTC from marma author: John Snow Labs name: pipeline_asr_test_by_marma date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_test_by_marma` is a Swedish model originally trained by marma. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_test_by_marma_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_test_by_marma_sv_4.2.0_3.0_1664116677013.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_test_by_marma_sv_4.2.0_3.0_1664116677013.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_test_by_marma', lang = 'sv') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_test_by_marma", lang = "sv") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_test_by_marma| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|756.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_base_squad date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_en_4.2.4_3.0_1669986533701.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_en_4.2.4_3.0_1669986533701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|461.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/thatdramebaazguy/roberta-base-squad - https://github.com/adityaarunsinghal/Domain-Adaptation/blob/master/scripts/shell_scripts/train_movieR_just_squadv1.sh - https://github.com/adityaarunsinghal/Domain-Adaptation/ --- layout: model title: English image_classifier_vit_base_beans_demo_v2 ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_base_beans_demo_v2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans_demo_v2` is a English model originally trained by nateraw. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_v2_en_4.1.0_3.0_1660166573645.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_v2_en_4.1.0_3.0_1660166573645.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_beans_demo_v2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_beans_demo_v2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_beans_demo_v2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect PHI for Deidentification (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_deid date: 2022-01-06 tags: [licensed, berfortokenclassification, deid, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 23 entities. This ner model is trained with a combination of the i2b2 train set and a re-augmented version of i2b2 train set using `BertForTokenClassification` We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `MEDICALRECORD`, `ORGANIZATION`, `DOCTOR`, `USERNAME`, `PROFESSION`, `HEALTHPLAN`, `URL`, `CITY`, `DATE`, `LOCATION-OTHER`, `STATE`, `PATIENT`, `DEVICE`, `COUNTRY`, `ZIP`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `SREET`, `BIOID`, `FAX`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_DE){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_en_3.3.4_2.4_1641472006823.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_deid_en_3.3.4_2.4_1641472006823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_deid", "en", "clinical/models")\ .setInputCols(["token", "document"])\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_deid", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_deid").predict("""A. Record date : 2093-01-13, David Hale, M.D. Name : Hendrickson, Ora MR. # 7194334. PCP : Oliveira, non-smoking. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |Oliveira |PATIENT | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |302) 786-5227 |PHONE | |Brothers Coal-Mine |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_deid| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.4 MB| |Case sensitive:|true| |Max sentense length:|128| ## Data Source A custom data set which is created from the i2b2-PHI train and the re-augmented version of the i2b2-PHI train set is used. ## Benchmarking ```bash label precision recall f1-score support B-AGE 0.92 0.80 0.86 1050 B-CITY 0.71 0.93 0.80 530 B-COUNTRY 0.94 0.72 0.82 179 B-DATE 0.99 0.99 0.99 20434 B-DEVICE 0.68 0.66 0.67 35 B-DOCTOR 0.93 0.91 0.92 3609 B-EMAIL 0.92 1.00 0.96 11 B-HOSPITAL 0.94 0.90 0.92 2225 B-IDNUM 0.88 0.64 0.74 1185 B-LOCATION-OTHER 0.83 0.60 0.70 25 B-MEDICALRECORD 0.84 0.97 0.90 2263 B-ORGANIZATION 0.92 0.68 0.79 171 B-PATIENT 0.89 0.86 0.88 2297 B-PHONE 0.90 0.96 0.93 755 B-PROFESSION 0.86 0.81 0.83 265 B-STATE 0.97 0.87 0.92 261 B-STREET 0.98 0.99 0.99 184 B-USERNAME 0.96 0.78 0.86 357 B-ZIP 0.96 0.96 0.96 444 I-CITY 0.74 0.83 0.78 138 I-DATE 0.98 0.96 0.97 955 I-DEVICE 1.00 1.00 1.00 3 I-DOCTOR 0.92 0.98 0.95 3134 I-HOSPITAL 0.95 0.92 0.93 1322 I-LOCATION-OTHER 1.00 1.00 1.00 8 I-MEDICALRECORD 1.00 0.91 0.95 23 I-ORGANIZATION 0.98 0.61 0.75 77 I-PATIENT 0.89 0.81 0.85 1281 I-PHONE 0.97 0.96 0.97 374 I-PROFESSION 0.95 0.82 0.88 232 I-STREET 0.98 0.98 0.98 391 O 1.00 1.00 1.00 585606 accuracy - - 0.99 629960 macro-avg 0.79 0.71 0.73 629960 weighted-avg 0.99 0.99 0.99 629960 ``` --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465518 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465518` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465518_en_4.0.0_3.0_1655986437519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465518_en_4.0.0_3.0_1655986437519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465518","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465518","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465518.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465518| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465518 --- layout: model title: Sentence Detection in Russian Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [ru, open_source, sentence_detection] task: Sentence Detection language: ru edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ru_3.2.0_3.0_1630320562697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ru_3.2.0_3.0_1630320562697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "ru") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""Ищете отличный источник абзацев для чтения на английском? Вы пришли в нужное место. Согласно недавнему исследованию, привычка к чтению у современной молодежи стремительно сокращается. Они не могут сосредоточиться на данном абзаце для чтения на английском дольше нескольких секунд! Кроме того, чтение было и остается неотъемлемой частью всех конкурсных экзаменов. Итак, как улучшить свои навыки чтения? Ответ на этот вопрос на самом деле представляет собой другой вопрос: какова польза от навыков чтения? Основная цель чтения - «понять смысл».""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ru") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("Ищете отличный источник абзацев для чтения на английском? Вы пришли в нужное место. Согласно недавнему исследованию, привычка к чтению у современной молодежи стремительно сокращается. Они не могут сосредоточиться на данном абзаце для чтения на английском дольше нескольких секунд! Кроме того, чтение было и остается неотъемлемой частью всех конкурсных экзаменов. Итак, как улучшить свои навыки чтения? Ответ на этот вопрос на самом деле представляет собой другой вопрос: какова польза от навыков чтения? Основная цель чтения - «понять смысл».").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('ru.sentence_detector').predict("Ищете отличный источник абзацев для чтения на английском? Вы пришли в нужное место. Согласно недавнему исследованию, привычка к чтению у современной молодежи стремительно сокращается. Они не могут сосредоточиться на данном абзаце для чтения на английском дольше нескольких секунд! Кроме того, чтение было и остается неотъемлемой частью всех конкурсных экзаменов. Итак, как улучшить свои навыки чтения? Ответ на этот вопрос на самом деле представляет собой другой вопрос: какова польза от навыков чтения? Основная цель чтения - «понять смысл».", output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------------+ |[Ищете отличный источник абзацев для чтения на английском?] | |[Вы пришли в нужное место.] | |[Согласно недавнему исследованию, привычка к чтению у современной молодежи стремительно сокращается.]| |[Они не могут сосредоточиться на данном абзаце для чтения на английском дольше нескольких секунд!] | |[Кроме того, чтение было и остается неотъемлемой частью всех конкурсных экзаменов.] | |[Итак, как улучшить свои навыки чтения?] | |[Ответ на этот вопрос на самом деле представляет собой другой вопрос:] | |[какова польза от навыков чтения?] | |[Основная цель чтения - «понять смысл».] | +-----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|ru| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Telugu XLMRoBerta Embeddings (from neuralspace-reverie) author: John Snow Labs name: xlmroberta_embeddings_indic_transformers_te_xlmroberta date: 2022-05-13 tags: [te, open_source, xlm_roberta, embeddings] task: Embeddings language: te edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-te-xlmroberta` is a Telugu model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_indic_transformers_te_xlmroberta_te_3.4.4_3.0_1652439933257.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_indic_transformers_te_xlmroberta_te_3.4.4_3.0_1652439933257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_indic_transformers_te_xlmroberta","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["నేను స్పార్క్ NLP ని ప్రేమిస్తున్నాను"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_indic_transformers_te_xlmroberta","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("నేను స్పార్క్ NLP ని ప్రేమిస్తున్నాను").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_indic_transformers_te_xlmroberta| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|te| |Size:|505.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-te-xlmroberta - https://oscar-corpus.com/ --- layout: model title: Universal Sentence Encoder Large author: John Snow Labs name: tfhub_use_lg date: 2020-04-17 task: Embeddings language: en nav_key: models edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder. The details are described in the paper "[Universal Sentence Encoder](https://arxiv.org/abs/1803.11175)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_lg_en_2.4.0_2.4_1587136993894.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_lg_en_2.4.0_2.4_1587136993894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Many thanks']], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en") .setInputCols("document") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Many thanks").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Many thanks"] embeddings_df = nlu.load('en.embed_sentence.tfhub_use.lg').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_tfhub_use_lg_embeddings sentence 0 [0.05463508144021034, 0.013395714573562145, 0.... I love NLP 1 [0.03631748631596565, 0.006253095343708992, 0.... Many thanks ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_lg| |Type:|embeddings| |Compatibility:|Spark NLP 2.4.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/universal-sentence-encoder-large/3](https://tfhub.dev/google/universal-sentence-encoder-large/3) --- layout: model title: Legal Communications Document Classifier (EURLEX) author: John Snow Labs name: legclf_communications_bert date: 2023-03-06 tags: [en, legal, classification, clauses, communications, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_communications_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Communications or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Communications`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_communications_bert_en_1.0.0_3.0_1678111642902.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_communications_bert_en_1.0.0_3.0_1678111642902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_communications_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Communications]| |[Other]| |[Other]| |[Communications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_communications_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Communications 0.84 0.88 0.86 121 Other 0.87 0.83 0.85 120 accuracy - - 0.85 241 macro-avg 0.86 0.85 0.85 241 weighted-avg 0.86 0.85 0.85 241 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_10 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_10_en_4.0.0_3.0_1657184671029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_10_en_4.0.0_3.0_1657184671029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_10","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-10 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from stevemobs) author: John Snow Labs name: distilbert_qa_stevemobs_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_stevemobs_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772903093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_stevemobs_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772903093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_stevemobs_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_stevemobs_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_stevemobs_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/stevemobs/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding UMLS Codes author: John Snow Labs name: snomed_umls_mapping date: 2022-06-27 tags: [snomed, umls, pipeline, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_umls_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_3.5.3_3.0_1656368000448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_3.5.3_3.0_1656368000448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") result= pipeline.fullAnnotate("733187009 449433008 51264003") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") val result= pipeline.fullAnnotate("733187009 449433008 51264003") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.umls").predict("""733187009 449433008 51264003""") ```
## Results ```bash | | snomed_code | umls_code | |---:|:---------------------------------|:-------------------------------| | 0 | 733187009 | 449433008 | 51264003 | C4546029 | C3164619 | C0271267 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|5.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English RoBERTa Embeddings (Sampling strategy 'div select') author: John Snow Labs name: roberta_embeddings_distilroberta_base_climate_d date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-climate-d` is a English model orginally trained by `climatebert`. Sampling strategy d: As expressed in the author's paper [here](https://arxiv.org/pdf/2110.12010.pdf), d is "div select", meaning 70% of the most diverse sentences of one of the corpus was used, discarding the rest. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_d_en_3.4.2_3.0_1649947414236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_climate_d_en_3.4.2_3.0_1649947414236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_d","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_climate_d","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta_base_climate_d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base_climate_d| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|310.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/climatebert/distilroberta-base-climate-d - https://arxiv.org/abs/2110.12010 --- layout: model title: Smaller BERT Sentence Embeddings (L-6_H-128_A-2) author: John Snow Labs name: sent_small_bert_L6_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_128_en_2.6.0_2.4_1598350323564.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_128_en_2.6.0_2.4_1598350323564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_128", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_128", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L6_128').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L6_128_embeddings sentence [0.40656131505966187, 0.7434929013252258, -1.2... I hate cancer [-0.5999047160148621, 0.7300994396209717, -0.3... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L6_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/google/small_bert/bert_uncased_L-6_H-128_A-2/2 --- layout: model title: Extract Information from Termination Clauses (Md) author: John Snow Labs name: legner_termination_md date: 2022-12-01 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_termination_clause` Text Classifier to select only these paragraphs; This is a NER model which extracts information from Termination Clauses, like the subject (Who? Which party?) the action (verb) the object (What?) and the Indirect Object (to Whom?). This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives outside the specific section, which may help to run it at whole document level (although not recommended). ## Predicted Entities `TERMINATION_SUBJECT`, `TERMINATION_ACTION`, `TERMINATION_OBJECT`, `TERMINATION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_termination_md_en_1.0.0_3.0_1669894724125.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_termination_md_en_1.0.0_3.0_1669894724125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_termination_md','en','legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) text = "(b) Either Party may terminate this Agreement" data = spark.createDataFrame([[test]]).toDF("text") model = nlpPipeline.fit(data) ```
## Results ```bash +-----------+---------------------+ | token| ner_label| +-----------+---------------------+ | (| O| | b| O| | )| O| | Either|B-TERMINATION_SUBJECT| | Party|I-TERMINATION_SUBJECT| | may| B-TERMINATION_ACTION| | terminate| I-TERMINATION_ACTION| | this| O| | Agreement| O| +-----------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_termination_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References In-house annotations of CUAD dataset. ## Benchmarking ```bash label tp fp fn prec rec f1 I-TERMINATION_INDIRECT_OBJECT 4 0 6 1.0 0.4 0.5714286 B-TERMINATION_INDIRECT_OBJECT 3 1 4 0.75 0.42857143 0.5454545 B-TERMINATION_OBJECT 38 22 36 0.6333333 0.5135135 0.5671642 I-TERMINATION_ACTION 85 27 5 0.7589286 0.9444444 0.8415842 I-TERMINATION_OBJECT 294 172 294 0.6309013 0.5 0.55787474 B-TERMINATION_SUBJECT 37 10 8 0.78723407 0.82222223 0.8043478 I-TERMINATION_SUBJECT 26 8 7 0.7647059 0.7878788 0.7761194 B-TERMINATION_ACTION 36 7 5 0.8372093 0.8780488 0.8571428 Macro-average 523 247 365 0.770289 0.6593349 0.7105064 Micro-average 523 247 365 0.6792208 0.588964 0.6308806 ``` --- layout: model title: Extract Clinical Problem Entities (low granularity) from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_problem_reduced date: 2023-06-07 tags: [licensed, clinical, ner, en, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences. The taxonomy is reduced (one label for all clinical problems). ## Predicted Entities `Problem`, `HealthStatus`, `Modifier` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_en_4.4.3_3.0_1686148012169.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_en_4.4.3_3.0_1686148012169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_reduced", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_reduced", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Problem | | fatigue | Problem | | rheumatoid arthritis | Problem | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_reduced| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Problem 5871 617 1339 7210 0.90 0.81 0.86 HealthStatus 86 30 21 107 0.74 0.80 0.77 Modifier 818 196 321 1139 0.81 0.72 0.76 macro_avg 6775 843 1681 8456 0.82 0.78 0.80 micro_avg 6775 843 1681 8456 0.89 0.80 0.85 ``` --- layout: model title: Portuguese BertForMaskedLM Cased model (from pucpr) author: John Snow Labs name: bert_embeddings_biobertpt_bio date: 2022-12-02 tags: [pt, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobertpt-bio` is a Portuguese model originally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_bio_pt_4.2.4_3.0_1670020771475.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_bio_pt_4.2.4_3.0_1670020771475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_bio","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_bio","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_biobertpt_bio| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/pucpr/biobertpt-bio - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Extract Tickers on Financial Texts author: John Snow Labs name: finner_ticker date: 2022-08-09 tags: [en, financial, ner, ticker, trading, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model aims to detect Trading Symbols / Tickers in texts. You can then use Chunk Mappers to get more information about the company that ticker belongs to. This is a light version of the model, trained on Tweets. You can find heavier models (transformer-based, more specifically RoBerta-based) in our Models Hub. ## Predicted Entities `TICKER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TICKER/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_ticker_en_1.0.0_3.2_1660037397073.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_ticker_en_1.0.0_3.2_1660037397073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer() \ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""TTSLA, DTV, AMZN, NFLX and GPRO continue to look good here. All ıf them need to continue and make it into"""]]).toDF("text") result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("ticker"), F.expr("cols['1']['entity']").alias("label")).show() ```
## Results ```bash +------+------+ |ticker| label| +------+------+ | TTSLA|TICKER| | DTV|TICKER| | AMZN|TICKER| | NFLX|TICKER| | GPRO|TICKER| +------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_ticker| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 MB| ## References Original dataset (https://www.kaggle.com/omermetinn/tweets-about-the-top-companies-from-2015-to-2020) and weak labelling on in-house texts ## Benchmarking ```bash label precision recall f1-score support TICKER 0.97 0.96 0.97 9823 micro-avg 0.97 0.96 0.97 9823 macro-avg 0.97 0.96 0.97 9823 weighted-avg 0.97 0.96 0.97 9823 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from victorlee071200) author: John Snow Labs name: distilbert_qa_base_cased_finetuned_squad_v2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-squad_v2` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_finetuned_squad_v2_en_4.3.0_3.0_1672767020407.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_finetuned_squad_v2_en_4.3.0_3.0_1672767020407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_finetuned_squad_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_finetuned_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_finetuned_squad_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/victorlee071200/distilbert-base-cased-finetuned-squad_v2 --- layout: model title: English Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_en_cased date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-en-cased` is a English model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_en_cased_en_3.4.2_3.0_1649672809725.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_en_cased_en_3.4.2_3.0_1649672809725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_en_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_en_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_base_en_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_en_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|405.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-en-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English asr_wav2vec2_tcrs_runtest TFWav2Vec2ForCTC from neelan-elucidate-ai author: John Snow Labs name: asr_wav2vec2_tcrs_runtest date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_tcrs_runtest` is a English model originally trained by neelan-elucidate-ai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_tcrs_runtest_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_tcrs_runtest_en_4.2.0_3.0_1664104261756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_tcrs_runtest_en_4.2.0_3.0_1664104261756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_tcrs_runtest", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_tcrs_runtest", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_tcrs_runtest| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Typo Detector Pipeline for Icelandic author: John Snow Labs name: distilbert_token_classifier_typo_detector_pipeline date: 2022-03-08 tags: [icelandic, typo, ner, distilbert, is, open_source] task: Named Entity Recognition language: is edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_token_classifier_typo_detector_is](https://nlp.johnsnowlabs.com/2022/01/19/distilbert_token_classifier_typo_detector_is.html). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TYPO_DETECTOR_IS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_is_3.4.1_3.0_1646741501592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_pipeline_is_3.4.1_3.0_1646741501592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python typo_pipeline = PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "is") typo_pipeline.annotate("Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.") ``` ```scala val typo_pipeline = new PretrainedPipeline("distilbert_token_classifier_typo_detector_pipeline", lang = "is") typo_pipeline.annotate("Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.") ```
## Results ```bash +--------+---------+ |chunk |ner_label| +--------+---------+ |miög |PO | |álykanir|PO | +--------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|is| |Size:|505.8 MB| ## Included Models - DocumentAssembler - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: Fast Neural Machine Translation Model from English to Ilocano author: John Snow Labs name: opus_mt_en_ilo date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ilo, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ilo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ilo_xx_2.7.0_2.4_1609168193338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ilo_xx_2.7.0_2.4_1609168193338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ilo", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ilo", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ilo').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ilo| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_8_en_4.0.0_3.0_1655730987421.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_8_en_4.0.0_3.0_1655730987421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|439.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-8 --- layout: model title: DistilRoBERTa Base Ontonotes NER Pipeline author: John Snow Labs name: distilroberta_base_token_classifier_ontonotes_pipeline date: 2022-04-23 tags: [open_source, ner, token_classifier, distilroberta, ontonotes, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilroberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/09/26/distilroberta_base_token_classifier_ontonotes_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilroberta_base_token_classifier_ontonotes_pipeline_en_3.4.1_3.0_1650713643088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilroberta_base_token_classifier_ontonotes_pipeline_en_3.4.1_3.0_1650713643088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("distilroberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilroberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|307.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Multilingual T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_flan_base date: 2023-01-30 tags: [vi, ne, fi, ur, ku, yo, si, ru, it, zh, la, hi, he, xh, so, ca, ar, as, sw, en, ro, ig, te, th, ta, ce, es, gu, or, fr, ka, "no", li, cr, ch, be, ha, ga, ja, pa, ko, sl, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `flan-t5-base` is a Multilingual model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_flan_base_xx_4.3.0_3.0_1675102308493.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_flan_base_xx_4.3.0_3.0_1675102308493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_flan_base","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_flan_base","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_flan_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.0 GB| ## References - https://huggingface.co/google/flan-t5-base - https://s3.amazonaws.com/moonup/production/uploads/1666363435475-62441d1d9fdefb55a0b7d12c.png - https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints - https://arxiv.org/pdf/2210.11416.pdf - https://github.com/google-research/t5x - https://arxiv.org/pdf/2210.11416.pdf - https://arxiv.org/pdf/2210.11416.pdf - https://arxiv.org/pdf/2210.11416.pdf - https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png - https://arxiv.org/pdf/2210.11416.pdf - https://github.com/google-research/t5x - https://github.com/google/jax - https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png - https://arxiv.org/pdf/2210.11416.pdf - https://arxiv.org/pdf/2210.11416.pdf - https://mlco2.github.io/impact#compute - https://arxiv.org/abs/1910.09700 --- layout: model title: English RobertaForQuestionAnswering (from aravind-812) author: John Snow Labs name: roberta_qa_roberta_train_json date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-train-json` is a English model originally trained by `aravind-812`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_train_json_en_4.0.0_3.0_1655738356583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_train_json_en_4.0.0_3.0_1655738356583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_train_json","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_train_json","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_aravind-812").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_train_json| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/aravind-812/roberta-train-json --- layout: model title: Translate Vietnamese to English Pipeline author: John Snow Labs name: translate_vi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, vi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `vi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_vi_en_xx_2.7.0_2.4_1609698667324.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_vi_en_xx_2.7.0_2.4_1609698667324.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_vi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_vi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.vi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_vi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Indonesian to English Pipeline author: John Snow Labs name: translate_id_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, id, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `id` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_id_en_xx_2.7.0_2.4_1609690563318.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_id_en_xx_2.7.0_2.4_1609690563318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_id_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_id_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.id.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_id_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Indonesian BertForMaskedLM Base Cased model (from cahya) author: John Snow Labs name: bert_embeddings_base_indonesian_1.5g date: 2022-12-02 tags: [id, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: id edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-indonesian-1.5G` is a Indonesian model originally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_indonesian_1.5g_id_4.2.4_3.0_1670017797375.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_indonesian_1.5g_id_4.2.4_3.0_1670017797375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_indonesian_1.5g","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_indonesian_1.5g","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_indonesian_1.5g| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|id| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/cahya/bert-base-indonesian-1.5G - https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers --- layout: model title: English DistilBertForQuestionAnswering model (from keras-io) author: John Snow Labs name: distilbert_qa_transformers_qa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `transformers-qa` is a English model originally trained by `keras-io`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_transformers_qa_en_4.0.0_3.0_1654728935950.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_transformers_qa_en_4.0.0_3.0_1654728935950.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_transformers_qa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_transformers_qa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_keras-io").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_transformers_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/keras-io/transformers-qa - https://keras.io/examples/nlp/question_answering/ - https://paperswithcode.com/sota?task=Question+Answering&dataset=SQuAD --- layout: model title: Legal Assignments Clause Binary Classifier author: John Snow Labs name: legclf_assignments_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `assignments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `assignments` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_assignments_clause_en_1.0.0_3.2_1660123239154.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_assignments_clause_en_1.0.0_3.2_1660123239154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_assignments_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[assignments]| |[other]| |[other]| |[assignments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_assignments_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support assignments 0.91 0.93 0.92 103 other 0.97 0.96 0.97 247 accuracy - - 0.95 350 macro-avg 0.94 0.95 0.95 350 weighted-avg 0.95 0.95 0.95 350 ``` --- layout: model title: Legal NER - Whereas Clauses (Md) author: John Snow Labs name: legner_whereas_md date: 2022-12-01 tags: [whereas, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_whereas_clause` Text Classifier to select only these paragraphs; This is a Legal NER Model, able to process WHEREAS clauses, to detect the SUBJECT (Who?), the ACTION, the OBJECT (what?) and, in some cases, the INDIRECT OBJECT (to whom?) of the clause. This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives outside the specific section, which may help to run it at whole document level (although not recommended). ## Predicted Entities `WHEREAS_SUBJECT`, `WHEREAS_OBJECT`, `WHEREAS_ACTION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_WHEREAS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_whereas_md_en_1.0.0_3.0_1669892674388.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_whereas_md_en_1.0.0_3.0_1669892674388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_whereas_md', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""WHEREAS, Seller and Buyer have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the "Stock Purchase Agreement"); WHEREAS, pursuant to the Stock Purchase Agreement, Seller has agreed to sell and transfer, and Buyer has agreed to purchase and acquire, all of Seller's right, title and interest in and to Armstrong Wood Products, Inc., a Delaware corporation ("AWP") and its Subsidiaries, the Company and HomerWood Hardwood Flooring Company, a Delaware corporation ("HHFC," and together with the Company, the "Company Subsidiaries" and together with AWP, the "Company Entities" and each a "Company Entity") by way of a purchase by Buyer and sale by Seller of the Shares, all upon the terms and condition set forth therein;"""] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +------------+-----------------+ | token| ner_label| +------------+-----------------+ | WHEREAS| O| | ,| O| | Seller|B-WHEREAS_SUBJECT| | and| O| | Buyer|B-WHEREAS_SUBJECT| | have| B-WHEREAS_ACTION| | entered| I-WHEREAS_ACTION| | into| I-WHEREAS_ACTION| | that| B-WHEREAS_OBJECT| | certain| I-WHEREAS_OBJECT| | Stock| I-WHEREAS_OBJECT| | Purchase| I-WHEREAS_OBJECT| | Agreement| I-WHEREAS_OBJECT| | ,| O| | dated| O| | November| O| | 14| O| | ,| O| | 2018| O| | (| O| | the| O| | "| O| | Stock| O| | Purchase| O| | Agreement| O| | ");| O| | WHEREAS| O| | ,| O| | pursuant| O| | to| O| | the| O| | Stock| O| | Purchase| O| | Agreement| O| | ,| O| | Seller|B-WHEREAS_SUBJECT| | has| B-WHEREAS_ACTION| | agreed| I-WHEREAS_ACTION| | to| I-WHEREAS_ACTION| | sell| I-WHEREAS_ACTION| | and| O| | transfer| O| | ,| O| | and| O| | Buyer|B-WHEREAS_SUBJECT| | has| B-WHEREAS_ACTION| | agreed| I-WHEREAS_ACTION| | to| I-WHEREAS_ACTION| | purchase| I-WHEREAS_ACTION| | and| O| | acquire| O| | ,| O| | all| O| | of| O| | Seller's| O| | right| O| | ,| O| | title| O| | and| O| | interest| O| | in| O| | and| O| | to| O| | Armstrong| O| | Wood| O| | Products| O| | ,| O| | Inc| O| | .,| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | AWP| O| | ")| O| | and| O| | its| O| |Subsidiaries| O| | ,| O| | the| O| | Company| O| | and| O| | HomerWood| O| | Hardwood| O| | Flooring| O| | Company| O| | ,| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | HHFC| O| | ,"| O| +------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_whereas_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.1 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 B-WHEREAS_SUBJECT 95 12 5 0.88785046 0.95 0.9178744 I-WHEREAS_ACTION 112 36 13 0.7567568 0.896 0.82051283 I-WHEREAS_SUBJECT 31 6 6 0.8378378 0.8378378 0.8378378 B-WHEREAS_OBJECT 59 33 30 0.6413044 0.66292137 0.6519337 B-WHEREAS_ACTION 87 12 3 0.8787879 0.96666664 0.9206349 I-WHEREAS_OBJECT 221 108 65 0.67173254 0.77272725 0.71869916 Macro-average 605 207 122 0.77904505 0.8476922 0.81192017 Micro-average 605 207 122 0.7450739 0.83218706 0.78622484 ``` --- layout: model title: Legal Limitations Clause Binary Classifier author: John Snow Labs name: legclf_limitations_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `limitations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `limitations` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limitations_clause_en_1.0.0_3.2_1660123690963.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limitations_clause_en_1.0.0_3.2_1660123690963.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_limitations_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[limitations]| |[other]| |[other]| |[limitations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limitations_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support limitations 0.82 0.67 0.74 73 other 0.81 0.90 0.86 115 accuracy - - 0.81 188 macro-avg 0.81 0.79 0.80 188 weighted-avg 0.81 0.81 0.81 188 ``` --- layout: model title: Pipeline to Medical Text Summarization author: John Snow Labs name: summarizer_generic_jsl_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_generic_jsl](https://nlp.johnsnowlabs.com/2023/03/30/summarizer_generic_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_generic_jsl_pipeline_en_4.4.1_3.0_1685532825334.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_generic_jsl_pipeline_en_4.4.1_3.0_1685532825334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_generic_jsl_pipeline", "en", "clinical/models") text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_generic_jsl_pipeline", "en", "clinical/models") val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash The patient is 78 years old and has hypertension. She has a history of chest pain, palpations, orthopedics, and spinal stenosis. She has a prescription of Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin, and TriViFlor 25 mg two pills daily. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_generic_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English BertForQuestionAnswering model (from tiennvcs) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_docvqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-docvqa` is a English model orginally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_docvqa_en_4.0.0_3.0_1654180907520.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_docvqa_en_4.0.0_3.0_1654180907520.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_docvqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_docvqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.docvqa.base_uncased.by_tiennvcs").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_docvqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/bert-base-uncased-finetuned-docvqa --- layout: model title: Finnish asr_wav2vec2_base_10k_voxpopuli TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_base_10k_voxpopuli date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_10k_voxpopuli` is a Finnish model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_10k_voxpopuli_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_10k_voxpopuli_fi_4.2.0_3.0_1664020194078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_10k_voxpopuli_fi_4.2.0_3.0_1664020194078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_10k_voxpopuli", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_10k_voxpopuli", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_10k_voxpopuli| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|228.1 MB| --- layout: model title: Detect Genetic Cancer Entities author: John Snow Labs name: ner_cancer_genetics class: NerDLModel language: en nav_key: models repository: clinical/models date: 2020-04-22 task: Named Entity Recognition edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,ner,en] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Pretrained named entity recognition deep learning model for biology and genetics terms. ## Predicted Entities ``DNA``, ``RNA``, ``cell_line``, ``cell_type``, ``protein``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_en_2.4.2_2.4_1587567870408.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_cancer_genetics_en_2.4.2_2.4_1587567870408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_cancer_genetics", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.']], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_cancer_genetics", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.cancer").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
{:.h2_title} ## Results ```bash +-------------------+---------+ | token|ner_label| +-------------------+---------+ | The| O| | human|B-protein| | KCNJ9|I-protein| | (| O| | Kir|B-protein| | 3.3|I-protein| | ,| O| | GIRK3|B-protein| | )| O| | is| O| | a| O| | member| O| | of| O| | the| O| |G-protein-activated|B-protein| | inwardly|I-protein| | rectifying|I-protein| | potassium|I-protein| | (|I-protein| | GIRK|I-protein| | )|I-protein| | channel|I-protein| | family|I-protein| | .| O| | Here| O| | we| O| | describe| O| | the| O| |genomicorganization| O| | of| O| | the| O| | KCNJ9| B-DNA| | locus| I-DNA| | on| O| | chromosome| B-DNA| | 1q21-23| I-DNA| +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|----------------------------------| | Name: | ner_cancer_genetics | | Type: | NerDLModel | | Compatibility: | 2.4.2 | | License: | Licensed | | Edition: | Official | |Input labels: | sentence, token, word_embeddings | |Output labels: | ner | | Language: | en | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013 with `embeddings_clinical`. https://aclanthology.org/W13-2008/ {:.h2_title} ## Benchmarking ```bash label tp fp fn prec rec f1 B-cell_line 581 148 151 0.79698217 0.79371583 0.79534566 I-DNA 2751 735 317 0.7891566 0.89667535 0.8394873 I-protein 4416 867 565 0.8358887 0.88656896 0.8604832 B-protein 5288 763 660 0.8739051 0.8890383 0.8814068 I-cell_line 1148 244 301 0.82471263 0.79227054 0.80816615 I-RNA 221 60 27 0.78647685 0.891129 0.83553874 B-RNA 157 40 36 0.79695433 0.8134715 0.8051282 B-cell_type 1127 292 250 0.7942213 0.8184459 0.8061516 I-cell_type 1547 392 263 0.7978339 0.85469615 0.82528675 B-DNA 1513 444 387 0.77312213 0.7963158 0.7845475 Macro-average prec: 0.8069253, rec: 0.84323275, f1: 0.82467955 Micro-average prec: 0.82471186, rec: 0.86377037, f1: 0.84378934 ``` --- layout: model title: English Legal Roberta Embeddings author: John Snow Labs name: roberta_large_english_legal date: 2023-02-17 tags: [en, english, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-english-roberta-large` is a English model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_english_legal_en_4.2.4_3.0_1676644962452.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_english_legal_en_4.2.4_3.0_1676644962452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_english_legal", "en")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_english_legal", "en") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_english_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-english-roberta-large --- layout: model title: Punjabi Bert Embeddings (from monsoon-nlp) author: John Snow Labs name: bert_embeddings_muril_adapted_local date: 2022-04-11 tags: [bert, embeddings, pa, open_source] task: Embeddings language: pa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `muril-adapted-local` is a Punjabi model orginally trained by `monsoon-nlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_pa_3.4.2_3.0_1649674956785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_muril_adapted_local_pa_3.4.2_3.0_1649674956785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","pa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ਮੈਨੂੰ ਸਪਾਰਕ ਐਨਐਲਪੀ ਪਸੰਦ ਹੈ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_muril_adapted_local","pa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ਮੈਨੂੰ ਸਪਾਰਕ ਐਨਐਲਪੀ ਪਸੰਦ ਹੈ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pa.embed.muril_adapted_local").predict("""ਮੈਨੂੰ ਸਪਾਰਕ ਐਨਐਲਪੀ ਪਸੰਦ ਹੈ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_muril_adapted_local| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pa| |Size:|888.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/monsoon-nlp/muril-adapted-local - https://tfhub.dev/google/MuRIL/1 --- layout: model title: Translate South Caucasian languages to English Pipeline author: John Snow Labs name: translate_ccs_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ccs, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ccs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ccs_en_xx_2.7.0_2.4_1609689201716.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ccs_en_xx_2.7.0_2.4_1609689201716.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ccs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ccs_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ccs.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ccs_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Conventions Classification author: John Snow Labs name: legclf_conventions date: 2022-08-09 tags: [es, legal, conventions, classification, licensed] task: Text Classification language: es edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Roberta-based Legal Sequence Classifier NLP model to label texts as one of the following categories: - Convención Internacional sobre la Protección de los Derechos de todos los Trabajadores Migratorios y de sus Familias - Convención de los Derechos del Niño - Convención sobre la Eliminación de todas las formas de Discriminación contra la Mujer - Pacto Internacional de Derechos Civiles y Políticos - Convención Internacional Sobre la Eliminación de Todas las Formas de Discriminación Racial - Convención contra la Tortura y otros Tratos o Penas Crueles, Inhumanos o Degradantes - Convención sobre los Derechos de las Personas con Discapacidad - Pacto Internacional de Derechos Económicos, Sociales y Culturales This model was originally trained with 3799 legal texts (see the original work [here](https://huggingface.co/hackathon-pln-es/jurisbert-class-tratados-internacionales-sistema-universal), and has been finetuned by JSL on more scrapped texts from the internet with weak labelling (as, for example, https://www.un.org/es/events/childrenday/pdf/derechos.pdf for `Convencion de los Derechos del Niño`). ## Predicted Entities `Convención sobre la Eliminación de todas las formas de Discriminación contra la Mujer`, `Convención sobre los Derechos de las Personas con Discapacidad`, `Convención Internacional Sobre la Eliminación de Todas las Formas de Discriminación Racial`, `Convención contra la Tortura y otros Tratos o Penas Crueles, Inhumanos o Degradantes`, `Convención Internacional sobre la Protección de los Derechos de todos los Trabajadores Migratorios y de sus Familias`, `Convención de los Derechos del Niño`, `Pacto Internacional de Derechos Económicos, Sociales y Culturales`, `Pacto Internacional de Derechos Civiles y Políticos` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conventions_es_1.0.0_3.2_1660056648122.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conventions_es_1.0.0_3.2_1660056648122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = nlp.RoBertaForSequenceClassification.pretrained("legclf_conventions","es", "legal/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) text = """La Convención, a lo largo de sus 54 artículos, reconoce que los niños (seres humanos menores de 18 años) son individuos con derecho de pleno desarrollo físico, mental y social, y con derecho a expresar libremente sus opiniones. Además la Convención es también un modelo para la salud, la supervivencia y el progreso de toda la sociedad humana. """ data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------+-------------------------------------+ | text| result| +--------------------+-------------------------------------+ |La Convención, a ...|[Convención de los Derechos del Niño]| +--------------------+-------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conventions| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model was originally trained with 3799 legal texts (see the original work [here](https://huggingface.co/hackathon-pln-es/jurisbert-class-tratados-internacionales-sistema-universal), and has been finetuned by JSL on more scrapped texts from the internet with weak labelling (as, for example, https://www.un.org/es/events/childrenday/pdf/derechos.pdf for `Convencion de los Derechos del Niño`). ## Benchmarking ```bash label precision recall f1-score support accuracy - - 0.90 120 macro-avg 0.90 0.91 0.90 120 weighted-avg 0.90 0.90 0.90 120 ``` --- layout: model title: English image_classifier_vit_llama_or_potato ViTForImageClassification from osanseviero author: John Snow Labs name: image_classifier_vit_llama_or_potato date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_llama_or_potato` is a English model originally trained by osanseviero. ## Predicted Entities `llamas`, `potato` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_or_potato_en_4.1.0_3.0_1660170666558.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_llama_or_potato_en_4.1.0_3.0_1660170666558.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_llama_or_potato", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_llama_or_potato", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_llama_or_potato| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: RxNorm Xsmall ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_xsmall_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_xsmall_clinical_en_3.0.0_3.0_1618603394135.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_xsmall_clinical_en_3.0.0_3.0_1618603394135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... rxnorm_resolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_xsmall_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\ .setInputCols(['token', 'chunk_embeddings'])\ .setOutputCol('rxnorm_resolution')\ .setPoolingStrategy("MAX") pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnorm_resolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_xsmall_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,11,0,0,0,9)) .setInputCols('token', 'chunk_embeddings') .setOutputCol('rxnorm_resolution') .setPoolingStrategy("MAX") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, rxnorm_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|Glipizide Metformin hydrochloride:::Glyburide Metformin hydrochloride:::Glipizide Metformin hydro...| 861731| 0.2000| | glipizide|TREATMENT| Glipizide:::Glipizide:::Glipizide:::Glipizide:::Glipizide Metformin hydrochloride| 310488| 0.2499| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|dapagliflozin saxagliptin:::dapagliflozin saxagliptin:::dapagliflozin saxagliptin:::dapagliflozin...|1925504| 0.2080| | dapagliflozin|TREATMENT| dapagliflozin:::dapagliflozin:::dapagliflozin:::dapagliflozin:::dapagliflozin saxagliptin|1488574| 0.2492| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_xsmall_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm]| |Language:|en| --- layout: model title: Detect Anatomical Regions author: John Snow Labs name: ner_anatomy date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for anatomy terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `Anatomical_system`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Tissue` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_en_3.0.0_3.0_1617208433342.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_en_3.0.0_3.0_1617208433342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_anatomy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_anatomy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now. General: Well-developed female, in no acute distress, afebrile. HEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist. Neck: No lymphadenopathy. Chest: Clear. Abdomen: Positive bowel sounds and soft. Dermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy").predict("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner | +-------------------+----------------------+ |skin |Organ | |Extraocular muscles|Organ | |turbinates |Multi-tissue_structure| |Mucous membranes |Tissue | |Neck |Organism_subdivision | |bowel |Organ | |skin |Organ | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on the Anatomical Entity Mention (AnEM) corpus with ``'embeddings_clinical'``. http://www.nactem.ac.uk/anatomy/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|----------------------------------:|-----:|-----:|-----:|---------:|---------:|---------:| | 0 | B-Immaterial_anatomical_entity | 4 | 0 | 1 | 1 | 0.8 | 0.888889 | | 1 | B-Cellular_component | 14 | 4 | 7 | 0.777778 | 0.666667 | 0.717949 | | 2 | B-Organism_subdivision | 21 | 7 | 3 | 0.75 | 0.875 | 0.807692 | | 3 | I-Cell | 47 | 8 | 5 | 0.854545 | 0.903846 | 0.878505 | | 4 | B-Tissue | 14 | 2 | 10 | 0.875 | 0.583333 | 0.7 | | 5 | B-Anatomical_system | 5 | 1 | 3 | 0.833333 | 0.625 | 0.714286 | | 6 | B-Organism_substance | 26 | 2 | 8 | 0.928571 | 0.764706 | 0.83871 | | 7 | B-Cell | 86 | 6 | 11 | 0.934783 | 0.886598 | 0.910053 | | 8 | I-Immaterial_anatomical_entity | 5 | 0 | 0 | 1 | 1 | 1 | | 9 | I-Tissue | 16 | 1 | 6 | 0.941176 | 0.727273 | 0.820513 | | 10 | I-Pathological_formation | 20 | 0 | 1 | 1 | 0.952381 | 0.97561 | | 11 | I-Anatomical_system | 7 | 0 | 0 | 1 | 1 | 1 | | 12 | B-Organ | 30 | 7 | 3 | 0.810811 | 0.909091 | 0.857143 | | 13 | B-Pathological_formation | 35 | 5 | 3 | 0.875 | 0.921053 | 0.897436 | | 14 | I-Cellular_component | 4 | 0 | 3 | 1 | 0.571429 | 0.727273 | | 15 | I-Multi-tissue_structure | 26 | 10 | 6 | 0.722222 | 0.8125 | 0.764706 | | 16 | B-Multi-tissue_structure | 57 | 23 | 8 | 0.7125 | 0.876923 | 0.786207 | | 17 | I-Organism_substance | 6 | 2 | 0 | 0.75 | 1 | 0.857143 | | 18 | Macro-average | 424 | 84 | 88 | 0.731775 | 0.682666 | 0.706368 | | 19 | Micro-average | 424 | 84 | 88 | 0.834646 | 0.828125 | 0.831372 | ``` --- layout: model title: Chinese Deberta Embeddings Cased model (from IDEA-CCNL) author: John Snow Labs name: deberta_embeddings_erlangshen_v2_chinese_sentencepiece date: 2023-02-23 tags: [open_source, deberta, deberta_embeddings, debertav2formaskedlm, zh, tensorflow] task: Embeddings language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaV2ForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Erlangshen-DeBERTa-v2-186M-Chinese-SentencePiece` is a Chinese model originally trained by `IDEA-CCNL`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_erlangshen_v2_chinese_sentencepiece_zh_4.3.0_3.0_1677192535880.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_erlangshen_v2_chinese_sentencepiece_zh_4.3.0_3.0_1677192535880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_erlangshen_v2_chinese_sentencepiece","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark-NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_erlangshen_v2_chinese_sentencepiece","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark-NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_erlangshen_v2_chinese_sentencepiece| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|445.8 MB| |Case sensitive:|false| ## References https://huggingface.co/IDEA-CCNL/Erlangshen-DeBERTa-v2-186M-Chinese-SentencePiece --- layout: model title: Arabic Part of Speech Tagger (Egyptian Arabic POS, Modern Standard Arabic-MSA, Dialectal Arabic-DA and Classical Arabic-CA) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_mix_pos_egy date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-mix-pos-egy` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_mix_pos_egy_ar_3.4.2_3.0_1650993504741.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_mix_pos_egy_ar_3.4.2_3.0_1650993504741.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_mix_pos_egy","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_mix_pos_egy","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_mix_pos_egy").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_mix_pos_egy| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix-pos-egy - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Obligations Pipeline author: John Snow Labs name: legpipe_obligations date: 2022-08-24 tags: [en, legal, obligations, licensed] task: [Named Entity Recognition, Part of Speech Tagging, Dependency Parser] language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: Pipeline article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_obligations_clause` Text Classifier to select only these paragraphs; This is a Pretrained Pipeline to process agreements, more specifically the sentences where all the obligations of the parties are expressed (what they agreed upon in the contract). This pipeline returns: - NER entities for the subject, the action/verb, the object and the indirect object of the clause; - Syntactic dependencies of the chunks, so that you can disambiguate in case different clauses/agreements are present in the same sentence. This model does not include a Sentence Detector, it executes everything at document-level. If you want to split by sentences, do it before and call this pipeline with the text of the sentences. ## Predicted Entities `OBLIGATION_SUBJECT`, `OBLIGATION_ACTION`, `OBLIGATION`, `OBLIGATION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legpipe_obligations_en_1.0.0_3.2_1661342149969.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legpipe_obligations_en_1.0.0_3.2_1661342149969.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("legpipe_obligations", "en", "legal/models") deid_pipeline.annotate('The Supplier agrees to provide the Buyer with all the necessary documents to fulfill the agreement') # Return NER chunkcs pipeline_result['ner_chunk'] # Visualize the Dependencies dependency_vis = viz.DependencyParserVisualizer() dependency_vis.display(pipeline_result[0], #should be the results of a single example, not the complete dataframe. pos_col = 'pos', #specify the pos column dependency_col = 'dependencies', #specify the dependency column dependency_type_col = 'dependency_type' #specify the dependency type column ) ```
## Results ```bash # NER ['Supplier', 'agrees to provide', 'Buyer', 'with all the necessary documents to fulfill the agreement'] # DEP # Use Spark NLP Display to see the dependency tree ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legpipe_obligations| |Type:|pipeline| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|435.9 MB| ## References In-house annotations on CUAD dataset ## Included Models - nlp.DocumentAssembler - nlp.Tokenizer - nlp.PerceptronModel - nlp.DependencyParserModel - nlp.TypedDependencyParserModel - legal.BertForTokenClassification - nlp.NerConverter --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_greek_1 TFWav2Vec2ForCTC from skylord author: John Snow Labs name: asr_wav2vec2_large_xlsr_greek_1 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_greek_1` is a Modern Greek (1453-) model originally trained by skylord. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_greek_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_greek_1_el_4.2.0_3.0_1664110192753.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_greek_1_el_4.2.0_3.0_1664110192753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_greek_1", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_greek_1", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_greek_1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: Legal Terms Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_terms_bert date: 2023-03-05 tags: [en, legal, classification, clauses, terms, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Terms` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Terms`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_terms_bert_en_1.0.0_3.0_1678050597560.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_terms_bert_en_1.0.0_3.0_1678050597560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_terms_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Terms]| |[Other]| |[Other]| |[Terms]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_terms_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.95 0.93 0.94 168 Terms 0.91 0.93 0.92 136 accuracy - - 0.93 304 macro-avg 0.93 0.93 0.93 304 weighted-avg 0.93 0.93 0.93 304 ``` --- layout: model title: Detect Assertion Status (assertion_dl_biobert) - supports confidence scores author: John Snow Labs name: assertion_dl_biobert date: 2021-01-26 task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.7.2 spark_version: 2.4 tags: [assertion, en, licensed, clinical, biobert] supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Assign assertion status to clinical entities extracted by NER based on their context in the text. ## Predicted Entities `absent`, `present`, `conditional`, `associated_with_someone_else`, `hypothetical`, `possible`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_en_2.7.2_2.4_1611647486798.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_en_2.7.2_2.4_1611647486798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence",'token'])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion ]) data = spark.createDataFrame([["""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia noted. She denies pain"""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert","en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia noted. She denies pain""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.biobert").predict("""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia noted. She denies pain.""") ```
## Results ```bash +----------+---------+-----------+ |chunk |ner_label|assertion | +----------+---------+-----------+ |a headache|PROBLEM |present | |a head CT |TEST |present | |anxious |PROBLEM |conditional| |alopecia |PROBLEM |absent | |pain |PROBLEM |absent | +----------+---------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_biobert| |Compatibility:|Spark NLP 2.7.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| ## Data Source Trained on i2b2 assertion data. ## Benchmarking ```bash label tp fp fn prec rec f1 absent 769 51 57 0.937805 0.930993 0.934386 present 2575 161 102 0.941155 0.961898 0.951413 conditional 20 14 23 0.588235 0.465116 0.519481 associated_with_someone_else 51 9 15 0.85 0.772727 0.809524 hypothetical 129 13 15 0.908451 0.895833 0.902098 possible 114 44 80 0.721519 0.587629 0.647727 Macro-average 3658 292 292 0.824527 0.769033 0.795814 Micro-average 3658 292 292 0.926076 0.926076 0.926076 ``` --- layout: model title: Legal Entire Agreement Clause Binary Classifier author: John Snow Labs name: legclf_entire_agreement_clause date: 2022-12-07 tags: [en, legal, classification, clauses, entire_agreement, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `entire-agreement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `entire-agreement`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_entire_agreement_clause_en_1.0.0_3.0_1670444342705.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_entire_agreement_clause_en_1.0.0_3.0_1670444342705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_entire_agreement_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[entire-agreement]| |[other]| |[other]| |[entire-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_entire_agreement_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support entire-agreement 1.00 1.00 1.00 24 other 1.00 1.00 1.00 73 accuracy - - 1.00 97 macro-avg 1.00 1.00 1.00 97 weighted-avg 1.00 1.00 1.00 97 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Samoan author: John Snow Labs name: opus_mt_en_sm date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sm, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sm` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sm_xx_2.7.0_2.4_1609163860994.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sm_xx_2.7.0_2.4_1609163860994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sm", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sm", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sm').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sm| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Marathi Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:37:00 +0800 task: Lemmatization language: mr edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, mr] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_mr_2.5.5_2.4_1596055007712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_mr_2.5.5_2.4_1596055007712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "mr") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "mr") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""उत्तरेचा राजा होण्याव्यतिरिक्त, जॉन स्नो एक इंग्रज चिकित्सक आहे आणि भूल आणि वैद्यकीय स्वच्छतेच्या विकासासाठी अग्रगण्य आहे."""] lemma_df = nlu.load('mr.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=7, result='उत्तरेचा', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=12, result='राजा', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=14, end=29, result='होण्याव्यतिरिक्त', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=32, end=34, result='जॉन', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|mr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Fast Neural Machine Translation Model from Samoan to English author: John Snow Labs name: opus_mt_sm_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sm, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sm` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sm_en_xx_2.7.0_2.4_1609169212342.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sm_en_xx_2.7.0_2.4_1609169212342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sm_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sm_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sm.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sm_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_SciBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-SciBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_SciBERT_384_en_4.0.0_3.0_1657108472368.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_SciBERT_384_en_4.0.0_3.0_1657108472368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_SciBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_SciBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_SciBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-SciBERT-384 --- layout: model title: Fast Neural Machine Translation Model from English to Malayalam author: John Snow Labs name: opus_mt_en_ml date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ml, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ml` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ml_xx_2.7.0_2.4_1609169768035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ml_xx_2.7.0_2.4_1609169768035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ml", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ml", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ml').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ml| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_BlueBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-BlueBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BlueBERT_512_en_4.0.0_3.0_1657108632479.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_BlueBERT_512_en_4.0.0_3.0_1657108632479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BlueBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_BlueBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_BlueBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-BlueBERT-512 --- layout: model title: Spanish Bert Embeddings (from flax-community) author: John Snow Labs name: bert_embeddings_alberti_bert_base_multilingual_cased date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `alberti-bert-base-multilingual-cased` is a Spanish model orginally trained by `flax-community`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_alberti_bert_base_multilingual_cased_es_3.4.2_3.0_1649671065273.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_alberti_bert_base_multilingual_cased_es_3.4.2_3.0_1649671065273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_alberti_bert_base_multilingual_cased","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_alberti_bert_base_multilingual_cased","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.alberti_bert_base_multilingual_cased").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_alberti_bert_base_multilingual_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|667.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/flax-community/alberti-bert-base-multilingual-cased - https://github.com/google/flax - https://github.com/linhd-postdata/averell/ - https://postdata.linhd.uned.es/ - https://github.com/pruizf/disco - https://github.com/bncolorado/CorpusSonetosSigloDeOro - https://github.com/bncolorado/CorpusGeneralPoesiaLiricaCastellanaDelSigloDeOro - https://github.com/linhd-postdata/gongocorpus - http://obvil.sorbonne-universite.site/corpus/gongora/gongora_obra-poetica - https://github.com/alhuber1502/ECPA - https://github.com/waynegraham/for_better_for_verse - https://crisco2.unicaen.fr/verlaine/index.php?navigation=accueil - https://github.com/linhd-postdata/metrique-en-ligne - https://github.com/linhd-postdata/biblioteca_italiana - http://www.bibliotecaitaliana.it/ - https://github.com/versotym/corpusCzechVerse - https://gitlab.com/stichotheque/stichotheque-pt - https://github.com/linhd-postdata/poesi.as - http://www.poesi.as/ - https://github.com/aparrish/gutenberg-poetry-corpus - https://www.kaggle.com/ahmedabelal/arabic-poetry - https://github.com/THUNLP-AIPoet/Datasets/tree/master/CCPC - https://github.com/sks190/SKVR - https://github.com/linhd-postdata/textgrid-poetry - https://textgrid.de/en/digitale-bibliothek - https://github.com/tnhaider/german-rhyme-corpus - https://github.com/ELTE-DH/verskorpusz - https://www.kaggle.com/oliveirasp6/poems-in-portuguese - https://www.kaggle.com/grafstor/19-000-russian-poems - https://discord.com/channels/858019234139602994/859113060068229190 --- layout: model title: Оcr base for handwritten text v2 author: John Snow Labs name: ocr_base_handwritten_v2 date: 2022-06-15 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 3.4.0 spark_version: 3.0 supported: true annotator: ImageToTextV2 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_en_3.4.0_3.0_1655299331379.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_handwritten_v2_en_3.4.0_3.0_1655299331379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ocr = ImageToTextv2().pretrained("ocr_base_handwritten_v2", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text)
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ocr = ImageToTextv2().pretrained("ocr_base_handwritten_v2", "en", "clinical/ocr") ocr.setInputCols(["image"]) ocr.setOutputCol("text") result = ocr.transform(image_text_lines_df).collect() print(result[0].text) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ocr_base_handwritten_v2| |Type:|ocr| |Compatibility:|Visual NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|781.9 MB| --- layout: model title: Stop Words Cleaner for Portuguese(Brazil) author: John Snow Labs name: stopwords_br date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: br edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, br] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_br_br_2.5.4_2.4_1594742440778.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_br_br_2.5.4_2.4_1594742440778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_br", "br") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_br", "br") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Distaolit dimp hon dleoù evel m"hor bo ivez distaolet d"hon dleourion.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Distaolit dimp hon dleoù evel m'hor bo ivez distaolet d'hon dleourion."""] stopword_df = nlu.load('br.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=8, result='Distaolit', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=13, result='dimp', metadata={'sentence': '0'}), Row(annotatorType='token', begin=15, end=17, result='hon', metadata={'sentence': '0'}), Row(annotatorType='token', begin=19, end=23, result='dleoù', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=28, result='evel', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_br| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|br| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Word Segmenter for Chinese author: John Snow Labs name: wordseg_ctb9 date: 2021-03-09 tags: [word_segmentation, open_source, chinese, wordseg_ctb9, zh] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_3.0.0_3.0_1615292298505.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_ctb9_zh_3.0.0_3.0_1615292298505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_ctb9", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.segment_words.ctb9').predict(text) token_df ```
## Results ```bash 0 从 1 J 2 o 3 h 4 n 5 S 6 n 7 o 8 w 9 Labs 10 你 11 好 12 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_ctb9| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|zh| --- layout: model title: English BertForQuestionAnswering model (from stevemobs) author: John Snow Labs name: bert_qa_bert_finetuned_squad_pytorch date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-pytorch` is a English model orginally trained by `stevemobs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad_pytorch_en_4.0.0_3.0_1654535957648.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad_pytorch_en_4.0.0_3.0_1654535957648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_squad_pytorch","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_squad_pytorch","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_stevemobs").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_squad_pytorch| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/stevemobs/bert-finetuned-squad-pytorch --- layout: model title: Hindi DistilBertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: distilbert_embeddings_indic_transformers date: 2022-12-12 tags: [hi, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: hi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-hi-distilbert` is a Hindi model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_hi_4.2.4_3.0_1670864885093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_hi_4.2.4_3.0_1670864885093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers","hi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|247.5 MB| |Case sensitive:|false| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-hi-distilbert - https://oscar-corpus.com/ --- layout: model title: Legal Contracts Clause Binary Classifier author: John Snow Labs name: legclf_contracts_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `contracts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `contracts` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_contracts_clause_en_1.0.0_3.2_1660123359175.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_contracts_clause_en_1.0.0_3.2_1660123359175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_contracts_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[contracts]| |[other]| |[other]| |[contracts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_contracts_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support contracts 0.87 0.77 0.82 35 other 0.85 0.92 0.88 50 accuracy - - 0.86 85 macro-avg 0.86 0.85 0.85 85 weighted-avg 0.86 0.86 0.86 85 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Marscen) author: John Snow Labs name: roberta_qa_base_squad2_finetuned_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad2` is a English model originally trained by `Marscen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_squad2_en_4.3.0_3.0_1674219552956.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_finetuned_squad2_en_4.3.0_3.0_1674219552956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Marscen/roberta-base-squad2-finetuned-squad2 --- layout: model title: Fast Neural Machine Translation Model from Gun to English author: John Snow Labs name: opus_mt_guw_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, guw, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `guw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_guw_en_xx_2.7.0_2.4_1609165028271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_guw_en_xx_2.7.0_2.4_1609165028271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_guw_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_guw_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.guw.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_guw_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Ukrainian Part of Speech Tagger (Large) author: John Snow Labs name: bert_pos_bert_large_slavic_cyrillic_upos date: 2022-04-26 tags: [bert, pos, part_of_speech, uk, open_source] task: Part of Speech Tagging language: uk edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-slavic-cyrillic-upos` is a Ukrainian model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_slavic_cyrillic_upos_uk_3.4.2_3.0_1650993800857.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_large_slavic_cyrillic_upos_uk_3.4.2_3.0_1650993800857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_slavic_cyrillic_upos","uk") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_large_slavic_cyrillic_upos","uk") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.pos.bert_large_slavic_cyrillic_upos").predict("""Я люблю Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_large_slavic_cyrillic_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|uk| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/bert-large-slavic-cyrillic-upos - https://universaldependencies.org/be/ - https://universaldependencies.org/bg/ - https://universaldependencies.org/ru/ - https://universaldependencies.org/treebanks/sr_set/ - https://universaldependencies.org/treebanks/uk_iu/ - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: Detect Temporal Relations for Clinical Events (Enriched) author: John Snow Labs name: re_temporal_events_enriched_clinical date: 2020-09-28 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [re, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to identify temporal relationships among clinical events. ## Predicted Entities `BEFORE`, `AFTER`, `SIMULTANEOUS`, `BEGUN_BY`, `ENDED_BY`, `DURING`, `BEFORE_OVERLAP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_EVENTS/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_enriched_clinical_en_2.5.5_2.4_1597775105767.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_enriched_clinical_en_2.5.5_2.4_1597775105767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel. In the table below, `re_temporal_events_enriched_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:------------------------------------:|:-----------------------------------------------------------------------:|:-------------------:|:-------------------------:| | re_temporal_events_enriched_clinical | BEFORE, AFTER, SIMULTANEOUS, BEGUN_BY, ENDED_BY, DURING, BEFORE_OVERLAP | ner_events_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPython.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") clinical_re_Model = RelationExtractionModel()\ .pretrained("re_temporal_events_enriched_clinical", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4) #default: 0 nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos_tagger, word_embeddings, clinical_ner, ner_converter, dependency_parser, clinical_re_Model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel().pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val clinical_re_Model = RelationExtractionModel() .pretrained("re_temporal_events_enriched_clinical", "en", "clinical/models") .setInputCols("embeddings", "pos_tags", "ner_chunks", "dependencies") .setOutputCol("relations") .setMaxSyntacticDistance(4) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos_tagger, word_embeddings, clinical_ner, ner_converter, dependency_parser, clinical_re_Model)) val data = Seq("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+===========+=================+===============+===============================================+============+=================+===============+==========================+==============+ | 0 | OVERLAP | PROBLEM | 54 | 98 | longstanding intermittent right low back pain | OCCURRENCE | 121 | 144 | a motor vehicle accident | 0.532308 | +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ | 1 | AFTER | DATE | 171 | 179 | that time | PROBLEM | 201 | 219 | any specific injury | 0.577288 | +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_temporal_events_enriched_clinical| |Type:|re| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[embeddings, pos_tags, ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on data gathered and manually annotated by John Snow Labs https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash |Relation | Recall | Precision | F1 | |---------:|--------:|----------:|-----:| | OVERLAP | 0.81 | 0.73 | 0.77 | | BEFORE | 0.85 | 0.88 | 0.86 | | AFTER | 0.38 | 0.46 | 0.43 | ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kaggleodin) author: John Snow Labs name: distilbert_qa_kaggleodin_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kaggleodin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaggleodin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771610077.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaggleodin_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771610077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaggleodin_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaggleodin_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kaggleodin_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kaggleodin/distilbert-base-uncased-finetuned-squad --- layout: model title: Multilingual DistilBertForMaskedLM Cased model (from hf-maintainers) author: John Snow Labs name: distilbert_embeddings_base_multilingual_cased date: 2023-02-23 tags: [distilbert, open_source, distilbert_embeddings, distilbertformaskedlm, xx, tensorflow] task: Embeddings language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased` is a Multilingual model originally trained by `hf-maintainers`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_multilingual_cased_xx_4.3.0_3.0_1677190875798.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_multilingual_cased_xx_4.3.0_3.0_1677190875798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_multilingual_cased","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_multilingual_cased","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_base_multilingual_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|false| ## References https://huggingface.co/distilbert-base-multilingual-cased --- layout: model title: Swiss Legal Roberta Embeddings author: John Snow Labs name: roberta_base_swiss_legal date: 2023-02-16 tags: [gsw, swiss, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: gsw edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-swiss-roberta-base` is a Swiss model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_swiss_legal_gsw_4.2.4_3.0_1676580464742.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_swiss_legal_gsw_4.2.4_3.0_1676580464742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_swiss_legal", "gsw")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_swiss_legal", "gsw") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_swiss_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|gsw| |Size:|695.1 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-swiss-roberta-base --- layout: model title: Relation Extraction Between Body Parts and Problem Entities (ReDL) author: John Snow Labs name: redl_bodypart_problem_biobert date: 2021-06-01 tags: [licensed, en, clinical, relation_extraction] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts and problem entities in clinical texts. `1` : Shows that there is a relation between body part entity and entities labeled as problem ( diagnosis, symptom etc.), `0` : Shows that there no relation between body part and problem entities. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_problem_biobert_en_3.0.3_3.0_1622578544064.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_problem_biobert_en_3.0.3_3.0_1622578544064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION',"EXTERNAL_BODY_PART_OR_REGION-SYMPTOM"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_problem_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="No neurologic deficits other than some numbness in his left hand." data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION","EXTERNAL_BODY_PART_OR_REGION-SYMPTOM")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_problem_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("No neurologic deficits other than some numbness in his left hand.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.problem").predict("""No neurologic deficits other than some numbness in his left hand.""") ```
## Results ```bash | | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |---:|-----------:|:----------|:---------|:-----------------------------|:---------|-------------:| | 0 | 1 | Symptom | numbness | External_body_part_or_region | hand | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_problem_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.762 0.814 0.787 315 1 0.938 0.917 0.927 885 Avg. 0.850 0.865 0.857 - ``` --- layout: model title: Pipeline to Detect Oncology-Specific Entities author: John Snow Labs name: ner_oncology_pipeline date: 2023-03-08 tags: [licensed, clinical, en, oncology, biomarker, treatment] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_pipeline_en_4.3.0_3.2_1678283681531.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_pipeline_en_4.3.0_3.2_1678283681531.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_pipeline", "en", "clinical/models") text = '''The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_pipeline", "en", "clinical/models") val text = "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------------------------|--------:|------:|:----------------------|-------------:| | 0 | left | 31 | 34 | Direction | 0.9913 | | 1 | mastectomy | 36 | 45 | Cancer_Surgery | 0.952 | | 2 | axillary lymph node dissection | 54 | 83 | Cancer_Surgery | 0.744525 | | 3 | left | 91 | 94 | Direction | 0.9966 | | 4 | breast cancer | 96 | 108 | Cancer_Dx | 0.9272 | | 5 | twenty years ago | 110 | 125 | Relative_Date | 0.857067 | | 6 | tumor | 132 | 136 | Tumor_Finding | 0.9959 | | 7 | positive | 142 | 149 | Biomarker_Result | 0.9958 | | 8 | ER | 155 | 156 | Biomarker | 0.9952 | | 9 | PR | 162 | 163 | Biomarker | 0.9709 | | 10 | radiotherapy | 183 | 194 | Radiotherapy | 0.9997 | | 11 | breast | 229 | 234 | Site_Breast | 0.8288 | | 12 | cancer | 241 | 246 | Cancer_Dx | 0.9949 | | 13 | recurred | 248 | 255 | Response_To_Treatment | 0.9849 | | 14 | right | 262 | 266 | Direction | 0.9993 | | 15 | lung | 268 | 271 | Site_Lung | 0.9982 | | 16 | metastasis | 273 | 282 | Metastasis | 0.9999 | | 17 | 13 years later | 284 | 297 | Relative_Date | 0.791433 | | 18 | adriamycin | 346 | 355 | Chemotherapy | 0.9999 | | 19 | 60 mg/m2 | 358 | 365 | Dosage | 0.91785 | | 20 | cyclophosphamide | 372 | 387 | Chemotherapy | 0.9999 | | 21 | 600 mg/m2 | 390 | 398 | Dosage | 0.9647 | | 22 | six courses | 406 | 416 | Cycle_Count | 0.6798 | | 23 | first line | 422 | 431 | Line_Of_Therapy | 0.9792 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [lt, licensed, ner, legal, mapa] task: Named Entity Recognition language: lt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Lithuanian` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_lt_1.0.0_3.0_1682599671257.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_lt_1.0.0_3.0_1682599671257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_lt_cased", "lt")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "lt", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Iš pagrindinės bylos matyti, kad Martin-Meat darbuotojai buvo komandiruoti į Austriją laikotarpiu nuo 2007 m iki 2012 m mėsos išpjaustymo darbams Alpenrind patalpose atlikti."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+------------+ |chunk |ner_label | +-----------+------------+ |Martin-Meat|ORGANISATION| |Austriją |ADDRESS | |2007 m |DATE | |2012 m |DATE | |Alpenrind |ORGANISATION| +-----------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|lt| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.86 0.75 0.80 8 AMOUNT 1.00 0.64 0.78 11 DATE 0.97 0.97 0.97 65 ORGANISATION 0.81 0.86 0.83 35 PERSON 0.87 0.84 0.85 56 macro-avg 0.90 0.87 0.89 175 macro-avg 0.90 0.81 0.85 175 weighted-avg 0.90 0.87 0.89 175 ``` --- layout: model title: English DistilBertForQuestionAnswering Cased model (from phanimvsk) author: John Snow Labs name: distilbert_qa_edtech_v2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Edtech_v2` is a English model originally trained by `phanimvsk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_edtech_v2_en_4.3.0_3.0_1672765466938.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_edtech_v2_en_4.3.0_3.0_1672765466938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_edtech_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_edtech_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_edtech_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/phanimvsk/Edtech_v2 --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1_x2.32_f86.6_d15_hybrid_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x2.32-f86.6-d15-hybrid-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.32_f86.6_d15_hybrid_v1_en_4.0.0_3.0_1654181597581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.32_f86.6_d15_hybrid_v1_en_4.0.0_3.0_1654181597581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x2.32_f86.6_d15_hybrid_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1_x2.32_f86.6_d15_hybrid_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_x2.01_f89.2_d30_hybrid_rewind_opt_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1_x2.32_f86.6_d15_hybrid_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|149.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squadv1-x2.32-f86.6-d15-hybrid-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: Chinese Bert Embeddings (Large) author: John Snow Labs name: bert_embeddings_bert_large_chinese date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-chinese` is a Chinese model orginally trained by `yechen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_chinese_zh_3.4.2_3.0_1649669611777.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_chinese_zh_3.4.2_3.0_1649669611777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.bert_large_chinese").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_chinese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/yechen/bert-large-chinese --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: roberta_ner_roberta_fa_zwnj_base_ner date: 2022-05-03 tags: [roberta, ner, open_source, fa] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-fa-zwnj-base-ner` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `PRO`, `PCT`, `PER`, `ORG`, `DAT`, `TIM`, `EVE`, `FAC`, `LOC`, `MON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_fa_zwnj_base_ner_fa_3.4.2_3.0_1651594463153.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_fa_zwnj_base_ner_fa_3.4.2_3.0_1651594463153.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_fa_zwnj_base_ner","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_fa_zwnj_base_ner","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.ner.roberta_fa_zwnj_base_ner").predict("""من عاشق جرقه nlp هستم""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_fa_zwnj_base_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|442.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base-ner - https://github.com/HaniehP/PersianNER - http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/ - https://elisa-ie.github.io/wikiann/ - https://github.com/hooshvare/parsner/issues --- layout: model title: Legal Zero-shot NER author: John Snow Labs name: legner_roberta_zeroshot date: 2022-09-02 tags: [en, legal, ner, zero, shot, zeroshot, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: ZeroShotNER article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to carry out a Zero-Shot Named Entity Recognition (NER) approach, detecting any kind of entities with no training dataset, just tje pretrained RoBERTa embeddings (included in the model) and some examples. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGNER_ZEROSHOT/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_en_1.0.0_3.2_1662113815288.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_en_1.0.0_3.2_1662113815288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sparktokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("zero_shot_ner")\ .setEntityDefinitions( { "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?', "When was the agreement?"], "ORG": ["Which company?"], "STATE": ["Which state?"], "AGREEMENT": ["What kind of agreement?"], "LICENSE": ["What kind of license?"], "LICENSE_RECIPIENT": ["To whom the license is granted?"] }) nerconverter = nlp.NerConverter()\ .setInputCols(["document", "token", "zero_shot_ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ documentAssembler, sparktokenizer, zero_shot_ner, nerconverter, ] ) sample_text = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.", "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.", "This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company('Licensing')" "The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license"] p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF("text")) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +---------------------------------------+-----------------+ |chunk |ner_label | +---------------------------------------+-----------------+ |March 2012 |DATE | |Vertro, Inc |ORG | |February 2017 |DATE | |asset purchase agreement |AGREEMENT | |NetSeer |ORG | |INTELLECTUAL PROPERTY AGREEMENT |AGREEMENT | |December 31, 2018 |DATE | |Armstrong Flooring |ORG | |Delaware |STATE | |AFI Licensing LLC |ORG | |Delaware |ORG | |Seller |LICENSE_RECIPIENT| |perpetual, non- exclusive, royalty-free|LICENSE | +---------------------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_roberta_zeroshot| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Legal Roberta Embeddings --- layout: model title: Drug Substance to UMLS Code Pipeline author: John Snow Labs name: umls_drug_substance_resolver_pipeline date: 2023-03-30 tags: [en, licensed, umls, pipeline, resolver, clinical] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Drug Substances) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.3.2_3.2_1680193781641.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.3.2_3.2_1680193781641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models") pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_drug_substance_resolver").predict("""The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml""") ```
## Results ```bash +-----------------------------+---------+---------+ |chunk |ner_label|umls_code| +-----------------------------+---------+---------+ |metformin |DRUG |C0025598 | |lenvatinib |DRUG |C2986924 | |Magnesium hydroxide 100mg/1ml|DRUG |C1134402 | +-----------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_drug_substance_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|5.1 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: Thai BertForQuestionAnswering model (from wicharnkeisei) author: John Snow Labs name: bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad date: 2022-06-02 tags: [th, open_source, question_answering, bert] task: Question Answering language: th edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `thai-bert-multi-cased-finetuned-xquadv1-finetuned-squad` is a Thai model orginally trained by `wicharnkeisei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad_th_4.0.0_3.0_1654192425473.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad_th_4.0.0_3.0_1654192425473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad","th") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad","th") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("th.answer_question.xquad_squad.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|th| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/wicharnkeisei/thai-bert-multi-cased-finetuned-xquadv1-finetuned-squad - https://github.com/iapp-technology/iapp-wiki-qa-dataset --- layout: model title: Legal Obligations Clause Binary Classifier author: John Snow Labs name: legclf_cuad_obligations_clause date: 2022-09-20 tags: [en, legal, classification, clauses, obligations, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `obligations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `obligations` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_obligations_clause_en_1.0.0_3.2_1663693203916.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_obligations_clause_en_1.0.0_3.2_1663693203916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_obligations_absolute_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[obligations]| |[other]| |[other]| |[obligations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_obligations_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.4 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support obligations 0.88 0.79 0.83 95 other 0.88 0.93 0.90 152 accuracy - - 0.88 247 macro-avg 0.88 0.86 0.87 247 weighted-avg 0.88 0.88 0.88 247 ``` --- layout: model title: Detect Entities (Onto 300) author: John Snow Labs name: onto_300 date: 2020-02-03 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, en, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. Onto was trained on the OntoNotes text corpus. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Onto 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_300_en_2.4.0_2.4_1579729071854.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_300_en_2.4.0_2.4_1579729071854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## Predicted Entities `CARDINAL`, `EVENT`, `WORK_OF_ART`, `ORG`, `DATE`, `GPE`, `PERSON`, `PRODUCT`, `NORP`, `ORDINAL`, `MONEY`, `LOC`, `FAC`, `LAW`, `TIME`, `PERCENT`, `QUANTITY`, `LANGUAGE` ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained('glove_840B_300', lang='xx') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("onto_300", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("onto_300", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.glove.6B_300d').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |), |PERSON | |May 2014 |DATE | |one |CARDINAL | |1970s |DATE | |1980s |DATE | |Born |PERSON | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Microsoft |ORG | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from[https://catalog.ldc.upenn.edu/LDC2013T19](https://catalog.ldc.upenn.edu/LDC2013T19) --- layout: model title: Translate English to Shona Pipeline author: John Snow Labs name: translate_en_sn date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sn, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sn_xx_2.7.0_2.4_1609698784995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sn_xx_2.7.0_2.4_1609698784995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sn').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sn| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Named Entity Recognition for Bengali (GloVe 840B 300d) author: John Snow Labs name: ner_jifs_glove_840B_300d date: 2021-01-27 task: Named Entity Recognition language: bn edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [bn, ner, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained `glove_840B_300` embeddings model from `WordEmbeddings` annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities `PER`, `LOC`, `ORG`, `OBJ`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_jifs_glove_840B_300d_bn_2.7.0_2.4_1611770574503.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_jifs_glove_840B_300d_bn_2.7.0_2.4_1611770574503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("ner_jifs_glove_840B_300d", "bn") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner]) example = spark.createDataFrame([["৯০ এর দশকের শুরুর দিকে বৃহৎ আকারে মার্কিন যুক্তরাষ্ট্রে এর প্রয়োগের প্রক্রিয়া শুরু হয়'"]], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_jifs_glove_840B_300d", "bn") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner)) val data = Seq("৯০ এর দশকের শুরুর দিকে বৃহৎ আকারে মার্কিন যুক্তরাষ্ট্রে এর প্রয়োগের প্রক্রিয়া শুরু হয়").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["৯০ এর দশকের শুরুর দিকে বৃহৎ আকারে মার্কিন যুক্তরাষ্ট্রে এর প্রয়োগের প্রক্রিয়া শুরু হয়"] ner_df = nlu.load('bn.ner').predict(text, output_level='token') ner_df ```
## Results ```bash +-------------+-----+ |token |ner | +-------------+-----+ |৯০ |O | |এর |O | |দশকের |O | |শুরুর |O | |দিকে |O | |বৃহৎ |O | |আকারে |O | |মার্কিন |B-LOC| |যুক্তরাষ্ট্রে|I-LOC| |এর |O | |প্রয়োগের |O | |প্রক্রিয়া |O | |শুরু |O | |হয় |O | |' |O | +-------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jifs_glove_840B_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|bn| ## Data Source The model was trained on the [Bengali NER](https://github.com/MISabic/NER-Bangla-Dataset) data set introduced in the Journal of Intelligent & Fuzzy Systems. Reference: - Karim, Redwanul & Islam, M. A. & Simanto, Sazid & Chowdhury, Saif & Roy, Kalyan & Neon, Adnan & Hasan, Md & Firoze, Adnan & Rahman, Mohammad. (2019). A step towards information extraction: Named entity recognition in Bangla using deep learning. Journal of Intelligent & Fuzzy Systems. 37. 1-13. 10.3233/JIFS-179349. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | B-LOC | 0.81 | 0.72 | 0.76 | 2005 | | B-OBJ | 0.66 | 0.08 | 0.13 | 573 | | B-ORG | 0.67 | 0.31 | 0.42 | 853 | | B-PER | 0.76 | 0.76 | 0.76 | 4035 | | I-LOC | 0.64 | 0.52 | 0.58 | 357 | | I-OBJ | 0.00 | 0.00 | 0.00 | 57 | | I-ORG | 0.65 | 0.37 | 0.47 | 516 | | I-PER | 0.76 | 0.73 | 0.74 | 1223 | | O | 0.93 | 0.97 | 0.95 | 39499 | | accuracy | | | 0.90 | 49118 | | macro avg | 0.65 | 0.49 | 0.54 | 49118 | | weighted avg | 0.89 | 0.90 | 0.89 | 49118 | ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_10_en_4.3.0_3.0_1674213377790.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_10_en_4.3.0_3.0_1674213377790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_1024_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|439.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-10 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_2_en_4.0.0_3.0_1655730749277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_2_en_4.0.0_3.0_1655730749277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|439.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-2 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from RathodSankul) author: John Snow Labs name: distilbert_qa_rathodsankul_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `RathodSankul`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_rathodsankul_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769055919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_rathodsankul_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769055919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rathodsankul_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rathodsankul_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_rathodsankul_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/RathodSankul/distilbert-base-uncased-finetuned-squad --- layout: model title: Self-Reported Covid-19 Symptoms Classifier (BERT) author: John Snow Labs name: bert_sequence_classifier_self_reported_symptoms_tweet date: 2022-07-28 tags: [es, clinical, licensed, public_health, classifier, sequence_classification, covid_19, tweet, symptom] task: Text Classification language: es edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BERT based](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) classifier that can classify the origin of symptoms related to Covid-19 from Spanish tweets. This model is intended for direct use as a classification model and the target classes are: Lit-News_mentions, Self_reports, non-personal_reports. ## Predicted Entities `Lit-News_mentions`, `Self_reports`, `non-personal_reports` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_COVID_SYMPTOMS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4TC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_symptoms_tweet_es_4.0.0_3.0_1659022252550.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_self_reported_symptoms_tweet_es_4.0.0_3.0_1659022252550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_symptoms_tweet", "es", "clinical/models")\ .setInputCols(["document", "token"])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame(["Las vacunas 3 y hablamos inminidad vivo Son bichito vivo dentro de líquido de la vacuna suelen tener reacciones alÃorgicas si que sepan", "Yo pense que me estaba dando el coronavirus porque cuando me levante casi no podia respirar pero que si era que tenia la nariz topada de mocos.", "Tos, dolor de garganta y fiebre, los síntomas más reportados por los porteños con coronavirus"], StringType()).toDF("text") result = model.transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_self_reported_symptoms_tweet", "es", "clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("class") .setCaseSensitive(True) .setMaxSentenceLength(512) val pipeline = new PipelineModel().setStages(Array(document_assembler, tokenizer, sequenceClassifier )) val data = Seq(Array("Las vacunas 3 y hablamos inminidad vivo Son bichito vivo dentro de líquido de la vacuna suelen tener reacciones alÃorgicas si que sepan", "Yo pense que me estaba dando el coronavirus porque cuando me levante casi no podia respirar pero que si era que tenia la nariz topada de mocos.", "Tos, dolor de garganta y fiebre, los síntomas más reportados por los porteños con coronavirus")).toDS.toDF("text") val result = model.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.classify.self_reported_symptoms").predict("""Yo pense que me estaba dando el coronavirus porque cuando me levante casi no podia respirar pero que si era que tenia la nariz topada de mocos.""") ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ |text |result | +-------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ |Las vacunas 3 y hablamos inminidad vivo Son bichito vivo dentro de líquido de la vacuna suelen tener reacciones alÃorgicas si que sepan |[non-personal_reports]| |Yo pense que me estaba dando el coronavirus porque cuando me levante casi no podia respirar pero que si era que tenia la nariz topada de mocos.|[Self_reports] | |Tos, dolor de garganta y fiebre, los síntomas más reportados por los porteños con coronavirus |[Lit-News_mentions] | +-------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_self_reported_symptoms_tweet| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support Lit-News_mentions 0.93 0.95 0.94 309 Self_reports 0.65 0.74 0.69 72 non-personal_reports 0.79 0.67 0.73 122 accuracy - - 0.85 503 macro-avg 0.79 0.79 0.79 503 weighted-avg 0.85 0.85 0.85 503 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from haunt224) author: John Snow Labs name: distilbert_qa_haunt224_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `haunt224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_haunt224_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771115606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_haunt224_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771115606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_haunt224_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_haunt224_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_haunt224_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/haunt224/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Problems, Tests and Treatments (ner_clinical_large) author: John Snow Labs name: ner_clinical_large_en date: 2020-05-23 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_en_2.5.0_2.4_1590021302624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_en_2.5.0_2.4_1590021302624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.']], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe: ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |the G-protein-activated inwardly rectifying potassium (GIRK|TREATMENT| |the genomicorganization |TREATMENT| |a candidate gene forType II diabetes mellitus |PROBLEM | |byapproximately |TREATMENT| |single nucleotide polymorphisms |TREATMENT| |aVal366Ala substitution |TREATMENT| |an 8 base-pair |TREATMENT| |insertion/deletion |PROBLEM | |Ourexpression studies |TEST | |the transcript in various humantissues |PROBLEM | |fat andskeletal muscle |PROBLEM | |furtherstudies |PROBLEM | |the KCNJ9 protein |TREATMENT| |evaluation |TEST | |Type II diabetes |PROBLEM | |the treatment |TREATMENT| |breast cancer |PROBLEM | |the standard therapy |TREATMENT| |anthracyclines |TREATMENT| |taxanes |TREATMENT| +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_large| |Type:|ner| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on augmented 2010 i2b2 challenge data with 'embeddings_clinical'. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | I-TREATMENT | 6625 | 1187 | 1329 | 0.848054 | 0.832914 | 0.840416 | | 1 | I-PROBLEM | 15142 | 1976 | 2542 | 0.884566 | 0.856254 | 0.87018 | | 2 | B-PROBLEM | 11005 | 1065 | 1587 | 0.911765 | 0.873968 | 0.892466 | | 3 | I-TEST | 6748 | 923 | 1264 | 0.879677 | 0.842237 | 0.86055 | | 4 | B-TEST | 8196 | 942 | 1029 | 0.896914 | 0.888455 | 0.892665 | | 5 | B-TREATMENT | 8271 | 1265 | 1073 | 0.867345 | 0.885167 | 0.876165 | | 6 | Macro-average | 55987 | 7358 | 8824 | 0.881387 | 0.863166 | 0.872181 | | 7 | Micro-average | 55987 | 7358 | 8824 | 0.883842 | 0.86385 | 0.873732 | ``` --- layout: model title: Spell Checker in English Text author: John Snow Labs name: check_spelling_dl date: 2022-01-04 tags: [open_source, en] task: Spell Check language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence pipeline that detects and corrects spelling errors in your input text. It's based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. You can download the pretrained pipeline that comes ready to use. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/check_spelling_dl_en_3.3.4_3.0_1641304582335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/check_spelling_dl_en_3.3.4_3.0_1641304582335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline_local = PretrainedPipeline("check_spelling_dl") testDoc = ''' During the summer we have the hottest ueather. I have a black ueather jacket, so nice.I intrduce you to my sister, she is called ueather. ''' result=pipeline_local.annotate(testDoc) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("check_spelling_dl", lang = "en") val result = pipeline.annotate("During the summer we have the hottest ueather. I have a black ueather jacket, so nice.I intrduce you to my sister, she is called ueather.") ``` {:.nlu-block} ```python import nlu nlu.load("en.spell").predict(""" During the summer we have the hottest ueather. I have a black ueather jacket, so nice.I intrduce you to my sister, she is called ueather. """) ```
## Results ```bash ('During', 'During'), ('the', 'the'), ('summer', 'summer'), ('we', 'we'), ('have', 'have'), ('the', 'the'), ('hottest', 'hottest'), ('ueather', 'weather'), ('.', '.'), ('I', 'I'), ('have', 'have'), ('a', 'a'), ('black', 'black'), ('ueather', 'leather'), ('jacket', 'jacket'), (',', ','), ('so', 'so'), ('nice', 'nice'), ('.', '.'), ('I', 'I'), ('intrduce', 'introduce'), ('you', 'you'), ('to', 'to'), ('my', 'my'), ('sister', 'sister'), (',', ','), ('she', 'she'), ('is', 'is'), ('called', 'called'), ('ueather', 'Heather'), ('.', '.') ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|check_spelling_dl| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|118.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - ContextSpellCheckerModel --- layout: model title: English AlbertForQuestionAnswering model (from vumichien) author: John Snow Labs name: albert_qa_vumichien_base_v2_squad2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-base-v2-squad2` is a English model originally trained by `vumichien`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_vumichien_base_v2_squad2_en_4.0.0_3.0_1656063726931.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_vumichien_base_v2_squad2_en_4.0.0_3.0_1656063726931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_vumichien_base_v2_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_vumichien_base_v2_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.base_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_vumichien_base_v2_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vumichien/albert-base-v2-squad2 --- layout: model title: Stop Words Cleaner for Romanian author: John Snow Labs name: stopwords_ro date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ro edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ro] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ro_ro_2.5.4_2.4_1594742441548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ro_ro_2.5.4_2.4_1594742441548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ro", "ro") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ro", "ro") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale."""] stopword_df = nlu.load('ro.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=3, end=7, result='afară', metadata={'sentence': '0'}), Row(annotatorType='token', begin=12, end=12, result='a', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=22, result='regele', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=31, result='nordului', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=32, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ro| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ro| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Legal Settlement Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_settlement_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, settlement, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_settlement_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `settlement-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `settlement-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_settlement_agreement_en_1.0.0_3.0_1669294931648.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_settlement_agreement_en_1.0.0_3.0_1669294931648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_settlement_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[settlement-agreement]| |[other]| |[other]| |[settlement-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_settlement_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.96 0.95 90 settlement-agreement 0.90 0.88 0.89 41 accuracy - - 0.93 131 macro-avg 0.92 0.92 0.92 131 weighted-avg 0.93 0.93 0.93 131 ``` --- layout: model title: English asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface TFWav2Vec2ForCTC from malay-huggingface author: John Snow Labs name: asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface` is a English model originally trained by malay-huggingface. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface_en_4.2.0_3.0_1664104503040.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface_en_4.2.0_3.0_1664104503040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Extract Oncology Tests author: John Snow Labs name: ner_oncology_test date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, test] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests. Definitions of Predicted Entities: - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. - `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan". - `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. - `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. ## Predicted Entities `Biomarker`, `Biomarker_Result`, `Imaging_Test`, `Oncogene`, `Pathology_Test` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_en_4.0.0_3.0_1666721761945.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_en_4.0.0_3.0_1666721761945.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_test", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_test", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_test").predict("""A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:---------------| | biopsy | Pathology_Test | | ultrasound guided thick-needle | Pathology_Test | | chest computed tomography | Imaging_Test | | CT | Imaging_Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_test| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.2 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Imaging_Test 2020 229 184 2204 0.90 0.92 0.91 Biomarker_Result 1177 186 268 1445 0.86 0.81 0.84 Pathology_Test 888 276 162 1050 0.76 0.85 0.80 Biomarker 1287 254 228 1515 0.84 0.85 0.84 Oncogene 365 89 84 449 0.80 0.81 0.81 macro_avg 5737 1034 926 6663 0.83 0.85 0.84 micro_avg 5737 1034 926 6663 0.85 0.86 0.85 ``` --- layout: model title: Word2Vec Embeddings in Assamese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, as, open_source] task: Embeddings language: as edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_as_3.4.1_3.0_1647284195001.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_as_3.4.1_3.0_1647284195001.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","as") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","as") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("as.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|as| |Size:|170.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Information Clause Binary Classifier author: John Snow Labs name: legclf_information_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `information` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `information` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_information_clause_en_1.0.0_3.2_1660122534714.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_information_clause_en_1.0.0_3.2_1660122534714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_information_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[information]| |[other]| |[other]| |[information]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_information_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support financial-information 0.95 0.78 0.85 68 other 0.92 0.98 0.95 167 accuracy - - 0.92 235 macro-avg 0.93 0.88 0.90 235 weighted-avg 0.92 0.92 0.92 235 ``` --- layout: model title: Explain Document pipeline for Norwegian (Bokmal) (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, norwegian_bokmal, explain_document_lg, pipeline, "no"] supported: true task: [Named Entity Recognition, Lemmatization] language: "no" edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_no_3.0.0_3.0_1616517073984.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_no_3.0.0_3.0_1616517073984.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'no') annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "no") val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei fra John Snow Labs! ""] result_df = nlu.load('no.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:--------------------------------------------|:-----------------------------|:---------------------------------------|:-----------------------| | 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0639619976282119,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'B-PROD'] | ['John Snow', 'Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|no| --- layout: model title: English Part of Speech Tagger (Base, UPOS-Universal Part-Of-Speech) author: John Snow Labs name: roberta_pos_roberta_base_english_upos date: 2022-05-03 tags: [roberta, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-english-upos` is a English model orginally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_base_english_upos_en_3.4.2_3.0_1651596229172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_pos_roberta_base_english_upos_en_3.4.2_3.0_1651596229172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_base_english_upos","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_pos_roberta_base_english_upos","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.pos.roberta_base_english_upos").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_pos_roberta_base_english_upos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|448.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/KoichiYasuoka/roberta-base-english-upos - https://universaldependencies.org/en/ - https://universaldependencies.org/u/pos/ - https://github.com/KoichiYasuoka/esupar --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_small_ssm date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-ssm` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_ssm_en_4.3.0_3.0_1675155844003.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_ssm_en_4.3.0_3.0_1675155844003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_ssm","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_ssm","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_ssm| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|179.4 MB| ## References - https://huggingface.co/google/t5-small-ssm - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/2002.08909.pdf - https://arxiv.org/abs/1910.10683.pdf - https://goo.gle/t5-cbqa - https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/how_much_know_ledge_image.png --- layout: model title: Legal Reimbursement Of Expenses Clause Binary Classifier author: John Snow Labs name: legclf_reimbursement_of_expenses_clause date: 2023-01-29 tags: [en, legal, classification, reimbursement, expenses, clauses, reimbursement_of_expenses, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `reimbursement-of-expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `reimbursement-of-expenses`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reimbursement_of_expenses_clause_en_1.0.0_3.0_1675004913668.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reimbursement_of_expenses_clause_en_1.0.0_3.0_1675004913668.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_reimbursement_of_expenses_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[reimbursement-of-expenses]| |[other]| |[other]| |[reimbursement-of-expenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reimbursement_of_expenses_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.77 0.85 39 reimbursement-of-expenses 0.77 0.94 0.85 32 accuracy - - 0.85 71 macro-avg 0.85 0.85 0.85 71 weighted-avg 0.86 0.85 0.85 71 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_inspec date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-inspec` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.1_3.0_1678782826902.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.1_3.0_1678782826902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_inspec| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-inspec - https://dl.acm.org/doi/10.3115/1119355.1119383 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=inspec --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_16_finetuned_squad_seed_8 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_8_en_4.3.0_3.0_1674214569437.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_16_finetuned_squad_seed_8_en_4.3.0_3.0_1674214569437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_8","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_16_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_16_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|415.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-8 --- layout: model title: Smaller BERT Sentence Embeddings (L-12_H-512_A-8) author: John Snow Labs name: sent_small_bert_L12_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_512_en_2.6.0_2.4_1598350859875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L12_512_en_2.6.0_2.4_1598350859875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_512", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L12_512", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L12_512').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L12_512_embeddings I hate cancer [-0.11204898357391357, 0.7771605849266052, -0.... Antibiotics aren't painkiller [0.3647293746471405, 0.5351018905639648, 0.118... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L12_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1 --- layout: model title: Stop Words Cleaner for Esperanto author: John Snow Labs name: stopwords_eo date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: eo edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, eo] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_eo_eo_2.5.4_2.4_1594742438724.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_eo_eo_2.5.4_2.4_1594742438724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_eo", "eo") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Krom esti la norda reĝo, John Snow estas angla kuracisto kaj gvidanto en la disvolviĝo de anestezo kaj medicina higieno.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_eo", "eo") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Krom esti la norda reĝo, John Snow estas angla kuracisto kaj gvidanto en la disvolviĝo de anestezo kaj medicina higieno.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Krom esti la norda reĝo, John Snow estas angla kuracisto kaj gvidanto en la disvolviĝo de anestezo kaj medicina higieno."""] stopword_df = nlu.load('eo.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='Krom', metadata={'sentence': '0'}), Row(annotatorType='token', begin=5, end=8, result='esti', metadata={'sentence': '0'}), Row(annotatorType='token', begin=13, end=17, result='norda', metadata={'sentence': '0'}), Row(annotatorType='token', begin=19, end=22, result='reĝo', metadata={'sentence': '0'}), Row(annotatorType='token', begin=23, end=23, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_eo| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|eo| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Dutch Named Entity Recognition (from Davlan) author: John Snow Labs name: bert_ner_bert_base_multilingual_cased_ner_hrl date: 2022-05-09 tags: [bert, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-multilingual-cased-ner-hrl` is a Dutch model orginally trained by `Davlan`. ## Predicted Entities `LOC`, `DATE`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_multilingual_cased_ner_hrl_nl_3.4.2_3.0_1652099961577.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_multilingual_cased_ner_hrl_nl_3.4.2_3.0_1652099961577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_multilingual_cased_ner_hrl","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_multilingual_cased_ner_hrl","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_multilingual_cased_ner_hrl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl - https://camel.abudhabi.nyu.edu/anercorp/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2002/ner/ - https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_FR.bnf.bio - https://ontotext.fbk.eu/icab.html - https://github.com/LUMII-AILab/FullStack/tree/master/NamedEntities - https://www.clips.uantwerpen.be/conll2002/ner/ - https://github.com/davidsbatista/NER-datasets/tree/master/Portuguese --- layout: model title: Financial Work experience Section Binary Classifier author: John Snow Labs name: finclf_work_experience_item date: 2022-11-03 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `work_experience` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Finance Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `work_experience` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_work_experience_item_en_1.0.0_3.0_1667484198932.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_work_experience_item_en_1.0.0_3.0_1667484198932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_work_experience_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[work_experience]| |[other]| |[other]| |[work_experience]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_work_experience_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.5 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.98 0.96 432 work_experience 0.91 0.79 0.85 130 accuracy - - 0.93 562 macro-avg 0.93 0.88 0.90 562 weighted-avg 0.93 0.93 0.93 562 ``` --- layout: model title: Legal Security Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_security_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, security, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_security_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `security-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `security-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_security_agreement_bert_en_1.0.0_3.0_1669370866506.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_security_agreement_bert_en_1.0.0_3.0_1669370866506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_security_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[security-agreement]| |[other]| |[other]| |[security-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_security_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.85 0.90 82 security-agreement 0.79 0.92 0.85 49 accuracy - - 0.88 131 macro-avg 0.87 0.89 0.87 131 weighted-avg 0.89 0.88 0.88 131 ``` --- layout: model title: Sinhala T5ForConditionalGeneration Base Cased model (from sankhajay) author: John Snow Labs name: t5_mt5_base_sinaha_qa date: 2023-01-30 tags: [si, open_source, t5, tensorflow] task: Text Generation language: si edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mt5-base-sinaha-qa` is a Sinhala model originally trained by `sankhajay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_mt5_base_sinaha_qa_si_4.3.0_3.0_1675106104485.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_mt5_base_sinaha_qa_si_4.3.0_3.0_1675106104485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_mt5_base_sinaha_qa","si") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_mt5_base_sinaha_qa","si") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_mt5_base_sinaha_qa| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|si| |Size:|1.2 GB| ## References - https://huggingface.co/sankhajay/mt5-base-sinaha-qa --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-256-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_6_en_4.0.0_3.0_1657184814398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_6_en_4.0.0_3.0_1657184814398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_256_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-256-finetuned-squad-seed-6 --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_summarize_news date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-summarize-news` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_summarize_news_en_4.3.0_3.0_1675109192613.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_summarize_news_en_4.3.0_3.0_1675109192613.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_summarize_news","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_summarize_news","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_summarize_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|924.2 MB| ## References - https://huggingface.co/mrm8488/t5-base-finetuned-summarize-news - https://github.com/abhimishra91 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://www.kaggle.com/sunnysai12345/news-summary - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://www.kaggle.com/sunnysai12345/news-summary - https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb - https://github.com/abhimishra91 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Legal Question Answering (RoBerta) author: John Snow Labs name: legqa_roberta date: 2022-08-09 tags: [en, legal, qa, licensed] task: Question Answering language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal RoBerta-based Question Answering model, trained on squad-v2, finetuned on proprietary Legal questions and answers. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legqa_roberta_en_1.0.0_3.2_1660054617548.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legqa_roberta_en_1.0.0_3.2_1660054617548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.RoBertaForQuestionAnswering.pretrained("legqa_roberta","en", "legal/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) example = spark.createDataFrame([["Who was subjected to torture?", "The applicant submitted that her husband was subjected to treatment amounting to abuse whilst in the custody of police."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('answer.result').show() ```
## Results ```bash `her husband` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legqa_roberta| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|447.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on squad-v2, finetuned on proprietary Legal questions and answers. --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_inspec date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-inspec` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.0_3.0_1677880864789.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_inspec_en_4.3.0_3.0_1677880864789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_inspec","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_inspec| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-inspec - https://dl.acm.org/doi/10.3115/1119355.1119383 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=inspec --- layout: model title: Greek BERT Sentence Base Uncased Embedding author: John Snow Labs name: sent_bert_base_uncased date: 2021-09-06 tags: [greek, open_source, bert_sentence_embeddings, uncased, el] task: Embeddings language: el edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A Greek version of BERT pre-trained language model. The pre-training corpora of bert-base-greek-uncased-v1 include: - The Greek part of Wikipedia, - The Greek part of European Parliament Proceedings Parallel Corpus, and - The Greek part of OSCAR, a cleansed version of Common Crawl. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_el_3.2.2_3.0_1630926274392.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_el_3.2.2_3.0_1630926274392.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "el") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased", "el") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_uncased| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|el| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kevinbror) author: John Snow Labs name: distilbert_qa_kevinbror_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kevinbror`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kevinbror_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771775858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kevinbror_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771775858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kevinbror_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kevinbror_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kevinbror_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kevinbror/distilbert-base-uncased-finetuned-squad --- layout: model title: Part of Speech for Danish author: John Snow Labs name: pos_ud_ddt date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: da edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, da] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ddt_da_2.5.5_2.4_1596053892919.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ddt_da_2.5.5_2.4_1596053892919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_ddt", "da") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_ddt", "da") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow er bortset fra at være kongen i nord, en engelsk læge og en leder inden for udvikling af anæstesi og medicinsk hygiejne."""] pos_df = nlu.load('da.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=3, result='PROPN', metadata={'word': 'John'}), Row(annotatorType='pos', begin=5, end=8, result='PROPN', metadata={'word': 'Snow'}), Row(annotatorType='pos', begin=10, end=11, result='AUX', metadata={'word': 'er'}), Row(annotatorType='pos', begin=13, end=19, result='VERB', metadata={'word': 'bortset'}), Row(annotatorType='pos', begin=21, end=23, result='ADP', metadata={'word': 'fra'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ddt| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|da| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English image_classifier_vit_dwarf_goats ViTForImageClassification from micole66 author: John Snow Labs name: image_classifier_vit_dwarf_goats date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_dwarf_goats` is a English model originally trained by micole66. ## Predicted Entities `african pygmy goat`, `nigerian dwarf goat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dwarf_goats_en_4.1.0_3.0_1660170488905.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_dwarf_goats_en_4.1.0_3.0_1660170488905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_dwarf_goats", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_dwarf_goats", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_dwarf_goats| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-6` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6_en_4.0.0_3.0_1654191762091.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6_en_4.0.0_3.0_1654191762091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_64d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|378.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-6 --- layout: model title: Stopwords Remover for Afrikaans language (51 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, af, open_source] task: Stop Words Removal language: af edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_af_3.4.1_3.0_1646672335208.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_af_3.4.1_3.0_1646672335208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","af") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Jy is nie beter as ek nie"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","af") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Jy is nie beter as ek nie").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("af.stopwords").predict("""Jy is nie beter as ek nie""") ```
## Results ```bash +-------+ |result | +-------+ |[beter]| +-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|af| |Size:|1.5 KB| --- layout: model title: English Electra Embeddings (from google) author: John Snow Labs name: electra_embeddings_electra_small_generator date: 2022-05-17 tags: [en, open_source, electra, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-generator` is a English model orginally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_generator_en_3.4.4_3.0_1652786668435.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_generator_en_3.4.4_3.0_1652786668435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_generator","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_generator","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_small_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|51.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/google/electra-small-generator - https://arxiv.org/pdf/1406.2661.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://openreview.net/pdf?id=r1xMH1BtvB - https://gluebenchmark.com/ - https://rajpurkar.github.io/SQuAD-explorer/ - https://www.clips.uantwerpen.be/conll2000/chunking/ --- layout: model title: Dutch RobertaForMaskedLM Base Cased model (from pdelobelle) author: John Snow Labs name: roberta_embeddings_pdelobelle_robbert_v2_dutch_base date: 2022-12-12 tags: [nl, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: nl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robbert-v2-dutch-base` is a Dutch model originally trained by `pdelobelle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_pdelobelle_robbert_v2_dutch_base_nl_4.2.4_3.0_1670858924094.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_pdelobelle_robbert_v2_dutch_base_nl_4.2.4_3.0_1670858924094.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_pdelobelle_robbert_v2_dutch_base","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_pdelobelle_robbert_v2_dutch_base","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_pdelobelle_robbert_v2_dutch_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|nl| |Size:|438.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/pdelobelle/robbert-v2-dutch-base - https://github.com/iPieter/RobBERT - https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7180110604335112086 - https://www.aclweb.org/anthology/2021.wassa-1.27/ - https://arxiv.org/pdf/2001.06286.pdf - https://biblio.ugent.be/publication/8704637/file/8704638.pdf - https://arxiv.org/pdf/2001.06286.pdf - https://arxiv.org/pdf/2001.06286.pdf - https://arxiv.org/pdf/2004.02814.pdf - https://github.com/proycon/deepfrog - https://arxiv.org/pdf/2001.06286.pdf - https://github.com/proycon/deepfrog - https://arxiv.org/pdf/2001.06286.pdf - https://arxiv.org/pdf/2010.13652.pdf - https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/automatic-classification-of-participant-roles-in-cyberbullying-can-we-detect-victims-bullies-and-bystanders-in-social-media-text/A2079C2C738C29428E666810B8903342 - https://gitlab.com/spelfouten/dutch-simpletransformers/ - https://arxiv.org/pdf/2101.05716.pdf - https://medium.com/broadhorizon-cmotions/nlp-with-r-part-5-state-of-the-art-in-nlp-transformers-bert-3449e3cd7494 - https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ - https://arxiv.org/abs/2001.06286 - https://github.com/iPieter/RobBERT - https://arxiv.org/abs/1907.11692 - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ - https://arxiv.org/abs/2001.06286 - https://github.com/iPieter/RobBERT - https://github.com/benjaminvdb/110kDBRD - https://www.statmt.org/europarl/ - https://arxiv.org/abs/2001.02943 - https://universaldependencies.org/treebanks/nl_lassysmall/index.html - https://www.clips.uantwerpen.be/conll2002/ner/ - https://oscar-corpus.com/ - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://arxiv.org/abs/2001.06286 - https://github.com/iPieter/RobBERT#how-to-replicate-our-paper-experiments - https://arxiv.org/abs/1909.11942 - https://camembert-model.fr/ - https://en.wikipedia.org/wiki/Robbert - https://muppet.fandom.com/wiki/Bert - https://github.com/iPieter/RobBERT/blob/master/res/robbert_logo.png - https://people.cs.kuleuven.be/~pieter.delobelle - https://thomaswinters.be - https://people.cs.kuleuven.be/~bettina.berendt/ --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_4_en_4.3.0_3.0_1674216065319.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_4_en_4.3.0_3.0_1674216065319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|419.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-4 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1655733028142.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1655733028142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_512d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-4 --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Original_biobert_v1.1 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-Original-biobert-v1.1` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_biobert_v1.1_en_4.0.0_3.0_1657108073709.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_biobert_v1.1_en_4.0.0_3.0_1657108073709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_biobert_v1.1","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_biobert_v1.1","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Original_biobert_v1.1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4-Original-biobert-v1.1 --- layout: model title: Stopwords Remover for Basque language (98 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, eu, open_source] task: Stop Words Removal language: eu edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_eu_3.4.1_3.0_1646673215020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_eu_3.4.1_3.0_1646673215020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","eu") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Ni baino hobea zara"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","eu") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Ni baino hobea zara").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("eu.stopwords").predict("""Ni baino hobea zara""") ```
## Results ```bash +--------------------+ |result | +--------------------+ |[baino, hobea, zara]| +--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|eu| |Size:|1.6 KB| --- layout: model title: English DistilBertForTokenClassification Base Cased model (from 51la5) author: John Snow Labs name: distilbert_token_classifier_base_ner date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-NER` is a English model originally trained by `51la5`. ## Predicted Entities `PER`, `ORG`, `MISC`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.1_3.0_1678782919178.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.1_3.0_1678782919178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_ner| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/51la5/distilbert-base-NER - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Fast Neural Machine Translation Model from Indic Languages to English author: John Snow Labs name: opus_mt_inc_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, inc, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `inc` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_inc_en_xx_2.7.0_2.4_1609167954147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_inc_en_xx_2.7.0_2.4_1609167954147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_inc_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_inc_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.inc.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_inc_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_pneumonia_bielefeld_dl_course ViTForImageClassification from eren23 author: John Snow Labs name: image_classifier_vit_pneumonia_bielefeld_dl_course date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pneumonia_bielefeld_dl_course` is a English model originally trained by eren23. ## Predicted Entities `BACTERIA`, `NORMAL`, `VIRUS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pneumonia_bielefeld_dl_course_en_4.1.0_3.0_1660169052203.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pneumonia_bielefeld_dl_course_en_4.1.0_3.0_1660169052203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pneumonia_bielefeld_dl_course", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pneumonia_bielefeld_dl_course", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pneumonia_bielefeld_dl_course| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English image_classifier_vit_base_cifar10 ViTForImageClassification from thapasushil author: John Snow Labs name: image_classifier_vit_base_cifar10 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_cifar10` is a English model originally trained by thapasushil. ## Predicted Entities `original`, `upside-down` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_cifar10_en_4.1.0_3.0_1660167752665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_cifar10_en_4.1.0_3.0_1660167752665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_cifar10", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_cifar10", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_cifar10| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English XlmRoBertaForQuestionAnswering (from SauravMaheshkar) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_chaii date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_chaii_en_4.0.0_3.0_1655989282280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_chaii_en_4.0.0_3.0_1655989282280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.xlm_roberta.base.by_SauravMaheshkar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/xlm-roberta-base-chaii --- layout: model title: English asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml TFWav2Vec2ForCTC from arbml author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml` is a English model originally trained by arbml. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml_en_4.2.0_3.0_1664096795617.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml_en_4.2.0_3.0_1664096795617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_arabic_egyptian_by_arbml| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu TFWav2Vec2ForCTC from adelgalu author: John Snow Labs name: asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu` is a English model originally trained by adelgalu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_en_4.2.0_3.0_1664098836005.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu_en_4.2.0_3.0_1664098836005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_google_colab_by_adelgalu| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.3 MB| --- layout: model title: Longformer Base NER Pipeline author: John Snow Labs name: longformer_base_token_classifier_conll03_pipeline date: 2022-04-20 tags: [ner, longformer, pipeline, conll, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [longformer_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_base_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650456150982.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650456150982.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("longformer_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|516.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - LongformerForTokenClassification - NerConverter - Finisher --- layout: model title: English asr_iloko TFWav2Vec2ForCTC from denden author: John Snow Labs name: asr_iloko date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_iloko` is a English model originally trained by denden. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_iloko_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_iloko_en_4.2.0_3.0_1664095361343.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_iloko_en_4.2.0_3.0_1664095361343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_iloko", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_iloko", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_iloko| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Public Health Mention Sequence Classifier (PHS-BERT) author: John Snow Labs name: bert_sequence_classifier_health_mentions date: 2022-07-25 tags: [public_health, en, licensed, sequence_classification, health, mention] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT](https://arxiv.org/abs/2204.04521) based sequence classification model that can classify public health mentions in social media text. Mentions are classified into three labels about personal health situation, figurative mention and other mentions. More detailed information about classes as follows: `health_mention`: The text contains a health mention that specifically indicating someone's health situation. This means someone has a certain disease or symptoms including death. e.g.; *My PCR test is positive. I have a severe joint pain, mucsle pain and headache right now.* `other_mention`: The text contains a health mention; however does not states a spesific person's situation. General health mentions like informative mentions, discussion about disease etc. e.g.; *Aluminum is a light metal that causes dementia and Alzheimer's disease.* `figurative_mention`: The text mention specific disease or symptom but it is used metaphorically, does not contain health-related information. e.g.; *I don't wanna fall in love. If I ever did that, I think I'd have a heart attack.* ## Predicted Entities `figurative_mention`, `other_mention`, `health_mention` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_en_4.0.0_3.0_1658746315237.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_en_4.0.0_3.0_1658746315237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python # Sample Python Code document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.health_mentions").predict("""Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike""") ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------------------------+----------------+ |text |class | +-------------------------------------------------------------------------------------------------------------------------------+----------------+ |Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike|[health_mention]| +-------------------------------------------------------------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mentions| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support health_mention 0.85 0.86 0.86 1352 other_mention 0.90 0.89 0.89 2151 figurative_mention 0.86 0.87 0.86 1386 accuracy - - 0.87 4889 macro-avg 0.87 0.87 0.87 4889 weighted-avg 0.87 0.87 0.87 4889 ``` --- layout: model title: Legal Deposit Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_deposit_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, deposit, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_deposit_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `deposit-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `deposit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_deposit_agreement_en_1.0.0_3.0_1670357636929.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_deposit_agreement_en_1.0.0_3.0_1670357636929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_deposit_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[deposit-agreement]| |[other]| |[other]| |[deposit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_deposit_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support deposit-agreement 1.00 0.93 0.97 61 other 0.97 1.00 0.98 111 accuracy - - 0.98 172 macro-avg 0.98 0.97 0.97 172 weighted-avg 0.98 0.98 0.98 172 ``` --- layout: model title: English asr_wav2vec2_base_960h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_base_960h date: 2022-09-23 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_960h` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_960h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_960h_en_4.2.0_3.0_1663934860018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_960h_en_4.2.0_3.0_1663934860018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_960h', lang = 'en') annotations = pipeline.fullAnnotate(listAudioFloats) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_960h", lang = "en") val annotations = pipeline.fullAnnotate(listAudioFloats) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_960h| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese BertForMaskedLM Base Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_lert_base date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-lert-base` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_lert_base_zh_4.2.4_3.0_1670021017956.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_lert_base_zh_4.2.4_3.0_1670021017956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_lert_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_lert_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_lert_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-lert-base - https://github.com/ymcui/LERT/blob/main/README_EN.md - https://arxiv.org/abs/2211.05344 --- layout: model title: English image_classifier_vit_modelversion01 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_modelversion01 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_modelversion01` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modelversion01_en_4.1.0_3.0_1660167477101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_modelversion01_en_4.1.0_3.0_1660167477101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_modelversion01", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_modelversion01", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_modelversion01| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Detect PHI for Deidentification purposes (Portuguese, reduced entities) author: John Snow Labs name: ner_deid_generic date: 2022-04-13 tags: [deid, deidentification, pt, licensed, clinical] task: De-identification language: pt edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Portuguese) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This NER model is trained with a combination of custom datasets and data augmentation techniques. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `SEX`, `LOCATION`, `PROFESSION`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pt_3.4.2_3.0_1649846957944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pt_3.4.2_3.0_1649846957944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic", "pt", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) text = [''' Detalhes do paciente. Nome do paciente: Pedro Gonçalves NHC: 2569870. Endereço: Rua Das Flores 23. Cidade/ Província: Porto. Código Postal: 21754-987. Dados de cuidados. Data de nascimento: 10/10/1963. Idade: 53 anos Sexo: Homen Data de admissão: 17/06/2016. Doutora: Maria Santos '''] data = spark.createDataFrame([text]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "pt") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "pt", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val text = """Detalhes do paciente. Nome do paciente: Pedro Gonçalves NHC: 2569870. Endereço: Rua Das Flores 23. Cidade/ Província: Porto. Código Postal: 21754-987. Dados de cuidados. Data de nascimento: 10/10/1963. Idade: 53 anos Sexo: Homen Data de admissão: 17/06/2016. Doutora: Maria Santos""" val df = Seq(text).toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("pt.med_ner.deid.generic").predict(""" Detalhes do paciente. Nome do paciente: Pedro Gonçalves NHC: 2569870. Endereço: Rua Das Flores 23. Cidade/ Província: Porto. Código Postal: 21754-987. Dados de cuidados. Data de nascimento: 10/10/1963. Idade: 53 anos Sexo: Homen Data de admissão: 17/06/2016. Doutora: Maria Santos """) ```
## Results ```bash +-----------------+---------+ |chunk |ner_label| +-----------------+---------+ |Pedro Gonçalves |NAME | |2569870 |ID | |Rua Das Flores 23|LOCATION | |Porto |LOCATION | |21754-987 |LOCATION | |10/10/1963 |DATE | |53 |AGE | |17/06/2016 |DATE | |Maria Santos |NAME | +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|15.0 MB| ## References - Custom John Snow Labs datasets - Data augmentation techniques ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 191.0 2.0 2.0 193.0 0.9896 0.9896 0.9896 NAME 2640.0 82.0 52.0 2692.0 0.9699 0.9807 0.9752 DATE 1316.0 24.0 5.0 1321.0 0.9821 0.9962 0.9891 ID 54.0 3.0 9.0 63.0 0.9474 0.8571 0.9 SEX 669.0 9.0 8.0 677.0 0.9867 0.9882 0.9875 LOCATION 5784.0 149.0 206.0 5990.0 0.9749 0.9656 0.9702 PROFESSION 249.0 17.0 27.0 276.0 0.9361 0.9022 0.9188 AGE 536.0 14.0 10.0 546.0 0.9745 0.9817 0.9781 macro - - - - - - 0.9636 macro - - - - - - 0.9736 ``` --- layout: model title: Legal Fractional shares Clause Binary Classifier author: John Snow Labs name: legclf_fractional_shares_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `fractional-shares` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `fractional-shares` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fractional_shares_clause_en_1.0.0_3.2_1660122460122.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fractional_shares_clause_en_1.0.0_3.2_1660122460122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_fractional_shares_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[fractional-shares]| |[other]| |[other]| |[fractional-shares]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fractional_shares_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support fractional-shares 1.00 1.00 1.00 36 other 1.00 1.00 1.00 107 accuracy - - 1.00 143 macro-avg 1.00 1.00 1.00 143 weighted-avg 1.00 1.00 1.00 143 ``` --- layout: model title: Stop Words Cleaner for Swahili author: John Snow Labs name: stopwords_sw date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: sw edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, sw] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_sw_sw_2.5.4_2.4_1594742438383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_sw_sw_2.5.4_2.4_1594742438383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_sw", "sw") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Mbali na kuwa mfalme wa kaskazini, John Snow ni daktari wa Kiingereza na kiongozi katika ukuzaji wa anesthesia na usafi wa matibabu.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_sw", "sw") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Mbali na kuwa mfalme wa kaskazini, John Snow ni daktari wa Kiingereza na kiongozi katika ukuzaji wa anesthesia na usafi wa matibabu.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Mbali na kuwa mfalme wa kaskazini, John Snow ni daktari wa Kiingereza na kiongozi katika ukuzaji wa anesthesia na usafi wa matibabu."""] stopword_df = nlu.load('sw.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Mbali', metadata={'sentence': '0'}), Row(annotatorType='token', begin=14, end=19, result='mfalme', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=32, result='kaskazini', metadata={'sentence': '0'}), Row(annotatorType='token', begin=33, end=33, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=35, end=38, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_sw| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sw| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English DistilBertForQuestionAnswering model (from tiennvcs) Infovqa author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_infovqa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-infovqa` is a English model originally trained by `tiennvcs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_infovqa_en_4.0.0_3.0_1654723940578.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_infovqa_en_4.0.0_3.0_1654723940578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_infovqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_infovqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_uncased.by_tiennvcs").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_infovqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tiennvcs/distilbert-base-uncased-finetuned-infovqa --- layout: model title: English image_classifier_vit_indian_snacks ViTForImageClassification from thak123 author: John Snow Labs name: image_classifier_vit_indian_snacks date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_indian_snacks` is a English model originally trained by thak123. ## Predicted Entities `marker`, `pencil`, `pens`, `crayon`, `chalk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_indian_snacks_en_4.1.0_3.0_1660170875179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_indian_snacks_en_4.1.0_3.0_1660170875179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_indian_snacks", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_indian_snacks", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_indian_snacks| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Ukrainian author: John Snow Labs name: opus_mt_en_uk date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, uk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `uk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_uk_xx_2.7.0_2.4_1609167383827.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_uk_xx_2.7.0_2.4_1609167383827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_uk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_uk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.uk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_uk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Kuanyama author: John Snow Labs name: opus_mt_en_kj date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, kj, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `kj` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_kj_xx_2.7.0_2.4_1609164412728.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_kj_xx_2.7.0_2.4_1609164412728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_kj", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_kj", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.kj').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_kj| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: RoBERTa Base Ontonotes NER Pipeline author: John Snow Labs name: roberta_base_token_classifier_ontonotes_pipeline date: 2022-06-25 tags: [open_source, ner, token_classifier, roberta, ontonotes, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/09/26/roberta_base_token_classifier_ontonotes_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1656118335450.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_token_classifier_ontonotes_pipeline_en_4.0.0_3.0_1656118335450.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|456.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Recognize Entities DL Pipeline for Danish - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, danish, entity_recognizer_md, pipeline, da] supported: true task: [Named Entity Recognition, Lemmatization] language: da edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_da_3.0.0_3.0_1616455694691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_da_3.0.0_3.0_1616455694691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'da') annotations = pipeline.fullAnnotate(""Hej fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "da") val result = pipeline.fullAnnotate("Hej fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej fra John Snow Labs! ""] result_df = nlu.load('da.ner.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej fra John Snow Labs! '] | ['Hej fra John Snow Labs!'] | ['Hej', 'fra', 'John', 'Snow', 'Labs!'] | [[0.4006600081920624,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|da| --- layout: model title: Detect Relations Between Genes and Phenotypes author: John Snow Labs name: re_human_phenotype_gene_clinical date: 2020-09-30 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [re, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python" --- {:.h2_title} ## Description This model can be used to identify relations between genes and phenotypes. `1` : There is a relation between gene and phenotype, `0` : There is not a relation between gene and phenotype. {:.h2_title} ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_human_phenotype_gene_clinical_en_2.5.5_2.4_1598560152543.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_human_phenotype_gene_clinical_en_2.5.5_2.4_1598560152543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, PerceptronModel, DependencyParserModel, WordEmbeddingsModel, NerDLModel, NerConverter, RelationExtractionModel. In the table below, `re_human_phenotype_gene_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:--------------------------------:|:--------------:|:---------------------------------:|---------------------------| | re_human_phenotype_gene_clinical | 0,1 | ner_human_phenotype_gene_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPython.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") clinical_re_Model = RelationExtractionModel()\ .pretrained("re_human_phenotype_gene_clinical", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setRelationPairs(["hp-gene",'gene-hp'])\ .setMaxSyntacticDistance(4) nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_converter, dependecy_parser, clinical_re_Model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text" .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val clinical_re_Model = RelationExtractionModel() .pretrained("re_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setRelationPairs(Array("hp-gene","gene-hp")) .setMaxSyntacticDistance(4) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, pos_tagger, ner_tagger, ner_converter, dependency_parser, clinical_re_Model)) val data = Seq("""Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results ```bash +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+===========+=================+===============+=====================+===========+=================+===============+=====================+==============+ | 0 | 1 | HP | 23 | 36 | microphthalmia | HP | 42 | 60 | developmental delay | 0.999954 | +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ | 1 | 1 | HP | 23 | 36 | microphthalmia | GENE | 110 | 114 | TENM3 | 0.999999 | +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_human_phenotype_gene_clinical| |Type:|re| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[embeddings, pos_tags, ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|[en]| |Case sensitive:|false| ## Data source This model was trained with data from https://github.com/lasigeBioTM/PGR For further details please refer to https://aclweb.org/anthology/papers/N/N19/N19-1152/ --- layout: model title: Spanish RoBERTa Embeddings (from flax-community) author: John Snow Labs name: roberta_embeddings_bertin_roberta_large_spanish date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-roberta-large-spanish` is a Spanish model orginally trained by `flax-community`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_roberta_large_spanish_es_3.4.2_3.0_1649945547128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_roberta_large_spanish_es_3.4.2_3.0_1649945547128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_roberta_large_spanish","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_roberta_large_spanish","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_roberta_large_spanish").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_roberta_large_spanish| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|233.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/flax-community/bertin-roberta-large-spanish - https://github.com/google/flax - https://discord.com/channels/858019234139602994/859113060068229190 --- layout: model title: Japanese Bert Embeddings (Large, Character Tokenization) author: John Snow Labs name: bert_embeddings_bert_large_japanese_char date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-japanese-char` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_char_ja_3.4.2_3.0_1649674601870.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_japanese_char_ja_3.4.2_3.0_1649674601870.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese_char","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese_char","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_large_japanese_char").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_japanese_char| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-large-japanese-char - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Legal Termination of employment Clause Binary Classifier author: John Snow Labs name: legclf_termination_of_employment_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination-of-employment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `termination-of-employment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_of_employment_clause_en_1.0.0_3.2_1660124066973.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_of_employment_clause_en_1.0.0_3.2_1660124066973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_termination_of_employment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination-of-employment]| |[other]| |[other]| |[termination-of-employment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_of_employment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.99 72 termination-of-employment 1.00 0.95 0.98 42 accuracy - - 0.98 114 macro-avg 0.99 0.98 0.98 114 weighted-avg 0.98 0.98 0.98 114 ``` --- layout: model title: English asr_wav2vec2_base_960h_by_facebook TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_base_960h_by_facebook date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_960h_by_facebook` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_960h_by_facebook_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_960h_by_facebook_en_4.2.0_3.0_1664035782305.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_960h_by_facebook_en_4.2.0_3.0_1664035782305.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_960h_by_facebook', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_960h_by_facebook", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_960h_by_facebook| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Word2Vec Embeddings in Mingrelian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, xmf, open_source] task: Embeddings language: xmf edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_xmf_3.4.1_3.0_1647446081530.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_xmf_3.4.1_3.0_1647446081530.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","xmf") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","xmf") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xmf.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|xmf| |Size:|127.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Severance Clause Binary Classifier author: John Snow Labs name: legclf_severance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `severance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `severance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_severance_clause_en_1.0.0_3.2_1660124007026.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_severance_clause_en_1.0.0_3.2_1660124007026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_severance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[severance]| |[other]| |[other]| |[severance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_severance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.94 0.94 68 severance 0.86 0.86 0.86 28 accuracy - - 0.92 96 macro-avg 0.90 0.90 0.90 96 weighted-avg 0.92 0.92 0.92 96 ``` --- layout: model title: Explain Document pipeline for Russian (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, russian, explain_document_lg, pipeline, ru] supported: true task: [Named Entity Recognition, Lemmatization] language: ru edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_ru_3.0.0_3.0_1616501405939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_ru_3.0.0_3.0_1616501405939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'ru') annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "ru") val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Здравствуйте из Джона Снежных Лабораторий! ""] result_df = nlu.load('ru.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------| | 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | ['здравствовать', 'из', 'Джон', 'Снежных', 'Лабораторий!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['Джона Снежных Лабораторий!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|ru| --- layout: model title: Word2Vec Embeddings in Fiji Hindi (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, hif, open_source] task: Embeddings language: hif edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hif_3.4.1_3.0_1647370472863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_hif_3.4.1_3.0_1647370472863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hif") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","hif") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hif.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|hif| |Size:|46.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265900 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265900` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265900_en_4.0.0_3.0_1655984666096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265900_en_4.0.0_3.0_1655984666096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265900","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265900","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265900").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265900| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265900 --- layout: model title: Translate English to Maltese Pipeline author: John Snow Labs name: translate_en_mt date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mt, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mt_xx_2.7.0_2.4_1609692007492.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mt_xx_2.7.0_2.4_1609692007492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mt", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mt", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mt').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mt| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Oncology Pipeline for Therapies author: John Snow Labs name: oncology_therapy_pipeline date: 2022-11-04 tags: [licensed, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.2.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition and Assertion Status models to extract information from oncology texts. This pipeline focuses on entities related to therapies. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.2.2_3.0_1667593592479.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_therapy_pipeline_en_4.2.2_3.0_1667593592479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_therapy_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_therpay.pipeline").predict("""The patient underwent a mastectomy two years ago. She is currently receiving her second cycle of adriamycin and cyclophosphamide, and is in good overall condition.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Chemotherapy | | cyclophosphamide | Chemotherapy | ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Chemotherapy | | cyclophosphamide | Chemotherapy | ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------| | mastectomy | Cancer_Surgery | | second cycle | Cycle_Number | | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | ******************** ner_oncology_unspecific_posology_wip results ******************** | chunk | ner_label | |:-----------------|:---------------------| | mastectomy | Cancer_Therapy | | second cycle | Posology_Information | | adriamycin | Cancer_Therapy | | cyclophosphamide | Cancer_Therapy | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-----------------|:---------------|:------------| | mastectomy | Cancer_Surgery | Past | | adriamycin | Chemotherapy | Present | | cyclophosphamide | Chemotherapy | Present | ******************** assertion_oncology_treatment_binary_wip results ******************** | chunk | ner_label | assertion | |:-----------------|:---------------|:----------------| | mastectomy | Cancer_Surgery | Present_Or_Past | | adriamycin | Chemotherapy | Present_Or_Past | | cyclophosphamide | Chemotherapy | Present_Or_Past | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_therapy_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP for Healthcare 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel --- layout: model title: English DistilBertForQuestionAnswering model (from Ifenna) author: John Snow Labs name: distilbert_qa_dbert_3epoch date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dbert-3epoch` is a English model originally trained by `Ifenna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_dbert_3epoch_en_4.0.0_3.0_1654723432240.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_dbert_3epoch_en_4.0.0_3.0_1654723432240.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_dbert_3epoch","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_dbert_3epoch","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_Ifenna").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_dbert_3epoch| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Ifenna/dbert-3epoch --- layout: model title: Portuguese BertForMaskedLM Cased model (from pucpr) author: John Snow Labs name: bert_embeddings_biobertpt_clin date: 2022-12-02 tags: [pt, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobertpt-clin` is a Portuguese model originally trained by `pucpr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_clin_pt_4.2.4_3.0_1670020835268.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_biobertpt_clin_pt_4.2.4_3.0_1670020835268.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_clin","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_biobertpt_clin","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_biobertpt_clin| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/pucpr/biobertpt-clin - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: English BertForMaskedLM Base Uncased model (from mlcorelib) author: John Snow Labs name: bert_embeddings_deberta_base_uncased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deberta-base-uncased` is a English model originally trained by `mlcorelib`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_4.2.4_3.0_1670021994466.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_deberta_base_uncased_en_4.2.4_3.0_1670021994466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_deberta_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_deberta_base_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/mlcorelib/deberta-base-uncased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Legal Meetings Clause Binary Classifier author: John Snow Labs name: legclf_meetings_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `meetings` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `meetings` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_meetings_clause_en_1.0.0_3.2_1660123728288.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_meetings_clause_en_1.0.0_3.2_1660123728288.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_meetings_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[meetings]| |[other]| |[other]| |[meetings]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_meetings_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support meetings 0.96 0.93 0.94 27 other 0.97 0.99 0.98 75 accuracy - - 0.97 102 macro-avg 0.97 0.96 0.96 102 weighted-avg 0.97 0.97 0.97 102 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_512_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_10_en_4.3.0_3.0_1674215413181.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_512_finetuned_squad_seed_10_en_4.3.0_3.0_1674215413181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_512_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_512_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|433.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-10 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_3 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_3 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_3` is a English model originally trained by doddle124578. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_3_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_3_en_4.2.0_3.0_1664038327564.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_3_en_4.2.0_3.0_1664038327564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_3', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_3", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_3| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to North Germanic Languages author: John Snow Labs name: opus_mt_en_gmq date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gmq, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gmq` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gmq_xx_2.7.0_2.4_1609170508234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gmq_xx_2.7.0_2.4_1609170508234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gmq", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gmq", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gmq').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gmq| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Mini Cased model (from google) author: John Snow Labs name: t5_efficient_mini date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-mini` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_en_4.3.0_3.0_1675118119036.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_en_4.3.0_3.0_1675118119036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_mini","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_mini","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_mini| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|83.9 MB| ## References - https://huggingface.co/google/t5-efficient-mini - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from evelynerhuan) author: John Snow Labs name: distilbert_qa_base_uncased_model_1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-model-1` is a English model originally trained by `evelynerhuan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_model_1_en_4.3.0_3.0_1672773999562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_model_1_en_4.3.0_3.0_1672773999562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_model_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_model_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_model_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/evelynerhuan/distilbert-base-uncased-model-1 --- layout: model title: Legal Attorney fees Clause Binary Classifier (md) author: John Snow Labs name: legclf_attorney_fees_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `attorney-fees` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `attorney-fees` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_attorney_fees_md_en_1.0.0_3.0_1673460260302.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_attorney_fees_md_en_1.0.0_3.0_1673460260302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_attorney_fees_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[attorney-fees]| |[other]| |[other]| |[attorney-fees]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_attorney_fees_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support choice-of-law 1.00 0.94 0.97 32 other 0.95 1.00 0.97 39 accuracy 0.97 71 macro avg 0.98 0.97 0.97 71 weighted avg 0.97 0.97 0.97 71 ``` --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_roberta_FT_new_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_FT_new_newsqa_en_4.0.0_3.0_1655738813374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_FT_new_newsqa_en_4.0.0_3.0_1655738813374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_roberta_ft_new_newsqa.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|461.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/roberta_FT_new_newsqa --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_CHEM_PubmedBERT date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4_CHEM_PubmedBERT` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_CHEM_PubmedBERT_en_4.0.0_3.0_1657109029095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_CHEM_PubmedBERT_en_4.0.0_3.0_1657109029095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_CHEM_PubmedBERT","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_CHEM_PubmedBERT","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_CHEM_PubmedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4_CHEM_PubmedBERT --- layout: model title: Explain Document Pipeline for Swedish author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, swedish, explain_document_sm, pipeline, sv] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: sv edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_sv_3.0.0_3.0_1616428447759.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_sv_3.0.0_3.0_1616428447759.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'sv') annotations = pipeline.fullAnnotate(""Hej från John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "sv") val result = pipeline.fullAnnotate("Hej från John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hej från John Snow Labs! ""] result_df = nlu.load('sv.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:------------------------------|:-----------------------------|:-----------------------------------------|:-----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hej från John Snow Labs! '] | ['Hej från John Snow Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['Hej', 'från', 'John', 'Snow', 'Labs!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0306969992816448,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| --- layout: model title: Extract temporal entities (Voice of the Patients) author: John Snow Labs name: ner_vop_temporal_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, temporal] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts temporal references from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Frequency`, `Duration`, `DateTime` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_wip_en_4.4.0_3.0_1682012928919.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_wip_en_4.4.0_3.0_1682012928919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_temporal_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I"m excited to start physical therapy and get back to the game."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_temporal_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I"m excited to start physical therapy and get back to the game.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:-----------|:------------| | last month | DateTime | | yesterday | DateTime | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_temporal_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Frequency 886 213 186 1072 0.81 0.83 0.82 Duration 1888 323 383 2271 0.85 0.83 0.84 DateTime 3886 597 437 4323 0.87 0.90 0.88 macro_avg 6660 1133 1006 7666 0.84 0.85 0.85 micro_avg 6660 1133 1006 7666 0.86 0.87 0.86 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from SauravMaheshkar) author: John Snow Labs name: roberta_qa_base_chaii date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_chaii_en_4.3.0_3.0_1674213133232.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_chaii_en_4.3.0_3.0_1674213133232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_chaii","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_chaii| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/SauravMaheshkar/roberta-base-chaii --- layout: model title: Spanish RobertaForQuestionAnswering (from BSC-TeMU) author: John Snow Labs name: roberta_qa_BSC_TeMU_roberta_large_bne_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-sqac` is a Spanish model originally trained by `BSC-TeMU`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_BSC_TeMU_roberta_large_bne_sqac_es_4.0.0_3.0_1655736117023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_BSC_TeMU_roberta_large_bne_sqac_es_4.0.0_3.0_1655736117023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_BSC_TeMU_roberta_large_bne_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_BSC_TeMU_roberta_large_bne_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_BSC_TeMU_roberta_large_bne_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/BSC-TeMU/roberta-large-bne-sqac - https://github.com/PlanTL-SANIDAD/lm-spanish - http://www.bne.es/en/Inicio/index.html - https://arxiv.org/abs/1907.11692 - https://arxiv.org/abs/2107.07253 --- layout: model title: English asr_wav2vec2_base_checkpoint_6 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: pipeline_asr_wav2vec2_base_checkpoint_6 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_6` is a English model originally trained by jiobiala24. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_checkpoint_6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_6_en_4.2.0_3.0_1664020811557.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_checkpoint_6_en_4.2.0_3.0_1664020811557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_checkpoint_6', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_checkpoint_6", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_checkpoint_6| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Ukrainian to English Pipeline author: John Snow Labs name: translate_uk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, uk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `uk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_uk_en_xx_2.7.0_2.4_1609699258519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_uk_en_xx_2.7.0_2.4_1609699258519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_uk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_uk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.uk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_uk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic Part of Speech Tagger (Gulf Arabic POS, Modern Standard Arabic-MSA, Dialectal Arabic-DA, and Classical Arabic-CA) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_mix_pos_glf date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-mix-pos-glf` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_mix_pos_glf_ar_3.4.2_3.0_1650993456567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_mix_pos_glf_ar_3.4.2_3.0_1650993456567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_mix_pos_glf","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_mix_pos_glf","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_mix_pos_glf").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_mix_pos_glf| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-mix-pos-glf - https://camel.abudhabi.nyu.edu/annotated-gumar-corpus/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: Legal Business Day Clause Binary Classifier author: John Snow Labs name: legclf_business_day_clause date: 2022-12-07 tags: [en, legal, classification, clauses, business_day, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `business-day` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `business-day`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_business_day_clause_en_1.0.0_3.0_1670445407280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_business_day_clause_en_1.0.0_3.0_1670445407280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_business_day_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[business-day]| |[other]| |[other]| |[business-day]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_business_day_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support business-day 0.90 1.00 0.95 28 other 1.00 0.96 0.98 73 accuracy - - 0.97 101 macro-avg 0.95 0.98 0.96 101 weighted-avg 0.97 0.97 0.97 101 ``` --- layout: model title: Legal Termination For Convenience Clause Binary Classifier author: John Snow Labs name: legclf_termination_for_convenience_clause date: 2023-01-27 tags: [en, legal, classification, termination, convenience, clauses, termination_for_convenience, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination-for-convenience` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `termination-for-convenience`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_for_convenience_clause_en_1.0.0_3.0_1674819602785.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_for_convenience_clause_en_1.0.0_3.0_1674819602785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_termination_for_convenience_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[termination-for-convenience]| |[other]| |[other]| |[termination-for-convenience]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_for_convenience_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.97 0.99 38 termination-for-convenience 0.96 1.00 0.98 24 accuracy - - 0.98 62 macro-avg 0.98 0.99 0.98 62 weighted-avg 0.98 0.98 0.98 62 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from anasaqsme) author: John Snow Labs name: distilbert_qa_anasaqsme_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `anasaqsme`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_anasaqsme_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769787475.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_anasaqsme_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769787475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anasaqsme_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_anasaqsme_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_anasaqsme_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anasaqsme/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-32-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_6_en_4.0.0_3.0_1657185049292.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_6_en_4.0.0_3.0_1657185049292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_32_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-32-finetuned-squad-seed-6 --- layout: model title: Sentence Embeddings - Biobert cased (MedNLI) author: John Snow Labs name: sbiobert_base_cased_mli date: 2020-11-27 task: Embeddings language: en nav_key: models edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [embeddings, en, licensed] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model is trained to generate contextual sentence embeddings of input sentences. It has been fine-tuned on MedNLI dataset to provide sota performance on STS and SentEval Benchmarks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_base_cased_mli_en_2.6.4_2.4_1606225728763.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_base_cased_mli_en_2.6.4_2.4_1606225728763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, BertSentenceEmbeddings. The output of this model can be used in tasks like NER, Classification, Entity Resolution etc.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.biobert.mli").predict("""Put your text here.""") ```
{:.h2_title} ## Results Gives a 768 dimensional vector representation of the sentence. {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobert_base_cased_mli| |Type:|BertSentenceEmbeddings| |Compatibility:|Spark NLP 2.6.4 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[ner_chunk]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Tuned on MedNLI dataset using Biobert weights. --- layout: model title: Word2Vec Embeddings in Corsican (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, co, open_source] task: Embeddings language: co edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_co_3.4.1_3.0_1647291100582.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_co_3.4.1_3.0_1647291100582.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","co") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Amu Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","co") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Amu Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("co.embed.w2v_cc_300d").predict("""Amu Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|co| |Size:|57.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Private Placement Warrants Purchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_private_placement_warrants_purchase_agreement_bert date: 2023-02-02 tags: [en, legal, classification, private, placement, warrants, purchase, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_private_placement_warrants_purchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `private-placement-warrants-purchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `private-placement-warrants-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_private_placement_warrants_purchase_agreement_bert_en_1.0.0_3.0_1675360216586.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_private_placement_warrants_purchase_agreement_bert_en_1.0.0_3.0_1675360216586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_private_placement_warrants_purchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[private-placement-warrants-purchase-agreement]| |[other]| |[other]| |[private-placement-warrants-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_private_placement_warrants_purchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 1.00 1.00 1.00 99 private-placement-warrants-purchase-agreement 1.00 1.00 1.00 42 accuracy - - 1.00 141 macro-avg 1.00 1.00 1.00 141 weighted-avg 1.00 1.00 1.00 141 ``` --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_vi_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-vi-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_vi_hi_dev_xx_4.0.0_3.0_1657190264031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_vi_hi_dev_xx_4.0.0_3.0_1657190264031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_vi_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_vi_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_vi_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-vi-hi --- layout: model title: Legal Survival Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_survival_bert date: 2023-03-05 tags: [en, legal, classification, clauses, survival, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Survival` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Survival`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_survival_bert_en_1.0.0_3.0_1678049964185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_survival_bert_en_1.0.0_3.0_1678049964185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_survival_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Survival]| |[Other]| |[Other]| |[Survival]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_survival_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.97 0.99 0.98 206 Survival 0.98 0.96 0.97 183 accuracy - - 0.97 389 macro-avg 0.97 0.97 0.97 389 weighted-avg 0.97 0.97 0.97 389 ``` --- layout: model title: Arabic Bert Embeddings (Base, Arabert Model, v2) author: John Snow Labs name: bert_embeddings_bert_base_arabertv2 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabertv2` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv2_ar_3.4.2_3.0_1649677188338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv2_ar_3.4.2_3.0_1649677188338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv2","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv2","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabertv2").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabertv2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv2 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabertv2/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el16_dl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el16-dl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl4_en_4.3.0_3.0_1675119722159.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl4_en_4.3.0_3.0_1675119722159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el16_dl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|192.1 MB| ## References - https://huggingface.co/google/t5-efficient-small-el16-dl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from evs) author: John Snow Labs name: xlmroberta_ner_evs_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `evs`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_evs_base_finetuned_panx_de_4.1.0_3.0_1660432742289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_evs_base_finetuned_panx_de_4.1.0_3.0_1660432742289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_evs_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_evs_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_evs_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/evs/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English image_classifier_vit_rock_challenge_DeiT_solo_2 ViTForImageClassification from dimbyTa author: John Snow Labs name: image_classifier_vit_rock_challenge_DeiT_solo_2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rock_challenge_DeiT_solo_2` is a English model originally trained by dimbyTa. ## Predicted Entities `fines`, `large`, `medium`, `pellets` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_DeiT_solo_2_en_4.1.0_3.0_1660170420119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rock_challenge_DeiT_solo_2_en_4.1.0_3.0_1660170420119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rock_challenge_DeiT_solo_2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rock_challenge_DeiT_solo_2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rock_challenge_DeiT_solo_2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|81.8 MB| --- layout: model title: Legal Inspection Clause Binary Classifier author: John Snow Labs name: legclf_inspection_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `inspection` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `inspection` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_inspection_clause_en_1.0.0_3.2_1660123616387.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_inspection_clause_en_1.0.0_3.2_1660123616387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_inspection_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[inspection]| |[other]| |[other]| |[inspection]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_inspection_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support inspection 0.89 0.89 0.89 27 other 0.95 0.95 0.95 64 accuracy - - 0.93 91 macro-avg 0.92 0.92 0.92 91 weighted-avg 0.93 0.93 0.93 91 ``` --- layout: model title: Medical Question Answering (flan_t5_base_jsl_qa) author: John Snow Labs name: flan_t5_base_jsl_qa date: 2023-05-15 tags: [licensed, clinical, en, qa, question_answering, flan_t5, tensorflow] task: Question Answering language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The flan_t5_base_jsl_qa model is designed to work seamlessly with the MedicalQuestionAnswering annotator. This model provides a powerful and efficient solution for accurately answering medical questions and delivering insightful information in the medical domain. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/flan_t5_base_jsl_qa_en_4.4.2_3.0_1684180120739.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/flan_t5_base_jsl_qa_en_4.4.2_3.0_1684180120739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler()\ .setInputCols("question", "context")\ .setOutputCols("document_question", "document_context") med_qa = MedicalQuestionAnswering.pretrained("flan_t5_base_jsl_qa","en","clinical/models")\ .setInputCols(["document_question", "document_context"])\ .setCustomPrompt("{DOCUMENT} {QUESTION}")\ .setMaxNewTokens(50)\ .setOutputCol("answer")\ pipeline = Pipeline(stages=[document_assembler, med_qa]) #doi: 10.3758/s13414-011-0157-z. paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65-97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene." long_question = "What is the effect of directing attention on memory?" data = spark.createDataFrame([[long_question, paper_abstract]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val med_qa = MedicalQuestionAnswering.pretrained("flan_t5_base_jsl_qa", "en", "clinical/models") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setMaxNewTokens(50) .setCustomPrompt("{DOCUMENT} {QUESTION}") val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa)) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65–97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus–stimulus and stimulus–response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" val data = Seq( (long_question, paper_abstract, )) .toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[The effect of directing attention on memory is that it can help to improve memory retention and recall. It can help to reduce the amount of time spent on tasks, such as focusing on one task at a time, or focusing on ]| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|flan_t5_base_jsl_qa| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.8 MB| |Case sensitive:|true| --- layout: model title: English DistilBertForQuestionAnswering model (from ericRosello) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_frozen_v2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad-frozen-v2` is a English model originally trained by `ericRosello`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_frozen_v2_en_4.0.0_3.0_1654726653762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_frozen_v2_en_4.0.0_3.0_1654726653762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_frozen_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_frozen_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased_v2.by_ericRosello").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_frozen_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ericRosello/distilbert-base-uncased-finetuned-squad-frozen-v2 --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_medium_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_finetuned_squadv2_en_4.0.0_3.0_1654183717880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_finetuned_squadv2_en_4.0.0_3.0_1654183717880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_medium_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_medium_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.medium_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_medium_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|154.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-medium-finetuned-squadv2 - https://twitter.com/mrm8488 - https://github.com/google-research - https://arxiv.org/abs/1908.08962 - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Resolver Company Names to Tickers author: John Snow Labs name: finel_names2tickers date: 2022-09-08 tags: [en, finance, companies, tickers, nasdaq, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Resolution / Entity Linking model, which is able to provide Ticker / Trading Symbols using a Company Name as an input. You can use any NER which extracts Organizations / Companies / Parties to then send the output to this Entity Linking model and get the Ticker / Trading Symbol (given the company has one). ## Predicted Entities {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/financial_company_normalization){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_names2tickers_en_1.0.0_3.2_1662636940877.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_names2tickers_en_1.0.0_3.2_1662636940877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = finance.SentenceEntityResolverModel.pretrained("finel_names2tickers", "en", "finance/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("name")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver]) lp = LightPipeline(pipelineModel) lp.fullAnnotate("apple") ```
## Results ```bash +--------+---------+----------------------------------------------+-------------------------+-----------------------------------+ | chunk| code | all_codes| resolutions | all_distances| +--------+---------+----------------------------------------------+-------------------------+-----------------------------------+ | apple | Apple | [Apple, Apple inc., Apple INC. , Apple Inc.] |[AAPL, AAPL, AAPL, AAPL] | [0.0000, 0.1093, 0.1093, 0.1093] | +--------+---------+----------------------------------------------+-------------------------+-----------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_names2tickers| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[company_ticker]| |Language:|en| |Size:|20.3 MB| |Case sensitive:|false| ## References https://data.world/johnsnowlabs/list-of-companies-in-nasdaq-exchanges --- layout: model title: Translate English to Chinyanja Pipeline author: John Snow Labs name: translate_en_ny date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ny, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ny` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ny_xx_2.7.0_2.4_1609689292026.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ny_xx_2.7.0_2.4_1609689292026.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ny", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ny", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ny').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ny| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Ilocano to English Pipeline author: John Snow Labs name: translate_ilo_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ilo, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ilo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ilo_en_xx_2.7.0_2.4_1609688854197.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ilo_en_xx_2.7.0_2.4_1609688854197.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ilo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ilo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ilo.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ilo_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shila) author: John Snow Labs name: distilbert_qa_shila_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shila`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shila_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772638325.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shila_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772638325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shila_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shila_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shila_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shila/distilbert-base-uncased-finetuned-squad --- layout: model title: Side Effect Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_vop_side_effect date: 2023-05-24 tags: [en, clinical, licensed, classifier, side_effect, adverse_event, vop, tensorflow] task: Text Classification language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that classifies texts written by patients as True if side effects from treatments or procedures are mentioned. ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_side_effect_en_4.4.2_3.0_1684930827651.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_side_effect_en_4.4.2_3.0_1684930827651.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_side_effect", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["I felt kind of dizzy after taking that medication for a month.", "I had a dental procedure last week and everything went well."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) # Checking results result.select("text", "prediction.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_side_effect", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("prediction") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("I felt kind of dizzy after taking that medication for a month.", "I had a dental procedure last week and everything went well.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------------------------------------+-------+ |text |result | +--------------------------------------------------------------+-------+ |I felt kind of dizzy after taking that medication for a month.|[True] | |I had a dental procedure last week and everything went well. |[False]| +--------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_side_effect| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label precision recall f1-score support False 0.881423 0.822878 0.851145 271 True 0.717647 0.802632 0.757764 152 accuracy - - 0.815603 423 macro-avg 0.799535 0.812755 0.804455 423 weighted-avg 0.822572 0.815603 0.817590 423 ``` --- layout: model title: Chinese Bert Embeddings (from SIKU-BERT) author: John Snow Labs name: bert_embeddings_sikuroberta date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `sikuroberta` is a Chinese model orginally trained by `SIKU-BERT`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_sikuroberta_zh_3.4.2_3.0_1649669282915.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_sikuroberta_zh_3.4.2_3.0_1649669282915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_sikuroberta","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_sikuroberta","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.sikuroberta").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_sikuroberta| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|408.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/SIKU-BERT/sikuroberta - https://github.com/SIKU-BERT/SikuBERT-for-digital-humanities-and-classical-Chinese-information-processing --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_state date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-state` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_state_en_4.3.0_3.0_1674220497335.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_state_en_4.3.0_3.0_1674220497335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_state","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_state","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_state| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-state --- layout: model title: Hebrew BertForTokenClassification Cased model (from tokeron) author: John Snow Labs name: bert_token_classifier_aleph_finetuned_metaphor_detection date: 2022-11-30 tags: [he, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: he edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `alephbert-finetuned-metaphor-detection` is a Hebrew model originally trained by `tokeron`. ## Predicted Entities `metaphor` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_aleph_finetuned_metaphor_detection_he_4.2.4_3.0_1669814114385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_aleph_finetuned_metaphor_detection_he_4.2.4_3.0_1669814114385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_aleph_finetuned_metaphor_detection","he") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_aleph_finetuned_metaphor_detection","he") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_aleph_finetuned_metaphor_detection| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|he| |Size:|470.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tokeron/alephbert-finetuned-metaphor-detection --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_company_all_903429548 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-company_all-903429548` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `Company`, `OOV` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429548_en_4.3.1_3.0_1678783401900.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_company_all_903429548_en_4.3.1_3.0_1678783401900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429548","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_company_all_903429548","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_company_all_903429548| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-company_all-903429548 --- layout: model title: Pipeline to Detect Adverse Drug Events author: John Snow Labs name: bert_token_classifier_ner_ade_pipeline date: 2022-03-21 tags: [licensed, ner, ade, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_ade](https://nlp.johnsnowlabs.com/2022/01/04/bert_token_classifier_ner_ade_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_pipeline_en_3.4.1_3.0_1647862348142.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_pipeline_en_3.4.1_3.0_1647862348142.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_ade_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_ade_pipeline", "en", "clinical/models") pipeline.annotate("Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ade_pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |Lipitor |DRUG | |severe fatigue|ADE | |voltaren |DRUG | |cramps |ADE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ade_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English asr_wav2vec2_base_20sec_timit_and_dementiabank TFWav2Vec2ForCTC from shields author: John Snow Labs name: pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_20sec_timit_and_dementiabank` is a English model originally trained by shields. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank_en_4.2.0_3.0_1664113473996.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank_en_4.2.0_3.0_1664113473996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_20sec_timit_and_dementiabank| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal English Bert Embeddings (Base, Uncased) author: John Snow Labs name: bert_embeddings_legal_bert_base_uncased date: 2022-04-11 tags: [bert, embeddings, en, open_source, legal] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Legal Pretrained Bert Embeddings model, trained with uncased text, uploaded to Hugging Face, adapted and imported into Spark NLP. `legal-bert-base-uncased` is a English model orginally trained by `nlpaueb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_legal_bert_base_uncased_en_3.4.2_3.0_1649676015756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_legal_bert_base_uncased_en_3.4.2_3.0_1649676015756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.legal_bert_base_uncased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_legal_bert_base_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|false| ## References - https://huggingface.co/nlpaueb/legal-bert-base-uncased - https://aclanthology.org/2020.findings-emnlp.261/ - https://eur-lex.europa.eu/ - https://www.legislation.gov.uk/ - https://case.law/ - https://www.sec.gov/edgar.shtml - https://archive.org/details/legal_bert_fp - http://nlp.cs.aueb.gr/ --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_dl6 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-dl6` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dl6_en_4.3.0_3.0_1675114965992.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dl6_en_4.3.0_3.0_1675114965992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_dl6","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_dl6","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_dl6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|899.5 MB| ## References - https://huggingface.co/google/t5-efficient-large-dl6 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Universal Sentence Encoder Multilingual (tfhub_use_multi) author: John Snow Labs name: tfhub_use_multi date: 2021-05-06 tags: [xx, open_source, embeddings] deprecated: true task: Embeddings language: xx edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is the variable-length text and the output is a 512-dimensional vector. The universal-sentence-encoder model has trained with a deep averaging network (DAN) encoder. This model supports 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder. The details are described in the paper "[Multilingual Universal Sentence Encoder for Semantic Retrieval](https://arxiv.org/abs/1907.04307)". Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_xx_3.0.0_3.0_1620291657399.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_multi_xx_3.0.0_3.0_1620291657399.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") ``` ```scala val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Me encanta usar SparkNLP"] embeddings_df = nlu.load('xx.use.multi').predict(text, output_level='sentence') embeddings_df ```
## Results ```bash It gives a 512-dimensional vector of the sentences ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_multi| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Benchmarking ```bash - We apply this model to the STS benchmark for semantic similarity. STSBenchmark | dev | test | -----------------------------------|--------|-------| Correlation coefficient of Pearson | 0.829 | 0.809 | - For semantic similarity retrieval, we evaluate the model on [Quora and AskUbuntu retrieval task.](https://arxiv.org/abs/1811.08008). Results are shown below: Dataset | Quora | AskUbuntu | Average | -----------------------|-------|-----------|---------| Mean Average Precision | 89.2 | 39.9 | 64.6 | - For the translation pair retrieval, we evaluate the model on the United Nation Parallel Corpus. Results are shown below: Language Pair | en-es | en-fr | en-ru | en-zh | ---------------|--------|-------|-------|-------| Precision@1 | 85.8 | 82.7 | 87.4 | 79.5 | ``` --- layout: model title: English BertForQuestionAnswering Cased model (from susghosh) author: John Snow Labs name: bert_qa_susghosh_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `susghosh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_susghosh_finetuned_squad_en_4.0.0_3.0_1657186764452.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_susghosh_finetuned_squad_en_4.0.0_3.0_1657186764452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_susghosh_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_susghosh_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_susghosh_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/susghosh/bert-finetuned-squad --- layout: model title: Sentence Entity Resolver for ICD-O (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_icdo language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to ICD-O codes using Bert Sentence Embeddings. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding `Morphology` codes comprising of `Histology` and `Behavior` codes. ## Predicted Entities ICD-O Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_en_2.6.4_2.4_1606235766320.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icdo_en_2.6.4_2.4_1606235766320.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icdo_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icdo","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icdo_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icdo","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icdo_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM|8312/3| 0.3558|Renal cell carcin...|8312/3:::9964/3::...| |chronic renal ins...| 83|109| PROBLEM|9980/3| 0.5290|Refractory anemia...|9980/3:::8312/3::...| | COPD| 113|116| PROBLEM|9950/3| 0.2092|Polycythemia vera...|9950/3:::8141/3::...| | gastritis| 120|128| PROBLEM|8150/3| 0.1795|Islet cell carcin...|8150/3:::8153/3::...| | TIA| 136|138| PROBLEM|9570/0| 0.4843|Neuroma, NOS:::Ca...|9570/0:::8692/3::...| |a non-ST elevatio...| 182|202| PROBLEM|8343/2| 0.1914|Non-invasive EFVP...|8343/2:::9150/0::...| |Guaiac positive s...| 208|229| PROBLEM|8155/3| 0.1069|Vipoma:::Myeloid ...|8155/3:::9930/3::...| |cardiac catheteri...| 295|317| TEST|8045/3| 0.1144|Combined small ce...|8045/3:::9705/3::...| | PTCA| 324|327|TREATMENT|9735/3| 0.0924|Plasmablastic lym...|9735/3:::9365/3::...| | mid LAD lesion| 332|345| PROBLEM|9383/1| 0.0845|Subependymoma:::D...|9383/1:::8806/3::...| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_icdo | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on ICD-O Histology Behaviour dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_bert_all_squad_ben_tel_context date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-all-squad_ben_tel_context` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_squad_ben_tel_context_en_4.0.0_3.0_1654179514602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_squad_ben_tel_context_en_4.0.0_3.0_1654179514602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_all_squad_ben_tel_context","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_all_squad_ben_tel_context","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_ben_tel.bert.by_krinal214").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_all_squad_ben_tel_context| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/bert-all-squad_ben_tel_context --- layout: model title: Legal Enforcement Clause Binary Classifier author: John Snow Labs name: legclf_enforcement_clause date: 2022-09-28 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `enforcement` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `enforcement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_enforcement_clause_en_1.0.0_3.0_1664363133652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_enforcement_clause_en_1.0.0_3.0_1664363133652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_enforcement_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[enforcement]| |[other]| |[other]| |[enforcement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_enforcement_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support enforcement 0.83 0.72 0.77 40 other 0.88 0.93 0.91 88 accuracy - - 0.87 128 macro-avg 0.86 0.83 0.84 128 weighted-avg 0.87 0.87 0.86 128 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1655732552413.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_2_en_4.0.0_3.0_1655732552413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_32d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|417.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-2 --- layout: model title: Fast Neural Machine Translation Model from English to Mon-Khmer Languages author: John Snow Labs name: opus_mt_en_mkh date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mkh, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mkh` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mkh_xx_2.7.0_2.4_1609164391064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mkh_xx_2.7.0_2.4_1609164391064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mkh", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mkh", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mkh').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mkh| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Maltese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mt, open_source] task: Embeddings language: mt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mt_3.4.1_3.0_1647445636747.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mt_3.4.1_3.0_1647445636747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Inħobb Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Inħobb Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mt.embed.w2v_cc_300d").predict("""Inħobb Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mt| |Size:|113.8 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Japanese BertForMaskedLM Large Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_large_japanese_char date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-japanese-char` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_japanese_char_ja_4.2.4_3.0_1670020339308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_japanese_char_ja_4.2.4_3.0_1670020339308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_japanese_char","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_japanese_char","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_japanese_char| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-large-japanese-char - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Legal Tenant Improvement(TI) Allowance Clause Binary Classifier author: John Snow Labs name: legclf_ti_allowance_clause date: 2022-12-07 tags: [en, legal, classification, clauses, ti_allowance, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `ti-allowance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `ti-allowance`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_ti_allowance_clause_en_1.0.0_3.0_1670445457826.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_ti_allowance_clause_en_1.0.0_3.0_1670445457826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_ti_allowance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[ti-allowance]| |[other]| |[other]| |[ti-allowance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_ti_allowance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 1.00 246 ti-allowance 1.00 0.98 0.99 99 accuracy - - 0.99 345 macro-avg 1.00 0.99 0.99 345 weighted-avg 0.99 0.99 0.99 345 ``` --- layout: model title: Detect PHI for Deidentification purposes (Spanish, reduced entities) author: John Snow Labs name: ner_deid_generic date: 2022-01-18 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset and several data augmentation mechanisms and it's a reduced version of `ner_deid_subentity`. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_es_3.3.4_3.0_1642528473168.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_es_3.3.4_3.0_1642528473168.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("word_embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic", "es", "clinical/models")\ .setInputCols(["sentence","token","word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = [''' Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "es", "clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""" val df = Seq(text).toDS.toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.generic").predict(""" Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | Antonio| B-NAME| | Pérez| I-NAME| | Juan| I-NAME| | ,| O| | nacido| O| | en| O| | Cadiz|B-LOCATION| | ,| O| | España|B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14/03/2020| B-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica|B-LOCATION| | San|I-LOCATION| | Carlos|I-LOCATION| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, word_embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.0 MB| |Dependencies:|embeddings_sciwiki_300d| ## Data Source - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 166.0 9.0 8.0 174.0 0.9486 0.954 0.9513 NAME 2879.0 195.0 191.0 3070.0 0.9366 0.9378 0.9372 DATE 1839.0 29.0 18.0 1857.0 0.9845 0.9903 0.9874 ID 119.0 11.0 12.0 131.0 0.9154 0.9084 0.9119 LOCATION 5149.0 711.0 607.0 5756.0 0.8787 0.8945 0.8865 PROFESSION 236.0 49.0 168.0 404.0 0.8281 0.5842 0.6851 AGE 313.0 33.0 50.0 363.0 0.9046 0.8623 0.8829 macro - - - - - - 0.891749 micro - - - - - - 0.909897 ``` --- layout: model title: Translate Galician to English Pipeline author: John Snow Labs name: translate_gl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gl_en_xx_2.7.0_2.4_1609691535926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gl_en_xx_2.7.0_2.4_1609691535926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [pt, open_source] task: Named Entity Recognition language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pt_4.0.0_3.0_1656130333737.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pt_4.0.0_3.0_1656130333737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "pt") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("pt.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pt| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Korean BertForQuestionAnswering Cased model (from arogyaGurkha) author: John Snow Labs name: bert_qa_kobert_finetuned_squad_kor_v1 date: 2022-07-07 tags: [ko, open_source, bert, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kobert-finetuned-squad_kor_v1` is a Korean model originally trained by `arogyaGurkha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kobert_finetuned_squad_kor_v1_ko_4.0.0_3.0_1657189640225.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kobert_finetuned_squad_kor_v1_ko_4.0.0_3.0_1657189640225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kobert_finetuned_squad_kor_v1","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kobert_finetuned_squad_kor_v1","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kobert_finetuned_squad_kor_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|343.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/arogyaGurkha/kobert-finetuned-squad_kor_v1 --- layout: model title: English image_classifier_vit_ex_for_evan ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_ex_for_evan date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_ex_for_evan` is a English model originally trained by nateraw. ## Predicted Entities `cheetah`, `elephant`, `giraffe`, `rhino` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ex_for_evan_en_4.1.0_3.0_1660168048082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_ex_for_evan_en_4.1.0_3.0_1660168048082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_ex_for_evan", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_ex_for_evan", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_ex_for_evan| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Translate San Salvador Kongo to English Pipeline author: John Snow Labs name: translate_kwy_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kwy, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kwy` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kwy_en_xx_2.7.0_2.4_1609700686259.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kwy_en_xx_2.7.0_2.4_1609700686259.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kwy_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kwy_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kwy.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kwy_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Hausa Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_hausa_finetuned_ner_hausa date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ha, open_source] task: Named Entity Recognition language: ha edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-hausa-finetuned-ner-hausa` is a Hausa model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_hausa_finetuned_ner_hausa_ha_3.4.2_3.0_1652808444160.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_hausa_finetuned_ner_hausa_ha_3.4.2_3.0_1652808444160.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_hausa_finetuned_ner_hausa","ha") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ina son Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_hausa_finetuned_ner_hausa","ha") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ina son Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_hausa_finetuned_ner_hausa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ha| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-hausa-finetuned-ner-hausa - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/N --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from jhoonk) author: John Snow Labs name: distilbert_qa_jhoonk_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `jhoonk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_jhoonk_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771478380.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_jhoonk_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771478380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jhoonk_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jhoonk_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_jhoonk_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jhoonk/distilbert-base-uncased-finetuned-squad --- layout: model title: Financial Risk factors Item Binary Classifier author: John Snow Labs name: finclf_risk_factors_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `risk_factors` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `risk_factors` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_risk_factors_item_en_1.0.0_3.2_1660154465203.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_risk_factors_item_en_1.0.0_3.2_1660154465203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_risk_factors_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[risk_factors]| |[other]| |[other]| |[risk_factors]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_risk_factors_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.92 0.92 1277 risk_factors 0.91 0.92 0.91 1228 accuracy - - 0.92 2505 macro-avg 0.92 0.92 0.92 2505 weighted-avg 0.92 0.92 0.92 2505 ``` --- layout: model title: Pipeline to Detect Living Species (bert_base_cased) author: John Snow Labs name: ner_living_species_bert_pipeline date: 2023-03-13 tags: [ro, ner, clinical, licensed, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_bert](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_bert_ro_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_ro_4.3.0_3.2_1678729997910.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_ro_4.3.0_3.2_1678729997910.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_bert_pipeline", "ro", "clinical/models") text = '''O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_bert_pipeline", "ro", "clinical/models") val text = "O femeie în vârstă de 26 de ani, însărcinată în 11 săptămâni, a consultat serviciul de urgențe dermatologice pentru că prezenta, de 4 zile, leziuni punctiforme dureroase de debut brusc pe vârful degetelor. Pacientul raportează că leziunile au început pe degete și ulterior s-au extins la degetele de la picioare. Markerii de imunitate, ANA și crioagglutininele, au fost negativi, iar serologia VHB a indicat doar vaccinarea. Pe baza acestor rezultate, diagnosticul de vasculită a fost exclus și, având în vedere diagnosticul suspectat de erupție cutanată cu mănuși și șosete, s-a efectuat serologia pentru virusul Ebstein Barr. Exantemă la mănuși și șosete datorat parvovirozei B19. Având în vedere suspiciunea unei afecțiuni infecțioase cu aceste caracteristici, a fost solicitată serologia pentru EBV, enterovirus și parvovirus B19, cu IgM pozitiv pentru acesta din urmă în două ocazii. De asemenea, nu au existat semne de anemie fetală sau complicații ale acesteia." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------------|--------:|------:|:------------|-------------:| | 0 | femeie | 2 | 7 | HUMAN | 0.9998 | | 1 | Pacientul | 206 | 214 | HUMAN | 0.9993 | | 2 | VHB | 394 | 396 | SPECIES | 1 | | 3 | virusul Ebstein Barr | 606 | 625 | SPECIES | 0.996467 | | 4 | parvovirozei B19 | 665 | 680 | SPECIES | 0.5685 | | 5 | EBV | 799 | 801 | SPECIES | 0.9999 | | 6 | enterovirus | 804 | 814 | SPECIES | 0.9984 | | 7 | parvovirus B19 | 819 | 832 | SPECIES | 0.99255 | | 8 | fetală | 932 | 937 | HUMAN | 0.9994 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|483.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from Bislama to English author: John Snow Labs name: opus_mt_bi_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bi, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bi_en_xx_2.7.0_2.4_1609167548214.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bi_en_xx_2.7.0_2.4_1609167548214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bi_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bi_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bi.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bi_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from ktrapeznikov) author: John Snow Labs name: bert_qa_biobert_v1.1_pubmed_squad_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobert_v1.1_pubmed_squad_v2` is a English model orginally trained by `ktrapeznikov`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_pubmed_squad_v2_en_4.0.0_3.0_1654185755500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biobert_v1.1_pubmed_squad_v2_en_4.0.0_3.0_1654185755500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biobert_v1.1_pubmed_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biobert_v1.1_pubmed_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_pubmed.biobert.v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biobert_v1.1_pubmed_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|403.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ktrapeznikov/biobert_v1.1_pubmed_squad_v2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Finnish asr_wav2vec2_base_voxpopuli_v2_finetuned TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: asr_wav2vec2_base_voxpopuli_v2_finetuned date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_voxpopuli_v2_finetuned` is a Finnish model originally trained by Finnish-NLP. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_voxpopuli_v2_finetuned_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_voxpopuli_v2_finetuned_fi_4.2.0_3.0_1664038253139.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_voxpopuli_v2_finetuned_fi_4.2.0_3.0_1664038253139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_voxpopuli_v2_finetuned", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_voxpopuli_v2_finetuned", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_voxpopuli_v2_finetuned| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|349.3 MB| --- layout: model title: Clinical Deidentification author: John Snow Labs name: clinical_deidentification date: 2022-01-06 tags: [deidentification, licensed, pipeline, de] task: Pipeline Healthcare language: de edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from **German** medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `CONTACT`, `ID`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `DLN`, `PLATE` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_3.3.4_2.4_1641504439647.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_de_3.3.4_2.4_1641504439647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models") sample = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""" result = deid_pipe.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models") val sample = """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""" val result = deid_pipe.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.clinical").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash wird am Morgen des ins in eingeliefert. ist Jahre alt und hat zu viel Wasser in den Beinen. Mathias Farber wird am Morgen des 05-01-1978 ins Rechts der Isar Hospital in Berlin eingeliefert. Mathias Farber ist 56 Jahre alt und hat zu viel Wasser in den Beinen. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ChunkMergeModel - DeIdentificationModel --- layout: model title: English RobertaForQuestionAnswering Cased model (from z-uo) author: John Snow Labs name: roberta_qa_sper date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-qasper` is a English model originally trained by `z-uo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_sper_en_4.2.4_3.0_1669988473791.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_sper_en_4.2.4_3.0_1669988473791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_sper","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_sper","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_sper| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/z-uo/roberta-qasper --- layout: model title: English image_classifier_vit_Visual_transformer_chihuahua_cookies ViTForImageClassification from peterbonnesoeur author: John Snow Labs name: image_classifier_vit_Visual_transformer_chihuahua_cookies date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Visual_transformer_chihuahua_cookies` is a English model originally trained by peterbonnesoeur. ## Predicted Entities `samoyed`, `chihuahua`, `shiba inu`, `cookies`, `corgi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Visual_transformer_chihuahua_cookies_en_4.1.0_3.0_1660167471745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Visual_transformer_chihuahua_cookies_en_4.1.0_3.0_1660167471745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Visual_transformer_chihuahua_cookies", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Visual_transformer_chihuahua_cookies", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Visual_transformer_chihuahua_cookies| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal NER for NDA (Exceptions Clause) author: John Snow Labs name: legner_nda_exceptions date: 2023-04-06 tags: [en, licensed, legal, ner, nda, exceptions] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `EXCEPTIONS` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `EXCLUDED_INFO `, and `EXCLUSION_GROUND`. ## Predicted Entities `EXCLUDED_INFO`, `EXCLUSION_GROUND` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_exceptions_en_1.0.0_3.0_1680796058978.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_exceptions_en_1.0.0_3.0_1680796058978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_exceptions", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""( ii ) was within the Recipient’s or its Recipient Representatives possession prior to its being furnished to the Recipient or its Recipient Representatives by or on behalf of the Provider pursuant here to , provided that the source of such information was not bound by a confidentiality agreement with, or other contractual, legal or fiduciary obligation of confidentiality to, the Provider or any other party with respect to such information."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------+----------------+ |chunk |ner_label | +----------+----------------+ |possession|EXCLUDED_INFO | |prior to |EXCLUSION_GROUND| +----------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_exceptions| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support B-EXCLUDED_INFO 0.84 0.91 0.87 34 B-EXCLUSION_GROUND 0.85 0.91 0.88 32 I-EXCLUSION_GROUND 0.91 0.76 0.83 51 I-EXCLUDED_INFO 1.00 0.50 0.67 4 micro-avg 0.87 0.83 0.85 121 macro-avg 0.90 0.77 0.81 121 weighted-avg 0.88 0.83 0.85 121 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10_en_4.0.0_3.0_1654191475689.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10_en_4.0.0_3.0_1654191475689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_256d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|383.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-10 --- layout: model title: Russian Lemmatizer author: John Snow Labs name: lemma date: 2020-03-12 13:52:00 +0800 task: Lemmatization language: ru edition: Spark NLP 2.4.4 spark_version: 2.4 tags: [lemmatizer, ru] supported: true annotator: LemmatizerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ru_2.4.4_2.4_1584013425855.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ru_2.4.4_2.4_1584013425855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "ru") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "ru") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Помимо того, что он король севера, Джон Сноу - английский врач и лидер в разработке анестезии и медицинской гигиены."""] lemma_df = nlu.load('ru.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='помимо', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=7, end=10, result='то', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=11, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=13, end=15, result='что', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=18, result='он', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.4.4| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|ru| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_squadv2_recipe_3_epochs date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squadv2-recipe-roberta-3-epochs` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_recipe_3_epochs_en_4.3.0_3.0_1674224063985.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_recipe_3_epochs_en_4.3.0_3.0_1674224063985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_recipe_3_epochs","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_recipe_3_epochs","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_squadv2_recipe_3_epochs| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/squadv2-recipe-roberta-3-epochs --- layout: model title: Legal Portuguese Embeddings (Base, Petitions) author: John Snow Labs name: bert_embeddings_bert_base_portuguese_cased_finetuned_peticoes date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-portuguese-cased-finetuned-peticoes` is a Portuguese model orginally trained by `Luciano`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_finetuned_peticoes_pt_3.4.2_3.0_1649673737932.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_portuguese_cased_finetuned_peticoes_pt_3.4.2_3.0_1649673737932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased_finetuned_peticoes","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_portuguese_cased_finetuned_peticoes","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_base_portuguese_cased_finetuned_peticoes").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_portuguese_cased_finetuned_peticoes| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|408.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Luciano/bert-base-portuguese-cased-finetuned-peticoes --- layout: model title: English RobertaForMaskedLM Base Cased model author: John Snow Labs name: roberta_embeddings_base date: 2022-12-12 tags: [en, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_en_4.2.4_3.0_1670859138319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_en_4.2.4_3.0_1670859138319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|300.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/roberta-base - https://arxiv.org/abs/1907.11692 - https://github.com/pytorch/fairseq/tree/master/examples/roberta - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia - https://commoncrawl.org/2016/10/news-dataset-available/ - https://github.com/jcpeterson/openwebtext - https://arxiv.org/abs/1806.02847 --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from tner) author: John Snow Labs name: xlmroberta_ner_tner_base_conll2003 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-conll2003` is a English model originally trained by `tner`. ## Predicted Entities `other`, `person`, `location`, `organization` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_conll2003_en_4.1.0_3.0_1660426294879.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_base_conll2003_en_4.1.0_3.0_1660426294879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_base_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_base_conll2003| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|792.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tner/xlm-roberta-base-conll2003 - https://github.com/asahi417/tner --- layout: model title: Image Processing algorithms to improve Document Quality author: John Snow Labs name: image_processing date: 2023-01-03 tags: [en, licensed, ocr, image_processing] task: Document Image Processing language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2.1 supported: true annotator: ImageProcessing article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The processing of documents for the purpose of discovering knowledge from them in an automated fashion is a challenging task and hence an open issue for the research community. Sometimes, the quality of the input images makes it much more difficult to perform these procedures correctly. To avoid this, it is proposed the use of different image processing algorithms on documents images to improve its quality and the performance of the next step of computer vision algorithms as text detection, text recognition, ocr, table detection... Some of these image processing algorithms included in this project are: scale image, adaptive thresholding, erosion, dilation, remove objects, median blur and gpu image transformation. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/IMAGE_PROCESSING/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/tutorials/Certification_Trainings/1.2.Image_processing.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") scaled_image_df = ImageTransformer() \ .addScalingTransform(2) \ .setInputCol("image") \ .setOutputCol("scaled_image") \ .setImageType(ImageType.TYPE_BYTE_GRAY) \ .transform(image_df) thresholded_image = ImageTransformer() \ .addAdaptiveThreshold(21, 20)\ .setInputCol("image") \ .setOutputCol("thresholded_image") \ .transform(image_df) eroded_image = ImageTransformer() \ .addErodeTransform(2,2)\ .setInputCol("image") \ .setOutputCol("eroded_image") \ .transform(image_df) dilated_image = ImageTransformer() \ .addDilateTransform(1, 2)\ .setInputCol("image") \ .setOutputCol("dilated_image") \ .transform(image_df) removebg_image = ImageTransformer() \ .addScalingTransform(2) \ .addAdaptiveThreshold(31, 2)\ .addRemoveObjects(10, 500) \ .setInputCol("image") \ .setOutputCol("corrected_image") \ .transform(image_df) deblured_image = ImageTransformer() \ .addScalingTransform(2) \ .addMedianBlur(3) \ .setInputCol("image") \ .setOutputCol("corrected_image") \ .transform(image_df) multiple_image = GPUImageTransformer() \ .addScalingTransform(8) \ .addOtsuTransform() \ .addErodeTransform(3, 3) \ .setInputCol("image") \ .setOutputCol("multiple_image") \ .transform(image_df) pipeline_scaled = PipelineModel(stages=[ binary_to_image, scaled_image_df ]) pipeline_thresholded = PipelineModel(stages=[ binary_to_image, thresholded_image ]) pipeline_eroded = PipelineModel(stages=[ binary_to_image, eroded_image ]) pipeline_dilated = PipelineModel(stages=[ binary_to_image, dilated_image ]) pipeline_removebg = PipelineModel(stages=[ binary_to_image, removebg_image ]) pipeline_deblured = PipelineModel(stages=[ binary_to_image, deblured_image ]) pipeline_multiple = PipelineModel(stages=[ binary_to_image, multiple_image ]) image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/check.jpg") image_example_df = spark.read.format("binaryFile").load(image_path) result_scaled = pipeline_scaled.transform(image_example_df).cache() result_thresholded = pipeline_thresholded.transform(image_example_df).cache() result_eroded = pipeline_eroded.transform(image_example_df).cache() result_dilated = pipeline_dilated.transform(image_example_df).cache() result_removebg = pipeline_removebg.transform(image_example_df).cache() result_deblured = pipeline_deblured.transform(image_example_df).cache() result_multiple = pipeline_multiple.transform(image_example_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") val scaled_image_df = new ImageTransformer() .addScalingTransform(2) .setInputCol("image") .setOutputCol("scaled_image") .setImageType(ImageType.TYPE_BYTE_GRAY) .transform(image_df) val thresholded_image = new ImageTransformer() .addAdaptiveThreshold(21, 20) .setInputCol("image") .setOutputCol("thresholded_image") .transform(image_df) val eroded_image = new ImageTransformer() .addErodeTransform(2,2) .setInputCol("image") .setOutputCol("eroded_image") .transform(image_df) val dilated_image = new ImageTransformer() .addDilateTransform(1, 2) .setInputCol("image") .setOutputCol("dilated_image") .transform(image_df) val removebg_image = new ImageTransformer() .addScalingTransform(2) .addAdaptiveThreshold(31, 2) .addRemoveObjects(10, 500) .setInputCol("image") .setOutputCol("corrected_image") .transform(image_df) val deblured_image = new ImageTransformer() .addScalingTransform(2) .addMedianBlur(3) .setInputCol("image") .setOutputCol("corrected_image") .transform(image_df) val multiple_image = new GPUImageTransformer() .addScalingTransform(8) .addOtsuTransform() .addErodeTransform(3, 3) .setInputCol("image") .setOutputCol("multiple_image") .transform(image_df) val pipeline_scaled = new PipelineModel().setStages(Array( binary_to_image, scaled_image_df)) val pipeline_thresholded = new PipelineModel().setStages(Array( binary_to_image, thresholded_image)) val pipeline_eroded = new PipelineModel().setStages(Array( binary_to_image, eroded_image)) val pipeline_dilated = new PipelineModel().setStages(Array( binary_to_image, dilated_image)) val pipeline_removebg = new PipelineModel().setStages(Array( binary_to_image, removebg_image)) val pipeline_deblured = new PipelineModel().setStages(Array( binary_to_image, deblured_image)) val pipeline_multiple = new PipelineModel().setStages(Array( binary_to_image, multiple_image)) val image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/check.jpg") val image_example_df = spark.read.format("binaryFile").load(image_path) val result_scaled = pipeline_scaled.transform(image_example_df).cache() val result_thresholded = pipeline_thresholded.transform(image_example_df).cache() val result_eroded = pipeline_eroded.transform(image_example_df).cache() val result_dilated = pipeline_dilated.transform(image_example_df).cache() val result_removebg = pipeline_removebg.transform(image_example_df).cache() val result_deblured = pipeline_deblured.transform(image_example_df).cache() val result_multiple = pipeline_multiple.transform(image_example_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image2.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image2_out2.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Model Information {:.table-model} |---|---| |Model Name:|image_processing| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl12 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl12` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl12_en_4.3.0_3.0_1675123754998.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl12_en_4.3.0_3.0_1675123754998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl12","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl12","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl12| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|74.4 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl12 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Mapping Drugs to Their Categories as well as Other Brand and Names author: John Snow Labs name: drug_category_mapper date: 2022-12-18 tags: [category, chunk_mapper, drug, licensed, clinical, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps drugs to their categories and other brands and names. It has two categories called main category and subcategory. ## Predicted Entities `main_category`, `sub_category`, `other_name` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/drug_category_mapper_en_4.2.2_3.0_1671374094037.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/drug_category_mapper_en_4.2.2_3.0_1671374094037.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("drug_category_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["main_category", "sub_category", "other_name"]) pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, converter, chunkerMapper]) text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("drug_category_mapper", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("mappings") .setRels(Array("main_category", "sub_category", "other_name")) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, converter, chunkerMapper)) val text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin" val data = Seq(text).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.drug_category").predict("""She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin""") ```
## Results ```bash +-------------+---------------------+-----------------------------------+-----------+ | ner_chunk| main_category| sub_category|other_names| +-------------+---------------------+-----------------------------------+-----------+ | OxyContin| Pain Management| Opioid Analgesics| Oxaydo| | folic acid| Nutritionals| Vitamins, Water-Soluble| Folvite| |levothyroxine|Metabolic & Endocrine| Thyroid Products| Levo T| | Norvasc| Cardiovascular| Antianginal Agents| Katerzia| | aspirin| Cardiovascular|Antiplatelet Agents, Cardiovascular| ASA| | Neurontin| Neurologics| GABA Analogs| Gralise| +-------------+---------------------+-----------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|drug_category_mapper| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|526.0 KB| --- layout: model title: Lemmatizer (Russian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, ru] task: Lemmatization language: ru edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Russian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ru_3.4.1_3.0_1646316536145.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ru_3.4.1_3.0_1646316536145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ru") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Ты не лучше меня"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ru") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Ты не лучше меня").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.lemma.spacylookup").predict("""Ты не лучше меня""") ```
## Results ```bash +-----------------------+ |result | +-----------------------+ |[Ты, не, хороший, меня]| +-----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ru| |Size:|6.8 MB| --- layout: model title: Legal Fees Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_fees_bert date: 2023-03-05 tags: [en, legal, classification, clauses, fees, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Fees` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Fees`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fees_bert_en_1.0.0_3.0_1678049919527.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fees_bert_en_1.0.0_3.0_1678049919527.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_fees_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Fees]| |[Other]| |[Other]| |[Fees]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fees_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Fees 0.89 0.97 0.93 69 Other 0.98 0.92 0.95 97 accuracy - - 0.94 166 macro-avg 0.94 0.94 0.94 166 weighted-avg 0.94 0.94 0.94 166 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_256_zh_4.2.4_3.0_1670325879533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_256_zh_4.2.4_3.0_1670325879533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|27.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Detect Clinical Entities (ner_jsl_greedy) author: John Snow Labs name: ner_jsl_greedy date: 2021-06-21 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 3.0 supported: true annotator: NotDefined article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_greedy_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Hyperlipidemia`, `Respiration`, `Birth_Entity`, `Age`, `Family_History_Header`, `Labour_Delivery`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Drug`, `Symptom`, `Treatment`, `Substance`, `Route`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Time`, `Frequency`, `Sexually_Active_or_Sexual_Orientation`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Hypertension`, `HDL`, `Overweight`, `Total_Cholesterol`, `Smoking`, `Date`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_en_3.1.0_3.0_1624308839002.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_en_3.1.0_3.0_1624308839002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------+----------------------------+ |chunk |ner_label | +----------------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | |20 minutes |Duration | +----------------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 229.0 56.0 34.0 263.0 0.8035 0.8707 0.8358 Direction 4009.0 479.0 403.0 4412.0 0.8933 0.9087 0.9009 Female_Reproducti... 2.0 1.0 3.0 5.0 0.6667 0.4 0.5 Respiration 80.0 9.0 14.0 94.0 0.8989 0.8511 0.8743 Cerebrovascular_D... 82.0 27.0 18.0 100.0 0.7523 0.82 0.7847 not 4.0 0.0 0.0 4.0 1.0 1.0 1.0 Family_History_He... 86.0 4.0 3.0 89.0 0.9556 0.9663 0.9609 Heart_Disease 469.0 76.0 83.0 552.0 0.8606 0.8496 0.8551 ImagingFindings 68.0 38.0 75.0 143.0 0.6415 0.4755 0.5462 RelativeTime 141.0 76.0 66.0 207.0 0.6498 0.6812 0.6651 Strength 720.0 49.0 58.0 778.0 0.9363 0.9254 0.9308 Smoking 117.0 8.0 6.0 123.0 0.936 0.9512 0.9435 Medical_Device 3584.0 730.0 359.0 3943.0 0.8308 0.909 0.8681 EKG_Findings 41.0 20.0 45.0 86.0 0.6721 0.4767 0.5578 Pulse 138.0 23.0 24.0 162.0 0.8571 0.8519 0.8545 Psychological_Con... 121.0 14.0 29.0 150.0 0.8963 0.8067 0.8491 Overweight 5.0 2.0 0.0 5.0 0.7143 1.0 0.8333 Triglycerides 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Obesity 49.0 6.0 4.0 53.0 0.8909 0.9245 0.9074 Admission_Discharge 325.0 30.0 2.0 327.0 0.9155 0.9939 0.9531 HDL 2.0 1.0 1.0 3.0 0.6667 0.6667 0.6667 Diabetes 118.0 13.0 7.0 125.0 0.9008 0.944 0.9219 Section_Header 3778.0 148.0 138.0 3916.0 0.9623 0.9648 0.9635 Age 617.0 52.0 47.0 664.0 0.9223 0.9292 0.9257 O2_Saturation 34.0 11.0 19.0 53.0 0.7556 0.6415 0.6939 Kidney_Disease 114.0 5.0 12.0 126.0 0.958 0.9048 0.9306 Test 2668.0 526.0 498.0 3166.0 0.8353 0.8427 0.839 Communicable_Disease 25.0 12.0 9.0 34.0 0.6757 0.7353 0.7042 Hypertension 152.0 10.0 6.0 158.0 0.9383 0.962 0.95 External_body_par... 2652.0 387.0 340.0 2992.0 0.8727 0.8864 0.8795 Oxygen_Therapy 67.0 21.0 23.0 90.0 0.7614 0.7444 0.7528 Test_Result 1124.0 227.0 258.0 1382.0 0.832 0.8133 0.8225 Modifier 539.0 185.0 309.0 848.0 0.7445 0.6356 0.6858 BMI 7.0 1.0 1.0 8.0 0.875 0.875 0.875 Labour_Delivery 75.0 19.0 23.0 98.0 0.7979 0.7653 0.7813 Employment 249.0 51.0 57.0 306.0 0.83 0.8137 0.8218 Clinical_Dept 948.0 95.0 80.0 1028.0 0.9089 0.9222 0.9155 Time 36.0 7.0 7.0 43.0 0.8372 0.8372 0.8372 Procedure 3180.0 460.0 480.0 3660.0 0.8736 0.8689 0.8712 Diet 50.0 29.0 30.0 80.0 0.6329 0.625 0.6289 Oncological 478.0 46.0 50.0 528.0 0.9122 0.9053 0.9087 LDL 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Symptom 6801.0 1097.0 1097.0 7898.0 0.8611 0.8611 0.8611 Temperature 109.0 12.0 7.0 116.0 0.9008 0.9397 0.9198 Vital_Signs_Header 213.0 27.0 16.0 229.0 0.8875 0.9301 0.9083 Relationship_Status 42.0 2.0 1.0 43.0 0.9545 0.9767 0.9655 Total_Cholesterol 10.0 4.0 5.0 15.0 0.7143 0.6667 0.6897 Blood_Pressure 167.0 22.0 23.0 190.0 0.8836 0.8789 0.8813 Injury_or_Poisoning 510.0 83.0 111.0 621.0 0.86 0.8213 0.8402 Drug_Ingredient 1698.0 160.0 158.0 1856.0 0.9139 0.9149 0.9144 Treatment 156.0 40.0 54.0 210.0 0.7959 0.7429 0.7685 Assertion_SocialD... 4.0 0.0 6.0 10.0 1.0 0.4 0.5714 Pregnancy 100.0 45.0 41.0 141.0 0.6897 0.7092 0.6993 Vaccine 13.0 3.0 6.0 19.0 0.8125 0.6842 0.7429 Disease_Syndrome_... 2861.0 452.0 376.0 3237.0 0.8636 0.8838 0.8736 Height 25.0 8.0 9.0 34.0 0.7576 0.7353 0.7463 Frequency 650.0 157.0 148.0 798.0 0.8055 0.8145 0.81 Route 872.0 83.0 85.0 957.0 0.9131 0.9112 0.9121 Death_Entity 49.0 7.0 6.0 55.0 0.875 0.8909 0.8829 Duration 367.0 132.0 95.0 462.0 0.7355 0.7944 0.7638 Internal_organ_or... 6532.0 1016.0 987.0 7519.0 0.8654 0.8687 0.8671 Alcohol 79.0 20.0 12.0 91.0 0.798 0.8681 0.8316 Date 515.0 19.0 19.0 534.0 0.9644 0.9644 0.9644 Hyperlipidemia 47.0 2.0 1.0 48.0 0.9592 0.9792 0.9691 Social_History_He... 89.0 9.0 4.0 93.0 0.9082 0.957 0.9319 Race_Ethnicity 113.0 0.0 3.0 116.0 1.0 0.9741 0.9869 Imaging_Technique 47.0 31.0 30.0 77.0 0.6026 0.6104 0.6065 Drug_BrandName 963.0 72.0 79.0 1042.0 0.9304 0.9242 0.9273 RelativeDate 553.0 128.0 121.0 674.0 0.812 0.8205 0.8162 Gender 6043.0 59.0 87.0 6130.0 0.9903 0.9858 0.9881 Form 227.0 35.0 47.0 274.0 0.8664 0.8285 0.847 Dosage 279.0 42.0 62.0 341.0 0.8692 0.8182 0.8429 Medical_History_H... 117.0 4.0 11.0 128.0 0.9669 0.9141 0.9398 Substance 59.0 16.0 16.0 75.0 0.7867 0.7867 0.7867 Weight 85.0 19.0 21.0 106.0 0.8173 0.8019 0.8095 macro - - - - - - 0.7286 micro - - - - - - 0.8715 ``` --- layout: model title: Explain Document Pipeline for Portuguese author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, portuguese, explain_document_md, pipeline, pt] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pt edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pt_3.0.0_3.0_1616433461478.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_pt_3.0.0_3.0_1616433461478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'pt') annotations = pipeline.fullAnnotate(""Olá de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "pt") val result = pipeline.fullAnnotate("Olá de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Olá de John Snow Labs! ""] result_df = nlu.load('pt.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:----------------------------|:---------------------------|:---------------------------------------|:---------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Olá de John Snow Labs! '] | ['Olá de John Snow Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pt| --- layout: model title: Fast Neural Machine Translation Model from Artificial Languages to English author: John Snow Labs name: opus_mt_art_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, art, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `art` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_art_en_xx_2.7.0_2.4_1609168940602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_art_en_xx_2.7.0_2.4_1609168940602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_art_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_art_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.art.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_art_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Tiny Cased model (from haritzpuerto) author: John Snow Labs name: bert_qa_tinybert_general_4l_312d_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `TinyBERT_General_4L_312D-squad` is a English model originally trained by `haritzpuerto`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_general_4l_312d_squad_en_4.0.0_3.0_1657182485738.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_general_4l_312d_squad_en_4.0.0_3.0_1657182485738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tinybert_general_4l_312d_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_tinybert_general_4l_312d_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tinybert_general_4l_312d_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|54.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/haritzpuerto/TinyBERT_General_4L_312D-squad --- layout: model title: Javanese RobertaForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: roberta_embeddings_javanese_small_imdb date: 2022-12-12 tags: [jv, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-roberta-small-imdb` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670858797384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670858797384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_javanese_small_imdb| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|jv| |Size:|468.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-roberta-small-imdb - https://arxiv.org/abs/1907.11692 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: Named Entity Recognition for Thai (GloVe 840B 300d) author: John Snow Labs name: ner_lst20_glove_840B_300d date: 2021-01-11 task: Named Entity Recognition language: th edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [th, ner, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained `glove_840B_300` embeddings model from `WordEmbeddings` annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities Brands-`BRN`, Designations (position or job title)-`DES`, Date and time-`DTM`, Locations-`LOC`, Measurements-`MEA`, Names-`NAME`, Numbers-`NUM`, Organizations-`ORG`, Persons-`PER`, Terminology-`TRM`, Titles-`TTL`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_lst20_glove_840B_300d_th_2.7.0_2.4_1610360616038.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_lst20_glove_840B_300d_th_2.7.0_2.4_1610360616038.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("ner_lst20_glove_840B_300d", "th") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter]) example = spark.createDataFrame([['Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th") .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_lst20_glove_840B_300d", "th") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter)) val data = Seq("Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส"""] ner_df = nlu.load('th.ner.lst20.glove_840B_300D').predict(text, output_level='token') ner_df ```
## Results ```bash ----------+-----+ |token |ner | +----------+-----+ |Mona |B_PER| |Lisa |E_PER| |เป็น |O | |ภาพวาด |O | |สีน้ำมัน |O | |ใน |O | |ศตวรรษ |O | |ที่ |O | |16 |O | |ที่ |O | |สร้าง |O | |โดย |O | |Leonardo |B_PER| |จัด |O | |ขึ้น |O | |ที่ |O | |พิพิธภัณฑ์ |O | |ลูฟร์ |O | |ใน |O | |ปารีส |O | +----------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_lst20_glove_840B_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|th| ## Data Source The model was trained on the [LST20 Corpus](https://aiat.or.th/lst20-corpus/) from National Electronics and Computer Technology Center (NECTEC) . ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | B_BRN | 0.26 | 0.23 | 0.24 | 47 | | B_DES | 0.92 | 0.89 | 0.91 | 1176 | | B_DTM | 0.83 | 0.81 | 0.82 | 1329 | | B_LOC | 0.75 | 0.70 | 0.72 | 2344 | | B_MEA | 0.76 | 0.80 | 0.78 | 3155 | | B_NUM | 0.71 | 0.59 | 0.64 | 1240 | | B_ORG | 0.79 | 0.78 | 0.78 | 4248 | | B_PER | 0.92 | 0.92 | 0.92 | 3269 | | B_TRM | 0.87 | 0.77 | 0.81 | 128 | | B_TTL | 0.97 | 0.98 | 0.98 | 1379 | | E_BRN | 0.86 | 0.75 | 0.80 | 8 | | E_DES | 0.94 | 0.82 | 0.88 | 198 | | E_DTM | 0.80 | 0.79 | 0.80 | 1151 | | E_LOC | 0.71 | 0.70 | 0.71 | 851 | | E_MEA | 0.69 | 0.77 | 0.73 | 830 | | E_NUM | 0.80 | 0.61 | 0.69 | 79 | | E_ORG | 0.80 | 0.76 | 0.78 | 2090 | | E_PER | 0.93 | 0.96 | 0.94 | 1586 | | E_TRM | 0.33 | 0.17 | 0.22 | 12 | | I_BRN | 0.75 | 0.60 | 0.67 | 5 | | I_DES | 0.79 | 0.63 | 0.70 | 204 | | I_DTM | 0.92 | 0.86 | 0.89 | 2969 | | I_LOC | 0.47 | 0.46 | 0.47 | 462 | | I_MEA | 0.64 | 0.74 | 0.69 | 935 | | I_NUM | 0.87 | 0.71 | 0.78 | 115 | | I_ORG | 0.81 | 0.75 | 0.78 | 3015 | | I_PER | 0.93 | 0.95 | 0.94 | 1604 | | I_TRM | 0.40 | 0.13 | 0.20 | 15 | | I_TTL | 0.67 | 0.50 | 0.57 | 4 | | accuracy | 0.95 | 207278 | | | | macro avg | 0.76 | 0.71 | 0.73 | 207278 | | weighted avg | 0.95 | 0.95 | 0.95 | 207278 | ``` --- layout: model title: Extract Negation and Uncertainty Entities from Spanish Medical Texts author: John Snow Labs name: ner_negation_uncertainty date: 2022-08-13 tags: [es, clinical, licensed, ner, unc, usco, neg, nsco, negation, uncertainty] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting relevant entities from Spanish medical texts and trained by using MedicalNerApproach annotator that allows to train generic NER models based on Neural Networks. The model detects Negation Trigger (NEG), Negation Scope (NSCO), Uncertainty Trigger (UNC) and Uncertainty Scope (USCO). ## Predicted Entities `NEG`, `UNC`, `USCO`, `NSCO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_negation_uncertainty_es_4.0.2_3.0_1660357762363.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_negation_uncertainty_es_4.0.2_3.0_1660357762363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_negation_uncertainty', "es", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_negation_uncertainty", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.negtaion_uncertainty").predict("""Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado.""") ```
## Results ```bash +------------------------------------------------------+---------+ |chunk |ner_label| +------------------------------------------------------+---------+ |probable de |UNC | |cirrosis hepática |USCO | |no |NEG | |conocida previamente |NSCO | |no |NEG | |se realizó paracentesis control por escasez de liquido|NSCO | |susceptible de |UNC | |ca basocelular perlado |USCO | +------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_negation_uncertainty| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.2 MB| ## References The model is prepared using the reference paper: "NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts", Salvador Lima-Lopez, Eulalia Farr ́e-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias and Martin Krallinger. Procesamiento del Lenguaje Natural, Revista nº 67, septiembre de 2021, pp. 243-256. ## Benchmarking ```bash label precision recall f1-score support B-NEG 0.93 0.97 0.95 1409 I-NEG 0.80 0.90 0.85 119 B-UNC 0.82 0.85 0.83 395 I-UNC 0.77 0.78 0.77 166 B-USCO 0.76 0.79 0.77 394 I-USCO 0.61 0.81 0.69 1468 B-NSCO 0.92 0.92 0.92 1308 I-NSCO 0.87 0.89 0.88 3806 micro-avg 0.82 0.89 0.85 9065 macro-avg 0.81 0.86 0.83 9065 weighted-avg 0.83 0.89 0.86 9065 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from ml6team) author: John Snow Labs name: distilbert_token_classifier_keyphrase_extraction_openkp date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `keyphrase-extraction-distilbert-openkp` is a English model originally trained by `ml6team`. ## Predicted Entities `KEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.1_3.0_1678133922182.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_keyphrase_extraction_openkp_en_4.3.1_3.0_1678133922182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_keyphrase_extraction_openkp","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_keyphrase_extraction_openkp| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/keyphrase-extraction-distilbert-openkp - https://github.com/microsoft/OpenKP - https://arxiv.org/abs/1911.02671 - https://paperswithcode.com/sota?task=Keyphrase+Extraction&dataset=openkp --- layout: model title: Smaller BERT Embeddings (L-10_H-128_A-2) author: John Snow Labs name: small_bert_L10_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L10_128_en_2.6.0_2.4_1598344364541.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L10_128_en_2.6.0_2.4_1598344364541.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L10_128", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L10_128", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L10_128').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L10_128_embeddings I [0.06678155064582825, 0.4304381012916565, 0.42... love [-0.4905094504356384, 0.6187271475791931, 0.56... NLP [0.17027147114276886, -0.49662041664123535, 1.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L10_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1) --- layout: model title: Legal Consulting Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_consulting_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, consulting, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_consulting_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `consulting-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `consulting-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consulting_agreement_bert_en_1.0.0_3.0_1669308795135.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consulting_agreement_bert_en_1.0.0_3.0_1669308795135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_consulting_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[consulting-agreement]| |[other]| |[other]| |[consulting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_consulting_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support consulting-agreement 0.88 0.93 0.90 30 other 0.97 0.93 0.95 61 accuracy - - 0.93 91 macro-avg 0.92 0.93 0.93 91 weighted-avg 0.94 0.93 0.93 91 ``` --- layout: model title: Part of Speech for Galician author: John Snow Labs name: pos_ud_treegal date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: gl edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, gl] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_treegal_gl_2.5.5_2.4_1596053906222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_treegal_gl_2.5.5_2.4_1596053906222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_treegal", "gl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_treegal", "gl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Ademais de ser o rei do norte, John Snow é un médico inglés e un líder no desenvolvemento da anestesia e a hixiene médica."""] pos_df = nlu.load('gl.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=6, result='ADV', metadata={'word': 'Ademais'}), Row(annotatorType='pos', begin=8, end=9, result='ADP', metadata={'word': 'de'}), Row(annotatorType='pos', begin=11, end=13, result='AUX', metadata={'word': 'ser'}), Row(annotatorType='pos', begin=15, end=15, result='DET', metadata={'word': 'o'}), Row(annotatorType='pos', begin=17, end=19, result='NOUN', metadata={'word': 'rei'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_treegal| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|gl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Extract the Names of Drugs & Chemicals author: John Snow Labs name: ner_chemd_clinical date: 2021-11-04 tags: [chemdner, chemd, ner, clinical, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with `clinical_embeddings` to extract the names of chemical compounds and drugs in medical texts. The entities that can be detected are as follows : - `SYSTEMATIC` : Systematic names of chemical mentions, e.g. IUPAC and IUPAC-like names.(e.g. 2-Acetoxybenzoic acid, 2-Acetoxybenzenecarboxylic acid) - `IDENTIFIERS` : Database identifiers of chemicals: CAS numbers, PubChem identifiers, registry numbers and ChEBI and CHEMBL ids. (e.g. CAS Registry Number: 501-36-0445154 CID 445154, CHEBI:28262, CHEMBL504) - `FORMULA` : Mentions of molecular formula, SMILES, InChI, InChIKey. (e.g. CC(=O)Oc1ccccc1C(=O)O, C9H8O4) - `TRIVIAL` : Trivial, trade (brand), common or generic names of compounds. It includes International Nonproprietary Name (INN) as well as British Approved Name (BAN) and United States Adopted Name (USAN). (e.g. Aspirin, Acylpyrin, paracetamol) - `ABBREVIATION` : Mentions of abbreviations and acronyms of chemicals compounds and drugs. (e.g. DMSO, GABA) - `FAMILY`: Chemical families that can be associated to some chemical structure are also included. (e.g. Iodopyridazines (FAMILY- SYSTEMATIC)) - `MULTIPLE` : Mentions that do correspond to chemicals that are not described in a continuous string of characters. This is often the case of mentions of multiple chemicals joined by coordinated clauses. (e.g. thieno2,3-d and thieno3,2-d fused oxazin-4-ones) ## Predicted Entities `SYSTEMATIC`, `IDENTIFIERS`, `FORMULA`, `TRIVIAL`, `ABBREVIATION`, `FAMILY`, `MULTIPLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMD/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemd_clinical_en_3.3.0_2.4_1636027285679.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemd_clinical_en_3.3.0_2.4_1636027285679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") chemd_ner = MedicalNerModel.pretrained('ner_chemd_clinical', 'en', 'clinical/models') \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, chemd_ner, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") chemd_model = nlpPipeline.fit(empty_data) results = chemd_model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition."""]}))) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val chemd_ner = MedicalNerModel.pretrained("ner_chemd_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlpPipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, chemd_ner, ner_converter)) val result = pipeline.fit(Seq.empty["Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition."].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemd").predict("""Isolation, Structure Elucidation, and Iron-Binding Properties of Lystabactins, Siderophores Isolated from a Marine Pseudoalteromonas sp. The marine bacterium Pseudoalteromonas sp. S2B, isolated from the Gulf of Mexico after the Deepwater Horizon oil spill, was found to produce lystabactins A, B, and C (1-3), three new siderophores. The structures were elucidated through mass spectrometry, amino acid analysis, and NMR. The lystabactins are composed of serine (Ser), asparagine (Asn), two formylated/hydroxylated ornithines (FOHOrn), dihydroxy benzoic acid (Dhb), and a very unusual nonproteinogenic amino acid, 4,8-diamino-3-hydroxyoctanoic acid (LySta). The iron-binding properties of the compounds were investigated through a spectrophotometric competition.""") ```
## Results ```bash +----------------------------------+------------+ |chunk |ner_label | +----------------------------------+------------+ |Lystabactins |FAMILY | |lystabactins A, B, and C |MULTIPLE | |amino acid |FAMILY | |lystabactins |FAMILY | |serine |TRIVIAL | |Ser |FORMULA | |asparagine |TRIVIAL | |Asn |FORMULA | |formylated/hydroxylated ornithines|FAMILY | |FOHOrn |FORMULA | |dihydroxy benzoic acid |SYSTEMATIC | |amino acid |FAMILY | |4,8-diamino-3-hydroxyoctanoic acid|SYSTEMATIC | |LySta |ABBREVIATION| +----------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemd_clinical| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source CHEMDNER Corpus. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331685/ ## Benchmarking ```bash +------------+------+-----+-----+------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +------------+------+-----+-----+------+---------+------+------+ | FORMULA|2296.0|196.0| 99.0|2395.0| 0.9213|0.9587|0.9396| | IDENTIFIER| 294.0| 33.0| 25.0| 319.0| 0.8991|0.9216|0.9102| | MULTIPLE| 310.0|284.0| 58.0| 368.0| 0.5219|0.8424|0.6445| | FAMILY|1865.0|277.0|347.0|2212.0| 0.8707|0.8431|0.8567| |ABBREVIATION|1648.0|188.0|158.0|1806.0| 0.8976|0.9125| 0.905| | SYSTEMATIC|3381.0|336.0|307.0|3688.0| 0.9096|0.9168|0.9132| | TRIVIAL|3862.0|255.0|241.0|4103.0| 0.9381|0.9413|0.9397| +------------+------+-----+-----+------+---------+------+------+ +------------------+ | macro| +------------------+ |0.8726928630563308| +------------------+ +------------------+ | micro| +------------------+ |0.9086394322529923| +------------------+ ``` --- layout: model title: Japanese Electra Embeddings (from Cinnamon) author: John Snow Labs name: electra_embeddings_electra_small_japanese_generator date: 2022-05-17 tags: [ja, open_source, electra, embeddings] task: Embeddings language: ja edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-japanese-generator` is a Japanese model orginally trained by `Cinnamon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_japanese_generator_ja_3.4.4_3.0_1652786693322.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_japanese_generator_ja_3.4.4_3.0_1652786693322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_japanese_generator","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_japanese_generator","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_small_japanese_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ja| |Size:|52.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Cinnamon/electra-small-japanese-generator - https://openreview.net/pdf?id=r1xMH1BtvB - https://dumps.wikimedia.org/jawiki/latest - https://www.aclweb.org/anthology/P16-1162.pdf - https://github.com/neologd/mecab-ipadic-neologd --- layout: model title: Sentence Entity Resolver for UMLS CUI Codes author: John Snow Labs name: sbiobertresolve_umls_major_concepts date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities This model returns CUI (concept unique identifier) codes for `Clinical Findings`, `Medical Devices`, `Anatomical Structures` and `Injuries & Poisoning` terms {:.btn-box} [Live Demo](http://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.0.4_3.0_1621188910976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_umls_major_concepts_en_3.0.4_3.0_1621188910976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_umls_major_concepts``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Cerebrovascular_Disease, Communicable_Disease, Diabetes, Disease_Syndrome_Disorder, Heart_Disease, Hyperlipidemia, Hypertension, Injury_or_Poisoning, Kidney_Disease, Medical-Device, Obesity, Oncological, Overweight, Psychological_Condition, Symptom, VS_Finding, ImagingFindings, EKG_Findings``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_umls_major_concepts", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokens, embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.umls").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | ner_chunk | resolution | |---:|:------------------------------|:-------------| | 0 | 28-year-old | C1864118 | | 1 | female | C3887375 | | 2 | gestational diabetes mellitus | C2183115 | | 3 | eight years prior | C5195266 | | 4 | subsequent | C3844350 | | 5 | type two diabetes mellitus | C4014362 | | 6 | T2DM | C4014362 | | 7 | HTG-induced pancreatitis | C4554179 | | 8 | three years prior | C1866782 | | 9 | acute | C1332147 | | 10 | hepatitis | C1963279 | | 11 | obesity | C1963185 | | 12 | body mass index | C0578022 | | 13 | 33.5 kg/m2 | C2911054 | | 14 | one-week | C0420331 | | 15 | polyuria | C3278312 | | 16 | polydipsia | C3278316 | | 17 | poor appetite | C0541799 | | 18 | vomiting | C1963281 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_umls_major_concepts| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[umls_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: English image_classifier_vit_Check_Aligned_Teeth ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Check_Aligned_Teeth date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Check_Aligned_Teeth` is a English model originally trained by steven123. ## Predicted Entities `Aligned Teeth`, `Crooked Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Aligned_Teeth_en_4.1.0_3.0_1660168392408.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Aligned_Teeth_en_4.1.0_3.0_1660168392408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Check_Aligned_Teeth", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Check_Aligned_Teeth", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Check_Aligned_Teeth| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Entities in Twitter texts author: John Snow Labs name: bert_token_classifier_ner_btc date: 2021-09-09 tags: [en, open_source, ner, btc, twitter] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.2.2 spark_version: 2.4 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description - This model is trained on Broad Twitter Corpus (BTC) dataset, so that it can detect entities in Twitter-based texts successfully. - It's based on `bert_base_cased` embeddings, which are included in the model, so you don't need to use any embeddings component in the NLP pipeline. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_BTC/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_BTC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_btc_en_3.2.2_2.4_1631195072459.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_btc_en_3.2.2_2.4_1631195072459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) test_sentences = ["""Pentagram's Dominic Lippa is working on a new identity for University of Arts London."""] result = model.transform(spark.createDataFrame(pd.DataFrame({'text': test_sentences}))) ``` ```scala ... val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en") .setInputCols("token", "document") .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("Pentagram's Dominic Lippa is working on a new identity for University of Arts London.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.classifier_ner_btc").predict("""Pentagram's Dominic Lippa is working on a new identity for University of Arts London.""") ```
## Results ```bash +--------------------------+---------+ |chunk |ner_label| +--------------------------+---------+ |Pentagram's |ORG | |Dominic Lippa |PER | |University of Arts London |ORG | +--------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_btc| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|128| ## Data Source https://github.com/juand-r/entity-recognition-datasets/tree/master/data/BTC ## Benchmarking ```bash label precision recall f1-score support B-LOC 0.90 0.79 0.84 536 B-ORG 0.80 0.79 0.79 821 B-PER 0.95 0.62 0.75 1575 I-LOC 0.96 0.76 0.85 181 I-ORG 0.88 0.81 0.84 217 I-PER 0.99 0.91 0.95 315 O 0.97 0.99 0.98 26217 accuracy - - 0.96 29862 macro-avg 0.92 0.81 0.86 29862 weighted-avg 0.96 0.96 0.96 29862 ``` --- layout: model title: English RoBERTa Embeddings (Base, Titles) author: John Snow Labs name: roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_title date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-finetuned-jira-qt-issue-title` is a English model orginally trained by `ietz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_title_en_3.4.2_3.0_1649947342812.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_title_en_3.4.2_3.0_1649947342812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_title","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_title","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta_base_finetuned_jira_qt_issue_title").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_title| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|309.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/ietz/distilroberta-base-finetuned-jira-qt-issue-title --- layout: model title: Galician RoBERTa Embeddings Cased model (from mrm8488) author: John Snow Labs name: roberta_embeddings_robertinh date: 2022-07-14 tags: [gl, open_source, roberta, embeddings] task: Embeddings language: gl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RoBERTinha` is a Galician model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robertinh_gl_4.0.0_3.0_1657810028126.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robertinh_gl_4.0.0_3.0_1657810028126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robertinh","gl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robertinh","gl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robertinh| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|gl| |Size:|314.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/RoBERTinha --- layout: model title: English BertForMaskedLM Base Cased model (from ayansinha) author: John Snow Labs name: bert_embeddings_lic_class_scancode_base_cased_l32_1 date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `lic-class-scancode-bert-base-cased-L32-1` is a English model originally trained by `ayansinha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_base_cased_l32_1_en_4.2.4_3.0_1670022582962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_base_cased_l32_1_en_4.2.4_3.0_1670022582962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_base_cased_l32_1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_base_cased_l32_1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_lic_class_scancode_base_cased_l32_1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/ayansinha/lic-class-scancode-bert-base-cased-L32-1 - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py --- layout: model title: English BertForQuestionAnswering model (from xraychen) author: John Snow Labs name: bert_qa_mqa_sim date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mqa-sim` is a English model orginally trained by `xraychen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_sim_en_4.0.0_3.0_1654188374685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mqa_sim_en_4.0.0_3.0_1654188374685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mqa_sim","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mqa_sim","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.sim.by_xraychen").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mqa_sim| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/xraychen/mqa-sim --- layout: model title: BERT Sentence Embeddings (Base Cased) author: John Snow Labs name: sent_bert_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_en_2.6.0_2.4_1598346030732.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_cased_en_2.6.0_2.4_1598346030732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.bert_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_bert_base_cased_embeddings sentence [0.6762558221817017, 0.05294809862971306, -0.2... I hate cancer [0.3460788130760193, -0.06936988979578018, 0.1... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1](https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1) --- layout: model title: Translate Turkic languages to English Pipeline author: John Snow Labs name: translate_trk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, trk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `trk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_trk_en_xx_2.7.0_2.4_1609688163945.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_trk_en_xx_2.7.0_2.4_1609688163945.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_trk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_trk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.trk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_trk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Option Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_option_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, option, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_option_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `option-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `option-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_option_agreement_bert_en_1.0.0_3.0_1670349613252.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_option_agreement_bert_en_1.0.0_3.0_1670349613252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_option_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[option-agreement]| |[other]| |[other]| |[option-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_option_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support option-agreement 0.94 0.82 0.88 39 other 0.90 0.97 0.93 65 accuracy - - 0.91 104 macro-avg 0.92 0.89 0.91 104 weighted-avg 0.92 0.91 0.91 104 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_512 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_512_zh_4.2.4_3.0_1670325817656.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_512_zh_4.2.4_3.0_1670325817656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|184.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Conditional Random Field Based Named Entity Recognizer author: John Snow Labs name: ner_crf date: 2021-03-28 tags: [ner, named_entity_recognition, crf, named_entity_recognition_crf, ner_crf, en, open_source] supported: true task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: NerCrfModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognizer is based on a [CRF Algorithm](https://en.wikipedia.org/wiki/Conditional_random_field) {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DA/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_crf_en_3.0.0_3.0_1616965829076.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_crf_en_3.0.0_3.0_1616965829076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") posTagger = PerceptronModel.pretrained().setInputCols(["token", "document"]).setOutputCol("pos") embeds = WordEmbeddingsModel.pretrained().setInputCols(["token", "document"]).setOutputCol("embeddings") nerCrf = NerCrfModel.pretrained().setInputCols(["document", "token","pos", "embeddings"]).setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, posTagger, embeds, nerCrf ]) df = spark.createDataFrame([['Donald Trump and Angela Merkel dont share many oppinions']], ["text"]) result = pipeline.fit(df).transform(df) result.select("ner.result").show(truncate = False ) result.select("ner").show(truncate = False) ``` ```scala val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val tokenizer = new Tokenizer().setInputCols(Array("document")).setOutputCol("token") val posTagger = PerceptronModel.pretrained().setInputCols(Array("token", "document")).setOutputCol("pos") val embeds = WordEmbeddingsModel.pretrained().setInputCols(Array("token", "document")).setOutputCol("embeddings") val nerCrf = NerCrfModel.pretrained().setInputCols(Array("document", "token","pos", "embeddings")).setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, posTagger, embeds, nerCrf)) val df = Seq("Donald Trump and Angela Merkel dont share many oppinions").toDF("text") val result = pipeline.fit(df).transform(df) result.select("ner.result").show(false) result.select("ner").show(false) ``` {:.nlu-block} ```python nlu.load('ner.crf').predoct("Donald Trump and Angela Merkel dont share many oppinions") ```
## Results ```bash +-------------------------------------------+ |result | +-------------------------------------------+ |[I-PER, I-PER, O, I-PER, I-PER, O, O, O, O]| +-------------------------------------------+ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |ner | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[[named_entity, 0, 5, I-PER, [word -> Donald], []], [named_entity, 7, 11, I-PER, [word -> Trump], []], [named_entity, 13, 15, O, [word -> and], []], [named_entity, 17, 22, I-PER, [word -> Angela], []], [named_entity, 24, 29, I-PER, [word -> Merkel], []], [named_entity, 31, 34, O, [word -> dont], []], [named_entity, 36, 40, O, [word -> share], []], [named_entity, 42, 45, O, [word -> many], []], [named_entity, 47, 55, O, [word -> oppinions], []]]| +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_crf| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, pos, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Legal NER for Indian Court Documents author: John Snow Labs name: legner_indian_court_judgement date: 2022-10-25 tags: [en, legal, ner, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model trained on Indian court dataset, aimed to extract the following entities from judgement documents. ## Predicted Entities `COURT`, `PETITIONER`, `RESPONDENT`, `JUDGE`, `DATE`, `ORG`, `GPE`, `STATUTE`, `PROVISION`, `PRECEDENT`, `CASE_NUMBER`, `WITNESS`, `OTHER_PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_indian_court_judgement_en_1.0.0_3.0_1666698501448.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_indian_court_judgement_en_1.0.0_3.0_1666698501448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ .setCleanupMode("shrink") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "en")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_indian_court_judgement", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Let fresh bailable warrant of Rs.20,000/- (Rupees Twenty Thousand) be issued through Superintendent of Police, Dhar to the respondents No.1 Sikandar and No.2 Aziz for a date to be fixed by the Registry to secure the presence of the respondents No.1 and 2, made returnable within six weeks. P.K.Jaiswal) Judge (Jarat Kumar Jain) Judge ns. W.P.No.1361/2013 14/12/2015 Parties through their Counsel."""]]) result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") .setMaxSentenceLength(512) .setCaseSensitive(True) val ner_model = NerModel.pretrained("legner_indian_court_judgement", "en", "legal/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Let fresh bailable warrant of Rs.20,000/- (Rupees Twenty Thousand) be issued through Superintendent of Police, Dhar to the respondents No.1 Sikandar and No.2 Aziz for a date to be fixed by the Registry to secure the presence of the respondents No.1 and 2, made returnable within six weeks. P.K.Jaiswal) Judge (Jarat Kumar Jain) Judge ns. W.P.No.1361/2013 14/12/2015 Parties through their Counsel.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------+-----------+ |chunk |label | +----------------+-----------+ |Dhar |GPE | |Sikandar |RESPONDENT | |Aziz |RESPONDENT | |P.K.Jaiswal |JUDGE | |Jarat Kumar Jain|JUDGE | |W.P.No.1361/2013|CASE_NUMBER| |14/12/2015 |DATE | +----------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_indian_court_judgement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.4 MB| ## References Training data is available [here](https://github.com/Legal-NLP-EkStep/legal_NER#3-data). ## Benchmarking ```bash label precision recall f1-score support CASE_NUMBER 0.83 0.80 0.82 112 COURT 0.92 0.94 0.93 140 DATE 0.97 0.97 0.97 204 GPE 0.81 0.75 0.78 95 JUDGE 0.84 0.86 0.85 57 ORG 0.75 0.76 0.76 131 OTHER_PERSON 0.83 0.90 0.86 241 PETITIONER 0.76 0.61 0.68 36 PRECEDENT 0.84 0.84 0.84 127 PROVISION 0.90 0.94 0.92 220 RESPONDENT 0.64 0.70 0.67 23 STATUTE 0.92 0.96 0.94 157 WITNESS 0.93 0.78 0.85 87 micro-avg 0.87 0.87 0.87 1630 macro-avg 0.84 0.83 0.83 1630 weighted-avg 0.87 0.87 0.87 1630 ``` --- layout: model title: Pipeline to Detect Clinical Conditions (ner_eu_clinical_condition) author: John Snow Labs name: ner_eu_clinical_condition_pipeline date: 2023-03-07 tags: [en, clinical, licensed, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_condition](https://nlp.johnsnowlabs.com/2023/02/06/ner_eu_clinical_condition_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_en_4.3.0_3.2_1678213988790.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_pipeline_en_4.3.0_3.2_1678213988790.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_condition_pipeline", "en", "clinical/models") text = " Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_condition_pipeline", "en", "clinical/models") val text = " Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:------------------------|--------:|------:|:-------------------|-------------:| | 0 | Hyperparathyroidism | 1 | 19 | clinical_condition | 0.9375 | | 1 | weakness | 77 | 84 | clinical_condition | 0.9779 | | 2 | generalized joint pains | 90 | 112 | clinical_condition | 0.717333 | | 3 | epigastric pain | 151 | 165 | clinical_condition | 0.64985 | | 4 | gastritis | 191 | 199 | clinical_condition | 0.9543 | | 5 | fractures | 281 | 289 | clinical_condition | 0.9726 | | 6 | anesthesia | 305 | 314 | clinical_condition | 0.991 | | 7 | mandibular fracture | 330 | 348 | clinical_condition | 0.54925 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Chemicals in Medical text (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_chemicals date: 2022-01-06 tags: [berfortokenclassification, ner, chemicals, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract different types of chemical compounds mentioned in text using pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `CHEM` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_en_3.3.4_2.4_1641465134046.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemicals_en_3.3.4_2.4_1641465134046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_chemicals", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_chemicals", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_chemical").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. "A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash +---------------------------+---------+ |chunk |ner_label| +---------------------------+---------+ |p - choloroaniline |CHEM | |chlorhexidine - digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone - iodine |CHEM | +---------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemicals| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source This model is trained on a custom dataset by John Snow Labs. ## Benchmarking ```bash label precision recall f1-score support B-CHEM 0.99 0.92 0.95 30731 I-CHEM 0.99 0.93 0.96 31270 accuracy - - 0.93 62001 macro-avg 0.96 0.95 0.96 62001 weighted-avg 0.99 0.93 0.96 62001 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shiromart) author: John Snow Labs name: distilbert_qa_shiromart_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shiromart`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shiromart_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772671335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shiromart_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772671335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shiromart_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shiromart_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shiromart_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shiromart/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mrm8488) author: John Snow Labs name: t5_base_finetuned_break_data_question_retrieval date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-finetuned-break_data-question-retrieval` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_break_data_question_retrieval_en_4.3.0_3.0_1675108919499.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_finetuned_break_data_question_retrieval_en_4.3.0_3.0_1675108919499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_finetuned_break_data_question_retrieval","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_finetuned_break_data_question_retrieval","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_finetuned_break_data_question_retrieval| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|878.5 MB| ## References - https://huggingface.co/mrm8488/t5-base-finetuned-break_data-question-retrieval - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb - https://twitter.com/psuraj28 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2023-04-26 tags: [licensed, en, clinical, biobert, profiling, ner_profiling, ner] task: [Named Entity Recognition, Pipeline Healthcare] language: en edition: Healthcare NLP 4.4.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_greedy_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_anatomy_coarse_biobert`, `ner_bionlp_biobert`, `ner_cellular_biobert`, `ner_chemprot_biobert`, `ner_clinical_biobert`, `ner_deid_biobert`, `ner_deid_enriched_biobert`, `ner_diseases_biobert`, `ner_events_biobert`, `ner_human_phenotype_gene_biobert`, `ner_human_phenotype_go_biobert`, `ner_jsl_biobert`, `ner_jsl_enriched_biobert`, `ner_jsl_greedy_biobert`, `ner_living_species_biobert`, `ner_posology_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.4.0_3.2_1682511035778.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.4.0_3.2_1682511035778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_biobert", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ```
## Results ```bash ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|766.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: Named Entity Recognition (NER) Model in Swedish (GloVe 6B 100) author: John Snow Labs name: swedish_ner_6B_100 date: 2020-08-30 task: Named Entity Recognition language: sv edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, sv, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Swedish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_SV/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/swedish_ner_6B_100_sv_2.6.0_2.4_1598810268071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/swedish_ner_6B_100_sv_2.6.0_2.4_1598810268071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("swedish_ner_6B_100", "sv") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("swedish_ner_6B_100", "sv") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill & Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella."""] ner_df = nlu.load('sv.ner.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +---------------------+---------+ |chunk |ner_label| +---------------------+---------+ |William Henry Gates |PER | |Microsoft Corporation|ORG | |Microsoft |ORG | |Gates |PER | |VD |MISC | |Seattle |LOC | |Washington |LOC | |Gates Microsoft |PER | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |ORG | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Melinda Gates |PER | |Melinda Gates |PER | |Ray Ozzie |PER | |Craig Mundie |PER | |Microsoft |ORG | |VD Satya Nadella |MISC | +---------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|swedish_ner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|sv| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on a custom dataset with multi-lingual GloVe Embeddings ``glove_6B_100``. --- layout: model title: Word2Vec Embeddings in Bashkir (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, ba, open_source] task: Embeddings language: ba edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ba_3.4.1_3.0_1647285040442.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ba_3.4.1_3.0_1647285040442.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ba") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ba") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ba.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ba| |Size:|447.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Technology And Technical Regulations Document Classifier (EURLEX) author: John Snow Labs name: legclf_technology_and_technical_regulations_bert date: 2023-03-06 tags: [en, legal, classification, clauses, technology_and_technical_regulations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_technology_and_technical_regulations_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Technology_and_Technical_Regulations or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Technology_and_Technical_Regulations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_technology_and_technical_regulations_bert_en_1.0.0_3.0_1678111609892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_technology_and_technical_regulations_bert_en_1.0.0_3.0_1678111609892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_technology_and_technical_regulations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Technology_and_Technical_Regulations]| |[Other]| |[Other]| |[Technology_and_Technical_Regulations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_technology_and_technical_regulations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.85 0.82 0.84 337 Technology_and_Technical_Regulations 0.83 0.86 0.85 349 accuracy - - 0.84 686 macro-avg 0.84 0.84 0.84 686 weighted-avg 0.84 0.84 0.84 686 ``` --- layout: model title: Detect PHI for Deidentification purposes (Spanish, reduced entities, Roberta Embeddings) author: John Snow Labs name: ner_deid_generic_roberta date: 2022-01-17 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 7 entities. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset and several data augmentation mechanisms, it's a reduced version of `ner_deid_subentity_roberta` and uses Roberta Clinical Embeddings. ## Predicted Entities `CONTACT`, `NAME`, `DATE`, `ID`, `LOCATION`, `PROFESSION`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_es_3.3.4_3.0_1642437901644.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_roberta_es_3.3.4_3.0_1642437901644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_generic_roberta", "es", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner]) text = [''' Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_roberta", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner)) val text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""" val df = Seq(text).toDS.toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.generic_roberta").predict(""" Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020 y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | Antonio| B-NAME| | Pérez| I-NAME| | Juan| I-NAME| | ,| O| | nacido| O| | en| O| | Cadiz|B-LOCATION| | ,| O| | España|B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14/03/2020| B-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica|B-LOCATION| | San|I-LOCATION| | Carlos|I-LOCATION| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_roberta| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| |Dependencies:|roberta_base_biomedical| ## Data Source - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) ## Benchmarking ```bash label tp fp fn total precision recall f1 CONTACT 171.0 10.0 3.0 174.0 0.9448 0.9828 0.9634 NAME 2732.0 198.0 219.0 2951.0 0.9324 0.9258 0.9291 DATE 1644.0 27.0 23.0 1667.0 0.9838 0.9862 0.985 ID 114.0 11.0 7.0 121.0 0.912 0.9421 0.9268 LOCATION 4850.0 623.0 594.0 5444.0 0.8862 0.8909 0.8885 PROFESSION 266.0 66.0 123.0 389.0 0.8012 0.6838 0.7379 AGE 303.0 50.0 45.0 348.0 0.8584 0.8707 0.8645 macro - - - - - - 0.8993 micro - - - - - - 0.9094 ``` --- layout: model title: Legal Proposal Summarization author: John Snow Labs name: legsum_proposal date: 2023-02-16 tags: [en, licensed, summarization, legal, tensorflow] task: Summarization language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is fine-tuned with a legal dataset (about EU proposals). Summarizes a proposal given on a socially important issue. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legsum_proposal_en_1.0.0_3.0_1676587991098.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legsum_proposal_en_1.0.0_3.0_1676587991098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") t5 = nlp.T5Transformer().pretrained("legsum_proposal", "en", "legal/models")\ .setTask("summarize")\ .setMaxOutputLength(512)\ .setInputCols(["document"])\ .setOutputCol("summaries") text = """The main reason for migration is poverty, and often times it is down to corruption in the leadership of poor countries. What people in such countries demand time and again is that the EU does not engage with their government, and does not supply financial support (which tends to end up in the wrong hands). The EU needs a strict line of engagement. One could envision a rating list by the EU that defines clear requirements support receiving nations must fulfill. Support should be granted in the form of improved economic conditions, such as increased import quota, discounted machinery, and technical know-how injection, not in terms of financial support. Countries failing to fulfill the requirements, especially those with indications of corruption must be put under strict embargoes.""" data_df = spark.createDataFrame([[text]]).toDF("text") pipeline = nlp.Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("summaries.result").show(truncate=False) ```
## Results ```bash People in poor countries demand that the EU does not engage with their government and do not provide financial support. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legsum_proposal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|925.9 MB| ## References Training dataset available [here](https://touche.webis.de/clef23/touche23-web/multilingual-stance-classification.html#data) --- layout: model title: Legal America Document Classifier (EURLEX) author: John Snow Labs name: legclf_america_bert date: 2023-03-06 tags: [en, legal, classification, clauses, america, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_america_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class America or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `America`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_america_bert_en_1.0.0_3.0_1678111843368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_america_bert_en_1.0.0_3.0_1678111843368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_america_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[America]| |[Other]| |[Other]| |[America]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_america_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support America 0.85 0.92 0.88 49 Other 0.91 0.83 0.87 48 accuracy - - 0.88 97 macro-avg 0.88 0.88 0.88 97 weighted-avg 0.88 0.88 0.88 97 ``` --- layout: model title: Stopwords Remover for Slovak language (418 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, sk, open_source] task: Stop Words Removal language: sk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sk_3.4.1_3.0_1646673255405.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sk_3.4.1_3.0_1646673255405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","sk") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nie ste lepší ako ja"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sk") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nie ste lepší ako ja").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sk.stopwords").predict("""Nie ste lepší ako ja""") ```
## Results ```bash +-------+ |result | +-------+ |[lepší]| +-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sk| |Size:|2.4 KB| --- layout: model title: Fast Neural Machine Translation Model from Kuanyama to English author: John Snow Labs name: opus_mt_kj_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, kj, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `kj` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_kj_en_xx_2.7.0_2.4_1609168083477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_kj_en_xx_2.7.0_2.4_1609168083477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_kj_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_kj_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.kj.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_kj_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic Named Entity Recognition (from boda) author: John Snow Labs name: bert_ner_ANER date: 2022-05-04 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ANER` is a Arabic model orginally trained by `boda`. ## Predicted Entities ` حكومة `, ` شخص ديني `, ` رياضي `, ` ترفيه `, ` كتاب `, ` شعب(أمة) `, ` مطار `, ` مؤسسة دينية `, ` مركز سكني `, ` فنان `, ` غير حكومي `, ` أراضي البناء `, ` سماوي `, ` أرض طبيعية `, ` منشأة منطقة فرعية `, ` هواء `, ` مدينة أو ضاحية `, ` عالم `, ` محامي `, ` سوفتوير(برمجيات) `, ` أرض `, ` منفجر `, ` كيميائي `, ` رياضة `, ` رماية `, ` طريق `, ` ماء `, ` تجاري `, ` `, ` سياسي `, ` صوت `, ` علوم طبية `, ` نووي `, ` فظ `, ` فيلم `, ` قارة `, ` ولاية أو مقاطعة `, ` شخص `, ` حاد `, ` مهندس `, ` مسطح مائي `, ` عتاد `, ` طعام `, ` قذيفة `, ` دواء `, ` تعليمي `, ` نبات `, ` شرطة `, ` رجل أعمال `, ` إعلام `, ` مجموعة ` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_ANER_ar_3.4.2_3.0_1651630311382.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_ANER_ar_3.4.2_3.0_1651630311382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_ANER","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_ANER","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner.ANER").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_ANER| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|506.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/boda/ANER - https://drive.google.com/file/d/1cNnKf-jS-3sjBXF2b0rkh517z9EzFFT4/view?usp=sharing - https://fsalotaibi.kau.edu.sa/Pages-Arabic-NE-Corpora.aspx - https://github.com/aub-mind/arabert - https://fsalotaibi.kau.edu.sa/Pages-Arabic-NE-Corpora.aspx - https://github.com/BodaSadalla98 --- layout: model title: English BertForQuestionAnswering Cased model (from horsbug98) author: John Snow Labs name: bert_qa_part_2_mbert_model_e1 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_2_mBERT_Model_E1` is a English model originally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_part_2_mbert_model_e1_en_4.0.0_3.0_1657182372814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_part_2_mbert_model_e1_en_4.0.0_3.0_1657182372814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_part_2_mbert_model_e1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_part_2_mbert_model_e1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_part_2_mbert_model_e1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_2_mBERT_Model_E1 --- layout: model title: English BertForQuestionAnswering model (from husnu) author: John Snow Labs name: bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xtremedistil-l6-h256-uncased-TQUAD-finetuned_lr-2e-05_epochs-9` is a English model orginally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9_en_4.0.0_3.0_1654192611826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9_en_4.0.0_3.0_1654192611826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tquad.bert.xtremedistiled_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|47.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/husnu/xtremedistil-l6-h256-uncased-TQUAD-finetuned_lr-2e-05_epochs-9 --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_2_h_128_a_2_squad2_covid_qna date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-128_A-2_squad2_covid-qna` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_squad2_covid_qna_en_4.0.0_3.0_1657188902163.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_2_h_128_a_2_squad2_covid_qna_en_4.0.0_3.0_1657188902163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_2_h_128_a_2_squad2_covid_qna","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_2_h_128_a_2_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|16.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-128_A-2_squad2_covid-qna --- layout: model title: English DistilBertForQuestionAnswering model (from en) author: John Snow Labs name: distilbert_qa_en_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `en`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725203237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725203237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_en").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_en_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/en/distilbert-base-uncased-finetuned-squad --- layout: model title: Financial Form 10k summary Item Binary Classifier author: John Snow Labs name: finclf_form_10k_summary_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `form_10k_summary` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `form_10k_summary` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_form_10k_summary_item_en_1.0.0_3.2_1660154435146.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_form_10k_summary_item_en_1.0.0_3.2_1660154435146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_form_10k_summary_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[form_10k_summary]| |[other]| |[other]| |[form_10k_summary]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_form_10k_summary_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support form_10k_summary 0.71 0.74 0.73 145 other 0.76 0.73 0.74 162 accuracy - - 0.74 307 macro-avg 0.74 0.74 0.74 307 weighted-avg 0.74 0.74 0.74 307 ``` --- layout: model title: Slovak BertForTokenClassification Cased model (from crabz) author: John Snow Labs name: bert_token_classifier_fernet_cc_sk_ner date: 2022-11-30 tags: [sk, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: sk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FERNET-CC_sk-ner` is a Slovak model originally trained by `crabz`. ## Predicted Entities `5`, `1`, `3`, `0`, `4`, `2`, `6` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_fernet_cc_sk_ner_sk_4.2.4_3.0_1669813459817.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_fernet_cc_sk_ner_sk_4.2.4_3.0_1669813459817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_fernet_cc_sk_ner","sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_fernet_cc_sk_ner","sk") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_fernet_cc_sk_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|sk| |Size:|610.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/crabz/FERNET-CC_sk-ner - https://paperswithcode.com/sota?task=Token+Classification&dataset=wikiann+sk --- layout: model title: Legal Officers Clause Binary Classifier author: John Snow Labs name: legclf_officers_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `officers` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `officers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_officers_clause_en_1.0.0_3.2_1660123780978.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_officers_clause_en_1.0.0_3.2_1660123780978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_officers_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[officers]| |[other]| |[other]| |[officers]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_officers_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support officers 0.97 0.94 0.96 35 other 0.97 0.99 0.98 74 accuracy - - 0.97 109 macro-avg 0.97 0.96 0.97 109 weighted-avg 0.97 0.97 0.97 109 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad2 author: John Snow Labs name: distilbert_qa_base_uncased_squad2_with_ner date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_en_4.0.0_3.0_1654727322583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_en_4.0.0_3.0_1654727322583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_with_ner| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner --- layout: model title: Italian RobertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: roberta_qa_ADDI_IT_RoBERTa date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: it edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-IT-RoBERTa` is a Italian model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_IT_RoBERTa_it_4.0.0_3.0_1655726552278.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ADDI_IT_RoBERTa_it_4.0.0_3.0_1655726552278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ADDI_IT_RoBERTa","it") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ADDI_IT_RoBERTa","it") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("it.answer_question.roberta.it_tuned.by_Gantenbein").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ADDI_IT_RoBERTa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|it| |Size:|422.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-IT-RoBERTa --- layout: model title: Japanese Electra Embeddings (from izumi-lab) author: John Snow Labs name: electra_embeddings_electra_small_paper_japanese_fin_generator date: 2022-05-17 tags: [ja, open_source, electra, embeddings] task: Embeddings language: ja edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-paper-japanese-fin-generator` is a Japanese model orginally trained by `izumi-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_paper_japanese_fin_generator_ja_3.4.4_3.0_1652786703298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_small_paper_japanese_fin_generator_ja_3.4.4_3.0_1652786703298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_paper_japanese_fin_generator","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_small_paper_japanese_fin_generator","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_small_paper_japanese_fin_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ja| |Size:|19.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/izumi-lab/electra-small-paper-japanese-fin-generator - https://github.com/google-research/electra - https://github.com/retarfi/language-pretraining/tree/v1.0 - https://arxiv.org/abs/2003.10555 - https://arxiv.org/abs/2003.10555 - https://creativecommons.org/licenses/by-sa/4.0/ --- layout: model title: English asr_wav2vec2_base_timit_moaiz_explast TFWav2Vec2ForCTC from moaiz237 author: John Snow Labs name: asr_wav2vec2_base_timit_moaiz_explast date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_moaiz_explast` is a English model originally trained by moaiz237. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_moaiz_explast_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_explast_en_4.2.0_3.0_1664039499235.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_moaiz_explast_en_4.2.0_3.0_1664039499235.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_moaiz_explast", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_moaiz_explast", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_moaiz_explast| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_2_en_4.0.0_3.0_1655731191487.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_2_en_4.0.0_3.0_1655731191487.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|422.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-2 --- layout: model title: English XlmRoBertaForQuestionAnswering (from ncthuan) author: John Snow Labs name: xlm_roberta_qa_xlm_l_uetqa date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-l-uetqa` is a English model originally trained by `ncthuan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_l_uetqa_en_4.0.0_3.0_1655988514002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_l_uetqa_en_4.0.0_3.0_1655988514002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_l_uetqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_l_uetqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.by_ncthuan").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_l_uetqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncthuan/xlm-l-uetqa --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1657191824179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_0_en_4.0.0_3.0_1657191824179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|381.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-0 --- layout: model title: Romanian ALBERT Embeddings (from dragosnicolae555) author: John Snow Labs name: albert_embeddings_ALR_BERT date: 2022-04-14 tags: [albert, embeddings, ro, open_source] task: Embeddings language: ro edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ALR_BERT` is a Romanian model orginally trained by `dragosnicolae555`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_ALR_BERT_ro_3.4.2_3.0_1649954326845.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_ALR_BERT_ro_3.4.2_3.0_1649954326845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_ALR_BERT","ro") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Îmi place Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_ALR_BERT","ro") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Îmi place Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.embed.ALR_BERT").predict("""Îmi place Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_ALR_BERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ro| |Size:|54.5 MB| |Case sensitive:|false| ## References - https://huggingface.co/dragosnicolae555/ALR_BERT --- layout: model title: Legal Recitals Clause Binary Classifier author: John Snow Labs name: legclf_recitals_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `recitals` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `recitals` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_recitals_clause_en_1.0.0_3.2_1660123893858.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_recitals_clause_en_1.0.0_3.2_1660123893858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_recitals_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[recitals]| |[other]| |[other]| |[recitals]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_recitals_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.96 0.95 151 recitals 0.91 0.86 0.89 72 accuracy - - 0.93 223 macro-avg 0.92 0.91 0.92 223 weighted-avg 0.93 0.93 0.93 223 ``` --- layout: model title: English BertForQuestionAnswering model (from tli8hf) author: John Snow Labs name: bert_qa_unqover_bert_base_uncased_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-bert-base-uncased-newsqa` is a English model orginally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_bert_base_uncased_newsqa_en_4.0.0_3.0_1654192522291.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_bert_base_uncased_newsqa_en_4.0.0_3.0_1654192522291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_unqover_bert_base_uncased_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_unqover_bert_base_uncased_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_unqover_bert_base_uncased_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-bert-base-uncased-newsqa --- layout: model title: Sentiment Analysis of Italian texts author: John Snow Labs name: bert_sequence_classifier_sentiment date: 2021-12-21 tags: [italian, sentiment, bert, sequence, it, open_source] task: Sentiment Analysis language: it edition: Spark NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned for Italian language, leveraging `Bert` embeddings and `BertForSequenceClassification` for text classification purposes. ## Predicted Entities `negative`, `positive`, `neutral` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_sentiment_it_3.3.4_2.4_1640079386384.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_sentiment_it_3.3.4_2.4_1640079386384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_sentiment', 'it') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['Ho mal di testa e mi sento male.']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_sentiment", "it") .setInputCols("document", "token") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Ho mal di testa e mi sento male."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("it.classify.sentiment").predict("""Ho mal di testa e mi sento male.""") ```
## Results ```bash ['negative'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_sentiment| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|it| |Size:|415.5 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source [https://huggingface.co/neuraly/bert-base-italian-cased-sentiment](https://huggingface.co/neuraly/bert-base-italian-cased-sentiment) ## Benchmarking ```bash label score accuracy 0.82 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_sameearif88 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab2_by_sameearif88 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_sameearif88` is a English model originally trained by sameearif88. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab2_by_sameearif88_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_sameearif88_en_4.2.0_3.0_1664040783641.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_sameearif88_en_4.2.0_3.0_1664040783641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_sameearif88", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_sameearif88", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab2_by_sameearif88| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Fast Neural Machine Translation Model from Morisyen to English author: John Snow Labs name: opus_mt_mfe_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mfe, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mfe` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mfe_en_xx_2.7.0_2.4_1609164196495.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mfe_en_xx_2.7.0_2.4_1609164196495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mfe_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mfe_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mfe.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mfe_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_ntp0102 TFWav2Vec2ForCTC from ntp0102 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_ntp0102 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_ntp0102` is a English model originally trained by ntp0102. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_ntp0102_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_ntp0102_en_4.2.0_3.0_1664026565165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_ntp0102_en_4.2.0_3.0_1664026565165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_ntp0102", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_ntp0102", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_ntp0102| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Universal Sentence Encoder XLING English and German author: John Snow Labs name: tfhub_use_xling_en_de date: 2020-12-08 task: Embeddings language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 deprecated: true tags: [embeddings, open_source, xx] supported: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder Cross-lingual (XLING) module is an extension of the Universal Sentence Encoder that includes training on multiple tasks across languages. The multi-task training setup is based on the paper "Learning Cross-lingual Sentence Representations via a Multi-task Dual Encoder". This specific module is trained on English and German (en-de) tasks, and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box for a number of applications. The input to the module is variable length English or German text and the output is a 512 dimensional vector. We note that one does not need to specify the language that the input is in, as the model was trained such that English and German text with similar meanings will have similar (high dot product score) embeddings. We also note that this model can be used for monolingual English (and potentially monolingual German) tasks with comparable or even better performance than the purely English Universal Sentence Encoder. Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_de_xx_2.7.0_2.4_1607440247381.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_de_xx_2.7.0_2.4_1607440247381.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_de", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Ich benutze gerne SparkNLP']], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_de", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Ich benutze gerne SparkNLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Ich benutze gerne SparkNLP"] embeddings_df = nlu.load('xx.use.xling_en_de').predict(text, output_level='sentence') embeddings_df ```
## Results It gives a 512-dimensional vector of the sentences. ```bash xx_use_xling_en_de_embeddings sentence 0 [-0.011391869746148586, -0.0591440312564373, -... I love NLP 1 [-0.06441862881183624, -0.05351972579956055, 0... Ich benutze gerne SparkNLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_xling_en_de| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-xling/en-de/1](https://tfhub.dev/google/universal-sentence-encoder-xling/en-de/1) --- layout: model title: French CamemBert Embeddings (from edge2992) author: John Snow Labs name: camembert_embeddings_edge2992_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `edge2992`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_edge2992_generic_model_fr_3.4.4_3.0_1653988047667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_edge2992_generic_model_fr_3.4.4_3.0_1653988047667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_edge2992_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_edge2992_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_edge2992_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/edge2992/dummy-model --- layout: model title: English T5ForConditionalGeneration Small Cased model (from doc2query) author: John Snow Labs name: t5_msmarco_small_v1 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `msmarco-t5-small-v1` is a English model originally trained by `doc2query`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_msmarco_small_v1_en_4.3.0_3.0_1675105793797.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_msmarco_small_v1_en_4.3.0_3.0_1675105793797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_msmarco_small_v1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_msmarco_small_v1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_msmarco_small_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|340.7 MB| ## References - https://huggingface.co/doc2query/msmarco-t5-small-v1 - https://arxiv.org/abs/1904.08375 - https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf - https://arxiv.org/abs/2104.08663 - https://github.com/UKPLab/beir - https://www.sbert.net/examples/unsupervised_learning/query_generation/README.html - https://github.com/microsoft/MSMARCO-Passage-Ranking --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd2_lr_5e_5_bs_32_e_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd2-lr-5e-5-bs-32-e-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_lr_5e_5_bs_32_e_3_en_4.0.0_3.0_1657188117800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_lr_5e_5_bs_32_e_3_en_4.0.0_3.0_1657188117800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd2_lr_5e_5_bs_32_e_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd2_lr_5e_5_bs_32_e_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd2_lr_5e_5_bs_32_e_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd2-lr-5e-5-bs-32-e-3 --- layout: model title: German asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204_de_4.2.0_3.0_1664120459175.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204_de_4.2.0_3.0_1664120459175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_gender_male_10_female_0_s204| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English RoBERTa Embeddings (ECtHR dataset) author: John Snow Labs name: roberta_embeddings_fairlex_ecthr_minilm date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `fairlex-ecthr-minilm` is a English model orginally trained by `coastalcph`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fairlex_ecthr_minilm_en_3.4.2_3.0_1649947433661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_fairlex_ecthr_minilm_en_3.4.2_3.0_1649947433661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_fairlex_ecthr_minilm","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_fairlex_ecthr_minilm","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.fairlex_ecthr_minilm").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_fairlex_ecthr_minilm| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|114.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/coastalcph/fairlex-ecthr-minilm - https://coastalcph.github.io - https://github.com/iliaschalkidis - https://twitter.com/KiddoThe2B --- layout: model title: English image_classifier_vit_autotrain_cifar10__base ViTForImageClassification from abhishek author: John Snow Labs name: image_classifier_vit_autotrain_cifar10__base date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_autotrain_cifar10__base` is a English model originally trained by abhishek. ## Predicted Entities `deer`, `bird`, `dog`, `horse`, `automobile`, `truck`, `frog`, `ship`, `airplane`, `cat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_cifar10__base_en_4.1.0_3.0_1660167561719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_autotrain_cifar10__base_en_4.1.0_3.0_1660167561719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_autotrain_cifar10__base", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_autotrain_cifar10__base", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_autotrain_cifar10__base| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: German Public Health Mention Sequence Classifier (GBERT-base) author: John Snow Labs name: bert_sequence_classifier_health_mentions_gbert date: 2022-08-10 tags: [public_health, de, licensed, sequence_classification, health_mention] task: Text Classification language: de edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [GBERT-base](https://arxiv.org/pdf/2010.10906.pdf) based sequence classification model that can classify public health mentions in German social media text. ## Predicted Entities `non-health`, `health-related` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_gbert_de_4.0.2_3.0_1660133710298.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_gbert_de_4.0.2_3.0_1660133710298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_gbert", "de", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["bis vor ein paar wochen hatte ich auch manchmal migräne, aber aktuell habe ich keine probleme"], ["der spiegelt ist für meine zwecke im badezimmer zu klein, es klappt nichtm harre zu machen"] ]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_gbert", "de", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("bis vor ein paar wochen hatte ich auch manchmal migräne, aber aktuell habe ich keine probleme", "der spiegelt ist für meine zwecke im badezimmer zu klein, es klappt nichtm harre zu machen")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.bert_sequence.health_mentions_gbert").predict("""bis vor ein paar wochen hatte ich auch manchmal migräne, aber aktuell habe ich keine probleme""") ```
## Results ```bash +---------------------------------------------------------------------------------------------+----------------+ |text |result | +---------------------------------------------------------------------------------------------+----------------+ |bis vor ein paar wochen hatte ich auch manchmal migräne, aber aktuell habe ich keine probleme|[health-related]| |der spiegelt ist für meine zwecke im badezimmer zu klein, es klappt nichtm harre zu machen |[non-health] | +---------------------------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mentions_gbert| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|412.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support non-health 0.97 0.91 0.94 82 health-related 0.91 0.97 0.94 69 accuracy - - 0.94 151 macro-avg 0.94 0.94 0.94 151 weighted-avg 0.94 0.94 0.94 151 ``` --- layout: model title: Javanese RoBERTa Embeddings (Small, Javanese Wikipedia) author: John Snow Labs name: roberta_embeddings_javanese_roberta_small date: 2022-04-14 tags: [roberta, embeddings, jv, open_source] task: Embeddings language: jv edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-roberta-small` is a Javanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_roberta_small_jv_3.4.2_3.0_1649948128252.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_javanese_roberta_small_jv_3.4.2_3.0_1649948128252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_roberta_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_javanese_roberta_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("jv.embed.javanese_roberta_small").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_javanese_roberta_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|468.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-roberta-small - https://arxiv.org/abs/1907.11692 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Multilingual BertForQuestionAnswering model (from kuppuluri) author: John Snow Labs name: bert_qa_telugu_bertu_tydiqa date: 2022-06-02 tags: [te, en, open_source, question_answering, bert, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `telugu_bertu_tydiqa` is a Multilingual model orginally trained by `kuppuluri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_telugu_bertu_tydiqa_xx_4.0.0_3.0_1654192209158.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_telugu_bertu_tydiqa_xx_4.0.0_3.0_1654192209158.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_telugu_bertu_tydiqa","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_telugu_bertu_tydiqa","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.tydiqa.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_telugu_bertu_tydiqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|413.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/kuppuluri/telugu_bertu_tydiqa - https://github.com/google-research-datasets/tydiqa --- layout: model title: Amharic Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_amharic_finetuned_ner_amharic date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, am, open_source] task: Named Entity Recognition language: am edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-amharic-finetuned-ner-amharic` is a Amharic model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_amharic_finetuned_ner_amharic_am_3.4.2_3.0_1652810288724.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_amharic_finetuned_ner_amharic_am_3.4.2_3.0_1652810288724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_amharic_finetuned_ner_amharic","am") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["ስካርቻ nlp እወዳለሁ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_amharic_finetuned_ner_amharic","am") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("ስካርቻ nlp እወዳለሁ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_amharic_finetuned_ner_amharic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|am| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-amharic-finetuned-ner-amharic - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukm --- layout: model title: Part of Speech for Latvian author: John Snow Labs name: pos_ud_lvtb date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: lv edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, lv] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_lvtb_lv_2.5.5_2.4_1596054308284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_lvtb_lv_2.5.5_2.4_1596054308284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pos = PerceptronModel.pretrained("pos_ud_lvtb", "lv") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.") ``` ```scala val pos = PerceptronModel.pretrained("pos_ud_lvtb", "lv") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Džons Snovs ir ne tikai ziemeļu karalis, bet arī angļu ārsts un anestēzijas un medicīniskās higiēnas attīstības līderis."""] pos_df = nlu.load('lv.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=4, result='PROPN', metadata={'word': 'Džons'}), Row(annotatorType='pos', begin=6, end=10, result='PROPN', metadata={'word': 'Snovs'}), Row(annotatorType='pos', begin=12, end=13, result='AUX', metadata={'word': 'ir'}), Row(annotatorType='pos', begin=15, end=16, result='CCONJ', metadata={'word': 'ne'}), Row(annotatorType='pos', begin=18, end=22, result='CCONJ', metadata={'word': 'tikai'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_lvtb| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|lv| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English asr_wav2vec2_base_20sec_timit_and_dementiabank TFWav2Vec2ForCTC from shields author: John Snow Labs name: asr_wav2vec2_base_20sec_timit_and_dementiabank date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_20sec_timit_and_dementiabank` is a English model originally trained by shields. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_20sec_timit_and_dementiabank_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_20sec_timit_and_dementiabank_en_4.2.0_3.0_1664113179233.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_20sec_timit_and_dementiabank_en_4.2.0_3.0_1664113179233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_20sec_timit_and_dementiabank", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_20sec_timit_and_dementiabank", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_20sec_timit_and_dementiabank| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: English BertForQuestionAnswering model (from mrp) author: John Snow Labs name: bert_qa_mrp_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `mrp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mrp_bert_finetuned_squad_en_4.0.0_3.0_1654535707327.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mrp_bert_finetuned_squad_en_4.0.0_3.0_1654535707327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mrp_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mrp_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_mrp").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mrp_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrp/bert-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Germanic Languages to English author: John Snow Labs name: opus_mt_gem_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gem, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gem` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gem_en_xx_2.7.0_2.4_1609169462862.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gem_en_xx_2.7.0_2.4_1609169462862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gem_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gem_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gem.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gem_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Consents Clause Binary Classifier author: John Snow Labs name: legclf_consents_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `consents` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `consents` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_consents_clause_en_1.0.0_3.2_1660123344303.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_consents_clause_en_1.0.0_3.2_1660123344303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_consents_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[consents]| |[other]| |[other]| |[consents]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_consents_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support consents 0.92 0.94 0.93 90 other 0.97 0.96 0.97 188 accuracy - - 0.96 278 macro-avg 0.95 0.95 0.95 278 weighted-avg 0.96 0.96 0.96 278 ``` --- layout: model title: Multilingual BERT Sentence Embeddings (Base Cased) author: John Snow Labs name: sent_bert_multi_cased date: 2020-08-25 task: Embeddings language: xx edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, xx] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_multi_cased_xx_2.6.0_2.4_1598347692999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_multi_cased_xx_2.6.0_2.4_1598347692999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_multi_cased", "xx") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('xx.embed_sentence.bert.cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash xx_embed_sentence_bert_cased_embeddings sentence [-0.3695415258407593, -0.33228799700737, 0.553... I hate cancer [-0.2776091396808624, -0.35221806168556213, 0.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_multi_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[xx]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3 --- layout: model title: Detect Pathogen, Medical Condition and Medicine (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_pathogen date: 2022-07-28 tags: [licensed, clinical, en, ner, pathogen, medical_condition, medicine, berfortokenclassification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained named entity recognition (NER) model is a deep learning model for detecting medical conditions (influenza, headache, malaria, etc), medicine (aspirin, penicillin, methotrexate) and pathogens (Corona Virus, Zika Virus, E. Coli, etc) in clinical texts. This model is trained with [BertForTokenClassification] method from transformers library and imported into Spark NLP. ## Predicted Entities `medicine`, `medical_condition`, `pathogen` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_pathogen_en_4.0.0_3.0_1659033806498.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_pathogen_en_4.0.0_3.0_1659033806498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(['sentence']) \ .setOutputCol('token') tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_pathogen", "en", "clinical/models")\ .setInputCols(['token', "sentence"])\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter ]) data = spark.createDataFrame([["""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe; while it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_pathogen", "en", "clinical/models") .setInputCols(Array("token", 'sentence')) .setOutputCol("label") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","label")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe; while it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.pathogen").predict("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe; while it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""") ```
## Results ```bash +---------------+----------------+ |ticker |label | +---------------+----------------+ |Racecadotril |Medicine | |loperamide |Medicine | |Diarrhea |MedicalCondition| |loose |MedicalCondition| |liquid |MedicalCondition| |watery |MedicalCondition| |bowel movements|MedicalCondition| |dehydration |MedicalCondition| |loss |MedicalCondition| |color |MedicalCondition| |fast |MedicalCondition| |heart rate |MedicalCondition| |rabies virus |Pathogen | |Lyssavirus |Pathogen | |Ephemerovirus |Pathogen | +---------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_pathogen| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support MedicalCondition 0.73 0.78 0.75 49 Medicine 0.95 0.95 0.95 38 Pathogen 0.77 0.91 0.83 11 micro-avg 0.82 0.86 0.84 98 macro-avg 0.82 0.88 0.84 98 weighted-avg 0.82 0.86 0.84 98 ``` --- layout: model title: GPT2 text-to-text model (Medium) author: John Snow Labs name: gpt2_medium date: 2021-12-03 tags: [gpt2, en, open_source] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: GPT2Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation. This is the medium model (smaller than Base). Other models (base, large) are also available in [Models Hub](https://nlp.johnsnowlabs.com/models?task=Text+Generation) ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/gpt2_medium_en_3.4.0_3.0_1638517188768.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/gpt2_medium_en_3.4.0_3.0_1638517188768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("documents") gpt2 = GPT2Transformer.pretrained("gpt2_medium") \\ .setInputCols(["documents"]) \\ .setMaxOutputLength(50) \\ .setOutputCol("generation") pipeline = Pipeline().setStages([documentAssembler, gpt2]) data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("summaries.generation").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val gpt2 = GPT2Transformer.pretrained("gpt2_medium") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) .setDoSample(false) .setTopK(50) .setNoRepeatNgramSize(3) .setOutputCol("generation") val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2)) val data = Seq("My name is Leonardo.").toDF("text") val result = pipeline.fit(data).transform(data) results.select("generation.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.gpt2.medium").predict("""My name is Leonardo.""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the | | United States in 1776, and I have lived in the United Kingdom since 1776] | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|gpt2_medium| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[generation]| |Language:|en| ## Data Source OpenAI WebText - a corpus created by scraping web pages with emphasis on document quality. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText. --- layout: model title: English DistilBertForQuestionAnswering model (from guhuawuli) author: John Snow Labs name: distilbert_qa_guhuawuli_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `guhuawuli`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_guhuawuli_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725334451.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_guhuawuli_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725334451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_guhuawuli_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_guhuawuli_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_guhuawuli").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_guhuawuli_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/guhuawuli/distilbert-base-uncased-finetuned-squad --- layout: model title: BERT Biomedical Embeddings author: John Snow Labs name: bert_biomed_pubmed_uncased date: 2022-02-21 tags: [pubmed, biomedical, bert, microsoft, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This embeddings model was imported from `Hugging Face` ([link](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext)). It is pre trained from scratch using abstracts from PubMed and full-text articles from PubMedCentral. This model achieves state-of-the-art performance on many biomedical NLP tasks, and currently holds the top score on the Biomedical Language Understanding and Reasoning Benchmark. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_biomed_pubmed_uncased_en_3.4.0_3.0_1645440129466.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_biomed_pubmed_uncased_en_3.4.0_3.0_1645440129466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_biomed_pubmed_uncased", "en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_biomed_pubmed_uncased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_biomed_pubmed_uncased| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|411.0 MB| |Case sensitive:|true| ## References [https://pubmed.ncbi.nlm.nih.gov/](https://pubmed.ncbi.nlm.nih.gov/) ``` @misc{pubmedbert, author = {Yu Gu and Robert Tinn and Hao Cheng and Michael Lucas and Naoto Usuyama and Xiaodong Liu and Tristan Naumann and Jianfeng Gao and Hoifung Poon}, title = {Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing}, year = {2020}, eprint = {arXiv:2007.15779}, } ``` --- layout: model title: Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl date: 2023-05-04 tags: [licensed, en, clinical, jsl, ner, berfortokenclassification, tensorflow] task: Named Entity Recognition language: en edition: Healthcare NLP 3.4.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. This model is trained with BertForTokenClassification method from transformers library and imported into Spark NLP. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.4.0_3.0_1683228461283.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_en_3.4.0_3.0_1683228461283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\ .setInputCols(["token", "sentence"])\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverterInternal()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) sample_text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." df = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val sample_text = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDS.toDF("text") val result = pipeline.fit(sample_text).transform(sample_text) ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Symptom | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug_BrandName | |Baby-girl |Age | |decreased |Symptom | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support B-Admission_Discharge 0.87 0.98 0.92 421 B-Age 0.97 0.97 0.97 2143 B-Alcohol 0.90 0.85 0.87 101 B-Allergen 0.29 0.32 0.30 31 B-BMI 0.77 0.77 0.77 13 B-Birth_Entity 0.30 0.60 0.40 5 B-Blood_Pressure 0.84 0.86 0.85 193 B-Cerebrovascular_Disease 0.63 0.81 0.71 176 B-Clinical_Dept 0.88 0.91 0.90 1459 B-Communicable_Disease 0.78 0.84 0.81 50 B-Date 0.97 0.99 0.98 1393 B-Death_Entity 0.71 0.82 0.76 44 B-Diabetes 0.95 0.97 0.96 145 B-Diet 0.67 0.51 0.58 112 B-Direction 0.90 0.93 0.91 5617 B-Disease_Syndrome_Disorder 0.89 0.89 0.89 7569 B-Dosage 0.75 0.79 0.77 333 B-Drug_BrandName 0.95 0.93 0.94 3165 B-Drug_Ingredient 0.92 0.94 0.93 5134 B-Duration 0.81 0.84 0.82 389 B-EKG_Findings 0.63 0.46 0.53 145 B-Employment 0.87 0.91 0.89 406 B-External_body_part_or_region 0.83 0.85 0.84 3647 B-Family_History_Header 1.00 1.00 1.00 333 B-Fetus_NewBorn 0.64 0.70 0.67 166 B-Form 0.81 0.79 0.80 334 B-Frequency 0.93 0.95 0.94 1343 B-Gender 0.99 0.99 0.99 5644 B-HDL 0.55 1.00 0.71 6 B-Heart_Disease 0.88 0.89 0.89 1052 B-Height 1.00 0.73 0.84 26 B-Hyperlipidemia 0.97 0.99 0.98 236 B-Hypertension 0.97 0.98 0.97 506 B-ImagingFindings 0.50 0.46 0.48 187 B-Imaging_Technique 0.90 0.78 0.84 95 B-Injury_or_Poisoning 0.87 0.84 0.86 793 B-Internal_organ_or_component 0.89 0.90 0.89 14484 B-Kidney_Disease 0.83 0.79 0.81 201 B-LDL 1.00 1.00 1.00 12 B-Labour_Delivery 0.67 0.72 0.69 129 B-Medical_Device 0.88 0.91 0.89 7574 B-Medical_History_Header 0.97 0.98 0.98 218 B-Modifier 0.83 0.83 0.83 4226 B-O2_Saturation 0.74 0.68 0.71 59 B-Obesity 0.81 0.81 0.81 108 B-Oncological 0.94 0.93 0.94 917 B-Overweight 1.00 0.80 0.89 10 B-Oxygen_Therapy 0.74 0.84 0.79 148 B-Pregnancy 0.82 0.80 0.81 279 B-Procedure 0.90 0.91 0.91 8012 B-Psychological_Condition 0.78 0.87 0.82 245 B-Pulse 0.88 0.93 0.90 152 B-Race_Ethnicity 0.99 1.00 0.99 157 B-Relationship_Status 0.89 0.93 0.91 44 B-RelativeDate 0.81 0.83 0.82 630 B-RelativeTime 0.66 0.62 0.64 167 B-Respiration 0.93 0.97 0.95 131 B-Route 0.85 0.90 0.87 1430 B-Section_Header 0.98 0.98 0.98 15527 B-Sexually_Active_or_Sexual_Orientation 0.83 0.56 0.67 9 B-Smoking 0.94 0.92 0.93 186 B-Social_History_Header 0.90 1.00 0.95 401 B-Strength 0.91 0.94 0.93 721 B-Substance 0.98 0.88 0.93 160 B-Symptom 0.88 0.88 0.88 13889 B-Temperature 0.92 0.94 0.93 161 B-Test 0.86 0.90 0.88 5336 B-Test_Result 0.86 0.87 0.87 2240 B-Time 0.74 0.87 0.80 68 B-Total_Cholesterol 0.66 0.72 0.69 29 B-Treatment 0.79 0.75 0.77 391 B-Triglycerides 1.00 1.00 1.00 22 B-VS_Finding 0.80 0.86 0.83 555 B-Vaccine 0.62 0.71 0.67 35 B-Vaccine_Name 0.65 0.75 0.70 20 B-Vital_Signs_Header 0.92 0.96 0.94 1131 B-Weight 0.70 0.76 0.73 90 I-Age 0.83 0.79 0.81 202 I-Alcohol 0.50 0.25 0.33 8 I-Allergen 0.25 0.50 0.33 2 I-BMI 0.71 0.74 0.73 27 I-Blood_Pressure 0.91 0.92 0.91 476 I-Cerebrovascular_Disease 0.48 0.79 0.59 66 I-Clinical_Dept 0.93 0.94 0.94 878 I-Communicable_Disease 0.29 0.71 0.42 7 I-Date 0.91 0.95 0.93 136 I-Death_Entity 0.33 0.50 0.40 2 I-Diabetes 0.97 0.95 0.96 81 I-Diet 0.65 0.54 0.59 80 I-Direction 0.81 0.80 0.80 353 I-Disease_Syndrome_Disorder 0.88 0.86 0.87 4633 I-Dosage 0.85 0.80 0.82 266 I-Drug_BrandName 0.78 0.64 0.70 67 I-Drug_Ingredient 0.80 0.85 0.82 437 I-Duration 0.87 0.92 0.89 802 I-EKG_Findings 0.68 0.50 0.58 209 I-Employment 0.77 0.77 0.77 141 I-External_body_part_or_region 0.79 0.81 0.80 1063 I-Family_History_Header 0.99 1.00 0.99 424 I-Fetus_NewBorn 0.63 0.55 0.59 154 I-Frequency 0.84 0.79 0.82 510 I-Gender 0.43 0.30 0.35 10 I-HDL 0.50 1.00 0.67 3 I-Heart_Disease 0.89 0.88 0.88 1120 I-Height 0.95 0.94 0.95 66 I-Hypertension 0.33 0.32 0.33 22 I-ImagingFindings 0.61 0.59 0.60 256 I-Imaging_Technique 0.86 0.83 0.84 59 I-Injury_or_Poisoning 0.78 0.76 0.77 616 I-Internal_organ_or_component 0.89 0.89 0.89 6723 I-Kidney_Disease 0.88 0.92 0.90 223 I-LDL 1.00 1.00 1.00 8 I-Labour_Delivery 0.65 0.77 0.70 124 I-Medical_Device 0.88 0.92 0.90 4220 I-Medical_History_Header 1.00 0.99 0.99 939 I-Modifier 0.63 0.59 0.61 305 I-O2_Saturation 0.88 0.80 0.84 110 I-Obesity 0.68 0.65 0.67 20 I-Oncological 0.94 0.92 0.93 692 I-Oxygen_Therapy 0.77 0.75 0.76 61 I-Pregnancy 0.59 0.68 0.63 143 I-Procedure 0.91 0.90 0.90 7483 I-Psychological_Condition 0.76 0.82 0.79 77 I-Pulse 0.92 0.90 0.91 226 I-Race_Ethnicity 1.00 1.00 1.00 1 I-RelativeDate 0.83 0.90 0.86 890 I-RelativeTime 0.73 0.68 0.71 233 I-Respiration 0.98 0.93 0.96 117 I-Route 0.90 0.90 0.90 375 I-Section_Header 0.97 0.98 0.98 12864 I-Sexually_Active_or_Sexual_Orientation 0.57 1.00 0.73 4 I-Smoking 0.40 0.33 0.36 6 I-Social_History_Header 0.92 0.99 0.95 478 I-Strength 0.86 0.94 0.90 502 I-Substance 0.84 0.76 0.80 50 I-Symptom 0.78 0.75 0.77 7167 I-Temperature 0.97 0.99 0.98 233 I-Test 0.81 0.85 0.83 3038 I-Test_Result 0.59 0.63 0.61 413 I-Time 0.80 0.80 0.80 71 I-Total_Cholesterol 0.82 0.86 0.84 43 I-Treatment 0.72 0.73 0.73 187 I-Triglycerides 0.87 1.00 0.93 20 I-VS_Finding 0.55 0.59 0.57 86 I-Vaccine 1.00 0.67 0.80 6 I-Vaccine_Name 0.86 1.00 0.92 18 I-Vital_Signs_Header 0.94 0.97 0.96 1063 I-Weight 0.93 0.88 0.90 237 O 0.97 0.96 0.97 200407 accuracy - - 0.93 386755 macro-avg 0.77 0.79 0.77 386755 weighted-avg 0.93 0.93 0.93 386755 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Swty) author: John Snow Labs name: roberta_qa_base_squad_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad-finetuned-squad` is a English model originally trained by `Swty`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_finetuned_squad_en_4.3.0_3.0_1674218785890.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_finetuned_squad_en_4.3.0_3.0_1674218785890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Swty/roberta-base-squad-finetuned-squad --- layout: model title: Smaller BERT Embeddings (L-6_H-512_A-8) author: John Snow Labs name: small_bert_L6_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L6_512_en_2.6.0_2.4_1598344643979.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L6_512_en_2.6.0_2.4_1598344643979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L6_512", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L6_512", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L6_512').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L6_512_embeddings I [-1.024675965309143, 1.3911163806915283, 0.305... love [-0.46703386306762695, -0.024186130613088608, ... NLP [0.2468228042125702, -0.9001016616821289, -0.2... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L6_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1 --- layout: model title: Lysandre's Tapas Table Understanding (Base) author: John Snow Labs name: table_qa_tapas_temporary_repo date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Base Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_temporary_repo_en_4.2.0_3.0_1664530579779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_temporary_repo_en_4.2.0_3.0_1664530579779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_temporary_repo","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_temporary_repo| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|413.9 MB| |Case sensitive:|false| ## References https://huggingface.co/models?pipeline_tag=table-question-answering --- layout: model title: Word Embeddings for French (word2vec_wiki_1000) author: John Snow Labs name: word2vec_wiki_1000 date: 2022-01-26 tags: [fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This French Word2Vec model was trained by [Jean-Philippe Fauconnier](https://fauconnier.github.io/) on a [French Wikipedia Dump of 2015](https://dumps.wikimedia.org/frwiki/) over a window size of 100 and dimensions of 1000. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/word2vec_wiki_1000_fr_3.4.0_3.0_1643203216331.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/word2vec_wiki_1000_fr_3.4.0_3.0_1643203216331.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("word2vec_wiki_1000", "fr")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("word2vec_wiki_1000", "fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.word2vec_wiki_1000").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|word2vec_wiki_1000| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|248.8 MB| |Case sensitive:|false| |Dimension:|1000| ## References This model was trained by [Jean-Philippe Fauconnier](https://fauconnier.github.io/) on a [French Wikipedia Dump of 2015](https://dumps.wikimedia.org/frwiki/). [[1]](#1) [1] Fauconnier, Jean-Philippe (2015), French Word Embeddings, http://fauconnier.github.io --- layout: model title: German asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103 date: 2022-09-26 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103_de_4.2.0_3.0_1664189006934.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103_de_4.2.0_3.0_1664189006934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s103| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Italian CamemBert Embeddings (from Musixmatch) author: John Snow Labs name: camembert_embeddings_umberto_commoncrawl_cased_v1 date: 2022-05-23 tags: [it, open_source, camembert, embeddings] task: Embeddings language: it edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `umberto-commoncrawl-cased-v1` is a Italian model orginally trained by `Musixmatch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_umberto_commoncrawl_cased_v1_it_3.4.4_3.0_1653321322277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_umberto_commoncrawl_cased_v1_it_3.4.4_3.0_1653321322277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_umberto_commoncrawl_cased_v1","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_umberto_commoncrawl_cased_v1","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_umberto_commoncrawl_cased_v1| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|265.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1 - https://github.com/musixmatchresearch/umberto - https://traces1.inria.fr/oscar/ - http://bit.ly/35zO7GH - https://github.com/google/sentencepiece - https://github.com/musixmatchresearch/umberto - https://github.com/UniversalDependencies/UD_Italian-ISDT - https://github.com/UniversalDependencies/UD_Italian-ParTUT - http://www.evalita.it/ - https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500 - https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub - https://github.com/loretoparisi - https://github.com/simonefrancia - https://github.com/paulthemagno - https://twitter.com/Musixmatch - https://twitter.com/musixmatchai - https://github.com/musixmatchresearch --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from victorlee071200) author: John Snow Labs name: roberta_qa_base_finetuned_squad_v2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-finetuned-squad_v2` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_v2_en_4.3.0_3.0_1674210438571.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_v2_en_4.3.0_3.0_1674210438571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/victorlee071200/distilroberta-base-finetuned-squad_v2 --- layout: model title: Cyberbullying Classifier in Turkish texts. author: John Snow Labs name: classifierdl_berturk_cyberbullying date: 2021-07-21 tags: [tr, cyberbullying, classification, public, berturk, open_source] task: Text Classification language: tr edition: Spark NLP 3.1.2 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Identifies whether a Turkish text contains cyberbullying or not. ## Predicted Entities `Negative`, `Positive` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_TR_CYBERBULLYING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_TR_CYBERBULLYING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_berturk_cyberbullying_tr_3.1.2_2.4_1626884209141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_berturk_cyberbullying_tr_3.1.2_2.4_1626884209141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... berturk_embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") \ .setInputCols("document", "lemma") \ .setOutputCol("embeddings") embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") document_classifier = ClassifierDLModel.pretrained('classifierdl_berturk_cyberbullying', 'tr') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") berturk_pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer, stopwords_cleaner, lemma, berturk_embeddings, embeddingsSentence, document_classifier]) light_pipeline = LightPipeline(berturk_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.annotate("""Gidişin olsun, dönüşün olmasın inşallah senin..""") result["class"] ``` ```scala ... val berturk_embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") .setInputCols("document", "lemma") .setOutputCol("embeddings") val embeddingsSentence = SentenceEmbeddings() .setInputCols(Array("document", "embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val document_classifier = ClassifierDLModel.pretrained("classifierdl_berturk_cyberbullying", "tr") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val berturk_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, normalizer, stopwords_cleaner, lemma, berturk_embeddings, embeddingsSentence, document_classifier)) val light_pipeline = LightPipeline(berturk_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) val result = light_pipeline.annotate("Gidişin olsun, dönüşün olmasın inşallah senin..") ``` {:.nlu-block} ```python import nlu nlu.load("tr.classify.cyberbullying").predict("""Gidişin olsun, dönüşün olmasın inşallah senin..""") ```
## Results ```bash ['Negative'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_berturk_cyberbullying| |Compatibility:|Spark NLP 3.1.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|tr| ## Data Source Trained on a custom dataset with Turkish Bert embeddings (BERTurk). ## Benchmarking ```bash precision recall f1-score support Negative 0.83 0.80 0.81 970 Positive 0.84 0.87 0.86 1225 accuracy 0.84 2195 macro avg 0.84 0.83 0.84 2195 weighted avg 0.84 0.84 0.84 2195 ``` --- layout: model title: Legal Master Repurchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_master_repurchase_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, master_repurchase, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_master_repurchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `master-repurchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `master-repurchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_master_repurchase_agreement_bert_en_1.0.0_3.0_1669368349135.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_master_repurchase_agreement_bert_en_1.0.0_3.0_1669368349135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_master_repurchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[master-repurchase-agreement]| |[other]| |[other]| |[master-repurchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_master_repurchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support master-repurchase-agreement 0.97 0.91 0.94 35 other 0.96 0.99 0.98 82 accuracy - - 0.97 117 macro-avg 0.97 0.95 0.96 117 weighted-avg 0.97 0.97 0.97 117 ``` --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381_de_4.2.0_3.0_1664113050219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381_de_4.2.0_3.0_1664113050219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_0_austria_10_s381| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Uncased model (from kamalkraj) author: John Snow Labs name: bert_qa_base_uncased_squad_v1.0_finetuned date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad-v1.0-finetuned` is a English model originally trained by `kamalkraj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_squad_v1.0_finetuned_en_4.0.0_3.0_1657185715352.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_squad_v1.0_finetuned_en_4.0.0_3.0_1657185715352.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_squad_v1.0_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_squad_v1.0_finetuned","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_squad_v1.0_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kamalkraj/bert-base-uncased-squad-v1.0-finetuned --- layout: model title: English image_classifier_vit_vc_bantai__withoutAMBI ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_vc_bantai__withoutAMBI date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI` is a English model originally trained by AykeeSalazar. ## Predicted Entities `Public Drinking`, `Public Smoking`, `non-violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_en_4.1.0_3.0_1660171133898.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_en_4.1.0_3.0_1660171133898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vc_bantai__withoutAMBI", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vc_bantai__withoutAMBI", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vc_bantai__withoutAMBI| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Dutch Named Entity Recognition (from Babelscape) author: John Snow Labs name: bert_ner_wikineural_multilingual_ner date: 2022-05-09 tags: [bert, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `wikineural-multilingual-ner` is a Dutch model orginally trained by `Babelscape`. ## Predicted Entities `LOC`, `PER`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_wikineural_multilingual_ner_nl_3.4.2_3.0_1652100024867.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_wikineural_multilingual_ner_nl_3.4.2_3.0_1652100024867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_wikineural_multilingual_ner","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_wikineural_multilingual_ner","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_wikineural_multilingual_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Babelscape/wikineural-multilingual-ner - https://github.com/Babelscape/wikineural - https://aclanthology.org/2021.findings-emnlp.215/ - https://creativecommons.org/licenses/by-nc-sa/4.0/ --- layout: model title: Financial Question Answering (Bert, Large) author: John Snow Labs name: finqa_bert_large date: 2023-01-03 tags: [en, licensed] task: Question Answering language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial large Bert-based Question Answering model, trained on squad-v2, finetuned on proprietary Financial questions and answers. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finqa_bert_large_en_1.0.0_3.0_1672759452867.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finqa_bert_large_en_1.0.0_3.0_1672759452867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.BertForQuestionAnswering.pretrained("finqa_bert_large","en", "finance/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) example = spark.createDataFrame([["On which market is their common stock traded?", "Our common stock is traded on the Nasdaq Global Select Market under the symbol CDNS."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('answer.result').show() ```
## Results ```bash `Nasdaq Global Select Market under the symbol CDNS` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finqa_bert_large| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on squad-v2, finetuned on proprietary Financial questions and answers. --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-06-25 tags: [pl, open_source] task: Named Entity Recognition language: pl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pl_4.0.0_3.0_1656129618293.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pl_4.0.0_3.0_1656129618293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "pl") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("pl.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pl| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Pipeline to Detect Diagnoses and Procedures (Spanish) author: John Snow Labs name: ner_diag_proc_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, es] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diag_proc](https://nlp.johnsnowlabs.com/2021/03/31/ner_diag_proc_es.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_pipeline_es_4.3.0_3.2_1678879521980.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diag_proc_pipeline_es_4.3.0_3.2_1678879521980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_diag_proc_pipeline", "es", "clinical/models") text = '''HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_diag_proc_pipeline", "es", "clinical/models") val text = "HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------------------|--------:|------:|:--------------|-------------:| | 0 | ENFERMEDAD | 12 | 21 | DIAGNOSTICO | 0.9989 | | 1 | cáncer de vejiga | 140 | 155 | DIAGNOSTICO | 0.8191 | | 2 | resección | 243 | 251 | PROCEDIMIENTO | 0.9616 | | 3 | cistectomía | 305 | 315 | PROCEDIMIENTO | 0.9554 | | 4 | estrés | 627 | 632 | DIAGNOSTICO | 0.9225 | | 5 | infarto subendocárdico | 705 | 726 | DIAGNOSTICO | 0.82595 | | 6 | isquemia | 804 | 811 | DIAGNOSTICO | 0.96 | | 7 | cateterismo del corazón izquierdo, | 884 | 917 | PROCEDIMIENTO | 0.68288 | | 8 | enfermedad de las arterias coronarias | 934 | 970 | DIAGNOSTICO | 0.75594 | | 9 | estenosada | 1010 | 1019 | DIAGNOSTICO | 0.9288 | | 10 | LAD | 1068 | 1070 | DIAGNOSTICO | 0.9365 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diag_proc_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|383.2 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Word2Vec Embeddings in Malayalam (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ml, open_source] task: Embeddings language: ml edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ml_3.4.1_3.0_1647445548113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ml_3.4.1_3.0_1647445548113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ml") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["എനിക്ക് സ്പാർക്ക് എൻഎൽപി ഇഷ്ടമാണ്"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ml") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("എനിക്ക് സ്പാർക്ക് എൻഎൽപി ഇഷ്ടമാണ്").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ml.embed.w2v_cc_300d").predict("""എനിക്ക് സ്പാർക്ക് എൻഎൽപി ഇഷ്ടമാണ്""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ml| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Latvian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_latvian_legal date: 2023-02-16 tags: [lv, latvian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: lv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-latvian-roberta-base` is a Latvian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_latvian_legal_lv_4.2.4_3.0_1676555343069.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_latvian_legal_lv_4.2.4_3.0_1676555343069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_latvian_legal", "lv")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_latvian_legal", "lv") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_latvian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|lv| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-latvian-roberta-base --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson TFWav2Vec2ForCTC from izzy-lazerson author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson` is a English model originally trained by izzy-lazerson. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson_en_4.2.0_3.0_1664112234389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson_en_4.2.0_3.0_1664112234389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_izzy_lazerson| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Explain Clinical Document Pipeline - CARP author: John Snow Labs name: explain_clinical_doc_carp date: 2021-04-01 tags: [pipeline, en, clinical, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline with `ner_clinical`, `assertion_dl`, `re_clinical` and `ner_posology`. It will extract clinical and medication entities, assign assertion status and find relationships between clinical entities. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_3.0.0_3.0_1617296754955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_3.0.0_3.0_1617296754955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python carp_pipeline = PretrainedPipeline("explain_clinical_doc_carp","en","clinical/models") annotations = carp_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")[0] annotations.keys() ``` ```scala val carp_pipeline = new PretrainedPipeline("explain_clinical_doc_carp","en","clinical/models") val result = carp_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.carp").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""") ```
## Results ```bash | | chunks | ner_clinical | assertion | posology_chunk | ner_posology | relations | |---|-------------------------------|--------------|-----------|------------------|--------------|-----------| | 0 | gestational diabetes mellitus | PROBLEM | present | metformin | Drug | TrAP | | 1 | metformin | TREATMENT | present | 1000 mg | Strength | TrCP | | 2 | polyuria | PROBLEM | present | two times a day | Frequency | TrCP | | 3 | polydipsia | PROBLEM | present | 40 units | Dosage | TrWP | | 4 | poor appetite | PROBLEM | present | insulin glargine | Drug | TrCP | | 5 | vomiting | PROBLEM | present | at night | Frequency | TrAP | | 6 | insulin glargine | TREATMENT | present | 12 units | Dosage | TrAP | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_carp| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_sd2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-sd2` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_en_4.0.0_3.0_1657188076719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sd2_en_4.0.0_3.0_1657188076719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sd2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sd2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sd2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-sd2 --- layout: model title: Hungarian Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 12:50:00 +0800 task: Lemmatization language: hu edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, hu] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_hu_2.5.0_2.4_1588671968880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_hu_2.5.0_2.4_1588671968880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "hu") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "hu") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében."""] lemma_df = nlu.load('hu.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=1, result='Az', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=8, result='északi', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=15, result='király', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=27, result='kivétel', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=29, end=32, result='John', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|hu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1657192323298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_2_en_4.0.0_3.0_1657192323298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|384.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-2 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1655731990121.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_0_en_4.0.0_3.0_1655731990121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_256d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|427.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-0 --- layout: model title: Detect Chemicals in Text (embeddings_clinical_medium) author: John Snow Labs name: ner_chemicals_emb_clinical_medium date: 2023-06-02 tags: [ner, clinical, licensed, en, chemicals] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract different types of chemical compounds mentioned in text using pretrained NER model. Trained with embeddings_clinical_medium . ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMICALS/){:.button.button-orange} [Open in Colab](https://nlp.johnsnowlabs.com/2021/04/01/ner_chemicals_en.html){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_emb_clinical_medium_en_4.4.3_3.0_1685716059468.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_emb_clinical_medium_en_4.4.3_3.0_1685716059468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") chemicals_ner = MedicalNerModel.pretrained("ner_chemicals_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("chemicals_ner") chemicals_ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "chemicals_ner"]) \ .setOutputCol("chemicals_ner_chunk") chemicals_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, chemicals_ner, chemicals_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") chemicals_ner_model = chemicals_ner_pipeline.fit(empty_data) results = chemicals_ner_model.transform(spark.createDataFrame([[''' Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4 - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases. RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied.''']]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val chemicals_ner_model = MedicalNerModel.pretrained("ner_chemicals_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("chemicals_ner") val chemicals_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "chemicals_ner")) .setOutputCol("chemicals_ner_chunk") val chemicals_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, chemicals_ner_model, chemicals_ner_converter)) val data = Seq(""" Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4 - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases. RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:-------------------------------------------------|--------:|------:|:-----------| | 0 | resveratrol | 48 | 58 | CHEM | | 1 | trans - 3, 5, 4 - trihydroxystilbene) glucosides | 61 | 108 | CHEM | | 2 | Resveratrol | 136 | 146 | CHEM | | 3 | trans - 3, 5, 4 - trihydroxystilbene | 149 | 185 | CHEM | | 4 | RSV | 189 | 191 | CHEM | | 5 | polyphenol | 206 | 215 | CHEM | | 6 | RSV | 270 | 272 | CHEM | | 7 | NAD(+ | 300 | 304 | CHEM | | 8 | superoxide | 436 | 445 | CHEM | | 9 | RSV | 470 | 472 | CHEM | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemicals_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash label precision recall f1-score support CHEM 0.95 0.92 0.94 62001 micro_avg 0.95 0.92 0.94 62001 macro_avg 0.95 0.92 0.94 62001 weighted_avg 0.95 0.92 0.94 62001 ``` --- layout: model title: German asr_exp_w2v2t_wav2vec2_s982 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2t_wav2vec2_s982 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_wav2vec2_s982` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2t_wav2vec2_s982_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_wav2vec2_s982_de_4.2.0_3.0_1664122759914.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2t_wav2vec2_s982_de_4.2.0_3.0_1664122759914.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2t_wav2vec2_s982', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2t_wav2vec2_s982", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2t_wav2vec2_s982| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_exper_batch_8_e8 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper_batch_8_e8 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper_batch_8_e8` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_8_e8_en_4.1.0_3.0_1660172993943.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_8_e8_en_4.1.0_3.0_1660172993943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper_batch_8_e8", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper_batch_8_e8", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper_batch_8_e8| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Legal Proprietary rights Clause Binary Classifier author: John Snow Labs name: legclf_proprietary_rights_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `proprietary_rights` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `proprietary_rights` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_proprietary_rights_clause_en_1.0.0_3.2_1660123848891.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_proprietary_rights_clause_en_1.0.0_3.2_1660123848891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_proprietary_rights_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[proprietary_rights]| |[other]| |[other]| |[proprietary_rights]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_proprietary_rights_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.88 1.00 0.94 51 proprietary_rights 1.00 0.78 0.88 32 accuracy - - 0.92 83 macro-avg 0.94 0.89 0.91 83 weighted-avg 0.93 0.92 0.91 83 ``` --- layout: model title: Latin Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:37:00 +0800 task: Lemmatization language: la edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, la] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_la_2.5.5_2.4_1596055005368.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_la_2.5.5_2.4_1596055005368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "la") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "la") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Alius est esse regem Aquilonis, et de Anglis medicus et nives Ioannes dux in progressus medicinae anesthesia et hygiene."""] lemma_df = nlu.load('la.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Alius', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=8, result='sum', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=13, result='sum', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=19, result='regem', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=29, result='Aquilonis', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|la| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Language Detection & Identification Pipeline - 43 Languages author: John Snow Labs name: detect_language_43 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Arabic`, `Belarusian`, `Bulgarian`, `Czech`, `Danish`, `German`, `Greek`, `English`, `Esperanto`, `Spanish`, `Estonian`, `Persian`, `Finnish`, `French`, `Hebrew`, `Hindi`, `Hungarian`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Japanese`, `Korean`, `Latin`, `Lithuanian`, `Latvian`, `Macedonian`, `Marathi`, `Dutch`, `Polish`, `Portuguese`, `Romanian`, `Russian`, `Slovak`, `Slovenian`, `Serbian`, `Swedish`, `Tagalog`, `Turkish`, `Tatar`, `Ukrainian`, `Vietnamese`, `Chinese`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_43_xx_2.7.0_2.4_1607185195681.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_43_xx_2.7.0_2.4_1607185195681.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_43", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_43", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("xx.classify.lang.43").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_43| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: Icelandic DistilBertForTokenClassification Cased model (from m3hrdadfi) author: John Snow Labs name: distilbert_token_classifier_icelandic_ner_distilbert date: 2023-03-03 tags: [is, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: is edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `icelandic-ner-distilbert` is a Icelandic model originally trained by `m3hrdadfi`. ## Predicted Entities `Money`, `Date`, `Time`, `Percent`, `Miscellaneous`, `Location`, `Person`, `Organization` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_icelandic_ner_distilbert_is_4.3.1_3.0_1677881983874.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_icelandic_ner_distilbert_is_4.3.1_3.0_1677881983874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_icelandic_ner_distilbert","is") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_icelandic_ner_distilbert","is") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_icelandic_ner_distilbert| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|is| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m3hrdadfi/icelandic-ner-distilbert - http://hdl.handle.net/20.500.12537/42 - https://en.ru.is/ - https://github.com/m3hrdadfi/icelandic-ner/issues --- layout: model title: Generic Classifier for Adverse Drug Events (SVM) author: John Snow Labs name: generic_svm_classifier_ade date: 2023-05-09 tags: [generic_classifier, svm, ade, clinical, licensed, en, text_classification] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: GenericSVMClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the Generic Classifier annotator and the Support Vector Machine (SVM) algorithm and classifies text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/generic_svm_classifier_ade_en_4.4.1_3.0_1683644825302.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/generic_svm_classifier_ade_en_4.4.1_3.0_1683644825302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\ .setInputCols(["features"])\ .setOutputCol("class") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier]) data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], ["""I feel a bit drowsy & have a little blurred vision after taking an insulin"""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = new WordEmbeddingsModel().pretrained("embeddings_clinical","en","clinical/models") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val sentence_embeddings = new SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = new GenericClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, word_embeddings, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("None of the patients required treatment for the overdose.", "I feel a bit drowsy & have a little blurred vision after taking an insulin")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------------------------------------------------+-------+ |text |result | +--------------------------------------------------------------------------+-------+ |None of the patients required treatment for the overdose. |[False]| |I feel a bit drowsy & have a little blurred vision after taking an insulin|[True] | +--------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|generic_svm_classifier_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[feature_vector]| |Output Labels:|[prediction]| |Language:|en| |Size:|16.4 KB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.84 0.92 0.88 3362 True 0.74 0.58 0.65 1361 accuracy - - 0.82 4723 macro avg 0.79 0.75 0.76 4723 weighted avg 0.81 0.82 0.81 4723 ``` --- layout: model title: Enawené-Nawé BertForTokenClassification Cased model (from shreyas-singh) author: John Snow Labs name: bert_token_classifier_autotrain_medicaltokenclassification_1279048948 date: 2022-11-30 tags: [unk, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: unk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-MedicalTokenClassification-1279048948` is a Enawené-Nawé model originally trained by `shreyas-singh`. ## Predicted Entities `MISC`, `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_medicaltokenclassification_1279048948_unk_4.2.4_3.0_1669814373500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_autotrain_medicaltokenclassification_1279048948_unk_4.2.4_3.0_1669814373500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_medicaltokenclassification_1279048948","unk") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_autotrain_medicaltokenclassification_1279048948","unk") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_autotrain_medicaltokenclassification_1279048948| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|unk| |Size:|408.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/shreyas-singh/autotrain-MedicalTokenClassification-1279048948 --- layout: model title: Entity Recognizer LG author: John Snow Labs name: entity_recognizer_lg date: 2022-07-07 tags: [sv, open_source] task: Named Entity Recognition language: sv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_sv_4.0.0_3.0_1657200680574.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_sv_4.0.0_3.0_1657200680574.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("entity_recognizer_lg", "sv") result = pipeline.annotate("""I love johnsnowlabs! """) ``` {:.nlu-block} ```python import nlu nlu.load("sv.ner.lg").predict("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_lg| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|sv| |Size:|2.5 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Legal Voting Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_voting_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, voting, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_voting_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `voting-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `voting-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_voting_agreement_en_1.0.0_3.0_1669295269897.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_voting_agreement_en_1.0.0_3.0_1669295269897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_voting_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[voting-agreement]| |[other]| |[other]| |[voting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_voting_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.99 0.98 90 voting-agreement 0.97 0.90 0.93 31 accuracy - - 0.97 121 macro avg 0.97 0.95 0.96 121 weighted avg 0.97 0.97 0.97 121 ``` --- layout: model title: Stopwords Remover for Arabic language (386 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ar, open_source] task: Stop Words Removal language: ar edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ar_3.4.1_3.0_1646672341952.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ar_3.4.1_3.0_1646672341952.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ar") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["أنت لست أفضل مني"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ar") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("أنت لست أفضل مني").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.stopwords").predict("""أنت لست أفضل مني""") ```
## Results ```bash +-----------+ |result | +-----------+ |[أفضل, مني]| +-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ar| |Size:|2.6 KB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_squad_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-3` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_3_en_4.3.0_3.0_1674217654479.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_3_en_4.3.0_3.0_1674217654479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad-3 --- layout: model title: Basque BertForQuestionAnswering model (from MarcBrun) author: John Snow Labs name: bert_qa_ixambert_finetuned_squad_eu_MarcBrun date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: eu edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ixambert-finetuned-squad-eu` is a Basque model orginally trained by `MarcBrun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ixambert_finetuned_squad_eu_MarcBrun_eu_4.0.0_3.0_1654249870079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ixambert_finetuned_squad_eu_MarcBrun_eu_4.0.0_3.0_1654249870079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ixambert_finetuned_squad_eu_MarcBrun","eu") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ixambert_finetuned_squad_eu_MarcBrun","eu") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("eu.answer_question.squad.ixam_bert.eu_tuned.by_MarcBrun").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ixambert_finetuned_squad_eu_MarcBrun| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|eu| |Size:|661.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MarcBrun/ixambert-finetuned-squad-eu --- layout: model title: Russian T5ForConditionalGeneration Small Cased model (from anzorq) author: John Snow Labs name: t5_kbd_lat_835k_3m_small date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kbd_lat-835k_ru-3M_t5-small` is a Russian model originally trained by `anzorq`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_kbd_lat_835k_3m_small_ru_4.3.0_3.0_1675104021877.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_kbd_lat_835k_3m_small_ru_4.3.0_3.0_1675104021877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_kbd_lat_835k_3m_small","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_kbd_lat_835k_3m_small","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_kbd_lat_835k_3m_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|211.3 MB| ## References - https://huggingface.co/anzorq/kbd_lat-835k_ru-3M_t5-small --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_qacombined_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qacombined_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qacombined_el_4_el_4.0.0_3.0_1657190872506.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qacombined_el_4_el_4.0.0_3.0_1657190872506.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qacombined_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_qacombined_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qacombined_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/qacombined_bert_el_4 --- layout: model title: Clinical Ner Assertion author: John Snow Labs name: clinical_ner_assertion class: PipelineModel language: en nav_key: models repository: clinical/models date: 2020-01-31 tags: [ner, clinical, licensed] supported: true edition: Spark NLP 2.4.0 spark_version: 2.4 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description A pretrained pipeline with ``ner_clinical`` and ``assertion_dl``. It will extract clinical entities and assign assertion status for them. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_ner_assertion_en_2.4.0_2.4_1580481098096.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_ner_assertion_en_2.4.0_2.4_1580481098096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python pipeline = PretrainedPipeline("clinical_ner_assertion","en","clinical/models") result = pipe_model.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache.She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge.""")[0] result.keys() ``` ```scala val pipeline = new PretrainedPipeline("clinical_ner_assertion","en","clinical/models") val result = pipeline.fullAnnotate("She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache.She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge.")(0) ```
{:.h2_title} ## Results ```bash chunks entities assertion 0 gestational diabetes mellitus PROBLEM present 1 pain PROBLEM absent 2 headache PROBLEM absent 3 insulin glargine TREATMENT present 4 insulin lispro TREATMENT present 5 metformin TREATMENT present ``` {:.model-param} ## Model Information {:.table-model} |---------------|------------------------| | Name: | clinical_ner_assertion | | Type: | PipelineModel | | Compatibility: | Spark NLP 2.4.0+ | | License: | Licensed | | Edition: | Official | | Language: | en | {:.h2_title} ## Included Models - ner_clinical - assertion_dl --- layout: model title: English BertForQuestionAnswering Cased model (from peggyhuang) author: John Snow Labs name: bert_qa_finetune_scibert_v2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-SciBert-v2` is a English model originally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_scibert_v2_en_4.0.0_3.0_1657189393157.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_scibert_v2_en_4.0.0_3.0_1657189393157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetune_scibert_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_finetune_scibert_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_finetune_scibert_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|410.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/finetune-SciBert-v2 --- layout: model title: ICD10 to UMLS Code Mapping author: John Snow Labs name: icd10cm_umls_mapping date: 2023-03-29 tags: [en, licensed, icd10cm, umls, pipeline, chunk_mapping] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_4.3.2_3.2_1680119059492.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_4.3.2_3.2_1680119059492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(['M8950', 'R822', 'R0901']) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(['M8950', 'R822', 'R0901']) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash {'icd10cm': ['M89.50', 'R82.2', 'R09.01'], 'umls': ['C4721411', 'C0159076', 'C0004044']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|952.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English image_classifier_vit_opencampus_age_detection ViTForImageClassification from chradden author: John Snow Labs name: image_classifier_vit_opencampus_age_detection date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_opencampus_age_detection` is a English model originally trained by chradden. ## Predicted Entities `pensioner portrait face`, `teenager portrait face`, `millennials portrait face`, `child portrait face`, `generation x portrait face` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_opencampus_age_detection_en_4.1.0_3.0_1660166579619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_opencampus_age_detection_en_4.1.0_3.0_1660166579619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_opencampus_age_detection", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_opencampus_age_detection", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_opencampus_age_detection| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Translate East Slavic languages to English Pipeline author: John Snow Labs name: translate_zle_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, zle, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `zle` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_zle_en_xx_2.7.0_2.4_1609690665624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_zle_en_xx_2.7.0_2.4_1609690665624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_zle_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_zle_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.zle.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_zle_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Slovak Legal Roberta Embeddings author: John Snow Labs name: roberta_base_slovak_legal date: 2023-02-17 tags: [sk, slovak, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: sk edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-slovak-roberta-base` is a Slovak model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_slovak_legal_sk_4.2.4_3.0_1676642929159.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_slovak_legal_sk_4.2.4_3.0_1676642929159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_slovak_legal", "sk")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_slovak_legal", "sk") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_slovak_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sk| |Size:|415.9 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-slovak-roberta-base --- layout: model title: Legal Recognition Clause Binary Classifier author: John Snow Labs name: legclf_recognition_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `recognition` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `recognition` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_recognition_clause_en_1.0.0_3.2_1660123901737.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_recognition_clause_en_1.0.0_3.2_1660123901737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_recognition_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[recognition]| |[other]| |[other]| |[recognition]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_recognition_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.99 0.95 70 recognition 0.98 0.87 0.92 46 accuracy - - 0.94 116 macro-avg 0.95 0.93 0.94 116 weighted-avg 0.94 0.94 0.94 116 ``` --- layout: model title: English Bert Embeddings Cased model (from Shafin) author: John Snow Labs name: bert_embeddings_chemical_uncased_finetuned_cust_c1_cust date: 2023-02-21 tags: [open_source, bert, bert_embeddings, bertformaskedlm, en, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chemical-bert-uncased-finetuned-cust-c1-cust` is a English model originally trained by `Shafin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chemical_uncased_finetuned_cust_c1_cust_en_4.3.0_3.0_1677001598364.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chemical_uncased_finetuned_cust_c1_cust_en_4.3.0_3.0_1677001598364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chemical_uncased_finetuned_cust_c1_cust","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chemical_uncased_finetuned_cust_c1_cust","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chemical_uncased_finetuned_cust_c1_cust| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.1 MB| |Case sensitive:|true| ## References https://huggingface.co/shafin/chemical-bert-uncased-finetuned-cust-c1-cust --- layout: model title: Part of Speech for Urdu author: John Snow Labs name: pos_ud_udtb date: 2021-03-09 tags: [part_of_speech, open_source, urdu, pos_ud_udtb, ur] task: Part of Speech Tagging language: ur edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PROPN - ADP - NUM - NOUN - CCONJ - ADJ - VERB - AUX - PUNCT - DET - PRON - ADV - PART - SCONJ - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_udtb_ur_3.0.0_3.0_1615292277973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_udtb_ur_3.0.0_3.0_1615292277973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_udtb", "ur") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['جان برف لیبز سے ہیلو! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_udtb", "ur") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("جان برف لیبز سے ہیلو! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""جان برف لیبز سے ہیلو! ""] token_df = nlu.load('ur.pos.ud_udtb').predict(text) token_df ```
## Results ```bash token pos 0 جان PROPN 1 برف PROPN 2 لیبز PROPN 3 سے ADP 4 ہیلو VERB 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_udtb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|ur| --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from csebuetnlp) author: John Snow Labs name: t5_banglat5_nmt_bn2en date: 2023-01-30 tags: [bn, en, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `banglat5_nmt_bn_en` is a Multilingual model originally trained by `csebuetnlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_banglat5_nmt_bn2en_xx_4.3.0_3.0_1675100164821.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_banglat5_nmt_bn2en_xx_4.3.0_3.0_1675100164821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_banglat5_nmt_bn2en","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_banglat5_nmt_bn2en","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_banglat5_nmt_bn2en| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.0 GB| ## References - https://huggingface.co/csebuetnlp/banglat5_nmt_bn_en - https://github.com/csebuetnlp/normalizer --- layout: model title: Explain Document DL Pipeline for English author: John Snow Labs name: explain_document_dl date: 2021-03-23 tags: [open_source, english, explain_document_dl, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_dl is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_dl_en_3.0.0_3.0_1616473268265.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_dl_en_3.0.0_3.0_1616473268265.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('explain_document_dl', lang = 'en') annotations = pipeline.fullAnnotate("The Mona Lisa is an oil painting from the 16th century.")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_dl", lang = "en") val result = pipeline.fullAnnotate("The Mona Lisa is an oil painting from the 16th century.")(0) ``` {:.nlu-block} ```python import nlu text = ["The Mona Lisa is an oil painting from the 16th century."] result_df = nlu.load('en.explain.dl').predict(text) result_df ```
## Results ```bash +--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------+-----------+ | text| document| sentence| token| checked| lemma| stem| pos| embeddings| ner| entities| +--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------+-----------+ |The Mona Lisa is an oil painting from the 16th ...|[The Mona Lisa is an oil painting from the 16th...|[The Mona Lisa is an oil painting from the 16th...|[The, Mona, Lisa, is, an, oil, painting, from, ...|[The, Mona, Lisa, is, an, oil, painting, from, ...|[The, Mona, Lisa, be, an, oil, painting, from, ...|[the, mona, lisa, i, an, oil, paint, from, the,...|[DT, NNP, NNP, VBZ, DT, NN, NN, IN, DT, JJ, NN, .]|[[-0.038194, -0.24487, 0.72812, -0.39961, 0.083...|[O, B-PER, I-PER, O, O, O, O, O, O, O, O, O]|[Mona Lisa]| +--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_dl| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Portuguese DistilBertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: distilbert_qa_multi_finedtuned_squad date: 2022-06-09 tags: [question_answering, distilbert, pt, open_source] task: Question Answering language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-multi-finedtuned-squad-pt` is a Portuguese model originally trained by `mrm8488`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finedtuned_squad_pt_4.0.0_3.0_1654758698273.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_multi_finedtuned_squad_pt_4.0.0_3.0_1654758698273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finedtuned_squad", "pt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Qual é o meu nome?", "Meu nome é Clara e moro em Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_multi_finedtuned_squad", "pt") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Qual é o meu nome?", "Meu nome é Clara e moro em Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.answer_question.squad.distil_bert").predict("""Qual é o meu nome?|||"Meu nome é Clara e moro em Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_multi_finedtuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|pt| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-distilbert-base-uncased-squad --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from CenIA) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased-finetuned-qa-mlqa` is a Castilian, Spanish model orginally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa_es_4.0.0_3.0_1654180564395.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa_es_4.0.0_3.0_1654180564395.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.mlqa.bert.base_uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/bert-base-spanish-wwm-uncased-finetuned-qa-mlqa --- layout: model title: English DistilBERT Embeddings (%85 sparse) author: John Snow Labs name: distilbert_embeddings_distilbert_base_uncased_sparse_85_unstructured_pruneofa date: 2022-04-12 tags: [distilbert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-uncased-sparse-85-unstructured-pruneofa` is a English model orginally trained by `Intel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uncased_sparse_85_unstructured_pruneofa_en_3.4.2_3.0_1649783418930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_uncased_sparse_85_unstructured_pruneofa_en_3.4.2_3.0_1649783418930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uncased_sparse_85_unstructured_pruneofa","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_uncased_sparse_85_unstructured_pruneofa","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilbert_base_uncased_sparse_85_unstructured_pruneofa").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_uncased_sparse_85_unstructured_pruneofa| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|132.8 MB| |Case sensitive:|false| ## References - https://huggingface.co/Intel/distilbert-base-uncased-sparse-85-unstructured-pruneofa - https://arxiv.org/abs/2111.05754 - https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all --- layout: model title: Explain Document Pipeline for Polish author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, polish, explain_document_md, pipeline, pl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: pl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pl_3.0.0_3.0_1616434210078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_pl_3.0.0_3.0_1616434210078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'pl') annotations = pipeline.fullAnnotate(""Witaj z John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "pl") val result = pipeline.fullAnnotate("Witaj z John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Witaj z John Snow Labs! ""] result_df = nlu.load('pl.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Witaj z John Snow Labs! '] | ['Witaj z John Snow Labs!'] | ['Witaj', 'z', 'John', 'Snow', 'Labs!'] | ['witać', 'z', 'John', 'Snow', 'Labs!'] | ['VERB', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pl| --- layout: model title: English asr_wav2vec2_large_xlsr_upper_sorbian_mixed TFWav2Vec2ForCTC from jimregan author: John Snow Labs name: asr_wav2vec2_large_xlsr_upper_sorbian_mixed date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_upper_sorbian_mixed` is a English model originally trained by jimregan. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_upper_sorbian_mixed_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_upper_sorbian_mixed_en_4.2.0_3.0_1664020234664.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_upper_sorbian_mixed_en_4.2.0_3.0_1664020234664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_upper_sorbian_mixed", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_upper_sorbian_mixed", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_upper_sorbian_mixed| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Fast Neural Machine Translation Model from Esperanto to English author: John Snow Labs name: opus_mt_eo_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, eo, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `eo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_eo_en_xx_2.7.0_2.4_1609166672544.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_eo_en_xx_2.7.0_2.4_1609166672544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_eo_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_eo_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.eo.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_eo_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Luba-Lulua Pipeline author: John Snow Labs name: translate_en_lua date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, lua, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `lua` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lua_xx_2.7.0_2.4_1609687243733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lua_xx_2.7.0_2.4_1609687243733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_lua", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_lua", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.lua').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_lua| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Luxembourgish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, lb, open_source] task: Embeddings language: lb edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lb_3.4.1_3.0_1647443574578.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lb_3.4.1_3.0_1647443574578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lb") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ech hu gär Spark nltp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lb") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ech hu gär Spark nltp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lb.embed.w2v_cc_300d").predict("""Ech hu gär Spark nltp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|lb| |Size:|370.0 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Finnish BERT Embeddings (Base Uncased) author: John Snow Labs name: bert_base_finnish_uncased date: 2022-01-03 tags: [open_source, embeddings, fi, bert] task: Embeddings language: fi edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. FinBERT features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words than e.g. the previously released multilingual BERT models from Google. FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train FinBERT. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_finnish_uncased_fi_3.4.0_3.0_1641223281610.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_finnish_uncased_fi_3.4.0_3.0_1641223281610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_base_finnish_uncased", "fi") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") sample_data= spark.createDataFrame([['Syväoppimisen tavoitteena on luoda algoritmien avulla neuroverkko, joka pystyy ratkaisemaan sille annetut ongelmat.']], ["text"]) nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(sample_data) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_finnish_uncased", "fi") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("Syväoppimisen tavoitteena on luoda algoritmien avulla neuroverkko, joka pystyy ratkaisemaan sille annetut ongelmat.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fi.embed_sentence.bert").predict("""Syväoppimisen tavoitteena on luoda algoritmien avulla neuroverkko, joka pystyy ratkaisemaan sille annetut ongelmat.""") ```
## Results ```bash +--------------------+-------------+ | embeddings| token| +--------------------+-------------+ |[0.9422476, -0.14...|Syväoppimisen| |[2.0408847, -1.45...| tavoitteena| |[2.33223, -1.7228...| on| |[0.6425015, -0.96...| luoda| |[0.10455999, -0.2...| algoritmien| |[0.28626734, -0.2...| avulla| |[1.0091506, -0.75...| neuroverkko| |[1.501086, -0.651...| ,| |[1.2654709, -0.82...| joka| |[1.710053, -0.406...| pystyy| |[0.43736708, -0.2...| ratkaisemaan| |[1.0496894, 0.191...| sille| |[0.8630942, -0.16...| annetut| |[0.50174934, -1.3...| ongelmat| |[0.27278847, -0.9...| .| +--------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_finnish_uncased| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert]| |Language:|fi| |Size:|464.1 MB| |Case sensitive:|false| --- layout: model title: French CamemBert Embeddings (from ericchchiu) author: John Snow Labs name: camembert_embeddings_ericchchiu_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `ericchchiu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ericchchiu_generic_model_fr_3.4.4_3.0_1653988475622.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ericchchiu_generic_model_fr_3.4.4_3.0_1653988475622.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ericchchiu_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ericchchiu_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_ericchchiu_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/ericchchiu/dummy-model --- layout: model title: English DistilBertForQuestionAnswering Cased model author: John Snow Labs name: distilbert_qa_base_cased_distilled_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_en_4.0.0_3.0_1654723573821.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_squad_en_4.0.0_3.0_1654723573821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_cased.by_uploaded by huggingface").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_distilled_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/distilbert-base-cased-distilled-squad --- layout: model title: Translate Tumbuka to English Pipeline author: John Snow Labs name: translate_tum_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tum, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tum` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tum_en_xx_2.7.0_2.4_1609690520108.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tum_en_xx_2.7.0_2.4_1609690520108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tum_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tum_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tum.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tum_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Classifying Proposal Comments author: John Snow Labs name: legclf_bert_support_proposal date: 2023-02-17 tags: [en, legal, licensed, classification, proposal, support, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a proposal on a socially important issue, the model classifies whether a comment is `In_Favor`, `Against`, or `Neutral` towards the proposal. ## Predicted Entities `In_Favor`, `Neutral`, `Against` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_support_proposal_en_1.0.0_3.0_1676599695968.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_support_proposal_en_1.0.0_3.0_1676599695968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") classifier = legal.BertForSequenceClassification.pretrained("legclf_bert_support_proposal", "en", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, classifier ]) empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) sample_text = ["""This is one of the most boring movies I have ever seen, its horrible. Christopher Lee is good but he is hardly in it, the only good part is the opening scene. Don't be fooled by the title. "End of the World" is truly a bad movie, I stopped watching it close to the end it was so bad, only for die-hard b-movie fans that have the brain to stand this vomit.""", """Of course, there is still a lot of possible improvement in the pipeline, but we definitely don't have to wait for some genius new technology to start. Why am I so definitely against this proposal though it sounds so reasonable and helpful? I'm definitely against the notion that we'll have to wait for a new genius industrial technology to show up to even think of starting a proper transformation. In my opinion, the opposite is true: We have to start right now with what we have & by the way develop better concepts of how to use all the technology & methods already available optimally. And for me, nuclear energy which is - by the way - relaunched with this proposal, is definitely not part of the game, not even in the modular mini-nuke version of Mr. Gates. There are people who know much more about renewable energy than Mr. Gates & completely energy independent who hate that book because of this crap.""", """One common defense policy would strengthen the voice and influence in our own backyard. A strong EU army can be a stabilizing factor in the unstable regions around our continent. We Europeans should take our safety and defense into our own hands and not rely on the US to do it for us.""" ] test = spark.createDataFrame(pd.DataFrame({"text": sample_text})) result = model.transform(test) ```
## Results ```bash +--------+--------------------+ | class| document| +--------+--------------------+ | Neutral|This is one of the...| | Against|Of course, there ...| |In_Favor|One common defense...| +--------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_support_proposal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|403.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://touche.webis.de/clef23/touche23-web/multilingual-stance-classification.html#data) ## Benchmarking ```bash label precision recall f1-score support Against 0.84 0.87 0.86 85 In_Favor 0.87 0.84 0.86 90 Neutral 0.98 0.98 0.98 57 accuracy - - 0.89 232 macro-avg 0.90 0.90 0.90 232 weighted-avg 0.89 0.89 0.89 232 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_scrambled_squad_10_new date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-scrambled-squad-10-new` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_10_new_en_4.3.0_3.0_1674216770257.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_scrambled_squad_10_new_en_4.3.0_3.0_1674216770257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_10_new","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_scrambled_squad_10_new","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_scrambled_squad_10_new| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-scrambled-squad-10-new --- layout: model title: English RobertaForQuestionAnswering Large Cased model (from mbartolo) author: John Snow Labs name: roberta_qa_large_syn date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-synqa` is a English model originally trained by `mbartolo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_syn_en_4.2.4_3.0_1669988247364.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_syn_en_4.2.4_3.0_1669988247364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_syn","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_syn","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_syn| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbartolo/roberta-large-synqa - https://arxiv.org/abs/2002.00293 - https://arxiv.org/abs/2104.08678 - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad --- layout: model title: Detect financial entities author: John Snow Labs name: ner_financial_contract date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract key entities in financial contracts using pretrained NER model. ## Predicted Entities `ORG`, `PER`, `MISC`, `LOC` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_financial_contract_en_3.0.0_3.0_1617260833629.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_financial_contract_en_3.0.0_3.0_1617260833629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_financial_contract", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Hans is a professor at the Norwegian University of Copenhagen, and he is a true Copenhagener.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", "xx") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_financial_contract", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) val data = Seq("""Hans is a professor at the Norwegian University of Copenhagen, and he is a true Copenhagener.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.financial_contract").predict("""Put your text here.""") ```
## Results ```bash +--------------------+---------+ |chunk |ner_label| +--------------------+---------+ |professor |PER | |Norwegian University|PER | |Copenhagen |LOC | +--------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_financial_contract| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368_de_4.2.0_3.0_1664117328908.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368_de_4.2.0_3.0_1664117328908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s368| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Venda to English Pipeline author: John Snow Labs name: translate_ve_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ve, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ve` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ve_en_xx_2.7.0_2.4_1609691141431.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ve_en_xx_2.7.0_2.4_1609691141431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ve_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ve_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ve.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ve_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Deberta Embeddings model (from totoro4007) author: John Snow Labs name: deberta_embeddings_mlm_tuned_v3_large_mnli_fever_anli_ling_wanli date: 2023-03-12 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, en, tensorflow] task: Embeddings language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MLM-tuned-DeBERTa-v3-large-mnli-fever-anli-ling-wanli` is a English model originally trained by `totoro4007`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_mlm_tuned_v3_large_mnli_fever_anli_ling_wanli_en_4.3.1_3.0_1678635128966.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_mlm_tuned_v3_large_mnli_fever_anli_ling_wanli_en_4.3.1_3.0_1678635128966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_mlm_tuned_v3_large_mnli_fever_anli_ling_wanli","vie") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_mlm_tuned_v3_large_mnli_fever_anli_ling_wanli","vie") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_mlm_tuned_v3_large_mnli_fever_anli_ling_wanli| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|false| ## References https://huggingface.co/totoro4007/MLM-tuned-DeBERTa-v3-large-mnli-fever-anli-ling-wanli --- layout: model title: Part of Speech for Chinese author: John Snow Labs name: pos_ud_gsd date: 2021-03-09 tags: [part_of_speech, open_source, chinese, pos_ud_gsd, zh] task: Part of Speech Tagging language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - VERB - ADJ - PUNCT - ADV - NUM - NOUN - PRON - PART - X - ADP - DET - CONJ - PROPN - AUX - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_3.0.0_3.0_1615292352973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_zh_3.0.0_3.0_1615292352973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_gsd", "zh") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("从John Snow Labs你好! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""从John Snow Labs你好! ""] token_df = nlu.load('zh.pos.ud_gsd').predict(text) token_df ```
## Results ```bash token pos 0 从 ADP 1 JohnSnowLabs X 2 你 PRON 3 好 ADV 4 ! VERB ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|zh| --- layout: model title: Fast Neural Machine Translation Model from English to Basque author: John Snow Labs name: opus_mt_en_eu date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, eu, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `eu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_eu_xx_2.7.0_2.4_1609167764286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_eu_xx_2.7.0_2.4_1609167764286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_eu", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_eu", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.eu').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_eu| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_1000000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-1000000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_1000000_cased_generator_de_3.4.4_3.0_1652786313340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_1000000_cased_generator_de_3.4.4_3.0_1652786313340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_1000000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_1000000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_1000000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-1000000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Russian T5ForConditionalGeneration Cased model (from IlyaGusev) author: John Snow Labs name: t5_rut5_tox date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5_tox` is a Russian model originally trained by `IlyaGusev`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_tox_ru_4.3.0_3.0_1675107092843.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_tox_ru_4.3.0_3.0_1675107092843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_rut5_tox","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_rut5_tox","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_rut5_tox| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|955.7 MB| ## References - https://huggingface.co/IlyaGusev/rut5_tox --- layout: model title: Chinese BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_zh_cased date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-zh-cased` is a Chinese model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_zh_cased_zh_4.2.4_3.0_1670019345444.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_zh_cased_zh_4.2.4_3.0_1670019345444.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_zh_cased","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_zh_cased","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_zh_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|360.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-zh-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_256 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_256_zh_4.2.4_3.0_1670325728244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_256_zh_4.2.4_3.0_1670325728244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|51.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English T5ForConditionalGeneration Cased model (from ceshine) author: John Snow Labs name: t5_paraphrase_quora_paws date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-paraphrase-quora-paws` is a English model originally trained by `ceshine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_paraphrase_quora_paws_en_4.3.0_3.0_1675125097390.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_paraphrase_quora_paws_en_4.3.0_3.0_1675125097390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_paraphrase_quora_paws","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_paraphrase_quora_paws","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_paraphrase_quora_paws| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|924.7 MB| ## References - https://huggingface.co/ceshine/t5-paraphrase-quora-paws - https://github.com/ceshine/finetuning-t5/tree/master/paraphrase --- layout: model title: Sentence Entity Resolver for NCI-t author: John Snow Labs name: sbiobertresolve_ncit date: 2023-03-26 tags: [entity_resolution, clinical, en, licensed, ncit] task: Entity Resolution language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities related to clinical care, translational and basic research, public information and administrative activities to NCI-t codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities `NCI-t codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ncit_en_4.3.2_3.0_1679843528109.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ncit_en_4.3.2_3.0_1679843528109.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setPreservePosition(False) chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_ncit","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""45 years old patient had Percutaneous mitral valve repair. He had Pericardiectomy 2 years ago. He has left cardiac ventricular systolic dysfunction in his history."""]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_ncit","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("45 years old patient had Percutaneous mitral valve repair. He had Pericardiectomy 2 years ago. He has left cardiac ventricular systolic dysfunction in his history.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | sent_id | ner_chunk | entity | NCI-t Code | all_codes | resolutions | |----------:|:----------------------------------------------|:----------|:-------------|:---------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------| | 0 | Percutaneous mitral valve repair | TREATMENT | C100003 | ['C100003', 'C158019', 'C80449', 'C50818', 'C80448'| ['percutaneous mitral valve repair [percutaneous mitral valve repair]', 'transcatheter mitral valve repair [transcatheter mitral valve...| | 1 | Pericardiectomy | TREATMENT | C51643 | ['C51643', 'C51618', 'C100004', 'C62550', 'C51791' | ['pericardiectomy [pericardiectomy]', 'pericardiostomy [pericardiostomy]', 'pericardial stripping [pericardial stripping]', 'pulpectom...| | 2 | left cardiac ventricular systolic dysfunction | PROBLEM | C64251 | ['C64251', 'C146719', 'C55062', 'C50629', 'C111655'| ['left cardiac ventricular systolic dysfunction [left cardiac ventricular systolic dysfunction]', 'left ventricular systolic dysfuncti...| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_ncit| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[nci-t_code]| |Language:|en| |Size:|1.5 GB| |Case sensitive:|false| ## References https://evs.nci.nih.gov/ftp1/NCI_Thesaurus/ --- layout: model title: Pipeline to Detect Adverse Drug Events (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_ade_pipeline date: 2023-03-20 tags: [ner, bertfortokenclassification, adverse, ade, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_ade](https://nlp.johnsnowlabs.com/2022/01/04/bert_token_classifier_ner_ade_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_pipeline_en_4.3.0_3.2_1679308398107.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_ade_pipeline_en_4.3.0_3.2_1679308398107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_ade_pipeline", "en", "clinical/models") text = '''Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_ade_pipeline", "en", "clinical/models") val text = "Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ade_pipeline").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash | ner_chunk | begin | end | ner_label | confidence | |-------------|---------|-------|-------------|--------------| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ade_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering model (from peggyhuang) author: John Snow Labs name: bert_qa_finetune_bert_base_v3 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `finetune-bert-base-v3` is a English model orginally trained by `peggyhuang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v3_en_4.0.0_3.0_1654187775889.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_finetune_bert_base_v3_en_4.0.0_3.0_1654187775889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_finetune_bert_base_v3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_finetune_bert_base_v3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.base_v3.by_peggyhuang").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_finetune_bert_base_v3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/peggyhuang/finetune-bert-base-v3 --- layout: model title: Norwegian BertForMaskedLM Cased model (from ltgoslo) author: John Snow Labs name: bert_embeddings_norbert date: 2022-12-02 tags: ["no", open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: "no" edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `norbert` is a Norwegian model originally trained by `ltgoslo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert_no_4.2.4_3.0_1670022742785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_norbert_no_4.2.4_3.0_1670022742785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert","no") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_norbert","no") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_norbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|no| |Size:|417.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/ltgoslo/norbert - http://vectors.nlpl.eu/repository/20/216.zip - http://norlm.nlpl.eu/ - https://github.com/ltgoslo/NorBERT - https://arxiv.org/abs/2104.06546 - https://www.eosc-nordic.eu/ - https://www.mn.uio.no/ifi/english/research/projects/sant/index.html - https://www.mn.uio.no/ifi/english/research/groups/ltg/ --- layout: model title: Arabic XlmRoBertaForQuestionAnswering (from bhavikardeshna) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_base_arabic date: 2022-06-23 tags: [ar, open_source, question_answering, xlmroberta] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-arabic` is a Arabic model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_arabic_ar_4.0.0_3.0_1655989183562.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_base_arabic_ar_4.0.0_3.0_1655989183562.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_base_arabic","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_base_arabic","ar") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ar.answer_question.xlm_roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_base_arabic| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|ar| |Size:|884.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/xlm-roberta-base-arabic --- layout: model title: Sentiment Analysis of Swahili texts author: John Snow Labs name: classifierdl_xlm_roberta_sentiment date: 2021-12-29 tags: [swahili, sentiment, sw, open_source] task: Sentiment Analysis language: sw edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies positive or negative sentiments in Swahili texts. ## Predicted Entities `Negative`, `Positive` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_SW/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_SW.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_xlm_roberta_sentiment_sw_3.3.4_3.0_1640766370034.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_xlm_roberta_sentiment_sw_3.3.4_3.0_1640766370034.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") normalizer = Normalizer() \ .setInputCols(["token"]) \ .setOutputCol("normalized") stopwords_cleaner = StopWordsCleaner.pretrained("stopwords_sw", "sw") \ .setInputCols(["normalized"]) \ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base_finetuned_swahili", "sw")\ .setInputCols(["document", "cleanTokens"])\ .setOutputCol("embeddings") embeddingsSentence = SentenceEmbeddings() \ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_xlm_roberta_sentiment", "sw") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") sw_pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer, stopwords_cleaner, embeddings, embeddingsSentence, sentimentClassifier]) light_pipeline = LightPipeline(sw_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result1 = light_pipeline.annotate("Hadithi yenyewe ni ya kutabirika tu na ya uvivu.") result2 = light_pipeline.annotate("Mtandao wa kushangaza wa 4G katika mji wa Mombasa pamoja na mipango nzuri sana na ya bei rahisi.") print(result1["class"], result2["class"], sep = "\n") ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val normalizer = Normalizer() .setInputCols(Array("token")) .setOutputCol("normalized") val stopwords_cleaner = StopWordsCleaner.pretrained("stopwords_sw", "sw") .setInputCols(Array("normalized")) .setOutputCol("cleanTokens") .setCaseSensitive(False) val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base_finetuned_swahili", "sw") .setInputCols(Array("document", "cleanTokens")) .setOutputCol("embeddings") val embeddingsSentence = SentenceEmbeddings() .setInputCols(Array("document", "embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_xlm_roberta_sentiment", "sw") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val sw_sentiment_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, normalizer, stopwords_cleaner, embeddings, embeddingsSentence, sentimentClassifier)) val light_pipeline = LightPipeline(sw_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result1 = light_pipeline.annotate("Hadithi yenyewe ni ya kutabirika tu na ya uvivu.") val result2 = light_pipeline.annotate("Mtandao wa kushangaza wa 4G katika mji wa Mombasa pamoja na mipango nzuri sana na ya bei rahisi.") ```
## Results ```bash ['Negative'] ['Positive'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_xlm_roberta_sentiment| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|sw| |Size:|23.0 MB| ## Data Source [https://github.com/Jinamizi/Swahili-sentiment-analysis](https://github.com/Jinamizi/Swahili-sentiment-analysis) ## Benchmarking ```bash label precision recall f1-score support Negative 0.79 0.84 0.81 85 Positive 0.86 0.82 0.84 103 accuracy - - 0.82 188 macro-avg 0.82 0.83 0.82 188 weighted-avg 0.83 0.82 0.82 188 ``` --- layout: model title: Detect mentions of general medical terms (coarse) author: John Snow Labs name: ner_medmentions_coarse date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract general medical terms in text like body parts, cells, genes, symptoms, etc in text using pretrained NER model. ## Predicted Entities `Qualitative_Concept`, `Organization`, `Manufactured_Object`, `Amino_Acid,_Peptide,_or_Protein`, `Pharmacologic_Substance`, `Professional_or_Occupational_Group`, `Cell_Component`, `Neoplastic_Process`, `Substance`, `Laboratory_Procedure`, `Nucleic_Acid,_Nucleoside,_or_Nucleotide`, `Research_Activity`, `Gene_or_Genome`, `Indicator,_Reagent,_or_Diagnostic_Aid`, `Biologic_Function`, `Chemical`, `Mammal`, `Molecular_Function`, `Quantitative_Concept`, `Prokaryote`, `Mental_or_Behavioral_Dysfunction`, `Injury_or_Poisoning`, `Body_Location_or_Region`, `Spatial_Concept`, `Nucleotide_Sequence`, `Tissue`, `Pathologic_Function`, `Body_Substance`, `Fungus`, `Mental_Process`, `Medical_Device`, `Plant`, `Health_Care_Activity`, `Clinical_Attribute`, `Genetic_Function`, `Food`, `Therapeutic_or_Preventive_Procedure`, `Body_Part,_Organ,_or_Organ_Component`, `Geographic_Area`, `Virus`, `Biomedical_or_Dental_Material`, `Diagnostic_Procedure`, `Eukaryote`, `Anatomical_Structure`, `Organism_Attribute`, `Molecular_Biology_Research_Technique`, `Organic_Chemical`, `Cell`, `Daily_or_Recreational_Activity`, `Population_Group`, `Disease_or_Syndrome`, `Group`, `Sign_or_Symptom`, `Body_System` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_medmentions_coarse_en_3.0.0_3.0_1617260791003.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_medmentions_coarse_en_3.0.0_3.0_1617260791003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_medmentions_coarse", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_medmentions_coarse", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.medmentions").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_medmentions_coarse| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Chinese Word Segmentation author: John Snow Labs name: wordseg_large date: 2021-09-20 tags: [cn, zh, word_segmentation, open_source] task: Word Segmentation language: zh edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on a maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. Many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as an initial step. Reference: > Xue, Nianwen. “Chinese word segmentation as character tagging.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003. ## Predicted Entities {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_large_zh_3.0.0_3.0_1632144130387.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_large_zh_3.0.0_3.0_1632144130387.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_segmenter ]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['然而,这样的处理也衍生了一些问题。']], ["text"]) result = ws_model.transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, word_segmenter)) val data = Seq("然而,这样的处理也衍生了一些问题。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python pipe = nlu.load('zh.segment_words.large') zh_data = ["然而,这样的处理也衍生了一些问题。"] df = pipe.predict(zh_data , output_level='token') df ```
## Results ```bash +----------------------------------+--------------------------------------------------------+ |text |result | +----------------------------------+--------------------------------------------------------+ |然而,这样的处理也衍生了一些问题。|[然而, ,, 这样, 的, 处理, 也, 衍生, 了, 一些, 问题, 。]| +----------------------------------+--------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_large| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|zh| ## Data Source We created a curated large data set obtained from Chinese Treebank, Weibo, and SIGHAM 2005 data sets. ## Benchmarking ```bash | Model | precision | recall | f1-score | |---------------|--------------|--------------|--------------| | WORSEG_CTB | 0,6453 | 0,6341 | 0,6397 | | WORDSEG_WEIBO | 0,5454 | 0,5655 | 0,5553 | | WORDSEG_MSRA | 0,5984 | 0,6088 | 0,6035 | | WORDSEG_PKU | 0,6094 | 0,6321 | 0,6206 | | WORDSEG_LARGE | 0,6326 | 0,6269 | 0,6297 | ``` --- layout: model title: Part of Speech for Catalan author: John Snow Labs name: pos_ud_ancora date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: ca edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, ca] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ancora_ca_2.5.5_2.4_1596053819077.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ancora_ca_2.5.5_2.4_1596053819077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_ancora", "ca") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_ancora", "ca") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""A part de ser el rei del nord, John Snow és un metge anglès i líder en el desenvolupament de l'anestèsia i la higiene mèdica."""] pos_df = nlu.load('ca.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=0, result='ADP', metadata={'word': 'A'}), Row(annotatorType='pos', begin=2, end=5, result='NOUN', metadata={'word': 'part'}), Row(annotatorType='pos', begin=7, end=8, result='ADP', metadata={'word': 'de'}), Row(annotatorType='pos', begin=10, end=12, result='AUX', metadata={'word': 'ser'}), Row(annotatorType='pos', begin=14, end=15, result='DET', metadata={'word': 'el'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ancora| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|ca| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: German BertForMaskedLM Cased model (from smanjil) author: John Snow Labs name: bert_embeddings_german_medbert date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `German-MedBERT` is a German model originally trained by `smanjil`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_german_medbert_de_4.2.4_3.0_1670015301798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_german_medbert_de_4.2.4_3.0_1670015301798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_german_medbert","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_german_medbert","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_german_medbert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/smanjil/German-MedBERT - #related-paper-reporthttpsopus4kobvdeopus4-rhein-waalfrontdoorindexindexsearchtypecollectionid16225start0rows10doctypefqmasterthesisdocid740 - https://opus4.kobv.de/opus4-rhein-waal/frontdoor/index/index/searchtype/collection/id/16225/start/0/rows/10/doctypefq/masterthesis/docId/740 - https://www.linkedin.com/in/manjil-shrestha-038527b4/ --- layout: model title: Spanish NER for Laws and Money author: John Snow Labs name: legner_law_money date: 2022-09-28 tags: [es, legal, ner, laws, money, licensed] task: Named Entity Recognition language: es edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Spanish Named Entity Recognition model for detecting laws and monetary ammounts. This model was trained in-house and available annotations of this [dataset](https://huggingface.co/datasets/scjnugacj/scjn_dataset_ner) and weak labelling from this [model](https://huggingface.co/pitoneros/NER_LAW_MONEY4) ## Predicted Entities `LAW`, `MONEY` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_law_money_es_1.0.0_3.0_1664362333282.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_law_money_es_1.0.0_3.0_1664362333282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("legner_law_money", "es", "legal/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = nlp.Pipeline( stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) text = "La recaudación del ministerio del interior fue de 20,000,000 euros así constatado por el artículo 24 de la Constitución Española." data = spark.createDataFrame([[""]]).toDF("text") fitmodel = pipeline.fit(data) light_model = LightPipeline(fitmodel) light_result = light_model.fullAnnotate(text) chunks = [] entities = [] for n in light_result[0]['ner_chunk']: print("{n.result} ({n.metadata['entity']})) ```
## Results ```bash 20,000,000 euros (MONEY) artículo 24 de la Constitución Española (LAW) ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_law_money| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|414.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model was trained in-house and available annotations of this [dataset](https://huggingface.co/datasets/scjnugacj/scjn_dataset_ner) and weak labelling from this [model](https://huggingface.co/pitoneros/NER_LAW_MONEY4) ## Benchmarking ```bash label precision recall f1-score support LAW 0.95 0.96 0.96 20 MONEY 0.98 0.99 0.99 106 accuracy - - 0.98 126 macro-avg 0.97 0.98 0.97 126 weighted-avg 0.98 0.99 0.99 126 ``` --- layout: model title: English BertForQuestionAnswering model (from KevinChoi) author: John Snow Labs name: bert_qa_KevinChoi_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `KevinChoi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_KevinChoi_bert_finetuned_squad_en_4.0.0_3.0_1654535497769.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_KevinChoi_bert_finetuned_squad_en_4.0.0_3.0_1654535497769.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_KevinChoi_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_KevinChoi_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_KevinChoi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_KevinChoi_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KevinChoi/bert-finetuned-squad --- layout: model title: Legal Exchange Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_exchange_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, exchange, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_exchange_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `exchange-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `exchange-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exchange_agreement_bert_en_1.0.0_3.0_1669311478318.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exchange_agreement_bert_en_1.0.0_3.0_1669311478318.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_exchange_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[exchange-agreement]| |[other]| |[other]| |[exchange-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exchange_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support exchange-agreement 0.97 0.75 0.85 40 other 0.89 0.99 0.94 82 accuracy - - 0.91 122 macro-avg 0.93 0.87 0.89 122 weighted-avg 0.92 0.91 0.91 122 ``` --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_2_en_4.0.0_3.0_1657185175793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_2_en_4.0.0_3.0_1657185175793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-2 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from chuchun9) author: John Snow Labs name: distilbert_qa_chuchun9_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `chuchun9`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_chuchun9_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770448637.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_chuchun9_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770448637.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_chuchun9_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_chuchun9_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_chuchun9_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/chuchun9/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Non discrimination Clause Binary Classifier author: John Snow Labs name: legclf_non_discrimination_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non-discrimination` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `non-discrimination` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_discrimination_clause_en_1.0.0_3.2_1660122736812.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_discrimination_clause_en_1.0.0_3.2_1660122736812.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_non_discrimination_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[non-discrimination]| |[other]| |[other]| |[non-discrimination]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_discrimination_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non-discrimination 1.00 0.85 0.92 41 other 0.94 1.00 0.97 103 accuracy - - 0.96 144 macro-avg 0.97 0.93 0.95 144 weighted-avg 0.96 0.96 0.96 144 ``` --- layout: model title: Fast Neural Machine Translation Model from Berber to English author: John Snow Labs name: opus_mt_ber_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ber, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ber` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ber_en_xx_2.7.0_2.4_1609167784403.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ber_en_xx_2.7.0_2.4_1609167784403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ber_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ber_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ber.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ber_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Text cleaner v1 author: John Snow Labs name: text_cleaner_v1 date: 2023-01-10 tags: [en, licensed] task: OCR Text Cleaner language: en nav_key: models edition: Visual NLP 4.1.0 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Model for cleaning image with text. It is based on text detection model with extra post-processing. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/IMAGE_CLEANER/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/text_cleaner_v1_en_3.0.0_2.4_1640088709401.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/text_cleaner_v1_en_3.0.0_2.4_1640088709401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pdf_to_image = PdfToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setResolution(300) ocr = ImageToText() \ .setInputCol("image") \ .setOutputCol("text") \ .setConfidenceThreshold(70) \ .setIgnoreResolution(False) cleaner = ImageTextCleaner \ .pretrained("text_cleaner_v1", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("corrected_image") \ .setMedianBlur(0) \ .setSizeThreshold(10) \ .setTextThreshold(0.3) \ .setLinkThreshold(0.2) \ .setPadding(5) \ .setBinarize(False) ocr_corrected = ImageToText() \ .setInputCol("corrected_image") \ .setOutputCol("corrected_text") \ .setConfidenceThreshold(70) \ .setIgnoreResolution(False) pipeline = PipelineModel(stages=[ pdf_to_image, ocr, cleaner, ocr_corrected ]) pdf_example = 'data/pdfs/noised.pdf' pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache() results = pipeline.transform(pdf_example_df).cache() ``` ```scala val pdf_to_image = new PdfToImage() .setInputCol("content") .setOutputCol("image") .setResolution(300) val ocr = new ImageToText() .setInputCol("image") .setOutputCol("text") .setConfidenceThreshold(70) .setIgnoreResolution(False) val cleaner = ImageTextCleaner .pretrained("text_cleaner_v1", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("corrected_image") .setMedianBlur(0) .setSizeThreshold(10) .setTextThreshold(0.3) .setLinkThreshold(0.2) .setPadding(5) .setBinarize(False) val ocr_corrected = new ImageToText() .setInputCol("corrected_image") .setOutputCol("corrected_text") .setConfidenceThreshold(70) .setIgnoreResolution(False) val pipeline = new PipelineModel().setStages(Array( pdf_to_image, ocr, cleaner, ocr_corrected)) val pdf_example = "data/pdfs/noised.pdf" val pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache() val results = pipeline.transform(pdf_example_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image4.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image4_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash Detected text: Sample specifications written by , BLEND CASING RECASING - OLD GOLD STRAIGHT Tobacco Blend Control for Sample No. 5030 Cigarettes: OLD GOLD STRAIGHT John H. M. Bohlken FINAL FLAVOR MENTHOL FLAVOR Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis, Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee - Written by -- John H. M. Bohlken Original to -Mr. C. L. Tucker, dr. Copies to ---Dr. A. W. Spears C ~ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_cleaner_v1| |Type:|ocr| |Compatibility:|Visual NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|77.1 MB| --- layout: model title: Legal Insurance Clause Binary Classifier author: John Snow Labs name: legclf_insurance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `insurance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `insurance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_insurance_clause_en_1.0.0_3.2_1660122557369.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_insurance_clause_en_1.0.0_3.2_1660122557369.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_insurance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[insurance]| |[other]| |[other]| |[insurance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_insurance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support insurance 1.00 0.87 0.93 31 other 0.96 1.00 0.98 85 accuracy - - 0.97 116 macro-avg 0.98 0.94 0.95 116 weighted-avg 0.97 0.97 0.96 116 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-6` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1654191630691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6_en_4.0.0_3.0_1654191630691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_512d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|387.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-6 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Ayushb) author: John Snow Labs name: roberta_qa_base_ft_esg date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-ft-esg` is a English model originally trained by `Ayushb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ft_esg_en_4.3.0_3.0_1674217855538.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_ft_esg_en_4.3.0_3.0_1674217855538.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ft_esg","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_ft_esg","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_ft_esg| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|416.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Ayushb/roberta-base-ft-esg - https://www.github.com/Ayush1702 --- layout: model title: English image_classifier_vit_south_indian_foods ViTForImageClassification from Amrrs author: John Snow Labs name: image_classifier_vit_south_indian_foods date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_south_indian_foods` is a English model originally trained by Amrrs. ## Predicted Entities `idli`, `dosai`, `vadai`, `idiyappam`, `puttu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_south_indian_foods_en_4.1.0_3.0_1660167045190.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_south_indian_foods_en_4.1.0_3.0_1660167045190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_south_indian_foods", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_south_indian_foods", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_south_indian_foods| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Clinical Deidentification author: John Snow Labs name: german_deid_pipeline_spark24 date: 2022-03-03 tags: [licensed, de, deidentification, pipeline, clinical] task: Pipeline Public language: de edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from **German** medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `CITY`, `STREET`, `USERNAME`, `PROFESSION`, `PHONE`, `COUNTRY`, `DOCTOR`, `AGE`, `CONTACT`, `ID`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `DLN`, `PLATE` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/german_deid_pipeline_spark24_de_3.4.1_2.4_1646330797926.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/german_deid_pipeline_spark24_de_3.4.1_2.4_1646330797926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "de", "clinical/models") sample = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """ result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification","de","clinical/models") val sample = "Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.pipeline").predict("""Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """) ```
## Results ```bash Masked with entity labels ------------------------------ Zusammenfassung : wird am Morgen des ins eingeliefert. Herr ist Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: Platte Kontonummer: SSN : Lizenznummer: Adresse : Masked with chars ------------------------------ Zusammenfassung : [************] wird am Morgen des [**************] ins [**********************] eingeliefert. Herr [************] ist ** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: [*******] Platte [*****] Kontonummer: [********************] SSN : [**********] Lizenznummer: [*********] Adresse : [*****************] [***] Masked with fixed length chars ------------------------------ Zusammenfassung : **** wird am Morgen des **** ins **** eingeliefert. Herr **** ist **** Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: **** Platte **** Kontonummer: **** SSN : **** Lizenznummer: **** Adresse : **** **** Obfusceted ------------------------------ Zusammenfassung : Herrmann Kallert wird am Morgen des 11-26-1977 ins International Neuroscience eingeliefert. Herr Herrmann Kallert ist 79 Jahre alt und hat zu viel Wasser in den Beinen. Persönliche Daten : ID-Nummer: 136704D357 Platte QA348G Kontonummer: 192837465738 SSN : 1310011981M454 Lizenznummer: XX123456 Adresse : Klingelhöferring 31206 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|german_deid_pipeline_spark24| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - Finisher --- layout: model title: Explain Clinical Document Pipeline - CARP author: John Snow Labs name: explain_clinical_doc_carp date: 2023-04-20 tags: [pipeline, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline with `ner_clinical`, `assertion_dl`, `re_clinical` and `ner_posology`. It will extract clinical and medication entities, assign assertion status and find relationships between clinical entities. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_4.3.0_3.2_1682023443679.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_carp_en_4.3.0_3.2_1682023443679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("explain_clinical_doc_carp", "en", "clinical/models") text = """A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""" result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("explain_clinical_doc_carp", "en", "clinical/models") val text = """A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.explain_doc.carp").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals.""") ```
## Results ```bash | | chunks | ner_clinical | assertion | posology_chunk | ner_posology | relations | |---|-------------------------------|--------------|-----------|------------------|--------------|-----------| | 0 | gestational diabetes mellitus | PROBLEM | present | metformin | Drug | TrAP | | 1 | metformin | TREATMENT | present | 1000 mg | Strength | TrCP | | 2 | polyuria | PROBLEM | present | two times a day | Frequency | TrCP | | 3 | polydipsia | PROBLEM | present | 40 units | Dosage | TrWP | | 4 | poor appetite | PROBLEM | present | insulin glargine | Drug | TrCP | | 5 | vomiting | PROBLEM | present | at night | Frequency | TrAP | | 6 | insulin glargine | TREATMENT | present | 12 units | Dosage | TrAP | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_clinical_doc_carp| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - DependencyParserModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - MedicalNerModel - NerConverterInternalModel - AssertionDLModel - RelationExtractionModel --- layout: model title: Extract Entities Related to TNM Staging author: John Snow Labs name: ner_oncology_tnm date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts staging information and mentions related to tumors, lymph nodes and metastases. Definitions of Predicted Entities: - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Lymph_Node`: Mentions of lymph nodes and pathological findings of the lymph nodes. - `Lymph_Node_Modifier`: Words that refer to a lymph node being abnormal (such as "enlargement"). - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Tumor`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Description`: Information related to tumor characteristics, such as size, presence of invasion, grade and hystological type. ## Predicted Entities `Cancer_Dx`, `Lymph_Node`, `Lymph_Node_Modifier`, `Metastasis`, `Staging`, `Tumor`, `Tumor_Description` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_en_4.0.0_3.0_1666720053687.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_en_4.0.0_3.0_1666720053687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_tnm", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_tnm").predict("""The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:------------------| | metastatic | Metastasis | | breast carcinoma | Cancer_Dx | | T2N1M1 stage IV | Staging | | 4 cm | Tumor_Description | | tumor | Tumor | | grade 2 | Tumor_Description | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_tnm| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.2 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Lymph_Node 570 77 77 647 0.88 0.88 0.88 Staging 232 22 26 258 0.91 0.90 0.91 Lymph_Node_Modifier 30 5 5 35 0.86 0.86 0.86 Tumor_Description 2651 581 490 3141 0.82 0.84 0.83 Tumor 1116 72 141 1257 0.94 0.89 0.91 Metastasis 358 15 12 370 0.96 0.97 0.96 Cancer_Dx 1302 87 92 1394 0.94 0.93 0.94 macro_avg 6259 859 843 7102 0.90 0.90 0.90 micro_avg 6259 859 843 7102 0.88 0.88 0.88 ``` --- layout: model title: English LongformerForQuestionAnswering Large model (from manishiitg) author: John Snow Labs name: longformer_qa_recruit_large date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-recruit-qa-large` is a English model originally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_large_en_4.0.0_3.0_1656255664915.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_recruit_large_en_4.0.0_3.0_1656255664915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit_large","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_recruit_large","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.longformer.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_recruit_large| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/longformer-recruit-qa-large --- layout: model title: Persian XLMRobertaForTokenClassification Base Cased model (from BK-V) author: John Snow Labs name: xlmroberta_ner_base_finetuned_arman date: 2022-08-13 tags: [fa, open_source, xlm_roberta, ner] task: Named Entity Recognition language: fa edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-arman-fa` is a Persian model originally trained by `BK-V`. ## Predicted Entities `pers`, `event`, `org`, `loc`, `pro`, `fac` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_arman_fa_4.1.0_3.0_1660426868156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_arman_fa_4.1.0_3.0_1660426868156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_arman","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_arman","fa") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_arman| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|841.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/BK-V/xlm-roberta-base-finetuned-arman-fa --- layout: model title: Pipeline to Extraction of biomarker information author: John Snow Labs name: ner_biomarker_pipeline date: 2023-03-14 tags: [en, ner, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_biomarker](https://nlp.johnsnowlabs.com/2021/11/26/ner_biomarker_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomarker_pipeline_en_4.3.0_3.2_1678777993811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomarker_pipeline_en_4.3.0_3.2_1678777993811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_biomarker_pipeline", "en", "clinical/models") text = '''Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin ''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_biomarker_pipeline", "en", "clinical/models") val text = "Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin " val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomarker.pipeline").predict("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------------------|--------:|------:|:----------------------|-------------:| | 0 | intraductal | 38 | 48 | CancerModifier | 0.9998 | | 1 | tubulopapillary | 50 | 64 | CancerModifier | 0.9995 | | 2 | neoplasm of the pancreas | 66 | 89 | CancerDx | 0.7239 | | 3 | clear cell | 96 | 105 | CancerModifier | 0.96745 | | 4 | Immunohistochemistry | 120 | 139 | Test | 0.9768 | | 5 | positivity | 150 | 159 | Biomarker_Measurement | 0.8704 | | 6 | Pan-CK | 165 | 170 | Biomarker | 0.998 | | 7 | CK7 | 174 | 176 | Biomarker | 0.9977 | | 8 | CK8/18 | 180 | 185 | Biomarker | 0.9988 | | 9 | MUC1 | 189 | 192 | Biomarker | 0.9965 | | 10 | MUC6 | 196 | 199 | Biomarker | 0.9974 | | 11 | carbonic anhydrase IX | 203 | 223 | Biomarker | 0.814033 | | 12 | CD10 | 227 | 230 | Biomarker | 0.9975 | | 13 | EMA | 234 | 236 | Biomarker | 0.9985 | | 14 | β-catenin | 240 | 248 | Biomarker | 0.9948 | | 15 | e-cadherin | 254 | 263 | Biomarker | 0.9952 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomarker_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Modified-BlueBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_384_en_4.0.0_3.0_1657108356733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_384_en_4.0.0_3.0_1657108356733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Modified_BlueBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Modified-BlueBERT-384 --- layout: model title: English asr_wav2vec2_vee_demo_colab TFWav2Vec2ForCTC from KISSz author: John Snow Labs name: pipeline_asr_wav2vec2_vee_demo_colab date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_vee_demo_colab` is a English model originally trained by KISSz. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_vee_demo_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_vee_demo_colab_en_4.2.0_3.0_1664104907151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_vee_demo_colab_en_4.2.0_3.0_1664104907151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_vee_demo_colab', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_vee_demo_colab", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_vee_demo_colab| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|3.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_768_A_12_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-768_A-12_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_en_4.0.0_3.0_1654185350625.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_en_4.0.0_3.0_1654185350625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.uncased_4l_768d_a12a_768d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_768_A_12_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|195.3 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-768_A-12_squad2 --- layout: model title: Fast Neural Machine Translation Model from English to Tahitian author: John Snow Labs name: opus_mt_en_ty date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ty, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ty` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ty_xx_2.7.0_2.4_1609168829489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ty_xx_2.7.0_2.4_1609168829489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ty", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ty", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ty').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ty| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Mini Uncased model (from haritzpuerto) author: John Snow Labs name: bert_qa_minilm_l12_h384_uncased_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MiniLM-L12-H384-uncased-squad` is a English model originally trained by `haritzpuerto`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_minilm_l12_h384_uncased_squad_en_4.0.0_3.0_1657182079855.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_minilm_l12_h384_uncased_squad_en_4.0.0_3.0_1657182079855.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_minilm_l12_h384_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_minilm_l12_h384_uncased_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_minilm_l12_h384_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|124.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/haritzpuerto/MiniLM-L12-H384-uncased-squad --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from mvonwyl) author: John Snow Labs name: roberta_qa_base_finetuned_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad2` is a English model originally trained by `mvonwyl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad2_en_4.3.0_3.0_1674217768093.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad2_en_4.3.0_3.0_1674217768093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mvonwyl/roberta-base-finetuned-squad2 --- layout: model title: Translate English to Hiligaynon Pipeline author: John Snow Labs name: translate_en_hil date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, hil, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `hil` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_hil_xx_2.7.0_2.4_1609691235841.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_hil_xx_2.7.0_2.4_1609691235841.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_hil", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_hil", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.hil').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_hil| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from minhdang241) Tapt author: John Snow Labs name: distilbert_qa_robustqa_tapt date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-tapt` is a English model originally trained by `minhdang241`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_tapt_en_4.0.0_3.0_1654728600426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_tapt_en_4.0.0_3.0_1654728600426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_tapt","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_tapt","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_minhdang241").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_robustqa_tapt| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/minhdang241/robustqa-tapt --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented language: en nav_key: models repository: clinical/models date: 2020-12-13 task: Entity Resolution edition: Healthcare NLP 2.6.5 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to ICD10-CM codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). {:.h2_title} ## Predicted Entities ICD10-CM Codes and their normalized definition with ``sbiobert_base_cased_mli`` sentence embeddings. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_2.6.4_2.4_1607890300949.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_2.6.4_2.4_1607890300949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+---------+------+------------------------------------------+---------------------+ | chunk| entity| code| all_k_resolutions| all_k_codes| +--------------------+---------+------+------------------------------------------+---------------------+ | hypertension| PROBLEM| I10|hypertension:::hypertension monitored::...|I10:::Z8679:::I159...| |chronic renal ins...| PROBLEM| N189|chronic renal insufficiency:::chronic r...|N189:::P2930:::N19...| | COPD| PROBLEM| J449|copd - chronic obstructive pulmonary di...|J449:::J984:::J628...| | gastritis| PROBLEM| K2970|gastritis:::chemical gastritis:::gastri...|K2970:::K2960:::K2...| | TIA| PROBLEM| S0690|cerebral trauma (disorder):::cerebral c...|S0690:::S060X:::G4...| |a non-ST elevatio...| PROBLEM| I219|silent myocardial infarction (disorder)...|I219:::I248:::I256...| |Guaiac positive s...| PROBLEM| K921|guaiac-positive stools:::acholic stool ...|K921:::R195:::R15:...| | mid LAD lesion| PROBLEM| I2102|stemi involving left anterior descendin...|I2102:::I2101:::Q2...| +--------------------+---------+------+------------------------------------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_icd10cm_augmented| | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.5 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on ICD10 Clinical Modification dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: English Legal Roberta Embeddings author: John Snow Labs name: roberta_base_english_legal date: 2023-02-17 tags: [en, english, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-english-roberta-base` is a English model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_english_legal_en_4.2.4_3.0_1676642304747.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_english_legal_en_4.2.4_3.0_1676642304747.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_english_legal", "en")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_english_legal", "en") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_english_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-english-roberta-base --- layout: model title: Fast Neural Machine Translation Model from Hiligaynon to English author: John Snow Labs name: opus_mt_hil_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, hil, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `hil` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_hil_en_xx_2.7.0_2.4_1609167877079.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_hil_en_xx_2.7.0_2.4_1609167877079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_hil_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_hil_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.hil.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_hil_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from pierrerappolt) author: John Snow Labs name: roberta_qa_cart date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cart` is a English model originally trained by `pierrerappolt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cart_en_4.0.0_3.0_1655727847167.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cart_en_4.0.0_3.0_1655727847167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_cart","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_cart","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_pierrerappolt").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cart| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/pierrerappolt/cart --- layout: model title: English DistilBertForMaskedLM Base Uncased model author: John Snow Labs name: distilbert_embeddings_base_uncased date: 2022-12-12 tags: [en, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_uncased_en_4.2.4_3.0_1670864757483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_base_uncased_en_4.2.4_3.0_1670864757483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_base_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| ## References - https://huggingface.co/distilbert-base-uncased - https://arxiv.org/abs/1910.01108 - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English DistilBertForQuestionAnswering model (from jgammack) author: John Snow Labs name: distilbert_qa_jgammack_base_uncased_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad` is a English model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_jgammack_base_uncased_squad_en_4.0.0_3.0_1654727227938.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_jgammack_base_uncased_squad_en_4.0.0_3.0_1654727227938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jgammack_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_jgammack_base_uncased_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_jgammack").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_jgammack_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/distilbert-base-uncased-squad --- layout: model title: Pipeline to Mapping MESH Codes with Their Corresponding UMLS Codes author: John Snow Labs name: mesh_umls_mapping date: 2023-03-29 tags: [en, licensed, clinical, resolver, pipeline, chunk_mapping, mesh, umls] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `mesh_umls_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_4.3.2_3.2_1680120667976.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapping_en_4.3.2_3.2_1680120667976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(C028491 D019326 C579867) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("mesh_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(C028491 D019326 C579867) ``` {:.nlu-block} ```python import nlu nlu.load("en.mesh.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash | | mesh_code | umls_code | |---:|:----------------------------|:-------------------------------| | 0 | C028491 | D019326 | C579867 | C0043904 | C0045010 | C3696376 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|mesh_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.8 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Legal Confidentiality Clause Binary Classifier author: John Snow Labs name: legclf_cuad_confidentiality_clause date: 2022-11-28 tags: [en, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `confidentiality` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `confidentiality` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_confidentiality_clause_en_1.0.0_3.0_1669637089061.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_confidentiality_clause_en_1.0.0_3.0_1669637089061.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_confidentiality_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[confidentiality]| |[other]| |[other]| |[confidentiality]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_confidentiality_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.2 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support confidentiality 1.00 0.91 0.95 11 other 0.95 1.00 0.97 19 accuracy - - 0.97 30 macro-avg 0.97 0.95 0.96 30 weighted-avg 0.97 0.97 0.97 30 ``` --- layout: model title: Translate Pedi to English Pipeline author: John Snow Labs name: translate_nso_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, nso, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `nso` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_nso_en_xx_2.7.0_2.4_1609699554960.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_nso_en_xx_2.7.0_2.4_1609699554960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_nso_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_nso_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.nso.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_nso_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from AyushPJ) author: John Snow Labs name: distilbert_qa_ai_club_inductions_21_nlp_bert date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-distilBERT` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ai_club_inductions_21_nlp_bert_en_4.3.0_3.0_1672765575832.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ai_club_inductions_21_nlp_bert_en_4.3.0_3.0_1672765575832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ai_club_inductions_21_nlp_bert","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ai_club_inductions_21_nlp_bert","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ai_club_inductions_21_nlp_bert| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-distilBERT --- layout: model title: Detect Anatomical Structures (Single Entity - embeddings_clinical) author: John Snow Labs name: ner_anatomy_coarse_en date: 2020-11-04 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.1 spark_version: 2.4 tags: [ner, en, licensed, clinical] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description An NER model to extract all types of anatomical references in text using "embeddings_clinical" embeddings. It is a single entity model and generalizes all anatomical references to a single entity. ## Predicted Entities `Anatomy` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ANATOMY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_en_2.6.1_2.4_1604435935079.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_en_2.6.1_2.4_1604435935079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_anatomy_coarse", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([["content in the lung tissue"]]).toDF("text")) results = model.transform(data) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_anatomy_coarse", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("content in the lung tissue").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``ner`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``token.result`` and ``ner.result`` from your output dataframe or add the ``Finisher`` to the end of your pipeline. ```bash | | ner_chunk | entity | |---:|------------------:|----------:| | 0 | lung tissue | Anatomy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse| |Type:|NerDLModel| |Compatibility:|Spark NLP 2.6.1 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on a custom dataset using 'embeddings_clinical'. {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|------:|------:|------:|---------:|---------:|---------:| | 0 | B-Anatomy | 2568 | 165 | 158 | 0.939627 | 0.94204 | 0.940832 | | 1 | I-Anatomy | 1692 | 89 | 169 | 0.950028 | 0.909189 | 0.92916 | | 2 | Macro-average | 4260 | 254 | 327 | 0.944827 | 0.925614 | 0.935122 | | 3 | Micro-average | 4260 | 254 | 327 | 0.943731 | 0.928712 | 0.936161 | ``` --- layout: model title: 10K Item Section Classifier author: John Snow Labs name: finclf_10k_items date: 2023-03-10 tags: [en, licensed, classifier, 10k_items, finance, tensorflow] task: Text Classification language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Multiclass classification model which identifies the item (section) number in a 10K filing. ## Predicted Entities `section_1`, `section_2`, `section_3`, `section_7`, `section_8`, `section_10`, `section_12`, `section_13`, `section_14`, `section_15`, `section_1A`, `section_1B`, `section_7A`, `section_9A`, `section_9B` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_10k_items_en_1.0.0_3.0_1678450523713.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_10k_items_en_1.0.0_3.0_1678450523713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = nlp.Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") seq_classifier = finance.BertForSequenceClassification.pretrained("finclf_10k_items", "en", "finance/models") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, seq_classifier]) data = spark.createDataFrame([["These issues could negatively affect the timely collection of our U.S. government invoices."]]).toDF("text") result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------+ | result| +------------+ |[section_10]| +------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_10k_items| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Train dataset available [here](https://huggingface.co/datasets/JanosAudran/financial-reports-sec) ## Benchmarking ```bash label precision recall f1-score support section_1 0.59 0.66 0.62 112 section_10 0.73 0.72 0.72 137 section_12 0.95 1.00 0.97 124 section_13 0.93 0.94 0.94 212 section_14 0.99 0.97 0.98 172 section_15 0.91 0.84 0.87 139 section_1A 0.85 0.86 0.85 92 section_1B 0.70 0.64 0.67 233 section_2 0.85 0.78 0.81 172 section_3 0.60 0.69 0.64 224 section_7 0.92 0.93 0.92 164 section_7A 0.89 0.90 0.89 99 section_8 0.80 0.97 0.88 72 section_9A 0.91 0.93 0.92 75 section_9B 0.77 0.63 0.69 147 accuracy - - 0.81 2174 macro-avg 0.83 0.83 0.83 2174 weighted-avg 0.82 0.81 0.81 2174 ``` --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_ucSat ViTForImageClassification from YKXBCi author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_ucSat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_ucSat` is a English model originally trained by YKXBCi. ## Predicted Entities `buildings`, `denseresidential`, `storagetanks`, `tenniscourt`, `parkinglot`, `golfcourse`, `intersection`, `harbor`, `river`, `runway`, `mediumresidential`, `chaparral`, `freeway`, `overpass`, `mobilehomepark`, `baseballdiamond`, `agricultural`, `airplane`, `sparseresidential`, `forest`, `beach` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_ucSat_en_4.1.0_3.0_1660165787445.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_ucSat_en_4.1.0_3.0_1660165787445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_ucSat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_ucSat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_ucSat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to English author: John Snow Labs name: opus_mt_af_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, af, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `af` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_en_xx_2.7.0_2.4_1609168281652.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_en_xx_2.7.0_2.4_1609168281652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.af.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from allenai) author: John Snow Labs name: t5_unifiedqa_v2_small_1363200 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unifiedqa-v2-t5-small-1363200` is a English model originally trained by `allenai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_small_1363200_en_4.3.0_3.0_1675158081810.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_unifiedqa_v2_small_1363200_en_4.3.0_3.0_1675158081810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_unifiedqa_v2_small_1363200","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_unifiedqa_v2_small_1363200","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_unifiedqa_v2_small_1363200| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|148.1 MB| ## References - https://huggingface.co/allenai/unifiedqa-v2-t5-small-1363200 - #further-details-httpsgithubcomallenaiunifiedqa - https://github.com/allenai/unifiedqa - #further-details-httpsgithubcomallenaiunifiedqa - https://github.com/allenai/unifiedqa --- layout: model title: Clinical Deidentification (Spanish, augmented) author: John Snow Labs name: clinical_deidentification_augmented date: 2023-06-13 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with sciwiki_300d embeddings and can be used to deidentify PHI information from medical texts in Spanish. It differs from the previous `clinical_deidentificaiton` pipeline in that it includes the `ner_deid_subentity_augmented` NER model and some improvements in ContextualParsers and RegexMatchers. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: `AGE`, `DATE`, `PROFESSION`, `EMAIL`, `USERNAME`, `STREET`, `COUNTRY`, `CITY`, `DOCTOR`, `HOSPITAL`, `PATIENT`, `URL`, `MEDICALRECORD`, `IDNUM`, `ORGANIZATION`, `PHONE`, `ZIP`, `ACCOUNT`, `SSN`, `PLATE`, `SEX` and `IPADDR` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_augmented_es_4.4.4_3.2_1686663780408.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_augmented_es_4.4.4_3.2_1686663780408.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""" result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical_augmented").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""" result = deid_pipeline .annotate(sample) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_augmented", "es", "clinical/models") sample = "Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com " val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("es.deid.clinical_augmented").predict("""Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910 04. Domicilio: Calle Losada Martí 23, 5 B.. Localidad/ Provincia: Madrid. CP: 28016. Datos asistenciales. Fecha de nacimiento: 15/04/1977. País: España. Edad: 37 años Sexo: F. Fecha de Ingreso: 05/06/2018. Médico: María Merino Viveros NºCol: 28 28 35489. Informe clínico del paciente: varón de 37 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Extremadura en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; TTPA 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: Rosa de Bengala +++; Test de Coombs > 1/1280; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). El paciente mejora significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. María Merino Viveros Hospital Universitario de Getafe Servicio de Endocrinología y Nutrición Carretera de Toledo km 12,500 28905 Getafe - Madrid (España) Correo electrónico: marietta84@hotmail.com""") ```
## Results ```bash Results Masked with entity labels ------------------------------ Datos . Nombre: . Apellidos: . NHC: . NASS: . Domicilio: , B.. Localidad/ Provincia: . CP: . Datos asistenciales. Fecha de nacimiento: . País: . Edad: años Sexo: . Fecha de Ingreso: . Médico: NºCol: . Informe clínico : de años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: +++; Test de Coombs > ; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Servicio de Endocrinología y Nutrición - () Correo electrónico: Masked with chars ------------------------------ Datos [**********]. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: [************]. Domicilio: [*******************], * B.. Localidad/ Provincia: [****]. CP: [***]. Datos asistenciales. Fecha de nacimiento: [********]. País: [****]. Edad: ** años Sexo: *. Fecha de Ingreso: [********]. Médico: [******************] NºCol: [*********]. Informe clínico [**********]: [***] de ** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en [*********] en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; [**] 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: [*************] +++; Test de Coombs > [****]; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). [*********] [****] significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. [******************] [******************************] Servicio de Endocrinología y Nutrición [***************************] [***] [****] - [****] ([****]) Correo electrónico: [********************] Masked with fixed length chars ------------------------------ Datos ****. Nombre: **** . Apellidos: ****. NHC: ****. NASS: ****. Domicilio: ****, **** B.. Localidad/ Provincia: ****. CP: ****. Datos asistenciales. Fecha de nacimiento: ****. País: ****. Edad: **** años Sexo: ****. Fecha de Ingreso: ****. Médico: **** NºCol: ****. Informe clínico ****: **** de **** años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en **** en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; **** 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: **** +++; Test de Coombs > ****; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). **** **** significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. **** **** Servicio de Endocrinología y Nutrición **** **** **** - **** (****) Correo electrónico: **** Obfuscated ------------------------------ Datos Hombre. Nombre: Aurora Garrido Paez . Apellidos: Aurora Garrido Paez. NHC: BBBBBBBBQR648597. NASS: 48127833R. Domicilio: C/ Rambla, 246, 5 B.. Localidad/ Provincia: Alicante. CP: 24202. Datos asistenciales. Fecha de nacimiento: 21/04/1977. País: Portugal. Edad: 56 años Sexo: Hombre. Fecha de Ingreso: 10/07/2018. Médico: Francisco José Roca Bermúdez NºCol: 21344083-P. Informe clínico Hombre: 041010000011 de 56 años con vida previa activa que refiere dolores osteoarticulares de localización variable en el último mes y fiebre en la última semana con picos (matutino y vespertino) de 40 C las últimas 24-48 horas, por lo que acude al Servicio de Urgencias. Antes de comenzar el cuadro estuvo en Zaragoza en una región endémica de brucella, ingiriendo leche de cabra sin pasteurizar y queso de dicho ganado. Entre los comensales aparecieron varios casos de brucelosis. Durante el ingreso para estudio del síndrome febril con antecedentes epidemiológicos de posible exposición a Brucella presenta un cuadro de orquiepididimitis derecha. La exploración física revela: Tª 40,2 C; T.A: 109/68 mmHg; Fc: 105 lpm. Se encuentra consciente, orientado, sudoroso, eupneico, con buen estado de nutrición e hidratación. En cabeza y cuello no se palpan adenopatías, ni bocio ni ingurgitación de vena yugular, con pulsos carotídeos simétricos. Auscultación cardíaca rítmica, sin soplos, roces ni extratonos. Auscultación pulmonar con conservación del murmullo vesicular. Abdomen blando, depresible, sin masas ni megalias. En la exploración neurológica no se detectan signos meníngeos ni datos de focalidad. Extremidades sin varices ni edemas. Pulsos periféricos presentes y simétricos. En la exploración urológica se aprecia el teste derecho aumentado de tamaño, no adherido a piel, con zonas de fluctuación e intensamente doloroso a la palpación, con pérdida del límite epidídimo-testicular y transiluminación positiva. Los datos analíticos muestran los siguentes resultados: Hemograma: Hb 13,7 g/dl; leucocitos 14.610/mm3 (neutrófilos 77%); plaquetas 206.000/ mm3. VSG: 40 mm 1ª hora. Coagulación: TQ 87%; Tecnogroup SL 25,8 seg. Bioquímica: Glucosa 117 mg/dl; urea 29 mg/dl; creatinina 0,9 mg/dl; sodio 136 mEq/l; potasio 3,6 mEq/l; GOT 11 U/l; GPT 24 U/l; GGT 34 U/l; fosfatasa alcalina 136 U/l; calcio 8,3 mg/dl. Orina: sedimento normal. Durante el ingreso se solicitan Hemocultivos: positivo para Brucella y Serologías específicas para Brucella: María Miguélez Sanz +++; Test de Coombs > 07-25-1974; Brucellacapt > 1/5120. Las pruebas de imagen solicitadas ( Rx tórax, Ecografía abdominal, TAC craneal, Ecocardiograma transtorácico) no evidencian patología significativa, excepto la Ecografía testicular, que muestra engrosamiento de la bolsa escrotal con pequeña cantidad de líquido con septos y testículo aumentado de tamaño con pequeñas zonas hipoecoicas en su interior que pueden representar microabscesos. Con el diagnóstico de orquiepididimitis secundaria a Brucella se instaura tratamiento sintomático (antitérmicos, antiinflamatorios, reposo y elevación testicular) así como tratamiento antibiótico específico: Doxiciclina 100 mg vía oral cada 12 horas (durante 6 semanas) y Estreptomicina 1 gramo intramuscular cada 24 horas (durante 3 semanas). F. 041010000011 significativamente de su cuadro tras una semana de ingreso, decidiéndose el alta a su domicilio donde completó la pauta de tratamiento antibiótico. En revisiones sucesivas en consultas se constató la completa remisión del cuadro. Remitido por: Dra. Francisco José Roca Bermúdez Hospital 12 de Octubre Servicio de Endocrinología y Nutrición Calle Ramón y Cajal s/n 03129 Zaragoza - Alicante (Portugal) Correo electrónico: richard@company.it {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_augmented| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|281.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - RegexMatcherModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Brazilian Portuguese NER for Laws (Bert, Large) author: John Snow Labs name: legner_br_bert_large date: 2022-09-28 tags: [pt, legal, ner, laws, licensed] task: Named Entity Recognition language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Deep Learning Portuguese Named Entity Recognition model for the legal domain, trained using Base Bert Embeddings, and is able to predict the following entities: - ORGANIZACAO (Organizations) - JURISPRUDENCIA (Jurisprudence) - PESSOA (Person) - TEMPO (Time) - LOCAL (Location) - LEGISLACAO (Laws) - O (Other) You can find different versions of this model in Models Hub: - With a Deep Learning architecture (non-transformer) and Base Embeddings; - With a Deep Learning architecture (non-transformer) and Large Embeddings; - With a Transformers Architecture and Base Embeddings; - With a Transformers Architecture and Large Embeddings; ## Predicted Entities `PESSOA`, `ORGANIZACAO`, `LEGISLACAO`, `JURISPRUDENCIA`, `TEMPO`, `LOCAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_PT/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_br_bert_large_pt_1.0.0_3.0_1664362018252.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_br_bert_large_pt_1.0.0_3.0_1664362018252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = nlp.BertForTokenClassification.pretrained("legner_br_bert_large","pt", "legal/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = nlp.Pipeline( stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) example = spark.createDataFrame(pd.DataFrame({'text': ["""Mediante do exposto , com fundamento nos artigos 32 , i , e 33 , da lei 8.443/1992 , submetem-se os autos à consideração superior , com posterior encaminhamento ao ministério público junto ao tcu e ao gabinete do relator , propondo : a ) conhecer do recurso e , no mérito , negar-lhe provimento ; b ) comunicar ao recorrente , ao superior tribunal militar e ao tribunal regional federal da 2ª região , a fim de fornecer subsídios para os processos judiciais 2001.34.00.024796-9 e 2003.34.00.044227-3 ; e aos demais interessados a deliberação que vier a ser proferida por esta corte ” ."""]})) result = pipeline.fit(example).transform(example) ```
## Results ```bash +-------------------+----------------+ | token| ner| +-------------------+----------------+ | diante| O| | do| O| | exposto| O| | ,| O| | com| O| | fundamento| O| | nos| O| | artigos| B-LEGISLACAO| | 32| I-LEGISLACAO| | ,| I-LEGISLACAO| | i| I-LEGISLACAO| | ,| I-LEGISLACAO| | e| I-LEGISLACAO| | 33| I-LEGISLACAO| | ,| I-LEGISLACAO| | da| I-LEGISLACAO| | lei| I-LEGISLACAO| | 8.443/1992| I-LEGISLACAO| | ,| O| | submetem-se| O| | os| O| | autos| O| | à| O| | consideração| O| | superior| O| | ,| O| | com| O| | posterior| O| | encaminhamento| O| | ao| O| | ministério| B-ORGANIZACAO| | público| I-ORGANIZACAO| | junto| O| | ao| O| | tcu| B-ORGANIZACAO| | e| O| | ao| O| | gabinete| O| | do| O| | relator| O| | ,| O| | propondo| O| | :| O| | a| O| | )| O| | conhecer| O| | do| O| | recurso| O| | e| O| | ,| O| | no| O| | mérito| O| | ,| O| | negar-lhe| O| | provimento| O| | ;| O| | b| O| | )| O| | comunicar| O| | ao| O| | recorrente| O| | ,| O| | ao| O| | superior| B-ORGANIZACAO| | tribunal| I-ORGANIZACAO| | militar| I-ORGANIZACAO| | e| O| | ao| O| | tribunal| B-ORGANIZACAO| | regional| I-ORGANIZACAO| | federal| I-ORGANIZACAO| | da| I-ORGANIZACAO| | 2ª| I-ORGANIZACAO| | região| I-ORGANIZACAO| | ,| O| | a| O| | fim| O| | de| O| | fornecer| O| | subsídios| O| | para| O| | os| O| | processos| O| | judiciais| O| |2001.34.00.024796-9|B-JURISPRUDENCIA| | e| O| |2003.34.00.044227-3|B-JURISPRUDENCIA| | ;| O| | e| O| | aos| O| | demais| O| | interessados| O| | a| O| | deliberação| O| | que| O| | vier| O| | a| O| | ser| O| | proferida| O| | por| O| | esta| O| | corte| O| | ”| O| | .| O| +-------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_br_bert_large| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|128| ## References Original texts available in https://paperswithcode.com/sota?task=Token+Classification&dataset=lener_br and in-house data augmentation with weak labelling ## Benchmarking ```bash label precision recall f1-score support B-JURISPRUDENCIA 0.84 0.91 0.88 175 B-LEGISLACAO 0.96 0.96 0.96 347 B-LOCAL 0.69 0.68 0.68 40 B-ORGANIZACAO 0.95 0.71 0.81 441 B-PESSOA 0.91 0.95 0.93 221 B-TEMPO 0.94 0.86 0.90 176 I-JURISPRUDENCIA 0.86 0.91 0.89 461 I-LEGISLACAO 0.98 0.99 0.98 2012 I-LOCAL 0.54 0.53 0.53 72 I-ORGANIZACAO 0.94 0.76 0.84 768 I-PESSOA 0.93 0.98 0.95 461 I-TEMPO 0.90 0.85 0.88 66 O 0.99 1.00 0.99 38419 accuracy - - 0.98 43659 macro-avg 0.88 0.85 0.86 43659 weighted-avg 0.98 0.98 0.98 43659 ``` --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_luo_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-luo-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luo_finetuned_ner_swahili_sw_4.1.0_3.0_1659354566122.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_luo_finetuned_ner_swahili_sw_4.1.0_3.0_1659354566122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luo_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_luo_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_luo_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-luo-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Extract clinical problems (Voice of the Patients) author: John Snow Labs name: ner_vop_problem_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences using a granular taxonomy. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `PsychologicalCondition`, `Disease`, `Symptom`, `HealthStatus`, `Modifier`, `InjuryOrPoisoning` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_wip_en_4.4.2_3.0_1684512901437.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_wip_en_4.4.2_3.0_1684512901437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Symptom | | fatigue | Symptom | | rheumatoid arthritis | Disease | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 PsychologicalCondition 315 30 26 341 0.91 0.92 0.92 Disease 1708 261 262 1970 0.87 0.87 0.87 Symptom 3669 440 1016 4685 0.89 0.78 0.83 HealthStatus 80 20 25 105 0.80 0.76 0.78 Modifier 695 210 292 987 0.77 0.70 0.73 InjuryOrPoisoning 93 33 52 145 0.74 0.64 0.69 macro_avg 6560 994 1673 8233 0.83 0.78 0.80 micro_avg 6560 994 1673 8233 0.87 0.80 0.83 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Gam) author: John Snow Labs name: roberta_qa_base_finetuned_cuad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-cuad` is a English model originally trained by `Gam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_cuad_en_4.3.0_3.0_1674216413698.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_cuad_en_4.3.0_3.0_1674216413698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_cuad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_cuad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|451.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Gam/roberta-base-finetuned-cuad --- layout: model title: Extraction of biomarker information author: John Snow Labs name: ner_biomarker date: 2021-11-26 tags: [en, ner, clinical, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to extract biomarkers, therapies, oncological, and other general concepts from text. ## Predicted Entities `Oncogenes`, `Tumor_Finding`, `UnspecificTherapy`, `Ethnicity`, `Age`, `ResponseToTreatment`, `Biomarker`, `HormonalTherapy`, `Staging`, `Drug`, `CancerDx`, `Radiotherapy`, `CancerSurgery`, `TargetedTherapy`, `PerformanceStatus`, `CancerModifier`, `Radiological_Test_Result`, `Biomarker_Measurement`, `Metastasis`, `Radiological_Test`, `Chemotherapy`, `Test`, `Dosage`, `Test_Result`, `Immunotherapy`, `Date`, `Gender`, `Prognostic_Biomarkers`, `Duration`, `Predictive_Biomarkers` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BIOMARKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomarker_en_3.3.3_3.0_1637935088644.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomarker_en_3.3.3_3.0_1637935088644.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') clinical_ner = MedicalNerModel.pretrained("ner_biomarker", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin "]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_biomarker", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomarker").predict("""Here , we report the first case of an intraductal tubulopapillary neoplasm of the pancreas with clear cell morphology . Immunohistochemistry revealed positivity for Pan-CK , CK7 , CK8/18 , MUC1 , MUC6 , carbonic anhydrase IX , CD10 , EMA , β-catenin and e-cadherin """) ```
## Results ```bash | | ner_chunk | entity | confidence | |---:|:-------------------------|:----------------------|-------------:| | 0 | intraductal | CancerModifier | 0.9934 | | 1 | tubulopapillary | CancerModifier | 0.6403 | | 2 | neoplasm of the pancreas | CancerDx | 0.758825 | | 3 | clear cell | CancerModifier | 0.9633 | | 4 | Immunohistochemistry | Test | 0.9534 | | 5 | positivity | Biomarker_Measurement | 0.8795 | | 6 | Pan-CK | Biomarker | 0.9975 | | 7 | CK7 | Biomarker | 0.9975 | | 8 | CK8/18 | Biomarker | 0.9987 | | 9 | MUC1 | Biomarker | 0.9967 | | 10 | MUC6 | Biomarker | 0.9972 | | 11 | carbonic anhydrase IX | Biomarker | 0.937567 | | 12 | CD10 | Biomarker | 0.9974 | | 13 | EMA | Biomarker | 0.9899 | | 14 | β-catenin | Biomarker | 0.8059 | | 15 | e-cadherin | Biomarker | 0.9806 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomarker| |Compatibility:|Healthcare NLP 3.3.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data sampled from Mimic-III, and annotated in-house. ## Benchmarking ```bash label tp fp fn prec rec f1 I-Oncogenes 73 65 64 0.5289855 0.5328467 0.53090906 B-Radiotherapy 130 8 12 0.942029 0.91549295 0.9285714 B-Chemotherapy 644 31 28 0.9540741 0.9583333 0.956199 I-Radiotherapy 72 4 8 0.94736844 0.9 0.92307687 B-Predictive_Biomarkers 0 0 2 0.0 0.0 0.0 I-Staging 71 11 30 0.86585367 0.7029703 0.77595633 B-Radiological_Test_Result 0 3 20 0.0 0.0 0.0 B-Drug 18 10 19 0.64285713 0.4864865 0.5538461 B-Dosage 123 20 28 0.86013985 0.81456953 0.8367347 I-Test_Result 22 11 44 0.6666667 0.33333334 0.44444448 I-CancerModifier 349 41 86 0.8948718 0.80229884 0.8460606 I-Predictive_Biomarkers 0 0 1 0.0 0.0 0.0 B-Date 131 19 34 0.87333333 0.7939394 0.831746 B-HormonalTherapy 114 5 12 0.9579832 0.9047619 0.9306123 B-Radiological_Test 105 38 21 0.73426574 0.8333333 0.78066915 B-Ethnicity 8 0 1 1.0 0.8888889 0.94117653 I-Radiological_Test 69 50 15 0.57983196 0.8214286 0.67980295 I-UnspecificTherapy 59 8 6 0.880597 0.9076923 0.8939394 I-Immunotherapy 100 25 22 0.8 0.8196721 0.80971664 B-UnspecificTherapy 92 16 12 0.8518519 0.88461536 0.8679245 I-ResponseToTreatment 5 18 76 0.2173913 0.061728396 0.09615384 B-ResponseToTreatment 6 18 38 0.25 0.13636364 0.1764706 B-Test_Result 23 17 20 0.575 0.53488374 0.55421686 I-Biomarker_Measurement 47 46 61 0.50537634 0.4351852 0.4676617 B-Test 286 145 138 0.6635731 0.6745283 0.6690058 B-TargetedTherapy 675 74 75 0.9012016 0.9 0.9006004 I-Biomarker 732 250 237 0.74541754 0.75541794 0.75038445 I-Radiological_Test_Result 8 6 86 0.5714286 0.08510638 0.14814815 B-CancerSurgery 194 29 34 0.8699552 0.85087717 0.86031044 I-Duration 37 47 57 0.44047618 0.39361703 0.41573036 B-Oncogenes 342 118 229 0.74347824 0.5989492 0.66343355 I-CancerDx 1272 131 123 0.90662867 0.911828 0.9092209 I-Age 19 4 4 0.82608694 0.82608694 0.826087 B-Immunotherapy 300 29 16 0.9118541 0.9493671 0.9302325 I-Prognostic_Biomarkers 4 3 7 0.5714286 0.36363637 0.44444445 B-Tumor_Finding 574 225 141 0.718398 0.8027972 0.75825626 B-CancerDx 2620 205 169 0.9274336 0.9394048 0.9333808 I-TargetedTherapy 317 70 38 0.8191214 0.89295775 0.8544474 B-Gender 52 14 10 0.7878788 0.83870965 0.81250006 B-Metastasis 584 41 44 0.9344 0.9299363 0.9321628 I-Dosage 69 16 19 0.8117647 0.78409094 0.7976879 B-CancerModifier 852 135 166 0.8632219 0.83693516 0.84987533 B-Staging 71 27 23 0.7244898 0.7553192 0.7395834 I-Tumor_Finding 79 58 92 0.57664233 0.4619883 0.512987 I-Test 168 96 123 0.6363636 0.57731956 0.60540545 B-Age 42 7 6 0.85714287 0.875 0.8659794 I-HormonalTherapy 54 7 3 0.8852459 0.94736844 0.91525424 B-PerformanceStatus 11 2 0 0.84615386 1.0 0.9166667 I-Chemotherapy 60 6 9 0.90909094 0.8695652 0.8888889 I-Date 116 15 9 0.8854962 0.928 0.90625 B-Prognostic_Biomarkers 33 11 35 0.75 0.4852941 0.58928573 B-Duration 30 50 38 0.375 0.44117647 0.40540543 I-Metastasis 32 14 45 0.6956522 0.41558442 0.5203252 B-Biomarker_Measurement 437 124 175 0.7789661 0.71405226 0.745098 I-CancerSurgery 128 17 30 0.8827586 0.8101266 0.8448845 I-Drug 2 0 8 1.0 0.2 0.3333333 B-Biomarker 3027 571 332 0.8413007 0.9011611 0.8702027 I-PerformanceStatus 37 15 0 0.71153843 1.0 0.83146065 Macro-average 15525 3026 3181 0.7223804 0.675604 0.69820964 Micro-average 15525 3026 3181 0.8368821 0.8299476 0.8334004 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from mbeck) author: John Snow Labs name: roberta_qa_mbeck_base_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `mbeck`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_mbeck_base_squad2_en_4.3.0_3.0_1674219019104.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_mbeck_base_squad2_en_4.3.0_3.0_1674219019104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mbeck_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mbeck_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_mbeck_base_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|459.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeck/roberta-base-squad2 --- layout: model title: RxNorm Scd ChunkResolver author: John Snow Labs name: chunkresolve_rxnorm_scd_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities RxNorm Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scd_clinical_en_3.0.0_3.0_1618603397185.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_scd_clinical_en_3.0.0_3.0_1618603397185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... rxnormResolver = ChunkEntityResolverModel()\ .pretrained('chunkresolve_rxnorm_scd_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,3,2,0,0,7])\ .setInputCols(['token', 'chunk_embs_drug'])\ .setOutputCol('rxnorm_resolution')\ pipeline_rxnorm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, rxnormResolver]) model = pipeline_rxnorm.fit(spark.createDataFrame([['']]).toDF("text")) results = model.transform(data) ``` ```scala ... val rxnormResolver = ChunkEntityResolverModel() .pretrained('chunkresolve_rxnorm_scd_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,3,2,0,0,7)) .setInputCols('token', 'chunk_embs_drug') .setOutputCol('rxnorm_resolution') val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, rxnormResolver)) val result = pipeline.fit(Seq.empty[String]).transform(data) ```
## Results ```bash | coords | chunk | entity | rxnorm_opts | |--------------|-------------|-----------|-----------------------------------------------------------------------------------------| | 3::278::287 | creatinine | DrugChem | [(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), ...] | | 7::83::93 | cholesterol | DrugChem | [(2104173, beta Sitosterol 35 MG Oral Tablet), (832876, phytosterol esters 500 MG O...] | | 10::397::406 | creatinine | DrugChem | [(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_scd_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm]| |Language:|en| --- layout: model title: English RobertaForQuestionAnswering (from Ching) author: John Snow Labs name: roberta_qa_negation_detector date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `negation_detector` is a English model originally trained by `Ching`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_negation_detector_en_4.0.0_3.0_1655729130284.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_negation_detector_en_4.0.0_3.0_1655729130284.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_negation_detector","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_negation_detector","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_Ching").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_negation_detector| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Ching/negation_detector --- layout: model title: Stopwords Remover for Turkish language (552 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, tr, open_source] task: Stop Words Removal language: tr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_tr_3.4.1_3.0_1646673242019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_tr_3.4.1_3.0_1646673242019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","tr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Sen benden daha iyi değilsin"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","tr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Sen benden daha iyi değilsin").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.stopwords").predict("""Sen benden daha iyi değilsin""") ```
## Results ```bash +----------+ |result | +----------+ |[değilsin]| +----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|tr| |Size:|3.1 KB| --- layout: model title: Legal Economic Policy Document Classifier (EURLEX) author: John Snow Labs name: legclf_economic_policy_bert date: 2023-03-06 tags: [en, legal, classification, clauses, economic_policy, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_economic_policy_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Economic_Policy or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Economic_Policy`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_economic_policy_bert_en_1.0.0_3.0_1678111818763.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_economic_policy_bert_en_1.0.0_3.0_1678111818763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_economic_policy_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Economic_Policy]| |[Other]| |[Other]| |[Economic_Policy]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_economic_policy_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Economic_Policy 0.88 0.88 0.88 584 Other 0.86 0.86 0.86 515 accuracy - - 0.87 1099 macro-avg 0.87 0.87 0.87 1099 weighted-avg 0.87 0.87 0.87 1099 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab7_by_sameearif88 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab7_by_sameearif88 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab7_by_sameearif88` is a English model originally trained by sameearif88. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab7_by_sameearif88_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab7_by_sameearif88_en_4.2.0_3.0_1664020381433.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab7_by_sameearif88_en_4.2.0_3.0_1664020381433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab7_by_sameearif88", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab7_by_sameearif88", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab7_by_sameearif88| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English BertForMaskedLM Base Cased model (from VMware) author: John Snow Labs name: bert_embeddings_v_2021_base date: 2022-12-06 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `vbert-2021-base` is a English model originally trained by `VMware`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_base_en_4.2.4_3.0_1670327173838.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_v_2021_base_en_4.2.4_3.0_1670327173838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_v_2021_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_v_2021_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/VMware/vbert-2021-base - https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99 --- layout: model title: Fast Neural Machine Translation Model from English to Albanian author: John Snow Labs name: opus_mt_en_sq date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sq, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sq` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sq_xx_2.7.0_2.4_1609166389560.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sq_xx_2.7.0_2.4_1609166389560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sq", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sq", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sq').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sq| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Clean Slang in Texts author: John Snow Labs name: clean_slang date: 2022-06-15 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_slang is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_slang_en_4.0.0_3.0_1655323062566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_slang_en_4.0.0_3.0_1655323062566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('clean_slang', lang='en') testDoc = ''' yo, what is wrong with ya? ''' ``` ```scala val pipeline = new PretrainedPipeline("clean_slang", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.slang').predict(text) result_df ```
## Results ```bash ['hey', 'what', 'is', 'wrong', 'with', 'you'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_slang| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|32.1 KB| ## Included Models - DocumentAssembler - TokenizerModel - NormalizerModel --- layout: model title: German asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922_de_4.2.0_3.0_1664120050981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922_de_4.2.0_3.0_1664120050981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_gender_male_0_female_10_s922| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Multilingual DistilBertForQuestionAnswering model (from ZYW) author: John Snow Labs name: distilbert_qa_en_de_vi_zh_es_model date: 2022-06-08 tags: [en, de, vi, zh, es, open_source, distilbert, question_answering, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `en-de-vi-zh-es-model` is a English model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_vi_zh_es_model_xx_4.0.0_3.0_1654728319675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_en_de_vi_zh_es_model_xx_4.0.0_3.0_1654728319675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_vi_zh_es_model","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_en_de_vi_zh_es_model","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.distil_bert.vi_zh_es_tuned.by_ZYW").predict("""PUT YOUR QUESTION HERE|||"PUT YOUR CONTEXT HERE""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_en_de_vi_zh_es_model| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/en-de-vi-zh-es-model --- layout: model title: English asr_wav2vec2_xls_r_timit_tokenizer_base TFWav2Vec2ForCTC from hrdipto author: John Snow Labs name: asr_wav2vec2_xls_r_timit_tokenizer_base date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_timit_tokenizer_base` is a English model originally trained by hrdipto. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_timit_tokenizer_base_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_timit_tokenizer_base_en_4.2.0_3.0_1664040291907.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_timit_tokenizer_base_en_4.2.0_3.0_1664040291907.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_timit_tokenizer_base", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_timit_tokenizer_base", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_timit_tokenizer_base| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Legal Registration Rights Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_registration_rights_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, registration, rights, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_registration_rights_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `registration-rights-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `registration-rights-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_rights_agreement_en_1.0.0_3.0_1671393670292.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_rights_agreement_en_1.0.0_3.0_1671393670292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_registration_rights_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[registration-rights-agreement]| |[other]| |[other]| |[registration-rights-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_registration_rights_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.99 0.98 205 registration-rights-agreement 0.98 0.95 0.97 108 accuracy - - 0.98 313 macro-avg 0.98 0.97 0.98 313 weighted-avg 0.98 0.98 0.98 313 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from holtin) author: John Snow Labs name: distilbert_qa_base_uncased_holtin_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-holtin-finetuned-squad` is a English model originally trained by `holtin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_squad_en_4.3.0_3.0_1672773933566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_holtin_finetuned_squad_en_4.3.0_3.0_1672773933566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_holtin_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_holtin_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/holtin/distilbert-base-uncased-holtin-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Cased model (from Shushant) author: John Snow Labs name: distilbert_qa_contamination_try date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ContaminationQuestionAnsweringTry` is a English model originally trained by `Shushant`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_contamination_try_en_4.2.5_3.0_1672750831060.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_contamination_try_en_4.2.5_3.0_1672750831060.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_contamination_try","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_contamination_try","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_contamination_try| |Compatibility:|Spark NLP 4.2.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Shushant/ContaminationQuestionAnsweringTry --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_el2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-el2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_el2_en_4.3.0_3.0_1675115770479.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_el2_en_4.3.0_3.0_1675115770479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_el2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_el2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_el2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|948.2 MB| ## References - https://huggingface.co/google/t5-efficient-large-el2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Legal Limited Partnership Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_limited_partnership_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, limited_partnership, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_limited_partnership_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `limited-partnership-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `limited-partnership-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limited_partnership_agreement_en_1.0.0_3.0_1668116513540.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limited_partnership_agreement_en_1.0.0_3.0_1668116513540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_limited_partnership_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[limited-partnership-agreement]| |[other]| |[other]| |[limited-partnership-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limited_partnership_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support limited-partnership-agreement 1.00 1.00 1.0 24 other 1.00 1.00 1.0 31 accuracy - - 1.0 55 macro-avg 1.00 1.00 1.0 55 weighted-avg 1.00 1.00 1.0 55 ``` --- layout: model title: Relation Extraction between Biomarkers and Results author: John Snow Labs name: re_oncology_biomarker_result_wip date: 2022-09-27 tags: [licensed, clinical, oncology, en, relation_extraction, test, biomarker] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Biomarker and Oncogene extractions to their corresponding Biomarker_Result extractions. ## Predicted Entities `is_finding_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_biomarker_result_wip_en_4.0.0_3.0_1664291278366.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_biomarker_result_wip_en_4.0.0_3.0_1664291278366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_biomarker_result_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(['Biomarker-Biomarker_Result', 'Biomarker_Result-Biomarker', 'Oncogene-Biomarker_Result', 'Biomarker_Result-Oncogene']) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_biomarker_result_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_biomarker_result").predict("""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.""") ```
## Results ```bash chunk1 entity1 chunk2 entity2 relation confidence negative Biomarker_Result thyroid transcription factor-1 Biomarker is_finding_of 0.99925953 negative Biomarker_Result napsin Biomarker is_finding_of 0.98856175 positive Biomarker_Result ER Biomarker is_finding_of 0.9833266 positive Biomarker_Result PR Biomarker is_finding_of 0.94771445 positive Biomarker_Result HER2 Oncogene O 0.96865135 ER Biomarker negative Biomarker_Result O 0.998276 PR Biomarker negative Biomarker_Result O 0.98595536 negative Biomarker_Result HER2 Oncogene is_finding_of 0.99124444 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_biomarker_result_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|265.8 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.88 0.95 0.91 is_finding_of 0.95 0.89 0.92 macro-avg 0.92 0.92 0.92 ``` --- layout: model title: Portuguese BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488 date: 2022-06-03 tags: [pt, open_source, question_answering, bert] task: Question Answering language: pt edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-portuguese-cased-finetuned-squad-v1-pt` is a Portuguese model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488_pt_4.0.0_3.0_1654249690597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488_pt_4.0.0_3.0_1654249690597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488","pt") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488","pt") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("pt.answer_question.squad.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pt| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-base-portuguese-cased-finetuned-squad-v1-pt --- layout: model title: Legal Binding effect Clause Binary Classifier author: John Snow Labs name: legclf_binding_effect_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `binding-effect` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `binding-effect` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_binding_effect_clause_en_1.0.0_3.2_1660123276697.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_binding_effect_clause_en_1.0.0_3.2_1660123276697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_binding_effect_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[binding-effect]| |[other]| |[other]| |[binding-effect]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_binding_effect_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support binding-effect 0.98 0.98 0.98 46 other 0.99 0.99 0.99 90 accuracy - - 0.99 136 macro-avg 0.98 0.98 0.98 136 weighted-avg 0.99 0.99 0.99 136 ``` --- layout: model title: English asr_wav2vec2_common_voice_accents_indian TFWav2Vec2ForCTC from willcai author: John Snow Labs name: asr_wav2vec2_common_voice_accents_indian date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_accents_indian` is a English model originally trained by willcai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_common_voice_accents_indian_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_accents_indian_en_4.2.0_3.0_1664105851293.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_accents_indian_en_4.2.0_3.0_1664105851293.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_common_voice_accents_indian", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_common_voice_accents_indian", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_common_voice_accents_indian| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from augustoortiz) author: John Snow Labs name: bert_qa_bert_finetuned_squad2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad2` is a English model orginally trained by `augustoortiz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad2_en_4.0.0_3.0_1654536008933.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_finetuned_squad2_en_4.0.0_3.0_1654536008933.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.by_augustoortiz").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/augustoortiz/bert-finetuned-squad2 --- layout: model title: German BertForQuestionAnswering model (from deutsche-telekom) author: John Snow Labs name: bert_qa_bert_multi_english_german_squad2 date: 2022-06-02 tags: [de, open_source, question_answering, bert] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-english-german-squad2` is a German model orginally trained by `deutsche-telekom`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_english_german_squad2_de_4.0.0_3.0_1654184577781.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_english_german_squad2_de_4.0.0_3.0_1654184577781.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_english_german_squad2","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_english_german_squad2","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.squadv2.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_english_german_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2 - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/blob/master/multilingual.md --- layout: model title: Explain Document pipeline for Finnish (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, finnish, explain_document_lg, pipeline, fi] supported: true task: [Named Entity Recognition, Lemmatization] language: fi edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_fi_3.0.0_3.0_1616528814552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_fi_3.0.0_3.0_1616528814552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'fi') annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "fi") val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei John Snow Labs! ""] result_df = nlu.load('fi.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------|:------------------------|:---------------------------------|:---------------------------------|:------------------------------------|:-----------------------------|:---------------------------------|:--------------------| | 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | ['hei', 'John', 'Snow', 'Labs!'] | ['INTJ', 'PROPN', 'PROPN', 'PROPN'] | [[0.0639619976282119,.,...]] | ['O', 'B-PRO', 'I-PRO', 'I-PRO'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_cv8 TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_german_cv8 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_cv8` is a German model originally trained by oliverguhr. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_german_cv8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_cv8_de_4.2.0_3.0_1664102524467.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_german_cv8_de_4.2.0_3.0_1664102524467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_german_cv8', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_german_cv8", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_german_cv8| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wynehills_mimi_ASR TFWav2Vec2ForCTC from mimi author: John Snow Labs name: asr_wynehills_mimi_ASR date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wynehills_mimi_ASR` is a English model originally trained by mimi. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wynehills_mimi_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wynehills_mimi_ASR_en_4.2.0_3.0_1664097187391.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wynehills_mimi_ASR_en_4.2.0_3.0_1664097187391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wynehills_mimi_ASR", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wynehills_mimi_ASR", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wynehills_mimi_ASR| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: RxNorm to MeSH Code Mapping author: John Snow Labs name: rxnorm_mesh_mapping date: 2021-05-04 tags: [rxnorm, mesh, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.0.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_3.0.2_3.0_1620134962818.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mesh_mapping_en_3.0.2_3.0_1620134962818.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") pipeline.annotate("1191 6809 47613") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_mesh_mapping","en","clinical/models") val result = pipeline.annotate("1191 6809 47613") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm.mesh").predict("""1191 6809 47613""") ```
## Results ```bash {'rxnorm': ['1191', '6809', '47613'], 'mesh': ['D001241', 'D008687', 'D019355']} Note: | RxNorm | Details | | ---------- | -------------------:| | 1191 | aspirin | | 6809 | metformin | | 47613 | calcium citrate | | MeSH | Details | | ---------- | -------------------:| | D001241 | Aspirin | | D008687 | Metformin | | D019355 | Calcium Citrate | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_mesh_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: English DistilBertForQuestionAnswering model (from mcurmei) Long author: John Snow Labs name: distilbert_qa_single_label_N_max_long_training date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `single_label_N_max_long_training` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_N_max_long_training_en_4.0.0_3.0_1654728707587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_N_max_long_training_en_4.0.0_3.0_1654728707587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_N_max_long_training","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_N_max_long_training","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.single_label_n_max_long_training.by_mcurmei").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_single_label_N_max_long_training| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/single_label_N_max_long_training --- layout: model title: Legal Disputes Clause Binary Classifier author: John Snow Labs name: legclf_disputes_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `disputes` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `disputes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disputes_clause_en_1.0.0_3.2_1660122366920.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disputes_clause_en_1.0.0_3.2_1660122366920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_disputes_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[disputes]| |[other]| |[other]| |[disputes]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disputes_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support disputes 0.96 0.77 0.86 31 other 0.93 0.99 0.96 99 accuracy - - 0.94 130 macro-avg 0.95 0.88 0.91 130 weighted-avg 0.94 0.94 0.94 130 ``` --- layout: model title: Legal Business Expenses Clause Binary Classifier author: John Snow Labs name: legclf_business_expenses_clause date: 2023-01-29 tags: [en, legal, classification, business, expenses, clauses, business_expenses, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `business-expenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `business-expenses`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_business_expenses_clause_en_1.0.0_3.0_1675005882013.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_business_expenses_clause_en_1.0.0_3.0_1675005882013.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_business_expenses_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[business-expenses]| |[other]| |[other]| |[business-expenses]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_business_expenses_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support business-expenses 0.82 0.96 0.88 24 other 0.97 0.87 0.92 39 accuracy - - 0.90 63 macro-avg 0.90 0.92 0.90 63 weighted-avg 0.91 0.90 0.91 63 ``` --- layout: model title: English BertForQuestionAnswering model (from mirbostani) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_newsqa date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-newsqa` is a English model orginally trained by `mirbostani`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_newsqa_en_4.0.0_3.0_1654180972768.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_newsqa_en_4.0.0_3.0_1654180972768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.bert.base_uncased.by_mirbostani").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mirbostani/bert-base-uncased-finetuned-newsqa - https://github.com/Maluuba/newsqa --- layout: model title: English BertForQuestionAnswering Cased model (from spasis) author: John Snow Labs name: bert_qa_spasis_finetuned_squad_accelera date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model originally trained by `spasis`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spasis_finetuned_squad_accelera_en_4.0.0_3.0_1657187066875.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spasis_finetuned_squad_accelera_en_4.0.0_3.0_1657187066875.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spasis_finetuned_squad_accelera","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spasis_finetuned_squad_accelera","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spasis_finetuned_squad_accelera| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/spasis/bert-finetuned-squad-accelerate --- layout: model title: Legal Acquisition Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_acquisition_agreement_bert date: 2023-02-02 tags: [en, legal, classification, acquisition, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_acquisition_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `acquisition-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `acquisition-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_acquisition_agreement_bert_en_1.0.0_3.0_1675360092921.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_acquisition_agreement_bert_en_1.0.0_3.0_1675360092921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_acquisition_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[acquisition-agreement]| |[other]| |[other]| |[acquisition-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_acquisition_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support acquisition-agreement 0.91 0.91 0.91 33 other 0.96 0.96 0.96 73 accuracy - - 0.94 106 macro-avg 0.93 0.93 0.93 106 weighted-avg 0.94 0.94 0.94 106 ``` --- layout: model title: English asr_vakyansh_wav2vec2_indian_english_enm_700 TFWav2Vec2ForCTC from Harveenchadha author: John Snow Labs name: pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_vakyansh_wav2vec2_indian_english_enm_700` is a English model originally trained by Harveenchadha. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700_en_4.2.0_3.0_1664094816417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700_en_4.2.0_3.0_1664094816417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_vakyansh_wav2vec2_indian_english_enm_700| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_timit_demo_colab10 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab10 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab10` is a English model originally trained by sameearif88. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab10_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab10_en_4.2.0_3.0_1664019640274.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab10_en_4.2.0_3.0_1664019640274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab10", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab10", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab10| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: French CamemBert Embeddings (from elliotsmith) author: John Snow Labs name: camembert_embeddings_elliotsmith_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `elliotsmith`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_elliotsmith_generic_model_fr_3.4.4_3.0_1653988265199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_elliotsmith_generic_model_fr_3.4.4_3.0_1653988265199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_elliotsmith_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_elliotsmith_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_elliotsmith_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/elliotsmith/dummy-model --- layout: model title: Extract temporal relations among clinical events (ReDL) author: John Snow Labs name: redl_temporal_events_biobert date: 2023-01-15 tags: [relation_extraction, en, clinical, licensed, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations between clinical events in terms of time. If an event occurred before, after, or overlaps another event. ## Predicted Entities `AFTER`, `BEFORE`, `OVERLAP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_EVENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_temporal_events_biobert_en_4.2.4_3.0_1673778147598.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_temporal_events_biobert_en_4.2.4_3.0_1673778147598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") re_model = RelationExtractionDLModel()\ .pretrained("redl_temporal_events_biobert", "en", "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text = "She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_temporal_events_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_events").predict("""She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001""") ```
## Results ```bash +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ |relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ | BEFORE| OCCURRENCE| 7| 15| diagnosed| PROBLEM| 22| 27| cancer|0.78168863| | OVERLAP| PROBLEM| 22| 27| cancer| DATE| 32| 35| 1991| 0.8492274| | AFTER| OCCURRENCE| 51| 58| admitted|CLINICAL_DEPT| 63| 73| Mayo Clinic|0.85629463| | BEFORE| OCCURRENCE| 51| 58| admitted| OCCURRENCE| 91| 100| discharged| 0.6843513| | OVERLAP|CLINICAL_DEPT| 63| 73|Mayo Clinic| DATE| 78| 85| May 2000| 0.7844673| | BEFORE|CLINICAL_DEPT| 63| 73|Mayo Clinic| OCCURRENCE| 91| 100| discharged|0.60411876| | OVERLAP|CLINICAL_DEPT| 63| 73|Mayo Clinic| DATE| 105| 116|October 2001| 0.540761| | BEFORE| DATE| 78| 85| May 2000| OCCURRENCE| 91| 100| discharged| 0.6042761| | OVERLAP| DATE| 78| 85| May 2000| DATE| 105| 116|October 2001|0.64867175| | BEFORE| OCCURRENCE| 91| 100| discharged| DATE| 105| 116|October 2001| 0.5302478| +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_temporal_events_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on temporal clinical events benchmark dataset. ## Benchmarking ```bash label Recall Precision F1 Support AFTER 0.332 0.655 0.440 2123 BEFORE 0.868 0.908 0.887 13817 OVERLAP 0.887 0.733 0.802 7860 Avg. 0.695 0.765 0.710 - ``` --- layout: model title: Recognize Entities DL pipeline for English - Small author: John Snow Labs name: onto_recognize_entities_sm date: 2021-03-22 tags: [open_source, english, onto_recognize_entities_sm, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_sm_en_3.0.0_3.0_1616441224446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_sm_en_3.0.0_3.0_1616441224446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_sm', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_sm", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.sm').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.2668800055980682,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: English BertForQuestionAnswering model (from deepset) author: John Snow Labs name: bert_qa_bert_medium_squad2_distilled date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-squad2-distilled` is a English model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_squad2_distilled_en_4.0.0_3.0_1654183742024.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_squad2_distilled_en_4.0.0_3.0_1654183742024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_medium_squad2_distilled","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_medium_squad2_distilled","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.distilled_medium").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_medium_squad2_distilled| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|154.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/bert-medium-squad2-distilled - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - https://twitter.com/deepset_ai - http://www.deepset.ai/jobs - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/ - https://deepset.ai/german-bert - https://www.linkedin.com/company/deepset-ai/ - https://github.com/deepset-ai/FARM - https://deepset.ai/germanquad --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from yashwantk) author: John Snow Labs name: distilbert_qa_yashwantk_base_cased_led_squad_finetuned date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-finetuned-squad` is a English model originally trained by `yashwantk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_yashwantk_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766595118.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_yashwantk_base_cased_led_squad_finetuned_en_4.3.0_3.0_1672766595118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yashwantk_base_cased_led_squad_finetuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_yashwantk_base_cased_led_squad_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_yashwantk_base_cased_led_squad_finetuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/yashwantk/distilbert-base-cased-distilled-squad-finetuned-squad --- layout: model title: Arabic Bert Embeddings (Arabert model, Covid-19) author: John Snow Labs name: bert_embeddings_arabert_c19 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `arabert_c19` is a Arabic model orginally trained by `moha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_arabert_c19_ar_3.4.2_3.0_1649678530708.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_arabert_c19_ar_3.4.2_3.0_1649678530708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_arabert_c19","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_arabert_c19","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.arabert_c19").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_arabert_c19| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/moha/arabert_c19 - https://arxiv.org/pdf/2105.03143.pdf - https://arxiv.org/abs/2004.04315 - https://github.com/MohamedHadjAmeur --- layout: model title: Dutch NER Pipeline author: John Snow Labs name: bert_token_classifier_dutch_udlassy_ner_pipeline date: 2022-04-19 tags: [open_source, ner, dutch, token_classifier, bert, treatment, nl] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_dutch_udlassy_ner](https://nlp.johnsnowlabs.com/2021/12/08/bert_token_classifier_dutch_udlassy_ner_nl.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_dutch_udlassy_ner_pipeline_nl_3.4.1_3.0_1650374307543.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_dutch_udlassy_ner_pipeline_nl_3.4.1_3.0_1650374307543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_dutch_udlassy_ner_pipeline", lang = "nl") pipeline.annotate("Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_dutch_udlassy_ner_pipeline", lang = "nl") pipeline.annotate("Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.") ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Peter Fergusson|PERSON | |oktober 2011 |DATE | |New York |GPE | |5 jaar |DATE | |Tesla Motor |ORG | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_dutch_udlassy_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|nl| |Size:|407.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertForTokenClassification - NerConverter - Finisher --- layout: model title: SDOH Alcohol Usage For Binary Classification author: John Snow Labs name: genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli date: 2023-01-14 tags: [en, licensed, generic_classifier, sdoh, alcohol, clinical] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator. `Present:` if the patient was a current consumer of alcohol or the patient was a consumer in the past and had quit. `Never:` if the patient had never consumed alcohol. `None: ` if there was no related text. ## Predicted Entities `Present`, `Never`, `None` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673699002618.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli_en_4.2.4_3.0_1673699002618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes", "Employee in neuro departmentin at the Center Hospital 18. Widower since 2001. Current smoker since 20 years. No EtOH or illicits.", "Patient smoked 4 ppd x 37 years, quitting 22 years ago. He is widowed, lives alone, has three children."] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq("Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.generic.sdoh_alchol_binary_sbiobert_cased").predict("""Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes""") ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | text| result| +----------------------------------------------------------------------------------------------------+---------+ |Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]| |Employee in neuro departmentin at the Center Hospital 18. Widower since 2001. Current smoker sinc...| [Never]| |Patient smoked 4 ppd x 37 years, quitting 22 years ago. He is widowed, lives alone, has three chi...| [None]| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| ## Benchmarking ```bash label precision recall f1-score support Never 0.85 0.86 0.85 523 None 0.81 0.82 0.81 341 Present 0.88 0.86 0.87 516 accuracy - - 0.85 1380 macro-avg 0.85 0.85 0.85 1380 weighted-avg 0.85 0.85 0.85 1380 ``` --- layout: model title: Sentence Entity Resolver for Snomed Aux Concepts, INT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_auxConcepts_int date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to Snomed codes (INT version) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is capable of extracting Morph Abnormality, Procedure, Substance, Physical Object, and Body Structure concepts of Snomed codes. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts Snomed Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_int_en_3.0.4_3.0_1621191454309.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_int_en_3.0.4_3.0_1621191454309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_aux_int_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_auxConcepts_int","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val snomed_aux_int_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_auxConcepts_int","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.aux_concepts_int").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 148439002| 0.2138|risk factors pres...|148439002:::42595...| |chronic renal ins...| 83|109| PROBLEM| 722403003| 0.8517|gastrointestinal ...|722403003:::13781...| | COPD| 113|116| PROBLEM|845101000000100| 0.0962|management of chr...|845101000000100::...| | gastritis| 120|128| PROBLEM| 711498001| 0.3398|magnetic resonanc...|711498001:::71771...| | TIA| 136|138| PROBLEM| 449758002| 0.1927|traumatic infarct...|449758002:::85844...| |a non-ST elevatio...| 182|202| PROBLEM| 1411000087101| 0.0823|ct of left knee::...|1411000087101:::3...| |Guaiac positive s...| 208|229| PROBLEM| 388507006| 0.0555|asparagus rast:::...|388507006:::71771...| |cardiac catheteri...| 295|317| TEST| 41976001| 0.9790|cardiac catheteri...|41976001:::705921...| | PTCA| 324|327|TREATMENT| 312644004| 0.0616|angioplasty of po...|312644004:::41507...| | mid LAD lesion| 332|345| PROBLEM| 91749005| 0.1399|structure of firs...|91749005:::917470...| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_auxConcepts_int| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_code_int_aux_loaded]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: Summarize clinical guidelines author: John Snow Labs name: summarizer_clinical_guidelines_large date: 2023-05-08 tags: [en, summarizer, clinical, licensed, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Based on Flan-T5-large, this model is finetuned to summarize clinical guidelines (only for Asthma and Breast Cancer as of now) into four different sections: Overview, Causes, Symptoms, Treatments. The context length of this model is 768 tokens. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_guidelines_large_en_4.4.0_3.0_1683577432272.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_guidelines_large_en_4.4.0_3.0_1683577432272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer.pretrained("summarizer_clinical_guidelines_large", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("summary")\ .setMaxTextLength(768)\ .setMaxNewTokens(512) pipeline = Pipeline(stages=[ document, summarizer ]) text = """Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity - Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in the breast or underarm area - Changes in the size or shape of the breast - Nipple discharge - Nipple changes in appearance, such as inversion or flattening - Redness or swelling in the breast Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include: - Surgery (such as lumpectomy or mastectomy) - Radiation therapy - Chemotherapy - Hormone therapy - Targeted therapy Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately.""" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer.pretrained("summarizer_clinical_guidelines_large", "en", "clinical/models") .setInputCols("document") .setOutputCol("summary") .setMaxTextLength(768) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = """Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity - Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in the breast or underarm area - Changes in the size or shape of the breast - Nipple discharge - Nipple changes in appearance, such as inversion or flattening - Redness or swelling in the breast Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include: - Surgery (such as lumpectomy or mastectomy) - Radiation therapy - Chemotherapy - Hormone therapy - Targeted therapy Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately. """ val data = Seq(text).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash Overview of the disease: Breast cancer is the most common type of cancer among women, occurring when the cells in the breast start growing abnormally, forming a lump or mass. It can result in the spread of cancerous cells to other parts of the body. Causes: The exact cause of breast cancer is unknown, but several risk factors can increase the likelihood of developing it, such as a personal or family history, a genetic mutation, exposure to radiation, age, early onset of menstruation or late menopause, obesity, and hormonal factors. Symptoms: Symptoms of breast cancer typically manifest as the disease progresses, including a lump or thickening in the breast or underarm area, changes in the size or shape of the breast, nipple discharge, nipple changes in appearance, and redness or swelling in the breast. Treatment recommendations: Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include surgery, radiation therapy, chemotherapy, hormone therapy, and targeted therapy. Early detection is crucial for successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_guidelines_large| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.9 GB| ## References Trained on in-house curated data. --- layout: model title: Abkhazian asr_xls_r_ab_spanish TFWav2Vec2ForCTC from joheras author: John Snow Labs name: pipeline_asr_xls_r_ab_spanish date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_spanish` is a Abkhazian model originally trained by joheras. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_spanish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_spanish_ab_4.2.0_3.0_1664020958116.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_spanish_ab_4.2.0_3.0_1664020958116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_spanish', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_spanish", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_spanish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|451.9 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_4_en_4.0.0_3.0_1657184243263.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_4_en_4.0.0_3.0_1657184243263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-4 --- layout: model title: English BertForQuestionAnswering model (from MrAnderson) author: John Snow Labs name: bert_qa_bert_base_4096_full_trivia_copied_embeddings date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-4096-full-trivia-copied-embeddings` is a English model orginally trained by `MrAnderson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_4096_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179649725.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_4096_full_trivia_copied_embeddings_en_4.0.0_3.0_1654179649725.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_4096_full_trivia_copied_embeddings","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_4096_full_trivia_copied_embeddings","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.bert.base_4096.by_MrAnderson").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_4096_full_trivia_copied_embeddings| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|418.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MrAnderson/bert-base-4096-full-trivia-copied-embeddings --- layout: model title: Legal Interpretation Clause Binary Classifier author: John Snow Labs name: legclf_interpretation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `interpretation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `interpretation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_interpretation_clause_en_1.0.0_3.2_1660123631357.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_interpretation_clause_en_1.0.0_3.2_1660123631357.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_interpretation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[interpretation]| |[other]| |[other]| |[interpretation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_interpretation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support interpretation 0.91 0.78 0.84 27 other 0.92 0.97 0.95 74 accuracy - - 0.92 101 macro-avg 0.92 0.88 0.89 101 weighted-avg 0.92 0.92 0.92 101 ``` --- layout: model title: Detect Clinical Entities (ner_jsl_greedy) author: John Snow Labs name: ner_jsl_greedy date: 2021-06-24 tags: [ner, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: NotDefined article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_greedy_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Hyperlipidemia`, `Respiration`, `Birth_Entity`, `Age`, `Family_History_Header`, `Labour_Delivery`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Drug`, `Symptom`, `Treatment`, `Substance`, `Route`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Time`, `Frequency`, `Sexually_Active_or_Sexual_Orientation`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Hypertension`, `HDL`, `Overweight`, `Total_Cholesterol`, `Smoking`, `Date`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_en_3.1.0_2.4_1624567745679.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_en_3.1.0_2.4_1624567745679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------+----------------------------+ |chunk |ner_label | +----------------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | |20 minutes |Duration | +----------------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 229.0 56.0 34.0 263.0 0.8035 0.8707 0.8358 Direction 4009.0 479.0 403.0 4412.0 0.8933 0.9087 0.9009 Female_Reproducti... 2.0 1.0 3.0 5.0 0.6667 0.4 0.5 Respiration 80.0 9.0 14.0 94.0 0.8989 0.8511 0.8743 Cerebrovascular_D... 82.0 27.0 18.0 100.0 0.7523 0.82 0.7847 not 4.0 0.0 0.0 4.0 1.0 1.0 1.0 Family_History_He... 86.0 4.0 3.0 89.0 0.9556 0.9663 0.9609 Heart_Disease 469.0 76.0 83.0 552.0 0.8606 0.8496 0.8551 ImagingFindings 68.0 38.0 75.0 143.0 0.6415 0.4755 0.5462 RelativeTime 141.0 76.0 66.0 207.0 0.6498 0.6812 0.6651 Strength 720.0 49.0 58.0 778.0 0.9363 0.9254 0.9308 Smoking 117.0 8.0 6.0 123.0 0.936 0.9512 0.9435 Medical_Device 3584.0 730.0 359.0 3943.0 0.8308 0.909 0.8681 EKG_Findings 41.0 20.0 45.0 86.0 0.6721 0.4767 0.5578 Pulse 138.0 23.0 24.0 162.0 0.8571 0.8519 0.8545 Psychological_Con... 121.0 14.0 29.0 150.0 0.8963 0.8067 0.8491 Overweight 5.0 2.0 0.0 5.0 0.7143 1.0 0.8333 Triglycerides 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Obesity 49.0 6.0 4.0 53.0 0.8909 0.9245 0.9074 Admission_Discharge 325.0 30.0 2.0 327.0 0.9155 0.9939 0.9531 HDL 2.0 1.0 1.0 3.0 0.6667 0.6667 0.6667 Diabetes 118.0 13.0 7.0 125.0 0.9008 0.944 0.9219 Section_Header 3778.0 148.0 138.0 3916.0 0.9623 0.9648 0.9635 Age 617.0 52.0 47.0 664.0 0.9223 0.9292 0.9257 O2_Saturation 34.0 11.0 19.0 53.0 0.7556 0.6415 0.6939 Kidney_Disease 114.0 5.0 12.0 126.0 0.958 0.9048 0.9306 Test 2668.0 526.0 498.0 3166.0 0.8353 0.8427 0.839 Communicable_Disease 25.0 12.0 9.0 34.0 0.6757 0.7353 0.7042 Hypertension 152.0 10.0 6.0 158.0 0.9383 0.962 0.95 External_body_par... 2652.0 387.0 340.0 2992.0 0.8727 0.8864 0.8795 Oxygen_Therapy 67.0 21.0 23.0 90.0 0.7614 0.7444 0.7528 Test_Result 1124.0 227.0 258.0 1382.0 0.832 0.8133 0.8225 Modifier 539.0 185.0 309.0 848.0 0.7445 0.6356 0.6858 BMI 7.0 1.0 1.0 8.0 0.875 0.875 0.875 Labour_Delivery 75.0 19.0 23.0 98.0 0.7979 0.7653 0.7813 Employment 249.0 51.0 57.0 306.0 0.83 0.8137 0.8218 Clinical_Dept 948.0 95.0 80.0 1028.0 0.9089 0.9222 0.9155 Time 36.0 7.0 7.0 43.0 0.8372 0.8372 0.8372 Procedure 3180.0 460.0 480.0 3660.0 0.8736 0.8689 0.8712 Diet 50.0 29.0 30.0 80.0 0.6329 0.625 0.6289 Oncological 478.0 46.0 50.0 528.0 0.9122 0.9053 0.9087 LDL 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Symptom 6801.0 1097.0 1097.0 7898.0 0.8611 0.8611 0.8611 Temperature 109.0 12.0 7.0 116.0 0.9008 0.9397 0.9198 Vital_Signs_Header 213.0 27.0 16.0 229.0 0.8875 0.9301 0.9083 Relationship_Status 42.0 2.0 1.0 43.0 0.9545 0.9767 0.9655 Total_Cholesterol 10.0 4.0 5.0 15.0 0.7143 0.6667 0.6897 Blood_Pressure 167.0 22.0 23.0 190.0 0.8836 0.8789 0.8813 Injury_or_Poisoning 510.0 83.0 111.0 621.0 0.86 0.8213 0.8402 Drug_Ingredient 1698.0 160.0 158.0 1856.0 0.9139 0.9149 0.9144 Treatment 156.0 40.0 54.0 210.0 0.7959 0.7429 0.7685 Assertion_SocialD... 4.0 0.0 6.0 10.0 1.0 0.4 0.5714 Pregnancy 100.0 45.0 41.0 141.0 0.6897 0.7092 0.6993 Vaccine 13.0 3.0 6.0 19.0 0.8125 0.6842 0.7429 Disease_Syndrome_... 2861.0 452.0 376.0 3237.0 0.8636 0.8838 0.8736 Height 25.0 8.0 9.0 34.0 0.7576 0.7353 0.7463 Frequency 650.0 157.0 148.0 798.0 0.8055 0.8145 0.81 Route 872.0 83.0 85.0 957.0 0.9131 0.9112 0.9121 Death_Entity 49.0 7.0 6.0 55.0 0.875 0.8909 0.8829 Duration 367.0 132.0 95.0 462.0 0.7355 0.7944 0.7638 Internal_organ_or... 6532.0 1016.0 987.0 7519.0 0.8654 0.8687 0.8671 Alcohol 79.0 20.0 12.0 91.0 0.798 0.8681 0.8316 Date 515.0 19.0 19.0 534.0 0.9644 0.9644 0.9644 Hyperlipidemia 47.0 2.0 1.0 48.0 0.9592 0.9792 0.9691 Social_History_He... 89.0 9.0 4.0 93.0 0.9082 0.957 0.9319 Race_Ethnicity 113.0 0.0 3.0 116.0 1.0 0.9741 0.9869 Imaging_Technique 47.0 31.0 30.0 77.0 0.6026 0.6104 0.6065 Drug_BrandName 963.0 72.0 79.0 1042.0 0.9304 0.9242 0.9273 RelativeDate 553.0 128.0 121.0 674.0 0.812 0.8205 0.8162 Gender 6043.0 59.0 87.0 6130.0 0.9903 0.9858 0.9881 Form 227.0 35.0 47.0 274.0 0.8664 0.8285 0.847 Dosage 279.0 42.0 62.0 341.0 0.8692 0.8182 0.8429 Medical_History_H... 117.0 4.0 11.0 128.0 0.9669 0.9141 0.9398 Substance 59.0 16.0 16.0 75.0 0.7867 0.7867 0.7867 Weight 85.0 19.0 21.0 106.0 0.8173 0.8019 0.8095 macro - - - - - - 0.7286 micro - - - - - - 0.8715 ``` --- layout: model title: NER Model for 6 Scandinavian Languages author: John Snow Labs name: bert_token_classifier_scandi_ner date: 2021-12-09 tags: [danish, norwegian, swedish, icelandic, faroese, ner, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true recommended: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned for 6 Scandinavian languages (Danish, Norwegian-Bokmål, Norwegian-Nynorsk, Swedish, Icelandic, Faroese), leveraging `Bert` embeddings and `BertForTokenClassification` for NER purposes. ## Predicted Entities `PER`, `ORG`, `LOC`, `MISC` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_SCANDINAVIAN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_scandi_ner_xx_3.3.2_2.4_1639044930234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_scandi_ner_xx_3.3.2_2.4_1639044930234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_scandi_ner", "xx"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_scandi_ner", "xx")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.ner.scandinavian").predict("""Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner.""") ```
## Results ```bash +-------------------+---------+ |chunk |ner_label| +-------------------+---------+ |Hans |PER | |Statens Universitet|ORG | |København |LOC | |københavner |MISC | +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_scandi_ner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|666.9 MB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/saattrupdan/nbailab-base-ner-scandi](https://huggingface.co/saattrupdan/nbailab-base-ner-scandi) ## Benchmarking ```bash languages : F1 Score: ---------- -------- Danish 0.8744 Bokmål 0.9106 Nynorsk 0.9042 Swedish 0.8837 Icelandic 0.8861 Faroese 0.9022 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_georgian_large TFWav2Vec2ForCTC from RaphaelKalandadze author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_georgian_large` is a English model originally trained by RaphaelKalandadze. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large_en_4.2.0_3.0_1664120573735.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large_en_4.2.0_3.0_1664120573735.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_georgian_large| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Word2Vec Embeddings in Divehi (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, dv, open_source] task: Embeddings language: dv edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_dv_3.4.1_3.0_1647293864694.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_dv_3.4.1_3.0_1647293864694.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","dv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","dv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("dv.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|dv| |Size:|89.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Extract Anatomical Entities from Oncology Texts author: John Snow Labs name: ner_oncology_anatomy_general date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, anatomy] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical entities using an unspecific label. Definitions of Predicted Entities: - `Anatomical_Site`: Relevant anatomical terms mentioned in text. - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". ## Predicted Entities `Anatomical_Site`, `Direction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_en_4.0.0_3.0_1666720431299.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_anatomy_general_en_4.0.0_3.0_1666720431299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_anatomy_general", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_general").predict("""The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver.""") ```
## Results ```bash | chunk | ner_label | |:--------|:----------------| | left | Direction | | breast | Anatomical_Site | | lungs | Anatomical_Site | | liver | Anatomical_Site | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_anatomy_general| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Anatomical_Site 2946 549 638 3584 0.84 0.82 0.83 Direction 864 209 120 984 0.81 0.88 0.84 macro_avg 3810 758 758 4568 0.82 0.85 0.84 micro_avg 3810 758 758 4568 0.83 0.83 0.83 ``` --- layout: model title: Part of Speech for Slovenian author: John Snow Labs name: pos_ud_ssj date: 2021-03-09 tags: [part_of_speech, open_source, slovenian, pos_ud_ssj, sl] task: Part of Speech Tagging language: sl edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PUNCT - DET - NOUN - AUX - VERB - PRON - ADP - SCONJ - PROPN - ADJ - CCONJ - PART - ADV - NUM - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ssj_sl_3.0.0_3.0_1615292232360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ssj_sl_3.0.0_3.0_1615292232360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_ssj", "sl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Pozdravljeni iz JOHN Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_ssj", "sl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Pozdravljeni iz JOHN Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Pozdravljeni iz JOHN Snow Labs! ""] token_df = nlu.load('sl.pos').predict(text) token_df ```
## Results ```bash token pos 0 Pozdravljeni ADJ 1 iz ADP 2 JOHN PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ssj| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|sl| --- layout: model title: Explain Document Pipeline for Dutch author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, dutch, explain_document_sm, pipeline, nl] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: nl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_nl_3.0.0_3.0_1616423469893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_nl_3.0.0_3.0_1616423469893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'nl') annotations = pipeline.fullAnnotate(""Hallo van John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "nl") val result = pipeline.fullAnnotate("Hallo van John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo van John Snow Labs! ""] result_df = nlu.load('nl.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:------------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo van John Snow Labs! '] | ['Hallo van John Snow Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['Hallo', 'van', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.3653799891471863,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| --- layout: model title: Korean BertForQuestionAnswering model (from sangrimlee) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_korquad date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-korquad` is a Korean model orginally trained by `sangrimlee`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_korquad_ko_4.0.0_3.0_1654180248745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_korquad_ko_4.0.0_3.0_1654180248745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_korquad","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_korquad","ko") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.korquad.bert.multilingual_base_cased.by_sangrimlee").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_korquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sangrimlee/bert-base-multilingual-cased-korquad --- layout: model title: Google's T5 for closed book question answering author: John Snow Labs name: google_t5_small_ssm_nq date: 2020-12-21 task: Question Answering language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, t5, en] supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a text-to-text model trained by Google on the colossal, cleaned version of Common Crawl's web crawl corpus (C4) data set and then fined tuned on Wikipedia and the natural questions (NQ) dataset. The model can answer free text questions, such as "Which is the capital of France ?" without relying on any context or external resources. ## Predicted Entities \[DOCUMENT] {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/google_t5_small_ssm_nq_en_2.7.0_2.4_1608552073257.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/google_t5_small_ssm_nq_en_2.7.0_2.4_1608552073257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.annotator import SentenceDetectorDLModel, T5Transformer data = self.spark.createDataFrame([ [1, "Which is the capital of France? Who was the first president of USA?"], [1, "Which is the capital of Bulgaria ?"], [2, "Who is Donald Trump?"]]).toDF("id", "text") document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") sentence_detector = SentenceDetectorDLModel\ .pretrained()\ .setInputCols(["documents"])\ .setOutputCol("questions") t5 = T5Transformer()\ .pretrained("google_t5_small_ssm_nq")\ .setInputCols(["questions"])\ .setOutputCol("answers")\ pipeline = Pipeline().setStages([document_assembler, sentence_detector, t5]) results = pipeline.fit(data).transform(data) results.select("questions.result", "answers.result").show(truncate=False) ``` ```scala val testData = ResourceHelper.spark.createDataFrame(Seq( (1, "Which is the capital of France? Who was the first president of USA?"), (1, "Which is the capital of Bulgaria ?"), (2, "Who is Donald Trump?") )).toDF("id", "text") val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val sentenceDetector = SentenceDetectorDLModel .pretrained() .setInputCols(Array("documents")) .setOutputCol("questions") val t5 = T5Transformer .pretrained("google_t5_small_ssm_nq") .setInputCols(Array("questions")) .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, t5)) val model = pipeline.fit(testData) val results = model.transform(testData) results.select("questions.result", "answers.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5").predict("""Which is the capital of France? Who was the first president of USA?""") ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------+-----------------------------------------+ |result |result | +-------------------------------------------------------------------------------------------------------------+-----------------------------------------+ |[Which is the capital of France?, Who was the first president of USA?]|[Paris, George Washington]| |[Which is the capital of Bulgaria ?] |[Sofia] | |[Who is Donald Trump?] |[a United States citizen] | +------------------------------------------------------------------------------------------------------------+------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|google_t5_small_ssm_nq| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Data Source C4, Wikipedia, NQ --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_0_en_4.0.0_3.0_1657184411386.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_0_en_4.0.0_3.0_1657184411386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-0 --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_swahili_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-swahili-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_swahili_sw_4.1.0_3.0_1659355857459.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_swahili_finetuned_ner_swahili_sw_4.1.0_3.0_1659355857459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_swahili_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_swahili_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-swahili-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_05 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_english_filipino_wav2vec2_l_xls_r_test_05 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_05` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_05_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_05_en_4.2.0_3.0_1664114092459.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_05_en_4.2.0_3.0_1664114092459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_05", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_05", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_05| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Exclusive License Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_exclusive_license_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, exclusive_license, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_exclusive_license_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `exclusive-license-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `exclusive-license-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exclusive_license_agreement_en_1.0.0_3.0_1668114191123.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exclusive_license_agreement_en_1.0.0_3.0_1668114191123.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_exclusive_license_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exclusive-license-agreement]| |[other]| |[other]| |[exclusive-license-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exclusive_license_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support exclusive-license-agreement 1.00 0.97 0.98 32 other 0.99 1.00 0.99 66 accuracy - - 0.99 98 macro-avg 0.99 0.98 0.99 98 weighted-avg 0.99 0.99 0.99 98 ``` --- layout: model title: German Named Entity Recognition (from philschmid) author: John Snow Labs name: bert_ner_gbert_base_germaner date: 2022-05-09 tags: [bert, ner, token_classification, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `gbert-base-germaner` is a German model orginally trained by `philschmid`. ## Predicted Entities `LOC`, `PER`, `ORG`, `OTH` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_gbert_base_germaner_de_3.4.2_3.0_1652099349088.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_gbert_base_germaner_de_3.4.2_3.0_1652099349088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_gbert_base_germaner","de") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_gbert_base_germaner","de") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_gbert_base_germaner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|410.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/philschmid/gbert-base-germaner - https://paperswithcode.com/sota?task=Token+Classification&dataset=germaner --- layout: model title: Dutch, Flemish BertForQuestionAnswering model (from horsbug98) author: John Snow Labs name: bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: nl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Part_2_BERT_Multilingual_Dutch_Model_E1` is a Dutch, Flemish model orginally trained by `horsbug98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1_nl_4.0.0_3.0_1654178958570.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1_nl_4.0.0_3.0_1654178958570.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1","nl") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1","nl") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("nl.answer_question.tydiqa.bert.multilingual").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|nl| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/horsbug98/Part_2_BERT_Multilingual_Dutch_Model_E1 --- layout: model title: Detect Risk Factors author: John Snow Labs name: ner_risk_factors date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for Heart Disease Risk Factors and Personal Health Information. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `CAD`, `DIABETES`, `FAMILY_HIST`, `HYPERLIPIDEMIA`, `HYPERTENSION`, `MEDICATION`, `OBESE`, `PHI`, `SMOKER`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RISK_FACTORS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_en_3.0.0_3.0_1617208442757.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_en_3.0.0_3.0_1617208442757.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_risk_factors", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.\n\nREVIEW OF SYSTEMS: All other systems reviewed & are negative.\n\nPAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.\n\nSOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.\n\nFAMILY HISTORY: Positive for coronary artery disease (father & brother).']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_risk_factors", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain".The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour.The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.REVIEW OF SYSTEMS: All other systems reviewed & are negative.PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.FAMILY HISTORY: Positive for coronary artery disease (father & brother).""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.risk_factors").predict("""chest pain""") ```
## Results ```bash +------------------------------------+------------+ |chunk |ner | +------------------------------------+------------+ |diabetic |DIABETES | |coronary artery disease |CAD | |Diabetes mellitus type II |DIABETES | |hypertension |HYPERTENSION| |coronary artery disease |CAD | |1995 |PHI | |ABC |PHI | |Smokes 2 packs of cigarettes per day|SMOKER | |banker |PHI | +------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with ``embeddings_clinical``. https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|-----------------:|------:|------:|------:|---------:|---------:|---------:| | 1 | I-HYPERLIPIDEMIA | 7 | 7 | 7 | 0.5 | 0.5 | 0.5 | | 2 | B-CAD | 104 | 52 | 101 | 0.666667 | 0.507317 | 0.576177 | | 3 | I-DIABETES | 127 | 67 | 92 | 0.654639 | 0.579909 | 0.615012 | | 4 | B-HYPERTENSION | 173 | 52 | 64 | 0.768889 | 0.729958 | 0.748918 | | 5 | B-OBESE | 46 | 20 | 3 | 0.69697 | 0.938776 | 0.8 | | 6 | B-PHI | 1968 | 599 | 252 | 0.766654 | 0.886486 | 0.822227 | | 7 | B-HYPERLIPIDEMIA | 71 | 17 | 14 | 0.806818 | 0.835294 | 0.820809 | | 8 | I-SMOKER | 116 | 73 | 94 | 0.613757 | 0.552381 | 0.581454 | | 9 | I-OBESE | 9 | 8 | 4 | 0.529412 | 0.692308 | 0.6 | | 10 | I-FAMILY_HIST | 5 | 0 | 10 | 1 | 0.333333 | 0.5 | | 11 | B-DIABETES | 190 | 59 | 58 | 0.763052 | 0.766129 | 0.764587 | | 12 | B-MEDICATION | 838 | 224 | 81 | 0.789077 | 0.911861 | 0.846037 | | 13 | I-PHI | 597 | 202 | 136 | 0.747184 | 0.814461 | 0.779373 | | 14 | Macro-average | 4533 | 1784 | 1600 | 0.620602 | 0.567477 | 0.592852 | | 15 | Micro-average | 4533 | 1784 | 1600 | 0.717588 | 0.739116 | 0.728193 | ``` --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Dutch (WikiNER 6B 100) author: John Snow Labs name: wikiner_6B_100 date: 2020-05-10 task: Named Entity Recognition language: nl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [ner, nl, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_NL){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_NL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_nl_2.5.0_2.4_1588546201140.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_nl_2.5.0_2.4_1588546201140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_100", "nl") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_100", "nl") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd 's werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill & Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella."""] ner_df = nlu.load('nl.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Amerikaanse |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |CEO |ORG | |Gates |PER | |Seattle |LOC | |Washington |LOC | |Paul Allen |PER | |Microsoft |ORG | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |CEO |ORG | |CEO |ORG | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.5.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|nl| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://fr.wikipedia.org](https://fr.wikipedia.org) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sjchoure) author: John Snow Labs name: distilbert_qa_sjchoure_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `sjchoure`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sjchoure_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772805034.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sjchoure_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772805034.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sjchoure_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sjchoure_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sjchoure_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sjchoure/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_no_label_5e_05 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-no-label-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_no_label_5e_05_en_4.3.0_3.0_1672766791475.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_no_label_5e_05_en_4.3.0_3.0_1672766791475.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_no_label_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_no_label_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_no_label_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-no-label-5e-05 --- layout: model title: Extract Income and Social Status Entities from Social Determinants of Health Texts author: John Snow Labs name: ner_sdoh_income_social_status_wip date: 2023-02-10 tags: [licensed, clinical, social_determinants, en, ner, income, social_status, sdoh, public_health] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts income and social status information related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Education`, `Marital_Status`, `Financial_Status`, `Population_Group`, `Employment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_NER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_income_social_status_wip_en_4.2.8_3.0_1675999206708.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_income_social_status_wip_en_4.2.8_3.0_1675999206708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_income_social_status_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["Pt is described as divorced and pleasant when approached but keeps to himself. Pt is working as a plumber, but he gets financial diffuculties. He has a son student at college. His family is imigrant for 2 years."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_income_social_status_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("Pt is described as divorced and pleasant when approached but keeps to himself. Pt is working as a plumber, but he gets financial diffuculties. He has a son student at college. His family is imigrant for 2 years.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------+----------------+-----+---+----------------------+ |sentence_id|ner_label |begin|end|chunk | +-----------+----------------+-----+---+----------------------+ |0 |Marital_Status |19 |26 |divorced | |1 |Employment |98 |104|plumber | |1 |Financial_Status|119 |140|financial diffuculties| |2 |Education |156 |162|student | |2 |Education |167 |173|college | |3 |Population_Group|190 |197|imigrant | +-----------+----------------+-----+---+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_income_social_status_wip| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|856.8 KB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Education 95.0 20.0 18.0 113.0 0.826087 0.840708 0.833333 Population_Group 41.0 0.0 5.0 46.0 1.000000 0.891304 0.942529 Financial_Status 286.0 52.0 82.0 368.0 0.846154 0.777174 0.810198 Employment 3968.0 142.0 215.0 4183.0 0.965450 0.948601 0.956952 Marital_Status 167.0 1.0 7.0 174.0 0.994048 0.959770 0.976608 ``` --- layout: model title: Key Value Recognition on 10K filings author: John Snow Labs name: visualner_keyvalue_10kfilings date: 2022-09-21 tags: [en, licensed] task: OCR Object Detection language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2 supported: true annotator: VisualDocumentNERv21 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Form Recognition / Key Value extraction model, trained on the summary page of SEC 10K filings. It extracts KEY, VALUE or HEADER as entities, being HEADER the title on the filing. ## Predicted Entities `KEY`, `VALUE`, `HEADER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visualner_keyvalue_10kfilings_en_4.0.0_3.2_1663781115795.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visualner_keyvalue_10kfilings_en_4.0.0_3.2_1663781115795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) img_to_hocr = ImageToHocr()\ .setInputCol("image")\ .setOutputCol("hocr")\ .setIgnoreResolution(False)\ .setOcrParams(["preserve_interword_spaces=0"]) tokenizer = HocrTokenizer()\ .setInputCol("hocr")\ .setOutputCol("token") doc_ner = VisualDocumentNerV21()\ .pretrained("visualner_keyvalue_10kfilings", "en", "clinical/ocr")\ .setInputCols(["token", "image"])\ .setOutputCol("entities") draw = ImageDrawAnnotations() \ .setInputCol("image") \ .setInputChunksCol("entities") \ .setOutputCol("image_with_annotations") \ .setFontSize(10) \ .setLineWidth(4)\ .setRectColor(Color.red) # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, img_to_hocr, tokenizer, doc_ner, draw ]) import pkg_resources bin_df = spark.read.format("binaryFile").load('data/t01.jpg') bin_df.show() results = pipeline.transform(bin_df).cache() res = results.collect() ## since pyspark2.3 doesn't have element_at, 'getItem' is involked path_array = f.split(results['path'], '/') # from pyspark2.4 # results.withColumn("filename", f.element_at(f.split("path", "/"), -1)) \ results.withColumn('filename', path_array.getItem(f.size(path_array)- 1)) \ .withColumn("exploded_entities", f.explode("entities")) \ .select("filename", "exploded_entities") \ .show(truncate=False) ```
## Results ```bash +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ |filename|exploded_entities | +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ |t01.jpg |{named_entity, 268, 269, OTHERS, {confidence -> 96, width -> 14, x -> 822, y -> 1101, word -> of, token -> of, height -> 34}, []} | |t01.jpg |{named_entity, 271, 273, OTHERS, {confidence -> 89, width -> 33, x -> 837, y -> 1112, word -> the, token -> the, height -> 13}, []} | |t01.jpg |{named_entity, 275, 277, OTHERS, {confidence -> 89, width -> 30, x -> 874, y -> 1113, word -> Act., token -> act, height -> 12}, []} | |t01.jpg |{named_entity, 280, 282, KEY-B, {confidence -> 94, width -> 26, x -> 910, y -> 1113, word -> Yes, token -> yes, height -> 12}, []} | |t01.jpg |{named_entity, 284, 285, VALUE-B, {confidence -> 45, width -> 13, x -> 944, y -> 1112, word -> LI, token -> li, height -> 13}, []} | |t01.jpg |{named_entity, 287, 288, KEY-B, {confidence -> 83, width -> 22, x -> 963, y -> 1113, word -> No, token -> no, height -> 12}, []} | |t01.jpg |{named_entity, 290, 295, HEADER-B, {confidence -> 96, width -> 89, x -> 1493, y -> 13, word -> UNITED, token -> united, height -> 16}, []} | |t01.jpg |{named_entity, 297, 302, HEADER-I, {confidence -> 95, width -> 83, x -> 1590, y -> 13, word -> STATES, token -> states, height -> 16}, []} | |t01.jpg |{named_entity, 304, 313, HEADER-B, {confidence -> 95, width -> 221, x -> 1186, y -> 45, word -> SECURITIES, token -> securities, height -> 25}, []} | |t01.jpg |{named_entity, 315, 317, HEADER-I, {confidence -> 95, width -> 80, x -> 1415, y -> 45, word -> AND, token -> and, height -> 25}, []} | |t01.jpg |{named_entity, 319, 326, HEADER-I, {confidence -> 96, width -> 212, x -> 1507, y -> 45, word -> EXCHANGE, token -> exchange, height -> 25}, []} | |t01.jpg |{named_entity, 328, 337, HEADER-I, {confidence -> 95, width -> 249, x -> 1732, y -> 45, word -> COMMISSION, token -> commission, height -> 25}, []} | |t01.jpg |{named_entity, 339, 348, HEADER-B, {confidence -> 96, width -> 125, x -> 1461, y -> 86, word -> Washington,, token -> washington, height -> 21}, []} | |t01.jpg |{named_entity, 351, 351, HEADER-I, {confidence -> 93, width -> 43, x -> 1595, y -> 86, word -> D.C., token -> d, height -> 16}, []} | |t01.jpg |{named_entity, 356, 360, HEADER-I, {confidence -> 93, width -> 59, x -> 1646, y -> 86, word -> 20549, token -> 20549, height -> 16}, []} | |t01.jpg |{named_entity, 362, 365, HEADER-B, {confidence -> 93, width -> 112, x -> 1484, y -> 159, word -> FORM, token -> form, height -> 25}, []} | |t01.jpg |{named_entity, 367, 368, HEADER-I, {confidence -> 91, width -> 77, x -> 1609, y -> 159, word -> 10-K, token -> 10, height -> 25}, []} | +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|visualner_keyvalue_10kfilings| |Type:|ocr| |Compatibility:|Visual NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|744.3 MB| ## References Sec 10K filings --- layout: model title: German BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_german date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-german` is a German model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_german_de_4.0.0_3.0_1654188494456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_german_de_4.0.0_3.0_1654188494456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_german","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_german","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.bert.multilingual_german_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_german| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-german --- layout: model title: Detect Problems, Tests and Treatments author: John Snow Labs name: ner_healthcare date: 2021-04-21 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for healthcare. Includes Problem, Test and Treatment entities. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_en_3.0.0_3.0_1619015116634.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_en_3.0.0_3.0_1619015116634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.healthcare").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .""") ```
## Results ```bash | | chunk | ner_label | |---|-------------------------------|-----------| | 0 | a respiratory tract infection | PROBLEM | | 1 | metformin | TREATMENT | | 2 | glipizide | TREATMENT | | 3 | dapagliflozin | TREATMENT | | 4 | T2DM | PROBLEM | | 5 | atorvastatin | TREATMENT | | 6 | gemfibrozil | TREATMENT | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on 2010 i2b2 challenge data. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label tp fp fn prec rec f1 I-TREATMENT 6625 1187 1329 0.848054 0.832914 0.840416 I-PROBLEM 15142 1976 2542 0.884566 0.856254 0.87018 B-PROBLEM 11005 1065 1587 0.911765 0.873968 0.892466 I-TEST 6748 923 1264 0.879677 0.842237 0.86055 B-TEST 8196 942 1029 0.896914 0.888455 0.892665 B-TREATMENT 8271 1265 1073 0.867345 0.885167 0.876165 Macro-average 55987 7358 8824 0.881387 0.863166 0.872181 Micro-average 55987 7358 8824 0.883842 0.86385 0.873732 ``` --- layout: model title: Google's Tapas Table Understanding (Large, WIKISQL) author: John Snow Labs name: table_qa_tapas_large_finetuned_wikisql_supervised date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Large Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530623547.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_wikisql_supervised_en_4.2.0_3.0_1664530623547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_large_finetuned_wikisql_supervised","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_large_finetuned_wikisql_supervised| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions https://github.com/salesforce/WikiSQL --- layout: model title: Legal NER - License / Permission Clauses (Bert, sm) author: John Snow Labs name: legner_bert_grants date: 2022-08-12 tags: [en, legal, ner, grants, permissions, licenses, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model aims to detect License grants / permissions in agreements, provided by a Subject (PERMISSION_SUBJECT) to a Recipient (PERMISSION_INDIRECT_OBJECT). THe permission itself is in PERMISSION tag. There is a lighter (non-transformer based) version of this model available as `legner_grants_md`. ## Predicted Entities `PERMISSION`, `PERMISSION_SUBJECT`, `PERMISSION_OBJECT`, `PERMISSION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_bert_grants_en_1.0.0_3.2_1660292396316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_bert_grants_en_1.0.0_3.2_1660292396316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = legal.BertForTokenClassification.pretrained("legner_bert_grants", "en", "legal/models")\ .setInputCols("token", "document")\ .setOutputCol("label")\ .setCaseSensitive(True) pipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, tokenClassifier ] ) import pandas as pd p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) text = """Fox grants to Licensee a limited, exclusive (except as otherwise may be provided in this Agreement), non-transferable (except as permitted in Paragraph 17(d)) right and license""" res = p_model.transform(spark.createDataFrame([[text]]).toDF("text")) from pyspark.sql import functions as F res.select(F.explode(F.arrays_zip('token.result', 'label.result')).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"))\ .show(20, truncate=100) ```
## Results ```bash +----------------+----------------------------+ | token| ner_label| +----------------+----------------------------+ | Fox| B-PERMISSION_SUBJECT| | grants| O| | to| O| | Licensee|B-PERMISSION_INDIRECT_OBJECT| | a| O| | limited| B-PERMISSION| | ,| I-PERMISSION| | exclusive| I-PERMISSION| | (| I-PERMISSION| | except| I-PERMISSION| | as| I-PERMISSION| | otherwise| I-PERMISSION| | may| I-PERMISSION| | be| I-PERMISSION| | provided| I-PERMISSION| | in| I-PERMISSION| | this| I-PERMISSION| | Agreement| I-PERMISSION| | ),| I-PERMISSION| |non-transferable| I-PERMISSION| +----------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_bert_grants| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|412.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support B-PERMISSION 0.88 0.79 0.83 38 B-PERMISSION_INDIRECT_OBJECT 0.85 0.94 0.89 36 B-PERMISSION_SUBJECT 0.89 0.85 0.87 40 I-PERMISSION 0.80 0.69 0.74 342 O 0.94 0.97 0.95 1827 accuracy - - 0.92 2292 macro-avg 0.85 0.81 0.86 2292 weighted-avg 0.91 0.92 0.91 2292 ``` --- layout: model title: Stop Words Cleaner for Somali author: John Snow Labs name: stopwords_so date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: so edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, so] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_so_so_2.5.4_2.4_1594742441799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_so_so_2.5.4_2.4_1594742441799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_so", "so") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Marka laga reebo inuu yahay boqorka woqooyiga, John Snow waa dhakhtar Ingiriis ah oo hormuud u ah horumarinta suuxdinta iyo nadaafadda caafimaadka.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_so", "so") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Marka laga reebo inuu yahay boqorka woqooyiga, John Snow waa dhakhtar Ingiriis ah oo hormuud u ah horumarinta suuxdinta iyo nadaafadda caafimaadka.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Marka laga reebo inuu yahay boqorka woqooyiga, John Snow waa dhakhtar Ingiriis ah oo hormuud u ah horumarinta suuxdinta iyo nadaafadda caafimaadka."""] stopword_df = nlu.load('so.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Marka', metadata={'sentence': '0'}), Row(annotatorType='token', begin=6, end=9, result='laga', metadata={'sentence': '0'}), Row(annotatorType='token', begin=11, end=15, result='reebo', metadata={'sentence': '0'}), Row(annotatorType='token', begin=22, end=26, result='yahay', metadata={'sentence': '0'}), Row(annotatorType='token', begin=28, end=34, result='boqorka', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_so| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|so| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Sentence Entity Resolver for LOINC (sbluebert_base_uncased_mli embeddings) author: John Snow Labs name: sbluebertresolve_loinc_uncased date: 2021-12-31 tags: [loinc, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical NER entities to LOINC codes using `sbluebert_base_uncased_mli` Sentence Bert Embeddings. It trained on the augmented version of the uncased (lowercased) dataset which is used in previous LOINC resolver models. ## Predicted Entities `LOINC Code` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_loinc_uncased_en_3.3.4_2.4_1640945648577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_loinc_uncased_en_3.3.4_2.4_1640945648577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical','en', 'clinical/models')\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(['Test']) chunk2doc = Chunk2Doc() \ .setInputCols("ner_chunk") \ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings")\ .setCaseSensitive(True) resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_loinc_uncased", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"])\ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline_loinc = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver ]) test = """The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""" sparkDF = spark.createDataFrame([[test]]).toDF("text") result = pipeline_loinc.fit(sparkDF).transform(sparkDF) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Test")) val chunk2doc = Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") .setCaseSensitive(True) val resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_loinc_uncased", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline_loinc = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.").toDF("text") val result = pipeline_loinc.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc_uncased").predict("""The patient is a 22-year-old female with a history of obesity. She has a BMI of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126. Her hgba1c is 8.2%.""") ```
## Results ```bash +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ner_chunk|entity| resolution| all_codes| resolutions| +-------------------------------------+------+-----------+----------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | BMI| Test| 39156-5|[39156-5, LP35925-4, BDYCRC, 73964-9, 59574-4,...] |[Body mass index, Body mass index (BMI), Body circumference, Body muscle mass, Body mass index (BMI) [Percentile], ...] | | aspartate aminotransferase| Test| 14409-7|['14409-7', '16325-3', '1916-6', '16324-6',...] |['Aspartate aminotransferase', 'Alanine aminotransferase/Aspartate aminotransferase', 'Aspartate aminotransferase/Alanine aminotransferase', 'Alanine aminotransferase', ...] | | alanine aminotransferase| Test| 16324-6|['16324-6', '1916-6', '16325-3', '59245-1',...] |['Alanine aminotransferase', 'Aspartate aminotransferase/Alanine aminotransferase', 'Alanine aminotransferase/Aspartate aminotransferase', 'Alanine glyoxylate aminotransferase',...] | | hgba1c| Test| 41995-2|['41995-2', 'LP35944-5', 'LP19717-5', '43150-2',...]|['Hemoglobin A1c', 'HbA1c measurement device', 'HBA1 gene', 'HbA1c measurement device panel', ...] | +-------------------------------------+------+-----------+------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbluebertresolve_loinc_uncased| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| |Size:|647.9 MB| |Case sensitive:|false| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Pipeline to Detect Drug Information (Large) author: John Snow Labs name: ner_posology_large_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_pipeline_en_4.3.0_3.2_1678869355529.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_large_pipeline_en_4.3.0_3.2_1678869355529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_large_pipeline", "en", "clinical/models") text = '''The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_large_pipeline", "en", "clinical/models") val text = "The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posoloy_large.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | insulin | 59 | 65 | DRUG | 0.9752 | | 1 | Bactrim | 346 | 352 | DRUG | 0.9994 | | 2 | for 14 days | 354 | 364 | DURATION | 0.796067 | | 3 | Fragmin | 925 | 931 | DRUG | 0.9995 | | 4 | 5000 units | 933 | 942 | DOSAGE | 0.6773 | | 5 | subcutaneously | 944 | 957 | ROUTE | 0.9987 | | 6 | daily | 959 | 963 | FREQUENCY | 0.999 | | 7 | Xenaderm | 966 | 973 | DRUG | 0.8853 | | 8 | topically | 985 | 993 | ROUTE | 0.9916 | | 9 | b.i.d | 995 | 999 | FREQUENCY | 0.995 | | 10 | Lantus | 1003 | 1008 | DRUG | 0.9994 | | 11 | 40 units | 1010 | 1017 | DOSAGE | 0.86805 | | 12 | subcutaneously | 1019 | 1032 | ROUTE | 0.9986 | | 13 | at bedtime | 1034 | 1043 | FREQUENCY | 0.84895 | | 14 | OxyContin | 1046 | 1054 | DRUG | 0.9875 | | 15 | 30 mg | 1056 | 1060 | STRENGTH | 0.97695 | | 16 | p.o. | 1062 | 1065 | ROUTE | 0.8367 | | 17 | q.12 h | 1067 | 1072 | FREQUENCY | 0.93305 | | 18 | folic acid | 1076 | 1085 | DRUG | 0.9569 | | 19 | 1 mg | 1087 | 1090 | STRENGTH | 0.83715 | | 20 | daily | 1092 | 1096 | FREQUENCY | 0.9998 | | 21 | levothyroxine | 1099 | 1111 | DRUG | 0.9794 | | 22 | 0.1 mg | 1113 | 1118 | STRENGTH | 0.9325 | | 23 | p.o. | 1120 | 1123 | ROUTE | 0.6783 | | 24 | daily | 1125 | 1129 | FREQUENCY | 0.9925 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering model (from mcurmei) Unique author: John Snow Labs name: distilbert_qa_unique_N_max date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unique_N_max` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_unique_N_max_en_4.0.0_3.0_1654729023137.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_unique_N_max_en_4.0.0_3.0_1654729023137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_unique_N_max","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_unique_N_max","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.unique_n_max.by_mcurmei").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_unique_N_max| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/unique_N_max --- layout: model title: Chinese Bert Embeddings (Base) author: John Snow Labs name: bert_embeddings_mengzi_bert_base date: 2022-04-08 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mengzi-bert-base` is a Chinese model orginally trained by `Langboat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_bert_base_zh_3.4.2_3.0_1649441520471.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_bert_base_zh_3.4.2_3.0_1649441520471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_bert_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_bert_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.mengzi_bert_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mengzi_bert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|245.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Langboat/mengzi-bert-base - https://arxiv.org/abs/2110.06696 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab3_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab3_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab3_by_hassnain` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab3_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab3_by_hassnain_en_4.2.0_3.0_1664040251119.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab3_by_hassnain_en_4.2.0_3.0_1664040251119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab3_by_hassnain", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab3_by_hassnain", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab3_by_hassnain| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Sentence Embeddings - sbiobert (tuned) author: John Snow Labs name: sbiobert_jsl_cased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_cased_en_3.1.0_2.4_1625050229429.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobert_jsl_cased_en_3.1.0_2.4_1625050229429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_jsl_cased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbiobert_jsl_cased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.biobert.jsl_cased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobert_jsl_cased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Tuned on MedNLI and UMLS dataset ## Benchmarking ```bash MedNLI Score Acc 0.788 STS(cos) 0.736 ``` --- layout: model title: Arabic Named Entity Recognition (from hatmimoha) author: John Snow Labs name: bert_ner_arabic_ner date: 2022-05-09 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `arabic-ner` is a Arabic model orginally trained by `hatmimoha`. ## Predicted Entities `DATE`, `PRICE`, `PRODUCT`, `COMPETITION`, `ORGANIZATION`, `LOCATION`, `DISEASE`, `PERSON`, `EVENT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_arabic_ner_ar_3.4.2_3.0_1652099460424.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_arabic_ner_ar_3.4.2_3.0_1652099460424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_arabic_ner","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_arabic_ner","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_arabic_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|412.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/hatmimoha/arabic-ner - https://github.com/hatmimoha/arabic-ner --- layout: model title: English image_classifier_vit_baseball_stadium_foods ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_baseball_stadium_foods date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_baseball_stadium_foods` is a English model originally trained by nateraw. ## Predicted Entities `nachos`, `popcorn`, `cotton candy`, `hot dog`, `hamburger` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_baseball_stadium_foods_en_4.1.0_3.0_1660169381512.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_baseball_stadium_foods_en_4.1.0_3.0_1660169381512.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_baseball_stadium_foods", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_baseball_stadium_foods", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_baseball_stadium_foods| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Pipeline to Detect Radiology Concepts with Biobert (WIP Greedy) author: John Snow Labs name: jsl_rd_ner_wip_greedy_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, wip, biobert, radiology, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_rd_ner_wip_greedy_biobert](https://nlp.johnsnowlabs.com/2021/07/26/jsl_rd_ner_wip_greedy_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_pipeline_en_3.4.1_3.0_1647867332070.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_biobert_pipeline_en_3.4.1_3.0_1647867332070.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("jsl_rd_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` ```scala val pipeline = new PretrainedPipeline("jsl_rd_ner_wip_greedy_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.wip_greedy_biobert.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | chunk | entity | |---:|:----------------------|:--------------------------| | 0 | Bilateral | Direction | | 1 | breast | BodyPart | | 2 | ultrasound | ImagingTest | | 3 | ovoid mass | ImagingFindings | | 4 | 0.5 x 0.5 x 0.4 | Measurements | | 5 | cm | Units | | 6 | left | Direction | | 7 | shoulder | BodyPart | | 8 | mass | ImagingFindings | | 9 | isoechoic echotexture | ImagingFindings | | 10 | muscle | BodyPart | | 11 | internal color flow | ImagingFindings | | 12 | benign fibrous tissue | ImagingFindings | | 13 | lipoma | Disease_Syndrome_Disorder | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_rd_ner_wip_greedy_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Pipeline to Summarize Clinical Guidelines author: John Snow Labs name: summarizer_clinical_guidelines_large_pipeline date: 2023-05-30 tags: [licensed, en, clinical, summarization, guidelines] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_guidelines_large](https://nlp.johnsnowlabs.com/2023/05/08/summarizer_clinical_guidelines_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_guidelines_large_pipeline_en_4.4.2_3.0_1685424006278.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_guidelines_large_pipeline_en_4.4.2_3.0_1685424006278.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_guidelines_large_pipeline", "en", "clinical/models") text = """Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity - Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in the breast or underarm area - Changes in the size or shape of the breast - Nipple discharge - Nipple changes in appearance, such as inversion or flattening - Redness or swelling in the breast Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include: - Surgery (such as lumpectomy or mastectomy) - Radiation therapy - Chemotherapy - Hormone therapy - Targeted therapy Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_guidelines_large_pipeline", "en", "clinical/models") val text = """Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity - Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in the breast or underarm area - Changes in the size or shape of the breast - Nipple discharge - Nipple changes in appearance, such as inversion or flattening - Redness or swelling in the breast Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include: - Surgery (such as lumpectomy or mastectomy) - Radiation therapy - Chemotherapy - Hormone therapy - Targeted therapy Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash Overview of the disease: Breast cancer is the most common type of cancer among women, occurring when the cells in the breast start growing abnormally, forming a lump or mass. It can result in the spread of cancerous cells to other parts of the body. Causes: The exact cause of breast cancer is unknown, but several risk factors can increase the likelihood of developing it, such as a personal or family history, a genetic mutation, exposure to radiation, age, early onset of menstruation or late menopause, obesity, and hormonal factors. Symptoms: Symptoms of breast cancer typically manifest as the disease progresses, including a lump or thickening in the breast or underarm area, changes in the size or shape of the breast, nipple discharge, nipple changes in appearance, and redness or swelling in the breast. Treatment recommendations: Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include surgery, radiation therapy, chemotherapy, hormone therapy, and targeted therapy. Early detection is crucial for successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_guidelines_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.0 GB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English image_classifier_vit_exper_batch_16_e4 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper_batch_16_e4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper_batch_16_e4` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_16_e4_en_4.1.0_3.0_1660173127276.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_16_e4_en_4.1.0_3.0_1660173127276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper_batch_16_e4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper_batch_16_e4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper_batch_16_e4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: English BertForQuestionAnswering Base Cased model (from baru98) author: John Snow Labs name: bert_qa_baru98_base_cased_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad` is a English model originally trained by `baru98`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_baru98_base_cased_finetuned_squad_en_4.0.0_3.0_1657182841081.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_baru98_base_cased_finetuned_squad_en_4.0.0_3.0_1657182841081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_baru98_base_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_baru98_base_cased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_baru98_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/baru98/bert-base-cased-finetuned-squad --- layout: model title: Hausa BertForMaskedLM Base Cased model (from Davlan) author: John Snow Labs name: bert_embeddings_base_multilingual_cased_finetuned_swahili date: 2022-12-02 tags: [ha, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ha edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-swahili` is a Hausa model originally trained by `Davlan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_multilingual_cased_finetuned_swahili_ha_4.2.4_3.0_1670018460397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_multilingual_cased_finetuned_swahili_ha_4.2.4_3.0_1670018460397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_multilingual_cased_finetuned_swahili","ha") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_multilingual_cased_finetuned_swahili","ha") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_multilingual_cased_finetuned_swahili| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ha| |Size:|666.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/Davlan/bert-base-multilingual-cased-finetuned-swahili - http://data.statmt.org/cc-100/ - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English BertForQuestionAnswering model (from ankitkupadhyay) author: John Snow Labs name: bert_qa_ankitkupadhyay_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `ankitkupadhyay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ankitkupadhyay_bert_finetuned_squad_en_4.0.0_3.0_1654535567926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ankitkupadhyay_bert_finetuned_squad_en_4.0.0_3.0_1654535567926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ankitkupadhyay_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ankitkupadhyay_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_ankitkupadhyay").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ankitkupadhyay_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ankitkupadhyay/bert-finetuned-squad --- layout: model title: Legal Registration rights Clause Binary Classifier author: John Snow Labs name: legclf_registration_rights_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `registration-rights` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `registration-rights` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_rights_clause_en_1.0.0_3.2_1660123916625.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_rights_clause_en_1.0.0_3.2_1660123916625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_registration_rights_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[registration-rights]| |[other]| |[other]| |[registration-rights]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_registration_rights_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.90 0.97 0.93 65 registration-rights 0.94 0.83 0.88 41 accuracy - - 0.92 106 macro-avg 0.92 0.90 0.91 106 weighted-avg 0.92 0.92 0.91 106 ``` --- layout: model title: Fast Neural Machine Translation Model from English to East Slavic Languages author: John Snow Labs name: opus_mt_en_zle date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, zle, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `zle` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_zle_xx_2.7.0_2.4_1609169786290.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_zle_xx_2.7.0_2.4_1609169786290.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_zle", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_zle", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.zle').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_zle| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese BertForQuestionAnswering model (from liam168) author: John Snow Labs name: bert_qa_qa_roberta_base_chinese_extractive date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qa-roberta-base-chinese-extractive` is a Chinese model orginally trained by `liam168`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qa_roberta_base_chinese_extractive_zh_4.0.0_3.0_1654189085576.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qa_roberta_base_chinese_extractive_zh_4.0.0_3.0_1654189085576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qa_roberta_base_chinese_extractive","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_qa_roberta_base_chinese_extractive","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.base.by_liam168").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qa_roberta_base_chinese_extractive| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|381.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/liam168/qa-roberta-base-chinese-extractive --- layout: model title: Legal Stock Puchase Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_stock_purchase_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, stock_purchase, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_stock_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `stock-purchase-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `stock-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stock_purchase_agreement_en_1.0.0_3.0_1668119188697.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stock_purchase_agreement_en_1.0.0_3.0_1668119188697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_stock_purchase_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[stock-purchase-agreement]| |[other]| |[other]| |[stock-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stock_purchase_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.87 0.91 0.89 43 stock-purchase-agreement 0.88 0.83 0.85 35 accuracy - - 0.87 78 macro-avg 0.87 0.87 0.87 78 weighted-avg 0.87 0.87 0.87 78 ``` --- layout: model title: Detect Assertion Status from Demographic Entities author: John Snow Labs name: assertion_oncology_demographic_binary_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects if a demographic entity refers to the patient or to someone else. ## Predicted Entities `Patient`, `Someone_Else` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_demographic_binary_wip_en_4.0.0_3.0_1665522157316.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_demographic_binary_wip_en_4.0.0_3.0_1665522157316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Age", "Gender"]) assertion = AssertionDLModel.pretrained("assertion_oncology_demographic_binary_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["One sister was diagnosed with breast cancer at the age of 40."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Age", "Gender")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_demographic_binary_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""One sister was diagnosed with breast cancer at the age of 40.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_demographic_binary_wip").predict("""One sister was diagnosed with breast cancer at the age of 40.""") ```
## Results ```bash | chunk | ner_label | assertion | |:----------|:------------|:-------------| | sister | Gender | Someone_Else | | age of 40 | Age | Someone_Else | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_demographic_binary_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Patient 0.93 0.86 0.89 29.0 Someone_Else 0.88 0.93 0.90 30.0 macro-avg 0.90 0.90 0.90 59.0 weighted-avg 0.90 0.90 0.90 59.0 ``` --- layout: model title: Hebrew BertForMaskedLM Base Cased model (from onlplab) author: John Snow Labs name: bert_embeddings_onlplab_aleph_base date: 2022-12-02 tags: [he, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: he edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `alephbert-base` is a Hebrew model originally trained by `onlplab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_onlplab_aleph_base_he_4.2.4_3.0_1670015499265.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_onlplab_aleph_base_he_4.2.4_3.0_1670015499265.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_onlplab_aleph_base","he") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_onlplab_aleph_base","he") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_onlplab_aleph_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|he| |Size:|473.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/onlplab/alephbert-base - https://arxiv.org/abs/1810.04805 - https://oscar-corpus.com/ - https://dumps.wikimedia.org/hewiki/latest/ --- layout: model title: English BertForQuestionAnswering Cased model (from cjjie) author: John Snow Labs name: bert_qa_cjjie_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `cjjie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_cjjie_finetuned_squad_en_4.0.0_3.0_1657186415232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_cjjie_finetuned_squad_en_4.0.0_3.0_1657186415232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_cjjie_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_cjjie_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_cjjie_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/cjjie/bert-finetuned-squad --- layout: model title: English asr_sp_proj TFWav2Vec2ForCTC from behroz author: John Snow Labs name: asr_sp_proj date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_sp_proj` is a English model originally trained by behroz. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_sp_proj_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_sp_proj_en_4.2.0_3.0_1664019382185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_sp_proj_en_4.2.0_3.0_1664019382185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_sp_proj", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_sp_proj", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_sp_proj| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Legal Lease Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_commercial_lease date: 2023-02-08 tags: [commercial, lease, en, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_commercial_lease` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `commercial-lease` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `commercial-lease`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_commercial_lease_en_1.0.0_3.0_1675878333648.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_commercial_lease_en_1.0.0_3.0_1675878333648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_commercial_lease", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[commercial-lease]| |[other]| |[other]| |[commercial-lease]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_commercial_lease| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support commercial-lease 1.00 0.97 0.98 30 other 0.95 1.00 0.98 21 accuracy - - 0.98 51 macro-avg 0.98 0.98 0.98 51 weighted-avg 0.98 0.98 0.98 51 ``` --- layout: model title: Named Entity Recognition for Japanese (GloVe 840B 300d) author: John Snow Labs name: ner_ud_gsd_glove_840B_300d date: 2021-01-03 task: Named Entity Recognition language: ja edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, ja, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained `glove_840B_300` embeddings model from `WordEmbeddings` annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `MOVEMENT`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `TITLE_AFFIX`, and `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_JA/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_glove_840B_300d_ja_2.7.0_2.4_1609712569080.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_glove_840B_300d_ja_2.7.0_2.4_1609712569080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("ner_ud_gsd_glove_840B_300d", "ja") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter]) example = spark.createDataFrame([['5月13日に放送されるフジテレビ系「僕らの音楽」にて、福原美穂とAIという豪華共演が決定した。']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_ud_gsd_glove_840B_300d", "ja") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols("sentence", "token", "ner") .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter)) val data = Seq("5月13日に放送されるフジテレビ系「僕らの音楽」にて、福原美穂とAIという豪華共演が決定した。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.ner").predict("""Put your text here.""") ```
## Results ```bash +----------+------+ |token |ner | +----------+------+ |5月 |DATE | |13日 |DATE | |に |O | |放送 |O | |さ |O | |れる |O | |フジテレビ|O | |系 |O | |「 |O | |僕らの音楽|O | |」 |O | |にて |O | |、 |O | |福原美穂 |PERSON| |と |O | |AI |O | |と |O | |いう |O | |豪華 |O | |共演 |O | |が |O | |決定 |O | |し |O | |た |O | |。 |O | +----------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ud_gsd_glove_840B_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ja| ## Data Source The model was trained on the Universal Dependencies, curated by Google. Reference: > Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash | ner_tag | precision | recall | f1-score | support | |:------------:|:---------:|:------:|:--------:|:-------:| | DATE | 1.00 | 0.86 | 0.92 | 84 | | EVENT | 1.00 | 0.14 | 0.25 | 14 | | FAC | 1.00 | 0.15 | 0.26 | 20 | | GPE | 1.00 | 0.01 | 0.02 | 82 | | LANGUAGE | 0.00 | 0.00 | 0.00 | 6 | | LAW | 0.00 | 0.00 | 0.00 | 3 | | LOC | 0.00 | 0.00 | 0.00 | 25 | | MONEY | 0.86 | 0.86 | 0.86 | 7 | | MOVEMENT | 0.00 | 0.00 | 0.00 | 4 | | NORP | 1.00 | 0.11 | 0.19 | 28 | | ORDINAL | 0.92 | 0.85 | 0.88 | 13 | | ORG | 0.44 | 0.35 | 0.39 | 75 | | PERCENT | 1.00 | 1.00 | 1.00 | 7 | | PERSON | 0.71 | 0.06 | 0.10 | 89 | | PRODUCT | 0.42 | 0.48 | 0.45 | 23 | | QUANTITY | 0.98 | 0.78 | 0.87 | 78 | | TIME | 1.00 | 1.00 | 1.00 | 13 | | TITLE_AFFIX | 0.00 | 0.00 | 0.00 | 20 | | WORK_OF_ART | 1.00 | 0.22 | 0.36 | 18 | | accuracy | 0.97 | 12419 | | | | macro avg | 0.67 | 0.39 | 0.43 | 12419 | | weighted avg | 0.96 | 0.97 | 0.96 | 12419 | ``` --- layout: model title: Fast Neural Machine Translation Model from English to Macedonian author: John Snow Labs name: opus_mt_en_mk date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mk_xx_2.7.0_2.4_1609168599762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mk_xx_2.7.0_2.4_1609168599762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_xlsr_53_bemba_5hrs TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: asr_xlsr_53_bemba_5hrs date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_53_bemba_5hrs` is a English model originally trained by csikasote. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xlsr_53_bemba_5hrs_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xlsr_53_bemba_5hrs_en_4.2.0_3.0_1664035808883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xlsr_53_bemba_5hrs_en_4.2.0_3.0_1664035808883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xlsr_53_bemba_5hrs", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xlsr_53_bemba_5hrs", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xlsr_53_bemba_5hrs| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: News Classifier of German text author: John Snow Labs name: classifierdl_bert_news date: 2021-07-12 tags: [de, news, classifier, open_source, german] task: Text Classification language: de edition: Spark NLP 3.1.1 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify German texts of news ## Predicted Entities `Inland`, `International`, `Panorama`, `Sport`, `Web`, `Wirtschaft`, `Wissenschaft` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_DE_NEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_DE_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_de_3.1.1_2.4_1626079085859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_de_3.1.1_2.4_1626079085859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = BertSentenceEmbeddings\ .pretrained('sent_bert_multi_cased', 'xx') \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "de") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document, embeddings, document_classifier]) light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.annotate("Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)") ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings .pretrained("sent_bert_multi_cased", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "de") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier)) val data = Seq("""Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.news").predict("""Niki Lauda in einem McLaren MP 4/2 TAG Turbo. Mit diesem Gefährt sicherte sich der Österreicher 1984 seinen dritten Weltmeistertitel, einen halben (!)""") ```
## Results ```bash ['Sport'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_news| |Compatibility:|Spark NLP 3.1.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|de| ## Data Source Trained on a custom dataset with multi-lingual Bert Sentence Embeddings. ## Benchmarking ```bash label precision recall f1-score support Inland 0.78 0.81 0.79 102 International 0.80 0.89 0.84 151 Panorama 0.84 0.70 0.76 168 Sport 0.98 0.99 0.98 120 Web 0.93 0.90 0.91 168 Wirtschaft 0.74 0.83 0.78 141 Wissenschaft 0.84 0.75 0.80 57 accuracy - - 0.84 907 macro-avg 0.84 0.84 0.84 907 weighted-avg 0.85 0.84 0.84 907 ``` --- layout: model title: Pipeline to Detect medical risk factors (biobert) author: John Snow Labs name: ner_risk_factors_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_risk_factors_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_risk_factors_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_biobert_pipeline_en_4.3.0_3.2_1679314882627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_biobert_pipeline_en_4.3.0_3.2_1679314882627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_risk_factors_biobert_pipeline", "en", "clinical/models") text = '''ISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed. REVIEW OF SYSTEMS: All other systems reviewed & are negative. PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC. SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker. FAMILY HISTORY: Positive for coronary artery disease (father & brother).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_risk_factors_biobert_pipeline", "en", "clinical/models") val text = "ISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed. REVIEW OF SYSTEMS: All other systems reviewed & are negative. PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC. SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker. FAMILY HISTORY: Positive for coronary artery disease (father & brother)." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.risk_factors_biobert.pipeline").predict("""ISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed. REVIEW OF SYSTEMS: All other systems reviewed & are negative. PAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC. SOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker. FAMILY HISTORY: Positive for coronary artery disease (father & brother).""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------------------------|--------:|------:|:-------------|-------------:| | 0 | diabetic | 135 | 142 | DIABETES | 0.9689 | | 1 | prior history of coronary artery disease | 154 | 193 | CAD | 0.419617 | | 2 | PTCA in 1995. | 698 | 710 | CAD | 0.574925 | | 3 | Diabetes mellitus type II | 1314 | 1338 | DIABETES | 0.946325 | | 4 | hypertension | 1341 | 1352 | HYPERTENSION | 0.956 | | 5 | coronary artery disease | 1355 | 1377 | CAD | 0.7962 | | 6 | Smokes 2 packs of cigarettes per day | 1480 | 1515 | SMOKER | 0.461643 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Stop Words Cleaner for Turkish author: John Snow Labs name: stopwords_tr date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: tr edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, tr] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_tr_tr_2.5.4_2.4_1594742440173.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_tr_tr_2.5.4_2.4_1594742440173.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_tr", "tr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_tr", "tr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir."""] stopword_df = nlu.load('tr.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=5, end=8, result='Snow', metadata={'sentence': '0'}), Row(annotatorType='token', begin=9, end=9, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=11, end=17, result='kuzeyin', metadata={'sentence': '0'}), Row(annotatorType='token', begin=19, end=23, result='kralı', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_tr| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|tr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Relation Extraction Between Body Parts and Problem Entities (ReDL) author: John Snow Labs name: redl_bodypart_problem_biobert date: 2023-01-14 tags: [licensed, en, clinical, relation_extraction, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts and problem entities in clinical texts. 1 : Shows that there is a relation between body part entity and entities labeled as problem ( diagnosis, symptom etc.), 0 : Shows that there no relation between body part and problem entities. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_problem_biobert_en_4.2.4_3.0_1673713187801.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_problem_biobert_en_4.2.4_3.0_1673713187801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION',"EXTERNAL_BODY_PART_OR_REGION-SYMPTOM"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_problem_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="No neurologic deficits other than some numbness in his left hand." data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION","EXTERNAL_BODY_PART_OR_REGION-SYMPTOM")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_problem_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("No neurologic deficits other than some numbness in his left hand.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.problem").predict("""No neurologic deficits other than some numbness in his left hand.""") ```
## Results ```bash +--------+-------+-------------------+----------------------------+------+----------+ |relation|entity1|chunk1 |entity2 |chunk2|confidence| +--------+-------+-------------------+----------------------------+------+----------+ |0 |Symptom|neurologic deficits|External_body_part_or_region|hand |0.8320218 | |1 |Symptom|numbness |External_body_part_or_region|hand |0.99943227| +--------+-------+-------------------+----------------------------+------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_problem_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on internal dataset. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.762 0.814 0.787 315 1 0.938 0.917 0.927 885 Avg. 0.850 0.865 0.857 - ``` --- layout: model title: English asr_XLS_R_53_english TFWav2Vec2ForCTC from BakhtUllah123 author: John Snow Labs name: asr_XLS_R_53_english date: 2022-09-26 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_XLS_R_53_english` is a English model originally trained by BakhtUllah123. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_XLS_R_53_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_XLS_R_53_english_en_4.2.0_3.0_1664203351344.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_XLS_R_53_english_en_4.2.0_3.0_1664203351344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_XLS_R_53_english", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_XLS_R_53_english", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_XLS_R_53_english| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Part of Speech for Czech author: John Snow Labs name: pos_ud_pdt date: 2021-03-08 tags: [part_of_speech, open_source, czech, pos_ud_pdt, cs] task: Part of Speech Tagging language: cs edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADV - ADJ - SCONJ - NOUN - VERB - PUNCT - AUX - DET - ADP - PROPN - NUM - PRON - CCONJ - PART - SYM - INTJ - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_pdt_cs_3.0.0_3.0_1615230287310.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_pdt_cs_3.0.0_3.0_1615230287310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_pdt", "cs") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Ahoj z John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_pdt", "cs") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Ahoj z John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Ahoj z John Snow Labs! ""] token_df = nlu.load('cs.pos.ud_pdt').predict(text) token_df ```
## Results ```bash token pos 0 Ahoj PROPN 1 z ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_pdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|cs| --- layout: model title: Legal NER - Whereas Clauses (sm) author: John Snow Labs name: legner_whereas date: 2022-08-12 tags: [en, legal, ner, whereas, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_whereas_clause` Text Classifier to select only these paragraphs; This is a Legal NER Model, able to process WHEREAS clauses, to detect the SUBJECT (Who?), the ACTION, the OBJECT (what?) and, in some cases, the INDIRECT OBJECT (to whom?) of the clause. ## Predicted Entities `WHEREAS_SUBJECT`, `WHEREAS_OBJECT`, `WHEREAS_ACTION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_WHEREAS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_whereas_en_1.0.0_3.2_1660294083004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_whereas_en_1.0.0_3.2_1660294083004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_whereas', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""WHEREAS, Seller and Buyer have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the "Stock Purchase Agreement"); WHEREAS, pursuant to the Stock Purchase Agreement, Seller has agreed to sell and transfer, and Buyer has agreed to purchase and acquire, all of Seller's right, title and interest in and to Armstrong Wood Products, Inc., a Delaware corporation ("AWP") and its Subsidiaries, the Company and HomerWood Hardwood Flooring Company, a Delaware corporation ("HHFC," and together with the Company, the "Company Subsidiaries" and together with AWP, the "Company Entities" and each a "Company Entity") by way of a purchase by Buyer and sale by Seller of the Shares, all upon the terms and condition set forth therein;"""] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +------------+-----------------+ | token| ner_label| +------------+-----------------+ | WHEREAS| O| | ,| O| | Seller|B-WHEREAS_SUBJECT| | and| O| | Buyer|B-WHEREAS_SUBJECT| | have| B-WHEREAS_ACTION| | entered| I-WHEREAS_ACTION| | into| I-WHEREAS_ACTION| | that| B-WHEREAS_OBJECT| | certain| I-WHEREAS_OBJECT| | Stock| I-WHEREAS_OBJECT| | Purchase| I-WHEREAS_OBJECT| | Agreement| I-WHEREAS_OBJECT| | ,| O| | dated| O| | November| O| | 14| O| | ,| O| | 2018| O| | (| O| | the| O| | "| O| | Stock| O| | Purchase| O| | Agreement| O| | ");| O| | WHEREAS| O| | ,| O| | pursuant| O| | to| O| | the| O| | Stock| O| | Purchase| O| | Agreement| O| | ,| O| | Seller|B-WHEREAS_SUBJECT| | has| B-WHEREAS_ACTION| | agreed| I-WHEREAS_ACTION| | to| I-WHEREAS_ACTION| | sell| I-WHEREAS_ACTION| | and| O| | transfer| O| | ,| O| | and| O| | Buyer|B-WHEREAS_SUBJECT| | has| B-WHEREAS_ACTION| | agreed| I-WHEREAS_ACTION| | to| I-WHEREAS_ACTION| | purchase| I-WHEREAS_ACTION| | and| O| | acquire| O| | ,| O| | all| O| | of| O| | Seller's| O| | right| O| | ,| O| | title| O| | and| O| | interest| O| | in| O| | and| O| | to| O| | Armstrong| O| | Wood| O| | Products| O| | ,| O| | Inc| O| | .,| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | AWP| O| | ")| O| | and| O| | its| O| |Subsidiaries| O| | ,| O| | the| O| | Company| O| | and| O| | HomerWood| O| | Hardwood| O| | Flooring| O| | Company| O| | ,| O| | a| O| | Delaware| O| | corporation| O| | ("| O| | HHFC| O| | ,"| O| +------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_whereas| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.5 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 B-WHEREAS_SUBJECT 191 14 15 0.9317073 0.92718446 0.9294404 I-WHEREAS_ACTION 202 38 59 0.84166664 0.77394634 0.8063872 I-WHEREAS_SUBJECT 52 8 16 0.8666667 0.7647059 0.8125 B-WHEREAS_OBJECT 101 63 68 0.61585367 0.5976331 0.6066066 B-WHEREAS_ACTION 152 19 16 0.8888889 0.9047619 0.89675516 I-WHEREAS_OBJECT 361 194 194 0.65045047 0.65045047 0.65045047 Macro-average 1059 336 368 0.7992056 0.76978034 0.784217 Micro-average 1059 336 368 0.7591398 0.74211633 0.75053155 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_4_en_4.0.0_3.0_1655731726407.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_4_en_4.0.0_3.0_1655731726407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|416.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-4 --- layout: model title: Chinese BertForMaskedLM Large Cased model (from genggui001) author: John Snow Labs name: bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_wwm_large_ext_fix_mlm` is a Chinese model originally trained by `genggui001`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_4.2.4_3.0_1670021892719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm_zh_4.2.4_3.0_1670021892719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_wwm_large_ext_fix_mlm| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| ## References - https://huggingface.co/genggui001/chinese_roberta_wwm_large_ext_fix_mlm - https://github.com/ymcui/Chinese-BERT-wwm/issues/98 - https://github.com/genggui001/chinese_roberta_wwm_large_ext_fix_mlm --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_agreement_date_08_31_v1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-agreement_date-08-31-v1` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_agreement_date_08_31_v1_en_4.3.0_3.0_1672766029156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_agreement_date_08_31_v1_en_4.3.0_3.0_1672766029156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_agreement_date_08_31_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_agreement_date_08_31_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_agreement_date_08_31_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-agreement_date-08-31-v1 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_2_h_256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-2_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_256_zh_4.2.4_3.0_1670021618182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_2_h_256_zh_4.2.4_3.0_1670021618182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_2_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_2_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|27.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-2_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English asr_model_2 TFWav2Vec2ForCTC from niclas author: John Snow Labs name: asr_model_2 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_2` is a English model originally trained by niclas. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_model_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_model_2_en_4.2.0_3.0_1664097688800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_model_2_en_4.2.0_3.0_1664097688800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_model_2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_model_2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_model_2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect Normalized Genes and Human Phenotypes author: John Snow Labs name: ner_human_phenotype_go_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, gene, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_human_phenotype_go_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_human_phenotype_go_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_pipeline_en_3.4.1_3.0_1647868134832.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_go_biobert_pipeline_en_3.4.1_3.0_1647868134832.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_human_phenotype_go_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.") ``` ```scala val pipeline = new PretrainedPipeline("ner_human_phenotype_go_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.phenotype_go_biobert.pipeline").predict("""Another disease that shares two of the tumor components of CT, namely GIST and tricarboxylic acid cycle is the Carney-Stratakis syndrome (CSS) or dyad.""") ```
## Results ```bash +------------------------+--------+ |chunks |entities| +------------------------+--------+ |tumor |HP | |tricarboxylic acid cycle|GO | +------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_go_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Bangla Bert Embeddings (from neuralspace-reverie) author: John Snow Labs name: bert_embeddings_indic_transformers_bn_bert date: 2022-04-11 tags: [bert, embeddings, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `indic-transformers-bn-bert` is a Bangla model orginally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_bert_bn_3.4.2_3.0_1649673426239.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_indic_transformers_bn_bert_bn_3.4.2_3.0_1649673426239.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_indic_transformers_bn_bert","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_indic_transformers_bn_bert","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed.indic_transformers_bn_bert").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_indic_transformers_bn_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|505.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-bert - https://oscar-corpus.com/ --- layout: model title: English Part of Speech Tagger (from QCRI) author: John Snow Labs name: bert_pos_bert_base_multilingual_cased_pos_english date: 2022-05-09 tags: [bert, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-multilingual-cased-pos-english` is a English model orginally trained by `QCRI`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_multilingual_cased_pos_english_en_3.4.2_3.0_1652091659315.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_multilingual_cased_pos_english_en_3.4.2_3.0_1652091659315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_multilingual_cased_pos_english","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_multilingual_cased_pos_english","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_multilingual_cased_pos_english| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/QCRI/bert-base-multilingual-cased-pos-english --- layout: model title: ICD10CM Poison Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_poison_ext_clinical class: ChunkEntityResolverModel language: en nav_key: models repository: clinical/models date: 2020-04-28 task: Entity Resolution edition: Healthcare NLP 2.4.5 spark_version: 2.4 tags: [clinical,licensed,entity_resolution,en] deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-CM Codes and their normalized definition with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_poison_ext_clinical_en_2.4.5_2.4_1588106053455.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_poison_ext_clinical_en_2.4.5_2.4_1588106053455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python ... model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_poison_ext_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("icd10_code") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_chunker, chunk_embeddings, model]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) light_pipeline.fullAnnotate("""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.""") ``` ```scala ... val model = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_poison_ext_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("icd10_code") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_chunker, chunk_embeddings, model)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Result ```bash | # | chunk | begin | end | entity | icd10_description | icd10_code | |--:|---------------------:|------:|----:|--------:|--------------------------------------------------:|------------| | 0 | a cold, cough | 75 | 87 | PROBLEM | Chronic obstructive pulmonary disease, unspeci... | J449 | | 1 | runny nose | 94 | 103 | PROBLEM | Nasal congestion | R0981 | | 2 | difficulty breathing | 210 | 229 | PROBLEM | Shortness of breath | R0602 | | 3 | her cough | 235 | 243 | PROBLEM | Cough | R05 | | 4 | fairly congested | 365 | 380 | PROBLEM | Edema, unspecified | R609 | | 5 | difficulty breathing | 590 | 609 | PROBLEM | Shortness of breath | R0602 | | 6 | more congested | 625 | 638 | PROBLEM | Edema, unspecified | R609 | | 7 | trouble sleeping | 759 | 774 | PROBLEM | Activity, sleeping | Y9384 | | 8 | congestion | 789 | 798 | PROBLEM | Nasal congestion | R0981 | ``` {:.model-param} ## Model Information {:.table-model} |----------------|------------------------------------------| | Name: | chunkresolve_icd10cm_poison_ext_clinical | | Type: | ChunkEntityResolverModel | | Compatibility: | Spark NLP 2.4.5+ | | License: | Licensed | |Edition:|Official| | |Input labels: | [token, chunk_embeddings] | |Output labels: | [icd10_code] | | Language: | en | | Case sensitive: | True | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on ICD10CM Dataset Range: T1500XA-T879 https://www.icd10data.com/ICD10CM/Codes/S00-T88 --- layout: model title: Pipeline to Detect Entities Related to Cancer Diagnosis author: John Snow Labs name: ner_oncology_diagnosis_pipeline date: 2023-03-09 tags: [licensed, clinical, en, ner, oncology] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_diagnosis](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_diagnosis_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_pipeline_en_4.3.0_3.2_1678346464717.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_pipeline_en_4.3.0_3.2_1678346464717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_diagnosis_pipeline", "en", "clinical/models") text = '''Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_diagnosis_pipeline", "en", "clinical/models") val text = "Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------------|-------------:| | 0 | tumor | 44 | 48 | Tumor_Finding | 0.9958 | | 1 | adenopathies | 73 | 84 | Adenopathy | 0.6287 | | 2 | invasive | 110 | 117 | Histological_Type | 0.9965 | | 3 | ductal | 119 | 124 | Histological_Type | 0.9996 | | 4 | carcinoma | 126 | 134 | Cancer_Dx | 0.9988 | | 5 | metastasis | 181 | 190 | Metastasis | 0.9996 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_diagnosis_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Service Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_service_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, service, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_service_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `service-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `service-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_service_agreement_bert_en_1.0.0_3.0_1669371284792.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_service_agreement_bert_en_1.0.0_3.0_1669371284792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_service_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[service-agreement]| |[other]| |[other]| |[service-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_service_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.89 0.95 0.92 60 service-agreement 0.88 0.77 0.82 30 accuracy - - 0.89 90 macro-avg 0.89 0.86 0.87 90 weighted-avg 0.89 0.89 0.89 90 ``` --- layout: model title: Pipeline to Extract entities in clinical trial abstracts author: John Snow Labs name: ner_clinical_trials_abstracts_pipeline date: 2023-03-09 tags: [ner, clinical, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/06/22/ner_clinical_trials_abstracts_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_pipeline_en_4.3.0_3.2_1678386393248.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_pipeline_en_4.3.0_3.2_1678386393248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "en", "clinical/models") text = '''A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "en", "clinical/models") val text = "A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_trials_abstracts.pipe").predict("""A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes. In a multicentre, open, randomised study, 570 patients with Type 2 diabetes, aged 34 - 80 years, were treated for 52 weeks with insulin glargine or NPH insulin given once daily at bedtime.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:-------------------|-------------:| | 0 | randomised | 12 | 21 | CTDesign | 0.9996 | | 1 | multicentre | 24 | 34 | CTDesign | 0.9998 | | 2 | insulin glargine | 52 | 67 | Drug | 0.99135 | | 3 | NPH insulin | 74 | 84 | Drug | 0.9687 | | 4 | type 2 diabetes | 135 | 149 | DisorderOrSyndrome | 0.999933 | | 5 | multicentre | 157 | 167 | CTDesign | 0.9997 | | 6 | open | 170 | 173 | CTDesign | 0.9988 | | 7 | randomised | 176 | 185 | CTDesign | 0.9984 | | 8 | 570 | 194 | 196 | NumberPatients | 0.9906 | | 9 | Type 2 diabetes | 212 | 226 | DisorderOrSyndrome | 0.9999 | | 10 | 34 | 234 | 235 | Age | 0.9999 | | 11 | 80 | 239 | 240 | Age | 0.9931 | | 12 | 52 weeks | 266 | 273 | Duration | 0.9794 | | 13 | insulin glargine | 280 | 295 | Drug | 0.989 | | 14 | NPH insulin | 300 | 310 | Drug | 0.97955 | | 15 | once daily | 318 | 327 | DrugTime | 0.999 | | 16 | bedtime | 332 | 338 | DrugTime | 0.9937 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_trials_abstracts_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Clinical Entities (bert_token_classifier_ner_clinical) author: John Snow Labs name: bert_token_classifier_ner_clinical date: 2021-08-28 tags: [ner, clinical, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a BERT-based version of the ner_clinical model and it is 4% better than the legacy NER model (MedicalNerModel) that is based on BiLSTM-CNN-Char architecture. ## Predicted Entities `PROBLEM`, `TEST`, `TREATMENT` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_en_3.2.0_2.4_1630165904492.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_clinical_en_3.2.0_2.4_1630165904492.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier ]) data = spark.createDataFrame([[''' A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . ''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.load("models/bert_based_ner_clinical") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector,tokenizer,tokenClassifier)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge .""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_clinical").predict(""" A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge . """) ```
## Results ```bash +------------+-------+ |chunk |label | +------------+-------+ |gestational |PROBLEM| |diabetes |PROBLEM| |mellitus |PROBLEM| |type |PROBLEM| |two |PROBLEM| |diabetes |PROBLEM| |mellitus |PROBLEM| |T2DM |PROBLEM| |HTG-induced |PROBLEM| |pancreatitis|PROBLEM| |acute |PROBLEM| |hepatitis |PROBLEM| |obesity |PROBLEM| |body |TEST | |mass |TEST | |index |TEST | |BMI |TEST | |polyuria |PROBLEM| |polydipsia |PROBLEM| |poor |PROBLEM| |appetite |PROBLEM| |vomiting |PROBLEM| +------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_clinical| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|128| ## Data Source Trained on augmented 2010 i2b2 challenge data with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label precision recall f1-score support PROBLEM 0.88 0.92 0.90 30276 TEST 0.91 0.86 0.88 17237 TREATMENT 0.87 0.88 0.88 17298 O 0.97 0.97 0.97 202438 accuracy - - 0.95 267249 macro-avg 0.91 0.91 0.91 267249 weighted-avg 0.95 0.95 0.95 267249 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Austronesian Languages author: John Snow Labs name: opus_mt_en_map date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, map, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `map` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_map_xx_2.7.0_2.4_1609164865008.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_map_xx_2.7.0_2.4_1609164865008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_map", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_map", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.map').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_map| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Labour Market Document Classifier (EURLEX) author: John Snow Labs name: legclf_labour_market_bert date: 2023-03-06 tags: [en, legal, classification, clauses, labour_market, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_labour_market_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Labour_Market or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Labour_Market`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_labour_market_bert_en_1.0.0_3.0_1678111744774.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_labour_market_bert_en_1.0.0_3.0_1678111744774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_labour_market_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Labour_Market]| |[Other]| |[Other]| |[Labour_Market]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_labour_market_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Labour_Market 0.89 0.95 0.92 43 Other 0.95 0.88 0.91 41 accuracy - - 0.92 84 macro-avg 0.92 0.92 0.92 84 weighted-avg 0.92 0.92 0.92 84 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab57 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab57 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab57` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab57_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab57_en_4.2.0_3.0_1664023693889.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab57_en_4.2.0_3.0_1664023693889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab57", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab57", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab57| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Translate Southern Sotho to English Pipeline author: John Snow Labs name: translate_st_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, st, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `st` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_st_en_xx_2.7.0_2.4_1609686667658.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_st_en_xx_2.7.0_2.4_1609686667658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_st_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_st_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.st.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_st_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Financial Statements Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_financial_statements_bert date: 2023-03-05 tags: [en, legal, classification, clauses, financial_statements, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Financial_Statements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Financial_Statements`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_financial_statements_bert_en_1.0.0_3.0_1678050605783.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_financial_statements_bert_en_1.0.0_3.0_1678050605783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_financial_statements_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Financial_Statements]| |[Other]| |[Other]| |[Financial_Statements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_financial_statements_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Financial_Statements 1.00 0.97 0.99 74 Other 0.98 1.00 0.99 103 accuracy - - 0.99 177 macro-avg 0.99 0.99 0.99 177 weighted-avg 0.99 0.99 0.99 177 ``` --- layout: model title: RCT Binary Classifier (BioBERT Sentence Embeddings) Pipeline author: John Snow Labs name: rct_binary_classifier_biobert_pipeline date: 2022-06-06 tags: [licensed, clinical, classifier, rct, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [rct_binary_classifier_biobert](https://nlp.johnsnowlabs.com/2022/05/27/rct_binary_classifier_biobert_en_3_0.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_biobert_pipeline_en_3.4.2_3.0_1654516722542.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_biobert_pipeline_en_3.4.2_3.0_1654516722542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rct_binary_classifier_biobert_pipeline", "en", "clinical/models") result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rct_binary_classifier_biobert_pipeline", "en", "clinical/models") val result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.rct_binary_biobert.pipeline").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |rct |text | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |true|Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rct_binary_classifier_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|427.8 MB| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: Recognize Entities DL Pipeline for German - Medium author: John Snow Labs name: entity_recognizer_md date: 2021-03-22 tags: [open_source, german, entity_recognizer_md, pipeline, de] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: de edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The entity_recognizer_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_de_3.0.0_3.0_1616447205269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_de_3.0.0_3.0_1616447205269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('entity_recognizer_md', lang = 'de') annotations = pipeline.fullAnnotate(""Hallo aus John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("entity_recognizer_md", lang = "de") val result = pipeline.fullAnnotate("Hallo aus John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hallo aus John Snow Labs! ""] result_df = nlu.load('de.ner.recognizer').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:-------------------------------|:------------------------------|:------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hallo aus John Snow Labs! '] | ['Hallo aus John Snow Labs!'] | ['Hallo', 'aus', 'John', 'Snow', 'Labs!'] | [[0.5910000205039978,.,...]] | ['O', 'O', 'I-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|entity_recognizer_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|de| --- layout: model title: English RobertaForQuestionAnswering (from rsvp-ai) author: John Snow Labs name: roberta_qa_bertserini_roberta_base date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertserini-roberta-base` is a English model originally trained by `rsvp-ai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bertserini_roberta_base_en_4.0.0_3.0_1655727802702.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bertserini_roberta_base_en_4.0.0_3.0_1655727802702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_bertserini_roberta_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_bertserini_roberta_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.base.by_rsvp-ai").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_bertserini_roberta_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|449.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rsvp-ai/bertserini-roberta-base --- layout: model title: Word2Vec Embeddings in Welsh (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, cy, open_source] task: Embeddings language: cy edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cy_3.4.1_3.0_1647467344941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_cy_3.4.1_3.0_1647467344941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cy") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Rwy'n caru Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","cy") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Rwy'n caru Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("cy.embed.w2v_cc_300d").predict("""Rwy'n caru Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|cy| |Size:|292.0 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering (from deepset) author: John Snow Labs name: roberta_qa_tinyroberta_6l_768d date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinyroberta-6l-768d` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_6l_768d_en_4.0.0_3.0_1655739981393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tinyroberta_6l_768d_en_4.0.0_3.0_1655739981393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_tinyroberta_6l_768d","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_tinyroberta_6l_768d","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.tiny_768d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tinyroberta_6l_768d| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/tinyroberta-6l-768d - https://www.linkedin.com/company/deepset-ai/ - https://arxiv.org/pdf/1909.10351.pdf - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack - https://github.com/deepset-ai/FARM - http://www.deepset.ai/jobs - https://twitter.com/deepset_ai - https://github.com/deepset-ai/haystack/discussions - https://github.com/deepset-ai/haystack/ - https://deepset.ai - https://haystack.deepset.ai/guides/model-distillation - https://deepset.ai/germanquad - https://deepset.ai/german-bert --- layout: model title: Stop Words Cleaner for Arabic author: John Snow Labs name: stopwords_ar date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: ar edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, ar] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_ar_ar_2.5.4_2.4_1594742440256.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_ar_ar_2.5.4_2.4_1594742440256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_ar", "ar") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("بخلاف كونه ملك الشمال ، فإن جون سنو طبيب إنجليزي ورائد في تطوير التخدير والنظافة الطبية.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_ar", "ar") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("بخلاف كونه ملك الشمال ، فإن جون سنو طبيب إنجليزي ورائد في تطوير التخدير والنظافة الطبية.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""بخلاف كونه ملك الشمال ، فإن جون سنو طبيب إنجليزي ورائد في تطوير التخدير والنظافة الطبية"""] stopword_df = nlu.load('ar.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='بخلاف', metadata={'sentence': '0'}), Row(annotatorType='token', begin=6, end=9, result='كونه', metadata={'sentence': '0'}), Row(annotatorType='token', begin=11, end=13, result='ملك', metadata={'sentence': '0'}), Row(annotatorType='token', begin=15, end=20, result='الشمال', metadata={'sentence': '0'}), Row(annotatorType='token', begin=24, end=26, result='فإن', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_ar| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ar| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English image_classifier_vit_base_beans_demo_v5 ViTForImageClassification from Miss author: John Snow Labs name: image_classifier_vit_base_beans_demo_v5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans_demo_v5` is a English model originally trained by Miss. ## Predicted Entities `lion`, `tulip`, `keyboard`, `cra`, `bus`, `dolphin`, `plate`, `beaver`, `skyscraper`, `tiger`, `bear`, `trout`, `porcupine`, `sea`, `shrew`, `squirrel`, `snail`, `leopard`, `palm_tree`, `turtle`, `orchid`, `skunk`, `hamster`, `oak_tree`, `lizard`, `bridge`, `sunflower`, `pickup_truck`, `orange`, `man`, `mouse`, `cup`, `whale`, `seal`, `television`, `snake`, `crocodile`, `cockroach`, `bed`, `otter`, `caterpillar`, `woman`, `rocket`, `butterfly`, `bicycle`, `spider`, `motorcycle`, `lawn_mower`, `wolf`, `raccoon`, `can`, `cloud`, `clock`, `worm`, `tank`, `ray`, `house`, `girl`, `dinosaur`, `willow_tree`, `maple_tree`, `kangaroo`, `cattle`, `bee`, `chair`, `aquarium_fish`, `shark`, `baby`, `tractor`, `sweet_pepper`, `plain`, `lamp`, `boy`, `telephone`, `mushroom`, `couch`, `apple`, `wardrobe`, `train`, `pine_tree`, `pear`, `road`, `mountain`, `castle`, `bowl`, `lobster`, `elephant`, `beetle`, `possum`, `forest`, `flatfish`, `poppy`, `fox`, `streetcar`, `chimpanzee`, `bottle`, `rose`, `rabbit`, `table`, `camel` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_v5_en_4.1.0_3.0_1660170085775.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_v5_en_4.1.0_3.0_1660170085775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_beans_demo_v5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_beans_demo_v5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_beans_demo_v5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.2 MB| --- layout: model title: Medical Text Generation (T5-based) author: John Snow Labs name: text_generator_generic_jsl_base date: 2023-04-03 tags: [licensed, en, clinical, text_generation, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based text generation model that is finetuned with natural instruction datasets by John Snow Labs.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 512 tokens given an input text (max 1024 tokens). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_GENERATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/33.1.Medical_Text_Generation.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/text_generator_generic_jsl_base_en_4.3.2_3.0_1680519245746.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/text_generator_generic_jsl_base_en_4.3.2_3.0_1680519245746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("prompt")\ .setOutputCol("document_prompt") med_text_generator = MedicalTextGenerator.pretrained("text_generator_generic_jsl_base", "en", "clinical/models")\ .setInputCols("document_prompt")\ .setOutputCol("answer")\ .setMaxNewTokens(256)\ .setDoSample(True)\ .setTopK(3)\ .setRandomSeed(42) pipeline = Pipeline(stages=[document_assembler, med_text_generator]) data = spark.createDataFrame([["the patient is admitted to the clinic with a severe back pain and "]]).toDF("prompt") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("prompt") .setOutputCol("document_prompt") val med_text_generator = MedicalTextGenerator.pretrained("text_generator_generic_jsl_base", "en", "clinical/models") .setInputCols("document_prompt") .setOutputCol("answer") .setMaxNewTokens(256) .setDoSample(true) .setTopK(3) .setRandomSeed(42) val pipeline = new Pipeline().setStages(Array(document_assembler, med_text_generator)) val data = Seq(Array("the patient is admitted to the clinic with a severe back pain and ")).toDS.toDF("prompt") val result = pipeline.fit(data).transform(data) ```
## Results ```bash ['the patient is admitted to the clinic with a severe back pain and a severe left - sided leg pain. The patient was diagnosed with a lumbar disc herniation and underwent a discectomy. The patient was discharged on the third postoperative day. The patient was followed up for a period of 6 months and was found to be asymptomatic. A rare case of a giant cell tumor of the sacrum. Giant cell tumors ( GCTs ) are benign, locally aggressive tumors that are most commonly found in the long bones of the extremities. They are rarely found in the spine. We report a case of a GCT of the sacrum in a young female patient. The patient presented with a history of progressive lower back pain and a palpable mass in the left buttock. The patient underwent a left hemilaminectomy and biopsy. The histopathological examination revealed a GCT. The patient was treated with a combination of surgery and radiation therapy. The patient was followed up for 2 years and no recurrence was observed. A rare case of a giant cell tumor of the sacrum. Giant cell tumors ( GCTs ) are benign, locally aggressive tumors that are most commonly found in the long bones of the extremities. They are rarely found in the spine. We report a case of a GCT'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_generator_generic_jsl_base| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.8 MB| |Case sensitive:|true| --- layout: model title: Premise About Health Mandates Related to Covid-19 Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_health_mandates_premise_tweet date: 2022-08-08 tags: [en, clinical, licensed, public_health, classifier, sequence_classification, covid_19, tweet, premise, mandate] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify premise about health mandates related to Covid-19 from tweets. This model is intended for direct use as a classification model and the target classes are: Has no premise (no argument), Has premise (argument). ## Predicted Entities `Has premise (argument)`, `Has no premise (no argument)` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_MANDATES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mandates_premise_tweet_en_4.0.2_3.0_1659983362057.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mandates_premise_tweet_en_4.0.2_3.0_1659983362057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mandates_premise_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["""It's too dangerous to hold the RNC, but let's send students and teachers back to school.""", """So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments You're Speaker Pelosi nephew so stop the agenda LIES.""", """Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.""", """Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks.""", """But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers""", """New title Maskhole I think I'm going to use this very soon coronavirus.""" ], StringType()).toDF("text") result = pipeline.fit(data).transform(data) result.select("class.result", "text").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mandates_premise_tweet", "es", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("It's too dangerous to hold the RNC, but let's send students and teachers back to school.", "So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments You're Speaker Pelosi nephew so stop the agenda LIES.", "Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.", "Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks.", "But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers", "New title Maskhole I think I'm going to use this very soon coronavirus.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.clasify.health_premise").predict("""Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.""") ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+ |text |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+ |It's too dangerous to hold the RNC, but let's send students and teachers back to school. |[Has premise (argument)] | |So is the flu and pneumonia what are their s stop the Media Manipulation covid has treatments You're Speaker Pelosi nephew so stop the agenda LIES. |[Has premise (argument)] | |Just a quick update to my U.S. followers, I'll be making a stop in all 50 states this spring! No tickets needed, just don't wash your hands, cough on each other.|[Has no premise (no argument)]| |Go to a restaurant no mask Do a food shop wear a mask INCONSISTENT No Masks No Masks. |[Has no premise (no argument)]| |But if schools close who is gonna occupy those graves Cause politiciansprotected smokers protected drunkardsprotected school kids amp teachers |[Has premise (argument)] | |New title Maskhole I think I'm going to use this very soon coronavirus. |[Has no premise (no argument)]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mandates_premise_tweet| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References The dataset is Covid-19-specific and consists of tweets collected via a series of keywords associated with that disease. ## Benchmarking ```bash label precision recall f1-score support Has_no_premise_(no_argument) 0.9000 0.6992 0.7870 502 Has_premise_(argument) 0.6576 0.8815 0.7532 329 accuracy - - 0.7714 831 macro-avg 0.7788 0.7903 0.7701 831 weighted-avg 0.8040 0.7714 0.7736 831 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from haddadalwi) author: John Snow Labs name: distilbert_qa_haddadalwi_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `haddadalwi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_haddadalwi_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771017489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_haddadalwi_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771017489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_haddadalwi_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_haddadalwi_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_haddadalwi_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/haddadalwi/distilbert-base-uncased-finetuned-squad --- layout: model title: Extract Dates in Financial Documents author: John Snow Labs name: finner_sec_dates date: 2022-11-01 tags: [date, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Finance 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This NER models uses `bert_embeddings_sec_bert_base` embeddings, trained on SEC documents and empowered with OntoNotes 2022, to extract DATES. This model is light but very accurate. ## Predicted Entities `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_sec_dates_en_1.0.0_3.0_1667305896514.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_sec_dates_en_1.0.0_3.0_1667305896514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokens = nlp.Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("label") text = "For the fiscal year ended December 31, 2021, Amazon reported a profit of ..." df = spark.createDataFrame([text], StringType()).toDF('text') pipeline = nlp.Pipeline(stages = [document, sentence, tokens, embeddings, ner]) fit_model = pipeline.fit(df) res = fit_model.transform(df) res.select(F.explode(F.arrays_zip(res.token.result, res.label.result)).alias("cols")) \ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("label")).show(truncate=50) ```
## Results ```bash +--------+------+ | token| label| +--------+------+ | For| O| | the|B-DATE| | fiscal|I-DATE| | year|I-DATE| | ended|I-DATE| |December|I-DATE| | 31|I-DATE| | ,|I-DATE| | 2021|I-DATE| | ,| O| | Amazon| O| |reported| O| | a| O| | profit| O| | of| O| | .| O| | .| O| | .| O| +--------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_sec_dates| |Compatibility:|Spark NLP for Finance 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.6 MB| ## References In-house annotations on SEC 10K filings, Ontonotes 2012 ## Benchmarking ```bash label tp fp fn prec rec f1 B-DATE 3572 278 252 0.9277922 0.9341004 0.9309356 I-DATE 4300 339 245 0.92692393 0.94609463 0.9364112 macro-avg 7872 617 497 0.92735803 0.9400975 0.93368435 micro-avg 7872 617 497 0.9273177 0.94061416 0.93391865 ``` --- layout: model title: Part of Speech for Danish author: John Snow Labs name: pos_ud_ddt date: 2021-03-09 tags: [part_of_speech, open_source, danish, pos_ud_ddt, da] task: Part of Speech Tagging language: da edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADP - NOUN - AUX - PROPN - VERB - SCONJ - ADV - DET - ADJ - PUNCT - CCONJ - PRON - NUM - PART - X - INTJ - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ddt_da_3.0.0_3.0_1615292166545.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ddt_da_3.0.0_3.0_1615292166545.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_ddt", "da") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hej fra John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_ddt", "da") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hej fra John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hej fra John Snow Labs! ""] token_df = nlu.load('da.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hej NOUN 1 fra ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ddt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|da| --- layout: model title: Detect Adverse Drug Events (bert-clinical) author: John Snow Labs name: ner_ade_clinicalbert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse drug events in tweets, reviews, and medical text using pretrained NER model. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinicalbert_en_3.0.0_3.0_1617260830764.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinicalbert_en_3.0.0_3.0_1617260830764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_clinical_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_ade_clinicalbert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_clinical_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_ade_clinicalbert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.ade.clinical_bert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_clinicalbert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash label precision recall f1-score support B-ADE 0.46 0.79 0.58 3582 B-DRUG 0.90 0.62 0.74 11763 I-ADE 0.45 0.76 0.56 4309 I-DRUG 0.96 0.26 0.41 7654 O 0.96 0.98 0.97 303457 accuracy - - 0.94 330765 macro-avg 0.75 0.68 0.65 330765 weighted-avg 0.95 0.94 0.94 330765 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_ff1000 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-ff1000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff1000_en_4.3.0_3.0_1675120730150.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_ff1000_en_4.3.0_3.0_1675120730150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_ff1000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_ff1000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_ff1000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|123.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-ff1000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: T5 for Formal to Informal Style Transfer author: John Snow Labs name: t5_formal_to_informal_styletransfer date: 2022-01-12 tags: [t5, open_source, en] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a text-to-text model based on T5 fine-tuned to generate informal text from a formal text input, for the task "transfer Formal to Casual:". It is based on Prithiviraj Damodaran's Styleformer. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5_LINGUISTIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_LINGUISTIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_formal_to_informal_styletransfer_en_3.4.0_3.0_1641984515976.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_formal_to_informal_styletransfer_en_3.4.0_3.0_1641984515976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * spark = sparknlp.start() documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer.pretrained("t5_formal_to_informal_styletransfer") \ .setTask("transfer Formal to Casual:") \ .setInputCols(["documents"]) \ .setMaxOutputLength(200) \ .setOutputCol("transfers") pipeline = Pipeline().setStages([documentAssembler, t5]) data = spark.createDataFrame([["Please leave the room now."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("transfers.result").show(truncate=False) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_formal_to_informal_styletransfer") .setTask("transfer Formal to Casual:") .setMaxOutputLength(200) .setInputCols("documents") .setOutputCol("transfers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("Please leave the room now.") .toDF("text") val result = pipeline.fit(data).transform(data) result.select("transfers.result").show(false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.formal_to_informal_styletransfer").predict("""transfer Formal to Casual:""") ```
## Results ```bash +---------------------+ |result | +---------------------+ |[leave the room now.]| +---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_formal_to_informal_styletransfer| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[transfers]| |Language:|en| |Size:|923.9 MB| ## Data Source The original model is from the transformers library: https://huggingface.co/prithivida/formal_to_informal_styletransfer --- layout: model title: English asr_wav2vec2_base_10000 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: asr_wav2vec2_base_10000 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_10000` is a English model originally trained by jiobiala24. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_10000_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_10000_en_4.2.0_3.0_1664118385712.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_10000_en_4.2.0_3.0_1664118385712.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_10000", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_10000", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_10000| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.3 MB| --- layout: model title: Legal Retirement Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_retirement_agreement_bert date: 2023-01-29 tags: [en, legal, classification, retirement, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_retirement_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `retirement-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `retirement-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_retirement_agreement_bert_en_1.0.0_3.0_1674990584883.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_retirement_agreement_bert_en_1.0.0_3.0_1674990584883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_retirement_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[retirement-agreement]| |[other]| |[other]| |[retirement-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_retirement_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 1.00 0.99 101 retirement-agreement 1.00 0.96 0.98 49 accuracy - - 0.99 150 macro-avg 0.99 0.98 0.98 150 weighted-avg 0.99 0.99 0.99 150 ``` --- layout: model title: Oncology Pipeline for Diagnosis Entities author: John Snow Labs name: oncology_diagnosis_pipeline date: 2022-12-01 tags: [licensed, pipeline, oncology, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status, Relation Extraction and Entity Resolution models to extract information from oncology texts. This pipeline focuses on entities related to oncological diagnosis. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_diagnosis_pipeline_en_4.2.2_3.0_1669901190921.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_diagnosis_pipeline_en_4.2.2_3.0_1669901190921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_diagnosis_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.")[0] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_diagnosis_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_diagnosis.pipeline").predict("""Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Size | | tumor | Tumor_Finding | | left | Direction | | breast | Site_Breast | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | lung | Site_Lung | | metastases | Metastasis | ******************** ner_oncology_diagnosis_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Size | | tumor | Tumor_Finding | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | metastases | Metastasis | ******************** ner_oncology_tnm_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Description | | tumor | Tumor | | ductal | Tumor_Description | | carcinoma | Cancer_Dx | | metastases | Metastasis | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-----------|:------------------|:------------| | tumor | Tumor_Finding | Present | | ductal | Histological_Type | Present | | carcinoma | Cancer_Dx | Present | | metastases | Metastasis | Absent | ******************** assertion_oncology_problem_wip results ******************** | chunk | ner_label | assertion | |:-----------|:------------------|:-----------------------| | tumor | Tumor_Finding | Medical_History | | ductal | Histological_Type | Medical_History | | carcinoma | Cancer_Dx | Medical_History | | metastases | Metastasis | Hypothetical_Or_Absent | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:--------------|:-----------|:--------------|:--------------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_related_to | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | | tumor | Tumor_Finding | breast | Site_Breast | is_related_to | | breast | Site_Breast | carcinoma | Cancer_Dx | O | | lung | Site_Lung | metastases | Metastasis | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:--------------|:-----------|:--------------|:---------------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_size_of | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | | tumor | Tumor_Finding | breast | Site_Breast | is_location_of | | breast | Site_Breast | carcinoma | Cancer_Dx | O | | lung | Site_Lung | metastases | Metastasis | is_location_of | ******************** re_oncology_size_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:-----------|:----------|:--------------|:-----------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_size_of | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | ******************** ICD-O resolver results ******************** | chunk | ner_label | code | normalized_term | |:-----------|:------------------|:-------|:------------------| | tumor | Tumor_Finding | 8000/1 | tumor | | breast | Site_Breast | C50 | breast | | ductal | Histological_Type | 8500/2 | dcis | | carcinoma | Cancer_Dx | 8010/3 | carcinoma | | lung | Site_Lung | C34.9 | lung | | metastases | Metastasis | 8000/6 | tumor, metastatic | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_diagnosis_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel - RelationExtractionModel - ChunkMergeModel - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel --- layout: model title: English asr_wav2vec2_large_xls_r_300m_georgian_v0.6 TFWav2Vec2ForCTC from pavle-tsotskolauri author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_georgian_v0.6 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_georgian_v0.6` is a English model originally trained by pavle-tsotskolauri. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_georgian_v0.6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_georgian_v0.6_en_4.2.0_3.0_1664118973050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_georgian_v0.6_en_4.2.0_3.0_1664118973050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_georgian_v0.6", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_georgian_v0.6", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_georgian_v0.6| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|2.3 GB| --- layout: model title: Legal Relation Extraction (Alias) author: John Snow Labs name: legre_org_prod_alias date: 2022-08-17 tags: [en, legal, re, relations, licensed] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to extract Aliases of Companies or Product names. An "Alias" is a named used in a document to refer to the original name of a company or product. Examples: - John Snow Labs, also known as JSL - John Snow Labs ("JSL") - etc ## Predicted Entities `has_alias`, `has_collective_alias` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_org_prod_alias_en_1.0.0_3.2_1660739037434.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_org_prod_alias_en_1.0.0_3.2_1660739037434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") reDL = legal.RelationExtractionDLModel()\ .pretrained("legre_org_prod_alias", "en", "legal/models")\ .setPredictionThreshold(0.1)\ .setInputCols(["ner_chunk", "document"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner_model, ner_converter, reDL]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text=''' On March 12, 2020 we closed a Loan and Security Agreement with Hitachi Capital America Corp. ("Hitachi") the terms of which are described in this report which replaced our credit facility with Western Alliance Bank. ''' lmodel = LightPipeline(model) lmodel.fullAnnotate(text) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence has_alias ORG 64 92 Hitachi Capital America Corp. ALIAS 96 102 Hitachi 0.9983972 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_org_prod_alias| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|409.9 MB| ## References Manual annotations on CUAD dataset and 10K filings ## Benchmarking ```bash label Recall Precision F1 Support has_alias 0.920 1.000 0.958 50 has_collective_alias 1.000 0.750 0.857 6 no_rel 1.000 0.957 0.978 44 Avg. 0.973 0.902 0.931 - Weighted-Avg. 0.960 0.966 0.961 - ``` --- layout: model title: Legal Reliance Clause Binary Classifier author: John Snow Labs name: legclf_reliance_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `reliance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `reliance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reliance_clause_en_1.0.0_3.2_1660122923811.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reliance_clause_en_1.0.0_3.2_1660122923811.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_reliance_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[reliance]| |[other]| |[other]| |[reliance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reliance_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.97 0.97 104 reliance 0.88 0.88 0.88 24 accuracy - - 0.95 128 macro-avg 0.92 0.92 0.92 128 weighted-avg 0.95 0.95 0.95 128 ``` --- layout: model title: Chinese BertForMaskedLM Mini Cased model (from hfl) author: John Snow Labs name: bert_embeddings_minirbt_h288 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `minirbt-h288` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h288_zh_4.2.4_3.0_1670326889548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_minirbt_h288_zh_4.2.4_3.0_1670326889548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h288","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_minirbt_h288","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_minirbt_h288| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|46.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/minirbt-h288 - https://github.com/iflytek/MiniRBT - https://github.com/ymcui/LERT - https://github.com/ymcui/PERT - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/iflytek/HFL-Anthology --- layout: model title: English RobertaForTokenClassification Base Cased model (from mrm8488) author: John Snow Labs name: roberta_ner_codebert_base_finetuned_stackoverflow date: 2022-07-19 tags: [open_source, roberta, ner, stackoverflow, codebert, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa NER model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `codebert_finetuned_stackoverflow` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_codebert_base_finetuned_stackoverflow_en_4.0.0_3.0_1658212367861.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_codebert_base_finetuned_stackoverflow_en_4.0.0_3.0_1658212367861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") ner = RoBertaForTokenClassification.pretrained("roberta_ner_codebert_base_finetuned_stackoverflow","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, ner]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner = RoBertaForTokenClassification.pretrained("roberta_ner_codebert_base_finetuned_stackoverflow","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, ner)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_codebert_base_finetuned_stackoverflow| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References https://huggingface.co/mrm8488/codebert-base-finetuned-stackoverflow-ner --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Nadhiya) author: John Snow Labs name: distilbert_qa_nadhiya_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Nadhiya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_nadhiya_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768819403.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_nadhiya_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768819403.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_nadhiya_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_nadhiya_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_nadhiya_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Nadhiya/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Force majeure Clause Binary Classifier author: John Snow Labs name: legclf_force_majeure_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `force-majeure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `force-majeure` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_force_majeure_clause_en_1.0.0_3.2_1660122448873.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_force_majeure_clause_en_1.0.0_3.2_1660122448873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_force_majeure_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[force-majeure]| |[other]| |[other]| |[force-majeure]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_force_majeure_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support force-majeure 1.00 1.00 1.00 32 other 1.00 1.00 1.00 90 accuracy - - 1.00 122 macro-avg 1.00 1.00 1.00 122 weighted-avg 1.00 1.00 1.00 122 ``` --- layout: model title: English RobertaForSequenceClassification Cased model (from mbyanfei) author: John Snow Labs name: roberta_classifier_autotrain_amazon_shoe_reviews_classification_1104340243 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-amazon-shoe-reviews-classification-1104340243` is a English model originally trained by `mbyanfei`. ## Predicted Entities `1`, `0`, `4`, `3`, `2` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_amazon_shoe_reviews_classification_1104340243_en_4.2.4_3.0_1670623254207.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_amazon_shoe_reviews_classification_1104340243_en_4.2.4_3.0_1670623254207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_amazon_shoe_reviews_classification_1104340243","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_amazon_shoe_reviews_classification_1104340243","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autotrain_amazon_shoe_reviews_classification_1104340243| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|446.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbyanfei/autotrain-amazon-shoe-reviews-classification-1104340243 --- layout: model title: English T5ForConditionalGeneration Cased model (from valurank) author: John Snow Labs name: t5_paraphraser date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-paraphraser` is a English model originally trained by `valurank`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_paraphraser_en_4.3.0_3.0_1675125180653.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_paraphraser_en_4.3.0_3.0_1675125180653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_paraphraser","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_paraphraser","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_paraphraser| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|920.7 MB| ## References - https://huggingface.co/valurank/t5-paraphraser --- layout: model title: German asr_wav2vec2_base_german_cv9 TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: pipeline_asr_wav2vec2_base_german_cv9 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_german_cv9` is a German model originally trained by oliverguhr. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_german_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_german_cv9_de_4.2.0_3.0_1664101264900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_german_cv9_de_4.2.0_3.0_1664101264900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_german_cv9', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_german_cv9", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_german_cv9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|349.2 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Word2Vec Embeddings in Telugu (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, te, open_source] task: Embeddings language: te edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_te_3.4.1_3.0_1647463349772.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_te_3.4.1_3.0_1647463349772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","te") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","te") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("te.embed.w2v_cc_300d").predict("""నేను స్పార్క్ nlp ను ప్రేమిస్తున్నాను""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|te| |Size:|1.1 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-256_A-4_cord19-200616_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185259989.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185259989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid_cord19.bert.uncased_4l_256d_a4a_256d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-256_A-4_cord19-200616_squad2_covid-qna --- layout: model title: English ALBERT Embeddings (Large) author: John Snow Labs name: albert_embeddings_albert_large_v1 date: 2022-04-14 tags: [albert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-large-v1` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_v1_en_3.4.2_3.0_1649954227169.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_large_v1_en_3.4.2_3.0_1649954227169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_v1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_large_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_large_v1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|68.0 MB| |Case sensitive:|false| ## References - https://huggingface.co/albert-large-v1 - https://arxiv.org/abs/1909.11942 - https://github.com/google-research/albert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Zero-shot Legal NER (CUAD, base) author: John Snow Labs name: legner_roberta_zeroshot_cuad_base date: 2023-01-30 tags: [zero, shot, cuad, en, licensed, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot NER model, trained using Roberta on SQUAD and finetuned to perform Zero-shot NER using CUAD legal dataset. In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES: ``` "Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract" ``` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_base_en_1.0.0_3.0_1675088394988.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_base_en_1.0.0_3.0_1675088394988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") zeroshot = nlp.ZeroShotNerModel.pretrained("legner_roberta_zeroshot_cuad_base","en","legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("zero_shot_ner")\ .setEntityDefinitions( { 'PARTIES': ['Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract'] }) nerconverter = NerConverter()\ .setInputCols(["document", "token", "zero_shot_ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline().setStages([ document_assembler, tokenizer, zeroshot, nerconverter ]) from pyspark.sql import types as T sample_text = ["""THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and certain of its subsidiaries. Identified on the signature pages hereto (each a "BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, in its capacity as agent for the Lenders under this Agreement (hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ')] p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) res = p_model.transform(spark.createDataFrame(sample_text, T.StringType()).toDF("text")) res.show() ```
## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |P.H. GLATFELTER COMPANY|PARTIES | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_roberta_zeroshot_cuad_base| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|453.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References SQUAD and CUAD --- layout: model title: Japanese Bert Embeddings (from hiroshi-matsuda-rit) author: John Snow Labs name: bert_embeddings_bert_base_japanese_basic_char_v2 date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-basic-char-v2` is a Japanese model orginally trained by `hiroshi-matsuda-rit`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_basic_char_v2_ja_3.4.2_3.0_1649674878739.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_basic_char_v2_ja_3.4.2_3.0_1649674878739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_basic_char_v2","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_basic_char_v2","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_basic_char_v2").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_basic_char_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|340.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/hiroshi-matsuda-rit/bert-base-japanese-basic-char-v2 --- layout: model title: Fast Neural Machine Translation Model from English to Marathi author: John Snow Labs name: opus_mt_en_mr date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mr, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mr` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mr_xx_2.7.0_2.4_1609168537482.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mr_xx_2.7.0_2.4_1609168537482.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mr", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mr').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mr| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Swazi author: John Snow Labs name: opus_mt_en_ss date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ss, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ss` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ss_xx_2.7.0_2.4_1609169911267.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ss_xx_2.7.0_2.4_1609169911267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ss", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ss", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ss').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ss| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: COVID BERT Sentence Embeddings (Large Uncased) author: John Snow Labs name: sent_covidbert_large_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_covidbert_large_uncased_en_2.6.0_2.4_1598488155401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_covidbert_large_uncased_en_2.6.0_2.4_1598488155401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_covidbert_large_uncased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_covidbert_large_uncased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.covidbert.large_uncased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_covidbert_large_uncased_embeddings sentence [-1.3138830661773682, 0.592442512512207, -0.21... I hate cancer [0.08157740533351898, 0.2123042196035385, 0.15... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|covidbert_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/digitalepidemiologylab/covid-twitter-bert/2 --- layout: model title: Finnish BERT Sentence Embeddings (Base Uncased) author: John Snow Labs name: sent_bert_finnish_uncased date: 2020-08-31 task: Embeddings language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, fi] supported: true deprecated: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A version of Google's BERT deep transfer learning model for Finnish. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks. `FinBERT` features a custom 50,000 wordpiece vocabulary that has much better coverage of Finnish words. `FinBERT` has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls. By contrast, Multilingual BERT was trained on Wikipedia texts, where the Finnish Wikipedia text is approximately 3% of the amount used to train `FinBERT`. These features allow `FinBERT` to outperform not only Multilingual BERT but also all previously proposed models when fine-tuned for Finnish natural language processing tasks. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_finnish_uncased_fi_2.6.0_2.4_1598897885576.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_finnish_uncased_fi_2.6.0_2.4_1598897885576.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_bert_finnish_uncased", "fi") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['Vihaan syöpää', 'antibiootit eivät ole kipulääkkeitä']], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_finnish_uncased", "fi") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("Vihaan syöpää","antibiootit eivät ole kipulääkkeitä").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Vihaan syöpää", "antibiootit eivät ole kipulääkkeitä"] embeddings_df = nlu.load('fi.embed_sentence.bert.cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence fi_embed_sentence_bert_uncased_embeddings Vihaan syöpää [-0.32807931303977966, -0.18222537636756897, 0... antibiootit eivät ole kipulääkkeitä [-0.192955881357193, -0.11151257902383804, 0.7... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_finnish_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[fi]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://github.com/TurkuNLP/FinBERT --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vishvamahadevan) author: John Snow Labs name: distilbert_qa_vishvamahadevan_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vishvamahadevan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vishvamahadevan_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773067737.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vishvamahadevan_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773067737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vishvamahadevan_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vishvamahadevan_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vishvamahadevan_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vishvamahadevan/distilbert-base-uncased-finetuned-squad --- layout: model title: Arabic Bert Embeddings (Base, Trained on a sixteenth of the full MSA dataset) author: John Snow Labs name: bert_embeddings_bert_base_arabic_camelbert_msa_sixteenth date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-sixteenth` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_sixteenth_ar_3.4.2_3.0_1649678782020.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabic_camelbert_msa_sixteenth_ar_3.4.2_3.0_1649678782020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_sixteenth","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabic_camelbert_msa_sixteenth","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabic_camelbert_msa_sixteenth").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabic_camelbert_msa_sixteenth| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|409.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-sixteenth - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://catalog.ldc.upenn.edu/LDC2011T11 - http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus - https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian - https://archive.org/details/arwiki-20190201 - https://oscar-corpus.com/ - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L286-L297 - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/CAMeLBERT --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_large_data_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-data-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_data_seed_4_en_4.0.0_3.0_1655736706076.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_data_seed_4_en_4.0.0_3.0_1655736706076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_data_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_data_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.large_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_data_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-large-data-seed-4 --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10 ViTForImageClassification from tanlq author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10` is a English model originally trained by tanlq. ## Predicted Entities `deer`, `bird`, `dog`, `horse`, `automobile`, `truck`, `frog`, `ship`, `airplane`, `cat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10_en_4.1.0_3.0_1660167505586.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10_en_4.1.0_3.0_1660167505586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_finetuned_cifar10| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Translate English to Danish Pipeline author: John Snow Labs name: translate_en_da date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, da, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `da` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_da_xx_2.7.0_2.4_1609688141755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_da_xx_2.7.0_2.4_1609688141755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_da", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_da", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.da').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_da| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Radiology Related Concepts author: John Snow Labs name: ner_radiology_wip_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_radiology_wip_clinical](https://nlp.johnsnowlabs.com/2021/04/01/ner_radiology_wip_clinical_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_RADIOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_pipeline_en_3.4.1_3.0_1647874482693.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_wip_clinical_pipeline_en_3.4.1_3.0_1647874482693.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_radiology_wip_clinical_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_radiology_wip_clinical_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology.clinical_wip.pipeline").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash +---------------------+-------------------------+ |chunks |entities | +---------------------+-------------------------+ |Bilateral |Direction | |breast |BodyPart | |ultrasound |ImagingTest | |ovoid mass |ImagingFindings | |0.5 x 0.5 x 0.4 |Measurements | |cm |Units | |anteromedial aspect |BodyPart | |left |Direction | |shoulder |BodyPart | |mass |ImagingFindings | |isoechoic echotexture|ImagingFindings | |muscle |BodyPart | |internal color flow |ImagingFindings | |benign fibrous tissue|ImagingFindings | |lipoma |Disease_Syndrome_Disorder| +---------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology_wip_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Fast Neural Machine Translation Model from Niger-Kordofanian Languages to English author: John Snow Labs name: opus_mt_nic_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, nic, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `nic` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_nic_en_xx_2.7.0_2.4_1609170330732.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_nic_en_xx_2.7.0_2.4_1609170330732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_nic_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_nic_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.nic.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_nic_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Document Visual Question Answering optimized with DONUT author: John Snow Labs name: docvqa_donut_base_opt date: 2023-01-17 tags: [en, licensed] task: Document Visual Question Answering language: en nav_key: models edition: Visual NLP 4.3.0 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Document understanding transformer (Donut) model pretrained for Document Visual Question Answering (DocVQA) task in the dataset is from Document Visual Question Answering [competition](https://rrc.cvc.uab.es/?ch=17) and consists of 50K questions defined on more than 12K documents. This model was optimized with ONNX. Donut is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). Paper link [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) developed by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han and Seunghyun Park. DocVQA seeks to inspire a “purpose-driven” point of view in Document Analysis and Recognition research, where the document content is extracted and used to respond to high-level tasks defined by the human consumers of this information. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/VISUAL_QUESTION_ANSWERING/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrVisualQuestionAnswering_opt.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/docvqa_donut_base_opt_en_4.3.0_3.0_1673269990047.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/docvqa_donut_base_opt_en_4.3.0_3.0_1673269990047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) visual_question_answering = VisualQuestionAnswering()\ .pretrained("docvqa_donut_base_opt", "en", "clinical/ocr")\ .setInputCol(["image"])\ .setOutputCol("answers")\ .setQuestionsCol("questions") # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, visual_question_answering ]) test_image_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/vqa/agenda.png') bin_df = spark.read.format("binaryFile").load(test_image_path) questions = [["When it finish the Coffee Break?", "Who is giving the Introductory Remarks?", "Who is going to take part of the individual interviews?"]] questions_df = spark.createDataFrame([questions]) questions_df = questions_df.withColumnRenamed("_1", "questions") image_and_questions = bin_df.join(questions_df) results = pipeline.transform(image_and_questions).cache() results.select(results.answers).show(truncate=False) ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val visual_question_answering = VisualQuestionAnswering() .pretrained("docvqa_donut_base_opt", "en", "clinical/ocr") .setInputCol(Array("image")) .setOutputCol("answers") .setQuestionsCol("questions") # OCR pipeline val pipeline = new PipelineModel().setStages(Array( binary_to_image, visual_question_answering)) val test_image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/vqa/agenda.png") val bin_df = spark.read.format("binaryFile").load(test_image_path) val questions = Array("When it finish the Coffee Break?", "Who is giving the Introductory Remarks?", "Who is going to take part of the individual interviews?") val questions_df = spark.createDataFrame(Array(questions)) val questions_df = questions_df.withColumnRenamed("_1", "questions") val image_and_questions = bin_df.join(questions_df) val results = pipeline.transform(image_and_questions).cache() val results.select(results.answers).show(truncate=False) ```
## Example ### Input: ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |questions | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ When it finish the Coffee Break?, Who is giving the Introductory Remarks?, Who is going to take part of the individual interviews? +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ![Screenshot](/assets/images/examples_ocr/image12.png) ### Output: ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answers | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[ When it finish the Coffee Break? -> 11:39 a.m., Who is giving the Introductory Remarks? -> lee a. waller, trrf vice presi- ident, Who is going to take part of the individual interviews? -> trrf]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` ## Model Information {:.table-model} |---|---| |Model Name:|docvqa_donut_base_opt| |Type:|ocr| |Compatibility:|Visual NLP 4.3.0+| |License:|Licensed| --- layout: model title: Detect Clinical Events (Admissions) author: John Snow Labs name: ner_events_admission_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect clinical events in medical text, with a focus on admission entities. ## Predicted Entities `DATE`, `TIME`, `PROBLEM`, `TEST`, `TREATMENT`, `OCCURENCE`, `CLINICAL_DEPT`, `EVIDENTIAL`, `DURATION`, `FREQUENCY`, `ADMISSION`, `DISCHARGE`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_en_3.0.0_3.0_1617209704296.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_admission_clinical_en_3.0.0_3.0_1617209704296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_events_admission_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient presented to the emergency room last evening"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_events_admission_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The patient presented to the emergency room last evening""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.admission_events").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +----+-----------------------------+---------+---------+-----------------+ | | chunk | begin | end | entity | +====+=============================+=========+=========+=================+ | 0 | presented | 12 | 20 | EVIDENTIAL | +----+-----------------------------+---------+---------+-----------------+ | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | +----+-----------------------------+---------+---------+-----------------+ | 2 | last evening | 44 | 55 | DATE | +----+-----------------------------+---------+---------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_admission_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on augmented/enriched i2b2 events data with clinical_embeddings. The data for Admissions has been enriched specifically. ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 42 6 9 0.875 0.8235294 0.8484849 I-TREATMENT 1134 111 312 0.9108434 0.7842324 0.8428094 B-OCCURRENCE 406 344 382 0.5413333 0.51522845 0.52795845 I-DURATION 160 42 71 0.7920792 0.6926407 0.73903 B-DATE 500 32 49 0.9398496 0.9107468 0.92506933 I-DATE 309 54 49 0.8512397 0.8631285 0.8571429 B-ADMISSION 206 1 2 0.9951691 0.99038464 0.9927711 I-PROBLEM 2394 390 412 0.85991377 0.85317177 0.8565295 B-CLINICAL_DEPT 327 64 77 0.8363171 0.8094059 0.8226415 B-TIME 44 12 15 0.78571427 0.7457627 0.76521736 I-CLINICAL_DEPT 597 62 78 0.90591806 0.8844444 0.8950525 B-PROBLEM 1643 260 252 0.86337364 0.86701846 0.86519223 I-FREQUENCY 35 21 39 0.625 0.47297296 0.5384615 I-TEST 1082 171 117 0.86352754 0.9024187 0.8825449 B-TEST 781 125 127 0.8620309 0.86013216 0.86108047 B-TREATMENT 1283 176 202 0.87936944 0.8639731 0.87160325 B-DISCHARGE 155 0 1 1.0 0.99358976 0.99678457 B-EVIDENTIAL 269 25 75 0.914966 0.78197676 0.84326017 B-DURATION 97 43 44 0.69285715 0.6879433 0.6903914 B-FREQUENCY 70 16 33 0.81395346 0.6796116 0.7407407 tp: 11841 fp: 2366 fn: 2680 labels: 22 Macro-average prec: 0.8137135, rec: 0.7533389, f1: 0.7823631 Micro-average prec: 0.83346236, rec: 0.8154397, f1: 0.8243525 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Luo (Kenya and Tanzania) author: John Snow Labs name: opus_mt_en_luo date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, luo, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `luo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_luo_xx_2.7.0_2.4_1609167663526.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_luo_xx_2.7.0_2.4_1609167663526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_luo", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_luo", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.luo').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_luo| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from HomayounSadri) author: John Snow Labs name: bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `HomayounSadri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654180994211.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654180994211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_HomayounSadri").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/HomayounSadri/bert-base-uncased-finetuned-squad --- layout: model title: Detect Cancer Genetics (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_bionlp date: 2022-01-03 tags: [bertfortokenclassification, ner, bionlp, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts biological and genetics terms in cancer-related texts using pre-trained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `Amino_acid`, `Anatomical_system`, `Cancer`, `Cell`, `Cellular_component`, `Developing_anatomical_Structure`, `Gene_or_gene_product`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism`, `Organism_subdivision`, `Simple_chemical`, `Tissue`, `Organism_substance`, `Pathological_formation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_en_3.4.0_2.4_1641222741515.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bionlp_en_3.4.0_2.4_1641222741515.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bionlp").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |erbA IRES |Organism | |erbA/myb virus |Organism | |erythroid cells |Cell | |bone marrow |Multi-tissue_structure| |blastoderm cultures|Cell | |erbA/myb IRES virus|Organism | |erbA IRES virus |Organism | |blastoderm |Cell | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bionlp| |Compatibility:|Healthcare NLP 3.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.4 MB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013. https://aclanthology.org/W13-2008/ ## Benchmarking ```bash Label precision recall f1-score support B-Cancer 0.88 0.82 0.85 924 B-Cell 0.84 0.86 0.85 1013 B-Cellular_component 0.87 0.84 0.86 180 B-Developing_anatomical_structure 0.65 0.65 0.65 17 B-Gene_or_gene_product 0.62 0.79 0.69 2520 B-Immaterial_anatomical_entity 0.68 0.74 0.71 31 B-Multi-tissue_structure 0.84 0.76 0.80 303 B-Organ 0.78 0.74 0.76 156 B-Organism 0.93 0.86 0.89 518 B-Organism_subdivision 0.74 0.51 0.61 39 B-Organism_substance 0.93 0.66 0.77 102 B-Pathological_formation 0.85 0.60 0.71 88 B-Simple_chemical 0.61 0.75 0.68 727 B-Tissue 0.74 0.83 0.78 184 I-Amino_acid 0.60 1.00 0.75 3 I-Cancer 0.91 0.69 0.78 604 I-Cell 0.98 0.74 0.84 1091 I-Cellular_component 0.88 0.62 0.73 69 I-Multi-tissue_structure 0.89 0.86 0.87 162 I-Organ 0.67 0.59 0.62 17 I-Organism 0.84 0.45 0.59 120 I-Organism_substance 0.80 0.50 0.62 24 I-Pathological_formation 0.81 0.56 0.67 39 I-Tissue 0.83 0.86 0.84 111 accuracy - - 0.64 12129 macro-avg 0.73 0.56 0.60 12129 weighted-avg 0.83 0.64 0.68 12129 ``` --- layout: model title: Mapping SNOMED Codes with Their Corresponding ICDO Codes author: John Snow Labs name: snomed_icdo_mapper date: 2022-06-26 tags: [snomed, icdo, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps SNOMED codes to corresponding ICDO codes under the Unified Medical Language System (UMLS). ## Predicted Entities `icdo_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapper_en_3.5.3_3.0_1656279162444.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icdo_mapper_en_3.5.3_3.0_1656279162444.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("snomed_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("snomed_icdo_mapper", "en", "clinical/models")\ .setInputCols(["snomed_code"])\ .setOutputCol("icdo_mappings")\ .setRels(["icdo_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, snomed_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Structure of tendon of gluteus minimus") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("snomed_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("snomed_icdo_mapper", "en", "clinical/models") .setInputCols("snomed_code") .setOutputCol("icdo_mappings") .setRels(Array("icdo_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, snomed_resolver, chunkerMapper )) val data = Seq("Structure of tendon of gluteus minimus").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.snomed_to_icdo").predict("""Structure of tendon of gluteus minimus""") ```
## Results ```bash | | ner_chunk | snomed_code | icdo_mappings | |---:|:---------------------------------------|:------------|----------------:| | 0 | Structure of tendon of gluteus minimus | 128501000 | C49.5 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icdo_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[snomed_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|203.5 KB| --- layout: model title: English BertForTokenClassification Cased model (from kktoto) author: John Snow Labs name: bert_pos_4L_weight_decay date: 2022-07-06 tags: [bert, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `4L_weight_decay` is a English model originally trained by `kktoto`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_4L_weight_decay_en_4.0.0_3.0_1657118162458.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_4L_weight_decay_en_4.0.0_3.0_1657118162458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_4L_weight_decay","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_4L_weight_decay","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_4L_weight_decay| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|42.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/kktoto/4L_weight_decay --- layout: model title: SDOH Housing Insecurity For Classification author: John Snow Labs name: genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli date: 2023-04-10 tags: [en, licenced, clinical, sdoh, housing, biobert, generic_classifier, licensed] task: Text Classification language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting whether the patient has housing insecurity. If the clinical note includes patient housing problems, the model identifies it. If there is no housing issue or it is not mentioned in the text, it is regarded as "no housing insecurity". The model is trained by using GenericClassifierApproach annotator. `Housing_Insecurity`: The patient has housing problems. `No_Housing_Insecurity`: The patient has no housing problems or it is not mentioned in the clinical notes. ## Predicted Entities `Housing_Insecurity`, `No_Housing_Insecurity` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.3.2_3.2_1681116895742.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.3.2_3.2_1681116895742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["""Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical conditions. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("""Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical conditions. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-----------------------+ | text| result| +----------------------------------------------------------------------------------------------------+-----------------------+ |Patient: Mary H.\n\nBackground: Mary is a 40-year-old woman who has been diagnosed with asthma an...|[No_Housing_Insecurity]| |Patient: Sarah L.\n\nBackground: Sarah is a 35-year-old woman who has been experiencing housing i...| [Housing_Insecurity]| +----------------------------------------------------------------------------------------------------+-----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Housing_Insecurity 0.83 0.81 0.82 64 No_Housing_Insecurity 0.86 0.87 0.86 83 accuracy - - 0.84 147 macro-avg 0.84 0.84 0.84 147 weighted-avg 0.84 0.84 0.84 147 ``` --- layout: model title: Swedish BertForQuestionAnswering model (from marbogusz) author: John Snow Labs name: bert_qa_bert_multi_cased_squad_sv_marbogusz date: 2022-06-03 tags: [open_source, question_answering, bert] task: Question Answering language: sv edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-multi-cased-squad_sv` is a Swedish model orginally trained by `marbogusz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_squad_sv_marbogusz_sv_4.0.0_3.0_1654249792429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_multi_cased_squad_sv_marbogusz_sv_4.0.0_3.0_1654249792429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_multi_cased_squad_sv_marbogusz","sv") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_multi_cased_squad_sv_marbogusz","sv") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sv.answer_question.squad.bert.cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_multi_cased_squad_sv_marbogusz| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|sv| |Size:|465.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/marbogusz/bert-multi-cased-squad_sv --- layout: model title: Fast Neural Machine Translation Model from Georgian to English author: John Snow Labs name: opus_mt_ka_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ka, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ka` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ka_en_xx_2.7.0_2.4_1609168048004.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ka_en_xx_2.7.0_2.4_1609168048004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ka_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ka_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ka.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ka_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_nh2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-nh2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nh2_en_4.3.0_3.0_1675116555931.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_nh2_en_4.3.0_3.0_1675116555931.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_nh2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_nh2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_nh2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|976.3 MB| ## References - https://huggingface.co/google/t5-efficient-large-nh2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Brazilian Portuguese NER for Laws (Large) author: John Snow Labs name: legner_br_large date: 2022-09-27 tags: [pt, licensed] task: Named Entity Recognition language: pt edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Deep Learning Portuguese Named Entity Recognition model for the legal domain, trained using Base Bert Embeddings, and is able to predict the following entities: - ORGANIZACAO (Organizations) - JURISPRUDENCIA (Jurisprudence) - PESSOA (Person) - TEMPO (Time) - LOCAL (Location) - LEGISLACAO (Laws) - O (Other) You can find different versions of this model in Models Hub: - With a Deep Learning architecture (non-transformer) and Base Embeddings; - With a Deep Learning architecture (non-transformer) and Large Embeddings; - With a Transformers Architecture and Base Embeddings; - With a Transformers Architecture and Large Embeddings; ## Predicted Entities `PESSOA`, `ORGANIZACAO`, `LEGISLACAO`, `JURISPRUDENCIA`, `TEMPO`, `LOCAL` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_LEGAL_PT/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_br_large_pt_1.0.0_3.0_1664278379247.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_br_large_pt_1.0.0_3.0_1664278379247.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') embeddings = nlp.BertEmbeddings.pretrained("bert_portuguese_base_cased", "pt")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_br_large', 'pt', 'legal/models') \ .setInputCols(['document', 'token', 'embeddings']) \ .setOutputCol('ner') ner_converter = nlp.NerConverter() \ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter ]) example = spark.createDataFrame(pd.DataFrame({'text': ["""Mediante do exposto , com fundamento nos artigos 32 , i , e 33 , da lei 8.443/1992 , submetem-se os autos à consideração superior , com posterior encaminhamento ao ministério público junto ao tcu e ao gabinete do relator , propondo : a ) conhecer do recurso e , no mérito , negar-lhe provimento ; b ) comunicar ao recorrente , ao superior tribunal militar e ao tribunal regional federal da 2ª região , a fim de fornecer subsídios para os processos judiciais 2001.34.00.024796-9 e 2003.34.00.044227-3 ; e aos demais interessados a deliberação que vier a ser proferida por esta corte ” ."""]})) result = pipeline.fit(example).transform(example) ```
## Results ```bash +-------------------+----------------+ | token| ner| +-------------------+----------------+ | diante| O| | do| O| | exposto| O| | ,| O| | com| O| | fundamento| O| | nos| O| | artigos| B-LEGISLACAO| | 32| I-LEGISLACAO| | ,| I-LEGISLACAO| | i| I-LEGISLACAO| | ,| I-LEGISLACAO| | e| I-LEGISLACAO| | 33| I-LEGISLACAO| | ,| I-LEGISLACAO| | da| I-LEGISLACAO| | lei| I-LEGISLACAO| | 8.443/1992| I-LEGISLACAO| | ,| O| | submetem-se| O| | os| O| | autos| O| | à| O| | consideração| O| | superior| O| | ,| O| | com| O| | posterior| O| | encaminhamento| O| | ao| O| | ministério| B-ORGANIZACAO| | público| I-ORGANIZACAO| | junto| O| | ao| O| | tcu| B-ORGANIZACAO| | e| O| | ao| O| | gabinete| O| | do| O| | relator| O| | ,| O| | propondo| O| | :| O| | a| O| | )| O| | conhecer| O| | do| O| | recurso| O| | e| O| | ,| O| | no| O| | mérito| O| | ,| O| | negar-lhe| O| | provimento| O| | ;| O| | b| O| | )| O| | comunicar| O| | ao| O| | recorrente| O| | ,| O| | ao| O| | superior| B-ORGANIZACAO| | tribunal| I-ORGANIZACAO| | militar| I-ORGANIZACAO| | e| O| | ao| O| | tribunal| B-ORGANIZACAO| | regional| I-ORGANIZACAO| | federal| I-ORGANIZACAO| | da| I-ORGANIZACAO| | 2ª| I-ORGANIZACAO| | região| I-ORGANIZACAO| | ,| O| | a| O| | fim| O| | de| O| | fornecer| O| | subsídios| O| | para| O| | os| O| | processos| O| | judiciais| O| |2001.34.00.024796-9|B-JURISPRUDENCIA| | e| O| |2003.34.00.044227-3|B-JURISPRUDENCIA| | ;| O| | e| O| | aos| O| | demais| O| | interessados| O| | a| O| | deliberação| O| | que| O| | vier| O| | a| O| | ser| O| | proferida| O| | por| O| | esta| O| | corte| O| | ”| O| | .| O| +-------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_br_large| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|pt| |Size:|19.5 MB| ## References Original texts available in https://paperswithcode.com/sota?task=Token+Classification&dataset=lener_br and in-house data augmentation with weak labelling ## Benchmarking ```bash label precision recall f1-score support B-JURISPRUDENCIA 0.84 0.91 0.88 175 B-LEGISLACAO 0.96 0.96 0.96 347 B-LOCAL 0.69 0.68 0.68 40 B-ORGANIZACAO 0.95 0.71 0.81 441 B-PESSOA 0.91 0.95 0.93 221 B-TEMPO 0.94 0.86 0.90 176 I-JURISPRUDENCIA 0.86 0.91 0.89 461 I-LEGISLACAO 0.98 0.99 0.98 2012 I-LOCAL 0.54 0.53 0.53 72 I-ORGANIZACAO 0.94 0.76 0.84 768 I-PESSOA 0.93 0.98 0.95 461 I-TEMPO 0.90 0.85 0.88 66 O 0.99 1.00 0.99 38419 accuracy - - 0.98 43659 macro-avg 0.88 0.85 0.86 43659 weighted-avg 0.98 0.98 0.98 43659 ``` --- layout: model title: Japanese BertForMaskedLM Large Cased model (from cl-tohoku) author: John Snow Labs name: bert_embeddings_large_japanese date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-japanese` is a Japanese model originally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_japanese_ja_4.2.4_3.0_1670020232480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_large_japanese_ja_4.2.4_3.0_1670020232480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_japanese","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_large_japanese","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_large_japanese| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-large-japanese - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Recognize Entities OntoNotes pipeline - BERT Tiny author: John Snow Labs name: onto_recognize_entities_bert_tiny date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_bert_tiny, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_bert_tiny is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_tiny_en_3.0.0_3.0_1616477524764.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_tiny_en_3.0.0_3.0_1616477524764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_bert_tiny', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_tiny", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.bert.tiny').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:----------------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-1.526878952980041,.,...]] | ['O', 'O', 'B-PERSON', 'I-PERSON', 'I-PERSON', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_tiny| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Arabic Named Entity Recognition (from Davlan) author: John Snow Labs name: bert_ner_bert_base_multilingual_cased_ner_hrl date: 2022-05-04 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-multilingual-cased-ner-hrl` is a Arabic model orginally trained by `Davlan`. ## Predicted Entities `DATE`, `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_multilingual_cased_ner_hrl_ar_3.4.2_3.0_1651630145936.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_multilingual_cased_ner_hrl_ar_3.4.2_3.0_1651630145936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_multilingual_cased_ner_hrl","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_multilingual_cased_ner_hrl","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner.multilingual_cased_ner_hrl").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_multilingual_cased_ner_hrl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|665.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl - https://camel.abudhabi.nyu.edu/anercorp/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2003/ner/ - https://www.clips.uantwerpen.be/conll2002/ner/ - https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_FR.bnf.bio - https://ontotext.fbk.eu/icab.html - https://github.com/LUMII-AILab/FullStack/tree/master/NamedEntities - https://www.clips.uantwerpen.be/conll2002/ner/ - https://github.com/davidsbatista/NER-datasets/tree/master/Portuguese --- layout: model title: Legal Successors And Assigns Clause Binary Classifier author: John Snow Labs name: legclf_successors_and_assigns_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, successors, and, assigns, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the successors-and-assigns clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `successors-and-assigns`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_successors_and_assigns_clause_en_1.0.0_3.0_1671393638580.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_successors_and_assigns_clause_en_1.0.0_3.0_1671393638580.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_successors_and_assigns_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[successors-and-assigns]| |[other]| |[other]| |[successors-and-assigns]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_successors_and_assigns_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.95 0.97 39 successors-and-assigns 0.94 1.00 0.97 33 accuracy - - 0.97 72 macro-avg 0.97 0.97 0.97 72 weighted-avg 0.97 0.97 0.97 72 ``` --- layout: model title: French CamemBert Embeddings (from xkang) author: John Snow Labs name: camembert_embeddings_xkang_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `xkang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_xkang_generic_model_fr_3.4.4_3.0_1653990717265.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_xkang_generic_model_fr_3.4.4_3.0_1653990717265.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_xkang_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_xkang_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_xkang_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/xkang/dummy-model --- layout: model title: English RobertaForQuestionAnswering (from benny6) author: John Snow Labs name: roberta_qa_roberta_tydiqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-tydiqa` is a English model originally trained by `benny6`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_tydiqa_en_4.0.0_3.0_1655738456483.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_tydiqa_en_4.0.0_3.0_1655738456483.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_tydiqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_tydiqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_tydiqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|471.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/benny6/roberta-tydiqa --- layout: model title: Financial NER for German Financial Statements author: John Snow Labs name: finner_financial_entity_value date: 2023-03-25 tags: [ner, licensed, finance, de] task: Named Entity Recognition language: de edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a German NER model trained on German Financial Statements, aimed to extract the following entities from the financial documents. ## Predicted Entities `FINANCIAL_ENTITY`, `FINANCIAL_VALUE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_financial_entity_value_de_1.0.0_3.0_1679702467400.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_financial_entity_value_de_1.0.0_3.0_1679702467400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_sentence_embeddings_financial","de") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_model= finance.NerModel.pretrained("finner_financial_entity_value", "de", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter ] ) import pandas as pd p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) text = 'Die Kapitalstruktur wird im Wesentlichen durch eine weitere Reduzierung der langfristigen Bankverbindlichkeiten um 3.000 TEUR auf 0 TEUR , einer Erhöhung der Rückstellungen um 3.397 TEUR auf 31.717 TEUR sowie die Erhöhung des Eigenkapitals um 1.771 TEUR auf 110.668 TEUR beeinflusst .' res = p_model.transform(spark.createDataFrame([[text]]).toDF("text")) result_df = res.select(F.explode(F.arrays_zip(res.token.result,res.ner.result, res.ner.metadata)).alias("cols"))\ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("label"), F.expr("cols['2']['confidence']").alias("confidence")) result_df.show(50, truncate=100) ```
## Results ```bash +---------------------+------------------+ | token| label| +---------------------+------------------+ | Die| O| | Kapitalstruktur| O| | wird| O| | im| O| | Wesentlichen| O| | durch| O| | eine| O| | weitere| O| | Reduzierung| O| | der| O| | langfristigen|B-FINANCIAL_ENTITY| |Bankverbindlichkeiten|I-FINANCIAL_ENTITY| | um| O| | 3.000| O| | TEUR| O| | auf| O| | 0| B-FINANCIAL_VALUE| | TEUR| O| | ,| O| | einer| O| | Erhöhung| O| | der| O| | Rückstellungen|B-FINANCIAL_ENTITY| | um| O| | 3.397| O| | TEUR| O| | auf| O| | 31.717| B-FINANCIAL_VALUE| | TEUR| O| | sowie| O| | die| O| | Erhöhung| O| | des| O| | Eigenkapitals|B-FINANCIAL_ENTITY| | um| O| | 1.771| O| | TEUR| O| | auf| O| | 110.668| B-FINANCIAL_VALUE| | TEUR| O| | beeinflusst| O| | .| O| +---------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_financial_entity_value| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Size:|1.1 MB| ## Benchmarking ```bash label precision recall f1-score support B-FINANCIAL_ENTITY 0.8947 0.9444 0.9189 18 B-FINANCIAL_VALUE 1.0000 0.8750 0.9333 16 I-FINANCIAL_ENTITY 0.8000 0.6154 0.6957 13 micro-avg 0.9070 0.8298 0.8667 47 macro-avg 0.8982 0.8116 0.8493 47 weighted-avg 0.9044 0.8298 0.8621 47 ``` --- layout: model title: English asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287` is a English model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287_en_4.2.0_3.0_1664093877249.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287_en_4.2.0_3.0_1664093877249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_gender_male_10_female_0_s287| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Igbo Named Entity Recognition (from Davlan) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_wikiann_ner date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, ig, open_source] task: Named Entity Recognition language: ig edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-wikiann-ner` is a Igbo model orginally trained by `Davlan`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_wikiann_ner_ig_3.4.2_3.0_1652809045092.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_wikiann_ner_ig_3.4.2_3.0_1652809045092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_wikiann_ner","ig") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ahụrụ m n'anya na-atọ m ụtọ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_wikiann_ner","ig") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ahụrụ m n'anya na-atọ m ụtọ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_wikiann_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ig| |Size:|859.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/xlm-roberta-base-wikiann-ner --- layout: model title: Legal Trademark License Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_trademark_license_agreement_bert date: 2023-01-26 tags: [en, legal, classification, trademark, license, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_trademark_license_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `trademark-license-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `trademark-license-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_trademark_license_agreement_bert_en_1.0.0_3.0_1674735429658.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_trademark_license_agreement_bert_en_1.0.0_3.0_1674735429658.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_trademark_license_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[trademark-license-agreement]| |[other]| |[other]| |[trademark-license-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_trademark_license_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.97 0.95 116 trademark-license-agreement 0.94 0.84 0.89 58 accuracy - - 0.93 174 macro-avg 0.93 0.91 0.92 174 weighted-avg 0.93 0.93 0.93 174 ``` --- layout: model title: English image_classifier_vit_koala_panda_wombat ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_koala_panda_wombat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_koala_panda_wombat` is a English model originally trained by nateraw. ## Predicted Entities `koala`, `panda`, `wombat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_koala_panda_wombat_en_4.1.0_3.0_1660172477788.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_koala_panda_wombat_en_4.1.0_3.0_1660172477788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_koala_panda_wombat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_koala_panda_wombat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_koala_panda_wombat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Smaller BERT Sentence Embeddings (L-6_H-768_A-12) author: John Snow Labs name: sent_small_bert_L6_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_768_en_2.6.0_2.4_1598351137007.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L6_768_en_2.6.0_2.4_1598351137007.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_768", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L6_768", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L6_768').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L6_768_embeddings sentence [0.01553951483219862, -0.1754797250032425, 0.0... I hate cancer [0.4996863603591919, 0.14960810542106628, 0.03... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L6_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1 --- layout: model title: Word2Vec Embeddings in Serbo-Croatian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sh, open_source] task: Embeddings language: sh edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sh_3.4.1_3.0_1647457026961.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sh_3.4.1_3.0_1647457026961.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sh.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sh| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Translate English to Efik Pipeline author: John Snow Labs name: translate_en_efi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, efi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `efi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_efi_xx_2.7.0_2.4_1609686118786.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_efi_xx_2.7.0_2.4_1609686118786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_efi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_efi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.efi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_efi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RoBERTa Embeddings (from scjnugacj) author: John Snow Labs name: roberta_embeddings_jurisbert date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `jurisbert` is a Spanish model orginally trained by `scjnugacj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_jurisbert_es_3.4.2_3.0_1649945233111.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_jurisbert_es_3.4.2_3.0_1649945233111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_jurisbert","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_jurisbert","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.jurisbert").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_jurisbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|466.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/scjnugacj/jurisbert --- layout: model title: Legal Change of control Clause Binary Classifier author: John Snow Labs name: legclf_change_of_control_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `change-of-control` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `change-of-control` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_change_of_control_clause_en_1.0.0_3.2_1660123299460.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_change_of_control_clause_en_1.0.0_3.2_1660123299460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_change_of_control_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[change-of-control]| |[other]| |[other]| |[change-of-control]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_change_of_control_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support change-of-control 1.00 0.86 0.92 28 other 0.94 1.00 0.97 68 accuracy - - 0.96 96 macro-avg 0.97 0.93 0.95 96 weighted-avg 0.96 0.96 0.96 96 ``` --- layout: model title: Untyped Dependency Parsing for English author: John Snow Labs name: dependency_conllu date: 2022-06-29 tags: [dependency_parsing, unlabelled_dependency_parsing, untyped_dependency_parsing, en, open_source] task: Dependency Parser language: en nav_key: models edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: DependencyParserModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Untyped Dependency parser, trained on the on the CONLL dataset. Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads. Example: root | | +-------dobj---------+ | | | nsubj | | +------det-----+ | +-----nmod------+ +--+ | | | | | | | | | | | | +-nmod-+| | | +-case-+ | + | + | + + || + | + | | I prefer the morning flight through Denver Relations among the words are illustrated above the sentence with directed, labeled arcs from heads to dependents (+ indicates the dependent). {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dependency_conllu_en_3.4.4_3.0_1656845289670.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dependency_conllu_en_3.4.4_3.0_1656845289670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.annotators import * documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentenceDetector = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence") tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("token") posTagger = PerceptronModel.pretrained().setInputCols(["token", "sentence"]).setOutputCol("pos") dependencyParser = DependencyParserModel.pretrained().setInputCols(["sentence", "pos", "token"]).setOutputCol("dependency") typedDependencyParser = TypedDependencyParserModel.pretrained().setInputCols(["token", "pos", "dependency"]).setOutputCol("labdep") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, posTagger, dependencyParser, typedDependencyParser]) data = spark.createDataFrame({"text": "Dependencies represents relationships betweens words in a Sentence"}) # Create data frame df = spark.createDataFrame(data) result = pipeline.fit(df).transform(df) result.select("dependency.result").show(false) ``` ```scala import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator._ import org.apache.spark.ml.Pipeline import spark.implicits._ val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val sentenceDetector = new SentenceDetector().setInputCols(Array("document")).setOutputCol("sentence") val tokenizer = new Tokenizer().setInputCols(Array("sentence")).setOutputCol("token") val posTagger = PerceptronModel.pretrained().setInputCols(Array("token", "sentence")).setOutputCol("pos") val dependencyParser = DependencyParserModel.pretrained().setInputCols(Array("sentence", "pos", "token")).setOutputCol("dependency") val typedDependencyParser = TypedDependencyParserModel.pretrained().setInputCols(Array("token", "pos", "dependency")).setOutputCol("labdep") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, posTagger, dependencyParser, typedDependencyParser)) val df = Seq("Dependencies represents relationships betweens words in a Sentence").toDF("text") val result = pipeline.fit(df).transform(df) result.select("dependency.result").show(false) ``` {:.nlu-block} ```python nlu.load("dep.untyped").predict("Dependencies represents relationships betweens words in a Sentence") ```
## Results ```bash +---------------------------------------------------------------------------------+ |result | +---------------------------------------------------------------------------------+ |[ROOT, Dependencies, represents, words, relationships, Sentence, Sentence, words]| +---------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dependency_conllu| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, pos, token]| |Output Labels:|[dep_root]| |Language:|en| |Size:|17.5 MB| ## Data Source CONLL --- layout: model title: Italian BertForQuestionAnswering model (from luigisaetta) author: John Snow Labs name: bert_qa_squad_xxl_cased_hub1 date: 2022-06-28 tags: [it, open_source, bert, question_answering] task: Question Answering language: it edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad_it_xxl_cased_hub1` is a Italian model originally trained by `luigisaetta`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_xxl_cased_hub1_it_4.0.0_3.0_1656413780942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_xxl_cased_hub1_it_4.0.0_3.0_1656413780942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_xxl_cased_hub1","it") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Qual è il mio nome?", "Mi chiamo Clara e vivo a Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_squad_xxl_cased_hub1","it") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Qual è il mio nome?", "Mi chiamo Clara e vivo a Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.answer_question.squad.bert.xxl_cased").predict("""Qual è il mio nome?|||"Mi chiamo Clara e vivo a Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_xxl_cased_hub1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|it| |Size:|413.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/luigisaetta/squad_it_xxl_cased_hub1 - https://github.com/luigisaetta/nlp-qa-italian/blob/main/train_squad_it_final1.ipynb --- layout: model title: English BertForQuestionAnswering Cased model (from hendrixcosta) author: John Snow Labs name: bert_qa_hendrixcosta date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hendrixcosta` is a English model originally trained by `hendrixcosta`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_hendrixcosta_en_4.0.0_3.0_1657189435012.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_hendrixcosta_en_4.0.0_3.0_1657189435012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_hendrixcosta","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_hendrixcosta","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_hendrixcosta| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|405.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hendrixcosta/hendrixcosta --- layout: model title: English DistilBertForQuestionAnswering model (from nlpunibo) Config2 author: John Snow Labs name: distilbert_qa_base_config2 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_config2` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config2_en_4.0.0_3.0_1654727826792.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_config2_en_4.0.0_3.0_1654727826792.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_config2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base_config2.by_nlpunibo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_config2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_base_config2 --- layout: model title: Loinc Sentence Entity Resolver author: John Snow Labs name: sbluebertresolve_loinc date: 2021-04-29 tags: [en, licensed, clinical, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true deprecated: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Map clinical NER entities to LOINC codes. ## Predicted Entities LOINC codes - per input NER entity {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_loinc_en_3.0.2_3.0_1619678534366.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbluebertresolve_loinc_en_3.0.2_3.0_1619678534366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbluebert_base_uncased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbluebertresolve_loinc","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) model = pipeline_loinc.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text")) results = model.transform(data) ```
## Results ```bash | | chunk | loinc_code | |---:|:--------------------------------------|:-------------| | 0 | gestational diabetes mellitus | 45636-8 | | 1 | subsequent type two diabetes mellitus | 44877-9 | | 2 | T2DM | 45636-8 | | 3 | HTG-induced pancreatitis | 79102-0 | | 4 | an acute hepatitis | 28083-4 | | 5 | obesity | 50227-8 | | 6 | a body mass index | 59574-4 | | 7 | BMI | 59574-4 | | 8 | polyuria | 28239-2 | | 9 | polydipsia | 90552-1 | | 10 | poor appetite | 65961-5 | | 11 | vomiting | 81224-8 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbluebertresolve_loinc| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_de_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-de-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_de_hi_dev_xx_4.0.0_3.0_1657189876761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_de_hi_dev_xx_4.0.0_3.0_1657189876761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_de_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_de_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_de_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-de-hi --- layout: model title: Pipeline to Detect chemicals in text author: John Snow Labs name: ner_chemicals_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemicals](https://nlp.johnsnowlabs.com/2021/04/01/ner_chemicals_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_pipeline_en_4.3.0_3.2_1678826470686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_pipeline_en_4.3.0_3.2_1678826470686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_chemicals_pipeline", "en", "clinical/models") text = '''The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_chemicals_pipeline", "en", "clinical/models") val text = "The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemicals.pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------------|--------:|------:|:------------|-------------:| | 0 | p - choloroaniline | 40 | 57 | CHEM | 0.935767 | | 1 | chlorhexidine - digluconate | 90 | 116 | CHEM | 0.855367 | | 2 | kanamycin | 168 | 176 | CHEM | 0.9824 | | 3 | colistin | 180 | 187 | CHEM | 0.9911 | | 4 | povidone - iodine | 193 | 209 | CHEM | 0.8111 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Stopwords Remover for Indonesian language (758 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, id, open_source] task: Stop Words Removal language: id edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_id_3.4.1_3.0_1646673043909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_id_3.4.1_3.0_1646673043909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","id") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Anda tidak lebih baik dari saya"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","id") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Anda tidak lebih baik dari saya").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("id.stopwords").predict("""Anda tidak lebih baik dari saya""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|id| |Size:|3.7 KB| --- layout: model title: Detect Assertion Status for Radiology author: John Snow Labs name: assertion_dl_radiology date: 2021-03-18 tags: [assertion, en, licensed, radiology, clinical] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 2.7.4 spark_version: 2.4 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Assign assertion status to clinical entities extracted by Radiology NER based on their context in the text. ## Predicted Entities `Confirmed`, `Suspected`, `Negative`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_RADIOLOGY/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_radiology_en_2.7.4_2.4_1616071311532.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_radiology_en_2.7.4_2.4_1616071311532.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Extract radiology entities using the radiology NER model in the pipeline and assign assertion status for them with `assertion_dl_radiology` pretrained model. Note: Example for demo purpose taken from: https://www.mtsamples.com/site/pages/sample.asp?Type=95-Radiology&Sample=1391-Chest%20PA%20&%20Lateral
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") radiology_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["ImagingFindings"]) radiology_assertion = AssertionDLModel.pretrained("assertion_dl_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, radiology_ner, ner_converter, radiology_assertion]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = LightPipeline(nlpPipeline.fit(empty_data)) text = """ INTERPRETATION: There has been interval development of a moderate left-sided pneumothorax with near complete collapse of the left upper lobe. The lower lobe appears aerated. There is stable, diffuse, bilateral interstitial thickening with no definite acute air space consolidation. The heart and pulmonary vascularity are within normal limits. Left-sided port is seen with Groshong tip at the SVC/RA junction. No evidence for acute fracture, malalignment, or dislocation.""" result = model.fullAnnotate(text) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val radiology_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("ImagingFindings")) val radiology_assertion = AssertionDLModel.pretrained("assertion_dl_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val nlpPipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, radiology_ner, ner_converter, radiology_assertion)) text = """ INTERPRETATION: There has been interval development of a moderate left-sided pneumothorax with near complete collapse of the left upper lobe. The lower lobe appears aerated. There is stable, diffuse, bilateral interstitial thickening with no definite acute air space consolidation. The heart and pulmonary vascularity are within normal limits. Left-sided port is seen with Groshong tip at the SVC/RA junction. No evidence for acute fracture, malalignment, or dislocation.""" val data = Seq("text").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.radiology").predict(""" INTERPRETATION: There has been interval development of a moderate left-sided pneumothorax with near complete collapse of the left upper lobe. The lower lobe appears aerated. There is stable, diffuse, bilateral interstitial thickening with no definite acute air space consolidation. The heart and pulmonary vascularity are within normal limits. Left-sided port is seen with Groshong tip at the SVC/RA junction. No evidence for acute fracture, malalignment, or dislocation.""") ```
## Results ```bash | | ner_chunk | assertion | |---:|:------------------------------|:------------| | 0 | pneumothorax | Confirmed | | 1 | complete collapse | Confirmed | | 2 | aerated | Confirmed | | 3 | thickening | Confirmed | | 4 | acute air space consolidation | Negative | | 5 | within normal limits | Confirmed | | 6 | acute fracture | Negative | | 7 | malalignment | Negative | | 8 | dislocation | Negative | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_radiology| |Compatibility:|Healthcare NLP 2.7.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| ## Data Source Custom internal labeled radiology dataset. ## Benchmarking ```bash label tp fp fn prec rec f1 Suspected 629 155 159 0.8022959 0.7982234 0.80025446 Negative 417 53 36 0.88723403 0.9205298 0.9035753 Confirmed 2252 173 186 0.9286598 0.92370796 0.92617726 tp: 3298 fp: 381 fn: 381 labels: 3 Macro-average prec: 0.87272996, rec: 0.88082033, f1: 0.8767565 Micro-average prec: 0.89643925, rec: 0.89643925, f1: 0.89643925 ``` --- layout: model title: Finnish T5ForConditionalGeneration Small Cased model (from Finnish-NLP) author: John Snow Labs name: t5_small_nl16 date: 2023-01-31 tags: [fi, open_source, t5, tensorflow] task: Text Generation language: fi edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-nl16-finnish` is a Finnish model originally trained by `Finnish-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_nl16_fi_4.3.0_3.0_1675126599389.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_nl16_fi_4.3.0_3.0_1675126599389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_nl16","fi") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_nl16","fi") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_nl16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|fi| |Size:|751.2 MB| ## References - https://huggingface.co/Finnish-NLP/t5-small-nl16-finnish - https://arxiv.org/abs/1910.10683 - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511 - https://arxiv.org/abs/2002.05202 - https://arxiv.org/abs/2109.10686 - http://urn.fi/urn:nbn:fi:lb-2017070501 - http://urn.fi/urn:nbn:fi:lb-2021050401 - http://urn.fi/urn:nbn:fi:lb-2018121001 - http://urn.fi/urn:nbn:fi:lb-2020021803 - https://sites.research.google/trc/about/ - https://github.com/google-research/t5x - https://github.com/spyysalo/yle-corpus - https://github.com/aajanki/eduskunta-vkk - https://sites.research.google/trc/ - https://www.linkedin.com/in/aapotanskanen/ - https://www.linkedin.com/in/rasmustoivanen/ --- layout: model title: Legal Grievance procedure Clause Binary Classifier (md) author: John Snow Labs name: legclf_grievance_procedure_md date: 2022-11-25 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `grievance-procedure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `grievance-procedure` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_grievance_procedure_md_en_1.0.0_3.0_1669376484167.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_grievance_procedure_md_en_1.0.0_3.0_1669376484167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_grievance_procedure_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[grievance-procedure]| |[other]| |[other]| |[grievance-procedure]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_grievance_procedure_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support grievance-procedure 1.00 0.96 0.98 24 other 0.97 1.00 0.99 39 accuracy 0.98 63 macro avg 0.99 0.98 0.98 63 weighted avg 0.98 0.98 0.98 63 ``` --- layout: model title: English asr_wav2vec2_large_lv60_timit_asr TFWav2Vec2ForCTC from elgeish author: John Snow Labs name: pipeline_asr_wav2vec2_large_lv60_timit_asr date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_lv60_timit_asr` is a English model originally trained by elgeish. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_lv60_timit_asr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_lv60_timit_asr_en_4.2.0_3.0_1664040478629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_lv60_timit_asr_en_4.2.0_3.0_1664040478629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_lv60_timit_asr', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_lv60_timit_asr", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_lv60_timit_asr| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from arvalinno) author: John Snow Labs name: distilbert_qa_arvalinno_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `arvalinno`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_arvalinno_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769984456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_arvalinno_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769984456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arvalinno_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_arvalinno_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_arvalinno_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/arvalinno/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering Cased model (from ksabeh) author: John Snow Labs name: distilbert_qa_attribute_correction date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-attribute-correction` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_en_4.3.0_3.0_1672766328885.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_attribute_correction_en_4.3.0_3.0_1672766328885.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_attribute_correction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_attribute_correction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ksabeh/distilbert-attribute-correction --- layout: model title: English DistilBertForQuestionAnswering Cased model (from ruishan-lin) author: John Snow Labs name: distilbert_qa_investopedia_qna date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `investopedia-QnA` is a English model originally trained by `ruishan-lin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_investopedia_qna_en_4.3.0_3.0_1672775226663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_investopedia_qna_en_4.3.0_3.0_1672775226663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_investopedia_qna","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_investopedia_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_investopedia_qna| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruishan-lin/investopedia-QnA --- layout: model title: Word2Vec Embeddings in Ukrainian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, uk, open_source] task: Embeddings language: uk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_uk_3.4.1_3.0_1647465021657.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_uk_3.4.1_3.0_1647465021657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","uk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.embed.w2v_cc_300d").predict("""Я люблю Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|uk| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Fast Neural Machine Translation Model from English to San Salvador Kongo author: John Snow Labs name: opus_mt_en_kwy date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, kwy, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `kwy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_kwy_xx_2.7.0_2.4_1609169192298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_kwy_xx_2.7.0_2.4_1609169192298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_kwy", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_kwy", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.kwy').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_kwy| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from sunitha) author: John Snow Labs name: distilbert_qa_AQG_CV_Squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `AQG_CV_Squad` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_AQG_CV_Squad_en_4.0.0_3.0_1654691481417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_AQG_CV_Squad_en_4.0.0_3.0_1654691481417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_AQG_CV_Squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_AQG_CV_Squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.by_sunitha").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_AQG_CV_Squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/AQG_CV_Squad --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_8_en_4.0.0_3.0_1657184119274.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_8_en_4.0.0_3.0_1657184119274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_1024_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-8 --- layout: model title: NER for Demographic Extended (healthcare) author: John Snow Labs name: ner_demographic_extended_healthcare date: 2023-06-08 tags: [demographic, licensed, clinical, en, ner, extended] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies healthcare mentions that refers to a situation where a patient's demographic characteristics, such as `race`, `ethnicity`, `gender`, `age`, `socioeconomic` `status`, or `geographic location`. ## Predicted Entities `Gender`, `Age`, `Race_ethnicity`, `Employment_status`, `Job_title`, `Marital_status`, `Political_afiliation`, `Union_membership`, `Sexual_orientation`, `Religion`, `Height`, `Weight`, `Obesity`, `Unhealthy_habits` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_demographic_extended_healthcare_en_4.4.3_3.0_1686217338322.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_demographic_extended_healthcare_en_4.4.3_3.0_1686217338322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_demographic_extended_healthcare","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner")\ .setLabelCasing("upper") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") ner_model = ner_pipeline.fit(empty_data) data = spark.createDataFrame([["""Patient Information: Gender: Non-binary Age: 68 years old Race: Black Employment status: Retired Marital Status: Divorced Sexual Orientation: Asexual Religion: Judaism Body Mass Index: 29.1 Unhealthy Habits: Substance use Socioeconomic Status: Low Income Area of Residence: Rural setting Disability Status: Blindness Chief Complaint: The patient presented to the emergency department with complaint of severe chest pain that started suddenly while asleep. """]]).toDF("text") result = ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_demographic_extended_healthcare","en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") .setLabelCasing("upper") val nerConverter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nerPipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, wordEmbeddings, ner, nerConverter )) ```
## Results ```bash +-------------+------------------+----------+ |chunk |ner_label |confidence| +-------------+------------------+----------+ |Non-binary |GENDER |0.9987 | |68 years old |AGE |0.6892667 | |Black |RACE_ETHNICITY |0.9226 | |Retired |EMPLOYMENT_STATUS |0.9426 | |Divorced |MARITAL_STATUS |0.9996 | |Asexual |SEXUAL_ORIENTATION|1.0 | |Judaism |RELIGION |0.986 | |Substance use|UNHEALTHY_HABITS |0.48755002| +-------------+------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_demographic_extended_healthcare| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.1 MB| ## References trained by in-house dataset ## Benchmarking ```bash label TP FP FN Total Precision Recall F1 B-Age 115 2.0 6 121 0.982906 0.950413 0.966387 I-Age 107 4.0 2 109 0.963964 0.981651 0.972727 B-Employment_status 82 3.0 5 87 0.964706 0.942529 0.953488 I-Employment_status 0 1.0 2 2 0.000000 0.000000 0.000000 B-Gender 110 1.0 21 131 0.990991 0.839695 0.909091 I-Gender 0 0.0 1 1 0.000000 0.000000 0.000000 B-Height 22 1.0 2 24 0.956522 0.916667 0.936170 I-Height 39 1.0 1 40 0.975000 0.975000 0.975000 B-Job_title 34 3.0 16 50 0.918919 0.680000 0.781609 I-Job_title 19 2.0 9 28 0.904762 0.678571 0.775510 B-Marital_Status 80 5.0 6 86 0.941176 0.930233 0.935673 I-Marital_Status 9 1.0 1 10 0.900000 0.900000 0.900000 B-Obesity 56 2.0 2 58 0.965517 0.965517 0.965517 I-Obesity 2 0.0 2 4 1.000000 0.500000 0.666667 B-Political_affiliation 19 0.0 0 19 1.000000 1.000000 1.000000 B-Race_ethnicity 89 5.0 4 93 0.946809 0.956989 0.951872 I-Race_ethnicity 27 3.0 2 29 0.900000 0.931034 0.915254 B-Religion 70 3.0 4 74 0.958904 0.945946 0.952381 I-Religion 2 0.0 5 7 1.000000 0.285714 0.444444 B-Sexual_orientation 57 0.0 0 57 1.000000 1.000000 1.000000 B-Unhealthy_habits 254 27.0 82 336 0.903915 0.755952 0.823339 I-Unhealthy_habits 141 9.0 54 195 0.940000 0.723077 0.817391 B-Union_membership 9 1.0 4 13 0.900000 0.692308 0.782609 I-Union_membership 39 1.0 4 43 0.975000 0.906977 0.939759 B-Weight 26 1.0 1 27 0.962963 0.962963 0.962963 I-Weight 25 1.0 0 25 0.961538 1.000000 0.980392 Macro-average 1433 77.0 236 - 0.881291 0.785432 0.830605 Micro-average 1433 77.0 236 - 0.949006 0.858597 0.901541 ``` --- layout: model title: Extract Pharmacological Entities from Spanish Medical Texts author: John Snow Labs name: ner_pharmacology date: 2022-08-13 tags: [es, clinical, licensed, ner, pharmacology, proteinas, normalizables] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting pharmacological entities from Spanish medical texts and trained by using MedicalNerApproach annotator that allows to train generic NER models based on Neural Networks.. The model detects PROTEINAS and NORMALIZABLES. ## Predicted Entities `PROTEINAS`, `NORMALIZABLES` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_pharmacology_es_4.0.2_3.0_1660355686728.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_pharmacology_es_4.0.2_3.0_1660355686728.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_pharmacology', "es", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_pharmacology", "es", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.pharmacology").predict("""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).""") ```
## Results ```bash +---------------+-------------+ |chunk |ner_label | +---------------+-------------+ |creatinkinasa |PROTEINAS | |LDH |PROTEINAS | |urea |NORMALIZABLES| |CA 19.9 |PROTEINAS | |vimentina |PROTEINAS | |S-100 |PROTEINAS | |HMB-45 |PROTEINAS | |actina |PROTEINAS | |Cisplatino |NORMALIZABLES| |Interleukina II|PROTEINAS | |Dacarbacina |NORMALIZABLES| |Interferon alfa|PROTEINAS | +---------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_pharmacology| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| ## References The model is prepared using the reference paper: "NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts", Salvador Lima-Lopez, Eulalia Farr ́e-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias and Martin Krallinger. Procesamiento del Lenguaje Natural, Revista nº 67, septiembre de 2021, pp. 243-256. ## Benchmarking ```bash label precision recall f1-score support B-PROTEINAS 0.88 0.93 0.90 813 I-PROTEINAS 0.83 0.71 0.77 321 B-NORMALIZABLES 0.94 0.93 0.93 954 I-NORMALIZABLES 0.87 0.84 0.86 134 micro-avg 0.90 0.89 0.90 2222 macro-avg 0.88 0.85 0.86 2222 weighted-avg 0.90 0.89 0.89 2222 ``` --- layout: model title: Legal Positions Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_positions_bert date: 2023-03-05 tags: [en, legal, classification, clauses, positions, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Positions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Positions`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_positions_bert_en_1.0.0_3.0_1678050569234.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_positions_bert_en_1.0.0_3.0_1678050569234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_positions_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Positions]| |[Other]| |[Other]| |[Positions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_positions_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 1.00 0.97 0.98 32 Positions 0.95 1.00 0.97 19 accuracy - - 0.98 51 macro-avg 0.97 0.98 0.98 51 weighted-avg 0.98 0.98 0.98 51 ``` --- layout: model title: English image_classifier_vit_robot2 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_robot2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_robot2` is a English model originally trained by sudo-s. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_robot2_en_4.1.0_3.0_1660171676018.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_robot2_en_4.1.0_3.0_1660171676018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_robot2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_robot2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_robot2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Legal Loan And Security Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_loan_and_security_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, loan, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_loan_and_security_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `loan-and-security-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `loan-and-security-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_loan_and_security_agreement_bert_en_1.0.0_3.0_1669367546145.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_loan_and_security_agreement_bert_en_1.0.0_3.0_1669367546145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_loan_and_security_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[loan-and-security-agreement]| |[other]| |[other]| |[loan-and-security-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_loan_and_security_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support loan-and-security-agreement 0.97 0.88 0.92 33 other 0.94 0.98 0.96 65 accuracy - - 0.95 98 macro-avg 0.95 0.93 0.94 98 weighted-avg 0.95 0.95 0.95 98 ``` --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_naija_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-naija-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `PER`, `DATE`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_naija_finetuned_ner_swahili_sw_4.1.0_3.0_1659354658764.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_naija_finetuned_ner_swahili_sw_4.1.0_3.0_1659354658764.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_naija_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_naija_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_naija_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|1.0 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-naija-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Part of Speech for Italian author: John Snow Labs name: pos_ud_isdt date: 2021-03-08 tags: [part_of_speech, open_source, italian, pos_ud_isdt, it] task: Part of Speech Tagging language: it edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with a `averaged perceptron architecture`. ## Predicted Entities - PROPN - PUNCT - NOUN - ADP - ADJ - DET - AUX - VERB - PRON - CCONJ - NUM - ADV - INTJ - SCONJ - X - SYM - PART {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_isdt_it_3.0.0_3.0_1615225751277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_isdt_it_3.0.0_3.0_1615225751277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_isdt", "it") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Ciao da John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_isdt", "it") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Ciao da John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Ciao da John Snow Labs! ""] token_df = nlu.load('it.pos.ud_isdt').predict(text) token_df ```
## Results ```bash token pos 0 Ciao VERB 1 da ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_isdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|it| --- layout: model title: English image_classifier_vit_pond_image_classification_2 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_2` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_2_en_4.1.0_3.0_1660166680787.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_2_en_4.1.0_3.0_1660166680787.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Bangla T5ForConditionalGeneration Cased model (from csebuetnlp) author: John Snow Labs name: t5_banglat5_banglaparaphrase date: 2023-01-30 tags: [bn, open_source, t5, tensorflow] task: Text Generation language: bn edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `banglat5_banglaparaphrase` is a Bangla model originally trained by `csebuetnlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_banglat5_banglaparaphrase_bn_4.3.0_3.0_1675100065577.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_banglat5_banglaparaphrase_bn_4.3.0_3.0_1675100065577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_banglat5_banglaparaphrase","bn") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_banglat5_banglaparaphrase","bn") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_banglat5_banglaparaphrase| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|bn| |Size:|1.0 GB| ## References - https://huggingface.co/csebuetnlp/banglat5_banglaparaphrase - https://github.com/csebuetnlp/BanglaNLG - https://github.com/csebuetnlp/normalizer --- layout: model title: Detect Adverse Drug Events (clinical_large) author: John Snow Labs name: ner_ade_emb_clinical_large date: 2023-05-21 tags: [ner, ade, drug, licensed, clinical, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions to drugs in reviews, tweets, and medical text using a pre-trained NER model. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_emb_clinical_large_en_4.4.2_3.0_1684710290191.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_emb_clinical_large_en_4.4.2_3.0_1684710290191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_ade_emb_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(['sentence', 'token', 'ner'])\ .setOutputCol('ner_chunk') pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_df = spark.createDataFrame([["Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps."]]).toDF("text") result = pipeline.fit(sample_df).transform(sample_df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_ade_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val sample_data = Seq("Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.").toDS.toDF("text") val result = pipeline.fit(sample_data).transform(sample_data) ```
## Results ```bash +--------------+-----+---+---------+ |chunk |begin|end|ner_label| +--------------+-----+---+---------+ |Lipitor |12 |18 |DRUG | |severe fatigue|52 |65 |ADE | |voltaren |97 |104|DRUG | |cramps |152 |157|ADE | +--------------+-----+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.7 MB| ## Benchmarking ```bash label precision recall f1-score support DRUG 0.92 0.91 0.92 16032 ADE 0.82 0.80 0.81 6142 micro-avg 0.89 0.88 0.89 22174 macro-avg 0.87 0.86 0.86 22174 weighted-avg 0.89 0.88 0.89 22174 ``` --- layout: model title: English BertForQuestionAnswering Mini Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_mini_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-finetuned-squad` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mini_finetuned_squad_en_4.0.0_3.0_1657187799271.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mini_finetuned_squad_en_4.0.0_3.0_1657187799271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mini_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mini_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mini_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-mini-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_el12 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-el12` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el12_en_4.3.0_3.0_1675123332949.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_el12_en_4.3.0_3.0_1675123332949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_el12","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_el12","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_el12| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|74.4 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-el12 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Translate English to Niger-Kordofanian languages Pipeline author: John Snow Labs name: translate_en_nic date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, nic, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `nic` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_nic_xx_2.7.0_2.4_1609691259533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_nic_xx_2.7.0_2.4_1609691259533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_nic", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_nic", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.nic').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_nic| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Compensation Clause Binary Classifier author: John Snow Labs name: legclf_compensation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `compensation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `compensation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_compensation_clause_en_1.0.0_3.2_1660123321963.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_compensation_clause_en_1.0.0_3.2_1660123321963.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_compensation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[compensation]| |[other]| |[other]| |[compensation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_compensation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support compensation 0.94 0.88 0.91 34 other 0.95 0.97 0.96 78 accuracy - - 0.95 112 macro-avg 0.94 0.93 0.94 112 weighted-avg 0.95 0.95 0.95 112 ``` --- layout: model title: General Oncology Pipeline author: John Snow Labs name: oncology_general_pipeline date: 2022-12-01 tags: [licensed, pipeline, oncology, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline extracts diagnoses, treatments, tests, anatomical references and demographic entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_general_pipeline_en_4.2.2_3.0_1669899456383.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_general_pipeline_en_4.2.2_3.0_1669899456383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_general_pipeline", "en", "clinical/models") pipeline.annotate("The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_general_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.""")(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_general.pipeline").predict("""The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.""") ```
## Results ```bash ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:---------------|:-----------------| | left | Direction | | mastectomy | Cancer_Surgery | | left | Direction | | breast cancer | Cancer_Dx | | two months ago | Relative_Date | | tumor | Tumor_Finding | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | ******************** ner_oncology_diagnosis_wip results ******************** | chunk | ner_label | |:--------------|:--------------| | breast cancer | Cancer_Dx | | tumor | Tumor_Finding | ******************** ner_oncology_tnm_wip results ******************** | chunk | ner_label | |:--------------|:------------| | breast cancer | Cancer_Dx | | tumor | Tumor | ******************** ner_oncology_therapy_wip results ******************** | chunk | ner_label | |:-----------|:---------------| | mastectomy | Cancer_Surgery | ******************** ner_oncology_test_wip results ******************** | chunk | ner_label | |:---------|:-----------------| | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:--------------|:---------------|:------------| | mastectomy | Cancer_Surgery | Past | | breast cancer | Cancer_Dx | Present | | tumor | Tumor_Finding | Present | | ER | Biomarker | Present | | PR | Biomarker | Present | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:--------------|:-----------------|:---------------|:--------------|:--------------| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_related_to | | breast cancer | Cancer_Dx | two months ago | Relative_Date | is_related_to | | tumor | Tumor_Finding | ER | Biomarker | O | | tumor | Tumor_Finding | PR | Biomarker | O | | positive | Biomarker_Result | ER | Biomarker | is_related_to | | positive | Biomarker_Result | PR | Biomarker | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:--------------|:-----------------|:---------------|:--------------|:--------------| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_date_of | | breast cancer | Cancer_Dx | two months ago | Relative_Date | is_date_of | | tumor | Tumor_Finding | ER | Biomarker | O | | tumor | Tumor_Finding | PR | Biomarker | O | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_general_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel --- layout: model title: French Bert Embeddings (from amine) author: John Snow Labs name: bert_embeddings_bert_base_5lang_cased date: 2022-04-11 tags: [bert, embeddings, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a French model orginally trained by `amine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_fr_3.4.2_3.0_1649675721237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_fr_3.4.2_3.0_1649675721237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.bert_5lang_cased").predict("""J'adore Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_5lang_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|464.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/amine/bert-base-5lang-cased - https://cloud.google.com/compute/docs/machine-types#n1_machine_type --- layout: model title: Persian (Farsi) ALBERT Embeddings author: John Snow Labs name: albert_embeddings_albert_fa_zwnj_base_v2 date: 2022-04-14 tags: [albert, embeddings, fa, open_source] task: Embeddings language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-fa-zwnj-base-v2` is a Persian model orginally trained by `HooshvareLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_fa_zwnj_base_v2_fa_3.4.2_3.0_1649954311972.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_fa_zwnj_base_v2_fa_3.4.2_3.0_1649954311972.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_fa_zwnj_base_v2","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["من عاشق جرقه NLP هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_fa_zwnj_base_v2","fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("من عاشق جرقه NLP هستم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.embed.albert_fa_zwnj_base_v2").predict("""من عاشق جرقه NLP هستم""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_fa_zwnj_base_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fa| |Size:|44.7 MB| |Case sensitive:|false| ## References - https://huggingface.co/HooshvareLab/albert-fa-zwnj-base-v2 - https://github.com/m3hrdadfi/albert-persian --- layout: model title: English asr_wav2vec2_base_common_voice_second_colab TFWav2Vec2ForCTC from zoha author: John Snow Labs name: asr_wav2vec2_base_common_voice_second_colab date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_common_voice_second_colab` is a English model originally trained by zoha. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_common_voice_second_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_common_voice_second_colab_en_4.2.0_3.0_1664024597921.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_common_voice_second_colab_en_4.2.0_3.0_1664024597921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_common_voice_second_colab", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_common_voice_second_colab", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_common_voice_second_colab| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Part of Speech for Thai author: John Snow Labs name: pos_lst20 date: 2021-03-09 tags: [part_of_speech, open_source, thai, pos_lst20, th] task: Part of Speech Tagging language: th edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - NN - VV - PS - NG - NU - CL - PU - CC - AX - AV - FX - AJ - PR - PA - IJ - XX {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_lst20_th_3.0.0_3.0_1615292399291.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_lst20_th_3.0.0_3.0_1615292399291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_lst20", "th") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['สวัสดีจาก John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_lst20", "th") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("สวัสดีจาก John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""สวัสดีจาก John Snow Labs! ""] token_df = nlu.load('th.pos').predict(text) token_df ```
## Results ```bash token pos 0 สวัสดีจาก CC 1 John NN 2 Snow NN 3 Labs NN 4 ! PU ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_lst20| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|th| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab971 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab971 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab971` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab971_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab971_en_4.2.0_3.0_1664043093338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab971_en_4.2.0_3.0_1664043093338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab971', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab971", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab971| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Pijin author: John Snow Labs name: opus_mt_en_pis date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, pis, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `pis` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pis_xx_2.7.0_2.4_1609170240745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pis_xx_2.7.0_2.4_1609170240745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_pis", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_pis", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.pis').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_pis| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model author: John Snow Labs name: bert_qa_bert_large_cased_whole_word_masking_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-cased-whole-word-masking-finetuned-squad` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_cased_whole_word_masking_finetuned_squad_en_4.0.0_3.0_1654536242460.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_cased_whole_word_masking_finetuned_squad_en_4.0.0_3.0_1654536242460.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_cased_whole_word_masking_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_cased_whole_word_masking_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_cased_whole_word_masking_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad - https://github.com/google-research/bert - https://arxiv.org/abs/1810.04805 - https://en.wikipedia.org/wiki/English_Wikipedia - https://yknzhu.wixsite.com/mbweb --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_07 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_english_filipino_wav2vec2_l_xls_r_test_07 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_07` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_07_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_07_en_4.2.0_3.0_1664116428412.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_07_en_4.2.0_3.0_1664116428412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_07", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_07", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_07| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Africa Document Classifier (EURLEX) author: John Snow Labs name: legclf_africa_bert date: 2023-03-06 tags: [en, legal, classification, clauses, africa, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_africa_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Africa or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Africa`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_africa_bert_en_1.0.0_3.0_1678111589297.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_africa_bert_en_1.0.0_3.0_1678111589297.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_africa_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Africa]| |[Other]| |[Other]| |[Africa]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_africa_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Africa 0.96 0.94 0.95 48 Other 0.95 0.96 0.96 57 accuracy - - 0.95 105 macro-avg 0.95 0.95 0.95 105 weighted-avg 0.95 0.95 0.95 105 ``` --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_deletion_squad_15 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-deletion-squad-15` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_deletion_squad_15_en_4.0.0_3.0_1655733897667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_deletion_squad_15_en_4.0.0_3.0_1655733897667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_deletion_squad_15","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_deletion_squad_15","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_deletion_15.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_deletion_squad_15| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-deletion-squad-15 --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from victorlee071200) author: John Snow Labs name: distilbert_qa_victorlee071200_base_cased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-finetuned-squad` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_victorlee071200_base_cased_finetuned_squad_en_4.3.0_3.0_1672766988389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_victorlee071200_base_cased_finetuned_squad_en_4.3.0_3.0_1672766988389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_victorlee071200_base_cased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_victorlee071200_base_cased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_victorlee071200_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/victorlee071200/distilbert-base-cased-finetuned-squad --- layout: model title: Detect Posology concepts (clinical_large) author: John Snow Labs name: ner_posology_emb_clinical_large date: 2023-04-12 tags: [ner, clinical, licensed, en, posology] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects Drug, Dosage, and administration instructions in text using pretrained NER model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_emb_clinical_large_en_4.3.2_3.0_1681303545819.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_emb_clinical_large_en_4.3.2_3.0_1681303545819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner = MedicalNerModel.pretrained("ner_posology_emb_clinical_large", "en", "clinical/models")) \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("posology_ner") posology_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "posology_ner"]) \ .setOutputCol("posology_ner_chunk") posology_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, posology_ner, posology_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") posology_ner_model = posology_ner_pipeline.fit(empty_data) results = posology_ner_model.transform(spark.createDataFrame([["The patient has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually."]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(Array("sentence", "token"))\ .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel.pretrained('ner_posology_emb_clinical_large' "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("posology_ner") val posology_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("posology_ner_chunk") val posology_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter)) val data = Seq(""" The patient has been advised Aspirin 81 milligrams QDay. insulin 50 units in a.m. HCTZ 50 mg QDay. Nitroglycerin 1/150 sublingually.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:--------------|--------:|------:|:-----------| | 0 | Aspirin | 268 | 274 | DRUG | | 1 | 81 milligrams | 276 | 288 | STRENGTH | | 2 | QDay | 290 | 293 | FREQUENCY | | 3 | insulin | 296 | 302 | DRUG | | 4 | 50 units | 304 | 311 | STRENGTH | | 5 | HCTZ | 321 | 324 | DRUG | | 6 | 50 mg | 326 | 330 | STRENGTH | | 7 | QDay | 332 | 335 | FREQUENCY | | 8 | Nitroglycerin | 338 | 350 | DRUG | | 9 | 1/150 | 352 | 356 | STRENGTH | | 10 | sublingually | 358 | 369 | FREQUENCY | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_emb_clinical_large| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash precision recall f1-score support DRUG 0.88 0.92 0.90 2252 STRENGTH 0.89 0.92 0.91 2290 FREQUENCY 0.92 0.90 0.91 1782 DURATION 0.76 0.83 0.79 463 DOSAGE 0.62 0.65 0.64 476 ROUTE 0.88 0.88 0.88 394 FORM 0.89 0.72 0.79 773 micro avg 0.87 0.87 0.87 8430 macro avg 0.83 0.83 0.83 8430 weighted avg 0.87 0.87 0.87 8430 ``` --- layout: model title: French asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709 date: 2022-09-26 tags: [wav2vec2, fr, audio, open_source, asr] task: Automatic Speech Recognition language: fr edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709` is a French model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709_fr_4.2.0_3.0_1664196892697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709_fr_4.2.0_3.0_1664196892697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709", "fr")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709", "fr") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_france_2_belgium_8_s709| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fr| |Size:|1.2 GB| --- layout: model title: Spanish RobertaForQuestionAnswering Large Cased model (from nlp-en-es) author: John Snow Labs name: roberta_qa_bertin_large_finetuned_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertin-large-finetuned-sqac` is a Spanish model originally trained by `nlp-en-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bertin_large_finetuned_s_c_es_4.2.4_3.0_1669984994226.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bertin_large_finetuned_s_c_es_4.2.4_3.0_1669984994226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bertin_large_finetuned_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bertin_large_finetuned_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_bertin_large_finetuned_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|450.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nlp-en-es/bertin-large-finetuned-sqac --- layout: model title: Legal Approvals Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_approvals_bert date: 2023-03-05 tags: [en, legal, classification, clauses, approvals, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Approvals` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Approvals`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_approvals_bert_en_1.0.0_3.0_1678050666848.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_approvals_bert_en_1.0.0_3.0_1678050666848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_approvals_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Approvals]| |[Other]| |[Other]| |[Approvals]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_approvals_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Approvals 0.88 0.93 0.90 15 Other 0.96 0.93 0.95 29 accuracy - - 0.93 44 macro-avg 0.92 0.93 0.93 44 weighted-avg 0.93 0.93 0.93 44 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_9 TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_9 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_9` is a English model originally trained by nimrah. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_9_en_4.2.0_3.0_1664117919703.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_9_en_4.2.0_3.0_1664117919703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_9", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_9", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Stop Words Cleaner for Zulu author: John Snow Labs name: stopwords_zu date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: zu edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, zu] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_zu_zu_2.5.4_2.4_1594742440504.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_zu_zu_2.5.4_2.4_1594742440504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_zu", "zu") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Ngaphandle kokuba yinkosi yasenyakatho, uJohn Snow ungudokotela waseNgilandi futhi ungumholi ekwenziweni kwe-anesthesia kanye nenhlanzeko yezokwelapha.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_zu", "zu") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Ngaphandle kokuba yinkosi yasenyakatho, uJohn Snow ungudokotela waseNgilandi futhi ungumholi ekwenziweni kwe-anesthesia kanye nenhlanzeko yezokwelapha.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Ngaphandle kokuba yinkosi yasenyakatho, uJohn Snow ungudokotela waseNgilandi futhi ungumholi ekwenziweni kwe-anesthesia kanye nenhlanzeko yezokwelapha."""] stopword_df = nlu.load('zu.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=9, result='Ngaphandle', metadata={'sentence': '0'}), Row(annotatorType='token', begin=11, end=16, result='kokuba', metadata={'sentence': '0'}), Row(annotatorType='token', begin=18, end=24, result='yinkosi', metadata={'sentence': '0'}), Row(annotatorType='token', begin=26, end=37, result='yasenyakatho', metadata={'sentence': '0'}), Row(annotatorType='token', begin=38, end=38, result=',', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_zu| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|zu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Indonesian BertForQuestionAnswering Cased model (from FirmanBr) author: John Snow Labs name: bert_qa_firmanindolanguagemodel date: 2022-07-07 tags: [id, open_source, bert, question_answering] task: Question Answering language: id edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FirmanIndoLanguageModel` is a Indonesian model originally trained by `FirmanBr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_firmanindolanguagemodel_id_4.0.0_3.0_1657181996433.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_firmanindolanguagemodel_id_4.0.0_3.0_1657181996433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_firmanindolanguagemodel","id") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Siapa namaku?", "Nama saya Clara dan saya tinggal di Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_firmanindolanguagemodel","id") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Siapa namaku?", "Nama saya Clara dan saya tinggal di Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_firmanindolanguagemodel| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|id| |Size:|413.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/FirmanBr/FirmanIndoLanguageModel --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_bert_base_cased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_squad_v1_en_4.0.0_3.0_1654179817071.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_squad_v1_en_4.0.0_3.0_1654179817071.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_cased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_cased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_cased.by_batterydata").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_cased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/bert-base-cased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: Spanish BertForTokenClassification Cased model (from mrm8488) author: John Snow Labs name: bert_pos_spanish_cased_finetuned_pos_16_tags date: 2022-07-19 tags: [open_source, bert, ner, pos, es] task: Part of Speech Tagging language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BERT POS model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-spanish-cased-finetuned-pos-16-tags` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_spanish_cased_finetuned_pos_16_tags_es_4.0.0_3.0_1658249425886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_spanish_cased_finetuned_pos_16_tags_es_4.0.0_3.0_1658249425886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") pos = BertForTokenClassification.pretrained("bert_pos_spanish_cased_finetuned_pos_16_tags","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, tokenizer, pos]) data = spark.createDataFrame([["PUT YOUR STRING HERE."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val pos = BertForTokenClassification.pretrained("bert_pos_spanish_cased_finetuned_pos_16_tags","es") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, pos)) val data = Seq("PUT YOUR STRING HERE.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_spanish_cased_finetuned_pos_16_tags| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-pos-16-tags --- layout: model title: RoBERTa Base Ontonotes NER Pipeline author: John Snow Labs name: roberta_base_token_classifier_ontonotes_pipeline date: 2022-04-23 tags: [open_source, ner, token_classifier, roberta, ontonotes, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_base_token_classifier_ontonotes](https://nlp.johnsnowlabs.com/2021/09/26/roberta_base_token_classifier_ontonotes_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_token_classifier_ontonotes_pipeline_en_3.4.1_3.0_1650716691150.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_token_classifier_ontonotes_pipeline_en_3.4.1_3.0_1650716691150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_base_token_classifier_ontonotes_pipeline", lang = "en") pipeline.annotate("My name is John and I have been working at John Snow Labs since November 2020.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | |November 2020 |DATE | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_token_classifier_ontonotes_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|456.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Lemmatizer (Italian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, it] task: Lemmatization language: it edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Italian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_it_3.4.1_3.0_1646316509683.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_it_3.4.1_3.0_1646316509683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","it") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Non sei migliore di me"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","it") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Non sei migliore di me").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.lemma.spacylookup").predict("""Non sei migliore di me""") ```
## Results ```bash +----------------------------+ |result | +----------------------------+ |[Non, essere, buono, di, me]| +----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|it| |Size:|3.5 MB| --- layout: model title: Legal Health Document Classifier (EURLEX) author: John Snow Labs name: legclf_health_bert date: 2023-03-06 tags: [en, legal, classification, clauses, health, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_health_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Health or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Health`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_health_bert_en_1.0.0_3.0_1678111626210.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_health_bert_en_1.0.0_3.0_1678111626210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_health_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Health]| |[Other]| |[Other]| |[Health]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_health_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Health 0.92 0.93 0.93 635 Other 0.92 0.91 0.91 536 accuracy - - 0.92 1171 macro-avg 0.92 0.92 0.92 1171 weighted-avg 0.92 0.92 0.92 1171 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from MyMild) author: John Snow Labs name: bert_qa_mymild_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `MyMild`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mymild_finetuned_squad_en_4.0.0_3.0_1657186147542.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mymild_finetuned_squad_en_4.0.0_3.0_1657186147542.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mymild_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mymild_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mymild_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MyMild/bert-finetuned-squad --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from clevrly) author: John Snow Labs name: roberta_qa_base_finetuned_hotpot date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-hotpot_qa` is a English model originally trained by `clevrly`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_hotpot_en_4.3.0_3.0_1674216656084.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_hotpot_en_4.3.0_3.0_1674216656084.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_hotpot","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_hotpot","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_hotpot| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/clevrly/roberta-base-finetuned-hotpot_qa --- layout: model title: Spanish RoBERTa Embeddings (Base, Stepwise, Using Sequence Length 512) author: John Snow Labs name: roberta_embeddings_bertin_base_stepwise_exp_512seqlen date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-base-stepwise-exp-512seqlen` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_stepwise_exp_512seqlen_es_3.4.2_3.0_1649945978602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_stepwise_exp_512seqlen_es_3.4.2_3.0_1649945978602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_stepwise_exp_512seqlen","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_stepwise_exp_512seqlen","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_base_stepwise_exp_512seqlen").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_base_stepwise_exp_512seqlen| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|234.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-base-stepwise-exp-512seqlen --- layout: model title: Smaller BERT Sentence Embeddings (L-2_H-128_A-2) author: John Snow Labs name: sent_small_bert_L2_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_128_en_2.6.0_2.4_1598350305687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L2_128_en_2.6.0_2.4_1598350305687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"]))) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L2_128').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L2_128_embeddings I hate cancer [-1.2620444297790527, -0.40405017137527466, -1... Antibiotics aren't painkiller [-0.9117010831832886, 0.26326191425323486, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L2_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from eldadshulman) author: John Snow Labs name: distilbert_qa_eldadshulman_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `eldadshulman`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_eldadshulman_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770650405.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_eldadshulman_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770650405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_eldadshulman_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_eldadshulman_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_eldadshulman_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/eldadshulman/distilbert-base-uncased-finetuned-squad --- layout: model title: Extract aspects and entities from airline questions (ATIS dataset) author: John Snow Labs name: nerdl_atis_840b_300d task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 date: 2021-01-25 tags: [en, ner, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract important aspects from user questions like 'arrive_date' - arrival time of a flight, 'fare_amount' - fare information, and 77 other different types of aspects and entities to automate the process of question answering. ## Predicted Entities `aircraft_code`, `airline_code`, `airline_name`, `airport_code`, `airport_name`, `arrive_date.date_relative`, `arrive_date.day_name`, `arrive_date.day_number`, `arrive_date.month_name`, `arrive_date.today_relative`, `arrive_time.end_time`, `arrive_time.period_mod`, `arrive_time.period_of_day`, `arrive_time.start_time`, `arrive_time.time`, `arrive_time.time_relative`, `city_name`, `class_type`, `connect`, `cost_relative`, `day_name`, `day_number`, `days_code`, `depart_date.date_relative`, `depart_date.day_name`, `depart_date.day_number`, `depart_date.month_name`, `depart_date.today_relative`, `depart_date.year`, `depart_time.end_time`, `depart_time.period_mod`, `depart_time.period_of_day`, `depart_time.start_time`, `depart_time.time`, `depart_time.time_relative`, `economy`, `fare_amount`, `fare_basis_code`, `flight_days`, `flight_mod`, `flight_number`, `flight_stop`, `flight_time`, `fromloc.airport_code`, `fromloc.airport_name`, `fromloc.city_name`, `fromloc.state_code`, `fromloc.state_name`, `meal`, `meal_code`, `meal_description`, `mod`, `month_name`, `or`, `period_of_day`, `restriction_code`, `return_date.date_relative`, `return_date.day_name`, `return_date.day_number`, `return_date.month_name`, `return_date.today_relative`, `return_time.period_mod`, `return_time.period_of_day`, `round_trip`, `state_code`, `state_name`, `stoploc.airport_name`, `stoploc.city_name`, `stoploc.state_code`, `time`, `time_relative`, `today_relative`, `toloc.airport_code`, `toloc.airport_name`, `toloc.city_name`, `toloc.country_name`, `toloc.state_code`, `toloc.state_name`, `transport_type`. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_atis_840b_300d_en_2.7.1_2.4_1611573523640.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_atis_840b_300d_en_2.7.1_2.4_1611573523640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("nerdl_atis_840b_300d", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, ner, ner_converter]) example = spark.createDataFrame([['i want to fly from baltimore to dallas round trip']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("nerdl_atis_840b_300d", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("i want to fly from baltimore to dallas round trip").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.airline").predict("""i want to fly from baltimore to dallas round trip""") ```
## Results ```bash +-------------------+---------------------+ | ner_chunk | label | +-------------------+---------------------+ | baltimore | fromloc.city_name | | dallas | toloc.city_name | | round trip | round_trip | +-------------------+---------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_atis_840b_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Dependencies:|glove_840B_300| ## Data Source This model is trained on the ATIS dataset https://www.kaggle.com/hassanamin/atis-airlinetravelinformationsystem ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |----:|-----------------------------:|------:|-----:|-----:|---------:|---------:|---------:| | 0 | B-airline_name | 101 | 5 | 0 | 0.95283 | 1 | 0.975845 | | 1 | B-booking_class | 0 | 0 | 1 | 0 | 0 | 0 | | 2 | B-toloc.state_code | 18 | 0 | 0 | 1 | 1 | 1 | | 3 | B-arrive_time.period_of_day | 6 | 2 | 0 | 0.75 | 1 | 0.857143 | | 4 | B-flight_number | 11 | 5 | 0 | 0.6875 | 1 | 0.814815 | | 5 | B-depart_date.year | 3 | 0 | 0 | 1 | 1 | 1 | | 6 | I-depart_time.time_relative | 0 | 0 | 1 | 0 | 0 | 0 | | 7 | I-depart_time.time | 52 | 1 | 0 | 0.981132 | 1 | 0.990476 | | 8 | B-transport_type | 10 | 0 | 0 | 1 | 1 | 1 | | 9 | I-depart_date.day_number | 14 | 1 | 1 | 0.933333 | 0.933333 | 0.933333 | | 10 | I-city_name | 12 | 1 | 18 | 0.923077 | 0.4 | 0.55814 | | 11 | B-depart_time.end_time | 2 | 0 | 1 | 1 | 0.666667 | 0.8 | | 12 | B-depart_date.date_relative | 17 | 2 | 0 | 0.894737 | 1 | 0.944445 | | 13 | B-toloc.country_name | 1 | 0 | 0 | 1 | 1 | 1 | | 14 | I-depart_time.start_time | 1 | 0 | 0 | 1 | 1 | 1 | | 15 | B-flight_stop | 21 | 0 | 0 | 1 | 1 | 1 | | 16 | B-airline_code | 27 | 1 | 7 | 0.964286 | 0.794118 | 0.870968 | | 17 | B-stoploc.city_name | 20 | 1 | 0 | 0.952381 | 1 | 0.97561 | | 18 | I-fromloc.airport_name | 15 | 19 | 0 | 0.441176 | 1 | 0.612245 | | 19 | B-round_trip | 70 | 0 | 3 | 1 | 0.958904 | 0.979021 | | 20 | B-day_name | 1 | 0 | 1 | 1 | 0.5 | 0.666667 | | 21 | B-depart_time.time_relative | 63 | 2 | 2 | 0.969231 | 0.969231 | 0.969231 | | 22 | B-fromloc.airport_name | 10 | 12 | 2 | 0.454545 | 0.833333 | 0.588235 | | 23 | B-meal_code | 0 | 1 | 1 | 0 | 0 | 0 | | 24 | B-fromloc.airport_code | 5 | 6 | 0 | 0.454545 | 1 | 0.625 | | 25 | B-arrive_time.start_time | 8 | 1 | 0 | 0.888889 | 1 | 0.941177 | | 26 | I-state_name | 0 | 0 | 1 | 0 | 0 | 0 | | 27 | I-arrive_time.time | 34 | 0 | 1 | 1 | 0.971429 | 0.985507 | | 28 | B-airport_code | 3 | 1 | 6 | 0.75 | 0.333333 | 0.461538 | | 29 | B-depart_time.start_time | 2 | 0 | 1 | 1 | 0.666667 | 0.8 | | 30 | B-arrive_time.time_relative | 27 | 3 | 4 | 0.9 | 0.870968 | 0.885246 | | 31 | I-cost_relative | 2 | 0 | 1 | 1 | 0.666667 | 0.8 | | 32 | B-flight | 0 | 0 | 1 | 0 | 0 | 0 | | 33 | B-toloc.state_name | 25 | 3 | 3 | 0.892857 | 0.892857 | 0.892857 | | 34 | I-arrive_time.start_time | 1 | 0 | 0 | 1 | 1 | 1 | | 35 | B-compartment | 0 | 0 | 1 | 0 | 0 | 0 | | 36 | B-toloc.airport_code | 4 | 0 | 0 | 1 | 1 | 1 | | 37 | B-connect | 6 | 0 | 0 | 1 | 1 | 1 | | 38 | I-return_date.date_relative | 2 | 0 | 1 | 1 | 0.666667 | 0.8 | | 39 | I-depart_time.end_time | 2 | 0 | 1 | 1 | 0.666667 | 0.8 | | 40 | B-meal | 16 | 1 | 0 | 0.941176 | 1 | 0.969697 | | 41 | B-arrive_date.month_name | 5 | 2 | 1 | 0.714286 | 0.833333 | 0.769231 | | 42 | B-arrive_date.day_number | 5 | 2 | 1 | 0.714286 | 0.833333 | 0.769231 | | 43 | I-toloc.state_name | 1 | 0 | 0 | 1 | 1 | 1 | | 44 | B-arrive_date.date_relative | 2 | 0 | 0 | 1 | 1 | 1 | | 45 | I-fromloc.state_name | 1 | 0 | 0 | 1 | 1 | 1 | | 46 | B-depart_time.period_of_day | 121 | 1 | 9 | 0.991803 | 0.930769 | 0.960317 | | 47 | I-toloc.airport_name | 3 | 2 | 0 | 0.6 | 1 | 0.75 | | 48 | B-class_type | 24 | 1 | 0 | 0.96 | 1 | 0.979592 | | 49 | B-period_of_day | 3 | 0 | 1 | 1 | 0.75 | 0.857143 | | 50 | I-flight_number | 0 | 0 | 1 | 0 | 0 | 0 | | 51 | B-stoploc.airport_code | 0 | 0 | 1 | 0 | 0 | 0 | | 52 | B-flight_mod | 23 | 0 | 1 | 1 | 0.958333 | 0.978723 | | 53 | I-airport_name | 13 | 2 | 16 | 0.866667 | 0.448276 | 0.590909 | | 54 | B-arrive_time.time | 33 | 2 | 1 | 0.942857 | 0.970588 | 0.956522 | | 55 | B-arrive_date.day_name | 11 | 3 | 0 | 0.785714 | 1 | 0.88 | | 56 | I-restriction_code | 3 | 0 | 0 | 1 | 1 | 1 | | 57 | I-flight_time | 1 | 0 | 0 | 1 | 1 | 1 | | 58 | B-meal_description | 10 | 0 | 0 | 1 | 1 | 1 | | 59 | I-flight_mod | 1 | 0 | 5 | 1 | 0.166667 | 0.285714 | | 60 | B-flight_days | 10 | 0 | 0 | 1 | 1 | 1 | | 61 | I-stoploc.city_name | 10 | 2 | 0 | 0.833333 | 1 | 0.909091 | | 62 | B-economy | 6 | 0 | 0 | 1 | 1 | 1 | | 63 | I-arrive_date.day_number | 0 | 1 | 0 | 0 | 0 | 0 | | 64 | B-toloc.airport_name | 3 | 1 | 0 | 0.75 | 1 | 0.857143 | | 65 | I-class_type | 17 | 0 | 0 | 1 | 1 | 1 | | 66 | B-state_code | 1 | 0 | 0 | 1 | 1 | 1 | | 67 | B-aircraft_code | 27 | 0 | 6 | 1 | 0.818182 | 0.9 | | 68 | B-days_code | 0 | 0 | 1 | 0 | 0 | 0 | | 69 | B-or | 3 | 3 | 0 | 0.5 | 1 | 0.666667 | | 70 | B-return_date.date_relative | 1 | 1 | 2 | 0.5 | 0.333333 | 0.4 | | 71 | B-fromloc.state_name | 17 | 0 | 0 | 1 | 1 | 1 | | 72 | B-depart_date.month_name | 54 | 1 | 2 | 0.981818 | 0.964286 | 0.972973 | | 73 | B-fromloc.city_name | 703 | 6 | 1 | 0.991537 | 0.99858 | 0.995046 | | 74 | B-restriction_code | 4 | 2 | 0 | 0.666667 | 1 | 0.8 | | 75 | B-state_name | 0 | 0 | 9 | 0 | 0 | 0 | | 76 | B-city_name | 32 | 4 | 25 | 0.888889 | 0.561404 | 0.688172 | | 77 | I-round_trip | 70 | 0 | 1 | 1 | 0.985915 | 0.992908 | | 78 | I-transport_type | 0 | 0 | 1 | 0 | 0 | 0 | | 79 | B-depart_date.day_name | 210 | 2 | 2 | 0.990566 | 0.990566 | 0.990566 | | 80 | B-fare_basis_code | 17 | 4 | 0 | 0.809524 | 1 | 0.894737 | | 81 | I-arrive_time.end_time | 8 | 1 | 0 | 0.888889 | 1 | 0.941177 | | 82 | B-depart_date.day_number | 53 | 1 | 2 | 0.981482 | 0.963636 | 0.972477 | | 83 | B-return_date.day_name | 0 | 0 | 2 | 0 | 0 | 0 | | 84 | B-toloc.city_name | 713 | 21 | 3 | 0.97139 | 0.99581 | 0.983448 | | 85 | I-fare_amount | 2 | 0 | 0 | 1 | 1 | 1 | | 86 | B-fromloc.state_code | 23 | 0 | 0 | 1 | 1 | 1 | | 87 | B-cost_relative | 36 | 0 | 1 | 1 | 0.972973 | 0.986301 | | 88 | I-arrive_time.time_relative | 0 | 0 | 4 | 0 | 0 | 0 | | 89 | B-mod | 0 | 0 | 2 | 0 | 0 | 0 | | 90 | B-flight_time | 1 | 0 | 0 | 1 | 1 | 1 | | 91 | B-arrive_time.end_time | 8 | 1 | 0 | 0.888889 | 1 | 0.941177 | | 92 | B-fare_amount | 2 | 0 | 0 | 1 | 1 | 1 | | 93 | I-toloc.city_name | 263 | 12 | 2 | 0.956364 | 0.992453 | 0.974074 | | 94 | I-fromloc.city_name | 176 | 1 | 1 | 0.99435 | 0.99435 | 0.99435 | | 95 | B-airport_name | 10 | 2 | 11 | 0.833333 | 0.47619 | 0.606061 | | 96 | I-airline_name | 65 | 0 | 0 | 1 | 1 | 1 | | 97 | I-depart_time.period_of_day | 0 | 0 | 1 | 0 | 0 | 0 | | 98 | B-depart_date.today_relative | 8 | 0 | 1 | 1 | 0.888889 | 0.941177 | | 99 | B-depart_time.time | 57 | 8 | 0 | 0.876923 | 1 | 0.934426 | | 100 | B-depart_time.period_mod | 5 | 0 | 0 | 1 | 1 | 1 | | 101 | Macro-average | 3487 | 157 | 176 | 0.768428 | 0.758601 | 0.763483 | | 102 | Micro-average | 3487 | 157 | 176 | 0.956916 | 0.951952 | 0.954427 | ``` --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [nl, ner, licensed, legal, mapa] task: Named Entity Recognition language: nl edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Dutch` documents. ## Predicted Entities `ADDRESS`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_nl_1.0.0_3.0_1682600676432.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_nl_1.0.0_3.0_1682600676432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_nl_cased", "nl")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "nl", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Liberato en Grigorescu zijn op 22 oktober 2005 in Rome ( Italië ) in het huwelijk getreden en hebben tot de geboorte van hun kind op 20 februari 2006 in die lidstaat samengewoond."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |Liberato |PERSON | |Grigorescu |PERSON | |22 oktober 2005 |DATE | |Rome ( Italië ) |ADDRESS | |20 februari 2006|DATE | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|nl| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.87 0.81 0.84 16 DATE 0.98 0.98 0.98 54 ORGANISATION 0.83 0.97 0.90 31 PERSON 0.90 0.92 0.91 39 macro-avg 0.91 0.94 0.93 140 macro-avg 0.90 0.92 0.91 140 weighted-avg 0.91 0.94 0.93 140 ``` --- layout: model title: Detect Assertion Status (assertion_dl_biobert_scope_L10R10) author: John Snow Labs name: assertion_dl_biobert_scope_L10R10 date: 2022-03-23 tags: [licensed, clinical, en, assertion, biobert] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained using `biobert_pubmed_base_cased` BERT token embeddings. It considers 10 tokens on the left and 10 tokens on the right side of the clinical entities extracted by NER models and assigns their assertion status based on their context in this scope. ## Predicted Entities `present`, `absent`, `possible`, `conditional`, `associated_with_someone_else`, `hypothetical` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_scope_L10R10_en_3.4.2_3.0_1648032139325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_dl_biobert_scope_L10R10_en_3.4.2_3.0_1648032139325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") token = Tokenizer()\ .setInputCols(['sentence'])\ .setOutputCol('token') embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[document, sentenceDetector, token, embeddings, clinical_ner, ner_converter, clinical_assertion]) text = "Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer." data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.biobert_l10210").predict("""Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA for pain control. He also became short of breath with climbing a flight of stairs. After CT, lung tumor located at the right lower lobe. Father with Alzheimer.""") ```
## Results ```bash +---------------+---------+----------------------------+ |chunk |ner_label|assertion | +---------------+---------+----------------------------+ |severe fever |PROBLEM |present | |sore throat |PROBLEM |present | |stomach pain |PROBLEM |absent | |an epidural |TREATMENT|present | |PCA |TREATMENT|present | |pain control |TREATMENT|present | |short of breath|PROBLEM |conditional | |CT |TEST |present | |lung tumor |PROBLEM |present | |Alzheimer |PROBLEM |associated_with_someone_else| +---------------+---------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_dl_biobert_scope_L10R10| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|3.2 MB| ## References Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with `biobert_pubmed_base_cased`. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label tp fp fn prec rec f1 absent 839 89 44 0.9040948 0.9501699 0.9265599 present 2436 127 168 0.9504487 0.9354839 0.9429069 conditional 29 21 24 0.58 0.5471698 0.5631067 associated_with_someone_else 39 11 6 0.78 0.8666670 0.8210527 hypothetical 164 44 11 0.7884616 0.9371429 0.8563969 possible 126 36 75 0.7777778 0.6268657 0.6942149 Macro-average 3633 328 328 0.7967971 0.8105832 0.8036310 Micro-average 3633 328 328 0.9171926 0.9171926 0.9171926 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from PremalMatalia) author: John Snow Labs name: roberta_qa_base_best_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-best-squad2` is a English model originally trained by `PremalMatalia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_best_squad2_en_4.3.0_3.0_1674212826277.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_best_squad2_en_4.3.0_3.0_1674212826277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_best_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_best_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_best_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/PremalMatalia/roberta-base-best-squad2 --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [fr, ner, licensed, legal, mapa] task: Named Entity Recognition language: fr edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `French` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_fr_1.0.0_3.0_1682596162755.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_fr_1.0.0_3.0_1682596162755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_fr_cased", "fr")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "fr", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Heeren, administrateur, vu la phase écrite de la procédure et à la suite de l’audience du 28 novembre 2017, rend le présent Arrêt Antécédents du litige 1 La requérante, Foshan Lihua Ceramic Co."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------------------+------------+ |chunk |ner_label | +-----------------------+------------+ |Heeren |PERSON | |28 novembre 2017 |DATE | |Foshan Lihua Ceramic Co|ORGANISATION| +-----------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 1.00 1.00 1.00 11 AMOUNT 1.00 1.00 1.00 4 DATE 1.00 0.96 0.98 28 ORGANISATION 1.00 0.95 0.98 22 PERSON 0.94 0.94 0.94 31 macro-avg 0.98 0.96 0.97 96 macro-avg 0.99 0.97 0.98 96 weighted-avg 0.98 0.96 0.97 96 ``` --- layout: model title: English RobertaForQuestionAnswering (from huxxx657) author: John Snow Labs name: roberta_qa_roberta_base_finetuned_squad_1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-1` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_1_en_4.0.0_3.0_1655734407785.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_squad_1_en_4.0.0_3.0_1655734407785.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned_squad_1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned_squad_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_v1.by_huxxx657").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned_squad_1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad-1 --- layout: model title: Legal Vacation Clause Binary Classifier author: John Snow Labs name: legclf_vacation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `vacation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `vacation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_vacation_clause_en_1.0.0_3.2_1660124104524.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_vacation_clause_en_1.0.0_3.2_1660124104524.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_vacation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[vacation]| |[other]| |[other]| |[vacation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_vacation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.99 0.99 84 vacation 0.98 1.00 0.99 51 accuracy - - 0.99 135 macro-avg 0.99 0.99 0.99 135 weighted-avg 0.99 0.99 0.99 135 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_10_en_4.0.0_3.0_1655732066077.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_10_en_4.0.0_3.0_1655732066077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_256d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|427.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-10 --- layout: model title: Detect Clinical Entities (Slim version, BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_slim date: 2021-09-24 tags: [ner, bert, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a pretrained named entity recognition deep learning model for clinical terminology. It is based on the `bert_token_classifier_ner_jsl` model, but with more generalized entities. This model is trained with BertForTokenClassification method from the `transformers` library and imported into Spark NLP. Definitions of Predicted Entities: - `Death_Entity`: Mentions that indicate the death of a patient. - `Medical_Device`: All mentions related to medical devices and supplies. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Allergen`: Allergen related extractions mentioned in the document. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Birth_Entity`: Mentions that indicate giving birth. - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). ## Predicted Entities `Death_Entity`, `Medical_Device`, `Vital_Sign`, `Alergen`, `Drug`, `Clinical_Dept`, `Lifestyle`, `Symptom`, `Body_Part`, `Physical_Measurement`, `Admission_Discharge`, `Date_Time`, `Age`, `Birth_Entity`, `Header`, `Oncological`, `Substance_Quantity`, `Test_Result`, `Test`, `Procedure`, `Treatment`, `Disease_Syndrome_Disorder`, `Pregnancy_Newborn`, `Demographics` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_en_3.2.0_2.4_1632473007308.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_en_3.2.0_2.4_1632473007308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) sample_text = """HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""" result = model.transform(spark.createDataFrame([[sample_text]]).toDF("text")) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val. ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val sample_text = Seq("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""").toDS.toDF("text") val result = pipeline.fit(sample_text).transform(sample_text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_jsl_slim").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""") ```
## Results ```bash +----------------+------------+ |chunk |ner_label | +----------------+------------+ |HISTORY: |Header | |30-year-old |Age | |female |Demographics| |mammography |Test | |soft tissue lump|Symptom | |shoulder |Body_Part | |breast cancer |Oncological | |her mother |Demographics| |age 58 |Age | |breast cancer |Oncological | +----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_slim| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|256| ## Data Source Trained on data annotated by JSL. ## Benchmarking ```bash label precision recall f1-score support B-Admission_Discharge 0.82 0.99 0.90 282 B-Age 0.88 0.83 0.85 576 B-Body_Part 0.84 0.91 0.87 8582 B-Clinical_Dept 0.86 0.94 0.90 909 B-Date_Time 0.82 0.77 0.79 1062 B-Death_Entity 0.66 0.98 0.79 43 B-Demographics 0.97 0.98 0.98 5285 B-Disease_Syndrome_Disorder 0.84 0.89 0.86 4259 B-Drug 0.88 0.87 0.87 2555 B-Header 0.97 0.66 0.78 3911 B-Lifestyle 0.77 0.83 0.80 371 B-Medical_Device 0.84 0.87 0.85 3605 B-Oncological 0.86 0.91 0.89 408 B-Physical_Measurement 0.84 0.81 0.82 135 B-Pregnancy_Newborn 0.66 0.71 0.68 245 B-Procedure 0.82 0.88 0.85 2654 B-Symptom 0.83 0.86 0.85 6545 B-Test 0.82 0.83 0.83 2448 B-Test_Result 0.76 0.81 0.78 1280 B-Treatment 0.70 0.76 0.73 275 B-Vital_Sign 0.85 0.87 0.86 627 I-Age 0.84 0.90 0.87 166 I-Alergen 0.00 0.00 0.00 5 I-Body_Part 0.86 0.89 0.88 4946 I-Clinical_Dept 0.92 0.93 0.93 806 I-Date_Time 0.82 0.91 0.86 1173 I-Demographics 0.89 0.84 0.86 416 I-Disease_Syndrome_Disorder 0.87 0.85 0.86 4385 I-Drug 0.83 0.86 0.85 5199 I-Header 0.85 0.97 0.90 6763 I-Lifestyle 0.77 0.69 0.73 134 I-Medical_Device 0.86 0.86 0.86 2341 I-Oncological 0.85 0.94 0.89 515 I-Physical_Measurement 0.88 0.94 0.91 329 I-Pregnancy_Newborn 0.66 0.70 0.68 273 I-Procedure 0.87 0.86 0.87 3414 I-Symptom 0.79 0.75 0.77 6485 I-Test 0.82 0.77 0.79 2283 I-Test_Result 0.67 0.56 0.61 649 I-Treatment 0.69 0.72 0.70 194 I-Vital_Sign 0.88 0.90 0.89 918 O 0.97 0.97 0.97 210520 accuracy - - 0.94 297997 macro-avg 0.74 0.74 0.73 297997 weighted-avg 0.94 0.94 0.94 297997 ``` --- layout: model title: Legal International Security Document Classifier (EURLEX) author: John Snow Labs name: legclf_international_security_bert date: 2023-03-06 tags: [en, legal, classification, clauses, international_security, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_international_security_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class International_Security or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `International_Security`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_international_security_bert_en_1.0.0_3.0_1678111613932.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_international_security_bert_en_1.0.0_3.0_1678111613932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_international_security_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[International_Security]| |[Other]| |[Other]| |[International_Security]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_international_security_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support International_Security 0.95 0.93 0.94 60 Other 0.94 0.95 0.95 65 accuracy - - 0.94 125 macro-avg 0.94 0.94 0.94 125 weighted-avg 0.94 0.94 0.94 125 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_4_en_4.0.0_3.0_1657192113814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_4_en_4.0.0_3.0_1657192113814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|376.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-4 --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_wiki_summarization date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-wiki-summarization` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_wiki_summarization_it_4.3.0_3.0_1675103710830.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_wiki_summarization_it_4.3.0_3.0_1675103710830.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_wiki_summarization","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_wiki_summarization","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_wiki_summarization| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|594.0 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-wiki-summarization - https://github.com/stefan-it - https://www.semanticscholar.org/paper/WITS%3A-Wikipedia-for-Italian-Text-Summarization-Casola-Lavelli/ad6c83122e721c7c0db4a40727dac3b4762cd2b1 - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Wikipedia+Summarization&dataset=WITS --- layout: model title: Understanding Perpetuity in "Return of Confidential Information" Clauses (Bert) author: John Snow Labs name: legclf_nda_perpetuity_bert date: 2023-05-17 tags: [en, legal, licensed, bert, nda, classification, perpetuity, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a clause classified as `RETURN_OF_CONF_INFO` using the `legmulticlf_mnda_sections_paragraph_other` classifier, you can subclassify the sentences as `PERPETUITY` or `OTHER` from it using the `legclf_nda_perpetuity_bert` model. It has been trained with the SOTA approach ## Predicted Entities `PERPETUITY`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_perpetuity_bert_en_1.0.0_3.0_1684353033843.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_perpetuity_bert_en_1.0.0_3.0_1684353033843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") sequence_classifier = legal.BertForSequenceClassification.pretrained("legclf_nda_perpetuity_bert", "en", "legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequence_classifier ]) empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) text_list = [ """Notwithstanding the return or destruction of all Evaluation Material, you or your Representatives shall continue to be bound by your obligations of confidentiality and other obligations hereunder.""", """There are no intended third party beneficiaries to this Agreement.""" ] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +--------------------------------------------------------------------------------+----------+ | text| class| +--------------------------------------------------------------------------------+----------+ |Notwithstanding the return or destruction of all Evaluation Material, you or ...|PERPETUITY| | There are no intended third-party beneficiaries to this Agreement.| OTHER| +--------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_perpetuity_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support OTHER 0.98 1.00 0.99 60 PERPETUITY 1.00 0.89 0.94 9 accuracy - - 0.99 69 macro avg 0.99 0.94 0.97 69 weighted avg 0.99 0.99 0.99 69 ``` --- layout: model title: English image_classifier_vit_sea_mammals ViTForImageClassification from Neto71 author: John Snow Labs name: image_classifier_vit_sea_mammals date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_sea_mammals` is a English model originally trained by Neto71. ## Predicted Entities `blue whale`, `dolphin`, `orca whale` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_sea_mammals_en_4.1.0_3.0_1660166060426.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_sea_mammals_en_4.1.0_3.0_1660166060426.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_sea_mammals", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_sea_mammals", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_sea_mammals| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Greek author: John Snow Labs name: opus_mt_en_el date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, el, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `el` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_el_xx_2.7.0_2.4_1609163144290.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_el_xx_2.7.0_2.4_1609163144290.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_el", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_el", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.el').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_el| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Company Name Normalization (Edgar Database) author: John Snow Labs name: finel_edgar_company_name date: 2022-08-30 tags: [en, finance, companies, edgar, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Linking / Entity Resolution model, which allows you to map an extracted Company Name from any NER model, to the name used by SEC in Edgar Database. This can come in handy to afterwards use Edgar Chunk Mappers with the output of this resolution, to carry out data augmentation and retrieve additional information stored in Edgar Database about a company. For more information about data augmentation, check `Chunk Mapping` task in Models Hub. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ER_EDGAR_CRUNCHBASE/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_edgar_company_name_en_1.0.0_3.2_1661866108362.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_edgar_company_name_en_1.0.0_3.2_1661866108362.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("normalized")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver]) lp = LightPipeline(pipelineModel) lp.fullAnnotate("CONTACT GOLD") ```
## Results ```bash +--------------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------+-------------------------------------------+ | chunk | code | all_codes | resolutions | all_distances | +--------------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------+-------------------------------------------+ | CONTACT GOLD | 981369960| [981369960, 271989147, 208531222, 273566922, 270348508] |[Contact Gold Corp, Guskin Gold Corp, Yinfu Gold Corp, MAGELLAN GOLD Corp, Star Gold Corp] | [0.1733, 0.3700, 0.3867, 0.4103, 0.4121] | +--------------+----------+---------------------------------------------------------+--------------------------------------------------------------------------------------------+-------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_edgar_company_name| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[original_company_name]| |Language:|en| |Size:|315.1 MB| |Case sensitive:|false| ## References In-house scrapping and postprocessing of SEC Edgar Database --- layout: model title: English asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface TFWav2Vec2ForCTC from malay-huggingface author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface` is a English model originally trained by malay-huggingface. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface_en_4.2.0_3.0_1664104576012.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface_en_4.2.0_3.0_1664104576012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_mixed_by_malay_huggingface| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent TFWav2Vec2ForCTC from creynier author: John Snow Labs name: pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent` is a English model originally trained by creynier. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent_en_4.2.0_3.0_1664103989166.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent_en_4.2.0_3.0_1664103989166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Spanish RobertaForQuestionAnswering (from PlanTL-GOB-ES) author: John Snow Labs name: roberta_qa_PlanTL_GOB_ES_roberta_base_bne_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-sqac` is a Spanish model originally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_PlanTL_GOB_ES_roberta_base_bne_sqac_es_4.0.0_3.0_1655730149548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_PlanTL_GOB_ES_roberta_base_bne_sqac_es_4.0.0_3.0_1655730149548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_PlanTL_GOB_ES_roberta_base_bne_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_PlanTL_GOB_ES_roberta_base_bne_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_PlanTL_GOB_ES_roberta_base_bne_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-sqac - https://arxiv.org/abs/2107.07253 - http://www.bne.es/en/Inicio/index.html - https://arxiv.org/abs/1907.11692 - https://github.com/PlanTL-GOB-ES/lm-spanish --- layout: model title: Legal Limited Liability Company Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_limited_liability_company_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, limited, liability, company, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_limited_liability_company_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `limited-liability-company-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `limited-liability-company-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_limited_liability_company_agreement_en_1.0.0_3.0_1671393675315.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_limited_liability_company_agreement_en_1.0.0_3.0_1671393675315.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_limited_liability_company_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[limited-liability-company-agreement]| |[other]| |[other]| |[limited-liability-company-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_limited_liability_company_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support limited-liability-company-agreement 1.00 0.96 0.98 98 other 0.98 1.00 0.99 231 accuracy - - 0.99 329 macro-avg 0.99 0.98 0.99 329 weighted-avg 0.99 0.99 0.99 329 ``` --- layout: model title: Legal Applicable Law Clause Binary Classifier author: John Snow Labs name: legclf_applicable_law_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, applicable, law, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the applicable-law clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `applicable-law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_applicable_law_clause_en_1.0.0_3.0_1671393641361.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_applicable_law_clause_en_1.0.0_3.0_1671393641361.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_applicable_law_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[applicable-law]| |[other]| |[other]| |[applicable-law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_applicable_law_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents + Lawinsider categorization ## Benchmarking ```bash label precision recall f1-score support applicable-law 1.00 1.00 1.00 28 other 1.00 1.00 1.00 39 accuracy - - 1.00 67 macro-avg 1.00 1.00 1.00 67 weighted-avg 1.00 1.00 1.00 67 ``` --- layout: model title: English AlbertForQuestionAnswering model (from rowan1224) Squad author: John Snow Labs name: albert_qa_squad_slp date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-squad-slp` is a English model originally trained by `rowan1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_squad_slp_en_4.0.0_3.0_1656063748783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_squad_slp_en_4.0.0_3.0_1656063748783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_squad_slp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_squad_slp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.albert.by_rowan1224").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_squad_slp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rowan1224/albert-squad-slp --- layout: model title: Word2Vec Embeddings in Uyghur (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, ug, open_source] task: Embeddings language: ug edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ug_3.4.1_3.0_1647465490706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ug_3.4.1_3.0_1647465490706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ug") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مەن stark nlp نى ياخشى كۆرىمەن"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ug") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مەن stark nlp نى ياخشى كۆرىمەن").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ug.embed.w2v_cc_300d").predict("""مەن stark nlp نى ياخشى كۆرىمەن""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ug| |Size:|148.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Pipeline to Detect Clinical Concepts (WIP Modifier) author: John Snow Labs name: jsl_ner_wip_modifier_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, wip, clinical, modifier, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [jsl_ner_wip_modifier_clinical](https://nlp.johnsnowlabs.com/2021/04/01/jsl_ner_wip_modifier_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_modifier_clinical_pipeline_en_3.4.1_3.0_1647866811633.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_modifier_clinical_pipeline_en_3.4.1_3.0_1647866811633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("jsl_ner_wip_modifier_clinical_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE MEDICAL TEXT") ``` ```scala val pipeline = new PretrainedPipeline("jsl_ner_wip_modifier_clinical_pipeline", "en", "clinical/models") pipeline.annotate("EXAMPLE MEDICAL TEXT") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.wip_modifier_clinical.pipeline").predict("""EXAMPLE MEDICAL TEXT""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_modifier_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Czech Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 11:14:00 +0800 task: Lemmatization language: cs edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, cs] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_cs_2.5.0_2.4_1588666300042.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_cs_2.5.0_2.4_1588666300042.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "cs") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "cs") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny."""] lemma_df = nlu.load('cs.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Kromě', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=9, result='ten', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=10, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=13, result='že', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=16, result='on', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|cs| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Extract medical devices and clinical department mentions (Voice of the Patients) author: John Snow Labs name: ner_vop_clinical_dept_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts medical devices and clinical department mentions terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `MedicalDevice`, `AdmissionDischarge`, `ClinicalDept` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_wip_en_4.4.0_3.0_1682012308508.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_clinical_dept_wip_en_4.4.0_3.0_1682012308508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_clinical_dept_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My little brother is having surgery tomorrow in the orthopedic department. He is getting a titanium plate put in his leg to help it heal faster. Wishing him a speedy recovery!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------------------|:--------------| | orthopedic department | ClinicalDept | | titanium plate | MedicalDevice | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_clinical_dept_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|4.0 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 MedicalDevice 227 59 83 310 0.79 0.73 0.76 AdmissionDischarge 23 2 5 28 0.92 0.82 0.87 ClinicalDept 271 30 37 308 0.90 0.88 0.89 macro_avg 521 91 125 646 0.87 0.81 0.84 micro_avg 521 91 125 646 0.85 0.81 0.83 ``` --- layout: model title: Portuguese Bert Embeddings (Base, Cased) author: John Snow Labs name: bert_embeddings_bert_base_gl_cased date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-gl-cased` is a Portuguese model orginally trained by `marcosgg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_gl_cased_pt_3.4.2_3.0_1649673799584.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_gl_cased_pt_3.4.2_3.0_1649673799584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_gl_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_gl_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_base_gl_cased").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_gl_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|667.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/marcosgg/bert-base-gl-cased - https://github.com/marcospln/homonymy_acl21 - https://arxiv.org/abs/2106.13553 --- layout: model title: Legal Background Clause Binary Classifier author: John Snow Labs name: legclf_background_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `background` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `background` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_background_clause_en_1.0.0_3.2_1660123254032.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_background_clause_en_1.0.0_3.2_1660123254032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_background_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[background]| |[other]| |[other]| |[background]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_background_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support background 0.73 0.92 0.81 26 other 0.97 0.89 0.93 81 accuracy - - 0.90 107 macro-avg 0.85 0.91 0.87 107 weighted-avg 0.91 0.90 0.90 107 ``` --- layout: model title: Relation Extraction between Tumors and Sizes (ReDL) author: John Snow Labs name: redl_oncology_size_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Tumor_Size extractions to their corresponding Tumor_Finding extractions. ## Predicted Entities `is_size_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_size_biobert_wip_en_4.2.4_3.0_1673772352847.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_size_biobert_wip_en_4.2.4_3.0_1673772352847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Tumor_Finding and Tumor_Size should be included in the relation pairs.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_size_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_size_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_size_biobert").predict("""The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.""") ```
## Results ```bash +----------+-------------+-------------+-----------+------+-------------+-------------+-----------+------+----------+ | relation| entity1|entity1_begin|entity1_end|chunk1| entity2|entity2_begin|entity2_end|chunk2|confidence| +----------+-------------+-------------+-----------+------+-------------+-------------+-----------+------+----------+ |is_size_of| Tumor_Size| 24| 27| 2 cm|Tumor_Finding| 29| 32| mass| 0.9604708| |is_size_of|Tumor_Finding| 62| 66| tumor| Tumor_Size| 92| 95| 3 cm|0.99731797| +----------+-------------+-------------+-----------+------+-------------+-------------+-----------+------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_size_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 support O 0.87 0.84 0.86 143.0 is_size_of 0.85 0.88 0.86 157.0 macro-avg 0.86 0.86 0.86 - ``` --- layout: model title: Legal Duties Clause Binary Classifier author: John Snow Labs name: legclf_duties_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `duties` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `duties` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_duties_clause_en_1.0.0_3.2_1660123451472.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_duties_clause_en_1.0.0_3.2_1660123451472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_duties_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[duties]| |[other]| |[other]| |[duties]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_duties_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support duties 0.96 0.92 0.94 24 other 0.97 0.99 0.98 68 accuracy - - 0.97 92 macro-avg 0.96 0.95 0.96 92 weighted-avg 0.97 0.97 0.97 92 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from huxxx657) author: John Snow Labs name: roberta_qa_base_finetuned_squad_1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-1` is a English model originally trained by `huxxx657`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_1_en_4.3.0_3.0_1674217536049.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_1_en_4.3.0_3.0_1674217536049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/huxxx657/roberta-base-finetuned-squad-1 --- layout: model title: Pipeline to Detect Anatomical Structures (Single Entity - embeddings_clinical) author: John Snow Labs name: ner_anatomy_coarse_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_anatomy_coarse](https://nlp.johnsnowlabs.com/2021/03/31/ner_anatomy_coarse_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_pipeline_en_3.4.1_3.0_1647873248325.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_coarse_pipeline_en_3.4.1_3.0_1647873248325.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_anatomy_coarse_pipeline", "en", "clinical/models") pipeline.annotate("content in the lung tissue") ``` ```scala val pipeline = new PretrainedPipeline("ner_anatomy_coarse_pipeline", "en", "clinical/models") pipeline.annotate("content in the lung tissue") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.anatomy_coarse.pipeline").predict("""content in the lung tissue""") ```
## Results ```bash | | ner_chunk | entity | |---:|------------------:|----------:| | 0 | lung tissue | Anatomy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_coarse_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from deepset) author: John Snow Labs name: roberta_qa_base_squad2_distilled date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-distilled` is a English model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_distilled_en_4.2.4_3.0_1669986886723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad2_distilled_en_4.2.4_3.0_1669986886723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_distilled","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad2_distilled","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad2_distilled| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|463.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/deepset/roberta-base-squad2-distilled - http://deepset.ai/ - https://haystack.deepset.ai/ - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/haystack - https://docs.haystack.deepset.ai - https://haystack.deepset.ai/community - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2 --- layout: model title: Russian T5ForConditionalGeneration Base Cased model (from IlyaGusev) author: John Snow Labs name: t5_rut5_base_sum_gazeta date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5_base_sum_gazeta` is a Russian model originally trained by `IlyaGusev`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_base_sum_gazeta_ru_4.3.0_3.0_1675106989342.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_base_sum_gazeta_ru_4.3.0_3.0_1675106989342.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_rut5_base_sum_gazeta","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_rut5_base_sum_gazeta","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_rut5_base_sum_gazeta| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|991.7 MB| ## References - https://huggingface.co/IlyaGusev/rut5_base_sum_gazeta - https://colab.research.google.com/drive/1re5E26ZIDUpAx1gOCZkbF3hcwjozmgG0 - https://github.com/IlyaGusev/summarus/blob/master/external/hf_scripts/train.py - https://github.com/IlyaGusev/summarus/blob/master/external/hf_scripts/configs/t5_training_config.json - https://github.com/IlyaGusev/summarus/blob/master/evaluate.py --- layout: model title: Mapping ICD10-CM Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icd10cm_snomed_mapper date: 2022-06-26 tags: [icd10cm, snomed, clinical, en, chunk_mapper, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps ICD10-CM codes to corresponding SNOMED codes under the Unified Medical Language System (UMLS). ## Predicted Entities `snomed_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapper_en_3.5.3_3.0_1656230731120.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapper_en_3.5.3_3.0_1656230731120.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("icd10cm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel.pretrained("icd10cm_snomed_mapper", "en", "clinical/models")\ .setInputCols(["icd10cm_code"])\ .setOutputCol("mappings")\ .setRels(["snomed_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, icd_resolver, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("Diabetes Mellitus") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("icd10cm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel.pretrained("icd10cm_snomed_mapper", "en","clinical/models") .setInputCols(Array("icd10cm_code")) .setOutputCol("mappings") .setRels(Array("snomed_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, icd_resolver, chunkerMapper)) val data = Seq("Diabetes Mellitus").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm_to_snomed").predict("""Diabetes Mellitus""") ```
## Results ```bash | | ner_chunk | icd10cm_code | snomed_mappings | |---:|:------------------|:---------------|------------------:| | 0 | Diabetes Mellitus | Z833 | 160402005 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_snomed_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[icd10_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.1 MB| --- layout: model title: Fast Neural Machine Translation Model from Portuguese-Based Creoles And Pidgins to English author: John Snow Labs name: opus_mt_cpp_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, cpp, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `cpp` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_cpp_en_xx_2.7.0_2.4_1609164463762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_cpp_en_xx_2.7.0_2.4_1609164463762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_cpp_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_cpp_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.cpp.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_cpp_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt3 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt3` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_4.2.4_3.0_1670022812260.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt3_zh_4.2.4_3.0_1670022812260.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt3","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt3| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|144.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt3 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Javanese Bert Embeddings (Small, Wikipedia) author: John Snow Labs name: bert_embeddings_javanese_bert_small date: 2022-04-11 tags: [bert, embeddings, jv, open_source] task: Embeddings language: jv edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `javanese-bert-small` is a Javanese model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_bert_small_jv_3.4.2_3.0_1649676642753.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_javanese_bert_small_jv_3.4.2_3.0_1649676642753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_javanese_bert_small","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_javanese_bert_small","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("jv.embed.javanese_bert_small").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_javanese_bert_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|jv| |Size:|410.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/javanese-bert-small - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_how_1e_4 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-how-1e-4` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_how_1e_4_en_4.3.0_3.0_1672766694745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_how_1e_4_en_4.3.0_3.0_1672766694745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_how_1e_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_how_1e_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_how_1e_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-how-1e-4 --- layout: model title: English AlbertForQuestionAnswering model (from ktrapeznikov) author: John Snow Labs name: albert_qa_xlarge_v2_squad_v2 date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xlarge-v2-squad-v2` is a English model originally trained by `ktrapeznikov`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xlarge_v2_squad_v2_en_4.0.0_3.0_1656063815863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xlarge_v2_squad_v2_en_4.0.0_3.0_1656063815863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xlarge_v2_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xlarge_v2_squad_v2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.albert.xl_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xlarge_v2_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|205.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ktrapeznikov/albert-xlarge-v2-squad-v2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Pipeline to Detect PHI for deidentification purposes author: John Snow Labs name: ner_deid_subentity_augmented_i2b2_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, phideidentification, i2b2, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity_augmented_i2b2](https://nlp.johnsnowlabs.com/2021/11/29/ner_deid_subentity_augmented_i2b2_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_pipeline_en_3.4.1_3.0_1647870819436.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_i2b2_pipeline_en_3.4.1_3.0_1647870819436.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_augmented_i2b2_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` ```scala val pipeline = new PretrainedPipeline("ner_deid_subentity_augmented_i2b2_pipeline", "en", "clinical/models") pipeline.annotate("A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.subentity_ner_augmented_i2b2.pipeline").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 years old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25 |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street |STREET | |(302) 786-5227 |PHONE | |Brothers Coal-Mine Corp |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented_i2b2_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl24 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl24` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl24_en_4.3.0_3.0_1675122070674.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl24_en_4.3.0_3.0_1675122070674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl24","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl24","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl24| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|402.2 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl24 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_bert_all date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-all` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_en_4.0.0_3.0_1654179448504.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_en_4.0.0_3.0_1654179448504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_all","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_all","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.tydiqa.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_all| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/bert-all --- layout: model title: Pipeline to Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species_pipeline date: 2023-03-20 tags: [es, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_living_species](https://nlp.johnsnowlabs.com/2022/06/27/bert_token_classifier_ner_living_species_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_es_4.3.0_3.2_1679304476657.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_es_4.3.0_3.2_1679304476657.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "es", "clinical/models") text = '''Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "es", "clinical/models") val text = "Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lactante varón | 0 | 13 | HUMAN | 0.999294 | | 1 | familiares | 41 | 50 | HUMAN | 0.999974 | | 2 | personales | 78 | 87 | HUMAN | 0.999983 | | 3 | neonatal | 116 | 123 | HUMAN | 0.999961 | | 4 | legumbres | 162 | 170 | SPECIES | 0.999973 | | 5 | lentejas | 243 | 250 | SPECIES | 0.999977 | | 6 | garbanzos | 254 | 262 | SPECIES | 0.99997 | | 7 | legumbres | 290 | 298 | SPECIES | 0.999974 | | 8 | madre | 334 | 338 | HUMAN | 0.999971 | | 9 | Cacahuete | 616 | 624 | SPECIES | 0.99997 | | 10 | padres | 728 | 733 | HUMAN | 0.999971 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|410.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Wiam) author: John Snow Labs name: distilbert_qa_wiam_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Wiam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_wiam_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769558209.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_wiam_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769558209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_wiam_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_wiam_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_wiam_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Wiam/distilbert-base-uncased-finetuned-squad --- layout: model title: Google's Tapas Table Understanding (Base, WTQ) author: John Snow Labs name: table_qa_tapas_base_finetuned_wtq date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Base Has aggregation operations?: True ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_base_finetuned_wtq_en_4.2.0_3.0_1664530550311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_base_finetuned_wtq_en_4.2.0_3.0_1664530550311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_base_finetuned_wtq","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_base_finetuned_wtq| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|413.9 MB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 https://github.com/ppasupat/WikiTableQuestions --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_ft_new_news date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_new_news_en_4.3.0_3.0_1674222919626.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_new_news_en_4.3.0_3.0_1674222919626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_new_news","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_new_news","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ft_new_news| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|461.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/roberta_FT_new_newsqa --- layout: model title: Part of Speech for Slovak author: John Snow Labs name: pos_snk date: 2021-03-23 tags: [pos, open_source, sk] supported: true task: Part of Speech Tagging language: sk edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_snk_sk_2.7.5_2.4_1616510497891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_snk_sk_2.7.5_2.4_1616510497891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_snk", "sk")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Potom Maju nežne pohladila po hlávke a vraví : Spoznáš krásny veľký svet , Maja , hrejivé slniečko a nádherné lúky plné kvetov .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_snk", "sk") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,, pos)) val data = Seq("Potom Maju nežne pohladila po hlávke a vraví : Spoznáš krásny veľký svet , Maja , hrejivé slniečko a nádherné lúky plné kvetov .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Potom Maju nežne pohladila po hlávke a vraví : Spoznáš krásny veľký svet , Maja , hrejivé slniečko a nádherné lúky plné kvetov .""] token_df = nlu.load('sk.pos.snk').predict(text) token_df ```
## Results ```bash +--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ |text |result | +--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ |Potom Maju nežne pohladila po hlávke a vraví : Spoznáš krásny veľký svet , Maja , hrejivé slniečko a nádherné lúky plné kvetov .|[ADV, PROPN, ADV, VERB, ADP, NOUN, CCONJ, VERB, PUNCT, VERB, ADJ, ADJ, NOUN, PUNCT, PROPN, PUNCT, ADJ, NOUN, CCONJ, ADJ, NOUN, ADJ, NOUN, PUNCT]| +--------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_snk| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|sk| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.91 | 0.91 | 0.91 | 1711 | | ADP | 0.99 | 0.99 | 0.99 | 1188 | | ADV | 0.82 | 0.82 | 0.82 | 442 | | AUX | 0.96 | 0.93 | 0.94 | 302 | | CCONJ | 0.97 | 0.92 | 0.95 | 447 | | DET | 0.94 | 0.95 | 0.94 | 504 | | NOUN | 0.90 | 0.96 | 0.93 | 3287 | | NUM | 0.97 | 0.92 | 0.94 | 412 | | PART | 0.73 | 0.77 | 0.75 | 177 | | PRON | 0.97 | 0.97 | 0.97 | 380 | | PROPN | 0.90 | 0.79 | 0.84 | 879 | | PUNCT | 1.00 | 1.00 | 1.00 | 1806 | | SCONJ | 0.99 | 0.98 | 0.99 | 182 | | SYM | 1.00 | 0.14 | 0.25 | 14 | | VERB | 0.93 | 0.94 | 0.93 | 1176 | | X | 0.60 | 0.25 | 0.35 | 121 | | accuracy | | | 0.93 | 13028 | | macro avg | 0.91 | 0.83 | 0.84 | 13028 | | weighted avg | 0.93 | 0.93 | 0.93 | 13028 | ``` --- layout: model title: Dutch RoBERTa Embeddings (Non Shuffled) author: John Snow Labs name: roberta_embeddings_robbertje_1_gb_non_shuffled date: 2022-04-14 tags: [roberta, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbertje-1-gb-non-shuffled` is a Dutch model orginally trained by `DTAI-KULeuven`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_non_shuffled_nl_3.4.2_3.0_1649949119650.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_non_shuffled_nl_3.4.2_3.0_1649949119650.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_non_shuffled","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_non_shuffled","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.robbertje_1_gb_non_shuffled").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robbertje_1_gb_non_shuffled| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|279.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-non-shuffled - http://github.com/iPieter/robbert - http://github.com/iPieter/robbertje - https://www.clinjournal.org/clinj/article/view/131 - https://www.clin31.ugent.be - https://arxiv.org/abs/2101.05716 --- layout: model title: Arabic Bert Embeddings (ARBERT model) author: John Snow Labs name: bert_embeddings_ARBERT date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ARBERT` is a Arabic model orginally trained by `UBC-NLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_ARBERT_ar_3.4.2_3.0_1649677924761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_ARBERT_ar_3.4.2_3.0_1649677924761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_ARBERT","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_ARBERT","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.arbert").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_ARBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|608.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/UBC-NLP/ARBERT - https://mageed.arts.ubc.ca/files/2020/12/marbert_arxiv_2020.pdf - https://github.com/UBC-NLP/marbert - https://doi.org/10.14288/SOCKEYE - https://www.tensorflow.org/tfrc --- layout: model title: English asr_wav2vec2_xls_r_1b_english TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_1b_english date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_1b_english` is a English model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_1b_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_1b_english_en_4.2.0_3.0_1664015676390.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_1b_english_en_4.2.0_3.0_1664015676390.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_1b_english', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_1b_english", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_1b_english| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForSequenceClassification Cased model (from emekaboris) author: John Snow Labs name: roberta_classifier_autonlp_txc_17923124 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-txc-17923124` is a English model originally trained by `emekaboris`. ## Predicted Entities `15.0`, `24.0`, `10.0`, `8.0`, `4.0`, `17.0`, `3.0`, `23.0`, `5.0`, `6.0`, `1.0`, `21.0`, `18.0`, `19.0`, `14.0`, `16.0`, `20.0`, `7.0`, `13.0`, `11.0`, `12.0`, `9.0`, `22.0`, `2.0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_txc_17923124_en_4.2.4_3.0_1670623102986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_txc_17923124_en_4.2.4_3.0_1670623102986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_txc_17923124","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_txc_17923124","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autonlp_txc_17923124| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|427.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/emekaboris/autonlp-txc-17923124 --- layout: model title: Hindi RobertaForSequenceClassification Cased model (from neuralspace) author: John Snow Labs name: roberta_classifier_autotrain_citizen_nlu_hindi_1370952776 date: 2022-12-09 tags: [hi, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: hi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-citizen_nlu_hindi-1370952776` is a Hindi model originally trained by `neuralspace`. ## Predicted Entities `ReportingMissingPets`, `EligibilityForBloodDonationCovidGap`, `ReportingPropertyTakeOver`, `IntentForBloodReceivalAppointment`, `EligibilityForBloodDonationSTD`, `InquiryForDoctorConsultation`, `InquiryOfCovidSymptoms`, `InquiryForVaccineCount`, `InquiryForCovidPrevention`, `InquiryForVaccinationRequirements`, `EligibilityForBloodDonationForPregnantWomen`, `ReportingCyberCrime`, `ReportingHitAndRun`, `ReportingTresspassing`, `InquiryofBloodDonationRequirements`, `ReportingMurder`, `ReportingVehicleAccident`, `ReportingMissingPerson`, `EligibilityForBloodDonationAgeLimit`, `ReportingAnimalPoaching`, `InquiryOfEmergencyContact`, `InquiryForQuarantinePeriod`, `ContactRealPerson`, `IntentForBloodDonationAppointment`, `ReportingMissingVehicle`, `InquiryForCovidRecentCasesCount`, `InquiryOfContact`, `StatusOfFIR`, `InquiryofVaccinationAgeLimit`, `InquiryForCovidTotalCasesCount`, `EligibilityForBloodDonationGap`, `InquiryofPostBloodDonationEffects`, `InquiryofPostBloodReceivalCareSchemes`, `EligibilityForBloodReceiversBloodGroup`, `EligitbilityForVaccine`, `InquiryOfLockdownDetails`, `ReportingSexualAssault`, `InquiryForVaccineCost`, `InquiryForCovidDeathCount`, `ReportingDrugConsumption`, `ReportingDrugTrafficing`, `InquiryofPostBloodDonationCertificate`, `ReportingDowry`, `ReportingChildAbuse`, `ReportingAnimalAbuse`, `InquiryofPostBloodReceivalEffects`, `Eligibility For BloodDonationWithComorbidities`, `InquiryOfTiming`, `InquiryForCovidActiveCasesCount`, `InquiryOfLocation`, `InquiryofPostBloodDonationCareSchemes`, `ReportingTheft`, `InquiryForTravelRestrictions`, `ReportingDomesticViolence`, `InquiryofBloodReceivalRequirements` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_citizen_nlu_hindi_1370952776_hi_4.2.4_3.0_1670623674646.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autotrain_citizen_nlu_hindi_1370952776_hi_4.2.4_3.0_1670623674646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_citizen_nlu_hindi_1370952776","hi") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autotrain_citizen_nlu_hindi_1370952776","hi") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autotrain_citizen_nlu_hindi_1370952776| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|hi| |Size:|314.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/neuralspace/autotrain-citizen_nlu_hindi-1370952776 --- layout: model title: Lemmatizer (Macedonian, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, mk] task: Lemmatization language: mk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Macedonian Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_mk_3.4.1_3.0_1646316593272.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_mk_3.4.1_3.0_1646316593272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","mk") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Вие не сте подобри од мене"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","mk") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Вие не сте подобри од мене").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mk.lemma").predict("""Вие не сте подобри од мене""") ```
## Results ```bash +---------------------------------+ |result | +---------------------------------+ |[Вие, не, сте, подобри, од, мене]| +---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|mk| |Size:|9.4 KB| --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-finetuned-squadv2` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2_en_4.0.0_3.0_1654537196929.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2_en_4.0.0_3.0_1654537196929.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.large_uncased_whole_word_masking_v2.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2 --- layout: model title: English image_classifier_vit_base_patch16_224_in21k_euroSat ViTForImageClassification from YKXBCi author: John Snow Labs name: image_classifier_vit_base_patch16_224_in21k_euroSat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_in21k_euroSat` is a English model originally trained by YKXBCi. ## Predicted Entities `Residential`, `AnnualCrop`, `Highway`, `Pasture`, `SeaLake`, `Industrial`, `HerbaceousVegetation`, `River`, `PermanentCrop`, `Forest` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_euroSat_en_4.1.0_3.0_1660168635128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_in21k_euroSat_en_4.1.0_3.0_1660168635128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_in21k_euroSat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_in21k_euroSat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_in21k_euroSat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Maltese to English author: John Snow Labs name: opus_mt_mt_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mt, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mt` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mt_en_xx_2.7.0_2.4_1609167442534.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mt_en_xx_2.7.0_2.4_1609167442534.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mt_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mt_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mt.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mt_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from AyushPJ) author: John Snow Labs name: distilbert_qa_test_squad_trained_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `test-squad-trained-finetuned-squad` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_finetuned_squad_en_4.0.0_3.0_1654728899742.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_finetuned_squad_en_4.0.0_3.0_1654728899742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.by_AyushPJ").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_test_squad_trained_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AyushPJ/test-squad-trained-finetuned-squad --- layout: model title: Detect Neoplasms author: John Snow Labs name: ner_neoplasms date: 2021-03-31 tags: [ner, clinical, licensed, es] task: Named Entity Recognition language: es edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Neoplasms NER is a Named Entity Recognition model that annotates text to find references to tumors. The only entity it annotates is MalignantNeoplasm. Neoplasms NER is trained with the 'embeddings_scielowiki_300d' word embeddings model, so be sure to use the same embeddings in the pipeline. ## Predicted Entities ``MORFOLOGIA_NEOPLASIA`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_neoplasms_es_3.0.0_3.0_1617208439630.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_neoplasms_es_3.0.0_3.0_1617208439630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_neoplasms","es","clinical/models")\ .setInputCols("sentence","token","word_embeddings")\ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embed, ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embed = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_neoplasms","es","clinical/models") .setInputCols(Array("sentence","token","word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embed, ner, ner_converter)) val data = Seq("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.neoplasm").predict("""HISTORIA DE ENFERMEDAD ACTUAL: El Sr. Smith es un hombre blanco veterano de 60 años con múltiples comorbilidades, que tiene antecedentes de cáncer de vejiga diagnosticado hace aproximadamente dos años por el Hospital VA. Allí se sometió a una resección. Debía ser ingresado en el Hospital de Día para una cistectomía. Fue visto en la Clínica de Urología y Clínica de Radiología el 02/04/2003. CURSO DE HOSPITAL: El Sr. Smith se presentó en el Hospital de Día antes de la cirugía de Urología. En evaluación, EKG, ecocardiograma fue anormal, se obtuvo una consulta de Cardiología. Luego se procedió a una resonancia magnética de estrés con adenosina cardíaca, la misma fue positiva para isquemia inducible, infarto subendocárdico inferolateral leve a moderado con isquemia peri-infarto. Además, se observa isquemia inducible en el tabique lateral inferior. El Sr. Smith se sometió a un cateterismo del corazón izquierdo, que reveló una enfermedad de las arterias coronarias de dos vasos. La RCA, proximal estaba estenosada en un 95% y la distal en un 80% estenosada. La LAD media estaba estenosada en un 85% y la LAD distal estaba estenosada en un 85%. Se colocaron cuatro stents de metal desnudo Multi-Link Vision para disminuir las cuatro lesiones al 0%. Después de la intervención, el Sr. Smith fue admitido en 7 Ardmore Tower bajo el Servicio de Cardiología bajo la dirección del Dr. Hart. El Sr. Smith tuvo un curso hospitalario post-intervención sin complicaciones. Se mantuvo estable para el alta hospitalaria el 07/02/2003 con instrucciones de tomar Plavix diariamente durante un mes y Urología está al tanto de lo mismo.""") ```
## Results ```bash +------+--------------------+ |chunk |ner_label | +------+--------------------+ |cáncer|MORFOLOGIA_NEOPLASIA| +------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_neoplasms| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| --- layout: model title: Financial Job Titles NER author: John Snow Labs name: finner_bert_roles date: 2022-08-30 tags: [en, finance, ner, job, titles, jobs, roles, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial nlp.BertForTokenClassification NER model aimed to extract Job Titles / Roles of people in Companies, and was trained using Resumes, Wikipedia Articles, Financial and Legal documents, annotated in-house. ## Predicted Entities `ROLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_ROLES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_bert_roles_en_1.0.0_3.2_1661846269918.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_bert_roles_en_1.0.0_3.2_1661846269918.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = finance.BertForTokenClassification.pretrained("finner_bert_roles", "en", "finance/models")\ .setInputCols("token", "document")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = nlp.NerConverter()\ .setInputCols(["document","token","label"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, tokenClassifier, ner_converter ] ) import pandas as pd p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']}))) text = 'Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon' res = p_model.transform(spark.createDataFrame([[text]]).toDF("text")) result_df = res.select(F.explode(F.arrays_zip(res.token.result,res.label.result, res.label.metadata)).alias("cols"))\ .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("label"), F.expr("cols['2']['confidence']").alias("confidence")) result_df.show(50, truncate=100) ```
## Results ```bash +------------+---------+----------+ | token|ner_label|confidence| +------------+---------+----------+ | Jeffrey| O| 0.9984| | Preston| O| 0.9878| | Bezos| O| 0.9939| | is| O| 0.999| | an| O| 0.9988| | American| B-ROLE| 0.8294| |entrepreneur| I-ROLE| 0.9358| | ,| O| 0.9979| | founder| B-ROLE| 0.8645| | and| O| 0.857| | CEO| B-ROLE| 0.72| | of| O| 0.995| | Amazon| O| 0.9428| +------------+---------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_bert_roles| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|402.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References In-house annotations on Wikidata, CUAD dataset, Financial 10-K documents and Resumes ## Benchmarking ```bash label tp fp fn prec rec f1 B-ROLE 3553 174 262 0.95331365 0.9313237 0.9421904 I-ROLE 4868 250 243 0.9511528 0.95245546 0.9518037 Macro-average 8421 424 505 0.9522332 0.9418896 0.9470331 Micro-average 8421 424 505 0.9520633 0.9434237 0.94772375 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab70 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab70 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab70` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab70_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab70_en_4.2.0_3.0_1664021746199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab70_en_4.2.0_3.0_1664021746199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab70', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab70", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab70| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal NER for NDA (Required Disclosure Clauses) author: John Snow Labs name: legner_nda_req_discl date: 2023-04-24 tags: [en, legal, licensed, ner, nda, disclosure] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `REQ_DISCL` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `DISCLOSURE_BASIS`, `REQ_DISCLOSURE_CONFID`, `REQ_DISCLOSURE_COOPERATION`, `REQ_DISCLOSURE_LEGAL`, `REQ_DISCLOSURE_NOTICE`, `REQ_DISCLOSURE_PARTY`, `REQ_DISCLOSURE_REMEDY`, and `REQ_OBLIGATION_ACTION`. ## Predicted Entities `DISCLOSURE_BASIS`, `REQ_DISCLOSURE_CONFID`, `REQ_DISCLOSURE_COOPERATION`, `REQ_DISCLOSURE_LEGAL`, `REQ_DISCLOSURE_NOTICE`, `REQ_DISCLOSURE_PARTY`, `REQ_DISCLOSURE_REMEDY`, `REQ_OBLIGATION_ACTION` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_req_discl_en_1.0.0_3.0_1682327765264.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_req_discl_en_1.0.0_3.0_1682327765264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_req_discl", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""If the Discloser waives the Recipient’s compliance with the agreement or fails to obtain a protective order or other appropriate remedies, the Recipient will furnish only that portion of the Confidential Information that is legally required to be disclosed and will use its best efforts to obtain confidential treatment for such Confidential Information."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------------+--------------------------+ |chunk |ner_label | +----------------------+--------------------------+ |Discloser |REQ_DISCLOSURE_PARTY | |obtain |REQ_OBLIGATION_ACTION | |protective order |REQ_DISCLOSURE_REMEDY | |appropriate remedies |REQ_DISCLOSURE_REMEDY | |furnish |REQ_OBLIGATION_ACTION | |legally required |REQ_DISCLOSURE_LEGAL | |best efforts |REQ_DISCLOSURE_COOPERATION| |obtain |REQ_OBLIGATION_ACTION | |confidential treatment|REQ_DISCLOSURE_CONFID | +----------------------+--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_req_discl| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support DISCLOSURE_BASIS 0.77 0.70 0.73 57 REQ_DISCLOSURE_CONFID 0.96 0.93 0.95 29 REQ_DISCLOSURE_COOPERATION 1.00 0.94 0.97 17 REQ_DISCLOSURE_LEGAL 0.93 0.77 0.84 35 REQ_DISCLOSURE_NOTICE 0.89 0.89 0.89 19 REQ_DISCLOSURE_PARTY 1.00 0.89 0.94 38 REQ_DISCLOSURE_REMEDY 1.00 1.00 1.00 52 REQ_OBLIGATION_ACTION 0.95 0.86 0.90 121 macro-avg 0.94 0.86 0.90 368 macro-avg 0.94 0.87 0.90 368 weighted-avg 0.93 0.86 0.90 368 ``` --- layout: model title: Chunk Entity Resolver RxNorm-scdc author: John Snow Labs name: chunkresolve_rxnorm_in_clinical date: 2021-04-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to RxNorm codes using chunk embeddings (augmented with synonyms, four times richer than previous resolver). ## Predicted Entities RxNorm codes {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_in_clinical_en_3.0.0_3.0_1618605233213.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_rxnorm_in_clinical_en_3.0.0_3.0_1618605233213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_in_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") pipeline = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, resolver]) data = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation."]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) ... ``` ```scala ... val resolver = ChunkEntityResolverModel.pretrained("chunkresolve_rxnorm_in_clinical","en","clinical/models") .setInputCols("token","chunk_embeddings") .setOutputCol("entity") val pipeline = new Pipeline().setStages(Array(document_assembler, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | chunk| entity| target_text| code|confidence| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ | metformin|TREATMENT|metFORMIN compounding powder:::Metformin Hydrochloride Powder:::metFORMIN 500 mg oral tablet:::me...| 601021| 0.2364| | glipizide|TREATMENT|Glipizide Powder:::Glipizide Crystal:::Glipizide Tablets:::glipiZIDE 5 mg oral tablet:::glipiZIDE...| 241604| 0.3647| |dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG|TREATMENT|Ezetimibe and Atorvastatin Tablets:::Amlodipine and Atorvastatin Tablets:::Atorvastatin Calcium T...|1422084| 0.3407| | dapagliflozin|TREATMENT|Dapagliflozin Tablets:::dapagliflozin 5 mg oral tablet:::dapagliflozin 10 mg oral tablet:::Dapagl...|1488568| 0.7070| +---------------------------------------------------------------+---------+----------------------------------------------------------------------------------------------------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_rxnorm_in_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| --- layout: model title: English RobertaForSequenceClassification Cased model (from ds198799) author: John Snow Labs name: roberta_classifier_autonlp_predict_roi_1_29797730 date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-predict_ROI_1-29797730` is a English model originally trained by `ds198799`. ## Predicted Entities `2.0`, `3.0`, `1.0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_predict_roi_1_29797730_en_4.2.4_3.0_1670622865249.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_autonlp_predict_roi_1_29797730_en_4.2.4_3.0_1670622865249.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_predict_roi_1_29797730","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_autonlp_predict_roi_1_29797730","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_autonlp_predict_roi_1_29797730| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|424.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ds198799/autonlp-predict_ROI_1-29797730 --- layout: model title: English BertForQuestionAnswering model (from internetoftim) author: John Snow Labs name: bert_qa_internetoftim_bert_large_uncased_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squad` is a English model orginally trained by `internetoftim`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_internetoftim_bert_large_uncased_squad_en_4.0.0_3.0_1654536620322.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_internetoftim_bert_large_uncased_squad_en_4.0.0_3.0_1654536620322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_internetoftim_bert_large_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_internetoftim_bert_large_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased.by_internetoftim").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_internetoftim_bert_large_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|798.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/internetoftim/bert-large-uncased-squad --- layout: model title: Identify intent in general text - SNIPS dataset author: John Snow Labs name: classifierdl_use_snips date: 2021-02-15 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.3 spark_version: 2.4 tags: [open_source, classifier, en] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Understand general commands and recognise the intent. ## Predicted Entities `AddToPlaylist`, `BookRestaurant`, `GetWeather`, `PlayMusic`, `RateBook`, `SearchCreativeWork`, `SearchScreeningEvent`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_CLS_SNIPS){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_snips_en_2.7.3_2.4_1613416966282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_snips_en_2.7.3_2.4_1613416966282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") classifier = ClassifierDLModel.pretrained('classifierdl_use_snips').setInputCols(['sentence_embeddings']).setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, embeddings, classifier]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate(["i want to bring six of us to a bistro in town that serves hot chicken sandwich that is within the same area", "show weather forcast for t h stone memorial st joseph peninsula state park on one hour from now"]) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", lang="en") .setInputCols(Array("document")) .setOutputCol("sentence_embeddings") val classifier = ClassifierDLModel.pretrained("classifierdl_use_snips", "en").setInputCols(Array("sentence_embeddings")).setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, embeddings, classifier)) val data = Seq("i want to bring six of us to a bistro in town that serves hot chicken sandwich that is within the same area", "show weather forcast for t h stone memorial st joseph peninsula state park on one hour from now").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.snips").predict("""i want to bring six of us to a bistro in town that serves hot chicken sandwich that is within the same area""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------+----------------+ | document | label | +---------------------------------------------------------------------------------------------------------------+----------------+ | i want to bring six of us to a bistro in town that serves hot chicken sandwich that is within the same area | BookRestaurant | | show weather forcast for t h stone memorial st joseph peninsula state park on one hour from now | GetWeather | +---------------------------------------------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_snips| |Compatibility:|Spark NLP 2.7.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| ## Data Source This model is trained on the NLU Benchmark, SNIPS dataset https://github.com/MiuLab/SlotGated-SLU ## Benchmarking ```bash precision recall f1-score support AddToPlaylist 0.98 0.97 0.97 124 BookRestaurant 0.98 0.99 0.98 92 GetWeather 1.00 0.98 0.99 104 PlayMusic 0.85 0.95 0.90 86 RateBook 1.00 1.00 1.00 80 SearchCreativeWork 0.82 0.84 0.83 107 SearchScreeningEvent 0.95 0.85 0.90 107 accuracy 0.94 700 macro avg 0.94 0.94 0.94 700 weighted avg 0.94 0.94 0.94 700 ``` --- layout: model title: English BertForQuestionAnswering model (from Seongkyu) author: John Snow Labs name: bert_qa_Seongkyu_bert_base_cased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad` is a English model orginally trained by `Seongkyu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Seongkyu_bert_base_cased_finetuned_squad_en_4.0.0_3.0_1654179731471.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Seongkyu_bert_base_cased_finetuned_squad_en_4.0.0_3.0_1654179731471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Seongkyu_bert_base_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Seongkyu_bert_base_cased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_cased.by_Seongkyu").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Seongkyu_bert_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Seongkyu/bert-base-cased-finetuned-squad --- layout: model title: ICD10CM Sentence Entity Resolver (Slim, normalized) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_normalized date: 2022-05-12 tags: [licensed, clinical, en, entity_resolution, icd10] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to ICD10 CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped, making the model slim. It also returns the official resolution text within the brackets inside the metadata ## Predicted Entities `ICD10 CM Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_normalized_en_3.5.1_3.0_1652337920061.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_normalized_en_3.5.1_3.0_1652337920061.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_normalized", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("icd_code")\ .setDistanceFunction("EUCLIDEAN")\ .setReturnCosineDistances(True) resolver_pipeline = Pipeline( stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, icd_resolver ]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity , presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val c2doc = new Chunk2Doc() .setInputCols(Array("ner_chunk")) .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_normalized", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("icd_code") .setDistanceFunction("EUCLIDEAN") .setReturnCosineDistances(True) val resolver_pipeline = new PipelineModel().setStages(Array(document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity , presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS.toDF("text") val results = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.slim_normalized").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity , presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+--------+---------------------------------------------------------------------------------+---------------------------------------------------+ | chunk| entity|icd_code| all_k_resolutions| all_k_codes| +-------------------------------------+-------+--------+---------------------------------------------------------------------------------+---------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| E11.9|gestational diabetes mellitus [Type 2 diabetes mellitus without complications]...|E11.9:::O24.41:::O24.919:::O24.419:::O24.439:::....| |subsequent type two diabetes mellitus|PROBLEM| O24.11|pre-existing type 2 diabetes mellitus [Pre-existing type 2 diabetes mellitus, ...|O24.11:::E11.8:::E11.9:::E11:::E13.9:::E11.3:::....| | T2DM|PROBLEM| E11.9|t2dm [Type 2 diabetes mellitus without complications]:::gm>2 [GM2 gangliosidos...|E11.9:::E75.00:::H35.89:::F80.0:::R44.8:::M79.89...| | HTG-induced pancreatitis|PROBLEM| F10.988|alcohol-induced pancreatitis [Alcohol use, unspecified with other alcohol-indu...|F10.988:::K85.9:::K85.3:::K85:::K85.2:::K85.8:::...| | acute hepatitis|PROBLEM| B17.9|acute hepatitis [Acute viral hepatitis, unspecified]:::acute hepatitis [Acute ...|B17.9:::K72.0:::B15.9:::B15:::B17.2:::Z03.89:::....| | obesity|PROBLEM| E66.8|abdominal obesity [Other obesity]:::overweight and obesity [Overweight and obe...|E66.8:::E66:::E66.01:::E66.9:::Z91.89:::E66.3:::...| | polyuria|PROBLEM| R35|polyuria [Polyuria]:::nocturnal polyuria [Nocturnal polyuria]:::other polyuria...|R35:::R35.81:::R35.89:::R31:::R30.0:::E72.01:::....| | polydipsia|PROBLEM| R63.1|polydipsia [Polydipsia]:::psychogenic polydipsia [Other impulse disorders]:::p...|R63.1:::F63.89:::O40.9XX0:::O40:::G47.50:::G47.5...| | poor appetite|PROBLEM| R63.0|poor appetite [Anorexia]:::patient dissatisfied with nutrition regime [Persons...|R63.0:::Z76.89:::R53.1:::R10.9:::R45.81:::R44.8:...| | vomiting|PROBLEM| R11.1|vomiting [Vomiting]:::vomiting [Vomiting, unspecified]:::intermittent vomiting...|R11.1:::R11.10:::R11:::G43.A:::G43.A0:::R11.0:::...| | a respiratory tract infection|PROBLEM| J06.9|upper respiratory tract infection [Acute upper respiratory infection, unspecif...|J06.9:::T17:::T17.9:::J04.10:::J22:::J98.8:::J98.9.| +-------------------------------------+-------+--------+---------------------------------------------------------------------------------+---------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_normalized| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|846.3 MB| |Case sensitive:|false| --- layout: model title: Translate English to Central Bikol Pipeline author: John Snow Labs name: translate_en_bcl date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bcl, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bcl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bcl_xx_2.7.0_2.4_1609699015704.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bcl_xx_2.7.0_2.4_1609699015704.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bcl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bcl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bcl').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bcl| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_en_cased date: 2022-04-12 tags: [distilbert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-en-cased` is a English model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_en_cased_en_3.4.2_3.0_1649783396033.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_en_cased_en_3.4.2_3.0_1649783396033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_en_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_en_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilbert_base_en_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_en_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|243.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-en-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from raisinbl) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_2_384_1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad_2_384_1` is a English model originally trained by `raisinbl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_2_384_1_en_4.3.0_3.0_1672773697863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_2_384_1_en_4.3.0_3.0_1672773697863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_2_384_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_2_384_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_2_384_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/raisinbl/distilbert-base-uncased-finetuned-squad_2_384_1 --- layout: model title: Multilingual DistilBertForQuestionAnswering Base Cased model (from Khanh) author: John Snow Labs name: distilbert_qa_khanh_base_cased_finetuned_squad date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-finetuned-squad` is a Multilingual model originally trained by `Khanh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_khanh_base_cased_finetuned_squad_xx_4.3.0_3.0_1672767135066.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_khanh_base_cased_finetuned_squad_xx_4.3.0_3.0_1672767135066.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_khanh_base_cased_finetuned_squad","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_khanh_base_cased_finetuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_khanh_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Khanh/distilbert-base-multilingual-cased-finetuned-squad --- layout: model title: Stopwords Remover for Nepali language (488 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ne, open_source] task: Stop Words Removal language: ne edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ne_3.4.1_3.0_1646672355573.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ne_3.4.1_3.0_1646672355573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ne") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["तिमी म भन्दा राम्रो छैनौ"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ne") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("तिमी म भन्दा राम्रो छैनौ").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ne.stopwords").predict("""तिमी म भन्दा राम्रो छैनौ""") ```
## Results ```bash +------+ |result| +------+ |[छैनौ]| +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ne| |Size:|3.4 KB| --- layout: model title: Chinese Bert Embeddings (Base, Finance) author: John Snow Labs name: bert_embeddings_mengzi_bert_base_fin date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mengzi-bert-base-fin` is a Chinese model orginally trained by `Langboat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_bert_base_fin_zh_3.4.2_3.0_1649669735398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_bert_base_fin_zh_3.4.2_3.0_1649669735398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_bert_base_fin","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_bert_base_fin","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.mengzi_bert_base_fin").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mengzi_bert_base_fin| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Langboat/mengzi-bert-base-fin - https://arxiv.org/abs/2110.06696 --- layout: model title: Legal Subcontractors Clause Binary Classifier author: John Snow Labs name: legclf_subcontractors_clause date: 2023-01-29 tags: [en, legal, classification, subcontractors, clauses, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `subcontractors` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `subcontractors`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subcontractors_clause_en_1.0.0_3.0_1674993925698.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subcontractors_clause_en_1.0.0_3.0_1674993925698.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_subcontractors_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[subcontractors]| |[other]| |[other]| |[subcontractors]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subcontractors_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.97 0.99 39 subcontractors 0.95 1.00 0.98 21 accuracy - - 0.98 60 macro-avg 0.98 0.99 0.98 60 weighted-avg 0.98 0.98 0.98 60 ``` --- layout: model title: Pipeline to Medical Text Summarization author: John Snow Labs name: summarizer_generic_jsl_pipeline date: 2023-05-29 tags: [licensed, en, clinical, text_summarization] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_generic_jsl](https://nlp.johnsnowlabs.com/2023/03/30/summarizer_generic_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_generic_jsl_pipeline_en_4.4.2_3.0_1685400186557.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_generic_jsl_pipeline_en_4.4.2_3.0_1685400186557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_generic_jsl_pipeline", "en", "clinical/models") text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_generic_jsl_pipeline", "en", "clinical/models") val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash The patient is 78 years old and has hypertension. She has a history of chest pain, palpations, orthopedics, and spinal stenosis. She has a prescription of Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin, and TriViFlor 25 mg two pills daily. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_generic_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: Pipeline to Detect bacterial species author: John Snow Labs name: ner_bacterial_species_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_bacterial_species](https://nlp.johnsnowlabs.com/2021/04/01/ner_bacterial_species_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_pipeline_en_4.3.0_3.2_1678804215844.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_pipeline_en_4.3.0_3.2_1678804215844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_bacterial_species_pipeline", "en", "clinical/models") text = '''Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_bacterial_species_pipeline", "en", "clinical/models") val text = "Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T))." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bacterial_species.pipeline").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | SMSP (T) | 73 | 80 | SPECIES | 0.9725 | | 1 | Methanoregula formicica | 167 | 189 | SPECIES | 0.97935 | | 2 | SMSP (T) | 222 | 229 | SPECIES | 0.991975 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bacterial_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Named Entity Recognition (NER) Model in Danish (Dane 6B 300) author: John Snow Labs name: dane_ner_6B_300 date: 2020-08-30 task: Named Entity Recognition language: da edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, da, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Norne is a Named Entity Recognition (or NER) model of Danish, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. Dane NER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DA/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dane_ner_6B_300_da_2.6.0_2.4_1598810268069.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dane_ner_6B_300_da_2.6.0_2.4_1598810268069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang = "xx") \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("dane_ner_6B_300", "da") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af \u200b\u200bde mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af \u200b\u200b1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang = "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("dane_ner_6B_300", "da") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af ​​de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970"erne og 1980"erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af ​​1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (født 28. oktober 1955) er en amerikansk forretningsmagnat, softwareudvikler, investor og filantrop. Han er bedst kendt som medstifter af Microsoft Corporation. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administrerende direktør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individuelle aktionær indtil maj 2014. Han er en af ​​de mest kendte iværksættere og pionerer inden for mikrocomputerrevolution i 1970'erne og 1980'erne. Født og opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; det fortsatte med at blive verdens største virksomhed inden for personlig computersoftware. Gates førte virksomheden som formand og administrerende direktør, indtil han trådte tilbage som administrerende direktør i januar 2000, men han forblev formand og blev chefsoftwarearkitekt. I slutningen af ​​1990'erne var Gates blevet kritiseret for sin forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. I juni 2006 meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter til Ray Ozzie og Craig Mundie. Han trådte tilbage som formand for Microsoft i februar 2014 og tiltrådte en ny stilling som teknologirådgiver for at støtte den nyudnævnte administrerende direktør Satya Nadella."""] ner_df = nlu.load('da.ner.6B300D').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates |PER | |amerikansk |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |ORG | |1970'erne |MISC | |1980'erne |MISC | |Seattle |LOC | |Washington |LOC | |Gates |LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |​​1990'erne |MISC | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dane_ner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|da| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.565.pdf](https://www.aclweb.org/anthology/2020.lrec-1.565.pdf) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_2 TFWav2Vec2ForCTC from doddle124578 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_2 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_2` is a English model originally trained by doddle124578. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_2_en_4.2.0_3.0_1664038077329.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_2_en_4.2.0_3.0_1664038077329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Portuguese BertForTokenClassification Cased model (from LucasFerro) author: John Snow Labs name: bert_token_classifier_biobertpt_clin_tempclinbr date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biobertpt-clin-tempclinbr` is a Portuguese model originally trained by `LucasFerro`. ## Predicted Entities `Teste`, `Problema`, `DepartamentoClinico`, `Tratamento`, `Evidencia`, ``, `Ocorrencia` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_biobertpt_clin_tempclinbr_pt_4.2.4_3.0_1669822235861.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_biobertpt_clin_tempclinbr_pt_4.2.4_3.0_1669822235861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_biobertpt_clin_tempclinbr","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_biobertpt_clin_tempclinbr","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_biobertpt_clin_tempclinbr| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/LucasFerro/biobertpt-clin-tempclinbr --- layout: model title: Named Entity Recognition - ELECTRA Base (OntoNotes) author: John Snow Labs name: onto_electra_base_uncased date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `electra_base_uncased` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_electra_base_uncased_en_2.7.0_2.4_1607203076517.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_electra_base_uncased_en_2.7.0_2.4_1607203076517.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("electra_base_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner_onto = NerDLModel.pretrained("onto_electra_base_uncased", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("electra_base_uncased", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_onto = NerDLModel.pretrained("onto_electra_base_uncased", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.electra.uncased_base').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |one |CARDINAL | |the 1970s |DATE | |1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Microsoft |ORG | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |PERSON | |January 2000 |DATE | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_electra_base_uncased| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.88154626, rec: 0.88217854, f1: 0.8818623 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11296 phrases; correct: 9871. accuracy: 97.56%; 9871 11257 11296 precision: 87.38%; recall: 87.69%; FB1: 87.54 CARDINAL: 800 935 959 precision: 83.42%; recall: 85.56%; FB1: 84.48 959 DATE: 1396 1602 1652 precision: 84.50%; recall: 87.14%; FB1: 85.80 1652 EVENT: 23 63 38 precision: 60.53%; recall: 36.51%; FB1: 45.54 38 FAC: 60 135 81 precision: 74.07%; recall: 44.44%; FB1: 55.56 81 GPE: 2102 2240 2205 precision: 95.33%; recall: 93.84%; FB1: 94.58 2205 LANGUAGE: 8 22 10 precision: 80.00%; recall: 36.36%; FB1: 50.00 10 LAW: 16 40 21 precision: 76.19%; recall: 40.00%; FB1: 52.46 21 LOC: 118 179 191 precision: 61.78%; recall: 65.92%; FB1: 63.78 191 MONEY: 285 314 329 precision: 86.63%; recall: 90.76%; FB1: 88.65 329 NORP: 801 841 897 precision: 89.30%; recall: 95.24%; FB1: 92.17 897 ORDINAL: 180 195 225 precision: 80.00%; recall: 92.31%; FB1: 85.71 225 ORG: 1538 1795 1816 precision: 84.69%; recall: 85.68%; FB1: 85.18 1816 PERCENT: 312 349 348 precision: 89.66%; recall: 89.40%; FB1: 89.53 348 PERSON: 1892 1988 2025 precision: 93.43%; recall: 95.17%; FB1: 94.29 2025 PRODUCT: 39 76 49 precision: 79.59%; recall: 51.32%; FB1: 62.40 49 QUANTITY: 81 105 102 precision: 79.41%; recall: 77.14%; FB1: 78.26 102 TIME: 136 212 216 precision: 62.96%; recall: 64.15%; FB1: 63.55 216 WORK_OF_ART: 84 166 132 precision: 63.64%; recall: 50.60%; FB1: 56.38 132 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Tetela author: John Snow Labs name: opus_mt_en_tll date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tll, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tll` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tll_xx_2.7.0_2.4_1609170104300.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tll_xx_2.7.0_2.4_1609170104300.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tll", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tll", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tll').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tll| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Anatomical Regions (embeddings_clinical_large) author: John Snow Labs name: ner_anatomy_emb_clinical_large date: 2023-05-15 tags: [ner, clinical, licensed, en, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for anatomy terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb#scrollTo=rUehS3qTdHUh){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_emb_clinical_large_en_4.4.2_3.0_1684140698076.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_emb_clinical_large_en_4.4.2_3.0_1684140698076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") anatomy_ner = MedicalNerModel.pretrained('ner_anatomy_emb_clinical_large' "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("anatomy_ner") anatomy_ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "anatomy_ner"]) \ .setOutputCol("anatomy_ner_chunk") posology_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, anatomy_ner, anatomy_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") posology_ner_model = posology_ner_pipeline.fit(empty_data) results = posology_ner_model.transform(spark.createDataFrame([['''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.''']]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val anatomy_ner_model = MedicalNerModel.pretrained("ner_anatomy_emb_clinical_large" "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("anatomy_ner") val anatomy_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("anatomy_ner_chunk") val posology_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, anatomy_ner_model, anatomy_ner_converter)) val data = Seq("""This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:--------------------|--------:|------:|:-----------------------| | 0 | toe | 326 | 328 | Organism_subdivision | | 1 | redness | 348 | 354 | Pathological_formation | | 2 | erythema | 360 | 367 | Pathological_formation | | 3 | skin | 374 | 377 | Organ | | 4 | Extraocular muscles | 574 | 592 | Organ | | 5 | turbinates | 659 | 668 | Multi-tissue_structure | | 6 | Mucous membranes | 716 | 731 | Tissue | | 7 | Neck | 744 | 747 | Organism_subdivision | | 8 | bowel sounds | 802 | 813 | Pathological_formation | | 9 | toe | 904 | 906 | Organ | | 10 | skin | 956 | 959 | Organ | | 11 | toe | 1046 | 1048 | Organ | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Trained on the Anatomical Entity Mention (AnEM) corpus with http://www.nactem.ac.uk/anatomy/ ## Benchmarking ```bash label precision recall f1-score support tissue_structure 0.69 0.77 0.73 130 Organ 0.95 0.81 0.88 52 Cell 0.92 0.96 0.94 118 Organism_subdivision 0.85 0.50 0.63 22 Pathological_formation 0.98 0.86 0.92 58 Cellular_component 0.54 0.50 0.52 26 Organism_substance 0.91 0.74 0.82 43 Anatomical_system 1.00 0.67 0.80 6 Immaterial_anatomical_entity 1.00 0.33 0.50 6 Tissue 0.67 0.62 0.65 32 Developing_anatomical_structure 1.00 0.20 0.33 5 micro-avg 0.82 0.78 0.80 498 macro-avg 0.87 0.63 0.70 498 weighted-avg 0.83 0.78 0.80 498 ``` --- layout: model title: English asr_wav2vec2_common_voice_accents_3 TFWav2Vec2ForCTC from willcai author: John Snow Labs name: asr_wav2vec2_common_voice_accents_3 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_common_voice_accents_3` is a English model originally trained by willcai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_common_voice_accents_3_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_accents_3_en_4.2.0_3.0_1664122762264.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_common_voice_accents_3_en_4.2.0_3.0_1664122762264.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_common_voice_accents_3", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_common_voice_accents_3", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_common_voice_accents_3| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Fast Neural Machine Translation Model from English to Dravidian Languages author: John Snow Labs name: opus_mt_en_dra date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, dra, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `dra` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_dra_xx_2.7.0_2.4_1609170416753.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_dra_xx_2.7.0_2.4_1609170416753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_dra", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_dra", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.dra').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_dra| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739375389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739375389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_hier_quadruplet_0.1_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_hier_quadruplet_0.1_epochs_1_shard_1_squad2.0 --- layout: model title: English DistilBertForQuestionAnswering model (from FabianWillner) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_triviaqa date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-triviaqa` is a English model originally trained by `FabianWillner`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_triviaqa_en_4.0.0_3.0_1654726988765.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_triviaqa_en_4.0.0_3.0_1654726988765.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_triviaqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_triviaqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.trivia.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_triviaqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/FabianWillner/distilbert-base-uncased-finetuned-triviaqa --- layout: model title: Translate English to Galician Pipeline author: John Snow Labs name: translate_en_gl date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, gl, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `gl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_gl_xx_2.7.0_2.4_1609690791897.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_gl_xx_2.7.0_2.4_1609690791897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_gl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_gl", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.gl').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_gl| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_dm768 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-dm768` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm768_en_4.3.0_3.0_1675119328836.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_dm768_en_4.3.0_3.0_1675119328836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_dm768","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_dm768","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_dm768| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|221.5 MB| ## References - https://huggingface.co/google/t5-efficient-small-dm768 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Pipeline to Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species_pipeline date: 2023-03-20 tags: [it, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: it edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_living_species](https://nlp.johnsnowlabs.com/2022/06/27/bert_token_classifier_ner_living_species_it_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_it_4.3.0_3.2_1679304618895.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_it_4.3.0_3.2_1679304618895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "it", "clinical/models") text = '''Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "it", "clinical/models") val text = "Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------|-------------:| | 0 | donna | 4 | 8 | HUMAN | 0.999699 | | 1 | personale | 133 | 141 | HUMAN | 0.999581 | | 2 | madre | 285 | 289 | HUMAN | 0.999633 | | 3 | fratello | 317 | 324 | HUMAN | 0.999641 | | 4 | sorella | 373 | 379 | HUMAN | 0.999622 | | 5 | virus epatotropi | 493 | 508 | SPECIES | 0.999333 | | 6 | HBV | 511 | 513 | SPECIES | 0.99968 | | 7 | HCV | 516 | 518 | SPECIES | 0.999616 | | 8 | HIV | 523 | 525 | SPECIES | 0.999383 | | 9 | paziente | 634 | 641 | HUMAN | 0.99977 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|410.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Extract Negation and Uncertainty Entities from Spanish Medical Texts (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_negation_uncertainty date: 2022-08-11 tags: [es, clinical, licensed, token_classification, bert, ner, negation, uncertainty, linguistics] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting relevant entities from Spanish medical texts and trained using the BertForTokenClassification method from the transformers library and [BERT based](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) embeddings. The model detects Negation Trigger (NEG), Negation Scope (NSCO), Uncertainty Trigger (UNC) and Uncertainty Scope (USCO). ## Predicted Entities `NEG`, `NSCO`, `UNC`, `USCO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_negation_uncertainty_es_4.0.2_3.0_1660231547751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_negation_uncertainty_es_4.0.2_3.0_1660231547751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_negation_uncertainty", "es", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame(["Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado."], StringType()).toDF("text") result = model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_negation_uncertainty", "es", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","label")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("""Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.classify.bert_token.negation_uncertainty").predict("""Con diagnóstico probable de cirrosis hepática (no conocida previamente) y peritonitis espontanea primaria con tratamiento durante 8 dias con ceftriaxona en el primer ingreso (no se realizó paracentesis control por escasez de liquido). Lesión tumoral en hélix izquierdo de 0,5 cms. de diámetro susceptible de ca basocelular perlado.""") ```
## Results ```bash +------------------------------------------------------+---------+ |chunk |ner_label| +------------------------------------------------------+---------+ |probable |UNC | |de cirrosis hepática |USCO | |no |NEG | |conocida previamente |NSCO | |no |NEG | |se realizó paracentesis control por escasez de liquido|NSCO | |susceptible de |UNC | |ca basocelular perlado |USCO | +------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_negation_uncertainty| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References The model is prepared using the reference paper: "NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts", Salvador Lima-Lopez, Eulalia Farr ́e-Maduell, Antonio Miranda-Escalada, Vicent Briva-Iglesias and Martin Krallinger. Procesamiento del Lenguaje Natural, Revista nº 67, septiembre de 2021, pp. 243-256. ## Benchmarking ```bash label precision recall f1-score support B-NEG 0.9599 0.9667 0.9633 1833 I-NEG 0.9216 0.9276 0.9246 152 B-UNC 0.9040 0.8898 0.8968 508 I-UNC 0.8772 0.8242 0.8499 182 B-USCO 0.9164 0.8983 0.9073 708 I-USCO 0.8473 0.8596 0.8534 2350 B-NSCO 0.9475 0.9560 0.9517 2022 I-NSCO 0.9323 0.9345 0.9334 5774 micro-avg 0.9207 0.9239 0.9223 13529 macro-avg 0.9133 0.9071 0.9101 13529 weighted-avg 0.9208 0.9239 0.9223 13529 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-2` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2_en_4.0.0_3.0_1654189536606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2_en_4.0.0_3.0_1654189536606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|390.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-2 --- layout: model title: German asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896_de_4.2.0_3.0_1664119391168.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896_de_4.2.0_3.0_1664119391168.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_gender_male_5_female_5_s896| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Translate Arabic to English Pipeline author: John Snow Labs name: translate_ar_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ar, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ar` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ar_en_xx_2.7.0_2.4_1609691189858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ar_en_xx_2.7.0_2.4_1609691189858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ar_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ar_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ar.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ar_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_architectural_styles ViTForImageClassification from gatecitypreservation author: John Snow Labs name: image_classifier_vit_architectural_styles date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_architectural_styles` is a English model originally trained by gatecitypreservation. ## Predicted Entities `craftsman bungalow architecture`, `queen anne architecture`, `tudor cottage architecture`, `mid-century modern ranch`, `classical revival architecture` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_architectural_styles_en_4.1.0_3.0_1660166786409.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_architectural_styles_en_4.1.0_3.0_1660166786409.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_architectural_styles", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_architectural_styles", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_architectural_styles| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect PHI for Deidentification purposes (Spanish, Roberta embeddings) author: John Snow Labs name: ner_deid_subentity_roberta date: 2022-01-17 tags: [deid, es, licensed] task: De-identification language: es edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 13 entities. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset and several data augmentation mechanisms. This model uses Roberta Clinical Embeddings. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `E-MAIL`, `USERNAME`, `LOCATION`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `DOCTOR`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEID_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_es_3.3.4_3.0_1642428102794.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_roberta_es_3.3.4_3.0_1642428102794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_roberta", "es", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner]) text = [''' Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. '''] df = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(df).transform(df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_roberta", "es", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, roberta_embeddings, clinical_ner)) val text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""" val df = Seq(text).toDS.toDF("text") val results = pipeline.fit(df).transform(df) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.deid.subentity_roberta").predict(""" Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos. """) ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | Antonio| B-PATIENT| | Pérez| I-PATIENT| | Juan| I-PATIENT| | ,| O| | nacido| O| | en| O| | Cadiz|B-LOCATION| | ,| O| | España|B-LOCATION| | .| O| | Aún| O| | no| O| | estaba| O| | vacunado| O| | ,| O| | se| O| | infectó| O| | con| O| | Covid-19| O| | el| O| | dia| O| | 14| B-DATE| | de| I-DATE| | Marzo| I-DATE| | y| O| | tuvo| O| | que| O| | ir| O| | al| O| | Hospital| O| | Fue| O| | tratado| O| | con| O| | anticuerpos| O| |monoclonales| O| | en| O| | la| O| | Clinica|B-HOSPITAL| | San|I-HOSPITAL| | Carlos|I-HOSPITAL| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_roberta| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|16.3 MB| |Dependencies:|roberta_base_biomedical| ## Data Source - Internal JSL annotated corpus - [Spanish conLL](https://www.clips.uantwerpen.be/conll2002/ner/data/) - [MeddoProf](https://temu.bsc.es/meddoprof/data/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 1946.0 157.0 213.0 2159.0 0.9253 0.9013 0.9132 HOSPITAL 272.0 82.0 87.0 359.0 0.7684 0.7577 0.763 DATE 1632.0 24.0 35.0 1667.0 0.9855 0.979 0.9822 ORGANIZATION 2460.0 479.0 513.0 2973.0 0.837 0.8274 0.8322 MAIL 58.0 0.0 0.0 58.0 1.0 1.0 1.0 USERNAME 95.0 1.0 10.0 105.0 0.9896 0.9048 0.9453 LOCATION 1734.0 416.0 381.0 2115.0 0.8065 0.8199 0.8131 ZIP 13.0 0.0 4.0 17.0 1.0 0.7647 0.8667 MEDICALRECORD 111.0 11.0 10.0 121.0 0.9098 0.9174 0.9136 PROFESSION 273.0 72.0 116.0 389.0 0.7913 0.7018 0.7439 PHONE 108.0 12.0 8.0 116.0 0.9 0.931 0.9153 DOCTOR 641.0 32.0 46.0 687.0 0.9525 0.933 0.9426 AGE 284.0 37.0 64.0 348.0 0.8847 0.8161 0.849 macro - - - - - - 0.88308 micro - - - - - - 0.87258 ``` --- layout: model title: English ElectraForQuestionAnswering model (from superspray) author: John Snow Labs name: electra_qa_large_discriminator_squad2_custom_dataset date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra_large_discriminator_squad2_custom_dataset` is a English model originally trained by `superspray`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_large_discriminator_squad2_custom_dataset_en_4.0.0_3.0_1655921639854.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_large_discriminator_squad2_custom_dataset_en_4.0.0_3.0_1655921639854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_large_discriminator_squad2_custom_dataset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_large_discriminator_squad2_custom_dataset","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.large.by_superspray").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_large_discriminator_squad2_custom_dataset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/superspray/electra_large_discriminator_squad2_custom_dataset --- layout: model title: English RobertaForQuestionAnswering Cased model (from raphaelsty) author: John Snow Labs name: roberta_qa_carbonblog date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `carbonblog` is a English model originally trained by `raphaelsty`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_carbonblog_en_4.2.4_3.0_1669985037575.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_carbonblog_en_4.2.4_3.0_1669985037575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_carbonblog","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_carbonblog","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_carbonblog| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/raphaelsty/carbonblog - https://raphaelsty.github.io/blog/template/ --- layout: model title: Legal Severance Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_severance_agreement_bert date: 2022-12-06 tags: [en, legal, classification, agreement, severance, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_severance_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `severance-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `severance-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_severance_agreement_bert_en_1.0.0_3.0_1670349917119.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_severance_agreement_bert_en_1.0.0_3.0_1670349917119.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_severance_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[severance-agreement]| |[other]| |[other]| |[severance-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_severance_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 1.00 0.97 35 severance-agreement 1.00 0.91 0.95 22 accuracy - - 0.96 57 macro-avg 0.97 0.95 0.96 57 weighted-avg 0.97 0.96 0.96 57 ``` --- layout: model title: Pipeline to Detect Adverse Drug Events (bert-clinical) author: John Snow Labs name: ner_ade_clinicalbert_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_ade_clinicalbert](https://nlp.johnsnowlabs.com/2021/04/01/ner_ade_clinicalbert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinicalbert_pipeline_en_4.3.0_3.2_1678809489216.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_clinicalbert_pipeline_en_4.3.0_3.2_1678809489216.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_ade_clinicalbert_pipeline", "en", "clinical/models") text = '''Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_ade_clinicalbert_pipeline", "en", "clinical/models") val text = "Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_bert_ade.pipeline").predict("""Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lipitor | 12 | 18 | DRUG | 0.9975 | | 1 | severe fatigue | 52 | 65 | ADE | 0.7094 | | 2 | voltaren | 97 | 104 | DRUG | 0.9202 | | 3 | cramps | 152 | 157 | ADE | 0.5992 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_clinicalbert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Basic NLP Pipeline for Spanish from TEMU_BSC for PlanTL author: cayorodriguez name: pipeline_bsc_roberta_base_bne date: 2022-11-22 tags: [es, open_source] task: Pipeline Public language: es edition: Spark NLP 4.0.0 spark_version: 3.2 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Basic NLP pipeline, by TEMU-BSC for PlanTL-GOB-ES, with Tokenization, lemmatization, NER, embeddings and Normalization, using roberta_base_bne transformer. {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/cayorodriguez/pipeline_bsc_roberta_base_bne_es_4.0.0_3.2_1669122787149.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/cayorodriguez/pipeline_bsc_roberta_base_bne_es_4.0.0_3.2_1669122787149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use import sparknlp spark = sparknlp.start() from sparknlp.annotator import * from sparknlp.base import * pipeline = PretrainedPipeline("pipeline_bsc_roberta_base_bne", "es", "@cayorodriguez") from sparknlp.base import LightPipeline light_model = LightPipeline(pipeline) text = "La Reserva Federal de el Gobierno de EE UU aprueba una de las mayorores subidas de tipos de interés desde 1994." light_result = light_model.annotate(text) result = pipeline.annotate(""Veo al hombre de los Estados Unidos con el telescopio"")
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp spark = sparknlp.start() from sparknlp.annotator import * from sparknlp.base import * pipeline = PretrainedPipeline("pipeline_bsc_roberta_base_bne", "es", "@cayorodriguez") from sparknlp.base import LightPipeline light_model = LightPipeline(pipeline) text = "La Reserva Federal de el Gobierno de EE UU aprueba una de las mayorores subidas de tipos de interés desde 1994." light_result = light_model.annotate(text) result = pipeline.annotate(""Veo al hombre de los Estados Unidos con el telescopio"") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_bsc_roberta_base_bne| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|es| |Size:|2.0 GB| |Dependencies:|roberta_base_bne| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - NormalizerModel - StopWordsCleaner - RoBertaEmbeddings - SentenceEmbeddings - EmbeddingsFinisher - LemmatizerModel - RoBertaForTokenClassification - RoBertaForTokenClassification - NerConverter --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from sarahmiller137) author: John Snow Labs name: distilbert_token_classifier_base_uncased_ft_conll2003 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-ft-conll2003` is a English model originally trained by `sarahmiller137`. ## Predicted Entities `LOC`, `ORG`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_ft_conll2003_en_4.3.1_3.0_1678133838174.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_ft_conll2003_en_4.3.1_3.0_1678133838174.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_ft_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_ft_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_ft_conll2003| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sarahmiller137/distilbert-base-uncased-ft-conll2003 - https://aclanthology.org/W03-0419 - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13 TFWav2Vec2ForCTC from PereLluis13 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13` is a Modern Greek (1453-) model originally trained by PereLluis13. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13_el_4.2.0_3.0_1664107066755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13_el_4.2.0_3.0_1664107066755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_greek_by_PereLluis13| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Management Contract Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_management_contract_bert date: 2022-11-25 tags: [en, legal, classification, agreement, management, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_management_contract_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `management-contract` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `management-contract`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_contract_bert_en_1.0.0_3.0_1669367729935.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_contract_bert_en_1.0.0_3.0_1669367729935.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_management_contract_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[management-contract]| |[other]| |[other]| |[management-contract]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_management_contract_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support management-contract 0.97 1.00 0.99 33 other 1.00 0.98 0.99 63 accuracy - - 0.99 96 macro-avg 0.99 0.99 0.99 96 weighted-avg 0.99 0.99 0.99 96 ``` --- layout: model title: German asr_wav2vec2_large_xlsr_german_by_maxidl TFWav2Vec2ForCTC from maxidl author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_german_by_maxidl` is a German model originally trained by maxidl. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl_de_4.2.0_3.0_1664097995013.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl_de_4.2.0_3.0_1664097995013.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_german_by_maxidl| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Share Exchange Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_share_exchange_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, share_exchange, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_share_exchange_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `share-exchange-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `share-exchange-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_share_exchange_agreement_bert_en_1.0.0_3.0_1669371750031.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_share_exchange_agreement_bert_en_1.0.0_3.0_1669371750031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_share_exchange_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[share-exchange-agreement]| |[other]| |[other]| |[share-exchange-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_share_exchange_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.92 0.93 65 share-exchange-agreement 0.85 0.88 0.87 33 accuracy - - 0.91 98 macro-avg 0.90 0.90 0.90 98 weighted-avg 0.91 0.91 0.91 98 ``` --- layout: model title: Dutch NER Pipeline author: John Snow Labs name: bert_token_classifier_dutch_udlassy_ner_pipeline date: 2022-06-25 tags: [open_source, ner, dutch, token_classifier, bert, treatment, nl] task: Named Entity Recognition language: nl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_dutch_udlassy_ner](https://nlp.johnsnowlabs.com/2021/12/08/bert_token_classifier_dutch_udlassy_ner_nl.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_dutch_udlassy_ner_pipeline_nl_4.0.0_3.0_1656119432774.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_dutch_udlassy_ner_pipeline_nl_4.0.0_3.0_1656119432774.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_dutch_udlassy_ner_pipeline", lang = "nl") pipeline.annotate("Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_dutch_udlassy_ner_pipeline", lang = "nl") pipeline.annotate("Mijn naam is Peter Fergusson. Ik woon sinds oktober 2011 in New York en werk 5 jaar bij Tesla Motor.") ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Peter Fergusson|PERSON | |oktober 2011 |DATE | |New York |GPE | |5 jaar |DATE | |Tesla Motor |ORG | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_dutch_udlassy_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|nl| |Size:|407.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertForTokenClassification - NerConverter - Finisher --- layout: model title: Sentiment Analysis of tweets Pipeline (analyze_sentimentdl_use_twitter) author: John Snow Labs name: analyze_sentimentdl_use_twitter date: 2021-01-18 task: [Embeddings, Sentiment Analysis, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [en, sentiment, pipeline, open_source] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline to analyze sentiment in tweets and classify them into 'positive' and 'negative' classes using `Universal Sentence Encoder` embeddings {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_use_twitter_en_2.7.1_2.4_1610993470852.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/analyze_sentimentdl_use_twitter_en_2.7.1_2.4_1610993470852.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("analyze_sentimentdl_use_twitter", lang = "en") result = pipeline.fullAnnotate(["im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!!", "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!"]) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("analyze_sentimentdl_use_twitter", lang = "en") val result = pipeline.fullAnnotate("im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!!", "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!") ``` {:.nlu-block} ```python import nlu text = ["""im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!!", "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!"""] sentiment_df = nlu.load('en.sentiment.twitter.use').predict(text) sentiment_df ```
## Results ```bash | | document | sentiment | |---:|:---------------------------------------------------------------------------------------------------------------- |:------------| | 0 | im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!! | positive | | 1 | is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah! | negative | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|analyze_sentimentdl_use_twitter| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.1+| |Edition:|Official| |Language:|en| ## Included Models `tfhub_use`, `sentimentdl_use_twitter` --- layout: model title: English asr_wav2vec2_thai_ASR TFWav2Vec2ForCTC from Rattana author: John Snow Labs name: asr_wav2vec2_thai_ASR date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_thai_ASR` is a English model originally trained by Rattana. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_thai_ASR_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_thai_ASR_en_4.2.0_3.0_1664112632452.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_thai_ASR_en_4.2.0_3.0_1664112632452.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_thai_ASR", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_thai_ASR", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_thai_ASR| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Registration Rights Agreement Document Binary Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_registration_rights_agreement_bert date: 2022-12-18 tags: [en, legal, classification, licensed, document, bert, registration, rights, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_registration_rights_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `registration-rights-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `registration-rights-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_registration_rights_agreement_bert_en_1.0.0_3.0_1671393849485.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_registration_rights_agreement_bert_en_1.0.0_3.0_1671393849485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_registration_rights_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[registration-rights-agreement]| |[other]| |[other]| |[registration-rights-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_registration_rights_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.98 0.96 204 registration-rights-agreement 0.96 0.90 0.93 113 accuracy - - 0.95 317 macro-avg 0.96 0.94 0.95 317 weighted-avg 0.95 0.95 0.95 317 ``` --- layout: model title: English asr_wav2vec2_xls_r_300m TFWav2Vec2ForCTC from hgharibi author: John Snow Labs name: asr_wav2vec2_xls_r_300m date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m` is a English model originally trained by hgharibi. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_en_4.2.0_3.0_1664016802341.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_en_4.2.0_3.0_1664016802341.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Sentence Embeddings - sbert tiny (tuned) author: John Snow Labs name: sbert_jsl_tiny_uncased date: 2021-05-14 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_uncased_en_3.0.3_2.4_1621017118423.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_tiny_uncased_en_3.0.3_2.4_1621017118423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings\ .pretrained("sbert_jsl_tiny_uncased","en","clinical/models")\ .setInputCols(["sentence"])\ .setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_tiny_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_tiny_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_tiny_uncased| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Score Acc 0.625 STS(cos) 0.682 ``` --- layout: model title: Turkish BertForQuestionAnswering model (from lserinol) author: John Snow Labs name: bert_qa_bert_turkish_question_answering date: 2022-06-02 tags: [tr, open_source, question_answering, bert] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-turkish-question-answering` is a Turkish model orginally trained by `lserinol`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_turkish_question_answering_tr_4.0.0_3.0_1654185022312.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_turkish_question_answering_tr_4.0.0_3.0_1654185022312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_turkish_question_answering","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_turkish_question_answering","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_turkish_question_answering| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/lserinol/bert-turkish-question-answering --- layout: model title: Fast Neural Machine Translation Model from Fijian to English author: John Snow Labs name: opus_mt_fj_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, fj, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `fj` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_fj_en_xx_2.7.0_2.4_1609169151401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_fj_en_xx_2.7.0_2.4_1609169151401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_fj_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_fj_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.fj.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_fj_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Salishan Languages author: John Snow Labs name: opus_mt_en_sal date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sal, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sal` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sal_xx_2.7.0_2.4_1609164488470.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sal_xx_2.7.0_2.4_1609164488470.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sal", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sal", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sal').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sal| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Sindhi (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sd, open_source] task: Embeddings language: sd edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sd_3.4.1_3.0_1647457161182.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sd_3.4.1_3.0_1647457161182.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sd") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["مون کي اسپارڪ اين ايل پي سان پيار آهي"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sd") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("مون کي اسپارڪ اين ايل پي سان پيار آهي").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sd.embed.w2v_cc_300d").predict("""مون کي اسپارڪ اين ايل پي سان پيار آهي""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sd| |Size:|78.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Named Entity Recognition for Chinese (BERT-MSRA Dataset) author: John Snow Labs name: ner_msra_bert_768d date: 2021-01-03 task: Named Entity Recognition language: zh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [zh, cn, ner, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained `bert_base_chinese` embeddings model from `BertEmbeddings` annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_ZH/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_msra_bert_768d_zh_2.7.0_2.4_1609703549977.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_msra_bert_768d_zh_2.7.0_2.4_1609703549977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained(name='bert_base_chinese', lang='zh')\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("ner_msra_bert_768d", "zh") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter]) example = spark.createDataFrame([['马云在浙江省杭州市出生,是阿里巴巴集团的主要创始人。']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh") .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained(name='bert_base_chinese', lang='zh') .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_msra_bert_768d", "zh") .setInputCols("document", "token", "embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols("sentence", "token", "ner") .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter)) val data = Seq("马云在浙江省杭州市出生,是阿里巴巴集团的主要创始人。").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""马云在浙江省杭州市出生,是阿里巴巴集团的主要创始人。"""] ner_df = nlu.load('zh.ner.msra.bert_768D').predict(text, output_level='token') ner_df ```
## Results ```bash +------------+---+ |token |ner| +------------+---+ |马云 |PER| |浙江省 |LOC| |杭州市 |LOC| |出生 |ORG| |阿里巴巴集团|ORG| |创始人 |PER| +------------+---+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_msra_bert_768d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|zh| ## Data Source The model was trained on the [MSRA (Levow, 2006)](https://www.aclweb.org/anthology/W06-0115/) data set created by "Microsoft Research Asia". ## Benchmarking ```bash | ner_tag | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | LOC | 0.97 | 0.97 | 0.97 | 2777 | | O | 1.00 | 1.00 | 1.00 | 146826 | | ORG | 0.88 | 0.99 | 0.93 | 1292 | | PER | 0.97 | 0.97 | 0.97 | 1430 | | accuracy | 1.00 | 152325 | | | | macro avg | 0.95 | 0.98 | 0.97 | 152325 | | weighted avg | 1.00 | 1.00 | 1.00 | 152325 | ``` --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from be4rr) author: John Snow Labs name: xlmroberta_ner_be4rr_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `be4rr`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_be4rr_base_finetuned_panx_de_4.1.0_3.0_1660431406884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_be4rr_base_finetuned_panx_de_4.1.0_3.0_1660431406884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_be4rr_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_be4rr_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_be4rr_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/be4rr/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English RobertaForQuestionAnswering (from AmazonScience) author: John Snow Labs name: roberta_qa_qanlu date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qanlu` is a English model originally trained by `AmazonScience`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_qanlu_en_4.0.0_3.0_1655729446230.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_qanlu_en_4.0.0_3.0_1655729446230.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_qanlu","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_qanlu","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_AmazonScience").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_qanlu| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AmazonScience/qanlu - https://github.com/amazon-research/question-answering-nlu - https://assets.amazon.science/33/ea/800419b24a09876601d8ab99bfb9/language-model-is-all-you-need-natural-language-understanding-as-question-answering.pdf --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-42` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1654180799941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42_en_4.0.0_3.0_1654180799941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_seed_42").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-42 --- layout: model title: RoBERTa Large CoNLL-03 NER Pipeline author: ahmedlone127 name: roberta_large_token_classifier_conll03_pipeline date: 2022-06-13 tags: [open_source, ner, token_classifier, roberta, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/roberta_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655160899783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/roberta_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655160899783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("roberta_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|29.0 KB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - RegexMatcherModel --- layout: model title: Legal Regions Of Eu Member States Document Classifier (EURLEX) author: John Snow Labs name: legclf_regions_of_eu_member_states_bert date: 2023-03-06 tags: [en, legal, classification, clauses, regions_of_eu_member_states, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_regions_of_eu_member_states_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Regions_of_Eu_Member_States or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Regions_of_Eu_Member_States`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_regions_of_eu_member_states_bert_en_1.0.0_3.0_1678111785780.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_regions_of_eu_member_states_bert_en_1.0.0_3.0_1678111785780.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_regions_of_eu_member_states_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Regions_of_Eu_Member_States]| |[Other]| |[Other]| |[Regions_of_Eu_Member_States]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_regions_of_eu_member_states_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.91 0.93 0.92 198 Regions_of_Eu_Member_States 0.94 0.91 0.93 218 accuracy - - 0.92 416 macro-avg 0.92 0.92 0.92 416 weighted-avg 0.92 0.92 0.92 416 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_128 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_128_zh_4.2.4_3.0_1670021767115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_128_zh_4.2.4_3.0_1670021767115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|17.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal Zero-shot Relation Extraction author: John Snow Labs name: legre_zero_shot date: 2022-08-22 tags: [en, legal, re, zero, shot, zero_shot, licensed] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Relation Extraction Model, meaning that it does not require any training data, just few examples of of the relations types you are looking for, to output a proper result. Make sure you keep the proper syntax of the relations you want to extract. For example: ``` re_model.setRelationalCategories({ "GRANTS_TO": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"], "GRANTS": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}"] }) ``` - The keys of the dictionary are the name of the relations (`GRANTS_TO`, `GRANTS`) - The values are list of sentences with similar examples of the relation - The values in brackets are the NER labels extracted by an NER component before ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_zero_shot_en_1.0.0_3.2_1661181212397.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_zero_shot_en_1.0.0_3.2_1661181212397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sparktokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = legal.BertForTokenClassifier.pretrained('legner_obligations','en', 'legal/models')\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") re_model = legal.ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")\ .setInputCols(["ner_chunk", "sentence"]) \ .setOutputCol("relations") # Remember it's 2 curly brackets instead of one if you are using Spark NLP < 4.0 re_model.setRelationalCategories({ "GRANTS_TO": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"], "GRANTS": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}"] }) pipeline = sparknlp.base.Pipeline() \ .setStages([document_assembler, sparktokenizer, tokenClassifier, ner_converter, re_model ]) # create Spark DF sample_text = """Fox grants to Licensee a limited, exclusive right and license""" data = spark.createDataFrame([[sample_text]]).toDF("text") model = pipeline.fit(data) results = model.transform(data) # ner output results.selectExpr("explode(ner_chunk) as ner").show(truncate=False) # relations output results.selectExpr("explode(relations) as relation").show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------+ |ner | +----------------------------------------------------------------------------------------------------------------------------+ |[chunk, 0, 2, Fox, [entity -> OBLIGATION_SUBJECT, sentence -> 0, chunk -> 0, confidence -> 0.6905101], []] | |[chunk, 4, 9, grants, [entity -> OBLIGATION_ACTION, sentence -> 0, chunk -> 1, confidence -> 0.7512371], []] | |[chunk, 14, 21, Licensee, [entity -> OBLIGATION_INDIRECT_OBJECT, sentence -> 0, chunk -> 2, confidence -> 0.8294538], []] | |[chunk, 23, 31, a limited, [entity -> OBLIGATION, sentence -> 0, chunk -> 3, confidence -> 0.7429814], []] | |[chunk, 34, 60, exclusive right and license, [entity -> OBLIGATION, sentence -> 0, chunk -> 4, confidence -> 0.9236847], []]| +----------------------------------------------------------------------------------------------------------------------------+ +-------------+ |relation | +-------------+ |[category, 0, 91, GRANTS, [entity1_begin -> 0, relation -> GRANTS, hypothesis -> Fox grants grants, confidence -> 0.7592092, nli_prediction -> entail, entity1 -> OBLIGATION_SUBJECT, syntactic_distance -> undefined, chunk2 -> grants, entity2_end -> 9, entity1_end -> 2, entity2_begin -> 4, entity2 -> OBLIGATION_ACTION, chunk1 -> Fox, sentence -> 0], []] | |[category, 92, 185, GRANTS_TO, [entity1_begin -> 0, relation -> GRANTS_TO, hypothesis -> Fox grants Licensee, confidence -> 0.9822127, nli_prediction -> entail, entity1 -> OBLIGATION_SUBJECT, syntactic_distance -> undefined, chunk2 -> Licensee, entity2_end -> 21, entity1_end -> 2, entity2_begin -> 14, entity2 -> OBLIGATION_INDIRECT_OBJECT, chunk1 -> Fox, sentence -> 0], []]| +-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_zero_shot| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| ## References Bert Base (cased) trained on the GLUE MNLI dataset. --- layout: model title: Pipeline to Detect Drug Chemicals author: John Snow Labs name: ner_drugs_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugs](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_pipeline_en_4.3.0_3.2_1678878172100.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_pipeline_en_4.3.0_3.2_1678878172100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_drugs_pipeline", "en", "clinical/models") text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_drugs_pipeline", "en", "clinical/models") val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | potassium | 92 | 100 | DrugChem | 0.5346 | | 1 | anthracyclines | 1124 | 1137 | DrugChem | 0.9639 | | 2 | taxanes | 1143 | 1149 | DrugChem | 0.6532 | | 3 | vinorelbine | 1203 | 1213 | DrugChem | 0.9729 | | 4 | vinorelbine | 1343 | 1353 | DrugChem | 0.9815 | | 5 | anthracyclines | 1390 | 1403 | DrugChem | 0.9447 | | 6 | taxanes | 1409 | 1415 | DrugChem | 0.6213 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate English to Kwangali Pipeline author: John Snow Labs name: translate_en_kwn date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, kwn, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `kwn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_kwn_xx_2.7.0_2.4_1609690865884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_kwn_xx_2.7.0_2.4_1609690865884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_kwn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_kwn", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.kwn').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_kwn| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Research And Intellectual Property Document Classifier (EURLEX) author: John Snow Labs name: legclf_research_and_intellectual_property_bert date: 2023-03-06 tags: [en, legal, classification, clauses, research_and_intellectual_property, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_research_and_intellectual_property_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Research_and_Intellectual_Property or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Research_and_Intellectual_Property`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_research_and_intellectual_property_bert_en_1.0.0_3.0_1678111868049.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_research_and_intellectual_property_bert_en_1.0.0_3.0_1678111868049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_research_and_intellectual_property_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Research_and_Intellectual_Property]| |[Other]| |[Other]| |[Research_and_Intellectual_Property]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_research_and_intellectual_property_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.80 0.82 0.81 110 Research_and_Intellectual_Property 0.85 0.83 0.84 133 accuracy - - 0.83 243 macro-avg 0.83 0.83 0.83 243 weighted-avg 0.83 0.83 0.83 243 ``` --- layout: model title: Оcr base v2 optimized for printed text author: John Snow Labs name: ocr_base_printed_v2_opt date: 2023-01-17 tags: [en, licensed] task: OCR Text Detection & Recognition language: en nav_key: models edition: Visual NLP 4.2.4 spark_version: 3.2.1 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Ocr base printed model v2 optimized with ONNX for recognise handwritten text based on a TrOCR pretrained model with printed datasets. It is an Ocr base model for recognise handwritten text based on TrOcr architecture. The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR). The abstract from the paper is the following: Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/ocr/RECOGNIZE_PRINTED/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Cards/SparkOcrImageToTextPrinted_V2_opt.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_printed_v2_opt_en_4.2.2_3.0_1670605909000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/ocr_base_printed_v2_opt_en_4.2.2_3.0_1670605909000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage() \ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) text_detector = ImageTextDetectorV2 \ .pretrained("image_text_detector_v2", "en", "clinical/ocr") \ .setInputCol("image") \ .setOutputCol("text_regions") \ .setWithRefiner(True) \ .setSizeThreshold(-1) \ .setLinkThreshold(0.3) \ .setWidth(500) ocr = ImageToTextV2Opt.pretrained("ocr_base_printed_v2_opt", "en", "clinical/ocr") \ .setInputCols(["image", "text_regions"]) \ .setGroupImages(True) \ .setOutputCol("text") \ .setRegionsColumn("text_regions") draw_regions = ImageDrawRegions() \ .setInputCol("image") \ .setInputRegionsCol("text_regions") \ .setOutputCol("image_with_regions") \ .setRectColor(Color.green) \ .setRotated(True) pipeline = PipelineModel(stages=[ binary_to_image, text_detector, ocr, draw_regions ]) imagePath = pkg_resources.resource_filename('sparkocr', 'resources/ocr/images/check.jpg') image_df = spark.read.format("binaryFile").load(imagePath) result = pipeline.transform(image_df).cache() ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val text_detector = ImageTextDetectorV2 .pretrained("image_text_detector_v2", "en", "clinical/ocr") .setInputCol("image") .setOutputCol("text_regions") .setWithRefiner(True) .setSizeThreshold(-1) .setLinkThreshold(0.3) .setWidth(500) val ocr = ImageToTextV2Opt .pretrained("ocr_base_printed_v2_opt", "en", "clinical/ocr") .setInputCols(Array("image", "text_regions")) .setGroupImages(True) .setOutputCol("text") .setRegionsColumn("text_regions") val draw_regions = new ImageDrawRegions() .setInputCol("image") .setInputRegionsCol("text_regions") .setOutputCol("image_with_regions") .setRectColor(Color.green) .setRotated(True) val pipeline = new PipelineModel().setStages(Array( binary_to_image, text_detector, ocr, draw_regions)) val imagePath = pkg_resources.resource_filename("sparkocr", "resources/ocr/images/check.jpg") val image_df = spark.read.format("binaryFile").load(imagePath) val result = pipeline.transform(image_df).cache() ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image2.png) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image2_out3.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash STARBUCKS STORE #10208 11302 EUCLID AVENUE CLEVELAND, OH (216) 229-0749 CHK 664290 12/07/2014 06:43 PM 1912003- DRAWER: 2. REG: 2 VT PEP MOCHA SBUX CARD : 4.95 XXXXXXXXXXXX3228 SUBTOTAL: $4.95 TOTAL @ 6 @ < $4.95 CHANGE DUE $0.00 ................ CCHECK CLOSED 12/07/2014 06:43 PM SBUX CARD X3228 NEW BALANCE: 37.45 CARD IS REGISTERED ``` ## Model Information {:.table-model} |---|---| |Model Name:|ocr_base_printed_v2_opt| |Type:|ocr| |Compatibility:|Visual NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| --- layout: model title: Detect Entities Related to Cancer Diagnosis author: John Snow Labs name: ner_oncology_diagnosis_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to cancer diagnosis, such as Metastasis, Histological_Type or Tumor_Size. Definitions of Predicted Entities: - `Adenopathy`: Mentions of pathological findings of the lymph nodes. - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score"). - `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated") - `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary". - `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category. - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells"). - `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4"). - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm"). ## Predicted Entities `Adenopathy`, `Cancer_Dx`, `Cancer_Score`, `Grade`, `Histological_Type`, `Invasion`, `Metastasis`, `Pathology_Result`, `Performance_Status`, `Staging`, `Tumor_Finding`, `Tumor_Size` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_wip_en_4.0.0_3.0_1664561418256.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_diagnosis_wip_en_4.0.0_3.0_1664561418256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_diagnosis_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma.Last week she was also found to have a lung metastasis."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_diagnosis_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma. Last week she was also found to have a lung metastasis.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_diseases_wip").predict("""Two years ago, the patient presented with a tumor in her left breast and adenopathies. She was diagnosed with invasive ductal carcinoma.Last week she was also found to have a lung metastasis.""") ```
## Results ```bash | chunk | ner_label | |:-------------|:------------------| | tumor | Tumor_Finding | | adenopathies | Adenopathy | | invasive | Histological_Type | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | metastasis | Metastasis | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_diagnosis_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|858.8 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 210.0 38.0 133.0 343.0 0.85 0.61 0.71 Staging 172.0 17.0 44.0 216.0 0.91 0.80 0.85 Cancer_Score 29.0 6.0 30.0 59.0 0.83 0.49 0.62 Tumor_Finding 837.0 48.0 105.0 942.0 0.95 0.89 0.92 Invasion 99.0 14.0 34.0 133.0 0.88 0.74 0.80 Tumor_Size 710.0 75.0 142.0 852.0 0.90 0.83 0.87 Adenopathy 30.0 8.0 14.0 44.0 0.79 0.68 0.73 Performance_Status 50.0 8.0 50.0 100.0 0.86 0.50 0.63 Pathology_Result 514.0 249.0 341.0 855.0 0.67 0.60 0.64 Metastasis 276.0 18.0 13.0 289.0 0.94 0.96 0.95 Cancer_Dx 946.0 48.0 120.0 1066.0 0.95 0.89 0.92 Grade 149.0 20.0 49.0 198.0 0.88 0.75 0.81 macro_avg 4022.0 549.0 1075.0 5097.0 0.87 0.73 0.79 micro_avg NaN NaN NaN NaN 0.88 0.79 0.83 ``` --- layout: model title: Thai Word Segmentation author: John Snow Labs name: wordseg_best date: 2021-09-22 tags: [th, open_source, word_segmentation] task: Word Segmentation language: th edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WordSegmenterModel (WSM) is based on maximum entropy probability model to detect word boundaries in Thai text. Thai text is written without white space between the words, and a computer-based application cannot know _a priori_ which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. References: > Xue, Nianwen. "Chinese word segmentation as character tagging." International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing. 2003.). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_best_th_3.0.0_3.0_1632311280102.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_best_th_3.0.0_3.0_1632311280102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline as a substitute of the Tokenizer stage.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_segmenter = WordSegmenterModel.pretrained('wordseg_best', 'th')\ .setInputCols("document")\ .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_segmenter]) example = spark.createDataFrame([['จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ']], ["text"]) result = pipeline.fit(example ).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th") .setInputCols("document") .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_segmenter)) val data = Seq("จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python pipe = nlu.load('th.segment_words') data = ["จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ"] df = pipe.predict(data , output_level='token') df ```
## Results ```bash +-----------------------------------+---------------------------------------------------------+ |text |result | +-----------------------------------+---------------------------------------------------------+ |จวนจะถึงร้านที่คุณจองโต๊ะไว้แล้วจ้ะ|[จวน, จะ, ถึง, ร้าน, ที่, คุณ, จอง, โต๊ะ, ไว้, แล้ว, จ้ะ]| +-----------------------------------+---------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_best| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|th| ## Data Source The model was trained on the [BEST](http://thailang.nectec.or.th/best) corpus from the National Electronics and Computer Technology Center (NECTEC). References: > Krit Kosawat, Monthika Boriboon, Patcharika Chootrakool, Ananlada Chotimongkol, Supon Klaithin, Sarawoot Kongyoung, Kanyanut Kriengket, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Chai Wutiwiwatchai, "BEST 2009: Thai word segmentation software contest," in Proc. 8th Int. Symp. Natural Language Process. (SNLP), Bangkok, Thailand, Oct.20-22, 2009, pp.83-88. > Monthika Boriboon, Kanyanut Kriengket, Patcharika Chootrakool, Sitthaa Phaholphinyo, Sumonmas Purodakananda, Tipraporn Thanakulwarapas, and Krit Kosawat, "BEST corpus development and analysis," in Proc. 2nd Int. Conf. Asian Language Process. (IALP), Singapore, Dec.7-9, 2009, pp.322-327. ## Benchmarking ```bash | Model | precision | recall | f1-score | |--------------|-----------|--------|----------| | WORDSEG_BEST | 0.4791 | 0.6245 | 0.5422 | ``` --- layout: model title: Pipeline to Mapping ICDO Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icdo_snomed_mapping date: 2023-06-13 tags: [en, licensed, clinical, resolver, pipeline, chunk_mapping, icdo, snomed] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icdo_snomed_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapping_en_4.4.4_3.2_1686665533628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icdo_snomed_mapping_en_4.4.4_3.2_1686665533628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(8120/1 8170/3 8380/3) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(8120/1 8170/3 8380/3) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icdo_to_snomed.pipe").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(8120/1 8170/3 8380/3) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icdo_snomed_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(8120/1 8170/3 8380/3) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icdo_to_snomed.pipe").predict("""Put your text here.""") ```
## Results ```bash Results | | icdo_code | snomed_code | |---:|:-------------------------|:-------------------------------| | 0 | 8120/1 | 8170/3 | 8380/3 | 45083001 | 25370001 | 30289006 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icdo_snomed_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|137.3 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Translate English to Hausa Pipeline author: John Snow Labs name: translate_en_ha date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ha, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ha` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ha_xx_2.7.0_2.4_1609689159455.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ha_xx_2.7.0_2.4_1609689159455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ha", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ha", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ha').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ha| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_cv8 TFWav2Vec2ForCTC from oliverguhr author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_cv8 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_cv8` is a German model originally trained by oliverguhr. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_cv8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_cv8_de_4.2.0_3.0_1664102427064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_cv8_de_4.2.0_3.0_1664102427064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_cv8", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_cv8", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_cv8| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_xls_r_300m_AM_CV8_v1 TFWav2Vec2ForCTC from emre author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_AM_CV8_v1` is a English model originally trained by emre. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1_en_4.2.0_3.0_1664038308114.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1_en_4.2.0_3.0_1664038308114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_AM_CV8_v1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Extract Cancer Therapies and Granular Posology Information author: John Snow Labs name: ner_oncology_posology_pipeline date: 2023-03-08 tags: [licensed, clinical, en, oncology, ner, treatment, posology] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_posology](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_posology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_pipeline_en_4.3.0_3.2_1678284759416.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_posology_pipeline_en_4.3.0_3.2_1678284759416.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_posology_pipeline", "en", "clinical/models") text = '''The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_posology_pipeline", "en", "clinical/models") val text = "The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:---------------|-------------:| | 0 | adriamycin | 46 | 55 | Cancer_Therapy | 1 | | 1 | 60 mg/m2 | 58 | 65 | Dosage | 0.92005 | | 2 | cyclophosphamide | 72 | 87 | Cancer_Therapy | 0.9999 | | 3 | 600 mg/m2 | 90 | 98 | Dosage | 0.9229 | | 4 | six courses | 106 | 116 | Cycle_Count | 0.494 | | 5 | second cycle | 150 | 161 | Cycle_Number | 0.98675 | | 6 | chemotherapy | 166 | 177 | Cancer_Therapy | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_posology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_xls_r_300m_hindi_lm TFWav2Vec2ForCTC from shoubhik author: John Snow Labs name: pipeline_asr_wav2vec2_xls_r_300m_hindi_lm date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_hindi_lm` is a English model originally trained by shoubhik. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xls_r_300m_hindi_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_hindi_lm_en_4.2.0_3.0_1664106569621.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xls_r_300m_hindi_lm_en_4.2.0_3.0_1664106569621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xls_r_300m_hindi_lm', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xls_r_300m_hindi_lm", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xls_r_300m_hindi_lm| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot date: 2022-09-25 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot` is a Finnish model originally trained by aapot. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot_fi_4.2.0_3.0_1664094457723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot_fi_4.2.0_3.0_1664094457723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|3.6 GB| --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from CenIA) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-qa-sqac` is a Castilian, Spanish model orginally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac_es_4.0.0_3.0_1654180417429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac_es_4.0.0_3.0_1654180417429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/bert-base-spanish-wwm-cased-finetuned-qa-sqac --- layout: model title: Word2Vec Embeddings in Venetian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, vec, open_source] task: Embeddings language: vec edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vec_3.4.1_3.0_1647465762438.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vec_3.4.1_3.0_1647465762438.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vec") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vec") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vec.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|vec| |Size:|133.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa TFWav2Vec2ForCTC from masapasa author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa` is a English model originally trained by masapasa. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa_en_4.2.0_3.0_1664096660826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa_en_4.2.0_3.0_1664096660826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_turkish_colab_by_masapasa| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|756.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from andi611) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-squad2-with-ner-conll2003-with-neg-with-repeat` is a English model orginally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat_en_4.0.0_3.0_1654537420339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat_en_4.0.0_3.0_1654537420339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.bert.large_uncased.by_andi611").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/bert-large-uncased-whole-word-masking-squad2-with-ner-conll2003-with-neg-with-repeat --- layout: model title: Ukrainian Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 12:35:00 +0800 task: Lemmatization language: uk edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, uk] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_uk_2.5.0_2.4_1588671294202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_uk_2.5.0_2.4_1588671294202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "uk") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("За винятком того, що є королем півночі, Джон Сноу є англійським лікарем та лідером у розвитку анестезії та медичної гігієни.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "uk") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("За винятком того, що є королем півночі, Джон Сноу є англійським лікарем та лідером у розвитку анестезії та медичної гігієни.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""За винятком того, що є королем півночі, Джон Сноу є англійським лікарем та лідером у розвитку анестезії та медичної гігієни."""] lemma_df = nlu.load('uk.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=1, result='За', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=10, result='виняток', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=15, result='те', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=16, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=19, result='що', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|uk| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: ALBERT XLarge CoNNL-03 NER Pipeline author: John Snow Labs name: albert_xlarge_token_classifier_conll03_pipeline date: 2022-04-23 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_xlarge_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_xlarge_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_xlarge_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650712616814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_xlarge_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650712616814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_xlarge_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|206.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Legal Miscellaneous Clause Binary Classifier author: John Snow Labs name: legclf_miscellaneous_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `miscellaneous` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `miscellaneous` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_clause_en_1.0.0_3.2_1660123735778.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_miscellaneous_clause_en_1.0.0_3.2_1660123735778.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_miscellaneous_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[miscellaneous]| |[other]| |[other]| |[miscellaneous]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_miscellaneous_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support miscellaneous 0.93 0.61 0.74 23 other 0.85 0.98 0.91 54 accuracy - - 0.87 77 macro-avg 0.89 0.80 0.83 77 weighted-avg 0.88 0.87 0.86 77 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Arabic author: John Snow Labs name: opus_mt_en_ar date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ar, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ar` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ar_xx_2.7.0_2.4_1609169007179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ar_xx_2.7.0_2.4_1609169007179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ar", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ar", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ar').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ar| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [fr, ner, clinical, licensed] task: Named Entity Recognition language: fr edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_fr_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_fr_4.3.0_3.2_1678705447222.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_fr_4.3.0_3.2_1678705447222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "fr", "clinical/models") text = '''Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "fr", "clinical/models") val text = "Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------------------------|--------:|------:|:------------|-------------:| | 0 | Femme | 0 | 4 | HUMAN | 1 | | 1 | mari | 130 | 133 | HUMAN | 0.982 | | 2 | enfants | 148 | 154 | HUMAN | 0.9863 | | 3 | patient | 203 | 209 | HUMAN | 0.9989 | | 4 | Coxiella burnetii | 346 | 362 | SPECIES | 0.9309 | | 5 | Bartonella henselae | 365 | 383 | SPECIES | 0.99275 | | 6 | Borrelia burgdorferi | 386 | 405 | SPECIES | 0.98795 | | 7 | Entamoeba histolytica | 408 | 428 | SPECIES | 0.98455 | | 8 | Toxoplasma gondii | 431 | 447 | SPECIES | 0.9736 | | 9 | cytomégalovirus | 479 | 493 | SPECIES | 0.9979 | | 10 | virus d'Epstein Barr | 496 | 515 | SPECIES | 0.788667 | | 11 | virus de la varicelle et du zona | 518 | 549 | SPECIES | 0.788543 | | 12 | parvovirus B19 | 554 | 567 | SPECIES | 0.9341 | | 13 | Brucella | 636 | 643 | SPECIES | 0.9993 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Financial NER on EDGAR Documents author: John Snow Labs name: finner_sec_edgar date: 2023-04-13 tags: [en, licensed, finance, ner, sec] task: Named Entity Recognition language: en edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Financial NER model extracts `ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, and `TICKER` entities from the US SEC EDGAR documents. ## Predicted Entities `ALIAS`, `COURT`, `INST`, `LAW`, `LOC`, `MISC`, `ORG`, `PER`, `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_sec_edgar_en_1.0.0_3.0_1681390760896.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_sec_edgar_en_1.0.0_3.0_1681390760896.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = finance.NerModel.pretrained("finner_sec_edgar", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------------------------------------+---------+ |chunk |ner_label| +----------------------------------------+---------+ |SunGard Capital Corp. II |ORG | |SCC II |ALIAS | |accounting principles generally accepted|LAW | |United States of America |LOC | +----------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_sec_edgar| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations ## Benchmarking ```bash label precision recall f1-score support ALIAS 0.88 0.93 0.90 84 COURT 1.00 0.83 0.91 6 INST 0.95 0.78 0.86 76 LAW 0.90 0.90 0.90 166 LOC 0.83 0.83 0.83 139 MISC 0.83 0.81 0.82 226 ORG 0.88 0.87 0.87 430 PER 0.91 0.79 0.85 66 TICKER 0.86 0.86 0.86 7 micro-avg 0.87 0.85 0.86 1200 macro-avg 0.89 0.84 0.87 1200 weighted-avg 0.88 0.85 0.86 1200 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Urdu author: John Snow Labs name: opus_mt_en_ur date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ur, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ur` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ur_xx_2.7.0_2.4_1609254422333.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ur_xx_2.7.0_2.4_1609254422333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ur", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ur", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ur').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ur| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Drugs, Experimental Drugs and Cycles Information author: John Snow Labs name: ner_posology_experimental_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_experimental](https://nlp.johnsnowlabs.com/2021/09/01/ner_posology_experimental_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_pipeline_en_3.4.1_3.0_1647872053101.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_experimental_pipeline_en_3.4.1_3.0_1647872053101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_posology_experimental_pipeline", "en", "clinical/models") pipeline.annotate("Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA). Calcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.") ``` ```scala val pipeline = new PretrainedPipeline("ner_posology_experimental_pipeline", "en", "clinical/models") pipeline.annotate("Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA). Calcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_experimental.pipeline").predict("""Y-90 Humanized Anti-Tac: 10 mCi (if a bone marrow transplant was part of the patient's previous therapy) or 15 mCi of yttrium labeled anti-TAC; followed by calcium trisodium Inj (Ca DTPA). Calcium-DTPA: Ca-DTPA will be administered intravenously on Days 1-3 to clear the radioactive agent from the body.""") ```
## Results ```bash | | chunk | begin | end | entity | |---:|:-------------------------|--------:|------:|:---------| | 0 | Anti-Tac | 15 | 22 | Drug | | 1 | 10 mCi | 25 | 30 | Dosage | | 2 | 15 mCi | 108 | 113 | Dosage | | 3 | yttrium labeled anti-TAC | 118 | 141 | Drug | | 4 | calcium trisodium Inj | 156 | 176 | Drug | | 5 | Calcium-DTPA | 191 | 202 | Drug | | 6 | Ca-DTPA | 205 | 211 | Drug | | 7 | intravenously | 234 | 246 | Route | | 8 | Days 1-3 | 251 | 258 | Cycleday | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_experimental_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Pipeline to Detect Problem, Test and Treatment (Large) author: John Snow Labs name: ner_clinical_large_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, problem, test, treatment, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_large](https://nlp.johnsnowlabs.com/2021/03/31/ner_clinical_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_pipeline_en_3.4.1_3.0_1647872951545.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_large_pipeline_en_3.4.1_3.0_1647872951545.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_clinical_large_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.") ``` ```scala val pipeline = new PretrainedPipeline("ner_clinical_large_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_large.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes. BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +-----------------------------------------------------------+---------+ |the G-protein-activated inwardly rectifying potassium (GIRK|TREATMENT| |the genomicorganization |TREATMENT| |a candidate gene forType II diabetes mellitus |PROBLEM | |byapproximately |TREATMENT| |single nucleotide polymorphisms |TREATMENT| |aVal366Ala substitution |TREATMENT| |an 8 base-pair |TREATMENT| |insertion/deletion |PROBLEM | |Ourexpression studies |TEST | |the transcript in various humantissues |PROBLEM | |fat andskeletal muscle |PROBLEM | |furtherstudies |PROBLEM | |the KCNJ9 protein |TREATMENT| |evaluation |TEST | |Type II diabetes |PROBLEM | |the treatment |TREATMENT| |breast cancer |PROBLEM | |the standard therapy |TREATMENT| |anthracyclines |TREATMENT| |taxanes |TREATMENT| +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Turkish BertForQuestionAnswering Cased model (from Aybars) author: John Snow Labs name: bert_qa_modelonwhol date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ModelOnWhole` is a Turkish model originally trained by `Aybars`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_modelonwhol_tr_4.0.0_3.0_1657182195552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_modelonwhol_tr_4.0.0_3.0_1657182195552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_modelonwhol","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_modelonwhol","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_modelonwhol| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|689.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Aybars/ModelOnWhole --- layout: model title: Fast Neural Machine Translation Model from Hiri Motu to English author: John Snow Labs name: opus_mt_ho_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ho, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ho` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ho_en_xx_2.7.0_2.4_1609165109583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ho_en_xx_2.7.0_2.4_1609165109583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ho_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ho_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ho.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ho_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Low German (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, nds, open_source] task: Embeddings language: nds edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nds_3.4.1_3.0_1647443417814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nds_3.4.1_3.0_1647443417814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nds") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nds") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nds.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|nds| |Size:|288.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Fast Neural Machine Translation Model from Arabic to Russian author: John Snow Labs name: opus_mt_ar_ru date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, ru, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: ru {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_ru_xx_3.1.0_2.4_1622557691586.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_ru_xx_3.1.0_2.4_1622557691586.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_ru", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_ru", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Russian').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_ru| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Spanish Pipeline author: John Snow Labs name: translate_en_es date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, es, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `es` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_es_xx_2.7.0_2.4_1609688119311.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_es_xx_2.7.0_2.4_1609688119311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_es", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_es", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.es').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_es| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract relations between drugs and proteins author: John Snow Labs name: re_drugprot_clinical date: 2022-01-05 tags: [relation_extraction, clinical, en, licensed] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description NOTE: This model has been improved by a new SOTA, Bert-based, Relation Extraction model, you can find [here](https://nlp.johnsnowlabs.com/2022/01/05/redl_drugprot_biobert_en.html) Detect interactions between chemical compounds/drugs and genes/proteins using Spark NLP's `RelationExtractionModel()` by classifying whether a specified semantic relation holds between a chemical and gene entities within a sentence or document. The entity labels used during training were derived from the [custom NER model](https://nlp.johnsnowlabs.com/2021/12/20/ner_drugprot_clinical_en.html) created by our team for the [DrugProt corpus](https://zenodo.org/record/5119892). These include `CHEMICAL` for chemical compounds/drugs, `GENE` for genes/proteins and `GENE_AND_CHEMICAL` for entity mentions of type `GENE` and of type `CHEMICAL` that overlap (such as enzymes and small peptides). The relation categories from the [DrugProt corpus](https://zenodo.org/record/5119892) were condensed from 13 categories to 10 categories due to low numbers of examples for certain categories. This merging process involved grouping the `SUBSTRATE_PRODUCT-OF` and `SUBSTRATE` relation categories together and grouping the `AGONIST-ACTIVATOR`, `AGONIST-INHIBITOR` and `AGONIST` relation categories together. ## Predicted Entities `INHIBITOR`, `DIRECT-REGULATOR`, `SUBSTRATE`, `ACTIVATOR`, `INDIRECT-UPREGULATOR`, `INDIRECT-DOWNREGULATOR`, `ANTAGONIST`, `PRODUCT-OF`, `PART-OF`, `AGONIST` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb#scrollTo=8tgB0NdZJlQU){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_drugprot_clinical_en_3.3.4_3.0_1641397921687.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_drugprot_clinical_en_3.3.4_3.0_1641397921687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `re_drugprot_clinical` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:--------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------:|----------------------------------------------------------------------------------------| | re_drugprot_clinical | INHIBITOR,
DIRECT-REGULATOR,
SUBSTRATE,
ACTIVATOR,
INDIRECT-UPREGULATOR,
INDIRECT-DOWNREGULATOR,
ANTAGONIST,
PRODUCT-OF,
PART-OF,
AGONIST | ner_drugprot_clinical | [“checmical-gene”,
“chemical-gene_and_chemical”,
“gene_and_chemical-gene”]
|
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") drugprot_re_model = RelationExtractionModel()\ .pretrained("re_drugprot_clinical", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(4)\ .setPredictionThreshold(0.9)\ .setRelationPairs(['CHEMICAL-GENE']) # Possible relation pairs. Default: All Relations. pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_model]) text='''Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.''' data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val drugprot_re_Model = RelationExactionModel() .pretrained("re_drugprot_clinical", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(4) .setPredictionThreshold(0.9) .setRelationPairs(Array("CHEMICAL-GENE")) # Possible relation pairs. Default: All Relations. val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_Model)) val data = Seq("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.drugprot.clinical").predict("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""") ```
## Results ```bash +---------+--------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+--------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ |SUBSTRATE|CHEMICAL| 308| 310| PS| GENE| 275| 283| flippase| 0.998399| |ACTIVATOR|CHEMICAL| 1563| 1578| sn-1,2-glycerol| GENE| 1479| 1509|plasma membrane P...| 0.999304| |ACTIVATOR|CHEMICAL| 1563| 1578| sn-1,2-glycerol| GENE| 1511| 1517| Atp8a1| 0.979057| |ACTIVATOR|CHEMICAL| 2112| 2114| PE| GENE| 2189| 2195| Atp8a1| 0.998299| |ACTIVATOR|CHEMICAL| 2116| 2145|phosphatidylhydro...| GENE| 2189| 2195| Atp8a1| 0.981534| |ACTIVATOR|CHEMICAL| 2151| 2173|phosphatidylhomos...| GENE| 2189| 2195| Atp8a1| 0.988504| |SUBSTRATE|CHEMICAL| 2217| 2244|N-methyl-phosphat...| GENE| 2290| 2298| flippase| 0.994092| |ACTIVATOR|CHEMICAL| 1292| 1312|phosphatidylglycerol| GENE| 1134| 1140| Atp8a1| 0.994409| |ACTIVATOR|CHEMICAL| 1316| 1340|phosphatidylethan...| GENE| 1134| 1140| Atp8a1| 0.988359| |ACTIVATOR|CHEMICAL| 1342| 1344| PE| GENE| 1134| 1140| Atp8a1| 0.988399| |ACTIVATOR|CHEMICAL| 1377| 1379| PS| GENE| 1134| 1140| Atp8a1| 0.996349| |ACTIVATOR|CHEMICAL| 2526| 2528| PS| GENE| 2444| 2450| Atp8a1| 0.978597| |ACTIVATOR|CHEMICAL| 2526| 2528| PS| GENE| 2403| 2409| ATPase| 0.988679| +---------+--------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_drugprot_clinical| |Type:|re| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|9.7 MB| ## Data Source This model was trained on the [DrugProt corpus](https://zenodo.org/record/5119892). This model has been improved using a Deep Learning Relation Extraction approach, resulting in the model available [here](https://nlp.johnsnowlabs.com/2022/01/05/redl_drugprot_biobert_en.html) with the following metrics ## Benchmarking ```bash label precision recall f1-score support ACTIVATOR 0.39 0.29 0.33 235 AGONIST 0.71 0.67 0.69 138 ANTAGONIST 0.79 0.77 0.78 215 DIRECT-REGULATOR 0.64 0.77 0.70 442 INDIRECT-DOWNREGULATOR 0.44 0.44 0.44 321 INDIRECT-UPREGULATOR 0.49 0.43 0.46 292 INHIBITOR 0.79 0.75 0.77 1119 PART-OF 0.74 0.82 0.78 246 PRODUCT-OF 0.51 0.37 0.43 153 SUBSTRATE 0.58 0.69 0.63 486 accuracy - - 0.65 3647 macro-avg 0.61 0.60 0.60 3647 weighted-avg 0.65 0.65 0.64 3647 - - - - - ACTIVATOR 0.885 0.776 0.827 235 AGONIST 0.810 0.925 0.864 137 ANTAGONIST 0.970 0.919 0.944 199 DIRECT-REGULATOR 0.836 0.901 0.867 403 INDIRECT-DOWNREGULATOR 0.885 0.850 0.867 313 INDIRECT-UPREGULATOR 0.844 0.887 0.865 270 INHIBITOR 0.947 0.937 0.942 1083 PART-OF 0.939 0.889 0.913 247 PRODUCT-OF 0.697 0.953 0.805 145 SUBSTRATE 0.912 0.884 0.898 468 Avg 0.873 0.892 0.879 3647 Weighted-Avg 0.897 0.899 0.897 3647 ``` --- layout: model title: Chinese BertForMaskedLM Base Cased model (from hfl) author: John Snow Labs name: bert_embeddings_chinese_mac_base date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-macbert-base` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_mac_base_zh_4.2.4_3.0_1670021176806.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_mac_base_zh_4.2.4_3.0_1670021176806.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_mac_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_mac_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_mac_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-macbert-base - https://github.com/ymcui/MacBERT/blob/master/LICENSE - https://2020.emnlp.org - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://github.com/chatopera/Synonyms - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/2004.13922 --- layout: model title: Translate English to Bemba (Zambia) Pipeline author: John Snow Labs name: translate_en_bem date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, bem, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `bem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_bem_xx_2.7.0_2.4_1609701823082.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_bem_xx_2.7.0_2.4_1609701823082.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_bem", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_bem", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.bem').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_bem| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in French (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-02-03 task: Named Entity Recognition language: fr edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, fr, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FR){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_fr_2.4.0_2.4_1579699913554.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_fr_2.4.0_2.4_1579699913554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "fr") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (né le 28 octobre 1955) est un magnat des affaires, développeur de logiciels, investisseur et philanthrope américain. Il est surtout connu comme le co-fondateur de Microsoft Corporation. Au cours de sa carrière chez Microsoft, Gates a occupé les postes de président, chef de la direction (PDG), président et architecte logiciel en chef, tout en étant le plus grand actionnaire individuel jusqu"en mai 2014. Il est l"un des entrepreneurs et pionniers les plus connus du révolution des micro-ordinateurs des années 1970 et 1980. Né et élevé à Seattle, Washington, Gates a cofondé Microsoft avec son ami d"enfance Paul Allen en 1975, à Albuquerque, au Nouveau-Mexique; il est devenu la plus grande société de logiciels informatiques au monde. Gates a dirigé l"entreprise en tant que président-directeur général jusqu"à sa démission en tant que PDG en janvier 2000, mais il est resté président et est devenu architecte logiciel en chef. À la fin des années 1990, Gates avait été critiqué pour ses tactiques commerciales, considérées comme anticoncurrentielles. Cette opinion a été confirmée par de nombreuses décisions de justice. En juin 2006, Gates a annoncé qu"il passerait à un poste à temps partiel chez Microsoft et à un emploi à temps plein à la Bill & Melinda Gates Foundation, la fondation caritative privée que lui et sa femme, Melinda Gates, ont créée en 2000. Il a progressivement transféré ses fonctions à Ray Ozzie et Craig Mundie. Il a démissionné de son poste de président de Microsoft en février 2014 et a assumé un nouveau poste de conseiller technologique pour soutenir le nouveau PDG Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "fr") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (né le 28 octobre 1955) est un magnat des affaires, développeur de logiciels, investisseur et philanthrope américain. Il est surtout connu comme le co-fondateur de Microsoft Corporation. Au cours de sa carrière chez Microsoft, Gates a occupé les postes de président, chef de la direction (PDG), président et architecte logiciel en chef, tout en étant le plus grand actionnaire individuel jusqu"en mai 2014. Il est l"un des entrepreneurs et pionniers les plus connus du révolution des micro-ordinateurs des années 1970 et 1980. Né et élevé à Seattle, Washington, Gates a cofondé Microsoft avec son ami d"enfance Paul Allen en 1975, à Albuquerque, au Nouveau-Mexique; il est devenu la plus grande société de logiciels informatiques au monde. Gates a dirigé l"entreprise en tant que président-directeur général jusqu"à sa démission en tant que PDG en janvier 2000, mais il est resté président et est devenu architecte logiciel en chef. À la fin des années 1990, Gates avait été critiqué pour ses tactiques commerciales, considérées comme anticoncurrentielles. Cette opinion a été confirmée par de nombreuses décisions de justice. En juin 2006, Gates a annoncé qu"il passerait à un poste à temps partiel chez Microsoft et à un emploi à temps plein à la Bill & Melinda Gates Foundation, la fondation caritative privée que lui et sa femme, Melinda Gates, ont créée en 2000. Il a progressivement transféré ses fonctions à Ray Ozzie et Craig Mundie. Il a démissionné de son poste de président de Microsoft en février 2014 et a assumé un nouveau poste de conseiller technologique pour soutenir le nouveau PDG Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (né le 28 octobre 1955) est un magnat des affaires, développeur de logiciels, investisseur et philanthrope américain. Il est surtout connu comme le co-fondateur de Microsoft Corporation. Au cours de sa carrière chez Microsoft, Gates a occupé les postes de président, chef de la direction (PDG), président et architecte logiciel en chef, tout en étant le plus grand actionnaire individuel jusqu'en mai 2014. Il est l'un des entrepreneurs et pionniers les plus connus du révolution des micro-ordinateurs des années 1970 et 1980. Né et élevé à Seattle, Washington, Gates a cofondé Microsoft avec son ami d'enfance Paul Allen en 1975, à Albuquerque, au Nouveau-Mexique; il est devenu la plus grande société de logiciels informatiques au monde. Gates a dirigé l'entreprise en tant que président-directeur général jusqu'à sa démission en tant que PDG en janvier 2000, mais il est resté président et est devenu architecte logiciel en chef. À la fin des années 1990, Gates avait été critiqué pour ses tactiques commerciales, considérées comme anticoncurrentielles. Cette opinion a été confirmée par de nombreuses décisions de justice. En juin 2006, Gates a annoncé qu'il passerait à un poste à temps partiel chez Microsoft et à un emploi à temps plein à la Bill & Melinda Gates Foundation, la fondation caritative privée que lui et sa femme, Melinda Gates, ont créée en 2000. Il a progressivement transféré ses fonctions à Ray Ozzie et Craig Mundie. Il a démissionné de son poste de président de Microsoft en février 2014 et a assumé un nouveau poste de conseiller technologique pour soutenir le nouveau PDG Satya Nadella."""] ner_df = nlu.load('fr.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Né et élevé à Seattle |MISC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Nouveau-Mexique |LOC | |Gates |PER | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | |Ray Ozzie |PER | |Craig Mundie |PER | |Microsoft |ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://fr.wikipedia.org](https://fr.wikipedia.org) --- layout: model title: Fast Neural Machine Translation Model from English to Hebrew author: John Snow Labs name: opus_mt_en_he date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, he, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `he` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_he_xx_2.7.0_2.4_1609164019932.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_he_xx_2.7.0_2.4_1609164019932.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_he", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_he", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.he').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_he| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Releases Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_releases_bert date: 2023-03-05 tags: [en, legal, classification, clauses, releases, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Releases` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Releases`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_releases_bert_en_1.0.0_3.0_1678049955995.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_releases_bert_en_1.0.0_3.0_1678049955995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_releases_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Releases]| |[Other]| |[Other]| |[Releases]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_releases_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.93 0.89 0.91 94 Releases 0.86 0.91 0.89 68 accuracy - - 0.90 162 macro-avg 0.90 0.90 0.90 162 weighted-avg 0.90 0.90 0.90 162 ``` --- layout: model title: Detect Assertion Status (assertion_jsl_large) author: John Snow Labs name: assertion_jsl_large date: 2021-07-24 tags: [licensed, clinical, assertion, en] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 2.4 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The deep neural network architecture for assertion status detection in Spark NLP is based on a BiLSTM framework, and is a modified version of the architecture proposed by Fancellu et.al. (Fancellu, Lopez, and Webber 2016). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances, hypothetically present in the patient at some future point, and mentioned in the patient report but associated with someone- else (Uzuner et al. 2011). Apart from what we released in other assertion models, an in-house annotations on a curated dataset (6K clinical notes) is used to augment the base assertion dataset (2010 i2b2/VA). ## Predicted Entities `present`, `absent`, `possible`, `planned`, `someoneelse`, `past` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_large_en_3.1.2_2.4_1627156678782.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_jsl_large_en_3.1.2_2.4_1627156678782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_large", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion]) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala ... val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val nerConverter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_large", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, clinical_assertion)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.jsl_large").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash The output is a dataframe with a sentence per row and an `assertion` column containing all of the assertion labels in the sentence. The assertion column also contains assertion character indices, and other metadata. To get only the entity chunks and assertion labels, without the metadata, select `ner_chunk.result` and `assertion.result` from your output dataframe. +-----------------------------------------+-----+---+----------------------------+-------+-----------+ |chunk |begin|end|ner_label |sent_id|assertion | +-----------------------------------------+-----+---+----------------------------+-------+-----------+ |21-day-old |17 |26 |Age |0 |present | |Caucasian |28 |36 |Race_Ethnicity |0 |present | |male |38 |41 |Gender |0 |someoneelse| |for 2 days |48 |57 |Duration |0 |present | |congestion |62 |71 |Symptom |0 |present | |mom |75 |77 |Gender |0 |someoneelse| |yellow |99 |104|Modifier |0 |present | |discharge |106 |114|Symptom |0 |present | |nares |135 |139|External_body_part_or_region|0 |someoneelse| |she |147 |149|Gender |0 |present | |mild |168 |171|Modifier |0 |present | |problems with his breathing while feeding|173 |213|Symptom |0 |present | |perioral cyanosis |237 |253|Symptom |0 |absent | |retractions |258 |268|Symptom |0 |absent | |One day ago |272 |282|RelativeDate |1 |someoneelse| |mom |285 |287|Gender |1 |someoneelse| |Tylenol |345 |351|Drug_BrandName |1 |someoneelse| |Baby |354 |357|Age |2 |someoneelse| |decreased p.o. intake |377 |397|Symptom |2 |someoneelse| |His |400 |402|Gender |3 |someoneelse| +-----------------------------------------+-----+---+----------------------------+-------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_jsl_large| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with ‘embeddings_clinical’. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash label prec rec f1 absent 0.957 0.949 0.953 someoneelse 0.958 0.936 0.947 planned 0.766 0.657 0.707 possible 0.852 0.884 0.868 past 0.894 0.890 0.892 present 0.902 0.917 0.910 Macro-average 0.888 0.872 0.880 Micro-average 0.908 0.908 0.908 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: bert_qa_rule_based_quadruplet_epochs_1_shard_1_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657190951032.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_rule_based_quadruplet_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1657190951032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_rule_based_quadruplet_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_rule_based_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_rule_based_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: Mapping Entities (Clinical Findings) with Corresponding UMLS CUI Codes author: John Snow Labs name: umls_clinical_findings_mapper date: 2022-07-08 tags: [umls, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps clinical entities and concepts to 4 major categories of UMLS CUI codes ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_mapper_en_4.0.0_3.0_1657279676626.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_mapper_en_4.0.0_3.0_1657279676626.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("clinical_ner") ner_model_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "clinical_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel.pretrained("umls_clinical_findings_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["umls_code"])\ .setLowerCase(True) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper]) test_data = spark.createDataFrame([["A 28-year-old female with a history of obesity with BMI of 33.5 kg/m2, presented with a one-week history of vomiting."]]).toDF("text") result = mapper_pipeline.fit(test_data).transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel .pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("clinical_ner") val ner_model_converter = new NerConverterInternal() .setInputCols("sentence", "token", "clinical_ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("umls_clinical_findings_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("umls_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner_model, ner_model_converter, chunkerMapper)) val test_data = Seq("A 28-year-old female with a history of obesity with BMI of 33.5 kg/m2, presented with a one-week history of vomiting.").toDF("text") val result = mapper_pipeline.fit(test_data).transform(test_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_clinical_findings_mapper").predict("""A 28-year-old female with a history of obesity with BMI of 33.5 kg/m2, presented with a one-week history of vomiting.""") ```
## Results ```bash +---------+---------+ |ner_chunk|umls_code| +---------+---------+ |obesity |C1963185 | |BMI |C0578022 | |vomiting |C1963281 | +---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_clinical_findings_mapper| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|18.9 MB| ## References 200K concepts from clinical findings.https://www.nlm.nih.gov/research/umls/index.html --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to English author: John Snow Labs name: opus_mt_bcl_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bcl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bcl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_en_xx_2.7.0_2.4_1609166247655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_en_xx_2.7.0_2.4_1609166247655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bcl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BioBERT Embeddings (Clinical) author: John Snow Labs name: biobert_clinical_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for generic clinical text. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_clinical_base_cased_en_2.6.2_2.4_1600531096837.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_clinical_base_cased_en_2.6.2_2.4_1600531096837.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("biobert_clinical_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala val embeddings = BertEmbeddings.pretrained("biobert_clinical_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.clinical_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_clinical_base_cased_embeddings I [0.2206662893295288, 0.41324421763420105, -0.3... hate [-0.19311018288135529, 0.6037888526916504, -0.... cancer [0.2895708680152893, 0.22499887645244598, -0.5... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_clinical_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: Legal Employment Clause Binary Classifier author: John Snow Labs name: legclf_employment_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `employment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `employment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_employment_clause_en_1.0.0_3.2_1660123473742.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_employment_clause_en_1.0.0_3.2_1660123473742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_employment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[employment]| |[other]| |[other]| |[employment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_employment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support employment 0.91 0.89 0.90 35 other 0.95 0.96 0.96 84 accuracy - - 0.94 119 macro-avg 0.93 0.93 0.93 119 weighted-avg 0.94 0.94 0.94 119 ``` --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_small_wrslb_finetuned_squadv1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-small-wrslb-finetuned-squadv1` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_wrslb_finetuned_squadv1_en_4.0.0_3.0_1654184975314.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_small_wrslb_finetuned_squadv1_en_4.0.0_3.0_1654184975314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_small_wrslb_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_small_wrslb_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.small").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_small_wrslb_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-small-wrslb-finetuned-squadv1 --- layout: model title: Word2Vec Embeddings in Northern Sotho (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, nso, open_source] task: Embeddings language: nso edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nso_3.4.1_3.0_1647448245360.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_nso_3.4.1_3.0_1647448245360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nso") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nso") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nso.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|nso| |Size:|51.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Arabic Bert Embeddings (Large, Arabert Model, v2) author: John Snow Labs name: bert_embeddings_bert_large_arabertv2 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-arabertv2` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv2_ar_3.4.2_3.0_1649677693301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_arabertv2_ar_3.4.2_3.0_1649677693301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv2","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_arabertv2","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_large_arabertv2").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_arabertv2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|1.4 GB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-large-arabertv2 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-large-arabertv2/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Russian T5ForConditionalGeneration Base Cased model (from IlyaGusev) author: John Snow Labs name: t5_rut5_base_headline_gen_telegram date: 2023-01-30 tags: [ru, open_source, t5, tensorflow] task: Text Generation language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rut5_base_headline_gen_telegram` is a Russian model originally trained by `IlyaGusev`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_rut5_base_headline_gen_telegram_ru_4.3.0_3.0_1675106899958.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_rut5_base_headline_gen_telegram_ru_4.3.0_3.0_1675106899958.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_rut5_base_headline_gen_telegram","ru") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_rut5_base_headline_gen_telegram","ru") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_rut5_base_headline_gen_telegram| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ru| |Size:|995.4 MB| ## References - https://huggingface.co/IlyaGusev/rut5_base_headline_gen_telegram - https://www.dropbox.com/s/ykqk49a8avlmnaf/ru_all_split.tar.gz - https://github.com/IlyaGusev/summarus/blob/master/external/hf_scripts/train.py --- layout: model title: Pipeline to Detect Living Species (bert_embeddings_bert_base_italian_xxl_cased) author: John Snow Labs name: ner_living_species_bert_pipeline date: 2023-03-13 tags: [it, ner, clinical, licensed, bert] task: Named Entity Recognition language: it edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_bert](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_bert_it_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_it_4.3.0_3.2_1678730309748.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_it_4.3.0_3.2_1678730309748.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_bert_pipeline", "it", "clinical/models") text = '''Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_bert_pipeline", "it", "clinical/models") val text = "Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------|-------------:| | 0 | donna | 4 | 8 | HUMAN | 0.9997 | | 1 | personale | 133 | 141 | HUMAN | 1 | | 2 | madre | 285 | 289 | HUMAN | 1 | | 3 | fratello | 317 | 324 | HUMAN | 0.9995 | | 4 | sorella | 373 | 379 | HUMAN | 0.9997 | | 5 | virus epatotropi | 493 | 508 | SPECIES | 0.75615 | | 6 | HBV | 511 | 513 | SPECIES | 0.9886 | | 7 | HCV | 516 | 518 | SPECIES | 0.9745 | | 8 | HIV | 523 | 525 | SPECIES | 0.9838 | | 9 | paziente | 634 | 641 | HUMAN | 0.9994 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|432.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: French CamemBert Embeddings (from juliencarbonnell) author: John Snow Labs name: camembert_embeddings_juliencarbonnell_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `juliencarbonnell`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_juliencarbonnell_generic_model_fr_3.4.4_3.0_1653989010937.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_juliencarbonnell_generic_model_fr_3.4.4_3.0_1653989010937.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_juliencarbonnell_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_juliencarbonnell_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_juliencarbonnell_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/juliencarbonnell/dummy-model --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11 TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11` is a English model originally trained by nimrah. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11_en_4.2.0_3.0_1664117412915.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11_en_4.2.0_3.0_1664117412915.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_hindi_home_colab_11| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect PHI for Deidentification (Subentity- Augmented) author: John Snow Labs name: ner_deid_subentity_augmented_pipeline date: 2023-03-13 tags: [deid, ner, en, i2b2, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity_augmented](https://nlp.johnsnowlabs.com/2021/09/03/ner_deid_subentity_augmented_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_pipeline_en_4.3.0_3.2_1678734896498.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_pipeline_en_4.3.0_3.2_1678734896498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_augmented_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_augmented_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.subentity_ner_augmented.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:--------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 1 | | 1 | David Hale | 27 | 36 | DOCTOR | 0.97385 | | 2 | Hendrickson Ora | 55 | 69 | PATIENT | 0.9932 | | 3 | 7194334 | 78 | 84 | MEDICALRECORD | 0.9993 | | 4 | 01/13/93 | 93 | 100 | DATE | 1 | | 5 | Oliveira | 110 | 117 | DOCTOR | 0.9993 | | 6 | 25 | 121 | 122 | AGE | 0.9905 | | 7 | 2079-11-09 | 150 | 159 | DATE | 0.9998 | | 8 | Cocke County Baptist Hospital | 163 | 191 | HOSPITAL | 0.97485 | | 9 | 0295 Keats Street | 195 | 211 | STREET | 0.8209 | | 10 | 302-786-5227 | 221 | 232 | PHONE | 0.9541 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Translate Macedonian to English Pipeline author: John Snow Labs name: translate_mk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mk_en_xx_2.7.0_2.4_1609688189807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mk_en_xx_2.7.0_2.4_1609688189807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Restricted Use Clause Binary Classifier author: John Snow Labs name: legclf_restricted_use_clause date: 2023-02-13 tags: [en, legal, classification, restricted, use, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `restricted_use` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `restricted_use`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_restricted_use_clause_en_1.0.0_3.0_1676303037732.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_restricted_use_clause_en_1.0.0_3.0_1676303037732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_restricted_use_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[restricted_use]| |[other]| |[other]| |[restricted_use]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_restricted_use_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.91 0.95 11 restricted_use 0.94 1.00 0.97 17 accuracy - - 0.96 28 macro-avg 0.97 0.95 0.96 28 weighted-avg 0.97 0.96 0.96 28 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Seychellois Creole author: John Snow Labs name: opus_mt_en_crs date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, crs, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `crs` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_crs_xx_2.7.0_2.4_1609169500447.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_crs_xx_2.7.0_2.4_1609169500447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_crs", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_crs", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.crs').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_crs| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Stock options Clause Binary Classifier author: John Snow Labs name: legclf_stock_options_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `stock-options` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `stock-options` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_stock_options_clause_en_1.0.0_3.2_1660123043343.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_stock_options_clause_en_1.0.0_3.2_1660123043343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_stock_options_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[stock-options]| |[other]| |[other]| |[stock-options]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_stock_options_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.97 0.97 87 stock-options 0.86 0.90 0.88 21 accuracy - - 0.95 108 macro-avg 0.92 0.94 0.93 108 weighted-avg 0.95 0.95 0.95 108 ``` --- layout: model title: Pipeline to Detect Living Species author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [en, ner, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_en_4.3.0_3.2_1678707530799.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_en_4.3.0_3.2_1678707530799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "en", "clinical/models") text = '''42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "en", "clinical/models") val text = "42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | woman | 12 | 16 | HUMAN | 0.9993 | | 1 | bacterial | 145 | 153 | SPECIES | 0.9815 | | 2 | Fusarium spp | 337 | 348 | SPECIES | 0.9644 | | 3 | patient | 355 | 361 | HUMAN | 0.9984 | | 4 | species | 507 | 513 | SPECIES | 0.8838 | | 5 | Fusarium solani complex | 522 | 544 | SPECIES | 0.748667 | | 6 | antifungals | 792 | 802 | SPECIES | 0.9847 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: German asr_wav2vec2_large_xlsr_53_german_with_lm TFWav2Vec2ForCTC from aware-ai author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_german_with_lm date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_german_with_lm` is a German model originally trained by aware-ai. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_german_with_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_with_lm_de_4.2.0_3.0_1664098470326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_german_with_lm_de_4.2.0_3.0_1664098470326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_german_with_lm", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_german_with_lm", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_german_with_lm| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batterybert_cased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batterybert-cased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batterybert_cased_squad_v1_en_4.0.0_3.0_1654179292884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batterybert_cased_squad_v1_en_4.0.0_3.0_1654179292884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batterybert_cased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batterybert_cased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.bert.cased.by_batterydata").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batterybert_cased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/batterybert-cased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kaipo-chang) author: John Snow Labs name: distilbert_qa_kaipo_chang_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kaipo-chang`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaipo_chang_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771642916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kaipo_chang_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771642916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaipo_chang_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kaipo_chang_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kaipo_chang_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kaipo-chang/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering model (from emre) author: John Snow Labs name: distilbert_qa_emre_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `emre`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_emre_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725170981.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_emre_base_uncased_finetuned_squad_en_4.0.0_3.0_1654725170981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_emre_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_emre_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_emre").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_emre_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/emre/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_tiny_4_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-tiny-4-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_4_finetuned_squadv2_en_4.0.0_3.0_1654184992654.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_tiny_4_finetuned_squadv2_en_4.0.0_3.0_1654184992654.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_tiny_4_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_tiny_4_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.tiny_v4.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_tiny_4_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|23.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-tiny-4-finetuned-squadv2 --- layout: model title: Drug Substance to UMLS Code Pipeline author: John Snow Labs name: umls_drug_substance_resolver_pipeline date: 2022-07-25 tags: [en, licensed, umls, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Drug Substances) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.0.0_3.0_1658737965746.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_substance_resolver_pipeline_en_4.0.0_3.0_1658737965746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models") pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_drug_substance_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_drug_substance_resolver").predict("""The patient was given metformin, lenvatinib and Magnesium hydroxide 100mg/1ml""") ```
## Results ```bash +-----------------------------+---------+---------+ |chunk |ner_label|umls_code| +-----------------------------+---------+---------+ |metformin |DRUG |C0025598 | |lenvatinib |DRUG |C2986924 | |Magnesium hydroxide 100mg/1ml|DRUG |C1134402 | +-----------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_drug_substance_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|5.1 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English DistilBertForQuestionAnswering Cased model (from manishiitg) author: John Snow Labs name: distilbert_qa_squad_256seq_8batch_test date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-squad-256seq-8batch-test` is a English model originally trained by `manishiitg`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_256seq_8batch_test_en_4.3.0_3.0_1672774315502.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_256seq_8batch_test_en_4.3.0_3.0_1672774315502.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_256seq_8batch_test","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_256seq_8batch_test","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_256seq_8batch_test| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/manishiitg/distilbert-squad-256seq-8batch-test --- layout: model title: Longformer Large NER Pipeline author: ahmedlone127 name: longformer_large_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, longformer, conll, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [longformer_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/09/longformer_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/longformer_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655214628921.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/longformer_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655214628921.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("longformer_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I am working at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|1.5 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - LongformerForTokenClassification - NerConverter - Finisher --- layout: model title: English DistilBertForTokenClassification Base Cased model (from 51la5) author: John Snow Labs name: distilbert_token_classifier_base_ner date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-NER` is a English model originally trained by `51la5`. ## Predicted Entities `LOC`, `ORG`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.0_3.0_1677881358175.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_ner_en_4.3.0_3.0_1677881358175.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_ner| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/51la5/distilbert-base-NER - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Mapping Abbreviations and Acronyms of Medical Regulatory Activities with Their Definitions and Categories author: John Snow Labs name: abbreviation_category_mapper date: 2022-11-16 tags: [abbreviation, definition, category, licensed, en, clinical, chunk_mapper] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.2.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps abbreviations and acronyms of medical regulatory activities with their definitions and categories. Predicted categories: `general`, `problem`, `test`, `treatment`, `medical_condition`, `clinical_dept`, `drug`, `nursing`, `internal_organ_or_component`, `hospital_unit`, `drug_frequency`, `employment`, `procedure` ## Predicted Entities `definition`, `category` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/abbreviation_category_mapper_en_4.2.1_3.0_1668594867892.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/abbreviation_category_mapper_en_4.2.1_3.0_1668594867892.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("abbr_ner") abbr_converter = NerConverter() \ .setInputCols(["sentence", "token", "abbr_ner"]) \ .setOutputCol("abbr_ner_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("abbreviation_category_mapper", "en", "clinical/models")\ .setInputCols(["abbr_ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["definition", "category"])\ pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper]) text = ["""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. VDRL: Nonreactive HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""] data = spark.createDataFrame([text]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("abbr_ner") val abbr_converter = new NerConverter() .setInputCols(Array("sentence", "token", "abbr_ner")) .setOutputCol("abbr_ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("abbreviation_category_mapper", "en", "clinical/models") .setInputCols("abbr_ner_chunk") .setOutputCol("mappings") .setRels(Array("definition", "category")) val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper)) val sample_text = """Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. VDRL: Nonreactive HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""" val data = Seq(sample_text).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.abbreviation_category").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. VDRL: Nonreactive HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash | | chunk | category | definition | |---:|:--------|:------------------|:---------------------------------------| | 0 | CBC | general | complete blood count | | 1 | VDRL | clinical_dept | Venereal Disease Research Laboratories | | 2 | HIV | medical_condition | Human immunodeficiency virus | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|abbreviation_category_mapper| |Compatibility:|Healthcare NLP 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[abbr_ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|128.2 KB| --- layout: model title: Legal Fisheries Document Classifier (EURLEX) author: John Snow Labs name: legclf_fisheries_bert date: 2023-03-06 tags: [en, legal, classification, clauses, fisheries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_fisheries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Fisheries or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Fisheries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fisheries_bert_en_1.0.0_3.0_1678111835145.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fisheries_bert_en_1.0.0_3.0_1678111835145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_fisheries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Fisheries]| |[Other]| |[Other]| |[Fisheries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fisheries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Fisheries 0.98 0.94 0.96 478 Other 0.94 0.98 0.96 439 accuracy - - 0.96 917 macro-avg 0.96 0.96 0.96 917 weighted-avg 0.96 0.96 0.96 917 ``` --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_dl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-dl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dl4_en_4.3.0_3.0_1675114615864.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dl4_en_4.3.0_3.0_1675114615864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_dl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_dl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_dl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|834.7 MB| ## References - https://huggingface.co/google/t5-efficient-large-dl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English AlbertForQuestionAnswering model (from 123tarunanand) author: John Snow Labs name: albert_qa_xlarge_finetuned date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `albert-xlarge-finetuned` is a English model originally trained by `123tarunanand`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_xlarge_finetuned_en_4.0.0_3.0_1656063767338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_xlarge_finetuned_en_4.0.0_3.0_1656063767338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_xlarge_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_xlarge_finetuned","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.albert.xl").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_xlarge_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|205.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/123tarunanand/albert-xlarge-finetuned - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Fastext Word Embeddings in German author: John Snow Labs name: w2v_cc_300d date: 2022-03-21 tags: [embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 2.5.5 spark_version: 2.4 supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/14.German_Healthcare_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_de_2.5.5_2.4_1647888218499.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_de_2.5.5_2.4_1647888218499.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("w2v_cc_300d","de")\ .setInputCols(["document","token"])\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("w2v_cc_300d","de") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.w2v").predict("""Put your text here.""") ```
## Results ```bash Word2Vec feature vectors based on `w2v_cc_300d`. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 2.5.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| ## References FastText common crawl word embeddings for Germany https://fasttext.cc/docs/en/crawl-vectors.html --- layout: model title: Clean Slang Pipeline for English author: John Snow Labs name: clean_slang date: 2021-03-24 tags: [open_source, english, clean_slang, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The clean_slang is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/clean_slang_en_3.0.0_3.0_1616544456744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/clean_slang_en_3.0.0_3.0_1616544456744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('clean_slang', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("clean_slang", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.clean.slang').predict(text) result_df ```
## Results ```bash | | document | token | normal | |---:|:---------------------------------|:-----------------------------------------------|:------------------------------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | ['Hello', 'from', 'John', 'Snow', 'Labs'] || | document | token | normal | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clean_slang| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Extract temporal relations among clinical events (ReDL) author: John Snow Labs name: redl_temporal_events_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations between clinical events in terms of time. If an event occurred before, after, or overlaps another event. ## Predicted Entities `AFTER`, `BEFORE`, `OVERLAP` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_EVENTS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_temporal_events_biobert_en_2.7.3_2.4_1612440268550.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_temporal_events_biobert_en_2.7.3_2.4_1612440268550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") #Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) re_model = RelationExtractionDLModel()\ .pretrained("redl_temporal_events_biobert", "en", "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text = "She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_temporal_events_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temporal_events").predict("""She is diagnosed with cancer in 1991. Then she was admitted to Mayo Clinic in May 2000 and discharged in October 2001""") ```
## Results ```bash +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ |relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ | BEFORE| OCCURRENCE| 7| 15| diagnosed| PROBLEM| 22| 27| cancer|0.78168863| | OVERLAP| PROBLEM| 22| 27| cancer| DATE| 32| 35| 1991| 0.8492274| | AFTER| OCCURRENCE| 51| 58| admitted|CLINICAL_DEPT| 63| 73| Mayo Clinic|0.85629463| | BEFORE| OCCURRENCE| 51| 58| admitted| OCCURRENCE| 91| 100| discharged| 0.6843513| | OVERLAP|CLINICAL_DEPT| 63| 73|Mayo Clinic| DATE| 78| 85| May 2000| 0.7844673| | BEFORE|CLINICAL_DEPT| 63| 73|Mayo Clinic| OCCURRENCE| 91| 100| discharged|0.60411876| | OVERLAP|CLINICAL_DEPT| 63| 73|Mayo Clinic| DATE| 105| 116|October 2001| 0.540761| | BEFORE| DATE| 78| 85| May 2000| OCCURRENCE| 91| 100| discharged| 0.6042761| | OVERLAP| DATE| 78| 85| May 2000| DATE| 105| 116|October 2001|0.64867175| | BEFORE| OCCURRENCE| 91| 100| discharged| DATE| 105| 116|October 2001| 0.5302478| +--------+-------------+-------------+-----------+-----------+-------------+-------------+-----------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_temporal_events_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on temporal clinical events benchmark dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support AFTER 0.332 0.655 0.440 2123 BEFORE 0.868 0.908 0.887 13817 OVERLAP 0.887 0.733 0.802 7860 Avg. 0.695 0.765 0.710 ``` --- layout: model title: Chinese BertForMaskedLM Cased model (from hfl) author: John Snow Labs name: bert_embeddings_rbt4_h312 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rbt4-h312` is a Chinese model originally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_h312_zh_4.2.4_3.0_1670022861902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_h312_zh_4.2.4_3.0_1670022861902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4_h312","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_rbt4_h312","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt4_h312| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|43.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt4-h312 - https://github.com/iflytek/MiniRBT - https://github.com/ymcui/LERT - https://github.com/ymcui/PERT - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/iflytek/HFL-Anthology --- layout: model title: English asr_xlsr_wav2vec_english TFWav2Vec2ForCTC from harshit345 author: John Snow Labs name: pipeline_asr_xlsr_wav2vec_english date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xlsr_wav2vec_english` is a English model originally trained by harshit345. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xlsr_wav2vec_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec_english_en_4.2.0_3.0_1664043396664.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xlsr_wav2vec_english_en_4.2.0_3.0_1664043396664.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xlsr_wav2vec_english', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xlsr_wav2vec_english", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xlsr_wav2vec_english| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vnktrmnb) author: John Snow Labs name: distilbert_qa_vnktrmnb_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vnktrmnb`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vnktrmnb_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773200065.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vnktrmnb_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773200065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vnktrmnb_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vnktrmnb_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vnktrmnb_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vnktrmnb/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Detect Relations Between Genes and Phenotypes author: John Snow Labs name: re_human_phenotype_gene_clinical_pipeline date: 2022-03-31 tags: [licensed, clinical, re, genes, phenotypes, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_human_phenotype_gene_clinical](https://nlp.johnsnowlabs.com/2020/09/30/re_human_phenotype_gene_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_human_phenotype_gene_clinical_pipeline_en_3.4.1_3.0_1648734276384.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_human_phenotype_gene_clinical_pipeline_en_3.4.1_3.0_1648734276384.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3") ``` ```scala val pipeline = new PretrainedPipeline("re_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.human_gene_clinical.pipeline").predict("""Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3""") ```
## Results ```bash +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+===========+=================+===============+=====================+===========+=================+===============+=====================+==============+ | 0 | 1 | HP | 23 | 36 | microphthalmia | HP | 42 | 60 | developmental delay | 0.999954 | +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ | 1 | 1 | HP | 23 | 36 | microphthalmia | GENE | 110 | 114 | TENM3 | 0.999999 | +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_human_phenotype_gene_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah TFWav2Vec2ForCTC from nimrah author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah` is a English model originally trained by nimrah. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_en_4.2.0_3.0_1664115274192.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah_en_4.2.0_3.0_1664115274192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_nimrah| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Swedish Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 11:16:00 +0800 task: Lemmatization language: sv edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, sv] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_sv_2.5.0_2.4_1588666548498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_sv_2.5.0_2.4_1588666548498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "sv") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "sv") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien."""] lemma_df = nlu.load('sv.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='Förutom', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=10, result='att', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=15, result='vara', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=17, end=22, result='kung', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=24, end=24, result='i', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|sv| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_question_answering date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-question-answering` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_question_answering_it_4.3.0_3.0_1675103534190.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_question_answering_it_4.3.0_3.0_1675103534190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_question_answering","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_question_answering","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_question_answering| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|593.9 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-question-answering - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Question+Answering&dataset=SQuAD-IT --- layout: model title: Recognize Entities OntoNotes pipeline - ELECTRA Small author: John Snow Labs name: onto_recognize_entities_electra_small date: 2021-03-22 tags: [open_source, english, onto_recognize_entities_electra_small, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_electra_small is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_small_en_3.0.0_3.0_1616444187316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_small_en_3.0.0_3.0_1616444187316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_electra_small', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_small", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.electra.small').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.2279076874256134,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_electra_small| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Pipeline to Detect Medical Risk Factors author: John Snow Labs name: ner_risk_factors_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, biobert, risk_factor, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_risk_factors_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_risk_factors_biobert_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RISK_FACTORS/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_RISK_FACTORS.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_biobert_pipeline_en_3.4.1_3.0_1647871536746.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_risk_factors_biobert_pipeline_en_3.4.1_3.0_1647871536746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_risk_factors_biobert_pipeline", "en", "clinical/models") pipeline.fullAnnotate('HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.\n\nREVIEW OF SYSTEMS: All other systems reviewed & are negative.\n\nPAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.\n\nSOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.\n\nFAMILY HISTORY: Positive for coronary artery disease (father & brother).') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_risk_factors_biobert_pipeline", "en", "clinical/models") pipeline.fullAnnotate("HISTORY OF PRESENT ILLNESS: The patient is a 40-year-old white male who presents with a chief complaint of "chest pain". The patient is diabetic and has a prior history of coronary artery disease. The patient presents today stating that his chest pain started yesterday evening and has been somewhat intermittent. The severity of the pain has progressively increased. He describes the pain as a sharp and heavy pain which radiates to his neck & left arm. He ranks the pain a 7 on a scale of 1-10. He admits some shortness of breath & diaphoresis. He states that he has had nausea & 3 episodes of vomiting tonight. He denies any fever or chills. He admits prior episodes of similar pain prior to his PTCA in 1995. He states the pain is somewhat worse with walking and seems to be relieved with rest. There is no change in pain with positioning. He states that he took 3 nitroglycerin tablets sublingually over the past 1 hour, which he states has partially relieved his pain. The patient ranks his present pain a 4 on a scale of 1-10. The most recent episode of pain has lasted one-hour. The patient denies any history of recent surgery, head trauma, recent stroke, abnormal bleeding such as blood in urine or stool or nosebleed.\n\nREVIEW OF SYSTEMS: All other systems reviewed & are negative.\n\nPAST MEDICAL HISTORY: Diabetes mellitus type II, hypertension, coronary artery disease, atrial fibrillation, status post PTCA in 1995 by Dr. ABC.\n\nSOCIAL HISTORY: Denies alcohol or drugs. Smokes 2 packs of cigarettes per day. Works as a banker.\n\nFAMILY HISTORY: Positive for coronary artery disease (father & brother).") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.risk_factors_biobert.pipeline").predict("""chest pain""") ```
## Results ```bash +----------------------------------------+------------+ |chunks |entities | +----------------------------------------+------------+ |diabetic |DIABETES | |prior history of coronary artery disease|CAD | |PTCA in 1995. |CAD | |Diabetes mellitus type II |DIABETES | |hypertension |HYPERTENSION| |coronary artery disease |CAD | |Smokes 2 packs of cigarettes per day |SMOKER | +----------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_risk_factors_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from SimulSt) author: John Snow Labs name: xlmroberta_ner_simulst_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `SimulSt`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_simulst_base_finetuned_panx_de_4.1.0_3.0_1660430532299.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_simulst_base_finetuned_panx_de_4.1.0_3.0_1660430532299.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_simulst_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_simulst_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_simulst_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/SimulSt/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Russian Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_ru_cased date: 2022-04-11 tags: [bert, embeddings, ru, open_source] task: Embeddings language: ru edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-ru-cased` is a Russian model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ru_cased_ru_3.4.2_3.0_1649674150803.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_ru_cased_ru_3.4.2_3.0_1649674150803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ru_cased","ru") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю искра NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_ru_cased","ru") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю искра NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.embed.bert_base_ru_cased").predict("""Я люблю искра NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_ru_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ru| |Size:|364.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ru-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: German Public Health Mention Sequence Classifier (German-MedBERT) author: John Snow Labs name: bert_sequence_classifier_health_mentions_medbert date: 2022-08-10 tags: [public_health, de, licensed, sequence_classification, health_mention] task: Text Classification language: de edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [German-MedBERT](https://opus4.kobv.de/opus4-rhein-waal/frontdoor/index/index/searchtype/collection/id/16225/start/0/rows/10/doctypefq/masterthesis/docId/740) based sequence classification model that can classify public health mentions in German social media text. ## Predicted Entities `non-health`, `health-related` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_medbert_de_4.0.2_3.0_1660133738049.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_health_mentions_medbert_de_4.0.2_3.0_1660133738049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_medbert", "de", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["Diabetes habe ich schon seit meiner Kindheit, seit der Pubertätch nehme Insulin."], ["Die Hochzeitszeitung ist zum Glück sehr schön geworden. Das Brautpaar gat sich gefreut."] ]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions_medbert", "de", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("Diabetes habe ich schon seit meiner Kindheit, seit der Pubertätch nehme Insulin.", "Die Hochzeitszeitung ist zum Glück sehr schön geworden. Das Brautpaar gat sich gefreut.")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.classify.bert_sequence.health_mentions_medbert").predict("""Die Hochzeitszeitung ist zum Glück sehr schön geworden. Das Brautpaar gat sich gefreut.""") ```
## Results ```bash +---------------------------------------------------------------------------------------+----------------+ |text |result | +---------------------------------------------------------------------------------------+----------------+ |Diabetes habe ich schon seit meiner Kindheit, seit der Pubertätch nehme Insulin. |[health-related]| |Die Hochzeitszeitung ist zum Glück sehr schön geworden. Das Brautpaar gat sich gefreut.|[non-health] | +---------------------------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_health_mentions_medbert| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|409.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support non-health 0.97 0.88 0.92 82 health-related 0.87 0.97 0.92 69 accuracy - - 0.92 151 macro-avg 0.92 0.92 0.92 151 weighted-avg 0.93 0.92 0.92 151 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ksabeh) author: John Snow Labs name: roberta_qa_base_attribute_correction_mlm date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-attribute-correction-mlm` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_attribute_correction_mlm_en_4.3.0_3.0_1674212707495.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_attribute_correction_mlm_en_4.3.0_3.0_1674212707495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction_mlm","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction_mlm","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_attribute_correction_mlm| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ksabeh/roberta-base-attribute-correction-mlm --- layout: model title: Legal Asset Purchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_asset_purchase_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, asset_purchase, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_asset_purchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `asset-purchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `asset-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_asset_purchase_agreement_bert_en_1.0.0_3.0_1669308592040.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_asset_purchase_agreement_bert_en_1.0.0_3.0_1669308592040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_asset_purchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[asset-purchase-agreement]| |[other]| |[other]| |[asset-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_asset_purchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support asset-purchase-agreement 0.91 0.76 0.83 38 other 0.87 0.95 0.91 65 accuracy - - 0.88 103 macro-avg 0.89 0.86 0.87 103 weighted-avg 0.89 0.88 0.88 103 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from rohitsroch) author: John Snow Labs name: t5_hybrid_hbh_small_ami_sum date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hybrid_hbh_t5-small_ami_sum` is a English model originally trained by `rohitsroch`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_hybrid_hbh_small_ami_sum_en_4.3.0_3.0_1675102906707.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_hybrid_hbh_small_ami_sum_en_4.3.0_3.0_1675102906707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_hybrid_hbh_small_ami_sum","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_hybrid_hbh_small_ami_sum","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_hybrid_hbh_small_ami_sum| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|288.9 MB| ## References - https://huggingface.co/rohitsroch/hybrid_hbh_t5-small_ami_sum - https://doi.org/10.1145/3508546.3508640 --- layout: model title: English image_classifier_vit_teeth_verify ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_teeth_verify date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_teeth_verify` is a English model originally trained by steven123. ## Predicted Entities `Good Teeth`, `Missing Teeth`, `Rotten Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_teeth_verify_en_4.1.0_3.0_1660170214773.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_teeth_verify_en_4.1.0_3.0_1660170214773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_teeth_verify", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_teeth_verify", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_teeth_verify| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: RE Pipeline between Problem, Test, and Findings in Reports author: John Snow Labs name: re_test_problem_finding_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, problem, test, findings, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_test_problem_finding](https://nlp.johnsnowlabs.com/2021/04/19/re_test_problem_finding_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_RADIOLOGY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/RE_RADIOLOGY.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_pipeline_en_3.4.1_3.0_1648733292407.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_test_problem_finding_pipeline_en_3.4.1_3.0_1648733292407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_test_problem_finding_pipeline", "en", "clinical/models") pipeline.fullAnnotate("Targeted biopsy of this lesion for histological correlation should be considered.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.test_problem_finding.pipeline").predict("""Targeted biopsy of this lesion for histological correlation should be considered.""") ```
## Results ```bash | index | relations | entity1 | chunk1 | entity2 | chunk2 | |-------|--------------|--------------|---------------------|--------------|---------| | 0 | 1 | PROCEDURE | biopsy | SYMPTOM | lesion | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_test_problem_finding_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Legal Redemption Clause Binary Classifier author: John Snow Labs name: legclf_redemption_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `redemption` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `redemption` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_redemption_clause_en_1.0.0_3.2_1660122886423.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_redemption_clause_en_1.0.0_3.2_1660122886423.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_redemption_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[redemption]| |[other]| |[other]| |[redemption]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_redemption_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.95 0.95 100 redemption 0.85 0.88 0.87 33 accuracy - - 0.93 133 macro-avg 0.91 0.91 0.91 133 weighted-avg 0.93 0.93 0.93 133 ``` --- layout: model title: Sentence Entity Resolver for CVX author: John Snow Labs name: sbiobertresolve_cvx date: 2022-10-12 tags: [entity_resolution, cvx, clinical, en, licensed] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 4.2.1 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps vaccine entities to CVX codes using sbiobert_base_cased_mli Sentence Bert Embeddings. Additionally, this model returns status of the vaccine (Active/Inactive/Pending/Non-US) in all_k_aux_labels column. ## Predicted Entities `CVX Code`, `Status` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cvx_en_4.2.1_3.0_1665597761894.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cvx_en_4.2.1_3.0_1665597761894.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") cvx_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cvx", "en", "clinical/models")\ .setInputCols(["sbert_embeddings"])\ .setOutputCol("cvx_code")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, cvx_resolver ]) light_model = LightPipeline(pipelineModel) result = light_model.fullAnnotate(["Sinovac", "Moderna", "BIOTHRAX"]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val cvx_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cvx", "en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("cvx_code") .setDistanceFunction("EUCLIDEAN") val cvx_pipelineModel = new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, cvx_resolver)) val light_model = LightPipeline(cvx_pipelineModel) val result = light_model.fullAnnotate(Array("Sinovac", "Moderna", "BIOTHRAX")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cvx").predict("""Put your text here.""") ```
## Results ```bash +----------+--------+-------------------------------------------------------+--------+ |ner_chunk |cvx_code|resolved_text |Status | +----------+--------+-------------------------------------------------------+--------+ |Sinovac |511 |COVID-19 IV Non-US Vaccine (CoronaVac, Sinovac) |Non-US | |Moderna |227 |COVID-19, mRNA, LNP-S, PF, pediatric 50 mcg/0.5 mL dose|Inactive| |BIOTHRAX |24 |anthrax |Active | +----------+--------+-------------------------------------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cvx| |Compatibility:|Healthcare NLP 4.2.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[bert_embeddings]| |Output Labels:|[cvx_code]| |Language:|en| |Size:|1.6 MB| |Case sensitive:|false| --- layout: model title: Japanese BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_ja_cased date: 2022-12-02 tags: [ja, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ja edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-ja-cased` is a Japanese model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ja_cased_ja_4.2.4_3.0_1670018071699.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_ja_cased_ja_4.2.4_3.0_1670018071699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ja_cased","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_ja_cased","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_ja_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|350.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-ja-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: NER Pipeline for 9 African Languages author: John Snow Labs name: distilbert_base_token_classifier_masakhaner_pipeline date: 2022-03-18 tags: [hausa, igbo, kinyarwanda, luganda, nigerian, pidgin, swahilu, wolof, yoruba, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [distilbert_base_token_classifier_masakhaner](https://nlp.johnsnowlabs.com/2022/01/18/distilbert_base_token_classifier_masakhaner_xx.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/Ner_masakhaner/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_token_classifier_masakhaner_pipeline_xx_3.4.1_3.0_1647607984667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_token_classifier_masakhaner_pipeline_xx_3.4.1_3.0_1647607984667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python masakhaner_pipeline = PretrainedPipeline("distilbert_base_token_classifier_masakhaner_pipeline", lang = "xx") masakhaner_pipeline.annotate("Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde.") ``` ```scala val masakhaner_pipeline = new PretrainedPipeline("distilbert_base_token_classifier_masakhaner_pipeline", lang = "xx") masakhaner_pipeline.annotate("Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde.") ```
## Results ```bash +-----------------------------+---------+ |chunk |ner_label| +-----------------------------+---------+ |Mohammed Sani Musa |PER | |Activate Technologies Limited|ORG | |ọdún-un 2019 |DATE | |All rogressives Congress |ORG | |APC |ORG | |Aṣojú Ìlà-Oòrùn Niger |LOC | |Premium Times |ORG | +-----------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_token_classifier_masakhaner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|505.8 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - DistilBertForTokenClassification - NerConverter - Finisher --- layout: model title: Malay DistilBertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: distilbert_embeddings_malaysian_small date: 2022-12-12 tags: [ms, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: ms edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `malaysian-distilbert-small` is a Malay model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_malaysian_small_ms_4.2.4_3.0_1670864988773.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_malaysian_small_ms_4.2.4_3.0_1670864988773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_malaysian_small","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_malaysian_small","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_malaysian_small| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ms| |Size:|248.4 MB| |Case sensitive:|false| ## References - https://huggingface.co/w11wo/malaysian-distilbert-small - https://arxiv.org/abs/1910.01108 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Chinese BertForQuestionAnswering Base Cased model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_base_cased_chines date: 2022-07-07 tags: [zh, open_source, bert, question_answering] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-chinese` is a Chinese model originally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_base_cased_chines_zh_4.0.0_3.0_1657190418828.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_base_cased_chines_zh_4.0.0_3.0_1657190418828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_base_cased_chines","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_multilingual_base_cased_chines","zh") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_base_cased_chines| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|zh| |Size:|665.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-chinese --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_base_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-base-squad` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_base_squad_en_4.3.0_3.0_1674224552223.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_base_squad_en_4.3.0_3.0_1674224552223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_base_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_base_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|463.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/tli8hf/unqover-roberta-base-squad --- layout: model title: Normalize Parent Companies Names using Wikidata author: John Snow Labs name: finel_wiki_parentorgs date: 2023-01-18 tags: [parent, wikipedia, wikidata, en, licensed] task: Entity Resolution language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an Entity Resolution model, aimed to normalize a previously extracted ORG entity, using its reference name in WIkidata. This is useful to then use `finel_wiki_parentorgs` Chunk Mapping model and get information of the subsidiaries, countries, stock exchange, etc. It also retrieves the TICKER, which can be retrieved from `aux_label` column in metadata. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finel_wiki_parentorgs_en_1.0.0_3.0_1674038525188.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finel_wiki_parentorgs_en_1.0.0_3.0_1674038525188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("ner_chunk") \ .setOutputCol("sentence_embeddings") resolver = finance.SentenceEntityResolverModel.pretrained("finel_wiki_parentorgs", "en", "finance/models")\ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("normalized_name")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = nlp.Pipeline( stages = [ documentAssembler, embeddings, resolver ]) lp = nlp.LightPipeline(pipelineModel) test_pred = lp.fullAnnotate('ALPHABET') print(test_pred[0]['normalized_name'][0].result) print(test_pred[0]['normalized_name'][0].metadata['all_k_aux_labels'].split(':::')[0]) ```
## Results ```bash Alphabet Inc. Aux data: GOOGL ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finel_wiki_parentorgs| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[original_company_name]| |Language:|en| |Size:|2.8 MB| |Case sensitive:|false| ## References Wikidata dump about company holdings using SparQL --- layout: model title: Translate Pijin to English Pipeline author: John Snow Labs name: translate_pis_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, pis, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `pis` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_pis_en_xx_2.7.0_2.4_1609688574385.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_pis_en_xx_2.7.0_2.4_1609688574385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_pis_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_pis_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.pis.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_pis_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Medical Question Answering (biogpt) author: John Snow Labs name: medical_qa_biogpt date: 2023-05-17 tags: [licensed, clinical, en, biogpt, pubmed, question_answering, tensorflow] task: Question Answering language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is directly ported from the  official BioGPT [implementation](https://github.com/microsoft/BioGPT)  that is trained on Pubmed abstracts and then finetuned with PubmedQA dataset. It is the baseline version called [BioGPT-QA-PubMedQA-BioGPT](https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/QA-PubMedQA-BioGPT.tgz). It can generate two types of answers, short and long. Types of questions are supported: `"short"`(producing yes/no/maybe) answers and `"full"` (long answers). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/BIOGPT_MEDICAL_QUESTION_ANSWERING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/31.Medical_Question_Answering.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medical_qa_biogpt_en_4.4.2_3.0_1684313829161.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medical_qa_biogpt_en_4.4.2_3.0_1684313829161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler()\ .setInputCols("question", "context")\ .setOutputCols("document_question", "document_context") med_qa = sparknlp_jsl.annotators.MedicalQuestionAnswering\ .pretrained("medical_qa_biogpt","en","clinical/models")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setMaxNewTokens(30)\ .setTopK(1)\ .setQuestionType("long") # "short" pipeline = Pipeline(stages=[document_assembler, med_qa]) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65–97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus–stimulus and stimulus–response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" data = spark.createDataFrame( [ [long_question, paper_abstract, "long"], [yes_no_question, paper_abstract, "short"], ] ).toDF("question", "context", "question_type") pipeline.fit(data).transform(data.where("question_type == 'long'"))\ .select("answer.result")\ .show(truncate=False) ############################### # for short ansver med_qa.setQuestionType("short") # "long" pipeline = Pipeline(stages=[document_assembler, med_qa]) pipeline.fit(data).transform(data.where("question_type == 'short'"))\ .select("answer.result")\ .show(truncate=False) ``` ```scala val document_assembler = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val med_qa = MedicalQuestionAnswering .pretrained("medical_qa_biogpt", "en", "clinical/models") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setMaxNewTokens(30) .setTopK(1) .setQuestionType("long") # "short" val pipeline = new Pipeline().setStages(Array(document_assembler, med_qa)) paper_abstract = "The visual indexing theory proposed by Zenon Pylyshyn (Cognition, 32, 65–97, 1989) predicts that visual attention mechanisms are employed when mental images are projected onto a visual scene. Recent eye-tracking studies have supported this hypothesis by showing that people tend to look at empty places where requested information has been previously presented. However, it has remained unclear to what extent this behavior is related to memory performance. The aim of the present study was to explore whether the manipulation of spatial attention can facilitate memory retrieval. In two experiments, participants were asked first to memorize a set of four objects and then to determine whether a probe word referred to any of the objects. The results of both experiments indicate that memory accuracy is not affected by the current focus of attention and that all the effects of directing attention to specific locations on response times can be explained in terms of stimulus–stimulus and stimulus–response spatial compatibility." long_question = "What is the effect of directing attention on memory?" yes_no_question = "Does directing attention improve memory for items?" val data = Seq( (long_question, paper_abstract,"long" ), (yes_no_question, paper_abstract, "short")) .toDS.toDF("question", "context", "question_type") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |[The effect of directing attention on memory is that it can help to improve the accuracy and recall of a document. It can help to improve the accuracy of a document by allowing the user to quickly and easily access the information they need. It can also help to improve the overall efficiency of a document by allowing the user to quickly]| +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medical_qa_biogpt| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 GB| |Case sensitive:|true| --- layout: model title: Spanish BertForQuestionAnswering model (from MMG) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac date: 2022-06-02 tags: [es, open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-cased-finetuned-sqac` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_es_4.0.0_3.0_1654180491232.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_es_4.0.0_3.0_1654180491232.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MMG/bert-base-spanish-wwm-cased-finetuned-sqac --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_what_1e_04 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-what-1e-04` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_what_1e_04_en_4.3.0_3.0_1672766823767.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_what_1e_04_en_4.3.0_3.0_1672766823767.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_what_1e_04","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_what_1e_04","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_what_1e_04| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-what-1e-04 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from mrm8488) author: John Snow Labs name: t5_small_finetuned_squadv1 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_squadv1_en_4.3.0_3.0_1675126129461.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_squadv1_en_4.3.0_3.0_1675126129461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_squadv1","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_squadv1","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_squadv1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|274.6 MB| ## References - https://huggingface.co/mrm8488/t5-small-finetuned-squadv1 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/pdf/1910.10683.pdf - https://i.imgur.com/jVFMMWR.png - https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb - https://twitter.com/psuraj28 - https://twitter.com/mrm8488 - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: English image_classifier_vit_hugging_geese ViTForImageClassification from osanseviero author: John Snow Labs name: image_classifier_vit_hugging_geese date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_hugging_geese` is a English model originally trained by osanseviero. ## Predicted Entities `swan`, `goose`, `dog`, `duck`, `pigeon` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hugging_geese_en_4.1.0_3.0_1660169832802.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_hugging_geese_en_4.1.0_3.0_1660169832802.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_hugging_geese", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_hugging_geese", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_hugging_geese| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Finnish asr_wav2vec2_base_voxpopuli_v2_finetuned TFWav2Vec2ForCTC from Finnish-NLP author: John Snow Labs name: pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_voxpopuli_v2_finetuned` is a Finnish model originally trained by Finnish-NLP. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned_fi_4.2.0_3.0_1664038305396.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned_fi_4.2.0_3.0_1664038305396.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_voxpopuli_v2_finetuned| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|349.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from SreyanG-NVIDIA) author: John Snow Labs name: bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad` is a English model orginally trained by `SreyanG-NVIDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad_en_4.0.0_3.0_1654179753356.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad_en_4.0.0_3.0_1654179753356.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_cased.by_SreyanG-NVIDIA").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SreyanG-NVIDIA/bert-base-cased-finetuned-squad --- layout: model title: Detect Assertion Status from Entities Related to Cancer Diagnosis author: John Snow Labs name: assertion_oncology_problem_wip date: 2022-10-11 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of entities related to cancer diagnosis (including Metastasis, Cancer_Dx and Tumor_Finding, among others). ## Predicted Entities `Family_History`, `Hypothetical_Or_Absent`, `Medical_History`, `Possible` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_problem_wip_en_4.0.0_3.0_1665520053860.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_problem_wip_en_4.0.0_3.0_1665520053860.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Dx"]) assertion = AssertionDLModel.pretrained("assertion_oncology_problem_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The patient was diagnosed with breast cancer. Her family history is positive for other cancers."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Dx")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_problem_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient was diagnosed with breast cancer. Her family history is positive for other cancers.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_problem_wip").predict("""The patient was diagnosed with breast cancer. Her family history is positive for other cancers.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------------|:------------|:----------------| | breast cancer | Cancer_Dx | Medical_History | | cancers | Cancer_Dx | Family_History | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_problem_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Family_History 0.75 0.75 0.75 12.0 Hypothetical_Or_Absent 0.87 0.81 0.84 310.0 Medical_History 0.76 0.86 0.81 304.0 Possible 0.71 0.61 0.65 92.0 macro-avg 0.77 0.76 0.76 718.0 weighted-avg 0.80 0.80 0.80 718.0 ``` --- layout: model title: Pipeline to Detect Time-related Terminology author: ahmedlone127 name: roberta_token_classifier_timex_semeval_pipeline date: 2022-06-14 tags: [timex, semeval, ner, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [roberta_token_classifier_timex_semeval](https://nlp.johnsnowlabs.com/2021/12/28/roberta_token_classifier_timex_semeval_en.html) model. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TIMEX_SEMEVAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/roberta_token_classifier_timex_semeval_pipeline_en_4.0.0_3.0_1655212741280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/roberta_token_classifier_timex_semeval_pipeline_en_4.0.0_3.0_1655212741280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python timex_pipeline = PretrainedPipeline("roberta_token_classifier_timex_semeval_pipeline", lang = "en") timex_pipeline.annotate("Model training was started at 22:12C and it took 3 days from Tuesday to Friday.") ``` ```scala val timex_pipeline = new PretrainedPipeline("roberta_token_classifier_timex_semeval_pipeline", lang = "en") timex_pipeline.annotate("Model training was started at 22:12C and it took 3 days from Tuesday to Friday.") ```
## Results ```bash +-------+-----------------+ |chunk |ner_label | +-------+-----------------+ |22:12C |Period | |3 |Number | |days |Calendar-Interval| |Tuesday|Day-Of-Week | |to |Between | |Friday |Day-Of-Week | +-------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_timex_semeval_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|439.5 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - RoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English ElectraForQuestionAnswering model (from carlosserquen) author: John Snow Labs name: electra_qa_elctrafp date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electrafp` is a English model originally trained by `carlosserquen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_elctrafp_en_4.0.0_3.0_1655921705697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_elctrafp_en_4.0.0_3.0_1655921705697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_elctrafp","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_elctrafp","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.electra.by_carlosserquen").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_elctrafp| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/carlosserquen/electrafp --- layout: model title: English DistilBertForQuestionAnswering Uncased model author: John Snow Labs name: distilbert_qa_base_uncased_distilled_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-distilled-squad` is a English model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_distilled_squad_en_4.0.0_3.0_1654723785022.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_distilled_squad_en_4.0.0_3.0_1654723785022.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_distilled_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_distilled_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_uploaded by huggingface").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_distilled_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/distilbert-base-uncased-distilled-squad --- layout: model title: Legal Survival Clause Binary Classifier author: John Snow Labs name: legclf_survival_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, survival, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the survival clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `survival`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_survival_clause_en_1.0.0_3.0_1671393633271.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_survival_clause_en_1.0.0_3.0_1671393633271.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_survival_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[survival]| |[other]| |[other]| |[survival]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_survival_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.99 39 survival 1.00 0.97 0.99 35 accuracy - - 0.99 74 macro-avg 0.99 0.99 0.99 74 weighted-avg 0.99 0.99 0.99 74 ``` --- layout: model title: Legal Terms Clause Binary Classifier author: John Snow Labs name: legclf_terms_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `terms` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `terms` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_terms_clause_en_1.0.0_3.2_1660124074605.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_terms_clause_en_1.0.0_3.2_1660124074605.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_terms_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[terms]| |[other]| |[other]| |[terms]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_terms_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.95 0.93 577 terms 0.88 0.82 0.85 271 accuracy - - 0.91 848 macro-avg 0.90 0.88 0.89 848 weighted-avg 0.91 0.91 0.91 848 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_2_lr_3e_5_bs_32_ep_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_2-lr-3e-5-bs-32-ep-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188515309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_2_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188515309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_3e_5_bs_32_ep_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_2_lr_3e_5_bs_32_ep_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_2_lr_3e_5_bs_32_ep_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_2-lr-3e-5-bs-32-ep-3 --- layout: model title: Fast Neural Machine Translation Model from Tagalog to English author: John Snow Labs name: opus_mt_tl_en date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tl_en_xx_2.7.0_2.4_1609254516008.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tl_en_xx_2.7.0_2.4_1609254516008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Portuguese BertForTokenClassification Cased model (from pucpr) author: John Snow Labs name: bert_token_classifier_clinicalnerpt_finding date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-finding` is a Portuguese model originally trained by `pucpr`. ## Predicted Entities `Finding` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_finding_pt_4.2.4_3.0_1669822533848.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_finding_pt_4.2.4_3.0_1669822533848.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_finding","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_finding","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_clinicalnerpt_finding| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/pucpr/clinicalnerpt-finding - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/SemClinBr - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Indonesian Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: id edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, id] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_id_2.5.5_2.4_1596054397023.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_id_2.5.5_2.4_1596054397023.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "id") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "id") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Selain menjadi raja utara, John Snow adalah seorang dokter Inggris dan pemimpin dalam pengembangan anestesi dan kebersihan medis."""] lemma_df = nlu.load('id.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=5, result='selain', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=7, end=13, result='menjadi', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=18, result='raja', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=20, end=24, result='utara', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=25, end=25, result=',', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|id| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Fast Neural Machine Translation Model from Azerbaijani to English author: John Snow Labs name: opus_mt_az_en date: 2021-06-01 tags: [open_source, seq2seq, translation, az, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: az target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_az_en_xx_3.1.0_2.4_1622551101110.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_az_en_xx_3.1.0_2.4_1622551101110.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_az_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_az_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Azerbaijani.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_az_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from West Germanic languages to English author: John Snow Labs name: opus_mt_gmw_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gmw, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gmw` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gmw_en_xx_2.7.0_2.4_1609168579667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gmw_en_xx_2.7.0_2.4_1609168579667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gmw_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gmw_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gmw.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gmw_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_base_patch32_384 ViTForImageClassification from google author: John Snow Labs name: image_classifier_vit_base_patch32_384 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch32_384` is a English model originally trained by google. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch32_384_en_4.1.0_3.0_1660166215289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch32_384_en_4.1.0_3.0_1660166215289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch32_384", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch32_384", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch32_384| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|331.3 MB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from saburbutt) author: John Snow Labs name: roberta_qa_base_tweet_model date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_base_tweetqa_model` is a English model originally trained by `saburbutt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_tweet_model_en_4.3.0_3.0_1674223144072.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_tweet_model_en_4.3.0_3.0_1674223144072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_tweet_model","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_tweet_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_tweet_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|432.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/saburbutt/roberta_base_tweetqa_model --- layout: model title: Fast Neural Machine Translation Model from English to Germanic Languages author: John Snow Labs name: opus_mt_en_gem date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gem, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gem_xx_2.7.0_2.4_1609164515709.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gem_xx_2.7.0_2.4_1609164515709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gem", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gem", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gem').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gem| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from AmazonScience) author: John Snow Labs name: roberta_qa_nlu date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qanlu` is a English model originally trained by `AmazonScience`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_nlu_en_4.2.4_3.0_1669985427449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_nlu_en_4.2.4_3.0_1669985427449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlu","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlu","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_nlu| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AmazonScience/qanlu - https://assets.amazon.science/33/ea/800419b24a09876601d8ab99bfb9/language-model-is-all-you-need-natural-language-understanding-as-question-answering.pdf - https://github.com/amazon-research/question-answering-nlu --- layout: model title: Extract conditions and benefits from drug reviews author: John Snow Labs name: ner_supplement_clinical date: 2022-02-01 tags: [licensed, ner, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to extract benefits of using drugs for certain conditions. ## Predicted Entities `CONDITION`, `BENEFIT` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_supplement_clinical_en_3.3.4_3.0_1643674915917.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_supplement_clinical_en_3.3.4_3.0_1643674915917.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') ner = MedicalNerModel.pretrained('ner_supplement_clinical', 'en', 'clinical/models') \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner_tags"]) \ .setOutputCol("ner_chunk")\ ner_pipeline = Pipeline( stages = [ documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter ]) sample_df = spark.createDataFrame([["Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :)"]]).toDF("text") result = ner_pipeline.fit(sample_df).transform(sample_df) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_supplement_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner_tags")) .setOutputCol("ner_chunk") val ner_pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter)) val sample_df = Seq("""Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :)""").toDS.toDF("text") val result = ner_pipeline.fit(sample_df).transform(sample_df) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.supplement_clinical").predict("""Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth. I recommend :)""") ```
## Results ```bash +------------------------+---------------+ | chunk | ner_label | +------------------------+---------------+ | nervousness | CONDITION | | night sleep improves | BENEFIT | | hair | BENEFIT | | nail | BENEFIT | +------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_supplement_clinical| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.5 MB| ## References Trained on healthsea dataset: https://github.com/explosion/healthsea/tree/main/project/assets/ner ## Benchmarking ```bash label tp fp fn prec rec f1 B-BENEFIT 268 39 42 0.87296414 0.86451614 0.86871964 I-CONDITION 178 29 72 0.8599034 0.712 0.7789934 I-BENEFIT 52 14 32 0.7878788 0.61904764 0.6933334 B-CONDITION 365 78 61 0.82392776 0.85680753 0.840046 Macro-average 863 160 207 0.8361685 0.7630928 0.7979612 Micro-average 863 160 207 0.84359723 0.80654204 0.8246535 ``` --- layout: model title: Legal Exclusive License Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_exclusive_license_agreement_bert date: 2022-11-24 tags: [en, legal, classification, agreement, exclusive_license, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_exclusive_license_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `exclusive-license-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `exclusive-license-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exclusive_license_agreement_bert_en_1.0.0_3.0_1669312972689.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exclusive_license_agreement_bert_en_1.0.0_3.0_1669312972689.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_exclusive_license_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[exclusive-license-agreement]| |[other]| |[other]| |[exclusive-license-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exclusive_license_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support exclusive-license-agreement 0.98 0.95 0.96 43 other 0.98 0.99 0.98 82 accuracy - - 0.98 125 macro-avg 0.98 0.97 0.97 125 weighted-avg 0.98 0.98 0.98 125 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Shashidhar) author: John Snow Labs name: distilbert_qa_Shashidhar_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Shashidhar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Shashidhar_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724473194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Shashidhar_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724473194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Shashidhar_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Shashidhar_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Shashidhar").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Shashidhar_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Shashidhar/distilbert-base-uncased-finetuned-squad --- layout: model title: Twi asr_wav2vec2large_xlsr_akan TFWav2Vec2ForCTC from azunre author: John Snow Labs name: pipeline_asr_wav2vec2large_xlsr_akan date: 2022-09-24 tags: [wav2vec2, tw, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: tw edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2large_xlsr_akan` is a Twi model originally trained by azunre. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2large_xlsr_akan_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2large_xlsr_akan_tw_4.2.0_3.0_1664022300448.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2large_xlsr_akan_tw_4.2.0_3.0_1664022300448.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2large_xlsr_akan', lang = 'tw') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2large_xlsr_akan", lang = "tw") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2large_xlsr_akan| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|tw| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Turkish BertForTokenClassification Cased model (from yanekyuk) author: John Snow Labs name: bert_token_classifier_berturk_cased_keyword_discriminator date: 2022-11-30 tags: [tr, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: tr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `berturk-cased-keyword-discriminator` is a Turkish model originally trained by `yanekyuk`. ## Predicted Entities `CON`, `ENT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_cased_keyword_discriminator_tr_4.2.4_3.0_1669815494903.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_cased_keyword_discriminator_tr_4.2.4_3.0_1669815494903.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_cased_keyword_discriminator","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_cased_keyword_discriminator","tr") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_berturk_cased_keyword_discriminator| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|tr| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/yanekyuk/berturk-cased-keyword-discriminator --- layout: model title: English asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent TFWav2Vec2ForCTC from creynier author: John Snow Labs name: asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent` is a English model originally trained by creynier. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent_en_4.2.0_3.0_1664103960597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent_en_4.2.0_3.0_1664103960597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_swbd_turn_eos_long_short_utt_removed_3percent| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Italian BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_it_cased date: 2022-12-02 tags: [it, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-it-cased` is a Italian model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_it_cased_it_4.2.4_3.0_1670017874211.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_it_cased_it_4.2.4_3.0_1670017874211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_it_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_it_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_it_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|396.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-it-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Multilabel Classifier on Online Terms of Service author: John Snow Labs name: legmulticlf_online_terms_of_service_english date: 2023-04-26 tags: [en, licensed, multilabel, classification, legal, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the Multi-Label Text Classification model that can be used to identify potentially unfair clauses in online Terms of Service. The classes are as follows: - Arbitration - Choice_of_law - Content_removal - Jurisdiction - Limitation_of_liability - Other - Unilateral_change - Unilateral_termination ## Predicted Entities `Arbitration`, `Choice_of_law`, `Content_removal`, `Jurisdiction`, `Limitation_of_liability`, `Other`, `Unilateral_change`, `Unilateral_termination` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_online_terms_of_service_english_en_1.0.0_3.0_1682519205970.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_online_terms_of_service_english_en_1.0.0_3.0_1682519205970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document'])\ .setOutputCol('token') embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols(['document', 'token'])\ .setOutputCol("embeddings") embeddingsSentence = nlp.SentenceEmbeddings() \ .setInputCols(['document', 'embeddings'])\ .setOutputCol('sentence_embeddings')\ .setPoolingStrategy('AVERAGE') classifierdl = nlp.MultiClassifierDLModel.pretrained('legmulticlf_online_terms_of_service_english', 'en', 'legal/models') .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline(stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]) df = spark.createDataFrame([["We are not responsible or liable for (and have no obligation to verify) any wrong or misspelled email address or inaccurate or wrong (mobile) phone number or credit card number."]]).toDF("text") model = clf_pipeline.fit(df) result = model.transform(df) result.select("text", "class.result").show(truncate=False) ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ |sentence |result | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ |We are not responsible or liable for (and have no obligation to verify) any wrong or misspelled email address or inaccurate or wrong (mobile) phone number or credit card number.|[Limitation_of_liability]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_online_terms_of_service_english| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|13.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/joelito/online_terms_of_service) ## Benchmarking ```bash label precision recall f1-score support Arbitration 1.00 0.50 0.67 4 Choice_of_law 0.67 0.67 0.67 3 Content_removal 1.00 0.67 0.80 3 Jurisdiction 0.80 1.00 0.89 4 Limitation_of_liability 0.73 0.73 0.73 15 Other 0.86 0.89 0.88 28 Unilateral_change 0.86 1.00 0.92 6 Unilateral_termination 1.00 0.80 0.89 5 micro-avg 0.84 0.82 0.83 68 macro-avg 0.86 0.78 0.81 68 weighted-avg 0.85 0.82 0.83 68 samples-avg 0.80 0.82 0.81 68 ``` --- layout: model title: English RobertaForQuestionAnswering (from 123tarunanand) author: John Snow Labs name: roberta_qa_roberta_base_finetuned date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned` is a English model originally trained by `123tarunanand`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_en_4.0.0_3.0_1655733687732.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_finetuned_en_4.0.0_3.0_1655733687732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.base.by_123tarunanand").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/123tarunanand/roberta-base-finetuned --- layout: model title: Fast Neural Machine Translation Model from English to Tok Pisin author: John Snow Labs name: opus_mt_en_tpi date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tpi, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tpi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tpi_xx_2.7.0_2.4_1609163638663.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tpi_xx_2.7.0_2.4_1609163638663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tpi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tpi", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tpi').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tpi| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Management Document Classifier (EURLEX) author: John Snow Labs name: legclf_management_bert date: 2023-03-06 tags: [en, legal, classification, clauses, management, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_management_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Management or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Management`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_bert_en_1.0.0_3.0_1678111568847.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_bert_en_1.0.0_3.0_1678111568847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_management_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Management]| |[Other]| |[Other]| |[Management]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_management_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Management 0.94 0.88 0.91 117 Other 0.85 0.92 0.88 87 accuracy - - 0.90 204 macro-avg 0.89 0.90 0.90 204 weighted-avg 0.90 0.90 0.90 204 ``` --- layout: model title: Pipeline to Detect Living Species(bert_base_cased) author: John Snow Labs name: ner_living_species_bert_pipeline date: 2023-03-13 tags: [es, ner, clinical, licensed, bert] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_bert](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_bert_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_es_4.3.0_3.2_1678730831600.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_es_4.3.0_3.2_1678730831600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_bert_pipeline", "es", "clinical/models") text = '''Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_bert_pipeline", "es", "clinical/models") val text = "Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lactante varón | 0 | 13 | HUMAN | 0.98915 | | 1 | familiares | 41 | 50 | HUMAN | 1 | | 2 | personales | 78 | 87 | HUMAN | 1 | | 3 | neonatal | 116 | 123 | HUMAN | 0.9921 | | 4 | legumbres | 162 | 170 | SPECIES | 0.9995 | | 5 | lentejas | 243 | 250 | SPECIES | 1 | | 6 | garbanzos | 254 | 262 | SPECIES | 1 | | 7 | legumbres | 290 | 298 | SPECIES | 0.9993 | | 8 | madre | 334 | 338 | HUMAN | 1 | | 9 | Cacahuete | 616 | 624 | SPECIES | 1 | | 10 | padres | 728 | 733 | HUMAN | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|429.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Management Agreement Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_management_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, management, agreement, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_management_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `management-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `management-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_management_agreement_en_1.0.0_3.0_1671393667813.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_management_agreement_en_1.0.0_3.0_1671393667813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_management_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[management-agreement]| |[other]| |[other]| |[management-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_management_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support management-agreement 0.95 0.93 0.94 115 other 0.96 0.97 0.96 184 accuracy - - 0.95 299 macro-avg 0.95 0.95 0.95 299 weighted-avg 0.95 0.95 0.95 299 ``` --- layout: model title: Legal General Provisions Clause Binary Classifier author: John Snow Labs name: legclf_general_provisions_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, general, provisions, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the general-provisions clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `general-provisions`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_general_provisions_clause_en_1.0.0_3.0_1671393654655.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_general_provisions_clause_en_1.0.0_3.0_1671393654655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_general_provisions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[general-provisions]| |[other]| |[other]| |[general-provisions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_general_provisions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support general-provisions 1.00 0.81 0.9 27 other 0.89 1.00 0.94 39 accuracy - - 0.92 66 macro-avg 0.94 0.91 0.92 66 weighted-avg 0.93 0.92 0.92 66 ``` --- layout: model title: Pipeline to Resolve Medication Codes author: John Snow Labs name: medication_resolver_pipeline date: 2022-09-01 tags: [resolver, snomed, umls, rxnorm, ndc, ade, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used as Lightpipeline (with `annotate/fullAnnotate`). You can use `medication_resolver_transform_pipeline` for Spark transform. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.0.2_3.0_1662044306623.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.0.2_3.0_1662044306623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline med_resolver_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" result = med_resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val med_resolver_pipeline = new PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") val result = med_resolver_pipeline.fullAnnotate("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | | chunks | entities | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |---:|:-----------------------------|:-----------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | 0 | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | 1 | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | 2 | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | 3 | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Spanish (WikiNER 6B 100) author: John Snow Labs name: wikiner_6B_100 date: 2020-02-03 task: Named Entity Recognition language: es edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, es, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 100 is trained with GloVe 6B 100 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_ES){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_es_2.4.0_2.4_1581971941700.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_100_es_2.4.0_2.4_1581971941700.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_100", "es") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_100", "es") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella."""] ner_df = nlu.load('es.ner.wikiner.glove.6B_100').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +------------------------------+---------+ |chunk |ner_label| +------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Nacido |ORG | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Nuevo México |LOC | |Gates |PER | |CEO |ORG | |CEO |ORG | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Fundación Bill y Melinda Gates|ORG | |Melinda Gates |PER | +------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_100| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://es.wikipedia.org](https://es.wikipedia.org) --- layout: model title: Legal Trustee may file proofs of claim Clause Binary Classifier author: John Snow Labs name: legclf_trustee_may_file_proofs_of_claim_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `trustee-may-file-proofs-of-claim` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `trustee-may-file-proofs-of-claim` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_trustee_may_file_proofs_of_claim_clause_en_1.0.0_3.2_1660123140632.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_trustee_may_file_proofs_of_claim_clause_en_1.0.0_3.2_1660123140632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_trustee_may_file_proofs_of_claim_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[trustee-may-file-proofs-of-claim]| |[other]| |[other]| |[trustee-may-file-proofs-of-claim]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_trustee_may_file_proofs_of_claim_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.97 0.97 30 trustee-may-file-proofs-of-claim 0.93 0.93 0.93 15 accuracy - - 0.96 45 macro-avg 0.95 0.95 0.95 45 weighted-avg 0.96 0.96 0.96 45 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten TFWav2Vec2ForCTC from patrickvonplaten author: John Snow Labs name: asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten` is a English model originally trained by patrickvonplaten. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_en_4.2.0_3.0_1664114039241.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten_en_4.2.0_3.0_1664114039241.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_google_colab_by_patrickvonplaten| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.3 MB| --- layout: model title: Sound Medical Classification Pipeline - Voice of the Patient author: John Snow Labs name: bert_sequence_classifier_vop_sound_medical_pipeline date: 2023-06-14 tags: [licensed, en, clinical, classification, vop] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline includes the Medical Bert for Sequence Classification model to identify whether the suggestion that is mentioned in the text is medically sound. The pipeline is built on the top of [bert_sequence_classifier_vop_sound_medical](https://nlp.johnsnowlabs.com/2023/06/13/bert_sequence_classifier_vop_sound_medical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_sound_medical_pipeline_en_4.4.3_3.2_1686710496292.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_sound_medical_pipeline_en_4.4.3_3.2_1686710496292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_vop_sound_medical_pipeline", "en", "clinical/models") pipeline.annotate("I had a lung surgery for emphyema and after surgery my xray showing some recovery.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_vop_sound_medical_pipeline", "en", "clinical/models") val result = pipeline.annotate(I had a lung surgery for emphyema and after surgery my xray showing some recovery.) ```
## Results ```bash | text | prediction | |:-----------------------------------------------------------------------------------|:-------------| | I had a lung surgery for emphyema and after surgery my xray showing some recovery. | True | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_sound_medical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: Romanian T5ForConditionalGeneration Small Cased model (from BlackKakapo) author: John Snow Labs name: t5_small_paraphrase date: 2023-01-31 tags: [ro, open_source, t5, tensorflow] task: Text Generation language: ro edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-paraphrase-ro` is a Romanian model originally trained by `BlackKakapo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_paraphrase_ro_4.3.0_3.0_1675126644590.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_paraphrase_ro_4.3.0_3.0_1675126644590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_paraphrase","ro") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_paraphrase","ro") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_paraphrase| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ro| |Size:|288.6 MB| ## References - https://huggingface.co/BlackKakapo/t5-small-paraphrase-ro - https://img.shields.io/badge/V.1-03.08.2022-brightgreen --- layout: model title: Translate Tsonga to English Pipeline author: John Snow Labs name: translate_ts_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ts, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ts` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ts_en_xx_2.7.0_2.4_1609691282237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ts_en_xx_2.7.0_2.4_1609691282237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ts_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ts_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ts.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ts_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824209 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824209` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824209_en_4.3.1_3.0_1678783290458.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824209_en_4.3.1_3.0_1678783290458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824209","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824209","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824209| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824209 --- layout: model title: German XLMRobertaForTokenClassification Large Cased model (from bettertextapp) author: John Snow Labs name: xlmroberta_ner_gpt2_large_detector_de_v1 date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gpt2-large-detector-de-v1` is a German model originally trained by `bettertextapp`. ## Predicted Entities `DELETE`, `KEEP`, `ADD`, `REPLACE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_gpt2_large_detector_de_v1_de_4.1.0_3.0_1660422438967.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_gpt2_large_detector_de_v1_de_4.1.0_3.0_1660422438967.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_gpt2_large_detector_de_v1","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_gpt2_large_detector_de_v1","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_gpt2_large_detector_de_v1| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|811.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/bettertextapp/gpt2-large-detector-de-v1 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab6_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab6_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab6_by_hassnain` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab6_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab6_by_hassnain_en_4.2.0_3.0_1664040502041.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab6_by_hassnain_en_4.2.0_3.0_1664040502041.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab6_by_hassnain", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab6_by_hassnain", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab6_by_hassnain| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Spanish BertForTokenClassification Cased model (from NazaGara) author: John Snow Labs name: bert_token_classifier_ner_fine_tuned_beto date: 2022-11-30 tags: [es, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `NER-fine-tuned-BETO` is a Spanish model originally trained by `NazaGara`. ## Predicted Entities `PER`, `MISC`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_fine_tuned_beto_es_4.2.4_3.0_1669813932223.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_fine_tuned_beto_es_4.2.4_3.0_1669813932223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_fine_tuned_beto","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_fine_tuned_beto","es") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_fine_tuned_beto| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|410.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/NazaGara/NER-fine-tuned-BETO --- layout: model title: English BertForMaskedLM Cased model (from lordtt13) author: John Snow Labs name: bert_embeddings_covid_scibert date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-SciBERT` is a English model originally trained by `lordtt13`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_covid_scibert_en_4.2.4_3.0_1670014857338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_covid_scibert_en_4.2.4_3.0_1670014857338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_covid_scibert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_covid_scibert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_covid_scibert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|415.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/lordtt13/COVID-SciBERT - https://arxiv.org/abs/1903.10676 - https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge - https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb - https://github.com/lordtt13 - https://www.linkedin.com/in/tanmay-thakur-6bb5a9154/ --- layout: model title: Fast Neural Machine Translation Model from Lushai to English author: John Snow Labs name: opus_mt_lus_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lus, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lus` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lus_en_xx_2.7.0_2.4_1609164045034.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lus_en_xx_2.7.0_2.4_1609164045034.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lus_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lus_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lus.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lus_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_base_food101 ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_base_food101 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_food101` is a English model originally trained by nateraw. ## Predicted Entities `grilled_cheese_sandwich`, `edamame`, `onion_rings`, `french_onion_soup`, `french_fries`, `creme_brulee`, `lobster_roll_sandwich`, `bruschetta`, `breakfast_burrito`, `caprese_salad`, `churros`, `omelette`, `club_sandwich`, `chocolate_mousse`, `nachos`, `bread_pudding`, `steak`, `hummus`, `panna_cotta`, `filet_mignon`, `sashimi`, `hot_and_sour_soup`, `cannoli`, `ravioli`, `samosa`, `grilled_salmon`, `lobster_bisque`, `seaweed_salad`, `macaroni_and_cheese`, `fish_and_chips`, `caesar_salad`, `dumplings`, `baby_back_ribs`, `fried_rice`, `oysters`, `peking_duck`, `guacamole`, `greek_salad`, `donuts`, `risotto`, `escargots`, `crab_cakes`, `waffles`, `carrot_cake`, `prime_rib`, `tuna_tartare`, `pho`, `chocolate_cake`, `bibimbap`, `fried_calamari`, `spaghetti_bolognese`, `gnocchi`, `chicken_quesadilla`, `frozen_yogurt`, `apple_pie`, `baklava`, `pulled_pork_sandwich`, `clam_chowder`, `eggs_benedict`, `lasagna`, `ceviche`, `paella`, `foie_gras`, `spring_rolls`, `falafel`, `miso_soup`, `pork_chop`, `ramen`, `pad_thai`, `garlic_bread`, `macarons`, `ice_cream`, `mussels`, `chicken_wings`, `pancakes`, `gyoza`, `poutine`, `croque_madame`, `pizza`, `cheese_plate`, `beignets`, `huevos_rancheros`, `french_toast`, `sushi`, `takoyaki`, `spaghetti_carbonara`, `beef_tartare`, `scallops`, `cup_cakes`, `tacos`, `deviled_eggs`, `beet_salad`, `tiramisu`, `cheesecake`, `strawberry_shortcake`, `beef_carpaccio`, `hamburger`, `red_velvet_cake`, `hot_dog`, `shrimp_and_grits`, `chicken_curry` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_food101_en_4.1.0_3.0_1660169184081.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_food101_en_4.1.0_3.0_1660169184081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_food101", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_food101", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_food101| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.2 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kenlevine) author: John Snow Labs name: distilbert_qa_kenlevine_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `kenlevine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kenlevine_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771743253.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kenlevine_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771743253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kenlevine_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kenlevine_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kenlevine_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kenlevine/distilbert-base-uncased-finetuned-squad --- layout: model title: Translate Turkish to English Pipeline author: John Snow Labs name: translate_tr_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tr, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tr` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tr_en_xx_2.7.0_2.4_1609687539715.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tr_en_xx_2.7.0_2.4_1609687539715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tr_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tr_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tr.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tr_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect PHI (Deidentification) author: John Snow Labs name: ner_deid_large date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Deidentification NER (Large) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. The entities it annotates are Age, Contact, Date, Id, Location, Name, and Profession. This model is trained with the `'embeddings_clinical'` word embeddings model, so be sure to use the same embeddings in the pipeline. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `HEALTHPLAN`, `URL`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_3.0.0_3.0_1617209688468.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_large_en_3.0.0_3.0_1617209688468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_deid_large","en","clinical/models") \ .setInputCols("sentence","token","embeddings") \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("entities") nlp_pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) input_text = ["""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same."""] result = pipeline_model.transform(spark.createDataFrame([input_text], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_large","en","clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("entities") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.large").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |NAME | |VA Hospital |LOCATION | |Day Hospital |LOCATION | |02/04/2003 |DATE | |Smith |NAME | |Day Hospital |LOCATION | |Smith |NAME | |Smith |NAME | |7 Ardmore Tower|LOCATION | |Hart |NAME | |Smith |NAME | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_large| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with `embeddings_clinical` https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash label tp fp fn prec rec f1 I-NAME 1096 47 80 0.95888 0.931973 0.945235 I-CONTACT 93 0 4 1 0.958763 0.978947 I-AGE 3 1 6 0.75 0.333333 0.461538 B-DATE 2078 42 52 0.980189 0.975587 0.977882 I-DATE 474 39 25 0.923977 0.9499 0.936759 I-LOCATION 755 68 76 0.917375 0.908544 0.912938 I-PROFESSION 78 8 9 0.906977 0.896552 0.901734 B-NAME 1182 101 36 0.921278 0.970443 0.945222 B-AGE 259 10 11 0.962825 0.959259 0.961039 B-ID 146 8 11 0.948052 0.929936 0.938907 B-PROFESSION 76 9 21 0.894118 0.783505 0.835165 B-LOCATION 556 87 71 0.864697 0.886762 0.875591 I-ID 64 8 3 0.888889 0.955224 0.920863 B-CONTACT 40 7 5 0.851064 0.888889 0.869565 Macro-average 6900 435 410 0.912023 0.880619 0.896046 Micro-average 6900 435 410 0.940695 0.943912 0.942301 ``` --- layout: model title: Translate Greek languages to English Pipeline author: John Snow Labs name: translate_grk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, grk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `grk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_grk_en_xx_2.7.0_2.4_1609687780254.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_grk_en_xx_2.7.0_2.4_1609687780254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_grk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_grk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.grk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_grk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stop Words Cleaner for Italian author: John Snow Labs name: stopwords_it date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: it edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, it] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_it_it_2.5.4_2.4_1594742442063.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_it_it_2.5.4_2.4_1594742442063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_it", "it") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Oltre ad essere il re del nord, John Snow è un medico inglese e leader nello sviluppo dell'anestesia e dell'igiene medica.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_it", "it") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Oltre ad essere il re del nord, John Snow è un medico inglese e leader nello sviluppo dell"anestesia e dell"igiene medica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Oltre ad essere il re del nord, John Snow è un medico inglese e leader nello sviluppo dell'anestesia e dell'igiene medica."""] stopword_df = nlu.load('it.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=19, end=20, result='re', metadata={'sentence': '0'}), Row(annotatorType='token', begin=26, end=29, result='nord', metadata={'sentence': '0'}), Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=35, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=37, end=40, result='Snow', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_it| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|it| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl2_en_4.3.0_3.0_1675121636200.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl2_en_4.3.0_3.0_1675121636200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|91.6 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Chem_Original_SciBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Chem-Original-SciBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_SciBERT_512_en_4.0.0_3.0_1657108778563.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Chem_Original_SciBERT_512_en_4.0.0_3.0_1657108778563.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_SciBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Chem_Original_SciBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Chem_Original_SciBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Chem-Original-SciBERT-512 --- layout: model title: Stop Words Cleaner for Slovak author: John Snow Labs name: stopwords_sk date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: sk edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, sk] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_sk_sk_2.5.4_2.4_1594742441462.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_sk_sk_2.5.4_2.4_1594742441462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_sk", "sk") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_sk", "sk") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny."""] stopword_df = nlu.load('sk.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Okrem', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=10, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=15, end=16, result='je', metadata={'sentence': '0'}), Row(annotatorType='token', begin=18, end=23, result='kráľom', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=30, result='severu', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_sk| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sk| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ak987) author: John Snow Labs name: distilbert_qa_ak987_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ak987`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ak987_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769722481.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ak987_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769722481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ak987_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ak987_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ak987_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ak987/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Oncology-Specific Entities author: John Snow Labs name: ner_oncology date: 2022-11-24 tags: [licensed, clinical, en, oncology, biomarker, treatment] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts more than 40 oncology-related entities, including therapies, tests and staging. Definitions of Predicted Entities: - `Adenopathy`: Mentions of pathological findings of the lymph nodes. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. - `Cancer_Dx`: Mentions of cancer diagnoses (such as "breast cancer") or pathological types that are usually used as synonyms for "cancer" (e.g. "carcinoma"). When anatomical references are present, they are included in the Cancer_Dx extraction. - `Cancer_Score`: Clinical or imaging scores that are specific for cancer settings (e.g. "BI-RADS" or "Allred score"). - `Cancer_Surgery`: Terms that indicate surgery as a form of cancer treatment. - `Chemotherapy`: Mentions of chemotherapy drugs, or unspecific words such as "chemotherapy". - `Cycle_Coun`: The total number of cycles being administered of an oncological therapy (e.g. "5 cycles"). - `Cycle_Day`: References to the day of the cycle of oncological therapy (e.g. "day 5"). - `Cycle_Number`: The number of the cycle of an oncological therapy that is being applied (e.g. "third cycle"). - `Date`: Mentions of exact dates, in any format, including day number, month and/or year. - `Death_Entity`: Words that indicate the death of the patient or someone else (including family members), such as "died" or "passed away". - `Direction`: Directional and laterality terms, such as "left", "right", "bilateral", "upper" and "lower". - `Dosage`: The quantity prescribed by the physician for an active ingredient. - `Duration`: Words indicating the duration of a treatment (e.g. "for 2 weeks"). - `Frequency`: Words indicating the frequency of treatment administration (e.g. "daily" or "bid"). - `Gender`: Gender-specific nouns and pronouns (including words such as "him" or "she", and family members such as "father"). - `Grade`: All pathological grading of tumors (e.g. "grade 1") or degrees of cellular differentiation (e.g. "well-differentiated") - `Histological_Type`: Histological variants or cancer subtypes, such as "papillary", "clear cell" or "medullary". - `Hormonal_Therapy`: Mentions of hormonal drugs used to treat cancer, or unspecific words such as "hormonal therapy". - `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan". - `Immunotherapy`: Mentions of immunotherapy drugs, or unspecific words such as "immunotherapy". - `Invasion`: Mentions that refer to tumor invasion, such as "invasion" or "involvement". Metastases or lymph node involvement are excluded from this category. - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Metastasis`: Terms that indicate a metastatic disease. Anatomical references are not included in these extractions. - `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. - `Pathology_Result`: The findings of a biopsy from the pathology report that is not covered by another entity (e.g. "malignant ductal cells"). - `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. - `Performance_Status`: Mentions of performance status scores, such as ECOG and Karnofsky. The name of the score is extracted together with the result (e.g. "ECOG performance status of 4"). - `Race_Ethnicity`: The race and ethnicity categories include racial and national origin or sociocultural groups. - `Radiotherapy`: Terms that indicate the use of Radiotherapy. - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Relative_Date`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "yesterday" or "three years later"). - `Route`: Words indicating the type of administration route (such as "PO" or "transdermal"). - `Site_Bone`: Anatomical terms that refer to the human skeleton. - `Site_Brain`: Anatomical terms that refer to the central nervous system (including the brain stem and the cerebellum). - `Site_Breast`: Anatomical terms that refer to the breasts. - `Site_Liver`: Anatomical terms that refer to the liver. - `Site_Lung`: Anatomical terms that refer to the lungs. - `Site_Lymph_Node`: Anatomical terms that refer to lymph nodes, excluding adenopathies. - `Site_Other_Body_Part`: Relevant anatomical terms that are not included in the rest of the anatomical entities. - `Smoking_Status`: All mentions of smoking related to the patient or to someone else. - `Staging`: Mentions of cancer stage such as "stage 2b" or "T2N1M0". It also includes words such as "in situ", "early-stage" or "advanced". - `Targeted_Therapy`: Mentions of targeted therapy drugs, or unspecific words such as "targeted therapy". - `Tumor_Finding`: All nonspecific terms that may be related to tumors, either malignant or benign (for example: "mass", "tumor", "lesion", or "neoplasm"). - `Tumor_Size`: Size of the tumor, including numerical value and unit of measurement (e.g. "3 cm"). - `Unspecific_Therapy`: Terms that indicate a known cancer therapy but that is not specific to any other therapy entity (e.g. "chemoradiotherapy" or "adjuvant therapy"). ## Predicted Entities `Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_en_4.2.2_3.0_1669306355829.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_en_4.2.2_3.0_1669306355829.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology").predict("""The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.""") ```
## Results ```bash | chunk | ner_label | |:-------------------------------|:----------------------| | left | Direction | | mastectomy | Cancer_Surgery | | axillary lymph node dissection | Cancer_Surgery | | left | Direction | | breast cancer | Cancer_Dx | | twenty years ago | Relative_Date | | tumor | Tumor_Finding | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | | radiotherapy | Radiotherapy | | breast | Site_Breast | | cancer | Cancer_Dx | | recurred | Response_To_Treatment | | right | Direction | | lung | Site_Lung | | metastasis | Metastasis | | 13 years later | Relative_Date | | adriamycin | Chemotherapy | | 60 mg/m2 | Dosage | | cyclophosphamide | Chemotherapy | | 600 mg/m2 | Dosage | | six courses | Cycle_Count | | first line | Line_Of_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.6 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Histological_Type 339 75 114 453 0.82 0.75 0.78 Direction 832 163 152 984 0.84 0.85 0.84 Staging 229 31 29 258 0.88 0.89 0.88 Cancer_Score 37 8 25 62 0.82 0.60 0.69 Imaging_Test 2027 214 177 2204 0.90 0.92 0.91 Cycle_Number 73 29 24 97 0.72 0.75 0.73 Tumor_Finding 1114 64 143 1257 0.95 0.89 0.91 Site_Lymph_Node 491 53 60 551 0.90 0.89 0.90 Invasion 158 36 23 181 0.81 0.87 0.84 Response_To_Treatment 431 149 165 596 0.74 0.72 0.73 Smoking_Status 66 18 2 68 0.79 0.97 0.87 Tumor_Size 1050 112 79 1129 0.90 0.93 0.92 Cycle_Count 177 62 53 230 0.74 0.77 0.75 Adenopathy 67 12 29 96 0.85 0.70 0.77 Age 930 33 19 949 0.97 0.98 0.97 Biomarker_Result 1160 169 285 1445 0.87 0.80 0.84 Unspecific_Therapy 198 86 80 278 0.70 0.71 0.70 Site_Breast 125 15 22 147 0.89 0.85 0.87 Chemotherapy 814 55 65 879 0.94 0.93 0.93 Targeted_Therapy 195 27 33 228 0.88 0.86 0.87 Radiotherapy 276 29 34 310 0.90 0.89 0.90 Performance_Status 121 17 14 135 0.88 0.90 0.89 Pathology_Test 888 296 162 1050 0.75 0.85 0.79 Site_Other_Body_Part 909 275 592 1501 0.77 0.61 0.68 Cancer_Surgery 693 119 126 819 0.85 0.85 0.85 Line_Of_Therapy 101 11 5 106 0.90 0.95 0.93 Pathology_Result 655 279 487 1142 0.70 0.57 0.63 Hormonal_Therapy 169 4 16 185 0.98 0.91 0.94 Site_Bone 264 81 49 313 0.77 0.84 0.80 Biomarker 1259 238 256 1515 0.84 0.83 0.84 Immunotherapy 103 47 25 128 0.69 0.80 0.74 Cycle_Day 200 36 48 248 0.85 0.81 0.83 Frequency 354 27 73 427 0.93 0.83 0.88 Route 91 15 22 113 0.86 0.81 0.83 Duration 625 161 136 761 0.80 0.82 0.81 Death_Entity 34 2 4 38 0.94 0.89 0.92 Metastasis 353 18 17 370 0.95 0.95 0.95 Site_Liver 189 64 45 234 0.75 0.81 0.78 Cancer_Dx 1301 103 93 1394 0.93 0.93 0.93 Grade 190 27 46 236 0.88 0.81 0.84 Date 807 21 24 831 0.97 0.97 0.97 Site_Lung 469 110 90 559 0.81 0.84 0.82 Site_Brain 221 64 58 279 0.78 0.79 0.78 Relative_Date 1211 401 111 1322 0.75 0.92 0.83 Race_Ethnicity 57 8 5 62 0.88 0.92 0.90 Gender 1247 17 7 1254 0.99 0.99 0.99 Oncogene 345 83 104 449 0.81 0.77 0.79 Dosage 900 30 160 1060 0.97 0.85 0.90 Radiation_Dose 108 5 18 126 0.96 0.86 0.90 macro_avg 24653 3999 4406 29059 0.85 0.84 0.84 micro_avg 24653 3999 4406 29059 0.86 0.85 0.85 ``` --- layout: model title: Dutch RoBERTa Embeddings (Shuffled) author: John Snow Labs name: roberta_embeddings_robbertje_1_gb_shuffled date: 2022-04-14 tags: [roberta, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `robbertje-1-gb-shuffled` is a Dutch model orginally trained by `DTAI-KULeuven`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_shuffled_nl_3.4.2_3.0_1649949093454.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_robbertje_1_gb_shuffled_nl_3.4.2_3.0_1649949093454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_shuffled","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_robbertje_1_gb_shuffled","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("nl.embed.robbertje_1_gb_shuffled").predict("""Ik hou van vonk nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_robbertje_1_gb_shuffled| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|279.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/DTAI-KULeuven/robbertje-1-gb-shuffled - http://github.com/iPieter/robbert - http://github.com/iPieter/robbertje - https://www.clinjournal.org/clinj/article/view/131 - https://www.clin31.ugent.be - https://arxiv.org/abs/2101.05716 --- layout: model title: Swahili XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_ner_swahili date: 2022-08-01 tags: [sw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: sw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-swahili` is a Swahili model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_swahili_sw_4.1.0_3.0_1659355354706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_swahili_sw_4.1.0_3.0_1659355354706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner_swahili","sw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner_swahili","sw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_ner_swahili| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|sw| |Size:|777.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-swahili - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: Mapping Entities with Corresponding RxNorm Codes author: John Snow Labs name: rxnorm_mapper date: 2022-06-27 tags: [rxnorm, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities with their corresponding RxNorm codes. ## Predicted Entities `rxnorm_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_mapper_en_3.5.3_3.0_1656325497141.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_mapper_en_3.5.3_3.0_1656325497141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner_model = MedicalNerModel\ .pretrained("ner_posology_greedy", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("posology_ner") posology_ner_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "posology_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel\ .pretrained("rxnorm_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["rxnorm_code"]) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper]) test_data = spark.createDataFrame([["The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"]]).toDF("text") mapper_model = mapper_pipeline.fit(test_data) result= mapper_model.transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel .pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("posology_ner") val posology_ner_converter = new NerConverterInternal() .setInputCols("sentence", "token", "posology_ner") .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("rxnorm_mapper", "en", "clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("mappings") .setRels(Array("rxnorm_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper)) val data = Seq("The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_resolver").predict("""The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray""") ```
## Results ```bash +------------------------------+-----------+ |ner_chunk |rxnorm_code| +------------------------------+-----------+ |Zyrtec 10 MG |1011483 | |Adapin 10 MG Oral Capsule |1000050 | |Septi-Soothe 0.5 Topical Spray|1000046 | +------------------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|2.0 MB| --- layout: model title: Extract anatomical entities (Voice of the Patients) author: John Snow Labs name: ner_vop_anatomy_wip date: 2023-04-20 tags: [licensed, clinical, en, ner, vop, patient, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts anatomical terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Laterality`, `BodyPart` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_wip_en_4.4.0_3.0_1682012132406.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_anatomy_wip_en_4.4.0_3.0_1682012132406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_anatomy_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Ugh, I pulled a muscle in my neck from sleeping weird last night. It"s like a knot in my trapezius and it hurts to turn my head."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained(embeddings_clinical, "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_anatomy_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Ugh, I pulled a muscle in my neck from sleeping weird last night. It"s like a knot in my trapezius and it hurts to turn my head.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | muscle | BodyPart | | neck | BodyPart | | trapezius | BodyPart | | head | BodyPart | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_anatomy_wip| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I”m 20 year old girl. I”m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I”m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I”m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I”m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Laterality 508 39 94 602 0.93 0.84 0.88 BodyPart 2758 202 215 2973 0.93 0.93 0.93 macro_avg 3266 241 309 3575 0.93 0.88 0.90 micro_avg 3266 241 309 3575 0.93 0.91 0.92 ``` --- layout: model title: Spanish Electra Legal Word Embeddings Small model author: John Snow Labs name: legalectra_small date: 2022-07-08 tags: [open_source, legalectra, embeddings, electra, legal, small, es] task: Embeddings language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Spanish Legal Word Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legalectra-small-spanish` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/legalectra_small_es_4.0.0_3.0_1657294835823.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/legalectra_small_es_4.0.0_3.0_1657294835823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") electra = BertEmbeddings.pretrained("legalectra_small","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, electra]) data = spark.createDataFrame([["Amo a Spark NLP."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val electra = BertEmbeddings.pretrained("legalectra_small","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, electra)) val data = Seq("Amo a Spark NLP.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legalectra_small| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|51.8 MB| |Case sensitive:|true| ## References https://huggingface.co/mrm8488/legalectra-small-spanish --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ashutoshyadav4) author: John Snow Labs name: distilbert_qa_ashutoshyadav4_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ashutoshyadav4`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ashutoshyadav4_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770050371.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ashutoshyadav4_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770050371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ashutoshyadav4_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ashutoshyadav4_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ashutoshyadav4_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ashutoshyadav4/distilbert-base-uncased-finetuned-squad --- layout: model title: Czech asr_wav2vec2_large_xlsr_53_Czech TFWav2Vec2ForCTC from MehdiHosseiniMoghadam author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_Czech date: 2022-09-25 tags: [wav2vec2, cs, audio, open_source, asr] task: Automatic Speech Recognition language: cs edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_Czech` is a Czech model originally trained by MehdiHosseiniMoghadam. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_Czech_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_Czech_cs_4.2.0_3.0_1664119909971.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_Czech_cs_4.2.0_3.0_1664119909971.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_Czech", "cs")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_Czech", "cs") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_Czech| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|cs| |Size:|1.2 GB| --- layout: model title: English asr_temp TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: asr_temp date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_temp` is a English model originally trained by ying-tina. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_temp_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_temp_en_4.2.0_3.0_1664110575314.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_temp_en_4.2.0_3.0_1664110575314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_temp", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_temp", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_temp| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hiiii23) author: John Snow Labs name: distilbert_qa_hiiii23_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hiiii23`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hiiii23_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771181551.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hiiii23_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771181551.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hiiii23_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hiiii23_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hiiii23_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hiiii23/distilbert-base-uncased-finetuned-squad --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534_de_4.2.0_3.0_1664105077801.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534_de_4.2.0_3.0_1664105077801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s534| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Stopwords Remover for Croatian language (339 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, hr, open_source] task: Stop Words Removal language: hr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_hr_3.4.1_3.0_1646673235236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_hr_3.4.1_3.0_1646673235236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","hr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Nisi bolji od mene"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","hr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Nisi bolji od mene").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("hr.stopwords").predict("""Nisi bolji od mene""") ```
## Results ```bash +-------+ |result | +-------+ |[bolji]| +-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hr| |Size:|2.3 KB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_8_h_128 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-8_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_128_zh_4.2.4_3.0_1670326027194.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_8_h_128_zh_4.2.4_3.0_1670326027194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_8_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_8_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|17.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-8_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English image_classifier_vit_exper4_mesum5 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper4_mesum5 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper4_mesum5` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper4_mesum5_en_4.1.0_3.0_1660168056154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper4_mesum5_en_4.1.0_3.0_1660168056154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper4_mesum5", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper4_mesum5", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper4_mesum5| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: English T5ForConditionalGeneration Base Cased model (from Supiri) author: John Snow Labs name: t5_base_conversation date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-conversation` is a English model originally trained by `Supiri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_conversation_en_4.3.0_3.0_1675108355643.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_conversation_en_4.3.0_3.0_1675108355643.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_conversation","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_conversation","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_conversation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|926.0 MB| ## References - https://huggingface.co/Supiri/t5-base-conversation - https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Nanite/ - https://www.youtube.com/watch?v=WU0gvPcc3jQ - https://www.youtube.com/watch?v=Z1OtYGzUoSo - https://www.personality-database.com/profile/2790/hinata-hyga-naruto-shippden-mbti-personality-type --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [en, legal, ner, mapa, licensed] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `DATE`, `ORGANISATION`, and `PERSON` entities from `English` documents. ## Predicted Entities `ADDRESS`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_en_1.0.0_3.0_1682592120053.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_en_1.0.0_3.0_1682592120053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_en_cased", "en")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""From 1 February 2012 until 31 January 2014, thus including the period concerned, Martimpex's workers were posted to Austria to perform the same work."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------------+------------+ |chunk |ner_label | +---------------+------------+ |1 February 2012|DATE | |31 January 2014|DATE | |Martimpex's |ORGANISATION| |Austria |ADDRESS | +---------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 1.00 1.00 1.00 5 DATE 0.98 1.00 0.99 40 ORGANISATION 0.83 0.71 0.77 14 PERSON 0.98 0.85 0.91 48 macro-avg 0.96 0.90 0.93 107 macro-avg 0.95 0.89 0.92 107 weighted-avg 0.96 0.90 0.93 107 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from PlanTL-GOB-ES) author: John Snow Labs name: roberta_qa_plantl_gob_es_base_bne_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-sqac` is a Spanish model originally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_plantl_gob_es_base_bne_s_c_es_4.2.4_3.0_1669985854414.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_plantl_gob_es_base_bne_s_c_es_4.2.4_3.0_1669985854414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_plantl_gob_es_base_bne_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_plantl_gob_es_base_bne_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_plantl_gob_es_base_bne_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-sqac - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://github.com/PlanTL-GOB-ES/lm-spanish - https://www.apache.org/licenses/LICENSE-2.0 - http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405 - https://paperswithcode.com/sota?task=question-answering&dataset=SQAC --- layout: model title: Detect Clinical Entities (jsl_ner_wip_clinical) author: John Snow Labs name: jsl_ner_wip_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_en_3.0.0_3.0_1617208406089.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_clinical_en_3.0.0_3.0_1617208406089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([['']]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +-----------------------------------------+----------------------------+ |chunk |ner_label | +-----------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |yellow |Modifier | |discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild |Modifier | |problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |Tylenol |Drug_BrandName | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | +-----------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 235.0 46.0 43.0 278.0 0.8363 0.8453 0.8408 Direction 3972.0 465.0 458.0 4430.0 0.8952 0.8966 0.8959 Respiration 82.0 4.0 4.0 86.0 0.9535 0.9535 0.9535 Cerebrovascular_D... 93.0 20.0 24.0 117.0 0.823 0.7949 0.8087 Family_History_He... 88.0 6.0 3.0 91.0 0.9362 0.967 0.9514 Heart_Disease 447.0 82.0 119.0 566.0 0.845 0.7898 0.8164 RelativeTime 158.0 80.0 59.0 217.0 0.6639 0.7281 0.6945 Strength 624.0 58.0 53.0 677.0 0.915 0.9217 0.9183 Smoking 121.0 11.0 4.0 125.0 0.9167 0.968 0.9416 Medical_Device 3716.0 491.0 466.0 4182.0 0.8833 0.8886 0.8859 Pulse 136.0 22.0 14.0 150.0 0.8608 0.9067 0.8831 Psychological_Con... 135.0 9.0 29.0 164.0 0.9375 0.8232 0.8766 Overweight 2.0 1.0 0.0 2.0 0.6667 1.0 0.8 Triglycerides 3.0 0.0 2.0 5.0 1.0 0.6 0.75 Obesity 42.0 5.0 6.0 48.0 0.8936 0.875 0.8842 Admission_Discharge 318.0 24.0 11.0 329.0 0.9298 0.9666 0.9478 HDL 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Diabetes 110.0 14.0 8.0 118.0 0.8871 0.9322 0.9091 Section_Header 3740.0 148.0 157.0 3897.0 0.9619 0.9597 0.9608 Age 627.0 75.0 48.0 675.0 0.8932 0.9289 0.9107 O2_Saturation 34.0 14.0 17.0 51.0 0.7083 0.6667 0.6869 Kidney_Disease 96.0 12.0 34.0 130.0 0.8889 0.7385 0.8067 Test 2504.0 545.0 498.0 3002.0 0.8213 0.8341 0.8276 Communicable_Disease 21.0 10.0 6.0 27.0 0.6774 0.7778 0.7241 Hypertension 162.0 5.0 10.0 172.0 0.9701 0.9419 0.9558 External_body_par... 2626.0 356.0 413.0 3039.0 0.8806 0.8641 0.8723 Oxygen_Therapy 81.0 15.0 14.0 95.0 0.8438 0.8526 0.8482 Modifier 2341.0 404.0 539.0 2880.0 0.8528 0.8128 0.8324 Test_Result 1007.0 214.0 255.0 1262.0 0.8247 0.7979 0.8111 BMI 9.0 1.0 0.0 9.0 0.9 1.0 0.9474 Labour_Delivery 57.0 23.0 33.0 90.0 0.7125 0.6333 0.6706 Employment 271.0 59.0 55.0 326.0 0.8212 0.8313 0.8262 Fetus_NewBorn 66.0 33.0 51.0 117.0 0.6667 0.5641 0.6111 Clinical_Dept 923.0 110.0 83.0 1006.0 0.8935 0.9175 0.9053 Time 29.0 13.0 16.0 45.0 0.6905 0.6444 0.6667 Procedure 3185.0 462.0 501.0 3686.0 0.8733 0.8641 0.8687 Diet 36.0 20.0 45.0 81.0 0.6429 0.4444 0.5255 Oncological 459.0 61.0 55.0 514.0 0.8827 0.893 0.8878 LDL 3.0 0.0 3.0 6.0 1.0 0.5 0.6667 Symptom 7104.0 1302.0 1200.0 8304.0 0.8451 0.8555 0.8503 Temperature 116.0 6.0 8.0 124.0 0.9508 0.9355 0.9431 Vital_Signs_Header 215.0 29.0 24.0 239.0 0.8811 0.8996 0.8903 Relationship_Status 49.0 2.0 1.0 50.0 0.9608 0.98 0.9703 Total_Cholesterol 11.0 4.0 5.0 16.0 0.7333 0.6875 0.7097 Blood_Pressure 158.0 18.0 22.0 180.0 0.8977 0.8778 0.8876 Injury_or_Poisoning 579.0 130.0 127.0 706.0 0.8166 0.8201 0.8184 Drug_Ingredient 1716.0 153.0 132.0 1848.0 0.9181 0.9286 0.9233 Treatment 136.0 36.0 60.0 196.0 0.7907 0.6939 0.7391 Pregnancy 123.0 36.0 51.0 174.0 0.7736 0.7069 0.7387 Vaccine 13.0 2.0 6.0 19.0 0.8667 0.6842 0.7647 Disease_Syndrome_... 2981.0 559.0 446.0 3427.0 0.8421 0.8699 0.8557 Height 30.0 10.0 15.0 45.0 0.75 0.6667 0.7059 Frequency 595.0 99.0 138.0 733.0 0.8573 0.8117 0.8339 Route 858.0 76.0 89.0 947.0 0.9186 0.906 0.9123 Duration 351.0 99.0 108.0 459.0 0.78 0.7647 0.7723 Death_Entity 43.0 14.0 5.0 48.0 0.7544 0.8958 0.819 Internal_organ_or... 6477.0 972.0 991.0 7468.0 0.8695 0.8673 0.8684 Alcohol 80.0 18.0 13.0 93.0 0.8163 0.8602 0.8377 Substance_Quantity 6.0 7.0 4.0 10.0 0.4615 0.6 0.5217 Date 498.0 38.0 19.0 517.0 0.9291 0.9632 0.9459 Hyperlipidemia 47.0 3.0 3.0 50.0 0.94 0.94 0.94 Social_History_He... 99.0 7.0 7.0 106.0 0.934 0.934 0.934 Race_Ethnicity 116.0 0.0 0.0 116.0 1.0 1.0 1.0 Imaging_Technique 40.0 18.0 47.0 87.0 0.6897 0.4598 0.5517 Drug_BrandName 859.0 62.0 61.0 920.0 0.9327 0.9337 0.9332 RelativeDate 566.0 124.0 143.0 709.0 0.8203 0.7983 0.8091 Gender 6096.0 80.0 101.0 6197.0 0.987 0.9837 0.9854 Dosage 244.0 31.0 57.0 301.0 0.8873 0.8106 0.8472 Form 234.0 32.0 55.0 289.0 0.8797 0.8097 0.8432 Medical_History_H... 114.0 9.0 10.0 124.0 0.9268 0.9194 0.9231 Birth_Entity 4.0 2.0 3.0 7.0 0.6667 0.5714 0.6154 Substance 59.0 8.0 11.0 70.0 0.8806 0.8429 0.8613 Sexually_Active_o... 5.0 3.0 4.0 9.0 0.625 0.5556 0.5882 Weight 90.0 10.0 21.0 111.0 0.9 0.8108 0.8531 macro - - - - - - 0.8148 micro - - - - - - 0.8788 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Large Cased model (from jamarju) author: John Snow Labs name: roberta_qa_large_bne_squad_2.0 date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-squad-2.0-es` is a Spanish model originally trained by `jamarju`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_large_bne_squad_2.0_es_4.2.4_3.0_1669987318036.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_large_bne_squad_2.0_es_4.2.4_3.0_1669987318036.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_bne_squad_2.0","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_large_bne_squad_2.0","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_large_bne_squad_2.0| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jamarju/roberta-large-bne-squad-2.0-es - https://github.com/PlanTL-SANIDAD/lm-spanish - https://github.com/ccasimiro88/TranslateAlignRetrieve --- layout: model title: Sentiment Analysis of Turkish texts author: John Snow Labs name: classifierdl_use_sentiment date: 2021-10-19 tags: [tr, sentiment, use, classification, open_source] task: Sentiment Analysis language: tr edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model identifies the sentiments (positive or negative) in Turkish texts. ## Predicted Entities `POSITIVE`, `NEGATIVE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_TR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_TR_SENTIMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_sentiment_tr_3.3.0_2.4_1634634525008.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_sentiment_tr_3.3.0_2.4_1634634525008.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ .setCleanupMode("shrink") embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_use_sentiment", "tr") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") fr_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier]) light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result1 = light_pipeline.annotate("Bu sıralar kafam çok karışık.") result2 = light_pipeline.annotate("Sınavımı geçtiğimi öğrenince derin bir nefes aldım.") print(result1["class"], result2["class"], sep = "\n") ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "tr") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val fr_sentiment_pipeline = new Pipeline().setStages(Array(document, embeddings, sentimentClassifier)) val light_pipeline = LightPipeline(fr_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) val result1 = light_pipeline.annotate("Bu sıralar kafam çok karışık.") val result2 = light_pipeline.annotate("Sınavımı geçtiğimi öğrenince derin bir nefes aldım.") ```
## Results ```bash ['NEGATIVE'] ['POSITIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_sentiment| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|tr| ## Data Source https://raw.githubusercontent.com/gurkandyilmaz/sentiment/master/data/ ## Benchmarking ```bash precision recall f1-score support NEGATIVE 0.86 0.88 0.87 19967 POSITIVE 0.88 0.85 0.86 19826 accuracy 0.87 39793 macro avg 0.87 0.87 0.87 39793 weighted avg 0.87 0.87 0.87 39793 ``` --- layout: model title: English RobertaForQuestionAnswering Tiny Cased model (from hf-internal-testing) author: John Snow Labs name: roberta_qa_tiny_random_forquestionanswering date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tiny-random-RobertaForQuestionAnswering` is a English model originally trained by `hf-internal-testing`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_random_forquestionanswering_en_4.3.0_3.0_1674224369695.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_tiny_random_forquestionanswering_en_4.3.0_3.0_1674224369695.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_random_forquestionanswering","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_tiny_random_forquestionanswering","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_tiny_random_forquestionanswering| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|681.7 KB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/hf-internal-testing/tiny-random-RobertaForQuestionAnswering --- layout: model title: Multilabel Classification of NDA Clauses (paragraph, medium) author: John Snow Labs name: legmulticlf_mnda_sections_paragraph_other date: 2023-03-09 tags: [nda, en, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models is a version of `legmulticlf_mnda_sections_other` (sentence, medium) but expecting a bigger-than-sentence context, ideally between 2 and 4-5 sentences, or a small paragraph, to provide with more context. It should be run on sentences of the NDA clauses, and will retrieve a series of 1..N labels for each of them. The possible clause types detected my this model in NDA / MNDA aggrements are: 1. Parties to the Agreement - Names of the Parties Clause 2. Identification of What Information Is Confidential - Definition of Confidential Information Clause 3. Use of Confidential Information: Permitted Use Clause and Obligations of the Recipient 4. Time Frame of the Agreement - Termination Clause 5. Return of Confidential Information Clause 6. Remedies for Breaches of Agreement - Remedies Clause 7. Non-Solicitation Clause 8. Dispute Resolution Clause 9. Exceptions Clause 10. Non-competition clause 11. Other: Nothing of the above (synonym to `[]`)- ## Predicted Entities `APPLIC_LAW`, `ASSIGNMENT`, `DEF_OF_CONF_INFO`, `DISPUTE_RESOL`, `EXCEPTIONS`, `NAMES_OF_PARTIES`, `NON_COMP`, `NON_SOLIC`, `PREAMBLE`, `REMEDIES`, `REQ_DISCL`, `RETURN_OF_CONF_INFO`, `TERMINATION`, `USE_OF_CONF_INFO`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_mnda_sections_paragraph_other_en_1.0.0_3.0_1678377832037.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_mnda_sections_paragraph_other_en_1.0.0_3.0_1678377832037.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = ( nlp.DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) sentence_detector = ( nlp.SentenceDetector() .setInputCols("document") .setOutputCol("sentence") .setExplodeSentences(True) .setCustomBounds(['\n']) ) embeddings = ( nlp.UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") ) paragraph_classifier = ( nlp.MultiClassifierDLModel.pretrained("legmulticlf_mnda_sections_paragraph_other", "en", "legal/models") .setInputCols(["sentence_embeddings"]) .setOutputCol("class") ) sentence_pipeline = nlp.Pipeline(stages=[document_assembler, sentence_detector, embeddings, paragraph_classifier]) prediction_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, paragraph_classifier]) text = """RECITALS WHEREAS, Corvus Gold Nevada Inc., a Nevada corporation (“Corvus”) and AngloGold Xxxxxxx (U.S.A.) Exploration, a Delaware corporation (“Viewer”) entered into that certain Confidentiality Agreement with an effective date of December 4, 2017 (“CA”); and WHEREAS, Corvus and Viewer desire to amend the terms of the CA pursuant to the terms of this Amendment; and WHEREAS, any terms not defined herein shall have the meanings set forth in the CA, as amended from time to time EXECUTION VERSION VITAL IMAGES, INC. TOSHIBA MEDICAL SYSTEMS CORPORATION Confidentiality Agreement This Confidentiality Agreement (this “Agreement”) dated as of January 28, 2011, between VITAL IMAGES, INC., a Minnesota corporation (“Vital Images” or the “Company”), and TOSHIBA MEDICAL SYSTEMS CORPORATION, a Japanese corporation (“TMSC” or the “Receiving Company”). W I T N E S S E T H: WHEREAS, the Parties wish to consider a strategic business transaction (the “Transaction”) and, in connection therewith, desire to set forth certain agreements regarding such consideration and the sharing of confidential and proprietary information by Vital Images with TMSC; """ df = spark.createDataFrame([[""]]).toDF("text") sentence_model = sentence_pipeline.fit(df) prediction_model = prediction_pipeline.fit(df) sentence_lp = nlp.LightPipeline(sentence_model) prediction_lp = nlp.LightPipeline(prediction_model) res = sentence_lp.fullAnnotate(text) sentences = [x.result for x in res[0]['sentence']] for i, s in enumerate(sentences): prev_sentence = "" if i==0 else sentences[i-1] next_sentence = "" if i>=len(sentences)-1 else sentences[i+1] chunk = " ".join([prev_sentence, s, next_sentence]).strip() print(f"{prediction_lp.annotate(chunk)['class']}: {chunk}") ```
## Results ```bash ['PREAMBLE']: WHEREAS, Corvus Gold Nevada Inc., a Nevada corporation (“Corvus”) and AngloGold Xxxxxxx (U.S.A.) Exploration, a Delaware corporation (“Viewer”) entered into that certain Confidentiality Agreement with an effective date of December 4, 2017 (“CA”); and WHEREAS, Corvus and Viewer desire to amend the terms of the CA pursuant to the terms of this Amendment; ['DEF_OF_CONF_INFO']: and WHEREAS, Corvus and Viewer desire to amend the terms of the CA pursuant to the terms of this Amendment; and ['OTHER', 'PREAMBLE']: WHEREAS, Corvus and Viewer desire to amend the terms of the CA pursuant to the terms of this Amendment; and WHEREAS, any terms not defined herein shall have the meanings set forth in the CA, as amended from time to time ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_mnda_sections_paragraph_other| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|13.4 MB| ## References In-house MNDA ## Benchmarking ```bash label precision recall f1-score support APPLIC_LAW 0.86 0.84 0.85 57 ASSIGNMENT 0.87 0.83 0.85 41 DEF_OF_CONF_INFO 0.86 0.76 0.81 67 DISPUTE_RESOL 0.84 0.69 0.76 70 EXCEPTIONS 0.84 0.79 0.82 109 NAMES_OF_PARTIES 0.90 0.76 0.83 50 NON_COMP 0.79 0.67 0.72 33 NON_SOLIC 0.81 0.82 0.81 82 OTHER 0.91 0.89 0.90 838 PREAMBLE 0.86 0.78 0.81 76 REMEDIES 0.91 0.84 0.87 87 REQ_DISCL 0.91 0.77 0.84 83 RETURN_OF_CONF_INFO 0.78 0.79 0.79 78 TERMINATION 0.74 0.67 0.70 42 USE_OF_CONF_INFO 0.77 0.84 0.80 200 micro-avg 0.87 0.83 0.85 1913 macro-avg 0.84 0.78 0.81 1913 weighted-avg 0.87 0.83 0.85 1913 samples-avg 0.82 0.84 0.83 1913 ``` --- layout: model title: English image_classifier_vit_tiny__random ViTForImageClassification from lysandre author: John Snow Labs name: image_classifier_vit_tiny__random date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_tiny__random` is a English model originally trained by lysandre. ## Predicted Entities `turnstile`, `damselfly`, `mixing bowl`, `sea snake`, `cockroach, roach`, `buckle`, `beer glass`, `bulbul`, `lumbermill, sawmill`, `whippet`, `Australian terrier`, `television, television system`, `hoopskirt, crinoline`, `horse cart, horse-cart`, `guillotine`, `malamute, malemute, Alaskan malamute`, `coyote, prairie wolf, brush wolf, Canis latrans`, `colobus, colobus monkey`, `hognose snake, puff adder, sand viper`, `sock`, `burrito`, `printer`, `bathing cap, swimming cap`, `chiton, coat-of-mail shell, sea cradle, polyplacophore`, `Rottweiler`, `cello, violoncello`, `pitcher, ewer`, `computer keyboard, keypad`, `bow`, `peacock`, `ballplayer, baseball player`, `refrigerator, icebox`, `solar dish, solar collector, solar furnace`, `passenger car, coach, carriage`, `African chameleon, Chamaeleo chamaeleon`, `oboe, hautboy, hautbois`, `toyshop`, `Leonberg`, `howler monkey, howler`, `bluetick`, `African elephant, Loxodonta africana`, `American lobster, Northern lobster, Maine lobster, Homarus americanus`, `combination lock`, `black-and-tan coonhound`, `bonnet, poke bonnet`, `harvester, reaper`, `Appenzeller`, `iron, smoothing iron`, `electric locomotive`, `lycaenid, lycaenid butterfly`, `sandbar, sand bar`, `Cardigan, Cardigan Welsh corgi`, `pencil sharpener`, `jean, blue jean, denim`, `backpack, back pack, knapsack, packsack, rucksack, haversack`, `monitor`, `ice cream, icecream`, `apiary, bee house`, `water jug`, `American coot, marsh hen, mud hen, water hen, Fulica americana`, `ground beetle, carabid beetle`, `jigsaw puzzle`, `ant, emmet, pismire`, `wreck`, `kuvasz`, `gyromitra`, `Ibizan hound, Ibizan Podenco`, `brown bear, bruin, Ursus arctos`, `bolo tie, bolo, bola tie, bola`, `Pembroke, Pembroke Welsh corgi`, `French bulldog`, `prison, prison house`, `ballpoint, ballpoint pen, ballpen, Biro`, `stage`, `airliner`, `dogsled, dog sled, dog sleigh`, `redshank, Tringa totanus`, `menu`, `Indian cobra, Naja naja`, `swab, swob, mop`, `window screen`, `brain coral`, `artichoke, globe artichoke`, `loupe, jeweler's loupe`, `loudspeaker, speaker, speaker unit, loudspeaker system, speaker system`, `panpipe, pandean pipe, syrinx`, `wok`, `croquet ball`, `plate`, `scoreboard`, `Samoyed, Samoyede`, `ocarina, sweet potato`, `beaver`, `borzoi, Russian wolfhound`, `horizontal bar, high bar`, `stretcher`, `seat belt, seatbelt`, `obelisk`, `forklift`, `feather boa, boa`, `frying pan, frypan, skillet`, `barbershop`, `hamper`, `face powder`, `Siamese cat, Siamese`, `ladle`, `dingo, warrigal, warragal, Canis dingo`, `mountain tent`, `head cabbage`, `echidna, spiny anteater, anteater`, `Polaroid camera, Polaroid Land camera`, `dumbbell`, `espresso`, `notebook, notebook computer`, `Norfolk terrier`, `binoculars, field glasses, opera glasses`, `carpenter's kit, tool kit`, `moving van`, `catamaran`, `tiger beetle`, `bikini, two-piece`, `Siberian husky`, `studio couch, day bed`, `bulletproof vest`, `lawn mower, mower`, `promontory, headland, head, foreland`, `soap dispenser`, `vulture`, `dam, dike, dyke`, `brambling, Fringilla montifringilla`, `toilet tissue, toilet paper, bathroom tissue`, `ringlet, ringlet butterfly`, `tiger cat`, `mobile home, manufactured home`, `Norwich terrier`, `little blue heron, Egretta caerulea`, `English setter`, `Tibetan mastiff`, `rocking chair, rocker`, `mask`, `maze, labyrinth`, `bookcase`, `viaduct`, `sweatshirt`, `plow, plough`, `basenji`, `typewriter keyboard`, `Windsor tie`, `coral fungus`, `desktop computer`, `Kerry blue terrier`, `Angora, Angora rabbit`, `can opener, tin opener`, `shield, buckler`, `triumphal arch`, `horned viper, cerastes, sand viper, horned asp, Cerastes cornutus`, `miniature schnauzer`, `tape player`, `jaguar, panther, Panthera onca, Felis onca`, `hook, claw`, `file, file cabinet, filing cabinet`, `chime, bell, gong`, `shower curtain`, `window shade`, `acoustic guitar`, `gas pump, gasoline pump, petrol pump, island dispenser`, `cicada, cicala`, `Petri dish`, `paintbrush`, `banana`, `chickadee`, `mountain bike, all-terrain bike, off-roader`, `lighter, light, igniter, ignitor`, `oil filter`, `cab, hack, taxi, taxicab`, `Christmas stocking`, `rugby ball`, `black widow, Latrodectus mactans`, `bustard`, `fiddler crab`, `web site, website, internet site, site`, `chocolate sauce, chocolate syrup`, `chainlink fence`, `fireboat`, `cocktail shaker`, `airship, dirigible`, `projectile, missile`, `bagel, beigel`, `screwdriver`, `oystercatcher, oyster catcher`, `pot, flowerpot`, `water bottle`, `Loafer`, `drumstick`, `soccer ball`, `cairn, cairn terrier`, `padlock`, `tow truck, tow car, wrecker`, `bloodhound, sleuthhound`, `punching bag, punch bag, punching ball, punchball`, `great grey owl, great gray owl, Strix nebulosa`, `scale, weighing machine`, `trench coat`, `briard`, `cheetah, chetah, Acinonyx jubatus`, `entertainment center`, `Boston bull, Boston terrier`, `Arabian camel, dromedary, Camelus dromedarius`, `steam locomotive`, `coil, spiral, volute, whorl, helix`, `plane, carpenter's plane, woodworking plane`, `gondola`, `spider web, spider's web`, `bathtub, bathing tub, bath, tub`, `pelican`, `miniature poodle`, `cowboy boot`, `perfume, essence`, `lakeside, lakeshore`, `timber wolf, grey wolf, gray wolf, Canis lupus`, `moped`, `sunscreen, sunblock, sun blocker`, `Brabancon griffon`, `puffer, pufferfish, blowfish, globefish`, `lifeboat`, `pool table, billiard table, snooker table`, `Bouvier des Flandres, Bouviers des Flandres`, `Pomeranian`, `theater curtain, theatre curtain`, `marimba, xylophone`, `baboon`, `vacuum, vacuum cleaner`, `pill bottle`, `pick, plectrum, plectron`, `hen`, `American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier`, `digital watch`, `pier`, `oxygen mask`, `Tibetan terrier, chrysanthemum dog`, `ostrich, Struthio camelus`, `water ouzel, dipper`, `drilling platform, offshore rig`, `magnetic compass`, `throne`, `butternut squash`, `minibus`, `EntleBucher`, `carousel, carrousel, merry-go-round, roundabout, whirligig`, `hot pot, hotpot`, `rain barrel`, `wood rabbit, cottontail, cottontail rabbit`, `miniature pinscher`, `partridge`, `three-toed sloth, ai, Bradypus tridactylus`, `English springer, English springer spaniel`, `corkscrew, bottle screw`, `fur coat`, `robin, American robin, Turdus migratorius`, `dowitcher`, `ruddy turnstone, Arenaria interpres`, `water snake`, `stove`, `Great Pyrenees`, `soft-coated wheaten terrier`, `carbonara`, `snail`, `breastplate, aegis, egis`, `wolf spider, hunting spider`, `hatchet`, `CD player`, `axolotl, mud puppy, Ambystoma mexicanum`, `pomegranate`, `poncho`, `leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea`, `lorikeet`, `spatula`, `jay`, `platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus`, `stethoscope`, `flagpole, flagstaff`, `coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch`, `agama`, `red wolf, maned wolf, Canis rufus, Canis niger`, `beaker`, `eft`, `pretzel`, `brassiere, bra, bandeau`, `frilled lizard, Chlamydosaurus kingi`, `joystick`, `goldfish, Carassius auratus`, `fig`, `maypole`, `caldron, cauldron`, `admiral`, `impala, Aepyceros melampus`, `spotted salamander, Ambystoma maculatum`, `syringe`, `hog, pig, grunter, squealer, Sus scrofa`, `handkerchief, hankie, hanky, hankey`, `tarantula`, `cheeseburger`, `pinwheel`, `sax, saxophone`, `dung beetle`, `broccoli`, `cassette player`, `milk can`, `traffic light, traffic signal, stoplight`, `shovel`, `sarong`, `tabby, tabby cat`, `parallel bars, bars`, `ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle`, `quill, quill pen`, `giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca`, `steel drum`, `quail`, `Blenheim spaniel`, `wig`, `hamster`, `ice lolly, lolly, lollipop, popsicle`, `seashore, coast, seacoast, sea-coast`, `chest`, `worm fence, snake fence, snake-rail fence, Virginia fence`, `missile`, `beer bottle`, `yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum`, `breakwater, groin, groyne, mole, bulwark, seawall, jetty`, `white wolf, Arctic wolf, Canis lupus tundrarum`, `guacamole`, `porcupine, hedgehog`, `trolleybus, trolley coach, trackless trolley`, `greenhouse, nursery, glasshouse`, `trimaran`, `Italian greyhound`, `potter's wheel`, `jacamar`, `wallet, billfold, notecase, pocketbook`, `Lakeland terrier`, `green lizard, Lacerta viridis`, `indigo bunting, indigo finch, indigo bird, Passerina cyanea`, `green mamba`, `walking stick, walkingstick, stick insect`, `crossword puzzle, crossword`, `eggnog`, `barrow, garden cart, lawn cart, wheelbarrow`, `remote control, remote`, `bicycle-built-for-two, tandem bicycle, tandem`, `wool, woolen, woollen`, `black grouse`, `abaya`, `marmoset`, `golf ball`, `jeep, landrover`, `Mexican hairless`, `dishwasher, dish washer, dishwashing machine`, `jersey, T-shirt, tee shirt`, `planetarium`, `goose`, `mailbox, letter box`, `capuchin, ringtail, Cebus capucinus`, `marmot`, `orangutan, orang, orangutang, Pongo pygmaeus`, `coffeepot`, `ambulance`, `shopping basket`, `pop bottle, soda bottle`, `red fox, Vulpes vulpes`, `crash helmet`, `street sign`, `affenpinscher, monkey pinscher, monkey dog`, `Arctic fox, white fox, Alopex lagopus`, `sidewinder, horned rattlesnake, Crotalus cerastes`, `ruffed grouse, partridge, Bonasa umbellus`, `muzzle`, `measuring cup`, `canoe`, `reflex camera`, `fox squirrel, eastern fox squirrel, Sciurus niger`, `French loaf`, `killer whale, killer, orca, grampus, sea wolf, Orcinus orca`, `dial telephone, dial phone`, `thimble`, `bubble`, `vizsla, Hungarian pointer`, `running shoe`, `mailbag, postbag`, `radio telescope, radio reflector`, `piggy bank, penny bank`, `Chihuahua`, `chambered nautilus, pearly nautilus, nautilus`, `Airedale, Airedale terrier`, `kimono`, `green snake, grass snake`, `rubber eraser, rubber, pencil eraser`, `upright, upright piano`, `orange`, `revolver, six-gun, six-shooter`, `ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin`, `drum, membranophone, tympan`, `Dungeness crab, Cancer magister`, `lipstick, lip rouge`, `gong, tam-tam`, `fountain`, `tub, vat`, `malinois`, `sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita`, `German short-haired pointer`, `apron`, `Irish setter, red setter`, `dishrag, dishcloth`, `school bus`, `candle, taper, wax light`, `bib`, `cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM`, `power drill`, `English foxhound`, `miniskirt, mini`, `swing`, `slug`, `hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa`, `rifle`, `Saluki, gazelle hound`, `Sealyham terrier, Sealyham`, `bullet train, bullet`, `hyena, hyaena`, `ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus`, `toy terrier`, `goblet`, `safe`, `cup`, `electric guitar`, `red wine`, `restaurant, eating house, eating place, eatery`, `wall clock`, `washbasin, handbasin, washbowl, lavabo, wash-hand basin`, `red-breasted merganser, Mergus serrator`, `crate`, `banded gecko`, `hippopotamus, hippo, river horse, Hippopotamus amphibius`, `tick`, `tripod`, `sombrero`, `desk`, `sea slug, nudibranch`, `racer, race car, racing car`, `pizza, pizza pie`, `dining table, board`, `Saint Bernard, St Bernard`, `komondor`, `electric ray, crampfish, numbfish, torpedo`, `prairie chicken, prairie grouse, prairie fowl`, `coffee mug`, `hammer`, `golfcart, golf cart`, `unicycle, monocycle`, `bison`, `soup bowl`, `rapeseed`, `golden retriever`, `plastic bag`, `grey fox, gray fox, Urocyon cinereoargenteus`, `water tower`, `house finch, linnet, Carpodacus mexicanus`, `barbell`, `hair slide`, `tiger, Panthera tigris`, `black-footed ferret, ferret, Mustela nigripes`, `meat loaf, meatloaf`, `hand blower, blow dryer, blow drier, hair dryer, hair drier`, `overskirt`, `gibbon, Hylobates lar`, `Gila monster, Heloderma suspectum`, `toucan`, `snowmobile`, `pencil box, pencil case`, `scuba diver`, `cloak`, `Sussex spaniel`, `otter`, `Greater Swiss Mountain dog`, `great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias`, `torch`, `magpie`, `tiger shark, Galeocerdo cuvieri`, `wing`, `Border collie`, `bell cote, bell cot`, `sea anemone, anemone`, `teapot`, `sea urchin`, `screen, CRT screen`, `bookshop, bookstore, bookstall`, `oscilloscope, scope, cathode-ray oscilloscope, CRO`, `crib, cot`, `police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria`, `hartebeest`, `manhole cover`, `iPod`, `rock python, rock snake, Python sebae`, `nipple`, `suspension bridge`, `safety pin`, `sea lion`, `cougar, puma, catamount, mountain lion, painter, panther, Felis concolor`, `mantis, mantid`, `wardrobe, closet, press`, `projector`, `Granny Smith`, `diamondback, diamondback rattlesnake, Crotalus adamanteus`, `pirate, pirate ship`, `espresso maker`, `African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus`, `cradle`, `common newt, Triturus vulgaris`, `tricycle, trike, velocipede`, `bobsled, bobsleigh, bob`, `thunder snake, worm snake, Carphophis amoenus`, `thresher, thrasher, threshing machine`, `banjo`, `armadillo`, `pajama, pyjama, pj's, jammies`, `ski`, `Maltese dog, Maltese terrier, Maltese`, `leafhopper`, `book jacket, dust cover, dust jacket, dust wrapper`, `silky terrier, Sydney silky`, `Shih-Tzu`, `wallaby, brush kangaroo`, `cardigan`, `sturgeon`, `freight car`, `home theater, home theatre`, `sundial`, `African crocodile, Nile crocodile, Crocodylus niloticus`, `odometer, hodometer, mileometer, milometer`, `sliding door`, `vine snake`, `West Highland white terrier`, `mongoose`, `hornbill`, `beagle`, `European gallinule, Porphyrio porphyrio`, `submarine, pigboat, sub, U-boat`, `Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis`, `cock`, `pedestal, plinth, footstall`, `accordion, piano accordion, squeeze box`, `gown`, `lynx, catamount`, `guenon, guenon monkey`, `Walker hound, Walker foxhound`, `standard schnauzer`, `reel`, `hip, rose hip, rosehip`, `grasshopper, hopper`, `Dutch oven`, `stone wall`, `hard disc, hard disk, fixed disk`, `snow leopard, ounce, Panthera uncia`, `shopping cart`, `digital clock`, `hourglass`, `Border terrier`, `Old English sheepdog, bobtail`, `academic gown, academic robe, judge's robe`, `spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish`, `spotlight, spot`, `dome`, `barn spider, Araneus cavaticus`, `bee eater`, `basketball`, `cliff dwelling`, `folding chair`, `isopod`, `Doberman, Doberman pinscher`, `bittern`, `sunglasses, dark glasses, shades`, `picket fence, paling`, `Crock Pot`, `ibex, Capra ibex`, `neck brace`, `cardoon`, `cassette`, `amphibian, amphibious vehicle`, `minivan`, `analog clock`, `trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi`, `yurt`, `cliff, drop, drop-off`, `Bernese mountain dog`, `teddy, teddy bear`, `sloth bear, Melursus ursinus, Ursus ursinus`, `bassoon`, `toaster`, `ptarmigan`, `Gordon setter`, `night snake, Hypsiglena torquata`, `grand piano, grand`, `purse`, `clumber, clumber spaniel`, `shoji`, `hair spray`, `maillot`, `knee pad`, `space heater`, `bottlecap`, `chiffonier, commode`, `chain saw, chainsaw`, `sulphur butterfly, sulfur butterfly`, `pay-phone, pay-station`, `kelpie`, `mouse, computer mouse`, `car wheel`, `cornet, horn, trumpet, trump`, `container ship, containership, container vessel`, `matchstick`, `scabbard`, `American black bear, black bear, Ursus americanus, Euarctos americanus`, `langur`, `rock crab, Cancer irroratus`, `lionfish`, `speedboat`, `black stork, Ciconia nigra`, `knot`, `disk brake, disc brake`, `mosquito net`, `white stork, Ciconia ciconia`, `abacus`, `titi, titi monkey`, `grocery store, grocery, food market, market`, `waffle iron`, `pickelhaube`, `wooden spoon`, `Norwegian elkhound, elkhound`, `earthstar`, `sewing machine`, `balance beam, beam`, `potpie`, `chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour`, `Staffordshire bullterrier, Staffordshire bull terrier`, `switch, electric switch, electrical switch`, `dhole, Cuon alpinus`, `paddle, boat paddle`, `limousine, limo`, `Shetland sheepdog, Shetland sheep dog, Shetland`, `space bar`, `library`, `paddlewheel, paddle wheel`, `alligator lizard`, `Band Aid`, `Persian cat`, `bull mastiff`, `tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui`, `sports car, sport car`, `football helmet`, `laptop, laptop computer`, `lens cap, lens cover`, `tennis ball`, `violin, fiddle`, `lab coat, laboratory coat`, `cinema, movie theater, movie theatre, movie house, picture palace`, `weasel`, `bow tie, bow-tie, bowtie`, `macaw`, `dough`, `whiskey jug`, `microphone, mike`, `spoonbill`, `bassinet`, `mud turtle`, `velvet`, `warthog`, `plunger, plumber's helper`, `dugong, Dugong dugon`, `honeycomb`, `badger`, `dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk`, `bee`, `doormat, welcome mat`, `fountain pen`, `giant schnauzer`, `assault rifle, assault gun`, `limpkin, Aramus pictus`, `siamang, Hylobates syndactylus, Symphalangus syndactylus`, `albatross, mollymawk`, `confectionery, confectionary, candy store`, `harp`, `parachute, chute`, `barrel, cask`, `tank, army tank, armored combat vehicle, armoured combat vehicle`, `collie`, `kite`, `puck, hockey puck`, `stupa, tope`, `buckeye, horse chestnut, conker`, `patio, terrace`, `broom`, `Dandie Dinmont, Dandie Dinmont terrier`, `scorpion`, `agaric`, `balloon`, `bucket, pail`, `squirrel monkey, Saimiri sciureus`, `Eskimo dog, husky`, `zebra`, `garter snake, grass snake`, `indri, indris, Indri indri, Indri brevicaudatus`, `tractor`, `guinea pig, Cavia cobaya`, `maraca`, `red-backed sandpiper, dunlin, Erolia alpina`, `bullfrog, Rana catesbeiana`, `trilobite`, `Japanese spaniel`, `gorilla, Gorilla gorilla`, `monastery`, `centipede`, `terrapin`, `llama`, `long-horned beetle, longicorn, longicorn beetle`, `boxer`, `curly-coated retriever`, `mortar`, `hammerhead, hammerhead shark`, `goldfinch, Carduelis carduelis`, `garden spider, Aranea diademata`, `stopwatch, stop watch`, `grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus`, `leaf beetle, chrysomelid`, `birdhouse`, `king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica`, `stole`, `bell pepper`, `radiator`, `flatworm, platyhelminth`, `mushroom`, `Scotch terrier, Scottish terrier, Scottie`, `liner, ocean liner`, `toilet seat`, `lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens`, `zucchini, courgette`, `harvestman, daddy longlegs, Phalangium opilio`, `Newfoundland, Newfoundland dog`, `flamingo`, `whiptail, whiptail lizard`, `geyser`, `cleaver, meat cleaver, chopper`, `sea cucumber, holothurian`, `American egret, great white heron, Egretta albus`, `parking meter`, `beacon, lighthouse, beacon light, pharos`, `coucal`, `motor scooter, scooter`, `mitten`, `cannon`, `weevil`, `megalith, megalithic structure`, `stinkhorn, carrion fungus`, `ear, spike, capitulum`, `box turtle, box tortoise`, `snowplow, snowplough`, `tench, Tinca tinca`, `modem`, `tobacco shop, tobacconist shop, tobacconist`, `barn`, `skunk, polecat, wood pussy`, `African grey, African gray, Psittacus erithacus`, `Madagascar cat, ring-tailed lemur, Lemur catta`, `holster`, `barometer`, `sleeping bag`, `washer, automatic washer, washing machine`, `recreational vehicle, RV, R.V.`, `drake`, `tray`, `butcher shop, meat market`, `china cabinet, china closet`, `medicine chest, medicine cabinet`, `photocopier`, `Yorkshire terrier`, `starfish, sea star`, `racket, racquet`, `park bench`, `Labrador retriever`, `whistle`, `clog, geta, patten, sabot`, `volcano`, `quilt, comforter, comfort, puff`, `leopard, Panthera pardus`, `cauliflower`, `swimming trunks, bathing trunks`, `American chameleon, anole, Anolis carolinensis`, `alp`, `mortarboard`, `barracouta, snoek`, `cocker spaniel, English cocker spaniel, cocker`, `space shuttle`, `beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon`, `harmonica, mouth organ, harp, mouth harp`, `gasmask, respirator, gas helmet`, `wombat`, `Model T`, `wild boar, boar, Sus scrofa`, `hermit crab`, `flat-coated retriever`, `rotisserie`, `jinrikisha, ricksha, rickshaw`, `trifle`, `bannister, banister, balustrade, balusters, handrail`, `go-kart`, `bakery, bakeshop, bakehouse`, `ski mask`, `dock, dockage, docking facility`, `Egyptian cat`, `oxcart`, `redbone`, `shoe shop, shoe-shop, shoe store`, `convertible`, `ox`, `crayfish, crawfish, crawdad, crawdaddy`, `cowboy hat, ten-gallon hat`, `conch`, `spaghetti squash`, `toy poodle`, `saltshaker, salt shaker`, `microwave, microwave oven`, `triceratops`, `necklace`, `castle`, `streetcar, tram, tramcar, trolley, trolley car`, `eel`, `diaper, nappy, napkin`, `standard poodle`, `prayer rug, prayer mat`, `radio, wireless`, `crane`, `envelope`, `rule, ruler`, `gar, garfish, garpike, billfish, Lepisosteus osseus`, `spider monkey, Ateles geoffroyi`, `Irish wolfhound`, `German shepherd, German shepherd dog, German police dog, alsatian`, `umbrella`, `sunglass`, `aircraft carrier, carrier, flattop, attack aircraft carrier`, `water buffalo, water ox, Asiatic buffalo, Bubalus bubalis`, `jellyfish`, `groom, bridegroom`, `tree frog, tree-frog`, `steel arch bridge`, `lemon`, `pickup, pickup truck`, `vault`, `groenendael`, `baseball`, `junco, snowbird`, `maillot, tank suit`, `gazelle`, `jack-o'-lantern`, `military uniform`, `slide rule, slipstick`, `wire-haired fox terrier`, `acorn squash`, `electric fan, blower`, `Brittany spaniel`, `chimpanzee, chimp, Pan troglodytes`, `pillow`, `binder, ring-binder`, `schipperke`, `Afghan hound, Afghan`, `plate rack`, `car mirror`, `hand-held computer, hand-held microcomputer`, `papillon`, `schooner`, `Bedlington terrier`, `cellular telephone, cellular phone, cellphone, cell, mobile phone`, `altar`, `Chesapeake Bay retriever`, `cabbage butterfly`, `polecat, fitch, foulmart, foumart, Mustela putorius`, `comic book`, `French horn, horn`, `daisy`, `organ, pipe organ`, `mashed potato`, `acorn`, `fly`, `chain`, `American alligator, Alligator mississipiensis`, `mink`, `garbage truck, dustcart`, `totem pole`, `wine bottle`, `strawberry`, `cricket`, `European fire salamander, Salamandra salamandra`, `coral reef`, `Welsh springer spaniel`, `bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis`, `snorkel`, `bald eagle, American eagle, Haliaeetus leucocephalus`, `meerkat, mierkat`, `grille, radiator grille`, `nematode, nematode worm, roundworm`, `anemone fish`, `corn`, `loggerhead, loggerhead turtle, Caretta caretta`, `palace`, `suit, suit of clothes`, `pineapple, ananas`, `macaque`, `ping-pong ball`, `ram, tup`, `church, church building`, `koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus`, `hare`, `bath towel`, `strainer`, `yawl`, `otterhound, otter hound`, `table lamp`, `king snake, kingsnake`, `lotion`, `lion, king of beasts, Panthera leo`, `thatch, thatched roof`, `basset, basset hound`, `black and gold garden spider, Argiope aurantia`, `barber chair`, `proboscis monkey, Nasalis larvatus`, `consomme`, `Irish terrier`, `Irish water spaniel`, `common iguana, iguana, Iguana iguana`, `Weimaraner`, `Great Dane`, `pug, pug-dog`, `rhinoceros beetle`, `vase`, `brass, memorial tablet, plaque`, `kit fox, Vulpes macrotis`, `king penguin, Aptenodytes patagonica`, `vending machine`, `dalmatian, coach dog, carriage dog`, `rock beauty, Holocanthus tricolor`, `pole`, `cuirass`, `bolete`, `jackfruit, jak, jack`, `monarch, monarch butterfly, milkweed butterfly, Danaus plexippus`, `chow, chow chow`, `nail`, `packet`, `half track`, `Lhasa, Lhasa apso`, `boathouse`, `hay`, `valley, vale`, `slot, one-armed bandit`, `volleyball`, `carton`, `shower cap`, `tile roof`, `lacewing, lacewing fly`, `patas, hussar monkey, Erythrocebus patas`, `boa constrictor, Constrictor constrictor`, `black swan, Cygnus atratus`, `lampshade, lamp shade`, `mousetrap`, `crutch`, `vestment`, `Pekinese, Pekingese, Peke`, `tusker`, `warplane, military plane`, `sandal`, `screw`, `custard apple`, `Scottish deerhound, deerhound`, `spindle`, `keeshond`, `hummingbird`, `letter opener, paper knife, paperknife`, `cucumber, cuke`, `bearskin, busby, shako`, `fire engine, fire truck`, `trombone`, `ringneck snake, ring-necked snake, ring snake`, `sorrel`, `fire screen, fireguard`, `paper towel`, `flute, transverse flute`, `hotdog, hot dog, red hot`, `Indian elephant, Elephas maximus`, `mosque`, `stingray`, `Rhodesian ridgeback`, `four-poster` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_tiny__random_en_4.1.0_3.0_1660165873976.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_tiny__random_en_4.1.0_3.0_1660165873976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_tiny__random", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_tiny__random", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_tiny__random| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|410.5 KB| --- layout: model title: Legal Books and records Clause Binary Classifier author: John Snow Labs name: legclf_books_and_records_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `books-and-records` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `books-and-records` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_books_and_records_clause_en_1.0.0_3.2_1660122179925.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_books_and_records_clause_en_1.0.0_3.2_1660122179925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_books_and_records_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[books-and-records]| |[other]| |[other]| |[books-and-records]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_books_and_records_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support books-and-records 0.91 0.94 0.92 31 other 0.98 0.97 0.98 104 accuracy - - 0.96 135 macro-avg 0.94 0.95 0.95 135 weighted-avg 0.96 0.96 0.96 135 ``` --- layout: model title: Icelandic NER Model author: John Snow Labs name: roberta_token_classifier_icelandic_ner date: 2021-12-06 tags: [icelandic, roberta, token_classifier, ner, is, open_source] task: Named Entity Recognition language: is edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model, imported from Hugging Face, was fine-tuned on the MIM-GOLD-NER dataset for the Icelandic language, leveraging `Roberta` embeddings and using `RobertaForTokenClassification` for NER purposes. ## Predicted Entities `Date`, `Location`, `Miscellaneous`, `Money`, `Organization`, `Percent`, `Person`, `Time` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_is_3.3.2_2.4_1638796728651.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_icelandic_ner_is_3.3.2_2.4_1638796728651.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_icelandic_ner", "is"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_icelandic_ner", "is")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("is.ner").predict("""Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.""") ```
## Results ```bash +----------------+------------+ |chunk |ner_label | +----------------+------------+ |Peter Fergusson |Person | |New York |Location | |október 2011 |Date | |Tesla Motor |Organization| |100K $ |Money | +----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_icelandic_ner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|is| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/m3hrdadfi/icelandic-ner-roberta](https://huggingface.co/m3hrdadfi/icelandic-ner-roberta) ## Benchmarking ```bash label score Macro-F1-Score 0.957209 Micro-F1-Score 0.951866 ``` --- layout: model title: Earning Calls Financial NER (Generic, sm) author: John Snow Labs name: finner_earning_calls_generic_sm date: 2022-11-30 tags: [en, financial, ner, earning, calls, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: FinanceNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a `sm` (small) version of a financial model trained on Earning Calls transcripts to detect financial entities (NER model). This model is called `Generic` as it has fewer labels in comparison with the `Specific` version. Please note this model requires some tokenization configuration to extract the currency (see python snippet below). The currently available entities are: - AMOUNT: Numeric amounts, not percentages - ASSET: Current or Fixed Asset - ASSET_DECREASE: Decrease in the asset possession/exposure - ASSET_INCREASE: Increase in the asset possession/exposure - CF: Total cash flow  - CF_DECREASE: Relative decrease in cash flow - CF_INCREASE: Relative increase in cash flow - COUNT: Number of items (not monetary, not percentages). - CURRENCY: The currency of the amount - DATE: Generic dates in context where either it's not a fiscal year or it can't be asserted as such given the context - EXPENSE: An expense or loss - EXPENSE_DECREASE: A piece of information saying there was an expense decrease in that fiscal year - EXPENSE_INCREASE: A piece of information saying there was an expense increase in that fiscal year - FCF: Free Cash Flow - FISCAL_YEAR: A date which expresses which month the fiscal exercise was closed for a specific year - KPI: Key Performance Indicator, a quantifiable measure of performance over time for a specific objective - KPI_DECREASE: Relative decrease in a KPI - KPI_INCREASE: Relative increase in a KPI - LIABILITY: Current or Long-Term Liability (not from stockholders) - LIABILITY_DECREASE: Relative decrease in liability - LIABILITY_INCREASE: Relative increase in liability - ORG: Mention to a company/organization name - PERCENTAGE: : Numeric amounts which are percentages - PROFIT: Profit or also Revenue - PROFIT_DECLINE: A piece of information saying there was a profit / revenue decrease in that fiscal year - PROFIT_INCREASE: A piece of information saying there was a profit / revenue increase in that fiscal year - TICKER: Trading symbol of the company You can also check for the Relation Extraction model which connects these entities together. ## Predicted Entities `AMOUNT`, `ASSET`, `ASSET_DECREASE`, `ASSET_INCREASE`, `CF`, `CF_DECREASE`, `CF_INCREASE`, `COUNT`, `CURRENCY`, `DATE`, `EXPENSE`, `EXPENSE_DECREASE`, `EXPENSE_INCREASE`, `FCF`, `FISCAL_YEAR`, `KPI`, `KPI_DECREASE`, `KPI_INCREASE`, `LIABILITY`, `LIABILITY_DECREASE`, `LIABILITY_INCREASE`, `ORG`, `PERCENTAGE`, `PROFIT`, `PROFIT_DECLINE`, `PROFIT_INCREASE`, `TICKER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_generic_sm_en_1.0.0_3.0_1669839690938.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_earning_calls_generic_sm_en_1.0.0_3.0_1669839690938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token")\ .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '”', '’', '$','€']) embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) ner_model = finance.NerModel.pretrained("finner_earning_calls_generic_sm", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Adjusted EPS was ahead of our expectations at $ 1.21 , and free cash flow is also ahead of our expectations despite a $ 1.5 billion additional tax payment we made related to the R&D amortization."""]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \ .select(F.expr("cols['0']").alias("text"), F.expr("cols['1']['entity']").alias("label")).show(200, truncate = False) ```
## Results ```bash +------------+----------+----------+ | token| ner_label|confidence| +------------+----------+----------+ | Adjusted| B-PROFIT| 0.9691| | EPS| I-PROFIT| 0.9954| | was| O| 1.0| | ahead| O| 1.0| | of| O| 1.0| | our| O| 1.0| |expectations| O| 1.0| | at| O| 1.0| | $|B-CURRENCY| 1.0| | 1.21| B-AMOUNT| 1.0| | ,| O| 0.9998| | and| O| 1.0| | free| B-FCF| 0.9981| | cash| I-FCF| 0.9998| | flow| I-FCF| 0.9998| | is| O| 1.0| | also| O| 1.0| | ahead| O| 1.0| | of| O| 1.0| | our| O| 1.0| |expectations| O| 1.0| | despite| O| 1.0| | a| O| 1.0| | $|B-CURRENCY| 1.0| | 1.5| B-AMOUNT| 1.0| | billion| I-AMOUNT| 0.9999| | additional| O| 0.998| | tax| O| 0.9532| | payment| O| 0.945| | we| O| 0.9999| | made| O| 1.0| | related| O| 1.0| | to| O| 1.0| | the| O| 1.0| | R&D| O| 0.9981| |amortization| O| 0.9973| | .| O| 1.0| +------------+----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_earning_calls_generic_sm| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations on Earning Calls. ## Benchmarking ```bash label tp fp fn prec rec f1 I-AMOUNT 383 1 3 0.9973958 0.992228 0.9948052 B-COUNT 13 5 2 0.7222222 0.8666667 0.78787875 B-AMOUNT 453 0 6 1.0 0.9869281 0.9934211 I-ORG 16 0 0 1.0 1.0 1.0 B-DATE 117 11 5 0.9140625 0.9590164 0.93600005 B-LIABILITY_DECREASE 1 1 0 0.5 1.0 0.6666667 I-LIABILITY 8 6 3 0.5714286 0.72727275 0.64000005 I-EXPENSE 75 13 52 0.85227275 0.5905512 0.69767445 I-KPI_INCREASE 6 3 8 0.6666667 0.42857143 0.5217392 B-LIABILITY 9 4 5 0.6923077 0.64285713 0.6666667 I-CF 18 1 18 0.94736844 0.5 0.6545455 I-COUNT 12 2 1 0.85714287 0.9230769 0.8888889 B-FCF 13 5 0 0.7222222 1.0 0.83870965 B-PROFIT_INCREASE 79 22 31 0.7821782 0.7181818 0.7488152 B-KPI_INCREASE 3 4 11 0.42857143 0.21428572 0.2857143 B-EXPENSE 41 19 38 0.68333334 0.51898736 0.5899281 I-PROFIT_DECLINE 5 7 22 0.41666666 0.18518518 0.25641027 I-LIABILITY_DECREASE 1 1 0 0.5 1.0 0.6666667 I-PROFIT 188 47 50 0.8 0.789916 0.79492605 B-CURRENCY 440 0 1 1.0 0.9977324 0.9988649 I-PROFIT_INCREASE 77 23 45 0.77 0.63114756 0.69369364 I-CURRENCY 6 0 0 1.0 1.0 1.0 B-CF 9 1 8 0.9 0.5294118 0.6666667 B-PROFIT 147 51 40 0.74242425 0.7860963 0.7636363 B-PERCENTAGE 417 2 4 0.99522674 0.99049884 0.99285716 B-TICKER 13 0 0 1.0 1.0 1.0 I-FISCAL_YEAR 3 0 0 1.0 1.0 1.0 B-ORG 14 0 0 1.0 1.0 1.0 B-EXPENSE_INCREASE 6 0 4 1.0 0.6 0.75 B-EXPENSE_DECREASE 1 0 1 1.0 0.5 0.6666667 B-ASSET 9 2 16 0.8181818 0.36 0.5 B-FISCAL_YEAR 1 0 0 1.0 1.0 1.0 I-EXPENSE_DECREASE 3 2 2 0.6 0.6 0.6 I-FCF 26 15 0 0.63414633 1.0 0.7761194 I-EXPENSE_INCREASE 8 0 3 1.0 0.72727275 0.84210527 Macro-average 2637 255 465 0.7494908 0.64362085 0.70253296 Micro-average 2637 255 465 0.9118257 0.8500967 0.8798799 ``` --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4CHEMD_Modified_scibert_scivocab_cased date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4CHEMD-Modified_scibert_scivocab_cased` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_scibert_scivocab_cased_en_4.0.0_3.0_1657108884893.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4CHEMD_Modified_scibert_scivocab_cased_en_4.0.0_3.0_1657108884893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_scibert_scivocab_cased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4CHEMD_Modified_scibert_scivocab_cased","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4CHEMD_Modified_scibert_scivocab_cased| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4CHEMD-Modified_scibert_scivocab_cased --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_mask_step_pretraining_base_squadv2_epochs_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mask_step_pretraining_roberta-base_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_mask_step_pretraining_base_squadv2_epochs_3_en_4.3.0_3.0_1674211403400.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_mask_step_pretraining_base_squadv2_epochs_3_en_4.3.0_3.0_1674211403400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mask_step_pretraining_base_squadv2_epochs_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mask_step_pretraining_base_squadv2_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_mask_step_pretraining_base_squadv2_epochs_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/mask_step_pretraining_roberta-base_squadv2_epochs_3 --- layout: model title: Pipeline to Resolve Medication Codes author: John Snow Labs name: medication_resolver_pipeline date: 2023-03-31 tags: [resolver, snomed, umls, rxnorm, ndc, ade, en, licensed, pipeline] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used as Lightpipeline (with `annotate/fullAnnotate`). You can use `medication_resolver_transform_pipeline` for Spark transform. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.2_1680263267789.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.2_1680263267789.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline med_resolver_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" result = med_resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val med_resolver_pipeline = new PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") val result = med_resolver_pipeline.fullAnnotate("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | | chunks | entities | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |---:|:-----------------------------|:-----------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | 0 | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | 1 | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | 2 | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | 3 | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: English RobertaForMaskedLM Cased model (from amoux) author: John Snow Labs name: roberta_embeddings_cord19_1m7k date: 2022-12-12 tags: [en, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-cord19-1M7k` is a English model originally trained by `amoux`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_cord19_1m7k_en_4.2.4_3.0_1670859285277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_cord19_1m7k_en_4.2.4_3.0_1670859285277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_cord19_1m7k","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_cord19_1m7k","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_cord19_1m7k| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|367.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/amoux/roberta-cord19-1M7k - https://github.githubassets.com/images/icons/emoji/unicode/2695.png - https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html --- layout: model title: Catalan, Valencian asr_Wav2Vec2_Large_XLSR_53_catalan TFWav2Vec2ForCTC from PereLluis13 author: John Snow Labs name: pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan date: 2022-09-24 tags: [wav2vec2, ca, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ca edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_Large_XLSR_53_catalan` is a Catalan, Valencian model originally trained by PereLluis13. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan_ca_4.2.0_3.0_1664036656488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan_ca_4.2.0_3.0_1664036656488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan', lang = 'ca') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan", lang = "ca") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Wav2Vec2_Large_XLSR_53_catalan| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ca| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: BioBERT Sentence Embeddings (Pubmed PMC) author: John Snow Labs name: sent_biobert_pubmed_pmc_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_pmc_base_cased_en_2.6.2_2.4_1600533114335.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_pmc_base_cased_en_2.6.2_2.4_1600533114335.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_pmc_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_pmc_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_pmc_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_pubmed_pmc_base_cased_embeddings I hate cancer [0.2354733943939209, 0.30127033591270447, -0.1... Antibiotics aren't painkiller [0.2837969958782196, 0.03842488303780556, 0.04... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pubmed_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_roberta_base_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2.0_en_4.0.0_3.0_1655735667354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_squad2.0_en_4.0.0_3.0_1655735667354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/roberta-base_squad2.0 --- layout: model title: English RobertaForQuestionAnswering (from arjunth2001) author: John Snow Labs name: roberta_qa_priv_qna date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `priv_qna` is a English model originally trained by `arjunth2001`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_priv_qna_en_4.0.0_3.0_1655729273853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_priv_qna_en_4.0.0_3.0_1655729273853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_priv_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_priv_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_arjunth2001").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_priv_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/arjunth2001/priv_qna --- layout: model title: Fast Neural Machine Translation Model from Slovak to English author: John Snow Labs name: opus_mt_sk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, sk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `sk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_sk_en_xx_2.7.0_2.4_1609168419909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_sk_en_xx_2.7.0_2.4_1609168419909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_sk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_sk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.sk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_sk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_exper_batch_32_e4 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_exper_batch_32_e4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_exper_batch_32_e4` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_32_e4_en_4.1.0_3.0_1660171164864.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_exper_batch_32_e4_en_4.1.0_3.0_1660171164864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_exper_batch_32_e4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_exper_batch_32_e4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_exper_batch_32_e4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: RCT Binary Classifier (BioBERT) Pipeline author: John Snow Labs name: bert_sequence_classifier_binary_rct_biobert_pipeline date: 2022-06-06 tags: [licensed, classifier, rct, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline is a BioBERT based classifier that can classify if an article is a randomized clinical trial (RCT) or not. This pretrained pipeline is built on the top of [bert_sequence_classifier_binary_rct_biobert](https://nlp.johnsnowlabs.com/2022/04/25/bert_sequence_classifier_binary_rct_biobert_en_3_0.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_binary_rct_biobert_pipeline_en_3.4.2_3.0_1654510510935.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_binary_rct_biobert_pipeline_en_3.4.2_3.0_1654510510935.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_sequence_classifier_binary_rct_biobert_pipeline", "en", "clinical/models") result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_sequence_classifier_binary_rct_biobert_pipeline", "en", "clinical/models") val result = pipeline.annotate("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.binary_rct_biobert.pipeline").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |rct |text | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |True|Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | +----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_binary_rct_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|406.0 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForSequenceClassification --- layout: model title: Extract Oncology Tests author: John Snow Labs name: ner_oncology_test_wip date: 2022-09-30 tags: [licensed, clinical, oncology, en, ner, test] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of tests from oncology texts, including pathology tests and imaging tests. Definitions of Predicted Entities: - `Biomarker`: Biological molecules that indicate the presence or absence of cancer, or the type of cancer. Oncogenes are excluded from this category. - `Biomarker_Result`: Terms or values that are identified as the result of a biomarkers. - `Imaging_Test`: Imaging tests mentioned in texts, such as "chest CT scan". - `Oncogene`: Mentions of genes that are implicated in the etiology of cancer. - `Pathology_Test`: Mentions of biopsies or tests that use tissue samples. ## Predicted Entities `Biomarker`, `Biomarker_Result`, `Imaging_Test`, `Oncogene`, `Pathology_Test` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_wip_en_4.0.0_3.0_1664562782979.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_test_wip_en_4.0.0_3.0_1664562782979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_test_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_test_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_test_wip").predict("""A biopsy was conducted using an ultrasound guided thick-needle. His chest computed tomography (CT) scan was negative.""") ```
## Results ```bash | chunk | ner_label | |:--------------------------|:---------------| | biopsy | Pathology_Test | | ultrasound | Imaging_Test | | chest computed tomography | Imaging_Test | | CT | Imaging_Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_test_wip| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|858.2 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Imaging_Test 1518.0 156.0 191.0 1709.0 0.91 0.89 0.90 Biomarker_Result 861.0 145.0 245.0 1106.0 0.86 0.78 0.82 Pathology_Test 600.0 105.0 209.0 809.0 0.85 0.74 0.79 Biomarker 917.0 166.0 194.0 1111.0 0.85 0.83 0.84 Oncogene 274.0 84.0 83.0 357.0 0.77 0.77 0.77 macro_avg 4170.0 656.0 922.0 5092.0 0.85 0.80 0.82 micro_avg NaN NaN NaN NaN 0.86 0.82 0.84 ``` --- layout: model title: English BertForQuestionAnswering Cased model (from mkkc58) author: John Snow Labs name: bert_qa_mkkc58_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `mkkc58`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mkkc58_finetuned_squad_en_4.0.0_3.0_1657186632740.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mkkc58_finetuned_squad_en_4.0.0_3.0_1657186632740.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mkkc58_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mkkc58_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mkkc58_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mkkc58/bert-finetuned-squad --- layout: model title: English image_classifier_vit_digital ViTForImageClassification from lazyturtl author: John Snow Labs name: image_classifier_vit_digital date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_digital` is a English model originally trained by lazyturtl. ## Predicted Entities `ansys`, `blender`, `roblox`, `sketchup` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_digital_en_4.1.0_3.0_1660169960054.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_digital_en_4.1.0_3.0_1660169960054.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_digital", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_digital", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_digital| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from Dravidian Languages to English author: John Snow Labs name: opus_mt_dra_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, dra, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `dra` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_dra_en_xx_2.7.0_2.4_1609168145192.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_dra_en_xx_2.7.0_2.4_1609168145192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_dra_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_dra_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.dra.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_dra_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Cancer Genetics author: John Snow Labs name: ner_bionlp_en date: 2020-01-30 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [clinical, licensed, ner, en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for biology and genetics terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities ``Amino_acid``, ``Anatomical_system``, ``Cancer``, ``Cell``, ``Cellular_component``, ``Developing_anatomical_Structure``, ``Gene_or_gene_product``, ``Immaterial_anatomical_entity``, ``Multi-tissue_structure``, ``Organ``, ``Organism``, ``Organism_subdivision``, ``Simple_chemical``, ``Tissue``, ``Organism_substance``, ``Pathological_formation`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR){:.button.button-orange} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_bionlp", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes."""]], ["text"])) ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_bionlp", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner_label"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash |id |sentence_id|chunk |begin|end|ner_label | +---+-----------+----------------------+-----+---+--------------------+ |0 |0 |human |4 |8 |Organism | |0 |0 |Kir 3.3 |17 |23 |Gene_or_gene_product| |0 |0 |GIRK3 |26 |30 |Gene_or_gene_product| |0 |0 |potassium |92 |100|Simple_chemical | |0 |0 |GIRK |103 |106|Gene_or_gene_product| |0 |1 |chromosome 1q21-23 |188 |205|Cellular_component | |0 |5 |pancreas |697 |704|Organ | |0 |5 |tissues |740 |746|Tissue | |0 |5 |fat andskeletal muscle|749 |770|Tissue | |0 |6 |KCNJ9 |801 |805|Gene_or_gene_product| |0 |6 |Type II |940 |946|Gene_or_gene_product| |1 |0 |breast cancer |84 |96 |Cancer | |1 |0 |patients |134 |141|Organism | |1 |0 |anthracyclines |167 |180|Simple_chemical | |1 |0 |taxanes |186 |192|Simple_chemical | |1 |1 |vinorelbine |246 |256|Simple_chemical | |1 |1 |patients |273 |280|Organism | |1 |1 |breast |309 |314|Cancer | |1 |1 |vinorelbine inpatients|386 |407|Simple_chemical | |1 |1 |anthracyclines |433 |446|Simple_chemical | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp_en_2.4.0_2.4| |Type:|ner| |Compatibility:|Spark NLP 2.4.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| |Dependency:embeddings_clinical| {:.h2_title} ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013 with ``embeddings_clinical``. https://aclanthology.org/W13-2008/ {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:----------------------------------|------:|------:|-----:|---------:|---------:|---------:| | 1 | I-Amino_acid | 1 | 0 | 2 | 1 | 0.333333 | 0.5 | | 2 | I-Simple_chemical | 264 | 39 | 358 | 0.871287 | 0.424437 | 0.570811 | | 3 | B-Immaterial_anatomical_entity | 19 | 12 | 12 | 0.612903 | 0.612903 | 0.612903 | | 4 | B-Cellular_component | 144 | 24 | 36 | 0.857143 | 0.8 | 0.827586 | | 5 | B-Cancer | 808 | 103 | 115 | 0.886937 | 0.875406 | 0.881134 | | 6 | I-Cell | 888 | 91 | 198 | 0.907048 | 0.81768 | 0.860048 | | 7 | B-Tissue | 137 | 44 | 47 | 0.756906 | 0.744565 | 0.750685 | | 8 | B-Organism_substance | 67 | 4 | 34 | 0.943662 | 0.663366 | 0.77907 | | 9 | B-Simple_chemical | 598 | 165 | 128 | 0.783748 | 0.823692 | 0.803224 | | 10 | B-Cell | 910 | 125 | 98 | 0.879227 | 0.902778 | 0.890847 | | 11 | I-Organ | 7 | 2 | 10 | 0.777778 | 0.411765 | 0.538462 | | 12 | I-Tissue | 86 | 21 | 25 | 0.803738 | 0.774775 | 0.788991 | | 13 | I-Pathological_formation | 20 | 5 | 19 | 0.8 | 0.512821 | 0.625 | | 14 | I-Organism | 58 | 13 | 62 | 0.816901 | 0.483333 | 0.60733 | | 15 | B-Gene_or_gene_product | 2354 | 282 | 165 | 0.89302 | 0.934498 | 0.913288 | | 16 | I-Cancer | 488 | 73 | 116 | 0.869875 | 0.807947 | 0.837768 | | 17 | B-Organ | 109 | 36 | 47 | 0.751724 | 0.698718 | 0.724252 | | 18 | B-Pathological_formation | 58 | 20 | 30 | 0.74359 | 0.659091 | 0.698795 | | 19 | I-Cellular_component | 33 | 5 | 36 | 0.868421 | 0.478261 | 0.616822 | | 20 | I-Multi-tissue_structure | 132 | 34 | 29 | 0.795181 | 0.819876 | 0.807339 | | 21 | B-Organism | 437 | 53 | 77 | 0.891837 | 0.850195 | 0.870518 | | 22 | I-Gene_or_gene_product | 1268 | 161 | 1086 | 0.887334 | 0.538658 | 0.670367 | | 23 | B-Multi-tissue_structure | 228 | 62 | 73 | 0.786207 | 0.757475 | 0.771574 | | 24 | Macro-average | 9159 | 1398 | 2948 | 0.76803 | 0.548396 | 0.639891 | | 25 | Micro-average | 9159 | 1398 | 2948 | 0.867576 | 0.756505 | 0.808242 | ``` --- layout: model title: English asr_wav2vec2_base_10000 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: pipeline_asr_wav2vec2_base_10000 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_10000` is a English model originally trained by jiobiala24. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_10000_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_10000_en_4.2.0_3.0_1664118418185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_10000_en_4.2.0_3.0_1664118418185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_10000', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_10000", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_10000| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|349.3 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Spanish Legal RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_RoBERTalex date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description RoBERTa Legal Embeddings, trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTalex_es_3.4.2_3.0_1649945353788.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTalex_es_3.4.2_3.0_1649945353788.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTalex","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTalex","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.RoBERTalex").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_RoBERTalex| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|300.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/BSC-TeMU/RoBERTalex - https://github.com/PlanTL-GOB-ES/lm-legal-es --- layout: model title: Fast Neural Machine Translation Model from Luganda to English author: John Snow Labs name: opus_mt_lg_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, lg, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `lg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_lg_en_xx_2.7.0_2.4_1609166333092.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_lg_en_xx_2.7.0_2.4_1609166333092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_lg_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_lg_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.lg.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_lg_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Diseases author: John Snow Labs name: ner_diseases date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for diseases. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities ``Disease``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DIAG_PROC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_en_3.0.0_3.0_1617209733880.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_en_3.0.0_3.0_1617209733880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_diseases", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. ']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_diseases", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val data = Seq("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. """).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive. """) ```
## Results ```bash +------------------------------+---------+ |chunk |ner | +------------------------------+---------+ |the cyst |Disease | |a large Prolene suture |Disease | |a very small incisional hernia|Disease | |the hernia cavity |Disease | |omentum |Disease | |the hernia |Disease | |the wound lesion |Disease | |The lesion |Disease | |the existing scar |Disease | |the cyst |Disease | |the wound |Disease | |this cyst down to its base |Disease | |a small incisional hernia |Disease | |The cyst |Disease | |The wound |Disease | +------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on i2b2 with ``embeddings_clinical``. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|-----:|-----:|---------:|---------:|---------:| | 0 | I-Disease | 5014 | 222 | 171 | 0.957601 | 0.96702 | 0.962288 | | 1 | B-Disease | 6004 | 213 | 159 | 0.965739 | 0.974201 | 0.969952 | | 2 | Macro-average | 11018 | 435 | 330 | 0.96167 | 0.970611 | 0.96612 | | 3 | Micro-average | 11018 | 435 | 330 | 0.962019 | 0.97092 | 0.966449 | ``` --- layout: model title: Legal Purchase agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_purchase_agreement date: 2022-10-24 tags: [en, legal, classification, document, agreement, contract, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_purchase_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class purchase-agreement or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `other`, `purchase-agreement` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purchase_agreement_en_1.0.0_3.0_1666621006686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purchase_agreement_en_1.0.0_3.0_1666621006686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") sembeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_purchase_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, sembeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[purchase-agreement]| |[other]| |[other]| |[purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_purchase_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.98 62 purchase-agreement 1.00 0.93 0.97 30 accuracy - - 0.98 92 macro-avg 0.98 0.97 0.97 92 weighted-avg 0.98 0.98 0.98 92 ``` --- layout: model title: Extract Demographic Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_demographic date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, demographic] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `Gender`, `Employment`, `RaceEthnicity`, `Age`, `Substance`, `RelationshipStatus`, `SubstanceQuantity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_en_4.4.3_3.0_1686075018843.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_en_4.4.3_3.0_1686075018843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_demographic", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_demographic", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:--------|:--------------| | grandma | Gender | | 85 | Age | | Black | RaceEthnicity | | doctors | Employment | | her | Gender | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_demographic| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1296 30 21 1317 0.98 0.98 0.98 Employment 1183 70 60 1243 0.94 0.95 0.95 RaceEthnicity 30 1 3 33 0.97 0.91 0.94 Age 533 44 49 582 0.92 0.92 0.92 Substance 386 49 35 421 0.89 0.92 0.90 RelationshipStatus 21 3 3 24 0.88 0.88 0.88 SubstanceQuantity 61 14 24 85 0.81 0.72 0.76 macro_avg 3510 211 195 3705 0.91 0.90 0.90 micro_avg 3510 211 195 3705 0.94 0.95 0.95 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_final_784824209 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-final-784824209` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824209_en_4.3.1_3.0_1677881842322.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_final_784824209_en_4.3.1_3.0_1677881842322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824209","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_final_784824209","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_final_784824209| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-final-784824209 --- layout: model title: Sentence Entity Resolver for HCPCS Codes author: John Snow Labs name: sbiobertresolve_hcpcs date: 2021-09-29 tags: [entity_resolution, hcpcs, licensed, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to [Healthcare Common Procedure Coding System (HCPCS)](https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/HCPCS/index.html#:~:text=The%20Healthcare%20Common%20Procedure%20Coding,%2C%20supplies%2C%20products%20and%20services.) codes using 'sbiobert_base_cased_mli' sentence embeddings. It also returns the domain information of the codes in the `all_k_aux_labels` parameter in the metadata of the result. ## Predicted Entities `HCPCS Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_HCPCS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcpcs_en_3.2.3_2.4_1632909577033.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_hcpcs_en_3.2.3_2.4_1632909577033.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_hcpcs``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_jsl``` as NER model. ```Procedure``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sentence_embeddings") hcpcs_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_hcpcs", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("hcpcs_code")\ .setDistanceFunction("EUCLIDEAN") hcpcs_pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, hcpcs_resolver]) data = spark.createDataFrame([["Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type"]]).toDF("text") results = hcpcs_pipelineModel.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sentence_embeddings") val hcpcs_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_hcpcs", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("hcpcs_code") .setDistanceFunction("EUCLIDEAN") val hcpcs_pipeline = new Pipeline().setStages(Array(documentAssembler, sbert_embedder, hcpcs_resolver)) val data = Seq("Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type").toDF("text") val results = hcpcs_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.hcpcs").predict("""Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type""") ```
## Results ```bash +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+ | |ner_chunk |hcpcs_code|all_codes |resolutions |domain | +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+ |0 |Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type|L8001 |[L8001, L8002, L8000, L8033, L8032, ...]|'Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, unilateral, any size, any type', 'Breast prosthesis, mastectomy bra, with integrated breast prosthesis form, bilateral, any size, any type', 'Breast prosthesis, mastectomy bra, without integrated breast prosthesis form, any size, any type', 'Nipple prosthesis, custom fabricated, reusable, any material, any type, each', ...|Device, Device, Device, Device, Device, ...| +--+---------------------------------------------------------------------------------------------------------+----------+----------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_hcpcs| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hcpcs_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Pipeline to Extract neurologic deficits related to Stroke Scale (NIHSS) author: John Snow Labs name: ner_nihss_pipeline date: 2023-03-14 tags: [ner, en, licensed, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_nihss](https://nlp.johnsnowlabs.com/2021/11/15/ner_nihss_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_nihss_pipeline_en_4.3.0_3.2_1678778218996.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_nihss_pipeline_en_4.3.0_3.2_1678778218996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_nihss_pipeline", "en", "clinical/models") text = '''Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_nihss_pipeline", "en", "clinical/models") val text = "Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently" val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.nihss_pipeline").predict("""Abdomen , soft , nontender . NIH stroke scale on presentation was 23 to 24 for , one for consciousness , two for month and year and two for eye / grip , one to two for gaze , two for face , eight for motor , one for limited ataxia , one to two for sensory , three for best language and two for attention . On the neurologic examination the patient was intermittently""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:----------------|-------------:| | 0 | NIH stroke scale | 29 | 44 | NIHSS | 0.973533 | | 1 | 23 to 24 | 66 | 73 | Measurement | 0.870567 | | 2 | one | 81 | 83 | Measurement | 0.8726 | | 3 | consciousness | 89 | 101 | 1a_LOC | 0.6322 | | 4 | two | 105 | 107 | Measurement | 0.9665 | | 5 | month and year | 113 | 126 | 1b_LOCQuestions | 0.846433 | | 6 | two | 132 | 134 | Measurement | 0.9659 | | 7 | eye / grip | 140 | 149 | 1c_LOCCommands | 0.889433 | | 8 | one | 153 | 155 | Measurement | 0.9917 | | 9 | two | 160 | 162 | Measurement | 0.5144 | | 10 | gaze | 168 | 171 | 2_BestGaze | 0.7272 | | 11 | two | 175 | 177 | Measurement | 0.9872 | | 12 | face | 183 | 186 | 4_FacialPalsy | 0.8758 | | 13 | eight | 190 | 194 | Measurement | 0.9013 | | 14 | one | 208 | 210 | Measurement | 0.9343 | | 15 | limited | 216 | 222 | 7_LimbAtaxia | 0.9326 | | 16 | ataxia | 224 | 229 | 7_LimbAtaxia | 0.5762 | | 17 | one to two | 233 | 242 | Measurement | 0.79 | | 18 | sensory | 248 | 254 | 8_Sensory | 0.9892 | | 19 | three | 258 | 262 | Measurement | 0.8896 | | 20 | best language | 268 | 280 | 9_BestLanguage | 0.89415 | | 21 | two | 286 | 288 | Measurement | 0.949 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_nihss_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Chemicals in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc5cdr_chemicals date: 2022-07-25 tags: [en, ner, clinical, licensed, bertfortokenclasification] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Chemicals, diseases, and their relations are among the most searched topics by PubMed users worldwide as they play central roles in many areas of biomedical research and healthcare, such as drug discovery and safety surveillance. In addition, identifying chemicals as biomarkers can be helpful in informing potential relationships between chemicals and pathologies. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. The model detects chemicals from a medical text. ## Predicted Entities `CHEM` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_chemicals_en_4.0.0_3.0_1658752308692.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc5cdr_chemicals_en_4.0.0_3.0_1658752308692.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc5cdr_chemicals", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, ner_model, ner_converter ]) data = spark.createDataFrame([["""The possibilities that these cardiovascular findings might be the result of non-selective inhibition of monoamine oxidase or of amphetamine and metamphetamine are discussed. The results have shown that the degradation product p-choloroaniline is not a significant factor in chlorhexidine-digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone-iodine irrigations were associated with erosive cystitis and suggested a possible complication with human usage."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc5cdr_chemicals", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") .setCaseSensitive(True) .setMaxSentenceLength(512) val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, ner_model, ner_converter)) val data = Seq("""The possibilities that these cardiovascular findings might be the result of non-selective inhibition of monoamine oxidase or of amphetamine and metamphetamine are discussed. The results have shown that the degradation product p-choloroaniline is not a significant factor in chlorhexidine-digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone-iodine irrigations were associated with erosive cystitis and suggested a possible complication with human usage.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.bc5cdr_chemicals").predict("""The possibilities that these cardiovascular findings might be the result of non-selective inhibition of monoamine oxidase or of amphetamine and metamphetamine are discussed. The results have shown that the degradation product p-choloroaniline is not a significant factor in chlorhexidine-digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone-iodine irrigations were associated with erosive cystitis and suggested a possible complication with human usage.""") ```
## Results ```bash +-------------------------+-----+ |ner_chunk |label| +-------------------------+-----+ |amphetamine |CHEM | |metamphetamine |CHEM | |p-choloroaniline |CHEM | |chlorhexidine-digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone-iodine |CHEM | +-------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc5cdr_chemicals| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References [https://github.com/cambridgeltl/MTL-Bioinformatics-2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) ## Benchmarking ```bash label precision recall f1-score support B-CHEM 0.8920 0.9734 0.9309 5385 I-CHEM 0.8129 0.8993 0.8539 1628 micro-avg 0.8734 0.9562 0.9129 7013 macro-avg 0.8524 0.9364 0.8924 7013 weighted-avg 0.8736 0.9562 0.9130 7013 ``` --- layout: model title: Legal BERT Sentence Base Uncased Embedding author: John Snow Labs name: sent_bert_base_uncased_legal date: 2021-09-06 tags: [legal, english, open_source, bert_sentence_embeddings, uncased, en] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true recommended: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. To pre-train the different variations of LEGAL-BERT, we collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. Sub-domains variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. A light-weight model (33% the size of BERT-BASE) pre-trained from scratch on legal data with competitive perfomance is also available. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_legal_en_3.2.2_3.0_1630926286151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_base_uncased_legal_en_3.2.2_3.0_1630926286151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.base_uncased_legal").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_base_uncased_legal| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|true| ## Data Source The model is imported from: https://huggingface.co/nlpaueb/legal-bert-base-uncased --- layout: model title: Portuguese Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_pt_cased date: 2022-04-11 tags: [bert, embeddings, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-pt-cased` is a Portuguese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_pt_cased_pt_3.4.2_3.0_1649674030177.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_pt_cased_pt_3.4.2_3.0_1649674030177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_pt_cased","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_pt_cased","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.bert_base_pt_cased").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_pt_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pt| |Size:|395.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-pt-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Legal Subordination Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_subordination_agreement_bert date: 2023-02-02 tags: [en, legal, classification, subordination, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_subordination_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `subordination-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `subordination-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_subordination_agreement_bert_en_1.0.0_3.0_1675359870706.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_subordination_agreement_bert_en_1.0.0_3.0_1675359870706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_subordination_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[subordination-agreement]| |[other]| |[other]| |[subordination-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_subordination_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.91 0.95 0.93 73 subordination-agreement 0.88 0.81 0.85 37 accuracy - - 0.90 110 macro-avg 0.90 0.88 0.89 110 weighted-avg 0.90 0.90 0.90 110 ``` --- layout: model title: Part of Speech for Slovak author: John Snow Labs name: pos_ud_snk date: 2021-03-08 tags: [part_of_speech, open_source, slovak, pos_ud_snk, sk] task: Part of Speech Tagging language: sk edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADV - DET - ADJ - NOUN - VERB - PUNCT - PART - AUX - SCONJ - PRON - ADP - PROPN - X - CCONJ - NUM - INTJ - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_snk_sk_3.0.0_3.0_1615230329573.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_snk_sk_3.0.0_3.0_1615230329573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_snk", "sk") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Ahoj z Johna Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_snk", "sk") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Ahoj z Johna Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Ahoj z Johna Snow Labs! ""] token_df = nlu.load('sk.pos.ud_snk').predict(text) token_df ```
## Results ```bash token pos 0 Ahoj NOUN 1 z ADP 2 Johna PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_snk| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|sk| --- layout: model title: Multilingual XLMRoBerta Embeddings (from jhu-clsp) author: John Snow Labs name: xlmroberta_embeddings_roberta_large_eng_ara_128k date: 2022-05-13 tags: [en, ar, open_source, xlm_roberta, embeddings, xx] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-large-eng-ara-128k` is a Multilingual model orginally trained by `jhu-clsp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_roberta_large_eng_ara_128k_xx_3.4.4_3.0_1652440561319.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_roberta_large_eng_ara_128k_xx_3.4.4_3.0_1652440561319.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_roberta_large_eng_ara_128k","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_roberta_large_eng_ara_128k","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_roberta_large_eng_ara_128k| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|1.0 GB| |Case sensitive:|true| ## References - https://huggingface.co/jhu-clsp/roberta-large-eng-ara-128k --- layout: model title: Translate English to Afro-Asiatic languages Pipeline author: John Snow Labs name: translate_en_afa date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, afa, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `afa` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_afa_xx_2.7.0_2.4_1609691368945.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_afa_xx_2.7.0_2.4_1609691368945.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_afa", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_afa", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.afa').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_afa| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from janeel) author: John Snow Labs name: roberta_qa_muppet_base_finetuned_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `muppet-roberta-base-finetuned-squad` is a English model originally trained by `janeel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_muppet_base_finetuned_squad_en_4.3.0_3.0_1674211542810.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_muppet_base_finetuned_squad_en_4.3.0_3.0_1674211542810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_muppet_base_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_muppet_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_muppet_base_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/janeel/muppet-roberta-base-finetuned-squad --- layout: model title: Legal Benefits Of Indenture Clause Binary Classifier author: John Snow Labs name: legclf_benefits_of_indenture_clause date: 2023-01-27 tags: [en, legal, classification, benefits, indenture, clauses, benefits_of_indenture, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `benefits-of-indenture` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `benefits-of-indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_benefits_of_indenture_clause_en_1.0.0_3.0_1674821259919.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_benefits_of_indenture_clause_en_1.0.0_3.0_1674821259919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_benefits_of_indenture_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[benefits-of-indenture]| |[other]| |[other]| |[benefits-of-indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_benefits_of_indenture_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support benefits-of-indenture 1.00 0.82 0.90 22 other 0.87 1.00 0.93 27 accuracy - - 0.92 49 macro-avg 0.94 0.91 0.92 49 weighted-avg 0.93 0.92 0.92 49 ``` --- layout: model title: Detect Assertion Status for Oncology Treatments author: John Snow Labs name: assertion_oncology_treatment_binary_wip date: 2022-07-25 tags: [licensed, english, clinical, assertion, oncology, cancer, treatment, en] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true published: false annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Assign assertion status to oncology treatment entities extracted by ner_oncology_wip based on their context. This model predicts if a treatment mentioned in text has been used by the patient (in the past or in the present) or not used (mentioned as absent, as treatment plan or as something hypothetical). ## Predicted Entities `Present_Or_Past`, `Hypothetical_Or_Absent` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_treatment_binary_wip_en_3.5.0_3.0_1658774066204.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_treatment_binary_wip_en_3.5.0_3.0_1658774066204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(['document'])\ .setOutputCol('sentence') tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained('embeddings_clinical', 'en', 'clinical/models')\ .setInputCols(["sentence", 'token']) \ .setOutputCol("embeddings") ner_oncology = MedicalNerModel.pretrained('ner_oncology_wip', 'en', 'clinical/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_oncology_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['Chemotherapy', 'Immunotherapy', 'Hormonal_Therapy', 'Targeted_Therapy', 'Unspecific_Therapy', 'Cancer_Therapy', 'Radiotherapy']) assertion = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner_oncology, ner_oncology_converter, assertion]) data = spark.createDataFrame([["The patient underwent a mastectomy three years ago. She continued with paclitaxel and trastuzumab for her breast cancer. She was not treated with radiotherapy. We discussed the possibility of using chemotherapy."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_oncology = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_oncology_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array('Chemotherapy', 'Immunotherapy', 'Hormonal_Therapy', 'Targeted_Therapy', 'Unspecific_Therapy', 'Cancer_Therapy', 'Radiotherapy')) val assertion = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip", "en", "clinical/models") .setInputCols(Array("sentence", "ner_chunk", "embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner_oncology, ner_oncology_converter, assertion)) val data = Seq("""The patient underwent a mastectomy three years ago. She continued with paclitaxel and trastuzumab for her breast cancer. She was not treated with radiotherapy. We discussed the possibility of using chemotherapy.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_treatment_binary_wip").predict("""The patient underwent a mastectomy three years ago. She continued with paclitaxel and trastuzumab for her breast cancer. She was not treated with radiotherapy. We discussed the possibility of using chemotherapy.""") ```
## Results ```bash +------------+-----+---+----------------+----------------------+ | chunk|begin|end| ner_label| assertion_status| +------------+-----+---+----------------+----------------------+ | mastectomy| 24| 33| Cancer_Surgery| Present_Or_Past| | paclitaxel| 71| 80| Chemotherapy| Present_Or_Past| | trastuzumab| 86| 96|Targeted_Therapy| Present_Or_Past| |radiotherapy| 146|157| Radiotherapy|Hypothetical_Or_Absent| |chemotherapy| 198|209| Chemotherapy|Hypothetical_Or_Absent| +------------+-----+---+----------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_treatment_binary_wip| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References Trained on case reports sampled from PubMed, and annotated in-house. ## Benchmarking ```bash label tp fp fn prec rec f1 Present_Or_Past 50 13 14 0.7936508 0.78125 0.78740156 Hypothetical_Or_Absent 68 14 13 0.8292683 0.83950615 0.83435583 Macro-average 118 27 27 0.8114595 0.8103781 0.8109184 Micro-average 118 27 27 0.8137931 0.8137931 0.81379306 ``` --- layout: model title: Spanish BertForSequenceClassification Tiny Cased model (from mrm8488) author: John Snow Labs name: bert_sequence_classifier_spanish_tinybert_betito_finetuned_mnli date: 2022-07-13 tags: [es, open_source, bert, sequence_classification] task: Text Classification language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanish-TinyBERT-betito-finetuned-mnli` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_spanish_tinybert_betito_finetuned_mnli_es_4.0.0_3.0_1657720799471.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_spanish_tinybert_betito_finetuned_mnli_es_4.0.0_3.0_1657720799471.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_spanish_tinybert_betito_finetuned_mnli","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = BertForSequenceClassification.pretrained("bert_sequence_classifier_spanish_tinybert_betito_finetuned_mnli","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_spanish_tinybert_betito_finetuned_mnli| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|54.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/spanish-TinyBERT-betito-finetuned-mnli --- layout: model title: Fast Neural Machine Translation Model from English to German author: John Snow Labs name: opus_mt_en_de date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, de, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `de` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_de_xx_2.7.0_2.4_1609166744674.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_de_xx_2.7.0_2.4_1609166744674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_de", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_de", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.de').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_de| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Relation extraction between Drugs and ADE (biobert) author: John Snow Labs name: re_ade_biobert date: 2021-07-16 tags: [licensed, en, relation_extraction, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.1.2 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is capable of Relating Drugs and adverse reactions caused by them; It predicts if an adverse event is caused by a drug or not. It is based on 'biobert_pubmed_base_cased' embeddings. `1` : Shows the adverse event and drug entities are related, `0` : Shows the adverse event and drug entities are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_ade_biobert_en_3.1.2_3.0_1626434941786.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_ade_biobert_en_3.1.2_3.0_1626434941786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") embedder = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en")\ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") ner_tagger = MedicalNerModel() \ .pretrained("ner_ade_biobert", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_tags") ner_chunker = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = sparknlp.annotators.DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel()\ .pretrained("re_ade_biobert", "en", 'clinical/models')\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setMaxSyntacticDistance(3)\ .setPredictionThreshold(0.5)\ .setRelationPairs(["ade-drug", "drug-ade"]) nlp_pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) text ="""Been taking Lipitor for 15 years , have experienced sever fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""" annotations = light_pipeline.fullAnnotate(text) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val embedder = BertEmbeddings.pretrained("biobert_pubmed_base_cased", "en").pretrained("biobert_pubmed_base_cased", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MEdicalNerModel() .pretrained("ner_ade_biobert", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel() .pretrained("re_ade_biobert", "en", 'clinical/models') .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(3) #default: 0 .setPredictionThreshold(0.5) #default: 0.5 .setRelationPairs(Array("drug-ade", "ade-drug")) # Possible relation pairs. Default: All Relations. val nlpPipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val data = Seq("""Been taking Lipitor for 15 years , have experienced sever fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.ade_biobert").predict("""Been taking Lipitor for 15 years , have experienced sever fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps""") ```
## Results ```bash | | chunk1 | entitiy1 | chunk2 | entity2 | relation | |----|-------------------------------|------------|-------------|---------|----------| | 0 | sever fatigue | ADE | Lipitor | DRUG | 1 | | 1 | cramps | ADE | Lipitor | DRUG | 0 | | 2 | sever fatigue | ADE | voltaren | DRUG | 0 | | 3 | cramps | ADE | voltaren | DRUG | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_ade_biobert| |Type:|re| |Compatibility:|Healthcare NLP 3.1.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| ## Data Source This model is trained on custom data annotated by JSL. ## Benchmarking ```bash label precision recall f1-score support 0 0.91 0.92 0.92 1670 1 0.92 0.91 0.91 1673 micro-avg 0.92 0.92 0.92 3343 macro-avg 0.92 0.92 0.92 3343 weighted-avg 0.92 0.92 0.92 3343 ``` --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities author: John Snow Labs name: bert_token_classifier_ner_cellular_pipeline date: 2022-03-10 tags: [cellular, ner, bert_token_classifier, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_cellular](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_cellular_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CELLULAR/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CELLULAR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_2.4_1646908073117.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_2.4_1646908073117.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline cellular_pipeline = PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") cellular_pipeline.fullAnnotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val cellular_pipeline = new PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") cellular_pipeline.fullAnnotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.cellular_pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1 |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.4 MB| ## Included Models - DocumentAssembler - TokenizerModel - MedicalBertForTokenClassifier - NerConverter - Finisher --- layout: model title: Typo Detector for Icelandic author: John Snow Labs name: distilbert_token_classifier_typo_detector date: 2022-01-19 tags: [typo, distilbert, icelandic, token_classification, is, open_source] task: Named Entity Recognition language: is edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` ([link](https://huggingface.co/m3hrdadfi/typo-detector-distilbert-is)) and it's been trained on a Icelandic synthetic data to detect typos, leveraging `DistilBERT` embeddings and `DistilBertForTokenClassification` for NER purposes. It classifies typo tokens as `PO`. ## Predicted Entities `PO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_is_3.3.4_3.0_1642599810600.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_typo_detector_is_3.3.4_3.0_1642599810600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "is")\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) text = """Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.""" data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "is") .setInputCols(Array("sentence","token")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash +--------+---------+ |chunk |ner_label| +--------+---------+ |miög |PO | |álykanir|PO | +--------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_typo_detector| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|is| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## Benchmarking ```bash label precision recall f1-score support micro avg 0.98954 0.967603 0.978448 43800.0 macro-avg 0.98954 0.967603 0.978448 43800.0 weighted-avg 0.98954 0.967603 0.978448 43800.0 ``` --- layout: model title: Detect PHI for Deidentification purposes (French) author: John Snow Labs name: ner_deid_subentity date: 2022-02-11 tags: [deid, fr, licensed] task: De-identification language: fr edition: Healthcare NLP 3.4.1 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER (French) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 15 entities. This NER model is trained with a custom dataset internally annotated, the [French WikiNER dataset](https://metatext.io/datasets/wikiner), a [public dataset of French company names](https://www.data.gouv.fr/fr/datasets/entreprises-immatriculees-en-2017/), a [public dataset of French hospital names](https://salesdorado.com/fichiers-prospection/hopitaux/) and several data augmentation mechanisms. ## Predicted Entities `PATIENT`, `HOSPITAL`, `DATE`, `ORGANIZATION`, `E-MAIL`, `USERNAME`, `ZIP`, `MEDICALRECORD`, `PROFESSION`, `PHONE`, `DOCTOR`, `AGE`, `STREET`, `CITY`, `COUNTRY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.1.Clinical_Multi_Language_Deidentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_fr_3.4.1_2.4_1644590174130.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_fr_3.4.1_2.4_1644590174130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models")\ .setInputCols(["sentence","token", "word_embeddings"])\ .setOutputCol("ner") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner]) text = ["J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."] data = spark.createDataFrame([text]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner)) val text = "J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015." val data = Seq(text).toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.med_ner.deid_subentity").predict("""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""") ```
## Results ```bash +------------+----------+ | token| ner_label| +------------+----------+ | J'ai| O| | vu| O| | en| O| |consultation| O| | Michel| B-PATIENT| | Martinez| I-PATIENT| | (| O| | 49| B-AGE| | ans| O| | )| O| | adressé| O| | au| O| | Centre|B-HOSPITAL| | Hospitalier|I-HOSPITAL| | De|I-HOSPITAL| | Plaisir|I-HOSPITAL| | pour| O| | un| O| | diabète| O| | mal| O| | contrôlé| O| | avec| O| | des| O| | symptômes| O| | datant| O| | de| O| | Mars| B-DATE| | 2015| I-DATE| | .| O| +------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|15.0 MB| ## References - Internal JSL annotated corpus - [French WikiNER dataset](https://metatext.io/datasets/wikiner) - [Public dataset of French company names](https://www.data.gouv.fr/fr/datasets/entreprises-immatriculees-en-2017/) - [Public dataset of French hospital names](https://salesdorado.com/fichiers-prospection/hopitaux/) ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 1966.0 124.0 135.0 2101.0 0.9407 0.9357 0.9382 HOSPITAL 315.0 23.0 19.0 334.0 0.932 0.9431 0.9375 DATE 2605.0 31.0 49.0 2654.0 0.9882 0.9815 0.9849 ORGANIZATION 503.0 142.0 159.0 662.0 0.7798 0.7598 0.7697 CITY 2296.0 370.0 351.0 2647.0 0.8612 0.8674 0.8643 MAIL 46.0 0.0 0.0 46.0 1.0 1.0 1.0 STREET 31.0 4.0 3.0 34.0 0.8857 0.9118 0.8986 USERNAME 91.0 1.0 14.0 105.0 0.9891 0.8667 0.9239 ZIP 33.0 0.0 0.0 33.0 1.0 1.0 1.0 MEDICALRECORD 100.0 11.0 2.0 102.0 0.9009 0.9804 0.939 PROFESSION 321.0 59.0 87.0 408.0 0.8447 0.7868 0.8147 PHONE 114.0 3.0 2.0 116.0 0.9744 0.9828 0.9785 COUNTRY 287.0 14.0 51.0 338.0 0.9535 0.8491 0.8983 DOCTOR 622.0 7.0 4.0 626.0 0.9889 0.9936 0.9912 AGE 370.0 52.0 71.0 441.0 0.8768 0.839 0.8575 macro - - - - - - 0.9197 micro - - - - - - 0.9154 ``` --- layout: model title: Legal Underwriting Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_underwriting_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, underwriting, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_underwriting_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `underwriting-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `underwriting-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_underwriting_agreement_en_1.0.0_3.0_1670358194573.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_underwriting_agreement_en_1.0.0_3.0_1670358194573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_underwriting_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[underwriting-agreement]| |[other]| |[other]| |[underwriting-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_underwriting_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.99 111 underwriting-agreement 1.00 0.93 0.96 43 accuracy - - 0.98 154 macro-avg 0.99 0.97 0.98 154 weighted-avg 0.98 0.98 0.98 154 ``` --- layout: model title: NER Model for 9 African Languages author: John Snow Labs name: distilbert_base_token_classifier_masakhaner date: 2022-01-18 tags: [xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face` ([link](https://huggingface.co/Davlan/distilbert-base-multilingual-cased-masakhaner)) and it's been finetuned on MasakhaNER dataset for 9 African languages (Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá) leveraging `DistilBert` embeddings and `DistilBertForTokenClassification` for NER purposes. ## Predicted Entities `DATE`, `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_token_classifier_masakhaner_xx_3.3.4_3.0_1642512428599.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_token_classifier_masakhaner_xx_3.3.4_3.0_1642512428599.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_base_token_classifier_masakhaner", "xx"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) text = """Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde.""" data = spark.createDataFrame([[text]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_base_token_classifier_masakhaner", "xx")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.ner.masakhaner.distilbert").predict("""Ilé-iṣẹ́ẹ Mohammed Sani Musa, Activate Technologies Limited, ni ó kó ẹ̀rọ Ìwé-pélébé Ìdìbò Alálòpẹ́ (PVCs) tí a lò fún ìbò ọdún-un 2019, ígbà tí ó jẹ́ òǹdíjedupò lábẹ́ ẹgbẹ́ olóṣèlúu tí ó ń tukọ̀ ètò ìṣèlú lọ́wọ́ All rogressives Congress (APC) fún Aṣojú Ìlà-Oòrùn Niger, ìyẹn gẹ́gẹ́ bí ilé iṣẹ́ aṣèwádìí, Premium Times ṣe tẹ̀ ẹ́ jáde.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |Mohammed Sani Musa |PER | |Activate Technologies Limited |ORG | |ọdún-un 2019 |DATE | |All rogressives Congress |ORG | |APC |ORG | |Ìlà-Oòrùn Niger |LOC | |Premium Times |ORG | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_token_classifier_masakhaner| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## Data Source [https://github.com/masakhane-io/masakhane-ner](https://github.com/masakhane-io/masakhane-ner) ## Benchmarking ```bash language: F1-score: -------- -------- hau 88.88 ibo 84.87 kin 74.19 lug 78.43 luo 73.32 pcm 87.98 swa 86.20 wol 64.67 yor 78.10 ``` --- layout: model title: Stop Words Cleaner for Bengali author: John Snow Labs name: stopwords_bn date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: bn edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, bn] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_bn_bn_2.5.4_2.4_1594742440339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_bn_bn_2.5.4_2.4_1594742440339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_bn", "bn") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("উত্তরের রাজা হওয়া ছাড়াও জন স্নো একজন ইংরেজ চিকিত্সক এবং অবেদন এবং মেডিকেল হাইজিনের বিকাশের এক নেতা") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_bn", "bn") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("উত্তরের রাজা হওয়া ছাড়াও জন স্নো একজন ইংরেজ চিকিত্সক এবং অবেদন এবং মেডিকেল হাইজিনের বিকাশের এক নেতা").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""উত্তরের রাজা হওয়া ছাড়াও জন স্নো একজন ইংরেজ চিকিত্সক এবং অবেদন এবং মেডিকেল হাইজিনের বিকাশের এক নেতা"""] stopword_df = nlu.load('bn.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='উত্তরের', metadata={'sentence': '0'}), Row(annotatorType='token', begin=8, end=11, result='রাজা', metadata={'sentence': '0'}), Row(annotatorType='token', begin=13, end=17, result='হওয়া', metadata={'sentence': '0'}), Row(annotatorType='token', begin=19, end=24, result='ছাড়াও', metadata={'sentence': '0'}), Row(annotatorType='token', begin=26, end=27, result='জন', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_bn| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|bn| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Pipeline to Summarize Clinical Guidelines author: John Snow Labs name: summarizer_clinical_guidelines_large_pipeline date: 2023-05-31 tags: [licensed, en, clinical, summarization, guidelines] task: Summarization language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_guidelines_large](https://nlp.johnsnowlabs.com/2023/05/08/summarizer_clinical_guidelines_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_guidelines_large_pipeline_en_4.4.1_3.0_1685529261348.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_guidelines_large_pipeline_en_4.4.1_3.0_1685529261348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_guidelines_large_pipeline", "en", "clinical/models") text = """Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity - Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in the breast or underarm area - Changes in the size or shape of the breast - Nipple discharge - Nipple changes in appearance, such as inversion or flattening - Redness or swelling in the breast Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include: - Surgery (such as lumpectomy or mastectomy) - Radiation therapy - Chemotherapy - Hormone therapy - Targeted therapy Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_guidelines_large_pipeline", "en", "clinical/models") val text = """Clinical Guidelines for Breast Cancer: Breast cancer is the most common type of cancer among women. It occurs when the cells in the breast start growing abnormally, forming a lump or mass. This can result in the spread of cancerous cells to other parts of the body. Breast cancer may occur in both men and women but is more prevalent in women. The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as: - A personal or family history of breast cancer - A genetic mutation, such as BRCA1 or BRCA2 - Exposure to radiation - Age (most commonly occurring in women over 50) - Early onset of menstruation or late menopause - Obesity - Hormonal factors, such as taking hormone replacement therapy Breast cancer may not present symptoms during its early stages. Symptoms typically manifest as the disease progresses. Some notable symptoms include: - A lump or thickening in the breast or underarm area - Changes in the size or shape of the breast - Nipple discharge - Nipple changes in appearance, such as inversion or flattening - Redness or swelling in the breast Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include: - Surgery (such as lumpectomy or mastectomy) - Radiation therapy - Chemotherapy - Hormone therapy - Targeted therapy Early detection is crucial for the successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. If you notice any changes in your breast tissue, consult with your healthcare provider immediately. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash Overview of the disease: Breast cancer is the most common type of cancer among women, occurring when the cells in the breast start growing abnormally, forming a lump or mass. It can result in the spread of cancerous cells to other parts of the body. Causes: The exact cause of breast cancer is unknown, but several risk factors can increase the likelihood of developing it, such as a personal or family history, a genetic mutation, exposure to radiation, age, early onset of menstruation or late menopause, obesity, and hormonal factors. Symptoms: Symptoms of breast cancer typically manifest as the disease progresses, including a lump or thickening in the breast or underarm area, changes in the size or shape of the breast, nipple discharge, nipple changes in appearance, and redness or swelling in the breast. Treatment recommendations: Treatment for breast cancer depends on several factors, including the stage of the cancer, the location of the tumor, and the individual's overall health. Common treatment options include surgery, radiation therapy, chemotherapy, hormone therapy, and targeted therapy. Early detection is crucial for successful treatment of breast cancer. Women are advised to routinely perform self-examinations and undergo regular mammogram testing starting at age 40. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_guidelines_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.0 GB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543_de_4.2.0_3.0_1664118524786.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543_de_4.2.0_3.0_1664118524786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s543| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Saban� RobertaForQuestionAnswering (from jgammack) author: John Snow Labs name: roberta_qa_SAE_roberta_base_squad date: 2022-06-20 tags: [open_source, question_answering, roberta] task: Question Answering language: sae edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SAE-roberta-base-squad` is a Saban model originally trained by `jgammack`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_SAE_roberta_base_squad_sae_4.0.0_3.0_1655727417245.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_SAE_roberta_base_squad_sae_4.0.0_3.0_1655727417245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_SAE_roberta_base_squad","sae") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_SAE_roberta_base_squad","sae") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("sae.answer_question.squad.roberta.base_sae.by_jgammack").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_SAE_roberta_base_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|sae| |Size:|466.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jgammack/SAE-roberta-base-squad --- layout: model title: Sentence Entity Resolver for ICD10-CM (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_icd10cm language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to ICD10-CM codes using sentence embeddings. {:.h2_title} ## Predicted Entities ICD10-CM Codes and their normalized definition with ``sbiobert_base_cased_mli`` sentence embeddings. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_en_2.6.4_2.4_1606235759310.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_en_2.6.4_2.4_1606235759310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_icd10cm","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| I150| 0.2606|Renovascular hype...|I150:::K766:::I10...| |chronic renal ins...| 83|109| PROBLEM| N186| 0.2059|End stage renal d...|N186:::D631:::P96...| | COPD| 113|116| PROBLEM| I2781| 0.2132|Cor pulmonale (ch...|I2781:::J449:::J4...| | gastritis| 120|128| PROBLEM| K5281| 0.1425|Eosinophilic gast...|K5281:::K140:::K9...| | TIA| 136|138| PROBLEM| G459| 0.1152|Transient cerebra...|G459:::I639:::T79...| |a non-ST elevatio...| 182|202| PROBLEM| I214| 0.0889|Non-ST elevation ...|I214:::I256:::M62...| |Guaiac positive s...| 208|229| PROBLEM| K626| 0.0631|Ulcer of anus and...|K626:::K380:::R15...| |cardiac catheteri...| 295|317| TEST| Z950| 0.2549|Presence of cardi...|Z950:::Z95811:::I...| | PTCA| 324|327|TREATMENT| Z9861| 0.1268|Coronary angiopla...|Z9861:::Z9862:::I...| | mid LAD lesion| 332|345| PROBLEM|L02424| 0.1117|Furuncle of left ...|L02424:::Q202:::L...| +--------------------+-----+---+---------+------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_icd10cm | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on ICD10 Clinical Modification dataset with ``sbiobert_base_cased_mli`` sentence embeddings. https://www.icd10data.com/ICD10CM/Codes/ --- layout: model title: Translate English to Swahili Pipeline author: John Snow Labs name: translate_en_sw date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sw, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sw_xx_2.7.0_2.4_1609701869435.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sw_xx_2.7.0_2.4_1609701869435.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sw", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sw').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sw| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Ewe to English author: John Snow Labs name: opus_mt_ee_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ee, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ee` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ee_en_xx_2.7.0_2.4_1609170215866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ee_en_xx_2.7.0_2.4_1609170215866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ee_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ee_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ee.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ee_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Chinese to English author: John Snow Labs name: opus_mt_zh_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, zh, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `zh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_zh_en_xx_2.7.0_2.4_1609166881669.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_zh_en_xx_2.7.0_2.4_1609166881669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_zh_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_zh_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.zh.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_zh_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Icelandic Pipeline author: John Snow Labs name: translate_en_is date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, is, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `is` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_is_xx_2.7.0_2.4_1609685862144.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_is_xx_2.7.0_2.4_1609685862144.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_is", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_is", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.is').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_is| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Context Spell Checker for English author: John Snow Labs name: spellcheck_dl date: 2022-04-01 tags: [spellcheck, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 2.4 supported: true annotator: ContextSpellCheckerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. The model is trained for PySpark 2.4.x users with SparkNLP 3.4.1. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.4.1_2.4_1648817790618.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.4.1_2.4_1648817790618.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setPrefixes(["\"", "“", "(", "[", "\n", "."]) \ .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained("spellcheck_dl", "en")\ .setInputCols("token")\ .setOutputCol("checked")\ pipeline = Pipeline(stages = [documentAssembler, tokenizer, spellModel]) empty_df = spark.createDataFrame([[""]]).toDF("text") lp = LightPipeline(pipeline.fit(empty_df)) text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] lp.annotate(text) ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new RecursiveTokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setPrefixes(Array("\"", "“", "(", "[", "\n", ".")) .setSuffixes(Array("\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained("spellcheck_dl", "en"). setInputCols("token"). setOutputCol("checked") val pipeline = new Pipeline().setStages(Array(assembler, tokenizer, spellChecker)) val empty_df = spark.createDataFrame([[""]]).toDF("text") val lp = new LightPipeline(pipeline.fit(empty_df)) val text = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") lp.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("spell").predict("""During the summer we have the best ueather.""") ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| |Size:|99.6 MB| ## References Combination of custom public data sets. --- layout: model title: Legal Iron Steel And Other Metal Industries Document Classifier (EURLEX) author: John Snow Labs name: legclf_iron_steel_and_other_metal_industries_bert date: 2023-03-06 tags: [en, legal, classification, clauses, iron_steel_and_other_metal_industries, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_iron_steel_and_other_metal_industries_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Iron_Steel_and_Other_Metal_Industries or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Iron_Steel_and_Other_Metal_Industries`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_iron_steel_and_other_metal_industries_bert_en_1.0.0_3.0_1678111622171.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_iron_steel_and_other_metal_industries_bert_en_1.0.0_3.0_1678111622171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_iron_steel_and_other_metal_industries_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Iron_Steel_and_Other_Metal_Industries]| |[Other]| |[Other]| |[Iron_Steel_and_Other_Metal_Industries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_iron_steel_and_other_metal_industries_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Iron_Steel_and_Other_Metal_Industries 0.84 0.91 0.87 99 Other 0.90 0.83 0.86 98 accuracy - - 0.87 197 macro-avg 0.87 0.87 0.87 197 weighted-avg 0.87 0.87 0.87 197 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from FabianWillner) author: John Snow Labs name: distilbert_qa_fabianwillner_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `FabianWillner`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_fabianwillner_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768522736.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_fabianwillner_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768522736.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fabianwillner_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_fabianwillner_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_fabianwillner_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/FabianWillner/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Assertion Status from Entities Related to Cancer Diagnosis author: John Snow Labs name: assertion_oncology_problem_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of entities related to cancer diagnosis (including Metastasis, Cancer_Dx and Tumor_Finding, among others). ## Predicted Entities `Absent`, `Family`, `Hypothetical`, `Possible`, `Present` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_problem_wip_en_4.1.0_3.0_1664641418708.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_problem_wip_en_4.1.0_3.0_1664641418708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Cancer_Dx"]) assertion = AssertionDLModel.pretrained("assertion_oncology_problem_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["""The patient was diagnosed with breast cancer. Her family history is positive for other cancers."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Cancer_Dx")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_problem_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The patient was diagnosed with breast cancer. Her family history is positive for other cancers.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_problem_wip").predict("""The patient was diagnosed with breast cancer. Her family history is positive for other cancers.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------------|:------------|:----------------| | breast cancer | Cancer_Dx | Medical_History | | cancers | Cancer_Dx | Family_History | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_problem_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Absent 0.88 0.87 0.87 154.0 Family 0.67 1.00 0.80 8.0 Hypothetical 0.81 0.77 0.79 77.0 Possible 0.62 0.61 0.62 54.0 Present 0.78 0.79 0.78 155.0 macro-avg 0.75 0.81 0.77 448.0 weighted-avg 0.80 0.79 0.79 448.0 ``` --- layout: model title: English Bert Embeddings (Cased) author: John Snow Labs name: bert_embeddings_lic_class_scancode_bert_base_cased_L32_1 date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `lic-class-scancode-bert-base-cased-L32-1` is a English model orginally trained by `ayansinha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_bert_base_cased_L32_1_en_3.4.2_3.0_1649672853388.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_lic_class_scancode_bert_base_cased_L32_1_en_3.4.2_3.0_1649672853388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_bert_base_cased_L32_1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_lic_class_scancode_bert_base_cased_L32_1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.lic_class_scancode_bert_base_cased_L32_1").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_lic_class_scancode_bert_base_cased_L32_1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/ayansinha/lic-class-scancode-bert-base-cased-L32-1 - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer - https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py - https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py --- layout: model title: Part of Speech for Slovenian author: John Snow Labs name: pos_ud_ssj date: 2020-07-29 23:35:00 +0800 task: Part of Speech Tagging language: sl edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, sl] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_ssj_sl_2.5.5_2.4_1596054388189.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_ssj_sl_2.5.5_2.4_1596054388189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_ssj", "sl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_ssj", "sl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow je poleg tega, da je severni kralj, angleški zdravnik in vodilni v razvoju anestezije in zdravstvene higiene."""] pos_df = nlu.load('sl.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=3, result='PROPN', metadata={'word': 'John'}), Row(annotatorType='pos', begin=5, end=8, result='PROPN', metadata={'word': 'Snow'}), Row(annotatorType='pos', begin=10, end=11, result='AUX', metadata={'word': 'je'}), Row(annotatorType='pos', begin=13, end=17, result='ADP', metadata={'word': 'poleg'}), Row(annotatorType='pos', begin=19, end=22, result='DET', metadata={'word': 'tega'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_ssj| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|sl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Stop Words Cleaner for Spanish author: John Snow Labs name: stopwords_es date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: es edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, es] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_es_es_2.5.4_2.4_1594742441303.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_es_es_2.5.4_2.4_1594742441303.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_es", "es") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_es", "es") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica."""] stopword_df = nlu.load('es.stopwords_es').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=17, end=19, result='rey', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=29, result='norte', metadata={'sentence': '0'}), Row(annotatorType='token', begin=30, end=30, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=32, end=35, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=37, end=40, result='Snow', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_es| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|es| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Fast Neural Machine Translation Model from English to Oromo author: John Snow Labs name: opus_mt_en_om date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, om, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `om` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_om_xx_2.7.0_2.4_1609168354999.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_om_xx_2.7.0_2.4_1609168354999.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_om", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_om", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.om').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_om| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad author: John Snow Labs name: distilbert_qa_andi611_base_uncased_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_andi611_base_uncased_squad_en_4.0.0_3.0_1654727196238.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_andi611_base_uncased_squad_en_4.0.0_3.0_1654727196238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_andi611_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_andi611_base_uncased_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_andi611_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad --- layout: model title: Spanish RobertaForTokenClassification Base Cased model (from bertin-project) author: John Snow Labs name: roberta_token_classifier_bertin_base_ner_conll2002 date: 2023-03-01 tags: [es, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertin-base-ner-conll2002-es` is a Spanish model originally trained by `bertin-project`. ## Predicted Entities `MISC`, `LOC`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bertin_base_ner_conll2002_es_4.3.0_3.0_1677703750308.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bertin_base_ner_conll2002_es_4.3.0_3.0_1677703750308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_bertin_base_ner_conll2002","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_bertin_base_ner_conll2002","es") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_bertin_base_ner_conll2002| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|426.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/bertin-project/bertin-base-ner-conll2002-es --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from hugsao123) author: John Snow Labs name: xlmroberta_ner_hugsao123_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `hugsao123`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hugsao123_base_finetuned_panx_de_4.1.0_3.0_1660433733970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_hugsao123_base_finetuned_panx_de_4.1.0_3.0_1660433733970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hugsao123_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_hugsao123_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_hugsao123_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|827.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/hugsao123/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from emre) author: John Snow Labs name: distilbert_qa_emre_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `emre`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_emre_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770682951.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_emre_base_uncased_finetuned_squad_en_4.3.0_3.0_1672770682951.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_emre_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_emre_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_emre_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/emre/distilbert-base-uncased-finetuned-squad --- layout: model title: Word2Vec Embeddings in Tibetan (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, bo, open_source] task: Embeddings language: bo edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bo_3.4.1_3.0_1647463403113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_bo_3.4.1_3.0_1647463403113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bo") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","bo") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bo.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|bo| |Size:|46.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from zhangfx7) author: John Snow Labs name: distilbert_qa_zhangfx7_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `zhangfx7`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_zhangfx7_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773333983.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_zhangfx7_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773333983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zhangfx7_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_zhangfx7_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_zhangfx7_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/zhangfx7/distilbert-base-uncased-finetuned-squad --- layout: model title: English BertForQuestionAnswering model (from mrbalazs5) author: John Snow Labs name: bert_qa_mrbalazs5_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `mrbalazs5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mrbalazs5_bert_finetuned_squad_en_4.0.0_3.0_1654535679451.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mrbalazs5_bert_finetuned_squad_en_4.0.0_3.0_1654535679451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mrbalazs5_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_mrbalazs5_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_mrbalazs5").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mrbalazs5_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrbalazs5/bert-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) Squad2 with Neg author: John Snow Labs name: distilbert_qa_base_uncased_squad2_with_ner_with_neg date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-with-ner-with-neg` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_en_4.0.0_3.0_1654727377477.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_with_ner_with_neg_en_4.0.0_3.0_1654727377477.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_with_ner_with_neg","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_conll.distil_bert.base_uncased_with_neg.by_andi611").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_with_ner_with_neg| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-squad2-with-ner-with-neg --- layout: model title: XLM-RoBERTa 40-Language NER Pipeline author: John Snow Labs name: xlm_roberta_token_classifier_ner_40_lang_pipeline date: 2022-04-22 tags: [open_source, ner, token_classifier, xlm_roberta, multilang, "40", xx] task: Named Entity Recognition language: xx edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_token_classifier_ner_40_lang](https://nlp.johnsnowlabs.com/2021/09/28/xlm_roberta_token_classifier_ner_40_lang_xx.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_token_classifier_ner_40_lang_pipeline_xx_3.4.1_3.0_1650628752833.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_token_classifier_ner_40_lang_pipeline_xx_3.4.1_3.0_1650628752833.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_token_classifier_ner_40_lang_pipeline", lang = "xx") pipeline.annotate(["My name is John and I work at John Snow Labs.", "انا اسمي احمد واعمل في ارامكو"]) ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_token_classifier_ner_40_lang_pipeline", lang = "xx") pipeline.annotate(Array("My name is John and I work at John Snow Labs.", "انا اسمي احمد واعمل في ارامكو")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | |احمد |PER | |ارامكو |ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_token_classifier_ner_40_lang_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|967.7 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl date: 2022-10-19 tags: [ner, licensed, en, clinical] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.2.0 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vaccine_Name`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_4.2.0_3.0_1666181370373.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_4.2.0_3.0_1666181370373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_jsl","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner")\ .setLabelCasing("upper") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") ner_model = ner_pipeline.fit(empty_data) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature. """]]).toDF("text") result = ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature. """) ```
## Results ```bash | | chunks | begin | end | sentence_id | entities | |---:|:------------------------------------------|--------:|------:|--------------:|:-----------------------------| | 0 | 21-day-old | 18 | 27 | 0 | Age | | 1 | Caucasian | 29 | 37 | 0 | Race_Ethnicity | | 2 | male | 39 | 42 | 0 | Gender | | 3 | 2 days | 53 | 58 | 0 | Duration | | 4 | congestion | 63 | 72 | 0 | Symptom | | 5 | mom | 76 | 78 | 0 | Gender | | 6 | suctioning yellow discharge | 89 | 115 | 0 | Symptom | | 7 | nares | 136 | 140 | 0 | External_body_part_or_region | | 8 | she | 148 | 150 | 0 | Gender | | 9 | mild | 169 | 172 | 0 | Modifier | | 10 | problems with his breathing while feeding | 174 | 214 | 0 | Symptom | | 11 | perioral cyanosis | 238 | 254 | 0 | Symptom | | 12 | retractions | 259 | 269 | 0 | Symptom | | 13 | Influenza vaccine | 326 | 342 | 1 | Vaccine_Name | | 14 | One day ago | 345 | 355 | 2 | RelativeDate | | 15 | mom | 358 | 360 | 2 | Gender | | 16 | tactile temperature | 377 | 395 | 2 | Symptom | | 17 | Tylenol | 418 | 424 | 2 | Drug_BrandName | | 18 | Baby | 427 | 430 | 3 | Age | | 19 | decreased p.o | 450 | 462 | 3 | Symptom | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl| |Compatibility:|Spark NLP for Healthcare 4.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.2 MB| ## References Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn total precision recall f1 VS_Finding 207.0 37.0 26.0 233.0 0.8484 0.8884 0.8679 Direction 3642.0 418.0 264.0 3906.0 0.897 0.9324 0.9144 Respiration 58.0 5.0 4.0 62.0 0.9206 0.9355 0.928 Cerebrovascular_D... 93.0 20.0 12.0 105.0 0.823 0.8857 0.8532 Family_History_He... 77.0 1.0 1.0 78.0 0.9872 0.9872 0.9872 Heart_Disease 453.0 47.0 50.0 503.0 0.906 0.9006 0.9033 ImagingFindings 85.0 35.0 101.0 186.0 0.7083 0.457 0.5556 RelativeTime 156.0 36.0 50.0 206.0 0.8125 0.7573 0.7839 Strength 648.0 24.0 27.0 675.0 0.9643 0.96 0.9621 Smoking 115.0 8.0 3.0 118.0 0.935 0.9746 0.9544 Medical_Device 3167.0 368.0 283.0 3450.0 0.8959 0.918 0.9068 Allergen 1.0 0.0 8.0 9.0 1.0 0.1111 0.2 EKG_Findings 36.0 13.0 33.0 69.0 0.7347 0.5217 0.6102 Pulse 119.0 14.0 6.0 125.0 0.8947 0.952 0.9225 Psychological_Con... 117.0 14.0 15.0 132.0 0.8931 0.8864 0.8897 Triglycerides 4.0 1.0 0.0 4.0 0.8 1.0 0.8889 Overweight 3.0 0.0 0.0 3.0 1.0 1.0 1.0 Obesity 40.0 1.0 1.0 41.0 0.9756 0.9756 0.9756 Admission_Discharge 307.0 26.0 5.0 312.0 0.9219 0.984 0.9519 HDL 3.0 0.0 1.0 4.0 1.0 0.75 0.8571 Diabetes 117.0 3.0 3.0 120.0 0.975 0.975 0.975 Section_Header 3327.0 103.0 109.0 3436.0 0.97 0.9683 0.9691 Age 556.0 22.0 31.0 587.0 0.9619 0.9472 0.9545 O2_Saturation 28.0 3.0 6.0 34.0 0.9032 0.8235 0.8615 Kidney_Disease 97.0 10.0 19.0 116.0 0.9065 0.8362 0.87 Test 2603.0 391.0 357.0 2960.0 0.8694 0.8794 0.8744 Communicable_Disease 22.0 6.0 6.0 28.0 0.7857 0.7857 0.7857 Hypertension 144.0 5.0 5.0 149.0 0.9664 0.9664 0.9664 External_body_par... 2401.0 228.0 378.0 2779.0 0.9133 0.864 0.8879 Oxygen_Therapy 69.0 14.0 10.0 79.0 0.8313 0.8734 0.8519 Modifier 2229.0 304.0 354.0 2583.0 0.88 0.863 0.8714 Test_Result 1169.0 165.0 187.0 1356.0 0.8763 0.8621 0.8691 BMI 5.0 3.0 1.0 6.0 0.625 0.8333 0.7143 Labour_Delivery 66.0 15.0 17.0 83.0 0.8148 0.7952 0.8049 Employment 220.0 16.0 37.0 257.0 0.9322 0.856 0.8925 Fetus_NewBorn 53.0 16.0 23.0 76.0 0.7681 0.6974 0.731 Clinical_Dept 843.0 69.0 53.0 896.0 0.9243 0.9408 0.9325 Time 28.0 8.0 11.0 39.0 0.7778 0.7179 0.7467 Procedure 2893.0 326.0 307.0 3200.0 0.8987 0.9041 0.9014 Diet 29.0 3.0 18.0 47.0 0.9063 0.617 0.7342 Oncological 419.0 41.0 36.0 455.0 0.9109 0.9209 0.9158 LDL 3.0 0.0 1.0 4.0 1.0 0.75 0.8571 Symptom 6559.0 876.0 908.0 7467.0 0.8822 0.8784 0.8803 Temperature 86.0 7.0 3.0 89.0 0.9247 0.9663 0.9451 Vital_Signs_Header 191.0 25.0 19.0 210.0 0.8843 0.9095 0.8967 Total_Cholesterol 13.0 3.0 7.0 20.0 0.8125 0.65 0.7222 Relationship_Status 52.0 5.0 2.0 54.0 0.9123 0.963 0.9369 Blood_Pressure 132.0 15.0 11.0 143.0 0.898 0.9231 0.9103 Injury_or_Poisoning 500.0 64.0 86.0 586.0 0.8865 0.8532 0.8696 Drug_Ingredient 1505.0 128.0 91.0 1596.0 0.9216 0.943 0.9322 Treatment 134.0 21.0 25.0 159.0 0.8645 0.8428 0.8535 Pregnancy 89.0 23.0 20.0 109.0 0.7946 0.8165 0.8054 Vaccine 7.0 2.0 2.0 9.0 0.7778 0.7778 0.7778 Disease_Syndrome_... 2684.0 383.0 344.0 3028.0 0.8751 0.8864 0.8807 Height 22.0 3.0 1.0 23.0 0.88 0.9565 0.9167 Frequency 604.0 74.0 67.0 671.0 0.8909 0.9001 0.8955 Route 783.0 89.0 64.0 847.0 0.8979 0.9244 0.911 Duration 352.0 83.0 41.0 393.0 0.8092 0.8957 0.8502 Death_Entity 41.0 3.0 3.0 44.0 0.9318 0.9318 0.9318 Internal_organ_or... 5915.0 811.0 713.0 6628.0 0.8794 0.8924 0.8859 Vaccine_Name 5.0 0.0 3.0 8.0 1.0 0.625 0.7692 Alcohol 72.0 4.0 6.0 78.0 0.9474 0.9231 0.9351 Substance_Quantity 3.0 4.0 0.0 3.0 0.4286 1.0 0.6 Date 544.0 26.0 17.0 561.0 0.9544 0.9697 0.962 Hyperlipidemia 44.0 4.0 0.0 44.0 0.9167 1.0 0.9565 Social_History_He... 93.0 3.0 4.0 97.0 0.9688 0.9588 0.9637 Imaging_Technique 59.0 4.0 31.0 90.0 0.9365 0.6556 0.7712 Race_Ethnicity 113.0 0.0 0.0 113.0 1.0 1.0 1.0 Drug_BrandName 819.0 53.0 41.0 860.0 0.9392 0.9523 0.9457 RelativeDate 530.0 86.0 89.0 619.0 0.8604 0.8562 0.8583 Gender 5414.0 55.0 47.0 5461.0 0.9899 0.9914 0.9907 Form 204.0 24.0 35.0 239.0 0.8947 0.8536 0.8737 Dosage 211.0 21.0 48.0 259.0 0.9095 0.8147 0.8595 Medical_History_H... 105.0 7.0 2.0 107.0 0.9375 0.9813 0.9589 Birth_Entity 4.0 0.0 2.0 6.0 1.0 0.6667 0.8 Substance 72.0 14.0 12.0 84.0 0.8372 0.8571 0.8471 Sexually_Active_o... 7.0 0.0 0.0 7.0 1.0 1.0 1.0 Weight 77.0 8.0 11.0 88.0 0.9059 0.875 0.8902 macro - - - - - - 0.8674 micro - - - - - - 0.9054 ``` --- layout: model title: Detect Anatomical Regions ((embeddings_clinical_medium) author: John Snow Labs name: ner_anatomy_emb_clinical_medium date: 2023-05-15 tags: [ner, clinical, licensed, en, anatomy] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for anatomy terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_emb_clinical_medium_en_4.4.1_3.0_1684136633973.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_anatomy_emb_clinical_medium_en_4.4.1_3.0_1684136633973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") anatomy_ner = MedicalNerModel.pretrained('ner_anatomy_emb_clinical_medium' "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("anatomy_ner") anatomy_ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "anatomy_ner"]) \ .setOutputCol("anatomy_ner_chunk") posology_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, anatomy_ner, anatomy_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") posology_ner_model = posology_ner_pipeline.fit(empty_data) results = posology_ner_model.transform(spark.createDataFrame([['''This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.''']]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") ​ val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") ​ val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") ​ val anatomy_ner_model = MedicalNerModel.pretrained("ner_anatomy_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("anatomy_ner") ​ val anatomy_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("anatomy_ner_chunk") ​ val posology_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, anatomy_ner_model, anatomy_ner_converter)) ​ val data = Seq(""" This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") ​ val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:--------------------|--------:|------:|:-----------------------| | 0 | skin | 374 | 377 | Organ | | 1 | Extraocular muscles | 574 | 592 | Multi-tissue_structure | | 2 | Nares | 613 | 617 | Multi-tissue_structure | | 3 | turbinates | 659 | 668 | Multi-tissue_structure | | 4 | Oropharynx | 683 | 692 | Multi-tissue_structure | | 5 | Mucous membranes | 716 | 731 | Cellular_component | | 6 | Neck | 744 | 747 | Organism_subdivision | | 7 | bowel | 802 | 806 | Multi-tissue_structure | | 8 | skin | 956 | 959 | Organ | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_anatomy_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Trained on the Anatomical Entity Mention (AnEM) corpus with http://www.nactem.ac.uk/anatomy/ ## Benchmarking ```bash label precision recall f1-score support tissue_structure 0.81 0.67 0.73 130 Organ 0.87 0.79 0.83 52 Cell 0.89 1.00 0.94 118 Organism_subdivision 0.69 0.50 0.58 22 Pathological_formation 0.96 0.91 0.94 58 Cellular_component 0.65 0.65 0.65 26 Organism_substance 0.93 0.86 0.89 43 Anatomical_system 1.00 0.50 0.67 6 Immaterial_anatomical_entity 1.00 0.67 0.80 6 Tissue 0.88 0.88 0.88 32 Developing_anatomical_structure 1.00 0.20 0.33 5 micro-avg 0.86 0.80 0.83 498 macro-avg 0.88 0.69 0.75 498 weighted-avg 0.86 0.80 0.82 498 ``` --- layout: model title: Medical Spell Checker Pipeline author: John Snow Labs name: spellcheck_clinical_pipeline date: 2022-04-19 tags: [spellcheck, medical, medical_spell_check, spell_corrector, spell_pipeline, en, licensed, clinical] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained medical spellchecker pipeline is built on the top of [spellcheck_clinical](https://nlp.johnsnowlabs.com/2022/04/18/spellcheck_clinical_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.2 and above. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.2_2.4_1650360182939.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_pipeline_en_3.4.2_2.4_1650360182939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] pipeline.fullAnnotate(example) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("spellcheck_clinical_pipeline", "en", "clinical/models") val example = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") pipeline.fullAnnotate(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical.pipeline").predict("""Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|141.3 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: Legal Conditions to effectiveness Clause Binary Classifier author: John Snow Labs name: legclf_conditions_to_effectiveness_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `conditions-to-effectiveness` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `conditions-to-effectiveness` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_conditions_to_effectiveness_clause_en_1.0.0_3.2_1660122255225.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_conditions_to_effectiveness_clause_en_1.0.0_3.2_1660122255225.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_conditions_to_effectiveness_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[conditions-to-effectiveness]| |[other]| |[other]| |[conditions-to-effectiveness]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_conditions_to_effectiveness_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support conditions-to-effectiveness 0.98 1.00 0.99 46 other 1.00 0.99 1.00 143 accuracy - - 0.99 189 macro-avg 0.99 1.00 0.99 189 weighted-avg 0.99 0.99 0.99 189 ``` --- layout: model title: Detect clinical entities (ner_jsl_enriched_biobert) author: John Snow Labs name: ner_jsl_enriched_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect symptoms, modifiers, age, drugs, treatments, tests and a lot more using a single pretrained NER model. Definitions of Predicted Entities: - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Weight`: All mentions related to a patients weight. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Gender`: Gender-specific nouns and pronouns. - `Temperature`: All mentions that refer to body temperature. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Respiration`: Number of breaths per minute. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Frequency`: Frequency of administration for a dose prescribed. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Allergen`: Allergen related extractions mentioned in the document. ## Predicted Entities `Symptom_Name`, `Pulse_Rate`, `Negation`, `Age`, `Modifier`, `Substance_Name`, `Causative_Agents_(Virus_and_Bacteria)`, `Diagnosis`, `Weight`, `Drug_Name`, `Procedure_Name`, `Lab_Name`, `Blood_Pressure`, `Lab_Result`, `Gender`, `Name`, `Temperature`, `Section_Name`, `Route`, `Maybe`, `O2_Saturation`, `Respiratory_Rate`, `Procedure`, `Frequency`, `Dosage`, `Allergenic_substance` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_biobert_en_3.0.0_3.0_1617260842011.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_biobert_en_3.0.0_3.0_1617260842011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_enriched_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.enriched_biobert").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +---------------------------+------------+ |chunk |ner_label | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |mom |Gender | |she |Gender | |mild |Modifier | |problems with his breathing|Symptom_Name| |negative |Negation | |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | |His |Gender | |he |Gender | |he |Gender | |Mom |Gender | |denies |Negation | |diarrhea |Symptom_Name| |His |Gender | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: English BertForQuestionAnswering Cased model (from qgrantq) author: John Snow Labs name: bert_qa_qgrantq_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `qgrantq`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_qgrantq_finetuned_squad_en_4.0.0_3.0_1657186673461.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_qgrantq_finetuned_squad_en_4.0.0_3.0_1657186673461.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_qgrantq_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_qgrantq_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_qgrantq_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/qgrantq/bert-finetuned-squad --- layout: model title: Pipeline for detecting posology entities author: John Snow Labs name: recognize_entities_posology date: 2023-06-13 tags: [pipeline, en, licensed, clinical] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pipeline with `ner_posology`. It will only extract medication entities. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/recognize_entities_posology_en_4.4.4_3.2_1686663455726.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/recognize_entities_posology_en_4.4.4_3.2_1686663455726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('recognize_entities_posology', 'en', 'clinical/models') res = pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """) ``` ```scala val era_pipeline = new PretrainedPipeline("recognize_entities_posology", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """)(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.recognize_entities.posology").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """) ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('recognize_entities_posology', 'en', 'clinical/models') res = pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """) ``` ```scala val era_pipeline = new PretrainedPipeline("recognize_entities_posology", "en", "clinical/models") val result = era_pipeline.fullAnnotate("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """)(0) ``` {:.nlu-block} ```python import nlu nlu.load("en.recognize_entities.posology").predict("""A 28-year-old female with a history of gestational diabetes mellitus, used to take metformin 1000 mg two times a day, presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . She was seen by the endocrinology service and discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals. """) ```
## Results ```bash Results | | chunk | begin | end | entity | |---:|:-----------------|--------:|------:|:----------| | 0 | metformin | 83 | 91 | DRUG | | 1 | 1000 mg | 93 | 99 | STRENGTH | | 2 | two times a day | 101 | 115 | FREQUENCY | | 3 | 40 units | 270 | 277 | DOSAGE | | 4 | insulin glargine | 282 | 297 | DRUG | | 5 | at night | 299 | 306 | FREQUENCY | | 6 | 12 units | 309 | 316 | DOSAGE | | 7 | insulin lispro | 321 | 334 | DRUG | | 8 | with meals | 336 | 345 | FREQUENCY | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_posology| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Augment Company Tickers with NASDAQ database author: John Snow Labs name: finmapper_nasdaq_data_ticker date: 2022-10-22 tags: [en, finance, companies, nasdaq, ticker, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Financial Chunk Mapper, which will retrieve, given a normalized Company Name (using, for example, `finer_nasdaq_data` to obtain the official Nasdaq company name), extra information about the company, including: - Ticker - Stock Exchange - Section - Sic codes - Section - Industry - Category - Currency - Location - Previous names (first_name) - Company type (INC, CORP, etc) - and some more. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FIN_LEG_COMPANY_AUGMENTATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_data_ticker_en_1.0.0_3.0_1666474260714.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_data_ticker_en_1.0.0_3.0_1666474260714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverterInternal()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_data_ticker', 'en', 'finance/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRel('company_name') pipeline = Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, CM]) text = ["""There are some serious purchases and sales of GLE1 stock today."""] test_data = spark.createDataFrame([text]).toDF("text") model = pipeline.fit(test_data) res= model.transform(test_data).select('mappings').collect() ```
## Results ```bash [Row(mappings=[Row(annotatorType='labeled_dependency', begin=46, end=49, result='AMZN', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'ticker', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Amazon.com Inc.', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'company_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Amazon.com', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'short_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Retail - Apparel & Specialty', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'industry', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=46, end=49, result='Consumer Cyclical', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMZN', 'relation': 'sector', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=57, end=61, result='NONE', metadata={'sentence': '0', 'chunk': '1', 'entity': 'today'}, embeddings=[])])] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_nasdaq_data_ticker| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.0 MB| ## References NASDAQ Database --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jdeboever) author: John Snow Labs name: xlmroberta_ner_jboever_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jdeboever`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jboever_base_finetuned_panx_de_4.1.0_3.0_1660434567446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jboever_base_finetuned_panx_de_4.1.0_3.0_1660434567446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jboever_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jboever_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jboever_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jdeboever/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English Named Entity Recognition (from mrm8488) author: John Snow Labs name: bert_ner_bert_small_finetuned_typo_detection date: 2022-05-09 tags: [bert, ner, token_classification, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-small-finetuned-typo-detection` is a English model orginally trained by `mrm8488`. ## Predicted Entities `typo`, `ok` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_small_finetuned_typo_detection_en_3.4.2_3.0_1652097081276.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_small_finetuned_typo_detection_en_3.4.2_3.0_1652097081276.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_small_finetuned_typo_detection","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_small_finetuned_typo_detection","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_small_finetuned_typo_detection| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/bert-small-finetuned-typo-detection - https://github.com/mhagiwara/github-typo-corpus - https://github.com/mhagiwara/github-typo-corpus - https://twitter.com/mrm8488 --- layout: model title: Translate Chuukese to English Pipeline author: John Snow Labs name: translate_chk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, chk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `chk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_chk_en_xx_2.7.0_2.4_1609691595140.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_chk_en_xx_2.7.0_2.4_1609691595140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_chk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_chk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.chk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_chk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_6 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_6_en_4.3.0_3.0_1674216152630.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_6_en_4.3.0_3.0_1674216152630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_6","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|419.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-6 --- layout: model title: Legal Representations And Warranties Of The Company Clause Binary Classifier author: John Snow Labs name: legclf_representations_and_warranties_of_the_company_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, representations, and, warranties, of, the, company, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the representations-and-warranties-of-the-company clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `representations-and-warranties-of-the-company`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_representations_and_warranties_of_the_company_clause_en_1.0.0_3.0_1671393657125.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_representations_and_warranties_of_the_company_clause_en_1.0.0_3.0_1671393657125.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_representations_and_warranties_of_the_company_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[representations-and-warranties-of-the-company]| |[other]| |[other]| |[representations-and-warranties-of-the-company]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_representations_and_warranties_of_the_company_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.99 39 representations-and-warranties-of-the-company 1.00 0.96 0.98 28 accuracy - - 0.99 67 macro-avg 0.99 0.98 0.98 67 weighted-avg 0.99 0.99 0.99 67 ``` --- layout: model title: Pipeline to Detect PHI for Deidentification purposes (Italian, reduced entities) author: John Snow Labs name: ner_deid_generic_pipeline date: 2023-03-13 tags: [deid, it, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/03/25/ner_deid_generic_it_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_it_4.3.0_3.2_1678744038782.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_it_4.3.0_3.2_1678744038782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "it", "clinical/models") text = '''Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "it", "clinical/models") val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|:-------------| | 0 | Gastone Montanariello | 9 | 29 | NAME | | | 1 | 49 | 32 | 33 | AGE | | | 2 | Ospedale San Camillo | 55 | 74 | LOCATION | | | 3 | marzo 2015 | 128 | 137 | DATE | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal Tax matters Clause Binary Classifier (md) author: John Snow Labs name: legclf_tax_matters_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `tax-matters` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `tax-matters` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_tax_matters_md_en_1.0.0_3.0_1673460237640.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_tax_matters_md_en_1.0.0_3.0_1673460237640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_tax_matters_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[tax-matters]| |[other]| |[other]| |[tax-matters]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_tax_matters_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support affirmative-covenants 1.00 0.91 0.95 32 other 0.93 1.00 0.96 39 accuracy 0.96 71 macro avg 0.96 0.95 0.96 71 weighted avg 0.96 0.96 0.96 71 ``` --- layout: model title: Part of Speech for Spanish author: John Snow Labs name: pos_ud_gsd date: 2020-02-17 00:16:00 +0800 task: Part of Speech Tagging language: es edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [pos, es] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_es_2.4.0_2.4_1581891015986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_gsd_es_2.4.0_2.4_1581891015986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_gsd", "es") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_gsd", "es") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Además de ser el rey del norte, John Snow es un médico inglés y líder en el desarrollo de la anestesia y la higiene médica."""] pos_df = nlu.load('es.pos.ud_gsd').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=5, result='ADV', metadata={'word': 'Además'}), Row(annotatorType='pos', begin=7, end=8, result='ADP', metadata={'word': 'de'}), Row(annotatorType='pos', begin=10, end=12, result='AUX', metadata={'word': 'ser'}), Row(annotatorType='pos', begin=14, end=15, result='DET', metadata={'word': 'el'}), Row(annotatorType='pos', begin=17, end=19, result='NOUN', metadata={'word': 'rey'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_gsd| |Type:|pos| |Compatibility:|Spark NLP 2.4.0| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|es| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Abkhazian asr_xls_r_ab_test_by_FitoDS TFWav2Vec2ForCTC from FitoDS author: John Snow Labs name: pipeline_asr_xls_r_ab_test_by_FitoDS date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_FitoDS` is a Abkhazian model originally trained by FitoDS. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_FitoDS_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_FitoDS_ab_4.2.0_3.0_1664021177866.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_FitoDS_ab_4.2.0_3.0_1664021177866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_FitoDS', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_FitoDS", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_test_by_FitoDS| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|448.1 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Identify Job Experiences in the Past author: John Snow Labs name: finassertiondl_past_roles date: 2022-09-09 tags: [en, finance, assertion, status, job, experiences, past, licensed] task: Assertion Status language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is aimed to detect if any Role, Job Title, Person, Organization, Date, etc. entity, extracted with NER, is expressed as a Past Experience. ## Predicted Entities `NO_PAST`, `PAST` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/ASSERTIONDL_PAST_ROLES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finassertiondl_past_roles_en_1.0.0_3.2_1662762393161.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finassertiondl_past_roles_en_1.0.0_3.2_1662762393161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # nlp.Tokenizer splits words in a relevant format for NLP tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") # Add as many NER as you wish here. We have added 2 as an example. # ================ tokenClassifier = finance.BertForTokenClassification.pretrained("finner_bert_roles", "en", "finance/models")\ .setInputCols("token", "document")\ .setOutputCol("label") ner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\ .setInputCols("document", "token", "embeddings")\ .setOutputCol("label2") ner_converter = finance.NerConverterInternal() \ .setInputCols(["document", "token", "label"]) \ .setOutputCol("ner_chunk") ner_converter2 = finance.NerConverterInternal() \ .setInputCols(["document", "token", "label2"]) \ .setOutputCol("ner_chunk2") merger = finance.ChunkMergeApproach()\ .setInputCols(["ner_chunk", "ner_chunk2"])\ .setOutputCol("merged_chunk") # ================ assertion = finance.AssertionDLModel.pretrained("finassertiondl_past_roles", "en", "finance/models")\ .setInputCols(["document", "merged_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, tokenClassifier, ner, ner_converter, ner_converter2, merger, assertion ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) lp = LightPipeline(model) r = lp.fullAnnotate("Mrs. Charles was before Managing Director at Liberty, LC") ```
## Results ```bash chunk,begin,end,entity_type,assertion Mrs. Charles,0,11,PERSON,PAST Managing Director,24,40,ROLE,PAST Liberty, LC,45,55,ORG,PAST ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finassertiondl_past_roles| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, doc_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations from 10K Filings and Wikidata ## Benchmarking ```bash label tp fp fn prec rec f1 NO_PAST 362 6 13 0.9836956 0.96533334 0.974428 PAST 196 13 6 0.9377990 0.97029704 0.953771 Macro-average 558 19 19 0.9607473 0.96781516 0.964268 Micro-average 558 19 19 0.9670710 0.96707106 0.967071 ``` --- layout: model title: German Legal Roberta Embeddings author: John Snow Labs name: roberta_large_german_legal date: 2023-02-16 tags: [de, german, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-german-roberta-large` is a German model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_large_german_legal_de_4.2.4_3.0_1676578377503.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_large_german_legal_de_4.2.4_3.0_1676578377503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_german_legal", "de")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_large_german_legal", "de") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_large_german_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-german-roberta-large --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Evelyn18) author: John Snow Labs name: distilbert_qa_evelyn18_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_evelyn18_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768454233.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_evelyn18_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768454233.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_evelyn18_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_evelyn18_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_evelyn18_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Evelyn18/distilbert-base-uncased-finetuned-squad --- layout: model title: English image_classifier_vit_places ViTForImageClassification from Giuliano author: John Snow Labs name: image_classifier_vit_places date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_places` is a English model originally trained by Giuliano. ## Predicted Entities `Beach`, `City`, `Forest` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_places_en_4.1.0_3.0_1660165666296.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_places_en_4.1.0_3.0_1660165666296.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_places", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_places", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_places| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_wav2vec2_large_lv60_timit_asr TFWav2Vec2ForCTC from elgeish author: John Snow Labs name: asr_wav2vec2_large_lv60_timit_asr date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_lv60_timit_asr` is a English model originally trained by elgeish. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_lv60_timit_asr_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_lv60_timit_asr_en_4.2.0_3.0_1664040351891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_lv60_timit_asr_en_4.2.0_3.0_1664040351891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_lv60_timit_asr", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_lv60_timit_asr", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_lv60_timit_asr| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Pipeline to Detect Cancer Genetics author: John Snow Labs name: ner_bionlp_pipeline date: 2023-03-15 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_bionlp](https://nlp.johnsnowlabs.com/2021/03/31/ner_bionlp_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_pipeline_en_4.3.0_3.2_1678865044035.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_pipeline_en_4.3.0_3.2_1678865044035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_bionlp_pipeline", "en", "clinical/models") text = '''The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_bionlp_pipeline", "en", "clinical/models") val text = "The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bionlp.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------|--------:|------:|:---------------------|-------------:| | 0 | human | 4 | 8 | Organism | 0.9996 | | 1 | Kir 3.3 | 17 | 23 | Gene_or_gene_product | 0.99635 | | 2 | GIRK3 | 26 | 30 | Gene_or_gene_product | 1 | | 3 | potassium | 92 | 100 | Simple_chemical | 0.9452 | | 4 | GIRK | 103 | 106 | Gene_or_gene_product | 0.998 | | 5 | chromosome 1q21-23 | 188 | 205 | Cellular_component | 0.80115 | | 6 | pancreas | 697 | 704 | Organ | 0.9994 | | 7 | tissues | 740 | 746 | Tissue | 0.975 | | 8 | fat andskeletal muscle | 749 | 770 | Tissue | 0.955433 | | 9 | KCNJ9 | 801 | 805 | Gene_or_gene_product | 0.9172 | | 10 | Type II | 940 | 946 | Gene_or_gene_product | 0.98845 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Word2Vec Embeddings in Scottish Gaelic (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, gd, open_source] task: Embeddings language: gd edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gd_3.4.1_3.0_1647455551902.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_gd_3.4.1_3.0_1647455551902.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gd") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Tha gaol agam air spark nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gd") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Tha gaol agam air spark nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gd.embed.w2v_cc_300d").predict("""Tha gaol agam air spark nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|gd| |Size:|82.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Understanding Restriction level of Assignment Clauses author: John Snow Labs name: legclf_nda_assigments date: 2023-04-07 tags: [en, licensed, legal, classification, nda, assigments, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a clause classified as `ASSIGNMENT ` using the `legmulticlf_mnda_sections_paragraph_other` classifier, you can subclassify the sentences as `PERMISSIVE_ASSIGNMENT`, `RESTRICTIVE_ASSIGNMENT` or `OTHER` from it using the `legclf_nda_assigments` model. ## Predicted Entities `PERMISSIVE_ASSIGNMENT`, `RESTRICTIVE_ASSIGNMENT`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_nda_assigments_en_1.0.0_3.0_1680898751373.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_nda_assigments_en_1.0.0_3.0_1680898751373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = nlp.UniversalSentenceEncoder.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") classifier = legal.ClassifierDLModel.pretrained("legclf_nda_assigments", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_embeddings, classifier ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text_list = ["""This Agreement will be binding upon and inure to the benefit of each Party and its respective heirs, successors and assigns""", """All notices and other communications provided for in this Agreement and the other Loan Documents shall be in writing and may (subject to paragraph (b) below) be telecopied (faxed), mailed by certified mail return receipt requested, or delivered by hand or overnight courier service to the intended recipient at the addresses specified below or at such other address as shall be designated by any party listed below in a notice to the other parties listed below given in accordance with this Section.""", """This Agreement is a personal contract for XCorp, and the rights and interests of XCorp hereunder may not be sold, transferred, assigned, pledged or hypothecated except as otherwise expressly permitted by the Company"""] df = spark.createDataFrame(pd.DataFrame({"text" : text_list})) result = model.transform(df) ```
## Results ```bash +--------------------------------------------------------------------------------+----------------------+ | text| class| +--------------------------------------------------------------------------------+----------------------+ |This Agreement will be binding upon and inure to the benefit of each Party an...| PERMISSIVE_ASSIGNMENT| |All notices and other communications provided for in this Agreement and the o...| OTHER| |This Agreement is a personal contract for XCorp, and the rights and interests...|RESTRICTIVE_ASSIGNMENT| +--------------------------------------------------------------------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_nda_assigments| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support OTHER 0.88 1.00 0.94 29 PERMISSIVE_ASSIGNMENT 1.00 0.85 0.92 13 RESTRICTIVE_ASSIGNMENT 0.95 0.86 0.90 22 accuracy - - 0.92 64 macro-avg 0.94 0.90 0.92 64 weighted-avg 0.93 0.92 0.92 64 ``` --- layout: model title: Extraction of Clinical Abbreviations and Acronyms author: John Snow Labs name: ner_abbreviation_emb_clinical_large date: 2023-05-12 tags: [ner, abbreviation, acronym, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to extract clinical abbreviations and acronyms in text. ## Predicted Entities `ABBR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ABBREVIATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_emb_clinical_large_en_4.4.1_3.0_1683884760156.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_emb_clinical_large_en_4.4.1_3.0_1683884760156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_abbreviation_emb_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(['sentence', 'token', 'ner'])\ .setOutputCol('ner_chunk') pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_abbreviation_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val data = Seq("Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----+-----+---+---------+ |chunk|begin|end|ner_label| +-----+-----+---+---------+ |CBC |126 |128|ABBR | |AB |159 |160|ABBR | |VDRL |189 |192|ABBR | |HIV |247 |249|ABBR | +-----+-----+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_abbreviation_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Trained on the in-house dataset. ## Benchmarking ```bash label precision recall f1-score support ABBR 0.93 0.97 0.95 620 micro-avg 0.93 0.97 0.95 620 macro-avg 0.93 0.97 0.95 620 weighted-avg 0.93 0.97 0.95 620 ``` --- layout: model title: Extract Clinical Problem Entities (low granularity) from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_problem_reduced_emb_clinical_large date: 2023-06-07 tags: [licensed, clinical, ner, en, vop, problem] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences. The taxonomy is reduced (one label for all clinical problems). ## Predicted Entities `Problem`, `HealthStatus`, `Modifier` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_emb_clinical_large_en_4.4.3_3.0_1686148163011.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_reduced_emb_clinical_large_en_4.4.3_3.0_1686148163011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem_reduced_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Problem | | fatigue | Problem | | rheumatoid arthritis | Problem | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem_reduced_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Problem 6106 992 1104 7210 0.86 0.85 0.85 HealthStatus 79 21 28 107 0.79 0.74 0.76 Modifier 816 248 323 1139 0.77 0.72 0.74 macro_avg 7001 1261 1455 8456 0.81 0.77 0.78 micro_avg 7001 1261 1455 8456 0.85 0.83 0.83 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from mtreviso) author: John Snow Labs name: t5_ct5_base_wiki date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ct5-base-en-wiki` is a English model originally trained by `mtreviso`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ct5_base_wiki_en_4.3.0_3.0_1675100704207.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ct5_base_wiki_en_4.3.0_3.0_1675100704207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ct5_base_wiki","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ct5_base_wiki","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ct5_base_wiki| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|928.7 MB| ## References - https://huggingface.co/mtreviso/ct5-base-en-wiki - https://github.com/mtreviso/chunked-t5 --- layout: model title: Detect Disease Mentions (MedicalBertForTokenClassification) (BERT) author: John Snow Labs name: bert_token_classifier_disease_mentions_tweet date: 2022-07-28 tags: [es, clinical, licensed, public_health, ner, token_classification, disease, tweet] task: Named Entity Recognition language: es edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is intended for detecting disease mentions in Spanish tweets and trained using the BertForTokenClassification method from the transformers library and [BERT based](https://huggingface.co/amine/bert-base-5lang-cased) embeddings. ## Predicted Entities `ENFERMEDAD` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_NER_DISEASE_ES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4TC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_disease_mentions_tweet_es_4.0.0_3.0_1659033666412.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_disease_mentions_tweet_es_4.0.0_3.0_1659033666412.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_disease_mentions_tweet", "es", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([["""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_disease_mentions_tweet", "es", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("label") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","label")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq(Array("El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.classify.disease_mentions").predict("""El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto.""") ```
## Results ```bash +---------------------+----------+ |chunk |ner_label | +---------------------+----------+ |Neumonía en el pulmón|ENFERMEDAD| |Sinusitis |ENFERMEDAD| |Faringitis aguda |ENFERMEDAD| |infección de orina |ENFERMEDAD| |Gripe |ENFERMEDAD| +---------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_disease_mentions_tweet| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|es| |Size:|461.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## Benchmarking ```bash label precision recall f1-score support B-ENFERMEDAD 0.74 0.95 0.83 4243 I-ENFERMEDAD 0.64 0.79 0.71 1570 micro-avg 0.71 0.91 0.80 5813 macro-avg 0.69 0.87 0.77 5813 weighted-avg 0.71 0.91 0.80 5813 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Slavic Languages author: John Snow Labs name: opus_mt_en_sla date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sla, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sla` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sla_xx_2.7.0_2.4_1609164576227.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sla_xx_2.7.0_2.4_1609164576227.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sla", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sla", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sla').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sla| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Yapese to English author: John Snow Labs name: opus_mt_yap_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, yap, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `yap` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_yap_en_xx_2.7.0_2.4_1609168741209.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_yap_en_xx_2.7.0_2.4_1609168741209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_yap_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_yap_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.yap.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_yap_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_task_specific_distilation_on_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-base-task-specific-distilation-on-squad` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_task_specific_distilation_on_squad_en_4.3.0_3.0_1674210519978.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_task_specific_distilation_on_squad_en_4.3.0_3.0_1674210519978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_task_specific_distilation_on_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_task_specific_distilation_on_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_task_specific_distilation_on_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|307.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/distilroberta-base-task-specific-distilation-on-squad --- layout: model title: Arabic Bert Embeddings (Base, Arabert Model, v02) author: John Snow Labs name: bert_embeddings_bert_base_arabertv02 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabertv02` is a Arabic model orginally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv02_ar_3.4.2_3.0_1649677022222.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_arabertv02_ar_3.4.2_3.0_1649677022222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv02","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabertv02","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_base_arabertv02").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_arabertv02| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|508.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabertv02 - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabertv02/resolve/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Detect Drug Chemicals author: John Snow Labs name: ner_drugs date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for Drugs. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities ``DrugChem``. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_en_3.0.0_3.0_1617209727819.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_en_3.0.0_3.0_1617209727819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_drugs", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_drugs", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +-----------------+---------+ |chunk |ner | +-----------------+---------+ |potassium |DrugChem | |anthracyclines |DrugChem | |taxanes |DrugChem | |vinorelbine |DrugChem | +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on i2b2_med7 + FDA with 'embeddings_clinical'. https://www.i2b2.org/NLP/Medication ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|------:|------:|------:|---------:|---------:|---------:| | 0 | B-DrugChem | 32745 | 1738 | 979 | 0.949598 | 0.97097 | 0.960165 | | 1 | I-DrugChem | 35522 | 1551 | 764 | 0.958164 | 0.978945 | 0.968443 | | 2 | Macro-average | 68267 | 3289 | 1743 | 0.953881 | 0.974958 | 0.964304 | | 3 | Micro-average | 68267 | 3289 | 1743 | 0.954036 | 0.975104 | 0.964455 | ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Sounak) author: John Snow Labs name: distilbert_qa_finetuned date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-finetuned` is a English model originally trained by `Sounak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_en_4.0.0_3.0_1654727500908.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_finetuned_en_4.0.0_3.0_1654727500908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_finetuned","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_Sounak").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sounak/distilbert-finetuned --- layout: model title: Detect PHI for Deidentification (Subentity- Augmented) author: John Snow Labs name: ner_deid_subentity_augmented date: 2021-09-03 tags: [deid, ner, en, i2b2, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. It detects 23 entities. This ner model is trained with a combination of the i2b2 train set and a re-augmented version of i2b2 train set. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities `MEDICALRECORD`, `ORGANIZATION`, `DOCTOR`, `USERNAME`, `PROFESSION`, `HEALTHPLAN`, `URL`, `CITY`, `DATE`, `LOCATION-OTHER`, `STATE`, `PATIENT`, `DEVICE`, `COUNTRY`, `ZIP`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `SREET`, `BIOID`, `FAX`, `AGE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_en_3.2.0_2.4_1630671569402.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_augmented_en_3.2.0_2.4_1630671569402.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk_subentity") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."""]}))) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk_subentity") val nlpPipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter)) val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.subentity_augmented").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""") ```
## Results ```bash +-----------------------------+-------------+ |chunk |ner_label | +-----------------------------+-------------+ |2093-01-13 |DATE | |David Hale |DOCTOR | |Hendrickson, Ora |PATIENT | |7194334 |MEDICALRECORD| |01/13/93 |DATE | |Oliveira |DOCTOR | |25-year-old |AGE | |1-11-2000 |DATE | |Cocke County Baptist Hospital|HOSPITAL | |0295 Keats Street. |STREET | |(302) 786-5227 |PHONE | |Brothers Coal-Mine |ORGANIZATION | +-----------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_augmented| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source A custom data set which is created from the i2b2-PHI train and the re-augmented version of the i2b2-PHI train set is used. ## Benchmarking ```bash label tp fp fn total precision recall f1 PATIENT 1465.0 159.0 162.0 1627.0 0.9021 0.9004 0.9013 HOSPITAL 1417.0 120.0 167.0 1584.0 0.9219 0.8946 0.908 DATE 5513.0 57.0 129.0 5642.0 0.9898 0.9771 0.9834 ORGANIZATION 101.0 25.0 37.0 138.0 0.8016 0.7319 0.7652 CITY 277.0 47.0 64.0 341.0 0.8549 0.8123 0.8331 STREET 405.0 7.0 10.0 415.0 0.983 0.9759 0.9794 USERNAME 88.0 2.0 13.0 101.0 0.9778 0.8713 0.9215 DEVICE 10.0 0.0 0.0 10.0 1.0 1.0 1.0 IDNUM 168.0 27.0 42.0 210.0 0.8615 0.8 0.8296 STATE 172.0 15.0 33.0 205.0 0.9198 0.839 0.8776 ZIP 137.0 0.0 2.0 139.0 1.0 0.9856 0.9928 MEDICALRECORD 416.0 14.0 28.0 444.0 0.9674 0.9369 0.9519 OTHER 16.0 4.0 5.0 21.0 0.8 0.7619 0.7805 PROFESSION 261.0 22.0 75.0 336.0 0.9223 0.7768 0.8433 PHONE 328.0 21.0 20.0 348.0 0.9398 0.9425 0.9412 COUNTRY 97.0 15.0 31.0 128.0 0.8661 0.7578 0.8083 DOCTOR 3279.0 139.0 268.0 3547.0 0.9593 0.9244 0.9416 AGE 715.0 39.0 47.0 762.0 0.9483 0.9383 0.9433 macro - - - - - - 0.7715 micro - - - - - - 0.9406 ``` --- layout: model title: Explain Document pipeline for Polish (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, polish, explain_document_lg, pipeline, pl] supported: true task: [Named Entity Recognition, Lemmatization] language: pl edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pl_3.0.0_3.0_1616509213756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_pl_3.0.0_3.0_1616509213756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'pl') annotations = pipeline.fullAnnotate(""Witaj z John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "pl") val result = pipeline.fullAnnotate("Witaj z John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Witaj z John Snow Labs! ""] result_df = nlu.load('pl.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Witaj z John Snow Labs! '] | ['Witaj z John Snow Labs!'] | ['Witaj', 'z', 'John', 'Snow', 'Labs!'] | ['witać', 'z', 'John', 'Snow', 'Labs!'] | ['VERB', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.4977500140666961,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pl| --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_all_904029577 date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029577` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Name` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1677881580824.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1677881580824.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_all_904029577| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029577 --- layout: model title: Translate English to Berber Pipeline author: John Snow Labs name: translate_en_ber date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ber, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ber` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ber_xx_2.7.0_2.4_1609691346190.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ber_xx_2.7.0_2.4_1609691346190.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ber", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ber", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ber').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ber| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Chemical Compounds and Genes author: John Snow Labs name: ner_chemprot_clinical date: 2020-09-21 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [ner, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a pre-trained model that can be used to automatically detect all chemical compounds and gene mentions from medical texts. ## Predicted Entities `CHEMICAL`, `GENE-Y`, `GENE-N` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMPROT_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_en_2.5.5_2.4_1599360199717.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_clinical_en_2.5.5_2.4_1599360199717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot.clinical").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
{:.h2_title} ## Results ```bash +----+---------------------------------+---------+-------+----------+ | | chunk | begin | end | entity | +====+=================================+=========+=======+==========+ | 0 | Keratinocyte growth factor | 0 | 25 | GENE-Y | +----+---------------------------------+---------+-------+----------+ | 1 | acidic fibroblast growth factor | 31 | 61 | GENE-Y | +----+---------------------------------+---------+-------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_clinical| |Type:|ner| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source This model was trained on the ChemProt corpus using 'embeddings_clinical' embeddings. Make sure you use the same embeddings when running the model. {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:--------------|-------:|------:|-----:|---------:|---------:|---------:| | 0 | B-GENE-Y | 4650 | 1090 | 838 | 0.810105 | 0.847303 | 0.828286 | | 1 | B-GENE-N | 1732 | 981 | 1019 | 0.638408 | 0.629589 | 0.633968 | | 2 | I-GENE-Y | 1846 | 571 | 573 | 0.763757 | 0.763125 | 0.763441 | | 3 | B-CHEMICAL | 7512 | 804 | 1136 | 0.903319 | 0.86864 | 0.88564 | | 4 | I-CHEMICAL | 1059 | 169 | 253 | 0.862378 | 0.807165 | 0.833858 | | 5 | I-GENE-N | 1393 | 853 | 598 | 0.620214 | 0.699648 | 0.657541 | | 6 | Macro-average | 18192 | 4468 | 4417 | 0.766363 | 0.769245 | 0.767801 | | 7 | Micro-average | 18192 | 4468 | 4417 | 0.802824 | 0.804635 | 0.803729 | ``` --- layout: model title: English DistilBertForQuestionAnswering model (from Sarmad) author: John Snow Labs name: distilbert_qa_projectmodel_bert date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `projectmodel-bert` is a English model originally trained by `Sarmad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_projectmodel_bert_en_4.0.0_3.0_1654728496446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_projectmodel_bert_en_4.0.0_3.0_1654728496446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_projectmodel_bert","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_projectmodel_bert","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_projectmodel_bert| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sarmad/projectmodel-bert --- layout: model title: Italian BERT Base Uncased author: John Snow Labs name: bert_base_italian_uncased date: 2021-05-20 tags: [open_source, it, embeddings, bert, italian] task: Embeddings language: it edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final training corpus has a size of 13GB and 2,050,057,573 tokens. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_italian_uncased_it_3.1.0_2.4_1621508298738.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_italian_uncased_it_3.1.0_2.4_1621508298738.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_italian_uncased", "it") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_italian_uncased", "it") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.bert.uncased").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_italian_uncased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|it| |Case sensitive:|true| ## Data Source [https://huggingface.co/dbmdz/bert-base-italian-uncased](https://huggingface.co/dbmdz/bert-base-italian-uncased) --- layout: model title: English image_classifier_vit__flyswot_test ViTForImageClassification from davanstrien author: John Snow Labs name: image_classifier_vit__flyswot_test date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit__flyswot_test` is a English model originally trained by davanstrien. ## Predicted Entities `EDGE + SPINE`, `OTHER`, `PAGE + FOLIO`, `FLYSHEET`, `CONTAINER`, `CONTROL SHOT`, `COVER`, `SCROLL` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit__flyswot_test_en_4.1.0_3.0_1660168360163.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit__flyswot_test_en_4.1.0_3.0_1660168360163.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit__flyswot_test", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit__flyswot_test", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit__flyswot_test| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.7 MB| --- layout: model title: Legal Multilabel Classifier on Covid-19 exceptions author: John Snow Labs name: legmulticlf_covid19_exceptions_english date: 2023-04-12 tags: [lecensed, legal, classification, en, multilabel, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is the Multi-Label Text Classification model that can be used to identify up to 6 classes to facilitate analysis, discovery and comparison of legal texts related to COVID-19 exception measures. The classes are as follows: - Closures/lockdown - Government_oversight - Restrictions_of_daily_liberties - Restrictions_of_fundamental_rights_and_civil_liberties - State_of_Emergency - Suspension_of_international_cooperation_and_commitments ## Predicted Entities `Closures/lockdown`, `Government_oversight`, `Restrictions_of_daily_liberties`, `Restrictions_of_fundamental_rights_and_civil_liberties`, `State_of_Emergency`, `Suspension_of_international_cooperation_and_commitments` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_english_en_1.0.0_3.0_1681315675753.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_covid19_exceptions_english_en_1.0.0_3.0_1681315675753.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document'])\ .setOutputCol('token') embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols(['document', 'token'])\ .setOutputCol("embeddings") embeddingsSentence = nlp.SentenceEmbeddings() \ .setInputCols(['document', 'embeddings'])\ .setOutputCol('sentence_embeddings')\ .setPoolingStrategy('AVERAGE') classifierdl = nlp.MultiClassifierDLModel.pretrained("legmulticlf_covid19_exceptions_english", "en", "legal/models") \ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline(stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]) df = spark.createDataFrame([["First, we must protect the NHS’s ability to cope. We must be confident that we are able to provide sufficient critical care and specialist treatment right across the UK. The NHS staff have been incredible. We must continue to support them as much as we can."]]).toDF("text") model = clf_pipeline.fit(df) result = model.transform(df) result.select("text", "class.result").show(truncate=False) ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ |text |result | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ |First, we must protect the NHS’s ability to cope. We must be confident that we are able to provide sufficient critical care and specialist treatment right across the UK. The NHS staff have been incredible. We must continue to support them as much as we can.|[Government_oversight]| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_covid19_exceptions_english| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|13.9 MB| ## References Train dataset available [here](https://huggingface.co/datasets/joelito/covid19_emergency_event) ## Benchmarking ```bash label precision recall f1-score support Closures/lockdown 1.00 0.60 0.75 10 Government_oversight 0.88 1.00 0.94 22 Restrictions_of_daily_liberties 0.83 0.95 0.89 21 Restrictions_of_fundamental_rights_and_civil_liberties 1.00 0.88 0.93 8 State_of_Emergency 1.00 0.89 0.94 28 Suspension_of_international_cooperation_and_commitments 1.00 1.00 1.00 2 micro-avg 0.92 0.90 0.91 91 macro-avg 0.95 0.89 0.91 91 weighted-avg 0.93 0.90 0.91 91 samples-avg 0.91 0.91 0.91 91 ``` --- layout: model title: Word2Vec Embeddings in Turkmen (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, tk, open_source] task: Embeddings language: tk edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tk_3.4.1_3.0_1647463931803.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tk_3.4.1_3.0_1647463931803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tk.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|tk| |Size:|114.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_all_904029577 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_all-904029577` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Name` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1678133978255.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_all_904029577_en_4.3.1_3.0_1678133978255.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_all_904029577","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_all_904029577| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_all-904029577 --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465521 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465521` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465521_en_4.0.0_3.0_1655986759710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465521_en_4.0.0_3.0_1655986759710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465521","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465521","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465521.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465521| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465521 --- layout: model title: Recognize Entities OntoNotes pipeline - BERT Base author: John Snow Labs name: onto_recognize_entities_bert_base date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_bert_base, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_bert_base is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_base_en_3.0.0_3.0_1616474549934.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_bert_base_en_3.0.0_3.0_1616474549934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_bert_base', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_bert_base", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.bert.base').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[-0.085488274693489,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_bert_base| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: Relation Extraction between dates and other entities author: John Snow Labs name: re_oncology_temporal_wip date: 2022-09-27 tags: [licensed, clinical, oncology, en, relation_extraction, temporal] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links Date and Relative_Date extractions to clinical entities such as Test or Cancer_Dx. ## Predicted Entities `is_date_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_temporal_wip_en_4.0.0_3.0_1664297421226.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_temporal_wip_en_4.0.0_3.0_1664297421226.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Each relevant relation pair in the pipeline should include one date entity (Date or Relative_Date) and a clinical entity (such as Pathology_Test, Cancer_Dx or Chemotherapy).
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_temporal_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(["Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surger-Relative_Date", "Relative_Date-Cancer_Surgery"]) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["Her breast cancer was diagnosed three years ago, and a bilateral mastectomy was performed last month."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_temporal_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Cancer_Dx-Date", "Date-Cancer_Dx", "Relative_Date-Cancer_Dx", "Cancer_Dx-Relative_Date", "Cancer_Surgery-Date", "Date-Cancer_Surgery", "Cancer_Surger-Relative_Date", "Relative_Date-Cancer_Surgery")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("Her breast cancer was diagnosed three years ago, and a bilateral mastectomy was performed last month.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_temporal_wip").predict("""Her breast cancer was diagnosed three years ago, and a bilateral mastectomy was performed last month.""") ```
## Results ```bash chunk1 entity1 chunk2 entity2 relation confidence breast cancer Cancer_Dx three years ago Relative_Date is_date_of 0.5886298 breast cancer Cancer_Dx last month Relative_Date O 0.9708738 three years ago Relative_Date mastectomy Cancer_Surgery O 0.6020852 mastectomy Cancer_Surgery last month Relative_Date is_date_of 0.9277692 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_temporal_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|265.6 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.79 0.76 0.77 is_date_of 0.74 0.77 0.75 macro-avg 0.76 0.76 0.76 ``` --- layout: model title: Part of Speech for English (pos_anc) author: John Snow Labs name: pos_anc date: 2021-03-05 tags: [en, open_source, part of speech, pos] supported: true task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Part of Speech trained for English trained Pos Anc dataset. Predicts the following tags : {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_anc_en_3.0.0_3.0_1614962126490.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_anc_en_3.0.0_3.0_1614962126490.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_anc", "en") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_anc", "en") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hello from John Snow Labs!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Hello from John Snow Labs!"] pos_df = nlu.load('en.pos.anc').predict(text, output_level = "token") pos_df ```
{:.h2_title} ## Results ```bash +----------+----------+ |token_result|pos_result| +----------+----------+ |Hello |UH | |from |IN | |John |NNP | |Snow |NNP | |Labs |NNP | |! |. | +----------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_anc| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|en| --- layout: model title: Translate English to Greek Pipeline author: John Snow Labs name: translate_en_el date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, el, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `el` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_el_xx_2.7.0_2.4_1609687029159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_el_xx_2.7.0_2.4_1609687029159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_el", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_el", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.el').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_el| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Toxic content classifier for Russian author: John Snow Labs name: bert_sequence_classifier_toxicity date: 2021-12-22 tags: [sentiment, bert, sequence, russian, ru, open_source] task: Text Classification language: ru edition: Spark NLP 3.3.4 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been fine-tuned for the Russian language, leveraging `Bert` embeddings and `BertForSequenceClassification` for text classification purposes. ## Predicted Entities `neutral`, `toxic` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_RU_TOXIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_RU_TOXIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_toxicity_ru_3.3.4_2.4_1640162987772.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_toxicity_ru_3.3.4_2.4_1640162987772.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_toxicity', 'ru') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([["Ненавижу тебя, идиот."]]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_toxicity", "ru") .setInputCols("document", "token") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Ненавижу тебя, идиот."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("ru.classify.toxic").predict("""Ненавижу тебя, идиот.""") ```
## Results ```bash ['toxic'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_toxicity| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ru| |Size:|665.1 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source [https://huggingface.co/SkolkovoInstitute/russian_toxicity_classifier](https://huggingface.co/SkolkovoInstitute/russian_toxicity_classifier) ## Benchmarking ```bash label precision recall f1-score support neutral 0.98 0.99 0.98 21384 toxic 0.94 0.92 0.93 4886 accuracy - - 0.97 26270 macro-avg 0.96 0.96 0.96 26270 weighted-avg 0.97 0.97 0.97 26270 ``` --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to French author: John Snow Labs name: opus_mt_bcl_fr date: 2021-06-01 tags: [open_source, seq2seq, translation, bcl, fr, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bcl target languages: fr {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_fr_xx_3.1.0_2.4_1622567201659.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_fr_xx_3.1.0_2.4_1622567201659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_fr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_fr", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Central Bikol.translate_to.French').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_fr| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. Definitions of Predicted Entities: - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Temperature`: All mentions that refer to body temperature. - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Allergen`: Allergen related extractions mentioned in the document. - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Frequency`: Frequency of administration for a dose prescribed. - `Weight`: All mentions related to a patients weight. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Respiration`: Number of breaths per minute. ## Predicted Entities `Diagnosis`, `Procedure_Name`, `Lab_Result`, `Procedure`, `Procedure_Findings`, `O2_Saturation`, `Procedure_incident_description`, `Dosage`, `Causative_Agents_(Virus_and_Bacteria)`, `Name`, `Cause_of_death`, `Substance_Name`, `Weight`, `Symptom_Name`, `Maybe`, `Modifier`, `Blood_Pressure`, `Frequency`, `Gender`, `Drug_incident_description`, `Age`, `Drug_Name`, `Temperature`, `Section_Name`, `Route`, `Negation`, `Negated`, `Allergenic_substance`, `Lab_Name`, `Respiratory_Rate`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.0.0_3.0_1617209720718.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.0.0_3.0_1617209720718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +---------------------------+------------+ |chunk |ner | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |congestion |Symptom_Name| |mom |Gender | |suctioning yellow discharge|Symptom_Name| |she |Gender | |mild |Modifier | |problems with his breathing|Symptom_Name| |negative |Negated | |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-Pulse_Rate 77 39 12 0.663793 0.865169 0.75122 I-Diagnosis 2134 1139 1329 0.652001 0.616229 0.63361 I-Procedure_Name 2335 1329 956 0.637282 0.709511 0.671459 B-Lab_Result 601 182 94 0.767561 0.864748 0.813261 B-Procedure 1 0 5 1 0.166667 0.285714 B-Procedure_Findings 2 13 72 0.133333 0.027027 0.044943 B-O2_Saturation 1 3 4 0.25 0.2 0.222222 B-Dosage 477 197 68 0.707715 0.875229 0.782609 I-Causative_Agents_(Virus_and_Bacteria) 12 2 7 0.857143 0.631579 0.727273 B-Name 562 268 554 0.677108 0.503584 0.577595 I-Cause_of_death 9 5 11 0.642857 0.45 0.529412 I-Substance_Name 24 34 54 0.413793 0.307692 0.352941 I-Name 716 377 710 0.655078 0.502104 0.56848 B-Cause_of_death 9 6 8 0.6 0.529412 0.5625 B-Weight 52 22 9 0.702703 0.852459 0.77037 B-Symptom_Name 4364 1916 1652 0.694904 0.725399 0.709824 I-Maybe 27 51 61 0.346154 0.306818 0.325301 I-Symptom_Name 2073 1492 2348 0.581487 0.468898 0.519159 B-Modifier 1573 890 768 0.638652 0.671935 0.654871 B-Blood_Pressure 76 19 13 0.8 0.853933 0.826087 B-Frequency 308 134 77 0.696833 0.8 0.744861 I-Gender 26 31 28 0.45614 0.481482 0.468468 I-Drug_incident_description 4 10 57 0.285714 0.0655738 0.106667 B-Drug_incident_description 2 5 23 0.285714 0.08 0.125 I-Age 5 0 9 1 0.357143 0.526316 B-Drug_Name 1741 490 290 0.780368 0.857213 0.816987 B-Substance_Name 148 41 48 0.783069 0.755102 0.768831 B-Temperature 56 23 13 0.708861 0.811594 0.756757 I-Procedure 1 0 7 1 0.125 0.222222 B-Section_Name 2711 317 166 0.89531 0.942301 0.918205 I-Route 119 110 189 0.519651 0.386364 0.443203 B-Maybe 143 80 127 0.641256 0.52963 0.580122 B-Gender 5166 709 58 0.879319 0.988897 0.930895 I-Dosage 434 196 87 0.688889 0.833013 0.754127 B-Causative_Agents_(Virus_and_Bacteria) 19 3 8 0.863636 0.703704 0.77551 I-Frequency 275 134 191 0.672372 0.590129 0.628571 B-Age 357 27 16 0.929688 0.957105 0.943197 I-Lab_Result 45 78 152 0.365854 0.228426 0.28125 B-Negation 99 38 38 0.722628 0.722628 0.722628 B-Diagnosis 2786 1342 913 0.674903 0.753177 0.711895 I-Section_Name 3885 1353 179 0.741695 0.955955 0.835304 B-Route 421 217 166 0.659875 0.717206 0.687347 I-Negation 11 30 24 0.268293 0.314286 0.289474 B-Procedure_Name 1490 811 522 0.647545 0.740557 0.690934 B-Negated 1490 332 215 0.817783 0.8739 0.844911 I-Allergenic_substance 1 0 12 1 0.0769231 0.142857 I-Negated 89 132 146 0.402715 0.378723 0.390351 I-Procedure_Findings 2 31 283 0.0606061 0.00701754 0.012578 B-Allergenic_substance 72 29 24 0.712871 0.75 0.730965 I-Weight 47 35 16 0.573171 0.746032 0.648276 B-Lab_Name 804 290 122 0.734918 0.868251 0.79604 I-Modifier 99 73 422 0.575581 0.190019 0.285714 I-Temperature 1 0 14 1 0.0666667 0.125 I-Drug_Name 362 284 261 0.560372 0.581059 0.570528 I-Lab_Name 284 194 127 0.594142 0.690998 0.63892 B-Respiratory_Rate 46 5 5 0.901961 0.901961 0.901961 Macro-average 38674 15571 13819 0.589085 0.515426 0.5498 Micro-average 38674 15571 13819 0.712951 0.736746 0.724653 ``` --- layout: model title: Arabic BertForMaskedLM Base Cased model (from aubmindlab) author: John Snow Labs name: bert_embeddings_base_arabert date: 2022-12-02 tags: [ar, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ar edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-arabert` is a Arabic model originally trained by `aubmindlab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabert_ar_4.2.4_3.0_1670015735145.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_arabert_ar_4.2.4_3.0_1670015735145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabert","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_arabert","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_arabert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|507.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/aubmindlab/bert-base-arabert - https://github.com/google-research/bert - https://arxiv.org/abs/2003.00104 - https://github.com/WissamAntoun/pydata_khobar_meetup - http://alt.qcri.org/farasa/segmenter.html - /aubmindlab/bert-base-arabert/blob/main/(https://github.com/google-research/bert/blob/master/multilingual.md) - https://github.com/elnagara/HARD-Arabic-Dataset - https://www.aclweb.org/anthology/D15-1299 - https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf - https://github.com/mohamedadaly/LABR - http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp - https://github.com/husseinmozannar/SOQAL - https://github.com/aub-mind/arabert/blob/master/AraBERT/README.md - https://arxiv.org/abs/2003.00104v2 - https://archive.org/details/arwiki-20190201 - https://www.semanticscholar.org/paper/1.5-billion-words-Arabic-Corpus-El-Khair/f3eeef4afb81223df96575adadf808fe7fe440b4 - https://www.aclweb.org/anthology/W19-4619 - https://sites.aub.edu.lb/mindlab/ - https://www.yakshof.com/#/ - https://www.behance.net/rahalhabib - https://www.linkedin.com/in/wissam-antoun-622142b4/ - https://twitter.com/wissam_antoun - https://github.com/WissamAntoun - https://www.linkedin.com/in/fadybaly/ - https://twitter.com/fadybaly - https://github.com/fadybaly --- layout: model title: Abkhazian asr_xls_r_ab_test_by_hf_test TFWav2Vec2ForCTC from hf-test author: John Snow Labs name: pipeline_asr_xls_r_ab_test_by_hf_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_hf_test` is a Abkhazian model originally trained by hf-test. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_hf_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_hf_test_ab_4.2.0_3.0_1664021541031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_hf_test_ab_4.2.0_3.0_1664021541031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_hf_test', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_hf_test", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_test_by_hf_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|452.3 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect PHI for Deidentification (Enriched) author: John Snow Labs name: ner_deid_enriched_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_enriched](https://nlp.johnsnowlabs.com/2021/03/31/ner_deid_enriched_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_pipeline_en_4.3.0_3.2_1678736351306.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_pipeline_en_4.3.0_3.2_1678736351306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_enriched_pipeline", "en", "clinical/models") text = '''HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_enriched_pipeline", "en", "clinical/models") val text = "HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid_enriched.pipeline").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------|--------:|------:|:------------|-------------:| | 0 | Smith | 32 | 36 | PATIENT | 0.9997 | | 1 | VA Hospital | 184 | 194 | HOSPITAL | 0.74455 | | 2 | 02/04/2003 | 341 | 350 | DATE | 0.9995 | | 3 | Smith | 374 | 378 | PATIENT | 0.9994 | | 4 | Smith | 782 | 786 | PATIENT | 0.9996 | | 5 | Smith | 1131 | 1135 | PATIENT | 0.9995 | | 6 | Ardmore Tower | 1155 | 1167 | PATIENT | 0.2694 | | 7 | Hart | 1221 | 1224 | DOCTOR | 0.9985 | | 8 | Smith | 1231 | 1235 | PATIENT | 0.9992 | | 9 | 02/07/2003 | 1329 | 1338 | DATE | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_enriched_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal No strict construction Clause Binary Classifier author: John Snow Labs name: legclf_no_strict_construction_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-strict-construction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-strict-construction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_strict_construction_clause_en_1.0.0_3.2_1660122698925.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_strict_construction_clause_en_1.0.0_3.2_1660122698925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_strict_construction_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-strict-construction]| |[other]| |[other]| |[no-strict-construction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_strict_construction_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-strict-construction 0.97 0.93 0.95 40 other 0.97 0.99 0.98 101 accuracy - - 0.97 141 macro-avg 0.97 0.96 0.96 141 weighted-avg 0.97 0.97 0.97 141 ``` --- layout: model title: Swedish BertForMaskedLM Cased model (from Addedk) author: John Snow Labs name: bert_embeddings_kb_distilled_cased date: 2022-12-06 tags: [sv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: sv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kbbert-distilled-cased` is a Swedish model originally trained by `Addedk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_kb_distilled_cased_sv_4.2.4_3.0_1670326799484.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_kb_distilled_cased_sv_4.2.4_3.0_1670326799484.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kb_distilled_cased","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kb_distilled_cased","sv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_kb_distilled_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sv| |Size:|308.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Addedk/kbbert-distilled-cased - https://spraakbanken.gu.se/en/resources/gigaword - https://github.com/AddedK/swedish-mbert-distillation/blob/main/azureML/pretrain_distillation.py - https://kth.diva-portal.org/smash/record.jsf?aq2=%5B%5B%5D%5D&c=2&af=%5B%5D&searchType=UNDERGRADUATE&sortOrder2=title_sort_asc&language=en&pid=diva2%3A1698451&aq=%5B%5B%7B%22freeText%22%3A%22added+kina%22%7D%5D%5D&sf=all&aqe=%5B%5D&sortOrder=author_sort_asc&onlyFullText=false&noOfRows=50&dswid=-6142 - https://arxiv.org/abs/2103.06418 - https://spraakbanken.gu.se/en/resources/gigaword --- layout: model title: BioBERT Embeddings (Pubmed PMC) author: John Snow Labs name: biobert_pubmed_pmc_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pubmed_pmc_base_cased_en_2.6.2_2.4_1600530770096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pubmed_pmc_base_cased_en_2.6.2_2.4_1600530770096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pubmed_pmc_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pubmed_pmc_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pubmed_pmc_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pubmed_pmc_base_cased_embeddings I [-0.012962102890014648, 0.27699071168899536, 0... hate [0.1688309609889984, 0.5337603688240051, 0.148... cancer [0.1850549429655075, 0.05875205248594284, -0.5... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pubmed_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Ddaow) author: John Snow Labs name: distilbert_qa_ddaow_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Ddaow`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ddaow_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768387383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ddaow_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768387383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ddaow_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ddaow_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ddaow_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Ddaow/distilbert-base-uncased-finetuned-squad --- layout: model title: English DistilBertForQuestionAnswering model (from leemii18) author: John Snow Labs name: distilbert_qa_robustqa_baseline_02 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `robustqa-baseline-02` is a English model originally trained by `leemii18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_baseline_02_en_4.0.0_3.0_1654728551710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_robustqa_baseline_02_en_4.0.0_3.0_1654728551710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_baseline_02","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_robustqa_baseline_02","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_robustqa_baseline_02| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/leemii18/robustqa-baseline-02 --- layout: model title: Mapping Entities with Corresponding RxNorm Codes According to According to National Institute of Health (NIH) Database author: John Snow Labs name: rxnorm_nih_mapper date: 2023-02-23 tags: [rxnorm, nih, chunk_mapping, clinical, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.3.0 spark_version: [3.0, 3.2] supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps entities with their corresponding RxNorm codes according to the National Institute of Health (NIH) database. It returns Rxnorm codes with their NIH Rxnorm Term Types within a parenthesis. ## Predicted Entities `rxnorm_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_nih_mapper_en_4.3.0_3.0_1677156206111.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_nih_mapper_en_4.3.0_3.0_1677156206111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner_model = MedicalNerModel\ .pretrained("ner_posology_greedy", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("posology_ner") posology_ner_converter = NerConverterInternal()\ .setInputCols("sentence", "token", "posology_ner")\ .setOutputCol("ner_chunk") chunkerMapper = ChunkMapperModel\ .pretrained("rxnorm_nih_mapper", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings")\ .setRels(["rxnorm_code"]) mapper_pipeline = Pipeline().setStages([ document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper]) test_data = spark.createDataFrame([["The patient was given Adapin 10 MG Oral Capsule, acetohexamide and Parlodel"]]).toDF("text") mapper_model = mapper_pipeline.fit(test_data) result= mapper_model.transform(test_data) ``` ```scala val document_assembler = new DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") val sentence_detector = new SentenceDetector()\ .setInputCols(Array("document"))\ .setOutputCol("sentence") val tokenizer = new Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") val word_embeddings = WordEmbeddingsModel .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(Array("sentence", "token"))\ .setOutputCol("embeddings") val posology_ner_model = MedicalNerModel .pretrained("ner_posology_greedy", "en", "clinical/models")\ .setInputCols(Array("sentence", "token", "embeddings"))\ .setOutputCol("posology_ner") val posology_ner_converter = new NerConverterInternal()\ .setInputCols("sentence", "token", "posology_ner")\ .setOutputCol("ner_chunk") val chunkerMapper = ChunkMapperModel .pretrained("rxnorm_nih_mapper", "en", "clinical/models")\ .setInputCols(Array("ner_chunk"))\ .setOutputCol("mappings")\ .setRels(Array("rxnorm_code")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, posology_ner_model, posology_ner_converter, chunkerMapper)) val data = Seq("The patient was given Adapin 10 MG Oral Capsule, acetohexamide and Parlodel").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------+-------------+-----------+ |ner_chunk |mappings |relation | +-------------------------+-------------+-----------+ |Adapin 10 MG Oral Capsule|1911002 (SY) |rxnorm_code| |acetohexamide |12250421 (IN)|rxnorm_code| |Parlodel |829 (BN) |rxnorm_code| +-------------------------+-------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_nih_mapper| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|10.3 MB| ## References Trained on February 2023 with NIH data: https://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from vitusya) author: John Snow Labs name: distilbert_qa_vitusya_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `vitusya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_vitusya_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773100746.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_vitusya_base_uncased_finetuned_squad_en_4.3.0_3.0_1672773100746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vitusya_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_vitusya_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_vitusya_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/vitusya/distilbert-base-uncased-finetuned-squad --- layout: model title: Chinese BertForQuestionAnswering model (from luhua) author: John Snow Labs name: bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_pretrain_mrc_roberta_wwm_ext_large` is a Chinese model orginally trained by `luhua`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large_zh_4.0.0_3.0_1654187272890.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large_zh_4.0.0_3.0_1654187272890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.large.by_luhua").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/luhua/chinese_pretrain_mrc_roberta_wwm_ext_large - https://github.com/basketballandlearn/MRC_Competition_Dureader --- layout: model title: Legal Mechanical Engineering Document Classifier (EURLEX) author: John Snow Labs name: legclf_mechanical_engineering_bert date: 2023-03-06 tags: [en, legal, classification, clauses, mechanical_engineering, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_mechanical_engineering_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Mechanical_Engineering or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Mechanical_Engineering`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_mechanical_engineering_bert_en_1.0.0_3.0_1678111872224.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_mechanical_engineering_bert_en_1.0.0_3.0_1678111872224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_mechanical_engineering_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Mechanical_Engineering]| |[Other]| |[Other]| |[Mechanical_Engineering]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_mechanical_engineering_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Mechanical_Engineering 0.85 0.93 0.89 125 Other 0.92 0.84 0.88 125 accuracy - - 0.88 250 macro-avg 0.89 0.88 0.88 250 weighted-avg 0.89 0.88 0.88 250 ``` --- layout: model title: Pipeline to Detect Concepts in Drug Development Trials (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_drug_development_trials_pipeline date: 2022-03-23 tags: [licensed, ner, clinical, bertfortokenclassification, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_drug_development_trials](https://nlp.johnsnowlabs.com/2021/12/17/bert_token_classifier_drug_development_trials_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DRUGS_DEVELOPMENT_TRIALS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_pipeline_en_3.4.1_3.0_1648044113917.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_pipeline_en_3.4.1_3.0_1648044113917.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_drug_development_trials_pipeline", "en", "clinical/models") pipeline.fullAnnotate("In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_drug_development_trials_pipeline", "en", "clinical/models") pipeline.fullAnnotate("In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.druge_developement.pipeline").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""") ```
## Results ```bash | | chunk | entity | |---:|:------------------|:--------------| | 0 | median | Duration | | 1 | overall survival | End_Point | | 2 | with | Trial_Group | | 3 | without topotecan | Trial_Group | | 4 | 4.0 | Value | | 5 | 3.6 months | Value | | 6 | 23 | Patient_Count | | 7 | 63 | Patient_Count | | 8 | 55 | Patient_Count | | 9 | 33 patients | Patient_Count | | 10 | topotecan | Trial_Group | | 11 | 11 | Patient_Count | | 12 | 61 | Patient_Count | | 13 | 66 | Patient_Count | | 14 | 32 patients | Patient_Count | | 15 | without topotecan | Trial_Group | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_drug_development_trials_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: Word2Vec Embeddings in Waray (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, war, open_source] task: Embeddings language: war edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_war_3.4.1_3.0_1647467209473.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_war_3.4.1_3.0_1647467209473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","war") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","war") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("war.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|war| |Size:|560.1 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Trust Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_trust_agreement date: 2022-11-24 tags: [en, legal, classification, agreement, trust, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_trust_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `trust-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `trust-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_trust_agreement_en_1.0.0_3.0_1669295168150.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_trust_agreement_en_1.0.0_3.0_1669295168150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_trust_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[trust-agreement]| |[other]| |[other]| |[trust-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_trust_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.97 0.98 90 trust-agreement 0.93 0.97 0.95 38 accuracy - - 0.97 128 macro-avg 0.96 0.97 0.96 128 weighted-avg 0.97 0.97 0.97 128 ``` --- layout: model title: English DistilBertForQuestionAnswering model (from aaraki) author: John Snow Labs name: distilbert_qa_aaraki_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `aaraki`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_aaraki_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724898100.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_aaraki_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724898100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_aaraki_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_aaraki_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_aaraki").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_aaraki_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aaraki/distilbert-base-uncased-finetuned-squad --- layout: model title: Detect Persons, Locations and Organization Entities in Turkish (bert_multi_cased) author: John Snow Labs name: turkish_ner_bert date: 2020-11-10 task: Named Entity Recognition language: tr edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [tr, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained Named Entity Recognition (NER) deep learning model for Turkish texts. It recognizes Persons, Locations, and Organization entities using multi-lingual Bert word embedding. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER ç Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities Persons-``PER``, Locations-``LOC``, Organizations-``ORG``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_TR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/turkish_ner_bert_tr_2.6.2_2.4_1605043368882.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/turkish_ner_bert_tr_2.6.2_2.4_1605043368882.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an NLP pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... ner_model = NerDLModel.pretrained("turkish_ner_bert", "tr") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (28 Ekim 1955 doğumlu), Amerikalı bir iş adamı, yazılım geliştirici, yatırımcı ve hayırseverdir. En çok Microsoft şirketinin kurucu ortağı olarak bilinir. William Gates , Microsoft şirketindeki kariyeri boyunca başkan, icra kurulu başkanı, başkan ve yazılım mimarisi başkanı pozisyonlarında bulunmuş, aynı zamanda Mayıs 2014'e kadar en büyük bireysel hissedar olmuştur. O, 1970'lerin ve 1980'lerin mikrobilgisayar devriminin en tanınmış girişimcilerinden ve öncülerinden biridir. Seattle Washington'da doğup büyüyen William Gates, 1975'te New Mexico Albuquerque'de çocukluk arkadaşı Paul Allen ile Microsoft şirketini kurdu; dünyanın en büyük kişisel bilgisayar yazılım şirketi haline geldi. William Gates, Ocak 2000'de icra kurulu başkanı olarak istifa edene kadar şirketi başkan ve icra kurulu başkanı olarak yönetti ve daha sonra yazılım mimarisi başkanı oldu. 1990'ların sonlarında, William Gates rekabete aykırı olduğu düşünülen iş taktikleri nedeniyle eleştirilmişti. Bu görüş, çok sayıda mahkeme kararıyla onaylanmıştır. Haziran 2006'da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan B&Melinda G. Vakfı'nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie'ye devretti. Şubat 2014'te Microsoft başkanlığından ayrıldı ve yeni atanan icra kurulu başkanı, Satya Nadella'yı desteklemek için teknoloji danışmanı olarak yeni bir göreve başladı."]], ["text"])) ``` ```scala ... val ner_model = NerDLModel.pretrained("turkish_ner_bert", "tr") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (28 Ekim 1955 doğumlu), Amerikalı bir iş adamı, yazılım geliştirici, yatırımcı ve hayırseverdir. En çok Microsoft şirketinin kurucu ortağı olarak bilinir. William Gates , Microsoft şirketindeki kariyeri boyunca başkan, icra kurulu başkanı, başkan ve yazılım mimarisi başkanı pozisyonlarında bulunmuş, aynı zamanda Mayıs 2014"e kadar en büyük bireysel hissedar olmuştur. O, 1970"lerin ve 1980"lerin mikrobilgisayar devriminin en tanınmış girişimcilerinden ve öncülerinden biridir. Seattle Washington"da doğup büyüyen William Gates, 1975"te New Mexico Albuquerque"de çocukluk arkadaşı Paul Allen ile Microsoft şirketini kurdu; dünyanın en büyük kişisel bilgisayar yazılım şirketi haline geldi. William Gates, Ocak 2000"de icra kurulu başkanı olarak istifa edene kadar şirketi başkan ve icra kurulu başkanı olarak yönetti ve daha sonra yazılım mimarisi başkanı oldu. 1990"ların sonlarında, William Gates rekabete aykırı olduğu düşünülen iş taktikleri nedeniyle eleştirilmişti. Bu görüş, çok sayıda mahkeme kararıyla onaylanmıştır. Haziran 2006"da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan B&Melinda G. Vakfı"nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie"ye devretti. Şubat 2014"te Microsoft başkanlığından ayrıldı ve yeni atanan icra kurulu başkanı, Satya Nadella'yı desteklemek için teknoloji danışmanı olarak yeni bir göreve başladı.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (28 Ekim 1955 doğumlu), Amerikalı bir iş adamı, yazılım geliştirici, yatırımcı ve hayırseverdir. En çok Microsoft şirketinin kurucu ortağı olarak bilinir. William Gates , Microsoft şirketindeki kariyeri boyunca başkan, icra kurulu başkanı, başkan ve yazılım mimarisi başkanı pozisyonlarında bulunmuş, aynı zamanda Mayıs 2014'e kadar en büyük bireysel hissedar olmuştur. O, 1970'lerin ve 1980'lerin mikrobilgisayar devriminin en tanınmış girişimcilerinden ve öncülerinden biridir. Seattle Washington'da doğup büyüyen William Gates, 1975'te New Mexico Albuquerque'de çocukluk arkadaşı Paul Allen ile Microsoft şirketini kurdu; dünyanın en büyük kişisel bilgisayar yazılım şirketi haline geldi. William Gates, Ocak 2000'de icra kurulu başkanı olarak istifa edene kadar şirketi başkan ve icra kurulu başkanı olarak yönetti ve daha sonra yazılım mimarisi başkanı oldu. 1990'ların sonlarında, William Gates rekabete aykırı olduğu düşünülen iş taktikleri nedeniyle eleştirilmişti. Bu görüş, çok sayıda mahkeme kararıyla onaylanmıştır. Haziran 2006'da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan B&Melinda G. Vakfı'nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie'ye devretti. Şubat 2014'te Microsoft başkanlığından ayrıldı ve yeni atanan icra kurulu başkanı, Satya Nadella'yı desteklemek için teknoloji danışmanı olarak yeni bir göreve başladı."""] ner_df = nlu.load('tr.ner.bert').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------+---------+ |chunk |ner_label| +-------------------------+---------+ |William Henry Gates III |PER | |Microsoft |ORG | |William Gates |PER | |Microsoft |ORG | |Seattle Washington'da |LOC | |William Gates |PER | |New Mexico Albuquerque'de|LOC | |Paul Allen |PER | |Microsoft |ORG | |William Gates |PER | |William Gates |PER | |William Gates |PER | |Microsoft |ORG | |Melinda Gates |PER | |B&Melinda G. Vakfı'nda |ORG | |Ray Ozzie |PER | |Craig Mundie'ye |PER | |Microsoft |ORG | |Satya Nadella'yı |PER | +-------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|turkish_ner_bert| |Type:|ner| |Compatibility:|Spark NLP 2.6.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|tr| |Dependencies:|bert_multi_cased| {:.h2_title} ## Data Source Trained on a custom dataset with multi-lingual Bert Embeddings ``bert_multi_cased``. {:.h2_title} ## Benchmarking ```bash label tp fp fn prec rec f1 B-LOC 1949 156 158 0.92589074 0.9250119 0.9254511 I-ORG 1266 266 98 0.8263708 0.9281525 0.8743094 I-LOC 270 54 79 0.8333333 0.77363896 0.8023774 I-PER 1507 89 94 0.94423556 0.9412867 0.94275886 B-ORG 1805 242 119 0.88177824 0.9381497 0.90909094 B-PER 2841 152 267 0.9492148 0.91409266 0.93132275 tp: 9638 fp: 959 fn: 815 labels: 6 Macro-average prec: 0.8934706, rec: 0.90338874, f1: 0.89840233 Micro-average prec: 0.9095027, rec: 0.92203194, f1: 0.91572446 ``` --- layout: model title: Legal Parking Clause Binary Classifier author: John Snow Labs name: legclf_parking_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `parking` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `parking` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_parking_clause_en_1.0.0_3.2_1660123803547.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_parking_clause_en_1.0.0_3.2_1660123803547.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_parking_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[parking]| |[other]| |[other]| |[parking]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_parking_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.99 0.96 69 parking 0.96 0.85 0.90 27 accuracy - - 0.95 96 macro-avg 0.95 0.92 0.93 96 weighted-avg 0.95 0.95 0.95 96 ``` --- layout: model title: Legal Governing Law Clause Binary Classifier author: John Snow Labs name: legclf_governing_law_clause date: 2022-12-07 tags: [en, legal, classification, clauses, governing_law, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `governing-law` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `governing-law`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_governing_law_clause_en_1.0.0_3.0_1670444230550.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_governing_law_clause_en_1.0.0_3.0_1670444230550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_governing_law_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[governing-law]| |[other]| |[other]| |[governing-law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_governing_law_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support governing-law 0.96 1.00 0.98 25 other 1.00 0.99 0.99 73 accuracy - - 0.99 98 macro-avg 0.98 0.99 0.99 98 weighted-avg 0.99 0.99 0.99 98 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from superspray) author: John Snow Labs name: distilbert_qa_base_squad2_custom_dataset date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_base_squad2_custom_dataset` is a English model originally trained by `superspray`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_squad2_custom_dataset_en_4.3.0_3.0_1672774514588.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_squad2_custom_dataset_en_4.3.0_3.0_1672774514588.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_squad2_custom_dataset","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_squad2_custom_dataset","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_squad2_custom_dataset| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/superspray/distilbert_base_squad2_custom_dataset --- layout: model title: Pipeline to Extract Entities in Spanish Clinical Trial Abstracts author: John Snow Labs name: ner_clinical_trials_abstracts_pipeline date: 2023-03-09 tags: [es, clinical, licensed, ner, clinical_abstracts, chem, diso, proc] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_clinical_trials_abstracts](https://nlp.johnsnowlabs.com/2022/08/12/ner_clinical_trials_abstracts_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_pipeline_es_4.3.0_3.2_1678380962617.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_trials_abstracts_pipeline_es_4.3.0_3.2_1678380962617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "es", "clinical/models") text = '''Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "es", "clinical/models") val text = "Efecto de la suplementación con ácido fólico sobre los niveles de homocisteína total en pacientes en hemodiálisis. La hiperhomocisteinemia es un marcador de riesgo independiente de morbimortalidad cardiovascular. Hemos prospectivamente reducir los niveles de homocisteína total (tHcy) mediante suplemento con ácido fólico y vitamina B6 (pp), valorando su posible correlación con dosis de diálisis, función residual y parámetros nutricionales." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | suplementación | 13 | 26 | PROC | 0.9987 | | 1 | ácido fólico | 32 | 43 | CHEM | 0.8828 | | 2 | niveles de homocisteína | 55 | 77 | PROC | 0.584633 | | 3 | hemodiálisis | 101 | 112 | PROC | 0.9998 | | 4 | hiperhomocisteinemia | 118 | 137 | DISO | 0.9977 | | 5 | niveles de homocisteína total | 248 | 276 | PROC | 0.604225 | | 6 | tHcy | 279 | 282 | PROC | 0.9699 | | 7 | ácido fólico | 309 | 320 | CHEM | 0.90385 | | 8 | vitamina B6 | 324 | 334 | CHEM | 0.9748 | | 9 | pp | 337 | 338 | CHEM | 0.96 | | 10 | diálisis | 388 | 395 | PROC | 0.9982 | | 11 | función residual | 398 | 414 | PROC | 0.73045 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_trials_abstracts_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|318.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - RoBertaEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Italian BERT Base Cased author: John Snow Labs name: bert_base_italian_cased date: 2021-05-20 tags: [open_source, it, italian, embeddings, bert] task: Embeddings language: it edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The source data for the Italian BERT model consists of a recent Wikipedia dump and various texts from the [OPUS corpora](http://opus.nlpl.eu/) collection. The final training corpus has a size of 13GB and 2,050,057,573 tokens. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_base_italian_cased_it_3.1.0_2.4_1621508025859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_base_italian_cased_it_3.1.0_2.4_1621508025859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_base_italian_cased", "it") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_base_italian_cased", "it") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.bert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_base_italian_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|it| |Case sensitive:|true| ## Data Source [https://huggingface.co/dbmdz/bert-base-italian-cased](https://huggingface.co/dbmdz/bert-base-italian-cased) --- layout: model title: Legal Rules and Regulations Clause Binary Classifier author: John Snow Labs name: legclf_rules_and_regulations_clause date: 2022-12-07 tags: [en, legal, classification, clauses, rules_and_regulations, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `rules-and-regulations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `rules-and-regulations`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_rules_and_regulations_clause_en_1.0.0_3.0_1670444814223.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_rules_and_regulations_clause_en_1.0.0_3.0_1670444814223.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_rules_and_regulations_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[rules-and-regulations]| |[other]| |[other]| |[rules-and-regulations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_rules_and_regulations_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.99 73 rules-and-regulations 1.00 0.93 0.96 28 accuracy - - 0.98 101 macro-avg 0.99 0.96 0.97 101 weighted-avg 0.98 0.98 0.98 101 ``` --- layout: model title: Extract relations between chemicals and proteins (ReDL) author: John Snow Labs name: redl_chemprot_biobert date: 2023-01-14 tags: [relation_extraction, licensed, en, clinical, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect interactions between chemicals and proteins using BERT model by classifying whether a specified semantic relation holds between the chemical and protein entities within a sentence or document. ## Predicted Entities `CPR:1`, `CPR:2`, `CPR:3`, `CPR:4`, `CPR:5`, `CPR:6`, `CPR:7`, `CPR:8`, `CPR:9`, `CPR:10` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CHEM_PROT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_chemprot_biobert_en_4.2.4_3.0_1673714908415.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_chemprot_biobert_en_4.2.4_3.0_1673714908415.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `redl_chemprot_biobert` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------:|:---------------------------------------------------------------------:|:---------------------:|---------------------------| | redl_chemprot_biobert | CPR:1, CPR:2, CPR:3, CPR:4, CPR:5, CPR:6, CPR:7, CPR:8, CPR:9, CPR:10 | ner_chemprot_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_chemprot_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text='''In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.''' data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_chemprot_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.chemprot").predict("""In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.""") ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|:-----------|:----------|----------------:|--------------:|:------------------|:----------|----------------:|--------------:|:--------------|-------------:| | 0 | CPR:2 | CHEMICAL | 43 | 53 | mitiglinide | GENE-N | 80 | 87 | channels | 0.998399 | | 1 | CPR:2 | GENE-N | 80 | 87 | channels | CHEMICAL | 224 | 234 | nateglinide | 0.994489 | | 2 | CPR:2 | CHEMICAL | 706 | 716 | mitiglinide | GENE-Y | 751 | 754 | SUR1 | 0.999304 | | 3 | CPR:2 | CHEMICAL | 823 | 839 | [3H]glibenclamide | GENE-Y | 844 | 847 | SUR1 | 0.998923 | | 4 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1019 | 1025 | glucose | 0.979057 | | 5 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1028 | 1038 | mitiglinide | 0.988504 | | 6 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1041 | 1051 | tolbutamide | 0.991856 | | 7 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1058 | 1070 | glibenclamide | 0.994092 | | 8 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1100 | 1110 | mitiglinide | 0.994409 | | 9 | CPR:2 | CHEMICAL | 1290 | 1300 | mitiglinide | GENE-N | 1387 | 1393 | channel | 0.981534 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_chemprot_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on ChemProt benchmark dataset. ## Benchmarking ```bash label Recall Precision F1 Support CPR:1 0.870 0.908 0.888 215 CPR:10 0.818 0.762 0.789 258 CPR:2 0.726 0.806 0.764 1651 CPR:3 0.788 0.785 0.787 657 CPR:4 0.901 0.855 0.878 1599 CPR:5 0.799 0.891 0.842 184 CPR:6 0.888 0.845 0.866 258 CPR:7 0.520 0.765 0.619 25 CPR:8 0.083 0.333 0.133 24 CPR:9 0.930 0.805 0.863 629 Avg. 0.732 0.775 0.743 - ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1654191589323.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0_en_4.0.0_3.0_1654191589323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_512d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|387.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-512-finetuned-squad-seed-0 --- layout: model title: Adverse Drug Events Classifier author: John Snow Labs name: classifierml_ade date: 2023-05-11 tags: [text_classification, ade, clinical, licensed, en] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: DocumentMLClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the DocumentMLClassifierApproach annotator and classifies a text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1683819947444.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1683819947444.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("prediction") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, classifier_ml]) data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler =new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifier_ml = new DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models") .setInputCols("token") .setOutputCol("prediction") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, classifier_ml)) val data = Seq(Array("I feel great after taking tylenol", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------------+-------+ |I feel great after taking tylenol |[False]| |Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] | +----------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierml_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[prediction]| |Language:|en| |Size:|2.7 MB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.90 0.94 0.92 3359 True 0.85 0.75 0.79 1364 accuracy - - 0.89 4723 macro avg 0.87 0.85 0.86 4723 weighted avg 0.89 0.89 0.89 4723 ``` --- layout: model title: Translate Maltese to English Pipeline author: John Snow Labs name: translate_mt_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, mt, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `mt` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_mt_en_xx_2.7.0_2.4_1609689642064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_mt_en_xx_2.7.0_2.4_1609689642064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_mt_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_mt_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.mt.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_mt_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Relations Between Genes and Phenotypes author: John Snow Labs name: re_human_phenotype_gene_clinical_pipeline date: 2023-06-13 tags: [licensed, clinical, re, genes, phenotypes, en] task: Relation Extraction language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_human_phenotype_gene_clinical](https://nlp.johnsnowlabs.com/2020/09/30/re_human_phenotype_gene_clinical_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_human_phenotype_gene_clinical_pipeline_en_4.4.4_3.2_1686664681373.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_human_phenotype_gene_clinical_pipeline_en_4.4.4_3.2_1686664681373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3") ``` ```scala val pipeline = new PretrainedPipeline("re_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.human_gene_clinical.pipeline").predict("""Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3") ``` ```scala val pipeline = new PretrainedPipeline("re_human_phenotype_gene_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.human_gene_clinical.pipeline").predict("""Bilateral colobomatous microphthalmia and developmental delay in whom genetic studies identified a homozygous TENM3""") ```
## Results ```bash Results +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+===========+=================+===============+=====================+===========+=================+===============+=====================+==============+ | 0 | 1 | HP | 23 | 36 | microphthalmia | HP | 42 | 60 | developmental delay | 0.999954 | +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ | 1 | 1 | HP | 23 | 36 | microphthalmia | GENE | 110 | 114 | TENM3 | 0.999999 | +----+------------+-----------+-----------------+---------------+---------------------+-----------+-----------------+---------------+---------------------+--------------+ {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_human_phenotype_gene_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Translate English to Austro-Asiatic languages Pipeline author: John Snow Labs name: translate_en_aav date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, aav, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `aav` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_aav_xx_2.7.0_2.4_1609691091653.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_aav_xx_2.7.0_2.4_1609691091653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_aav", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_aav", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.aav').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_aav| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Medical Spell Checker author: John Snow Labs name: spellcheck_clinical date: 2022-04-11 tags: [spellcheck, medical, medical_spell_checker, spell_checker, spelling_corrector, en, licensed, clinical] task: Spell Check language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: SpellCheckModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Contextual Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your medical input text. It’s based on Levenshtein Automation for generating candidate corrections and a Neural Language Model for ranking corrections. This model has been trained in a dataset containing data from different sources; MTSamples, i2b2 clinical notes, and several specific medical corpora. You can download the model that comes fully pretrained and ready to use. However, you can still customize it further without the need for re-training a new model from scratch. This can be accomplished by providing custom definitions for the word classes the model has been trained on, namely Dates, Numbers, Ages, Units, and Medications. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/6.Clinical_Context_Spell_Checker.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.1_3.0_1649672133997.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/spellcheck_clinical_en_3.4.1_3.0_1649672133997.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setContextChars(["*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained('spellcheck_clinical', 'en', 'clinical/models')\ .setInputCols("token")\ .setOutputCol("checked") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, spellModel]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) example = ["Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress"] result = light_pipeline.annotate(example) ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setContextChars(Array("*", "-", "“", "(", "[", "\n", ".","\"", "”", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel.pretrained("spellcheck_clinical", "en", "clinical/models") .setInputCols("token") .setOutputCol("checked") val pipeline = new Pipeline().setStages(Array( assembler, tokenizer, spellChecker)) val light_pipeline = new LightPipeline(pipeline.fit(Seq("").toDS.toDF("text"))) val text = Array("Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.", "With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.", "Abdomen is sort, nontender, and nonintended.", "Patient not showing pain or any wealth problems.", "No cute distress") val result = light_pipeline.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.spell.clinical").predict(""") pipeline = Pipeline(stages = [ documentAssembler, tokenizer, spellModel]) light_pipeline = LightPipeline(pipeline.fit(spark.createDataFrame([[""") ```
## Results ```bash [{'checked': ['With','the','cell','of','physical','therapy','the','patient','was','ambulated','and','on','postoperative',',','the','patient','tolerating','a','post','surgical','soft','diet','.'], 'document': ['Witth the hell of phisical terapy the patient was imbulated and on postoperative, the impatient tolerating a post curgical soft diet.'], 'token': ['Witth','the','hell','of','phisical','terapy','the','patient','was','imbulated','and','on','postoperative',',','the','impatient','tolerating','a','post','curgical','soft','diet','.']}, {'checked': ['With','pain','well','controlled','on','oral','pain','medications',',','she','was','discharged','to','rehabilitation','facility','.'], 'document': ['With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.'], 'token': ['With','paint','wel','controlled','on','orall','pain','medications',',','she','was','discharged','too','reihabilitation','facilitay','.']}, {'checked': ['Abdomen','is','soft',',','nontender',',','and','nondistended','.'], 'document': ['Abdomen is sort, nontender, and nonintended.'], 'token': ['Abdomen','is','sort',',','nontender',',','and','nonintended','.']}, {'checked': ['Patient','not','showing','pain','or','any','health','problems','.'], 'document': ['Patient not showing pain or any wealth problems.'], 'token': ['Patient','not','showing','pain','or','any','wealth','problems','.']}, {'checked': ['No', 'acute', 'distress'], 'document': ['No cute distress'], 'token': ['No', 'cute', 'distress']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_clinical| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| |Size:|141.2 MB| ## References MTSamples, i2b2 clinical notes, and several specific medical corpora. --- layout: model title: Sentiment Analysis of Tweets (sentimentdl_use_twitter) author: John Snow Labs name: sentimentdl_use_twitter date: 2021-01-18 task: Sentiment Analysis language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [en, sentiment, open_source] supported: true annotator: SentimentDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify sentiment in tweets as `negative` or `positive` using `Universal Sentence Encoder` embeddings. ## Predicted Entities `positive`, `negative` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_use_twitter_en_2.7.1_2.4_1610983524713.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_use_twitter_en_2.7.1_2.4_1610983524713.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") classifier = SentimentDLModel().pretrained('sentimentdl_use_twitter')\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("sentiment") nlp_pipeline = Pipeline(stages=[document_assembler, use, classifier ]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate(["im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!!", "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!"]) ``` {:.nlu-block} ```python import nlu nlu.load("en.sentiment.twitter.dl").predict("""is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!""") ```
## Results ```bash | | document | sentiment | |---:|:---------------------------------------------------------------------------------------------------------------- |:------------| | 0 | im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!! | positive | | 1 | is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah! | negative | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentimentdl_use_twitter| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[sentiment]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source Trained on Sentiment140 dataset comprising of 1.6M tweets. https://www.kaggle.com/kazanova/sentiment140 ## Benchmarking ```bash loss: 7930.071 - acc: 0.80694044 - val_acc: 80.00508 - batches: 16000 ``` --- layout: model title: Smaller BERT Sentence Embeddings (L-8_H-768_A-12) author: John Snow Labs name: sent_small_bert_L8_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_768_en_2.6.0_2.4_1598351300711.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L8_768_en_2.6.0_2.4_1598351300711.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_768", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_768", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L8_768').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_small_bert_L8_768_embeddings I hate cancer [-0.34135347604751587, -0.27485668659210205, -... Antibiotics aren't painkiller [-0.09529726952314377, -0.12627077102661133, -... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L8_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1 --- layout: model title: Dutch Part of Speech Tagger (from GroNLP) author: John Snow Labs name: bert_pos_bert_base_dutch_cased_upos_alpino date: 2022-05-09 tags: [bert, pos, part_of_speech, nl, open_source] task: Part of Speech Tagging language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-dutch-cased-upos-alpino` is a Dutch model orginally trained by `GroNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_dutch_cased_upos_alpino_nl_3.4.2_3.0_1652092515813.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_dutch_cased_upos_alpino_nl_3.4.2_3.0_1652092515813.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_dutch_cased_upos_alpino","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_dutch_cased_upos_alpino","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_dutch_cased_upos_alpino| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|407.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/GroNLP/bert-base-dutch-cased-upos-alpino - https://arxiv.org/abs/2105.02855 - https://github.com/wietsedv/low-resource-adapt - https://github.com/wietsedv/bertje --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-512-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1657185218309.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_4_en_4.0.0_3.0_1657185218309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_512_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-512-finetuned-squad-seed-4 --- layout: model title: English BertForQuestionAnswering model (from jimypbr) author: John Snow Labs name: bert_qa_jimypbr_bert_base_uncased_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squad` is a English model orginally trained by `jimypbr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_jimypbr_bert_base_uncased_squad_en_4.0.0_3.0_1654181306560.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_jimypbr_bert_base_uncased_squad_en_4.0.0_3.0_1654181306560.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_jimypbr_bert_base_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_jimypbr_bert_base_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_jimypbr").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_jimypbr_bert_base_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|259.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/jimypbr/bert-base-uncased-squad --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from VietAI) author: John Snow Labs name: t5_envit5_translation date: 2023-01-30 tags: [vi, en, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `envit5-translation` is a Multilingual model originally trained by `VietAI`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_envit5_translation_xx_4.3.0_3.0_1675101501844.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_envit5_translation_xx_4.3.0_3.0_1675101501844.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_envit5_translation","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_envit5_translation","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_envit5_translation| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|599.3 MB| ## References - https://huggingface.co/VietAI/envit5-translation - https://paperswithcode.com/sota/machine-translation-on-iwslt2015-english-1?p=mtet-multi-domain-translation-for-english - https://paperswithcode.com/sota/on-phomt?p=mtet-multi-domain-translation-for-english-and - https://research.vietai.org/mtet/ - https://github.com/VinAIResearch/PhoMT - https://user-images.githubusercontent.com/44376091/195998681-5860e443-2071-4048-8a2b-873dcee14a72.png --- layout: model title: Extract Access to Healthcare Entities from Social Determinants of Health Texts author: John Snow Labs name: ner_sdoh_access_to_healthcare_wip date: 2023-02-24 tags: [licensed, clinical, en, social_determinants, ner, sdoh, public_health, access, healthcare, access_to_healthcare] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts access to healthcare information related to Social Determinants of Health from various kinds of biomedical documents. ## Predicted Entities `Insurance_Status`, `Healthcare_Institution`, `Access_To_Care` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_access_to_healthcare_wip_en_4.3.1_3.0_1677202491556.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_access_to_healthcare_wip_en_4.3.1_3.0_1677202491556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_access_to_healthcare_wip", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_texts = ["She has a pension and private health insurance, she reports feeling lonely and isolated.", "He also reported food insecurityduring his childhood and lack of access to adequate healthcare.", "She used to work as a unit clerk at XYZ Medical Center."] data = spark.createDataFrame(sample_texts, StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_access_to_healthcare_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter )) val data = Seq("She has a pension and private health insurance, she reports feeling lonely and isolated.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------------+-----+---+----------------------+ |chunk |begin|end|ner_label | +-----------------------------+-----+---+----------------------+ |private health insurance |22 |45 |Insurance_Status | |access to adequate healthcare|65 |93 |Access_To_Care | |XYZ Medical Center |36 |53 |Healthcare_Institution| +-----------------------------+-----+---+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_access_to_healthcare_wip| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.0 MB| ## Benchmarking ```bash label tp fp fn total precision recall f1 Healthcare_Institution 94.0 8.0 5.0 99.0 0.921569 0.949495 0.935323 Access_To_Care 561.0 23.0 38.0 599.0 0.960616 0.936561 0.948436 Insurance_Status 60.0 5.0 3.0 63.0 0.923077 0.952381 0.937500 ``` --- layout: model title: SpanBert Coreference Resolution author: John Snow Labs name: spanbert_base_coref date: 2022-06-14 tags: [en, open_source] task: Dependency Parser language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: SpanBertCorefModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A coreference resolution model identifies expressions which refer to the same entity in a text. For example, given a sentence "John told Mary he would like to borrow a book from her." the model will link "he" to "John" and "her" to "Mary". This model is based on SpanBert, which is fine-tuned on the OntoNotes 5.0 data set. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spanbert_base_coref_en_4.0.0_3.0_1655203982784.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spanbert_base_coref_en_4.0.0_3.0_1655203982784.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python data = spark.createDataFrame([["John told Mary he would like to borrow a book from her."]]).toDF("text") document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sentence_detector = SentenceDetector().setInputCols(["document"]).setOutputCol("sentences") tokenizer = Tokenizer().setInputCols(["sentences"]).setOutputCol("tokens") corefResolution = SpanBertCorefModel().pretrained("spanbert_base_coref").setInputCols(["sentences", "tokens"]).setOutputCol("corefs") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, corefResolution]) model = pipeline.fit(self.data) model.transform(self.data).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False) ``` ```scala val data = Seq("John told Mary he would like to borrow a book from her.").toDF("text") val document = new DocumentAssembler().setInputCol("text").setOutputCol("document") val sentencer = SentenceDetector().setInputCols(Array("document")).setOutputCol("sentences") val tokenizer = new Tokenizer().setInputCols(Array("sentences")).setOutputCol("tokens") val corefResolution = SpanBertCorefModel.pretrained("spanbert_base_coref").setInputCols(Array("sentences", "tokens")).setOutputCol("corefs") val pipeline = new Pipeline().setStages(Array(document, sentencer, tokenizer, corefResolution)) val result = pipeline.fit(data).transform(data) result.selectExpr("explode(corefs) as coref").selectExpr("coref.result as token", "coref.metadata").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.coreference.spanbert").predict("""John told Mary he would like to borrow a book from her.""") ```
## Results ```bash +-----+------------------------------------------------------------------------------------+ |token|metadata | +-----+------------------------------------------------------------------------------------+ |John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}| |he |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0} | |Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}| |her |{head.sentence -> 0, head -> Mary, head.begin -> 10, head.end -> 13, sentence -> 0} | +-----+------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spanbert_base_coref| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentences, tokens]| |Output Labels:|[corefs]| |Language:|en| |Size:|566.3 MB| |Case sensitive:|true| ## References OntoNotes 5.0 ## Benchmarking ```bash label score f1 77.7 ``` https://github.com/mandarjoshi90/coref --- layout: model title: Pipeline to Detect Drugs and Proteins author: John Snow Labs name: ner_drugprot_clinical_pipeline date: 2023-03-14 tags: [ner, clinical, drugprot, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugprot_clinical](https://nlp.johnsnowlabs.com/2021/12/20/ner_drugprot_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugprot_clinical_pipeline_en_4.3.0_3.2_1678777770925.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugprot_clinical_pipeline_en_4.3.0_3.2_1678777770925.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_drugprot_clinical_pipeline", "en", "clinical/models") text = '''Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_drugprot_clinical_pipeline", "en", "clinical/models") val text = "Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_drugprot.pipeline").predict("""Anabolic effects of clenbuterol on skeletal muscle are mediated by beta 2-adrenoceptor activation.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:------------|-------------:| | 0 | clenbuterol | 20 | 30 | CHEMICAL | 0.9691 | | 1 | beta 2-adrenoceptor | 67 | 85 | GENE | 0.89855 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugprot_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Extract relations between chemicals and proteins (ReDL) author: John Snow Labs name: redl_chemprot_biobert date: 2021-07-24 tags: [relation_extraction, licensed, en, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 2.4 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect interactions between chemicals and proteins using BERT model by classifying whether a specified semantic relation holds between the chemical and protein entities within a sentence or document. ## Predicted Entities `CPR:1`, `CPR:2`, `CPR:3`, `CPR:4`, `CPR:5`, `CPR:6`, `CPR:7`, `CPR:8`, `CPR:9`, `CPR:10` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CHEM_PROT){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_chemprot_biobert_en_3.0.3_2.4_1627111978465.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_chemprot_biobert_en_3.0.3_2.4_1627111978465.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `redl_chemprot_biobert` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------:|:---------------------------------------------------------------------:|:---------------------:|---------------------------| | redl_chemprot_biobert | CPR:1, CPR:2, CPR:3, CPR:4, CPR:5, CPR:6, CPR:7, CPR:8, CPR:9, CPR:10 | ner_chemprot_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_chemprot_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text='''In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.''' data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_chemprot_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.chemprot").predict("""In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.""") ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|:-----------|:----------|----------------:|--------------:|:------------------|:----------|----------------:|--------------:|:--------------|-------------:| | 0 | CPR:2 | CHEMICAL | 43 | 53 | mitiglinide | GENE-N | 80 | 87 | channels | 0.998399 | | 1 | CPR:2 | GENE-N | 80 | 87 | channels | CHEMICAL | 224 | 234 | nateglinide | 0.994489 | | 2 | CPR:2 | CHEMICAL | 706 | 716 | mitiglinide | GENE-Y | 751 | 754 | SUR1 | 0.999304 | | 3 | CPR:2 | CHEMICAL | 823 | 839 | [3H]glibenclamide | GENE-Y | 844 | 847 | SUR1 | 0.998923 | | 4 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1019 | 1025 | glucose | 0.979057 | | 5 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1028 | 1038 | mitiglinide | 0.988504 | | 6 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1041 | 1051 | tolbutamide | 0.991856 | | 7 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1058 | 1070 | glibenclamide | 0.994092 | | 8 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1100 | 1110 | mitiglinide | 0.994409 | | 9 | CPR:2 | CHEMICAL | 1290 | 1300 | mitiglinide | GENE-N | 1387 | 1393 | channel | 0.981534 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_chemprot_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on ChemProt benchmark dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support CPR:1 0.870 0.908 0.888 215 CPR:10 0.818 0.762 0.789 258 CPR:2 0.726 0.806 0.764 1651 CPR:3 0.788 0.785 0.787 657 CPR:4 0.901 0.855 0.878 1599 CPR:5 0.799 0.891 0.842 184 CPR:6 0.888 0.845 0.866 258 CPR:7 0.520 0.765 0.619 25 CPR:8 0.083 0.333 0.133 24 CPR:9 0.930 0.805 0.863 629 Avg. 0.732 0.775 0.743 - ``` --- layout: model title: English T5ForConditionalGeneration Mini Cased model (from google) author: John Snow Labs name: t5_efficient_mini_nl24 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-mini-nl24` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl24_en_4.3.0_3.0_1675118296258.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_mini_nl24_en_4.3.0_3.0_1675118296258.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_mini_nl24","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_mini_nl24","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_mini_nl24| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|265.3 MB| ## References - https://huggingface.co/google/t5-efficient-mini-nl24 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering (from Andranik) author: John Snow Labs name: roberta_qa_TestQaV1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `TestQaV1` is a English model originally trained by `Andranik`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_TestQaV1_en_4.0.0_3.0_1655727463217.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_TestQaV1_en_4.0.0_3.0_1655727463217.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_TestQaV1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_TestQaV1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_Andranik").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_TestQaV1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Andranik/TestQaV1 --- layout: model title: English asr_wav2vec2_large_in_lm TFWav2Vec2ForCTC from crossdelenna author: John Snow Labs name: asr_wav2vec2_large_in_lm date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_in_lm` is a English model originally trained by crossdelenna. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_in_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_in_lm_en_4.2.0_3.0_1664103206886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_in_lm_en_4.2.0_3.0_1664103206886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_in_lm", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_in_lm", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_in_lm| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English DistilBertForQuestionAnswering model (from rahulkuruvilla) C Version author: John Snow Labs name: distilbert_qa_COVID_DistilBERTc date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-DistilBERTc` is a English model originally trained by `rahulkuruvilla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_COVID_DistilBERTc_en_4.0.0_3.0_1654722938609.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_COVID_DistilBERTc_en_4.0.0_3.0_1654722938609.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_COVID_DistilBERTc","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_COVID_DistilBERTc","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid.distil_bert.c.by_rahulkuruvilla").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_COVID_DistilBERTc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulkuruvilla/COVID-DistilBERTc --- layout: model title: English BertForQuestionAnswering model (from rahulkuruvilla) author: John Snow Labs name: bert_qa_COVID_BERTb date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `COVID-BERTb` is a English model orginally trained by `rahulkuruvilla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTb_en_4.0.0_3.0_1654176571589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_COVID_BERTb_en_4.0.0_3.0_1654176571589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_COVID_BERTb","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_COVID_BERTb","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.covid_bert.b.by_rahulkuruvilla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_COVID_BERTb| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/rahulkuruvilla/COVID-BERTb --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-512_A-8_cord19-200616_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185284081.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_en_4.0.0_3.0_1654185284081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_cord19.bert.uncased_4l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-512_A-8_cord19-200616_squad2 --- layout: model title: Translate English to Cushitic languages Pipeline author: John Snow Labs name: translate_en_cus date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, cus, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `cus` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_cus_xx_2.7.0_2.4_1609686323037.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_cus_xx_2.7.0_2.4_1609686323037.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_cus", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_cus", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.cus').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_cus| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_effective_date_08_29_v1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-effective_date-08-29-v1` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_effective_date_08_29_v1_en_4.3.0_3.0_1672766130454.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_effective_date_08_29_v1_en_4.3.0_3.0_1672766130454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_effective_date_08_29_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_effective_date_08_29_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_effective_date_08_29_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-effective_date-08-29-v1 --- layout: model title: English RoBERTa Embeddings (Base, Titles, Bodies) author: John Snow Labs name: roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilroberta-base-finetuned-jira-qt-issue-titles-and-bodies` is a English model orginally trained by `ietz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies_en_3.4.2_3.0_1649947469810.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies_en_3.4.2_3.0_1649947469810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_distilroberta_base_finetuned_jira_qt_issue_titles_and_bodies| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|309.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/ietz/distilroberta-base-finetuned-jira-qt-issue-titles-and-bodies --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1_x2.01_f89.2_d30_hybrid_rewind_opt_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.01_f89.2_d30_hybrid_rewind_opt_v1_en_4.0.0_3.0_1654181584203.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x2.01_f89.2_d30_hybrid_rewind_opt_v1_en_4.0.0_3.0_1654181584203.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x2.01_f89.2_d30_hybrid_rewind_opt_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1_x2.01_f89.2_d30_hybrid_rewind_opt_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_x1.96_f88.3_d27_hybrid_filled_opt_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1_x2.01_f89.2_d30_hybrid_rewind_opt_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|194.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squadv1-x2.01-f89.2-d30-hybrid-rewind-opt-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: Arabic BertForQuestionAnswering Cased model (from wonfs) author: John Snow Labs name: bert_qa_arabert_v2 date: 2022-07-07 tags: [ar, open_source, bert, question_answering] task: Question Answering language: ar edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `arabert-v2-qa` is a Arabic model originally trained by `wonfs`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_arabert_v2_ar_4.0.0_3.0_1657182574675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_arabert_v2_ar_4.0.0_3.0_1657182574675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_arabert_v2","ar") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_arabert_v2","ar") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("ما هو اسمي؟", "اسمي كلارا وأنا أعيش في بيركلي.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_arabert_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ar| |Size:|505.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/wonfs/arabert-v2-qa --- layout: model title: ELECTRA Sentence Embeddings(ELECTRA Small) author: John Snow Labs name: sent_electra_small_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by: Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_electra_small_uncased_en_2.6.0_2.4_1598489761661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_electra_small_uncased_en_2.6.0_2.4_1598489761661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_electra_small_uncased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_small_uncased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.electra_small_uncased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_electra_small_uncased_embeddings I hate cancer [0.4288138449192047, -0.25909560918807983, -0.... Antibiotics aren't painkiller [0.04786013811826706, 0.14878112077713013, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_electra_small_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/electra_small/2](https://tfhub.dev/google/electra_small/2) --- layout: model title: Pipeline to Detect concepts in drug development trials (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_drug_development_trials_pipeline date: 2023-03-20 tags: [ner, en, bertfortokenclassification, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_drug_development_trials](https://nlp.johnsnowlabs.com/2022/06/18/bert_token_classifier_drug_development_trials_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_pipeline_en_4.3.0_3.2_1679304929639.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_drug_development_trials_pipeline_en_4.3.0_3.2_1679304929639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_drug_development_trials_pipeline", "en", "clinical/models") text = '''In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_drug_development_trials_pipeline", "en", "clinical/models") val text = "In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.druge_developement.pipeline").predict("""In June 2003, the median overall survival with and without topotecan were 4.0 and 3.6 months, respectively. The best complete response ( CR ) , partial response ( PR ) , stable disease and progressive disease were observed in 23, 63, 55 and 33 patients, respectively, with topotecan, and 11, 61, 66 and 32 patients, respectively, without topotecan.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:--------------|-------------:| | 0 | June 2003 | 3 | 11 | DATE | 0.996034 | | 1 | median | 18 | 23 | Duration | 0.999535 | | 2 | overall survival | 25 | 40 | End_Point | 0.996754 | | 3 | without topotecan | 52 | 68 | Trial_Group | 0.976542 | | 4 | 4.0 | 75 | 77 | Value | 0.998101 | | 5 | 3.6 months | 83 | 92 | Value | 0.998159 | | 6 | complete response ( CR | 118 | 140 | End_Point | 0.998629 | | 7 | partial response ( PR | 146 | 167 | End_Point | 0.998672 | | 8 | stable disease | 173 | 186 | End_Point | 0.996891 | | 9 | progressive disease | 192 | 210 | End_Point | 0.997602 | | 10 | 23 | 229 | 230 | Patient_Count | 0.998463 | | 11 | 63 | 233 | 234 | Patient_Count | 0.996301 | | 12 | 55 | 237 | 238 | Patient_Count | 0.996667 | | 13 | 33 patients | 244 | 254 | Patient_Count | 0.995486 | | 14 | topotecan | 277 | 285 | Trial_Group | 0.999624 | | 15 | 11 | 293 | 294 | Patient_Count | 0.998747 | | 16 | 61 | 297 | 298 | Patient_Count | 0.998314 | | 17 | 66 | 301 | 302 | Patient_Count | 0.998066 | | 18 | 32 patients | 308 | 318 | Patient_Count | 0.996285 | | 19 | without topotecan | 335 | 351 | Trial_Group | 0.971218 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_drug_development_trials_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English BertForQuestionAnswering model (from andresestevez) author: John Snow Labs name: bert_qa_andresestevez_bert_base_cased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad` is a English model orginally trained by `andresestevez`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_andresestevez_bert_base_cased_finetuned_squad_en_4.0.0_3.0_1654179773819.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_andresestevez_bert_base_cased_finetuned_squad_en_4.0.0_3.0_1654179773819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_andresestevez_bert_base_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_andresestevez_bert_base_cased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_andresestevez_bert_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/andresestevez/bert-base-cased-finetuned-squad --- layout: model title: Key Value Recognition on 10K filings author: John Snow Labs name: visualner_keyvalue_10kfilings date: 2023-01-10 tags: [en, licensed] task: OCR Object Detection language: en nav_key: models edition: Visual NLP 4.0.0 spark_version: 3.2 supported: true annotator: VisualDocumentNERv21 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Form Recognition / Key Value extraction model, trained on the summary page of SEC 10K filings. It extracts KEY, VALUE or HEADER as entities, being HEADER the title on the filing. ## Predicted Entities `KEY`, `VALUE`, `HEADER` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/90.2.Financial_Visual_NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/visualner_keyvalue_10kfilings_en_4.0.0_3.2_1663781115795.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/visualner_keyvalue_10kfilings_en_4.0.0_3.2_1663781115795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python binary_to_image = BinaryToImage()\ .setInputCol("content") \ .setOutputCol("image") \ .setImageType(ImageType.TYPE_3BYTE_BGR) img_to_hocr = ImageToHocr()\ .setInputCol("image")\ .setOutputCol("hocr")\ .setIgnoreResolution(False)\ .setOcrParams(["preserve_interword_spaces=0"]) tokenizer = HocrTokenizer()\ .setInputCol("hocr")\ .setOutputCol("token") doc_ner = VisualDocumentNerV21()\ .pretrained("visualner_keyvalue_10kfilings", "en", "clinical/ocr")\ .setInputCols(["token", "image"])\ .setOutputCol("entities") draw = ImageDrawAnnotations() \ .setInputCol("image") \ .setInputChunksCol("entities") \ .setOutputCol("image_with_annotations") \ .setFontSize(10) \ .setLineWidth(4)\ .setRectColor(Color.red) # OCR pipeline pipeline = PipelineModel(stages=[ binary_to_image, img_to_hocr, tokenizer, doc_ner, draw ]) bin_df = spark.read.format("binaryFile").load('data/t01.jpg') results = pipeline.transform(bin_df).cache() res = results.collect() path_array = f.split(results['path'], '/') results.withColumn('filename', path_array.getItem(f.size(path_array)- 1)) \ .withColumn("exploded_entities", f.explode("entities")) \ .select("filename", "exploded_entities") \ .show(truncate=False) ``` ```scala val binary_to_image = new BinaryToImage() .setInputCol("content") .setOutputCol("image") .setImageType(ImageType.TYPE_3BYTE_BGR) val img_to_hocr = new ImageToHocr() .setInputCol("image") .setOutputCol("hocr") .setIgnoreResolution(False) .setOcrParams(Array("preserve_interword_spaces=0")) val tokenizer = new HocrTokenizer() .setInputCol("hocr") .setOutputCol("token") val doc_ner = VisualDocumentNerV21() .pretrained("visualner_keyvalue_10kfilings", "en", "clinical/ocr") .setInputCols(Array("token", "image")) .setOutputCol("entities") val draw = new ImageDrawAnnotations() .setInputCol("image") .setInputChunksCol("entities") .setOutputCol("image_with_annotations") .setFontSize(10) .setLineWidth(4) .setRectColor(Color.red) # OCR pipeline val pipeline = new PipelineModel().setStages(Array( binary_to_image, img_to_hocr, tokenizer, doc_ner, draw)) val bin_df = spark.read.format("binaryFile").load('data/t01.jpg') val results = pipeline.transform(bin_df).cache() val res = results.collect() val path_array = f.split(results["path"], "/") val results.withColumn("filename", path_array.getItem(f.size(path_array)- 1)) .withColumn(Array("exploded_entities", f.explode("entities"))) .select(Array("filename", "exploded_entities")) .show(truncate=False) ```
## Example {%- capture input_image -%} ![Screenshot](/assets/images/examples_ocr/image11.jpg) {%- endcapture -%} {%- capture output_image -%} ![Screenshot](/assets/images/examples_ocr/image11_out.png) {%- endcapture -%} {% include templates/input_output_image.md input_image=input_image output_image=output_image %} ## Output text ```bash +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ |filename|exploded_entities | +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ |t01.jpg |{named_entity, 268, 269, OTHERS, {confidence -> 96, width -> 14, x -> 822, y -> 1101, word -> of, token -> of, height -> 34}, []} | |t01.jpg |{named_entity, 271, 273, OTHERS, {confidence -> 89, width -> 33, x -> 837, y -> 1112, word -> the, token -> the, height -> 13}, []} | |t01.jpg |{named_entity, 275, 277, OTHERS, {confidence -> 89, width -> 30, x -> 874, y -> 1113, word -> Act., token -> act, height -> 12}, []} | |t01.jpg |{named_entity, 280, 282, KEY-B, {confidence -> 94, width -> 26, x -> 910, y -> 1113, word -> Yes, token -> yes, height -> 12}, []} | |t01.jpg |{named_entity, 284, 285, VALUE-B, {confidence -> 45, width -> 13, x -> 944, y -> 1112, word -> LI, token -> li, height -> 13}, []} | |t01.jpg |{named_entity, 287, 288, KEY-B, {confidence -> 83, width -> 22, x -> 963, y -> 1113, word -> No, token -> no, height -> 12}, []} | |t01.jpg |{named_entity, 290, 295, HEADER-B, {confidence -> 96, width -> 89, x -> 1493, y -> 13, word -> UNITED, token -> united, height -> 16}, []} | |t01.jpg |{named_entity, 297, 302, HEADER-I, {confidence -> 95, width -> 83, x -> 1590, y -> 13, word -> STATES, token -> states, height -> 16}, []} | |t01.jpg |{named_entity, 304, 313, HEADER-B, {confidence -> 95, width -> 221, x -> 1186, y -> 45, word -> SECURITIES, token -> securities, height -> 25}, []} | |t01.jpg |{named_entity, 315, 317, HEADER-I, {confidence -> 95, width -> 80, x -> 1415, y -> 45, word -> AND, token -> and, height -> 25}, []} | |t01.jpg |{named_entity, 319, 326, HEADER-I, {confidence -> 96, width -> 212, x -> 1507, y -> 45, word -> EXCHANGE, token -> exchange, height -> 25}, []} | |t01.jpg |{named_entity, 328, 337, HEADER-I, {confidence -> 95, width -> 249, x -> 1732, y -> 45, word -> COMMISSION, token -> commission, height -> 25}, []} | |t01.jpg |{named_entity, 339, 348, HEADER-B, {confidence -> 96, width -> 125, x -> 1461, y -> 86, word -> Washington,, token -> washington, height -> 21}, []} | |t01.jpg |{named_entity, 351, 351, HEADER-I, {confidence -> 93, width -> 43, x -> 1595, y -> 86, word -> D.C., token -> d, height -> 16}, []} | |t01.jpg |{named_entity, 356, 360, HEADER-I, {confidence -> 93, width -> 59, x -> 1646, y -> 86, word -> 20549, token -> 20549, height -> 16}, []} | |t01.jpg |{named_entity, 362, 365, HEADER-B, {confidence -> 93, width -> 112, x -> 1484, y -> 159, word -> FORM, token -> form, height -> 25}, []} | |t01.jpg |{named_entity, 367, 368, HEADER-I, {confidence -> 91, width -> 77, x -> 1609, y -> 159, word -> 10-K, token -> 10, height -> 25}, []} | +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|visualner_keyvalue_10kfilings| |Type:|ocr| |Compatibility:|Visual NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|744.3 MB| ## References Sec 10K filings --- layout: model title: Sentence Entity Resolver for Snomed Concepts, INT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_findings_int language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to Snomed codes (INT version) using chunk embeddings. {:.h2_title} ## Predicted Entities Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_int_en_2.6.4_2.4_1606235761314.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_int_en_2.6.4_2.4_1606235761314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## Predicted Entities Snomed Codes and their normalized definition with ``sbiobert_base_cased_mli`` embeddings. {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_int_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_findings_int","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_int_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val snomed_int_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_findings_int","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_int_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 266285003| 0.8867|rheumatic myocard...|266285003:::15529...| |chronic renal ins...| 83|109| PROBLEM| 236425005| 0.2470|chronic renal imp...|236425005:::90688...| | COPD| 113|116| PROBLEM| 413839001| 0.0720|chronic lung dise...|413839001:::41384...| | gastritis| 120|128| PROBLEM| 266502003| 0.3240|acute peptic ulce...|266502003:::45560...| | TIA| 136|138| PROBLEM|353101000119105| 0.0727|prostatic intraep...|353101000119105::...| |a non-ST elevatio...| 182|202| PROBLEM| 233843008| 0.2846|silent myocardial...|233843008:::71942...| |Guaiac positive s...| 208|229| PROBLEM| 168319009| 0.1167|stool culture pos...|168319009:::70396...| |cardiac catheteri...| 295|317| TEST| 301095005| 0.2137|cardiac finding::...|301095005:::25090...| | PTCA| 324|327|TREATMENT|842741000000109| 0.0631|occlusion of post...|842741000000109::...| | mid LAD lesion| 332|345| PROBLEM| 449567000| 0.0808|overriding left v...|449567000:::25342...| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_snomed_findings_int | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_6 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-256-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_6_en_4.0.0_3.0_1655732262112.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_6_en_4.0.0_3.0_1655732262112.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_256d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_256_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|426.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-256-finetuned-squad-seed-6 --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1657192372566.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_4_en_4.0.0_3.0_1657192372566.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|384.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-256-finetuned-squad-seed-4 --- layout: model title: Fast Neural Machine Translation Model from Arabic to English author: John Snow Labs name: opus_mt_ar_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ar, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ar` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_en_xx_2.7.0_2.4_1609164323248.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_en_xx_2.7.0_2.4_1609164323248.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ar.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: RCT Binary Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_binary_rct_biobert date: 2022-04-25 tags: [licensed, rct, classifier, en, clinical] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify if an article is a randomized clinical trial (RCT) or not. ## Predicted Entities `True`, `False` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_binary_rct_biobert_en_3.5.0_3.0_1650861635354.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_binary_rct_biobert_en_3.5.0_3.0_1650861635354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier_loaded = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_binary_rct_biobert", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier_loaded ]) data = spark.createDataFrame([["""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_binary_rct_biobert", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """).toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.binary_biobert").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash | text | rct | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| | Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | True | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_binary_rct_biobert| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References https://arxiv.org/abs/1710.06071 ## Benchmarking ```bash label prec rec f1 support False 0.94 0.76 0.84 1982 True 0.76 0.94 0.84 1629 accuracy 0.84 0.84 0.84 3611 macro-avg 0.85 0.85 0.84 3611 weighted-avg 0.86 0.84 0.84 3611 ``` --- layout: model title: English RobertaForSequenceClassification Cased model (from Souvikcmsa) author: John Snow Labs name: roberta_classifier_sentiment_analysis date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Roberta_Sentiment_Analysis` is a English model originally trained by `Souvikcmsa`. ## Predicted Entities `4`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_sentiment_analysis_en_4.2.4_3.0_1670621684250.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_sentiment_analysis_en_4.2.4_3.0_1670621684250.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_sentiment_analysis","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_sentiment_analysis","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_sentiment_analysis| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Souvikcmsa/Roberta_Sentiment_Analysis --- layout: model title: Clinical Deidentification (glove) author: John Snow Labs name: clinical_deidentification_glove date: 2021-06-08 tags: [deidentification, en, licensed, pipeline] task: De-identification language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline is trained with lightweight glove_100d embeddings and can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_3.0.4_3.0_1623177289663.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_glove_en_3.0.4_3.0_1623177289663.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_glove", "en", "clinical/models") deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = PretrainedPipeline("clinical_deidentification_glove","en","clinical/models") val result = pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.") ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.glove_pipeline").predict("""Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.""") ```
## Results ```bash {'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 'IP: 203.120.223.13.', 'The driver's license no:A334455B.', 'the SSN:324598674 and e-mail: hale@gmail.com.', 'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.', 'PCP : Oliveira, 25 years-old.', 'Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.'], 'masked': ['Record date : , , M.D.', 'IP: .', 'The driver's license .', 'the and e-mail: .', 'Name : MR. # Date : .', 'PCP : , years-old.', 'Record date : , Patient's VIN : .'], 'obfuscated': ['Record date : 2093-02-13, Shella Solan, M.D.', 'IP: 444.444.444.444.', 'The driver's license O497302436569.', 'the SSN-539-29-1060 and e-mail: Keith@google.com.', 'Name : Roscoe Kerns MR. # Q984288 Date : 10-08-1991.', 'PCP : Dr Rudell Dubin, 10 years-old.', 'Record date : 2079-12-30, Patient's VIN : 5eeee44ffff555666.'], 'ner_chunk': ['2093-01-13', 'David Hale', 'no:A334455B', 'SSN:324598674', 'Hendrickson, Ora', '719435', '01/13/93', 'Oliveira', '25', '2079-11-09', '1HGBH41JXMN109286']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_glove| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_bert_all_translated date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-all-translated` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_translated_en_4.0.0_3.0_1654179576151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_translated_en_4.0.0_3.0_1654179576151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_all_translated","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_all_translated","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_krinal214").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_all_translated| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/bert-all-translated --- layout: model title: Google's Tapas Table Understanding (Large, SQA) author: John Snow Labs name: table_qa_tapas_large_finetuned_sqa date: 2022-09-30 tags: [en, table, qa, question, answering, open_source] task: Table Question Answering language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: TapasForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot Table Understanding Model which allows you to carry out Question Answering on Spark Dataframes. If you have a file stored in any table format, as csv, load it before using Spark. Size of this model: Large Has aggregation operations?: False ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_sqa_en_4.2.0_3.0_1664530811900.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/table_qa_tapas_large_finetuned_sqa_en_4.2.0_3.0_1664530811900.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python json_data = """ { "header": ["name", "money", "age"], "rows": [ ["Donald Trump", "$100,000,000", "75"], ["Elon Musk", "$20,000,000,000,000", "55"] ] } """ queries = [ "Who earns less than 200,000,000?", "Who earns 100,000,000?", "How much money has Donald Trump?", "How old are they?", ] data = spark.createDataFrame([ [json_data, " ".join(queries)] ]).toDF("table_json", "questions") document_assembler = MultiDocumentAssembler() \ .setInputCols("table_json", "questions") \ .setOutputCols("document_table", "document_questions") sentence_detector = SentenceDetector() \ .setInputCols(["document_questions"]) \ .setOutputCol("questions") table_assembler = TableAssembler()\ .setInputCols(["document_table"])\ .setOutputCol("table") tapas = TapasForQuestionAnswering\ .pretrained("table_qa_tapas_large_finetuned_sqa","en")\ .setInputCols(["questions", "table"])\ .setOutputCol("answers") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, table_assembler, tapas ]) model = pipeline.fit(data) model\ .transform(data)\ .selectExpr("explode(answers) AS answer")\ .select("answer")\ .show(truncate=False) ```
## Results ```bash +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |answer | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 0, 12, Donald Trump, {question -> Who earns less than 200,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, Donald Trump, {question -> Who earns 100,000,000?, aggregation -> NONE, cell_positions -> [0, 0], cell_scores -> 0.9999999}, []} | |{chunk, 0, 12, $100,000,000, {question -> How much money has Donald Trump?, aggregation -> NONE, cell_positions -> [1, 0], cell_scores -> 0.9999998}, []} | |{chunk, 0, 6, AVERAGE > 75, 55, {question -> How old are they?, aggregation -> AVERAGE, cell_positions -> [2, 0], [2, 1], cell_scores -> 0.99999976, 0.9999995}, []} | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|table_qa_tapas_large_finetuned_sqa| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| ## References https://www.microsoft.com/en-us/download/details.aspx?id=54253 --- layout: model title: French CamemBert Embeddings (from adam1224) author: John Snow Labs name: camembert_embeddings_adam1224_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `adam1224`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_adam1224_generic_model_fr_3.4.4_3.0_1653987174659.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_adam1224_generic_model_fr_3.4.4_3.0_1653987174659.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_adam1224_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_adam1224_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_adam1224_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/adam1224/dummy-model --- layout: model title: Malay BertForMaskedLM Cased model (from StevenLimcorn) author: John Snow Labs name: bert_embeddings_melayubert date: 2022-12-02 tags: [ms, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: ms edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `MelayuBERT` is a Malay model originally trained by `StevenLimcorn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_melayubert_ms_4.2.4_3.0_1670015457507.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_melayubert_ms_4.2.4_3.0_1670015457507.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_melayubert","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_melayubert","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_melayubert| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ms| |Size:|410.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/StevenLimcorn/MelayuBERT - https://arxiv.org/abs/1810.04805 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://github.com/stevenlimcorn - https://hf.co/w11wo --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_el16_dl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-el16-dl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl2_en_4.3.0_3.0_1675119642706.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_el16_dl2_en_4.3.0_3.0_1675119642706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_el16_dl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_el16_dl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|175.9 MB| ## References - https://huggingface.co/google/t5-efficient-small-el16-dl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Fast Neural Machine Translation Model from English to Galician author: John Snow Labs name: opus_mt_en_gl date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gl, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gl_xx_2.7.0_2.4_1609164535729.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gl_xx_2.7.0_2.4_1609164535729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gl", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gl').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_small_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_small](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_small_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_pipeline_en_3.4.1_3.0_1647873277709.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_pipeline_en_3.4.1_3.0_1647873277709.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_small_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_small_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_small.pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d., |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_small_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Pipeline to Detect Clinical Entities (ner_eu_clinical_case - eu) author: John Snow Labs name: ner_eu_clinical_case_pipeline date: 2023-03-08 tags: [eu, clinical, licensed, ner] task: Named Entity Recognition language: eu edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_eu_clinical_case](https://nlp.johnsnowlabs.com/2023/02/02/ner_eu_clinical_case_eu.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_eu_4.3.0_3.2_1678261023976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_pipeline_eu_4.3.0_3.2_1678261023976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_eu_clinical_case_pipeline", "eu", "clinical/models") text = " 3 urteko mutiko bat nahasmendu autistarekin unibertsitateko ospitaleko A pediatriako ospitalean. Ez du autismoaren espektroaren nahaste edo gaixotasun familiaren aurrekaririk. Mutilari komunikazio-nahaste larria diagnostikatu zioten, elkarrekintza sozialeko zailtasunak eta prozesamendu sentsorial atzeratua. Odol-analisiak normalak izan ziren (tiroidearen hormona estimulatzailea (TSH), hemoglobina, batez besteko bolumen corpuskularra (MCV) eta ferritina). Goiko endoskopiak mukosaren azpiko tumore bat ere erakutsi zuen, urdail-irteeren guztizko oztopoa eragiten zuena. Estroma gastrointestinalaren tumore baten susmoa ikusita, distaleko gastrektomia egin zen. Azterketa histopatologikoak agerian utzi zuen mukosaren azpiko zelulen ugaltzea. " result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_eu_clinical_case_pipeline", "eu", "clinical/models") val text = " 3 urteko mutiko bat nahasmendu autistarekin unibertsitateko ospitaleko A pediatriako ospitalean. Ez du autismoaren espektroaren nahaste edo gaixotasun familiaren aurrekaririk. Mutilari komunikazio-nahaste larria diagnostikatu zioten, elkarrekintza sozialeko zailtasunak eta prozesamendu sentsorial atzeratua. Odol-analisiak normalak izan ziren (tiroidearen hormona estimulatzailea (TSH), hemoglobina, batez besteko bolumen corpuskularra (MCV) eta ferritina). Goiko endoskopiak mukosaren azpiko tumore bat ere erakutsi zuen, urdail-irteeren guztizko oztopoa eragiten zuena. Estroma gastrointestinalaren tumore baten susmoa ikusita, distaleko gastrektomia egin zen. Azterketa histopatologikoak agerian utzi zuen mukosaren azpiko zelulen ugaltzea. " val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-------------------------|--------:|------:|:-------------------|-------------:| | 0 | 3 urteko mutiko bat | 1 | 19 | patient | 0.813975 | | 1 | nahasmendu | 21 | 30 | clinical_event | 0.9848 | | 2 | autismoaren espektroaren | 104 | 127 | clinical_condition | 0.344 | | 3 | nahaste | 129 | 135 | clinical_event | 0.996 | | 4 | gaixotasun | 141 | 150 | clinical_event | 0.9839 | | 5 | familiaren | 152 | 161 | patient | 0.8834 | | 6 | aurrekaririk | 163 | 174 | clinical_event | 0.8742 | | 7 | Mutilari | 177 | 184 | patient | 0.9477 | | 8 | komunikazio-nahaste | 186 | 204 | clinical_event | 0.8647 | | 9 | diagnostikatu | 213 | 225 | clinical_event | 0.9969 | | 10 | elkarrekintza | 235 | 247 | clinical_event | 0.9828 | | 11 | zailtasunak | 259 | 269 | clinical_event | 0.9897 | | 12 | prozesamendu | 275 | 286 | clinical_event | 0.9927 | | 13 | sentsorial | 288 | 297 | clinical_condition | 0.7912 | | 14 | Odol-analisiak | 310 | 323 | clinical_event | 0.9992 | | 15 | normalak | 325 | 332 | units_measurements | 0.7265 | | 16 | tiroidearen | 346 | 356 | bodypart | 0.9718 | | 17 | hormona | 358 | 364 | clinical_event | 0.9904 | | 18 | estimulatzailea | 366 | 380 | clinical_condition | 0.6005 | | 19 | TSH | 383 | 385 | clinical_event | 0.9976 | | 20 | hemoglobina | 389 | 399 | clinical_event | 0.9936 | | 21 | bolumen | 416 | 422 | clinical_event | 0.735 | | 22 | MCV | 439 | 441 | clinical_event | 0.9933 | | 23 | ferritina | 448 | 456 | clinical_event | 0.4228 | | 24 | Goiko | 460 | 464 | bodypart | 0.9564 | | 25 | endoskopiak | 466 | 476 | clinical_event | 0.9082 | | 26 | mukosaren azpiko | 478 | 493 | bodypart | 0.5929 | | 27 | tumore | 495 | 500 | clinical_event | 0.998 | | 28 | erakutsi | 510 | 517 | clinical_event | 0.9963 | | 29 | oztopoa | 550 | 556 | clinical_event | 0.9964 | | 30 | Estroma | 574 | 580 | clinical_event | 0.884 | | 31 | gastrointestinalaren | 582 | 601 | clinical_condition | 0.3525 | | 32 | tumore | 603 | 608 | clinical_event | 0.9896 | | 33 | ikusita | 623 | 629 | clinical_event | 0.9873 | | 34 | distaleko | 632 | 640 | bodypart | 0.7425 | | 35 | gastrektomia | 642 | 653 | clinical_event | 0.9986 | | 36 | Azterketa | 665 | 673 | clinical_event | 0.9517 | | 37 | agerian | 693 | 699 | clinical_event | 0.9842 | | 38 | utzi | 701 | 704 | clinical_event | 0.925 | | 39 | mukosaren azpiko zelulen | 711 | 734 | bodypart | 0.754933 | | 40 | ugaltzea | 736 | 743 | clinical_event | 0.9989 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|eu| |Size:|1.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Lemmatizer (Greek, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-08 tags: [open_source, lemmatizer, grc] task: Lemmatization language: grc edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Greek Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_grc_3.4.1_3.0_1646753608262.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_grc_3.4.1_3.0_1646753608262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","grc") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Δεν είσαι καλύτερος από μένα."]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","grc") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Δεν είσαι καλύτερος από μένα.").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("grc.lemma.spacylookup").predict("""Δεν είσαι καλύτερος από μένα.""") ```
## Results ```bash +-------------------------------------+ |result | +-------------------------------------+ |[Δεν, είσαι, καλύτερος, από, μένα, .]| +-------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|grc| |Size:|806.9 KB| --- layout: model title: English T5ForConditionalGeneration Cased model (from yogi) author: John Snow Labs name: t5_autotrain_amazon_text_sum_730222226 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-amazon_text_sum-730222226` is a English model originally trained by `yogi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_autotrain_amazon_text_sum_730222226_en_4.3.0_3.0_1675099816649.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_autotrain_amazon_text_sum_730222226_en_4.3.0_3.0_1675099816649.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_autotrain_amazon_text_sum_730222226","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_autotrain_amazon_text_sum_730222226","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_autotrain_amazon_text_sum_730222226| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|284.9 MB| ## References - https://huggingface.co/yogi/autotrain-amazon_text_sum-730222226 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken TFWav2Vec2ForCTC from cuzeverynameistaken author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken` is a English model originally trained by cuzeverynameistaken. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken_en_4.2.0_3.0_1664039286755.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken_en_4.2.0_3.0_1664039286755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab0_by_cuzeverynameistaken| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English image_classifier_vit_pond ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond` is a English model originally trained by SummerChiam. ## Predicted Entities `NormalCement0`, `Boiling0`, `NormalNight0`, `Algae0`, `BoilingNight0`, `NormalRain0`, `Normal0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_en_4.1.0_3.0_1660165886306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_en_4.1.0_3.0_1660165886306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Turkish BertForQuestionAnswering Base Cased model (from husnu) author: John Snow Labs name: bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_3 date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3TQUAD2-finetuned_lr-2e-05_epochs-3` is a Turkish model originally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1657183679798.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1657183679798.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_3","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_3","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_turkish_128k_cased_tquad2_finetuned_lr_2e_05_epochs_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|689.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/husnu/bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3TQUAD2-finetuned_lr-2e-05_epochs-3 --- layout: model title: English asr_wav2vec2_base_demo_colab_by_thyagosme TFWav2Vec2ForCTC from thyagosme author: John Snow Labs name: pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_demo_colab_by_thyagosme` is a English model originally trained by thyagosme. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme_en_4.2.0_3.0_1664108055641.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme_en_4.2.0_3.0_1664108055641.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_demo_colab_by_thyagosme| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_timit_demo_colab50 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab50 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab50` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab50_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab50_en_4.2.0_3.0_1664021095237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab50_en_4.2.0_3.0_1664021095237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab50', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab50", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab50| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Therefore Clause Binary Classifier author: John Snow Labs name: legclf_therefore_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `therefore` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `therefore` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_therefore_clause_en_1.0.0_3.2_1660124082046.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_therefore_clause_en_1.0.0_3.2_1660124082046.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_therefore_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[therefore]| |[other]| |[other]| |[therefore]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_therefore_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.99 0.97 88 therefore 0.98 0.92 0.95 48 accuracy - - 0.96 136 macro-avg 0.97 0.95 0.96 136 weighted-avg 0.96 0.96 0.96 136 ``` --- layout: model title: Translate English to Arabic Pipeline author: John Snow Labs name: translate_en_ar date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ar, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ar` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ar_xx_2.7.0_2.4_1609686191172.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ar_xx_2.7.0_2.4_1609686191172.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ar", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ar", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ar').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ar| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect posology entities (biobert) author: John Snow Labs name: ner_posology_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_pipeline_en_4.3.0_3.2_1679316307940.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_biobert_pipeline_en_4.3.0_3.2_1679316307940.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_biobert_pipeline", "en", "clinical/models") text = '''The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_biobert_pipeline", "en", "clinical/models") val text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology_biobert.pipeline").predict("""The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------|--------:|------:|:------------|-------------:| | 0 | 1 | 27 | 27 | DOSAGE | 0.9993 | | 1 | capsule | 29 | 35 | FORM | 0.9998 | | 2 | Advil | 40 | 44 | DRUG | 0.9999 | | 3 | 10 mg | 46 | 50 | STRENGTH | 0.98145 | | 4 | for 5 days | 52 | 61 | DURATION | 0.998833 | | 5 | magnesium hydroxide | 67 | 85 | DRUG | 0.82655 | | 6 | 100mg/1ml | 87 | 95 | STRENGTH | 0.9391 | | 7 | PO | 108 | 109 | ROUTE | 1 | | 8 | 40 units | 179 | 186 | DOSAGE | 0.87745 | | 9 | insulin glargine | 191 | 206 | DRUG | 0.9817 | | 10 | at night | 208 | 215 | FREQUENCY | 0.8641 | | 11 | 12 units | 218 | 225 | DOSAGE | 0.9533 | | 12 | insulin lispro | 230 | 243 | DRUG | 0.9476 | | 13 | with meals | 245 | 254 | FREQUENCY | 0.82125 | | 14 | metformin | 261 | 269 | DRUG | 0.9999 | | 15 | 1000 mg | 271 | 277 | STRENGTH | 0.91255 | | 16 | two times a day | 279 | 293 | FREQUENCY | 0.9969 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from Central Bikol to Finnish author: John Snow Labs name: opus_mt_bcl_fi date: 2021-06-01 tags: [open_source, seq2seq, translation, bcl, fi, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bcl target languages: fi {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_fi_xx_3.1.0_2.4_1622558651339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bcl_fi_xx_3.1.0_2.4_1622558651339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bcl_fi", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bcl_fi", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Central Bikol.translate_to.Finnish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bcl_fi| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jamie613) author: John Snow Labs name: xlmroberta_ner_jamie613_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jamie613`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jamie613_base_finetuned_panx_de_4.1.0_3.0_1660434209491.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jamie613_base_finetuned_panx_de_4.1.0_3.0_1660434209491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jamie613_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jamie613_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jamie613_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jamie613/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: French CamemBert Embeddings (from hackertec) author: John Snow Labs name: camembert_embeddings_hackertec_generic date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy` is a French model orginally trained by `hackertec`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_hackertec_generic_fr_3.4.4_3.0_1653985899257.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_hackertec_generic_fr_3.4.4_3.0_1653985899257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_hackertec_generic","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_hackertec_generic","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_hackertec_generic| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/hackertec/dummy --- layout: model title: English ElectraForQuestionAnswering Small model (from mrm8488) Version-2 author: John Snow Labs name: electra_qa_small_finetuned_squadv2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_small_finetuned_squadv2_en_4.0.0_3.0_1655921290665.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_small_finetuned_squadv2_en_4.0.0_3.0_1655921290665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_small_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_small_finetuned_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.small_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_small_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/electra-small-finetuned-squadv2 - https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/ --- layout: model title: Turkish Electra Embeddings (from dbmdz) author: John Snow Labs name: electra_embeddings_electra_base_turkish_mc4_cased_generator date: 2022-05-17 tags: [tr, open_source, electra, embeddings] task: Embeddings language: tr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-turkish-mc4-cased-generator` is a Turkish model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_turkish_mc4_cased_generator_tr_3.4.4_3.0_1652786611710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_turkish_mc4_cased_generator_tr_3.4.4_3.0_1652786611710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_turkish_mc4_cased_generator","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_turkish_mc4_cased_generator","tr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_turkish_mc4_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|130.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-generator - https://zenodo.org/badge/latestdoi/237817454 - https://twitter.com/mervenoyann - https://github.com/allenai/allennlp/discussions/5265 - https://github.com/dbmdz - http://www.andrew.cmu.edu/user/ko/ - https://twitter.com/mervenoyann --- layout: model title: Word Segmenter for Korean author: John Snow Labs name: wordseg_kaist_ud date: 2021-03-09 tags: [word_segmentation, open_source, korean, wordseg_kaist_ud, ko] task: Word Segmentation language: ko edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Korean text. Korean text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_kaist_ud_ko_3.0.0_3.0_1615292316292.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_kaist_ud_ko_3.0.0_3.0_1615292316292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['John Snow Labs에서 안녕하세요! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("John Snow Labs에서 안녕하세요! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""John Snow Labs에서 안녕하세요! ""] token_df = nlu.load('ko.segment_words').predict(text) token_df ```
## Results ```bash 0 J 1 o 2 h 3 n 4 S 5 n 6 o 7 w 8 L 9 a 10 b 11 s 12 에 13 서 14 안 15 녕 16 하세요 17 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_kaist_ud| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|ko| --- layout: model title: Fast Neural Machine Translation Model from English to Hausa author: John Snow Labs name: opus_mt_en_ha date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ha, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ha` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ha_xx_2.7.0_2.4_1609164301831.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ha_xx_2.7.0_2.4_1609164301831.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ha", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ha", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ha').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ha| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate English to Malayo-Polynesian languages Pipeline author: John Snow Labs name: translate_en_poz date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, poz, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `poz` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_poz_xx_2.7.0_2.4_1609687099842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_poz_xx_2.7.0_2.4_1609687099842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_poz", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_poz", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.poz').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_poz| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Medical BertForQuestionAnswering model (Uncased, Base, PubMed) author: John Snow Labs name: bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Medical Question Answering model, trained on PubMed, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. This is an English model originally trained by `Shushant`. This model is a fine-tuned version of `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_4.0.0_3.0_1654176466849.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_4.0.0_3.0_1654176466849.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.pubmed.bert.base_uncased.by_Shushant").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Shushant/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-ContaminationQAmodel_PubmedBERT --- layout: model title: Vietnamese BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_vi_cased date: 2022-12-02 tags: [vi, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: vi edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-vi-cased` is a Vietnamese model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_vi_cased_vi_4.2.4_3.0_1670019307890.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_vi_cased_vi_4.2.4_3.0_1670019307890.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_vi_cased","vi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_vi_cased","vi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_vi_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|vi| |Size:|373.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-vi-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Clinical Drugs to UMLS Code Mapping author: John Snow Labs name: umls_drug_resolver_pipeline date: 2023-03-29 tags: [en, licensed, umls, pipeline] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Drugs) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_drug_resolver_pipeline_en_4.3.2_3.2_1680127003093.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_drug_resolver_pipeline_en_4.3.2_3.2_1680127003093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_drug_resolver_pipeline", "en", "clinical/models") pipeline.annotate("The patient was given Adapin 10 MG, coumadn 5 mg") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_drug_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("The patient was given Adapin 10 MG, coumadn 5 mg") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_drug_resolver").predict("""The patient was given Adapin 10 MG, coumadn 5 mg""") ```
## Results ```bash +------------+---------+---------+ |chunk |ner_label|umls_code| +------------+---------+---------+ |Adapin 10 MG|DRUG |C2930083 | |coumadn 5 mg|DRUG |C2723075 | +------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_drug_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.6 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English image_classifier_vit_violation_classification_bantai__v80ep ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_violation_classification_bantai__v80ep date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_violation_classification_bantai__v80ep` is a English model originally trained by AykeeSalazar. ## Predicted Entities `Public Smoking`, `Public-Drinking`, `ambiguous`, `non-violation` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__v80ep_en_4.1.0_3.0_1660166952488.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_violation_classification_bantai__v80ep_en_4.1.0_3.0_1660166952488.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_violation_classification_bantai__v80ep", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_violation_classification_bantai__v80ep", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_violation_classification_bantai__v80ep| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2022-01-21 tags: [ner, ner_profiling, clinical, biobert, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.3.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `ner_jsl_enriched_biobert`, `ner_clinical_biobert`, `ner_chemprot_biobert`, `ner_jsl_greedy_biobert`, `ner_bionlp_biobert`, `ner_human_phenotype_go_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert`, `ner_anatomy_coarse_biobert`, `ner_deid_enriched_biobert`, `ner_human_phenotype_gene_biobert`, `ner_jsl_biobert`, `ner_events_biobert`, `ner_deid_biobert`, `ner_posology_biobert`, `ner_diseases_biobert`, `jsl_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_cellular_biobert` . {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_3.0_1642755851782.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_3.3.1_3.0_1642755851782.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') val result = ner_profiling_pipeline.annotate("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|750.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel(x21) - NerConverter (x21) - Finisher --- layout: model title: Bangla DistilBertForMaskedLM Cased model (from neuralspace-reverie) author: John Snow Labs name: distilbert_embeddings_indic_transformers date: 2022-12-12 tags: [bn, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: bn edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `indic-transformers-bn-distilbert` is a Bangla model originally trained by `neuralspace-reverie`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_bn_4.2.4_3.0_1670864859945.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_indic_transformers_bn_4.2.4_3.0_1670864859945.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_indic_transformers","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_indic_transformers| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bn| |Size:|248.3 MB| |Case sensitive:|false| ## References - https://huggingface.co/neuralspace-reverie/indic-transformers-bn-distilbert - https://oscar-corpus.com/ --- layout: model title: English DistilBertForQuestionAnswering model author: John Snow Labs name: distilbert_base_cased_qa_squad2 date: 2022-06-15 tags: [open_source, distilbert, question_answering, en] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad` is a English model originally trained by Hugging Face. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_qa_squad2_en_4.0.0_3.0_1655291785089.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_qa_squad2_en_4.0.0_3.0_1655291785089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_base_cased_qa_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_base_cased_qa_squad2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.distil_bert.base_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_cased_qa_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://huggingface.co/distilbert-base-cased-distilled-squad --- layout: model title: English asr_wav2vec2_base_timit_demo_colab647 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab647 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab647` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab647_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab647_en_4.2.0_3.0_1664022878647.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab647_en_4.2.0_3.0_1664022878647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab647', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab647", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab647| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering model (from nickmuchi) author: John Snow Labs name: bert_qa_nickmuchi_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `nickmuchi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nickmuchi_bert_finetuned_squad_en_4.0.0_3.0_1654535729252.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nickmuchi_bert_finetuned_squad_en_4.0.0_3.0_1654535729252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nickmuchi_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_nickmuchi_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_nickmuchi").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nickmuchi_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nickmuchi/bert-finetuned-squad --- layout: model title: Multilabel Classification of NDA Clauses (sentences, small) author: John Snow Labs name: legmulticlf_mnda_sections date: 2023-02-02 tags: [nda, en, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models should be run on each sentence of the NDA clauses, and will retrieve a series of 1..N labels for each of them. The possible clause types detected my this model in NDA / MNDA aggrements are: 1. Parties to the Agreement - Names of the Parties Clause 2. Identification of What Information Is Confidential - Definition of Confidential Information Clause 3. Use of Confidential Information: Permitted Use Clause and Obligations of the Recipient 4. Time Frame of the Agreement - Termination Clause 5. Return of Confidential Information Clause 6. Remedies for Breaches of Agreement - Remedies Clause 7. Non-Solicitation Clause 8. Dispute Resolution Clause 9. Exceptions Clause 10. Non-competition clause ## Predicted Entities `APPLIC_LAW`, `ASSIGNMENT`, `DEF_OF_CONF_INFO`, `DISPUTE_RESOL`, `EXCEPTIONS`, `NAMES_OF_PARTIES`, `NON_COMP`, `NON_SOLIC`, `PREAMBLE`, `REMEDIES`, `REQ_DISCL`, `RETURN_OF_CONF_INFO`, `TERMINATION`, `USE_OF_CONF_INFO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_mnda_sections_en_1.0.0_3.0_1675361534773.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_mnda_sections_en_1.0.0_3.0_1675361534773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = ( nlp.DocumentAssembler().setInputCol("text").setOutputCol("document") ) sentence_splitter = ( nlp.SentenceDetector() .setInputCols(["document"]) .setOutputCol("sentence") .setCustomBounds(["\n"]) ) embeddings = ( nlp.UniversalSentenceEncoder.pretrained() .setInputCols("sentence") .setOutputCol("sentence_embeddings") ) classsifierdl_pred = nlp.MultiClassifierDLModel.pretrained('legmulticlf_mnda_sections', 'en', 'legal/models')\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline(stages=[document_assembler, sentence_splitter, embeddings, classsifierdl_pred]) df = spark.createDataFrame([["Governing Law.\nThis Agreement shall be govern..."]]).toDF("text") res = clf_pipeline.fit(df).transform(df) res.select('text', 'class.result').show() res.select('class.result') ```
## Results ```bash [APPLIC_LAW] Governing Law.\nThis Agreement shall be govern... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_mnda_sections| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|12.9 MB| ## References In-house MNDA ## Benchmarking ```bash label precision recall f1-score support APPLIC_LAW 0.93 0.96 0.95 28 ASSIGNMENT 0.95 0.91 0.93 22 DEF_OF_CONF_INFO 0.92 0.80 0.86 30 DISPUTE_RESOL 0.76 0.89 0.82 28 EXCEPTIONS 0.77 0.91 0.83 11 NAMES_OF_PARTIES 0.94 0.88 0.91 33 NON_COMP 1.00 0.91 0.95 23 NON_SOLIC 0.88 0.94 0.91 16 PREAMBLE 0.79 0.85 0.81 26 REMEDIES 0.91 0.91 0.91 32 REQ_DISCL 0.92 0.92 0.92 13 RETURN_OF_CONF_INFO 1.00 0.96 0.98 24 TERMINATION 1.00 0.77 0.87 13 USE_OF_CONF_INFO 0.85 0.88 0.86 32 micro-avg 0.89 0.89 0.89 331 macro-avg 0.90 0.89 0.89 331 weighted-avg 0.90 0.89 0.89 331 samples-avg 0.87 0.89 0.88 331 ``` --- layout: model title: Stop Words Cleaner for Hungarian author: John Snow Labs name: stopwords_hu date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: hu edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, hu] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_hu_hu_2.5.4_2.4_1594742441137.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_hu_hu_2.5.4_2.4_1594742441137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_hu", "hu") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_hu", "hu") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Az északi király kivételével John Snow angol orvos, vezető szerepet játszik az érzéstelenítés és az orvosi higiénia fejlesztésében."""] stopword_df = nlu.load('hu.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=3, end=8, result='északi', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=15, result='király', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=27, result='kivételével', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=32, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=34, end=37, result='Snow', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_hu| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English BertForQuestionAnswering model (from HomayounSadri) author: John Snow Labs name: bert_qa_bert_base_uncased_finetuned_squad_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad-v2` is a English model orginally trained by `HomayounSadri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_v2_en_4.0.0_3.0_1654181219445.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_finetuned_squad_v2_en_4.0.0_3.0_1654181219445.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_finetuned_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_finetuned_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_uncased_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_finetuned_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/HomayounSadri/bert-base-uncased-finetuned-squad-v2 --- layout: model title: BERT Embeddings (Large Uncased) author: John Snow Labs name: bert_large_uncased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_large_uncased_en_2.6.0_2.4_1598341287005.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_large_uncased_en_2.6.0_2.4_1598341287005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_large_uncased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_large_uncased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.large_uncased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_bert_large_uncased_embeddings token [-0.07447264343500137, -0.337308406829834, -0.... I [-0.5735481977462769, -0.3580206632614136, -0.... love [-0.3929762840270996, -0.4147087037563324, 0.2... NLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1](https://tfhub.dev/google/bert_uncased_L-24_H-1024_A-16/1) --- layout: model title: English asr_wav2vec2_base_timit_demo_test_jong TFWav2Vec2ForCTC from prows12 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_test_jong date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_test_jong` is a English model originally trained by prows12. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_test_jong_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_test_jong_en_4.2.0_3.0_1664101526481.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_test_jong_en_4.2.0_3.0_1664101526481.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_test_jong", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_test_jong", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_test_jong| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.4 MB| --- layout: model title: Legal In witness whereof Clause Binary Classifier author: John Snow Labs name: legclf_in_witness_whereof_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `in-witness-whereof` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `in-witness-whereof` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_in_witness_whereof_clause_en_1.0.0_3.2_1660122497421.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_in_witness_whereof_clause_en_1.0.0_3.2_1660122497421.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_in_witness_whereof_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[in-witness-whereof]| |[other]| |[other]| |[in-witness-whereof]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_in_witness_whereof_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support in-witness-whereof 0.91 1.00 0.95 21 other 1.00 0.97 0.99 69 accuracy - - 0.98 90 macro-avg 0.96 0.99 0.97 90 weighted-avg 0.98 0.98 0.98 90 ``` --- layout: model title: Detect Clinical Entities (Slim version, BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_slim date: 2022-01-06 tags: [ner, bertfortokenclassification, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a pretrained named entity recognition deep learning model for clinical terminology. It is based on the `bert_token_classifier_ner_jsl` model, but with more generalized entities. This model is trained with BertForTokenClassification method from the `transformers` library and imported into Spark NLP. Definitions of Predicted Entities: - `Death_Entity`: Mentions that indicate the death of a patient. - `Medical_Device`: All mentions related to medical devices and supplies. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Allergen`: Allergen related extractions mentioned in the document. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Birth_Entity`: Mentions that indicate giving birth. - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). ## Predicted Entities `Death_Entity`, `Medical_Device`, `Vital_Sign`, `Alergen`, `Drug`, `Clinical_Dept`, `Lifestyle`, `Symptom`, `Body_Part`, `Physical_Measurement`, `Admission_Discharge`, `Date_Time`, `Age`, `Birth_Entity`, `Header`, `Oncological`, `Substance_Quantity`, `Test_Result`, `Test`, `Procedure`, `Treatment`, `Disease_Syndrome_Disorder`, `Pregnancy_Newborn`, `Demographics` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BERT_TOKEN_CLASSIFIER/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BERT_TOKEN_CLASSIFIER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_en_3.3.4_2.4_1641473775238.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_slim_en_3.3.4_2.4_1641473775238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\ .setInputCols("token", "sentence")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) sample_text = """HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""" data = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models") .setInputCols(Array("token", "sentence")) .setOutputCol("ner") .setCaseSensitive(True) val. ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_jsl_slim").predict("""HISTORY: 30-year-old female presents for digital bilateral mammography secondary to a soft tissue lump palpated by the patient in the upper right shoulder. The patient has a family history of breast cancer within her mother at age 58. Patient denies personal history of breast cancer.""") ```
## Results ```bash +----------------+------------+ |chunk |ner_label | +----------------+------------+ |HISTORY: |Header | |30-year-old |Age | |female |Demographics| |mammography |Test | |soft tissue lump|Symptom | |shoulder |Body_Part | |breast cancer |Oncological | |her mother |Demographics| |age 58 |Age | |breast cancer |Oncological | +----------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_slim| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.4 MB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source Trained on data annotated by JSL. ## Benchmarking ```bash label precision recall f1-score support B-Admission_Discharge 0.82 0.99 0.90 282 B-Age 0.88 0.83 0.85 576 B-Body_Part 0.84 0.91 0.87 8582 B-Clinical_Dept 0.86 0.94 0.90 909 B-Date_Time 0.82 0.77 0.79 1062 B-Death_Entity 0.66 0.98 0.79 43 B-Demographics 0.97 0.98 0.98 5285 B-Disease_Syndrome_Disorder 0.84 0.89 0.86 4259 B-Drug 0.88 0.87 0.87 2555 B-Header 0.97 0.66 0.78 3911 B-Lifestyle 0.77 0.83 0.80 371 B-Medical_Device 0.84 0.87 0.85 3605 B-Oncological 0.86 0.91 0.89 408 B-Physical_Measurement 0.84 0.81 0.82 135 B-Pregnancy_Newborn 0.66 0.71 0.68 245 B-Procedure 0.82 0.88 0.85 2654 B-Symptom 0.83 0.86 0.85 6545 B-Test 0.82 0.83 0.83 2448 B-Test_Result 0.76 0.81 0.78 1280 B-Treatment 0.70 0.76 0.73 275 B-Vital_Sign 0.85 0.87 0.86 627 I-Age 0.84 0.90 0.87 166 I-Alergen 0.00 0.00 0.00 5 I-Body_Part 0.86 0.89 0.88 4946 I-Clinical_Dept 0.92 0.93 0.93 806 I-Date_Time 0.82 0.91 0.86 1173 I-Demographics 0.89 0.84 0.86 416 I-Disease_Syndrome_Disorder 0.87 0.85 0.86 4385 I-Drug 0.83 0.86 0.85 5199 I-Header 0.85 0.97 0.90 6763 I-Lifestyle 0.77 0.69 0.73 134 I-Medical_Device 0.86 0.86 0.86 2341 I-Oncological 0.85 0.94 0.89 515 I-Physical_Measurement 0.88 0.94 0.91 329 I-Pregnancy_Newborn 0.66 0.70 0.68 273 I-Procedure 0.87 0.86 0.87 3414 I-Symptom 0.79 0.75 0.77 6485 I-Test 0.82 0.77 0.79 2283 I-Test_Result 0.67 0.56 0.61 649 I-Treatment 0.69 0.72 0.70 194 I-Vital_Sign 0.88 0.90 0.89 918 O 0.97 0.97 0.97 210520 accuracy - - 0.94 297997 macro-avg 0.74 0.74 0.73 297997 weighted-avg 0.94 0.94 0.94 297997 ``` --- layout: model title: Russian DistilBertForQuestionAnswering Cased model (from AndrewChar) author: John Snow Labs name: distilbert_qa_model_5_epoch date: 2023-01-03 tags: [ru, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: ru edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `model-QA-5-epoch-RU` is a Russian model originally trained by `AndrewChar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_5_epoch_ru_4.3.0_3.0_1672775276891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_5_epoch_ru_4.3.0_3.0_1672775276891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model_5_epoch","ru")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model_5_epoch","ru") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_model_5_epoch| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AndrewChar/model-QA-5-epoch-RU --- layout: model title: Sentiment Analysis of IMDB Reviews author: John Snow Labs name: sentimentdl_glove_imdb date: 2021-01-09 task: Sentiment Analysis language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [open_source, en, sentiment] supported: true annotator: SentimentDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify IMDB reviews in negative and positive categories. ## Predicted Entities `neg`, `pos` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_EN/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentimentdl_glove_imdb_en_2.7.1_2.4_1610208660282.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentimentdl_glove_imdb_en_2.7.1_2.4_1610208660282.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel().pretrained("glove_100d)\ .setInputCols(['document','tokens'])\ .setOutputCol('word_embeddings') sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = SentimentDLModel().pretrained('sentimentdl_glove_imdb')\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("sentiment") nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, embeddings, sentence_embeddings, classifier]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate('Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!') ``` ```scala ... val embeddings = WordEmbeddingsModel().pretrained("glove_100d) .setInputCols(Array('document','tokens')) .setOutputCol('word_embeddings') val sentence_embeddings = SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = SentimentDLModel().pretrained('sentimentdl_glove_imdb') .setInputCols(Array("sentence_embeddings")) .setOutputCol("sentiment") val pipeline = new Pipeline().setStages(Array(document_assembler, sentencer, tokenizer, embeddings, sentence_embeddings, classifier)) val data = Seq("Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.sentiment.imdb.glove").predict(""") nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, embeddings, sentence_embeddings, classifier]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF(""") ```
## Results ```bash | | document | sentiment | |---:|---------------------------------------------------------------------------------------------------------:|--------------:| | | Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the | | | 0 | film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music | positive | | | was rad! Horror and sword fight freaks,buy this movie now! | | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentimentdl_glove_imdb| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[sentiment]| |Language:|en| |Dependencies:|glove_840B_300| ## Data Source https://ai.stanford.edu/~amaas/data/sentiment/ ## Benchmarking ```bash precision recall f1-score support neg 0.85 0.85 0.85 12500 pos 0.87 0.83 0.85 12500 accuracy 0.84 25000 macro avg 0.86 0.84 0.85 25000 weighted avg 0.86 0.84 0.85 25000 ``` --- layout: model title: Translate Fijian to English Pipeline author: John Snow Labs name: translate_fj_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, fj, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `fj` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_fj_en_xx_2.7.0_2.4_1609687193478.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_fj_en_xx_2.7.0_2.4_1609687193478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_fj_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_fj_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.fj.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_fj_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Swiss Judgements Classification (German) author: John Snow Labs name: legclf_bert_swiss_judgements date: 2022-10-27 tags: [de, legal, licensed, sequence_classification] task: Text Classification language: de edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Bert-based model that can be used to classify Swiss Judgement documents in German language into the following 6 classes according to their case area. It has been trained with SOTA approach. ## Predicted Entities `public law`, `civil law`, `insurance law`, `social law`, `penal law`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_de_1.0.0_3.0_1666863676063.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_bert_swiss_judgements_de_1.0.0_3.0_1666863676063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols(['document'])\ .setOutputCol("token") clf_model = legal.BertForSequenceClassification.pretrained("legclf_bert_swiss_judgements", "de", "legal/models")\ .setInputCols(['document','token'])\ .setOutputCol('class')\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, clf_model ]) data = spark.createDataFrame([["""Sachverhalt: A. Mit Strafbefehl vom 30. Juli 2015 sprach die Staatsanwaltschaft Lenzburg-Aarau gegen X._ eine bedingte Geldstrafe von 150 Tagessätzen zu Fr. 150.-- (Probezeit vier Jahre) sowie eine Busse von Fr. 4'500.-- aus wegen Führens eines Motorfahrzeugs in angetrunkenem Zustand sowie wegen mehrfacher Anstiftung zu falschem Zeugnis. Die Staatsanwaltschaft legte X._ unter anderem zur Last, am 5. Juli 2013 nach Aussage von Zeugen sein Auto mit einem Blutalkoholgehalt von mindestens 2,12 Promille bestiegen und von Lenzburg an seinen Wohnort in Z._ gelenkt zu haben. Das nach Einsprache von X._ mit der Sache befasste Bezirksgericht Lenzburg sprach ihn vom Vorwurf der mehrfachen Anstiftung zu falschem Zeugnis frei und verurteilte ihn wegen Führens eines Motorfahrzeugs in angetrunkenem Zustand zu einer bedingten Geldstrafe von 105 Tagessätzen zu Fr. 210.-- (Probezeit zwei Jahre) und zu einer Busse von Fr. 4'400.-- (Urteil vom 15. August 2016). B. X._ erhob Berufung. Das Obergericht des Kantons Aargau wies das Rechtsmittel ab (Urteil vom 3. Juli 2017). C. Mit Beschwerde in Strafsachen beantragt X._, das angefochtene Urteil sei aufzuheben und er von Schuld und Strafe freizusprechen."""]]).toDF("text") result = clf_pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+---------+ | document| class| +----------------------------------------------------------------------------------------------------+---------+ |Sachverhalt: A. Mit Strafbefehl vom 30. Juli 2015 sprach die Staatsanwaltschaft Lenzburg-Aarau ge...|penal law| +----------------------------------------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_bert_swiss_judgements| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|de| |Size:|405.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Training data is available [here](https://zenodo.org/record/7109926#.Y1gJwexBw8E). ## Benchmarking ```bash label precision recall f1-score support civil-law 0.93 0.96 0.94 809 insurance-law 0.92 0.94 0.93 357 other 0.76 0.70 0.73 23 penal-law 0.97 0.95 0.96 913 public-law 0.94 0.94 0.94 1048 social-law 0.97 0.95 0.96 719 accuracy - - 0.95 3869 macro-avg 0.92 0.91 0.91 3869 weighted-avg 0.95 0.95 0.95 3869 ``` --- layout: model title: Legal Vesting Clause Binary Classifier author: John Snow Labs name: legclf_vesting_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `vesting` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `vesting` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_vesting_clause_en_1.0.0_3.2_1660124119544.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_vesting_clause_en_1.0.0_3.2_1660124119544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_vesting_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[vesting]| |[other]| |[other]| |[vesting]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_vesting_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.96 0.98 84 vesting 0.90 1.00 0.95 28 accuracy - - 0.97 112 macro-avg 0.95 0.98 0.97 112 weighted-avg 0.98 0.97 0.97 112 ``` --- layout: model title: Classifier for Adverse Drug Events author: John Snow Labs name: classifierdl_ade_biobert date: 2020-09-30 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.6.2 spark_version: 2.4 tags: [classifier, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model classifies if a text is ADE-related (``True``) or not (``False``). {:.h2_title} ## Predicted Entities ``True``, ``False``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_biobert_en_2.6.0_2.4_1601594685053.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_biobert_en_2.6.0_2.4_1601594685053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use To classify your text if it is ADE-related, you can use this model as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings (``biobert_pubmed_base_cased``), SentenceEmbeddings, ClassifierDLModel.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\ .setInputCols(["document", 'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = ClassifierDLModel.pretrained('classifierdl_ade_biobert', 'en', 'clinical/models')\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("I feel a bit drowsy & have a little blurred vision after taking an insulin") ``` ```scala ... val embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased') .setInputCols(Array("document", 'token')) .setOutputCol("word_embeddings") val sentence_embeddings = SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classsifierADE = ClassifierDLModel.pretrained("classifierdl_ade_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifierADE)) val data = Seq("I feel a bit drowsy & have a little blurred vision after taking an insulin").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.ade.biobert").predict("""I feel a bit drowsy & have a little blurred vision after taking an insulin""") ```
{:.h2_title} ## Results ``True`` : The sentence is talking about a possible ADE ``False`` : The sentences doesn't have any information about an ADE. ```bash 'True' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_ade_biobert| |Type:|ClassifierDLModel| |Compatibility:|Healthcare NLP 2.6.2 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|[en]| |Case sensitive:|True| {:.h2_title} ## Data Source Trained on a custom dataset comprising of CADEC, DRUG-AE, Twimed using ``biobert_pubmed_base_cased`` embeddings. {:.h2_title} ## Benchmarking ```bash | | label | prec | rec | f1 | |---:|-----------------:|-------:|-------:|-------:| | 0 | False | 0.9469 | 0.9327 | 0.9398 | | 1 | True | 0.7603 | 0.8030 | 0.7811 | | 2 | Macro-average | 0.8536 | 0.8679 | 0.8604 | | 3 | Weighted-average | 0.9077 | 0.9055 | 0.9065 | ``` --- layout: model title: Explain Document Pipeline for Finnish author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, finnish, explain_document_md, pipeline, fi] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: fi edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_fi_3.0.0_3.0_1616437919150.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_fi_3.0.0_3.0_1616437919150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'fi') annotations = pipeline.fullAnnotate(""Hei John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "fi") val result = pipeline.fullAnnotate("Hei John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei John Snow Labs! ""] result_df = nlu.load('fi.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-------------------------|:------------------------|:---------------------------------|:---------------------------------|:------------------------------------|:-----------------------------|:---------------------------------|:--------------------| | 0 | ['Hei John Snow Labs! '] | ['Hei John Snow Labs!'] | ['Hei', 'John', 'Snow', 'Labs!'] | ['hei', 'John', 'Snow', 'Labs!'] | ['INTJ', 'PROPN', 'PROPN', 'PROPN'] | [[0.1868100017309188,.,...]] | ['O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1654191523682.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10_en_4.0.0_3.0_1654191523682.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_32d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|377.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-10 --- layout: model title: Legal Events Of Default Clause Binary Classifier author: John Snow Labs name: legclf_events_of_default_clause date: 2022-12-18 tags: [en, legal, classification, licensed, clause, bert, events, of, default, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the events-of-default clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `events-of-default`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_events_of_default_clause_en_1.0.0_3.0_1671393644005.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_events_of_default_clause_en_1.0.0_3.0_1671393644005.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_events_of_default_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[events-of-default]| |[other]| |[other]| |[events-of-default]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_events_of_default_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support events-of-default 0.92 0.96 0.94 24 other 0.97 0.95 0.96 39 accuracy - - 0.95 63 macro-avg 0.95 0.95 0.95 63 weighted-avg 0.95 0.95 0.95 63 ``` --- layout: model title: Language Detection & Identification Pipeline - 375 Languages author: John Snow Labs name: detect_language_375 date: 2020-12-05 task: [Pipeline Public, Language Detection, Sentence Detection] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, pipeline, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This pipeline can detect the following languages: ## Predicted Entities `Abkhaz`, `Iraqi Arabic`, `Adyghe`, `Afrikaans`, `Gulf Arabic`, `Afrihili`, `Assyrian Neo-Aramaic`, `Ainu`, `Aklanon`, `Gheg Albanian`, `Amharic`, `Aragonese`, `Old English`, `Uab Meto`, `North Levantine Arabic`, `Arabic`, `Algerian Arabic`, `Moroccan Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Kotava`, `Awadhi`, `Aymara`, `Azerbaijani`, `Bashkir`, `Baluchi`, `Balinese`, `Bavarian`, `Central Bikol`, `Belarusian`, `Berber`, `Bulgarian`, `Bhojpuri`, `Bislama`, `Banjar`, `Bambara`, `Bengali`, `Tibetan`, `Breton`, `Bodo`, `Bosnian`, `Buryat`, `Baybayanon`, `Brithenig`, `Catalan`, `Cayuga`, `Chavacano`, `Chechen`, `Cebuano`, `Chamorro`, `Chagatai`, `Chinook Jargon`, `Choctaw`, `Cherokee`, `Jin Chinese`, `Chukchi`, `Central Mnong`, `Corsican`, `Chinese Pidgin English`, `Crimean Tatar`, `Seychellois Creole`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `CycL`, `Cuyonon`, `Danish`, `German`, `Dungan`, `Drents`, `Lower Sorbian`, `Central Dusun`, `Dhivehi`, `Dutton World Speedwords`, `Ewe`, `Emilian`, `Greek`, `Erromintxela`, `English`, `Middle English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Evenki`, `Extremaduran`, `Persian`, `Finnish`, `Fijian`, `Kven Finnish`, `Faroese`, `French`, `Middle French`, `Old French`, `North Frisian`, `Pulaar`, `Friulian`, `Nigerian Fulfulde`, `Frisian`, `Irish`, `Ga`, `Gagauz`, `Gan Chinese`, `Garhwali`, `Guadeloupean Creole French`, `Scottish Gaelic`, `Gilbertese`, `Galician`, `Guarani`, `Konkani (Goan)`, `Gronings`, `Gothic`, `Ancient Greek`, `Swiss German`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hawaiian`, `Ancient Hebrew`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Hiligaynon`, `Hmong Njua (Green)`, `Ho`, `Croatian`, `Hunsrik`, `Upper Sorbian`, `Xiang Chinese`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Iban`, `Indonesian`, `Interlingue`, `Igbo`, `Nuosu`, `Inuktitut`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Ingrian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Juhuri (Judeo-Tat)`, `Jewish Palestinian Aramaic`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kamba`, `Kekchi (Q'eqchi')`, `Khasi`, `Khakas`, `Kazakh`, `Greenlandic`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Komi-Zyrian`, `Karachay-Balkar`, `Karelian`, `Kashmiri`, `Kölsch`, `Kurdish`, `Kumyk`, `Cornish`, `Keningau Murut`, `Kyrgyz`, `Coastal Kadazan`, `Latin`, `Southern Subanen`, `Ladino`, `Luxembourgish`, `Láadan`, `Lingua Franca Nova`, `Luganda`, `Ligurian`, `Livonian`, `Lakota`, `Ladin`, `Lombard`, `Lingala`, `Lao`, `Louisiana Creole`, `Lithuanian`, `Latgalian`, `Latvian`, `Latvian`, `Literary Chinese`, `Laz`, `Madurese`, `Maithili`, `North Moluccan Malay`, `Moksha`, `Morisyen`, `Malagasy`, `Mambae`, `Marshallese`, `Meadow Mari`, `Maori`, `Mi'kmaq`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Manchu`, `Mon`, `Mohawk`, `Marathi`, `Hill Mari`, `Malay`, `Maltese`, `Tagal Murut`, `Mirandese`, `Hmong Daw (White)`, `Burmese`, `Erzya`, `Nauruan`, `Nahuatl`, `Norwegian Bokmål`, `Central Huasteca Nahuatl`, `Low German (Low Saxon)`, `Nepali`, `Newari`, `Ngeq`, `Guerrero Nahuatl`, `Niuean`, `Dutch`, `Orizaba Nahuatl`, `Norwegian Nynorsk`, `Norwegian`, `Nogai`, `Old Norse`, `Novial`, `Nepali`, `Naga (Tangshang)`, `Navajo`, `Chinyanja`, `Nyungar`, `Old Aramaic`, `Occitan`, `Ojibwe`, `Odia (Oriya)`, `Old East Slavic`, `Ossetian`, `Old Spanish`, `Old Saxon`, `Ottoman Turkish`, `Old Turkish`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Palauan`, `Picard`, `Pennsylvania German`, `Palatine German`, `Phoenician`, `Pali`, `Polish`, `Piedmontese`, `Punjabi (Western)`, `Pipil`, `Old Prussian`, `Pashto`, `Portuguese`, `Quechua`, `K'iche'`, `Quenya`, `Rapa Nui`, `Rendille`, `Tarifit`, `Romansh`, `Kirundi`, `Romanian`, `Romani`, `Russian`, `Rusyn`, `Kinyarwanda`, `Okinawan`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Scots`, `Sindhi`, `Northern Sami`, `Sango`, `Samogitian`, `Shuswap`, `Tachawit`, `Sinhala`, `Sindarin`, `Slovak`, `Slovenian`, `Samoan`, `Southern Sami`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Swazi`, `Southern Sotho`, `Saterland Frisian`, `Sundanese`, `Sumerian`, `Swedish`, `Swahili`, `Swabian`, `Swahili`, `Syriac`, `Tamil`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Tahaggart Tamahaq`, `Tigrinya`, `Tigre`, `Turkmen`, `Tokelauan`, `Tagalog`, `Klingon`, `Talysh`, `Jewish Babylonian Aramaic`, `Temuan`, `Setswana`, `Tongan`, `Tonga (Zambezi)`, `Toki Pona`, `Tok Pisin`, `Old Tupi`, `Turkish`, `Tsonga`, `Tatar`, `Isan`, `Tuvaluan`, `Tahitian`, `Tuvinian`, `Talossan`, `Udmurt`, `Uyghur`, `Ukrainian`, `Umbundu`, `Urdu`, `Urhobo`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Volapük`, `Võro`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Kalmyk`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Cantonese`, `Chinese`, `Malay (Vernacular)`, `Malay`, `Zulu`, `Zaza`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/LANGUAGE_DETECTOR/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_375_xx_2.7.0_2.4_1607185980306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/detect_language_375_xx_2.7.0_2.4_1607185980306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("detect_language_375", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("detect_language_375", lang = "xx") pipeline.annotate("French author who helped pioneer the science-fiction genre.") ``` {:.nlu-block} ```python import nlu text = ["French author who helped pioneer the science-fiction genre."] lang_df = nlu.load("lang").predict(text) lang_df ```
## Results ```bash {'document': ['French author who helped pioneer the science-fiction genre.'], 'language': ['en']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|detect_language_375| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - LanguageDetectorDL --- layout: model title: Clinical Findings to UMLS Code Pipeline author: John Snow Labs name: umls_clinical_findings_resolver_pipeline date: 2023-03-30 tags: [en, licensed, umls, resolver, pipeline, clinical] task: Pipeline Healthcare language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Findings) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.3.2_3.2_1680168336993.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.3.2_3.2_1680168336993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") text = '''['HTG-induced pancreatitis associated with an acute hepatitis, and obesity']''' result = pipeline.annotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") val text = "HTG-induced pancreatitis associated with an acute hepatitis, and obesity" val result = pipeline.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_clinical_findings_resolver").predict("""['HTG-induced pancreatitis associated with an acute hepatitis, and obesity']""") ```
## Results ```bash +------------------------+---------+---------+ |chunk |ner_label|umls_code| +------------------------+---------+---------+ |HTG-induced pancreatitis|PROBLEM |C1963198 | |an acute hepatitis |PROBLEM |C4750596 | |obesity |PROBLEM |C1963185 | +------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_clinical_findings_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English asr_wav2vec2_large_xls_r_300m_cantonese TFWav2Vec2ForCTC from ivanlau author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_cantonese date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_cantonese` is a English model originally trained by ivanlau. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_cantonese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_cantonese_en_4.2.0_3.0_1664112880096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_cantonese_en_4.2.0_3.0_1664112880096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_cantonese", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_cantonese", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_cantonese| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Genes/Proteins (BC2GM) in Medical Texts author: John Snow Labs name: ner_biomedical_bc2gm date: 2022-05-11 tags: [bc2gm, ner, biomedical, gene_protein, gene, protein, en, licensed, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN. This model has been trained to extract genes/proteins from a medical text for PySpark 2.4.x users. ## Predicted Entities `GENE_PROTEIN` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_en_3.5.1_2.4_1652262009994.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_en_3.5.1_2.4_1652262009994.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections."]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomedical_bc2gm").predict("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.""") ```
## Results ```bash +-----------+------------+ |chunk |ner_label | +-----------+------------+ |S-100 |GENE_PROTEIN| |HMB-45 |GENE_PROTEIN| |cytokeratin|GENE_PROTEIN| +-----------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomedical_bc2gm| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|14.6 MB| ## References Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition ([BC2GM](https://metatext.io/datasets/biocreative-ii-gene-mention-recognition-(bc2gm))) Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). ## Benchmarking ```bash label precision recall f1-score support GENE_PROTEIN 0.83 0.82 0.82 6325 micro-avg 0.83 0.82 0.82 6325 macro-avg 0.83 0.82 0.82 6325 weighted-avg 0.83 0.82 0.82 6325 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-6` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1654191410803.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6_en_4.0.0_3.0_1654191410803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_128d_seed_6").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|380.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-6 --- layout: model title: Arabic Part of Speech Tagger (Modern Standard Arabic (MSA), Egyptian Arabic POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_msa_pos_egy date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-pos-egy` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_egy_ar_3.4.2_3.0_1650993704537.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_msa_pos_egy_ar_3.4.2_3.0_1650993704537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_egy","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_msa_pos_egy","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_msa_pos_egy").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_msa_pos_egy| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-pos-egy - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_2_en_4.3.0_3.0_1674215981311.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_2_en_4.3.0_3.0_1674215981311.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|419.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-2 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-128-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_2_en_4.0.0_3.0_1657184204245.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_2_en_4.0.0_3.0_1657184204245.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_128_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-128-finetuned-squad-seed-2 --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_mask_step_pretraining_recipes_base_squadv2_epochs_3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mask_step_pretraining_recipes-roberta-base_squadv2_epochs_3` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_mask_step_pretraining_recipes_base_squadv2_epochs_3_en_4.3.0_3.0_1674211343406.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_mask_step_pretraining_recipes_base_squadv2_epochs_3_en_4.3.0_3.0_1674211343406.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mask_step_pretraining_recipes_base_squadv2_epochs_3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_mask_step_pretraining_recipes_base_squadv2_epochs_3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_mask_step_pretraining_recipes_base_squadv2_epochs_3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/mask_step_pretraining_recipes-roberta-base_squadv2_epochs_3 --- layout: model title: BERT Embeddings trained on Wikipedia and BooksCorpus author: John Snow Labs name: bert_wiki_books date: 2021-08-30 tags: [en, bert_embeddings, wikipedia_dataset, books_corpus_dataset, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture pretrained from scratch on Wikipedia and BooksCorpus. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learning that improve its accuracy over the original BERT base checkpoint. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_wiki_books_en_3.2.0_3.0_1630328923814.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_wiki_books_en_3.2.0_3.0_1630328923814.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("bert_wiki_books", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = BertEmbeddings.pretrained("bert_wiki_books", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.wiki_books').predict(text, output_level='token') embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_wiki_books| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/2 --- layout: model title: T5-small fine-tuned on WikiSQL author: John Snow Labs name: t5_small_wikiSQL date: 2022-01-12 tags: [t5, open_source, en] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Google's T5 small fine-tuned on WikiSQL for English to SQL translation. Will generate SQL code from natural language input when task is set it to "translate English to SQL:". ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5_SQL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_SQL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_wikiSQL_en_3.4.0_3.0_1641982554211.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_wikiSQL_en_3.4.0_3.0_1641982554211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * spark = sparknlp.start() documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer.pretrained("t5_small_wikiSQL") \ .setTask("translate English to SQL:") \ .setInputCols(["documents"]) \ .setMaxOutputLength(200) \ .setOutputCol("sql") pipeline = Pipeline().setStages([documentAssembler, t5]) data = spark.createDataFrame([["How many customers have ordered more than 2 items?"]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("sql.result").show(truncate=False) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_small_wikiSQL") .setInputCols("documents") .setMaxOutputLength(200) .setOutputCol("sql") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("How many customers have ordered more than 2 items?") .toDF("text") val result = pipeline.fit(data).transform(data) result.select("sql.result").show(false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.wikiSQL").predict("""How many customers have ordered more than 2 items?""") ```
## Results ```bash +----------------------------------------------------+ |result | +----------------------------------------------------+ |[SELECT COUNT Customers FROM table WHERE Orders > 2]| +----------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_wikiSQL| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[sql]| |Language:|en| |Size:|262.1 MB| ## Data Source Model originally from the transformer model of Manuel Romero/mrm8488. https://huggingface.co/mrm8488/t5-small-finetuned-wikiSQL --- layout: model title: Extract Financial, Legal and Generic Entities in Arabic author: John Snow Labs name: legner_arabert_arabic date: 2022-10-02 tags: [ar, legal, licensed] task: Named Entity Recognition language: ar edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description AraBert v2 model, trained in house, using the dataset available [here](https://ontology.birzeit.edu/Wojood/) and augmenting with financial and legal information. The entities you can find in this model are: PERS (person) EVENT CARDINAL NORP (group of people) DATE ORDINAL OCC (occupation) TIME PERCENT ORG (organization) LANGUAGE QUANTITY GPE (geopolitical entity) WEBSITE UNIT LOC (geographical location) LAW MONEY FAC (facility: landmarks places) PRODUCT CURR (currency) ## Predicted Entities `NORP`, `PERS`, `LOC`, `MONEY`, `TIME`, `ORG`, `WEBSITE`, `ORDINAL`, `PERCENT`, `EVENT`, `QUANTITY`, `OCC`, `LANGUAGE`, `CARDINAL`, `DATE`, `GPE`, `PRODUCT`, `CURR`, `FAC`, `UNIT`, `LAW` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_arabert_arabic_ar_1.0.0_3.0_1664705605292.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_arabert_arabic_ar_1.0.0_3.0_1664705605292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokeniz = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") tokenClassifier = legal.BertForTokenClassification.pretrained("legner_arabert_arabic", "ar", "legal/models")\ .setInputCols("token", "document")\ .setOutputCol("label") pipeline = nlp.Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) example = spark.createDataFrame(pd.DataFrame({'text': ["""أمثلة: جامعة بيرزيت وبالتعاون مع مؤسسة ادوارد سعيد تنظم مهرجان للفن الشعبي سيبدأ الساعة الرابعة عصرا، بتاريخ 16/5/2016. بورصة فلسطين تسجل ارتفاعا بنسبة 0.08% ، في جلسة بلغت قيمة تداولاتها أكثر من نصف مليون دولار . إنتخاب رئيس هيئة سوق رأس المال وتعديل مادة (4) في القانون الأساسي. مسيرة قرب باب العامود والذي 700 متر عن المسجد الأقصى."""]})) result = pipeline.fit(example).transform(example) ```
## Results ```bash ["أمثلة:","O"] ["جامعة","B-ORG"] ["بيرزيت","I-ORG"] ["وبالتعاون","O"] ["مع","O"] ["مؤسسة","B-ORG"] ["ادوارد","B-PERS"] ["سعيد","I-PERS"] ["تنظم","O"] ["مهرجان","B-EVENT"] ["للفن","I-EVENT"] ["الشعبي","I-EVENT"] ["سيبدأ","O"] ["الساعة","B-TIME"] ["الرابعة","I-TIME"] ["عصرا،","I-TIME"] ["بتاريخ","B-DATE"] ["16/5/2016.","I-DATE"] ["بورصة","B-ORG"] ["فلسطين","I-ORG"] ["تسجل","O"] ["ارتفاعا","O"] ["بنسبة","O"] ["0.08%","B-PERCENT"] ["،","O"] ["في","O"] ["جلسة","O"] ["بلغت","O"] ["قيمة","O"] ["تداولاتها","O"] ["أكثر","O"] ["من","O"] ["نصف","B-MONEY"] ["مليون","I-MONEY"] ["دولار","B-CURR"] [".","O"] ["إنتخاب","O"] ["رئيس","B-OCC"] ["هيئة","B-ORG"] ["سوق","I-ORG"] ["رأس","I-ORG"] ["المال","I-ORG"] ["وتعديل","O"] ["مادة","B-LAW"] ["(4)","I-LAW"] ["في","O"] ["القانون","B-LAW"] ["الأساسي.","O"] ["مسيرة","O"] ["قرب","O"] ["باب","B-FAC"] ["العامود","I-FAC"] ["والذي","O"] ["700","B-QUANTITY"] ["متر","B-UNIT"] ["عن","O"] ["المسجد","B-FAC"] ["الأقصى.","I-FAC"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_arabert_arabic| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References https://ontology.birzeit.edu/Wojood/ ## Benchmarking ```bash label precision recall f1-score support B-CARDINAL 0.93 0.87 0.80 19 B-DATE 0.88 0.93 0.90 106 B-EVENT 1.00 0.86 0.92 14 B-FAC 1.00 0.67 0.80 3 B-GPE 0.88 0.85 0.87 89 B-LAW 1.00 0.50 0.67 6 B-NORP 0.72 0.81 0.76 32 B-OCC 0.88 0.83 0.85 52 B-ORDINAL 0.76 0.80 0.78 35 B-ORG 0.81 0.87 0.84 103 B-PERS 0.78 0.89 0.83 47 B-WEBSITE 0.62 1.00 0.77 5 I-CARDINAL 0.33 0.62 0.43 8 I-DATE 0.98 0.99 0.98 447 I-EVENT 0.91 0.91 0.91 23 I-FAC 0.75 0.43 0.55 7 I-GPE 0.80 0.92 0.86 53 I-LAW 1.00 0.85 0.92 13 I-NORP 0.65 0.48 0.55 23 I-OCC 0.97 0.86 0.91 96 I-ORG 0.87 0.91 0.89 139 I-PERS 0.94 1.00 0.97 60 I-WEBSITE 0.94 1.00 0.97 15 O 0.98 0.97 0.98 3062 accuracy - - 0.95 4468 macro-avg 0.83 0.81 0.81 4468 weighted-avg 0.95 0.95 0.95 4468 ``` --- layout: model title: English DistilBertForTokenClassification Base Uncased model (from Datasaur) author: John Snow Labs name: distilbert_token_classifier_base_uncased_finetuned_conll2003 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-conll2003` is a English model originally trained by `Datasaur`. ## Predicted Entities `LOC`, `ORG`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1678133950243.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_base_uncased_finetuned_conll2003_en_4.3.1_3.0_1678133950243.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_base_uncased_finetuned_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_base_uncased_finetuned_conll2003| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Datasaur/distilbert-base-uncased-finetuned-conll2003 --- layout: model title: English BertForQuestionAnswering Base Uncased model (from anas-awadalla) author: John Snow Labs name: bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_6 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-16-finetuned-squad-seed-6` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1657184582953.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_6_en_4.0.0_3.0_1657184582953.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_6","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_6","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_few_shot_k_16_finetuned_squad_seed_6| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-16-finetuned-squad-seed-6 --- layout: model title: English RobertaForQuestionAnswering Cased model (from skandaonsolve) author: John Snow Labs name: roberta_qa_finetuned_state2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-finetuned-state2` is a English model originally trained by `skandaonsolve`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_state2_en_4.3.0_3.0_1674220557287.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_finetuned_state2_en_4.3.0_3.0_1674220557287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_state2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_finetuned_state2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_finetuned_state2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skandaonsolve/roberta-finetuned-state2 --- layout: model title: English BertForTokenClassification Uncased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC4_Original_bluebert_pubmed_uncased_L_12_H_768_A_12 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC4-Original-bluebert_pubmed_uncased_L-12_H-768_A-12` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_bluebert_pubmed_uncased_L_12_H_768_A_12_en_4.0.0_3.0_1657108108305.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC4_Original_bluebert_pubmed_uncased_L_12_H_768_A_12_en_4.0.0_3.0_1657108108305.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_bluebert_pubmed_uncased_L_12_H_768_A_12","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC4_Original_bluebert_pubmed_uncased_L_12_H_768_A_12","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC4_Original_bluebert_pubmed_uncased_L_12_H_768_A_12| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|407.6 MB| |Case sensitive:|false| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC4-Original-bluebert_pubmed_uncased_L-12_H-768_A-12 --- layout: model title: English asr_liepa_lithuanian TFWav2Vec2ForCTC from birgermoell author: John Snow Labs name: asr_liepa_lithuanian date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_liepa_lithuanian` is a English model originally trained by birgermoell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_liepa_lithuanian_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_liepa_lithuanian_en_4.2.0_3.0_1664111853348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_liepa_lithuanian_en_4.2.0_3.0_1664111853348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_liepa_lithuanian", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_liepa_lithuanian", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_liepa_lithuanian| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|228.1 MB| --- layout: model title: Fast Neural Machine Translation Model from Southern Sotho to English author: John Snow Labs name: opus_mt_st_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, st, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `st` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_st_en_xx_2.7.0_2.4_1609168854147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_st_en_xx_2.7.0_2.4_1609168854147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_st_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_st_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.st.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_st_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Classifier for Genders - BIOBERT author: John Snow Labs name: classifierdl_gender_biobert date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 tags: [licensed, en, classifier, clinical] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies the gender of the patient in the clinical document using context. ## Predicted Entities `Female`, `Male`, `Unknown` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_GENDER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/21_Gender_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_biobert_en_2.7.1_2.4_1611247084544.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_biobert_en_2.7.1_2.4_1611247084544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(['document'])\ .setOutputCol('token') biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased") \ .setInputCols(["document", "token"])\ .setOutputCol("bert_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "bert_embeddings"]) \ .setOutputCol("sentence_bert_embeddings") \ .setPoolingStrategy("AVERAGE") genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models") \ .setInputCols(["document", "sentence_bert_embeddings"]) \ .setOutputCol("gender") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, genderClassifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased") .setInputCols(Array("document","token")) .setOutputCol("bert_embeddings") val sentence_embeddings = new SentenceEmbeddings() .setInputCols(Array("document", "bert_embeddings")) .setOutputCol("sentence_bert_embeddings") .setPoolingStrategy("AVERAGE") val genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models") .setInputCols(Array("document", "sentence_bert_embeddings")) .setOutputCol("gender") val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, genderClassifier)) val data = Seq("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.gender.biobert").predict("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ```
## Results ```bash Female ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_gender_biobert| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|biobert_pubmed_base_cased| ## Data Source This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally. ## Benchmarking ```bash label precision recall f1-score support Female 0.9020 0.9364 0.9189 236 Male 0.8761 0.7857 0.8285 126 Unknown 0.7091 0.7647 0.7358 51 accuracy - - 0.8692 413 macro-avg 0.8291 0.8290 0.8277 413 weighted-avg 0.8703 0.8692 0.8687 413 ``` --- layout: model title: English Part of Speech Tagger (from sagorsarker) author: John Snow Labs name: bert_pos_codeswitch_spaeng_pos_lince date: 2022-05-09 tags: [bert, pos, part_of_speech, en, open_source] task: Part of Speech Tagging language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `codeswitch-spaeng-pos-lince` is a English model orginally trained by `sagorsarker`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_codeswitch_spaeng_pos_lince_en_3.4.2_3.0_1652091591511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_codeswitch_spaeng_pos_lince_en_3.4.2_3.0_1652091591511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_codeswitch_spaeng_pos_lince","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_codeswitch_spaeng_pos_lince","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_codeswitch_spaeng_pos_lince| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sagorsarker/codeswitch-spaeng-pos-lince - https://ritual.uh.edu/lince/home - https://github.com/sagorbrur/codeswitch --- layout: model title: English BertForQuestionAnswering Uncased model (from aodiniz) author: John Snow Labs name: bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_covid_qna date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-6_H-128_A-2_cord19-200616_squad2_covid-qna` is a English model originally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1657188934733.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1657188934733.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_uncased_l_6_h_128_a_2_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|20.0 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-6_H-128_A-2_cord19-200616_squad2_covid-qna --- layout: model title: Fast Neural Machine Translation Model from English to West Germanic languages author: John Snow Labs name: opus_mt_en_gmw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, gmw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `gmw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_gmw_xx_2.7.0_2.4_1609166311519.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_gmw_xx_2.7.0_2.4_1609166311519.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_gmw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_gmw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.gmw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_gmw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from mbartolo) author: John Snow Labs name: roberta_qa_roberta_large_synqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-synqa` is a English model originally trained by `mbartolo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_synqa_en_4.0.0_3.0_1655737922920.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_synqa_en_4.0.0_3.0_1655737922920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_synqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_synqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.synqa.roberta.large.by_mbartolo").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_synqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mbartolo/roberta-large-synqa - https://arxiv.org/abs/2002.00293 - https://arxiv.org/abs/2104.08678 --- layout: model title: Financial Market risk Item Binary Classifier author: John Snow Labs name: finclf_market_risk_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `market_risk` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `market_risk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_market_risk_item_en_1.0.0_3.2_1660154450360.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_market_risk_item_en_1.0.0_3.2_1660154450360.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_market_risk_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[market_risk]| |[other]| |[other]| |[market_risk]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_market_risk_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support market_risk 0.79 0.82 0.81 74 other 0.81 0.77 0.79 71 accuracy - - 0.80 145 macro-avg 0.80 0.80 0.80 145 weighted-avg 0.80 0.80 0.80 145 ``` --- layout: model title: Pipeline to Detect PHI for Deidentification (Sub Entity) author: John Snow Labs name: ner_deid_subentity_pipeline date: 2023-03-13 tags: [de, deid, ner, licensed] task: Named Entity Recognition language: de edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_subentity](https://nlp.johnsnowlabs.com/2022/01/06/ner_deid_subentity_de.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_de_4.3.0_3.2_1678739178086.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_subentity_pipeline_de_4.3.0_3.2_1678739178086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_subentity_pipeline", "de", "clinical/models") text = '''Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_subentity_pipeline", "de", "clinical/models") val text = "Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("de.deid.ner_subentity.pipeline").predict("""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------------|--------:|------:|:------------|-------------:| | 0 | Michael Berger | 0 | 13 | PATIENT | 0.99685 | | 1 | 12 Dezember 2018 | 34 | 49 | DATE | 1 | | 2 | Elisabeth-Krankenhaus | 59 | 79 | HOSPITAL | 0.6468 | | 3 | Bad Kissingen | 84 | 96 | CITY | 0.69685 | | 4 | Berger | 117 | 122 | PATIENT | 0.5764 | | 5 | 76 | 128 | 129 | AGE | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_subentity_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|de| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal No default Clause Binary Classifier author: John Snow Labs name: legclf_no_default_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-default` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-default` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_default_clause_en_1.0.0_3.2_1660123758322.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_default_clause_en_1.0.0_3.2_1660123758322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_default_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-default]| |[other]| |[other]| |[no-default]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_default_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-default 0.98 0.85 0.91 53 other 0.91 0.99 0.95 83 accuracy - - 0.93 136 macro-avg 0.94 0.92 0.93 136 weighted-avg 0.94 0.93 0.93 136 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8 date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-8` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8_en_4.0.0_3.0_1654538001568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8_en_4.0.0_3.0_1654538001568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|390.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-8 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket` is a English model originally trained by lilitket. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_en_4.2.0_3.0_1664095074111.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket_en_4.2.0_3.0_1664095074111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_lilitket| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from deepset) author: John Snow Labs name: bert_qa_tinybert_6l_768d_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tinybert-6l-768d-squad2` is a English model orginally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_en_4.0.0_3.0_1654192459375.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_tinybert_6l_768d_squad2_en_4.0.0_3.0_1654192459375.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_tinybert_6l_768d_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_tinybert_6l_768d_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.tiny_768d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_tinybert_6l_768d_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|249.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/tinybert-6l-768d-squad2 - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - https://twitter.com/deepset_ai - http://www.deepset.ai/jobs - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/ - https://deepset.ai/german-bert - https://www.linkedin.com/company/deepset-ai/ - https://arxiv.org/pdf/1909.10351.pdf - https://github.com/deepset-ai/FARM - https://deepset.ai/germanquad --- layout: model title: NER Model Finder author: John Snow Labs name: ner_model_finder date: 2021-11-24 tags: [pretrainedpipeline, clinical, ner, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is trained with bert embeddings and can be used to find the most appropriate NER model given the entity name. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_3.3.2_2.4_1637761259895.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_3.3.2_2.4_1637761259895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_pipeline = PretrainedPipeline("ner_model_finder", "en", "clinical/models") result = ner_pipeline.annotate("medication") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_pipeline = PretrainedPipeline("ner_model_finder","en","clinical/models") val result = ner_pipeline.annotate("medication") ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.model_finder.pipeline").predict("""Put your text here.""") ```
## Results ```bash {'model_names': ["['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']"]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_model_finder| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - SentenceEntityResolverModel - Finisher --- layout: model title: Translate Georgian to English Pipeline author: John Snow Labs name: translate_ka_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ka, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ka` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ka_en_xx_2.7.0_2.4_1609688709188.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ka_en_xx_2.7.0_2.4_1609688709188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ka_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ka_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ka.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ka_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Tagalog to English Pipeline author: John Snow Labs name: translate_tl_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tl, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tl_en_xx_2.7.0_2.4_1609685912308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tl_en_xx_2.7.0_2.4_1609685912308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tl_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tl.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tl_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English Bert Embeddings (from amine) author: John Snow Labs name: bert_embeddings_bert_base_5lang_cased date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a English model orginally trained by `amine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_en_3.4.2_3.0_1649672172777.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_en_3.4.2_3.0_1649672172777.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_base_5lang_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_5lang_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/amine/bert-base-5lang-cased - https://cloud.google.com/compute/docs/machine-types#n1_machine_type --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_04 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_04` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04_en_4.2.0_3.0_1664109417906.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04_en_4.2.0_3.0_1664109417906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_04| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_42 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-42` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1657191867446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_42_en_4.0.0_3.0_1657191867446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_42","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_42","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_42| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|385.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-42 --- layout: model title: Fast Neural Machine Translation Model from English to Danish author: John Snow Labs name: opus_mt_en_da date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, da, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `da` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_da_xx_2.7.0_2.4_1609163523252.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_da_xx_2.7.0_2.4_1609163523252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_da", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_da", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.da').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_da| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from lewtun) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_v1 date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad-v1` is a English model originally trained by `lewtun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_v1_en_4.0.0_3.0_1654726718619.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_v1_en_4.0.0_3.0_1654726718619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_lewtun").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/lewtun/distilbert-base-uncased-finetuned-squad-v1 --- layout: model title: Fast Neural Machine Translation Model from English to Western Malayo-Polynesian Languages author: John Snow Labs name: opus_mt_en_pqw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, pqw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `pqw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pqw_xx_2.7.0_2.4_1609168491289.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pqw_xx_2.7.0_2.4_1609168491289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_pqw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_pqw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.pqw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_pqw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Arabic ALBERT Embeddings (x-large) author: John Snow Labs name: albert_embeddings_albert_xlarge_arabic date: 2022-04-14 tags: [albert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-xlarge-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xlarge_arabic_ar_3.4.2_3.0_1649954299286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_xlarge_arabic_ar_3.4.2_3.0_1649954299286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xlarge_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_xlarge_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.albert_xlarge_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_xlarge_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|221.6 MB| |Case sensitive:|false| ## References - https://huggingface.co/asafaya/albert-xlarge-arabic - https://oscar-corpus.com/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/albert - https://www.tensorflow.org/tfrc - https://github.com/KUIS-AI-Lab/Arabic-ALBERT/ --- layout: model title: English image_classifier_vit_Teeth_C ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Teeth_C date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Teeth_C` is a English model originally trained by steven123. ## Predicted Entities `Good Teeth`, `Missing Teeth`, `Rotten Teeth` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Teeth_C_en_4.1.0_3.0_1660169227707.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Teeth_C_en_4.1.0_3.0_1660169227707.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Teeth_C", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Teeth_C", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Teeth_C| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Clinical Conditions (ner_eu_clinical_case - eu) author: John Snow Labs name: ner_eu_clinical_condition date: 2023-02-06 tags: [eu, clinical, licensed, ner, clinical_condition] task: Named Entity Recognition language: eu edition: Healthcare NLP 4.2.8 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for extracting clinical conditions from Basque texts. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nichols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_condition` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_eu_4.2.8_3.0_1675723038941.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_condition_eu_4.2.8_3.0_1675723038941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_condition', "eu", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""Gertaera honetatik bi hilabetetara, umea Larrialdietako Zerbitzura dator 4 egunetan zehar buruko mina eta bekokiko hantura azaltzeagatik, sukarrik izan gabe. Miaketan, haztapen mingarria duen bekokiko hantura bigunaz gain, ez da beste zeinurik azaltzen. Polakiuria eta tenesmo arina ere izan zuen egun horretan hematuriarekin batera. Geroztik sintomarik gabe dago."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eu") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_condition", "eu", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""Gertaera honetatik bi hilabetetara, umea Larrialdietako Zerbitzura dator 4 egunetan zehar buruko mina eta bekokiko hantura azaltzeagatik, sukarrik izan gabe. Miaketan, haztapen mingarria duen bekokiko hantura bigunaz gain, ez da beste zeinurik azaltzen. Polakiuria eta tenesmo arina ere izan zuen egun horretan hematuriarekin batera. Geroztik sintomarik gabe dago.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------+------------------+ |chunk |ner_label | +----------+------------------+ |mina |clinical_condition| |hantura |clinical_condition| |sukarrik |clinical_condition| |mingarria |clinical_condition| |hantura |clinical_condition| |Polakiuria|clinical_condition| |sintomarik|clinical_condition| +----------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_condition| |Compatibility:|Healthcare NLP 4.2.8+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|eu| |Size:|899.6 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 clinical_condition 45.0 4.0 13.0 58.0 0.9184 0.7759 0.8411 macro - - - - - - 0.8411 micro - - - - - - 0.8411 ``` --- layout: model title: Clinical Deidentification Pipeline (English) author: John Snow Labs name: clinical_deidentification_wip date: 2022-07-08 tags: [deidentification, pipeline, clinical, en, licensed] task: De-identification language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`, `CITY`, `COUNTRY`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `MEDICALRECORD`, `ORGANIZATION`, `PATIENT`, `PHONE`, `PROFESSION`, `STREET`, `USERNAME`, `ZIP`, `ACCOUNT`, `LICENSE`, `VIN`, `SSN`, `DLN`, `PLATE`, `IPADDR`, `EMAIL` entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_wip_en_4.0.0_3.0_1657298706277.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_wip_en_4.0.0_3.0_1657298706277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_wip", "en", "clinical/models") sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" result = deid_pipeline.annotate(sample) print("\n".join(result['masked'])) print("\n".join(result['masked_with_chars'])) print("\n".join(result['masked_fixed_length_chars'])) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification_wip","en","clinical/models") val sample = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""" val result = deid_pipeline.annotate(sample) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.clinical_wip").predict("""Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""") ```
## Results ```bash Masked with entity labels ------------------------------ Name : , Record date: , # . Dr. , ID, IP . He is a male was admitted to the for cystectomy on . Patient's VIN : , SSN , Driver's license . Phone , , , E-MAIL: . Masked with chars ------------------------------ Name : [**************], Record date: [********], # [****]. Dr. [********], ID[**********], IP [************]. He is a [*********] male was admitted to the [**********] for cystectomy on [******]. Patient's VIN : [***************], SSN [**********], Driver's license [*********]. Phone [************], [***************], [***********], E-MAIL: [*************]. Masked with fixed length chars ------------------------------ Name : ****, Record date: ****, # ****. Dr. ****, ID****, IP ****. He is a **** male was admitted to the **** for cystectomy on ****. Patient's VIN : ****, SSN ****, Driver's license ****. Phone ****, ****, ****, E-MAIL: ****. Obfuscated ------------------------------ Name : Craige Perks, Record date: 2093-02-06, # R2593192. Dr. Dr Felice Lacer, IDXO:4884578, IP 444.444.444.444. He is a 75 male was admitted to the MADISON VALLEY MEDICAL CENTER for cystectomy on 07-01-1972. Patient's VIN : 2BBBB11BBBB222999, SSN SSN-814-86-1962, Driver's license P055567317431. Phone 0381-6762484, Budaörsi út 14., New brunswick, E-MAIL: Reba@google.com. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification_wip| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - RegexMatcherModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Relation Extraction between Test and Results author: John Snow Labs name: re_oncology_test_result_wip date: 2022-09-27 tags: [licensed, clinical, oncology, en, relation_extraction, test] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links test extractions to their corresponding results. ## Predicted Entities `is_finding_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_test_result_wip_en_4.0.0_3.0_1664301319746.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_test_result_wip_en_4.0.0_3.0_1664301319746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Each relevant relation pair in the pipeline should include one test entity (such as Biomarker, Imaging_Test, Pathology_Test or Oncogene) and one result entity (such as Biomarker_Result, Pathology_Result or Tumor_Finding).
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_test_result_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"]) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["Pathology showed tumor cells, which were positive for estrogen and progesterone receptors."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_test_result_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology.test_result").predict("""Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.""") ```
## Results ```bash chunk1 entity1 chunk2 entity2 relation confidence Pathology Pathology_Test tumor cells Pathology_Result is_finding_of 0.53310496 positive Biomarker_Result estrogen Biomarker is_finding_of 0.9453165 positive Biomarker_Result progesterone receptors Biomarker is_finding_of 0.8816877 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_test_result_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|266.9 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.84 0.88 0.86 is_finding_of 0.89 0.85 0.87 macro-avg 0.86 0.86 0.86 ``` --- layout: model title: Translate English to Marathi Pipeline author: John Snow Labs name: translate_en_mr date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mr, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mr` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/INDIAN_TRANSLATION_MARATHI/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TRANSLATION_PIPELINES_MODELS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mr_xx_2.7.0_2.4_1609687217108.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mr_xx_2.7.0_2.4_1609687217108.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mr", lang = "xx") result = pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mr", lang = "xx") val result = pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["Your sentence to translate!"] translate_df = nlu.load('xx.en.translate_to.mr').predict(text, output_level='sentence') translate_df ```
## Results ```bash +------------------------------+--------------------------+ |sentence |translation | +------------------------------+--------------------------+ |Your sentence to translate! |तू तुझ्या वाक्याचा अनुवाद करशील! | +------------------------------+--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mr| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab90 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab90 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab90` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab90_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab90_en_4.2.0_3.0_1664022247117.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab90_en_4.2.0_3.0_1664022247117.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab90', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab90", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab90| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-16-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1655731657936.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1655731657936.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_16_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|416.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-16-finetuned-squad-seed-2 --- layout: model title: Pipeline to Detect PHI in text (base) author: John Snow Labs name: ner_deid_sd_pipeline date: 2023-03-13 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_sd](https://nlp.johnsnowlabs.com/2021/04/01/ner_deid_sd_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_pipeline_en_4.3.0_3.2_1678732698616.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_sd_pipeline_en_4.3.0_3.2_1678732698616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_sd_pipeline", "en", "clinical/models") text = '''Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_sd_pipeline", "en", "clinical/models") val text = "Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.deid.sd.pipeline").predict("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 302-786-5227.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:------------------------------|--------:|------:|:------------|-------------:| | 0 | 2093-01-13 | 14 | 23 | DATE | 0.9952 | | 1 | David Hale | 27 | 36 | NAME | 0.9834 | | 2 | Hendrickson Ora | 55 | 69 | NAME | 0.97745 | | 3 | 7194334 | 78 | 84 | ID | 0.999 | | 4 | 01/13/93 | 93 | 100 | DATE | 0.983 | | 5 | Oliveira | 110 | 117 | NAME | 0.9965 | | 6 | 25 | 121 | 122 | AGE | 0.9899 | | 7 | 2079-11-09 | 150 | 159 | DATE | 0.9841 | | 8 | Cocke County Baptist Hospital | 163 | 191 | LOCATION | 0.84345 | | 9 | 0295 Keats Street | 195 | 211 | LOCATION | 0.775333 | | 10 | 302-786-5227 | 221 | 232 | CONTACT | 0.9492 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_sd_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_xls_r_300m_AM_CV8_v1 TFWav2Vec2ForCTC from emre author: John Snow Labs name: asr_wav2vec2_xls_r_300m_AM_CV8_v1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_AM_CV8_v1` is a English model originally trained by emre. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_AM_CV8_v1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_AM_CV8_v1_en_4.2.0_3.0_1664038251873.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_AM_CV8_v1_en_4.2.0_3.0_1664038251873.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_AM_CV8_v1", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_AM_CV8_v1", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_AM_CV8_v1| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Translate English to Swazi Pipeline author: John Snow Labs name: translate_en_ss date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ss, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ss` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ss_xx_2.7.0_2.4_1609691721086.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ss_xx_2.7.0_2.4_1609691721086.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ss", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ss", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ss').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ss| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering Base Uncased model (from alistvt) author: John Snow Labs name: bert_qa_base_uncased_pretrain_finetuned_coqa_falttened date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-pretrain-finetuned-coqa-falttened` is a English model originally trained by `alistvt`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_pretrain_finetuned_coqa_falttened_en_4.0.0_3.0_1657185673805.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_pretrain_finetuned_coqa_falttened_en_4.0.0_3.0_1657185673805.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_pretrain_finetuned_coqa_falttened","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_pretrain_finetuned_coqa_falttened","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_pretrain_finetuned_coqa_falttened| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/alistvt/bert-base-uncased-pretrain-finetuned-coqa-falttened --- layout: model title: Match Chunks in Texts author: John Snow Labs name: match_chunks date: 2022-01-04 tags: [en, open_source] task: Pipeline Public language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The pipeline uses regex `
?/*+` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/match_chunks_en_3.3.4_3.0_1641307675339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/match_chunks_en_3.3.4_3.0_1641307675339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline_local = PretrainedPipeline('match_chunks') result = pipeline_local.annotate("David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow.") result['chunk'] ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP SparkNLP.version() val testData = spark.createDataFrame(Seq( (1, "David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow."))).toDF("id", "text") val pipeline = PretrainedPipeline("match_chunks", lang="en") val annotation = pipeline.transform(testData) annotation.show() ``` {:.nlu-block} ```python import nlu nlu.load("en.match.chunks").predict("""David visited the restaurant yesterday with his family. He also visited and the day before, but at that time he was alone. David again visited today with his colleagues. He and his friends really liked the food and hoped to visit again tomorrow.""") ```
## Results ```bash ['the restaurant yesterday', 'family', 'the day', 'that time', 'today', 'the food', 'tomorrow'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|match_chunks| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|4.1 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - Chunker --- layout: model title: Sentence Entity Resolver for Snomed Aux Concepts, INT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_auxConcepts date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts to Snomed codes codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. This model is capable of extracting Morph Abnormality, Procedure, Substance, Physical Object, and Body Structure concepts of Snomed codes. It has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts Snomed Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_en_3.0.4_3.0_1621189567327.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_auxConcepts_en_3.0.4_3.0_1621189567327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_aux_int_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_auxConcepts","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val snomed_aux_int_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_auxConcepts","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_aux_int_resolver)) val data = Seq("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 148439002| 0.2138|risk factors pres...|148439002:::42595...| |chronic renal ins...| 83|109| PROBLEM| 722403003| 0.8517|gastrointestinal ...|722403003:::13781...| | COPD| 113|116| PROBLEM|845101000000100| 0.0962|management of chr...|845101000000100::...| | gastritis| 120|128| PROBLEM| 711498001| 0.3398|magnetic resonanc...|711498001:::71771...| | TIA| 136|138| PROBLEM| 449758002| 0.1927|traumatic infarct...|449758002:::85844...| |a non-ST elevatio...| 182|202| PROBLEM| 1411000087101| 0.0823|ct of left knee::...|1411000087101:::3...| |Guaiac positive s...| 208|229| PROBLEM| 388507006| 0.0555|asparagus rast:::...|388507006:::71771...| |cardiac catheteri...| 295|317| TEST| 41976001| 0.9790|cardiac catheteri...|41976001:::705921...| | PTCA| 324|327|TREATMENT| 312644004| 0.0616|angioplasty of po...|312644004:::41507...| | mid LAD lesion| 332|345| PROBLEM| 91749005| 0.1399|structure of firs...|91749005:::917470...| +--------------------+-----+---+---------+---------------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_auxConcepts| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_code_ct_aux_loaded]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on SNOMED (INT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_en_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-en-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_en_hi_dev_xx_4.0.0_3.0_1657189998034.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_en_hi_dev_xx_4.0.0_3.0_1657189998034.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_en_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_en_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_en_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-en-hi --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_news_summarization date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-news-summarization` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_news_summarization_it_4.3.0_3.0_1675103478274.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_news_summarization_it_4.3.0_3.0_1675103478274.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_news_summarization","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_news_summarization","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_news_summarization| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|594.0 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-news-summarization - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=News+Summarization&dataset=NewsSum-IT --- layout: model title: Detect Clinical Entities (ner_jsl) author: John Snow Labs name: ner_jsl date: 2021-04-30 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. Definitions of Predicted Entities: - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Death_Entity`: Mentions that indicate the death of a patient. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Weight`: All mentions related to a patients weight. - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Gender`: Gender-specific nouns and pronouns. - `Temperature`: All mentions that refer to body temperature. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Respiration`: Number of breaths per minute. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Frequency`: Frequency of administration for a dose prescribed. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Allergen`: Allergen related extractions mentioned in the document. ## Predicted Entities `Diagnosis` , `Procedure_Name` , `Lab_Result` , `Procedure` , `Procedure_Findings` , `O2_Saturation` , `Procedure_incident_description` , `Dosage` , `Causative_Agents_(Virus_and_Bacteria)` , `Name` , `Cause_of_death` , `Substance_Name` , `Weight` , `Symptom_Name` , `Maybe` , `Modifier` , `Blood_Pressure` , `Frequency` , `Gender` , `Drug_incident_description` , `Age` , `Drug_Name` , `Temperature` , `Section_Name` , `Route` , `Negation` , `Negated` , `Allergenic_substance` , `Lab_Name` , `Respiratory_Rate` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.0.0_3.0_1619768531594.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_en_3.0.0_3.0_1619768531594.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +---------------------------+------------+ |chunk |ner | +---------------------------+------------+ |21-day-old |Age | |male |Gender | |congestion |Symptom_Name| |mom |Gender | |suctioning yellow discharge|Symptom_Name| |she |Gender | |mild |Modifier | |problems with his breathing|Symptom_Name| |negative |Negated | |perioral cyanosis |Symptom_Name| |retractions |Symptom_Name| |mom |Gender | |Tylenol |Drug_Name | |His |Gender | |his |Gender | |respiratory congestion |Symptom_Name| |He |Gender | |tired |Symptom_Name| |fussy |Symptom_Name| |albuterol |Drug_Name | +---------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-Pulse_Rate 77 39 12 0.663793 0.865169 0.75122 I-Diagnosis 2134 1139 1329 0.652001 0.616229 0.63361 I-Procedure_Name 2335 1329 956 0.637282 0.709511 0.671459 B-Lab_Result 601 182 94 0.767561 0.864748 0.813261 B-Procedure 1 0 5 1 0.166667 0.285714 B-Procedure_Findings 2 13 72 0.133333 0.027027 0.044944 B-O2_Saturation 1 3 4 0.25 0.2 0.222222 B-Dosage 477 197 68 0.707715 0.875229 0.782609 I-Causative_Agents_(Virus_and_Bacteria) 12 2 7 0.857143 0.631579 0.727273 B-Name 562 268 554 0.677108 0.503584 0.577595 I-Cause_of_death 9 5 11 0.642857 0.45 0.529412 I-Substance_Name 24 34 54 0.413793 0.307692 0.352941 I-Name 716 377 710 0.655078 0.502104 0.56848 B-Cause_of_death 9 6 8 0.6 0.529412 0.5625 B-Weight 52 22 9 0.702703 0.852459 0.77037 B-Symptom_Name 4364 1916 1652 0.694904 0.725399 0.709824 I-Maybe 27 51 61 0.346154 0.306818 0.325301 I-Symptom_Name 2073 1492 2348 0.581487 0.468898 0.519159 B-Modifier 1573 890 768 0.638652 0.671935 0.654871 B-Blood_Pressure 76 19 13 0.8 0.853933 0.826087 B-Frequency 308 134 77 0.696833 0.8 0.744861 I-Gender 26 31 28 0.45614 0.481482 0.468468 I-Drug_incident_description 4 10 57 0.285714 0.0655738 0.106667 B-Drug_incident_description 2 5 23 0.285714 0.08 0.125 I-Age 5 0 9 1 0.357143 0.526316 B-Drug_Name 1741 490 290 0.780368 0.857213 0.816987 B-Substance_Name 148 41 48 0.783069 0.755102 0.768831 B-Temperature 56 23 13 0.708861 0.811594 0.756757 I-Procedure 1 0 7 1 0.125 0.222222 B-Section_Name 2711 317 166 0.89531 0.942301 0.918205 I-Route 119 110 189 0.519651 0.386364 0.443203 B-Maybe 143 80 127 0.641256 0.52963 0.580122 B-Gender 5166 709 58 0.879319 0.988897 0.930895 I-Dosage 434 196 87 0.688889 0.833013 0.754127 B-Causative_Agents_(Virus_and_Bacteria) 19 3 8 0.863636 0.703704 0.77551 I-Frequency 275 134 191 0.672372 0.590129 0.628571 B-Age 357 27 16 0.929688 0.957105 0.943197 I-Lab_Result 45 78 152 0.365854 0.228426 0.28125 B-Negation 99 38 38 0.722628 0.722628 0.722628 B-Diagnosis 2786 1342 913 0.674903 0.753177 0.711895 I-Section_Name 3885 1353 179 0.741695 0.955955 0.835304 B-Route 421 217 166 0.659875 0.717206 0.687347 I-Negation 11 30 24 0.268293 0.314286 0.289474 B-Procedure_Name 1490 811 522 0.647545 0.740557 0.690934 B-Negated 1490 332 215 0.817783 0.8739 0.844911 I-Allergenic_substance 1 0 12 1 0.0769231 0.142857 I-Negated 89 132 146 0.402715 0.378723 0.390351 I-Procedure_Findings 2 31 283 0.0606061 0.00701754 0.012579 B-Allergenic_substance 72 29 24 0.712871 0.75 0.730965 I-Weight 47 35 16 0.573171 0.746032 0.648276 B-Lab_Name 804 290 122 0.734918 0.868251 0.79604 I-Modifier 99 73 422 0.575581 0.190019 0.285714 I-Temperature 1 0 14 1 0.0666667 0.125 I-Drug_Name 362 284 261 0.560372 0.581059 0.570528 I-Lab_Name 284 194 127 0.594142 0.690998 0.63892 B-Respiratory_Rate 46 5 5 0.901961 0.901961 0.901961 Macro-average 38674 15571 13819 0.589085 0.515426 0.5498 Micro-average 38674 15571 13819 0.712951 0.736746 0.724653 ``` --- layout: model title: Legal Due Authorization Clause Binary Classifier author: John Snow Labs name: legclf_due_authorization_clause date: 2023-01-29 tags: [en, legal, classification, tax, treatment, clauses, due_authorization, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `due-authorization` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `due-authorization`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_due_authorization_clause_en_1.0.0_3.0_1674993500619.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_due_authorization_clause_en_1.0.0_3.0_1674993500619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_due_authorization_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[due-authorization]| |[other]| |[other]| |[due-authorization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_due_authorization_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support due-authorization 0.98 1.00 0.99 61 other 1.00 0.99 1.00 106 accuracy - - 0.99 167 macro-avg 0.99 1.00 0.99 167 weighted-avg 0.99 0.99 0.99 167 ``` --- layout: model title: Ukrainian RoBERTa Embeddings author: John Snow Labs name: roberta_embeddings_ukr_roberta_base date: 2022-04-14 tags: [roberta, embeddings, uk, open_source] task: Embeddings language: uk edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `ukr-roberta-base` is a Ukrainian model orginally trained by `youscan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ukr_roberta_base_uk_3.4.2_3.0_1649948817339.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_ukr_roberta_base_uk_3.4.2_3.0_1649948817339.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ukr_roberta_base","uk") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Я люблю Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_ukr_roberta_base","uk") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Я люблю Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("uk.embed.ukr_roberta_base").predict("""Я люблю Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_ukr_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|uk| |Size:|474.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/youscan/ukr-roberta-base - https://dumps.wikimedia.org/ukwiki/latest/ukwiki-latest-pages-articles.xml.bz2 - https://oscar-public.huma-num.fr/shuffled/uk_dedup.txt.gz - https://github.com/youscan/language-models - https://twitter.com/vitaliradchenko --- layout: model title: Detect Chemical Compounds and Genes (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_chemprot date: 2021-10-19 tags: [berfortokenclassification, ner, chemprot, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.0 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect chemical compounds and genes in the medical text using the pretrained NER model. This model is trained with the `BertForTokenClassification` method from the `transformers` library and imported into Spark NLP. ## Predicted Entities `CHEMICAL`, `GENE-Y`, `GENE-N` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMPROT_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CHEMPROT_CLINICAL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_en_3.3.0_2.4_1634644903577.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_en_3.3.0_2.4_1634644903577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_chemprot", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, tokenizer, tokenClassifier, ner_converter]) sample_text = "Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium." df = spark.createDataFrame([[sample_text]]).toDF("text") result = pipeline.fit(df).transform(df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_chemprot", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot.bert").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |Keratinocyte growth factor |GENE-Y | |acidic fibroblast growth factor|GENE-Y | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemprot| |Compatibility:|Healthcare NLP 3.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|true| |Max sentense length:|512| ## Data Source This model is trained on a [ChemProt corpus](https://biocreative.bioinformatics.udel.edu/). ## Benchmarking ```bash label precision recall f1-score support B-CHEMICAL 0.93 0.79 0.85 8649 B-GENE-N 0.63 0.56 0.59 2752 B-GENE-Y 0.82 0.73 0.77 5490 I-CHEMICAL 0.90 0.79 0.84 1313 I-GENE-N 0.72 0.62 0.67 1993 I-GENE-Y 0.81 0.72 0.77 2420 accuracy - - 0.73 22617 macro-avg 0.75 0.74 0.75 22617 weighted-avg 0.83 0.73 0.78 22617 ``` --- layout: model title: Legal Topic Classification on Greek Legislation author: John Snow Labs name: legclf_legal_code date: 2023-05-12 tags: [el, legal, classification, bert, licensed, tensorflow] task: Text Classification language: el edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Greek Legal Code (GLC) is a dataset consisting of approx. 47k legal resources from Greek legislation. The origin of GLC is “Permanent Greek Legislation Code - Raptarchis”, a collection of Greek legislative documents classified into multi-level (from broader to more specialized) categories. Given the text of a document, the `legclf_legal_code` model predicts the corresponding class ## Predicted Entities `ΒΙΟΜΗΧΑΝΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΝΟΜΟΘΕΣΙΑ_ΔΗΜΩΝ_ΚΑΙ_ΚΟΙΝΟΤΗΤΩΝ`, `ΕΚΠΑΙΔΕΥΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΕΘΝΙΚΗ_ΑΜΥΝΑ`, `ΕΚΚΛΗΣΙΑΣΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΕΘΝΙΚΗ_ΟΙΚΟΝΟΜΙΑ`, `ΔΗΜΟΣΙΟ_ΛΟΓΙΣΤΙΚΟ`, `ΕΜΠΟΡΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΣΤΡΑΤΟΣ_ΞΗΡΑΣ`, `ΔΗΜΟΣΙΟΙ_ΥΠΑΛΛΗΛΟΙ`, `ΔΙΠΛΩΜΑΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΤΕΛΩΝΕΙΑΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΤΑΧΥΔΡΟΜΕΙΑ_ΤΗΛΕΠΙΚΟΙΝΩΝΙΕΣ`, `ΠΟΛΕΜΙΚΟ_ΝΑΥΤΙΚΟ`, `ΑΣΦΑΛΙΣΤΙΚΑ_ΤΑΜΕΙΑ`, `ΓΕΩΡΓΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΑΓΡΟΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΑΣΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΠΟΛΙΤΙΚΗ_ΔΙΚΟΝΟΜΙΑ`, `ΚΟΙΝΩΝΙΚΕΣ_ΑΣΦΑΛΙΣΕΙΣ`, `ΟΙΚΟΝΟΜΙΚΗ_ΔΙΟΙΚΗΣΗ`, `ΔΙΟΙΚΗΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΠΟΛΕΜΙΚΗ_ΑΕΡΟΠΟΡΙΑ`, `ΔΑΣΗ_ΚΑΙ_ΚΤΗΝΟΤΡΟΦΙΑ`, `ΑΓΟΡΑΝΟΜΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΝΟΜΟΘΕΣΙΑ_ΕΠΙΜΕΛΗΤΗΡΙΩΝ_ΣΥΝΕΤΑΙΡΙΣΜΩΝ_ΚΑΙ_ΣΩΜΑΤΕΙΩΝ`, `ΥΓΕΙΟΝΟΜΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΔΙΟΙΚΗΣΗ_ΔΙΚΑΙΟΣΥΝΗΣ`, `ΣΥΓΚΟΙΝΩΝΙΕΣ`, `ΠΟΛΙΤΙΚΗ_ΑΕΡΟΠΟΡΙΑ`, `ΕΠΙΣΤΗΜΕΣ_ΚΑΙ_ΤΕΧΝΕΣ`, `ΡΑΔΙΟΦΩΝΙΑ_ΚΑΙ_ΤΥΠΟΣ`, `ΕΡΓΑΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΛΙΜΕΝΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΣΥΝΤΑΓΜΑΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΑΜΕΣΗ_ΦΟΡΟΛΟΓΙΑ`, `ΑΣΤΥΝΟΜΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΔΗΜΟΣΙΑ_ΕΡΓΑ`, `ΝΟΜΙΚΑ_ΠΡΟΣΩΠΑ_ΔΗΜΟΣΙΟΥ_ΔΙΚΑΙΟΥ`, `ΠΟΙΝΙΚΗ_ΝΟΜΟΘΕΣΙΑ`, `ΤΥΠΟΣ_ΚΑΙ_ΤΟΥΡΙΣΜΟΣ`, `ΚΟΙΝΩΝΙΚΗ_ΠΡΟΝΟΙΑ`, `ΕΜΠΟΡΙΚΗ_ΝΑΥΤΙΛΙΑ`, `ΠΕΡΙΟΥΣΙΑ_ΔΗΜΟΣΙΟΥ_ΚΑΙ_ΝΟΜΙΣΜΑ`, `ΕΜΜΕΣΗ_ΦΟΡΟΛΟΓΙΑ`, `ΕΛΕΓΚΤΙΚΟ_ΣΥΝΕΔΡΙΟ_ΚΑΙ_ΣΥΝΤΑΞΕΙΣ`, `ΝΟΜΟΘΕΣΙΑ_ΑΝΩΝΥΜΩΝ_ΕΤΑΙΡΕΙΩΝ_ΤΡΑΠΕΖΩΝ_ΚΑΙ_ΧΡΗΜΑΤΙΣΤΗΡΙΩΝ` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_legal_code_el_1.0.0_3.0_1683904327601.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_legal_code_el_1.0.0_3.0_1683904327601.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequence_classifier = legal.BertForSequenceClassification.pretrained("legclf_legal_code", "el", "legal/models")\ .setInputCols(["document","token"])\ .setOutputCol("class")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) clf_pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, sequence_classifier ]) empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) text = """14. ΠΡΟΕΔΡΙΚΟΝ ΔΙΑΤΑΓΜΑ υπ’ αριθ. 545 της 5/25 Ιουλ. 1979 (ΦΕΚ Α΄ 168) Περί αυξήσεως των υπό του Ταμείου Προνοίας Εργοληπτών Δημοσίων Έργων παρεχομένων εφ’ άπαξ βοηθημάτων. Έχοντες υπ’ όψιν: 1.Τας διατάξεις της παρ. 8 του άρθρ. 2 του Ν.Δ. 75/1946 «περί συστάσεως Ταμείου Προνοίας Εργοληπτών Δημοσίων Έργων». 2.Τας διατάξεις της παρ. 2 του άρθρ. 12 του Νόμ. 400/1976 «περί Υπουργικού Συμβουλίου και Υπουργείων» (ΦΕΚ 203/76 τ.Α΄). 3.Τας διατάξεις των άρθρ. 17 παρ. 2 εδάφ. β΄ περίπτ. αα΄ και 113 παρ. 2 εδάφ. α΄ του Π.Δ. 544/1977 (ΦΕΚ 178/77 τ.Α΄) ως η τελευταία αντικατεστάθη δια της παρ. 1 του άρθρ. 2 του Νόμ. 728/1977 (ΦΕΚ 316/77 τ.Α΄). 4.Την υπ’ αριθ. Δ3/2087/6.12.77 (ΦΕΚ 1278/77 τ.Β΄) απόφασιν του Πρωθυπουργού και του Υπουργού Κοινωνικών Υπηρεσιών «περί αναθέσεως αρμοδιοτήτων στους Υφυπουργούς Κοινωνικών Υπηρεσιών». 5.Την σύμφωνον γνώμην του Διοικητικού Συμβουλίου του Ταμείου Προνοίας Εργοληπτών Δημοσίων Έργων, ληφθείσα κατά την υπ’ αριθ. 1/17.1.79 συνεδρίασιν αυτού και υποβληθείσα ημίν δια της υπ’ αριθ. 583/24.1.79 αναφοράς του Ταμείου. 6.Την γνωμοδότησιν του Συμβουλίου Κοινων. Ασφαλείας, ληφθείσα κατά την υπ’ αριθ. 9/11.4.79 συνεδρίασιν αυτού της Κ΄ περιόδου. 7.Την υπ’ αριθ. 490/1979 γνωμοδότησιν του Συμβουλίου της Επικρατείας, προτάσει του επί των Κοινωνικών Υπηρεσιών Υφυπουργού, αποφασίζομεν: Άρθρον μόνον.-1.Το υπό του Ταμείου Προνοίας Εργοληπτών Δημοσίων Έργων παρεχόμενον εις τους εξερχομένους του επαγγέλματος δι’ οιονδήποτε λόγον ησφαλισμένος αυτού, πλήρες εφ’ άπαξ βοήθημα (χορηγία) δια τους έχοντας συμπεπληρωμένην 35ετή υπηρεσίαν, καθορίζεται εφ’ εξής ως κάτωθι: α)Δια τους ησφαλισμένους Εργολήπτας Δημοσίων Έργων Γ΄ και Δ΄ τάξεως ως και τους μετόχους υπαλλήλους του Ταμείου και των Εργοληπτικών Οργανώσεων από του 6ου βαθμού συμπεριλαμβανομένου και άνω εις 474.500 δραχμάς. β)Δια τους λοιπούς ησφαλισμένους του Ταμείου εις 357.000 δραχμάς. 2.Εις περίπτωσιν κατά την οποίαν η συνολική υπηρεσία οιουδήποτε εκ των ανωτέρω ησφαλισμένων είναι μικροτέρα των 35 ετών και εφ’ όσον συντρέχουν αι προϋποθέσεις του άρθρ. 1 του Β.Δ/τος της 13/29 Μαρτ. 1947 «περί χορηγιών του Ταμείου Προνοίας Εργοληπτών Δημοσίων Έργων» ως ισχύει κατόπιν των τροποποιήσεων και συμπληρώσεων αυτού, ο ησφαλισμένος δικαιούται τόσων τριακοστών πέμπτων του πλήρους εφ’ άπαξ βοηθήματος, όσα και τα έτη της υπηρεσίας αυτού. 3.Προϋπηρεσία πλέον των 35 ετών δεν αναγνωρίζεται. (Αντί για τη σελ. 224,03(α) Σελ. 224,03(β) Τεύχος 709-Σελ. 113 Ταμείο Προνοίας Εργοληπτών 23.Γ.ε.12-14 Εις τον επί των Κοινωνικών Υπηρεσιών Υφυπουργόν ανατίθεμεν την δημοσίευσιν και εκτέλεσιν του παρόντος.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ```
## Results ```bash +--------------------+--------------+ | text| result| +--------------------+--------------+ |14. ΠΡΟΕΔΡΙΚΟΝ ΔΙ...|[ΔΗΜΟΣΙΑ_ΕΡΓΑ]| +--------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_legal_code| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|el| |Size:|356.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References The dataset is available [here](https://huggingface.co/datasets/greek_legal_code) ## Benchmarking ```bash label precision recall f1-score support ΑΓΟΡΑΝΟΜΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.79 0.83 0.81 41 ΑΓΡΟΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.84 0.87 0.85 30 ΑΜΕΣΗ_ΦΟΡΟΛΟΓΙΑ 0.87 0.90 0.89 94 ΑΣΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.81 0.83 0.82 41 ΑΣΤΥΝΟΜΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.85 0.87 0.86 70 ΑΣΦΑΛΙΣΤΙΚΑ_ΤΑΜΕΙΑ 0.93 0.95 0.94 121 ΒΙΟΜΗΧΑΝΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.85 0.83 0.84 93 ΓΕΩΡΓΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.88 0.92 0.90 155 ΔΑΣΗ_ΚΑΙ_ΚΤΗΝΟΤΡΟΦΙΑ 0.88 0.70 0.78 40 ΔΗΜΟΣΙΑ_ΕΡΓΑ 0.88 0.88 0.88 111 ΔΗΜΟΣΙΟ_ΛΟΓΙΣΤΙΚΟ 0.79 0.74 0.76 46 ΔΗΜΟΣΙΟΙ_ΥΠΑΛΛΗΛΟΙ 0.78 0.79 0.78 90 ΔΙΟΙΚΗΣΗ_ΔΙΚΑΙΟΣΥΝΗΣ 0.91 0.97 0.94 140 ΔΙΟΙΚΗΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.80 0.86 0.83 42 ΔΙΠΛΩΜΑΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.82 0.87 0.85 95 ΕΘΝΙΚΗ_ΑΜΥΝΑ 0.84 0.80 0.82 121 ΕΘΝΙΚΗ_ΟΙΚΟΝΟΜΙΑ 0.76 0.79 0.77 43 ΕΚΚΛΗΣΙΑΣΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.92 0.98 0.95 47 ΕΚΠΑΙΔΕΥΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.94 0.94 0.94 230 ΕΛΕΓΚΤΙΚΟ_ΣΥΝΕΔΡΙΟ_ΚΑΙ_ΣΥΝΤΑΞΕΙΣ 0.79 0.70 0.74 37 ΕΜΜΕΣΗ_ΦΟΡΟΛΟΓΙΑ 0.81 0.84 0.83 62 ΕΜΠΟΡΙΚΗ_ΝΑΥΤΙΛΙΑ 0.95 0.95 0.95 149 ΕΜΠΟΡΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.85 0.95 0.90 42 ΕΠΙΣΤΗΜΕΣ_ΚΑΙ_ΤΕΧΝΕΣ 0.94 0.95 0.95 285 ΕΡΓΑΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.90 0.88 0.89 95 ΚΟΙΝΩΝΙΚΕΣ_ΑΣΦΑΛΙΣΕΙΣ 0.92 0.89 0.91 66 ΚΟΙΝΩΝΙΚΗ_ΠΡΟΝΟΙΑ 0.91 0.85 0.88 71 ΛΙΜΕΝΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.91 0.96 0.94 78 ΝΟΜΙΚΑ_ΠΡΟΣΩΠΑ_ΔΗΜΟΣΙΟΥ_ΔΙΚΑΙΟΥ 0.95 0.91 0.93 67 ΝΟΜΟΘΕΣΙΑ_ΑΝΩΝΥΜΩΝ_ΕΤΑΙΡΕΙΩΝ_ΤΡΑΠΕΖΩΝ_ΚΑΙ_ΧΡΗΜΑΤΙΣΤΗΡΙΩΝ 0.89 0.81 0.85 72 ΝΟΜΟΘΕΣΙΑ_ΔΗΜΩΝ_ΚΑΙ_ΚΟΙΝΟΤΗΤΩΝ 0.90 0.85 0.88 55 ΝΟΜΟΘΕΣΙΑ_ΕΠΙΜΕΛΗΤΗΡΙΩΝ_ΣΥΝΕΤΑΙΡΙΣΜΩΝ_ΚΑΙ_ΣΩΜΑΤΕΙΩΝ 0.92 0.92 0.92 37 ΟΙΚΟΝΟΜΙΚΗ_ΔΙΟΙΚΗΣΗ 0.86 0.70 0.77 53 ΠΕΡΙΟΥΣΙΑ_ΔΗΜΟΣΙΟΥ_ΚΑΙ_ΝΟΜΙΣΜΑ 0.80 0.80 0.80 46 ΠΟΙΝΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.96 0.94 0.95 49 ΠΟΛΕΜΙΚΗ_ΑΕΡΟΠΟΡΙΑ 0.89 0.83 0.86 29 ΠΟΛΕΜΙΚΟ_ΝΑΥΤΙΚΟ 0.94 0.77 0.85 43 ΠΟΛΙΤΙΚΗ_ΑΕΡΟΠΟΡΙΑ 0.93 0.90 0.92 30 ΠΟΛΙΤΙΚΗ_ΔΙΚΟΝΟΜΙΑ 0.86 0.95 0.90 19 ΡΑΔΙΟΦΩΝΙΑ_ΚΑΙ_ΤΥΠΟΣ 0.81 0.94 0.87 31 ΣΤΡΑΤΟΣ_ΞΗΡΑΣ 0.74 0.83 0.78 71 ΣΥΓΚΟΙΝΩΝΙΕΣ 0.91 0.92 0.91 178 ΣΥΝΤΑΓΜΑΤΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.84 0.74 0.79 87 ΤΑΧΥΔΡΟΜΕΙΑ_ΤΗΛΕΠΙΚΟΙΝΩΝΙΕΣ 0.96 0.93 0.95 84 ΤΕΛΩΝΕΙΑΚΗ_ΝΟΜΟΘΕΣΙΑ 0.87 0.84 0.85 73 ΤΥΠΟΣ_ΚΑΙ_ΤΟΥΡΙΣΜΟΣ 0.96 0.98 0.97 50 ΥΓΕΙΟΝΟΜΙΚΗ_ΝΟΜΟΘΕΣΙΑ 0.91 0.93 0.92 136 accuracy - - 0.88 3745 macro avg 0.87 0.87 0.87 3745 weighted avg 0.89 0.88 0.88 3745 ``` --- layout: model title: English RobertaForQuestionAnswering (from deepakvk) author: John Snow Labs name: roberta_qa_deepakvk_roberta_base_squad2_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2-finetuned-squad` is a English model originally trained by `deepakvk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepakvk_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735429401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepakvk_roberta_base_squad2_finetuned_squad_en_4.0.0_3.0_1655735429401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepakvk_roberta_base_squad2_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_deepakvk_roberta_base_squad2_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base.by_deepakvk").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepakvk_roberta_base_squad2_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepakvk/roberta-base-squad2-finetuned-squad --- layout: model title: Sentence Detection in Indonesian Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [open_source, sentence_detection, id] task: Sentence Detection language: id edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_id_3.2.0_3.0_1630318954338.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_id_3.2.0_3.0_1630318954338.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "id") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""Mencari sumber paragraf bacaan bahasa Inggris yang bagus? Anda telah datang ke tempat yang tepat. Menurut sebuah penelitian baru-baru ini, kebiasaan membaca di kalangan remaja saat ini menurun dengan cepat. Mereka tidak dapat fokus pada paragraf bacaan bahasa Inggris yang diberikan selama lebih dari beberapa detik! Juga, membaca adalah dan merupakan bagian integral dari semua ujian kompetitif. Jadi, bagaimana Anda meningkatkan keterampilan membaca Anda? Jawaban atas pertanyaan ini sebenarnya adalah pertanyaan lain: Apa gunanya keterampilan membaca? Tujuan utama membaca adalah 'untuk masuk akal'.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "id") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("Mencari sumber paragraf bacaan bahasa Inggris yang bagus? Anda telah datang ke tempat yang tepat. Menurut sebuah penelitian baru-baru ini, kebiasaan membaca di kalangan remaja saat ini menurun dengan cepat. Mereka tidak dapat fokus pada paragraf bacaan bahasa Inggris yang diberikan selama lebih dari beberapa detik! Juga, membaca adalah dan merupakan bagian integral dari semua ujian kompetitif. Jadi, bagaimana Anda meningkatkan keterampilan membaca Anda? Jawaban atas pertanyaan ini sebenarnya adalah pertanyaan lain: Apa gunanya keterampilan membaca? Tujuan utama membaca adalah 'untuk masuk akal'.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python nlu.load('id.sentence_detector').predict("Mencari sumber paragraf bacaan bahasa Inggris yang bagus? Anda telah datang ke tempat yang tepat. Menurut sebuah penelitian baru-baru ini, kebiasaan membaca di kalangan remaja saat ini menurun dengan cepat. Mereka tidak dapat fokus pada paragraf bacaan bahasa Inggris yang diberikan selama lebih dari beberapa detik! Juga, membaca adalah dan merupakan bagian integral dari semua ujian kompetitif. Jadi, bagaimana Anda meningkatkan keterampilan membaca Anda? Jawaban atas pertanyaan ini sebenarnya adalah pertanyaan lain: Apa gunanya keterampilan membaca? Tujuan utama membaca adalah 'untuk masuk akal'.", output_level ='sentence') ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------+ |result | +---------------------------------------------------------------------------------------------------------------+ |[Mencari sumber paragraf bacaan bahasa Inggris yang bagus?] | |[Anda telah datang ke tempat yang tepat.] | |[Menurut sebuah penelitian baru-baru ini, kebiasaan membaca di kalangan remaja saat ini menurun dengan cepat.] | |[Mereka tidak dapat fokus pada paragraf bacaan bahasa Inggris yang diberikan selama lebih dari beberapa detik!]| |[Juga, membaca adalah dan merupakan bagian integral dari semua ujian kompetitif.] | |[Jadi, bagaimana Anda meningkatkan keterampilan membaca Anda?] | |[Jawaban atas pertanyaan ini sebenarnya adalah pertanyaan lain:] | |[Apa gunanya keterampilan membaca?] | |[Tujuan utama membaca adalah 'untuk masuk akal'.] | +---------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|id| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Fast and Accurate Language Identification - 375 Languages (CNN) author: John Snow Labs name: ld_wiki_tatoeba_cnn_375 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Abkhaz`, `Iraqi Arabic`, `Adyghe`, `Afrikaans`, `Gulf Arabic`, `Afrihili`, `Assyrian Neo-Aramaic`, `Ainu`, `Aklanon`, `Gheg Albanian`, `Amharic`, `Aragonese`, `Old English`, `Uab Meto`, `North Levantine Arabic`, `Arabic`, `Algerian Arabic`, `Moroccan Arabic`, `Egyptian Arabic`, `Assamese`, `Asturian`, `Kotava`, `Awadhi`, `Aymara`, `Azerbaijani`, `Bashkir`, `Baluchi`, `Balinese`, `Bavarian`, `Central Bikol`, `Belarusian`, `Berber`, `Bulgarian`, `Bhojpuri`, `Bislama`, `Banjar`, `Bambara`, `Bengali`, `Tibetan`, `Breton`, `Bodo`, `Bosnian`, `Buryat`, `Baybayanon`, `Brithenig`, `Catalan`, `Cayuga`, `Chavacano`, `Chechen`, `Cebuano`, `Chamorro`, `Chagatai`, `Chinook Jargon`, `Choctaw`, `Cherokee`, `Jin Chinese`, `Chukchi`, `Central Mnong`, `Corsican`, `Chinese Pidgin English`, `Crimean Tatar`, `Seychellois Creole`, `Czech`, `Kashubian`, `Chuvash`, `Welsh`, `CycL`, `Cuyonon`, `Danish`, `German`, `Dungan`, `Drents`, `Lower Sorbian`, `Central Dusun`, `Dhivehi`, `Dutton World Speedwords`, `Ewe`, `Emilian`, `Greek`, `Erromintxela`, `English`, `Middle English`, `Esperanto`, `Spanish`, `Estonian`, `Basque`, `Evenki`, `Extremaduran`, `Persian`, `Finnish`, `Fijian`, `Kven Finnish`, `Faroese`, `French`, `Middle French`, `Old French`, `North Frisian`, `Pulaar`, `Friulian`, `Nigerian Fulfulde`, `Frisian`, `Irish`, `Ga`, `Gagauz`, `Gan Chinese`, `Garhwali`, `Guadeloupean Creole French`, `Scottish Gaelic`, `Gilbertese`, `Galician`, `Guarani`, `Konkani (Goan)`, `Gronings`, `Gothic`, `Ancient Greek`, `Swiss German`, `Gujarati`, `Manx`, `Hausa`, `Hakka Chinese`, `Hawaiian`, `Ancient Hebrew`, `Hebrew`, `Hindi`, `Fiji Hindi`, `Hiligaynon`, `Hmong Njua (Green)`, `Ho`, `Croatian`, `Hunsrik`, `Upper Sorbian`, `Xiang Chinese`, `Haitian Creole`, `Hungarian`, `Armenian`, `Interlingua`, `Iban`, `Indonesian`, `Interlingue`, `Igbo`, `Nuosu`, `Inuktitut`, `Ilocano`, `Ido`, `Icelandic`, `Italian`, `Ingrian`, `Japanese`, `Jamaican Patois`, `Lojban`, `Juhuri (Judeo-Tat)`, `Jewish Palestinian Aramaic`, `Javanese`, `Georgian`, `Karakalpak`, `Kabyle`, `Kamba`, `Kekchi (Q'eqchi')`, `Khasi`, `Khakas`, `Kazakh`, `Greenlandic`, `Khmer`, `Kannada`, `Korean`, `Komi-Permyak`, `Komi-Zyrian`, `Karachay-Balkar`, `Karelian`, `Kashmiri`, `Kölsch`, `Kurdish`, `Kumyk`, `Cornish`, `Keningau Murut`, `Kyrgyz`, `Coastal Kadazan`, `Latin`, `Southern Subanen`, `Ladino`, `Luxembourgish`, `Láadan`, `Lingua Franca Nova`, `Luganda`, `Ligurian`, `Livonian`, `Lakota`, `Ladin`, `Lombard`, `Lingala`, `Lao`, `Louisiana Creole`, `Lithuanian`, `Latgalian`, `Latvian`, `Latvian`, `Literary Chinese`, `Laz`, `Madurese`, `Maithili`, `North Moluccan Malay`, `Moksha`, `Morisyen`, `Malagasy`, `Mambae`, `Marshallese`, `Meadow Mari`, `Maori`, `Mi'kmaq`, `Minangkabau`, `Macedonian`, `Malayalam`, `Mongolian`, `Manchu`, `Mon`, `Mohawk`, `Marathi`, `Hill Mari`, `Malay`, `Maltese`, `Tagal Murut`, `Mirandese`, `Hmong Daw (White)`, `Burmese`, `Erzya`, `Nauruan`, `Nahuatl`, `Norwegian Bokmål`, `Central Huasteca Nahuatl`, `Low German (Low Saxon)`, `Nepali`, `Newari`, `Ngeq`, `Guerrero Nahuatl`, `Niuean`, `Dutch`, `Orizaba Nahuatl`, `Norwegian Nynorsk`, `Norwegian`, `Nogai`, `Old Norse`, `Novial`, `Nepali`, `Naga (Tangshang)`, `Navajo`, `Chinyanja`, `Nyungar`, `Old Aramaic`, `Occitan`, `Ojibwe`, `Odia (Oriya)`, `Old East Slavic`, `Ossetian`, `Old Spanish`, `Old Saxon`, `Ottoman Turkish`, `Old Turkish`, `Punjabi (Eastern)`, `Pangasinan`, `Kapampangan`, `Papiamento`, `Palauan`, `Picard`, `Pennsylvania German`, `Palatine German`, `Phoenician`, `Pali`, `Polish`, `Piedmontese`, `Punjabi (Western)`, `Pipil`, `Old Prussian`, `Pashto`, `Portuguese`, `Quechua`, `K'iche'`, `Quenya`, `Rapa Nui`, `Rendille`, `Tarifit`, `Romansh`, `Kirundi`, `Romanian`, `Romani`, `Russian`, `Rusyn`, `Kinyarwanda`, `Okinawan`, `Sanskrit`, `Yakut`, `Sardinian`, `Sicilian`, `Scots`, `Sindhi`, `Northern Sami`, `Sango`, `Samogitian`, `Shuswap`, `Tachawit`, `Sinhala`, `Sindarin`, `Slovak`, `Slovenian`, `Samoan`, `Southern Sami`, `Shona`, `Somali`, `Albanian`, `Serbian`, `Swazi`, `Southern Sotho`, `Saterland Frisian`, `Sundanese`, `Sumerian`, `Swedish`, `Swahili`, `Swabian`, `Swahili`, `Syriac`, `Tamil`, `Telugu`, `Tetun`, `Tajik`, `Thai`, `Tahaggart Tamahaq`, `Tigrinya`, `Tigre`, `Turkmen`, `Tokelauan`, `Tagalog`, `Klingon`, `Talysh`, `Jewish Babylonian Aramaic`, `Temuan`, `Setswana`, `Tongan`, `Tonga (Zambezi)`, `Toki Pona`, `Tok Pisin`, `Old Tupi`, `Turkish`, `Tsonga`, `Tatar`, `Isan`, `Tuvaluan`, `Tahitian`, `Tuvinian`, `Talossan`, `Udmurt`, `Uyghur`, `Ukrainian`, `Umbundu`, `Urdu`, `Urhobo`, `Uzbek`, `Venetian`, `Veps`, `Vietnamese`, `Volapük`, `Võro`, `Walloon`, `Waray`, `Wolof`, `Shanghainese`, `Kalmyk`, `Xhosa`, `Mingrelian`, `Yiddish`, `Yoruba`, `Cantonese`, `Chinese`, `Malay (Vernacular)`, `Malay`, `Zulu`, `Zaza`. ## Predicted Entities `ab`, `acm`, `ady`, `af`, `afb`, `afh`, `aii`, `ain`, `akl`, `aln`, `am`, `an`, `ang`, `aoz`, `apc`, `ar`, `arq`, `ary`, `arz`, `as`, `ast`, `avk`, `awa`, `ay`, `az`, `ba`, `bal`, `ban`, `bar`, `bcl`, `be`, `ber`, `bg`, `bho`, `bi`, `bjn`, `bm`, `bn`, `bo`, `br`, `brx`, `bs`, `bua`, `bvy`, `bzt`, `ca`, `cay`, `cbk`, `ce`, `ceb`, `ch`, `chg`, `chn`, `cho`, `chr`, `cjy`, `ckt`, `cmo`, `co`, `cpi`, `crh`, `crs`, `cs`, `csb`, `cv`, `cy`, `cycl`, `cyo`, `da`, `de`, `dng`, `drt`, `dsb`, `dtp`, `dv`, `dws`, `ee`, `egl`, `el`, `emx`, `en`, `enm`, `eo`, `es`, `et`, `eu`, `evn`, `ext`, `fa`, `fi`, `fj`, `fkv`, `fo`, `fr`, `frm`, `fro`, `frr`, `fuc`, `fur`, `fuv`, `fy`, `ga`, `gaa`, `gag`, `gan`, `gbm`, `gcf`, `gd`, `gil`, `gl`, `gn`, `gom`, `gos`, `got`, `grc`, `gsw`, `gu`, `gv`, `ha`, `hak`, `haw`, `hbo`, `he`, `hi`, `hif`, `hil`, `hnj`, `hoc`, `hr`, `hrx`, `hsb`, `hsn`, `ht`, `hu`, `hy`, `ia`, `iba`, `id`, `ie`, `ig`, `ii`, `ike`, `ilo`, `io`, `is`, `it`, `izh`, `ja`, `jam`, `jbo`, `jdt`, `jpa`, `jv`, `ka`, `kaa`, `kab`, `kam`, `kek`, `kha`, `kjh`, `kk`, `kl`, `km`, `kn`, `ko`, `koi`, `kpv`, `krc`, `krl`, `ks`, `ksh`, `ku`, `kum`, `kw`, `kxi`, `ky`, `kzj`, `la`, `laa`, `lad`, `lb`, `ldn`, `lfn`, `lg`, `lij`, `liv`, `lkt`, `lld`, `lmo`, `ln`, `lo`, `lou`, `lt`, `ltg`, `lv`, `lvs`, `lzh`, `lzz`, `mad`, `mai`, `max`, `mdf`, `mfe`, `mg`, `mgm`, `mh`, `mhr`, `mi`, `mic`, `min`, `mk`, `ml`, `mn`, `mnc`, `mnw`, `moh`, `mr`, `mrj`, `ms`, `mt`, `mvv`, `mwl`, `mww`, `my`, `myv`, `na`, `nah`, `nb`, `nch`, `nds`, `ne`, `new`, `ngt`, `ngu`, `niu`, `nl`, `nlv`, `nn`, `no`, `nog`, `non`, `nov`, `npi`, `nst`, `nv`, `ny`, `nys`, `oar`, `oc`, `oj`, `or`, `orv`, `os`, `osp`, `osx`, `ota`, `otk`, `pa`, `pag`, `pam`, `pap`, `pau`, `pcd`, `pdc`, `pfl`, `phn`, `pi`, `pl`, `pms`, `pnb`, `ppl`, `prg`, `ps`, `pt`, `qu`, `quc`, `qya`, `rap`, `rel`, `rif`, `rm`, `rn`, `ro`, `rom`, `ru`, `rue`, `rw`, `ryu`, `sa`, `sah`, `sc`, `scn`, `sco`, `sd`, `se`, `sg`, `sgs`, `shs`, `shy`, `si`, `sjn`, `sk`, `sl`, `sm`, `sma`, `sn`, `so`, `sq`, `sr`, `ss`, `st`, `stq`, `su`, `sux`, `sv`, `sw`, `swg`, `swh`, `syc`, `ta`, `te`, `tet`, `tg`, `th`, `thv`, `ti`, `tig`, `tk`, `tkl`, `tl`, `tlh`, `tly`, `tmr`, `tmw`, `tn`, `to`, `toi`, `toki`, `tpi`, `tpw`, `tr`, `ts`, `tt`, `tts`, `tvl`, `ty`, `tyv`, `tzl`, `udm`, `ug`, `uk`, `umb`, `ur`, `urh`, `uz`, `vec`, `vep`, `vi`, `vo`, `vro`, `wa`, `war`, `wo`, `wuu`, `xal`, `xh`, `xmf`, `yi`, `yo`, `yue`, `zh`, `zlm`, `zsm`, `zu`, `zza`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_375_xx_2.7.0_2.4_1607184873730.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_wiki_tatoeba_cnn_375_xx_2.7.0_2.4_1607184873730.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_375", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_tatoeba_cnn_375", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_375').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_wiki_tatoeba_cnn_375| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Wikipedia and Tatoeba ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | fr| 1000| 1000| 1.0| | de| 1000| 999| 0.999| | fi| 1000| 999| 0.999| | nl| 1000| 998| 0.998| | el| 1000| 997| 0.997| | en| 1000| 995| 0.995| | es| 1000| 994| 0.994| | it| 1000| 993| 0.993| | sv| 1000| 991| 0.991| | da| 1000| 987| 0.987| | pl| 914| 901|0.9857768052516411| | hu| 880| 866|0.9840909090909091| | pt| 1000| 980| 0.98| | et| 928| 907|0.9773706896551724| | ro| 784| 766|0.9770408163265306| | lt| 1000| 976| 0.976| | bg| 1000| 965| 0.965| | cs| 1000| 945| 0.945| | sk| 1000| 944| 0.944| | lv| 916| 843|0.9203056768558951| | sl| 914| 810|0.8862144420131292| +--------+-----+-------+------------------+ +-------+--------------------+ |summary| precision| +-------+--------------------+ | count| 21| | mean| 0.9758952066282511| | stddev|0.029434744995013935| | min| 0.8862144420131292| | max| 1.0| +-------+--------------------+ ``` --- layout: model title: English asr_wav2vec2_base_checkpoint_9 TFWav2Vec2ForCTC from jiobiala24 author: John Snow Labs name: asr_wav2vec2_base_checkpoint_9 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_checkpoint_9` is a English model originally trained by jiobiala24. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_checkpoint_9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_9_en_4.2.0_3.0_1664021017779.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_checkpoint_9_en_4.2.0_3.0_1664021017779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_checkpoint_9", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_checkpoint_9", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_checkpoint_9| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|349.2 MB| --- layout: model title: Word2Vec Embeddings in Portuguese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, pt, open_source] task: Embeddings language: pt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pt_3.4.1_3.0_1647453502800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pt_3.4.1_3.0_1647453502800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.embed.w2v_cc_300d").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|pt| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English image_classifier_vit_skin_type ViTForImageClassification from driboune author: John Snow Labs name: image_classifier_vit_skin_type date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_skin_type` is a English model originally trained by driboune. ## Predicted Entities `dark skin`, `light skin` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_skin_type_en_4.1.0_3.0_1660165919540.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_skin_type_en_4.1.0_3.0_1660165919540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_skin_type", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_skin_type", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_skin_type| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Extract Cancer Therapies and Posology Information author: John Snow Labs name: ner_oncology_unspecific_posology date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of treatments and posology information using unspecific labels (low granularity). Definitions of Predicted Entities: - `Cancer_Therapy`: Mentions of cancer treatments, including chemotherapy, radiotherapy, surgery and other. - `Posology_Information`: Terms related to the posology of the treatment, including duration, frequencies and dosage. ## Predicted Entities `Cancer_Therapy`, `Posology_Information` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_en_4.0.0_3.0_1666722206468.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_en_4.0.0_3.0_1666722206468.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_unspecific_posology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_unspecific_posology").predict("""The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.""") ```
## Results ```bash | chunk | ner_label | |:-----------------|:---------------------| | adriamycin | Cancer_Therapy | | 60 mg/m2 | Posology_Information | | cyclophosphamide | Cancer_Therapy | | 600 mg/m2 | Posology_Information | | over six courses | Posology_Information | | second cycle | Posology_Information | | chemotherapy | Cancer_Therapy | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_unspecific_posology| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.3 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Posology_Information 2663 244 399 3062 0.92 0.87 0.89 Cancer_Therapy 2580 317 247 2827 0.89 0.91 0.90 macro_avg 5243 561 646 5889 0.90 0.89 0.90 micro_avg 5243 561 646 5889 0.90 0.89 0.90 ``` --- layout: model title: English asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm TFWav2Vec2ForCTC from gxbag author: John Snow Labs name: pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm` is a English model originally trained by gxbag. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm_en_4.2.0_3.0_1664026375800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm_en_4.2.0_3.0_1664026375800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_960h_lv60_self_with_wikipedia_lm| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|333.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Japanese Bert Embeddings (Base, v2) author: John Snow Labs name: bert_embeddings_bert_base_japanese_v2 date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-v2` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_v2_ja_3.4.2_3.0_1649674281970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_v2_ja_3.4.2_3.0_1649674281970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_v2","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_v2","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_v2").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_v2| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|417.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-v2 - https://github.com/google-research/bert - https://pypi.org/project/unidic-lite/ - https://github.com/cl-tohoku/bert-japanese/tree/v2.0 - https://taku910.github.io/mecab/ - https://github.com/neologd/mecab-ipadic-neologd - https://github.com/polm/fugashi - https://github.com/polm/unidic-lite - https://www.tensorflow.org/tfrc/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: Sentiment Analysis Pipeline for French texts author: John Snow Labs name: classifierdl_bert_sentiment_pipeline date: 2021-09-28 tags: [fr, sentiment, pipeline, open_source] task: Sentiment Analysis language: fr edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline identifies the sentiments (positive or negative) in French texts. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/SENTIMENT_FR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_Fr_Sentiment.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_pipeline_fr_3.3.0_2.4_1632835775093.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_pipeline_fr_3.3.0_2.4_1632835775093.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "fr") result = pipeline.annotate("Mignolet vraiment dommage de ne jamais le voir comme titulaire") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "fr") val result = pipeline.fullAnnotate("Mignolet vraiment dommage de ne jamais le voir comme titulaire")(0) ```
## Results ```bash ['NEGATIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_sentiment_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Language:|fr| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: English DistilBertForTokenClassification Cased model (from f2io) author: John Snow Labs name: distilbert_token_classifier_ner_roles_openapi date: 2023-03-03 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ner-roles-openapi` is a English model originally trained by `f2io`. ## Predicted Entities ``, `LOC`, `OR`, `PRG`, `ROLE`, `ORG`, `PER`, `ENTITY`, `MISC`, `ACTION` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.0_3.0_1677881330949.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_ner_roles_openapi_en_4.3.0_3.0_1677881330949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_ner_roles_openapi","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_ner_roles_openapi| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/f2io/ner-roles-openapi --- layout: model title: Word2Vec Embeddings in Tajik (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, tg, open_source] task: Embeddings language: tg edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tg_3.4.1_3.0_1647461560049.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_tg_3.4.1_3.0_1647461560049.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tg") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ман nlp-ро дӯст медорам"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tg") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ман nlp-ро дӯст медорам").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tg.embed.w2v_cc_300d").predict("""Ман nlp-ро дӯст медорам""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|tg| |Size:|291.3 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: News Classifier Pipeline for Urdu texts author: John Snow Labs name: classifierdl_bert_news_pipeline date: 2022-02-09 tags: [urdu, news, classifier, pipeline, ur, open_source] task: Text Classification language: ur edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline classifies Urdu news into up to 7 categories. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_UR_NEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_UR_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_ur_3.4.0_3.0_1644402089229.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_pipeline_ur_3.4.0_3.0_1644402089229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_bert_news_pipeline", lang = "ur") result = pipeline.fullAnnotate("""گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔""") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_news_pipeline", "ur") val result = pipeline.fullAnnotate("گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔")(0) ```
## Results ```bash ['business'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_news_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Language:|ur| |Size:|1.8 GB| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: English image_classifier_vit_planes_trains_automobiles ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_planes_trains_automobiles date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_planes_trains_automobiles` is a English model originally trained by nateraw. ## Predicted Entities `automobiles`, `planes`, `trains` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_planes_trains_automobiles_en_4.1.0_3.0_1660170662986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_planes_trains_automobiles_en_4.1.0_3.0_1660170662986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_planes_trains_automobiles", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_planes_trains_automobiles", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_planes_trains_automobiles| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: RCT Binary Classifier (USE) author: John Snow Labs name: rct_binary_classifier_use date: 2022-05-27 tags: [licensed, clinical, rct, classifier, en] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.2 spark_version: 3.0 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a USE based classifier that can classify if an article is a randomized clinical trial (RCT) or not. ## Predicted Entities `true`, `false` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_RCT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_RCT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_use_en_3.4.2_3.0_1653676810143.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rct_binary_classifier_use_en_3.4.2_3.0_1653676810143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_use", "en", "clinical/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") use_clf_pipeline = Pipeline( stages = [ document_assembler, use, classifier_dl ]) data = spark.createDataFrame([["""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """]]).toDF("text") result = use_clf_pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val use = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") val classifier_dl = ClassifierDLModel.pretrained("rct_binary_classifier_use", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("class") val use_clf_pipeline = new Pipeline().setStages(Array(documenter, use, classifier_dl)) val data = Seq("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """).toDS.toDF("text") val result = use_clf_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.rct_binary_use").predict("""Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. """) ```
## Results ```bash | text | rct | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| | Abstract:Based on the American Society of Anesthesiologists' Practice Guidelines for Sedation and Analgesia by Non-Anesthesiologists (ASA-SED), a sedation training course aimed at improving medical safety was developed by the Japanese Association for Medical Simulation in 2011. This study evaluated the effect of debriefing on participants' perceptions of the essential points of the ASA-SED. A total of 38 novice doctors participated in the sedation training course during the research period. Of these doctors, 18 participated in the debriefing group, and 20 participated in non-debriefing group. Scoring of participants' guideline perceptions was conducted using an evaluation sheet (nine items, 16 points) created based on the ASA-SED. The debriefing group showed a greater perception of the ASA-SED, as reflected in the significantly higher scores on the evaluation sheet (median, 16 points) than the control group (median, 13 points; p < 0.05). No significant differences were identified before or during sedation, but the difference after sedation was significant (p < 0.05). Debriefing after sedation training courses may contribute to better perception of the ASA-SED, and may lead to enhanced attitudes toward medical safety during sedation and analgesia. | true | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rct_binary_classifier_use| |Compatibility:|Healthcare NLP 3.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|20.9 MB| ## References https://arxiv.org/abs/1710.06071 ## Benchmarking ```bash label precision recall f1-score support false 0.84 0.80 0.82 2915 true 0.78 0.82 0.80 2545 accuracy - - 0.81 5460 macro-avg 0.81 0.81 0.81 5460 weighted-avg 0.81 0.81 0.81 5460 ``` --- layout: model title: Spanish RobertaForQuestionAnswering (from nlp-en-es) author: John Snow Labs name: roberta_qa_nlp_en_es_roberta_base_bne_finetuned_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `nlp-en-es`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_nlp_en_es_roberta_base_bne_finetuned_sqac_es_4.0.0_3.0_1655730047157.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_nlp_en_es_roberta_base_bne_finetuned_sqac_es_4.0.0_3.0_1655730047157.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_nlp_en_es_roberta_base_bne_finetuned_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_nlp_en_es_roberta_base_bne_finetuned_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.base.by_nlp-en-es").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_nlp_en_es_roberta_base_bne_finetuned_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|460.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlp-en-es/roberta-base-bne-finetuned-sqac - https://paperswithcode.com/sota?task=Question+Answering&dataset=sqac --- layout: model title: Word2Vec Embeddings in Mazanderani (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mzn, open_source] task: Embeddings language: mzn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mzn_3.4.1_3.0_1647445915568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mzn_3.4.1_3.0_1647445915568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mzn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mzn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mzn.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mzn| |Size:|80.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: bert_ner_bert_fa_base_uncased_ner_arman date: 2022-05-09 tags: [bert, ner, token_classification, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-fa-base-uncased-ner-arman` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `fac`, `pers`, `pro`, `event`, `org`, `loc` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_fa_base_uncased_ner_arman_fa_3.4.2_3.0_1652099808382.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_fa_base_uncased_ner_arman_fa_3.4.2_3.0_1652099808382.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_fa_base_uncased_ner_arman","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_fa_base_uncased_ner_arman","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_fa_base_uncased_ner_arman| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|607.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/bert-fa-base-uncased-ner-arman - https://github.com/hooshvare/parsbert - https://github.com/HaniehP/PersianNER - https://github.com/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://colab.research.google.com/github/hooshvare/parsbert-ner/blob/master/persian-ner-pipeline.ipynb - https://github.com/hooshvare/parsbert/issues --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223261564.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0_en_4.3.0_3.0_1674223261564.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_bert_quadruplet_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_bert_quadruplet_epochs_1_shard_1_squad2.0 --- layout: model title: English RobertaForQuestionAnswering Mini Cased model (from sguskin) author: John Snow Labs name: roberta_qa_dynamic_minilmv2_l6_h384_squad1.1 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dynamic-minilmv2-L6-H384-squad1.1` is a English model originally trained by `sguskin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_dynamic_minilmv2_l6_h384_squad1.1_en_4.3.0_3.0_1674210790213.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_dynamic_minilmv2_l6_h384_squad1.1_en_4.3.0_3.0_1674210790213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dynamic_minilmv2_l6_h384_squad1.1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_dynamic_minilmv2_l6_h384_squad1.1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_dynamic_minilmv2_l6_h384_squad1.1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|112.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/sguskin/dynamic-minilmv2-L6-H384-squad1.1 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from rbiswas4) author: John Snow Labs name: distilbert_qa_rbiswas4_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `rbiswas4`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_rbiswas4_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772274920.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_rbiswas4_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772274920.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rbiswas4_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_rbiswas4_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_rbiswas4_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/rbiswas4/distilbert-base-uncased-finetuned-squad --- layout: model title: Named Entity Recognition for Korean (GloVe 840B 300d) author: John Snow Labs name: ner_kmou_glove_840B_300d date: 2021-01-03 task: Named Entity Recognition language: ko edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ko, ner, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations in the `BIO` format. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained `glove_840B_300` embeddings model from `WordEmbeddings` annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities Dates-`DT`, Locations-`LC`, Organizations-`OG`, Persons-`PS`, Time-`TI`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_KO/){:.button.button-orange.button-orange-trans.co.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_kmou_glove_840B_300d_ko_2.7.0_2.4_1609716021199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_kmou_glove_840B_300d_ko_2.7.0_2.4_1609716021199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko")\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\ .setInputCols("document", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("ner_kmou_glove_840B_300d", "ko") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter]) example = spark.createDataFrame([['라이프니츠 의 주도 로 베를린 에 세우 어 지 ㄴ 베를린 과학아카데미']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko") .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_kmou_glove_840B_300d", "ko") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter)) val data = Seq("라이프니츠 의 주도 로 베를린 에 세우 어 지 ㄴ 베를린 과학아카데미").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["라이프니츠 의 주도 로 베를린 에 세우 어 지 ㄴ 베를린 과학아카데미"] ner_df = nlu.load('ko.ner').predict(text) ner_df ```
## Results ```bash +------------+----+ |token |ner | +------------+----+ |라이프니츠 |B-PS| |베를린 |B-OG| |과학아카데미 |I-OG| +------------+----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_kmou_glove_840B_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ko| ## Data Source The model was trained by the Korea Maritime and Ocean University [NLP data set](https://github.com/kmounlp/NER). ## Benchmarking ```bash | ner_tag | precision | recall | f1-score | support | |:------------:|:---------:|:------:|:--------:|:-------:| | B-DT | 0.95 | 0.29 | 0.44 | 132 | | B-LC | 0.00 | 0.00 | 0.00 | 166 | | B-OG | 1.00 | 0.06 | 0.11 | 149 | | B-PS | 0.86 | 0.13 | 0.23 | 287 | | B-TI | 0.50 | 0.05 | 0.09 | 20 | | I-DT | 0.94 | 0.36 | 0.52 | 164 | | I-LC | 0.00 | 0.00 | 0.00 | 4 | | I-OG | 1.00 | 0.08 | 0.15 | 25 | | I-PS | 1.00 | 0.08 | 0.15 | 12 | | I-TI | 0.50 | 0.10 | 0.17 | 10 | | O | 0.94 | 1.00 | 0.97 | 12830 | | accuracy | 0.94 | 13799 | | | | macro avg | 0.70 | 0.20 | 0.26 | 13799 | | weighted avg | 0.93 | 0.94 | 0.92 | 13799 | ``` --- layout: model title: English image_classifier_vit_Check_Gum_Teeth ViTForImageClassification from steven123 author: John Snow Labs name: image_classifier_vit_Check_Gum_Teeth date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Check_Gum_Teeth` is a English model originally trained by steven123. ## Predicted Entities `Bad_Gum`, `Good_Gum` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Gum_Teeth_en_4.1.0_3.0_1660168122164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Check_Gum_Teeth_en_4.1.0_3.0_1660168122164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Check_Gum_Teeth", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Check_Gum_Teeth", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Check_Gum_Teeth| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1_en_4.0.0_3.0_1654181570985.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1_en_4.0.0_3.0_1654181570985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_x1.84_f88.7_d36_hybrid_filled_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|188.1 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: English asr_wav2vec2_xls_r_1b_english TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_wav2vec2_xls_r_1b_english date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_1b_english` is a English model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_1b_english_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_1b_english_en_4.2.0_3.0_1664015553321.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_1b_english_en_4.2.0_3.0_1664015553321.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_1b_english", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_1b_english", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_1b_english| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|3.6 GB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_terri1102 TFWav2Vec2ForCTC from terri1102 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_terri1102` is a English model originally trained by terri1102. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102_en_4.2.0_3.0_1664107472195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102_en_4.2.0_3.0_1664107472195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_terri1102| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: German DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_de_cased date: 2022-04-12 tags: [distilbert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-de-cased` is a German model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_de_cased_de_3.4.2_3.0_1649783710575.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_de_cased_de_3.4.2_3.0_1649783710575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_de_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_de_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.distilbert_base_de_cased").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_de_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|236.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-de-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Bemba (Zambia) asr_wav2vec2_large_xlsr_bemba TFWav2Vec2ForCTC from csikasote author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_bemba date: 2022-09-24 tags: [wav2vec2, bem, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: bem edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_bemba` is a Bemba (Zambia) model originally trained by csikasote. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_bemba_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_bemba_bem_4.2.0_3.0_1664023162676.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_bemba_bem_4.2.0_3.0_1664023162676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_bemba', lang = 'bem') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_bemba", lang = "bem") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_bemba| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|bem| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering Cased model (from yirenl2) author: John Snow Labs name: roberta_qa_plm date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `plm_qa` is a English model originally trained by `yirenl2`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_plm_en_4.2.4_3.0_1669985368479.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_plm_en_4.2.4_3.0_1669985368479.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_plm","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_plm","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_plm| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/yirenl2/plm_qa - https://paperswithcode.com/sota?task=Question+Answering&dataset=squad_v2 --- layout: model title: English image_classifier_vit_rare_puppers2 ViTForImageClassification from Samlit author: John Snow Labs name: image_classifier_vit_rare_puppers2 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rare_puppers2` is a English model originally trained by Samlit. ## Predicted Entities ` La Goulue Toulouse-Lautrec`, `aristide bruant Lautrec`, `la goulue Toulouse-Lautrec`, `Marcelle Lender Bolero`, `Salon at the Rue des Moulins Lautrec` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers2_en_4.1.0_3.0_1660171547389.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rare_puppers2_en_4.1.0_3.0_1660171547389.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rare_puppers2", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rare_puppers2", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rare_puppers2| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Word Segmenter for Japanese author: John Snow Labs name: wordseg_gsd_ud date: 2021-03-09 tags: [word_segmentation, open_source, japanese, wordseg_gsd_ud, ja] task: Word Segmentation language: ja edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: WordSegmenterModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [WordSegmenterModel-WSM](https://en.wikipedia.org/wiki/Text_segmentation) is based on maximum entropy probability model to detect word boundaries in Japanese text. Japanese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_ja_3.0.0_3.0_1615292309908.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wordseg_gsd_ud_ja_3.0.0_3.0_1615292309908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols(["sentence"]) .setOutputCol("token") pipeline = Pipeline(stages=[document_assembler, word_segmenter]) ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) example = spark.createDataFrame([['ジョンスノーラボからこんにちは! ']], ["text"]) result = ws_model.transform(example) ``` ```scala val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols(Array("sentence")) .setOutputCol("token") val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter)) val data = Seq("ジョンスノーラボからこんにちは! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""ジョンスノーラボからこんにちは! ""] token_df = nlu.load('ja.segment_words').predict(text) token_df ```
## Results ```bash 0 ジョンス 1 ノ 2 ー 3 ラ 4 ボ 5 から 6 こん 7 に 8 ち 9 は 10 ! Name: token, dtype: object ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wordseg_gsd_ud| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[words_segmented]| |Language:|ja| --- layout: model title: English DistilBertForQuestionAnswering Cased model (from SauravMaheshkar) author: John Snow Labs name: distilbert_qa_base_cased_distilled_chaii date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_chaii_en_4.0.0_3.0_1654723547031.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_distilled_chaii_en_4.0.0_3.0_1654723547031.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_distilled_chaii","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.distil_bert.base_cased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_distilled_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/distilbert-base-cased-distilled-chaii --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_3_lr_2e_5_bs_32_ep_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_3-lr-2e-5-bs-32-ep-4` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1657188613886.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_3_lr_2e_5_bs_32_ep_4_en_4.0.0_3.0_1657188613886.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_2e_5_bs_32_ep_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_3_lr_2e_5_bs_32_ep_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_3_lr_2e_5_bs_32_ep_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_3-lr-2e-5-bs-32-ep-4 --- layout: model title: Embeddings Scielowiki 300 dims author: John Snow Labs name: embeddings_scielowiki_300d class: WordEmbeddingsModel language: es repository: clinical/models date: 2020-05-26 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,es] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_scielowiki_300d_es_2.5.0_2.4_1590467643391.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_scielowiki_300d_es_2.5.0_2.4_1590467643391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models") .setInputCols(["document","token"]) .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_scielowiki_300d","es","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.scielowiki.300d").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_scielowiki_300d``. {:.model-param} ## Model Information {:.table-model} |---------------|----------------------------| | Name: | embeddings_scielowiki_300d | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: |[word_embeddings] | | Language: | es | | Dimension: | 300.0 | {:.h2_title} ## Data Source Trained on Scielo Articles + Clinical Wikipedia Articles https://zenodo.org/record/3744326#.XtViinVKh_U --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_8 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_8_en_4.0.0_3.0_1657192728608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_8_en_4.0.0_3.0_1657192728608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_8","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|378.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-8 --- layout: model title: Topic Identification (Banking) author: John Snow Labs name: finclf_bert_banking77 date: 2022-09-28 tags: [en, finance, bank, topic, classification, modeling, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Bert-based model, which can be used to classify texts into 77 banking-related classes. This is a Multiclass model, meaning only one label will be returned as an output. The classes are the following: - activate_my_card - age_limit - apple_pay_or_google_pay - atm_support - automatic_top_up - balance_not_updated_after_bank_transfer - balance_not_updated_after_cheque_or_cash_deposit - beneficiary_not_allowed - cancel_transfer - card_about_to_expire - card_acceptance - card_arrival - card_delivery_estimate - card_linking - card_not_working - card_payment_fee_charged - card_payment_not_recognised - card_payment_wrong_exchange_rate - card_swallowed - cash_withdrawal_charge - cash_withdrawal_not_recognised - change_pin - compromised_card - contactless_not_working - country_support - declined_card_payment - declined_cash_withdrawal - declined_transfer - direct_debit_payment_not_recognised - disposable_card_limits - edit_personal_details - exchange_charge - exchange_rate - exchange_via_app - extra_charge_on_statement - failed_transfer - fiat_currency_support - get_disposable_virtual_card - get_physical_card - getting_spare_card - getting_virtual_card - lost_or_stolen_card - lost_or_stolen_phone - order_physical_card - passcode_forgotten - pending_card_payment - pending_cash_withdrawal - pending_top_up - pending_transfer - pin_blocked - receiving_money - Refund_not_showing_up - request_refund - reverted_card_payment? - supported_cards_and_currencies - terminate_account - top_up_by_bank_transfer_charge - top_up_by_card_charge - top_up_by_cash_or_cheque - top_up_failed - top_up_limits - top_up_reverted - topping_up_by_card - transaction_charged_twice - transfer_fee_charged - transfer_into_account - transfer_not_received_by_recipient - transfer_timing - unable_to_verify_identity - verify_my_identity - verify_source_of_funds - verify_top_up - virtual_card_not_working - visa_or_mastercard - why_verify_identity - wrong_amount_of_cash_received - wrong_exchange_rate_for_cash_withdrawal ## Predicted Entities `activate_my_card`, `age_limit`, `card_acceptance`, `card_arrival`, `card_delivery_estimate`, `card_linking`, `card_not_working`, `card_payment_fee_charged`, `card_payment_not_recognised`, `card_payment_wrong_exchange_rate`, `card_swallowed`, `cash_withdrawal_charge`, `apple_pay_or_google_pay`, `cash_withdrawal_not_recognised`, `change_pin`, `compromised_card`, `contactless_not_working`, `country_support`, `declined_card_payment`, `declined_cash_withdrawal`, `declined_transfer`, `direct_debit_payment_not_recognised`, `disposable_card_limits`, `atm_support`, `edit_personal_details`, `exchange_charge`, `exchange_rate`, `exchange_via_app`, `extra_charge_on_statement`, `failed_transfer`, `fiat_currency_support`, `get_disposable_virtual_card`, `get_physical_card`, `getting_spare_card`, `automatic_top_up`, `getting_virtual_card`, `lost_or_stolen_card`, `lost_or_stolen_phone`, `order_physical_card`, `passcode_forgotten`, `pending_card_payment`, `pending_cash_withdrawal`, `pending_top_up`, `pending_transfer`, `pin_blocked`, `balance_not_updated_after_bank_transfer`, `receiving_money`, `Refund_not_showing_up`, `request_refund`, `reverted_card_payment?`, `supported_cards_and_currencies`, `terminate_account`, `top_up_by_bank_transfer_charge`, `top_up_by_card_charge`, `top_up_by_cash_or_cheque`, `top_up_failed`, `balance_not_updated_after_cheque_or_cash_deposit`, `top_up_limits`, `top_up_reverted`, `topping_up_by_card`, `transaction_charged_twice`, `transfer_fee_charged`, `transfer_into_account`, `transfer_not_received_by_recipient`, `transfer_timing`, `unable_to_verify_identity`, `verify_my_identity`, `beneficiary_not_allowed`, `verify_source_of_funds`, `verify_top_up`, `virtual_card_not_working`, `visa_or_mastercard`, `why_verify_identity`, `wrong_amount_of_cash_received`, `wrong_exchange_rate_for_cash_withdrawal`, `cancel_transfer`, `card_about_to_expire` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_BANKING/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_bert_banking77_en_1.0.0_3.0_1664361071567.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_bert_banking77_en_1.0.0_3.0_1664361071567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = nlp.Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = finance.BertForSequenceClassification \ .pretrained('finclf_bert_banking77', 'en', 'finance/models') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = nlp.Pipeline( stages=[ document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['I am still waiting on my card?']]).toDF("text") result = pipeline.fit(example).transform(example) ```
## Results ```bash ['atm_support'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_bert_banking77| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|410.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Banking77 dataset (https://paperswithcode.com/dataset/banking77-oos) and in-house data augmentation ## Benchmarking ```bash label Score Loss 0.3031957447528839 Accuracy 0.9363636363636364 Macro_F1 0.9364655956915154 Micro_F1 0.9363636363636364 Weighted_F1 0.9364655956915157 Macro_Precision 0.9396792003322154 Micro_Precision 0.9363636363636364 Weighted_Precision 0.9396792003322155 Macro_Recall 0.9363636363636365 Micro_Recall 0.9363636363636364 Weighted_Recall 0.9363636363636364 ``` --- layout: model title: English ElectraForQuestionAnswering Large model (from sultan) author: John Snow Labs name: electra_qa_BioM_Large_SQuAD2 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BioM-ELECTRA-Large-SQuAD2` is a English model originally trained by `sultan`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Large_SQuAD2_en_4.0.0_3.0_1655919022213.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_BioM_Large_SQuAD2_en_4.0.0_3.0_1655919022213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_BioM_Large_SQuAD2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_BioM_Large_SQuAD2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.electra.large").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_BioM_Large_SQuAD2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/sultan/BioM-ELECTRA-Large-SQuAD2 - http://participants-area.bioasq.org/results/9b/phaseB/ - https://github.com/salrowili/BioM-Transformers --- layout: model title: Indonesian RobertaForMaskedLM Base Cased model (from cahya) author: John Snow Labs name: roberta_embeddings_base_indonesian_522m date: 2022-12-12 tags: [id, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: id edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-indonesian-522M` is a Indonesian model originally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_indonesian_522m_id_4.2.4_3.0_1670859219449.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_base_indonesian_522m_id_4.2.4_3.0_1670859219449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base_indonesian_522m","id") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_base_indonesian_522m","id") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_base_indonesian_522m| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|id| |Size:|473.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/cahya/roberta-base-indonesian-522M - https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers --- layout: model title: Word2Vec Embeddings in Italian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-07 tags: [cc, embeddings, fastText, word2vec, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_it_3.4.1_3.0_1646660816126.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_it_3.4.1_3.0_1646660816126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\ .setInputCols(["document", "token"])\ .setOutputCol("embeddings") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.word2vec").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| ## References [fastText common crawl word embeddings for Italian](https://fasttext.cc/docs/en/crawl-vectors.html). --- layout: model title: Smaller BERT Sentence Embeddings (L-4_H-768_A-12) author: John Snow Labs name: sent_small_bert_L4_768 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_768_en_2.6.0_2.4_1598351030380.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L4_768_en_2.6.0_2.4_1598351030380.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_768", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L4_768", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L4_768').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L4_768_embeddings sentence [0.4034431576728821, 0.08385054022073746, 0.08... I hate cancer [-0.2262536883354187, 0.02650507353246212, -0.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L4_768| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1 --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from dkasti) author: John Snow Labs name: xlmroberta_ner_dkasti_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `dkasti`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_dkasti_base_finetuned_panx_de_4.1.0_3.0_1660432395543.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_dkasti_base_finetuned_panx_de_4.1.0_3.0_1660432395543.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_dkasti_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_dkasti_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_dkasti_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/dkasti/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42_de_4.2.0_3.0_1664118970719.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42_de_4.2.0_3.0_1664118970719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate English to Philippine languages Pipeline author: John Snow Labs name: translate_en_phi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, phi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `phi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_phi_xx_2.7.0_2.4_1609691790773.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_phi_xx_2.7.0_2.4_1609691790773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_phi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_phi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.phi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_phi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_puppies_classify ViTForImageClassification from cherrypaca author: John Snow Labs name: image_classifier_vit_puppies_classify date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_puppies_classify` is a English model originally trained by cherrypaca. ## Predicted Entities `corgi`, `husky`, `pomeranian` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_puppies_classify_en_4.1.0_3.0_1660170671838.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_puppies_classify_en_4.1.0_3.0_1660170671838.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_puppies_classify", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_puppies_classify", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_puppies_classify| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Smaller BERT Embeddings (L-2_H-256_A-4) author: John Snow Labs name: small_bert_L2_256 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L2_256_en_2.6.0_2.4_1598344391697.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L2_256_en_2.6.0_2.4_1598344391697.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L2_256", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L2_256", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L2_256').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L2_256_embeddings I [-1.712011456489563, -1.076645851135254, 0.697... love [-1.1276499032974243, -0.9930340647697449, 1.5... NLP [-0.3206934928894043, 0.03202249854803085, 1.4... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L2_256| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|256| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1 --- layout: model title: English image_classifier_vit_food ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_food date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_food` is a English model originally trained by nateraw. ## Predicted Entities `grilled_cheese_sandwich`, `edamame`, `onion_rings`, `french_onion_soup`, `french_fries`, `creme_brulee`, `lobster_roll_sandwich`, `bruschetta`, `breakfast_burrito`, `caprese_salad`, `churros`, `omelette`, `club_sandwich`, `chocolate_mousse`, `nachos`, `bread_pudding`, `steak`, `hummus`, `panna_cotta`, `filet_mignon`, `sashimi`, `hot_and_sour_soup`, `cannoli`, `ravioli`, `samosa`, `grilled_salmon`, `lobster_bisque`, `seaweed_salad`, `macaroni_and_cheese`, `fish_and_chips`, `caesar_salad`, `dumplings`, `baby_back_ribs`, `fried_rice`, `oysters`, `peking_duck`, `guacamole`, `greek_salad`, `donuts`, `risotto`, `escargots`, `crab_cakes`, `waffles`, `carrot_cake`, `prime_rib`, `tuna_tartare`, `pho`, `chocolate_cake`, `bibimbap`, `fried_calamari`, `spaghetti_bolognese`, `gnocchi`, `chicken_quesadilla`, `frozen_yogurt`, `apple_pie`, `baklava`, `pulled_pork_sandwich`, `clam_chowder`, `eggs_benedict`, `lasagna`, `ceviche`, `paella`, `foie_gras`, `spring_rolls`, `falafel`, `miso_soup`, `pork_chop`, `ramen`, `pad_thai`, `garlic_bread`, `macarons`, `ice_cream`, `mussels`, `chicken_wings`, `pancakes`, `gyoza`, `poutine`, `croque_madame`, `pizza`, `cheese_plate`, `beignets`, `huevos_rancheros`, `french_toast`, `sushi`, `takoyaki`, `spaghetti_carbonara`, `beef_tartare`, `scallops`, `cup_cakes`, `tacos`, `deviled_eggs`, `beet_salad`, `tiramisu`, `cheesecake`, `strawberry_shortcake`, `beef_carpaccio`, `hamburger`, `red_velvet_cake`, `hot_dog`, `shrimp_and_grits`, `chicken_curry` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_food_en_4.1.0_3.0_1660167590552.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_food_en_4.1.0_3.0_1660167590552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_food", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_food", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_food| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.2 MB| --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_10_h_256 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-10_H-256` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_256_zh_4.2.4_3.0_1670021467871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_10_h_256_zh_4.2.4_3.0_1670021467871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_256","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_10_h_256","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_10_h_256| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|51.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-10_H-256 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Extraction of Clinical Abbreviations and Acronyms author: John Snow Labs name: ner_abbreviation_emb_clinical_medium date: 2023-05-12 tags: [ner, abbreviation, acronym, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to extract clinical abbreviations and acronyms in text. ## Predicted Entities `ABBR` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ABBREVIATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_emb_clinical_medium_en_4.4.1_3.0_1683884147067.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_emb_clinical_medium_en_4.4.1_3.0_1683884147067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_abbreviation_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_abbreviation_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val data = Seq("Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-----+-----+---+---------+ |chunk|begin|end|ner_label| +-----+-----+---+---------+ |CBC |126 |128|ABBR | |AB |159 |160|ABBR | |VDRL |189 |192|ABBR | |HIV |247 |249|ABBR | +-----+-----+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_abbreviation_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## References Trained on the in-house dataset. ## Benchmarking ```bash label precision recall f1-score support ABBR 0.94 0.95 0.95 620 micro-avg 0.94 0.95 0.95 620 macro-avg 0.94 0.95 0.95 620 weighted-avg 0.94 0.95 0.95 620 ``` --- layout: model title: Portuguese BertForTokenClassification Cased model (from pucpr) author: John Snow Labs name: bert_token_classifier_clinicalnerpt_disorder date: 2022-11-30 tags: [pt, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `clinicalnerpt-disorder` is a Portuguese model originally trained by `pucpr`. ## Predicted Entities `Disorder` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_disorder_pt_4.2.4_3.0_1669822474674.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_clinicalnerpt_disorder_pt_4.2.4_3.0_1669822474674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_disorder","pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_clinicalnerpt_disorder","pt") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_clinicalnerpt_disorder| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|pt| |Size:|665.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/pucpr/clinicalnerpt-disorder - https://www.aclweb.org/anthology/2020.clinicalnlp-1.7/ - https://github.com/HAILab-PUCPR/SemClinBr - https://github.com/HAILab-PUCPR/BioBERTpt --- layout: model title: Part of Speech for Polish author: John Snow Labs name: pos_lfg date: 2021-03-23 tags: [pl, open_source] supported: true task: Part of Speech Tagging language: pl edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_lfg_pl_2.7.5_2.4_1616510144592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_lfg_pl_2.7.5_2.4_1616510144592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_lfg", "pl")\ .setInputCols(["document", "token"])\ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['Zarobki wszystkich nauczycieli będą rosły co rok .']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_lfg", "pl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos)) val data = Seq("Zarobki wszystkich nauczycieli będą rosły co rok .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Zarobki wszystkich nauczycieli będą rosły co rok .""] token_df = nlu.load('pl.pos.lfg').predict(text) token_df ```
## Results ```bash +--------------------------------------------------+----------------------------------------------+ |text |result | +--------------------------------------------------+----------------------------------------------+ |Zarobki wszystkich nauczycieli będą rosły co rok .|[NOUN, DET, NOUN, AUX, VERB, ADP, NOUN, PUNCT]| +--------------------------------------------------+----------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_lfg| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|pl| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.93 | 0.90 | 0.92 | 830 | | ADP | 0.98 | 0.99 | 0.99 | 1097 | | ADV | 0.91 | 0.94 | 0.93 | 589 | | AUX | 0.93 | 0.95 | 0.94 | 429 | | CCONJ | 0.98 | 0.99 | 0.98 | 354 | | DET | 0.94 | 0.91 | 0.93 | 324 | | INTJ | 0.67 | 0.33 | 0.44 | 6 | | NOUN | 0.93 | 0.95 | 0.94 | 2457 | | NUM | 0.92 | 0.94 | 0.93 | 90 | | PART | 0.99 | 0.95 | 0.97 | 597 | | PRON | 0.98 | 0.97 | 0.97 | 986 | | PROPN | 0.92 | 0.87 | 0.89 | 470 | | PUNCT | 1.00 | 1.00 | 1.00 | 2555 | | SCONJ | 0.97 | 0.99 | 0.98 | 141 | | VERB | 0.96 | 0.96 | 0.96 | 2187 | | accuracy | | | 0.96 | 13112 | | macro avg | 0.93 | 0.91 | 0.92 | 13112 | | weighted avg | 0.96 | 0.96 | 0.96 | 13112 | | weighted avg | 0.92 | 0.92 | 0.92 | 139697 | ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from victorlee071200) author: John Snow Labs name: bert_qa_victorlee071200_base_cased_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad` is a English model originally trained by `victorlee071200`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_victorlee071200_base_cased_finetuned_squad_en_4.0.0_3.0_1657182882040.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_victorlee071200_base_cased_finetuned_squad_en_4.0.0_3.0_1657182882040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_victorlee071200_base_cased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_victorlee071200_base_cased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_victorlee071200_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/victorlee071200/bert-base-cased-finetuned-squad --- layout: model title: Financial ORG, PRODUCT and ALIAS NER (Small) author: John Snow Labs name: finner_orgs_prods_alias date: 2022-08-17 tags: [en, finance, ner, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a large Named Entity Recognition model, trained with a subset of generic conLL, financial and legal conll, ontonotes and several in-house corpora, to detect Organizations, Products and Aliases of Companies. ## Predicted Entities `ORG`, `PROD`, `ALIAS` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINNER_ORGPROD){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_orgs_prods_alias_en_1.0.0_3.2_1660733832114.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_orgs_prods_alias_en_1.0.0_3.2_1660733832114.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""] res = model.transform(spark.createDataFrame([text]).toDF("text")) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']['entity']").alias("ner_label"), F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False) ```
## Results ```bash +------------------------------+---------+----------+ |chunk |ner_label|confidence| +------------------------------+---------+----------+ |Spell Security Private Limited|ORG |0.8475 | |Spell Security |ALIAS |0.8871 | |Policy Compliance |PRODUCT |0.7991 | +------------------------------+---------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_orgs_prods_alias| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.7 MB| ## References ConLL-2003, FinSec ConLL, a subset of Ontonotes, In-house corpora ## Benchmarking ```bash label tp fp fn prec rec f1 I-ORG 12853 2621 2685 0.8306191 0.82719785 0.828905 B-PRODUCT 2306 697 932 0.76789874 0.712168 0.7389841 I-ALIAS 14 6 13 0.7 0.5185185 0.59574467 B-ORG 8967 2078 2311 0.81186056 0.79508775 0.80338657 I-PRODUCT 2336 803 1091 0.74418604 0.68164575 0.7115443 B-ALIAS 76 14 22 0.84444445 0.7755102 0.80851066 Macro-average 26552 6219 7054 0.78316814 0.7183547 0.7493626 Micro-average 26552 6219 7054 0.8102285 0.790097 0.80003613 ``` --- layout: model title: Cyberbullying Classifier Pipeline in Turkish texts author: John Snow Labs name: classifierdl_berturk_cyberbullying_pipeline date: 2021-08-13 tags: [tr, cyberbullying, pipeline, open_source] task: Pipeline Public language: tr edition: Spark NLP 3.1.3 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pre-trained pipeline Identifies whether a Turkish text contains cyberbullying or not. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_berturk_cyberbullying_pipeline_tr_3.1.3_2.4_1628848526053.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_berturk_cyberbullying_pipeline_tr_3.1.3_2.4_1628848526053.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_berturk_cyberbullying_pipeline", "tr") result = pipeline.fullAnnotate("""Gidişin olsun, dönüşün olmasın inşallah senin..""") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_berturk_cyberbulling_pipeline", "tr") val result = pipeline.fullAnnotate("Gidişin olsun, dönüşün olmasın inşallah senin..")(0) ```
## Results ```bash ["Negative"] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_berturk_cyberbullying_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.1.3+| |License:|Open Source| |Edition:|Official| |Language:|tr| ## Included Models - DocumentAssembler - TokenizerModel - NormalizerModel - StopWordsCleaner - LemmatizerModel - BertEmbeddings - SentenceEmbeddings - ClassifierDLModel --- layout: model title: Legal Intellectual Property Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_intellectual_property_bert date: 2023-03-05 tags: [en, legal, classification, clauses, intellectual_property, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Intellectual_Property` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Intellectual_Property`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_intellectual_property_bert_en_1.0.0_3.0_1678050714983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_intellectual_property_bert_en_1.0.0_3.0_1678050714983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_intellectual_property_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Intellectual_Property]| |[Other]| |[Other]| |[Intellectual_Property]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_intellectual_property_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Intellectual_Property 0.98 0.96 0.97 52 Other 0.97 0.99 0.98 73 accuracy - - 0.98 125 macro-avg 0.98 0.97 0.98 125 weighted-avg 0.98 0.98 0.98 125 ``` --- layout: model title: Legal Witnesseth Clause Binary Classifier author: John Snow Labs name: legclf_witnesseth_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `witnesseth` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `witnesseth` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_witnesseth_clause_en_1.0.0_3.2_1660124134557.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_witnesseth_clause_en_1.0.0_3.2_1660124134557.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_witnesseth_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[witnesseth]| |[other]| |[other]| |[witnesseth]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_witnesseth_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.94 0.95 125 witnesseth 0.88 0.89 0.88 55 accuracy - - 0.93 180 macro-avg 0.91 0.92 0.92 180 weighted-avg 0.93 0.93 0.93 180 ``` --- layout: model title: Fast Neural Machine Translation Model from Lozi to English author: John Snow Labs name: opus_mt_loz_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, loz, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `loz` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_loz_en_xx_2.7.0_2.4_1609170483069.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_loz_en_xx_2.7.0_2.4_1609170483069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_loz_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_loz_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.loz.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_loz_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Solvency Clause Binary Classifier author: John Snow Labs name: legclf_solvency_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `solvency` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `solvency` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_solvency_clause_en_1.0.0_3.2_1660123006192.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_solvency_clause_en_1.0.0_3.2_1660123006192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_solvency_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[solvency]| |[other]| |[other]| |[solvency]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_solvency_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support insolvency 0.98 0.98 0.98 43 other 0.99 0.99 0.99 101 accuracy - - 0.99 144 macro-avg 0.98 0.98 0.98 144 weighted-avg 0.99 0.99 0.99 144 ``` --- layout: model title: Explain Document Pipeline for Norwegian (Bokmal) author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, norwegian_bokmal, explain_document_md, pipeline, "no"] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: "no" edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_no_3.0.0_3.0_1616435687010.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_no_3.0.0_3.0_1616435687010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'no') annotations = pipeline.fullAnnotate(""Hei fra John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "no") val result = pipeline.fullAnnotate("Hei fra John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hei fra John Snow Labs! ""] result_df = nlu.load('no.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Hei fra John Snow Labs! '] | ['Hei fra John Snow Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['Hei', 'fra', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.1868100017309188,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|no| --- layout: model title: Clinical Findings to UMLS Code Pipeline author: John Snow Labs name: umls_clinical_findings_resolver_pipeline date: 2022-07-26 tags: [en, licensed, umls, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps entities (Clinical Findings) with their corresponding UMLS CUI codes. You’ll just feed your text and it will return the corresponding UMLS codes. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.0.0_3.0_1658822255140.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/umls_clinical_findings_resolver_pipeline_en_4.0.0_3.0_1658822255140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline= PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") pipeline.annotate("HTG-induced pancreatitis associated with an acute hepatitis, and obesity") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline= PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models") val pipeline.annotate("HTG-induced pancreatitis associated with an acute hepatitis, and obesity") ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.umls_clinical_findings_resolver").predict("""HTG-induced pancreatitis associated with an acute hepatitis, and obesity""") ```
## Results ```bash +------------------------+---------+---------+ |chunk |ner_label|umls_code| +------------------------+---------+---------+ |HTG-induced pancreatitis|PROBLEM |C1963198 | |an acute hepatitis |PROBLEM |C4750596 | |obesity |PROBLEM |C1963185 | +------------------------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|umls_clinical_findings_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.3 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_03 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: asr_english_filipino_wav2vec2_l_xls_r_test_03 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_03` is a English model originally trained by Khalsuu. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_english_filipino_wav2vec2_l_xls_r_test_03_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_03_en_4.2.0_3.0_1664108041739.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_english_filipino_wav2vec2_l_xls_r_test_03_en_4.2.0_3.0_1664108041739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_03", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_english_filipino_wav2vec2_l_xls_r_test_03", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_english_filipino_wav2vec2_l_xls_r_test_03| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from skr3178) author: John Snow Labs name: xlmroberta_ner_skr3178_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `skr3178`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_skr3178_base_finetuned_panx_all_xx_4.1.0_3.0_1660428924987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_skr3178_base_finetuned_panx_all_xx_4.1.0_3.0_1660428924987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_skr3178_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_skr3178_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_skr3178_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/skr3178/xlm-roberta-base-finetuned-panx-all --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_medium_pretrained_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-pretrained-finetuned-squad` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_pretrained_finetuned_squad_en_4.0.0_3.0_1654183730265.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_pretrained_finetuned_squad_en_4.0.0_3.0_1654183730265.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_medium_pretrained_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_medium_pretrained_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.medium_finetuned.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_medium_pretrained_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|154.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-medium-pretrained-finetuned-squad --- layout: model title: Legal Notices Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_notices_bert date: 2023-03-05 tags: [en, legal, classification, clauses, notices, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Notices` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Notices`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_notices_bert_en_1.0.0_3.0_1678049992578.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_notices_bert_en_1.0.0_3.0_1678049992578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_notices_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Notices]| |[Other]| |[Other]| |[Notices]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_notices_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Notices 0.98 0.97 0.97 301 Other 0.97 0.98 0.97 337 accuracy - - 0.97 638 macro-avg 0.97 0.97 0.97 638 weighted-avg 0.97 0.97 0.97 638 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from dbernsohn) author: John Snow Labs name: t5_measurement_time date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_measurement_time` is a English model originally trained by `dbernsohn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_measurement_time_en_4.3.0_3.0_1675156797621.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_measurement_time_en_4.3.0_3.0_1675156797621.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_measurement_time","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_measurement_time","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_measurement_time| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|278.9 MB| ## References - https://huggingface.co/dbernsohn/t5_measurement_time - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://www.tensorflow.org/datasets/catalog/math_dataset#mathdatasetmeasurement_time - https://github.com/DorBernsohn/CodeLM/tree/main/MathLM - https://www.linkedin.com/in/dor-bernsohn-70b2b1146/ --- layout: model title: Legal Adoption Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_adoption_agreement_bert date: 2023-01-29 tags: [en, legal, classification, adoption, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_adoption_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `adoption-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `adoption-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_adoption_agreement_bert_en_1.0.0_3.0_1674990271078.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_adoption_agreement_bert_en_1.0.0_3.0_1674990271078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_adoption_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[adoption-agreement]| |[other]| |[other]| |[adoption-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_adoption_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.91 0.97 0.94 98 standstill-agreement 0.93 0.82 0.87 51 accuracy - - 0.92 149 macro-avg 0.92 0.90 0.91 149 weighted-avg 0.92 0.92 0.92 149 ``` --- layout: model title: Translate English to Urdu Pipeline author: John Snow Labs name: translate_en_ur date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ur, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ur` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ur_xx_2.7.0_2.4_1609690466742.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ur_xx_2.7.0_2.4_1609690466742.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ur", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ur", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ur').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ur| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from ktrapeznikov) author: John Snow Labs name: bert_qa_scibert_scivocab_uncased_squad_v2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `scibert_scivocab_uncased_squad_v2` is a English model orginally trained by `ktrapeznikov`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_scivocab_uncased_squad_v2_en_4.0.0_3.0_1654189468977.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_scibert_scivocab_uncased_squad_v2_en_4.0.0_3.0_1654189468977.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_scibert_scivocab_uncased_squad_v2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_scibert_scivocab_uncased_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.scibert.uncased_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_scibert_scivocab_uncased_squad_v2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ktrapeznikov/scibert_scivocab_uncased_squad_v2 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Word2Vec Embeddings in Volapük (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, vo, open_source] task: Embeddings language: vo edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vo_3.4.1_3.0_1647466940019.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_vo_3.4.1_3.0_1647466940019.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vo") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vo") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("vo.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|vo| |Size:|144.7 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Korean Electra Embeddings (from snunlp) author: John Snow Labs name: electra_embeddings_kr_electra_generator date: 2022-05-16 tags: [ko, open_source, electra, embeddings] task: Embeddings language: ko edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `KR-ELECTRA-generator` is a Korean model orginally trained by `snunlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_kr_electra_generator_ko_3.4.4_3.0_1652716594290.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_kr_electra_generator_ko_3.4.4_3.0_1652716594290.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_kr_electra_generator","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_kr_electra_generator","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_kr_electra_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ko| |Size:|124.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/snunlp/KR-ELECTRA-generator - https://github.com/google-research/electra - https://github.com/google-research/electra - https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/ - https://drive.google.com/file/d/1L_yKEDaXM_yDLwHm5QrXAncQZiMN3BBU/view?usp=sharing - https://github.com/monologg/KoELECTRA - https://github.com/snunlp/KR-ELECTRA - https://github.com/monologg/KoELECTRA --- layout: model title: Translate Kabyle to English Pipeline author: John Snow Labs name: translate_kab_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kab, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kab` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kab_en_xx_2.7.0_2.4_1609690292388.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kab_en_xx_2.7.0_2.4_1609690292388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kab_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kab_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kab.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kab_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Sangita) author: John Snow Labs name: distilbert_qa_sangita_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Sangita`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sangita_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769187333.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sangita_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769187333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sangita_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sangita_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sangita_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Sangita/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Base Salary Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_base_salary_bert date: 2023-03-05 tags: [en, legal, classification, clauses, base_salary, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Base_Salary` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Base_Salary`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_base_salary_bert_en_1.0.0_3.0_1678050573214.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_base_salary_bert_en_1.0.0_3.0_1678050573214.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_base_salary_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Base_Salary]| |[Other]| |[Other]| |[Base_Salary]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_base_salary_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Base_Salary 1.00 1.00 1.0 77 Other 1.00 1.00 1.0 108 accuracy - - 1.0 185 macro-avg 1.00 1.00 1.0 185 weighted-avg 1.00 1.00 1.0 185 ``` --- layout: model title: English image_classifier_vit_base_patch16_224_finetuned_eurosat ViTForImageClassification from rwang5688 author: John Snow Labs name: image_classifier_vit_base_patch16_224_finetuned_eurosat date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_finetuned_eurosat` is a English model originally trained by rwang5688. ## Predicted Entities `Residential`, `AnnualCrop`, `Highway`, `Pasture`, `SeaLake`, `Industrial`, `HerbaceousVegetation`, `River`, `PermanentCrop`, `Forest` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_eurosat_en_4.1.0_3.0_1660171928095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_finetuned_eurosat_en_4.1.0_3.0_1660171928095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_finetuned_eurosat", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_finetuned_eurosat", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_finetuned_eurosat| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: NER Pipeline for German author: John Snow Labs name: xlm_roberta_large_token_classifier_conll03_pipeline date: 2022-06-27 tags: [german, roberta, xlm, ner, conll03, de, open_source] task: Named Entity Recognition language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_large_token_classifier_conll03_de](https://nlp.johnsnowlabs.com/2021/12/25/xlm_roberta_large_token_classifier_conll03_de.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_pipeline_de_4.0.0_3.0_1656373648807.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_conll03_pipeline_de_4.0.0_3.0_1656373648807.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_large_token_classifier_conll03_pipeline", lang = "de") pipeline.annotate("Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_large_token_classifier_conll03_pipeline", lang = "de") pipeline.annotate("Ibser begann seine Karriere beim ASK Ebreichsdorf. 2004 wechselte er zu Admira Wacker Mödling, wo er auch in der Akademie spielte.") ```
## Results ```bash +----------------------+---------+ |chunk |ner_label| +----------------------+---------+ |Ibser |PER | |ASK Ebreichsdorf |ORG | |Admira Wacker Mödling |ORG | +----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.8 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nh32 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nh32` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh32_en_4.3.0_3.0_1675113289211.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nh32_en_4.3.0_3.0_1675113289211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nh32","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nh32","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nh32| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|745.5 MB| ## References - https://huggingface.co/google/t5-efficient-base-nh32 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-512-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_2_en_4.0.0_3.0_1655732960556.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_2_en_4.0.0_3.0_1655732960556.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_512d_seed_2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_512_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|432.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-512-finetuned-squad-seed-2 --- layout: model title: Fast Neural Machine Translation Model from Turkish to English author: John Snow Labs name: opus_mt_tr_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tr, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tr` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tr_en_xx_2.7.0_2.4_1609169109312.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tr_en_xx_2.7.0_2.4_1609169109312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tr_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tr_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tr.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tr_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from English to Gun author: John Snow Labs name: opus_mt_en_guw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, guw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `guw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_guw_xx_2.7.0_2.4_1609164685066.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_guw_xx_2.7.0_2.4_1609164685066.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_guw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_guw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.guw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_guw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke TFWav2Vec2ForCTC from logicbloke author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke` is a English model originally trained by logicbloke. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke_en_4.2.0_3.0_1664095691550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke_en_4.2.0_3.0_1664095691550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_arabic_by_logicbloke| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect PHI for Generic Deidentification in Romanian author: John Snow Labs name: ner_deid_generic_pipeline date: 2023-03-09 tags: [deidentification, word2vec, phi, generic, ner, ro, licensed] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_deid_generic](https://nlp.johnsnowlabs.com/2022/07/08/ner_deid_generic_ro_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_ro_4.3.0_3.2_1678382243449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_pipeline_ro_4.3.0_3.2_1678382243449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_deid_generic_pipeline", "ro", "clinical/models") text = '''Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui,737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_deid_generic_pipeline", "ro", "clinical/models") val text = "Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui,737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume si Prenume : BUREAN MARIA, Varsta: 77 Medic : Agota Evelyn Tımar C.N.P : 2450502264401" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------------------|--------:|------:|:------------|-------------:| | 0 | Spitalul Pentru Ochi de Deal | 0 | 27 | LOCATION | 0.88326 | | 1 | Drumul Oprea Nr. 972 | 30 | 49 | LOCATION | 0.98642 | | 2 | Vaslui,737405 România | 51 | 71 | LOCATION | 0.8018 | | 3 | +40(235)413773 | 78 | 91 | CONTACT | 1 | | 4 | 25 May 2022 | 118 | 128 | DATE | 1 | | 5 | BUREAN MARIA | 157 | 168 | NAME | 0.99965 | | 6 | 77 | 179 | 180 | AGE | 1 | | 7 | Agota Evelyn Tımar | 190 | 207 | NAME | 0.832933 | | 8 | 2450502264401 | 217 | 229 | ID | 1 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_generic_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|ro| |Size:|1.2 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Detect Entities (66-labeled) in General Scope (Few-NERD dataset) author: John Snow Labs name: nerdl_fewnerd_subentity_100d date: 2021-07-22 tags: [ner, en, fewnerd, public, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.1.1 spark_version: 2.4 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained on Few-NERD/inter public dataset and it extracts 66 entities that are in general scope. ## Predicted Entities `building-theater`, `art-other`, `location-bodiesofwater`, `other-god`, `organization-politicalparty`, `product-other`, `building-sportsfacility`, `building-restaurant`, `organization-sportsleague`, `event-election`, `organization-media/newspaper`, `product-software`, `other-educationaldegree`, `person-politician`, `person-soldier`, `other-disease`, `product-airplane`, `person-athlete`, `location-mountain`, `organization-company`, `other-biologything`, `location-other`, `other-livingthing`, `person-actor`, `organization-other`, `event-protest`, `art-film`, `other-award`, `other-astronomything`, `building-airport`, `product-food`, `person-other`, `event-disaster`, `product-weapon`, `event-sportsevent`, `location-park`, `product-ship`, `building-library`, `art-painting`, `building-other`, `other-currency`, `organization-education`, `person-scholar`, `organization-showorganization`, `person-artist/author`, `product-train`, `location-GPE`, `product-car`, `art-writtenart`, `event-attack/battle/war/militaryconflict`, `other-law`, `other-medical`, `organization-sportsteam`, `art-broadcastprogram`, `art-music`, `organization-government/governmentagency`, `other-language`, `event-other`, `person-director`, `other-chemicalthing`, `product-game`, `organization-religion`, `location-road/railway/highway/transit`, `location-island`, `building-hotel`, `building-hospital` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_FEW_NERD/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_en_3.1.1_2.4_1626970707030.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/nerdl_fewnerd_subentity_100d_en_3.1.1_2.4_1626970707030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("nerdl_fewnerd_subentity_100d") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(['document', 'token', 'ner']) \ .setOutputCol('ner_chunk') nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, embeddings, ner, ner_converter]) l_model = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = l_model.fullAnnotate("""12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.""") ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("nerdl_fewnerd_subentity_100d") .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner") val ner_converter = NerConverter.setInputCols(Array("document", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner, ner_converter)) val data = Seq("12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.fewnerd_subentity").predict("""12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit.""") ```
## Results ```bash +-----------------------+----------------------------+ |chunk |ner_label | +-----------------------+----------------------------+ |Corazones ('12 Hearts')|art-broadcastprogram | |Spanish-language |other-language | |United States |location-GPE | |Telemundo |organization-media/newspaper| |Argentine TV |organization-media/newspaper| |Los Angeles |location-GPE | |Steven Spielberg |person-director | |Cloverfield Paradox |art-film | +-----------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|nerdl_fewnerd_subentity_100d| |Type:|ner| |Compatibility:|Spark NLP 3.1.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Few-NERD:A Few-shot Named Entity Recognition Dataset, author: Ding, Ning and Xu, Guangwei and Chen, Yulin, and Wang, Xiaobin and Han, Xu and Xie, Pengjun and Zheng, Hai-Tao and Liu, Zhiyuan, book title: ACL-IJCNL, 2021. ## Benchmarking ```bash +--------------------+-------+------+-------+-------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +--------------------+-------+------+-------+-------+---------+------+------+ | disaster| 309.0| 114.0| 287.0| 596.0| 0.7305|0.5185|0.6065| | film| 1589.0| 725.0| 810.0| 2399.0| 0.6867|0.6624|0.6743| | mountain| 851.0| 175.0| 431.0| 1282.0| 0.8294|0.6638|0.7374| | currency| 280.0| 66.0| 189.0| 469.0| 0.8092| 0.597|0.6871| | scholar| 31.0| 12.0| 413.0| 444.0| 0.7209|0.0698|0.1273| | island| 829.0| 165.0| 372.0| 1201.0| 0.834|0.6903|0.7554| | politicalparty| 242.0| 86.0| 283.0| 525.0| 0.7378| 0.461|0.5674| | ship| 461.0| 207.0| 311.0| 772.0| 0.6901|0.5972|0.6403| | award| 2234.0| 279.0| 1245.0| 3479.0| 0.889|0.6421|0.7457| | showorganization| 120.0| 201.0| 273.0| 393.0| 0.3738|0.3053|0.3361| | religion| 218.0| 117.0| 415.0| 633.0| 0.6507|0.3444|0.4504| | education| 5788.0| 852.0| 1001.0| 6789.0| 0.8717|0.8526| 0.862| | park| 259.0| 295.0| 176.0| 435.0| 0.4675|0.5954|0.5238| | painting| 0.0| 0.0| 14.0| 14.0| 0.0| 0.0| 0.0| | hotel| 570.0| 150.0| 254.0| 824.0| 0.7917|0.6917|0.7383| | library| 218.0| 92.0| 134.0| 352.0| 0.7032|0.6193|0.6586| | livingthing| 576.0| 280.0| 312.0| 888.0| 0.6729|0.6486|0.6606| | educationaldegree| 189.0| 31.0| 47.0| 236.0| 0.8591|0.8008|0.8289| | director| 673.0| 227.0| 507.0| 1180.0| 0.7478|0.5703|0.6471| | food| 474.0| 375.0| 341.0| 815.0| 0.5583|0.5816|0.5697| | athlete| 1181.0| 529.0| 540.0| 1721.0| 0.6906|0.6862|0.6884| | software| 922.0| 460.0| 493.0| 1415.0| 0.6671|0.6516|0.6593| | protest| 162.0| 212.0| 275.0| 437.0| 0.4332|0.3707|0.3995| | other|12555.0|7510.0|14369.0|26924.0| 0.6257|0.4663|0.5344| | sportsleague| 1439.0| 654.0| 842.0| 2281.0| 0.6875|0.6309| 0.658| | airplane| 1295.0| 442.0| 463.0| 1758.0| 0.7455|0.7366|0.7411| | train| 135.0| 111.0| 198.0| 333.0| 0.5488|0.4054|0.4663| | biologything| 1574.0| 625.0| 924.0| 2498.0| 0.7158|0.6301|0.6702| | politician| 3107.0|1545.0| 1688.0| 4795.0| 0.6679| 0.648|0.6578| | music| 419.0| 211.0| 182.0| 601.0| 0.6651|0.6972|0.6807| |government/govern...| 564.0| 656.0| 511.0| 1075.0| 0.4623|0.5247|0.4915| | media/newspaper| 1600.0|1072.0| 893.0| 2493.0| 0.5988|0.6418|0.6196| | actor| 674.0| 161.0| 274.0| 948.0| 0.8072| 0.711| 0.756| | language| 698.0| 226.0| 335.0| 1033.0| 0.7554|0.6757|0.7133| | chemicalthing| 592.0| 231.0| 687.0| 1279.0| 0.7193|0.4629|0.5633| | sportsfacility| 870.0| 334.0| 291.0| 1161.0| 0.7226|0.7494|0.7357| | hospital| 226.0| 472.0| 49.0| 275.0| 0.3238|0.8218|0.4645| | writtenart| 297.0| 203.0| 450.0| 747.0| 0.594|0.3976|0.4763| |road/railway/high...| 3238.0| 926.0| 1063.0| 4301.0| 0.7776|0.7528| 0.765| | election| 13.0| 13.0| 127.0| 140.0| 0.5|0.0929|0.1566| | soldier| 623.0| 537.0| 559.0| 1182.0| 0.5371|0.5271| 0.532| | god| 332.0| 157.0| 414.0| 746.0| 0.6789| 0.445|0.5377| | astronomything| 1120.0| 353.0| 232.0| 1352.0| 0.7604|0.8284|0.7929| |attack/battle/war...| 2516.0| 444.0| 590.0| 3106.0| 0.85| 0.81|0.8295| | broadcastprogram| 1056.0| 762.0| 811.0| 1867.0| 0.5809|0.5656|0.5731| | airport| 857.0| 96.0| 112.0| 969.0| 0.8993|0.8844|0.8918| | theater| 72.0| 31.0| 119.0| 191.0| 0.699| 0.377|0.4898| | weapon| 303.0| 190.0| 237.0| 540.0| 0.6146|0.5611|0.5866| | company| 5849.0|2632.0| 2570.0| 8419.0| 0.6897|0.6947|0.6922| | car| 413.0| 293.0| 207.0| 620.0| 0.585|0.6661|0.6229| | artist/author| 4172.0|1953.0| 1777.0| 5949.0| 0.6811|0.7013|0.6911| | medical| 94.0| 112.0| 192.0| 286.0| 0.4563|0.3287|0.3821| | disease| 1009.0| 476.0| 447.0| 1456.0| 0.6795| 0.693|0.6862| | game| 141.0| 120.0| 264.0| 405.0| 0.5402|0.3481|0.4234| | sportsevent| 1042.0| 553.0| 552.0| 1594.0| 0.6533|0.6537|0.6535| | sportsteam| 3657.0|1133.0| 1301.0| 4958.0| 0.7635|0.7376|0.7503| | restaurant| 285.0| 444.0| 201.0| 486.0| 0.3909|0.5864|0.4691| | bodiesofwater| 314.0| 91.0| 343.0| 657.0| 0.7753|0.4779|0.5913| | law| 1626.0| 583.0| 329.0| 1955.0| 0.7361|0.8317| 0.781| | GPE|22173.0|5585.0| 3839.0|26012.0| 0.7988|0.8524|0.8247| +--------------------+-------+------+-------+-------+---------+------+------+ +-----------------+ | macro| +-----------------+ |0.608599546406531| +-----------------+ +-----------------+ | micro| +-----------------+ |0.684720504256685| +-----------------+ ``` --- layout: model title: Bulgarian RobertaForMaskedLM Small Cased model (from iarfmoose) author: John Snow Labs name: roberta_embeddings_small_bulgarian date: 2022-12-12 tags: [bg, open_source, roberta_embeddings, robertaformaskedlm] task: Embeddings language: bg edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-small-bulgarian` is a Bulgarian model originally trained by `iarfmoose`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_small_bulgarian_bg_4.2.4_3.0_1670859644104.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_small_bulgarian_bg_4.2.4_3.0_1670859644104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_small_bulgarian","bg") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_loaded = RoBertaEmbeddings.pretrained("roberta_embeddings_small_bulgarian","bg") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_small_bulgarian| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|bg| |Size:|314.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/iarfmoose/roberta-small-bulgarian - https://arxiv.org/abs/1907.11692 - https://oscar-corpus.com/ - https://wortschatz.uni-leipzig.de/en/download/bulgarian - https://wortschatz.uni-leipzig.de/en/download/bulgarian --- layout: model title: English image_classifier_vit_vision_transformers_spain_or_italy_fan ViTForImageClassification from jeffboudier author: John Snow Labs name: image_classifier_vit_vision_transformers_spain_or_italy_fan date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vision_transformers_spain_or_italy_fan` is a English model originally trained by jeffboudier. ## Predicted Entities `italy soccer fan`, `spain soccer fan` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformers_spain_or_italy_fan_en_4.1.0_3.0_1660166700234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vision_transformers_spain_or_italy_fan_en_4.1.0_3.0_1660166700234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vision_transformers_spain_or_italy_fan", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vision_transformers_spain_or_italy_fan", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vision_transformers_spain_or_italy_fan| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: ICD10CM Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_clinical_en_3.0.0_3.0_1617355382919.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_clinical_en_3.0.0_3.0_1617355382919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") \ .setInputCols(["ner_token", "chunk_embeddings"]) \ .setOutputCol("icd10cm_code") \ .setDistanceFunction("COSINE") \ .setNeighbours(5) pipeline_icd10cm = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, chunk_tokenizer, icd10cm_resolution]) pipeline_model = pipeline_icd10cm.fit(spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG."""]]).toDF("text")) result = pipeline_model.transform(data) ``` ```scala ... val icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") .setInputCols("ner_token", "chunk_embeddings") .setOutputCol("icd10cm_code") .setDistanceFunction("COSINE") .setNeighbours(5) val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, chunk_tokenizer, icd10cm_resolution)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | chunk | entity | resolved_text | code | cms | |---|-----------------------------|-----------|----------------------------------------------------|--------|---------------------------------------------------| | 0 | T2DM), | PROBLEM | Type 2 diabetes mellitus with diabetic nephrop... | E1121 | Type 2 diabetes mellitus with diabetic nephrop... | | 1 | T2DM | PROBLEM | Type 2 diabetes mellitus with diabetic nephrop... | E1121 | Type 2 diabetes mellitus with diabetic nephrop... | | 2 | polydipsia | PROBLEM | Polydipsia | R631 | Polydipsia:::Anhedonia:::Galactorrhea | | 3 | interference from turbidity | PROBLEM | Non-working side interference | M2656 | Non-working side interference:::Hemoglobinuria... | | 4 | polyuria | PROBLEM | Other polyuria | R358 | Other polyuria:::Polydipsia:::Generalized edem... | | 5 | lipemia | PROBLEM | Glycosuria | R81 | Glycosuria:::Pure hyperglyceridemia:::Hyperchy... | | 6 | starvation ketosis | PROBLEM | Propionic acidemia | E71121 | Propionic acidemia:::Bartter's syndrome:::Hypo... | | 7 | HTG | PROBLEM | Pure hyperglyceridemia | E781 | Pure hyperglyceridemia:::Familial hypercholest... | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10cm_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm]| |Language:|en| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from jasonyim2) author: John Snow Labs name: xlmroberta_ner_jasonyim2_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `jasonyim2`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jasonyim2_base_finetuned_panx_de_4.1.0_3.0_1660434335874.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jasonyim2_base_finetuned_panx_de_4.1.0_3.0_1660434335874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jasonyim2_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jasonyim2_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jasonyim2_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jasonyim2/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: French CamemBert Embeddings (from Ebtihal) author: John Snow Labs name: camembert_embeddings_ArBERTMo date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ArBERTMo` is a French model orginally trained by `Ebtihal`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ArBERTMo_fr_3.4.4_3.0_1653985627290.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_ArBERTMo_fr_3.4.4_3.0_1653985627290.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ArBERTMo","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_ArBERTMo","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_ArBERTMo| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Ebtihal/ArBERTMo --- layout: model title: Named Entity Recognition - BERT Mini (OntoNotes) author: John Snow Labs name: onto_small_bert_L4_256 date: 2020-12-05 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [ner, open_source, en] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Onto is a Named Entity Recognition (or NER) model trained on OntoNotes 5.0. It can extract up to 18 entities such as people, places, organizations, money, time, date, etc. This model uses the pretrained `small_bert_L4_256` embeddings model from the `BertEmbeddings` annotator as an input. ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_small_bert_L4_256_en_2.7.0_2.4_1607199231735.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_small_bert_L4_256_en_2.7.0_2.4_1607199231735.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... ner_onto = NerDLModel.pretrained("onto_small_bert_L4_256", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val ner_onto = NerDLModel.pretrained("onto_small_bert_L4_256", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_onto, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.onto.bert.small_l4_256').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PERSON | |October 28, 1955 |DATE | |American |NORP | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PERSON | |May 2014 |DATE | |one |CARDINAL | |1970s and 1980s |DATE | |Seattle |GPE | |Washington |GPE | |Gates |PERSON | |Paul Allen |PERSON | |1975 |DATE | |Albuquerque |GPE | |New Mexico |GPE | |Gates |ORG | |January 2000 |DATE | |the late 1990s |DATE | |Gates |PERSON | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_small_bert_L4_256| |Type:|ner| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source The model is trained based on data from [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) ## Benchmarking ```bash Micro-average: prec: 0.8617996, rec: 0.85458803, f1: 0.8581787 CoNLL Eval: processed 152728 tokens with 11257 phrases; found: 11191 phrases; correct: 9476. accuracy: 97.05%; 9476 11257 11191 precision: 84.68%; recall: 84.18%; FB1: 84.43 CARDINAL: 771 935 934 precision: 82.55%; recall: 82.46%; FB1: 82.50 934 DATE: 1383 1602 1645 precision: 84.07%; recall: 86.33%; FB1: 85.19 1645 EVENT: 29 63 49 precision: 59.18%; recall: 46.03%; FB1: 51.79 49 FAC: 65 135 100 precision: 65.00%; recall: 48.15%; FB1: 55.32 100 GPE: 2054 2240 2211 precision: 92.90%; recall: 91.70%; FB1: 92.29 2211 LANGUAGE: 10 22 13 precision: 76.92%; recall: 45.45%; FB1: 57.14 13 LAW: 11 40 22 precision: 50.00%; recall: 27.50%; FB1: 35.48 22 LOC: 112 179 186 precision: 60.22%; recall: 62.57%; FB1: 61.37 186 MONEY: 272 314 317 precision: 85.80%; recall: 86.62%; FB1: 86.21 317 NORP: 781 841 856 precision: 91.24%; recall: 92.87%; FB1: 92.04 856 ORDINAL: 172 195 228 precision: 75.44%; recall: 88.21%; FB1: 81.32 228 ORG: 1383 1795 1749 precision: 79.07%; recall: 77.05%; FB1: 78.05 1749 PERCENT: 311 349 346 precision: 89.88%; recall: 89.11%; FB1: 89.50 346 PERSON: 1809 1988 2048 precision: 88.33%; recall: 91.00%; FB1: 89.64 2048 PRODUCT: 34 76 50 precision: 68.00%; recall: 44.74%; FB1: 53.97 50 QUANTITY: 83 105 106 precision: 78.30%; recall: 79.05%; FB1: 78.67 106 TIME: 138 212 228 precision: 60.53%; recall: 65.09%; FB1: 62.73 228 WORK_OF_ART: 58 166 103 precision: 56.31%; recall: 34.94%; FB1: 43.12 103 ``` --- layout: model title: Fast Neural Machine Translation Model from Tsonga to English author: John Snow Labs name: opus_mt_ts_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ts, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ts` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ts_en_xx_2.7.0_2.4_1609163263407.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ts_en_xx_2.7.0_2.4_1609163263407.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ts_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ts_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ts.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ts_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Purpose Clause Binary Classifier author: John Snow Labs name: legclf_purpose_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `purpose` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `purpose` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_purpose_clause_en_1.0.0_3.2_1660123878905.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_purpose_clause_en_1.0.0_3.2_1660123878905.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_purpose_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[purpose]| |[other]| |[other]| |[purpose]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_purpose_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.94 0.98 0.96 93 purpose 0.94 0.84 0.89 37 accuracy - - 0.94 130 macro-avg 0.94 0.91 0.92 130 weighted-avg 0.94 0.94 0.94 130 ``` --- layout: model title: French CamemBert Embeddings (from osanseviero) author: John Snow Labs name: camembert_embeddings_generic_model_test date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model-test` is a French model orginally trained by `osanseviero`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_generic_model_test_fr_3.4.4_3.0_1653991144897.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_generic_model_test_fr_3.4.4_3.0_1653991144897.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_generic_model_test","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_generic_model_test","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_generic_model_test| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/osanseviero/dummy-model-test --- layout: model title: Voice of the Patients (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_wip_emb_clinical_medium date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Allergen`, `RaceEthnicity`, `SubstanceQuantity`, `InjuryOrPoisoning`, `Treatment`, `Modifier`, `HealthStatus`, `MedicalDevice`, `TestResult`, `RelationshipStatus`, `Route`, `Measurements`, `AdmissionDischarge`, `Frequency`, `Procedure`, `Symptom`, `Substance`, `Duration`, `ClinicalDept`, `Dosage`, `Disease`, `Vaccine`, `Laterality`, `DateTime`, `Drug`, `Test`, `PsychologicalCondition`, `VitalTest`, `Form`, `Age`, `BodyPart`, `Employment`, `Gender` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_emb_clinical_medium_en_4.4.2_3.0_1684509750529.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_wip_emb_clinical_medium_en_4.4.2_3.0_1684509750529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_wip_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_wip_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Treatment | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | allopathy medicine | Treatment | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_wip_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1296 22 21 1317 0.98 0.98 0.98 Employment 1171 43 72 1243 0.96 0.94 0.95 Vaccine 40 4 2 42 0.91 0.95 0.93 Age 536 38 46 582 0.93 0.92 0.93 PsychologicalCondition 413 35 31 444 0.92 0.93 0.93 BodyPart 2729 218 171 2900 0.93 0.94 0.93 AdmissionDischarge 29 0 5 34 1.00 0.85 0.92 Form 249 33 17 266 0.88 0.94 0.91 Drug 1298 127 142 1440 0.91 0.90 0.91 Substance 388 50 33 421 0.89 0.92 0.90 Laterality 545 58 83 628 0.90 0.87 0.89 ClinicalDept 277 24 49 326 0.92 0.85 0.88 Test 1041 121 167 1208 0.90 0.86 0.88 DateTime 4120 802 282 4402 0.84 0.94 0.88 Route 39 3 9 48 0.93 0.81 0.87 Disease 1759 338 256 2015 0.84 0.87 0.86 VitalTest 140 16 32 172 0.90 0.81 0.85 Dosage 329 42 83 412 0.89 0.80 0.84 Duration 1841 230 469 2310 0.89 0.80 0.84 Frequency 885 170 194 1079 0.84 0.82 0.83 Symptom 3771 764 804 4575 0.83 0.82 0.83 Allergen 34 2 12 46 0.94 0.74 0.83 RelationshipStatus 18 2 6 24 0.90 0.75 0.82 Procedure 555 108 150 705 0.84 0.79 0.81 HealthStatus 81 25 26 107 0.76 0.76 0.76 Modifier 799 184 340 1139 0.81 0.70 0.75 SubstanceQuantity 57 13 28 85 0.81 0.67 0.74 Treatment 135 18 93 228 0.88 0.59 0.71 InjuryOrPoisoning 117 36 59 176 0.76 0.66 0.71 TestResult 325 72 199 524 0.82 0.62 0.71 MedicalDevice 229 83 103 332 0.73 0.69 0.71 macro_avg 25246 3681 3984 29230 0.88 0.82 0.85 micro_avg 25246 3681 3984 29230 0.87 0.86 0.87 ``` --- layout: model title: Generic NER on Financial Texts author: John Snow Labs name: finner_sec_conll date: 2022-08-31 tags: [en, financial, ner, sec, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects Organizations (ORG), People (PER) and Locations (LOC) in financial texts. Was trained using manual annotations, coNll2003 and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings. Financial documents may be long, going over the limits of most of the standard Deep Learning and Transforming architectures. Please considering aggressive Sentece Splitting mechanisms to split sentences into smaller chunks. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/NER_SEC/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_sec_conll_en_1.0.0_3.2_1661944821177.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_sec_conll_en_1.0.0_3.2_1661944821177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") # Consider using nlp.SentenceDetector with rules/patterns to get smaller chunks from long sentences sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") jsl_ner = finance.NerModel.pretrained("finner_sec_conll", "en", "finance/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = nlp.NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = nlp.Pipeline().setStages([ documentAssembler, sentence_detector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) text = """December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender".""" df = spark.createDataFrame([[text]]).toDF("text") model = jsl_ner_pipeline.fit(df) res = model.transform(df) ```
## Results ```bash +------------------------------------------------+-----+ |ner_chunk |label| +------------------------------------------------+-----+ |SILICIUM DE PROVENCE S.A.S |ORG | |France |LOC | |Usine de Saint Auban |LOC | |France |LOC | |Mr.Frank Wouters |PER | |Borrower |PER | |EVERGREEN SOLAR INC |ORG | |Delaware |LOC | |U.S.A |LOC | |Bartlett Street |LOC | |Marlboro |LOC | |Massachusetts |LOC | |U.S.A |LOC | |Richard Chleboski |PER | |Lender |PER | +------------------------------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_sec_conll| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References Manual annotations, coNll2003 and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings. ## Benchmarking ```bash label tp fp fn prec rec f1 B-LOC 14 6 11 0.7 0.56 0.6222222 I-ORG 59 30 3 0.66292137 0.9516129 0.781457 I-LOC 32 2 22 0.9411765 0.5925926 0.7272727 I-PER 18 4 5 0.8181818 0.7826087 0.8 B-ORG 47 17 5 0.734375 0.90384614 0.8103449 B-PER 211 7 2 0.9678899 0.9906103 0.97911835 Macro-average 381 66 48 0.80409074 0.7968784 0.8004684 Micro-average 381 66 48 0.852349 0.8881119 0.869863 ``` --- layout: model title: Legal Note Purchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_note_purchase_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, note_purchase, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_note_purchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `note-purchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `note-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_note_purchase_agreement_bert_en_1.0.0_3.0_1669368416923.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_note_purchase_agreement_bert_en_1.0.0_3.0_1669368416923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_note_purchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[note-purchase-agreement]| |[other]| |[other]| |[note-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_note_purchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support note-purchase-agreement 0.92 0.79 0.85 28 other 0.87 0.95 0.91 41 accuracy - - 0.88 69 macro-avg 0.89 0.87 0.88 69 weighted-avg 0.89 0.88 0.88 69 ``` --- layout: model title: Detect clinical concepts (jsl_ner_wip_modifier_clinical) author: John Snow Labs name: jsl_ner_wip_modifier_clinical date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect modifiers and other clinical entities using pretrained NER model. ## Predicted Entities `Kidney_Disease`, `Height`, `Family_History_Header`, `RelativeTime`, `Hypertension`, `HDL`, `Alcohol`, `Test`, `Substance`, `Fetus_NewBorn`, `Diet`, `Substance_Quantity`, `Allergen`, `Form`, `Birth_Entity`, `Age`, `Race_Ethnicity`, `Modifier`, `Internal_organ_or_component`, `Hyperlipidemia`, `ImagingFindings`, `Psychological_Condition`, `Triglycerides`, `Cerebrovascular_Disease`, `Obesity`, `Duration`, `Weight`, `Date`, `Test_Result`, `Strength`, `VS_Finding`, `Respiration`, `Social_History_Header`, `Employment`, `Injury_or_Poisoning`, `Medical_History_Header`, `Death_Entity`, `Relationship_Status`, `Oxygen_Therapy`, `Blood_Pressure`, `Gender`, `Section_Header`, `Oncological`, `Drug`, `Labour_Delivery`, `Heart_Disease`, `LDL`, `Medical_Device`, `Temperature`, `Treatment`, `Female_Reproductive_Status`, `Total_Cholesterol`, `Time`, `Disease_Syndrome_Disorder`, `Communicable_Disease`, `EKG_Findings`, `Diabetes`, `Route`, `External_body_part_or_region`, `Pulse`, `Vital_Signs_Header`, `Direction`, `Admission_Discharge`, `Overweight`, `RelativeDate`, `O2_Saturation`, `BMI`, `Vaccine`, `Pregnancy`, `Sexually_Active_or_Sexual_Orientation`, `Procedure`, `Frequency`, `Dosage`, `Symptom`, `Clinical_Dept`, `Smoking` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_modifier_clinical_en_3.0.0_3.0_1617260799422.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_modifier_clinical_en_3.0.0_3.0_1617260799422.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_modifier_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_modifier_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.wip.clinical.modifier").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash +----------------------------------------------+----------------------------+ |chunk |ner_label | +----------------------------------------------+----------------------------+ |21-day-old |Age | |Caucasian |Race_Ethnicity | |male |Gender | |for 2 days |Duration | |congestion |Symptom | |mom |Gender | |suctioning yellow discharge |Symptom | |nares |External_body_part_or_region| |she |Gender | |mild problems with his breathing while feeding|Symptom | |perioral cyanosis |Symptom | |retractions |Symptom | |One day ago |RelativeDate | |mom |Gender | |tactile temperature |Symptom | |Tylenol |Drug | |Baby |Age | |decreased p.o. intake |Symptom | |His |Gender | |20 minutes |Duration | |q.2h. |Frequency | |to 5 to 10 minutes |Duration | |his |Gender | |respiratory congestion |Symptom | |He |Gender | |tired |Symptom | |fussy |Symptom | |over the past 2 days |RelativeDate | |albuterol |Drug | |ER |Clinical_Dept | |His |Gender | |urine output has also decreased |Symptom | |he |Gender | |per 24 hours |Frequency | |he |Gender | |per 24 hours |Frequency | |Mom |Gender | |diarrhea |Symptom | |His |Gender | |bowel |Internal_organ_or_component | +----------------------------------------------+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_modifier_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| --- layout: model title: Named Entity Recognition for Japanese (FastText 300d) author: John Snow Labs name: ner_ud_gsd_cc_300d date: 2021-09-09 tags: [ner, ja, open_source] task: Named Entity Recognition language: ja edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. This model uses the pre-trained fasttext_ja_300d embeddings model from WordEmbeddings annotator as an input, so be sure to use the same embeddings in the pipeline. ## Predicted Entities `ORDINAL`, `PERSON`, `LAW`, `MOVEMENT`, `LOC`, `WORK_OF_ART`, `DATE`, `NORP`, `TITLE_AFFIX`, `QUANTITY`, `FAC`, `TIME`, `MONEY`, `LANGUAGE`, `GPE`, `EVENT`, `ORG`, `PERCENT`, `PRODUCT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_cc_300d_ja_3.2.2_3.0_1631189041655.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_ud_gsd_cc_300d_ja_3.2.2_3.0_1631189041655.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \ .setInputCols(["sentence"]) \ .setOutputCol("token") embeddings = WordEmbeddings().pretrained("japanese_cc_300d", "ja") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") # Then the training can start nerTagger = NerDLModel.pretrained("ner_ud_gsd_cc_300d", "ja") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") pipeline = Pipeline().setStages([ documentAssembler, sentence, word_segmenter, embeddings, nerTagger ]) data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))") \ .selectExpr("col['0'] as token", "col['1'] as ner") \ .show() ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel} import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("japanese_cc_300d", "ja") .setInputCols("sentence", "token") .setOutputCol("embeddings") val nerTagger = NerDLModel.pretrained("ner_ud_gsd_cc_300d", "ja") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, word_segmenter, embeddings, nerTagger )) val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text") val model = pipeline.fit(data) val result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))") .selectExpr("col'0' as token", "col'1' as ner") .show() ``` {:.nlu-block} ```python import nlu nlu.load("ja.ner.ud_gsd_cc_300d").predict("""explode(arrays_zip(token.result, ner.result))""") ```
## Results ```bash +--------------+--------+ | token| ner| +--------------+--------+ | 宮本|B-PERSON| | 茂|I-PERSON| | 氏| O| | は| O| | 、| O| | 日本| B-GPE| | の| O| | 任天| B-FAC| | 堂| I-FAC| | の| O| | ゲーム| O| |プロデューサー| O| | です| O| | 。| O| +--------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ud_gsd_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ja| |Dependencies:|japanese_cc_300d| ## Data Source The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs: https://github.com/megagonlabs/UD_Japanese-GSD Reference: Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash label precision recall f1-score support DATE 0.91 0.92 0.92 206 EVENT 0.91 0.56 0.69 52 FAC 0.82 0.63 0.71 59 GPE 0.81 0.90 0.86 102 LANGUAGE 1.00 1.00 1.00 8 LAW 0.44 0.31 0.36 13 LOC 0.83 0.93 0.87 41 MONEY 0.80 1.00 0.89 20 MOVEMENT 0.38 0.55 0.44 11 NORP 0.98 0.81 0.88 57 O 0.99 0.99 0.99 11785 ORDINAL 0.79 0.94 0.86 32 ORG 0.82 0.74 0.78 179 PERCENT 1.00 1.00 1.00 16 PERSON 0.87 0.87 0.87 127 PRODUCT 0.61 0.72 0.66 50 QUANTITY 0.91 0.91 0.91 172 TIME 0.97 0.88 0.92 32 TITLE_AFFIX 0.81 0.92 0.86 24 WORK_OF_ART 0.71 0.83 0.77 48 accuracy - - 0.98 13034 macro-avg 0.82 0.82 0.81 13034 weighted-avg 0.98 0.98 0.98 13034 ``` --- layout: model title: Italian Part of Speech Tagger (from sachaarbonel) author: John Snow Labs name: bert_pos_bert_italian_cased_finetuned_pos date: 2022-05-09 tags: [bert, pos, part_of_speech, it, open_source] task: Part of Speech Tagging language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-italian-cased-finetuned-pos` is a Italian model orginally trained by `sachaarbonel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_italian_cased_finetuned_pos_it_3.4.2_3.0_1652092443943.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_italian_cased_finetuned_pos_it_3.4.2_3.0_1652092443943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_italian_cased_finetuned_pos","it") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_italian_cased_finetuned_pos","it") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_italian_cased_finetuned_pos| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|it| |Size:|410.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sachaarbonel/bert-italian-cased-finetuned-pos - https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py - https://twitter.com/sachaarbonel - https://www.linkedin.com/in/sacha-arbonel --- layout: model title: Loinc Sentence Entity Resolver author: John Snow Labs name: sbiobertresolve_loinc date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical NER entities to LOINC codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts LOINC Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_LOINC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_en_3.0.4_3.0_1621189494152.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_loinc_en_3.0.4_3.0_1621189494152.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentenceDetector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") stopwords = StopWordsCleaner.pretrained()\ .setInputCols("token")\ .setOutputCol("cleanTokens")\ .setCaseSensitive(False) word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "cleanTokens"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "cleanTokens", "ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline_loinc = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""]]).toDF("text") results = pipeline_loinc.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val stopwords = StopWordsCleaner.pretrained() .setInputCols("token") .setOutputCol("cleanTokens") .setCaseSensitive(False) val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "cleanTokens")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "cleanTokens", "ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, resolver)) val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.loinc").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""") ```
## Results ```bash | | chunk | loinc_code | |---:|:--------------------------------------|:-------------| | 0 | gestational diabetes mellitus | 45636-8 | | 1 | subsequent type two diabetes mellitus | 44877-9 | | 2 | T2DM | 45636-8 | | 3 | HTG-induced pancreatitis | 66667-7 | | 4 | an acute hepatitis | 45690-5 | | 5 | obesity | 73708-0 | | 6 | a body mass index | 59574-4 | | 7 | BMI | 59574-4 | | 8 | polyuria | 28239-2 | | 9 | polydipsia | 90552-1 | | 10 | poor appetite | 28387-9 | | 11 | vomiting | 81224-8 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_loinc| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[loinc_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on standard LOINC coding system. --- layout: model title: Finnish asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot date: 2022-09-25 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot` is a Finnish model originally trained by aapot. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot_fi_4.2.0_3.0_1664094639072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot_fi_4.2.0_3.0_1664094639072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_xlsr_1b_finnish_lm_v2_by_aapot| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|3.6 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Translate Luganda to English Pipeline author: John Snow Labs name: translate_lg_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, lg, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `lg` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_lg_en_xx_2.7.0_2.4_1609689782642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_lg_en_xx_2.7.0_2.4_1609689782642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_lg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_lg_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.lg.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_lg_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_model_4 TFWav2Vec2ForCTC from niclas author: John Snow Labs name: asr_model_4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_4` is a English model originally trained by niclas. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_model_4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_model_4_en_4.2.0_3.0_1664098250606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_model_4_en_4.2.0_3.0_1664098250606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_model_4", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_model_4", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_model_4| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Italian T5ForConditionalGeneration Small Cased model (from it5) author: John Snow Labs name: t5_it5_efficient_small_el32_formal_to_informal date: 2023-01-30 tags: [it, open_source, t5, tensorflow] task: Text Generation language: it edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `it5-efficient-small-el32-formal-to-informal` is a Italian model originally trained by `it5`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_formal_to_informal_it_4.3.0_3.0_1675103238992.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_it5_efficient_small_el32_formal_to_informal_it_4.3.0_3.0_1675103238992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_formal_to_informal","it") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_it5_efficient_small_el32_formal_to_informal","it") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_it5_efficient_small_el32_formal_to_informal| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|it| |Size:|593.5 MB| ## References - https://huggingface.co/it5/it5-efficient-small-el32-formal-to-informal - https://github.com/stefan-it - https://arxiv.org/abs/2203.03759 - https://gsarti.com - https://malvinanissim.github.io - https://arxiv.org/abs/2109.10686 - https://github.com/gsarti/it5 - https://paperswithcode.com/sota?task=Formal-to-informal+Style+Transfer&dataset=XFORMAL+%28Italian+Subset%29 --- layout: model title: German ElectraForQuestionAnswering model (from deepset) author: John Snow Labs name: electra_qa_g_base_germanquad date: 2022-06-22 tags: [de, open_source, electra, question_answering] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gelectra-base-germanquad` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_g_base_germanquad_de_4.0.0_3.0_1655921748064.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_g_base_germanquad_de_4.0.0_3.0_1655921748064.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_g_base_germanquad","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_g_base_germanquad","de") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.electra.base").predict("""Was ist mein Name?|||"Mein Name ist Clara und ich lebe in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_g_base_germanquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|de| |Size:|410.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deepset/gelectra-base-germanquad - https://deepset.ai/germanquad - https://deepset.ai/german-bert - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://haystack.deepset.ai/community/join --- layout: model title: ICD10CM ChunkResolver author: John Snow Labs name: chunkresolve_icd10cm_diseases_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance. ## Predicted Entities ICD10-CM Codes and their normalized definition with ``clinical_embeddings``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_diseases_clinical_en_3.0.0_3.0_1617355419289.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_diseases_clinical_en_3.0.0_3.0_1617355419289.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... icd10cmResolver = ChunkEntityResolverModel.pretrained('chunkresolve_icd10cm_diseases_clinical', 'en', "clinical/models")\ .setEnableLevenshtein(True)\ .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,3,2,0,0,7])\ .setInputCols('token', 'chunk_embs_jsl')\ .setOutputCol('icd10cm_resolution') pipeline_icd10 = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, icd10cmResolver]) empty_df = spark.createDataFrame([[""]]).toDF("text") data = ["""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret's Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU"""] pipeline_model = pipeline_icd10.fit(empty_df) light_pipeline = LightPipeline(pipeline_model) result = light_pipeline.annotate(data) ``` ```scala ... val icd10cmResolver = ChunkEntityResolverModel.pretrained('chunkresolve_icd10cm_diseases_clinical', 'en', "clinical/models") .setEnableLevenshtein(True) .setNeighbours(200).setAlternatives(5).setDistanceWeights(Array(3,3,2,0,0,7)) .setInputCols('token', 'chunk_embs_jsl') .setOutputCol('icd10cm_resolution') val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, stopwords, word_embeddings, jslNer, drugNer, jslConverter, drugConverter, jslChunkEmbeddings, drugChunkEmbeddings, icd10cmResolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret's Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | | coords | chunk | entity | icd10cm_opts | |---|-------------|-----------------------------|-----------|-------------------------------------------------------------------------------------------| | 0 | 2::499::506 | insomnia | Diagnosis | [(G4700, Insomnia, unspecified), (G4709, Other insomnia), (F5102, Adjustment insomnia)...]| | 1 | 4::83::109 | chronic renal insufficiency | Diagnosis | [(N185, Chronic kidney disease, stage 5), (N181, Chronic kidney disease, stage 1), (N1...]| | 2 | 4::120::128 | gastritis | Diagnosis | [(K2970, Gastritis, unspecified, without bleeding), (B9681, Helicobacter pylori [H. py...]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10cm_diseases_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm]| |Language:|en| --- layout: model title: Fast Neural Machine Translation Model from Brazilian Sign Language to English author: John Snow Labs name: opus_mt_bzs_en date: 2020-12-29 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bzs, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bzs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bzs_en_xx_2.7.0_2.4_1609254326608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bzs_en_xx_2.7.0_2.4_1609254326608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bzs_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bzs_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bzs.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bzs_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Slovak to English Pipeline author: John Snow Labs name: translate_sk_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sk, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sk_en_xx_2.7.0_2.4_1609688872908.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sk_en_xx_2.7.0_2.4_1609688872908.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sk_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sk.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sk_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English image_classifier_vit_animal_classifier_huggingface ViTForImageClassification from vivekRahul author: John Snow Labs name: image_classifier_vit_animal_classifier_huggingface date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_animal_classifier_huggingface` is a English model originally trained by vivekRahul. ## Predicted Entities `lion`, `tiger`, `dog`, `cat`, `elephant` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animal_classifier_huggingface_en_4.1.0_3.0_1660168778923.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_animal_classifier_huggingface_en_4.1.0_3.0_1660168778923.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_animal_classifier_huggingface", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_animal_classifier_huggingface", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_animal_classifier_huggingface| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Indemnification Relation Extraction (sm, Bidirectional) author: John Snow Labs name: legre_indemnifications date: 2022-09-28 tags: [en, legal, re, indemnification, licensed] task: Relation Extraction language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_indemnification_clause` Text Classifier to select only these paragraphs; This is a Relation Extraction model to group the different entities extracted with the Indemnification NER model (see `legner_bert_indemnifications` in Models Hub). This model is a `sm` model without meaningful directions in the relations (the model was not trained to understand if the direction of the relation is from left to right or right to left). There are bigger models in Models Hub trained also with directed relationships. ## Predicted Entities `is_indemnification_subject`, `is_indemnification_object`, `is_indemnification_indobject` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGALRE_INDEMNIFICATION/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legre_indemnifications_en_1.0.0_3.0_1664361611044.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legre_indemnifications_en_1.0.0_3.0_1664361611044.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "en") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = legal.BertForTokenClassification.pretrained("legner_bert_indemnifications", "en", "legal/models")\ .setInputCols("token", "sentence")\ .setOutputCol("label")\ .setCaseSensitive(True) ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","label"])\ .setOutputCol("ner_chunk") # ONLY NEEDED IF YOU WANT TO FILTER RELATION PAIRS OR SYNTACTIC DISTANCE # ================= pos_tagger = nlp.PerceptronModel()\ .pretrained() \ .setInputCols(["sentence", "token"])\ .setOutputCol("pos_tags") dependency_parser = nlp.DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") #Set a filter on pairs of named entities which will be treated as relation candidates re_filter = legal.RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunks")\ .setMaxSyntacticDistance(20)\ .setRelationPairs(['INDEMNIFICATION_SUBJECT-INDEMNIFICATION_ACTION', 'INDEMNIFICATION_SUBJECT-INDEMNIFICATION_INDIRECT_OBJECT', 'INDEMNIFICATION_ACTION-INDEMNIFICATION', 'INDEMNIFICATION_ACTION-INDEMNIFICATION_INDIRECT_OBJECT']) # ================= reDL = legal.RelationExtractionDLModel()\ .pretrained("legre_indemnifications", "en", "legal/models")\ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentence"])\ .setOutputCol("relations") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentencizer, tokenizer, tokenClassifier, ner_converter, pos_tagger, dependency_parser, re_filter, reDL]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text='''The Company shall indemnify and hold harmless HOC against any losses, claims, damages or liabilities to which it may become subject under the 1933 Act or otherwise, insofar as such losses, claims, damages or liabilities (or actions in respect thereof) arise out of or are based upon ''' data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(data) lmodel = LightPipeline(model) res = lmodel.annotate(text) ```
## Results ```bash relation entity1 entity1_begin entity1_end chunk1 entity2 entity2_begin entity2_end chunk2 confidence 1 is_indemnification_subject INDEMNIFICATION_SUBJECT 4 10 Company INDEMNIFICATION_ACTION 32 44 hold harmless 0.8847967 2 is_indemnification_indobject INDEMNIFICATION_SUBJECT 4 10 Company INDEMNIFICATION_INDIRECT_OBJECT 46 48 HOC 0.96191925 3 is_indemnification_indobject INDEMNIFICATION_ACTION 12 26 shall indemnify INDEMNIFICATION_INDIRECT_OBJECT 46 48 HOC 0.7332646 10 is_indemnification_object INDEMNIFICATION_ACTION 32 44 hold harmless INDEMNIFICATION 70 75 claims 0.9728908 11 is_indemnification_object INDEMNIFICATION_ACTION 32 44 hold harmless INDEMNIFICATION 78 84 damages 0.9727499 12 is_indemnification_object INDEMNIFICATION_ACTION 32 44 hold harmless INDEMNIFICATION 89 99 liabilities 0.964168 ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legre_indemnifications| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.9 MB| ## References In-house annotated examples from CUAD legal dataset ## Benchmarking ```bash label Recall Precision F1 Support is_indemnification_indobject 0.966 1.000 0.982 29 is_indemnification_object 0.929 0.929 0.929 42 is_indemnification_subject 0.931 0.931 0.931 29 no_rel 0.950 0.941 0.945 100 Avg. 0.944 0.950 0.947 - Weighted-Avg. 0.945 0.945 0.945 - ``` --- layout: model title: Pipeline to Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_pipeline date: 2023-06-06 tags: [licensed, en, clinical, ner, ner_jsl, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2023/05/04/bert_token_classifier_ner_jsl_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.2_1686088562944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.3.0_3.2_1686088562944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") val text = """The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""" val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_labels | confidence | |---:|:------------------------------------------|--------:|------:|:-----------------------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.996622 | | 1 | Caucasian | 28 | 36 | Race_Ethnicity | 0.999759 | | 2 | male | 38 | 41 | Gender | 0.999847 | | 3 | 2 days | 52 | 57 | Duration | 0.818646 | | 4 | congestion | 62 | 71 | Symptom | 0.997344 | | 5 | mom | 75 | 77 | Gender | 0.999601 | | 6 | yellow | 99 | 104 | Symptom | 0.476263 | | 7 | discharge | 106 | 114 | Symptom | 0.704853 | | 8 | nares | 135 | 139 | External_body_part_or_region | 0.999152 | | 9 | she | 147 | 149 | Gender | 0.999927 | | 10 | mild | 168 | 171 | Modifier | 0.999674 | | 11 | problems with his breathing while feeding | 173 | 213 | Symptom | 0.995353 | | 12 | perioral cyanosis | 237 | 253 | Symptom | 0.99852 | | 13 | retractions | 258 | 268 | Symptom | 0.999806 | | 14 | One day ago | 272 | 282 | RelativeDate | 0.99949 | | 15 | mom | 285 | 287 | Gender | 0.999779 | | 16 | tactile temperature | 304 | 322 | Symptom | 0.997475 | | 17 | Tylenol | 345 | 351 | Drug_BrandName | 0.998978 | | 18 | Baby-girl | 354 | 362 | Age | 0.990654 | | 19 | decreased | 382 | 390 | Symptom | 0.996808 | | 20 | intake | 397 | 402 | Symptom | 0.983608 | | 21 | His | 405 | 407 | Gender | 0.999922 | | 22 | breast-feeding | 416 | 429 | External_body_part_or_region | 0.994421 | | 23 | 20 minutes | 444 | 453 | Duration | 0.992322 | | 24 | 5 to 10 minutes | 464 | 478 | Duration | 0.969913 | | 25 | his | 493 | 495 | Gender | 0.999908 | | 26 | respiratory congestion | 497 | 518 | Symptom | 0.995677 | | 27 | He | 521 | 522 | Gender | 0.999803 | | 28 | tired | 555 | 559 | Symptom | 0.999463 | | 29 | fussy | 574 | 578 | Symptom | 0.996514 | | 30 | over the past 2 days | 580 | 599 | RelativeDate | 0.998001 | | 31 | albuterol | 642 | 650 | Drug_Ingredient | 0.99964 | | 32 | ER | 676 | 677 | Clinical_Dept | 0.998161 | | 33 | His | 680 | 682 | Gender | 0.999921 | | 34 | urine output has also decreased | 684 | 714 | Symptom | 0.971606 | | 35 | he | 726 | 727 | Gender | 0.999916 | | 36 | per 24 hours | 765 | 776 | Frequency | 0.910935 | | 37 | he | 783 | 784 | Gender | 0.999922 | | 38 | per 24 hours | 812 | 823 | Frequency | 0.921849 | | 39 | Mom | 826 | 828 | Gender | 0.999606 | | 40 | diarrhea | 841 | 848 | Symptom | 0.999849 | | 41 | His | 851 | 853 | Gender | 0.999739 | | 42 | bowel | 855 | 859 | Internal_organ_or_component | 0.999471 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_modelo1 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-modelo1` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo1_es_4.3.0_3.0_1674218442505.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo1_es_4.3.0_3.0_1674218442505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo1","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_modelo1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-modelo1 --- layout: model title: Named Entity Recognition in Romanian Official Documents (Large) author: John Snow Labs name: legner_romanian_official_lg date: 2022-11-10 tags: [ro, ner, legal, licensed] task: Named Entity Recognition language: ro edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a large version NER model that extracts following 14 entities from Romanian Official Documents. ## Predicted Entities `PER`, `LOC`, `ORG`, `DATE`, `DECISION`, `DECREE`, `DIRECTIVE`, `ORDINANCE`, `EMERGENCY_ORDINANCE`, `LAW`, `ORDER`, `REGULATION`, `REPORT`, `TREATY` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGNER_ROMANIAN_OFFICIAL/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_lg_ro_1.0.0_3.0_1668084251147.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_lg_ro_1.0.0_3.0_1668084251147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_romanian_official_lg", "ro", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""LEGE nr. 159 din 25 iulie 2019 pentru modificarea și completarea Decretului-lege nr. 118 / 1990 privind acordarea unor drepturi persoanelor persecutate din motive politice de dictatura instaurată cu începere de la 6 martie 1945, precum și celor deportate în străinătate ori constituite în prizonieri și pentru modificarea Ordonanței Guvernului nr. 105 / 1999 privind acordarea unor drepturi persoanelor persecutate de către regimurile instaurate în România cu începere de la 6 septembrie 1940 până la 6 martie 1945 din motive etnice Publicat în MONITORUL OFICIAL nr. 625 din 26 iulie 2019 Parlamentul României adoptă prezenta lege. Articolul I Decretul-lege nr. 118 / 1990 privind acordarea unor drepturi persoanelor persecutate din motive politice de dictatura instaurată cu începere de la 6 martie 1945, precum și celor deportate în străinătate ori constituite în prizonieri, republicat în Monitorul Oficial al României, Partea I, nr. 631 din 23 septembrie 2009, cu modificările și completările ulterioare, se modifică și se completează după cum urmează: Articolul II Ordonanța Guvernului nr. 105 / 1999 privind acordarea unor drepturi persoanelor persecutate de către regimurile instaurate în România cu începere de la 6 septembrie 1940 până la 6 martie 1945 din motive etnice, publicată în Monitorul Oficial al României, Partea I, nr. 426 din 31 august 1999, aprobată cu modificări și completări prin Legea nr. 189 / 2000, cu modificările și completările ulterioare, se modifică după cum urmează: Această lege a fost adoptată de Parlamentul României, cu respectarea prevederilor art. 75 și ale art. 76 alin. (2) din Constituția României, republicată. PREȘEDINTELE CAMEREI DEPUTAȚILOR ION-MARCEL CIOLACU București, 25 iulie 2019."""]]).toDF("text") result = model.transform(data) ```
## Results ```bash +------------------------------------+---------+ |chunk |label | +------------------------------------+---------+ |LEGE nr. 159 din 25 iulie 2019 |LAW | |Decretului-lege nr. 118 / 1990 |DECREE | |6 martie 1945 |DATE | |Ordonanței Guvernului nr. 105 / 1999|ORDINANCE| |România |LOC | |6 septembrie 1940 |DATE | |6 martie 1945 |DATE | |26 iulie 2019 |DATE | |Parlamentul României |ORG | |Decretul-lege nr. 118 / 1990 |DECREE | |6 martie 1945 |DATE | |Monitorul Oficial al României |ORG | |23 septembrie 2009 |DATE | |Ordonanța Guvernului nr. 105 / 1999 |ORDINANCE| |România |LOC | |6 septembrie 1940 |DATE | |6 martie 1945 |DATE | |Monitorul Oficial al României |ORG | |31 august 1999 |DATE | |Legea nr. 189 / 2000 |LAW | |Parlamentul României |ORG | |Constituția României |LAW | |CAMEREI DEPUTAȚILOR |ORG | |ION-MARCEL CIOLACU |PER | |București |LOC | |25 iulie 2019 |DATE | +------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_romanian_official_lg| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.5 MB| ## References Dataset is available [here](https://zenodo.org/record/7025333#.Y2zsquxBx83). ## Benchmarking ```bash label precision recall f1-score support DATE 0.9036 0.9223 0.9128 193 DECISION 0.9831 0.9831 0.9831 59 DECREE 0.5000 1.0000 0.6667 1 DIRECTIVE 1.0000 0.6667 0.8000 3 EMERGENCY_ORDINANCE 1.0000 0.9615 0.9804 26 LAW 0.9619 0.9806 0.9712 103 LOC 0.9110 0.8365 0.8721 159 ORDER 0.9767 1.0000 0.9882 42 ORDINANCE 1.0000 0.9500 0.9744 20 ORG 0.8899 0.8879 0.8889 455 PER 0.9091 0.9821 0.9442 112 REGULATION 0.9118 0.8378 0.8732 37 REPORT 0.7778 0.7778 0.7778 9 TREATY 1.0000 1.0000 1.0000 3 micro-avg 0.9139 0.9116 0.9127 1222 macro-avg 0.9089 0.9133 0.9024 1222 weighted-avg 0.9143 0.9116 0.9124 1222 ``` --- layout: model title: Chinese Bert Embeddings (from Geotrend) author: John Snow Labs name: bert_embeddings_bert_base_zh_cased date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-zh-cased` is a Chinese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_zh_cased_zh_3.4.2_3.0_1649669775199.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_zh_cased_zh_3.4.2_3.0_1649669775199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_zh_cased","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_zh_cased","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.bert_base_zh_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_zh_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|360.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-zh-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1654191695998.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10_en_4.0.0_3.0_1654191695998.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_64d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|378.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-10 --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batteryscibert_uncased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batteryscibert-uncased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batteryscibert_uncased_squad_v1_en_4.0.0_3.0_1654179403799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batteryscibert_uncased_squad_v1_en_4.0.0_3.0_1654179403799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batteryscibert_uncased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batteryscibert_uncased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.scibert.uncased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batteryscibert_uncased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/batteryscibert-uncased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: English BertForQuestionAnswering Base Uncased model (from ksabeh) author: John Snow Labs name: bert_qa_base_uncased_attribute_correction_mlm date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-attribute-correction-mlm` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_mlm_en_4.0.0_3.0_1657183769786.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_attribute_correction_mlm_en_4.0.0_3.0_1657183769786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction_mlm","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_attribute_correction_mlm","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_attribute_correction_mlm| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ksabeh/bert-base-uncased-attribute-correction-mlm --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from adamlin) author: John Snow Labs name: distilbert_qa_base_cased_sgd_step5000 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-sgd_qa-step5000` is a English model originally trained by `adamlin`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_sgd_step5000_en_4.3.0_3.0_1672767052489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_sgd_step5000_en_4.3.0_3.0_1672767052489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_sgd_step5000","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_sgd_step5000","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_sgd_step5000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/adamlin/distilbert-base-cased-sgd_qa-step5000 --- layout: model title: Part of Speech for Finnish author: John Snow Labs name: pos_ud_tdt date: 2021-03-08 tags: [part_of_speech, open_source, finnish, pos_ud_tdt, fi] task: Part of Speech Tagging language: fi edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - NOUN - ADJ - VERB - ADV - CCONJ - PUNCT - SCONJ - PRON - PROPN - AUX - ADP - NUM - X - SYM - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_tdt_fi_3.0.0_3.0_1615230305202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_tdt_fi_3.0.0_3.0_1615230305202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_tdt", "fi") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hei John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_tdt", "fi") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hei John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hei John Snow Labs! ""] token_df = nlu.load('fi.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hei INTJ 1 John PROPN 2 Snow PROPN 3 Labs PROPN 4 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_tdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|fi| --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from jgriffi) author: John Snow Labs name: xlmroberta_ner_jgriffi_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `jgriffi`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jgriffi_base_finetuned_panx_all_xx_4.1.0_3.0_1660428574153.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_jgriffi_base_finetuned_panx_all_xx_4.1.0_3.0_1660428574153.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jgriffi_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_jgriffi_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_jgriffi_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|862.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jgriffi/xlm-roberta-base-finetuned-panx-all --- layout: model title: General Oncology Pipeline author: John Snow Labs name: oncology_general_pipeline date: 2023-03-29 tags: [licensed, pipeline, oncology, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status and Relation Extraction models to extract information from oncology texts. This pipeline extracts diagnoses, treatments, tests, anatomical references and demographic entities. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_general_pipeline_en_4.3.2_3.2_1680110018146.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_general_pipeline_en_4.3.2_3.2_1680110018146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_general_pipeline", "en", "clinical/models") text = '''The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_general_pipeline", "en", "clinical/models") val text = "The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_general.pipeline").predict("""The patient underwent a left mastectomy for a left breast cancer two months ago. The tumor is positive for ER and PR.""") ```
## Results ```bash " ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:---------------|:-----------------| | left | Direction | | mastectomy | Cancer_Surgery | | left | Direction | | breast cancer | Cancer_Dx | | two months ago | Relative_Date | | tumor | Tumor_Finding | | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | ******************** ner_oncology_diagnosis_wip results ******************** | chunk | ner_label | |:--------------|:--------------| | breast cancer | Cancer_Dx | | tumor | Tumor_Finding | ******************** ner_oncology_tnm_wip results ******************** | chunk | ner_label | |:--------------|:------------| | breast cancer | Cancer_Dx | | tumor | Tumor | ******************** ner_oncology_therapy_wip results ******************** | chunk | ner_label | |:-----------|:---------------| | mastectomy | Cancer_Surgery | ******************** ner_oncology_test_wip results ******************** | chunk | ner_label | |:---------|:-----------------| | positive | Biomarker_Result | | ER | Biomarker | | PR | Biomarker | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:--------------|:---------------|:------------| | mastectomy | Cancer_Surgery | Past | | breast cancer | Cancer_Dx | Present | | tumor | Tumor_Finding | Present | | ER | Biomarker | Present | | PR | Biomarker | Present | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:--------------|:-----------------|:---------------|:--------------|:--------------| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_related_to | | breast cancer | Cancer_Dx | two months ago | Relative_Date | is_related_to | | tumor | Tumor_Finding | ER | Biomarker | O | | tumor | Tumor_Finding | PR | Biomarker | O | | positive | Biomarker_Result | ER | Biomarker | is_related_to | | positive | Biomarker_Result | PR | Biomarker | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:--------------|:-----------------|:---------------|:--------------|:--------------| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_date_of | | breast cancer | Cancer_Dx | two months ago | Relative_Date | is_date_of | | tumor | Tumor_Finding | ER | Biomarker | O | | tumor | Tumor_Finding | PR | Biomarker | O | | positive | Biomarker_Result | ER | Biomarker | is_finding_of | | positive | Biomarker_Result | PR | Biomarker | is_finding_of | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_general_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265906 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-more_fine_tune_24465520-26265906` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265906_en_4.0.0_3.0_1655985334529.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265906_en_4.0.0_3.0_1655985334529.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265906","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265906","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlm_roberta.fine_tune_24465520_26265906").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_more_fine_tune_24465520_26265906| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|888.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-more_fine_tune_24465520-26265906 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_32_epochs30 TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_32_epochs30` is a English model originally trained by ying-tina. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30_en_4.2.0_3.0_1664111484072.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30_en_4.2.0_3.0_1664111484072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_32_epochs30| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from flood) author: John Snow Labs name: xlmroberta_ner_flood_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `flood`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_flood_base_finetuned_panx_de_4.1.0_3.0_1660432859459.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_flood_base_finetuned_panx_de_4.1.0_3.0_1660432859459.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_flood_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_flood_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_flood_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/flood/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Hungarian BertForQuestionAnswering model (from mcsabai) author: John Snow Labs name: bert_qa_huBert_fine_tuned_hungarian_squadv1 date: 2022-06-02 tags: [hu, open_source, question_answering, bert] task: Question Answering language: hu edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `huBert-fine-tuned-hungarian-squadv1` is a Hungarian model orginally trained by `mcsabai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_huBert_fine_tuned_hungarian_squadv1_hu_4.0.0_3.0_1654187962165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_huBert_fine_tuned_hungarian_squadv1_hu_4.0.0_3.0_1654187962165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_huBert_fine_tuned_hungarian_squadv1","hu") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_huBert_fine_tuned_hungarian_squadv1","hu") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hu.answer_question.squad.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_huBert_fine_tuned_hungarian_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hu| |Size:|413.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcsabai/huBert-fine-tuned-hungarian-squadv1 --- layout: model title: English image_classifier_vit_pond_image_classification_9 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_9 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_9` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_9_en_4.1.0_3.0_1660167226926.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_9_en_4.1.0_3.0_1660167226926.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_9", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_9", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_9| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal Agreements Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_agreements_bert date: 2023-03-05 tags: [en, legal, classification, clauses, agreements, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Agreements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Agreements`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_agreements_bert_en_1.0.0_3.0_1678050691083.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_agreements_bert_en_1.0.0_3.0_1678050691083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_agreements_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Agreements]| |[Other]| |[Other]| |[Agreements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_agreements_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Agreements 0.86 0.96 0.91 26 Other 0.97 0.90 0.93 39 accuracy - - 0.92 65 macro-avg 0.92 0.93 0.92 65 weighted-avg 0.93 0.92 0.92 65 ``` --- layout: model title: Detect biological concepts (biobert) author: John Snow Labs name: ner_bionlp_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect general biological entities like tissues, organisms, cells, etc in text using pretrained NER model. ## Predicted Entities `tissue_structure`, `Amino_acid`, `Simple_chemical`, `Organism_substance`, `Developing_anatomical_structure`, `Cell`, `Cancer`, `Cellular_component`, `Gene_or_gene_product`, `Immaterial_anatomical_entity`, `Organ`, `Organism`, `Pathological_formation`, `Organism_subdivision`, `Anatomical_system`, `Tissue` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_en_3.0.0_3.0_1617260864949.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_biobert_en_3.0.0_3.0_1617260864949.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_bionlp_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_bionlp_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bionlp.biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash +-------------------------------+------+-----+------+------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +-------------------------------+------+-----+------+------+---------+------+------+ | Organ| 123.0| 59.0| 50.0| 173.0| 0.6758| 0.711| 0.693| | Pathological_formation| 82.0| 32.0| 45.0| 127.0| 0.7193|0.6457|0.6805| | Organism_substance| 75.0| 8.0| 51.0| 126.0| 0.9036|0.5952|0.7177| | tissue_structure| 412.0|162.0| 53.0| 465.0| 0.7178| 0.886|0.7931| | Cellular_component| 188.0| 46.0| 61.0| 249.0| 0.8034| 0.755|0.7785| | Tissue| 244.0| 73.0| 51.0| 295.0| 0.7697|0.8271|0.7974| | Cancer|1384.0|193.0| 144.0|1528.0| 0.8776|0.9058|0.8915| |Developing_anatomical_structure| 10.0| 3.0| 11.0| 21.0| 0.7692|0.4762|0.5882| | Immaterial_anatomical_entity| 20.0| 16.0| 21.0| 41.0| 0.5556|0.4878|0.5195| | Gene_or_gene_product|3829.0|233.0|1045.0|4874.0| 0.9426|0.7856| 0.857| | Cell|1873.0|175.0| 231.0|2104.0| 0.9146|0.8902|0.9022| | Organism| 559.0|116.0| 79.0| 638.0| 0.8281|0.8762|0.8515| | Simple_chemical| 928.0|106.0| 421.0|1349.0| 0.8975|0.6879|0.7789| +-------------------------------+------+-----+------+------+---------+------+------+ +------------------+ | macro| +------------------+ |0.6589490623994527| +------------------+ +------------------+ | micro| +------------------+ |0.8407823737981155| +------------------+ ``` --- layout: model title: Legal Transactions with affiliates Clause Binary Classifier author: John Snow Labs name: legclf_transactions_with_affiliates_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `transactions-with-affiliates` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `transactions-with-affiliates` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_transactions_with_affiliates_clause_en_1.0.0_3.2_1660123125455.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_transactions_with_affiliates_clause_en_1.0.0_3.2_1660123125455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_transactions_with_affiliates_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[transactions-with-affiliates]| |[other]| |[other]| |[transactions-with-affiliates]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_transactions_with_affiliates_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.98 0.97 86 transactions-with-affiliates 0.92 0.88 0.90 26 accuracy - - 0.96 112 macro-avg 0.94 0.93 0.94 112 weighted-avg 0.95 0.96 0.96 112 ``` --- layout: model title: English image_classifier_vit_orcs_and_friends ViTForImageClassification from andy-0v0 author: John Snow Labs name: image_classifier_vit_orcs_and_friends date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_orcs_and_friends` is a English model originally trained by andy-0v0. ## Predicted Entities `ogre`, `goblin`, `orc`, `gremlin`, `troll` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_orcs_and_friends_en_4.1.0_3.0_1660167849859.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_orcs_and_friends_en_4.1.0_3.0_1660167849859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_orcs_and_friends", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_orcs_and_friends", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_orcs_and_friends| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English asr_wav2vec2_base_rj_try_5 TFWav2Vec2ForCTC from rjrohit author: John Snow Labs name: pipeline_asr_wav2vec2_base_rj_try_5 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_rj_try_5` is a English model originally trained by rjrohit. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_rj_try_5_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_rj_try_5_en_4.2.0_3.0_1664102652691.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_rj_try_5_en_4.2.0_3.0_1664102652691.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_rj_try_5', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_rj_try_5", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_rj_try_5| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Swahili author: John Snow Labs name: opus_mt_en_sw date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sw, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sw` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sw_xx_2.7.0_2.4_1609171045871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sw_xx_2.7.0_2.4_1609171045871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sw", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sw", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sw').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sw| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RobertaForQuestionAnswering (from PlanTL-GOB-ES) author: John Snow Labs name: roberta_qa_PlanTL_GOB_ES_roberta_large_bne_sqac date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-bne-sqac` is a Spanish model originally trained by `PlanTL-GOB-ES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_PlanTL_GOB_ES_roberta_large_bne_sqac_es_4.0.0_3.0_1655736230510.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_PlanTL_GOB_ES_roberta_large_bne_sqac_es_4.0.0_3.0_1655736230510.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_PlanTL_GOB_ES_roberta_large_bne_sqac","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_PlanTL_GOB_ES_roberta_large_bne_sqac","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.sqac.roberta.large.by_PlanTL-GOB-ES").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_PlanTL_GOB_ES_roberta_large_bne_sqac| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-sqac - https://arxiv.org/abs/2107.07253 - http://www.bne.es/en/Inicio/index.html - https://arxiv.org/abs/1907.11692 - https://github.com/PlanTL-GOB-ES/lm-spanish --- layout: model title: Legal Electronic Communications Clause Binary Classifier author: John Snow Labs name: legclf_electronic_communications_clause date: 2022-12-07 tags: [en, legal, classification, clauses, electronic_communications, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `electronic-communications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `electronic-communications`, `other` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_electronic_communications_clause_en_1.0.0_3.0_1670445124280.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_electronic_communications_clause_en_1.0.0_3.0_1670445124280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_electronic_communications_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[electronic-communications]| |[other]| |[other]| |[electronic-communications]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_electronic_communications_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support electronic-communications 1.00 1.00 1.00 25 other 1.00 1.00 1.00 73 accuracy - - 1.00 98 macro-avg 1.00 1.00 1.00 98 weighted-avg 1.00 1.00 1.00 98 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from erickfm) author: John Snow Labs name: t5_small_finetuned_bias date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-finetuned-bias` is a English model originally trained by `erickfm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_bias_en_4.3.0_3.0_1675125947945.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_finetuned_bias_en_4.3.0_3.0_1675125947945.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_finetuned_bias","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_finetuned_bias","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_finetuned_bias| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|284.4 MB| ## References - https://huggingface.co/erickfm/t5-small-finetuned-bias - https://github.com/rpryzant/neutralizing-bias --- layout: model title: Japanese BertForQuestionAnswering Large model (from KoichiYasuoka) author: John Snow Labs name: bert_qa_large_japanese_wikipedia_ud_head date: 2022-06-28 tags: [ja, open_source, bert, question_answering] task: Question Answering language: ja edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-japanese-wikipedia-ud-head` is a Japanese model originally trained by `KoichiYasuoka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_large_japanese_wikipedia_ud_head_ja_4.0.0_3.0_1656413609706.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_large_japanese_wikipedia_ud_head_ja_4.0.0_3.0_1656413609706.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_large_japanese_wikipedia_ud_head","ja") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["私の名前は何ですか?", "私の名前はクララで、私はバークレーに住んでいます。"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_large_japanese_wikipedia_ud_head","ja") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("私の名前は何ですか?", "私の名前はクララで、私はバークレーに住んでいます。").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.answer_question.wikipedia.bert.large").predict("""私の名前は何ですか?|||"私の名前はクララで、私はバークレーに住んでいます。""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_large_japanese_wikipedia_ud_head| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ja| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KoichiYasuoka/bert-large-japanese-wikipedia-ud-head - https://github.com/UniversalDependencies/UD_Japanese-GSDLUW --- layout: model title: Italian DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_it_cased date: 2022-04-12 tags: [distilbert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-it-cased` is a Italian model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_it_cased_it_3.4.2_3.0_1649783835301.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_it_cased_it_3.4.2_3.0_1649783835301.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_it_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_it_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.distilbert_base_it_cased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_it_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|234.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-it-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Translate Ewe to English Pipeline author: John Snow Labs name: translate_ee_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ee, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ee` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ee_en_xx_2.7.0_2.4_1609699221850.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ee_en_xx_2.7.0_2.4_1609699221850.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ee_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ee_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ee.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ee_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Chinese to English Pipeline author: John Snow Labs name: translate_zh_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, zh, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `zh` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_zh_en_xx_2.7.0_2.4_1609685820690.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_zh_en_xx_2.7.0_2.4_1609685820690.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_zh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_zh_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.zh.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_zh_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Amharic Named Entity Recognition (from mbeukman) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_finetuned_ner_amharic date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, am, open_source] task: Named Entity Recognition language: am edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-amharic` is a Amharic model orginally trained by `mbeukman`. ## Predicted Entities `PER`, `ORG`, `LOC`, `DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_amharic_am_3.4.2_3.0_1652810198597.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_finetuned_ner_amharic_am_3.4.2_3.0_1652810198597.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_amharic","am") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["ስካርቻ nlp እወዳለሁ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_finetuned_ner_amharic","am") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("ስካርቻ nlp እወዳለሁ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_finetuned_ner_amharic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|am| |Size:|773.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-amharic - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://www.apache.org/licenses/LICENSE-2.0 - https://github.com/Michael-Beukman/NERTransfer - h --- layout: model title: Fast Neural Machine Translation Model from English to Icelandic author: John Snow Labs name: opus_mt_en_is date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, is, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `is` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_is_xx_2.7.0_2.4_1609168104095.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_is_xx_2.7.0_2.4_1609168104095.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_is", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_is", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.is').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_is| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Embeddings Clinical author: John Snow Labs name: embeddings_clinical class: WordEmbeddingsModel language: en nav_key: models repository: clinical/models date: 2020-01-28 task: Embeddings edition: Healthcare NLP 2.4.0 spark_version: 2.4 tags: [clinical,licensed,embeddings,en] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_clinical_en_2.4.0_2.4_1580237286004.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols("document","token")\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.glove.clinical").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | embeddings_clinical | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.4.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | en | | Dimension: | 200.0 | {:.h2_title} ## Data Source Trained on PubMed corpora https://www.nlm.nih.gov/databases/download/pubmed_medline.html --- layout: model title: Part of Speech for Basque author: John Snow Labs name: pos_ud_bdt date: 2020-07-29 23:34:00 +0800 task: Part of Speech Tagging language: eu edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [pos, eu] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bdt_eu_2.5.5_2.4_1596053577577.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bdt_eu_2.5.5_2.4_1596053577577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_bdt", "eu") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_bdt", "eu") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Iparraldeko erregea izateaz gain, mediku ingelesa eta anestesia eta higiene medikoa garatzen duen liderra da John Snow."""] pos_df = nlu.load('eu.pos').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=10, result='NOUN', metadata={'word': 'Iparraldeko'}), Row(annotatorType='pos', begin=12, end=18, result='NOUN', metadata={'word': 'erregea'}), Row(annotatorType='pos', begin=20, end=26, result='VERB', metadata={'word': 'izateaz'}), Row(annotatorType='pos', begin=28, end=31, result='NOUN', metadata={'word': 'gain'}), Row(annotatorType='pos', begin=32, end=32, result='PUNCT', metadata={'word': ','}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bdt| |Type:|pos| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|eu| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Legal Foodstuff Document Classifier (EURLEX) author: John Snow Labs name: legclf_foodstuff_bert date: 2023-03-06 tags: [en, legal, classification, clauses, foodstuff, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_foodstuff_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Foodstuff or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Foodstuff`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_foodstuff_bert_en_1.0.0_3.0_1678111793995.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_foodstuff_bert_en_1.0.0_3.0_1678111793995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_foodstuff_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Foodstuff]| |[Other]| |[Other]| |[Foodstuff]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_foodstuff_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Foodstuff 0.84 0.84 0.84 669 Other 0.81 0.80 0.80 541 accuracy - - 0.82 1210 macro-avg 0.82 0.82 0.82 1210 weighted-avg 0.82 0.82 0.82 1210 ``` --- layout: model title: Translate English to German Pipeline author: John Snow Labs name: translate_en_de date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, de, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `de` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_de_xx_2.7.0_2.4_1609687050179.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_de_xx_2.7.0_2.4_1609687050179.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_de", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_de", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.de').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_de| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Mossi to English author: John Snow Labs name: opus_mt_mos_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mos, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mos` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mos_en_xx_2.7.0_2.4_1609163565820.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mos_en_xx_2.7.0_2.4_1609163565820.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mos_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mos_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mos.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mos_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Romanian Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 11:16:00 +0800 task: Lemmatization language: ro edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, ro] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_ro_2.5.0_2.4_1588666512149.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_ro_2.5.0_2.4_1588666512149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "ro") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "ro") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""În afară de a fi regele nordului, John Snow este un medic englez și un lider în dezvoltarea anesteziei și igienei medicale."""] lemma_df = nlu.load('ro.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=1, result='În', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=7, result='afară', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=10, result='de', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=12, result='al', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=14, end=15, result='fi', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|ro| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_which_5e_05 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-which-5e-05` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_which_5e_05_en_4.3.0_3.0_1672766921583.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_which_5e_05_en_4.3.0_3.0_1672766921583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_which_5e_05","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_which_5e_05","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_which_5e_05| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-which-5e-05 --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-1024-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_4_en_4.0.0_3.0_1655730810758.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_4_en_4.0.0_3.0_1655730810758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_1024d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_1024_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|439.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-1024-finetuned-squad-seed-4 --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in German (WikiNER 6B 300) author: John Snow Labs name: wikiner_6B_300 date: 2020-02-03 task: Named Entity Recognition language: de edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, de, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 6B 300 is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_DE){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_DE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_de_2.4.0_2.4_1579717534653.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_6B_300_de_2.4.0_2.4_1579717534653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_6B_300", "de") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (* 28. Oktober 1955 in London) ist ein US-amerikanischer Geschäftsmann, Softwareentwickler, Investor und Philanthrop. Er ist bekannt als Mitbegründer der Microsoft Corporation. Während seiner Karriere bei Microsoft war Gates Vorsitzender, Chief Executive Officer (CEO), Präsident und Chief Software Architect und bis Mai 2014 der größte Einzelaktionär. Er ist einer der bekanntesten Unternehmer und Pioniere der Mikrocomputer-Revolution der 1970er und 1980er Jahre. Gates wurde in Seattle, Washington, geboren und wuchs dort auf. 1975 gründete er Microsoft zusammen mit seinem Freund aus Kindertagen, Paul Allen, in Albuquerque, New Mexico. Es entwickelte sich zum weltweit größten Unternehmen für Personal-Computer-Software. Gates leitete das Unternehmen als Chairman und CEO, bis er im Januar 2000 als CEO zurücktrat. Er blieb jedoch Chairman und wurde Chief Software Architect. In den späten neunziger Jahren wurde Gates für seine Geschäftstaktiken kritisiert, die als wettbewerbswidrig angesehen wurden. Diese Meinung wurde durch zahlreiche Gerichtsurteile bestätigt. Im Juni 2006 gab Gates bekannt, dass er eine Teilzeitstelle bei Microsoft und eine Vollzeitstelle bei der Bill & Melinda Gates Foundation, der privaten gemeinnützigen Stiftung, die er und seine Frau Melinda Gates im Jahr 2000 gegründet haben, übernehmen wird. Er übertrug seine Aufgaben nach und nach auf Ray Ozzie und Craig Mundie. Im Februar 2014 trat er als Vorsitzender von Microsoft zurück und übernahm eine neue Position als Technologieberater, um den neu ernannten CEO Satya Nadella zu unterstützen.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_6B_300", "de") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (* 28. Oktober 1955 in London) ist ein US-amerikanischer Geschäftsmann, Softwareentwickler, Investor und Philanthrop. Er ist bekannt als Mitbegründer der Microsoft Corporation. Während seiner Karriere bei Microsoft war Gates Vorsitzender, Chief Executive Officer (CEO), Präsident und Chief Software Architect und bis Mai 2014 der größte Einzelaktionär. Er ist einer der bekanntesten Unternehmer und Pioniere der Mikrocomputer-Revolution der 1970er und 1980er Jahre. Gates wurde in Seattle, Washington, geboren und wuchs dort auf. 1975 gründete er Microsoft zusammen mit seinem Freund aus Kindertagen, Paul Allen, in Albuquerque, New Mexico. Es entwickelte sich zum weltweit größten Unternehmen für Personal-Computer-Software. Gates leitete das Unternehmen als Chairman und CEO, bis er im Januar 2000 als CEO zurücktrat. Er blieb jedoch Chairman und wurde Chief Software Architect. In den späten neunziger Jahren wurde Gates für seine Geschäftstaktiken kritisiert, die als wettbewerbswidrig angesehen wurden. Diese Meinung wurde durch zahlreiche Gerichtsurteile bestätigt. Im Juni 2006 gab Gates bekannt, dass er eine Teilzeitstelle bei Microsoft und eine Vollzeitstelle bei der Bill & Melinda Gates Foundation, der privaten gemeinnützigen Stiftung, die er und seine Frau Melinda Gates im Jahr 2000 gegründet haben, übernehmen wird. Er übertrug seine Aufgaben nach und nach auf Ray Ozzie und Craig Mundie. Im Februar 2014 trat er als Vorsitzender von Microsoft zurück und übernahm eine neue Position als Technologieberater, um den neu ernannten CEO Satya Nadella zu unterstützen.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (* 28. Oktober 1955 in London) ist ein US-amerikanischer Geschäftsmann, Softwareentwickler, Investor und Philanthrop. Er ist bekannt als Mitbegründer der Microsoft Corporation. Während seiner Karriere bei Microsoft war Gates Vorsitzender, Chief Executive Officer (CEO), Präsident und Chief Software Architect und bis Mai 2014 der größte Einzelaktionär. Er ist einer der bekanntesten Unternehmer und Pioniere der Mikrocomputer-Revolution der 1970er und 1980er Jahre. Gates wurde in Seattle, Washington, geboren und wuchs dort auf. 1975 gründete er Microsoft zusammen mit seinem Freund aus Kindertagen, Paul Allen, in Albuquerque, New Mexico. Es entwickelte sich zum weltweit größten Unternehmen für Personal-Computer-Software. Gates leitete das Unternehmen als Chairman und CEO, bis er im Januar 2000 als CEO zurücktrat. Er blieb jedoch Chairman und wurde Chief Software Architect. In den späten neunziger Jahren wurde Gates für seine Geschäftstaktiken kritisiert, die als wettbewerbswidrig angesehen wurden. Diese Meinung wurde durch zahlreiche Gerichtsurteile bestätigt. Im Juni 2006 gab Gates bekannt, dass er eine Teilzeitstelle bei Microsoft und eine Vollzeitstelle bei der Bill & Melinda Gates Foundation, der privaten gemeinnützigen Stiftung, die er und seine Frau Melinda Gates im Jahr 2000 gegründet haben, übernehmen wird. Er übertrug seine Aufgaben nach und nach auf Ray Ozzie und Craig Mundie. Im Februar 2014 trat er als Vorsitzender von Microsoft zurück und übernahm eine neue Position als Technologieberater, um den neu ernannten CEO Satya Nadella zu unterstützen."""] ner_df = nlu.load('de.ner.wikiner.glove.6B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +--------------------------+---------+ |chunk |ner_label| +--------------------------+---------+ |William Henry Gates III |PER | |London |LOC | |US-amerikanischer |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Chief Executive Officer |MISC | |CEO |ORG | |Mikrocomputer-Revolution |MISC | |Seattle |LOC | |Washington |LOC | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Personal-Computer-Software|ORG | |Gates |ORG | |CEO |ORG | |CEO |ORG | |Chief Software Architect |MISC | +--------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|de| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://de.wikipedia.org](https://de.wikipedia.org) --- layout: model title: English ALBERT Embeddings (Base, v1) author: John Snow Labs name: albert_embeddings_albert_base_v1 date: 2022-04-14 tags: [albert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-base-v1` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_v1_en_3.4.2_3.0_1649954125996.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_base_v1_en_3.4.2_3.0_1649954125996.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_v1","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_base_v1","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.albert_base_v1").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_base_v1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|44.8 MB| |Case sensitive:|false| ## References - https://huggingface.co/albert-base-v1 - https://arxiv.org/abs/1909.11942 - https://github.com/google-research/albert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Explain Document Pipeline for Russian author: John Snow Labs name: explain_document_md date: 2021-03-22 tags: [open_source, russian, explain_document_md, pipeline, ru] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: ru edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_md is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_ru_3.0.0_3.0_1616432725151.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_md_ru_3.0.0_3.0_1616432725151.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_md', lang = 'ru') annotations = pipeline.fullAnnotate(""Здравствуйте из Джона Снежных Лабораторий! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_md", lang = "ru") val result = pipeline.fullAnnotate("Здравствуйте из Джона Снежных Лабораторий! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Здравствуйте из Джона Снежных Лабораторий! ""] result_df = nlu.load('ru.explain.md').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:------------------------------------------------|:-----------------------------------------------|:-----------------------------------------------------------|:-----------------------------------------------------------|:-------------------------------------------|:-----------------------------|:--------------------------------------|:-------------------------------| | 0 | ['Здравствуйте из Джона Снежных Лабораторий! '] | ['Здравствуйте из Джона Снежных Лабораторий!'] | ['Здравствуйте', 'из', 'Джона', 'Снежных', 'Лабораторий!'] | ['здравствовать', 'из', 'Джон', 'Снежных', 'Лабораторий!'] | ['NOUN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.0, 0.0, 0.0, 0.0,.,...]] | ['O', 'O', 'B-LOC', 'I-LOC', 'I-LOC'] | ['Джона Снежных Лабораторий!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_md| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|ru| --- layout: model title: Clinical Deidentification Pipeline (Arabic) author: John Snow Labs name: clinical_deidentification date: 2023-06-14 tags: [licensed, deidentification, clinical, ar, pipeline] task: De-identification language: ar edition: Healthcare NLP 4.4.4 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to deidentify Arabic PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate CONTACT, NAME, DATE, ID, LOCATION, AGE, PATIENT, HOSPITAL, ORGANIZATION, CITY, STREET, USERNAME, SEX, IDNUM, EMAIL, ZIP, MEDICALRECORD, PROFESSION, PHONE, COUNTRY, DOCTOR, SSN, ACCOUNT, LICENSE, DLN and VIN {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ar_4.4.4_3.0_1686734212603.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/clinical_deidentification_ar_4.4.4_3.0_1686734212603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "ar", "clinical/models") text = ''' ملاحظات سريرية - مريض الربو: التاريخ: 30 مايو 2023 اسم المريضة: ليلى حسن تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي 123456789012. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة الرمز البريدي: 54321 البلد: المملكة العربية السعودية اسم المستشفى: مستشفى النور اسم الطبيب: د. أميرة أحمد تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح ''' result = deid_pipeline.annotate(text) print("\nMasked with entity labels") print("-"*30) print("\n".join(result['masked_with_entity'])) print("\nMasked with chars") print("-"*30) print("\n".join(result['masked_with_chars'])) print("\nMasked with fixed length chars") print("-"*30) print("\n".join(result['masked_fixed_length_chars'])) print("\nObfuscated") print("-"*30) print("\n".join(result['obfuscated'])) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val deid_pipeline = new PretrainedPipeline("clinical_deidentification","ar","clinical/models") val text = ''' ملاحظات سريرية - مريض الربو: التاريخ: 30 مايو 2023 اسم المريضة: ليلى حسن تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي 123456789012. العنوان: شارع المعرفة، مبنى رقم 789، حي الأمانة، جدة الرمز البريدي: 54321 البلد: المملكة العربية السعودية اسم المستشفى: مستشفى النور اسم الطبيب: د. أميرة أحمد تفاصيل الحالة: المريضة ليلى حسن، البالغة من العمر 35 عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح ''' val result = deid_pipeline.annotate(text) ```
## Results ```bash Masked with entity labels ------------------------------ ملاحظات سريرية - مريض الربو: التاريخ: [تاريخ] [تاريخ] اسم المريضة: [المريض] تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي [هاتف]. العنوان: شارع المعرفة، مبنى رقم [الرمز البريدي] حي [الموقع] الرمز البريدي: [الرمز البريدي] البلد: [المدينة] [البلد] اسم المستشفى: [الموقع] اسم الطبيب: د. [دكتور] تفاصيل الحالة: المريضة [المريض] حسن، البالغة من العمر [العمر]عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح Masked with chars ------------------------------ ملاحظات سريرية - مريض الربو: التاريخ: [٭٭٭٭٭] [٭٭] اسم المريضة: [٭٭٭٭٭٭] تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي [٭٭٭٭٭٭٭٭٭٭]. العنوان: شارع المعرفة، مبنى رقم [٭٭] حي [٭٭٭٭٭٭٭٭٭٭] الرمز البريدي: [٭٭٭] البلد: [٭٭٭٭٭٭٭٭٭٭٭٭٭] [٭٭٭٭٭٭] اسم المستشفى: [٭٭٭٭٭٭٭٭٭٭] اسم الطبيب: د. [٭٭٭٭٭٭٭٭] تفاصيل الحالة: المريضة [٭٭] حسن، البالغة من العمر ٭٭عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح Masked with fixed length chars ------------------------------ ملاحظات سريرية - مريض الربو: التاريخ: ٭٭٭٭ ٭٭٭٭ اسم المريضة: ٭٭٭٭ تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي ٭٭٭٭. العنوان: شارع المعرفة، مبنى رقم ٭٭٭٭ حي ٭٭٭٭ الرمز البريدي: ٭٭٭٭ البلد: ٭٭٭٭ ٭٭٭٭ اسم المستشفى: ٭٭٭٭ اسم الطبيب: د. ٭٭٭٭ تفاصيل الحالة: المريضة ٭٭٭٭ حسن، البالغة من العمر ٭٭٭٭عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح Obfuscated ------------------------------ ملاحظات سريرية - مريض الربو: التاريخ: 30 يونيو 2024 اسم المريضة: رياض حسيبي تم تسجيل المريض في النظام باستخدام رقم الضمان الاجتماعي 217492240818. العنوان: شارع المعرفة، مبنى رقم 616، حي شارع الحرية الرمز البريدي: 32681 البلد: تاونات موريشيوس اسم المستشفى: شارع الجزر الموريسية اسم الطبيب: د. ميرا نورة تفاصيل الحالة: المريضة ربى شحاتة حسن، البالغة من العمر 26عامًا، تعاني من مرض الربو المزمن. تشكو من ضيق التنفس والسعال المتكرر والشهيق الشديد. تم تشخيصها بمرض الربو بناءً على تاريخها الطبي واختبارات وظائف الرئة. الخطة: تم وصف مضادات الالتهاب غير الستيرويدية والموسعات القصبية لتحسين التنفس وتقليل التهيج. يجب على المريضة حمل معها جهاز الاستنشاق في حالة حدوث نوبة ربو حادة. يتعين على المريضة تجنب التحسس من العوامل المسببة للربو، مثل الدخان والغبار والحيوانات الأليفة. يجب مراقبة وظائف الرئة بانتظام ومتابعة التعليمات الطبية المتعلقة بمرض الربو. تعليم المريضة بشأن كيفية استخدام جهاز الاستنشاق بشكل صحيح وتقنيات التنفس الصحيح ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|clinical_deidentification| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|ar| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ContextualParserModel - ChunkMergeModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - DeIdentificationModel - Finisher --- layout: model title: Fast Neural Machine Translation Model from Basque (Family) to English author: John Snow Labs name: opus_mt_euq_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, euq, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `euq` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_euq_en_xx_2.7.0_2.4_1609164976603.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_euq_en_xx_2.7.0_2.4_1609164976603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_euq_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_euq_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.euq.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_euq_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stopwords Remover for Kannada language (82 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, kn, open_source] task: Stop Words Removal language: kn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_kn_3.4.1_3.0_1646672976164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_kn_3.4.1_3.0_1646672976164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","kn") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["ನೀವು ನನಗೆ ಉತ್ತಮವಾಗಿಲ್ಲ"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","kn") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("ನೀವು ನನಗೆ ಉತ್ತಮವಾಗಿಲ್ಲ").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("kn.stopwords").predict("""ನೀವು ನನಗೆ ಉತ್ತಮವಾಗಿಲ್ಲ""") ```
## Results ```bash +--------------+ |result | +--------------+ |[ಉತ್ತಮವಾಗಿಲ್ಲ]| +--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|kn| |Size:|1.8 KB| --- layout: model title: English LongformerForQuestionAnswering model (Squad dataset) author: John Snow Labs name: longformer_qa_base_4096_finetuned_squadv2 date: 2022-06-26 tags: [en, open_source, longformer, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: LongformerForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `longformer-base-4096-finetuned-squadv2` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_qa_base_4096_finetuned_squadv2_en_4.0.0_3.0_1656255377204.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_qa_base_4096_finetuned_squadv2_en_4.0.0_3.0_1656255377204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = LongformerForQuestionAnswering.pretrained("longformer_qa_base_4096_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = LongformerForQuestionAnswering.pretrained("longformer_qa_base_4096_finetuned_squadv2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.longformer.base_v2").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|longformer_qa_base_4096_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|551.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2 - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/abs/2004.05150 --- layout: model title: English DistilBertForQuestionAnswering model (from andi611) author: John Snow Labs name: distilbert_qa_base_uncased_qa_with_ner date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-qa-with-ner` is a English model originally trained by `andi611`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_qa_with_ner_en_4.0.0_3.0_1654727168244.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_qa_with_ner_en_4.0.0_3.0_1654727168244.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_qa_with_ner","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_qa_with_ner","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.conll.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_qa_with_ner| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/andi611/distilbert-base-uncased-qa-with-ner --- layout: model title: Word2Vec Embeddings in Walloon (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, wa, open_source] task: Embeddings language: wa edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_wa_3.4.1_3.0_1647466983817.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_wa_3.4.1_3.0_1647466983817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","wa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","wa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("wa.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|wa| |Size:|92.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Temporality / Certainty Assertion Status (sm) author: John Snow Labs name: legassertion_time date: 2022-09-27 tags: [en, licensed] task: Assertion Status language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a small Assertion Status Model aimed to detect temporality (PRESENT, PAST, FUTURE) or Certainty (POSSIBLE) in your legal documents ## Predicted Entities `PRESENT`, `PAST`, `FUTURE`, `POSSIBLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGASSERTION_TEMPORALITY){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legassertion_time_en_1.0.0_3.0_1664274039847.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legassertion_time_en_1.0.0_3.0_1664274039847.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python # YOUR NER HERE # ... embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") chunk_converter = nlp.ChunkConverter() \ .setInputCols(["entity"]) \ .setOutputCol("ner_chunk") assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, tokenizer, embeddings, ner, chunk_converter, assertion ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) lp = LightPipeline(model) texts = ["The subsidiaries of Atlantic Inc will participate in a merging operation", "The Conditions and Warranties of this agreement might be modified"] lp.annotate(texts) ```
## Results ```bash chunk,begin,end,entity_type,assertion Atlantic Inc,20,31,ORG,FUTURE chunk,begin,end,entity_type,assertion Conditions and Warranties,4,28,DOC,POSSIBLE ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legassertion_time| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, doc_chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotations on financial and legal corpora ## Benchmarking ```bash label tp fp fn prec rec f1 PRESENT 201 11 16 0.9481132 0.9262672 0.937063 POSSIBLE 171 3 6 0.9827586 0.9661017 0.974359 FUTURE 119 6 4 0.952 0.9674796 0.959677 PAST 270 16 10 0.9440559 0.9642857 0.954063 Macro-average 761 36 36 0.9567319 0.9560336 0.9563826 Micro-average 761 36 36 0.9548306 0.9548306 0.9548306 ``` --- layout: model title: ICD10 to ICD9 Code Mapping author: John Snow Labs name: icd10_icd9_mapping date: 2021-12-22 tags: [icd10, icd9, en, clinical, licensed, code_mapping] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps ICD10 codes to ICD9 codes without using any text data. You’ll just feed a comma or white space-delimited ICD10 codes and it will return the corresponding ICD9 codes as a list. {:.btn-box} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.1.Healthcare_Code_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_3.3.4_2.4_1640175449509.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10_icd9_mapping_en_3.3.4_2.4_1640175449509.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") pipeline.annotate('E669 R630 J988') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10_icd9_mapping", "en", "clinical/models") val result = pipeline.annotate('E669 R630 J988') ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10_icd9.mapping").predict("""E669 R630 J988""") ```
## Results ```bash {'document': ['E669 R630 J988'], 'icd10': ['E669', 'R630', 'J988'], 'icd9': ['27800', '7830', '5198']} Note: | ICD10 | Details | | ---------- | ----------------------------:| | E669 | Obesity | | R630 | Anorexia | | J988 | Other specified respiratory disorders | | ICD9 | Details | | ---------- | ---------------------------:| | 27800 | Obesity | | 7830 | Anorexia | | 5198 | Other diseases of respiratory system | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10_icd9_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|545.2 KB| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel --- layout: model title: English DistilBertForQuestionAnswering Cased model (from LucasS) author: John Snow Labs name: distilbert_qa_absa date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilBertABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_absa_en_4.3.0_3.0_1672766295348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_absa_en_4.3.0_3.0_1672766295348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_absa","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_absa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_absa| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LucasS/distilBertABSA --- layout: model title: Finnish XlmRoBertaForQuestionAnswering (from Gantenbein) author: John Snow Labs name: xlm_roberta_qa_ADDI_FI_XLM_R date: 2022-06-23 tags: [fi, open_source, question_answering, xlmroberta] task: Question Answering language: fi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ADDI-FI-XLM-R` is a Finnish model originally trained by `Gantenbein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_FI_XLM_R_fi_4.0.0_3.0_1655983152909.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_ADDI_FI_XLM_R_fi_4.0.0_3.0_1655983152909.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_ADDI_FI_XLM_R","fi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_ADDI_FI_XLM_R","fi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("fi.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_ADDI_FI_XLM_R| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|fi| |Size:|777.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Gantenbein/ADDI-FI-XLM-R --- layout: model title: English image_classifier_vit_shirt_identifier ViTForImageClassification from b25mayank3 author: John Snow Labs name: image_classifier_vit_shirt_identifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_shirt_identifier` is a English model originally trained by b25mayank3. ## Predicted Entities `Big Check shirt`, `Formal Shirt`, `casual shirt`, `denim shirt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_shirt_identifier_en_4.1.0_3.0_1660171643901.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_shirt_identifier_en_4.1.0_3.0_1660171643901.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_shirt_identifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_shirt_identifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_shirt_identifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Word2Vec Embeddings in Serbian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, sr, open_source] task: Embeddings language: sr edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sr_3.4.1_3.0_1647456008397.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_sr_3.4.1_3.0_1647456008397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Волим СПАРК НЛП"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Волим СПАРК НЛП").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sr.embed.w2v_cc_300d").predict("""Волим СПАРК НЛП""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|sr| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: roberta_qa_deepset_base_squad2_orkg_no_label_1e_4 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deepset-roberta-base-squad2-orkg-no-label-1e-4` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_no_label_1e_4_en_4.3.0_3.0_1674209605485.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_deepset_base_squad2_orkg_no_label_1e_4_en_4.3.0_3.0_1674209605485.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_no_label_1e_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_deepset_base_squad2_orkg_no_label_1e_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_deepset_base_squad2_orkg_no_label_1e_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Moussab/deepset-roberta-base-squad2-orkg-no-label-1e-4 --- layout: model title: English RobertaForSequenceClassification Cased model (from liamcripwell) author: John Snow Labs name: roberta_classifier_ctrl44_clf date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification, tensorflow] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ctrl44-clf` is a English model originally trained by `liamcripwell`. ## Predicted Entities `rephrase`, `ignore`, `syntax-split`, `discourse-split` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_ctrl44_clf_en_4.2.4_3.0_1670624790107.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_ctrl44_clf_en_4.2.4_3.0_1670624790107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_ctrl44_clf","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_ctrl44_clf","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_ctrl44_clf| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|467.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/liamcripwell/ctrl44-clf --- layout: model title: English Legal RoBERTa Embeddings (CaseLaw, Base, Cased) author: John Snow Labs name: roberta_embeddings_legal_roberta_base date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `legal-roberta-base` is a English model orginally trained by `saibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_legal_roberta_base_en_3.4.2_3.0_1649946738367.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_legal_roberta_base_en_3.4.2_3.0_1649946738367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.legal_roberta_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_legal_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|468.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/saibo/legal-roberta-base - https://www.kaggle.com/uspto/patent-litigations - https://case.law/ - https://www.kaggle.com/bigquery/patents - https://www.kaggle.com/sohier/beyond-queries-exploring-the-bigquery-api --- layout: model title: Vietnamese asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese TFWav2Vec2ForCTC from leduytan93 author: John Snow Labs name: asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese date: 2022-09-26 tags: [wav2vec2, vi, audio, open_source, asr] task: Automatic Speech Recognition language: vi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese` is a Vietnamese model originally trained by leduytan93. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese_vi_4.2.0_3.0_1664197522821.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese_vi_4.2.0_3.0_1664197522821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese", "vi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese", "vi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Fine_Tune_XLSR_Wav2Vec2_Speech2Text_Vietnamese| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|vi| |Size:|1.2 GB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_32_epochs30 TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_32_epochs30 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_32_epochs30` is a English model originally trained by ying-tina. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_32_epochs30_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_32_epochs30_en_4.2.0_3.0_1664111278307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_32_epochs30_en_4.2.0_3.0_1664111278307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_32_epochs30", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_32_epochs30", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_32_epochs30| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|354.9 MB| --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities author: John Snow Labs name: bert_token_classifier_ner_jnlpba_cellular_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jnlpba_cellular](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_jnlpba_cellular_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jnlpba_cellular_pipeline_en_4.3.0_3.2_1679303520732.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jnlpba_cellular_pipeline_en_4.3.0_3.2_1679303520732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jnlpba_cellular_pipeline", "en", "clinical/models") text = '''The results suggest that activation of protein kinase C, but not new protein synthesis, is required for IL-2 induction of IFN-gamma and GM-CSF cytoplasmic mRNA. It also was observed that suppression of cytokine gene expression by these agents was independent of the inhibition of proliferation. These data indicate that IL-2 and IL-12 may have distinct signaling pathways leading to the induction of IFN-gammaand GM-CSFgene expression, andthatthe NK3.3 cell line may serve as a novel model for dissecting the biochemical and molecular events involved in these pathways. A functional T-cell receptor signaling pathway is required for p95vav activity. Stimulation of the T-cell antigen receptor ( TCR ) induces activation of multiple tyrosine kinases, resulting in phosphorylation of numerous intracellular substrates. One substrate is p95vav, which is expressed exclusively in hematopoietic and trophoblast cells..''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jnlpba_cellular_pipeline", "en", "clinical/models") val text = "The results suggest that activation of protein kinase C, but not new protein synthesis, is required for IL-2 induction of IFN-gamma and GM-CSF cytoplasmic mRNA. It also was observed that suppression of cytokine gene expression by these agents was independent of the inhibition of proliferation. These data indicate that IL-2 and IL-12 may have distinct signaling pathways leading to the induction of IFN-gammaand GM-CSFgene expression, andthatthe NK3.3 cell line may serve as a novel model for dissecting the biochemical and molecular events involved in these pathways. A functional T-cell receptor signaling pathway is required for p95vav activity. Stimulation of the T-cell antigen receptor ( TCR ) induces activation of multiple tyrosine kinases, resulting in phosphorylation of numerous intracellular substrates. One substrate is p95vav, which is expressed exclusively in hematopoietic and trophoblast cells.." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------------------|--------:|------:|:------------|-------------:| | 0 | protein kinase C | 39 | 54 | protein | 0.993263 | | 1 | IL-2 | 104 | 107 | protein | 0.969095 | | 2 | IFN-gamma and GM-CSF cytoplasmic mRNA | 122 | 158 | RNA | 0.998495 | | 3 | cytokine gene | 202 | 214 | DNA | 0.953537 | | 4 | IL-2 | 320 | 323 | protein | 0.999317 | | 5 | IL-12 | 329 | 333 | protein | 0.999216 | | 6 | IFN-gammaand GM-CSFgene | 400 | 422 | protein | 0.995236 | | 7 | NK3.3 cell line | 447 | 461 | cell_line | 0.998958 | | 8 | T-cell receptor | 583 | 597 | protein | 0.987655 | | 9 | p95vav | 633 | 638 | protein | 0.999857 | | 10 | T-cell antigen receptor | 669 | 691 | protein | 0.99891 | | 11 | TCR | 695 | 697 | protein | 0.998049 | | 12 | tyrosine kinases | 732 | 747 | protein | 0.999636 | | 13 | p95vav | 834 | 839 | protein | 0.999842 | | 14 | hematopoietic and trophoblast cells | 876 | 910 | cell_type | 0.999709 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jnlpba_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Detect Radiology Concepts (WIP) author: John Snow Labs name: jsl_rd_ner_wip_greedy_clinical date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract clinical entities from Radiology reports using pretrained NER model. ## Predicted Entities `Kidney_Disease`, `HDL`, `Diet`, `Test`, `Imaging_Technique`, `Triglycerides`, `Obesity`, `Duration`, `Weight`, `Social_History_Header`, `ImagingTest`, `Labour_Delivery`, `Disease_Syndrome_Disorder`, `Communicable_Disease`, `Overweight`, `Units`, `Smoking`, `Score`, `Substance_Quantity`, `Form`, `Race_Ethnicity`, `Modifier`, `Hyperlipidemia`, `ImagingFindings`, `Psychological_Condition`, `OtherFindings`, `Cerebrovascular_Disease`, `Date`, `Test_Result`, `VS_Finding`, `Employment`, `Death_Entity`, `Gender`, `Oncological`, `Heart_Disease`, `Medical_Device`, `Total_Cholesterol`, `ManualFix`, `Time`, `Route`, `Pulse`, `Admission_Discharge`, `RelativeDate`, `O2_Saturation`, `Frequency`, `RelativeTime`, `Hypertension`, `Alcohol`, `Allergen`, `Fetus_NewBorn`, `Birth_Entity`, `Age`, `Respiration`, `Medical_History_Header`, `Oxygen_Therapy`, `Section_Header`, `LDL`, `Treatment`, `Vital_Signs_Header`, `Direction`, `BMI`, `Pregnancy`, `Sexually_Active_or_Sexual_Orientation`, `Symptom`, `Clinical_Dept`, `Measurements`, `Height`, `Family_History_Header`, `Substance`, `Strength`, `Injury_or_Poisoning`, `Relationship_Status`, `Blood_Pressure`, `Drug`, `Temperature`, `EKG_Findings`, `Diabetes`, `BodyPart`, `Vaccine`, `Procedure`, `Dosage` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_clinical_en_3.0.0_3.0_1617260438155.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_rd_ner_wip_greedy_clinical_en_3.0.0_3.0_1617260438155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("jsl_rd_ner_wip_greedy_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("jsl_rd_ner_wip_greedy_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl.wip.clinical.rd").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_rd_ner_wip_greedy_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash entity tp fp fn total precision recall f1 VS_Finding 306.0 129.0 119.0 425.0 0.7034 0.72 0.7116 Direction 8717.0 678.0 616.0 9333.0 0.9278 0.934 0.9309 Respiration 224.0 28.0 18.0 242.0 0.8889 0.9256 0.9069 Cerebrovascular_Disease 149.0 57.0 64.0 213.0 0.7233 0.6995 0.7112 Family_History_Header 315.0 1.0 3.0 318.0 0.9968 0.9906 0.9937 Heart_Disease 1087.0 198.0 141.0 1228.0 0.8459 0.8852 0.8651 ImagingFindings 5568.0 1112.0 1627.0 7195.0 0.8335 0.7739 0.8026 RelativeTime 422.0 138.0 100.0 522.0 0.7536 0.8084 0.78 Strength 96.0 51.0 54.0 150.0 0.6531 0.64 0.6465 BodyPart 20155.0 1698.0 1860.0 22015.0 0.9223 0.9155 0.9189 Smoking 151.0 16.0 5.0 156.0 0.9042 0.9679 0.935 Medical_Device 8162.0 885.0 821.0 8983.0 0.9022 0.9086 0.9054 EKG_Findings 131.0 37.0 83.0 214.0 0.7798 0.6121 0.6859 Pulse 382.0 44.0 50.0 432.0 0.8967 0.8843 0.8904 Psychological_Condition 195.0 32.0 43.0 238.0 0.859 0.8193 0.8387 Triglycerides 18.0 0.0 0.0 18.0 1.0 1.0 1.0 Overweight 6.0 2.0 1.0 7.0 0.75 0.8571 0.8 Obesity 68.0 3.0 5.0 73.0 0.9577 0.9315 0.9444 Admission_Discharge 376.0 26.0 24.0 400.0 0.9353 0.94 0.9377 HDL 11.0 0.0 5.0 16.0 1.0 0.6875 0.8148 Diabetes 227.0 9.0 12.0 239.0 0.9619 0.9498 0.9558 Section_Header 13630.0 476.0 413.0 14043.0 0.9663 0.9706 0.9684 Age 1174.0 129.0 94.0 1268.0 0.901 0.9259 0.9133 O2_Saturation 122.0 34.0 29.0 151.0 0.7821 0.8079 0.7948 Drug 9391.0 1505.0 928.0 10319.0 0.8619 0.9101 0.8853 Kidney_Disease 296.0 28.0 53.0 349.0 0.9136 0.8481 0.8796 Test 3980.0 721.0 925.0 4905.0 0.8466 0.8114 0.8286 Communicable_Disease 40.0 18.0 12.0 52.0 0.6897 0.7692 0.7273 Hypertension 163.0 16.0 10.0 173.0 0.9106 0.9422 0.9261 Oxygen_Therapy 123.0 36.0 27.0 150.0 0.7736 0.82 0.7961 Test_Result 1607.0 374.0 458.0 2065.0 0.8112 0.7782 0.7944 Modifier 1229.0 435.0 593.0 1822.0 0.7386 0.6745 0.7051 BMI 21.0 4.0 7.0 28.0 0.84 0.75 0.7925 Labour_Delivery 117.0 38.0 62.0 179.0 0.7548 0.6536 0.7006 Employment 414.0 65.0 93.0 507.0 0.8643 0.8166 0.8398 Fetus_NewBorn 118.0 68.0 87.0 205.0 0.6344 0.5756 0.6036 Clinical_Dept 1937.0 189.0 133.0 2070.0 0.9111 0.9357 0.9233 Time 637.0 43.0 27.0 664.0 0.9368 0.9593 0.9479 Procedure 7578.0 953.0 1088.0 8666.0 0.8883 0.8745 0.8813 ImagingTest 1712.0 213.0 281.0 1993.0 0.8894 0.859 0.8739 Diet 79.0 44.0 82.0 161.0 0.6423 0.4907 0.5563 Oncological 1088.0 188.0 103.0 1191.0 0.8527 0.9135 0.882 LDL 20.0 7.0 1.0 21.0 0.7407 0.9524 0.8333 Symptom 15940.0 3662.0 3035.0 18975.0 0.8132 0.8401 0.8264 Temperature 240.0 28.0 25.0 265.0 0.8955 0.9057 0.9006 Vital_Signs_Header 850.0 34.0 52.0 902.0 0.9615 0.9424 0.9518 Total_Cholesterol 43.0 6.0 7.0 50.0 0.8776 0.86 0.8687 Relationship_Status 51.0 3.0 9.0 60.0 0.9444 0.85 0.8947 Blood_Pressure 353.0 18.0 117.0 470.0 0.9515 0.7511 0.8395 Injury_or_Poisoning 1003.0 311.0 241.0 1244.0 0.7633 0.8063 0.7842 Treatment 335.0 98.0 91.0 426.0 0.7737 0.7864 0.78 Pregnancy 214.0 99.0 86.0 300.0 0.6837 0.7133 0.6982 Vaccine 29.0 3.0 10.0 39.0 0.9063 0.7436 0.8169 Height 105.0 10.0 45.0 150.0 0.913 0.7 0.7925 Disease_Syndrome_Disorder 8466.0 1568.0 1533.0 9999.0 0.8437 0.8467 0.8452 Frequency 1263.0 237.0 173.0 1436.0 0.842 0.8795 0.8604 Route 219.0 35.0 144.0 363.0 0.8622 0.6033 0.7099 Duration 978.0 199.0 338.0 1316.0 0.8309 0.7432 0.7846 Death_Entity 35.0 17.0 16.0 51.0 0.6731 0.6863 0.6796 Alcohol 102.0 24.0 21.0 123.0 0.8095 0.8293 0.8193 Date 840.0 43.0 13.0 853.0 0.9513 0.9848 0.9677 Hyperlipidemia 44.0 4.0 1.0 45.0 0.9167 0.9778 0.9462 Social_History_Header 284.0 6.0 27.0 311.0 0.9793 0.9132 0.9451 ManualFix 50.0 2.0 7.0 57.0 0.9615 0.8772 0.9174 Imaging_Technique 845.0 240.0 98.0 943.0 0.7788 0.8961 0.8333 Race_Ethnicity 141.0 0.0 5.0 146.0 1.0 0.9658 0.9826 RelativeDate 1691.0 394.0 194.0 1885.0 0.811 0.8971 0.8519 Gender 6800.0 105.0 130.0 6930.0 0.9848 0.9812 0.983 Dosage 122.0 67.0 81.0 203.0 0.6455 0.601 0.6224 Medical_History_Header 486.0 10.0 19.0 505.0 0.9798 0.9624 0.971 Sexually_Active_or_Sexual_Orientation 12.0 0.0 5.0 17.0 1.0 0.7059 0.8276 Substance 102.0 11.0 22.0 124.0 0.9027 0.8226 0.8608 Weight 346.0 26.0 65.0 411.0 0.9301 0.8418 0.8838 macro - - - - - - 0.8038 micro - - - - - - 0.8793 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from shwetha) author: John Snow Labs name: distilbert_qa_shwetha_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `shwetha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_shwetha_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772737063.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_shwetha_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772737063.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shwetha_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_shwetha_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_shwetha_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/shwetha/distilbert-base-uncased-finetuned-squad --- layout: model title: Drug Side Effect Classifier (BioBERT) author: John Snow Labs name: bert_sequence_classifier_vop_drug_side_effect date: 2023-06-13 tags: [licensed, classification, side_effect, drug, vop, en, tensorflow] task: Text Classification language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify informal texts (such as tweets or forum posts) according to the presence of drug side effects. ## Predicted Entities `Drug_AE`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_drug_side_effect_en_4.4.3_3.0_1686668996540.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_vop_drug_side_effect_en_4.4.3_3.0_1686668996540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_drug_side_effect", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["I felt kind of dizzy after taking that medication for a month.", "I took antibiotics last week and everything went well."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_vop_drug_side_effect", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("prediction") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array("I felt kind of dizzy after taking that medication for a month.", "I took antibiotics last week and everything went well.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------------------------------------+---------+ |text |result | +--------------------------------------------------------------+---------+ |I felt kind of dizzy after taking that medication for a month.|[Drug_AE]| |I took antibiotics last week and everything went well. |[Other] | +--------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_vop_drug_side_effect| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I’m 20 year old girl. I’m diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I’m taking weekly supplement of vitamin D and 1000 mcg b12 daily. I’m taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I’m facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label precision recall f1-score support Drug_AE 0.861111 0.788136 0.823009 118 Other 0.875622 0.921466 0.897959 191 accuracy - - 0.870550 309 macro_avg 0.868367 0.854801 0.860484 309 weighted_avg 0.870081 0.870550 0.869337 309 ``` --- layout: model title: Legal No third party beneficiaries Clause Binary Classifier author: John Snow Labs name: legclf_no_third_party_beneficiaries_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-third-party-beneficiaries` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-third-party-beneficiaries` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_third_party_beneficiaries_clause_en_1.0.0_3.2_1660122706229.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_third_party_beneficiaries_clause_en_1.0.0_3.2_1660122706229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_third_party_beneficiaries_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-third-party-beneficiaries]| |[other]| |[other]| |[no-third-party-beneficiaries]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_third_party_beneficiaries_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-third-party-beneficiaries 0.96 0.96 0.96 49 other 0.98 0.98 0.98 130 accuracy - - 0.98 179 macro-avg 0.97 0.97 0.97 179 weighted-avg 0.98 0.98 0.98 179 ``` --- layout: model title: English T5ForConditionalGeneration Cased model (from jeremyccollinsmpi) author: John Snow Labs name: t5_autotrain_inference_probability_3_900329401 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-inference_probability_3-900329401` is a English model originally trained by `jeremyccollinsmpi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_autotrain_inference_probability_3_900329401_en_4.3.0_3.0_1675099900080.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_autotrain_inference_probability_3_900329401_en_4.3.0_3.0_1675099900080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_autotrain_inference_probability_3_900329401","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_autotrain_inference_probability_3_900329401","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_autotrain_inference_probability_3_900329401| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|846.1 MB| ## References - https://huggingface.co/jeremyccollinsmpi/autotrain-inference_probability_3-900329401 --- layout: model title: Part of Speech for Turkish author: John Snow Labs name: pos_ud_imst date: 2020-05-03 12:43:00 +0800 task: Part of Speech Tagging language: tr edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, tr] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_imst_tr_2.5.0_2.4_1587480006078.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_imst_tr_2.5.0_2.4_1587480006078.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_imst", "tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_imst", "tr") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""John Snow, kuzeyin kralı olmanın yanı sıra bir İngiliz doktordur ve anestezi ve tıbbi hijyenin geliştirilmesinde liderdir."""] pos_df = nlu.load('tr.pos.ud_imst').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=3, result='NOUN', metadata={'word': 'John'}), Row(annotatorType='pos', begin=5, end=8, result='PROPN', metadata={'word': 'Snow'}), Row(annotatorType='pos', begin=9, end=9, result='PUNCT', metadata={'word': ','}), Row(annotatorType='pos', begin=11, end=17, result='NOUN', metadata={'word': 'kuzeyin'}), Row(annotatorType='pos', begin=19, end=23, result='NOUN', metadata={'word': 'kralı'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_imst| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|tr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Pipeline to Detect Clinical Entities (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_jsl_pipeline date: 2023-06-13 tags: [ner_jsl, ner, berfortokenclassification, en, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_jsl](https://nlp.johnsnowlabs.com/2022/03/21/bert_token_classifier_ner_jsl_en_2_4.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.4.4_3.2_1686664116632.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_jsl_pipeline_en_4.4.4_3.2_1686664116632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_token_ner_jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_jsl_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_token_ner_jsl.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby-girl also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash Results | | ner_chunk | begin | end | ner_label | confidence | |---:|:---------------------------------|--------:|------:|:-------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 0.999456 | | 1 | Caucasian male | 28 | 41 | Demographics | 0.9901 | | 2 | congestion | 62 | 71 | Symptom | 0.997918 | | 3 | mom | 75 | 77 | Demographics | 0.999013 | | 4 | yellow discharge | 99 | 114 | Symptom | 0.998663 | | 5 | nares | 135 | 139 | Body_Part | 0.998609 | | 6 | she | 147 | 149 | Demographics | 0.999442 | | 7 | mild problems with his breathing | 168 | 199 | Symptom | 0.930385 | | 8 | perioral cyanosis | 237 | 253 | Symptom | 0.99819 | | 9 | retractions | 258 | 268 | Symptom | 0.999783 | | 10 | One day ago | 272 | 282 | Date_Time | 0.999386 | | 11 | mom | 285 | 287 | Demographics | 0.999835 | | 12 | tactile temperature | 304 | 322 | Symptom | 0.999352 | | 13 | Tylenol | 345 | 351 | Drug | 0.999762 | | 14 | Baby-girl | 354 | 362 | Age | 0.980529 | | 15 | decreased p.o. intake | 382 | 402 | Symptom | 0.998978 | | 16 | His | 405 | 407 | Demographics | 0.999913 | | 17 | breast-feeding | 416 | 429 | Body_Part | 0.99954 | | 18 | his | 493 | 495 | Demographics | 0.999661 | | 19 | respiratory congestion | 497 | 518 | Symptom | 0.834984 | | 20 | He | 521 | 522 | Demographics | 0.999858 | | 21 | tired | 555 | 559 | Symptom | 0.999516 | | 22 | fussy | 574 | 578 | Symptom | 0.997592 | | 23 | over the past 2 days | 580 | 599 | Date_Time | 0.994786 | | 24 | albuterol | 642 | 650 | Drug | 0.999735 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: BioBERT Embeddings (Clinical) author: John Snow Labs name: biobert_clinical_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for generic clinical text. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_clinical_base_cased_en_2.6.0_2.4_1598343387227.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_clinical_base_cased_en_2.6.0_2.4_1598343387227.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = BertEmbeddings.pretrained("biobert_clinical_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala val embeddings = BertEmbeddings.pretrained("biobert_clinical_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.clinical_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_clinical_base_cased_embeddings I [0.2206662893295288, 0.41324421763420105, -0.3... hate [-0.19311018288135529, 0.6037888526916504, -0.... cancer [0.2895708680152893, 0.22499887645244598, -0.5... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_clinical_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465514 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465514` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465514_en_4.0.0_3.0_1655985998052.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465514_en_4.0.0_3.0_1655985998052.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465514","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465514","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465514.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465514| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465514 --- layout: model title: Embeddings BioVec author: John Snow Labs name: embeddings_biovec class: WordEmbeddingsModel language: en nav_key: models repository: clinical/models date: 2020-06-02 task: Embeddings edition: Healthcare NLP 2.5.0 spark_version: 2.4 tags: [clinical,embeddings,en] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/embeddings_biovec_en_2.5.0_2.4_1591068211397.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/embeddings_biovec_en_2.5.0_2.4_1591068211397.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python model = WordEmbeddingsModel.pretrained("embeddings_biovec","en","clinical/models")\ .setInputCols("document","token")\ .setOutputCol("word_embeddings") ``` ```scala val model = WordEmbeddingsModel.pretrained("embeddings_biovec","en","clinical/models") .setInputCols("document","token") .setOutputCol("word_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.glove.biovec").predict("""Put your text here.""") ```
{:.h2_title} ## Results Word2Vec feature vectors based on ``embeddings_biovec``. {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | embeddings_biovec | | Type: | WordEmbeddingsModel | | Compatibility: | Spark NLP 2.5.0+ | | License: | Licensed | | Edition: | Official | |Input labels: | [document, token] | |Output labels: | [word_embeddings] | | Language: | en | | Dimension: | 300.0 | {:.h2_title} ## Data Source Trained on PubMed corpora https://github.com/ncbi-nlp/BioSentVec --- layout: model title: Extract Clinical Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, en, ner, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `Gender`, `Employment`, `BodyPart`, `Vaccine`, `Age`, `PsychologicalCondition`, `Form`, `Drug`, `Substance`, `ClinicalDept`, `Laterality`, `DateTime`, `Test`, `Route`, `Disease`, `AdmissionDischarge`, `Dosage`, `Duration`, `VitalTest`, `Frequency`, `Symptom`, `Allergen`, `Procedure`, `RelationshipStatus`, `HealthStatus`, `Modifier`, `TestResult`, `InjuryOrPoisoning`, `SubstanceQuantity`, `MedicalDevice`, `Treatment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_medium_en_4.4.3_3.0_1686073490433.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_emb_clinical_medium_en_4.4.3_3.0_1686073490433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid | TestResult | | heartrate | VitalTest | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1293 24 24 1317 0.98 0.98 0.98 Employment 1182 48 61 1243 0.96 0.95 0.96 BodyPart 2697 228 203 2900 0.92 0.93 0.93 Vaccine 39 4 3 42 0.91 0.93 0.92 Age 552 61 30 582 0.90 0.95 0.92 PsychologicalCondition 405 33 39 444 0.92 0.91 0.92 Form 248 30 18 266 0.89 0.93 0.91 Drug 1327 158 113 1440 0.89 0.92 0.91 Substance 387 55 34 421 0.88 0.92 0.90 ClinicalDept 293 42 33 326 0.87 0.90 0.89 Laterality 550 55 78 628 0.91 0.88 0.89 DateTime 4030 621 372 4402 0.87 0.92 0.89 Test 1056 134 152 1208 0.89 0.87 0.88 Route 42 7 6 48 0.86 0.88 0.87 Disease 1750 312 265 2015 0.85 0.87 0.86 AdmissionDischarge 27 2 7 34 0.93 0.79 0.86 Dosage 346 57 66 412 0.86 0.84 0.85 Duration 1889 287 421 2310 0.87 0.82 0.84 VitalTest 144 28 28 172 0.84 0.84 0.84 Frequency 889 171 190 1079 0.84 0.82 0.83 Symptom 3707 671 868 4575 0.85 0.81 0.83 Allergen 34 2 12 46 0.94 0.74 0.83 Procedure 583 140 122 705 0.81 0.83 0.82 RelationshipStatus 18 2 6 24 0.90 0.75 0.82 HealthStatus 83 29 24 107 0.74 0.78 0.76 Modifier 787 183 352 1139 0.81 0.69 0.75 TestResult 362 112 162 524 0.76 0.69 0.73 InjuryOrPoisoning 119 35 57 176 0.77 0.68 0.72 SubstanceQuantity 59 20 26 85 0.75 0.69 0.72 MedicalDevice 228 92 104 332 0.71 0.69 0.70 Treatment 144 45 84 228 0.76 0.63 0.69 macro_avg 25270 3688 3960 29230 0.86 0.83 0.85 micro_avg 25270 3688 3960 29230 0.87 0.87 0.87 ``` --- layout: model title: Translate Ga to English Pipeline author: John Snow Labs name: translate_gaa_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, gaa, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `gaa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_gaa_en_xx_2.7.0_2.4_1609686849272.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_gaa_en_xx_2.7.0_2.4_1609686849272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_gaa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_gaa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.gaa.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_gaa_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: BERT LaBSE Sentence Embeddings author: John Snow Labs name: labse date: 2020-09-23 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, xx] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language-agnostic BERT sentence embedding model supporting 109 languages: The language-agnostic BERT sentence embedding encodes text into high dimensional vectors. The model is trained and optimized to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. So it can be used for mining for translations of a sentence in a larger corpus. The details are described in the paper "[Language-agnostic BERT Sentence Embedding. July 2020](https://arxiv.org/abs/2007.01852)" {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/labse_xx_2.6.0_2.4_1600858075633.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/labse_xx_2.6.0_2.4_1600858075633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Many thanks']], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Many thanks").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Many thanks"] embeddings_df = nlu.load('xx.embed_sentence.labse').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence xx_embed_sentence_labse_embeddings 0 I love NLP [-0.060951583087444305, -0.011645414866507053,... 1 Many thanks [0.002173778833821416, -0.05513454228639603, 0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|labse| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[xx]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/LaBSE/1](https://tfhub.dev/google/LaBSE/1) --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1654191670990.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1654191670990.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_64d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|378.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-64-finetuned-squad-seed-0 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_12_h_512 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-12_H-512` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_512_zh_4.2.4_3.0_1670021556439.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_12_h_512_zh_4.2.4_3.0_1670021556439.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_512","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_12_h_512","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_12_h_512| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|184.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-12_H-512 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: English BertForQuestionAnswering model (from ncduy) author: John Snow Labs name: bert_qa_bert_base_cased_finetuned_squad_test date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-cased-finetuned-squad-test` is a English model orginally trained by `ncduy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_finetuned_squad_test_en_4.0.0_3.0_1654179794462.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_cased_finetuned_squad_test_en_4.0.0_3.0_1654179794462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_cased_finetuned_squad_test","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_cased_finetuned_squad_test","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_cased.by_ncduy").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_cased_finetuned_squad_test| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ncduy/bert-base-cased-finetuned-squad-test --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW) author: John Snow Labs name: distilbert_qa_squad_zyw_model date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-en-de-es-vi-zh-model` is a Multilingual model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_zyw_model_xx_4.3.0_3.0_1672775639500.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_zyw_model_xx_4.3.0_3.0_1672775639500.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_zyw_model","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_zyw_model","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_zyw_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/squad-en-de-es-vi-zh-model --- layout: model title: Legal Assignment Clause Binary Classifier author: John Snow Labs name: legclf_assignment_clause date: 2023-02-13 tags: [en, legal, classification, assignment, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `assignment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `assignment`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_assignment_clause_en_1.0.0_3.0_1676302845555.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_assignment_clause_en_1.0.0_3.0_1676302845555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_assignment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[assignment]| |[other]| |[other]| |[assignment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_assignment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support assignment 0.95 1.00 0.98 20 other 1.00 0.91 0.95 11 accuracy - - 0.97 31 macro-avg 0.98 0.95 0.96 31 weighted-avg 0.97 0.97 0.97 31 ``` --- layout: model title: BioBERT Sentence Embeddings (Discharge) author: John Snow Labs name: sent_biobert_discharge_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for discharge summaries. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_discharge_base_cased_en_2.6.0_2.4_1598349530721.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_discharge_base_cased_en_2.6.0_2.4_1598349530721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_discharge_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_discharge_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.discharge_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_discharge_base_cased_embeddings I hate cancer [0.3155321180820465, 0.37484583258628845, -0.4... Antibiotics arent painkiller [0.3543206453323364, 0.0787968561053276, -0.08... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_discharge_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: Hindi XlmRoBertaForQuestionAnswering (from abhishek) author: John Snow Labs name: xlm_roberta_qa_autonlp_hindi_question_answering_23865268 date: 2022-06-23 tags: [hi, open_source, question_answering, xlmroberta] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-hindi-question-answering-23865268` is a Hindi model originally trained by `abhishek`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_hindi_question_answering_23865268_hi_4.0.0_3.0_1655984202768.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_hindi_question_answering_23865268_hi_4.0.0_3.0_1655984202768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_hindi_question_answering_23865268","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_hindi_question_answering_23865268","hi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hi.answer_question.xlm_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_hindi_question_answering_23865268| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|hi| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/abhishek/autonlp-hindi-question-answering-23865268 --- layout: model title: Pipeline to Detect Drug Chemicals author: John Snow Labs name: ner_drugs_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, drug, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_drugs](https://nlp.johnsnowlabs.com/2021/03/31/ner_drugs_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_drugs_pipeline_en_3.4.1_3.0_1647872661131.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_drugs_pipeline_en_3.4.1_3.0_1647872661131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_drugs_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.") ``` ```scala val pipeline = new PretrainedPipeline("ner_drugs_pipeline", "en", "clinical/models") pipeline.annotate("The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.drugs.pipeline").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes.""") ```
## Results ```bash +-----------------+---------+ |chunk |ner | +-----------------+---------+ |potassium |DrugChem | |anthracyclines |DrugChem | |taxanes |DrugChem | |vinorelbine |DrugChem | |vinorelbine |DrugChem | |anthracyclines |DrugChem | |taxanes |DrugChem | +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_drugs_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_bahasa_indonesia TFWav2Vec2ForCTC from Bagus author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_bahasa_indonesia` is a Modern Greek (1453-) model originally trained by Bagus. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia_el_4.2.0_3.0_1664108133476.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia_el_4.2.0_3.0_1664108133476.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_bahasa_indonesia| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_el4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-el4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_el4_en_4.3.0_3.0_1675116178104.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_el4_en_4.3.0_3.0_1675116178104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_el4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_el4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_el4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|995.5 MB| ## References - https://huggingface.co/google/t5-efficient-large-el4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_ff6000 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff6000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff6000_en_4.3.0_3.0_1675123549632.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff6000_en_4.3.0_3.0_1675123549632.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_ff6000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff6000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_ff6000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|86.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-ff6000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Translate English to Oromo Pipeline author: John Snow Labs name: translate_en_om date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, om, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `om` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_om_xx_2.7.0_2.4_1609690425970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_om_xx_2.7.0_2.4_1609690425970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_om", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_om", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.om').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_om| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ioanfr) author: John Snow Labs name: distilbert_qa_ioanfr_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `ioanfr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ioanfr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771445817.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ioanfr_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771445817.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ioanfr_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ioanfr_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ioanfr_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ioanfr/distilbert-base-uncased-finetuned-squad --- layout: model title: ELECTRA Sentence Embeddings(ELECTRA Large) author: John Snow Labs name: sent_electra_large_uncased date: 2020-08-27 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description ELECTRA is a BERT-like model that is pre-trained as a discriminator in a set-up resembling a generative adversarial network (GAN). It was originally published by: Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning: [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://openreview.net/forum?id=r1xMH1BtvB), ICLR 2020. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_electra_large_uncased_en_2.6.0_2.4_1598489955147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_electra_large_uncased_en_2.6.0_2.4_1598489955147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.electra_large_uncased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_electra_large_uncased_embeddings I hate cancer [-0.5168956518173218, -0.4284093976020813, -0.... Antibiotics aren't painkiller [0.03924501687288284, 0.28086787462234497, 0.3... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_electra_large_uncased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/google/electra_large/2 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_tahazakir TFWav2Vec2ForCTC from tahazakir author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_tahazakir` is a English model originally trained by tahazakir. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir_en_4.2.0_3.0_1664037102995.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir_en_4.2.0_3.0_1664037102995.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab0_by_tahazakir| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Classifier for Adverse Drug Events author: John Snow Labs name: classifierdl_ade_biobert date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 tags: [licensed, clinical, en, classifier] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify text/sentence in two categories: - `True` : The sentence is talking about a possible ADE - `False` : The sentences doesn’t have any information about an ADE. ## Predicted Entities `True`, `False` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_biobert_en_2.7.1_2.4_1611243410222.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_biobert_en_2.7.1_2.4_1611243410222.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\ .setInputCols(["document", 'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = ClassifierDLModel.pretrained('classifierdl_ade_biobert', 'en', 'clinical/models')\ .setInputCols(['document', 'token', 'sentence_embeddings'])\ .setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["I feel a bit drowsy & have a little blurred vision after taking an insulin", "I feel great after taking tylenol"]) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("document", "token")) .setOutputCol("word_embeddings") val sentence_embeddings = new SentenceEmbeddings() .setInputCols(Array("document", "word_embeddings")) .setOutputCol("sentence_embeddings") .setPoolingStrategy("AVERAGE") val classifier = ClassifierDLModel.pretrained("classifierdl_ade_biobert", "en", "clinical/models") .setInputCols(Array("document", "token", "sentence_embeddings")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, sentence_embeddings, classifier)) val data = Seq("""I feel a bit drowsy & have a little blurred vision after taking an insulin, I feel great after taking tylenol""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.ade.biobert").predict("""I feel a bit drowsy & have a little blurred vision after taking an insulin""") ```
## Results ```bash | | text | label | |--:|:---------------------------------------------------------------------------|:------| | 0 | I feel a bit drowsy & have a little blurred vision after taking an insulin | True | | 1 | I feel great after taking tylenol | False | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_ade_biobert| |Compatibility:|Healthcare NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|biobert_pubmed_base_cased| ## Data Source Trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed. ## Benchmarking ```bash label precision recall f1-score support False 0.96 0.94 0.95 6923 True 0.71 0.79 0.75 1359 micro-avg 0.91 0.91 0.91 8282 macro-avg 0.83 0.86 0.85 8282 weighted-avg 0.92 0.91 0.91 8282 ``` --- layout: model title: Legal Authority Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_authority_bert date: 2023-03-05 tags: [en, legal, classification, clauses, authority, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Authority` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Authority`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_authority_bert_en_1.0.0_3.0_1678050593552.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_authority_bert_en_1.0.0_3.0_1678050593552.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_authority_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Authority]| |[Other]| |[Other]| |[Authority]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_authority_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Authority 0.93 0.93 0.93 56 Other 0.95 0.95 0.95 82 accuracy - - 0.94 138 macro-avg 0.94 0.94 0.94 138 weighted-avg 0.94 0.94 0.94 138 ``` --- layout: model title: Classifier for Genders - SBERT author: John Snow Labs name: classifierdl_gender_sbert date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 tags: [en, licensed, classifier, clinical] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies the gender of the patient in the clinical document using context. ## Predicted Entities `Female`, `Male`, `Unknown` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CLASSIFICATION_GENDER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/21_Gender_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_sbert_en_2.7.1_2.4_1611248306976.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_gender_sbert_en_2.7.1_2.4_1611248306976.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", 'en', 'clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") gender_classifier = ClassifierDLModel.pretrained( 'classifierdl_gender_sbert', 'en', 'clinical/models') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlp_pipeline = Pipeline(stages=[document_assembler, sbert_embedder, gender_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.gender.sbert").predict("""social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""") ```
## Results ```bash Female ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_gender_sbert| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|sbiobert_base_cased_mli| ## Data Source This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally. ## Benchmarking ```bash precision recall f1-score support Female 0.9390 0.9747 0.9565 237 Male 0.9561 0.8720 0.9121 125 Unknown 0.8491 0.8824 0.8654 51 accuracy 0.9322 413 macro avg 0.9147 0.9097 0.9113 413 weighted avg 0.9331 0.9322 0.9318 413 ``` --- layout: model title: Twitter XLM-RoBERTa Base (twitter_xlm_roberta_base) author: John Snow Labs name: twitter_xlm_roberta_base date: 2021-05-25 tags: [xx, multilingual, embeddings, xlm_roberta, open_source] task: Embeddings language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description [XLM-RoBERTa](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) is a scaled cross-lingual sentence encoder. It is trained on 2.5T of data across 100 languages data filtered from Common Crawl. XLM-R achieves state-of-the-arts results on multiple cross-lingual benchmarks. The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/twitter_xlm_roberta_base_xx_3.1.0_2.4_1621962368880.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/twitter_xlm_roberta_base_xx_3.1.0_2.4_1621962368880.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = XlmRoBertaEmbeddings.pretrained("twitter_xlm_roberta_base", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") ``` ```scala val embeddings = XlmRoBertaEmbeddings.pretrained("twitter_xlm_roberta_base", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("xx.embed.xlm.twitter").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|twitter_xlm_roberta_base| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|xx| |Case sensitive:|true| ## Data Source - [https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base) - [https://github.com/cardiffnlp/xlm-t](https://github.com/cardiffnlp/xlm-t) --- layout: model title: Arabic Bert Embeddings (from Kamel) author: John Snow Labs name: bert_embeddings_DarijaBERT date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `DarijaBERT` is a Arabic model orginally trained by `Kamel`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_DarijaBERT_ar_3.4.2_3.0_1649677857720.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_DarijaBERT_ar_3.4.2_3.0_1649677857720.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_DarijaBERT","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_DarijaBERT","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.DarijaBERT").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_DarijaBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|554.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Kamel/DarijaBERT - https://github.com/AIOXLABS/DBert --- layout: model title: Word2Vec Embeddings in Polish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, pl, open_source] task: Embeddings language: pl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pl_3.4.1_3.0_1647451641210.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pl_3.4.1_3.0_1647451641210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Kocham iskra NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Kocham iskra NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pl.embed.w2v_cc_300d").predict("""Kocham iskra NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|pl| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-512_A-8_cord19-200616_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185294283.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185294283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid_cord19.bert.uncased_4l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|107.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-512_A-8_cord19-200616_squad2_covid-qna --- layout: model title: English Bert Embeddings (from GroNLP) author: John Snow Labs name: bert_embeddings_hateBERT date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `hateBERT` is a English model orginally trained by `GroNLP`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_hateBERT_en_3.4.2_3.0_1649671840480.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_hateBERT_en_3.4.2_3.0_1649671840480.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_hateBERT","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_hateBERT","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.hateBERT").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_hateBERT| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/GroNLP/hateBERT - https://www.semanticscholar.org/author/Tommaso-Caselli/1864635 - https://www.semanticscholar.org/author/Valerio-Basile/3101511 - https://www.semanticscholar.org/author/Jelena-Mitrovic/145157863 - https://www.semanticscholar.org/author/M.-Granitzer/2389675 - https://aclanthology.org/2021.woah-1.3/ - https://osf.io/tbd58/?view_onlycb79b3228d4248ddb875eb1803525ad8 --- layout: model title: French CamemBert Embeddings (from linyi) author: John Snow Labs name: camembert_embeddings_linyi_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `linyi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_linyi_generic_model_fr_3.4.4_3.0_1653989548358.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_linyi_generic_model_fr_3.4.4_3.0_1653989548358.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_linyi_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_linyi_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_linyi_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/linyi/dummy-model --- layout: model title: English image_classifier_vit_finetuned_eurosat_kornia ViTForImageClassification from nielsr author: John Snow Labs name: image_classifier_vit_finetuned_eurosat_kornia date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_finetuned_eurosat_kornia` is a English model originally trained by nielsr. ## Predicted Entities `Residential`, `AnnualCrop`, `Highway`, `Pasture`, `SeaLake`, `Industrial`, `HerbaceousVegetation`, `River`, `PermanentCrop`, `Forest` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_finetuned_eurosat_kornia_en_4.1.0_3.0_1660165618186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_finetuned_eurosat_kornia_en_4.1.0_3.0_1660165618186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_finetuned_eurosat_kornia", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_finetuned_eurosat_kornia", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_finetuned_eurosat_kornia| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering model (from healx) author: John Snow Labs name: bert_qa_biomedical_slot_filling_reader_large date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `biomedical-slot-filling-reader-large` is a English model orginally trained by `healx`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_biomedical_slot_filling_reader_large_en_4.0.0_3.0_1654185806879.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_biomedical_slot_filling_reader_large_en_4.0.0_3.0_1654185806879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_biomedical_slot_filling_reader_large","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_biomedical_slot_filling_reader_large","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bio_medical.bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_biomedical_slot_filling_reader_large| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/healx/biomedical-slot-filling-reader-large - https://arxiv.org/abs/2109.08564 --- layout: model title: English image_classifier_vit_pond_image_classification_11 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_11 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_11` is a English model originally trained by SummerChiam. ## Predicted Entities `NormalCement0`, `Boiling0`, `NormalNight0`, `Algae0`, `BoilingNight0`, `NormalRain0`, `Normal0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_11_en_4.1.0_3.0_1660168212591.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_11_en_4.1.0_3.0_1660168212591.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_11", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_11", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_11| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Fast Neural Machine Translation Model from English to Semitic Languages author: John Snow Labs name: opus_mt_en_sem date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sem, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sem` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sem_xx_2.7.0_2.4_1609167506537.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sem_xx_2.7.0_2.4_1609167506537.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sem", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sem", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sem').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sem| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: SDOH Under Treatment For Classification author: John Snow Labs name: genericclassifier_sdoh_under_treatment_sbiobert_cased_mli date: 2023-05-30 tags: [en, licensed, sdoh, generic_classifier, under_treatment, biobert] task: Text Classification language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting if the patient is under treatment or not. If under treatment is not mentioned in the text, it is regarded as “Not_Under_Treatment_Or_Not_Mentioned”. The model is trained by using GenericClassifierApproach annotator. `Under_Treatment`: The patient is under treatment. `Not_Under_Treatment_Or_Not_Mentioned`: The patient is not under treatment or it is not mentioned in the clinical notes. ## Predicted Entities `Under_Treatment`, `Not_Under_Treatment_Or_Not_Mentioned` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION_GENERIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT_CLASSIFICATION.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.4.2_3.0_1685475327309.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en_4.4.2_3.0_1685475327309.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["Patient is a 50-year-old male who was diagnosed with hepatitis C. He has received a treatment plan that includes medication and regular monitoring of his liver function.", "Patient has been living with chronic migraines for several years. She has not pursued any specific treatment for her migraines and has been managing her condition through lifestyle modifications such as stress reduction techniques and avoiding triggers.", """Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications. To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity. Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly. With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life. """, """John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures. Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor. """] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "prediction.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_under_treatment_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("prediction") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("Patient is a 50-year-old male who was diagnosed with hepatitis C. He has received a treatment plan that includes medication and regular monitoring of his liver function.", "Patient has been living with chronic migraines for several years. She has not pursued any specific treatment for her migraines and has been managing her condition through lifestyle modifications such as stress reduction techniques and avoiding triggers.", """Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disease, presented to her primary care physician with complaints of chest pain and shortness of breath. After a thorough evaluation, Sarah was diagnosed with coronary artery disease (CAD), a condition that can lead to heart attacks and other serious complications. To manage her CAD, Sarah was started on a treatment plan that included medication to lower her cholesterol and blood pressure, as well as aspirin to prevent blood clots. In addition to medication, Sarah was advised to make lifestyle modifications such as improving her diet, quitting smoking, and increasing physical activity. Over the course of several months, Sarah's symptoms improved, and follow-up tests showed that her cholesterol and blood pressure were within the target range. However, Sarah continued to experience occasional chest pain, and her medication regimen was adjusted accordingly. With regular follow-up appointments and adherence to her treatment plan, Sarah's CAD remained under control, and she was able to resume her normal activities with improved quality of life. """, """John, a 60-year-old man with a history of smoking and high blood pressure, presented to his primary care physician with complaints of chest pain and shortness of breath. Further tests revealed that John had a blockage in one of his coronary arteries, which required urgent intervention. However, John was hesitant to undergo treatment, citing concerns about potential complications and side effects of medications and procedures. Despite the physician's recommendations and attempts to educate John about the risks of leaving the blockage untreated, John ultimately chose not to pursue any treatment. Over the next several months, John continued to experience symptoms, which progressively worsened, and he ultimately required hospitalization for a heart attack. The medical team attempted to intervene at that point, but the damage to John's heart was severe, and his prognosis was poor. """)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+--------------------------------------+ | text| result| +----------------------------------------------------------------------------------------------------+--------------------------------------+ |Patient is a 50-year-old male who was diagnosed with hepatitis C. He has received a treatment pla...| [Under_Treatment]| |Patient has been living with chronic migraines for several years. She has not pursued any specifi...|[Not_Under_Treatment_Or_Not_Mentioned]| |Sarah, a 55-year-old woman with a history of high cholesterol and a family history of heart disea...| [Under_Treatment]| |John, a 60-year-old man with a history of smoking and high blood pressure, presented to his prima...|[Not_Under_Treatment_Or_Not_Mentioned]| +----------------------------------------------------------------------------------------------------+--------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_under_treatment_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Not_Under_Treatment_Or_Not_Mentioned 0.88 0.75 0.81 281 Under_Treatment 0.86 0.93 0.90 458 accuracy - - 0.86 739 macro-avg 0.87 0.84 0.85 739 weighted-avg 0.87 0.86 0.86 739 ``` --- layout: model title: NER Model Finder author: John Snow Labs name: ner_model_finder date: 2022-01-21 tags: [pretrainedpipeline, clinical, ner, en, licensed] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is trained with bert embeddings and can be used to find the most appropriate NER model given the entity name. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_MODEL_FINDER/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_3.3.2_2.4_1642758002888.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_model_finder_en_3.3.2_2.4_1642758002888.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_pipeline = PretrainedPipeline("ner_model_finder", "en", "clinical/models") result = ner_pipeline.annotate("medication") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_pipeline = PretrainedPipeline("ner_model_finder","en","clinical/models") val result = ner_pipeline.annotate("medication") ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.model_finder.pipeline").predict("""Put your text here.""") ```
## Results ```bash {'model_names': ["['ner_posology', 'ner_posology_large', 'ner_posology_small', 'ner_posology_greedy', 'ner_drugs_large', 'ner_posology_experimental', 'ner_drugs_greedy', 'ner_ade_clinical', 'ner_jsl_slim', 'ner_posology_healthcare', 'ner_ade_healthcare', 'jsl_ner_wip_modifier_clinical', 'ner_ade_clinical', 'ner_jsl_greedy', 'ner_risk_factors']"]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_model_finder| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|155.8 MB| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - SentenceEntityResolverModel - Finisher --- layout: model title: Word2Vec Embeddings in Chinese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_zh_3.4.1_3.0_1647290907204.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_zh_3.4.1_3.0_1647290907204.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|1.4 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Fast Neural Machine Translation Model from Baltic languages to English author: John Snow Labs name: opus_mt_bat_en date: 2021-06-01 tags: [open_source, seq2seq, translation, bat, en, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: bat target languages: en {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bat_en_xx_3.1.0_2.4_1622552991351.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bat_en_xx_3.1.0_2.4_1622552991351.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bat_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bat_en", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Baltic languages.translate_to.English').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bat_en| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual XLMRobertaForTokenClassification Cased model (from opensource) author: John Snow Labs name: xlmroberta_ner_extract_names date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `extract_names` is a Multilingual model originally trained by `opensource`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_extract_names_xx_4.1.0_3.0_1660422227464.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_extract_names_xx_4.1.0_3.0_1660422227464.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_extract_names","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_extract_names","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_extract_names| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|967.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/opensource/extract_names --- layout: model title: Fast Neural Machine Translation Model from Macedonian to English author: John Snow Labs name: opus_mt_mk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, mk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `mk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_mk_en_xx_2.7.0_2.4_1609165073677.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_mk_en_xx_2.7.0_2.4_1609165073677.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_mk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_mk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.mk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_mk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from Mr-Wick) author: John Snow Labs name: roberta_qa_Roberta date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Roberta` is a English model originally trained by `Mr-Wick`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_Roberta_en_4.0.0_3.0_1655727223761.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_Roberta_en_4.0.0_3.0_1655727223761.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_Roberta","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_Roberta","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.by_Mr-Wick").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_Roberta| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|461.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Mr-Wick/Roberta --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding UMLS Codes author: John Snow Labs name: icd10cm_umls_mapping date: 2022-06-27 tags: [icd10cm, umls, pipeline, chunk_mapper, clinical, licensed, en] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icd10cm_umls_mapper` model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_3.5.3_3.0_1656366054366.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_3.5.3_3.0_1656366054366.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(["M8950", "R822", "R0901"]) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(Array("M8950", "R822", "R0901")) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.umls").predict("""Put your text here.""") ```
## Results ```bash | | icd10cm_code | umls_code | |---:|:---------------|:------------| | 0 | M8950 | C4721411 | | 1 | R822 | C0159076 | | 2 | R0901 | C0004044 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|952.4 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Catalan, Valencian asr_Wav2Vec2_Large_XLSR_53_catalan TFWav2Vec2ForCTC from PereLluis13 author: John Snow Labs name: asr_Wav2Vec2_Large_XLSR_53_catalan date: 2022-09-24 tags: [wav2vec2, ca, audio, open_source, asr] task: Automatic Speech Recognition language: ca edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_Large_XLSR_53_catalan` is a Catalan, Valencian model originally trained by PereLluis13. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_Large_XLSR_53_catalan_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_Large_XLSR_53_catalan_ca_4.2.0_3.0_1664036578113.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_Large_XLSR_53_catalan_ca_4.2.0_3.0_1664036578113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Wav2Vec2_Large_XLSR_53_catalan", "ca")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Wav2Vec2_Large_XLSR_53_catalan", "ca") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Wav2Vec2_Large_XLSR_53_catalan| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ca| |Size:|1.2 GB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from Cole) author: John Snow Labs name: xlmroberta_ner_cole_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `Cole`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cole_base_finetuned_panx_de_4.1.0_3.0_1660429422413.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cole_base_finetuned_panx_de_4.1.0_3.0_1660429422413.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cole_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cole_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_cole_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Cole/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Legal Independent Contractor Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_independent_contractor_agreement_bert date: 2023-01-26 tags: [en, legal, classification, contractor, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_independent_contractor_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `independent-contractor-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `independent-contractor-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_independent_contractor_agreement_bert_en_1.0.0_3.0_1674732769102.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_independent_contractor_agreement_bert_en_1.0.0_3.0_1674732769102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_independent_contractor_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[independent-contractor-agreement]| |[other]| |[other]| |[independent-contractor-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_independent_contractor_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support independent-contractor-agreement 0.98 0.96 0.97 67 other 0.97 0.99 0.98 116 accuracy - - 0.98 183 macro-avg 0.98 0.97 0.98 183 weighted-avg 0.98 0.98 0.98 183 ``` --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_news_pretrain_roberta_FT_new_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `news_pretrain_roberta_FT_new_newsqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_roberta_FT_new_newsqa_en_4.0.0_3.0_1655729181858.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_news_pretrain_roberta_FT_new_newsqa_en_4.0.0_3.0_1655729181858.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_news_pretrain_roberta_FT_new_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_news_pretrain_roberta_FT_new_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.qa_ft_new.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_news_pretrain_roberta_FT_new_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/news_pretrain_roberta_FT_new_newsqa --- layout: model title: Polish DistilBertForTokenClassification Cased model (from clarin-pl) author: John Snow Labs name: distilbert_token_classifier_fastpdn_distiluse date: 2023-03-14 tags: [pl, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: pl edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `FastPDN-distiluse` is a Polish model originally trained by `clarin-pl`. ## Predicted Entities `nam_num_phone`, `nam_loc_land_mountain`, `nam_loc_gpe_subdivision`, `nam_adj_country`, `nam_oth_currency`, `nam_loc_country_region`, `nam_pro_title_document`, `nam_loc_gpe_admin1`, `nam_eve_human_sport`, `nam_loc_hydronym_sea`, `nam_fac_park`, `nam_adj_city`, `nam_loc_land_region`, `nam_liv_animal`, `nam_liv_person`, `nam_pro_media_tv`, `nam_fac_bridge`, `nam_pro_model_car`, `nam_oth_tech`, `nam_oth_position`, `nam_loc_land_island`, `nam_liv_habitant`, `nam_pro_award`, `nam_pro_title_article`, `nam_org_group`, `nam_num_house`, `nam_pro_title_book`, `nam_pro_media_periodic`, `nam_pro_media_web`, `nam_pro_title_treaty`, `nam_loc_gpe_conurbation`, `nam_pro_software_game`, `nam_pro_brand`, `nam_fac_goe`, `nam_loc_historical_region`, `nam_pro`, `nam_pro_media_radio`, `nam_pro_title`, `nam_loc_hydronym_river`, `nam_loc_land`, `nam_org_group_team`, `nam_fac_system`, `nam_org_company`, `nam_pro_title_song`, `nam_loc_land_peak`, `nam_eve`, `nam_loc_hydronym_ocean`, `nam_org_group_band`, `nam_liv_character`, `nam_loc_gpe_admin2`, `nam_org_organization`, `nam_adj_person`, `nam_eve_human`, `nam_org_nation`, `nam_loc_gpe_district`, `nam_liv_god`, `nam_org_political_party`, `nam_oth_data_format`, `nam_loc_land_continent`, `nam_fac_goe_stop`, `nam_loc`, `nam_oth`, `nam_loc_gpe_admin3`, `nam_pro_media`, `nam_loc_gpe_city`, `nam_loc_hydronym_lake`, `nam_pro_title_tv`, `nam_oth_license`, `nam_org_organization_sub`, `nam_adj`, `nam_loc_hydronym`, `nam_oth_www`, `nam_org_institution`, `nam_pro_vehicle`, `nam_pro_software`, `nam_loc_gpe_country`, `nam_eve_human_cultural`, `nam_fac_road`, `nam_pro_title_album`, `nam_loc_astronomical`, `nam_eve_human_holiday`, `nam_fac_square` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_fastpdn_distiluse_pl_4.3.1_3.0_1678783020131.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_fastpdn_distiluse_pl_4.3.1_3.0_1678783020131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_fastpdn_distiluse","pl") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_fastpdn_distiluse","pl") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_fastpdn_distiluse| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pl| |Size:|509.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/clarin-pl/FastPDN-distiluse - https://gitlab.clarin-pl.eu/information-extraction/poldeepner2 - https://gitlab.clarin-pl.eu/grupa-wieszcz/ner/fast-pdn - https://clarin-pl.eu/dspace/bitstream/handle/11321/294/WytyczneKPWr-jednostkiidentyfikacyjne.pdf --- layout: model title: English BertForQuestionAnswering Cased model (from motiondew) author: John Snow Labs name: bert_qa_set_date_1_lr_3e_5_bs_32_ep_3 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-set_date_1-lr-3e-5-bs-32-ep-3` is a English model originally trained by `motiondew`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188352417.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_set_date_1_lr_3e_5_bs_32_ep_3_en_4.0.0_3.0_1657188352417.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_lr_3e_5_bs_32_ep_3","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_set_date_1_lr_3e_5_bs_32_ep_3","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_set_date_1_lr_3e_5_bs_32_ep_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/motiondew/bert-set_date_1-lr-3e-5-bs-32-ep-3 --- layout: model title: English DistilBertForQuestionAnswering model (from MYX4567) author: John Snow Labs name: distilbert_qa_MYX4567_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `MYX4567`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_MYX4567_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724267012.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_MYX4567_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724267012.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_MYX4567_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_MYX4567_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_MYX4567_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/MYX4567/distilbert-base-uncased-finetuned-squad --- layout: model title: BioBERT Sentence Embeddings (Clinical) author: John Snow Labs name: sent_biobert_clinical_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for generic clinical text. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_clinical_base_cased_en_2.6.0_2.4_1598349343675.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_clinical_base_cased_en_2.6.0_2.4_1598349343675.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.clinical_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_clinical_base_cased_embeddings I hate cancer [0.397987425327301, 0.6472950577735901, -0.551... Antibiotics aren't painkiller [0.19467104971408844, 0.3496762812137604, -0.3... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_clinical_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: Swedish BertForMaskedLM Base Cased model (from KB) author: John Snow Labs name: bert_embeddings_kb_base_swedish_cased date: 2022-12-02 tags: [sv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: sv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-swedish-cased` is a Swedish model originally trained by `KB`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_kb_base_swedish_cased_sv_4.2.4_3.0_1670018989680.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_kb_base_swedish_cased_sv_4.2.4_3.0_1670018989680.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kb_base_swedish_cased","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kb_base_swedish_cased","sv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_kb_base_swedish_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sv| |Size:|468.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/KB/bert-base-swedish-cased --- layout: model title: Translate English to Hebrew Pipeline author: John Snow Labs name: translate_en_he date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, he, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `he` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_he_xx_2.7.0_2.4_1609687340750.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_he_xx_2.7.0_2.4_1609687340750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_he", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_he", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.he').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_he| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Judgments Clause Binary Classifier author: John Snow Labs name: legclf_judgments_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `judgments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `judgments` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_judgments_clause_en_1.0.0_3.2_1660122586859.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_judgments_clause_en_1.0.0_3.2_1660122586859.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_judgments_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[judgments]| |[other]| |[other]| |[judgments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_judgments_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support judgments 1.00 0.91 0.96 35 other 0.98 1.00 0.99 120 accuracy - - 0.98 155 macro-avg 0.99 0.96 0.97 155 weighted-avg 0.98 0.98 0.98 155 ``` --- layout: model title: Explain Document pipeline for Portuguese (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-03-23 tags: [open_source, portuguese, explain_document_lg, pipeline, pt] supported: true task: [Named Entity Recognition, Lemmatization] language: pt edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pt_3.0.0_3.0_1616505297906.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_pt_3.0.0_3.0_1616505297906.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_lg', lang = 'pt') annotations = pipeline.fullAnnotate(""Olá de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "pt") val result = pipeline.fullAnnotate("Olá de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Olá de John Snow Labs! ""] result_df = nlu.load('pt.explain.lg').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:----------------------------|:---------------------------|:---------------------------------------|:---------------------------------------|:--------------------------------------------|:-----------------------------|:--------------------------------------|:--------------------| | 0 | ['Olá de John Snow Labs! '] | ['Olá de John Snow Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['Olá', 'de', 'John', 'Snow', 'Labs!'] | ['PROPN', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.4388400018215179,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'I-PER'] | ['John Snow Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|pt| --- layout: model title: Fast Neural Machine Translation Model from American Sign Language to Swedish author: John Snow Labs name: opus_mt_ase_sv date: 2021-06-01 tags: [open_source, seq2seq, translation, ase, sv, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ase target languages: sv {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ase_sv_xx_3.1.0_2.4_1622553091292.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ase_sv_xx_3.1.0_2.4_1622553091292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ase_sv", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ase_sv", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.American Sign Language.translate_to.Swedish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ase_sv| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Luba-Katanga BertForMaskedLM Cased model (from raduion) author: John Snow Labs name: bert_embeddings_medium_luxembourgish date: 2022-12-02 tags: [lu, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: lu edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-luxembourgish` is a Luba-Katanga model originally trained by `raduion`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_medium_luxembourgish_lu_4.2.4_3.0_1670020644930.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_medium_luxembourgish_lu_4.2.4_3.0_1670020644930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_medium_luxembourgish","lu") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_medium_luxembourgish","lu") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_medium_luxembourgish| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|lu| |Size:|263.2 MB| |Case sensitive:|true| ## References - https://huggingface.co/raduion/bert-medium-luxembourgish --- layout: model title: English asr_Dansk_wav2vec2_stt TFWav2Vec2ForCTC from Siyam author: John Snow Labs name: pipeline_asr_Dansk_wav2vec2_stt date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Dansk_wav2vec2_stt` is a English model originally trained by Siyam. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_Dansk_wav2vec2_stt_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_Dansk_wav2vec2_stt_en_4.2.0_3.0_1664120521607.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_Dansk_wav2vec2_stt_en_4.2.0_3.0_1664120521607.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_Dansk_wav2vec2_stt', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_Dansk_wav2vec2_stt", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_Dansk_wav2vec2_stt| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese BertForQuestionAnswering Base Cased model (from jackh1995) author: John Snow Labs name: bert_qa_roberta_base_chinese_extractive date: 2022-07-07 tags: [zh, open_source, bert, question_answering] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-chinese-extractive-qa` is a Chinese model originally trained by `jackh1995`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_zh_4.0.0_3.0_1657190912340.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_roberta_base_chinese_extractive_zh_4.0.0_3.0_1657190912340.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_roberta_base_chinese_extractive","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_roberta_base_chinese_extractive","zh") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_roberta_base_chinese_extractive| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|zh| |Size:|381.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jackh1995/roberta-base-chinese-extractive-qa --- layout: model title: Legal Termination for cause Clause Binary Classifier (md) author: John Snow Labs name: legclf_termination_for_cause_md date: 2023-01-11 tags: [en, legal, classification, document, agreement, contract, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `termination-for-cause` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `termination-for-cause` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/CLASSIFY_LEGAL_DOCUMENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_termination_for_cause_md_en_1.0.0_3.0_1673460252721.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_termination_for_cause_md_en_1.0.0_3.0_1673460252721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_termination_for_cause_md", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[termination-for-cause]| |[other]| |[other]| |[termination-for-cause]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_termination_for_cause_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash precision recall f1-score support attorney-fees 1.00 1.00 1.00 38 other 1.00 1.00 1.00 39 accuracy 1.00 77 macro avg 1.00 1.00 1.00 77 weighted avg 1.00 1.00 1.00 77 ``` --- layout: model title: Detect Problems, Tests and Treatments (ner_clinical_en) author: John Snow Labs name: ner_clinical_en date: 2020-01-30 task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 tags: [clinical, licensed, ner, en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terms. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities ``PROBLEM``, ``TEST``, ``TREATMENT``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CLINICAL/){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_en_3.0.0_3.0_1617208419368.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_en_3.0.0_3.0_1617208419368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPython.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.h2_title} ## Results The output is a dataframe with a sentence per row and a ``"ner"`` column containing all of the entity labels in the sentence, entity character indices, and other metadata. To get only the tokens and entity labels, without the metadata, select ``"token.result"`` and ``"ner.result"`` from your output dataframe or add the ``"Finisher"`` to the end of your pipeline. ```bash +-------------------------------------+---------+ |chunk |ner | +-------------------------------------+---------+ |congestion |PROBLEM | |suctioning yellow discharge |PROBLEM | |some mild problems with his breathing|PROBLEM | |any perioral cyanosis |PROBLEM | |retractions |PROBLEM | |a tactile temperature |TEST | |Tylenol |TREATMENT| |his respiratory congestion |PROBLEM | |more tired |PROBLEM | |fussy |PROBLEM | |albuterol treatments |TREATMENT| |His urine output |TEST | |dirty diapers |TREATMENT| |diarrhea |PROBLEM | |yellow colored |PROBLEM | +-------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical| |Type:|ner| |Compatibility:|Spark NLP 3.0.0+| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence,token, embeddings]| |Output Labels:|[ner]| |Language:|[en]| |Case sensitive:|false| {:.h2_title} ## Data Source Trained on 2010 i2b2 challenge data with `embeddings_clinical`. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ {:.h2_title} ## Benchmarking ```bash label tp fp fn prec rec f1 I-TREATMENT 6492 873 1445 0.881466 0.817941 0.848517 I-PROBLEM 15645 1808 2031 0.896408 0.885098 0.890717 B-PROBLEM 11160 1048 1424 0.914155 0.88684 0.90029 I-TEST 6878 864 1132 0.888401 0.858677 0.873286 B-TEST 8140 932 1081 0.897266 0.882768 0.889958 B-TREATMENT 8163 945 1150 0.896245 0.876517 0.886271 Macro-average 56478 6470 8263 0.895657 0.867974 0.881598 Micro-average 56478 6470 8263 0.897217 0.872368 0.884618 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from 21iridescent) author: John Snow Labs name: distilbert_qa_21iridescent_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `21iridescent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_21iridescent_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768188793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_21iridescent_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768188793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_21iridescent_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_21iridescent_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_21iridescent_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/21iridescent/distilbert-base-uncased-finetuned-squad --- layout: model title: T5 for Informal to Formal Style Transfer author: John Snow Labs name: t5_informal_to_formal_styletransfer date: 2022-01-12 tags: [t5, open_source, en] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a text-to-text model based on T5 fine-tuned to generate informal text from a formal text input, for the task "transfer Casual to Formal:". It is based on Prithiviraj Damodaran's Styleformer. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5_LINGUISTIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_LINGUISTIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_informal_to_formal_styletransfer_en_3.4.0_3.0_1641985760876.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_informal_to_formal_styletransfer_en_3.4.0_3.0_1641985760876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * spark = sparknlp.start() documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer.pretrained("t5_informal_to_formal_styletransfer") \ .setTask("transfer Casual to Formal:") \ .setInputCols(["documents"]) \ .setMaxOutputLength(200) \ .setOutputCol("transfers") pipeline = Pipeline().setStages([documentAssembler, t5]) data = spark.createDataFrame([["Who gives a crap?"]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("transfers.result").show(truncate=False) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_informal_to_formal_styletransfer") .setTask("transfer Casual to Formal:") .setMaxOutputLength(200) .setInputCols("documents") .setOutputCol("transfer") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("Who gives a crap?").toDF("text") val result = pipeline.fit(data).transform(data) result.select("transfer.result").show(false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.informal_to_formal_styletransfer").predict("""transfer Casual to Formal:""") ```
## Results ```bash +------------+ |result | +------------+ |[Who cares?]| +------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_informal_to_formal_styletransfer| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|924.0 MB| ## Data Source The original model is from the transformers library: https://huggingface.co/prithivida/informal_to_formal_styletransfer --- layout: model title: Part of Speech for Polish author: John Snow Labs name: pos_ud_lfg date: 2021-03-08 tags: [part_of_speech, open_source, polish, pos_ud_lfg, pl] task: Part of Speech Tagging language: pl edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADJ - PROPN - VERB - NOUN - PUNCT - NUM - ADV - ADP - SCONJ - PRON - AUX - DET - CCONJ - PART - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_lfg_pl_3.0.0_3.0_1615230237237.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_lfg_pl_3.0.0_3.0_1615230237237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_lfg", "pl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Witaj z John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_lfg", "pl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Witaj z John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Witaj z John Snow Labs! ""] token_df = nlu.load('pl.pos.ud_lfg').predict(text) token_df ```
## Results ```bash token pos 0 Witaj VERB 1 z ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_lfg| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|pl| --- layout: model title: Legal Collateral Clause Binary Classifier author: John Snow Labs name: legclf_collateral_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `collateral` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `collateral` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_collateral_clause_en_1.0.0_3.2_1660123314449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_collateral_clause_en_1.0.0_3.2_1660123314449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_collateral_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[collateral]| |[other]| |[other]| |[collateral]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_collateral_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support collateral 0.92 0.89 0.90 37 other 0.94 0.96 0.95 69 accuracy - - 0.93 106 macro-avg 0.93 0.92 0.93 106 weighted-avg 0.93 0.93 0.93 106 ``` --- layout: model title: Extract relations between chemicals and proteins (ReDL) author: John Snow Labs name: redl_chemprot_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect interactions between chemicals and proteins using BERT model by classifying whether a specified semantic relation holds between the chemical and protein entities within a sentence or document. ## Predicted Entities `CPR:1`, `CPR:2`, `CPR:3`, `CPR:4`, `CPR:5`, `CPR:6`, `CPR:7`, `CPR:8`, `CPR:9`, `CPR:10` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CHEM_PROT){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_chemprot_biobert_en_2.7.3_2.4_1612443115083.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_chemprot_biobert_en_2.7.3_2.4_1612443115083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `redl_chemprot_biobert` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------:|:---------------------------------------------------------------------:|:---------------------:|---------------------------| | redl_chemprot_biobert | CPR:1, CPR:2, CPR:3, CPR:4, CPR:5, CPR:6, CPR:7, CPR:8, CPR:9, CPR:10 | ner_chemprot_clinical | [“No need to set pairs.”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ) ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_chemprot_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text='''In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.''' p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text")) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_chemprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array("SYMPTOM-EXTERNAL_BODY_PART_OR_REGION")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_chemprot_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.chemprot").predict("""In this study, we examined the effects of mitiglinide on various cloned K(ATP) channels (Kir6.2/SUR1, Kir6.2/SUR2A, and Kir6.2/SUR2B) reconstituted in COS-1 cells, and compared them to another meglitinide-related compound, nateglinide. Patch-clamp analysis using inside-out recording configuration showed that mitiglinide inhibits the Kir6.2/SUR1 channel currents in a dose-dependent manner (IC50 value, 100 nM) but does not significantly inhibit either Kir6.2/SUR2A or Kir6.2/SUR2B channel currents even at high doses (more than 10 microM). Nateglinide inhibits Kir6.2/SUR1 and Kir6.2/SUR2B channels at 100 nM, and inhibits Kir6.2/SUR2A channels at high concentrations (1 microM). Binding experiments on mitiglinide, nateglinide, and repaglinide to SUR1 expressed in COS-1 cells revealed that they inhibit the binding of [3H]glibenclamide to SUR1 (IC50 values: mitiglinide, 280 nM; nateglinide, 8 microM; repaglinide, 1.6 microM), suggesting that they all share a glibenclamide binding site. The insulin responses to glucose, mitiglinide, tolbutamide, and glibenclamide in MIN6 cells after chronic mitiglinide, nateglinide, or repaglinide treatment were comparable to those after chronic tolbutamide and glibenclamide treatment. These results indicate that, similar to the sulfonylureas, mitiglinide is highly specific to the Kir6.2/SUR1 complex, i.e., the pancreatic beta-cell K(ATP) channel, and suggest that mitiglinide may be a clinically useful anti-diabetic drug.""") ```
## Results ```bash | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |---:|:-----------|:----------|----------------:|--------------:|:------------------|:----------|----------------:|--------------:|:--------------|-------------:| | 0 | CPR:2 | CHEMICAL | 43 | 53 | mitiglinide | GENE-N | 80 | 87 | channels | 0.998399 | | 1 | CPR:2 | GENE-N | 80 | 87 | channels | CHEMICAL | 224 | 234 | nateglinide | 0.994489 | | 2 | CPR:2 | CHEMICAL | 706 | 716 | mitiglinide | GENE-Y | 751 | 754 | SUR1 | 0.999304 | | 3 | CPR:2 | CHEMICAL | 823 | 839 | [3H]glibenclamide | GENE-Y | 844 | 847 | SUR1 | 0.998923 | | 4 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1019 | 1025 | glucose | 0.979057 | | 5 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1028 | 1038 | mitiglinide | 0.988504 | | 6 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1041 | 1051 | tolbutamide | 0.991856 | | 7 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1058 | 1070 | glibenclamide | 0.994092 | | 8 | CPR:2 | GENE-N | 998 | 1004 | insulin | CHEMICAL | 1100 | 1110 | mitiglinide | 0.994409 | | 9 | CPR:2 | CHEMICAL | 1290 | 1300 | mitiglinide | GENE-N | 1387 | 1393 | channel | 0.981534 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_chemprot_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on ChemProt benchmark dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support CPR:1 0.870 0.908 0.888 215 CPR:10 0.818 0.762 0.789 258 CPR:2 0.726 0.806 0.764 1651 CPR:3 0.788 0.785 0.787 657 CPR:4 0.901 0.855 0.878 1599 CPR:5 0.799 0.891 0.842 184 CPR:6 0.888 0.845 0.866 258 CPR:7 0.520 0.765 0.619 25 CPR:8 0.083 0.333 0.133 24 CPR:9 0.930 0.805 0.863 629 Avg. 0.732 0.775 0.743 ``` --- layout: model title: Catalan RobertaForQuestionAnswering Cased model (from crodri) author: John Snow Labs name: roberta_qa_ca_v2_squac_ca_catalan date: 2023-01-20 tags: [ca, open_source, roberta, question_answering, tensorflow] task: Question Answering language: ca edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-ca-v2-qa-squac-ca-catalanqa` is a Catalan model originally trained by `crodri`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ca_v2_squac_ca_catalan_ca_4.3.0_3.0_1674219909583.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ca_v2_squac_ca_catalan_ca_4.3.0_3.0_1674219909583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ca_v2_squac_ca_catalan","ca")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ca_v2_squac_ca_catalan","ca") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ca_v2_squac_ca_catalan| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|ca| |Size:|461.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/crodri/roberta-ca-v2-qa-squac-ca-catalanqa --- layout: model title: Translate Hiri Motu to English Pipeline author: John Snow Labs name: translate_ho_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, ho, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `ho` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_ho_en_xx_2.7.0_2.4_1609690127723.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_ho_en_xx_2.7.0_2.4_1609690127723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_ho_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_ho_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.ho.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_ho_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Russian BertForQuestionAnswering Cased model (from ruselkomp) author: John Snow Labs name: bert_qa_deep_pavlov_full_2 date: 2022-07-07 tags: [ru, open_source, bert, question_answering] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `deep-pavlov-full-2` is a Russian model originally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_deep_pavlov_full_2_ru_4.0.0_3.0_1657189320548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_deep_pavlov_full_2_ru_4.0.0_3.0_1657189320548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_deep_pavlov_full_2","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_deep_pavlov_full_2","ru") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_deep_pavlov_full_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|665.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/deep-pavlov-full-2 --- layout: model title: BioBERT Sentence Embeddings (Clinical) author: John Snow Labs name: sent_biobert_clinical_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true recommended: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for generic clinical text. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_clinical_base_cased_en_2.6.2_2.4_1600533460155.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_clinical_base_cased_en_2.6.2_2.4_1600533460155.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.clinical_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_clinical_base_cased_embeddings I hate cancer [0.397987425327301, 0.6472950577735901, -0.551... Antibiotics aren't painkiller [0.19467104971408844, 0.3496762812137604, -0.3... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_clinical_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: English image_classifier_vit_garbage_classification ViTForImageClassification from yangy50 author: John Snow Labs name: image_classifier_vit_garbage_classification date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_garbage_classification` is a English model originally trained by yangy50. ## Predicted Entities `plastic`, `cardboard`, `paper`, `metal`, `glass`, `trash` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_garbage_classification_en_4.1.0_3.0_1660166313891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_garbage_classification_en_4.1.0_3.0_1660166313891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_garbage_classification", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_garbage_classification", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_garbage_classification| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Detect Company Sectors in texts (small) author: John Snow Labs name: finner_wiki_sector date: 2023-01-15 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model, aimed to detect Company Sectors. It was trained with wikipedia texts about companies. ## Predicted Entities `SECTOR`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_wiki_sector_en_1.0.0_3.0_1673797361843.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_wiki_sector_en_1.0.0_3.0_1673797361843.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512) chunks = finance.NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner = finance.NerModel().pretrained("finner_wiki_sector", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks]) model = pipe.fit(df) res = model.transform(df) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \ .select(F.expr("cols['3']['sentence']").alias("sentence_id"), F.expr("cols['0']").alias("chunk"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias("ner_label"))\ .filter("ner_label!='O'")\ .show(truncate=False) ```
## Results ```bash +-----------+-----------------+---+---------+ |sentence_id|chunk |end|ner_label| +-----------+-----------------+---+---------+ |1 |lawn mowers |175|SECTOR | |1 |snow blowers |192|SECTOR | |1 |irrigation system|214|SECTOR | +-----------+-----------------+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_wiki_sector| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.2 MB| ## References Wikipedia ## Benchmarking ```bash label tp fp fn prec rec f1 B-SECTOR 70 17 23 0.8045977 0.75268817 0.7777778 I-SECTOR 24 11 9 0.6857143 0.72727275 0.70588243 Macro-average 94 28 32 0.745156 0.73998046 0.7425592 Micro-average 94 32 0.7704918 0.74603176 0.7580645 ``` --- layout: model title: Mapping Companies to NASDAQ Stock Screener by Company Name author: John Snow Labs name: finmapper_nasdaq_company_name_stock_screener date: 2023-01-19 tags: [en, finance, licensed, nasdaq, company] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model allows you to, given an extracted name of a company, get following information about that company from Nasdaq Stock Screener: - Country - IPO_Year - Industry - Last_Sale - Market_Cap - Name - Net_Change - Percent_Change - Sector - Ticker - Volume It can be optionally combined with Entity Resolution to normalize first the name of the company. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_company_name_stock_screener_en_1.0.0_3.0_1674161310624.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_nasdaq_company_name_stock_screener_en_1.0.0_3.0_1674161310624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained('finner_orgs_prods_alias', 'en', 'finance/models')\ .setInputCols(["document", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") # Optional: To normalize the ORG name using NASDAQ data before the mapping ########################################################################## chunkToDoc = nlp.Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") chunk_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("chunk_embeddings") use_er_model = finance.SentenceEntityResolverModel.pretrained('finel_nasdaq_company_name_stock_screener', 'en', 'finance/models')\ .setInputCols("chunk_embeddings")\ .setOutputCol('normalized')\ .setDistanceFunction("EUCLIDEAN") ########################################################################## CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_company_name_stock_screener', 'en', 'finance/models')\ .setInputCols(["normalized"])\ .setOutputCol("mappings") pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings, ner_model, ner_converter, chunkToDoc, # Optional for normalization chunk_embeddings, # Optional for normalization use_er_model, # Optional for normalization CM]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_data) lp = nlp.LightPipeline(model) text = """Nike is an American multinational association that is involved in the design, development, manufacturing and worldwide marketing and sales of apparel, footwear, accessories, equipment and services.""" result = lp.fullAnnotate(text) ```
## Results ```bash "Country": "United States", "IPO_Year": "0", "Industry": "Shoe Manufacturing", "Last_Sale": "$128.85", "Market_Cap": "1.9979004036E11", "Name": "Nike Inc. Common Stock", "Net_Change": "0.96", "Percent_Change": "0.751%", "Sector": "Consumer Discretionary", "Symbol": "NKE", "Volume": "4854668" ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_nasdaq_company_name_stock_screener| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|599.1 KB| ## References https://www.nasdaq.com/market-activity/stocks/screener --- layout: model title: Translate English to Tonga (Zambezi) Pipeline author: John Snow Labs name: translate_en_toi date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, toi, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `toi` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_toi_xx_2.7.0_2.4_1609689111234.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_toi_xx_2.7.0_2.4_1609689111234.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_toi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_toi", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.toi').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_toi| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Seychellois Creole to English author: John Snow Labs name: opus_mt_crs_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, crs, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `crs` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_crs_en_xx_2.7.0_2.4_1609168963014.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_crs_en_xx_2.7.0_2.4_1609168963014.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_crs_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_crs_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.crs.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_crs_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Spanish RobertaForTokenClassification Base Cased model (from bertin-project) author: John Snow Labs name: roberta_token_classifier_bertin_base_pos_conll2002 date: 2023-03-01 tags: [es, open_source, roberta, token_classification, ner, tensorflow] task: Named Entity Recognition language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertin-base-pos-conll2002-es` is a Spanish model originally trained by `bertin-project`. ## Predicted Entities `DA`, `VAM`, `I`, `VSM`, `PP`, `VSS`, `DI`, `AQ`, `Y`, `VMN`, `Fit`, `Fg`, `Fia`, `Fpa`, `Fat`, `VSN`, `Fpt`, `DD`, `VAP`, `SP`, `NP`, `Fh`, `VAI`, `CC`, `Fd`, `VMG`, `NC`, `PX`, `DE`, `Fz`, `PN`, `Fx`, `Faa`, `Fs`, `Fe`, `VSP`, `DP`, `VAS`, `VSG`, `PT`, `Ft`, `VAN`, `PI`, `P0`, `RG`, `RN`, `CS`, `DN`, `VMI`, `Fp`, `Fc`, `PR`, `VSI`, `AO`, `VMM`, `PD`, `VMS`, `DT`, `Z`, `VMP` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bertin_base_pos_conll2002_es_4.3.0_3.0_1677703697571.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_bertin_base_pos_conll2002_es_4.3.0_3.0_1677703697571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_bertin_base_pos_conll2002","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = RobertaForTokenClassification.pretrained("roberta_token_classifier_bertin_base_pos_conll2002","es") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_bertin_base_pos_conll2002| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|426.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/bertin-project/bertin-base-pos-conll2002-es --- layout: model title: Fast Neural Machine Translation Model from English to Spanish author: John Snow Labs name: opus_mt_en_es date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, es, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `es` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_es_xx_2.7.0_2.4_1609167574367.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_es_xx_2.7.0_2.4_1609167574367.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_es", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.es').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_es| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: French CamemBert Embeddings (from codingJacob) author: John Snow Labs name: camembert_embeddings_codingJacob_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `codingJacob`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_codingJacob_generic_model_fr_3.4.4_3.0_1653987610803.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_codingJacob_generic_model_fr_3.4.4_3.0_1653987610803.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_codingJacob_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_codingJacob_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_codingJacob_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/codingJacob/dummy-model --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-few-shot-k-64-finetuned-squad-seed-0` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1654180885453.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0_en_4.0.0_3.0_1654180885453.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_64d_seed_0").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-base-uncased-few-shot-k-64-finetuned-squad-seed-0 --- layout: model title: English image_classifier_vit_new_exper3 ViTForImageClassification from sudo-s author: John Snow Labs name: image_classifier_vit_new_exper3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_new_exper3` is a English model originally trained by sudo-s. ## Predicted Entities `45`, `98`, `113`, `34`, `67`, `120`, `93`, `142`, `147`, `12`, `66`, `89`, `51`, `124`, `84`, `8`, `73`, `78`, `19`, `100`, `23`, `62`, `135`, `128`, `4`, `121`, `88`, `77`, `40`, `110`, `15`, `11`, `104`, `90`, `9`, `141`, `139`, `132`, `44`, `33`, `117`, `22`, `56`, `55`, `26`, `134`, `50`, `123`, `37`, `68`, `61`, `107`, `13`, `46`, `99`, `24`, `94`, `83`, `35`, `16`, `79`, `5`, `103`, `112`, `72`, `10`, `59`, `144`, `87`, `48`, `21`, `116`, `76`, `138`, `54`, `43`, `148`, `127`, `65`, `71`, `57`, `108`, `32`, `80`, `106`, `137`, `82`, `49`, `6`, `126`, `36`, `1`, `39`, `140`, `17`, `25`, `60`, `14`, `133`, `47`, `122`, `111`, `102`, `31`, `96`, `69`, `95`, `58`, `145`, `64`, `53`, `42`, `75`, `115`, `109`, `0`, `20`, `27`, `70`, `2`, `86`, `38`, `81`, `118`, `92`, `125`, `18`, `101`, `30`, `7`, `143`, `97`, `130`, `114`, `129`, `29`, `41`, `105`, `63`, `3`, `74`, `91`, `52`, `85`, `131`, `28`, `119`, `136`, `146` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_new_exper3_en_4.1.0_3.0_1660173387567.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_new_exper3_en_4.1.0_3.0_1660173387567.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_new_exper3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_new_exper3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_new_exper3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.3 MB| --- layout: model title: Detect Time-related Terminology author: John Snow Labs name: roberta_token_classifier_timex_semeval date: 2021-12-28 tags: [timex, ner, roberta, en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.3.4 spark_version: 2.4 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model was imported from `Hugging Face` and it's been trained to detect time-related terminology, leveraging `RoBERTa` embeddings and `RobertaForTokenClassification` for NER purposes. ## Predicted Entities `Period`, `Year`, `Calendar-Interval`, `Month-Of-Year`, `Day-Of-Month`, `Day-Of-Week`, `Hour-Of-Day`, `Minute-Of-Hour`, `Number`, `Second-Of-Minute`, `Time-Zone`, `Part-Of-Day`, `Season-Of-Year`, `AMPM-Of-Day`, `Part-Of-Week`, `Week-Of-Year`, `Two-Digit-Year`, `Sum`, `Difference`, `Union`, `Intersection`, `Every-Nth`, `This`, `Last`, `Next`, `Before`, `After`, `Between`, `NthFromStart`, `NthFromEnd`, `Frequency`, `Modifier` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TIMEX_SEMEVAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_timex_semeval_en_3.3.4_2.4_1640679857852.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_token_classifier_timex_semeval_en_3.3.4_2.4_1640679857852.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_timex_semeval", "en"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """Model training was started at 22:12C and it took 3 days from Tuesday to Friday.""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_timex_semeval", "en")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["Model training was started at 22:12C and it took 3 days from Tuesday to Friday."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.time").predict("""Model training was started at 22:12C and it took 3 days from Tuesday to Friday.""") ```
## Results ```bash +-------+-----------------+ |chunk |ner_label | +-------+-----------------+ |22:12C |Period | |3 |Number | |days |Calendar-Interval| |Tuesday|Day-Of-Week | |to |Between | |Friday |Day-Of-Week | +-------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_token_classifier_timex_semeval| |Compatibility:|Spark NLP 3.3.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|439.5 MB| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/clulab/roberta-timex-semeval](https://huggingface.co/clulab/roberta-timex-semeval) --- layout: model title: Relation Extraction between different oncological entity types using granular classes (ReDL) author: John Snow Labs name: redl_oncology_granular_biobert_wip date: 2022-09-29 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Using this relation extraction model, four relation types can be identified: is_date_of (between date entities and other clinical entities), is_size_of (between Tumor_Finding and Tumor_Size), is_location_of (between anatomical entities and other entities) and is_finding_of (between test entities and their results). ## Predicted Entities `is_date_of`, `is_finding_of`, `is_location_of`, `is_size_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_granular_biobert_wip_en_4.1.0_3.0_1664482477934.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_granular_biobert_wip_en_4.1.0_3.0_1664482477934.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter() .setInputCols(["ner_chunk", "dependencies"]) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_granular_biobert_wip", "en", "clinical/models") .setInputCols(["re_ner_chunk", "sentence"]) .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols("ner_chunk", "dependencies") .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_granular_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols("re_ner_chunk", "sentence") .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("A mastectomy was performed two months ago.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_granular_biobert_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""") ```
## Results ```bash | chunk1 | entity1 | chunk2 | entity2 | relation | confidence | |:-----------|:---------------|:---------------|:--------------|:-----------|-------------:| | mastectomy | Cancer_Surgery | two months ago | Relative_Date | is_date_of | 0.965252 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_granular_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.83 0.91 0.87 is_date_of 0.82 0.80 0.81 is_finding_of 0.92 0.85 0.88 is_location_of 0.95 0.85 0.90 is_size_of 0.91 0.80 0.85 macro-avg 0.89 0.84 0.86 ``` --- layout: model title: Legal Cultivation Of Agricultural Land Document Classifier (EURLEX) author: John Snow Labs name: legclf_cultivation_of_agricultural_land_bert date: 2023-03-06 tags: [en, legal, classification, clauses, cultivation_of_agricultural_land, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_cultivation_of_agricultural_land_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Cultivation_of_Agricultural_Land or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Cultivation_of_Agricultural_Land`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cultivation_of_agricultural_land_bert_en_1.0.0_3.0_1678111585218.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cultivation_of_agricultural_land_bert_en_1.0.0_3.0_1678111585218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_cultivation_of_agricultural_land_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Cultivation_of_Agricultural_Land]| |[Other]| |[Other]| |[Cultivation_of_Agricultural_Land]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cultivation_of_agricultural_land_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.1 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Cultivation_of_Agricultural_Land 0.86 0.88 0.87 56 Other 0.87 0.85 0.86 55 accuracy - - 0.86 111 macro-avg 0.87 0.86 0.86 111 weighted-avg 0.86 0.86 0.86 111 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Tuvaluan author: John Snow Labs name: opus_mt_en_tvl date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tvl, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tvl` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tvl_xx_2.7.0_2.4_1609168063147.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tvl_xx_2.7.0_2.4_1609168063147.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tvl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tvl", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tvl').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tvl| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English XLMRobertaForTokenClassification Base Cased model (from Yaxin) author: John Snow Labs name: xlmroberta_ner_yaxin_base_conll2003 date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-conll2003-ner` is a English model originally trained by `Yaxin`. ## Predicted Entities `ORG`, `LOC`, `PER`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_yaxin_base_conll2003_en_4.1.0_3.0_1660426515795.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_yaxin_base_conll2003_en_4.1.0_3.0_1660426515795.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_yaxin_base_conll2003","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_yaxin_base_conll2003","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_yaxin_base_conll2003| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|792.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Yaxin/xlm-roberta-base-conll2003-ner - https://paperswithcode.com/sota?task=Token+Classification&dataset=conll2003 --- layout: model title: Pipeline to Resolve Medication Codes author: John Snow Labs name: medication_resolver_pipeline date: 2023-04-13 tags: [licensed, clinical, en, resolver, snomed, umls, rxnorm, ndc, ade, pipeline] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pretrained resolver pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text. Action/treatments are available for branded medication, and SNOMED codes are available for non-branded medication. This pipeline can be used as Lightpipeline (with `annotate/fullAnnotate`). You can use `medication_resolver_transform_pipeline` for Spark transform. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.0_1681388823359.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/medication_resolver_pipeline_en_4.3.2_3.0_1681388823359.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline med_resolver_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""" result = med_resolver_pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val med_resolver_pipeline = new PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models") val result = med_resolver_pipeline.fullAnnotate("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.medication").predict("""The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet.""") ```
## Results ```bash | | chunks | entities | ADE | RxNorm | Action | Treatment | UMLS | SNOMED_CT | NDC_Product | NDC_Package | |---:|:-----------------------------|:-----------|:----------------------------|---------:|:---------------------------|:-------------------------------------------|:---------|:------------|:--------------|:--------------| | 0 | Amlodopine Vallarta 10-320mg | DRUG | Gynaecomastia | 722131 | NONE | NONE | C1949334 | 425838008 | 00093-7693 | 00093-7693-56 | | 1 | Eviplera | DRUG | Anxiety | 217010 | Inhibitory Bone Resorption | Osteoporosis | C0720318 | NONE | NONE | NONE | | 2 | Lescol 40 MG | DRUG | NONE | 103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | NONE | 00078-0234 | 00078-0234-05 | | 3 | Everolimus 1.5 mg tablet | DRUG | Acute myocardial infarction | 2056895 | NONE | NONE | C4723581 | NONE | 00054-0604 | 00054-0604-21 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|medication_resolver_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel - TextMatcherModel - ChunkMergeModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperFilterer - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel - ResolverMerger - ResolverMerger - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - ChunkMapperModel - Finisher --- layout: model title: Bangla Bert Embeddings author: John Snow Labs name: bert_embeddings_bangla_bert_base date: 2022-04-11 tags: [bert, embeddings, bn, open_source] task: Embeddings language: bn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bangla-bert-base` is a Bangla model orginally trained by `sagorsarker`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_bert_base_bn_3.4.2_3.0_1649673290861.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bangla_bert_base_bn_3.4.2_3.0_1649673290861.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bangla_bert_base","bn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["আমি স্পার্ক এনএলপি ভালোবাসি"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bangla_bert_base","bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("আমি স্পার্ক এনএলপি ভালোবাসি").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.embed.bangala_bert").predict("""আমি স্পার্ক এনএলপি ভালোবাসি""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bangla_bert_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|bn| |Size:|617.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/sagorsarker/bangla-bert-base - https://github.com/sagorbrur/bangla-bert - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://oscar-corpus.com/ - https://dumps.wikimedia.org/bnwiki/latest/ - https://github.com/sagorbrur/bnlp - https://github.com/sagorbrur/bangla-bert - https://github.com/google-research/bert - https://twitter.com/mapmeld - https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP - https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb - https://github.com/sagorbrur/bangla-bert/tree/master/evaluations/wikiann - https://arxiv.org/abs/2012.14353 - https://arxiv.org/abs/2104.08613 - https://arxiv.org/abs/2107.03844 - https://arxiv.org/abs/2101.00204 - https://github.com/sagorbrur - https://www.tensorflow.org/tfrc - https://github.com/google-research/bert --- layout: model title: English BertForQuestionAnswering model (from nbroad) author: John Snow Labs name: bert_qa_xdistil_l12_h384_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xdistil-l12-h384-squad2` is a English model orginally trained by `nbroad`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_xdistil_l12_h384_squad2_en_4.0.0_3.0_1654192565150.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_xdistil_l12_h384_squad2_en_4.0.0_3.0_1654192565150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_xdistil_l12_h384_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_xdistil_l12_h384_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.distilled").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_xdistil_l12_h384_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|124.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nbroad/xdistil-l12-h384-squad2 --- layout: model title: Legal Reimbursements Clause Binary Classifier author: John Snow Labs name: legclf_reimbursements_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `reimbursements` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `reimbursements` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_reimbursements_clause_en_1.0.0_3.2_1660123924164.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_reimbursements_clause_en_1.0.0_3.2_1660123924164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_reimbursements_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[reimbursements]| |[other]| |[other]| |[reimbursements]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_reimbursements_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.97 0.96 132 reimbursements 0.93 0.90 0.91 58 accuracy - - 0.95 190 macro-avg 0.94 0.93 0.94 190 weighted-avg 0.95 0.95 0.95 190 ``` --- layout: model title: Pipeline to Detect Chemicals and Proteins in text (biobert) author: John Snow Labs name: ner_chemprot_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemprot_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_chemprot_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_biobert_pipeline_en_4.3.0_3.2_1679314581092.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_biobert_pipeline_en_4.3.0_3.2_1679314581092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_chemprot_biobert_pipeline", "en", "clinical/models") text = '''Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_chemprot_biobert_pipeline", "en", "clinical/models") val text = "Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot_biobert.pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:------------|-------------:| | 0 | Keratinocyte | 0 | 11 | GENE-Y | 0.894 | | 1 | growth | 13 | 18 | GENE-Y | 0.4833 | | 2 | factor | 20 | 25 | GENE-Y | 0.7991 | | 3 | acidic | 31 | 36 | GENE-Y | 0.9765 | | 4 | fibroblast | 38 | 47 | GENE-Y | 0.3905 | | 5 | growth | 49 | 54 | GENE-Y | 0.7109 | | 6 | factor | 56 | 61 | GENE-Y | 0.8693 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.2 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you TFWav2Vec2ForCTC from project2you author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you` is a English model originally trained by project2you. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_en_4.2.0_3.0_1664110092661.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you_en_4.2.0_3.0_1664110092661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_demo_colab_by_project2you| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Tagalog Electra Embeddings (from jcblaise) author: John Snow Labs name: electra_embeddings_electra_tagalog_base_cased_generator date: 2022-05-17 tags: [tl, open_source, electra, embeddings] task: Embeddings language: tl edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-tagalog-base-cased-generator` is a Tagalog model orginally trained by `jcblaise`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_base_cased_generator_tl_3.4.4_3.0_1652786728634.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_tagalog_base_cased_generator_tl_3.4.4_3.0_1652786728634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_base_cased_generator","tl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Mahilig ako sa Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_tagalog_base_cased_generator","tl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Mahilig ako sa Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_tagalog_base_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tl| |Size:|130.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/jcblaise/electra-tagalog-base-cased-generator - https://blaisecruz.com --- layout: model title: Detect PHI for Deidentification (Enriched) author: John Snow Labs name: ner_deid_enriched date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Deidentification NER (Enriched) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. The entities it annotates are Age, City, Country, Date, Doctor, Hospital, Idnum, Medicalrecord, Organization, Patient, Phone, Profession, State, Street, Username, and Zip. Clinical NER is trained with the 'embeddings_clinical' word embeddings model, so be sure to use the same embeddings in the pipeline. We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/) ## Predicted Entities ``Age``, ``City``, ``Country``, ``Date``, ``Doctor``, ``Hospital``, ``Idnum``, ``Medicalrecord``, ``Organization``, ``Patient``, ``Phone``, ``Profession``, ``State``, ``Street``, ``Username``, ``Zip`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_DEMOGRAPHICS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_DEMOGRAPHICS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_en_3.0.0_3.0_1617208426129.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_deid_enriched_en_3.0.0_3.0_1617208426129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_deid_enriched","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_deid_enriched","en","clinical/models") .setInputCols("sentence","token","embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.deid.enriched").predict("""HISTORY OF PRESENT ILLNESS: Mr. Smith is a 60-year-old white male veteran with multiple comorbidities, who has a history of bladder cancer diagnosed approximately two years ago by the VA Hospital. He underwent a resection there. He was to be admitted to the Day Hospital for cystectomy. He was seen in Urology Clinic and Radiology Clinic on 02/04/2003. HOSPITAL COURSE: Mr. Smith presented to the Day Hospital in anticipation for Urology surgery. On evaluation, EKG, echocardiogram was abnormal, a Cardiology consult was obtained. A cardiac adenosine stress MRI was then proceeded, same was positive for inducible ischemia, mild-to-moderate inferolateral subendocardial infarction with peri-infarct ischemia. In addition, inducible ischemia seen in the inferior lateral septum. Mr. Smith underwent a left heart catheterization, which revealed two vessel coronary artery disease. The RCA, proximal was 95% stenosed and the distal 80% stenosed. The mid LAD was 85% stenosed and the distal LAD was 85% stenosed. There was four Multi-Link Vision bare metal stents placed to decrease all four lesions to 0%. Following intervention, Mr. Smith was admitted to 7 Ardmore Tower under Cardiology Service under the direction of Dr. Hart. Mr. Smith had a noncomplicated post-intervention hospital course. He was stable for discharge home on 02/07/2003 with instructions to take Plavix daily for one month and Urology is aware of the same.""") ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |Smith |PATIENT | |VA Hospital |HOSPITAL | |Day Hospital |HOSPITAL | |02/04/2003 |DATE | |Smith |PATIENT | |Day Hospital |HOSPITAL | |Smith |PATIENT | |Smith |PATIENT | |7 Ardmore Tower|HOSPITAL | |Hart |DOCTOR | |Smith |PATIENT | |02/07/2003 |DATE | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_deid_enriched| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on JSL enriched n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with `embeddings_clinical` https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/ ## Benchmarking ```bash label tp fp fn prec rec f1 I-AGE 7 3 6 0.7 0.538462 0.608696 I-DOCTOR 800 27 94 0.967352 0.894855 0.929692 I-IDNUM 6 0 2 1 0.75 0.857143 B-DATE 1883 34 56 0.982264 0.971119 0.97666 I-DATE 425 28 25 0.93819 0.944444 0.941307 B-PHONE 29 7 9 0.805556 0.763158 0.783784 B-STATE 87 4 11 0.956044 0.887755 0.920635 B-CITY 35 11 26 0.76087 0.57377 0.654206 I-ORGANIZATION 12 4 15 0.75 0.444444 0.55814 B-DOCTOR 728 75 53 0.9066 0.932138 0.919192 I-PROFESSION 43 11 13 0.796296 0.767857 0.781818 I-PHONE 62 4 4 0.939394 0.939394 0.939394 B-AGE 234 13 16 0.947368 0.936 0.94165 B-STREET 20 7 16 0.740741 0.555556 0.634921 I-ZIP 60 3 2 0.952381 0.967742 0.96 I-MEDICALRECORD 54 5 2 0.915254 0.964286 0.93913 B-ZIP 2 1 0 0.666667 1 0.8 B-HOSPITAL 256 23 66 0.917563 0.795031 0.851913 I-STREET 150 17 20 0.898204 0.882353 0.890208 B-COUNTRY 22 2 8 0.916667 0.733333 0.814815 I-COUNTRY 1 0 0 1 1 1 I-STATE 6 0 1 1 0.857143 0.923077 B-USERNAME 30 0 4 1 0.882353 0.9375 I-HOSPITAL 295 37 64 0.888554 0.821727 0.853835 I-PATIENT 243 26 41 0.903346 0.855634 0.878843 B-PROFESSION 52 8 17 0.866667 0.753623 0.806202 B-IDNUM 32 3 12 0.914286 0.727273 0.810127 I-CITY 76 15 13 0.835165 0.853933 0.844444 B-PATIENT 337 29 40 0.920765 0.893899 0.907133 B-MEDICALRECORD 74 6 4 0.925 0.948718 0.936709 B-ORGANIZATION 20 5 13 0.8 0.606061 0.689655 Macro-average 6083 408 673 0.7976 0.697533 0.744218 Micro-average 6083 408 673 0.937144 0.900385 0.918397 ``` --- layout: model title: English Bert Embeddings (Base, Uncased, Chemical) author: John Snow Labs name: bert_embeddings_chemical_bert_uncased date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chemical-bert-uncased` is a English model orginally trained by `recobo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chemical_bert_uncased_en_3.4.2_3.0_1649671988862.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chemical_bert_uncased_en_3.4.2_3.0_1649671988862.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chemical_bert_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chemical_bert_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.chemical_bert_uncased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chemical_bert_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|412.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/recobo/chemical-bert-uncased --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from cj-mills) author: John Snow Labs name: xlmroberta_ner_cj_mills_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `cj-mills`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cj_mills_base_finetuned_panx_de_4.1.0_3.0_1660431546170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_cj_mills_base_finetuned_panx_de_4.1.0_3.0_1660431546170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cj_mills_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_cj_mills_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_cj_mills_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|850.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/cj-mills/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [ro, licensed, ner, legal, mapa] task: Named Entity Recognition language: ro edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Romanian` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_ro_1.0.0_3.0_1682609352989.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_ro_1.0.0_3.0_1682609352989.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_ro_cased", "ro")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "ro", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Or, rezultă din hotărârea Curții de Apel București din 12 iunie 2013 că instanța română a aplicat greșit dreptul Uniunii (32) atunci când a respins excepția de litispendență invocată de domnul Liberato, întemeiată pe cererile referitoare la legătura matrimonială."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------------+---------+ |chunk |ner_label| +---------------+---------+ |București |ADDRESS | |12 iunie 2013 |DATE | |domnul Liberato|PERSON | +---------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.88 0.96 0.92 23 AMOUNT 1.00 0.67 0.80 3 DATE 0.97 0.97 0.97 31 ORGANISATION 0.67 0.71 0.69 28 PERSON 0.91 0.83 0.87 48 macro-avg 0.86 0.86 0.86 133 macro-avg 0.88 0.83 0.85 133 weighted-avg 0.87 0.86 0.86 133 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_4_en_4.0.0_3.0_1655731258207.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_4_en_4.0.0_3.0_1655731258207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|423.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-4 --- layout: model title: Legal No Conflicts Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_no_conflicts_bert date: 2023-03-05 tags: [en, legal, classification, clauses, no_conflicts, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `No_Conflicts` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `No_Conflicts`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_conflicts_bert_en_1.0.0_3.0_1678050654719.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_conflicts_bert_en_1.0.0_3.0_1678050654719.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_no_conflicts_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[No_Conflicts]| |[Other]| |[Other]| |[No_Conflicts]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_conflicts_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support No_Conflicts 0.96 0.98 0.97 84 Other 0.98 0.97 0.98 118 accuracy - - 0.98 202 macro-avg 0.97 0.98 0.97 202 weighted-avg 0.98 0.98 0.98 202 ``` --- layout: model title: Detect Adverse Drug Events (clinical_medium) author: John Snow Labs name: ner_ade_emb_clinical_medium date: 2023-05-21 tags: [en, clinical, ade, drug, licensed, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse reactions to drugs in reviews, tweets, and medical text using a pre-trained NER model. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_emb_clinical_medium_en_4.4.2_3.0_1684646733993.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_emb_clinical_medium_en_4.4.2_3.0_1684646733993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_ade_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(['sentence', 'token', 'ner'])\ .setOutputCol('ner_chunk') pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter ]) sample_df = spark.createDataFrame([["Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps."]]).toDF("text") result = pipeline.fit(sample_df).transform(sample_df) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_ade_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, clinical_embeddings, ner_model, ner_converter)) val sample_data = Seq("Been taking Lipitor for 15 years , have experienced severe fatigue a lot!!! . Doctor moved me to voltaren 2 months ago , so far , have only experienced cramps.").toDS.toDF("text") val result = pipeline.fit(sample_data).transform(sample_data) ```
## Results ```bash +--------------+-----+---+---------+ |chunk |begin|end|ner_label| +--------------+-----+---+---------+ |Lipitor |12 |18 |DRUG | |severe fatigue|52 |65 |ADE | |voltaren |97 |104|DRUG | |cramps |152 |157|ADE | +--------------+-----+---+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.7 MB| ## Benchmarking ```bash label precision recall f1-score support DRUG 0.92 0.91 0.91 15895 ADE 0.83 0.77 0.80 6077 micro-avg 0.89 0.87 0.88 21972 macro-avg 0.87 0.84 0.86 21972 weighted-avg 0.89 0.87 0.88 21972 ``` --- layout: model title: Extract treatment entities (Voice of the Patients) author: John Snow Labs name: ner_vop_treatment_wip date: 2023-05-19 tags: [licensed, clinical, en, ner, vop, patient, treatment] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts treatments mentioned in documents transferred from the patient’s own sentences. Note: ‘wip’ suffix indicates that the model development is work-in-progress and will be finalised and the model performance will improved in the upcoming releases. ## Predicted Entities `Drug`, `Form`, `Route`, `Dosage`, `Duration`, `Procedure`, `Frequency`, `Treatment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_wip_en_4.4.2_3.0_1684513558207.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_treatment_wip_en_4.4.2_3.0_1684513558207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_treatment_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_treatment_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandpa was diagnosed with type 2 diabetes and had to make some changes to his lifestyle. He also takes metformin and glipizide to help regulate his blood sugar levels. It's been a bit of an adjustment, but he's doing well.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:----------|:------------| | metformin | Drug | | glipizide | Drug | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_treatment_wip| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Drug 1299 92 141 1440 0.93 0.90 0.92 Form 251 36 15 266 0.87 0.94 0.91 Route 41 5 7 48 0.89 0.85 0.87 Dosage 342 35 70 412 0.91 0.83 0.87 Duration 2064 530 246 2310 0.80 0.89 0.84 Procedure 543 67 162 705 0.89 0.77 0.83 Frequency 888 181 191 1079 0.83 0.82 0.83 Treatment 165 54 63 228 0.75 0.72 0.74 macro_avg 5593 1000 895 6488 0.86 0.84 0.85 micro_avg 5593 1000 895 6488 0.85 0.86 0.86 ``` --- layout: model title: Legal Participation Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_participation_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, participation, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_participation_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `participation-agreement` for similar document type classification) or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `participation-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_participation_agreement_bert_en_1.0.0_3.0_1669368597593.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_participation_agreement_bert_en_1.0.0_3.0_1669368597593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_participation_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[participation-agreement]| |[other]| |[other]| |[participation-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_participation_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 1.00 0.98 65 participation-agreement 1.00 0.94 0.97 33 accuracy - - 0.98 98 macro-avg 0.99 0.97 0.98 98 weighted-avg 0.98 0.98 0.98 98 ``` --- layout: model title: Arabic Named Entity Recognition (Modern Standard Arabic-MSA) author: John Snow Labs name: bert_ner_bert_base_arabic_camelbert_msa_ner date: 2022-05-04 tags: [bert, ner, token_classification, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-msa-ner` is a Arabic model orginally trained by `CAMeL-Lab`. ## Predicted Entities `ORG`, `LOC`, `PERS`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_msa_ner_ar_3.4.2_3.0_1651630233283.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_arabic_camelbert_msa_ner_ar_3.4.2_3.0_1651630233283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_msa_ner","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_arabic_camelbert_msa_ner","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner.arabic_camelbert_msa_ner").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_arabic_camelbert_msa_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa-ner - https://camel.abudhabi.nyu.edu/anercorp/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: BERT Sequence Classification - Detect Spam SMS author: John Snow Labs name: bert_sequence_classifier_sms_spam date: 2021-11-07 tags: [sms, spam, bert_for_sequence_classification, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true annotator: BertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is imported from `Hugging Face-models`. It is a BERT-Tiny version of the `sms_spam` dataset. It identifies if the SMS is spam or not. - `LABEL_0` : No Spam - `LABEL_1` : Spam ## Predicted Entities `LABEL_0`, `LABEL_1` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_sms_spam_en_3.3.2_2.4_1636290194115.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_sequence_classifier_sms_spam_en_3.3.2_2.4_1636290194115.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = BertForSequenceClassification \ .pretrained('bert_sequence_classifier_sms_spam', 'en') \ .setInputCols(['token', 'document']) \ .setOutputCol('class') \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[document_assembler, tokenizer, sequenceClassifier]) example = spark.createDataFrame([['Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days.']]).toDF("text") result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_sms_spam", "en") .setInputCols("document", "token") .setOutputCol("class") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val example = Seq.empty["Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days."].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ```
## Results ```bash ['LABEL_1'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_sms_spam| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[label]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/mrm8488/bert-tiny-finetuned-sms-spam-detection](https://huggingface.co/mrm8488/bert-tiny-finetuned-sms-spam-detection) ## Benchmarking ```bash label score accuracy 0.98 ``` --- layout: model title: Legal Wood Industry Document Classifier (EURLEX) author: John Snow Labs name: legclf_wood_industry_bert date: 2023-03-06 tags: [en, legal, classification, clauses, wood_industry, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_wood_industry_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Wood_Industry or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Wood_Industry`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_wood_industry_bert_en_1.0.0_3.0_1678111777594.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_wood_industry_bert_en_1.0.0_3.0_1678111777594.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_wood_industry_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Wood_Industry]| |[Other]| |[Other]| |[Wood_Industry]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_wood_industry_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.94 0.88 0.91 34 Wood_Industry 0.86 0.92 0.89 26 accuracy - - 0.90 60 macro-avg 0.90 0.90 0.90 60 weighted-avg 0.90 0.90 0.90 60 ``` --- layout: model title: Pipeline to Mapping ICD10-CM Codes with Their Corresponding SNOMED Codes author: John Snow Labs name: icd10cm_snomed_mapping date: 2023-03-29 tags: [en, licensed, icd10cm, snomed, pipeline, chunk_mapping] task: Chunk Mapping language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `icd10cm_snomed_mapper` model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_4.3.2_3.2_1680118771671.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_snomed_mapping_en_4.3.2_3.2_1680118771671.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(R079 N4289 M62830) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_snomed_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(R079 N4289 M62830) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.icd10cm_to_snomed.pipe").predict("""Put your text here.""") ```
## Results ```bash | | icd10cm_code | snomed_code | |---:|:----------------------|:-----------------------------------------| | 0 | R079 | N4289 | M62830 | 161972006 | 22035000 | 16410651000119105 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_snomed_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Entity Resolver for Human Phenotype Ontology author: John Snow Labs name: sbiobertresolve_HPO date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps phenotypic abnormalities encountered in human diseases to Human Phenotype Ontology (HPO) codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities This model returns Human Phenotype Ontology (HPO) codes for phenotypic abnormalities encountered in human diseases. It also returns associated codes from the following vocabularies for each HPO code: - MeSH (Medical Subject Headings)- SNOMED- UMLS (Unified Medical Language System ) - ORPHA (international reference resource for information on rare diseases and orphan drugs) - OMIM (Online Mendelian Inheritance in Man) {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_MSH/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_HPO_en_3.0.4_3.0_1621189482944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_HPO_en_3.0.4_3.0_1621189482944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_HPO``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_human_phenotype_gene_clinical``` as NER model. No need to ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_HPO", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") pipeline = Pipeline(stages = [document_assembler, sentence_detector, tokens, embeddings, ner, ner_converter, chunk2doc, sbert_embedder, resolver]) model = LightPipeline(pipeline.fit(spark.createDataFrame([[""]], ["text"]))) text="""These disorders include cancer, bipolar disorder, schizophrenia, autism, Cri-du-chat syndrome, myopia, cortical cataract-linked Alzheimer's disease, and infectious diseases""" res = model.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.HPO").predict("""These disorders include cancer, bipolar disorder, schizophrenia, autism, Cri-du-chat syndrome, myopia, cortical cataract-linked Alzheimer's disease, and infectious diseases""") ```
## Results ```bash | | chunk | entity | resolution | aux_codes | |---:|:-----------------|:---------|:-------------|:-----------------------------------------------------------------------------| | 0 | cancer | HP | HP:0002664 | MSH:D009369||SNOMED:108369006,363346000||UMLS:C0006826,C0027651||ORPHA:1775 | | 1 | bipolar disorder | HP | HP:0007302 | MSH:D001714||SNOMED:13746004||UMLS:C0005586||ORPHA:370079 | | 2 | schizophrenia | HP | HP:0100753 | MSH:D012559||SNOMED:191526005,58214004||UMLS:C0036341||ORPHA:231169 | | 3 | autism | HP | HP:0000717 | MSH:D001321||SNOMED:408856003,408857007,43614003||UMLS:C0004352||ORPHA:79279 | | 4 | myopia | HP | HP:0000545 | MSH:D009216||SNOMED:57190000||UMLS:C0027092||ORPHA:370022 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_HPO| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[hpo_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: GPT2 text-to-text model, distilled version author: John Snow Labs name: gpt2_distilled date: 2021-12-03 tags: [gpt2, en, open_source] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: GPT2Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation. This is a distilled version which has fewer parameters and requires less computational resources to run. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/gpt2_distilled_en_3.4.0_3.0_1638520197316.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/gpt2_distilled_en_3.4.0_3.0_1638520197316.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("documents") gpt2 = GPT2Transformer.pretrained("gpt2_distilled") \\ .setInputCols(["documents"]) \\ .setMaxOutputLength(50) \\ .setOutputCol("generation") pipeline = Pipeline().setStages([documentAssembler, gpt2]) data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("summaries.generation").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val gpt2 = GPT2Transformer.pretrained("gpt2_distilled") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) .setDoSample(false) .setTopK(50) .setNoRepeatNgramSize(3) .setOutputCol("generation") val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2)) val data = Seq("My name is Leonardo.").toDF("text") val result = pipeline.fit(data).transform(data) results.select("generation.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.gpt2.distilled").predict("""My name is Leonardo.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|gpt2_distilled| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[generation]| |Language:|en| ## Data Source OpenAI WebText - a corpus created by scraping web pages with emphasis on document quality. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText. --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_mini_finetuned_squadv2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-mini-finetuned-squadv2` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_mini_finetuned_squadv2_en_4.0.0_3.0_1654183774865.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_mini_finetuned_squadv2_en_4.0.0_3.0_1654183774865.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_mini_finetuned_squadv2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_mini_finetuned_squadv2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base_v2.by_mrm8488").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_mini_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|42.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-mini-finetuned-squadv2 - https://twitter.com/mrm8488 - https://github.com/google-research - https://arxiv.org/abs/1908.08962 - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: German asr_exp_w2v2t_wav2vec2_s982 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_wav2vec2_s982 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_wav2vec2_s982` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_wav2vec2_s982_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_wav2vec2_s982_de_4.2.0_3.0_1664122237111.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_wav2vec2_s982_de_4.2.0_3.0_1664122237111.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_wav2vec2_s982", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_wav2vec2_s982", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_wav2vec2_s982| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: French CamemBert Embeddings (from Hasanmuradbuet) author: John Snow Labs name: camembert_embeddings_Hasanmuradbuet_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `Hasanmuradbuet`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Hasanmuradbuet_generic_model_fr_3.4.4_3.0_1653986214124.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_Hasanmuradbuet_generic_model_fr_3.4.4_3.0_1653986214124.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Hasanmuradbuet_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_Hasanmuradbuet_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_Hasanmuradbuet_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/Hasanmuradbuet/dummy-model --- layout: model title: Legal Non Soliciting Clause Binary Classifier author: John Snow Labs name: legclf_non_solic_clause date: 2023-02-13 tags: [en, legal, classification, non_soliciting, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `non_solic` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `non_solic`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_non_solic_clause_en_1.0.0_3.0_1676305039177.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_non_solic_clause_en_1.0.0_3.0_1676305039177.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_non_solic_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[non_solic]| |[other]| |[other]| |[non_solic]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_non_solic_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support non_solic 0.92 1.00 0.96 11 other 1.00 0.80 0.89 5 accuracy - - 0.94 16 macro-avg 0.96 0.90 0.92 16 weighted-avg 0.94 0.94 0.94 16 ``` --- layout: model title: Multilingual XLMRobertaForTokenClassification Cased model (from jplu) author: John Snow Labs name: xlmroberta_ner_tf_r_40_lang date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tf-xlm-r-ner-40-lang` is a Multilingual model originally trained by `jplu`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tf_r_40_lang_xx_4.1.0_3.0_1660423122270.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tf_r_40_lang_xx_4.1.0_3.0_1660423122270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tf_r_40_lang","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tf_r_40_lang","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tf_r_40_lang| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|967.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/jplu/tf-xlm-r-ner-40-lang - https://arxiv.org/abs/1911.02116 - https://github.com/google-research/xtreme - https://aclweb.org/anthology/P17-1178 - https://github.com/google-research/xtreme#download-the-data --- layout: model title: NER Model for 10 African Languages author: John Snow Labs name: xlm_roberta_large_token_classifier_masakhaner date: 2021-12-06 tags: [amharic, hausa, igbo, kinyarwanda, luganda, swahilu, wolof, yoruba, token_classifier, xlm_roberta, ner, nigerian, pidgin, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.3.2 spark_version: 2.4 supported: true recommended: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description - This model is imported from `Hugging Face`. - It's been trained using `xlm_roberta_large` fine-tuned model on 10 African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá). ## Predicted Entities `DATE`, `LOC`, `PER`, `ORG` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/Ner_masakhaner/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/Ner_masakhaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_xx_3.3.2_2.4_1638784947143.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_large_token_classifier_masakhaner_xx_3.3.2_2.4_1638784947143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx"))\ .setInputCols(["sentence",'token'])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = """አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።""" result = model.transform(spark.createDataFrame([[text]]).toDF("text")) ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx")) .setInputCols(Array("sentence","token")) .setOutputCol("ner") ner_converter = NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter)) val example = Seq.empty["አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።"].toDS.toDF("text") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.ner.masakhaner").predict("""አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።""") ```
## Results ```bash +----------------+---------+ |chunk |ner_label| +----------------+---------+ |አህመድ ቫንዳ |PER | |ከ3-10-2000 ጀምሮ|DATE | |በአዲስ አበባ |LOC | +----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_large_token_classifier_masakhaner| |Compatibility:|Spark NLP 3.3.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|xx| |Case sensitive:|true| |Max sentense length:|256| ## Data Source [https://huggingface.co/Davlan/xlm-roberta-large-masakhaner](https://huggingface.co/Davlan/xlm-roberta-large-masakhaner) ## Benchmarking ```bash language: F1-score: -------- -------- amh 75.76 hau 91.75 ibo 86.26 kin 76.38 lug 84.64 luo 80.65 pcm 89.55 swa 89.48 wol 70.70 yor 82.05 ``` --- layout: model title: Detect Clinical Entities (clinical_medium) author: John Snow Labs name: ner_jsl_emb_clinical_medium date: 2023-04-12 tags: [ner, licensed, clinical, en, clinical_medium] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_clinical model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Injury_or_Poisoning`, `Direction`, `Test`, `Admission_Discharge`, `Death_Entity`, `Relationship_Status`, `Duration`, `Respiration`, `Hyperlipidemia`, `Birth_Entity`, `Age`, `Labour_Delivery`, `Family_History_Header`, `BMI`, `Temperature`, `Alcohol`, `Kidney_Disease`, `Oncological`, `Medical_History_Header`, `Cerebrovascular_Disease`, `Oxygen_Therapy`, `O2_Saturation`, `Psychological_Condition`, `Heart_Disease`, `Employment`, `Obesity`, `Disease_Syndrome_Disorder`, `Pregnancy`, `ImagingFindings`, `Procedure`, `Medical_Device`, `Race_Ethnicity`, `Section_Header`, `Symptom`, `Treatment`, `Substance`, `Route`, `Drug_Ingredient`, `Blood_Pressure`, `Diet`, `External_body_part_or_region`, `LDL`, `VS_Finding`, `Allergen`, `EKG_Findings`, `Imaging_Technique`, `Triglycerides`, `RelativeTime`, `Gender`, `Pulse`, `Social_History_Header`, `Substance_Quantity`, `Diabetes`, `Modifier`, `Internal_organ_or_component`, `Clinical_Dept`, `Form`, `Drug_BrandName`, `Strength`, `Fetus_NewBorn`, `RelativeDate`, `Height`, `Test_Result`, `Sexually_Active_or_Sexual_Orientation`, `Frequency`, `Time`, `Weight`, `Vaccine`, `Vaccine_Name`, `Vital_Signs_Header`, `Communicable_Disease`, `Dosage`, `Overweight`, `Hypertension`, `HDL`, `Total_Cholesterol`, `Smoking`, `Date` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_emb_clinical_medium_en_4.3.2_3.0_1681306334405.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_emb_clinical_medium_en_4.3.2_3.0_1681306334405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_jsl_emb_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature. """]]).toDF("text") result = ner_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("ner_jsl_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). Additionally, there is no side effect observed after Influenza vaccine. One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ```
## Results ```bash +-----------------------------------------+-----+---+----------------------------+ |chunk |begin|end|ner_label | +-----------------------------------------+-----+---+----------------------------+ |21-day-old |17 |26 |Age | |Caucasian |28 |36 |Race_Ethnicity | |male |38 |41 |Gender | |2 days |52 |57 |Duration | |congestion |62 |71 |Symptom | |mom |75 |77 |Gender | |suctioning yellow discharge |88 |114|Symptom | |nares |135 |139|External_body_part_or_region| |she |147 |149|Gender | |mild |168 |171|Modifier | |problems with his breathing while feeding|173 |213|Symptom | |perioral cyanosis |237 |253|Symptom | |retractions |258 |268|Symptom | |Influenza vaccine |325 |341|Vaccine_Name | |One day ago |344 |354|RelativeDate | |mom |357 |359|Gender | |tactile temperature |376 |394|Symptom | |Tylenol |417 |423|Drug_BrandName | |Baby |426 |429|Age | |decreased p.o |449 |461|Symptom | +-----------------------------------------+-----+---+----------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.2 MB| ## Benchmarking ```bash label precision recall f1-score support Drug_Ingredient 0.91 0.94 0.93 1905 Disease_Syndrome_Disorder 0.85 0.89 0.87 4949 Drug_BrandName 0.94 0.92 0.93 963 Strength 0.95 0.93 0.94 759 Route 0.90 0.94 0.92 943 Internal_organ_or_component 0.89 0.90 0.89 10310 Dosage 0.95 0.78 0.86 478 Frequency 0.90 0.87 0.88 1016 Treatment 0.90 0.70 0.78 332 Procedure 0.85 0.91 0.88 6433 Gender 0.98 0.99 0.99 5586 RelativeTime 0.79 0.70 0.74 306 Direction 0.91 0.90 0.91 4344 Modifier 0.84 0.82 0.83 2863 Symptom 0.84 0.83 0.84 11599 Date 0.94 0.98 0.96 546 External_body_part_or_region 0.89 0.85 0.87 3270 Section_Header 0.98 0.97 0.98 9320 Age 0.85 0.92 0.88 757 Substance 0.91 0.85 0.88 113 VS_Finding 0.86 0.60 0.70 304 Medical_Device 0.87 0.93 0.90 5475 Oxygen_Therapy 0.89 0.85 0.87 117 Test 0.87 0.87 0.87 4491 Diabetes 0.95 0.97 0.96 149 Duration 0.86 0.86 0.86 1009 ImagingFindings 0.83 0.50 0.63 353 Hyperlipidemia 0.80 0.87 0.84 47 Hypertension 0.97 0.94 0.96 152 RelativeDate 0.90 0.89 0.89 1338 Clinical_Dept 0.92 0.94 0.93 1771 Kidney_Disease 0.90 0.96 0.93 228 Heart_Disease 0.93 0.85 0.89 967 Diet 0.67 0.62 0.65 106 Weight 0.93 0.93 0.93 254 Test_Result 0.79 0.81 0.80 1470 Form 0.85 0.85 0.85 254 Time 0.80 0.72 0.76 76 Psychological_Condition 0.81 0.76 0.79 187 Injury_or_Poisoning 0.84 0.80 0.82 889 Admission_Discharge 0.91 0.97 0.94 301 Labour_Delivery 0.74 0.73 0.73 110 Employment 0.90 0.73 0.81 389 Vaccine 0.83 0.33 0.48 15 Obesity 0.92 0.91 0.92 54 Oncological 0.92 0.93 0.93 784 Smoking 0.96 0.94 0.95 106 Imaging_Technique 0.70 0.53 0.60 98 Blood_Pressure 0.86 0.87 0.86 314 Pulse 0.87 0.92 0.89 278 Respiration 0.96 0.94 0.95 180 O2_Saturation 0.86 0.77 0.81 96 Medical_History_Header 0.93 0.99 0.96 396 Total_Cholesterol 0.88 0.37 0.52 19 Cerebrovascular_Disease 0.75 0.78 0.76 108 Pregnancy 0.90 0.71 0.79 201 Death_Entity 0.85 0.76 0.80 46 EKG_Findings 0.78 0.43 0.56 186 Race_Ethnicity 0.97 0.99 0.98 118 Family_History_Header 0.97 0.99 0.98 273 Alcohol 0.84 0.95 0.89 84 Fetus_NewBorn 0.75 0.54 0.63 235 Vital_Signs_Header 0.96 0.94 0.95 710 Relationship_Status 0.93 0.95 0.94 41 Height 0.98 0.85 0.91 68 Temperature 0.91 0.96 0.94 141 Triglycerides 0.40 0.40 0.40 10 LDL 0.94 0.68 0.79 22 Social_History_Header 0.98 0.97 0.98 259 Communicable_Disease 0.79 0.62 0.70 50 Overweight 0.86 0.86 0.86 7 Allergen 0.00 0.00 0.00 23 Substance_Quantity 0.33 1.00 0.50 2 Vaccine_Name 0.68 1.00 0.81 15 Birth_Entity 0.00 0.00 0.00 3 BMI 1.00 0.67 0.80 15 Sexually_Active_or_Sexual_Orientation 1.00 0.80 0.89 5 HDL 1.00 1.00 1.00 2 micro-avg 0.89 0.89 0.89 92193 macro-avg 0.85 0.81 0.82 92193 weighted-avg 0.89 0.89 0.89 92193 ``` --- layout: model title: Detect Radiology Related Entities author: John Snow Labs name: ner_radiology date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for radiology related texts and reports. ## Predicted Entities `ImagingTest`, `Imaging_Technique`, `ImagingFindings`, `OtherFindings`, `BodyPart`, `Direction`, `Test`, `Symptom`, `Disease_Syndrome_Disorder`, `Medical_Device`, `Procedure`, `Measurements`, `Units` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_RADIOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_radiology_en_3.0.0_3.0_1617208452337.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_radiology_en_3.0.0_3.0_1617208452337.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") radiology_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, radiology_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma."]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val radiology_ner = MedicalNerModel().pretrained("ner_radiology", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlpPipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, radiology_ner, ner_converter)) val data = Seq("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.radiology").predict("""Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.""") ```
## Results ```bash | | chunks | entities | |----|-----------------------|---------------------------| | 0 | Bilateral | Direction | | 1 | breast | BodyPart | | 2 | ultrasound | ImagingTest | | 3 | ovoid mass | ImagingFindings | | 4 | 0.5 x 0.5 x 0.4 | Measurements | | 5 | cm | Units | | 6 | anteromedial aspect | Direction | | 7 | left | Direction | | 8 | shoulder | BodyPart | | 9 | mass | ImagingFindings | | 10 | isoechoic echotexture | ImagingFindings | | 11 | muscle | BodyPart | | 12 | internal color flow | ImagingFindings | | 13 | benign fibrous tissue | ImagingFindings | | 14 | lipoma | Disease_Syndrome_Disorder | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_radiology| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on a custom dataset comprising of MIMIC-CXR and MT Radiology texts ## Benchmarking ```bash +--------------------+------+-----+-----+------+---------+------+------+ | entity| tp| fp| fn| total|precision|recall| f1| +--------------------+------+-----+-----+------+---------+------+------+ | OtherFindings| 8.0| 15.0| 63.0| 71.0| 0.3478|0.1127|0.1702| | Measurements| 481.0| 30.0| 15.0| 496.0| 0.9413|0.9698|0.9553| | Direction| 650.0|137.0| 94.0| 744.0| 0.8259|0.8737|0.8491| | ImagingFindings|1345.0|355.0|324.0|1669.0| 0.7912|0.8059|0.7985| | BodyPart|1942.0|335.0|290.0|2232.0| 0.8529|0.8701|0.8614| | Medical_Device| 236.0| 75.0| 64.0| 300.0| 0.7588|0.7867|0.7725| | Test| 222.0| 41.0| 48.0| 270.0| 0.8441|0.8222| 0.833| | Procedure| 269.0|117.0|116.0| 385.0| 0.6969|0.6987|0.6978| | ImagingTest| 263.0| 50.0| 43.0| 306.0| 0.8403|0.8595|0.8498| | Symptom| 498.0|101.0|132.0| 630.0| 0.8314|0.7905|0.8104| |Disease_Syndrome_...|1180.0|258.0|200.0|1380.0| 0.8206|0.8551|0.8375| | Units| 269.0| 10.0| 2.0| 271.0| 0.9642|0.9926|0.9782| | Imaging_Technique| 140.0| 38.0| 25.0| 165.0| 0.7865|0.8485|0.8163| +--------------------+------+-----+-----+------+---------+------+------+ +------------------+ | macro| +------------------+ |0.7524248724038437| +------------------+ +------------------+ | micro| +------------------+ |0.8315240382681794| +------------------+ ``` --- layout: model title: Swedish BertForTokenClassification Base Cased model (from KBLab) author: John Snow Labs name: bert_token_classifier_base_swedish_lowermix_reallysimple_ner date: 2022-11-30 tags: [sv, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: sv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-swedish-lowermix-reallysimple-ner` is a Swedish model originally trained by `KBLab`. ## Predicted Entities `TME`, `OBJ`, `MSR`, `LOC`, `ORG`, `PRS`, `WRK`, `EVN` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_swedish_lowermix_reallysimple_ner_sv_4.2.4_3.0_1669815085727.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_base_swedish_lowermix_reallysimple_ner_sv_4.2.4_3.0_1669815085727.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_swedish_lowermix_reallysimple_ner","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_base_swedish_lowermix_reallysimple_ner","sv") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_base_swedish_lowermix_reallysimple_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|sv| |Size:|465.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/KBLab/bert-base-swedish-lowermix-reallysimple-ner - https://kb-labb.github.io/posts/2022-02-07-sucx3_ner --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl20 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl20` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl20_en_4.3.0_3.0_1675121765443.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl20_en_4.3.0_3.0_1675121765443.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl20","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl20","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl20| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|345.3 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl20 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Explain Document Pipeline for Spanish author: John Snow Labs name: explain_document_sm date: 2021-03-22 tags: [open_source, spanish, explain_document_sm, pipeline, es] supported: true task: [Named Entity Recognition, Lemmatization, Part of Speech Tagging] language: es edition: Spark NLP 3.0.0 spark_version: 3.0 article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_sm is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/jupyter/annotation/english/explain-document-dl/Explain%20Document%20DL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_es_3.0.0_3.0_1616422359763.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_sm_es_3.0.0_3.0_1616422359763.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('explain_document_sm', lang = 'es') annotations = pipeline.fullAnnotate(""Hola de John Snow Labs! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_sm", lang = "es") val result = pipeline.fullAnnotate("Hola de John Snow Labs! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hola de John Snow Labs! ""] result_df = nlu.load('es.explain').predict(text) result_df ```
## Results ```bash | | document | sentence | token | lemma | pos | embeddings | ner | entities | |---:|:-----------------------------|:----------------------------|:----------------------------------------|:----------------------------------------|:-------------------------------------------|:-----------------------------|:---------------------------------------|:-----------------------| | 0 | ['Hola de John Snow Labs! '] | ['Hola de John Snow Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['Hola', 'de', 'John', 'Snow', 'Labs!'] | ['PART', 'ADP', 'PROPN', 'PROPN', 'PROPN'] | [[0.1754499971866607,.,...]] | ['O', 'O', 'B-PER', 'I-PER', 'B-MISC'] | ['John Snow', 'Labs!'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_sm| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|es| --- layout: model title: Pipeline to Detect Anatomical Structures in Medical Text author: John Snow Labs name: bert_token_classifier_ner_anatem_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_anatem](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_anatem_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatem_pipeline_en_4.3.0_3.2_1679303142191.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_anatem_pipeline_en_4.3.0_3.2_1679303142191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_anatem_pipeline", "en", "clinical/models") text = '''Malignant cells often display defects in autophagy, an evolutionarily conserved pathway for degrading long-lived proteins and cytoplasmic organelles. However, as yet, there is no genetic evidence for a role of autophagy genes in tumor suppression. The beclin 1 autophagy gene is monoallelically deleted in 40 - 75 % of cases of human sporadic breast, ovarian, and prostate cancer.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_anatem_pipeline", "en", "clinical/models") val text = "Malignant cells often display defects in autophagy, an evolutionarily conserved pathway for degrading long-lived proteins and cytoplasmic organelles. However, as yet, there is no genetic evidence for a role of autophagy genes in tumor suppression. The beclin 1 autophagy gene is monoallelically deleted in 40 - 75 % of cases of human sporadic breast, ovarian, and prostate cancer." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-----------------------|--------:|------:|:------------|-------------:| | 0 | Malignant cells | 0 | 14 | Anatomy | 0.999951 | | 1 | cytoplasmic organelles | 126 | 147 | Anatomy | 0.999937 | | 2 | tumor | 229 | 233 | Anatomy | 0.999871 | | 3 | breast | 343 | 348 | Anatomy | 0.999842 | | 4 | ovarian | 351 | 357 | Anatomy | 0.99998 | | 5 | prostate cancer | 364 | 378 | Anatomy | 0.999968 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_anatem_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab2_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_hassnain` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab2_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_hassnain_en_4.2.0_3.0_1664039971215.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_hassnain_en_4.2.0_3.0_1664039971215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_hassnain", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_hassnain", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab2_by_hassnain| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Icelandic Lemmatizer author: John Snow Labs name: lemma date: 2021-04-02 tags: [is, open_source, lemmatizer] task: Lemmatization language: is edition: Spark NLP 2.7.5 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_is_2.7.5_2.4_1617376506935.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_is_2.7.5_2.4_1617376506935.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "is") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) example = spark.createDataFrame([['En þar er þeir vinnast eigi til þá hafa þeir við aðra stafi svo marga og þesskonar sem þarf en hina taka þeir úr er eigi eru réttræðir í máli þeirra .']], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "is") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("En þar er þeir vinnast eigi til þá hafa þeir við aðra stafi svo marga og þesskonar sem þarf en hina taka þeir úr er eigi eru réttræðir í máli þeirra .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["En þar er þeir vinnast eigi til þá hafa þeir við aðra stafi svo marga og þesskonar sem þarf en hina taka þeir úr er eigi eru réttræðir í máli þeirra ."] lemma_df = nlu.load('is.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash +-----------+ | lemma| +-----------+ | hinn| | þar| | vera| | þeir| | vinnast| | ekki| | til| | þá| | hafa| | þeir| | ég| | aðra| | stafur| | svo| | margur| |og-og-og-og| | þesskonar| | sem| | þurfa| | hinn| +-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|is| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7. ## Benchmarking ```bash Precision=0.63, Recall=0.59, F1-score=0.61 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from deep-learning-analytics) author: John Snow Labs name: t5_triviaqa_base date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `triviaqa-t5-base` is a English model originally trained by `deep-learning-analytics`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_triviaqa_base_en_4.3.0_3.0_1675157650979.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_triviaqa_base_en_4.3.0_3.0_1675157650979.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_triviaqa_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_triviaqa_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_triviaqa_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|927.9 MB| ## References - https://huggingface.co/deep-learning-analytics/triviaqa-t5-base - https://medium.com/@priya.dwivedi/build-a-trivia-bot-using-t5-transformer-345ff83205b6 - https://www.triviaquestionss.com/easy-trivia-questions/ - https://laffgaff.com/easy-trivia-questions-and-answers/ --- layout: model title: English DistilBertForQuestionAnswering Cased model (from jadegao) author: John Snow Labs name: distilbert_qa_askinvesto_model date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `askinvesto-distilbert-model` is a English model originally trained by `jadegao`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_askinvesto_model_en_4.3.0_3.0_1672765610636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_askinvesto_model_en_4.3.0_3.0_1672765610636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_askinvesto_model","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_askinvesto_model","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_askinvesto_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jadegao/askinvesto-distilbert-model --- layout: model title: ICD10 to UMLS Code Mapping author: John Snow Labs name: icd10cm_umls_mapping date: 2021-07-01 tags: [icd10cm, umls, en, licensed, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CODE_MAPPING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_3.1.0_2.4_1625126281405.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_3.1.0_2.4_1625126281405.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline( "icd10cm_umls_mapping","en","clinical/models") pipeline.annotate(["M8950", "R822", "R0901"]) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_umls_mapping","en","clinical/models") val result = pipeline.annotate("M8950", "R822", "R0901") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.umls").predict("""M8950 R822 R0901""") ```
## Results ```bash {'icd10cm': ['M89.50', 'R82.2', 'R09.01'], 'umls': ['C4721411', 'C0159076', 'C0004044']} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - TokenizerModel - LemmatizerModel - Finisher --- layout: model title: English asr_wav2vec2_large_xls_r_300m_georgian_v0.6 TFWav2Vec2ForCTC from pavle-tsotskolauri author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_georgian_v0.6` is a English model originally trained by pavle-tsotskolauri. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6_en_4.2.0_3.0_1664119076573.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6_en_4.2.0_3.0_1664119076573.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_georgian_v0.6| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|2.3 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Mapping MESH Codes with Their Corresponding UMLS Codes author: John Snow Labs name: mesh_umls_mapper date: 2022-06-26 tags: [mesh, umls, chunk_mapper, clinical, licensed, en] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps MESH codes to corresponding UMLS codes under the Unified Medical Language System (UMLS). ## Predicted Entities `umls_code` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapper_en_3.5.3_3.0_1656281333787.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/mesh_umls_mapper_en_3.5.3_3.0_1656281333787.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings")\ .setCaseSensitive(False) mesh_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_mesh", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("mesh_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel\ .pretrained("mesh_umls_mapper", "en", "clinical/models")\ .setInputCols(["mesh_code"])\ .setOutputCol("umls_mappings")\ .setRels(["umls_code"]) pipeline = Pipeline(stages = [ documentAssembler, sbert_embedder, mesh_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) light_pipeline= LightPipeline(model) result = light_pipeline.fullAnnotate("N-acetyl-L-arginine") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("ner_chunk") .setOutputCol("sbert_embeddings") val mesh_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_mesh", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("mesh_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel .pretrained("mesh_umls_mapper", "en", "clinical/models") .setInputCols("mesh_code") .setOutputCol("umls_mappings") .setRels(Array("umls_code")) val pipeline = new Pipeline(stages = Array( documentAssembler, sbert_embedder, mesh_resolver, chunkerMapper )) val data = Seq("N-acetyl-L-arginine").toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.mesh_to_umls").predict("""Put your text here.""") ```
## Results ```bash | | ner_chunk | mesh_code | umls_mappings | |---:|:--------------------|:------------|:----------------| | 0 | N-acetyl-L-arginine | C000015 | C0067655 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|mesh_umls_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[mesh_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|3.8 MB| --- layout: model title: English asr_wav2vec2_large_960h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_large_960h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_960h` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_960h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_en_4.2.0_3.0_1664016613522.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_960h_en_4.2.0_3.0_1664016613522.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_960h', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_960h", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_960h| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|755.4 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_10 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-128-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_10_en_4.0.0_3.0_1655731124938.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_10_en_4.0.0_3.0_1655731124938.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_128d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_128_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|422.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-128-finetuned-squad-seed-10 --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_nl8 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-nl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl8_en_4.3.0_3.0_1675114011208.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_nl8_en_4.3.0_3.0_1675114011208.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_nl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_nl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_nl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|348.0 MB| ## References - https://huggingface.co/google/t5-efficient-base-nl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Fast Neural Machine Translation Model from Afro-Asiatic languages to Afro-Asiatic languages author: John Snow Labs name: opus_mt_afa_afa date: 2021-06-01 tags: [open_source, seq2seq, translation, afa, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: afa target languages: afa {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_afa_afa_xx_3.1.0_2.4_1622553414579.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_afa_afa_xx_3.1.0_2.4_1622553414579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_afa_afa", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_afa_afa", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afro-Asiatic languages.translate_to.Afro-Asiatic languages').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_afa_afa| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada TFWav2Vec2ForCTC from nadaAlnada author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada` is a English model originally trained by nadaAlnada. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada_en_4.2.0_3.0_1664114063867.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada_en_4.2.0_3.0_1664114063867.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_nadaAlnada| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Sentence Entity Resolver for RxNorm (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_rxnorm date: 2021-10-10 tags: [rxnorm, entity_resolution, licensed, clinical, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.2.3 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using `sbiobert_base_cased_mli ` Sentence Bert Embeddings. ## Predicted Entities `RxNorm Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_3.2.3_2.4_1633875017884.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_en_3.2.3_2.4_1633875017884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver]) data = spark.createDataFrame([["""She is given Fragmin 5000 units subcutaneously daily , Xenaderm to wounds topically b.i.d., lantus 40 units subcutaneously at bedtime , OxyContin 30 mg p.o.q. , folic acid 1 mg daily , levothyroxine 0.1 mg p.o. daily , Prevacid 30 mg daily , Avandia 4 mg daily , norvasc 10 mg daily , lexapro 20 mg daily , aspirin 81 mg daily , Neurontin 400 mg ."""]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val chunk2doc = new Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, rxnorm_resolver)) val data = Seq("She is given Fragmin 5000 units subcutaneously daily , Xenaderm to wounds topically b.i.d., lantus 40 units subcutaneously at bedtime , OxyContin 30 mg p.o.q. , folic acid 1 mg daily , levothyroxine 0.1 mg p.o. daily , Prevacid 30 mg daily , Avandia 4 mg daily , norvasc 10 mg daily , lexapro 20 mg daily , aspirin 81 mg daily , Neurontin 400 mg .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm").predict("""She is given Fragmin 5000 units subcutaneously daily , Xenaderm to wounds topically b.i.d., lantus 40 units subcutaneously at bedtime , OxyContin 30 mg p.o.q. , folic acid 1 mg daily , levothyroxine 0.1 mg p.o. daily , Prevacid 30 mg daily , Avandia 4 mg daily , norvasc 10 mg daily , lexapro 20 mg daily , aspirin 81 mg daily , Neurontin 400 mg .""") ```
## Results ```bash +-------------+------+----------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | ner_chunk|entity|icd10_code| all_codes| resolutions| +-------------+------+----------+----------------------------------------------------------------------+----------------------------------------------------------------------+ | Fragmin| DRUG| 281554|281554:::1992532:::151794:::217814:::361779:::1098701:::203870:::15...|Fragmin:::Kedrab:::Frisium:::Isopto Frin:::Frumax:::Folgard:::Faslo...| | Xenaderm| DRUG| 581754|581754:::898496:::202363:::1307304:::94611:::1046399:::1093360:::11...|Xenaderm:::Xiaflex:::Xanax:::Xtandi:::Xerac AC:::Xgeva:::Xoten:::Xa...| | lantus| DRUG| 261551|261551:::151959:::28381:::202990:::196502:::608814:::1040032:::7049...|Lantus:::Laratrim:::lachesine:::Larodopa:::Lamictal:::Lansinoh:::La...| | OxyContin| DRUG| 218986|218986:::32680:::32698:::352843:::218859:::1086614:::1120014:::2189...|Oxycontin:::oxychlorosene:::oxyphencyclimine:::Ocutricin HC:::Ocutr...| | folic acid| DRUG| 4511|4511:::1162058:::1162059:::62356:::1376005:::542060:::619039:::1162...|folic acid:::folic acid Oral Product:::folic acid Pill:::folate:::F...| |levothyroxine| DRUG| 10582|10582:::1868004:::40144:::1602753:::1602745:::227577:::1602750:::11...|levothyroxine:::levothyroxine Injection:::levothyroxine sodium:::le...| | Prevacid| DRUG| 83156|83156:::219485:::219171:::606051:::858359:::1547099:::2286610:::105...|Prevacid:::Provisc:::Perisine:::ProQuad:::Acuvail:::suvorexant:::Pi...| | Avandia| DRUG| 261455|261455:::152800:::1310526:::236219:::1370666:::686438:::215221:::99...|Avandia:::Amilamont:::Aubagio:::alibendol:::anisate:::Invega:::Amil...| | norvasc| DRUG| 58927|58927:::1876388:::218772:::262324:::385700:::226108:::203013:::2036...|Norvasc:::NoRisc:::Norco:::Norflex:::Norval:::Norimode:::Norcuron::...| | lexapro| DRUG| 352741|352741:::580253:::227285:::2058916:::24867:::847463:::2055761:::144...|Lexapro:::Levsinex:::Loprox:::Vizimpro:::fenproporex:::Levoprome:::...| | aspirin| DRUG| 1191|1191:::405403:::218266:::1154070:::215568:::202547:::1154069:::2393...|aspirin:::YSP Aspirin:::Med Aspirin:::aspirin Pill:::Bayer Aspirin:...| | Neurontin| DRUG| 196498|196498:::152627:::151178:::827343:::134802:::1720602:::152141:::131...|Neurontin:::Nystamont:::Nitronal:::Nucort:::Naropin:::Nucala:::Nyst...| +-------------+------+----------+----------------------------------------------------------------------+----------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm| |Compatibility:|Healthcare NLP 3.2.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on 02 August 2021 RxNorm dataset. --- layout: model title: English DistilBertForQuestionAnswering model (from mcurmei) Single author: John Snow Labs name: distilbert_qa_single_label_N_max date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `single_label_N_max` is a English model originally trained by `mcurmei`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_N_max_en_4.0.0_3.0_1654728665750.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_single_label_N_max_en_4.0.0_3.0_1654728665750.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_N_max","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_single_label_N_max","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.single_label_n_max.by_mcurmei").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_single_label_N_max| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mcurmei/single_label_N_max --- layout: model title: Extract Mentions of Response to Cancer Treatment author: John Snow Labs name: ner_oncology_response_to_treatment date: 2022-10-25 tags: [licensed, clinical, oncology, en, ner, treatment] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP for Healthcare 4.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to the patient"s response to the oncology treatment, including clinical response and changes in tumor size. Definitions of Predicted Entities: - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Size_Trend`: Terms related to the changes in the size of the tumor (such as "growth" or "reduced in size"). ## Predicted Entities `Line_Of_Therapy`, `Response_To_Treatment`, `Size_Trend` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_en_4.0.0_3.0_1666722959227.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_en_4.0.0_3.0_1666722959227.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["She completed her first-line therapy, but some months later there was recurrence of the breast cancer. "]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("She completed her first-line therapy, but some months later there was recurrence of the breast cancer. ").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_response_to_treatment").predict("""She completed her first-line therapy, but some months later there was recurrence of the breast cancer. """) ```
## Results ```bash | chunk | ner_label | |:-----------|:----------------------| | first-line | Line_Of_Therapy | | recurrence | Response_To_Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_response_to_treatment| |Compatibility:|Spark NLP for Healthcare 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.4 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Response_To_Treatment 326 101 157 483 0.76 0.67 0.72 Size_Trend 43 28 70 113 0.61 0.38 0.47 Line_Of_Therapy 99 11 7 106 0.90 0.93 0.92 macro_avg 468 140 234 702 0.76 0.66 0.70 micro_avg 468 140 234 702 0.76 0.67 0.71 ``` --- layout: model title: Persian Lemmatizer author: John Snow Labs name: lemma date: 2020-11-28 task: Lemmatization language: fa edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [lemmatizer, fa, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fa_2.7.0_2.4_1606581127793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_fa_2.7.0_2.4_1606581127793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of a pipeline after tokenisation.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "fa") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate(['فرماندهان فرماندهی فرمانده فرمانده‌ای فرماندهٔ فرماند']) ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "fa") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("فرماندهان فرماندهی فرمانده فرمانده‌ای فرماندهٔ فرماند").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""فرماندهان فرماندهی فرمانده فرمانده‌ای فرماندهٔ فرماند"""] lemma_df = nlu.load('fa.lemma').predict(text) lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 8, فرمانده, {'sentence': '0'}), Annotation(token, 10, 17, فرماندهی, {'sentence': '0'}), Annotation(token, 19, 25, فرمانده, {'sentence': '0'}), Annotation(token, 27, 36, فرمانده, {'sentence': '0'}), Annotation(token, 38, 45, فرمانده, {'sentence': '0'}), Annotation(token, 47, 52, فرماند, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|fa| ## Data Source This model is trained on data obtained from [https://universaldependencies.org/](https://universaldependencies.org/) --- layout: model title: Dutch Named Entity Recognition (from gunghio) author: John Snow Labs name: distilbert_ner_distilbert_base_multilingual_cased_finetuned_conll2003_ner date: 2022-05-16 tags: [distilbert, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-finetuned-conll2003-ner` is a Dutch model orginally trained by `gunghio`. ## Predicted Entities `ORG`, `MISC`, `PER`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_multilingual_cased_finetuned_conll2003_ner_nl_3.4.2_3.0_1652721741727.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_multilingual_cased_finetuned_conll2003_ner_nl_3.4.2_3.0_1652721741727.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_multilingual_cased_finetuned_conll2003_ner","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_multilingual_cased_finetuned_conll2003_ner","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_distilbert_base_multilingual_cased_finetuned_conll2003_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner - https://paperswithcode.com/sota?task=Named+Entity+Recognition&dataset=ConLL+2003 --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl8 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl8` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl8_en_4.3.0_3.0_1675123995941.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl8_en_4.3.0_3.0_1675123995941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl8","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl8","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl8| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|60.3 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl8 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Multilingual T5ForConditionalGeneration Small Cased model (from north) author: John Snow Labs name: t5_small_ncc date: 2023-01-31 tags: [is, nn, en, "no", sv, open_source, t5, xx, tensorflow] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5_small_NCC` is a Multilingual model originally trained by `north`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_ncc_xx_4.3.0_3.0_1675156926295.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_ncc_xx_4.3.0_3.0_1675156926295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_ncc","xx") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_ncc","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_ncc| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|1.6 GB| ## References - https://huggingface.co/north/t5_small_NCC - https://github.com/google-research/text-to-text-transfer-transformer - https://github.com/google-research/t5x - https://arxiv.org/abs/2104.09617 - https://arxiv.org/abs/2104.09617 - https://arxiv.org/pdf/1910.10683.pdf - https://sites.research.google/trc/about/ --- layout: model title: English asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy TFWav2Vec2ForCTC from rafiulrumy author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy` is a English model originally trained by rafiulrumy. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy_en_4.2.0_3.0_1664102298062.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy_en_4.2.0_3.0_1664102298062.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_hindi_demo_colab_by_rafiulrumy| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QNLI author: John Snow Labs name: sent_bert_wiki_books_qnli date: 2021-08-31 tags: [en, open_source, sentence_embeddings, wikipedia_dataset, books_corpus_dataset, qnli_dataset] task: Embeddings language: en nav_key: models edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses a BERT base architecture initialized from https://tfhub.dev/google/experts/bert/wiki_books/1 and fine-tuned on QNLI. This is a BERT base architecture but some changes have been made to the original training and export scheme based on more recent learnings. This model is intended to be used for a variety of English NLP tasks. The pre-training data contains more formal text and the model may not generalize to more colloquial text such as social media or messages. This model is fine-tuned on the QNLI and is recommended for use in question-based natural language inference tasks. The QNLI fine-tuning task where is a classification task for a question, context pair, whether the context contains the answer and where the context paragraphs are drawn from Wikipedia. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_qnli_en_3.2.0_3.0_1630412734083.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_bert_wiki_books_qnli_en_3.2.0_3.0_1630412734083.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_qnli", "en") \ .setInputCols("sentence") \ .setOutputCol("bert_sentence") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ]) ``` ```scala val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_wiki_books_qnli", "en") .setInputCols("sentence") .setOutputCol("bert_sentence") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings )) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] sent_embeddings_df = nlu.load('en.embed_sentence.bert.wiki_books_qnli').predict(text, output_level='sentence') sent_embeddings_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_bert_wiki_books_qnli| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[bert_sentence]| |Language:|en| |Case sensitive:|false| ## Data Source [1]: [Wikipedia dataset](https://dumps.wikimedia.org/) [2]: [BooksCorpus dataset](http://yknzhu.wixsite.com/mbweb) [3]: [QNLI dataset](https://gluebenchmark.com/) This Model has been imported from: https://tfhub.dev/google/experts/bert/wiki_books/qnli/2 --- layout: model title: Zero-shot Relation Extraction (BioBert) author: John Snow Labs name: re_zeroshot_biobert date: 2022-04-05 tags: [zero, shot, zero_shot, en, licensed] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: ZeroShotRelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Zero-shot Relation Extraction to extract relations between clinical entities with no training dataset, just pretrained BioBert embeddings (included in the model). This model requires Healthcare NLP 3.5.0. Take a look at how it works in the "Open in Colab" section below. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.3.ZeroShot_Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_zeroshot_biobert_en_3.5.0_3.0_1649176740466.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_zeroshot_biobert_en_3.5.0_3.0_1649176740466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} {% raw %} ```python documenter = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("tokens") sentencer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") words_embedder = nlp.WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_clinical = medical.NerModel() \ .pretrained("ner_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_clinical") ner_clinical_converter = nlp.NerConverter() \ .setInputCols(["sentences", "tokens", "ner_clinical"]) \ .setOutputCol("ner_clinical_chunks")\ .setWhiteList(["PROBLEM", "TEST"]) # PROBLEM-TEST-TREATMENT ner_posology = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_posology") ner_posology_converter = nlp.NerConverter() \ .setInputCols(["sentences", "tokens", "ner_posology"]) \ .setOutputCol("ner_posology_chunks")\ .setWhiteList(["DRUG"]) # DRUG-FREQUENCY-DOSAGE-DURATION-FORM-ROUTE-STRENGTH chunk_merger = medical.ChunkMergeApproach()\ .setInputCols("ner_clinical_chunks", "ner_posology_chunks")\ .setOutputCol('merged_ner_chunks') ## ZERO-SHOT RE Starting... re_model = medical.ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")\ .setInputCols(["merged_ner_chunks", "sentences"]) \ .setOutputCol("relations")\ .setMultiLabel(True) re_model.setRelationalCategories({ "ADE": ["{DRUG} causes {PROBLEM}."], "IMPROVE": ["{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."], "REVEAL": ["{TEST} reveals {PROBLEM}."]}) pipeline = nlp.Pipeline() \ .setStages([documenter, tokenizer, sentencer, words_embedder, ner_clinical, ner_clinical_converter, ner_posology, ner_posology_converter, chunk_merger, re_model]) data = spark.createDataFrame( [["Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."]] ).toDF("text") results = pipeline.fit(data).transform(data) results\ .selectExpr("explode(relations) as relation")\ .show(truncate=False) ``` ```scala val data = spark.createDataFrame(Seq("Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer.")).toDF("text") val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("tokens") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_clinical = MedicalNerModel() .pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_clinical") val ner_clinical_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_clinical")) .setOutputCol("ner_clinical_chunks") .setWhiteList(Array("PROBLEM", "TEST")) # PROBLEM-TEST-TREATMENT val ner_posology = MedicalNerModel() .pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_posology") val ner_posology_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_posology")) .setOutputCol("ner_posology_chunks") .setWhiteList(Array("DRUG")) # DRUG-FREQUENCY-DOSAGE-DURATION-FORM-ROUTE-STRENGTH val chunk_merger = ChunkMergeApproach() .setInputCols("ner_clinical_chunks", "ner_posology_chunks") .setOutputCol('merged_ner_chunks') ## ZERO-SHOT RE Starting... val re_model = ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models") .setInputCols(Array("ner_chunks", "sentences")) .setOutputCol("relations") .setMultiLabel(true) re_model.setRelationalCategories({ Map( "CURE" -> Array("{TREATMENT} cures {PROBLEM}."), "IMPROVE" -> Array("{TREATMENT} improves {PROBLEM}.", "{TREATMENT} cures {PROBLEM}."), "REVEAL" -> Array("{TEST} reveals {PROBLEM}.") )) val pipeline = new Pipeline() .setStages(Array(documenter, tokenizer, sentencer, words_embedder, ner_clinical, ner_clinical_converter, ner_posology, ner_posology_converter, chunk_merger, re_model)) val model = pipeline.fit(data) val results = model.transform(data) ``` {% endraw %} {:.nlu-block} ```python import nlu nlu.load("en.relation.zeroshot_biobert").predict("""Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer.""") ```
## Results ```bash +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |relation | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |{category, 268, 358, IMPROVE, {entity1_begin -> 0, relation -> IMPROVE, hypothesis -> Paracetamol improves sickness., confidence -> 0.98819494, nli_prediction -> entail, entity1 -> DRUG, syntactic_distance -> undefined, chunk2 -> sickness, entity2_end -> 45, entity1_end -> 10, entity2_begin -> 38, entity2 -> PROBLEM, chunk1 -> Paracetamol, sentence -> 0}, []}| |{category, 0, 90, IMPROVE, {entity1_begin -> 0, relation -> IMPROVE, hypothesis -> Paracetamol improves headache., confidence -> 0.9929625, nli_prediction -> entail, entity1 -> DRUG, syntactic_distance -> undefined, chunk2 -> headache, entity2_end -> 33, entity1_end -> 10, entity2_begin -> 26, entity2 -> PROBLEM, chunk1 -> Paracetamol, sentence -> 0}, []} | |{category, 536, 615, REVEAL, {entity1_begin -> 48, relation -> REVEAL, hypothesis -> An MRI test reveals cancer., confidence -> 0.9760039, nli_prediction -> entail, entity1 -> TEST, syntactic_distance -> undefined, chunk2 -> cancer, entity2_end -> 85, entity1_end -> 58, entity2_begin -> 80, entity2 -> PROBLEM, chunk1 -> An MRI test, sentence -> 1}, []} | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_zeroshot_biobert| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References As it is a zero-shot relation extractor, no training dataset is necessary. --- layout: model title: BERT Embeddings (Large Cased) author: John Snow Labs name: bert_large_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a deep bidirectional transformer trained on Wikipedia and the BookCorpus. The details are described in the paper "[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_large_cased_en_2.6.0_2.4_1598340717429.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_large_cased_en_2.6.0_2.4_1598340717429.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("bert_large_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("bert_large_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.large_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_large_cased_embeddings I [-0.5893247723579407, -1.1389378309249878, -0.... love [-0.8002289533615112, -0.15043185651302338, 0.... NLP [-0.8995863199234009, 0.08327484875917435, 0.9... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1](https://tfhub.dev/google/bert_cased_L-24_H-1024_A-16/1) --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from FabianWillner) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_trivia_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-triviaqa-finetuned-squad` is a English model originally trained by `FabianWillner`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_trivia_squad_en_4.3.0_3.0_1672773863211.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_trivia_squad_en_4.3.0_3.0_1672773863211.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_trivia_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_trivia_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_trivia_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/FabianWillner/distilbert-base-uncased-finetuned-triviaqa-finetuned-squad --- layout: model title: Chinese T5ForConditionalGeneration Cased model (from hululuzhu) author: John Snow Labs name: t5_chinese_poem_mengzi_finetune date: 2023-01-30 tags: [zh, open_source, t5, tensorflow] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-poem-t5-mengzi-finetune` is a Chinese model originally trained by `hululuzhu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_chinese_poem_mengzi_finetune_zh_4.3.0_3.0_1675100525010.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_chinese_poem_mengzi_finetune_zh_4.3.0_3.0_1675100525010.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_chinese_poem_mengzi_finetune","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_chinese_poem_mengzi_finetune","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_chinese_poem_mengzi_finetune| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|1.0 GB| ## References - https://huggingface.co/hululuzhu/chinese-poem-t5-mengzi-finetune - https://github.com/hululuzhu/chinese-ai-writing-share - https://github.com/hululuzhu/chinese-ai-writing-share/tree/main/slides - https://github.com/chinese-poetry/chinese-poetry --- layout: model title: Spanish RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_RuPERTa_base_finetuned_squadv2 date: 2022-06-20 tags: [es, open_source, question_answering, roberta] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-squadv2` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_RuPERTa_base_finetuned_squadv2_es_4.0.0_3.0_1655727371159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_RuPERTa_base_finetuned_squadv2_es_4.0.0_3.0_1655727371159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_RuPERTa_base_finetuned_squadv2","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_RuPERTa_base_finetuned_squadv2","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.squadv2.roberta.base_v2").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_RuPERTa_base_finetuned_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|es| |Size:|470.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-squadv2 --- layout: model title: Extract Mentions of Response to Cancer Treatment author: John Snow Labs name: ner_oncology_response_to_treatment date: 2022-11-24 tags: [licensed, clinical, en, oncology, ner, treatment] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts entities related to the patient”s response to the oncology treatment, including clinical response and changes in tumor size. Definitions of Predicted Entities: - `Line_Of_Therapy`: Explicit references to the line of therapy of an oncological therapy (e.g. "first-line treatment"). - `Response_To_Treatment`: Terms related to clinical progress of the patient related to cancer treatment, including "recurrence", "bad response" or "improvement". - `Size_Trend`: Terms related to the changes in the size of the tumor (such as "growth" or "reduced in size"). ## Predicted Entities `Line_Of_Therapy`, `Response_To_Treatment`, `Size_Trend` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_en_4.2.2_3.0_1669307329775.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_response_to_treatment_en_4.2.2_3.0_1669307329775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["She completed her first-line therapy, but some months later there was recurrence of the breast cancer. "]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_response_to_treatment", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("She completed her first-line therapy, but some months later there was recurrence of the breast cancer. ").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_response_to_treatment").predict("""She completed her first-line therapy, but some months later there was recurrence of the breast cancer. """) ```
## Results ```bash | chunk | ner_label | |:-----------|:----------------------| | first-line | Line_Of_Therapy | | recurrence | Response_To_Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_response_to_treatment| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|34.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Response_To_Treatment 326 101 157 483 0.76 0.67 0.72 Size_Trend 43 28 70 113 0.61 0.38 0.47 Line_Of_Therapy 99 11 7 106 0.90 0.93 0.92 macro_avg 468 140 234 702 0.76 0.66 0.70 micro_avg 468 140 234 702 0.76 0.67 0.71 ``` --- layout: model title: Legal Restricted Stock Unit Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_restricted_stock_unit_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, restricted, stock, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_restricted_stock_unit_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `restricted-stock-unit-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `restricted-stock-unit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_restricted_stock_unit_agreement_en_1.0.0_3.0_1670358322057.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_restricted_stock_unit_agreement_en_1.0.0_3.0_1670358322057.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_restricted_stock_unit_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[restricted-stock-unit-agreement]| |[other]| |[other]| |[restricted-stock-unit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_restricted_stock_unit_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.99 0.99 0.99 80 restricted-stock-unit-agreement 0.97 0.97 0.97 39 accuracy - - 0.98 119 macro-avg 0.98 0.98 0.98 119 weighted-avg 0.98 0.98 0.98 119 ``` --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_triviaqa_el_4 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `triviaqa_bert_el_4` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_triviaqa_el_4_el_4.0.0_3.0_1657193016800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_triviaqa_el_4_el_4.0.0_3.0_1657193016800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_triviaqa_el_4","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_triviaqa_el_4","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_triviaqa_el_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/triviaqa_bert_el_4 --- layout: model title: Spanish Electra Uncased Embeddings (Oscar dataset) author: John Snow Labs name: electra_embeddings_electricidad_base_generator date: 2022-05-17 tags: [es, open_source, electra, embeddings] task: Embeddings language: es edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electricidad-base-generator` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electricidad_base_generator_es_3.4.4_3.0_1652786783374.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electricidad_base_generator_es_3.4.4_3.0_1652786783374.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electricidad_base_generator","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electricidad_base_generator","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electricidad_base_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|126.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/electricidad-base-generator - https://imgur.com/uxAvBfh - https://oscar-corpus.com/ - https://openreview.net/pdf?id=r1xMH1BtvB - https://arxiv.org/pdf/1406.2661.pdf - https://rajpurkar.github.io/SQuAD-explorer/ - https://openreview.net/pdf?id=r1xMH1BtvB - https://twitter.com/julien_c - https://twitter.com/mrm8488 --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from transformersbook) author: John Snow Labs name: xlmroberta_ner_transformersbook_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `transformersbook`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_transformersbook_base_finetuned_panx_all_xx_4.1.0_3.0_1660429047165.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_transformersbook_base_finetuned_panx_all_xx_4.1.0_3.0_1660429047165.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_transformersbook_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_transformersbook_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_transformersbook_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/transformersbook/xlm-roberta-base-finetuned-panx-all - https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/ - https://github.com/nlp-with-transformers/notebooks/blob/main/04_multilingual-ner.ipynb - https://paperswithcode.com/sota?task=Token+Classification&dataset=wikiann --- layout: model title: Text Cleaning author: John Snow Labs name: text_cleaning date: 2022-07-07 tags: [en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The text_cleaning is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and cleans text. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/text_cleaning_en_4.0.0_3.0_1657198002660.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/text_cleaning_en_4.0.0_3.0_1657198002660.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("text_cleaning", "en") result = pipeline.annotate("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|text_cleaning| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|934.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - NormalizerModel - StopWordsCleaner - LemmatizerModel - TokenAssembler --- layout: model title: BioBERT Sentence Embeddings (Pubmed Large) author: John Snow Labs name: sent_biobert_pubmed_large_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_large_cased_en_2.6.2_2.4_1600531709085.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_large_cased_en_2.6.2_2.4_1600531709085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_large_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_large_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_large_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_biobert_pubmed_large_cased_embeddings sentence [-0.50444495677948, 0.15660151839256287, -0.03... I hate cancer [0.3133466839790344, 0.1945181041955948, 0.014... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pubmed_large_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|1024| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Fast Neural Machine Translation Model from English to Pedi author: John Snow Labs name: opus_mt_en_nso date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, nso, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `nso` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_nso_xx_2.7.0_2.4_1609165051207.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_nso_xx_2.7.0_2.4_1609165051207.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_nso", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_nso", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.nso').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_nso| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from krinal214) author: John Snow Labs name: bert_qa_bert_all_squad_all_translated date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-all-squad_all_translated` is a English model orginally trained by `krinal214`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_squad_all_translated_en_4.0.0_3.0_1654179485871.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_all_squad_all_translated_en_4.0.0_3.0_1654179485871.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_all_squad_all_translated","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_all_squad_all_translated","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_translated.bert.by_krinal214").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_all_squad_all_translated| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/krinal214/bert-all-squad_all_translated --- layout: model title: Clinical QA BioGPT (JSL - conversational) author: John Snow Labs name: biogpt_chat_jsl_conversational date: 2023-04-18 tags: [clinical, licensed, en, tensorflow] task: Text Generation language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalTextGenerator article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_conversational_en_4.4.0_3.0_1681853305199.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/biogpt_chat_jsl_conversational_en_4.4.0_3.0_1681853305199.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") gpt_qa = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conversational", "en", "clinical/models")\ .setInputCols("documents")\ .setOutputCol("answer").setMaxNewTokens(100) pipeline = Pipeline().setStages([document_assembler, gpt_qa]) data = spark.createDataFrame([["How to treat asthma ?"]]).toDF("text") pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conversational", "en", "clinical/models") .setInputCols("documents") .setOutputCol("answer").setMaxNewTokens(100) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = "How to treat asthma ?" val data = Seq(Array(text)).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash question: How to treat asthma ? answer: You have to take montelukast + albuterol tablet once or twice in day according to severity of symptoms. Montelukast is used as a maintenance therapy to relieve symptoms of asthma. Albuterol is used as a rescue therapy when symptoms are severe. You can also use inhaled corticosteroids ( ICS ) like budesonide or fluticasone for long term treatment. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biogpt_chat_jsl_conversational| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.4 GB| |Case sensitive:|true| --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_cline_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cline_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_squad2.0_en_4.0.0_3.0_1655728028363.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_squad2.0_en_4.0.0_3.0_1655728028363.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_cline_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.cline.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cline_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/cline_squad2.0 --- layout: model title: English asr_wav2vec2_vee_demo_colab TFWav2Vec2ForCTC from KISSz author: John Snow Labs name: asr_wav2vec2_vee_demo_colab date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_vee_demo_colab` is a English model originally trained by KISSz. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_vee_demo_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_vee_demo_colab_en_4.2.0_3.0_1664104883104.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_vee_demo_colab_en_4.2.0_3.0_1664104883104.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_vee_demo_colab", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_vee_demo_colab", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_vee_demo_colab| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|3.2 MB| --- layout: model title: Detect Persons, Locations and Organization Entities in Turkish (GloVe 840B_300) author: John Snow Labs name: turkish_ner_840B_300 date: 2020-11-10 task: Named Entity Recognition language: tr edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [tr, open_source] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained Named Entity Recognition (NER) deep learning model for Turkish texts. It recognizes Persons, Locations, and Organization entities using multi-lingual Bert word embedding. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER ç Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. {:.h2_title} ## Predicted Entities Persons-``PER``, Locations-``LOC``, Organizations-``ORG``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_TR/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_TR.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/turkish_ner_840B_300_tr_2.6.2_2.4_1605042988496.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/turkish_ner_840B_300_tr_2.6.2_2.4_1605042988496.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use Use as part of an NLP pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... ner_model = NerDLModel.pretrained("turkish_ner_840B_300", "tr") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (28 Ekim 1955 doğumlu), Amerikalı bir iş adamı, yazılım geliştirici, yatırımcı ve hayırseverdir. En çok Microsoft şirketinin kurucu ortağı olarak bilinir. William Gates , Microsoft şirketindeki kariyeri boyunca başkan, icra kurulu başkanı, başkan ve yazılım mimarisi başkanı pozisyonlarında bulunmuş, aynı zamanda Mayıs 2014'e kadar en büyük bireysel hissedar olmuştur. O, 1970'lerin ve 1980'lerin mikrobilgisayar devriminin en tanınmış girişimcilerinden ve öncülerinden biridir. Seattle Washington'da doğup büyüyen William Gates, 1975'te New Mexico Albuquerque'de çocukluk arkadaşı Paul Allen ile Microsoft şirketini kurdu; dünyanın en büyük kişisel bilgisayar yazılım şirketi haline geldi. William Gates, Ocak 2000'de icra kurulu başkanı olarak istifa edene kadar şirketi başkan ve icra kurulu başkanı olarak yönetti ve daha sonra yazılım mimarisi başkanı oldu. 1990'ların sonlarında, William Gates rekabete aykırı olduğu düşünülen iş taktikleri nedeniyle eleştirilmişti. Bu görüş, çok sayıda mahkeme kararıyla onaylanmıştır. Haziran 2006'da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan B&Melinda G. Vakfı'nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie'ye devretti. Şubat 2014'te Microsoft başkanlığından ayrıldı ve yeni atanan icra kurulu başkanı, Satya Nadella'yı desteklemek için teknoloji danışmanı olarak yeni bir göreve başladı."]], ["text"])) ``` ```scala ... val ner_model = NerDLModel.pretrained("turkish_ner_840B_300", "tr") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (28 Ekim 1955 doğumlu), Amerikalı bir iş adamı, yazılım geliştirici, yatırımcı ve hayırseverdir. En çok Microsoft şirketinin kurucu ortağı olarak bilinir. William Gates , Microsoft şirketindeki kariyeri boyunca başkan, icra kurulu başkanı, başkan ve yazılım mimarisi başkanı pozisyonlarında bulunmuş, aynı zamanda Mayıs 2014"e kadar en büyük bireysel hissedar olmuştur. O, 1970"lerin ve 1980"lerin mikrobilgisayar devriminin en tanınmış girişimcilerinden ve öncülerinden biridir. Seattle Washington"da doğup büyüyen William Gates, 1975"te New Mexico Albuquerque"de çocukluk arkadaşı Paul Allen ile Microsoft şirketini kurdu; dünyanın en büyük kişisel bilgisayar yazılım şirketi haline geldi. William Gates, Ocak 2000"de icra kurulu başkanı olarak istifa edene kadar şirketi başkan ve icra kurulu başkanı olarak yönetti ve daha sonra yazılım mimarisi başkanı oldu. 1990"ların sonlarında, William Gates rekabete aykırı olduğu düşünülen iş taktikleri nedeniyle eleştirilmişti. Bu görüş, çok sayıda mahkeme kararıyla onaylanmıştır. Haziran 2006"da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan B&Melinda G. Vakfı"nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie"ye devretti. Şubat 2014"te Microsoft başkanlığından ayrıldı ve yeni atanan icra kurulu başkanı, Satya Nadella'yı desteklemek için teknoloji danışmanı olarak yeni bir göreve başladı.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (28 Ekim 1955 doğumlu), Amerikalı bir iş adamı, yazılım geliştirici, yatırımcı ve hayırseverdir. En çok Microsoft şirketinin kurucu ortağı olarak bilinir. William Gates , Microsoft şirketindeki kariyeri boyunca başkan, icra kurulu başkanı, başkan ve yazılım mimarisi başkanı pozisyonlarında bulunmuş, aynı zamanda Mayıs 2014'e kadar en büyük bireysel hissedar olmuştur. O, 1970'lerin ve 1980'lerin mikrobilgisayar devriminin en tanınmış girişimcilerinden ve öncülerinden biridir. Seattle Washington'da doğup büyüyen William Gates, 1975'te New Mexico Albuquerque'de çocukluk arkadaşı Paul Allen ile Microsoft şirketini kurdu; dünyanın en büyük kişisel bilgisayar yazılım şirketi haline geldi. William Gates, Ocak 2000'de icra kurulu başkanı olarak istifa edene kadar şirketi başkan ve icra kurulu başkanı olarak yönetti ve daha sonra yazılım mimarisi başkanı oldu. 1990'ların sonlarında, William Gates rekabete aykırı olduğu düşünülen iş taktikleri nedeniyle eleştirilmişti. Bu görüş, çok sayıda mahkeme kararıyla onaylanmıştır. Haziran 2006'da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan B&Melinda G. Vakfı'nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie'ye devretti. Şubat 2014'te Microsoft başkanlığından ayrıldı ve yeni atanan icra kurulu başkanı, Satya Nadella'yı desteklemek için teknoloji danışmanı olarak yeni bir göreve başladı."""] ner_df = nlu.load('tr.ner').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------+---------+ |chunk |ner_label| +-------------------------+---------+ |William Henry Gates III |PER | |Microsoft |ORG | |William Gates |PER | |Microsoft |ORG | |Seattle Washington'da |LOC | |William Gates |PER | |New Mexico Albuquerque'de|LOC | |Paul Allen |PER | |Microsoft |ORG | |William Gates |PER | |William Gates |PER | |William Gates |PER | |Microsoft |ORG | |Melinda Gates |PER | |B&Melinda G. Vakfı'nda |ORG | |Ray Ozzie |PER | |Craig Mundie'ye |PER | |Microsoft |ORG | |Satya Nadella'yı |PER | +-------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|turkish_ner_840B_300| |Type:|ner| |Compatibility:|Spark NLP 2.6.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|tr| |Dependencies:|glove_840B_300| {:.h2_title} ## Data Source Trained on a custom dataset with multi-lingual GloVe Embeddings ``glove_840B_300``. {:.h2_title} ## Benchmarking ```bash label tp fp fn prec rec f1 B-LOC 2081 157 142 0.9298481 0.93612236 0.9329747 I-ORG 1292 220 152 0.8544974 0.8947368 0.87415427 I-LOC 293 81 66 0.78342247 0.81615597 0.79945433 I-PER 1578 127 99 0.9255132 0.940966 0.9331757 B-ORG 1846 185 145 0.9089119 0.9271723 0.91795135 B-PER 3043 186 206 0.942397 0.93659586 0.9394875 tp: 10133 fp: 956 fn: 810 labels: 6 Macro-average prec: 0.890765, rec: 0.9086249, f1: 0.8996063 Micro-average prec: 0.91378844, rec: 0.9259801, f1: 0.9198439 ``` --- layout: model title: Legal Construction Clause Binary Classifier author: John Snow Labs name: legclf_construction_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `construction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `construction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_construction_clause_en_1.0.0_3.2_1660123351617.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_construction_clause_en_1.0.0_3.2_1660123351617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_construction_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[construction]| |[other]| |[other]| |[construction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_construction_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support construction 0.88 0.84 0.86 43 other 0.90 0.93 0.92 70 accuracy - - 0.89 113 macro-avg 0.89 0.88 0.89 113 weighted-avg 0.89 0.89 0.89 113 ``` --- layout: model title: English BertForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: bert_qa_bert_large_uncased_whole_word_masking_chaii date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-whole-word-masking-chaii` is a English model orginally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_chaii_en_4.0.0_3.0_1654536887751.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_uncased_whole_word_masking_chaii_en_4.0.0_3.0_1654536887751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_uncased_whole_word_masking_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_uncased_whole_word_masking_chaii","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.bert.large_uncased_uncased_whole_word_masking.by_SauravMaheshkar").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_uncased_whole_word_masking_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/bert-large-uncased-whole-word-masking-chaii --- layout: model title: Explain Document Pipeline - CRA author: John Snow Labs name: explain_clinical_doc_cra class: PipelineModel language: en nav_key: models repository: clinical/models date: 2020-08-19 task: [Named Entity Recognition, Assertion Status, Relation Extraction, Pipeline Healthcare] edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [clinical,licensed,pipeline,en] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description A pretrained pipeline with `ner_clinical`, `assertion_dl`, `re_clinical`. It will extract clinical entities, assign assertion status and find relationships between clinical entities. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.Pretrained_Clinical_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_cra_en_2.5.5_2.4_1597846145640.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/explain_clinical_doc_cra_en_2.5.5_2.4_1597846145640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPython.html %} ```python cra_pipeline = PretrainedPipeline("explain_clinical_doc_cra","en","clinical/models") annotations = cra_pipeline.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache.She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """)[0] annotations.keys() ``` ```scala val cra_pipeline = new PretrainedPipeline("explain_clinical_doc_cra","en","clinical/models") val result = cra_pipeline.fullAnnotate("""She is admitted to The John Hopkins Hospital 2 days ago with a history of gestational diabetes mellitus diagnosed. She denied pain and any headache.She was seen by the endocrinology service and she was discharged on 03/02/2018 on 40 units of insulin glargine, 12 units of insulin lispro, and metformin 1000 mg two times a day. She had close follow-up with endocrinology post discharge. """)(0) ```
{:.h2_title} ## Results This pretrained pipeline gives the result of `ner_clinical`, `re_clinical` and `assertion_dl` models. ```bash | | chunk1 | ner_clinical1 | assertion | relation | chunk2 | ner_clinical2 | |---|-------------------------------|---------------|------------|----------|---------------|---------------| | 0 | gestational diabetes mellitus | PROBLEM | present | TrAP | metformin | TREATMENT | | 1 | gestational diabetes mellitus | PROBLEM | present | TrAP | polyuria | PROBLEM | | 2 | gestational diabetes mellitus | PROBLEM | present | TrCP | polydipsia | PROBLEM | | 3 | gestational diabetes mellitus | PROBLEM | present | TrCP | poor appetite | PROBLEM | | 4 | metformin | TREATMENT | present | TrAP | polyuria | PROBLEM | | 5 | metformin | TREATMENT | present | TrAP | polydipsia | PROBLEM | | 6 | metformin | TREATMENT | present | TrAP | poor appetite | PROBLEM | | 7 | polydipsia | PROBLEM | present | TrAP | vomiting | PROBLEM | ``` {:.model-param} ## Model Information {:.table-model} |---------------|--------------------------| | Name: | explain_clinical_doc_cra | | Type: | PipelineModel | | Compatibility: | Spark NLP 2.5.5+ | | License: | Licensed | | Edition: | Official | | Language: | en | {:.h2_title} ## Included Models - ner_clinical - assertion_dl - re_clinical --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_32_finetuned_squad_seed_10 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-10` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_32_finetuned_squad_seed_10_en_4.3.0_3.0_1674215255291.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_32_finetuned_squad_seed_10_en_4.3.0_3.0_1674215255291.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_32_finetuned_squad_seed_10","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_32_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_32_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|417.6 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-10 --- layout: model title: English BertForQuestionAnswering model (from phiyodr) author: John Snow Labs name: bert_qa_bert_base_finetuned_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-finetuned-squad2` is a English model orginally trained by `phiyodr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_finetuned_squad2_en_4.0.0_3.0_1654179945636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_finetuned_squad2_en_4.0.0_3.0_1654179945636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/phiyodr/bert-base-finetuned-squad2 - https://arxiv.org/abs/1810.04805 - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json - https://rajpurkar.github.io/SQuAD-explorer/ - https://arxiv.org/abs/1806.03822 --- layout: model title: Extract relations between effects of using multiple drugs (ReDL) author: John Snow Labs name: redl_drug_drug_interaction_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract potential improvements or harmful effects of Drug-Drug interactions (DDIs) when two or more drugs are taken at the same time or at certain interval ## Predicted Entities `DDI-advise`, `DDI-effect`, `DDI-false`, `DDI-int`, `DDI-mechanism` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_DRUG_DRUG_INT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_drug_drug_interaction_biobert_en_2.7.3_2.4_1612441748775.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_drug_drug_interaction_biobert_en_2.7.3_2.4_1612441748775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_drug_drug_interaction_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text="""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. \ If additional adrenergic drugs are to be administered by any route, \ they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array('SYMPTOM-EXTERNAL_BODY_PART_OR_REGION')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_drug_drug_interaction_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.drug_drug_interaction").predict("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. \ If additional adrenergic drugs are to be administered by any route, \ they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""") ```
## Results ```bash +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ | relation|entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ |DDI-false| DRUG| 5| 17|carbamazepine| DRUG| 62| 73|aripiprazole|0.91685396| +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_drug_drug_interaction_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on DDI Extraction corpus. ## Benchmarking ```bash Relation Recall Precision F1 Support DDI-advise 0.758 0.874 0.812 211 DDI-effect 0.759 0.754 0.756 348 DDI-false 0.977 0.957 0.967 4097 DDI-int 0.175 0.458 0.253 63 DDI-mechanism 0.783 0.853 0.816 281 Avg. 0.690 0.779 0.721 ``` --- layout: model title: Spanish Bert Embeddings (Base, Question, Allqa) author: John Snow Labs name: bert_embeddings_dpr_spanish_question_encoder_allqa_base date: 2022-04-11 tags: [bert, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dpr-spanish-question_encoder-allqa-base` is a Spanish model orginally trained by `IIC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_question_encoder_allqa_base_es_3.4.2_3.0_1649671162169.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dpr_spanish_question_encoder_allqa_base_es_3.4.2_3.0_1649671162169.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_question_encoder_allqa_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_dpr_spanish_question_encoder_allqa_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.dpr_spanish_question_encoder_allqa_base").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dpr_spanish_question_encoder_allqa_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|412.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base - https://arxiv.org/abs/2004.04906 - https://github.com/facebookresearch/DPR - https://arxiv.org/abs/2004.04906 - https://paperswithcode.com/sota?task=text+similarity&dataset=squad_es --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_53_greek_by_lighteternal TFWav2Vec2ForCTC from lighteternal author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_greek_by_lighteternal` is a Modern Greek (1453-) model originally trained by lighteternal. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal_el_4.2.0_3.0_1664105186590.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal_el_4.2.0_3.0_1664105186590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal', lang = 'el') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal", lang = "el") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_greek_by_lighteternal| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|el| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English XlmRoBertaForQuestionAnswering (from vesteinn) author: John Snow Labs name: xlm_roberta_qa_XLMr_ENIS_QA_IsQ_EnA date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `XLMr-ENIS-QA-IsQ-EnA` is a English model originally trained by `vesteinn`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLMr_ENIS_QA_IsQ_EnA_en_4.0.0_3.0_1655984020629.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_XLMr_ENIS_QA_IsQ_EnA_en_4.0.0_3.0_1655984020629.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_XLMr_ENIS_QA_IsQ_EnA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_XLMr_ENIS_QA_IsQ_EnA","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.xlmr_roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_XLMr_ENIS_QA_IsQ_EnA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|457.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/vesteinn/XLMr-ENIS-QA-IsQ-EnA --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_4 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-32-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1655732633509.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1655732633509.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_32d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_32_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|417.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-32-finetuned-squad-seed-4 --- layout: model title: Part of Speech for Portuguese author: John Snow Labs name: pos_ud_bosque date: 2021-03-08 tags: [part_of_speech, open_source, portuguese, pos_ud_bosque, pt] task: Part of Speech Tagging language: pt edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - DET - NOUN - ADJ - PROPN - AUX - ADP - PUNCT - NUM - ADV - VERB - PRON - CCONJ - X - SCONJ - INTJ - PART - SYM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bosque_pt_3.0.0_3.0_1615230225146.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bosque_pt_3.0.0_3.0_1615230225146.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_bosque", "pt") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Olá de John Snow Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_bosque", "pt") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Olá de John Snow Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Olá de John Snow Labs! ""] token_df = nlu.load('pt.pos').predict(text) token_df ```
## Results ```bash token pos 0 Olá PROPN 1 de ADP 2 John PROPN 3 Snow PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bosque| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|pt| --- layout: model title: Pipeline to Detect Genes/Proteins (BC2GM) in Medical Texts author: John Snow Labs name: ner_biomedical_bc2gm_pipeline date: 2023-03-14 tags: [bc2gm, ner, biomedical, gene_protein, gene, protein, en, licensed, clinical] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_biomedical_bc2gm](https://nlp.johnsnowlabs.com/2022/05/11/ner_biomedical_bc2gm_en_2_4.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_pipeline_en_4.3.0_3.2_1678787724252.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_biomedical_bc2gm_pipeline_en_4.3.0_3.2_1678787724252.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_biomedical_bc2gm_pipeline", "en", "clinical/models") text = '''Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_biomedical_bc2gm_pipeline", "en", "clinical/models") val text = "Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biomedical_bc2gm.pipeline").predict("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------|--------:|------:|:-------------|-------------:| | 0 | S-100 | 46 | 50 | GENE_PROTEIN | 0.9911 | | 1 | HMB-45 | 89 | 94 | GENE_PROTEIN | 0.9944 | | 2 | cytokeratin | 131 | 141 | GENE_PROTEIN | 0.9951 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_biomedical_bc2gm_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Italian BertForMaskedLM Base Uncased model (from dbmdz) author: John Snow Labs name: bert_embeddings_base_italian_uncased date: 2022-12-02 tags: [it, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: it edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-italian-uncased` is a Italian model originally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_uncased_it_4.2.4_3.0_1670017957577.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_italian_uncased_it_4.2.4_3.0_1670017957577.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_uncased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_italian_uncased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_italian_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|412.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-italian-uncased - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: Multilingual XLMRoBerta Embeddings (from hfl) author: John Snow Labs name: xlmroberta_embeddings_cino_base_v2 date: 2022-05-13 tags: [zh, ko, open_source, xlm_roberta, embeddings, xx, cino] task: Embeddings language: xx edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `cino-base-v2` is a Multilingual model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_base_v2_xx_3.4.4_3.0_1652439334973.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_cino_base_v2_xx_3.4.4_3.0_1652439334973.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_base_v2","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_cino_base_v2","xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_cino_base_v2| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|xx| |Size:|712.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/cino-base-v2 - https://github.com/ymcui/Chinese-Minority-PLM - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology --- layout: model title: Notice Clause NER Model author: John Snow Labs name: legner_notice_clause date: 2022-12-16 tags: [en, legal, ner, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model aimed to be used in notice clauses, to retrieve entities as NOTICE_METHOD, NOTICE_PARTY, ADDRESS, EMAIL, etc. Make sure you run this model only on notice clauses, after you filter them using `legclf_notice_clause` ## Predicted Entities `ADDRESS`, `DEPARTMENT`, `EMAIL`, `FAX`, `NAME`, `NOTICE_METHOD`, `NOTICE_PARTY`, `PERSON`, `PHONE`, `TITLE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_notice_clause_en_1.0.0_3.0_1671211179919.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_notice_clause_en_1.0.0_3.0_1671211179919.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained("legner_notice_clause", "en", "legal/models") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = nlp.NerConverter() \ .setInputCols(["document","token","ner"]) \ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, ner_model, ner_converter ]) empty_df = spark.createDataFrame([['']]).toDF("text") ner_model = pipeline.fit(empty_df) data = spark.createDataFrame([["""Source: FUELCELL ENERGY INC, 8-K, 11/6/2019 ExxonMobil: ExxonMobil Research and Engineering Company 1545 Route 22 East Annandale, NJ 08801-0900 Attention: Timothy Barckholtz, Senior Scientific Advisor Email: tim.barckholtz@exxonmobil.com FCE: FuelCell Energy, Inc. 782"""]]).toDF("text") result = ner_model.transform(data) ```
## Results ```bash +---------------------------------------------------------------------+------------+ |ner_chunk |label | +---------------------------------------------------------------------+------------+ |ExxonMobil |NOTICE_PARTY| |ExxonMobil Research and Engineering Company |NAME | |1545 Route 22 East Annandale, NJ 08801-0900 |ADDRESS | |Timothy Barckholtz |PERSON | |Senior Scientific Advisor |TITLE | |tim.barckholtz@exxonmobil.com |EMAIL | +---------------------------------------------------------------------+------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_notice_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|1.1 MB| ## References In-house dataset ## Benchmarking ```bash label precision recall f1-score support ADDRESS 0.86 0.94 0.90 141 DEPARTMENT 0.75 0.27 0.40 11 EMAIL 0.92 1.00 0.96 48 FAX 0.65 0.88 0.75 51 NAME 0.78 0.79 0.79 140 NOTICE_METHOD 0.74 0.80 0.77 353 NOTICE_PARTY 0.77 0.85 0.81 103 PERSON 0.91 0.94 0.92 114 PHONE 0.60 0.47 0.53 19 TITLE 0.76 0.90 0.82 80 micro-avg 0.78 0.85 0.81 1060 macro-avg 0.77 0.79 0.77 1060 weighted-avg 0.79 0.85 0.81 1060 ``` --- layout: model title: Legal Organization Clause Binary Classifier author: John Snow Labs name: legclf_organization_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `organization` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `organization` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_organization_clause_en_1.0.0_3.2_1660123788612.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_organization_clause_en_1.0.0_3.2_1660123788612.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_organization_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[organization]| |[other]| |[other]| |[organization]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_organization_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support organization 0.99 0.91 0.95 100 other 0.96 1.00 0.98 207 accuracy - - 0.97 307 macro-avg 0.97 0.95 0.96 307 weighted-avg 0.97 0.97 0.97 307 ``` --- layout: model title: Chinese Bert Embeddings (4-layer) author: John Snow Labs name: bert_embeddings_rbt4 date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `rbt4` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_3.4.2_3.0_1649669853254.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_rbt4_zh_3.4.2_3.0_1649669853254.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_rbt4","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.rbt4").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_rbt4| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|171.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/rbt4 - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: English asr_wav2vec2_base_test TFWav2Vec2ForCTC from cahya author: John Snow Labs name: asr_wav2vec2_base_test date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_test` is a English model originally trained by cahya. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_test_en_4.2.0_3.0_1664035642510.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_test_en_4.2.0_3.0_1664035642510.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_test", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_test", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|348.7 MB| --- layout: model title: Dispute Clauses NER author: John Snow Labs name: legner_dispute_clauses date: 2023-01-18 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal NER model which helps to retrieve Courts/Arbitrations, Rules and Resolution Means from legal agreements. ## Predicted Entities `RESOLUT_MEANS`, `RULES_NAME`, `COURT_NAME` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_dispute_clauses_en_1.0.0_3.0_1674054944954.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_dispute_clauses_en_1.0.0_3.0_1674054944954.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel().pretrained("legner_dispute_clauses","en","legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""The contract includes a dispute clause that requires the parties to follow the rules of judicial arbitration set forth by the United Nations Commission on International Trade Law (UNCITRAL) Rules of Arbitration and the jurisdiction of the International Chamber of Commerce court in the event of a dispute."""] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------+---------------+ | token| ner_label| +-------------+---------------+ | The| O| | contract| O| | includes| O| | a| O| | dispute| O| | clause| O| | that| O| | requires| O| | the| O| | parties| O| | to| O| | follow| O| | the| O| | rules| O| | of| O| | judicial|B-RESOLUT_MEANS| | arbitration|I-RESOLUT_MEANS| | set| O| | forth| O| | by| O| | the| O| | United| B-RULES_NAME| | Nations| I-RULES_NAME| | Commission| I-RULES_NAME| | on| I-RULES_NAME| |International| I-RULES_NAME| | Trade| I-RULES_NAME| | Law| I-RULES_NAME| | (| I-RULES_NAME| | UNCITRAL| I-RULES_NAME| | )| I-RULES_NAME| | Rules| I-RULES_NAME| | of| I-RULES_NAME| | Arbitration| I-RULES_NAME| | and| O| | the| O| | jurisdiction| O| | of| O| | the| O| |International| B-COURT_NAME| | Chamber| I-COURT_NAME| | of| I-COURT_NAME| | Commerce| I-COURT_NAME| | court| O| | in| O| | the| O| | event| O| | of| O| | a| O| | dispute| O| | .| O| +-------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_dispute_clauses| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations of the CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 B-RESOLUT_MEANS 14 4 6 0.7777778 0.7 0.73684216 B-RULES_NAME 15 0 5 1.0 0.75 0.85714287 I-RESOLUT_MEANS 12 0 3 1.0 0.8 0.88888896 B-COURT_NAME 26 6 6 0.8125 0.8125 0.8125 I-RULES_NAME 101 7 19 0.9351852 0.84166664 0.8859649 I-COURT_NAME 166 23 24 0.87830687 0.8736842 0.87598944 Macro-average 334 40 63 0.9006283 0.7963085 0.8452619 Micro-average 334 40 63 0.8930481 0.84130985 0.8664072 ``` --- layout: model title: Chinese Bert Embeddings (Base, Wobert model) author: John Snow Labs name: bert_embeddings_wobert_chinese_base date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `wobert_chinese_base` is a Chinese model orginally trained by `junnyu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_base_zh_3.4.2_3.0_1649669889842.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_wobert_chinese_base_zh_3.4.2_3.0_1649669889842.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_wobert_chinese_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.wobert_chinese_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_wobert_chinese_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|419.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/junnyu/wobert_chinese_base - https://github.com/ZhuiyiTechnology/WoBERT - https://github.com/JunnYu/WoBERT_pytorch --- layout: model title: Turkish Named Entity Recognition (from winvoker) author: John Snow Labs name: bert_ner_bert_base_turkish_cased_ner_tf date: 2022-05-09 tags: [bert, ner, token_classification, tr, open_source] task: Named Entity Recognition language: tr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-turkish-cased-ner-tf` is a Turkish model orginally trained by `winvoker`. ## Predicted Entities `LOC`, `PER`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_cased_ner_tf_tr_3.4.2_3.0_1652099181398.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_base_turkish_cased_ner_tf_tr_3.4.2_3.0_1652099181398.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_cased_ner_tf","tr") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_base_turkish_cased_ner_tf","tr") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_base_turkish_cased_ner_tf| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|tr| |Size:|412.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/winvoker/bert-base-turkish-cased-ner-tf - https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt - https://ieeexplore.ieee.org/document/7495744 --- layout: model title: English image_classifier_vit_rust_image_classification_6 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_6 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_6` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust`, `rust` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_6_en_4.1.0_3.0_1660168572229.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_6_en_4.1.0_3.0_1660168572229.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_6", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_6", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_6| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering Large Cased model (from LucasS) author: John Snow Labs name: bert_qa_bertlargeabsa date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bertLargeABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bertlargeabsa_en_4.0.0_3.0_1657188817667.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bertlargeabsa_en_4.0.0_3.0_1657188817667.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bertlargeabsa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_bertlargeabsa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bertlargeabsa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LucasS/bertLargeABSA --- layout: model title: Multilingual BertForQuestionAnswering Cased model (from roshnir) author: John Snow Labs name: bert_qa_mbert_finetuned_mlqa_en_zh_hi_dev date: 2022-07-07 tags: [xx, open_source, bert, question_answering] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mBert-finetuned-mlqa-dev-en-zh-hi` is a Multilingual model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_en_zh_hi_dev_xx_4.0.0_3.0_1657190064489.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_mbert_finetuned_mlqa_en_zh_hi_dev_xx_4.0.0_3.0_1657190064489.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_en_zh_hi_dev","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE"]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_mbert_finetuned_mlqa_en_zh_hi_dev","xx") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("PUT YOUR QUESTION HERE", "PUT YOUR CONTEXT HERE").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_mbert_finetuned_mlqa_en_zh_hi_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/mBert-finetuned-mlqa-dev-en-zh-hi --- layout: model title: Relation Extraction between anatomical entities and other clinical entities author: John Snow Labs name: re_oncology_location_wip date: 2022-09-27 tags: [licensed, clinical, oncology, en, relation_extraction, anatomy] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links extractions from anatomical entities (such as Site_Breast or Site_Lung) to other clinical entities (such as Tumor_Finding or Cancer_Surgery). ## Predicted Entities `is_location_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_oncology_location_wip_en_4.0.0_3.0_1664301554732.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_oncology_location_wip_en_4.0.0_3.0_1664301554732.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use relation pairs to include only the combinations of entities that are relevant in your case.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_model = RelationExtractionModel.pretrained("re_oncology_location_wip", "en", "clinical/models") \ .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \ .setOutputCol("relation_extraction") \ .setRelationPairs(["Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding","Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding"]) \ .setMaxSyntacticDistance(10) pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model]) data = spark.createDataFrame([["In April 2011, she first noticed a lump in her right breast."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_model = RelationExtractionModel.pretrained("re_oncology_location_wip", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunk", "dependencies")) .setOutputCol("relation_extraction") .setRelationPairs(Array("Tumor_Finding-Site_Breast", "Site_Breast-Tumor_Finding","Tumor_Finding-Anatomical_Site", "Anatomical_Site-Tumor_Finding")) .setMaxSyntacticDistance(10) val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_model)) val data = Seq("In April 2011, she first noticed a lump in her right breast.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_location_wip").predict("""In April 2011, she first noticed a lump in her right breast.""") ```
## Results ```bash +--------------+-------------+-------------+-----------+------+---------------+-------------+-----------+------+----------+ | relation| entity1|entity1_begin|entity1_end|chunk1| entity2|entity2_begin|entity2_end|chunk2|confidence| +--------------+-------------+-------------+-----------+------+---------------+-------------+-----------+------+----------+ |is_location_of|Tumor_Finding| 35| 38| lump|Anatomical_Site| 53| 58|breast|0.81353307| +--------------+-------------+-------------+-----------+------+---------------+-------------+-----------+------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_oncology_location_wip| |Type:|re| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|266.7 KB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.83 0.93 0.88 is_location_of 0.93 0.84 0.88 macro-avg 0.88 0.88 0.88 ``` --- layout: model title: Recognize Entities OntoNotes - ELECTRA Small author: John Snow Labs name: onto_recognize_entities_electra_small date: 2020-12-09 task: [Named Entity Recognition, Sentence Detection, Embeddings, Pipeline Public] language: en nav_key: models edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [en, open_source, pipeline] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A pre-trained pipeline containing NerDl Model. The NER model trained on OntoNotes 5.0 with `electra_small_uncased` embeddings. It can extract up to following 18 entities: ## Predicted Entities `CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_small_en_2.7.0_2.4_1607511710029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_small_en_2.7.0_2.4_1607511710029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('onto_recognize_entities_electra_small') result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_small") val result = pipeline.annotate("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament.") ``` {:.nlu-block} ```python import nlu text = ["""Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament."""] ner_df = nlu.load('en.ner.onto.electra.small').predict(text, output_level='chunk') ner_df[["entities", "entities_class"]] ```
{:.h2_title} ## Results ```bash +------------+---------+ |chunk |ner_label| +------------+---------+ |Johnson |PERSON | |first |ORDINAL | |2001 |DATE | |eight years |DATE | |London |GPE | |2008 to 2016|DATE | +------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_electra_small| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|en| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - Tokenizer - BertEmbeddings - NerDLModel - NerConverter --- layout: model title: German asr_exp_w2v2t_r_wav2vec2_s466 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_r_wav2vec2_s466 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_r_wav2vec2_s466` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_r_wav2vec2_s466_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_r_wav2vec2_s466_de_4.2.0_3.0_1664108739065.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_r_wav2vec2_s466_de_4.2.0_3.0_1664108739065.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_r_wav2vec2_s466", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_r_wav2vec2_s466", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_r_wav2vec2_s466| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: English BertForMaskedLM Base Uncased model (from model-attribution-challenge) author: John Snow Labs name: bert_embeddings_model_attribution_challenge_base_uncased date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased` is a English model originally trained by `model-attribution-challenge`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_model_attribution_challenge_base_uncased_en_4.2.4_3.0_1670019231783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_model_attribution_challenge_base_uncased_en_4.2.4_3.0_1670019231783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_model_attribution_challenge_base_uncased","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_model_attribution_challenge_base_uncased","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_model_attribution_challenge_base_uncased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|409.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/model-attribution-challenge/bert-base-uncased - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://github.com/google-research/bert/blob/master/README.md - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: Dutch RoBERTa Embeddings (from CLTL) author: John Snow Labs name: roberta_embeddings_MedRoBERTa.nl date: 2022-04-14 tags: [roberta, embeddings, nl, open_source] task: Embeddings language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `MedRoBERTa.nl` is a Dutch model orginally trained by `CLTL`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_MedRoBERTa.nl_nl_3.4.2_3.0_1649949062891.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_MedRoBERTa.nl_nl_3.4.2_3.0_1649949062891.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_MedRoBERTa.nl","nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ik hou van vonk nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_MedRoBERTa.nl","nl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ik hou van vonk nlp").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_MedRoBERTa.nl| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|nl| |Size:|472.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/CLTL/MedRoBERTa.nl - https://github.com/cltl-students/verkijk_stella_rma_thesis_dutch_medical_language_model --- layout: model title: Turkish BertForTokenClassification Cased model (from alierenak) author: John Snow Labs name: bert_token_classifier_berturk_cased_ner date: 2022-11-30 tags: [tr, open_source, bert, token_classification, ner, tensorflow] task: Named Entity Recognition language: tr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `berturk-cased-ner` is a Turkish model originally trained by `alierenak`. ## Predicted Entities `LOCATION`, `ORGANIZATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_cased_ner_tr_4.2.4_3.0_1669815532606.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_berturk_cased_ner_tr_4.2.4_3.0_1669815532606.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_cased_ner","tr") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_berturk_cased_ner","tr") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_berturk_cased_ner| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|tr| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/alierenak/berturk-cased-ner - https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/statistical-information-extraction-system-for-turkish/7C288FAFC71D5F0763C1F8CE66464017 - https://aclanthology.org/P11-3019 - https://data.tdd.ai/#/effafb5f-ebfc-4e5c-9a63-4f709ec1a135 --- layout: model title: ALBERT XLarge CoNNL-03 NER Pipeline author: ahmedlone127 name: albert_xlarge_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, albert, conll03, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [albert_xlarge_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/26/albert_xlarge_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/albert_xlarge_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655219194773.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/albert_xlarge_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655219194773.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("albert_xlarge_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.")) ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PER | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_xlarge_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|206.5 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - AlbertForTokenClassification - NerConverter - Finisher --- layout: model title: Gender Classifier (BERT) author: John Snow Labs name: bert_sequence_classifier_gender_biobert date: 2022-02-08 tags: [bert, sequence_classification, en, licensed] task: Text Classification language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model classifies the gender of a patient in a clinical document using context. This model is a [BioBERT-based](https://github.com/dmis-lab/biobert) classifier. ## Predicted Entities `Female`, `Male`, `Unknown` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_gender_biobert_en_3.4.1_3.0_1644317917385.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_gender_biobert_en_3.4.1_3.0_1644317917385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_gender_biobert", "en", "clinical/models")\ .setInputCols(["document","token"]) \ .setOutputCol("class") \ .setCaseSensitive(True) \ .setMaxSentenceLength(512) pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([["The patient took Advil and he experienced an adverse reaction."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_gender_biobert", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq("The patient took Advil and he experienced an adverse reaction.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.gender.seq_biobert").predict("""The patient took Advil and he experienced an adverse reaction.""") ```
## Results ```bash +---------------------------------------------------------------+------+ |text |result| +---------------------------------------------------------------+------+ |The patient took Advil and he experienced an adverse reaction. |[Male]| +---------------------------------------------------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_gender_biobert| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.0 MB| |Case sensitive:|true| |Max sentence length:|128| ## References This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits, etc) annotated internally. ## Benchmarking ```bash label precision recall f1-score support Female 0.94 0.94 0.94 479 Male 0.88 0.86 0.87 245 Unknown 0.73 0.78 0.76 102 accuracy 0.89 0.89 0.89 826 macro-avg 0.85 0.86 0.85 826 weighted-avg 0.90 0.89 0.90 826 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from pszemraj) author: John Snow Labs name: t5_base_askscience date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-askscience` is a English model originally trained by `pszemraj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_base_askscience_en_4.3.0_3.0_1675108079371.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_base_askscience_en_4.3.0_3.0_1675108079371.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_base_askscience","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_base_askscience","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_askscience| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|1.0 GB| ## References - https://huggingface.co/pszemraj/t5-base-askscience --- layout: model title: English BertForQuestionAnswering Large Uncased model (from tli8hf) author: John Snow Labs name: bert_qa_unqover_large_uncased_newsqa date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-bert-large-uncased-newsqa` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_large_uncased_newsqa_en_4.0.0_3.0_1657193094308.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_unqover_large_uncased_newsqa_en_4.0.0_3.0_1657193094308.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_unqover_large_uncased_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_unqover_large_uncased_newsqa","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_unqover_large_uncased_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-bert-large-uncased-newsqa --- layout: model title: Stopwords Remover for Sinhalese language (191 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, si, open_source] task: Stop Words Removal language: si edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_si_3.4.1_3.0_1646673160202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_si_3.4.1_3.0_1646673160202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","si") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["ඔබ මට වඩා හොඳ නැත"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","si") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("ඔබ මට වඩා හොඳ නැත").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("si.stopwords").predict("""ඔබ මට වඩා හොඳ නැත""") ```
## Results ```bash +------------------+ |result | +------------------+ |[ඔබ, මට, හොඳ, නැත]| +------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|si| |Size:|2.2 KB| --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739733256.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739733256.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_twostagequadruplet_hier_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|306.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostagequadruplet_hier_epochs_1_shard_1_squad2.0 --- layout: model title: Japanese DistilBERT Embeddings (from Geotrend) author: John Snow Labs name: distilbert_embeddings_distilbert_base_ja_cased date: 2022-04-12 tags: [distilbert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `distilbert-base-ja-cased` is a Japanese model orginally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ja_cased_ja_3.4.2_3.0_1649783582188.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_distilbert_base_ja_cased_ja_3.4.2_3.0_1649783582188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ja_cased","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_distilbert_base_ja_cased","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.distilbert_base_ja_cased").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_distilbert_base_ja_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|188.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/distilbert-base-ja-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: English BertForMaskedLM Cased model (from bioformers) author: John Snow Labs name: bert_embeddings_bioformer_cased_v1.0 date: 2022-12-02 tags: [en, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bioformer-cased-v1.0` is a English model originally trained by `bioformers`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bioformer_cased_v1.0_en_4.2.4_3.0_1670020877701.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bioformer_cased_v1.0_en_4.2.4_3.0_1670020877701.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_bioformer_cased_v1.0","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_bioformer_cased_v1.0","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bioformer_cased_v1.0| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|159.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/bioformers/bioformer-cased-v1.0 - https://allenai.github.io/scispacy/ - https://biocreative.bioinformatics.udel.edu/media/store/files/2021/TRACK5_pos_1_BC7_submission_221.pdf - https://github.com/WGLab/bioformer/issues --- layout: model title: Spanish Part of Speech Tagger (from mrm8488) author: John Snow Labs name: bert_pos_bert_spanish_cased_finetuned_pos_syntax date: 2022-05-09 tags: [bert, pos, part_of_speech, es, open_source] task: Part of Speech Tagging language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-spanish-cased-finetuned-pos-syntax` is a Spanish model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_spanish_cased_finetuned_pos_syntax_es_3.4.2_3.0_1652091508354.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_spanish_cased_finetuned_pos_syntax_es_3.4.2_3.0_1652091508354.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_spanish_cased_finetuned_pos_syntax","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_spanish_cased_finetuned_pos_syntax","es") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_spanish_cased_finetuned_pos_syntax| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|410.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-pos-syntax - https://github.com/dccuchile/beto - https://www.kaggle.com/nltkdata/conll-corpora - https://www.kaggle.com/nltkdata/conll-corpora - https://twitter.com/mrm8488 --- layout: model title: Legal Ownership Clause Binary Classifier author: John Snow Labs name: legclf_ownership_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `ownership` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `ownership` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_ownership_clause_en_1.0.0_3.2_1660123796143.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_ownership_clause_en_1.0.0_3.2_1660123796143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_ownership_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[ownership]| |[other]| |[other]| |[ownership]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_ownership_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.95 0.95 59 ownership 0.91 0.91 0.91 32 accuracy - - 0.93 91 macro-avg 0.93 0.93 0.93 91 weighted-avg 0.93 0.93 0.93 91 ``` --- layout: model title: Legal Remedies Clause Binary Classifier author: John Snow Labs name: legclf_remedies_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `remedies` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `remedies` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_remedies_clause_en_1.0.0_3.2_1660123931699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_remedies_clause_en_1.0.0_3.2_1660123931699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_remedies_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[remedies]| |[other]| |[other]| |[remedies]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_remedies_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.95 0.96 0.96 56 remedies 0.92 0.88 0.90 26 accuracy - - 0.94 82 macro-avg 0.93 0.92 0.93 82 weighted-avg 0.94 0.94 0.94 82 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from eAsyle) author: John Snow Labs name: roberta_qa_base_custom date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta_base_custom_QA` is a English model originally trained by `eAsyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_custom_en_4.3.0_3.0_1674223065625.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_custom_en_4.3.0_3.0_1674223065625.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_custom","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_custom","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_custom| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|424.1 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/eAsyle/roberta_base_custom_QA --- layout: model title: Detect Living Species(embeddings_scielo_300d) author: John Snow Labs name: ner_living_species_300 date: 2022-07-26 tags: [es, ner, licensed, clinical] task: Named Entity Recognition language: es edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Spanish which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `embeddings_scielo_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_es_3.5.0_3.0_1658876470162.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_300_es_3.5.0_3.0_1658876470162.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_300", "es","clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_scielo_300d","es","clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_300", "es","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.med_ner.living_species.300").predict("""Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.""") ```
## Results ```bash +--------------+-------+ |ner_chunk |label | +--------------+-------+ |Lactante varón|HUMAN | |familiares |HUMAN | |personales |HUMAN | |neonatal |HUMAN | |legumbres |SPECIES| |lentejas |SPECIES| |garbanzos |SPECIES| |legumbres |SPECIES| |madre |HUMAN | |Cacahuete |SPECIES| |padres |HUMAN | +--------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_300| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|15.2 MB| ## References https://temu.bsc.es/livingner/ ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.98 0.97 0.98 3281 B-SPECIES 0.94 0.98 0.96 3712 I-HUMAN 0.87 0.81 0.84 297 I-SPECIES 0.79 0.89 0.84 1732 micro-avg 0.92 0.95 0.94 9022 macro-avg 0.90 0.91 0.90 9022 weighted-avg 0.93 0.95 0.94 9022 ``` --- layout: model title: English DistilBertForTokenClassification Cased model (from Neurona) author: John Snow Labs name: distilbert_token_classifier_cpener_test date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cpener-test` is a English model originally trained by `Neurona`. ## Predicted Entities `cpe_version`, `cpe_product`, `cpe_vendor` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.1_3.0_1678133810294.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_cpener_test_en_4.3.1_3.0_1678133810294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_cpener_test","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_cpener_test| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Neurona/cpener-test --- layout: model title: Abkhazian asr_xls_r_ab_test_by_pablouribe TFWav2Vec2ForCTC from pablouribe author: John Snow Labs name: pipeline_asr_xls_r_ab_test_by_pablouribe date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_pablouribe` is a Abkhazian model originally trained by pablouribe. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_ab_test_by_pablouribe_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_pablouribe_ab_4.2.0_3.0_1664021656355.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_ab_test_by_pablouribe_ab_4.2.0_3.0_1664021656355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_ab_test_by_pablouribe', lang = 'ab') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_ab_test_by_pablouribe", lang = "ab") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_ab_test_by_pablouribe| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ab| |Size:|451.4 KB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English asr_wav2vec2_base_960h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_base_960h date: 2022-09-23 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_960h` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_960h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_960h_en_4.2.0_3.0_1663934856069.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_960h_en_4.2.0_3.0_1663934856069.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_960h", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_960h", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_960h| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.6 MB| --- layout: model title: Finnish asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot` is a Finnish model originally trained by aapot. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_fi_4.2.0_3.0_1664024535110.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot_fi_4.2.0_3.0_1664024535110.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xlsr_300m_finnish_lm_by_aapot| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: BioBERT Embeddings (PMC) author: John Snow Labs name: biobert_pmc_base_cased date: 2020-09-19 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.2 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_pmc_base_cased_en_2.6.2_2.4_1600530421096.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_pmc_base_cased_en_2.6.2_2.4_1600530421096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_pmc_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.pmc_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_pmc_base_cased_embeddings I [0.0654267892241478, 0.06330983340740204, 0.13... hate [0.3058323264122009, 0.4778319299221039, -0.09... cancer [0.3130614757537842, 0.024675076827406883, -0.... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.2| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: XLNet Large CoNLL-03 NER Pipeline author: ahmedlone127 name: xlnet_large_token_classifier_conll03_pipeline date: 2022-06-14 tags: [open_source, ner, token_classifier, xlnet, conll03, large, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlnet_large_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/09/28/xlnet_large_token_classifier_conll03_en.html) model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655218011796.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/xlnet_large_token_classifier_conll03_pipeline_en_4.0.0_3.0_1655218011796.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlnet_large_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlnet_large_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|1.4 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlnetForTokenClassification - NerConverter - Finisher --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_cline_emanuals_techqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cline-emanuals-techqa` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_emanuals_techqa_en_4.0.0_3.0_1655727895129.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cline_emanuals_techqa_en_4.0.0_3.0_1655727895129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_cline_emanuals_techqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_cline_emanuals_techqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.roberta.techqa_cline_emanuals.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cline_emanuals_techqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/cline-emanuals-techqa --- layout: model title: Fast Neural Machine Translation Model from English to Turkic Languages author: John Snow Labs name: opus_mt_en_trk date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, trk, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `trk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_trk_xx_2.7.0_2.4_1609167167159.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_trk_xx_2.7.0_2.4_1609167167159.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_trk", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_trk", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.trk').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_trk| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from Raynok) author: John Snow Labs name: roberta_qa_squad date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-squad` is a English model originally trained by `Raynok`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_squad_en_4.3.0_3.0_1674222368310.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_squad_en_4.3.0_3.0_1674222368310.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Raynok/roberta-squad --- layout: model title: Smaller BERT Embeddings (L-12_H-128_A-2) author: John Snow Labs name: small_bert_L12_128 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L12_128_en_2.6.0_2.4_1598344378220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L12_128_en_2.6.0_2.4_1598344378220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L12_128", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L12_128", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L12_128').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L12_128_embeddings I [-0.19200828671455383, -1.1298311948776245, -0... love [-0.427543580532074, -1.2282991409301758, -0.2... NLP [0.5775153636932373, 0.06353635340929031, -0.3... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L12_128| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|128| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab7_by_sameearif88 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab7_by_sameearif88` is a English model originally trained by sameearif88. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88_en_4.2.0_3.0_1664020412505.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88_en_4.2.0_3.0_1664020412505.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab7_by_sameearif88| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect clinical entities (ner_jsl_enriched_biobert) author: John Snow Labs name: ner_jsl_enriched_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_enriched_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_jsl_enriched_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_biobert_pipeline_en_4.3.0_3.2_1679316183988.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_enriched_biobert_pipeline_en_4.3.0_3.2_1679316183988.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_enriched_biobert_pipeline", "en", "clinical/models") text = '''The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_enriched_biobert_pipeline", "en", "clinical/models") val text = "The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_enriched_biobert.pipeline").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:----------------------------|--------:|------:|:-------------|-------------:| | 0 | 21-day-old | 17 | 26 | Age | 1 | | 1 | male | 38 | 41 | Gender | 0.9326 | | 2 | mom | 75 | 77 | Gender | 0.9258 | | 3 | she | 147 | 149 | Gender | 0.8551 | | 4 | mild | 168 | 171 | Modifier | 0.8119 | | 5 | problems with his breathing | 173 | 199 | Symptom_Name | 0.624975 | | 6 | negative | 220 | 227 | Negation | 0.9946 | | 7 | perioral cyanosis | 237 | 253 | Symptom_Name | 0.41775 | | 8 | retractions | 258 | 268 | Symptom_Name | 0.9572 | | 9 | mom | 285 | 287 | Gender | 0.9468 | | 10 | Tylenol | 345 | 351 | Drug_Name | 0.989 | | 11 | His | 400 | 402 | Gender | 0.8694 | | 12 | his | 488 | 490 | Gender | 0.8967 | | 13 | respiratory congestion | 492 | 513 | Symptom_Name | 0.4195 | | 14 | He | 516 | 517 | Gender | 0.8529 | | 15 | tired | 550 | 554 | Symptom_Name | 0.7902 | | 16 | fussy | 569 | 573 | Symptom_Name | 0.9389 | | 17 | albuterol | 637 | 645 | Drug_Name | 0.9588 | | 18 | His | 675 | 677 | Gender | 0.8484 | | 19 | he | 721 | 722 | Gender | 0.8909 | | 20 | he | 778 | 779 | Gender | 0.8625 | | 21 | Mom | 821 | 823 | Gender | 0.8167 | | 22 | denies | 825 | 830 | Negation | 0.9841 | | 23 | diarrhea | 836 | 843 | Symptom_Name | 0.6033 | | 24 | His | 846 | 848 | Gender | 0.8459 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_enriched_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.3 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from English to Chinyanja author: John Snow Labs name: opus_mt_en_ny date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, ny, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `ny` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_ny_xx_2.7.0_2.4_1609169030913.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_ny_xx_2.7.0_2.4_1609169030913.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_ny", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_ny", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.ny').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_ny| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Civil Law Document Classifier (EURLEX) author: John Snow Labs name: legclf_civil_law_bert date: 2023-03-06 tags: [en, legal, classification, clauses, civil_law, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_civil_law_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Civil_Law or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Civil_Law`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_civil_law_bert_en_1.0.0_3.0_1678111748956.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_civil_law_bert_en_1.0.0_3.0_1678111748956.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_civil_law_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Civil_Law]| |[Other]| |[Other]| |[Civil_Law]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_civil_law_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Civil_Law 0.89 0.89 0.89 53 Other 0.87 0.87 0.87 45 accuracy - - 0.88 98 macro-avg 0.88 0.88 0.88 98 weighted-avg 0.88 0.88 0.88 98 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from razent) author: John Snow Labs name: t5_scifive_base_pubmed_pmc date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `SciFive-base-Pubmed_PMC` is a English model originally trained by `razent`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_scifive_base_pubmed_pmc_en_4.3.0_3.0_1675099109583.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_scifive_base_pubmed_pmc_en_4.3.0_3.0_1675099109583.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_scifive_base_pubmed_pmc","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_scifive_base_pubmed_pmc","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_scifive_base_pubmed_pmc| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|474.3 MB| ## References - https://huggingface.co/razent/SciFive-base-Pubmed_PMC - https://arxiv.org/abs/2106.03598 - https://github.com/justinphan3110/SciFive --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding ICD10-CM Codes author: John Snow Labs name: snomed_icd10cm_mapping date: 2023-06-13 tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, icd10cm] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_icd10cm_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_4.4.4_3.2_1686663521616.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_icd10cm_mapping_en_4.4.4_3.2_1686663521616.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(128041000119107 292278006 293072005) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(128041000119107 292278006 293072005) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icd10cm.pipe").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(128041000119107 292278006 293072005) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_icd10cm_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(128041000119107 292278006 293072005) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.snomed_to_icd10cm.pipe").predict("""Put your text here.""") ```
## Results ```bash Results | | snomed_code | icd10cm_code | |---:|:----------------------------------------|:---------------------------| | 0 | 128041000119107 | 292278006 | 293072005 | K22.70 | T43.595 | T37.1X5 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_icd10cm_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.5 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CD_Chem_Modified_PubMedBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CD-Chem-Modified-PubMedBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CD_Chem_Modified_PubMedBERT_512_en_4.0.0_3.0_1657109249349.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CD_Chem_Modified_PubMedBERT_512_en_4.0.0_3.0_1657109249349.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CD_Chem_Modified_PubMedBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CD_Chem_Modified_PubMedBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CD_Chem_Modified_PubMedBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CD-Chem-Modified-PubMedBERT-512 --- layout: model title: Translate Baltic languages to English Pipeline author: John Snow Labs name: translate_bat_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bat, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bat` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bat_en_xx_2.7.0_2.4_1609686947121.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bat_en_xx_2.7.0_2.4_1609686947121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bat_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bat_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bat.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bat_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [it, ner, legal, mapa, licensed] task: Named Entity Recognition language: it edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Italian` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_it_1.0.0_3.0_1682597548726.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_it_1.0.0_3.0_1682597548726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_it_cased", "it")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "it", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""In pendenza del giudizio relativo alla responsabilità genitoriale instaurato in Italia, la sig.ra Grigorescu, il 30 settembre 2009, ha adito la Judecătoria București ( Tribunale di primo grado di Bucarest ) chiedendo il divorzio, l’affidamento esclusivo del figlio e un contributo al mantenimento del figlio a carico del padre a titolo di mantenimento della prole."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------------+---------+ |chunk |ner_label| +-----------------+---------+ |Italia |ADDRESS | |sig.ra Grigorescu|PERSON | |30 settembre 2009|DATE | |Bucarest |ADDRESS | +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 1.00 1.00 1.00 14 AMOUNT 1.00 1.00 1.00 3 DATE 1.00 1.00 1.00 45 ORGANISATION 0.89 0.89 0.89 9 PERSON 0.92 1.00 0.96 12 macro-avg 0.98 0.99 0.98 83 macro-avg 0.96 0.98 0.97 83 weighted-avg 0.98 0.99 0.98 83 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from sunitha) author: John Snow Labs name: distilbert_qa_base_uncased_3feb_2022_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-3feb-2022-finetuned-squad` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_3feb_2022_finetuned_squad_en_4.3.0_3.0_1672767357306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_3feb_2022_finetuned_squad_en_4.3.0_3.0_1672767357306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_3feb_2022_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_3feb_2022_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_3feb_2022_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/sunitha/distilbert-base-uncased-3feb-2022-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from Arabic to Polish author: John Snow Labs name: opus_mt_ar_pl date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, pl, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: pl {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_pl_xx_3.1.0_2.4_1622560089386.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_pl_xx_3.1.0_2.4_1622560089386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_pl", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_pl", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Polish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_pl| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Roles NER author: John Snow Labs name: legner_roles date: 2023-01-02 tags: [role, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This NER models extracts legal roles in an agreement, such as `Borrower`, `Supplier`, `Agent`, `Attorney`, `Pursuant`, etc. ## Predicted Entities `ROLE`, `O` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_roles_en_1.0.0_3.0_1672673551040.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_roles_en_1.0.0_3.0_1672673551040.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencizer = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = legal.NerModel.pretrained('legner_roles', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, ner_converter]) empty = spark.createDataFrame([[example]]).toDF("text") tr_results = model.transform(spark.createDataFrame([[example]]).toDF('text')) ```
## Results ```bash +-------+---------+---------+ |sent_id|chunk |ner_label| +-------+---------+---------+ |1 |Lender |ROLE | |1 |Lender's |ROLE | |1 |principal|ROLE | |1 |Lender |ROLE | |2 |pursuant |ROLE | |3 |Lenders |ROLE | |3 |Lenders |ROLE | |3 |Lenders |ROLE | |4 |Lenders |ROLE | |7 |Agent |ROLE | |14 |Lenders |ROLE | |14 |Borrowers|ROLE | |14 |Lender |ROLE | |14 |Agent |ROLE | |14 |Lender |ROLE | |15 |Agent |ROLE | |15 |Lender |ROLE | |15 |pursuant |ROLE | |15 |Agent |ROLE | |15 |Borrowers|ROLE | +-------+---------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_roles| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References CUAD dataset and synthetic data ## Benchmarking ```bash label tp fp fn prec rec f1 B-ROLE 19095 16 77 0.9991628 0.9959837 0.9975707 I-ROLE 162 1 0 0.993865 1.0 0.9969231 Macro-average 19257 17 77 0.9965139 0.99799186 0.9972523 Micro-average 19257 17 77 0.999118 0.9960174 0.9975653 ``` --- layout: model title: English asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2 TFWav2Vec2ForCTC from gary109 author: John Snow Labs name: asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2` is a English model originally trained by gary109. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_en_4.2.0_3.0_1664101378845.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2_en_4.2.0_3.0_1664101378845.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_ai_light_dance_singing2_wav2vec2_large_xlsr_53_5gram_v4_2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Detect Drug Information (Small) author: John Snow Labs name: ner_posology_small date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for posology, this NER is trained with the ``embeddings_clinical`` word embeddings model, so be sure to use the same embeddings in the pipeline. ## Predicted Entities ``DOSAGE``, ``DRUG``, ``DURATION``, ``FORM``, ``FREQUENCY``, ``ROUTE``, ``STRENGTH``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_en_3.0.0_3.0_1617208436385.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_en_3.0.0_3.0_1617208436385.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val model = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, model, ner_converter)) val data = Seq("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.small").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d., |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_small| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on the 2018 i2b2 dataset (no FDA) with ``embeddings_clinical``. https://www.i2b2.org/NLP/Medication ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | B-DRUG | 1408 | 62 | 99 | 0.957823 | 0.934307 | 0.945919 | | 1 | B-STRENGTH | 470 | 43 | 29 | 0.916179 | 0.941884 | 0.928854 | | 2 | I-DURATION | 123 | 22 | 8 | 0.848276 | 0.938931 | 0.891304 | | 3 | I-STRENGTH | 499 | 66 | 15 | 0.883186 | 0.970817 | 0.924931 | | 4 | I-FREQUENCY | 945 | 47 | 55 | 0.952621 | 0.945 | 0.948795 | | 5 | B-FORM | 365 | 13 | 12 | 0.965608 | 0.96817 | 0.966887 | | 6 | B-DOSAGE | 298 | 27 | 26 | 0.916923 | 0.919753 | 0.918336 | | 7 | I-DOSAGE | 348 | 29 | 22 | 0.923077 | 0.940541 | 0.931727 | | 8 | I-DRUG | 208 | 25 | 60 | 0.892704 | 0.776119 | 0.830339 | | 9 | I-ROUTE | 10 | 0 | 2 | 1 | 0.833333 | 0.909091 | | 10 | B-ROUTE | 467 | 4 | 25 | 0.991507 | 0.949187 | 0.969886 | | 11 | B-DURATION | 64 | 10 | 10 | 0.864865 | 0.864865 | 0.864865 | | 12 | B-FREQUENCY | 588 | 12 | 17 | 0.98 | 0.971901 | 0.975934 | | 13 | I-FORM | 264 | 5 | 4 | 0.981413 | 0.985075 | 0.98324 | | 14 | Macro-average | 6057 | 365 | 384 | 0.93387 | 0.924277 | 0.929049 | | 15 | Micro-average | 6057 | 365 | 384 | 0.943164 | 0.940382 | 0.941771 | ``` --- layout: model title: Translate English to Luvale Pipeline author: John Snow Labs name: translate_en_lue date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, lue, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `lue` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_lue_xx_2.7.0_2.4_1609688898091.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_lue_xx_2.7.0_2.4_1609688898091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_lue", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_lue", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.lue').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_lue| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW) author: John Snow Labs name: distilbert_qa_squad_model date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-en-de-es-model` is a Multilingual model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_model_xx_4.3.0_3.0_1672775581986.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_squad_model_xx_4.3.0_3.0_1672775581986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_model","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_squad_model","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_squad_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/squad-en-de-es-model --- layout: model title: Korean ElectraForQuestionAnswering model (from obokkkk) author: John Snow Labs name: electra_qa_base_v3_discriminator_finetuned_klue_v4 date: 2022-06-22 tags: [ko, open_source, electra, question_answering] task: Question Answering language: ko edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `koelectra-base-v3-discriminator-finetuned-klue-v4` is a Korean model originally trained by `obokkkk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_v3_discriminator_finetuned_klue_v4_ko_4.0.0_3.0_1655922184133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_v3_discriminator_finetuned_klue_v4_ko_4.0.0_3.0_1655922184133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_v3_discriminator_finetuned_klue_v4","ko") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_v3_discriminator_finetuned_klue_v4","ko") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("내 이름은 무엇입니까?", "제 이름은 클라라이고 저는 버클리에 살고 있습니다.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.answer_question.klue.electra.base.by_obokkkk").predict("""내 이름은 무엇입니까?|||"제 이름은 클라라이고 저는 버클리에 살고 있습니다.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_v3_discriminator_finetuned_klue_v4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ko| |Size:|419.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/obokkkk/koelectra-base-v3-discriminator-finetuned-klue-v4 --- layout: model title: English asr_wav2vec2_large_xls_r_300m_CN_colab TFWav2Vec2ForCTC from li666 author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_CN_colab` is a English model originally trained by li666. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab_en_4.2.0_3.0_1664094642974.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab_en_4.2.0_3.0_1664094642974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_CN_colab| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Financial Pipeline (ORG-PER-ROLE-DATE) author: John Snow Labs name: finpipe_org_per_role_date date: 2022-09-09 tags: [en, financial, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a pretrained pipeline to extract Companies (ORG), People (PERSON), Job titles (ROLE) and Dates combining different pretrained NER models to improve coverage. ## Predicted Entities `ORG`, `PERSON`, `ROLE`, `DATE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINPIPE_ORG_PER_DATE_ROLES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finpipe_org_per_role_date_en_1.0.0_3.2_1662716423161.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finpipe_org_per_role_date_en_1.0.0_3.2_1662716423161.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("finpipe_org_per_role_date", "en", "finance/models") deid_pipeline.annotate("John Smith works as Computer Engineer at Amazon since 2020") res = deid_pipeline.annotate("John Smith works as Computer Engineer at Amazon since 2020") for token, ner in zip(res['token'], res['ner']): print(f"{token} ({ner})") ```
## Results ```bash John (B-PERSON) Smith (I-PERSON) works (O) as (O) Computer (B-ROLE) Engineer (I-ROLE) at (O) Amazon (B-ORG) since (O) 2020 (B-DATE) ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finpipe_org_per_role_date| |Type:|pipeline| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|828.4 MB| ## References In-house annotations on legal and financial documents, Ontonotes, Conll 2003, Finsec conll, Cuad dataset, 10k filings ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - FinanceNerModel - FinanceBertForTokenClassification - NerConverter - NerConverter - ChunkMergeModel --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from Evelyn18) author: John Snow Labs name: roberta_qa_base_spanish_squades_modelo_v1 date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-spanish-squades-modelo-robertav1` is a Spanish model originally trained by `Evelyn18`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo_v1_es_4.3.0_3.0_1674218326080.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_spanish_squades_modelo_v1_es_4.3.0_3.0_1674218326080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo_v1","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_spanish_squades_modelo_v1","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_spanish_squades_modelo_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/Evelyn18/roberta-base-spanish-squades-modelo-robertav1 --- layout: model title: Spanish Named Entity Recognition (Base, Plus, CAPITEL competition at IberLEF 2020 dataset) author: John Snow Labs name: roberta_ner_roberta_base_bne_capitel_ner_plus date: 2022-05-03 tags: [roberta, ner, open_source, es] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-base-bne-capitel-ner-plus` is a Spanish model orginally trained by `PlanTL-GOB-ES`. ## Predicted Entities `ORG`, `OTH`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_base_bne_capitel_ner_plus_es_3.4.2_3.0_1651592741277.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_base_bne_capitel_ner_plus_es_3.4.2_3.0_1651592741277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_base_bne_capitel_ner_plus","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_base_bne_capitel_ner_plus","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.ner.roberta_base_bne_capitel_ner_plus").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_base_bne_capitel_ner_plus| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|458.9 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne-capitel-ner-plus - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://sites.google.com/view/capitel2020 - https://github.com/PlanTL-GOB-ES/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Lemmatizer (Germany, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, de] task: Lemmatization language: de edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Germany Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_de_3.4.1_3.0_1646316522910.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_de_3.4.1_3.0_1646316522910.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","de") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Du bist nicht besser als ich"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","de") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Du bist nicht besser als ich").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.lemma").predict("""Du bist nicht besser als ich""") ```
## Results ```bash +--------------------------------+ |result | +--------------------------------+ |[Du, sein, nicht, gut, als, ich]| +--------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|de| |Size:|4.2 MB| --- layout: model title: Translate Bislama to English Pipeline author: John Snow Labs name: translate_bi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, bi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `bi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_bi_en_xx_2.7.0_2.4_1609689875128.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_bi_en_xx_2.7.0_2.4_1609689875128.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_bi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_bi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.bi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_bi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract relations between phenotypic abnormalities and diseases (ReDL) author: John Snow Labs name: redl_human_phenotype_gene_biobert date: 2023-01-14 tags: [relation_extraction, en, licensed, clinical, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract relations to fully understand the origin of some phenotypic abnormalities and their associated diseases. 1 : Entities are related, 0 : Entities are not related. ## Predicted Entities `1`, `0` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_4.2.4_3.0_1673737099610.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_human_phenotype_gene_biobert_en_4.2.4_3.0_1673737099610.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") #Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_human_phenotype_gene_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text = """She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_human_phenotype_gene_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_human_phenotype_gene_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.humen_phenotype_gene").predict("""She has a retinal degeneration, hearing loss and renal failure, short stature, Mutations in the SH3PXD2B gene coding for the Tks4 protein are responsible for the autosomal recessive.""") ```
## Results ```bash +--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+-------------------+----------+ |relation|entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+-------------------+----------+ | 0| HP| 10| 29|retinal degeneration| HP| 32| 43| hearing loss|0.92880034| | 0| HP| 10| 29|retinal degeneration| HP| 49| 61| renal failure|0.93935645| | 0| HP| 10| 29|retinal degeneration| HP| 64| 76| short stature|0.92370766| | 1| HP| 10| 29|retinal degeneration| GENE| 96| 103| SH3PXD2B|0.63739055| | 1| HP| 10| 29|retinal degeneration| HP| 162| 180|autosomal recessive|0.58393383| | 0| HP| 32| 43| hearing loss| HP| 49| 61| renal failure| 0.9543991| | 0| HP| 32| 43| hearing loss| HP| 64| 76| short stature| 0.8060494| | 1| HP| 32| 43| hearing loss| GENE| 96| 103| SH3PXD2B| 0.8507128| | 1| HP| 32| 43| hearing loss| HP| 162| 180|autosomal recessive|0.90283227| | 0| HP| 49| 61| renal failure| HP| 64| 76| short stature|0.85388213| | 1| HP| 49| 61| renal failure| GENE| 96| 103| SH3PXD2B|0.76057386| | 1| HP| 49| 61| renal failure| HP| 162| 180|autosomal recessive|0.85482293| | 1| HP| 64| 76| short stature| GENE| 96| 103| SH3PXD2B| 0.8951201| | 1| HP| 64| 76| short stature| HP| 162| 180|autosomal recessive| 0.9018232| | 1| GENE| 96| 103| SH3PXD2B| HP| 162| 180|autosomal recessive|0.97185487| +--------+-------+-------------+-----------+--------------------+-------+-------------+-----------+-------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_human_phenotype_gene_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on a silver standard corpus of human phenotype and gene annotations and their relations. ## Benchmarking ```bash label Recall Precision F1 Support 0 0.922 0.908 0.915 129 1 0.831 0.855 0.843 71 Avg. 0.877 0.882 0.879 - ``` --- layout: model title: Abkhazian asr_xls_r_ab_test_by_baaastien TFWav2Vec2ForCTC from baaastien author: John Snow Labs name: asr_xls_r_ab_test_by_baaastien date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_baaastien` is a Abkhazian model originally trained by baaastien. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_test_by_baaastien_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_baaastien_ab_4.2.0_3.0_1664020806550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_baaastien_ab_4.2.0_3.0_1664020806550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_test_by_baaastien", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_test_by_baaastien", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_test_by_baaastien| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|442.4 KB| --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CDR_Chem_Modified_BioBERT_384 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-BioBERT-384` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BioBERT_384_en_4.0.0_3.0_1657109284236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_BioBERT_384_en_4.0.0_3.0_1657109284236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BioBERT_384","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_BioBERT_384","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CDR_Chem_Modified_BioBERT_384| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|403.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-BioBERT-384 --- layout: model title: Extract Biomarkers and Their Results author: John Snow Labs name: ner_oncology_biomarker_healthcare date: 2023-01-11 tags: [licensed, clinical, oncology, en, ner, biomarker] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts mentions of biomarkers and biomarker results from oncology texts. ## Predicted Entities `Biomarker_Result`, `Biomarker` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_ONCOLOGY_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_healthcare_en_4.2.4_3.0_1673477151495.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_biomarker_healthcare_en_4.2.4_3.0_1673477151495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel()\ .pretrained("embeddings_healthcare_100d", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel\ .pretrained("ner_oncology_biomarker_healthcare", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel .pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel() .pretrained("embeddings_healthcare_100d", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_biomarker_healthcare", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.oncology_biomarker_healthcare").predict("""The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA), Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87%.""") ```
## Results ```bash | chunk | ner_label | |:-----------------------------------------|:-----------------| | negative | Biomarker_Result | | CK7 | Biomarker | | synaptophysin | Biomarker | | Syn | Biomarker | | chromogranin A | Biomarker | | CgA | Biomarker | | Muc5AC | Biomarker | | human epidermal growth factor receptor-2 | Biomarker | | HER2 | Biomarker | | Muc6 | Biomarker | | positive | Biomarker_Result | | CK20 | Biomarker | | Muc1 | Biomarker | | Muc2 | Biomarker | | E-cadherin | Biomarker | | p53 | Biomarker | | Ki-67 index | Biomarker | | 87% | Biomarker_Result | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_biomarker_healthcare| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|33.8 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label tp fp fn total precision recall f1 Biomarker_Result 519 78 62 581 0.87 0.89 0.88 Biomarker 828 51 98 926 0.94 0.89 0.92 macro-avg 1347 129 160 1507 0.91 0.89 0.90 micro-avg 1347 129 160 1507 0.91 0.89 0.90 ``` --- layout: model title: Sentence Embeddings - sbert mini (tuned) author: John Snow Labs name: sbert_jsl_mini_uncased date: 2021-06-30 tags: [embeddings, clinical, licensed, en] task: Embeddings language: en nav_key: models edition: Healthcare NLP 3.1.0 spark_version: 2.4 supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained to generate contextual sentence embeddings of input sentences. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_uncased_en_3.1.0_2.4_1625050221194.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbert_jsl_mini_uncased_en_3.1.0_2.4_1625050221194.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbert_jsl_mini_uncased","en","clinical/models").setInputCols(["sentence"]).setOutputCol("sbert_embeddings") ``` ```scala val sbiobert_embeddings = BertSentenceEmbeddings .pretrained("sbert_jsl_mini_uncased","en","clinical/models") .setInputCols(Array("sentence")) .setOutputCol("sbert_embeddings") ``` {:.nlu-block} ```python import nlu nlu.load("en.embed_sentence.bert.jsl_mini_uncased").predict("""Put your text here.""") ```
## Results ```bash Gives a 768 dimensional vector representation of the sentence. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbert_jsl_mini_uncased| |Compatibility:|Healthcare NLP 3.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|false| ## Data Source Tuned on MedNLI dataset ## Benchmarking ```bash MedNLI Score Acc 0.663 STS(cos) 0.701 ``` --- layout: model title: Adverse Drug Events Classifier author: John Snow Labs name: classifierml_ade date: 2023-05-16 tags: [text_classification, ade, en, clinical, licensed] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: DocumentMLClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the DocumentMLClassifierApproach annotator and classifies a text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1684249521145.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierml_ade_en_4.4.1_3.0_1684249521145.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("prediction") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, classifier_ml]) data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler =new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifier_ml = new DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models") .setInputCols("token") .setOutputCol("prediction") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, classifier_ml)) val data = Seq(Array("I feel great after taking tylenol", "Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +--------------------------------------------------------------------------------------------------------+-------+ |text |result | +--------------------------------------------------------------------------------------------------------+-------+ |Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B.|[True] | |I feel great after taking tylenol |[False]| +--------------------------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierml_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[prediction]| |Language:|en| |Size:|2.6 MB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.90 0.94 0.92 3359 True 0.85 0.75 0.79 1364 accuracy - - 0.89 4723 macro avg 0.87 0.85 0.86 4723 weighted avg 0.89 0.89 0.89 4723 ``` --- layout: model title: Stop Words Cleaner for English author: John Snow Labs name: stopwords_en date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: en nav_key: models edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, en] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_en_en_2.5.4_2.4_1594742439135.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_en_en_2.5.4_2.4_1594742439135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_en", "en") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Other than being the king of the north, John Snow is a an English physician and a leader in the development of anaesthesia and medical hygiene.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_en", "en") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Other than being the king of the north, John Snow is a an English physician and a leader in the development of anaesthesia and medical hygiene.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Other than being the king of the north, John Snow is a an English physician and a leader in the development of anaesthesia and medical hygiene."""] stopword_df = nlu.load('en.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=21, end=24, result='king', metadata={'sentence': '0'}), Row(annotatorType='token', begin=33, end=37, result='north', metadata={'sentence': '0'}), Row(annotatorType='token', begin=38, end=38, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=40, end=43, result='John', metadata={'sentence': '0'}), Row(annotatorType='token', begin=45, end=48, result='Snow', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_en| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|en| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: English image_classifier_vit_beer_whisky_wine_detection ViTForImageClassification from firas-spanioli author: John Snow Labs name: image_classifier_vit_beer_whisky_wine_detection date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_beer_whisky_wine_detection` is a English model originally trained by firas-spanioli. ## Predicted Entities `beer`, `whisky`, `wine` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_beer_whisky_wine_detection_en_4.1.0_3.0_1660165748295.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_beer_whisky_wine_detection_en_4.1.0_3.0_1660165748295.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_beer_whisky_wine_detection", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_beer_whisky_wine_detection", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_beer_whisky_wine_detection| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Part of Speech for Dutch author: John Snow Labs name: pos_ud_alpino date: 2020-05-04 01:47:00 +0800 task: Part of Speech Tagging language: nl edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [pos, nl] supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model annotates the part of speech of tokens in a text. The [parts of speech](https://universaldependencies.org/u/pos/) annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/2da56c087da53a2fac1d51774d49939e05418e57/tutorials/Certification_Trainings/Public/6.Playground_DataFrames.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_alpino_nl_2.5.0_2.4_1588545949009.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_alpino_nl_2.5.0_2.4_1588545949009.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... pos = PerceptronModel.pretrained("pos_ud_alpino", "nl") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Behalve dat hij de koning van het noorden is, is John Snow een Engelse arts en een leider in de ontwikkeling van anesthesie en medische hygiëne.") ``` ```scala ... val pos = PerceptronModel.pretrained("pos_ud_alpino", "nl") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos)) val data = Seq("Behalve dat hij de koning van het noorden is, is John Snow een Engelse arts en een leider in de ontwikkeling van anesthesie en medische hygiëne.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Behalve dat hij de koning van het noorden is, is John Snow een Engelse arts en een leider in de ontwikkeling van anesthesie en medische hygiëne."""] pos_df = nlu.load('nl.pos.ud_alpino').predict(text, output_level='token') pos_df ```
{:.h2_title} ## Results ```bash [Row(annotatorType='pos', begin=0, end=6, result='SCONJ', metadata={'word': 'Behalve'}), Row(annotatorType='pos', begin=8, end=10, result='SCONJ', metadata={'word': 'dat'}), Row(annotatorType='pos', begin=12, end=14, result='PRON', metadata={'word': 'hij'}), Row(annotatorType='pos', begin=16, end=17, result='DET', metadata={'word': 'de'}), Row(annotatorType='pos', begin=19, end=24, result='NOUN', metadata={'word': 'koning'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_alpino| |Type:|pos| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[pos]| |Language:|nl| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Indonesian BertForQuestionAnswering model (from Rifky) author: John Snow Labs name: bert_qa_Indobert_QA date: 2022-06-02 tags: [id, open_source, question_answering, bert] task: Question Answering language: id edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Indobert-QA` is a Indonesian model orginally trained by `Rifky`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_Indobert_QA_id_4.0.0_3.0_1654176733545.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_Indobert_QA_id_4.0.0_3.0_1654176733545.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_Indobert_QA","id") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_Indobert_QA","id") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("id.answer_question.indo_bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_Indobert_QA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|id| |Size:|412.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Rifky/Indobert-QA - https://indolem.github.io/ - https://github.com/rifkybujana/IndoBERT-QA - https://github.com/Wikidepia/indonesian_datasets/tree/master/question-answering/squad --- layout: model title: Chinese T5ForConditionalGeneration Small Cased model (from thu-coai) author: John Snow Labs name: t5_longlm_small date: 2023-01-30 tags: [zh, open_source, t5, tensorflow] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `LongLM-small` is a Chinese model originally trained by `thu-coai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_longlm_small_zh_4.3.0_3.0_1675098245048.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_longlm_small_zh_4.3.0_3.0_1675098245048.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_longlm_small","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_longlm_small","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_longlm_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|288.2 MB| ## References - https://huggingface.co/thu-coai/LongLM-small - https://jianguanthu.github.io/ - http://coai.cs.tsinghua.edu.cn/ --- layout: model title: Fast Neural Machine Translation Model from Galician to English author: John Snow Labs name: opus_mt_gl_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, gl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `gl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_gl_en_xx_2.7.0_2.4_1609169484710.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_gl_en_xx_2.7.0_2.4_1609169484710.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_gl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_gl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.gl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_gl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Farming Systems Document Classifier (EURLEX) author: John Snow Labs name: legclf_farming_systems_bert date: 2023-03-06 tags: [en, legal, classification, clauses, farming_systems, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_farming_systems_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Farming_Systems or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Farming_Systems`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_farming_systems_bert_en_1.0.0_3.0_1678111765449.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_farming_systems_bert_en_1.0.0_3.0_1678111765449.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_farming_systems_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Farming_Systems]| |[Other]| |[Other]| |[Farming_Systems]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_farming_systems_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Farming_Systems 0.82 0.81 0.82 52 Other 0.80 0.82 0.81 50 accuracy - - 0.81 102 macro-avg 0.81 0.81 0.81 102 weighted-avg 0.81 0.81 0.81 102 ``` --- layout: model title: Pipeline to Extract Cancer Therapies and Posology Information author: John Snow Labs name: ner_oncology_unspecific_posology_pipeline date: 2023-03-09 tags: [licensed, clinical, oncology, en, ner, treatment, posology] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_unspecific_posology](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_unspecific_posology_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_pipeline_en_4.3.0_3.2_1678347063020.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_unspecific_posology_pipeline_en_4.3.0_3.2_1678347063020.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_unspecific_posology_pipeline", "en", "clinical/models") text = '''The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_unspecific_posology_pipeline", "en", "clinical/models") val text = "The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses. She is currently receiving his second cycle of chemotherapy and is in good overall condition." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:---------------------|-------------:| | 0 | adriamycin | 46 | 55 | Cancer_Therapy | 1 | | 1 | 60 mg/m2 | 58 | 65 | Posology_Information | 0.86955 | | 2 | cyclophosphamide | 72 | 87 | Cancer_Therapy | 1 | | 3 | 600 mg/m2 | 90 | 98 | Posology_Information | 0.81215 | | 4 | over six courses | 101 | 116 | Posology_Information | 0.9078 | | 5 | second cycle | 150 | 161 | Posology_Information | 0.9853 | | 6 | chemotherapy | 166 | 177 | Cancer_Therapy | 0.9998 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_unspecific_posology_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Sentence Entity Resolver for RxNorm (disposition) author: John Snow Labs name: sbiobertresolve_rxnorm_disposition date: 2021-08-12 tags: [rxnorm, licensed, en, entity_resolution] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 2.4 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps medication entities (like drugs/ingredients) to RxNorm codes and their dispositions using `sbiobert_base_cased_mli` Sentence Bert Embeddings. ## Predicted Entities Predicts RxNorm Codes, their normalized definition for each chunk, and dispositions if any. In the result, look for the aux_label parameter in the metadata to get dispositions divided by `|`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_RXNORM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_RXNORM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_disposition_en_3.1.3_2.4_1628792971821.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_rxnorm_disposition_en_3.1.3_2.4_1628792971821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_rxnorm_disposition``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_posology``` as NER model. ```DRUG``` set in ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_disposition", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") pipelineModel = PipelineModel( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver ]) rxnorm_lp = LightPipeline(pipelineModel) result = rxnorm_lp.fullAnnotate("belimumab 80 mg/ml injectable solution") ``` ```scala val documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_disposition", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val pipelineModel= new PipelineModel().setStages(Array(documentAssembler, sbert_embedder, rxnorm_resolver)) val rxnorm_lp = LightPipeline(pipelineModel) val result = rxnorm_lp.fullAnnotate("belimumab 80 mg/ml injectable solution") ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.rxnorm_disposition").predict("""belimumab 80 mg/ml injectable solution""") ```
## Results ```bash | | chunks | code | resolutions | all_codes | all_k_aux_labels | all_distances | |---:|:--------------------------------------|:--------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------|:--------------------------------------------------------------------------------------------|:----------------------------------------------| | 0 |belimumab 80 mg/ml injectable solution | 1092440 | [belimumab 80 mg/ml injectable solution, belimumab 80 mg/ml injectable solution [benlysta], ifosfamide 80 mg/ml injectable solution, belimumab 80 mg/ml [benlysta], belimumab 80 mg/ml, ...]| [1092440, 1092444, 107034, 1092442, 1092438, ...] | [Immunomodulator, Immunomodulator, Alkylating agent, Immunomodulator, Immunomodulator, ...] | [0.0000, 0.0145, 0.0479, 0.0619, 0.0636, ...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_rxnorm_disposition| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[rxnorm_code]| |Language:|en| |Case sensitive:|false| --- layout: model title: Translate Kuanyama to English Pipeline author: John Snow Labs name: translate_kj_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, kj, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `kj` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_kj_en_xx_2.7.0_2.4_1609686306200.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_kj_en_xx_2.7.0_2.4_1609686306200.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_kj_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_kj_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.kj.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_kj_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Word2Vec Embeddings in Finnish (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, fi, open_source] task: Embeddings language: fi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fi_3.4.1_3.0_1647373893178.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_fi_3.4.1_3.0_1647373893178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fi") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Rakastan kipinää NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fi") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Rakastan kipinää NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fi.embed.w2v_cc_300d").predict("""Rakastan kipinää NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|fi| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect chemicals in text author: John Snow Labs name: ner_chemicals date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract different types of chemical compounds mentioned in text using pretrained NER model. ## Predicted Entities `CHEM` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMICALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_en_3.0.0_3.0_1617260785955.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_en_3.0.0_3.0_1617260785955.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_chemicals", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_chemicals", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemicals").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemicals| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash +------+-------+------+------+-------+---------+------+------+ |entity| tp| fp| fn| total|precision|recall| f1| +------+-------+------+------+-------+---------+------+------+ | CHEM|58026.0|3022.0|3975.0|62001.0| 0.9505|0.9359|0.9431| +------+-------+------+------+-------+---------+------+------+ +------------------+ | macro| +------------------+ |0.9431364740875586| +------------------+ +------------------+ | micro| +------------------+ |0.9431364740875586| +------------------+ ``` --- layout: model title: English ElectraForQuestionAnswering model (from hankzhong) author: John Snow Labs name: electra_qa_hankzhong_small_discriminator_finetuned_squad date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-small-discriminator-finetuned-squad` is a English model originally trained by `hankzhong`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_hankzhong_small_discriminator_finetuned_squad_en_4.0.0_3.0_1655921254220.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_hankzhong_small_discriminator_finetuned_squad_en_4.0.0_3.0_1655921254220.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_hankzhong_small_discriminator_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_hankzhong_small_discriminator_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.small.by_hankzhong").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_hankzhong_small_discriminator_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|51.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hankzhong/electra-small-discriminator-finetuned-squad --- layout: model title: Fast Neural Machine Translation Model from English to Morisyen author: John Snow Labs name: opus_mt_en_mfe date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mfe, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mfe` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mfe_xx_2.7.0_2.4_1609168764734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mfe_xx_2.7.0_2.4_1609168764734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mfe", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mfe", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mfe').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mfe| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from saraks) author: John Snow Labs name: distilbert_qa_cuad_document_name_cased_08_31_v1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `cuad-distil-document_name-cased-08-31-v1` is a English model originally trained by `saraks`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_document_name_cased_08_31_v1_en_4.3.0_3.0_1672766097687.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_cuad_document_name_cased_08_31_v1_en_4.3.0_3.0_1672766097687.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_document_name_cased_08_31_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_cuad_document_name_cased_08_31_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_cuad_document_name_cased_08_31_v1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/saraks/cuad-distil-document_name-cased-08-31-v1 --- layout: model title: Fast Neural Machine Translation Model from Bemba (Zambia) to English author: John Snow Labs name: opus_mt_bem_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, bem, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `bem` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_bem_en_xx_2.7.0_2.4_1609171066307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_bem_en_xx_2.7.0_2.4_1609171066307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_bem_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_bem_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.bem.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_bem_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from akdeniz27) author: John Snow Labs name: roberta_qa_akdeniz27_roberta_base_cuad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-cuad` is a English model originally trained by `akdeniz27`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_akdeniz27_roberta_base_cuad_en_4.0.0_3.0_1655730413653.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_akdeniz27_roberta_base_cuad_en_4.0.0_3.0_1655730413653.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_akdeniz27_roberta_base_cuad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_akdeniz27_roberta_base_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.cuad.roberta.base.by_akdeniz27").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_akdeniz27_roberta_base_cuad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|447.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/akdeniz27/roberta-base-cuad - https://github.com/TheAtticusProject/cuad - https://github.com/marshmellow77/cuad-demo --- layout: model title: English asr_wav2vec2_coral_300ep TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: asr_wav2vec2_coral_300ep date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_coral_300ep` is a English model originally trained by joaoalvarenga. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_coral_300ep_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_coral_300ep_en_4.2.0_3.0_1664023690286.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_coral_300ep_en_4.2.0_3.0_1664023690286.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_coral_300ep", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_coral_300ep", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_coral_300ep| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Arabic Bert Embeddings (from alger-ia) author: John Snow Labs name: bert_embeddings_dziribert date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `dziribert` is a Arabic model orginally trained by `alger-ia`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_dziribert_ar_3.4.2_3.0_1649677382652.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_dziribert_ar_3.4.2_3.0_1649677382652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_dziribert","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_dziribert","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.dziribert").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_dziribert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|465.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/alger-ia/dziribert - https://arxiv.org/pdf/2109.12346.pdf - https://github.com/alger-ia/dziribert --- layout: model title: Greek BertForQuestionAnswering Cased model (from Danastos) author: John Snow Labs name: bert_qa_nq_squad_el_3 date: 2022-07-07 tags: [el, open_source, bert, question_answering] task: Question Answering language: el edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `nq_squad_bert_el_3` is a Greek model originally trained by `Danastos`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_nq_squad_el_3_el_4.0.0_3.0_1657190653972.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_nq_squad_el_3_el_4.0.0_3.0_1657190653972.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_nq_squad_el_3","el") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_nq_squad_el_3","el") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Ποιο είναι το όνομά μου?", "Το όνομά μου είναι Κλάρα και μένω στο Μπέρκλεϊ.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_nq_squad_el_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|el| |Size:|421.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Danastos/nq_squad_bert_el_3 --- layout: model title: English BertForQuestionAnswering model (from ruselkomp) author: John Snow Labs name: bert_qa_sbert_large_nlu_ru_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sbert_large_nlu_ru-finetuned-squad` is a English model orginally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sbert_large_nlu_ru_finetuned_squad_en_4.0.0_3.0_1654189351825.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sbert_large_nlu_ru_finetuned_squad_en_4.0.0_3.0_1654189351825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sbert_large_nlu_ru_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_sbert_large_nlu_ru_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sbert_large_nlu_ru_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/sbert_large_nlu_ru-finetuned-squad --- layout: model title: English asr_wav2vec2_large_xlsr_sermon TFWav2Vec2ForCTC from sharonibejih author: John Snow Labs name: asr_wav2vec2_large_xlsr_sermon date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_sermon` is a English model originally trained by sharonibejih. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_sermon_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_sermon_en_4.2.0_3.0_1664115027943.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_sermon_en_4.2.0_3.0_1664115027943.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_sermon", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_sermon", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_sermon| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Translate English to Altaic languages Pipeline author: John Snow Labs name: translate_en_tut date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, tut, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `tut` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_tut_xx_2.7.0_2.4_1609689008206.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_tut_xx_2.7.0_2.4_1609689008206.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_tut", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_tut", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.tut').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_tut| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Vietnamese XlmRoBertaForQuestionAnswering (from ancs21) author: John Snow Labs name: xlm_roberta_qa_xlm_roberta_large_vi_qa date: 2022-06-23 tags: [vi, open_source, question_answering, xlmroberta] task: Question Answering language: vi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-large-vi-qa` is a Vietnamese model originally trained by `ancs21`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_vi_qa_vi_4.0.0_3.0_1655996230939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_xlm_roberta_large_vi_qa_vi_4.0.0_3.0_1655996230939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_xlm_roberta_large_vi_qa","vi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_xlm_roberta_large_vi_qa","vi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("vi.answer_question.xlm_roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_xlm_roberta_large_vi_qa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|vi| |Size:|1.9 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ancs21/xlm-roberta-large-vi-qa - https://github.com/deepmind/xquad/blob/master/xquad.vi.json - https://github.com/mailong25/bert-vietnamese-question-answering/tree/master/dataset --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_ff12000 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-ff12000` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff12000_en_4.3.0_3.0_1675123452076.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_ff12000_en_4.3.0_3.0_1675123452076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_ff12000","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_ff12000","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_ff12000| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|134.1 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-ff12000 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: English asr_wav2vec2_base_960h_by_facebook TFWav2Vec2ForCTC from facebook author: John Snow Labs name: asr_wav2vec2_base_960h_by_facebook date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_960h_by_facebook` is a English model originally trained by facebook. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_960h_by_facebook_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_960h_by_facebook_en_4.2.0_3.0_1664035746899.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_960h_by_facebook_en_4.2.0_3.0_1664035746899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_960h_by_facebook", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_960h_by_facebook", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_960h_by_facebook| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.6 MB| --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_100000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-100000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_100000_cased_generator_de_3.4.4_3.0_1652786285688.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_100000_cased_generator_de_3.4.4_3.0_1652786285688.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_100000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_100000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_100000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-100000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: Translate English to Lozi Pipeline author: John Snow Labs name: translate_en_loz date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, loz, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `loz` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_loz_xx_2.7.0_2.4_1609690709587.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_loz_xx_2.7.0_2.4_1609690709587.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_loz", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_loz", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.loz').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_loz| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal NER for NDA (Dispute Resolution Clause) author: John Snow Labs name: legner_nda_dispute_resolution date: 2023-04-06 tags: [en, licensed, legal, ner, nda] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `DISPUTE_RESOL ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `COURT_NAME`, `LAW_LOCATION `, and `RESOLUT_MEANS `. ## Predicted Entities `COURT_NAME`, `LAW_LOCATION`, `RESOLUT_MEANS` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_dispute_resolution_en_1.0.0_3.0_1680821390209.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_dispute_resolution_en_1.0.0_3.0_1680821390209.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_dispute_resolution", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""In case no settlement can be reached through consultation within thirty ( 30 ) days after such dispute is raised, each party can submit such matter to China International Economic and Trade Arbitration Commission ( the "CIETAC") in accordance with its rules."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------------------------------------------------+-------------+ |chunk |ner_label | +-------------------------------------------------------+-------------+ |consultation |RESOLUT_MEANS| |China |LAW_LOCATION | |International Economic and Trade Arbitration Commission|COURT_NAME | +-------------------------------------------------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_dispute_resolution| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support COURT_NAME 0.86 0.89 0.88 75 LAW_LOCATION 0.79 0.85 0.82 79 RESOLUT_MEANS 0.88 0.88 0.88 58 micro-avg 0.84 0.87 0.85 212 macro-avg 0.84 0.87 0.86 212 weighted-avg 0.84 0.87 0.85 212 ``` --- layout: model title: English image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial` is a English model originally trained by AykeeSalazar. ## Predicted Entities `nonViolation`, `publicDrinking`, `publicSmoking` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial_en_4.1.0_3.0_1660172056548.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial_en_4.1.0_3.0_1660172056548.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vc_bantai__withoutAMBI_adunest_trial| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English BertForQuestionAnswering Base Uncased model (from AnonymousSub) author: John Snow Labs name: bert_qa_base_uncased_squad2.0 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_squad2.0_en_4.0.0_3.0_1657185907503.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_uncased_squad2.0_en_4.0.0_3.0_1657185907503.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_uncased_squad2.0","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_uncased_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.8 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/bert-base-uncased_squad2.0 --- layout: model title: Financial ORG, PER, ROLE, DATE NER author: John Snow Labs name: finner_org_per_role_date date: 2022-08-30 tags: [en, finance, ner, licensed] task: Named Entity Recognition language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is an NER model trained on SEC 10K documents, aimed to extract the following entities: - ORG - PERSON - ROLE - DATE ## Predicted Entities `ORG`, `PER`, `DATE`, `ROLE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/FINPIPE_ORG_PER_DATE_ROLES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_org_per_role_date_en_1.0.0_3.2_1661862246514.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_org_per_role_date_en_1.0.0_3.2_1661862246514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter, ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----+---+---------------------+------+ |begin|end| chunk|entity| +-----+---+---------------------+------+ | 0| 20|Jeffrey Preston Bezos|PERSON| | 37| 48| entrepreneur| ROLE| | 51| 57| founder| ROLE| | 63| 65| CEO| ROLE| | 70| 75| Amazon| ORG| +-----+---+---------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finner_org_per_role_date| |Type:|finance| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References SEC 10-K filings with in-house annotations ## Benchmarking ```bash label tp fp fn prec rec f1 B-PERSON 254 20 56 0.9270073 0.81935483 0.86986303 I-ORG 1161 133 231 0.8972179 0.8340517 0.8644826 B-DATE 202 15 14 0.9308756 0.9351852 0.9330255 I-DATE 302 29 12 0.9123867 0.96178347 0.93643415 B-ROLE 219 21 47 0.9125 0.8233083 0.8656126 B-ORG 674 92 163 0.87989557 0.80525684 0.84092325 I-ROLE 260 26 68 0.90909094 0.79268295 0.8469055 I-PERSON 501 34 94 0.9364486 0.8420168 0.88672566 Macro-average 3573 370 685 0.91317785 0.851705 0.8813709 Micro-average 3573 370 685 0.9061628 0.83912635 0.8713572 ``` --- layout: model title: Relation Extraction between different oncological entity types (ReDL) author: John Snow Labs name: redl_oncology_biobert_wip date: 2023-01-15 tags: [licensed, clinical, oncology, en, relation_extraction, temporal, test, biomarker, anatomy, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model identifies relations between dates and other clinical entities, between tumor mentions and their size, between anatomical entities and other clinical entities, and between tests and their results. In contrast to re_oncology_granular, all these relation types are labeled as is_related_to. The different types of relations can be identified considering the pairs of entities that are linked. ## Predicted Entities `is_related_to` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biobert_wip_en_4.2.4_3.0_1673763869198.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_biobert_wip_en_4.2.4_3.0_1673763869198.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["A mastectomy was performed two months ago, and a 3 cm mass was extracted."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunk", "dependencies")) .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding", "Cancer_Surgery-Relative_Date", "Relative_Date-Cancer_Surgery")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols(Array("re_ner_chunk", "sentence")) .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("A mastectomy was performed two months ago, and a 3 cm mass was extracted.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology_biobert_wip").predict("""A mastectomy was performed two months ago, and a 3 cm mass was extracted.""") ```
## Results ```bash +-------------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1| entity2|entity2_begin|entity2_end| chunk2|confidence| +-------------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+ |is_related_to|Cancer_Surgery| 2| 11|mastectomy|Relative_Date| 27| 40|two months ago|0.91422147| |is_related_to| Tumor_Size| 49| 52| 3 cm|Tumor_Finding| 54| 57| mass|0.90398973| +-------------+--------------+-------------+-----------+----------+-------------+-------------+-----------+--------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_biobert_wip| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.82 0.89 0.86 is_related_to 0.90 0.84 0.87 macro-avg 0.86 0.87 0.86 ``` --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42_de_4.2.0_3.0_1664118903970.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42_de_4.2.0_3.0_1664118903970.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_xls_r_accent_germany_8_austria_2_s42| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Legal Interests Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_interests_bert date: 2023-03-05 tags: [en, legal, classification, clauses, interests, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Interests` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Interests`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_interests_bert_en_1.0.0_3.0_1678049881866.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_interests_bert_en_1.0.0_3.0_1678049881866.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_interests_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Interests]| |[Other]| |[Other]| |[Interests]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_interests_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Interests 0.97 0.93 0.95 40 Other 0.95 0.98 0.97 58 accuracy - - 0.96 98 macro-avg 0.96 0.95 0.96 98 weighted-avg 0.96 0.96 0.96 98 ``` --- layout: model title: English image_classifier_vit_lung_cancer ViTForImageClassification from Giuliano author: John Snow Labs name: image_classifier_vit_lung_cancer date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_lung_cancer` is a English model originally trained by Giuliano. ## Predicted Entities `LUAD`, `LUSC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lung_cancer_en_4.1.0_3.0_1660171240836.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_lung_cancer_en_4.1.0_3.0_1660171240836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_lung_cancer", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_lung_cancer", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_lung_cancer| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Stop Words Cleaner for Swedish author: John Snow Labs name: stopwords_sv date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: sv edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, sv] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_sv_sv_2.5.4_2.4_1594742438273.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_sv_sv_2.5.4_2.4_1594742438273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_sv", "sv") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_sv", "sv") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Förutom att vara kungen i norr är John Snow en engelsk läkare och en ledare inom utveckling av anestesi och medicinsk hygien."""] stopword_df = nlu.load('sv.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=6, result='Förutom', metadata={'sentence': '0'}), Row(annotatorType='token', begin=17, end=22, result='kungen', metadata={'sentence': '0'}), Row(annotatorType='token', begin=26, end=29, result='norr', metadata={'sentence': '0'}), Row(annotatorType='token', begin=31, end=32, result='är', metadata={'sentence': '0'}), Row(annotatorType='token', begin=34, end=37, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_sv| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sv| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Mapping Abbreviations and Acronyms of Medical Regulatory Activities with Their Definitions author: John Snow Labs name: abbreviation_mapper date: 2022-05-11 tags: [en, abbreviation, definition, licensed, clinical, chunk_mapper] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps abbreviations and acronyms of medical regulatory activities with their `definition`. ## Predicted Entities `definition` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/abbreviation_mapper_en_3.5.1_3.0_1652307379928.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/abbreviation_mapper_en_3.5.1_3.0_1652307379928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") #NER model to detect abbreviations in the text abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("abbr_ner") abbr_converter = NerConverter() \ .setInputCols(["sentence", "token", "abbr_ner"]) \ .setOutputCol("abbr_ner_chunk")\ chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper", "en", "clinical/models")\ .setInputCols(["abbr_ner_chunk"])\ .setOutputCol("mappings")\ .setRel("definition") pipeline = Pipeline().setStages([document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper]) text = ["""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""] data = spark.createDataFrame([text]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val abbr_ner = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("abbr_ner") val abbr_converter = NerConverter() .setInputCols(Array("sentence", "token", "abbr_ner")) .setOutputCol("abbr_ner_chunk") val chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper", "en", "clinical/models") .setInputCols(Array("abbr_ner_chunk")) .setOutputCol("mappings") .setRel("definition") val pipeline = new Pipeline().setStages(Array( document_assembler, sentence_detector, tokenizer, word_embeddings, abbr_ner, abbr_converter, chunkerMapper)) val test_sentence = """Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""" val data = Seq(test_sentence).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.abbreviation_to_definition").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash +----------+------------------------------+ |ner_chunk |mapping_result | +----------+------------------------------+ |CBC |complete blood count | |HIV |human immunodeficiency virus | +----------+------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|abbreviation_mapper| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[abbr_ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|214.8 KB| ## References https://www.johnsnowlabs.com/marketplace/list-of-abbreviations-and-acronyms-for-medical-regulatory-activities/ --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from SSarim) author: John Snow Labs name: distilbert_qa_ssarim_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `SSarim`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ssarim_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769154633.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ssarim_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769154633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ssarim_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ssarim_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ssarim_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SSarim/distilbert-base-uncased-finetuned-squad --- layout: model title: Abkhazian asr_wav2vec2_xls_r_300m_ab_CV8 TFWav2Vec2ForCTC from emre author: John Snow Labs name: asr_wav2vec2_xls_r_300m_ab_CV8 date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_xls_r_300m_ab_CV8` is a Abkhazian model originally trained by emre. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_xls_r_300m_ab_CV8_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_ab_CV8_ab_4.2.0_3.0_1664020050816.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_xls_r_300m_ab_CV8_ab_4.2.0_3.0_1664020050816.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_xls_r_300m_ab_CV8", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_xls_r_300m_ab_CV8", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_xls_r_300m_ab_CV8| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|1.2 GB| --- layout: model title: English BertForQuestionAnswering model (from ricardo-filho) author: John Snow Labs name: bert_qa_bert_large_faquad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_large_faquad` is a English model orginally trained by `ricardo-filho`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_faquad_en_4.0.0_3.0_1654185123808.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_faquad_en_4.0.0_3.0_1654185123808.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_faquad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_faquad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.large.by_ricardo-filho").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_faquad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ricardo-filho/bert_large_faquad --- layout: model title: Fast Neural Machine Translation Model from English to Rundi author: John Snow Labs name: opus_mt_en_run date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, run, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `run` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_run_xx_2.7.0_2.4_1609166290734.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_run_xx_2.7.0_2.4_1609166290734.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_run", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_run", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.run').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_run| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Arabic to Turkish author: John Snow Labs name: opus_mt_ar_tr date: 2021-06-01 tags: [open_source, seq2seq, translation, ar, tr, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: ar target languages: tr {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ar_tr_xx_3.1.0_2.4_1622563228141.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ar_tr_xx_3.1.0_2.4_1622563228141.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ar_tr", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ar_tr", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Arabic.translate_to.Turkish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ar_tr| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Japanese Bert Embeddings (Base, Whole Word Masking) author: John Snow Labs name: bert_embeddings_bert_base_japanese_whole_word_masking date: 2022-04-11 tags: [bert, embeddings, ja, open_source] task: Embeddings language: ja edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-japanese-whole-word-masking` is a Japanese model orginally trained by `cl-tohoku`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_whole_word_masking_ja_3.4.2_3.0_1649674234386.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_japanese_whole_word_masking_ja_3.4.2_3.0_1649674234386.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_whole_word_masking","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["私はSpark NLPを愛しています"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_japanese_whole_word_masking","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("私はSpark NLPを愛しています").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ja.embed.bert_base_japanese_whole_word_masking").predict("""私はSpark NLPを愛しています""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_japanese_whole_word_masking| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ja| |Size:|415.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking - https://github.com/google-research/bert - https://github.com/cl-tohoku/bert-japanese/tree/v1.0 - https://github.com/attardi/wikiextractor - https://taku910.github.io/mecab/ - https://creativecommons.org/licenses/by-sa/3.0/ - https://www.tensorflow.org/tfrc/ --- layout: model title: English asr_wav2vec2_base_timit_demo_colab9 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab9 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab9` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab9_en_4.2.0_3.0_1664019968511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab9_en_4.2.0_3.0_1664019968511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab9', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab9", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Multilingual DistilBertForQuestionAnswering Cased model (from ZYW) author: John Snow Labs name: distilbert_qa_test_squad_trained date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `test-squad-trained` is a Multilingual model originally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_xx_4.3.0_3.0_1672775713238.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_test_squad_trained_xx_4.3.0_3.0_1672775713238.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_test_squad_trained","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_test_squad_trained| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/test-squad-trained --- layout: model title: Relation Extraction between Test and Results (ReDL) author: John Snow Labs name: redl_oncology_test_result_biobert_wip date: 2022-09-29 tags: [licensed, clinical, oncology, en, relation_extraction, test] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This relation extraction model links test extractions to their corresponding results. ## Predicted Entities `is_finding_of`, `O` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_oncology_test_result_biobert_wip_en_4.1.0_3.0_1664460623630.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_oncology_test_result_biobert_wip_en_4.1.0_3.0_1664460623630.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Each relevant relation pair in the pipeline should include one test entity (such as Biomarker, Imaging_Test, Pathology_Test or Oncogene) and one result entity (such as Biomarker_Result, Pathology_Result or Tumor_Finding).
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \ .setInputCols(["sentence", "pos_tags", "token"]) \ .setOutputCol("dependencies") re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setMaxSyntacticDistance(10)\ .setRelationPairs(["Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test"]) re_model = RelationExtractionDLModel.pretrained("redl_oncology_test_result_biobert_wip", "en", "clinical/models")\ .setInputCols(["re_ner_chunk", "sentence"])\ .setOutputCol("relation_extraction") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model]) data = spark.createDataFrame([["Pathology showed tumor cells, which were positive for estrogen and progesterone receptors."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") .setInputCols(Array("sentence", "pos_tags", "token")) .setOutputCol("dependencies") val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols("ner_chunk", "dependencies") .setOutputCol("re_ner_chunk") .setMaxSyntacticDistance(10) .setRelationPairs(Array("Biomarker-Biomarker_Result", "Biomarker_Result-Biomarker", "Oncogene-Biomarker_Result", "Biomarker_Result-Oncogene", "Pathology_Test-Pathology_Result", "Pathology_Result-Pathology_Test")) val re_model = RelationExtractionDLModel.pretrained("redl_oncology_test_result_biobert_wip", "en", "clinical/models") .setPredictionThreshold(0.5f) .setInputCols("re_ner_chunk", "sentence") .setOutputCol("relation_extraction") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, pos_tagger, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.oncology.test_result_biobert").predict("""Pathology showed tumor cells, which were positive for estrogen and progesterone receptors.""") ```
## Results ```bash | chunk1 | entity1 | chunk2 | entity2 | relation | confidence | |-------- |----------------- |----------------------- |---------- |-------------- |----------- | |positive | Biomarker_Result | estrogen | Biomarker | is_finding_of | 0.99451536 | |positive | Biomarker_Result | progesterone receptors | Biomarker | is_finding_of | 0.99218905 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_oncology_test_result_biobert_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label recall precision f1 O 0.87 0.92 0.9 is_finding_of 0.93 0.88 0.9 macro-avg 0.90 0.90 0.9 ``` --- layout: model title: English BertForQuestionAnswering model (from ZYW) author: John Snow Labs name: bert_qa_squad_mbert_model_2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-mbert-model_2` is a English model orginally trained by `ZYW`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_mbert_model_2_en_4.0.0_3.0_1654192072821.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_mbert_model_2_en_4.0.0_3.0_1654192072821.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_mbert_model_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad_mbert_model_2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.multi_lingual_bert.v2.by_ZYW").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_mbert_model_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ZYW/squad-mbert-model_2 --- layout: model title: Bashkir asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt TFWav2Vec2ForCTC from AigizK author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt date: 2022-09-24 tags: [wav2vec2, ba, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: ba edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt` is a Bashkir model originally trained by AigizK. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_ba_4.2.0_3.0_1664040387674.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt_ba_4.2.0_3.0_1664040387674.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt', lang = 'ba') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt", lang = "ba") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_bashkir_cv7_opt| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|ba| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from hark99) author: John Snow Labs name: distilbert_qa_hark99_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `hark99`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_hark99_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771083025.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_hark99_base_uncased_finetuned_squad_en_4.3.0_3.0_1672771083025.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hark99_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_hark99_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_hark99_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/hark99/distilbert-base-uncased-finetuned-squad --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_test TFWav2Vec2ForCTC from ying-tina author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_test date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_test` is a English model originally trained by ying-tina. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_test_en_4.2.0_3.0_1664111990571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_test_en_4.2.0_3.0_1664111990571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_test', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_test", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_test| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Fast Neural Machine Translation Model from English to Papiamento author: John Snow Labs name: opus_mt_en_pap date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, pap, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `pap` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_pap_xx_2.7.0_2.4_1609169563076.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_pap_xx_2.7.0_2.4_1609169563076.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_pap", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_pap", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.pap').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_pap| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl32 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl32` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl32_en_4.3.0_3.0_1675123928277.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl32_en_4.3.0_3.0_1675123928277.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl32","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl32","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl32| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|145.4 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl32 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Chinese Bert Embeddings (Bert, Whole Word Masking) author: John Snow Labs name: bert_embeddings_chinese_bert_wwm_ext date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `chinese-bert-wwm-ext` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_bert_wwm_ext_zh_3.4.2_3.0_1649668880514.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_bert_wwm_ext_zh_3.4.2_3.0_1649668880514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_bert_wwm_ext","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_chinese_bert_wwm_ext","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.chinese_bert_wwm_ext").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_bert_wwm_ext| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|384.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/hfl/chinese-bert-wwm-ext - https://arxiv.org/abs/1906.08101 - https://github.com/google-research/bert - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/MacBERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/HFL-Anthology - https://arxiv.org/abs/2004.13922 - https://arxiv.org/abs/1906.08101 --- layout: model title: Named Entity Recognition Profiling (Clinical) author: John Snow Labs name: ner_profiling_clinical date: 2023-04-28 tags: [licensed, en, clinical, profiling, ner_profiling, ner] task: [Named Entity Recognition, Pipeline Healthcare] language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `embeddings_clinical`. It has been updated by adding new clinical NER models and NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_clinical`,`jsl_ner_wip_greedy_clinical`,`jsl_ner_wip_modifier_clinical`, `jsl_rd_ner_wip_greedy_clinical`, `ner_abbreviation_clinical`, `ner_ade_binary`, `ner_ade_clinical`, `ner_anatomy`, `ner_anatomy_coarse`, `ner_bacterial_species`, `ner_biomarker`, `ner_biomedical_bc2gm`, `ner_bionlp`, `ner_cancer_genetics`, `ner_cellular`, `ner_chemd_clinical`, `ner_chemicals`, `ner_chemprot_clinical`, `ner_chexpert`, `ner_clinical`, `ner_clinical_large`, `ner_clinical_trials_abstracts`, `ner_covid_trials`, `ner_deid_augmented`, `ner_deid_enriched`, `ner_deid_generic_augmented`,`ner_deid_large`, `ner_deid_sd`,`ner_deid_sd_large`,`ner_deid_subentity_augmented`,`ner_deid_subentity_augmented_i2b2`, `ner_deid_synthetic`, `ner_diseases`, `ner_diseases_large`, `ner_drugprot_clinical`, `ner_drugs`, `ner_drugs_greedy`, `ner_drugs_large`, `ner_eu_clinical_case`, `ner_eu_clinical_condition`, `ner_events_admission_clinical`, `ner_events_clinical`, `ner_financial_contract`, `ner_genetic_variants`, `ner_healthcare`, `ner_human_phenotype_gene_clinical`, `ner_human_phenotype_go_clinical`, `ner_jsl`, `ner_jsl_enriched`, `ner_jsl_slim`, `ner_living_species`, `ner_measurements_clinical`, `ner_medmentions_coarse`, `ner_nature_nero_clinical`, `ner_nihss`, `ner_oncology`, `ner_oncology_anatomy_general`, `ner_oncology_anatomy_granular`, `ner_oncology_biomarker`, `ner_oncology_demographics`, `ner_oncology_diagnosis`, `ner_oncology_posology`, `ner_oncology_response_to_treatment`, `ner_oncology_test`, `ner_oncology_therapy`, `ner_oncology_tnm`, `ner_oncology_unspecific_posology`, `ner_oncology_wip`, `ner_pathogen`, `ner_posology`, `ner_posology_experimental`, `ner_posology_greedy`, `ner_posology_large`, `ner_posology_small`, `ner_radiology`, `ner_radiology_wip_clinical`, `ner_risk_factors`, `ner_sdoh_access_to_healthcare_wip`, `ner_sdoh_community_condition_wip`, `ner_sdoh_demographics_wip`, `ner_sdoh_health_behaviours_problems_wip`, `ner_sdoh_income_social_status_wip`, `ner_sdoh_mentions`, `ner_sdoh_slim_wip`, `ner_sdoh_social_environment_wip`, `ner_sdoh_substance_usage_wip`, `ner_sdoh_wip`, `ner_supplement_clinical`, `ner_vop_anatomy_wip`, `ner_vop_clinical_dept_wip`, `ner_vop_demographic_wip`, `ner_vop_problem_reduced_wip`, `ner_vop_problem_wip`, `ner_vop_slim_wip`, `ner_vop_temporal_wip`, `ner_vop_test_wip`, `ner_vop_treatment_wip`, `ner_vop_wip`, `nerdl_tumour_demo` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.4.0_3.0_1682686984113.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_clinical_en_4.4.0_3.0_1682686984113.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models") result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_clinical", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_clinical").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.""") ```
## Results ```bash ******************** ner_jsl Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('subsequent', 'Modifier'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Diabetes'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Communicable_Disease'), ('obesity', 'Obesity'), ('body mass index', 'Symptom'), ('33.5 kg/m2', 'Weight'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_diseases_large Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('vomiting', 'Disease')] ******************** ner_radiology Model Results ******************** [('gestational diabetes mellitus', 'Disease_Syndrome_Disorder'), ('type two diabetes mellitus', 'Disease_Syndrome_Disorder'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('acute hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Disease_Syndrome_Disorder'), ('body', 'BodyPart'), ('mass index', 'Symptom'), ('BMI', 'Test'), ('33.5', 'Measurements'), ('kg/m2', 'Units'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus', 'PROBLEM'), ('T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'PROBLEM'), ('BMI', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_medmentions_coarse Model Results ******************** [('female', 'Organism_Attribute'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('diabetes mellitus', 'Disease_or_Syndrome'), ('T2DM', 'Disease_or_Syndrome'), ('HTG-induced pancreatitis', 'Disease_or_Syndrome'), ('associated with', 'Qualitative_Concept'), ('acute hepatitis', 'Disease_or_Syndrome'), ('obesity', 'Disease_or_Syndrome'), ('body mass index', 'Clinical_Attribute'), ('BMI', 'Clinical_Attribute'), ('polyuria', 'Sign_or_Symptom'), ('polydipsia', 'Sign_or_Symptom'), ('poor appetite', 'Sign_or_Symptom'), ('vomiting', 'Sign_or_Symptom')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_clinical| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|3.1 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: Multilingual T5ForConditionalGeneration Cased model (from qiaoyi) author: John Snow Labs name: t5_comment_summarization4designtutor date: 2023-01-30 tags: [ro, fr, de, en, open_source, t5, xx] task: Text Generation language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Comment_Summarization4DesignTutor` is a Multilingual model originally trained by `qiaoyi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_comment_summarization4designtutor_xx_4.3.0_3.0_1675096380129.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_comment_summarization4designtutor_xx_4.3.0_3.0_1675096380129.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_comment_summarization4designtutor","xx") \ .setInputCols(["document"]) \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_comment_summarization4designtutor","xx") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_comment_summarization4designtutor| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|xx| |Size:|271.8 MB| ## References - https://huggingface.co/qiaoyi/Comment_Summarization4DesignTutor - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/1805.12471 - https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf - https://aclanthology.org/I05-5002 - https://arxiv.org/abs/1708.00055 - https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs - https://arxiv.org/abs/1704.05426 - https://arxiv.org/abs/1606.05250 - https://link.springer.com/chapter/10.1007/11736790_9 - https://semanticsarchive.net/Archive/Tg3ZGI2M/Marneffe.pdf - https://www.researchgate.net/publication/221251392_Choice_of_Plausible_Alternatives_An_Evaluation_of_Commonsense_Causal_Reasoning - https://arxiv.org/abs/1808.09121 - https://aclanthology.org/N18-1023 - https://arxiv.org/abs/1810.12885 - https://arxiv.org/abs/1905.10044 - https://arxiv.org/pdf/1910.10683.pdf - https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67 --- layout: model title: Detect Bacterial Species (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_bacteria date: 2022-01-07 tags: [bacteria, bertfortokenclassification, ner, en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 2.4 supported: true annotator: MedicalBertForTokenClassifier article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect different types of species of bacteria in text using pretrained NER model. This model is trained with the `BertForTokenClassification` method from `transformers` library and imported into Spark NLP. ## Predicted Entities `SPECIES` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_en_3.3.4_2.4_1641568604267.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bacteria_en_3.3.4_2.4_1641568604267.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bacteria", "en", "clinical/models")\ .setInputCols("token", "document")\ .setOutputCol("ner")\ .setCaseSensitive(True) ner_converter = NerConverter()\ .setInputCols(["document","token","ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]) data = spark.createDataFrame([[ """Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents \ a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica \ sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""" ]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bacteria", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("ner") .setCaseSensitive(True) val ner_converter = new NerConverter() .setInputCols(Array("document","token","ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter)) val data = Seq("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.ner_bacteria").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents \ a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica \ sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""") ```
## Results ```bash +-------------------+----------------------+ |chunk |ner_label | +-------------------+----------------------+ |erbA IRES |Organism | |erbA/myb virus |Organism | |erythroid cells |Cell | |bone marrow |Multi-tissue_structure| |blastoderm cultures|Cell | |erbA/myb IRES virus|Organism | |erbA IRES virus |Organism | |blastoderm |Cell | +-------------------+----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bacteria| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentense length:|512| ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013. https://aclanthology.org/W13-2008/ ## Benchmarking ```bash label precision recall f1-score support B-SPECIES 0.98 0.84 0.91 767 I-SPECIES 0.99 0.84 0.91 1043 accuracy - - 0.84 1810 macro-avg 0.85 0.89 0.87 1810 weighted-avg 0.99 0.84 0.91 1810 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl22 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl22` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl22_en_4.3.0_3.0_1675121912411.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl22_en_4.3.0_3.0_1675121912411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl22","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl22","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl22| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|373.8 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl22 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Extract Temporal Entities from Voice of the Patient Documents (embeddings_clinical_large) author: John Snow Labs name: ner_vop_temporal_emb_clinical_large_final date: 2023-06-06 tags: [clinical, licensed, ner, en, vop, temporal] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts temporal references from the documents transferred from the patient’s own sentences. ## Predicted Entities `DateTime`, `Duration`, `Frequency` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_emb_clinical_large_final_en_4.4.3_3.0_1686076327359.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_temporal_emb_clinical_large_final_en_4.4.3_3.0_1686076327359.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_temporal_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_temporal_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I broke my arm playing football last month and had to get surgery in the orthopedic department. The cast just came off yesterday and I'm excited to start physical therapy and get back to the game.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:-----------|:------------| | last month | DateTime | | yesterday | DateTime | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_temporal_emb_clinical_large_final| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_large| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 DateTime 4021 585 381 4402 0.87 0.91 0.89 Duration 1889 266 421 2310 0.88 0.82 0.85 Frequency 890 166 189 1079 0.84 0.82 0.83 macro_avg 6800 1017 991 7791 0.86 0.85 0.86 micro_avg 6800 1017 991 7791 0.87 0.87 0.87 ``` --- layout: model title: Italian BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_italian_finedtuned_squadv1_it_alfa date: 2022-06-06 tags: [it, open_source, question_answering, bert] task: Question Answering language: it edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-italian-finedtuned-squadv1-it-alfa` is a Italian model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_italian_finedtuned_squadv1_it_alfa_it_4.0.0_3.0_1654536030553.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_italian_finedtuned_squadv1_it_alfa_it_4.0.0_3.0_1654536030553.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_italian_finedtuned_squadv1_it_alfa","it") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_italian_finedtuned_squadv1_it_alfa","it") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("it.answer_question.squad.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_italian_finedtuned_squadv1_it_alfa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|it| |Size:|410.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-italian-finedtuned-squadv1-it-alfa - https://github.com/crux82/squad-it/blob/master/README.md#evaluating-a-neural-model-over-squad-it - https://twitter.com/mrm8488 - https://link.springer.com/chapter/10.1007/978-3-030-03840-3_29 - https://github.com/crux82/squad-it - https://rajpurkar.github.io/SQuAD-explorer/ - https://www.linkedin.com/in/manuel-romero-cs/ --- layout: model title: Pipeline to Summarize Clinical Notes author: John Snow Labs name: summarizer_clinical_jsl_pipeline date: 2023-05-29 tags: [licensed, en, clinical, text_summarization] task: Summarization language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [summarizer_clinical_jsl](https://nlp.johnsnowlabs.com/2023/03/25/summarizer_clinical_jsl.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_pipeline_en_4.4.2_3.0_1685392074544.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_pipeline_en_4.4.2_3.0_1685392074544.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("summarizer_clinical_jsl_pipeline", "en", "clinical/models") text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("summarizer_clinical_jsl_pipeline", "en", "clinical/models") val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash. """ val result = pipeline.fullAnnotate(text) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|936.7 MB| ## Included Models - DocumentAssembler - MedicalSummarizer --- layout: model title: English T5ForConditionalGeneration Cased model (from marcus2000) author: John Snow Labs name: t5_fine_tuned_model date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fine_tuned_t5_model` is a English model originally trained by `marcus2000`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_fine_tuned_model_en_4.3.0_3.0_1675101922079.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_fine_tuned_model_en_4.3.0_3.0_1675101922079.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_fine_tuned_model","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_fine_tuned_model","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_fine_tuned_model| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|918.7 MB| ## References - https://huggingface.co/marcus2000/fine_tuned_t5_model - https://paperswithcode.com/sota?task=automatic-speech-recognition&dataset=Librispeech+%28clean%29 --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_ner_778023879 date: 2023-03-14 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-ner-778023879` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1678783208077.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1678783208077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_ner_778023879| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-ner-778023879 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_128 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_128_zh_4.2.4_3.0_1670021654446.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_128_zh_4.2.4_3.0_1670021654446.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|13.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Legal NER for MAPA(Multilingual Anonymisation for Public Administrations) author: John Snow Labs name: legner_mapa date: 2023-04-27 tags: [es, licensed, legal, ner, mapa] task: Named Entity Recognition language: es edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The dataset consists of 12 documents taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. This model extracts `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, and `PERSON` entities from `Spanish` documents. ## Predicted Entities `ADDRESS`, `AMOUNT`, `DATE`, `ORGANISATION`, `PERSON` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_mapa_es_1.0.0_3.0_1682593085140.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_mapa_es_1.0.0_3.0_1682593085140.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_base_es_cased", "es")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_mapa", "es", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Heiko Jonny Maniero , de nacionalidad italiana , nació y reside en Alemania."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-------------------+---------+ |chunk |ner_label| +-------------------+---------+ |Heiko Jonny Maniero|PERSON | |Alemania |ADDRESS | +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_mapa| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Size:|1.4 MB| ## References The dataset is available [here](https://huggingface.co/datasets/joelito/mapa). ## Benchmarking ```bash label precision recall f1-score support ADDRESS 1.00 0.86 0.92 7 AMOUNT 1.00 1.00 1.00 1 DATE 1.00 0.92 0.96 24 ORGANISATION 0.83 0.71 0.77 7 PERSON 0.75 0.71 0.73 17 macro-avg 0.90 0.82 0.86 56 macro-avg 0.92 0.84 0.88 56 weighted-avg 0.90 0.82 0.86 56 ``` --- layout: model title: GloVe Embeddings 6B 300 (Multilingual) author: John Snow Labs name: glove_6B_300 date: 2020-01-22 task: Embeddings language: xx edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [open_source, embeddings] supported: true annotator: WordEmbeddingsModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description GloVe (Global Vectors) is a model for distributed word representation. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. It outperformed many common Word2vec models on the word analogy task. One benefit of GloVe is that it is the result of directly modeling relationships, instead of getting them as a side effect of training a language model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/glove_6B_300_xx_2.4.0_2.4_1579698630432.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/glove_6B_300_xx_2.4.0_2.4_1579698630432.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddings.pretrained("glove_6B_300", "xx") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love Spark NLP']], ["text"])) ``` ```scala val embeddings = WordEmbeddings.pretrained("glove_6B_300", "xx") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""I love Spark NLP"""] glove_df = nlu.load('xx.embed.glove.6B_300').predict(text) glove_df ```
{:.h2_title} ## Results ```bash token | glove_embeddings | -------|----------------------------------------------------| I | [0.1941000074148178, 0.22603000700473785, -0.4...] | love | [0.13948999345302582, 0.534529983997345, -0.25...] | Spark | [0.20353999733924866, 0.6292600035667419, 0.27...] | NLP | [0.059436000883579254, 0.18411000072956085, -0...] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|glove_6B_300| |Type:|embeddings| |Compatibility:|Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[xx]| |Dimension:|300| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from [https://nlp.stanford.edu/projects/glove/](https://nlp.stanford.edu/projects/glove/) --- layout: model title: English DistilBertForQuestionAnswering Cased model (from ajaypyatha) author: John Snow Labs name: distilbert_qa_sdsqna date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sdsqna` is a English model originally trained by `ajaypyatha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_sdsqna_en_4.3.0_3.0_1672775452508.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_sdsqna_en_4.3.0_3.0_1672775452508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sdsqna","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_sdsqna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_sdsqna| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ajaypyatha/sdsqna --- layout: model title: Oncology Pipeline for Diagnosis Entities author: John Snow Labs name: oncology_diagnosis_pipeline date: 2023-03-29 tags: [licensed, pipeline, oncology, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.2 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline includes Named-Entity Recognition, Assertion Status, Relation Extraction and Entity Resolution models to extract information from oncology texts. This pipeline focuses on entities related to oncological diagnosis. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/oncology_diagnosis_pipeline_en_4.3.2_3.2_1680111732253.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/oncology_diagnosis_pipeline_en_4.3.2_3.2_1680111732253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("oncology_diagnosis_pipeline", "en", "clinical/models") text = '''Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("oncology_diagnosis_pipeline", "en", "clinical/models") val text = "Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.oncology_diagnosis.pipeline").predict("""Two years ago, the patient presented with a 4-cm tumor in her left breast. She was diagnosed with ductal carcinoma. According to her last CT, she has no lung metastases.""") ```
## Results ```bash " ******************** ner_oncology_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Size | | tumor | Tumor_Finding | | left | Direction | | breast | Site_Breast | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | lung | Site_Lung | | metastases | Metastasis | ******************** ner_oncology_diagnosis_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Size | | tumor | Tumor_Finding | | ductal | Histological_Type | | carcinoma | Cancer_Dx | | metastases | Metastasis | ******************** ner_oncology_tnm_wip results ******************** | chunk | ner_label | |:-----------|:------------------| | 4-cm | Tumor_Description | | tumor | Tumor | | ductal | Tumor_Description | | carcinoma | Cancer_Dx | | metastases | Metastasis | ******************** assertion_oncology_wip results ******************** | chunk | ner_label | assertion | |:-----------|:------------------|:------------| | tumor | Tumor_Finding | Present | | ductal | Histological_Type | Present | | carcinoma | Cancer_Dx | Present | | metastases | Metastasis | Absent | ******************** assertion_oncology_problem_wip results ******************** | chunk | ner_label | assertion | |:-----------|:------------------|:-----------------------| | tumor | Tumor_Finding | Medical_History | | ductal | Histological_Type | Medical_History | | carcinoma | Cancer_Dx | Medical_History | | metastases | Metastasis | Hypothetical_Or_Absent | ******************** re_oncology_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:--------------|:-----------|:--------------|:--------------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_related_to | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | | tumor | Tumor_Finding | breast | Site_Breast | is_related_to | | breast | Site_Breast | carcinoma | Cancer_Dx | O | | lung | Site_Lung | metastases | Metastasis | is_related_to | ******************** re_oncology_granular_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:--------------|:-----------|:--------------|:---------------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_size_of | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | | tumor | Tumor_Finding | breast | Site_Breast | is_location_of | | breast | Site_Breast | carcinoma | Cancer_Dx | O | | lung | Site_Lung | metastases | Metastasis | is_location_of | ******************** re_oncology_size_wip results ******************** | chunk1 | entity1 | chunk2 | entity2 | relation | |:---------|:-----------|:----------|:--------------|:-----------| | 4-cm | Tumor_Size | tumor | Tumor_Finding | is_size_of | | 4-cm | Tumor_Size | carcinoma | Cancer_Dx | O | ******************** ICD-O resolver results ******************** | chunk | ner_label | code | normalized_term | |:-----------|:------------------|:-------|:------------------| | tumor | Tumor_Finding | 8000/1 | tumor | | breast | Site_Breast | C50 | breast | | ductal | Histological_Type | 8500/2 | dcis | | carcinoma | Cancer_Dx | 8010/3 | carcinoma | | lung | Site_Lung | C34.9 | lung | | metastases | Metastasis | 8000/6 | tumor, metastatic | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|oncology_diagnosis_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|2.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - ChunkMergeModel - ChunkMergeModel - AssertionDLModel - AssertionDLModel - PerceptronModel - DependencyParserModel - RelationExtractionModel - RelationExtractionModel - RelationExtractionModel - ChunkMergeModel - Chunk2Doc - BertSentenceEmbeddings - SentenceEntityResolverModel --- layout: model title: Deidentify PHI (Large) author: John Snow Labs name: deidentify_large language: en nav_key: models repository: clinical/models date: 2020-08-04 task: De-identification edition: Healthcare NLP 2.5.5 spark_version: 2.4 tags: [deidentify, en, clinical, licensed] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Deidentify (Large) is a deidentification model. It identifies instances of protected health information in text documents, and it can either obfuscate them (e.g., replacing names with different, fake names) or mask them (e.g., replacing "2020-06-04" with "<DATE>"). This model is useful for maintaining HIPAA compliance when dealing with text documents that contain protected health information. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/nerdl_deid_en_1.8.0_2.4_1545462443516.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/nerdl_deid_en_1.8.0_2.4_1545462443516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... nlp_pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter]) result = nlp_pipeline.transform(spark.createDataFrame([['Patient AIQING, 25 month years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital. Phone number: (541) 754-3010. MSW 100009632582 for his colonic polyps. He wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. P: Follow up with Dr. Hobbs in 3 months. Gilbert P. Perez, M.D.']], ["text"])) obfuscation = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "ner_chunk"]) \ .setOutputCol("obfuscated") \ .setMode("obfuscate") deid_text = obfuscation.transform(result) ``` ```scala ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Patient AIQING, 25 month years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital. Phone number: (541) 754-3010. MSW 100009632582 for his colonic polyps. He wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. P: Follow up with Dr. Hobbs in 3 months. Gilbert P. Perez, M.D.").toDF("text") val result = pipeline.fit(data).transform(data) val deid = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") .setInputCols(Array("sentence", "token", "ner_chunk")) .setOutputCol("obfuscated") .setMode("obfuscate") val deid_text = new deid.transform(result) ``` {:.nlu-block} ```python import nlu nlu.load("en.de_identify.large").predict("""Patient AIQING, 25 month years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital. Phone number: (541) 754-3010. MSW 100009632582 for his colonic polyps. He wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. P: Follow up with Dr. Hobbs in 3 months. Gilbert P. Perez, M.D.""") ```
{:.h2_title} ## Results ```bash | | sentence | deidentified | |--:|--------------------------------------------------:|--------------------------------------------------:| | 0 | Patient AIQING, 25 month years-old , born in B... | Patient CAM, month years-old , born in M... | | 1 | Phone number: (541) 754-3010. | Phone number: (603)531-7148. | | 2 | MSW 100009632582 for his colonic polyps. | MSW for his colonic polyps. | | 3 | He wants to know the results from them. | He wants to know the results from them. | | 4 | He is not taking hydrochlorothiazide and is cu... | He is not taking hydrochlorothiazide and is cu... | | 5 | He said he has cut his alcohol back to 6 pack ... | He said he has cut his alcohol back to 6 pack ... | | 6 | He \nhas cut back his cigarettes to one time p... | He \nhas cut back his cigarettes to one time p... | | 7 | P: Follow up with Dr. Hobbs in 3 months. | P: Follow up with Dr. RODOLPH in 3 months. | | 8 | Gilbert P. Perez, M.D. | Gertie, M.D. | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deidentify_large| |Type:|deid| |Compatibility:| Healthcare NLP 2.5.5| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, ner_chunk]| |Output Labels:|[obfuscated]| |Language:|en| |Case sensitive:|false| {:.h2_title} ## Data Source The model was trained based on data from [https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/](https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/) --- layout: model title: Modern Greek (1453-) asr_wav2vec2_large_xlsr_greek_2 TFWav2Vec2ForCTC from skylord author: John Snow Labs name: asr_wav2vec2_large_xlsr_greek_2 date: 2022-09-25 tags: [wav2vec2, el, audio, open_source, asr] task: Automatic Speech Recognition language: el edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_greek_2` is a Modern Greek (1453-) model originally trained by skylord. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_greek_2_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_greek_2_el_4.2.0_3.0_1664111950103.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_greek_2_el_4.2.0_3.0_1664111950103.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_greek_2", "el")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_greek_2", "el") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_greek_2| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|el| |Size:|1.2 GB| --- layout: model title: French CamemBert Embeddings (from peterhsu) author: John Snow Labs name: camembert_embeddings_tf_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tf-dummy-model` is a French model orginally trained by `peterhsu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tf_generic_model_fr_3.4.4_3.0_1653992001210.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_tf_generic_model_fr_3.4.4_3.0_1653992001210.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tf_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_tf_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_tf_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/peterhsu/tf-dummy-model --- layout: model title: English RobertaForQuestionAnswering (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_roberta_base_newsqa date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-base-newsqa` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_base_newsqa_en_4.0.0_3.0_1655740101329.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_base_newsqa_en_4.0.0_3.0_1655740101329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_roberta_base_newsqa","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_unqover_roberta_base_newsqa","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.news.roberta.base").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_roberta_base_newsqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-roberta-base-newsqa --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ksabeh) author: John Snow Labs name: roberta_qa_base_attribute_correction date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-attribute-correction` is a English model originally trained by `ksabeh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_attribute_correction_en_4.3.0_3.0_1674212650491.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_attribute_correction_en_4.3.0_3.0_1674212650491.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_attribute_correction","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_attribute_correction| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|430.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ksabeh/roberta-base-attribute-correction --- layout: model title: Legal Air And Space Transport Document Classifier (EURLEX) author: John Snow Labs name: legclf_air_and_space_transport_bert date: 2023-03-06 tags: [en, legal, classification, clauses, air_and_space_transport, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_air_and_space_transport_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Air_and_Space_Transport or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Air_and_Space_Transport`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_air_and_space_transport_bert_en_1.0.0_3.0_1678111618067.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_air_and_space_transport_bert_en_1.0.0_3.0_1678111618067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_air_and_space_transport_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Air_and_Space_Transport]| |[Other]| |[Other]| |[Air_and_Space_Transport]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_air_and_space_transport_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.6 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Air_and_Space_Transport 0.94 0.97 0.95 65 Other 0.97 0.93 0.95 60 accuracy - - 0.95 125 macro-avg 0.95 0.95 0.95 125 weighted-avg 0.95 0.95 0.95 125 ``` --- layout: model title: Multilingual RobertaForQuestionAnswering (from leolin12345) author: John Snow Labs name: roberta_qa_ft_lr_cu_leolin12345 date: 2022-06-21 tags: [cu, ft, lr, open_source, question_answering, roberta, xx] task: Question Answering language: xx edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ft-lr-cu` is a Multilingual model originally trained by `leolin12345`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_lr_cu_leolin12345_xx_4.0.0_3.0_1655789046666.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ft_lr_cu_leolin12345_xx_4.0.0_3.0_1655789046666.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_ft_lr_cu_leolin12345","xx") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_ft_lr_cu_leolin12345","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("xx.answer_question.roberta").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ft_lr_cu_leolin12345| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|xx| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/leolin12345/ft-lr-cu --- layout: model title: Sentence Entity Resolver for Billable ICD10-CM HCC Codes (sbiobertresolve_icd10cm_slim_billable_hcc) author: John Snow Labs name: sbiobertresolve_icd10cm_slim_billable_hcc date: 2023-05-31 tags: [licensed, en, clinical, icd10cm, entity_resolution] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted clinical entities to ICD-10-CM codes using `sbiobert_base_cased_mli` sentence bert embeddings. In this model, synonyms having low cosine similarity to unnormalized terms are dropped. It returns the official resolution text within the brackets and also provides billable and Hierarchical Condition Categories (HCC) information of the codes in `all_k_aux_labels` parameter in the metadata. This column can be divided to get further details: `billable status || hcc status || hcc score`. For example, if `all_k_aux_labels` is like `1||1||19` which means the `billable status` is 1, `hcc status` is 1, and `hcc score` is 19. ## Predicted Entities `ICD-10-CM Codes`, `billable status`, `hcc status`, `hcc score` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_4.4.2_3.0_1685500916240.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_slim_billable_hcc_en_4.4.2_3.0_1685500916240.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token", "word_embeddings"])\ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["PROBLEM"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline(stages = [document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, icd_resolver]) data = spark.createDataFrame([["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList("PROBLEM") val chunk2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols("ner_chunk_doc") .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") .setInputCols("sbert_embeddings") .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | ner_chunk| entity|icd10_code| resolutions| all_codes| hcc_list| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|[gestational diabetes mellitus [gestational diabetes mellitus], gestatio...|[O24.4, O24.41, Z86.32, O24.11, O24.81, P70.2, O24.01, O24.42, O24.414, ...|[0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0,...| |subsequent type two diabetes mellitus|PROBLEM| E11|[type 2 diabetes mellitus [type 2 diabetes mellitus], type 2 diabetes me...|[E11, E11.62, E11.5, E11.69, E11.59, E09, E11.6, E11.8, E11.4, E11.628, ...|[0||0||0, 0||0||0, 0||0||0, 1||1||18, 1||1||18, 0||0||0, 0||0||0, 1||1||...| | obesity|PROBLEM| E66|[overweight and obesity [overweight and obesity], overweight [overweight...|[E66, E66.3, E66.8, E66.0, E66.1, E88.81, E66.09, E66.01, E34.4, E66.9, ...|[0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||1||22...| | a body mass index|PROBLEM| Z68|[body mass index [bmi] [body mass index [bmi]], localized adiposity [loc...|[Z68, E65, L02.221, Z96.81, Y92.81, Y93.75, L02.23, L02.22, M67.49, R73,...|[0||0||0, 1||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 0||0||0, 0||0||0,...| | polyuria|PROBLEM| R35|[polyuria [polyuria], nocturnal polyuria [nocturnal polyuria], other pol...|[R35, R35.81, R35.89, R35.8, R31, R30.0, E72.01, R80, R34, R82.4, R82.99...|[0||0||0, 1||0||0, 1||0||0, 0||0||0, 0||0||0, 1||0||0, 1||1||23, 0||0||0...| | polydipsia|PROBLEM| R63.1|[polydipsia [polydipsia], polyhydramnios [polyhydramnios], parasomnia [p...|[R63.1, O40, G47.5, R63.2, R00.2, G47.1, G47.13, F51.11, G47.19, L68.3, ...|[1||0||0, 0||0||0, 0||0||0, 1||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0,...| | vomiting|PROBLEM| R11.1|[vomiting [vomiting], cyclical vomiting [cyclical vomiting], nausea [nau...|[R11.1, G43.A, R11.0, R11, R11.14, R11.12, R23.1, G47.51, R11.10, H57.03...|[0||0||0, 0||0||0, 1||0||0, 0||0||0, 1||0||0, 1||0||0, 1||0||0, 1||0||0,...| | a respiratory tract infection|PROBLEM| T17|[foreign body in respiratory tract [foreign body in respiratory tract], ...|[T17, T81.4, T81.81, J95.851, T17.8, Z87.0, J44.0, J06, T81.44, Z22, T17...|[0||0||0, 0||0||0, 0||0||0, 1||1||114, 0||0||0, 0||0||0, 1||1||111, 0||0...| +-------------------------------------+-------+----------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_slim_billable_hcc| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|435.3 MB| |Case sensitive:|false| --- layout: model title: Chinese Bert Embeddings (Base, trained on Oscar dataset) author: John Snow Labs name: bert_embeddings_mengzi_oscar_base date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mengzi-oscar-base` is a Chinese model orginally trained by `Langboat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_zh_3.4.2_3.0_1649670580702.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_mengzi_oscar_base_zh_3.4.2_3.0_1649670580702.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_mengzi_oscar_base","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.mengzi_oscar_base").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_mengzi_oscar_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|383.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/Langboat/mengzi-oscar-base - https://github.com/microsoft/Oscar - https://github.com/Langboat/Mengzi - https://arxiv.org/abs/2110.06696 - https://github.com/microsoft/Oscar/blob/master/INSTALL.md - https://github.com/Langboat/Mengzi/blob/main/Mengzi-Oscar.md --- layout: model title: Pipeline to Detect Drug Information author: John Snow Labs name: ner_posology_healthcare_pipeline date: 2022-03-22 tags: [licensed, ner, clinical, drug, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_posology_healthcare](https://nlp.johnsnowlabs.com/2021/04/01/ner_posology_healthcare_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_healthcare_pipeline_en_3.4.1_3.0_1647942842636.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_healthcare_pipeline_en_3.4.1_3.0_1647942842636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_posology_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate('The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.') ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_posology_healthcare_pipeline", "en", "clinical/models") pipeline.fullAnnotate("The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.healthcare_pipeline").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunks |entities | +--------------+---------+ |insulin |Drug | |Bactrim |Drug | |for 14 days |Duration | |Fragmin |Drug | |5000 units |Dosage | |subcutaneously|Route | |daily |Frequency| |Xenaderm |Drug | |topically |Route | |b.i.d |Frequency| |Lantus |Drug | |40 units |Dosage | |subcutaneously|Route | |at bedtime |Frequency| |OxyContin |Drug | |30 mg |Strength | |p.o |Route | |q.12 h |Frequency| |folic acid |Drug | |1 mg |Strength | +--------------+---------+ only showing top 20 rows ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_posology_healthcare_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|513.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Translate Romance languages to English Pipeline author: John Snow Labs name: translate_roa_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, roa, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `roa` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_roa_en_xx_2.7.0_2.4_1609687168744.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_roa_en_xx_2.7.0_2.4_1609687168744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_roa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_roa_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.roa.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_roa_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForSequenceClassification Cased model (from Shenzy) author: John Snow Labs name: roberta_classifier_sentence_classification4designtutor date: 2022-12-09 tags: [en, open_source, roberta, sequence_classification, classification] task: Text Classification language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `Sentence_Classification4DesignTutor` is a English model originally trained by `Shenzy`. ## Predicted Entities `2`, `1`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_classifier_sentence_classification4designtutor_en_4.2.4_3.0_1670621867959.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_classifier_sentence_classification4designtutor_en_4.2.4_3.0_1670621867959.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_sentence_classification4designtutor","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, roberta_classifier]) data = spark.createDataFrame([["I love you!"], ["I feel lucky to be here."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val roberta_classifier = RoBertaForSequenceClassification.pretrained("roberta_classifier_sentence_classification4designtutor","en") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, roberta_classifier)) val data = Seq("I love you!").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_classifier_sentence_classification4designtutor| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Shenzy/Sentence_Classification4DesignTutor --- layout: model title: Hindi BertForQuestionAnswering model (from bhavikardeshna) author: John Snow Labs name: bert_qa_multilingual_bert_base_cased_hindi date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: hi edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `multilingual-bert-base-cased-hindi` is a Hindi model orginally trained by `bhavikardeshna`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_hindi_hi_4.0.0_3.0_1654188532153.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_multilingual_bert_base_cased_hindi_hi_4.0.0_3.0_1654188532153.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_multilingual_bert_base_cased_hindi","hi") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_multilingual_bert_base_cased_hindi","hi") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("hi.answer_question.bert.multilingual_hindi_tuned_base_cased.by_bhavikardeshna").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_multilingual_bert_base_cased_hindi| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hi| |Size:|665.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/bhavikardeshna/multilingual-bert-base-cased-hindi --- layout: model title: Word2Vec Embeddings in Catalan (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, ca, open_source] task: Embeddings language: ca edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ca_3.4.1_3.0_1647288979526.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ca_3.4.1_3.0_1647288979526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ca") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["M'encanta Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ca") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("M'encanta Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ca.embed.w2v_cc_300d").predict("""M'encanta Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ca| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Economic Conditions Document Classifier (EURLEX) author: John Snow Labs name: legclf_economic_conditions_bert date: 2023-03-06 tags: [en, legal, classification, clauses, economic_conditions, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_economic_conditions_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Economic_Conditions or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Economic_Conditions`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_economic_conditions_bert_en_1.0.0_3.0_1678111905018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_economic_conditions_bert_en_1.0.0_3.0_1678111905018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_economic_conditions_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Economic_Conditions]| |[Other]| |[Other]| |[Economic_Conditions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_economic_conditions_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Economic_Conditions 0.85 0.85 0.85 62 Other 0.84 0.84 0.84 56 accuracy - - 0.85 118 macro-avg 0.85 0.85 0.85 118 weighted-avg 0.85 0.85 0.85 118 ``` --- layout: model title: Legal Sales Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_sales_agreement_bert date: 2023-02-02 tags: [en, legal, classification, sales, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_sales_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `sales-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `sales-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_sales_agreement_bert_en_1.0.0_3.0_1675359477213.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_sales_agreement_bert_en_1.0.0_3.0_1675359477213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_sales_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[sales-agreement]| |[other]| |[other]| |[sales-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_sales_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.88 0.93 0.90 122 sales-agreement 0.84 0.76 0.80 63 accuracy - - 0.87 185 macro-avg 0.86 0.84 0.85 185 weighted-avg 0.87 0.87 0.87 185 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Shona author: John Snow Labs name: opus_mt_en_sn date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sn, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sn` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sn_xx_2.7.0_2.4_1609170752195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sn_xx_2.7.0_2.4_1609170752195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sn", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sn", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sn').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sn| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Financial News Summarization (Headers, Large) author: John Snow Labs name: finsum_news_headers_lg date: 2022-11-24 tags: [financial, summarization, summary, en, licensed] task: Summarization language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Financial news Summarizer, aimed to extract headers from financial news. This is a `lg` (large) version. Other versions may be found in Models Hub. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finsum_news_headers_lg_en_1.0.0_3.0_1669287223603.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finsum_news_headers_lg_en_1.0.0_3.0_1669287223603.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = nlp.T5Transformer() \ .pretrained("finsum_news_headers_lg" ,"en", "finance/models") \ .setTask("summarization") \ .setInputCols(["documents"]) \ .setMaxOutputLength(512) \ .setOutputCol("summaries") data_df = spark.createDataFrame([["FTX is expected to make its debut appearance Tuesday in Delaware bankruptcy court, where its new management is expected to recount events leading up to the cryptocurrency platform’s sudden collapse and explain the steps it has since taken to secure customer funds and other assets."]]).toDF("text") pipeline = nlp.Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("summaries.result").show(truncate=False) ```
## Results ```bash FTX to Make Debut in Delaware Bankruptcy Court Tuesday. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finsum_news_headers_lg| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|925.6 MB| ## References In-house JSL financial summarized news. --- layout: model title: Sentence Entity Resolver for NDC (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_ndc date: 2022-04-18 tags: [ndc, entity_resolution, licensed, en, clinical] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.3.2 spark_version: 2.4 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps clinical entities and concepts (like drugs/ingredients) to [National Drug Codes](https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory) using `sbiobert_base_cased_mli` Sentence Bert Embeddings. It also returns package options and alternative drugs in the all_k_aux_label column. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_NDC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_NDC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ndc_en_3.3.2_2.4_1650298194939.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_ndc_en_3.3.2_2.4_1650298194939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") posology_ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["DRUG"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings") ndc_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_ndc", "en", "clinical/models") \ .setInputCols(["ner_chunk", "sentence_embeddings"]) \ .setOutputCol("ndc")\ .setDistanceFunction("EUCLIDEAN")\ .setCaseSensitive(False) resolver_pipeline = Pipeline(stages = [ documentAssembler, sentenceDetector, tokenizer, word_embeddings, posology_ner, ner_converter, c2doc, sbert_embedder, ndc_resolver ]) data = spark.createDataFrame([["""The patient was given aspirin 81 mg and metformin 500 mg"""]]).toDF("text") result = resolver_pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("DRUG")) val c2doc = new Chunk2Doc() .setInputCols("ner_chunk") .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") val ndc_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_ndc", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sentence_embeddings")) .setOutputCol("ndc") .setDistanceFunction("EUCLIDEAN") .setCaseSensitive(False) val resolver_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, word_embeddings, posology_ner, ner_converter, c2doc, sbert_embedder, ndc_resolver )) val clinical_note = Seq("The patient was given aspirin 81 mg and metformin 500 mg").toDS.toDF("text") val results = resolver_pipeline.fit(clinical_note).transform(clinical_note) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.ndc").predict("""The patient was given aspirin 81 mg and metformin 500 mg""") ```
## Results ```bash +----------------+----------+----------------------------------------------------------------------------------------------------+ | ner_chunk| ndc_code| aux_labels| +----------------+----------+----------------------------------------------------------------------------------------------------+ | aspirin 81 mg|41250-0780|{'packages': "['1 BOTTLE, PLASTIC in 1 PACKAGE (41250-780-01) > 120 TABLET, DELAYED RELEASE in 1...| |metformin 500 mg|62207-0491|{'packages': "['5000 TABLET in 1 POUCH (62207-491-31)', '25000 TABLET in 1 CARTON (62207-491-35)'...| +----------------+----------+----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_ndc| |Compatibility:|Healthcare NLP 3.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[ndc]| |Language:|en| |Size:|932.2 MB| |Case sensitive:|false| ## References It is trained on U.S. FDA 2022-NDC Codes dataset. --- layout: model title: Legal Security Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_security_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, security, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_security_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `security-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `security-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_security_agreement_en_1.0.0_3.0_1668111327495.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_security_agreement_en_1.0.0_3.0_1668111327495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_security_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[security-agreement]| |[other]| |[other]| |[security-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_security_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.92 0.96 0.94 85 security-agreement 0.92 0.82 0.87 40 accuracy - - 0.92 125 macro-avg 0.92 0.89 0.91 125 weighted-avg 0.92 0.92 0.92 125 ``` --- layout: model title: Fast Neural Machine Translation Model from English to Sino-Tibetan Languages author: John Snow Labs name: opus_mt_en_sit date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, sit, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `sit` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_sit_xx_2.7.0_2.4_1609169720133.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_sit_xx_2.7.0_2.4_1609169720133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_sit", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_sit", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.sit').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_sit| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering (from thatdramebaazguy) author: John Snow Labs name: roberta_qa_movie_roberta_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `movie-roberta-squad` is a English model originally trained by `thatdramebaazguy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_roberta_squad_en_4.0.0_3.0_1655729079202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_movie_roberta_squad_en_4.0.0_3.0_1655729079202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_movie_roberta_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_movie_roberta_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.movie_squad.roberta.by_thatdramebaazguy").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_movie_roberta_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|466.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/thatdramebaazguy/movie-roberta-squad --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson TFWav2Vec2ForCTC from jessiejohnson author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson` is a English model originally trained by jessiejohnson. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson_en_4.2.0_3.0_1664019369584.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson_en_4.2.0_3.0_1664019369584.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab_by_jessiejohnson| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|228.0 MB| --- layout: model title: Toxic Comment Classification author: John Snow Labs name: multiclassifierdl_use_toxic date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [en, open_source, text_classification] supported: true annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) is working on tools to help improve the online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful, or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content). Automatically detect identity hate, insult, obscene, severe toxic, threat, or toxic content in SM comments using our out-of-the-box Spark NLP Multiclassifier DL. ## Predicted Entities `toxic`, `severe_toxic`, `identity_hate`, `insult`, `obscene`, `threat` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_MULTILABEL_TOXIC/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_MULTILABEL_TOXIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_toxic_en_2.7.1_2.4_1611231604648.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/multiclassifierdl_use_toxic_en_2.7.1_2.4_1611231604648.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained() \ .setInputCols(["document"])\ .setOutputCol("use_embeddings") docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic") \ .setInputCols(["use_embeddings"])\ .setOutputCol("category")\ .setThreshold(0.5) pipeline = Pipeline( stages = [ document, use, docClassifier ]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val use = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("use_embeddings") val docClassifier = MultiClassifierDLModel.pretrained("multiclassifierdl_use_toxic") .setInputCols("use_embeddings") .setOutputCol("category") .setThreshold(0.5f) val pipeline = new Pipeline() .setStages( Array( documentAssembler, use, docClassifier ) ) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.toxic").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|multiclassifierdl_use_toxic| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[use_embeddings]| |Output Labels:|[category]| |Language:|en| ## Data Source https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview ## Benchmarking ```bash precision recall f1-score support 0 0.56 0.30 0.39 127 1 0.71 0.70 0.70 761 2 0.76 0.72 0.74 824 3 0.55 0.21 0.31 147 4 0.79 0.38 0.51 50 5 0.94 1.00 0.97 1504 micro avg 0.83 0.80 0.81 3413 macro avg 0.72 0.55 0.60 3413 weighted avg 0.81 0.80 0.80 3413 samples avg 0.84 0.83 0.80 3413 F1 micro averaging: 0.8113432835820896 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_4 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-4` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1657192524287.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_4_en_4.0.0_3.0_1657192524287.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_4","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|377.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-32-finetuned-squad-seed-4 --- layout: model title: Gujarati RoBERTa Embeddings (from surajp) author: John Snow Labs name: roberta_embeddings_RoBERTa_hindi_guj_san date: 2022-04-14 tags: [roberta, embeddings, gu, open_source] task: Embeddings language: gu edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `RoBERTa-hindi-guj-san` is a Gujarati model orginally trained by `surajp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTa_hindi_guj_san_gu_3.4.2_3.0_1649948207679.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_RoBERTa_hindi_guj_san_gu_3.4.2_3.0_1649948207679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTa_hindi_guj_san","gu") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["મને સ્પાર્ક એનએલપી ગમે છે"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_RoBERTa_hindi_guj_san","gu") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("મને સ્પાર્ક એનએલપી ગમે છે").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gu.embed.RoBERTa_hindi_guj_san").predict("""મને સ્પાર્ક એનએલપી ગમે છે""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_RoBERTa_hindi_guj_san| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|gu| |Size:|252.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/surajp/RoBERTa-hindi-guj-san - https://github.com/goru001/inltk - https://www.kaggle.com/disisbig/hindi-wikipedia-articles-172k - https://www.kaggle.com/disisbig/gujarati-wikipedia-articles - https://www.kaggle.com/disisbig/sanskrit-wikipedia-articles - https://twitter.com/parmarsuraj99 - https://www.linkedin.com/in/parmarsuraj99/ --- layout: model title: English RobertaForQuestionAnswering (from phiyodr) author: John Snow Labs name: roberta_qa_roberta_large_finetuned_squad2 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-finetuned-squad2` is a English model originally trained by `phiyodr`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_finetuned_squad2_en_4.0.0_3.0_1655737049100.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_finetuned_squad2_en_4.0.0_3.0_1655737049100.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_finetuned_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.large.by_phiyodr").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_finetuned_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/phiyodr/roberta-large-finetuned-squad2 - https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json - https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/ - https://arxiv.org/abs/1907.11692 - https://arxiv.org/abs/1806.03822 - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Pipeline to Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [it, ner, clinical, licensed] task: Named Entity Recognition language: it edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_it_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_it_4.3.0_3.2_1678703906819.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_it_4.3.0_3.2_1678703906819.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "it", "clinical/models") text = '''Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "it", "clinical/models") val text = "Una donna di 74 anni è stata ricoverata con dolore addominale diffuso, ipossia e astenia di 2 settimane di evoluzione. La sua storia personale includeva ipertensione in trattamento con amiloride/idroclorotiazide e dislipidemia controllata con lovastatina. La sua storia familiare era: madre morta di cancro gastrico, fratello con cirrosi epatica di eziologia sconosciuta e sorella con carcinoma epatocellulare. Lo studio eziologico delle diverse cause di malattia epatica cronica comprendeva: virus epatotropi (HBV, HCV) e HIV, studio dell'autoimmunità, ceruloplasmina, ferritina e porfirine nelle urine, tutti risultati negativi. Il paziente è stato messo in trattamento anticoagulante con acenocumarolo e diuretici a tempo indeterminato." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------|-------------:| | 0 | donna | 4 | 8 | HUMAN | 0.9992 | | 1 | personale | 133 | 141 | HUMAN | 0.9982 | | 2 | madre | 285 | 289 | HUMAN | 0.9979 | | 3 | fratello | 317 | 324 | HUMAN | 0.9857 | | 4 | sorella | 373 | 379 | HUMAN | 0.9847 | | 5 | virus epatotropi | 493 | 508 | SPECIES | 0.7906 | | 6 | HBV | 511 | 513 | SPECIES | 0.9833 | | 7 | HCV | 516 | 518 | SPECIES | 0.991 | | 8 | HIV | 523 | 525 | SPECIES | 0.991 | | 9 | paziente | 634 | 641 | HUMAN | 0.9978 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|it| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Pipeline to Detect Pathogen, Medical Condition and Medicine author: John Snow Labs name: ner_pathogen_pipeline date: 2022-06-29 tags: [licensed, clinical, en, pathogen, ner, medicine, medical_condition, pipeline] task: Pipeline Healthcare language: en nav_key: models edition: Healthcare NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_pathogen](https://nlp.johnsnowlabs.com/2022/06/28/ner_pathogen_en_3_0.html) model. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_pipeline_en_4.0.0_3.0_1656527387514.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_pathogen_pipeline_en_4.0.0_3.0_1656527387514.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_pathogen_pipeline", "en", "clinical/models") result = pipeline.fullAnnotate("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_pathogen_pipeline", "en", "clinical/models") val result = pipeline.fullAnnotate("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.pathogen.pipeline").predict("""Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. This can progress to loss of skin color, a fast heart rate as it becomes more severe. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.""") ```
## Results ```bash |chunk |ner_label | +---------------+----------------+ |Racecadotril |Medicine | |loperamide |Medicine | |Diarrhea |MedicalCondition| |dehydration |MedicalCondition| |skin color |MedicalCondition| |fast heart rate|MedicalCondition| |rabies virus |Pathogen | |Lyssavirus |Pathogen | |Ephemerovirus |Pathogen | +---------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_pathogen_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: French CamemBert Embeddings (from cmarkea) author: John Snow Labs name: camembert_embeddings_distilcamembert_base date: 2022-05-23 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilcamembert-base` is a French model orginally trained by `cmarkea`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_distilcamembert_base_fr_3.4.4_3.0_1653321109186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_distilcamembert_base_fr_3.4.4_3.0_1653321109186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_distilcamembert_base","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_distilcamembert_base","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_distilcamembert_base| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|256.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/cmarkea/distilcamembert-base - https://arxiv.org/abs/1910.01108 --- layout: model title: Legal NER for NDA (Non-solicitation Clauses) author: John Snow Labs name: legner_nda_non_solicitation date: 2023-04-10 tags: [en, legal, licensed, ner, nda, non_solicitation] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `NON_SOLIC` clause with a proper classifier (use `legmulticlf_mnda_sections_paragraph_other` model for that purpose). It will extract the following entities: `NON_SOLIC_ACTION`, `NON_SOLIC_OBJECT`, `NON_SOLIC_IND_OBJECT`, and `NON_SOLIC_SUBJECT`. ## Predicted Entities `NON_SOLIC_ACTION`, `NON_SOLIC_OBJECT`, `NON_SOLIC_IND_OBJECT`, `NON_SOLIC_SUBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_non_solicitation_en_1.0.0_3.0_1681096605002.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_non_solicitation_en_1.0.0_3.0_1681096605002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_non_solicitation", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""During the 12-month period commencing on the date of this Agreement, each Party agrees that it will not permit any of its officers, directors, or employees or any direct or indirect subsidiary or any employment agency retained by a Party or any such subsidiary, in each case who is or becomes aware of the negotiation of a possible Transaction between the Parties, to solicit for employment with such Party or with any of its direct or indirect subsidiaries any employee of the other Party or any of its direct or indirect subsidiaries;"""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +----------+--------------------+ |chunk |ner_label | +----------+--------------------+ |Party |NON_SOLIC_SUBJECT | |officers |NON_SOLIC_IND_OBJECT| |directors |NON_SOLIC_IND_OBJECT| |employees |NON_SOLIC_IND_OBJECT| |agency |NON_SOLIC_SUBJECT | |solicit |NON_SOLIC_ACTION | |employment|NON_SOLIC_OBJECT | |employee |NON_SOLIC_IND_OBJECT| +----------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_non_solicitation| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support NON_SOLIC_ACTION 1.00 0.95 0.98 22 NON_SOLIC_IND_OBJECT 0.67 0.90 0.76 29 NON_SOLIC_OBJECT 0.81 0.94 0.87 18 NON_SOLIC_SUBJECT 0.86 0.86 0.86 21 micro-avg 0.80 0.91 0.85 90 macro-avg 0.83 0.91 0.87 90 weighted-avg 0.82 0.91 0.86 90 ``` --- layout: model title: English asr_wav2vec2_large_robust_swbd_300h TFWav2Vec2ForCTC from facebook author: John Snow Labs name: pipeline_asr_wav2vec2_large_robust_swbd_300h date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_robust_swbd_300h` is a English model originally trained by facebook. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_robust_swbd_300h_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_swbd_300h_en_4.2.0_3.0_1664041796149.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_robust_swbd_300h_en_4.2.0_3.0_1664041796149.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_robust_swbd_300h', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_robust_swbd_300h", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_robust_swbd_300h| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|757.5 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Extract Test Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_test date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, test] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts test mentions from the documents transferred from the patient’s own sentences. ## Predicted Entities `VitalTest`, `Test`, `Measurements`, `TestResult` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_en_4.4.3_3.0_1686076606665.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_test_en_4.4.3_3.0_1686076606665.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_test", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_test", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I went to the endocrinology department to get my thyroid levels checked. They ordered a blood test and found out that I have hypothyroidism, so now I'm on medication to manage it.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------|:------------| | thyroid levels | Test | | blood test | Test | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_test| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 VitalTest 159 24 13 172 0.87 0.92 0.90 Test 1040 101 168 1208 0.91 0.86 0.89 Measurements 139 14 47 186 0.91 0.75 0.82 TestResult 336 71 188 524 0.83 0.64 0.72 macro_avg 1674 210 416 2090 0.88 0.79 0.83 micro_avg 1674 210 416 2090 0.89 0.80 0.84 ``` --- layout: model title: Context Spell Checker Pipeline for English author: ahmedlone127 name: spellcheck_dl_pipeline date: 2022-06-14 tags: [spellcheck, spell, spellcheck_pipeline, spelling_corrector, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained spellchecker pipeline is built on the top of [spellcheck_dl](https://nlp.johnsnowlabs.com/2022/04/02/spellcheck_dl_en_2_4.html) model. This pipeline is for PySpark 2.4.x users with SparkNLP 3.4.2 and above. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/spellcheck_dl_pipeline_en_4.0.0_3.0_1655213660333.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/spellcheck_dl_pipeline_en_4.0.0_3.0_1655213660333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] pipeline.annotate(text) ``` ```scala val pipeline = new PretrainedPipeline("spellcheck_dl_pipeline", lang = "en") val example = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") pipeline.annotate(example) ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|99.8 MB| ## Included Models - DocumentAssembler - TokenizerModel - ContextSpellCheckerModel --- layout: model title: Universal Sentence Encoder XLING English and Spanish author: John Snow Labs name: tfhub_use_xling_en_es date: 2020-12-08 task: Embeddings language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 deprecated: true tags: [embeddings, open_source, xx] supported: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder Cross-lingual (XLING) module is an extension of the Universal Sentence Encoder that includes training on multiple tasks across languages. The multi-task training setup is based on the paper "Learning Cross-lingual Sentence Representations via a Multi-task Dual Encoder". This specific module is trained on English and Spanish (en-es) tasks, and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box for a number of applications. The input to the module is variable length English or Spanish text and the output is a 512 dimensional vector. We note that one does not need to specify the language that the input is in, as the model was trained such that English and Spanish text with similar meanings will have similar (high dot product score) embeddings. We also note that this model can be used for monolingual English (and potentially monolingual Spanish) tasks with comparable or even better performance than the purely English Universal Sentence Encoder. Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_es_xx_2.7.0_2.4_1607440558771.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_xling_en_es_xx_2.7.0_2.4_1607440558771.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_es", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Me encanta usar SparkNLP']], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_xling_en_es", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Me encanta usar SparkNLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP", "Me encanta usar SparkNLP"] embeddings_df = nlu.load('xx.use.xling_en_es').predict(text, output_level='sentence') embeddings_df ```
## Results It gives a 512-dimensional vector of the sentences. ```bash xx_use_xling_en_es_embeddings sentence 0 [-0.02727784588932991, 0.022969702258706093, 0... I love NLP 1 [-0.01980777457356453, 0.03035994991660118, 0.... Me encanta usar SparkNLP ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use_xling_en_es| |Compatibility:|Spark NLP 2.7.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|xx| ## Data Source This embeddings model is imported from [https://tfhub.dev/google/universal-sentence-encoder-xling/en-es/1](https://tfhub.dev/google/universal-sentence-encoder-xling/en-es/1) --- layout: model title: Extract Clinical Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop date: 2023-06-06 tags: [licensed, clinical, en, ner, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts healthcare-related terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `Gender`, `Employment`, `Age`, `BodyPart`, `Substance`, `Form`, `PsychologicalCondition`, `Vaccine`, `Drug`, `DateTime`, `ClinicalDept`, `Laterality`, `Test`, `AdmissionDischarge`, `Disease`, `VitalTest`, `Dosage`, `Duration`, `RelationshipStatus`, `Route`, `Allergen`, `Frequency`, `Symptom`, `Procedure`, `HealthStatus`, `InjuryOrPoisoning`, `Modifier`, `Treatment`, `SubstanceQuantity`, `MedicalDevice`, `TestResult` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_en_4.4.3_3.0_1686072829593.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_en_4.4.3_3.0_1686072829593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:-----------------------| | 20 year old | Age | | girl | Gender | | hyperthyroid | Disease | | 1 month ago | DateTime | | weak | Symptom | | light | Symptom | | panic attacks | PsychologicalCondition | | depression | PsychologicalCondition | | left | Laterality | | chest | BodyPart | | pain | Symptom | | increased | TestResult | | heart rate | VitalTest | | rapidly | Modifier | | weight loss | Symptom | | 4 months | Duration | | hospital | ClinicalDept | | discharged | AdmissionDischarge | | hospital | ClinicalDept | | blood tests | Test | | brain | BodyPart | | mri | Test | | ultrasound scan | Test | | endoscopy | Procedure | | doctors | Employment | | homeopathy doctor | Employment | | he | Gender | | hyperthyroid | Disease | | TSH | Test | | 0.15 | TestResult | | T3 | Test | | T4 | Test | | normal | TestResult | | b12 deficiency | Disease | | vitamin D deficiency | Disease | | weekly | Frequency | | supplement | Drug | | vitamin D | Drug | | 1000 mcg | Dosage | | b12 | Drug | | daily | Frequency | | homeopathy medicine | Drug | | 40 days | Duration | | 2nd test | Test | | after 30 days | DateTime | | TSH | Test | | 0.5 | TestResult | | now | DateTime | | weakness | Symptom | | depression | PsychologicalCondition | | last week | DateTime | | rapid heartrate | Symptom | | allopathy medicine | Treatment | | homeopathy | Treatment | | thyroid | BodyPart | | allopathy | Treatment | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.9 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1291 19 26 1317 0.99 0.98 0.98 Employment 1182 56 61 1243 0.95 0.95 0.95 Age 544 50 38 582 0.92 0.93 0.93 BodyPart 2711 219 189 2900 0.93 0.93 0.93 Substance 390 45 31 421 0.90 0.93 0.91 Form 246 30 20 266 0.89 0.92 0.91 PsychologicalCondition 403 34 41 444 0.92 0.91 0.91 Vaccine 37 4 5 42 0.90 0.88 0.89 Drug 1330 208 110 1440 0.86 0.92 0.89 DateTime 4045 690 357 4402 0.85 0.92 0.89 ClinicalDept 277 24 49 326 0.92 0.85 0.88 Laterality 550 66 78 628 0.89 0.88 0.88 Test 1063 158 145 1208 0.87 0.88 0.88 AdmissionDischarge 28 2 6 34 0.93 0.82 0.88 Disease 1706 247 309 2015 0.87 0.85 0.86 VitalTest 143 19 29 172 0.88 0.83 0.86 Dosage 333 38 79 412 0.90 0.81 0.85 Duration 1897 320 413 2310 0.86 0.82 0.84 RelationshipStatus 19 2 5 24 0.90 0.79 0.84 Route 39 7 9 48 0.85 0.81 0.83 Allergen 33 1 13 46 0.97 0.72 0.83 Frequency 905 224 174 1079 0.80 0.84 0.82 Symptom 3813 973 762 4575 0.80 0.83 0.81 Procedure 556 111 149 705 0.83 0.79 0.81 HealthStatus 77 15 30 107 0.84 0.72 0.77 InjuryOrPoisoning 131 42 45 176 0.76 0.74 0.75 Modifier 837 254 302 1139 0.77 0.73 0.75 Treatment 164 50 64 228 0.77 0.72 0.74 SubstanceQuantity 58 18 27 85 0.76 0.68 0.72 MedicalDevice 232 86 100 332 0.73 0.70 0.71 TestResult 321 85 203 524 0.79 0.61 0.69 macro_avg 25361 4097 3869 29230 0.86 0.83 0.84 micro_avg 25361 4097 3869 29230 0.86 0.87 0.86 ``` --- layout: model title: Legal Indemnity Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_indemnity_agreement date: 2022-12-06 tags: [en, legal, classification, agreement, indemnity, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_indemnity_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `indemnity-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `indemnity-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_agreement_en_1.0.0_3.0_1670357677216.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indemnity_agreement_en_1.0.0_3.0_1670357677216.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indemnity_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[indemnity-agreement]| |[other]| |[other]| |[indemnity-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indemnity_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support indemnity-agreement 0.96 0.94 0.95 50 other 0.97 0.98 0.98 111 accuracy - - 0.97 161 macro-avg 0.97 0.96 0.96 161 weighted-avg 0.97 0.97 0.97 161 ``` --- layout: model title: Arabic Bert Embeddings (Medium) author: John Snow Labs name: bert_embeddings_bert_medium_arabic date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-medium-arabic` is a Arabic model orginally trained by `asafaya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_medium_arabic_ar_3.4.2_3.0_1649678487166.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_medium_arabic_ar_3.4.2_3.0_1649678487166.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_medium_arabic","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_medium_arabic","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.bert_medium_arabic").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_medium_arabic| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|158.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/asafaya/bert-medium-arabic - https://traces1.inria.fr/oscar/ - http://commoncrawl.org/ - https://dumps.wikimedia.org/backup-index.html - https://github.com/google-research/bert - https://www.tensorflow.org/tfrc - https://github.com/alisafaya/Arabic-BERT --- layout: model title: Extract Clinical Problem Entities from Voice of the Patient Documents (embeddings_clinical) author: John Snow Labs name: ner_vop_problem date: 2023-06-06 tags: [licensed, clinical, ner, en, problem, vop] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts clinical problems from the documents transferred from the patient’s own sentences using a granular taxonomy. ## Predicted Entities `PsychologicalCondition`, `Disease`, `Symptom`, `HealthStatus`, `Modifier`, `InjuryOrPoisoning` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_en_4.4.3_3.0_1686075641877.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_problem_en_4.4.3_3.0_1686075641877.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_problem", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_problem", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("I've been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:---------------------|:------------| | pain | Symptom | | fatigue | Symptom | | rheumatoid arthritis | Disease | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_problem| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 PsychologicalCondition 413 40 31 444 0.91 0.93 0.92 Disease 1748 288 267 2015 0.86 0.87 0.86 Symptom 3726 668 849 4575 0.85 0.81 0.83 HealthStatus 81 23 26 107 0.78 0.76 0.77 Modifier 845 259 294 1139 0.77 0.74 0.75 InjuryOrPoisoning 117 25 59 176 0.82 0.66 0.74 macro_avg 6930 1303 1526 8456 0.83 0.80 0.81 micro_avg 6930 1303 1526 8456 0.84 0.82 0.83 ``` --- layout: model title: Depression Classifier (PHS-BERT) for Tweets author: John Snow Labs name: bert_sequence_classifier_depression_twitter date: 2022-08-09 tags: [public_health, en, licensed, sequence_classification, mental_health, depression, twitter] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [PHS-BERT](https://arxiv.org/abs/2204.04521) based tweet classification model that can classify whether tweets contain depressive text. ## Predicted Entities `depression`, `no-depression` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_depression_twitter_en_4.0.2_3.0_1660051816827.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_depression_twitter_en_4.0.2_3.0_1660051816827.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python # Sample Python Code document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_depression_twitter", "en", "clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame([ ["Do what makes you happy, be with who makes you smile, laugh as much as you breathe, and love as long as you live!"], ["Everything is a lie, everyone is fake, I'm so tired of living"] ]).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_depression_twitter", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier)) val data = Seq(Array( "Do what makes you happy, be with who makes you smile, laugh as much as you breathe, and love as long as you live!", "Everything is a lie, everyone is fake, I'm so tired of living" )).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.depression_twitter").predict("""Do what makes you happy, be with who makes you smile, laugh as much as you breathe, and love as long as you live!""") ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------+---------------+ |text |result | +-----------------------------------------------------------------------------------------------------------------+---------------+ |Do what makes you happy, be with who makes you smile, laugh as much as you breathe, and love as long as you live!|[no-depression]| |Everything is a lie, everyone is fake, I'm so tired of living |[depression] | +-----------------------------------------------------------------------------------------------------------------+---------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_depression_twitter| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References Curated from several academic and in-house datasets. ## Benchmarking ```bash label precision recall f1-score support minimum 0.97 0.98 0.97 1411 high-depression 0.95 0.92 0.93 595 accuracy - - 0.96 2006 macro-avg 0.96 0.95 0.95 2006 weighted-avg 0.96 0.96 0.96 2006 ``` --- layout: model title: Pipeline to Detect Temporal Relations for Clinical Events (Enriched) author: John Snow Labs name: re_temporal_events_enriched_clinical_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, event, enriched, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_temporal_events_enriched_clinical](https://nlp.johnsnowlabs.com/2020/09/28/re_temporal_events_enriched_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_enriched_clinical_pipeline_en_3.4.1_3.0_1648734605627.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_temporal_events_enriched_clinical_pipeline_en_3.4.1_3.0_1648734605627.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("re_temporal_events_enriched_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` ```scala val pipeline = new PretrainedPipeline("re_temporal_events_enriched_clinical_pipeline", "en", "clinical/models") pipeline.annotate("The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.temproal_enriched.pipeline").predict("""The patient is a 56-year-old right-handed female with longstanding intermittent right low back pain, who was involved in a motor vehicle accident in September of 2005. At that time, she did not notice any specific injury, but five days later, she started getting abnormal right low back pain.""") ```
## Results ```bash +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ | | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | +====+============+===========+=================+===============+===============================================+============+=================+===============+==========================+==============+ | 0 | OVERLAP | PROBLEM | 54 | 98 | longstanding intermittent right low back pain | OCCURRENCE | 121 | 144 | a motor vehicle accident | 0.532308 | +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ | 1 | AFTER | DATE | 171 | 179 | that time | PROBLEM | 201 | 219 | any specific injury | 0.577288 | +----+------------+-----------+-----------------+---------------+-----------------------------------------------+------------+-----------------+---------------+--------------------------+--------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_temporal_events_enriched_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: English RobertaForQuestionAnswering Cased model (from AyushPJ) author: John Snow Labs name: roberta_qa_ai_club_inductions_21_nlp date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-roBERTa` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_en_4.3.0_3.0_1674208962596.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_en_4.3.0_3.0_1674208962596.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ai_club_inductions_21_nlp| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|465.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-roBERTa --- layout: model title: Portuguese Named Entity Recognition (from m-lin20) author: John Snow Labs name: roberta_ner_satellite_instrument_roberta_NER date: 2022-05-03 tags: [roberta, ner, open_source, pt] task: Named Entity Recognition language: pt edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `satellite-instrument-roberta-NER` is a Portuguese model orginally trained by `m-lin20`. ## Predicted Entities `instrument`, `satellite` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_satellite_instrument_roberta_NER_pt_3.4.2_3.0_1651594391122.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_satellite_instrument_roberta_NER_pt_3.4.2_3.0_1651594391122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_satellite_instrument_roberta_NER","pt") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Eu amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_satellite_instrument_roberta_NER","pt") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Eu amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pt.ner.satellite_instrument_roberta_NER").predict("""Eu amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_satellite_instrument_roberta_NER| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|pt| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m-lin20/satellite-instrument-roberta-NER - https://github.com/Tsinghua-mLin/satellite-instrument-NER --- layout: model title: Legal Indemnification NER (Light, md) author: John Snow Labs name: legner_indemnifications_md date: 2022-12-01 tags: [en, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_indemnification_clause` Text Classifier to select only these paragraphs; This is a Legal Named Entity Recognition Model to identify the Subject (who), Action (web), Object(the indemnification) and Indirect Object (to whom) from Indemnification clauses. This is a `md` (medium version) of the classifier, trained with more data and being more resistent to false positives outside the specific section, which may help to run it at whole document level (although not recommended). ## Predicted Entities `INDEMNIFICATION`, `INDEMNIFICATION_SUBJECT`, `INDEMNIFICATION_ACTION`, `INDEMNIFICATION_INDIRECT_OBJECT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_indemnifications_md_en_1.0.0_3.0_1669894326703.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_indemnifications_md_en_1.0.0_3.0_1669894326703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_indemnifications_md', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[documentAssembler,sentenceDetector,tokenizer,embeddings,ner_model,ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text='''The Company shall protect and indemnify the Supplier against any damages, losses or costs whatsoever''' data = spark.createDataFrame([[text]]).toDF("text") model = nlpPipeline.fit(data) lmodel = LightPipeline(model) res = lmodel.annotate(text) ```
## Results ```bash +----------+---------------------------------+ | token| ner_label| +----------+---------------------------------+ | The| O| | Company| O| | shall| B-INDEMNIFICATION_ACTION| | protect| I-INDEMNIFICATION_ACTION| | and| O| | indemnify| B-INDEMNIFICATION_ACTION| | the| O| | Supplier|B-INDEMNIFICATION_INDIRECT_OBJECT| | against| O| | any| O| | damages| B-INDEMNIFICATION| | ,| O| | losses| B-INDEMNIFICATION| | or| O| | costs| B-INDEMNIFICATION| |whatsoever| O| +----------+---------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_indemnifications_md| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References In-house annotated examples from CUAD legal dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-INDEMNIFICATION_ACTION 9 2 0 0.8181818 1.0 0.90000004 B-INDEMNIFICATION_INDIRECT_OBJECT 24 7 0 0.7741935 1.0 0.8727273 B-INDEMNIFICATION_SUBJECT 5 2 0 0.71428573 1.0 0.8333334 I-INDEMNIFICATION_SUBJECT 3 0 0 1.0 1.0 1.0 B-INDEMNIFICATION 23 2 0 0.92 1.0 0.9583333 I-INDEMNIFICATION_INDIRECT_OBJECT 9 3 2 0.75 0.8181818 0.78260875 B-INDEMNIFICATION_ACTION 9 4 0 0.6923077 1.0 0.8181818 I-INDEMNIFICATION 5 5 0 0.5 1.0 0.6666667 Macro-average 87 25 2 0.77112114 0.97727275 0.8620434 Micro-average 87 25 2 0.77678573 0.9775281 0.86567163 ``` --- layout: model title: German ElectraForQuestionAnswering model (from deutsche-telekom) author: John Snow Labs name: electra_qa_base_squad2 date: 2022-06-22 tags: [de, open_source, electra, question_answering] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-de-squad2` is a German model originally trained by `deutsche-telekom`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_de_4.0.0_3.0_1655920425196.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_squad2_de_4.0.0_3.0_1655920425196.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_squad2","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_squad2","de") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Was ist mein Name?", "Mein Name ist Clara und ich lebe in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.squadv2.electra.base").predict("""Was ist mein Name?|||"Mein Name ist Clara und ich lebe in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|de| |Size:|414.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/deutsche-telekom/electra-base-de-squad2 --- layout: model title: Word2Vec Embeddings in Piedmontese (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, pms, open_source] task: Embeddings language: pms edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pms_3.4.1_3.0_1647451242344.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_pms_3.4.1_3.0_1647451242344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("pms.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|pms| |Size:|122.9 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English RobertaForQuestionAnswering (from tli8hf) author: John Snow Labs name: roberta_qa_unqover_roberta_base_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `unqover-roberta-base-squad` is a English model originally trained by `tli8hf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_base_squad_en_4.0.0_3.0_1655740154333.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_unqover_roberta_base_squad_en_4.0.0_3.0_1655740154333.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_unqover_roberta_base_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_unqover_roberta_base_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_tli8hf").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_unqover_roberta_base_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|463.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/tli8hf/unqover-roberta-base-squad --- layout: model title: Pipeline to Classify Texts into TREC-6 Classes author: ahmedlone127 name: bert_sequence_classifier_trec_coarse_pipeline date: 2022-06-14 tags: [bert_sequence, trec, coarse, bert, en, open_source] task: Text Classification language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: false article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_sequence_classifier_trec_coarse_en](https://nlp.johnsnowlabs.com/2021/11/06/bert_sequence_classifier_trec_coarse_en.html). The TREC dataset for question classification consists of open-domain, fact-based questions divided into broad semantic categories. You can check the official documentation of the dataset, entities, etc. [here](https://search.r-project.org/CRAN/refmans/textdata/html/dataset_trec.html). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/community.johnsnowlabs.com/ahmedlone127/bert_sequence_classifier_trec_coarse_pipeline_en_4.0.0_3.0_1655211903928.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://community.johnsnowlabs.com/ahmedlone127/bert_sequence_classifier_trec_coarse_pipeline_en_4.0.0_3.0_1655211903928.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python trec_pipeline = PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en") trec_pipeline.annotate("Germany is the largest country in Europe economically.") ``` ```scala val trec_pipeline = new PretrainedPipeline("bert_sequence_classifier_trec_coarse_pipeline", lang = "en") trec_pipeline.annotate("Germany is the largest country in Europe economically.") ```
## Results ```bash ['LOC'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_trec_coarse_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Community| |Language:|en| |Size:|406.6 MB| ## Included Models - DocumentAssembler - TokenizerModel - BertForSequenceClassification --- layout: model title: English image_classifier_vit_gtsrb_model ViTForImageClassification from bazyl author: John Snow Labs name: image_classifier_vit_gtsrb_model date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_gtsrb_model` is a English model originally trained by bazyl. ## Predicted Entities `Children crossing`, `Double curve`, `Road work`, `Yield`, `Beware of ice/snow`, `Speed limit (70km/h)`, `Bicycles crossing`, `Roundabout mandatory`, `Speed limit (30km/h)`, `Keep left`, `Dangerous curve left`, `No vehicles`, `End of no passing`, `Bumpy road`, `Speed limit (50km/h)`, `Turn left ahead`, `Speed limit (20km/h)`, `General caution`, `Speed limit (100km/h)`, `End speed + passing limits`, `Go straight or right`, `Dangerous curve right`, `Speed limit (80km/h)`, `Slippery road`, `Turn right ahead`, `No passing veh over 3.5 tons`, `Speed limit (60km/h)`, `Pedestrians`, `Right-of-way at intersection`, `Priority road`, `End of speed limit (80km/h)`, `Road narrows on the right`, `No entry`, `Stop`, `Wild animals crossing`, `Veh > 3.5 tons prohibited`, `End no passing veh > 3.5 tons`, `Go straight or left`, `Speed limit (120km/h)`, `Ahead only`, `Keep right`, `Traffic signals`, `No passing` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_gtsrb_model_en_4.1.0_3.0_1660166650134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_gtsrb_model_en_4.1.0_3.0_1660166650134.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_gtsrb_model", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_gtsrb_model", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_gtsrb_model| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|322.0 MB| --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Tianle) author: John Snow Labs name: distilbert_qa_tianle_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Tianle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_tianle_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769458679.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_tianle_base_uncased_finetuned_squad_en_4.3.0_3.0_1672769458679.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tianle_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_tianle_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_tianle_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Tianle/distilbert-base-uncased-finetuned-squad --- layout: model title: Word2Vec Embeddings in Egyptian Arabic (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, arz, open_source] task: Embeddings language: arz edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_arz_3.4.1_3.0_1647294515164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_arz_3.4.1_3.0_1647294515164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","arz") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","arz") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("arz.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|arz| |Size:|205.5 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Participation Agreement Document Classifier (Longformer) author: John Snow Labs name: legclf_participation_agreement date: 2022-11-10 tags: [en, legal, classification, agreement, participation, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_participation_agreement` model is a Legal Longformer Document Classifier to classify if the document belongs to the class `participation-agreement` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If not, let us know and we can carry out another approach for you: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. But this theoretically should not be required. ## Predicted Entities `participation-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_participation_agreement_en_1.0.0_3.0_1668110456181.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_participation_agreement_en_1.0.0_3.0_1668110456181.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = nlp.ClassifierDLModel.pretrained("legclf_participation_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[participation-agreement]| |[other]| |[other]| |[participation-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_participation_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.7 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.99 0.98 85 participation-agreement 0.97 0.90 0.93 31 accuracy - - 0.97 116 macro-avg 0.97 0.95 0.96 116 weighted-avg 0.97 0.97 0.97 116 ``` --- layout: model title: Stopwords Remover for Finnish language (822 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, fi, open_source] task: Stop Words Removal language: fi edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_fi_3.4.1_3.0_1646672293430.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_fi_3.4.1_3.0_1646672293430.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","fi") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Et ole parempi kuin minä"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","fi") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Et ole parempi kuin minä").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fi.stopwords").predict("""Et ole parempi kuin minä""") ```
## Results ```bash +------+ |result| +------+ |[] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|fi| |Size:|3.7 KB| --- layout: model title: Fast Neural Machine Translation Model from English to Tetun Dili author: John Snow Labs name: opus_mt_en_tdt date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, tdt, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `tdt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_tdt_xx_2.7.0_2.4_1609168336715.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_tdt_xx_2.7.0_2.4_1609168336715.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_tdt", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_tdt", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.tdt').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_tdt| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Kannada RoBERTa Embeddings (from Naveen-k) author: John Snow Labs name: roberta_embeddings_KanBERTo date: 2022-04-14 tags: [roberta, embeddings, kn, open_source] task: Embeddings language: kn edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `KanBERTo` is a Kannada model orginally trained by `Naveen-k`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_KanBERTo_kn_3.4.2_3.0_1649948277756.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_KanBERTo_kn_3.4.2_3.0_1649948277756.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_KanBERTo","kn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_KanBERTo","kn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("kn.embed.KanBERTo").predict("""ನಾನು ಸ್ಪಾರ್ಕ್ ಎನ್ಎಲ್ಪಿ ಪ್ರೀತಿಸುತ್ತೇನೆ""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_KanBERTo| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|kn| |Size:|314.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/Naveen-k/KanBERTo - https://en.wikipedia.org/wiki/Kannada - https://traces1.inria.fr/oscar/files/compressed-orig/kn.txt.gz - https://traces1.inria.fr/oscar/ --- layout: model title: English BertForQuestionAnswering model (from ArpanZS) author: John Snow Labs name: bert_qa_debug_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `debug_squad` is a English model orginally trained by `ArpanZS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_debug_squad_en_4.0.0_3.0_1654187393032.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_debug_squad_en_4.0.0_3.0_1654187393032.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_debug_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_debug_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_ArpanZS").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_debug_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|408.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ArpanZS/debug_squad --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_bert_medium_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-finetuned-squad` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_finetuned_squad_en_4.0.0_3.0_1654183705887.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_finetuned_squad_en_4.0.0_3.0_1654183705887.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_medium_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_medium_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.medium.by_anas-awadalla").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_medium_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|154.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/bert-medium-finetuned-squad --- layout: model title: Spanish RobertaForSequenceClassification Base Cased model (from mrm8488) author: John Snow Labs name: roberta_sequence_classifier_ruperta_base_finetuned_pawsx date: 2022-07-13 tags: [es, open_source, roberta, sequence_classification] task: Text Classification language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForSequenceClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `RuPERTa-base-finetuned-pawsx-es` is a Spanish model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_ruperta_base_finetuned_pawsx_es_4.0.0_3.0_1657716331993.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_sequence_classifier_ruperta_base_finetuned_pawsx_es_4.0.0_3.0_1657716331993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") classifier = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_ruperta_base_finetuned_pawsx","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("class") pipeline = Pipeline(stages=[documentAssembler, tokenizer, classifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val classifer = RoBertaForSequenceClassification.pretrained("roberta_sequence_classifier_ruperta_base_finetuned_pawsx","es") .setInputCols(Array("document", "token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, classifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_sequence_classifier_ruperta_base_finetuned_pawsx| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|472.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-pawsx-es --- layout: model title: French RoBERTa Embeddings (from abhilash1910) author: John Snow Labs name: roberta_embeddings_french_roberta date: 2022-04-14 tags: [roberta, embeddings, fr, open_source] task: Embeddings language: fr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `french-roberta` is a French model orginally trained by `abhilash1910`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_french_roberta_fr_3.4.2_3.0_1649947893823.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_french_roberta_fr_3.4.2_3.0_1649947893823.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_french_roberta","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark Nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_french_roberta","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark Nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.embed.french_roberta").predict("""J'adore Spark Nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_french_roberta| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fr| |Size:|255.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/abhilash1910/french-roberta --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_finetuned_squad_r3f date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad-r3f` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_r3f_en_4.3.0_3.0_1674217712089.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_finetuned_squad_r3f_en_4.3.0_3.0_1674217712089.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_r3f","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_finetuned_squad_r3f","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_finetuned_squad_r3f| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-finetuned-squad-r3f --- layout: model title: Legal Organizations Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_organizations_bert date: 2023-03-05 tags: [en, legal, classification, clauses, organizations, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Organizations` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Organizations`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_organizations_bert_en_1.0.0_3.0_1678050638474.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_organizations_bert_en_1.0.0_3.0_1678050638474.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_organizations_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Organizations]| |[Other]| |[Other]| |[Organizations]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_organizations_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Organizations 0.98 0.94 0.96 47 Other 0.96 0.99 0.97 70 accuracy - - 0.97 117 macro-avg 0.97 0.96 0.96 117 weighted-avg 0.97 0.97 0.97 117 ``` --- layout: model title: English BertForQuestionAnswering model (from Sounak) author: John Snow Labs name: bert_qa_bert_large_finetuned date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-finetuned` is a English model orginally trained by `Sounak`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_finetuned_en_4.0.0_3.0_1654536299509.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_large_finetuned_en_4.0.0_3.0_1654536299509.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_finetuned","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_large_finetuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.large.by_Sounak").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_large_finetuned| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sounak/bert-large-finetuned --- layout: model title: Multilingual DistilBertForQuestionAnswering Base Cased model (from ruselkomp) author: John Snow Labs name: distilbert_qa_ruselkomp_base_cased_finetuned_squad date: 2023-01-03 tags: [xx, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: xx edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-finetuned-squad` is a Multilingual model originally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_ruselkomp_base_cased_finetuned_squad_xx_4.3.0_3.0_1672767256602.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_ruselkomp_base_cased_finetuned_squad_xx_4.3.0_3.0_1672767256602.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ruselkomp_base_cased_finetuned_squad","xx")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_ruselkomp_base_cased_finetuned_squad","xx") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_ruselkomp_base_cased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|xx| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/distilbert-base-multilingual-cased-finetuned-squad --- layout: model title: XLM-RoBERTa Base, CoNLL-03 NER Pipeline author: John Snow Labs name: xlm_roberta_base_token_classifier_conll03_pipeline date: 2022-04-21 tags: [open_source, ner, token_classifier, xlm_roberta, conll03, xlm, base, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [xlm_roberta_base_token_classifier_conll03](https://nlp.johnsnowlabs.com/2021/10/03/xlm_roberta_base_token_classifier_conll03_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650542851685.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_base_token_classifier_conll03_pipeline_en_3.4.1_3.0_1650542851685.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ``` ```scala val pipeline = new PretrainedPipeline("xlm_roberta_base_token_classifier_conll03_pipeline", lang = "en") pipeline.annotate("My name is John and I work at John Snow Labs.") ```
## Results ```bash +--------------+---------+ |chunk |ner_label| +--------------+---------+ |John |PERSON | |John Snow Labs|ORG | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_base_token_classifier_conll03_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|851.9 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - XlmRoBertaForTokenClassification - NerConverter - Finisher --- layout: model title: Sentiment Analysis Pipeline for Spanish texts author: John Snow Labs name: classifierdl_bert_sentiment_pipeline date: 2021-11-29 tags: [spanish, sentiment, es, classifier, open_source] task: Sentiment Analysis language: es edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline identifies the sentiments (positive or negative) in Spanish texts. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_ES_SENTIMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_pipeline_es_3.3.0_2.4_1638192149292.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_sentiment_pipeline_es_3.3.0_2.4_1638192149292.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "es") result1 = pipeline.annotate("Estoy seguro de que esta vez pasará la entrevista.") result2 = pipeline.annotate("Soy una persona que intenta desayunar todas las mañanas sin falta.") result3 = pipeline.annotate("No estoy seguro de si mi salario mensual es suficiente para vivir.") ``` ```scala val pipeline = new PretrainedPipeline("classifierdl_bert_sentiment_pipeline", lang = "es") val result1 = pipeline.annotate("Estoy seguro de que esta vez pasará la entrevista.")(0) val result2 = pipeline.annotate("Soy una persona que intenta desayunar todas las mañanas sin falta.")(0) val result3 = pipeline.annotate("No estoy seguro de si mi salario mensual es suficiente para vivir.")(0) ```
## Results ```bash ['POSITIVE'] ['NEUTRAL'] ['NEGATIVE'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_sentiment_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Language:|es| ## Included Models - DocumentAssembler - BertSentenceEmbeddings - ClassifierDLModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from kevinbror) author: John Snow Labs name: distilbert_qa_kevinbror_base_uncased_finetuned_squad2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `kevinbror`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_kevinbror_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773532721.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_kevinbror_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773532721.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kevinbror_base_uncased_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_kevinbror_base_uncased_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_kevinbror_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/kevinbror/distilbert-base-uncased-finetuned-squad2 --- layout: model title: Translate English to Ukrainian Pipeline author: John Snow Labs name: translate_en_uk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, uk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `uk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_uk_xx_2.7.0_2.4_1609686168178.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_uk_xx_2.7.0_2.4_1609686168178.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_uk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_uk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.uk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_uk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Afrikaans to Russian author: John Snow Labs name: opus_mt_af_ru date: 2021-06-01 tags: [open_source, seq2seq, translation, af, ru, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: af target languages: ru {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_af_ru_xx_3.1.0_2.4_1622551937327.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_af_ru_xx_3.1.0_2.4_1622551937327.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_af_ru", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_af_ru", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Afrikaans.translate_to.Russian').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_af_ru| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Living Species (w2v_cc_300d) author: John Snow Labs name: ner_living_species date: 2022-06-23 tags: [gl, ner, clinical, licensed] task: Named Entity Recognition language: gl edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in Galician which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `w2v_cc_300d` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_gl_3.5.3_3.0_1655976346794.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_gl_3.5.3_3.0_1655976346794.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "gl")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species", "gl", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","gl") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species", "gl","clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("gl.med_ner.living_species").predict("""Muller de 45 anos, sen antecedentes médicos de interese, que foi remitida á consulta de dermatoloxía de urxencias por lesións faciales de tres semanas de evolución. A paciente non presentaba lesións noutras localizaciones nin outra clínica de interese. No seu centro de saúde prescribíronlle corticoides tópicos ante a sospeita de picaduras de artrópodos e unha semana despois, antivirales orais baixo o diagnóstico de posible infección herpética. As lesións interferían de forma notable na súa vida persoal e profesional xa que traballaba de face ao púbico. Unha semana máis tarde o diagnóstico foi confirmado ao resultar o cultivo positivo a Staphylococcus aureus.""") ```
## Results ```bash +---------------------+-------+ |ner_chunk |label | +---------------------+-------+ |Muller |HUMAN | |paciente |HUMAN | |artrópodos |SPECIES| |antivirales |SPECIES| |herpética |SPECIES| |púbico |HUMAN | |Staphylococcus aureus|SPECIES| +---------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|gl| |Size:|15.2 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.88 0.97 0.92 2952 B-SPECIES 0.54 0.91 0.67 3333 I-HUMAN 0.74 0.75 0.74 206 I-SPECIES 0.59 0.85 0.70 1297 micro-avg 0.65 0.92 0.76 7788 macro-avg 0.69 0.87 0.76 7788 weighted-avg 0.68 0.92 0.77 7788 ``` --- layout: model title: Translate English to Kuanyama Pipeline author: John Snow Labs name: translate_en_kj date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, kj, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `kj` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_kj_xx_2.7.0_2.4_1609687416916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_kj_xx_2.7.0_2.4_1609687416916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_kj", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_kj", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.kj').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_kj| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Detect Person, Location, Organization, and Miscellaneous entities in Arabic (ANERcorp) author: John Snow Labs name: aner_cc_300d date: 2022-08-09 tags: [ner, ar, open_source] task: Named Entity Recognition language: ar edition: Spark NLP 4.0.2 spark_version: 3.0 supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses Arabic word embeddings to find 4 different types of entities in Arabic text. It is trained using `arabic_w2v_cc_300d` word embeddings, so please use the same embeddings in the pipeline. ## Predicted Entities `PER`, `LOC`, `ORG`, `MISC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/aner_cc_300d_ar_4.0.2_3.0_1660030385202.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/aner_cc_300d_ar_4.0.2_3.0_1660030385202.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("aner_cc_300d", "ar") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar",) .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("aner_cc_300d", "ar") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlp_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""").toDS.toDF("text") val result = nlp_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.ner").predict("""في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز""") ```
## Results ```bash | | ner_chunk | entity | |---:|-------------------------:|-------------:| | 0 | قوات الثورة العربية | ORG | | 1 | دمشق | LOC | | 2 | الإنكليز | PER | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|aner_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 4.0.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|ar| |Size:|14.8 MB| ## References This model is trained on data obtained from [http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp) ## Benchmarking ```bash label tp fp fn prec rec f1 B-LOC 163 28 34 0.853403 0.827411 0.840206 I-ORG 60 10 5 0.857142 0.923077 0.888889 I-MIS 124 53 53 0.700565 0.700565 0.700565 I-LOC 64 20 23 0.761904 0.735632 0.748538 B-MIS 297 71 52 0.807065 0.851003 0.828452 I-PER 84 23 13 0.785046 0.865979 0.823530 B-ORG 54 9 12 0.857142 0.818181 0.837210 B-PER 182 26 33 0.875 0.846512 0.860520 Macro-average 1028 240 225 0.812159 0.821045 0.816578 Micro-average 1028 240 225 0.810726 0.820431 0.815550 ``` --- layout: model title: Part of Speech for Latvian author: John Snow Labs name: pos_ud_lvtb date: 2021-03-09 tags: [part_of_speech, open_source, latvian, pos_ud_lvtb, lv] task: Part of Speech Tagging language: lv edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - PROPN - VERB - ADJ - PUNCT - PRON - PART - CCONJ - NOUN - AUX - ADP - ADV - DET - NUM - SCONJ - SYM - X - INTJ {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_lvtb_lv_3.0.0_3.0_1615292214572.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_lvtb_lv_3.0.0_3.0_1615292214572.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_lvtb", "lv") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Sveiki no John Sniega Labs! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_lvtb", "lv") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Sveiki no John Sniega Labs! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Sveiki no John Sniega Labs! ""] token_df = nlu.load('lv.pos').predict(text) token_df ```
## Results ```bash token pos 0 Sveiki VERB 1 no ADP 2 John PROPN 3 Sniega PROPN 4 Labs PROPN 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_lvtb| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|lv| --- layout: model title: Legal Third Supplemental Indenture Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_third_supplemental_indenture_bert date: 2023-02-02 tags: [en, legal, classification, third, supplemental, indenture, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_third_supplemental_indenture_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `third-supplemental-indenture` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `third-supplemental-indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_third_supplemental_indenture_bert_en_1.0.0_3.0_1675360680501.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_third_supplemental_indenture_bert_en_1.0.0_3.0_1675360680501.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_third_supplemental_indenture_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[third-supplemental-indenture]| |[other]| |[other]| |[third-supplemental-indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_third_supplemental_indenture_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.96 0.92 0.94 122 third-supplemental-indenture 0.81 0.90 0.85 49 accuracy - - 0.91 171 macro-avg 0.89 0.91 0.90 171 weighted-avg 0.92 0.91 0.91 171 ``` --- layout: model title: Pipeline to Detect Cellular/Molecular Biology Entities (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_cellular_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, berfortokenclassification, cellular, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_cellular](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_cellular_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_3.0_1647889939388.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_cellular_pipeline_en_3.4.1_3.0_1647889939388.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_cellular_pipeline", "en", "clinical/models") pipeline.annotate("Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.cellular_pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash +-----------------------------------------------------------+---------+ |chunk |ner_label| +-----------------------------------------------------------+---------+ |intracellular signaling proteins |protein | |human T-cell leukemia virus type 1 promoter |DNA | |Tax |protein | |Tax-responsive element 1 |DNA | |cyclic AMP-responsive members |protein | |CREB/ATF family |protein | |transcription factors |protein | |Tax |protein | |human T-cell leukemia virus type 1 Tax-responsive element 1|DNA | |TRE-1 |DNA | |lacZ gene |DNA | |CYC1 promoter |DNA | |TRE-1 |DNA | |cyclic AMP response element-binding protein |protein | |CREB |protein | |CREB |protein | |GAL4 activation domain |protein | |GAD |protein | |reporter gene |DNA | |Tax |protein | +-----------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_cellular_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: Part of Speech for Tamil (pos_ttb) author: John Snow Labs name: pos_ttb date: 2021-03-10 tags: [open_source, pos, ta] supported: true task: Part of Speech Tagging language: ta edition: Spark NLP 2.7.5 spark_version: 2.4 annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron` architecture. ## Predicted Entities - ADJ - ADP - ADV - AUX - CCONJ - DET - NOUN - NUM - PART - PRON - PROPN - PUNCT - VERB - X {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ttb_ta_2.7.5_2.4_1615399578187.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ttb_ta_2.7.5_2.4_1615399578187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols("sentence")\ .setOutputCol("token") pos = PerceptronModel.pretrained("pos_ttb", "ta") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, posTagger ]) example = spark.createDataFrame([['எனவே ஐநா சபை மூலமாக நிதி உதவியை அளிக்கும் ஆறு இந்தியாவுக்கு தகவல் அனுப்பிய் உள்ளோம் என அந் நாட்டின் வெளியுறவுத் துறை செய்தித்தொடர்பாளர் அப்துல் பாசித் தெரிவித்தார் . ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols("sentence") .setOutputCol("token") val pos = PerceptronModel.pretrained("pos_ttb", "ta") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector,tokenizer, pos)) val data = Seq("எனவே ஐநா சபை மூலமாக நிதி உதவியை அளிக்கும் ஆறு இந்தியாவுக்கு தகவல் அனுப்பிய் உள்ளோம் என அந் நாட்டின் வெளியுறவுத் துறை செய்தித்தொடர்பாளர் அப்துல் பாசித் தெரிவித்தார் . ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""எனவே ஐநா சபை மூலமாக நிதி உதவியை அளிக்கும் ஆறு இந்தியாவுக்கு தகவல் அனுப்பிய் உள்ளோம் என அந் நாட்டின் வெளியுறவுத் துறை செய்தித்தொடர்பாளர் அப்துல் பாசித் தெரிவித்தார் . ""] token_df = nlu.load('ta.pos.ttb').predict(text) token_df ```
## Results ```bash +---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ |text |result | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ |எனவே ஐநா சபை மூலமாக நிதி உதவியை அளிக்கும் ஆறு இந்தியாவுக்கு தகவல் அனுப்பிய் உள்ளோம் என அந் நாட்டின் வெளியுறவுத் துறை செய்தித்தொடர்பாளர் அப்துல் பாசித் தெரிவித்தார் .|[ADV, PROPN, NOUN, ADP, NOUN, NOUN, ADJ, PART, PROPN, NOUN, VERB, AUX, PART, DET, NOUN, NOUN, NOUN, NOUN, PROPN, PROPN, VERB, PUNCT]| +---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ttb| |Compatibility:|Spark NLP 2.7.5+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[pos]| |Language:|ta| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) data set. ## Benchmarking ```bash | | precision | recall | f1-score | support | |--------------|-----------|--------|----------|---------| | ADJ | 0.21 | 0.58 | 0.31 | 53 | | ADP | 0.76 | 0.41 | 0.53 | 68 | | ADV | 0.70 | 0.57 | 0.63 | 75 | | AUX | 0.73 | 0.74 | 0.73 | 151 | | CCONJ | 1.00 | 1.00 | 1.00 | 8 | | DET | 0.80 | 0.69 | 0.74 | 29 | | NOUN | 0.72 | 0.80 | 0.76 | 526 | | NUM | 0.87 | 0.71 | 0.78 | 91 | | PART | 0.82 | 0.81 | 0.82 | 168 | | PRON | 0.73 | 0.75 | 0.74 | 61 | | PROPN | 0.67 | 0.57 | 0.62 | 249 | | PUNCT | 0.83 | 0.89 | 0.86 | 190 | | VERB | 0.65 | 0.51 | 0.57 | 319 | | X | 0.00 | 0.00 | 0.00 | 1 | | accuracy | | | 0.70 | 1989 | | macro avg | 0.68 | 0.65 | 0.65 | 1989 | | weighted avg | 0.72 | 0.70 | 0.70 | 1989 | ``` --- layout: model title: Detect Clinical Entities in Romanian (Bert, Base, Cased) author: John Snow Labs name: ner_clinical_bert date: 2022-08-12 tags: [licensed, clinical, ro, ner, bert] task: Named Entity Recognition language: ro edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true recommended: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract clinical entities from Romanian clinical texts. This model is trained using `bert_base_cased` embeddings. ## Predicted Entities `Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.0.2_3.0_1660295520992.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.0.2_3.0_1660295520992.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter]) data = spark.createDataFrame([[""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""") ```
## Results ```bash +--------------------------+-------------------------+ |chunks |entities | +--------------------------+-------------------------+ |Angio CT cardio-toracic |Imaging_Test | |Atrezie |Disease_Syndrome_Disorder| |valva pulmonara |Body_Part | |Hipoplazie |Disease_Syndrome_Disorder| |VS |Body_Part | |Atrezie |Disease_Syndrome_Disorder| |VAV stang |Body_Part | |Anastomoza Glenn |Disease_Syndrome_Disorder| |Tromboza |Disease_Syndrome_Disorder| |Sectia Clinica Cardiologie|Clinical_Dept | |GE Revolution HD |Medical_Device | |Branula albastra |Medical_Device | |membrului superior drept |Body_Part | |Scout |Body_Part | |30 ml |Dosage | |Iomeron 350 |Drug_Ingredient | |2.2 ml/s |Dosage | |20 ml |Dosage | |ser fiziologic |Drug_Ingredient | |angio-CT |Imaging_Test | +--------------------------+-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_clinical_bert| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.2 MB| ## Benchmarking ```bash label precision recall f1-score support Body_Part 0.91 0.93 0.92 679 Clinical_Dept 0.68 0.65 0.67 97 Date 0.99 0.99 0.99 87 Direction 0.66 0.76 0.70 50 Disease_Syndrome_Disorder 0.73 0.76 0.74 121 Dosage 0.78 1.00 0.87 38 Drug_Ingredient 0.90 0.94 0.92 48 Form 1.00 1.00 1.00 6 Imaging_Findings 0.86 0.82 0.84 201 Imaging_Technique 0.92 0.92 0.92 26 Imaging_Test 0.93 0.98 0.95 205 Measurements 0.71 0.69 0.70 214 Medical_Device 0.85 0.81 0.83 42 Pulse 0.82 1.00 0.90 9 Route 1.00 0.91 0.95 33 Score 1.00 0.98 0.99 41 Time 1.00 1.00 1.00 28 Units 0.60 0.93 0.73 88 Weight 0.82 1.00 0.90 9 micro-avg 0.84 0.87 0.86 2037 macro-avg 0.70 0.74 0.72 2037 weighted-avg 0.84 0.87 0.85 2037 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_ahmad573 TFWav2Vec2ForCTC from ahmad573 author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab2_by_ahmad573 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_ahmad573` is a English model originally trained by ahmad573. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab2_by_ahmad573_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_ahmad573_en_4.2.0_3.0_1664036825737.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab2_by_ahmad573_en_4.2.0_3.0_1664036825737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_ahmad573", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab2_by_ahmad573", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab2_by_ahmad573| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English asr_Wav2Vec2_XLSR_Bengali_10500 TFWav2Vec2ForCTC from shoubhik author: John Snow Labs name: asr_Wav2Vec2_XLSR_Bengali_10500 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_Wav2Vec2_XLSR_Bengali_10500` is a English model originally trained by shoubhik. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_Wav2Vec2_XLSR_Bengali_10500_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_XLSR_Bengali_10500_en_4.2.0_3.0_1664104957251.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_Wav2Vec2_XLSR_Bengali_10500_en_4.2.0_3.0_1664104957251.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_Wav2Vec2_XLSR_Bengali_10500", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_Wav2Vec2_XLSR_Bengali_10500", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_Wav2Vec2_XLSR_Bengali_10500| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|3.6 GB| --- layout: model title: Translate English to Pangasinan Pipeline author: John Snow Labs name: translate_en_pag date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, pag, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `pag` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_pag_xx_2.7.0_2.4_1609691698746.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_pag_xx_2.7.0_2.4_1609691698746.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_pag", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_pag", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.pag').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_pag| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Extract relations between effects of using multiple drugs (ReDL) author: John Snow Labs name: redl_drug_drug_interaction_biobert date: 2023-01-14 tags: [relation_extraction, en, licensed, clinical, tensorflow] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract potential improvements or harmful effects of Drug-Drug interactions (DDIs) when two or more drugs are taken at the same time or at a certain interval. ## Predicted Entities `DDI-advise`, `DDI-effect`, `DDI-false`, `DDI-int`, `DDI-mechanism` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_DRUG_DRUG_INT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_drug_drug_interaction_biobert_en_4.2.4_3.0_1673734887835.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_drug_drug_interaction_biobert_en_4.2.4_3.0_1673734887835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverterInternal() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks") #.setRelationPairs(['SYMPTOM-EXTERNAL_BODY_PART_OR_REGION']) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_drug_drug_interaction_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text="""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. \ If additional adrenergic drugs are to be administered by any route, \ they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""" data = spark.createDataFrame([[text]]).toDF("text") p_model = pipeline.fit(data) result = p_model.transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array('SYMPTOM-EXTERNAL_BODY_PART_OR_REGION')) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_drug_drug_interaction_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. If additional adrenergic drugs are to be administered by any route, they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.drug_drug_interaction").predict("""When carbamazepine is withdrawn from the combination therapy, aripiprazole dose should then be reduced. \ If additional adrenergic drugs are to be administered by any route, \ they should be used with caution because the pharmacologically predictable sympathetic effects of Metformin may be potentiated""") ```
## Results ```bash +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ | relation|entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ |DDI-false| DRUG| 5| 17|carbamazepine| DRUG| 62| 73|aripiprazole|0.91685396| +---------+-------+-------------+-----------+-------------+-------+-------------+-----------+------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_drug_drug_interaction_biobert| |Compatibility:|Healthcare NLP 4.2.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|401.7 MB| ## References Trained on DDI Extraction corpus. ## Benchmarking ```bash label Recall Precision F1 Support DDI-advise 0.758 0.874 0.812 211 DDI-effect 0.759 0.754 0.756 348 DDI-false 0.977 0.957 0.967 4097 DDI-int 0.175 0.458 0.253 63 DDI-mechanism 0.783 0.853 0.816 281 Avg. 0.690 0.779 0.721 - ``` --- layout: model title: Legal Definitions and interpretation Clause Binary Classifier author: John Snow Labs name: legclf_definitions_and_interpretation_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `definitions-and-interpretation` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `definitions-and-interpretation` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_definitions_and_interpretation_clause_en_1.0.0_3.2_1660122329652.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_definitions_and_interpretation_clause_en_1.0.0_3.2_1660122329652.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_definitions_and_interpretation_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[definitions-and-interpretation]| |[other]| |[other]| |[definitions-and-interpretation]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_definitions_and_interpretation_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support definitions-and-interpretation 0.96 0.94 0.95 54 other 0.98 0.99 0.98 156 accuracy - - 0.98 210 macro-avg 0.97 0.97 0.97 210 weighted-avg 0.98 0.98 0.98 210 ``` --- layout: model title: English image_classifier_vit_Tomato_Leaf_Classifier ViTForImageClassification from Aftabhussain author: John Snow Labs name: image_classifier_vit_Tomato_Leaf_Classifier date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_Tomato_Leaf_Classifier` is a English model originally trained by Aftabhussain. ## Predicted Entities `Bacterial_spot`, `Healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Tomato_Leaf_Classifier_en_4.1.0_3.0_1660167088791.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_Tomato_Leaf_Classifier_en_4.1.0_3.0_1660167088791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_Tomato_Leaf_Classifier", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_Tomato_Leaf_Classifier", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_Tomato_Leaf_Classifier| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from ManqingLiu) author: John Snow Labs name: xlmroberta_ner_manqingliu_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `ManqingLiu`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_manqingliu_base_finetuned_panx_de_4.1.0_3.0_1660429790495.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_manqingliu_base_finetuned_panx_de_4.1.0_3.0_1660429790495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_manqingliu_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_manqingliu_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_manqingliu_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ManqingLiu/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: Legal Specific Performance Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_specific_performance_bert date: 2023-03-05 tags: [en, legal, classification, clauses, specific_performance, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Specific_Performance` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Specific_Performance`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_specific_performance_bert_en_1.0.0_3.0_1678050609899.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_specific_performance_bert_en_1.0.0_3.0_1678050609899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_specific_performance_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Specific_Performance]| |[Other]| |[Other]| |[Specific_Performance]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_specific_performance_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.4 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Other 0.98 0.98 0.98 53 Specific_Performance 0.97 0.97 0.97 34 accuracy - - 0.98 87 macro-avg 0.98 0.98 0.98 87 weighted-avg 0.98 0.98 0.98 87 ``` --- layout: model title: Pipeline to Extraction of Clinical Abbreviations and Acronyms author: John Snow Labs name: ner_abbreviation_clinical_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_abbreviation_clinical](https://nlp.johnsnowlabs.com/2021/12/30/ner_abbreviation_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_pipeline_en_3.4.1_3.0_1647874931143.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_abbreviation_clinical_pipeline_en_3.4.1_3.0_1647874931143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_abbreviation_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.") ``` ```scala val pipeline = new PretrainedPipeline("ner_abbreviation_clinical_pipeline", "en", "clinical/models") pipeline.annotate("Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical-abbreviation.pipeline").predict("""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet.""") ```
## Results ```bash +-----+---------+ |chunk|ner_label| +-----+---------+ |CBC |ABBR | |AB |ABBR | |VDRL |ABBR | |HIV |ABBR | +-----+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_abbreviation_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Legal Confidentiality Clause Binary Classifier author: John Snow Labs name: legclf_cuad_confidentiality_clause date: 2022-09-20 tags: [en, legal, classification, clauses, confidentiality, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `confidentiality` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `confidentiality` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_confidentiality_clause_en_1.0.0_3.2_1663693181343.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_cuad_confidentiality_clause_en_1.0.0_3.2_1663693181343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_confidential_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[confidentiality]| |[other]| |[other]| |[confidentiality]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_cuad_confidentiality_clause| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.2 MB| ## References In-house annotations on CUAD dataset ## Benchmarking ```bash label precision recall f1-score support confidentiality 1.00 0.91 0.95 11 other 0.95 1.00 0.97 19 accuracy - - 0.97 30 macro-avg 0.97 0.95 0.96 30 weighted-avg 0.97 0.97 0.97 30 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab51 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab51 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab51` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab51_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab51_en_4.2.0_3.0_1664021491825.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab51_en_4.2.0_3.0_1664021491825.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab51", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab51", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab51| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: English image_classifier_vit_road_good_damaged_condition ViTForImageClassification from edixo author: John Snow Labs name: image_classifier_vit_road_good_damaged_condition date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_road_good_damaged_condition` is a English model originally trained by edixo. ## Predicted Entities `damaged road`, `good road` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_road_good_damaged_condition_en_4.1.0_3.0_1660171740568.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_road_good_damaged_condition_en_4.1.0_3.0_1660171740568.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_road_good_damaged_condition", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_road_good_damaged_condition", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_road_good_damaged_condition| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Spanish RoBERTa Embeddings (from MMG) author: John Snow Labs name: roberta_embeddings_mlm_spanish_roberta_base date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `mlm-spanish-roberta-base` is a Spanish model orginally trained by `MMG`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_mlm_spanish_roberta_base_es_3.4.2_3.0_1649945432879.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_mlm_spanish_roberta_base_es_3.4.2_3.0_1649945432879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_mlm_spanish_roberta_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_mlm_spanish_roberta_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.mlm_spanish_roberta_base").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_mlm_spanish_roberta_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|473.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/MMG/mlm-spanish-roberta-base - https://github.com/dccuchile/GLUES --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_768 date: 2022-12-02 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_768_zh_4.2.4_3.0_1670021751518.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_768_zh_4.2.4_3.0_1670021751518.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|224.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Icelandic DistilBertForTokenClassification Cased model (from m3hrdadfi) author: John Snow Labs name: dtilbert_token_classifier_typo_detector date: 2023-03-03 tags: [is, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: is edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `typo-detector-distilbert-is` is a Icelandic model originally trained by `m3hrdadfi`. ## Predicted Entities `TYPO` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/dtilbert_token_classifier_typo_detector_is_4.3.1_3.0_1677881909024.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/dtilbert_token_classifier_typo_detector_is_4.3.1_3.0_1677881909024.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("dtilbert_token_classifier_typo_detector","is") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("dtilbert_token_classifier_typo_detector","is") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|dtilbert_token_classifier_typo_detector| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|is| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m3hrdadfi/typo-detector-distilbert-is - https://github.com/m3hrdadfi/typo-detector/issues --- layout: model title: Spanish Named Entity Recognition (from mrm8488) author: John Snow Labs name: roberta_ner_RuPERTa_base_finetuned_ner date: 2022-05-03 tags: [roberta, ner, open_source, es] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `RuPERTa-base-finetuned-ner` is a Spanish model orginally trained by `mrm8488`. ## Predicted Entities `MISC`, `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_RuPERTa_base_finetuned_ner_es_3.4.2_3.0_1651593655495.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_RuPERTa_base_finetuned_ner_es_3.4.2_3.0_1651593655495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_RuPERTa_base_finetuned_ner","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_RuPERTa_base_finetuned_ner","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.ner.RuPERTa_base_finetuned_ner").predict("""Amo Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_RuPERTa_base_finetuned_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|470.7 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/mrm8488/RuPERTa-base-finetuned-ner - https://www.kaggle.com/nltkdata/conll-corpora - https://www.kaggle.com/nltkdata/conll-corpora - https://twitter.com/mrm8488 --- layout: model title: Detect Drug Information (Small) author: John Snow Labs name: ner_posology_small class: NerDLModel language: en nav_key: models repository: clinical/models date: 2020-04-21 task: Named Entity Recognition edition: Healthcare NLP 2.4.2 spark_version: 2.4 tags: [clinical,licensed,ner,en] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description Pretrained named entity recognition deep learning model for posology, this NER is trained with the ``embeddings_clinical`` word embeddings model, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities ``DOSAGE``, ``DRUG``, ``DURATION``, ``FORM``, ``FREQUENCY``, ``ROUTE``, ``STRENGTH``. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_POSOLOGY/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_en_2.4.2_2.4_1587513301751.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_posology_small_en_2.4.2_2.4_1587513301751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_posology_small","en","clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([['The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.']], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val model = NerDLModel.pretrained("ner_posology_small","en","clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, model, ner_converter)) val data = Seq("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.posology.small").predict("""The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d.""") ```
## Results ```bash +--------------+---------+ |chunk |ner | +--------------+---------+ |insulin |DRUG | |Bactrim |DRUG | |for 14 days |DURATION | |Fragmin |DRUG | |5000 units |DOSAGE | |subcutaneously|ROUTE | |daily |FREQUENCY| |Xenaderm |DRUG | |topically |ROUTE | |b.i.d., |FREQUENCY| |Lantus |DRUG | |40 units |DOSAGE | |subcutaneously|ROUTE | |at bedtime |FREQUENCY| |OxyContin |DRUG | |30 mg |STRENGTH | |p.o |ROUTE | |q.12 h |FREQUENCY| |folic acid |DRUG | |1 mg |STRENGTH | +--------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |----------------|----------------------------------| | Name: | ner_posology_small | | Type: | NerDLModel | | Compatibility: | Spark NLP 2.4.2+ | | License: | Licensed | |Edition:|Official| | |Input labels: | sentence, token, word_embeddings | |Output labels: | ner | | Language: | en | | Case sensitive: | False | | Dependencies: | embeddings_clinical | {:.h2_title} ## Data Source Trained on the 2018 i2b2 dataset (no FDA) with ``embeddings_clinical``. https://www.i2b2.org/NLP/Medication {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|--------------:|------:|------:|------:|---------:|---------:|---------:| | 0 | B-DRUG | 1408 | 62 | 99 | 0.957823 | 0.934307 | 0.945919 | | 1 | B-STRENGTH | 470 | 43 | 29 | 0.916179 | 0.941884 | 0.928854 | | 2 | I-DURATION | 123 | 22 | 8 | 0.848276 | 0.938931 | 0.891304 | | 3 | I-STRENGTH | 499 | 66 | 15 | 0.883186 | 0.970817 | 0.924931 | | 4 | I-FREQUENCY | 945 | 47 | 55 | 0.952621 | 0.945 | 0.948795 | | 5 | B-FORM | 365 | 13 | 12 | 0.965608 | 0.96817 | 0.966887 | | 6 | B-DOSAGE | 298 | 27 | 26 | 0.916923 | 0.919753 | 0.918336 | | 7 | I-DOSAGE | 348 | 29 | 22 | 0.923077 | 0.940541 | 0.931727 | | 8 | I-DRUG | 208 | 25 | 60 | 0.892704 | 0.776119 | 0.830339 | | 9 | I-ROUTE | 10 | 0 | 2 | 1 | 0.833333 | 0.909091 | | 10 | B-ROUTE | 467 | 4 | 25 | 0.991507 | 0.949187 | 0.969886 | | 11 | B-DURATION | 64 | 10 | 10 | 0.864865 | 0.864865 | 0.864865 | | 12 | B-FREQUENCY | 588 | 12 | 17 | 0.98 | 0.971901 | 0.975934 | | 13 | I-FORM | 264 | 5 | 4 | 0.981413 | 0.985075 | 0.98324 | | 14 | Macro-average | 6057 | 365 | 384 | 0.93387 | 0.924277 | 0.929049 | | 15 | Micro-average | 6057 | 365 | 384 | 0.943164 | 0.940382 | 0.941771 | ``` --- layout: model title: Translate English to Creoles and pidgins, French‑based Pipeline author: John Snow Labs name: translate_en_cpf date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, cpf, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `cpf` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_cpf_xx_2.7.0_2.4_1609691980246.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_cpf_xx_2.7.0_2.4_1609691980246.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_cpf", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_cpf", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.cpf').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_cpf| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English BertForQuestionAnswering model (from ofirzaf) author: John Snow Labs name: bert_qa_ofirzaf_bert_large_uncased_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-large-uncased-squad` is a English model orginally trained by `ofirzaf`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_ofirzaf_bert_large_uncased_squad_en_4.0.0_3.0_1654536654168.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_ofirzaf_bert_large_uncased_squad_en_4.0.0_3.0_1654536654168.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_ofirzaf_bert_large_uncased_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_ofirzaf_bert_large_uncased_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.large_uncased.by_ofirzaf").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_ofirzaf_bert_large_uncased_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ofirzaf/bert-large-uncased-squad --- layout: model title: Abkhazian asr_xls_r_ab_test_by_pablouribe TFWav2Vec2ForCTC from pablouribe author: John Snow Labs name: asr_xls_r_ab_test_by_pablouribe date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_pablouribe` is a Abkhazian model originally trained by pablouribe. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_test_by_pablouribe_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_pablouribe_ab_4.2.0_3.0_1664021635884.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_pablouribe_ab_4.2.0_3.0_1664021635884.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_test_by_pablouribe", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_test_by_pablouribe", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_test_by_pablouribe| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|445.7 KB| --- layout: model title: English asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4 TFWav2Vec2ForCTC from lilitket author: John Snow Labs name: pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4` is a English model originally trained by lilitket. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4_en_4.2.0_3.0_1664119982498.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4_en_4.2.0_3.0_1664119982498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xls_r_300m_hyAM_batch4_lr4| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Costs Clause Binary Classifier author: John Snow Labs name: legclf_costs_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `costs` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `costs` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_costs_clause_en_1.0.0_3.2_1660122307454.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_costs_clause_en_1.0.0_3.2_1660122307454.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_costs_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[costs]| |[other]| |[other]| |[costs]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_costs_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support costs 0.89 0.85 0.87 48 other 0.95 0.96 0.95 128 accuracy - - 0.93 176 macro-avg 0.92 0.91 0.91 176 weighted-avg 0.93 0.93 0.93 176 ``` --- layout: model title: English RobertaForQuestionAnswering (from armageddon) author: John Snow Labs name: roberta_qa_roberta_large_squad2_covid_qa_deepset date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad2-covid-qa-deepset` is a English model originally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_covid_qa_deepset_en_4.0.0_3.0_1655737643329.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad2_covid_qa_deepset_en_4.0.0_3.0_1655737643329.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_squad2_covid_qa_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_squad2_covid_qa_deepset","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_squad2_covid_qa_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/roberta-large-squad2-covid-qa-deepset --- layout: model title: Legal Remedies cumulative Clause Binary Classifier author: John Snow Labs name: legclf_remedies_cumulative_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `remedies-cumulative` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `remedies-cumulative` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_remedies_cumulative_clause_en_1.0.0_3.2_1660122931314.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_remedies_cumulative_clause_en_1.0.0_3.2_1660122931314.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_remedies_cumulative_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[remedies-cumulative]| |[other]| |[other]| |[remedies-cumulative]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_remedies_cumulative_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.96 0.98 114 remedies-cumulative 0.91 1.00 0.95 42 accuracy - - 0.97 156 macro-avg 0.96 0.98 0.97 156 weighted-avg 0.98 0.97 0.97 156 ``` --- layout: model title: English asr_wav2vec2_murad_with_some_data TFWav2Vec2ForCTC from MBMMurad author: John Snow Labs name: asr_wav2vec2_murad_with_some_data date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_murad_with_some_data` is a English model originally trained by MBMMurad. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_murad_with_some_data_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_murad_with_some_data_en_4.2.0_3.0_1664110531411.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_murad_with_some_data_en_4.2.0_3.0_1664110531411.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_murad_with_some_data", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_murad_with_some_data", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_murad_with_some_data| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_vasilis TFWav2Vec2ForCTC from vasilis author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_vasilis` is a Finnish model originally trained by vasilis. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_fi_4.2.0_3.0_1664024101617.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis_fi_4.2.0_3.0_1664024101617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis', lang = 'fi') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis", lang = "fi") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_53_finnish_by_vasilis| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|fi| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Indonesian T5ForConditionalGeneration Base Cased model (from cahya) author: John Snow Labs name: t5_cahya_base_indonesian_summarization_cased date: 2023-01-30 tags: [id, open_source, t5, tensorflow] task: Text Generation language: id edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-indonesian-summarization-cased` is a Indonesian model originally trained by `cahya`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_cahya_base_indonesian_summarization_cased_id_4.3.0_3.0_1675109672981.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_cahya_base_indonesian_summarization_cased_id_4.3.0_3.0_1675109672981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_cahya_base_indonesian_summarization_cased","id") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_cahya_base_indonesian_summarization_cased","id") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_cahya_base_indonesian_summarization_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|id| |Size:|926.2 MB| ## References - https://huggingface.co/cahya/t5-base-indonesian-summarization-cased --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from raisinbl) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_2_512_1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad_2_512_1` is a English model originally trained by `raisinbl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_2_512_1_en_4.3.0_3.0_1672773730195.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_2_512_1_en_4.3.0_3.0_1672773730195.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_2_512_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_2_512_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_2_512_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/raisinbl/distilbert-base-uncased-finetuned-squad_2_512_1 --- layout: model title: Smaller BERT Sentence Embeddings (L-10_H-512_A-8) author: John Snow Labs name: sent_small_bert_L10_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_512_en_2.6.0_2.4_1598350765497.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_small_bert_L10_512_en_2.6.0_2.4_1598350765497.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_512", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L10_512", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.small_bert_L10_512').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash en_embed_sentence_small_bert_L10_512_embeddings sentence [-0.45785054564476013, 0.16585223376750946, 1.... I hate cancer [0.07550251483917236, 0.009794552810490131, 0.... Antibiotics aren't painkiller ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_small_bert_L10_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1 --- layout: model title: Slovak Lemmatizer author: John Snow Labs name: lemma date: 2020-05-05 11:16:00 +0800 task: Lemmatization language: sk edition: Spark NLP 2.5.0 spark_version: 2.4 tags: [lemmatizer, sk] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_sk_2.5.0_2.4_1588666524270.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_sk_2.5.0_2.4_1588666524270.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "sk") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny.") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "sk") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Okrem toho, že je kráľom severu, je John Snow anglickým lekárom a lídrom vo vývoji anestézie a lekárskej hygieny."""] lemma_df = nlu.load('sk.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Okrem', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=9, result='to', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=10, end=10, result=',', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=13, result='že', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=16, result='jesť', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.0+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|sk| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Dutch Named Entity Recognition (from ml6team) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_nl_emoji_ner date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, nl, open_source] task: Named Entity Recognition language: nl edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-nl-emoji-ner` is a Dutch model orginally trained by `ml6team`. ## Predicted Entities `😨`, `🚗`, `☕`, `😥`, `💰`, `😠`, `🍾`, `😍`, `😄`, `🤯` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_nl_emoji_ner_nl_3.4.2_3.0_1652809991427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_nl_emoji_ner_nl_3.4.2_3.0_1652809991427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_nl_emoji_ner","nl") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ik hou van Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_nl_emoji_ner","nl") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ik hou van Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_nl_emoji_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|nl| |Size:|851.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ml6team/xlm-roberta-base-nl-emoji-ner --- layout: model title: Classifier for Adverse Drug Events using Clinical Bert author: John Snow Labs name: classifierdl_ade_clinicalbert date: 2021-01-21 task: Text Classification language: en nav_key: models edition: Healthcare NLP 2.7.1 spark_version: 2.4 tags: [en, licensed, classifier, clinical] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify text/sentence in two categories: `True` : The sentence is talking about a possible ADE `False` : The sentences doesn’t have any information about an ADE. ## Predicted Entities `True`, `False` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/16.Adverse_Drug_Event_ADE_NER_and_Classifier.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_clinicalbert_en_2.7.1_2.4_1611244439637.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifierdl_ade_clinicalbert_en_2.7.1_2.4_1611244439637.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = Tokenizer().setInputCols(['document']).setOutputCol('token') embeddings = BertEmbeddings.pretrained('biobert_clinical_base_cased')\ .setInputCols(["document", 'token'])\ .setOutputCol("word_embeddings") sentence_embeddings = SentenceEmbeddings() \ .setInputCols(["document", "word_embeddings"]) \ .setOutputCol("sentence_embeddings") \ .setPoolingStrategy("AVERAGE") classifier = ClassifierDLModel.pretrained('classifierdl_ade_clinicalbert', 'en', 'clinical/models')\ .setInputCols(['document', 'token', 'sentence_embeddings']).setOutputCol('class') nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, sentence_embeddings, classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(["I feel a bit drowsy & have a little blurred vision after taking an insulin", "I feel great after taking tylenol"]) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.ade.clinicalbert").predict("""I feel a bit drowsy & have a little blurred vision after taking an insulin""") ```
## Results ```bash | | text | label | |--:|:---------------------------------------------------------------------------|:------| | 0 | I feel a bit drowsy & have a little blurred vision after taking an insulin | True | | 1 | I feel great after taking tylenol | False | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_ade_clinicalbert| |Compatibility:|Spark NLP 2.7.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|biobert_clinical_base_cased| ## Data Source Trained on a custom dataset comprising of CADEC, DRUG-AE and Twimed. ## Benchmarking ```bash precision recall f1-score support False 0.95 0.92 0.93 6923 True 0.64 0.78 0.70 1359 micro avg 0.89 0.89 0.89 8282 macro avg 0.80 0.85 0.82 8282 weighted avg 0.90 0.89 0.90 8282 ``` --- layout: model title: Fast Neural Machine Translation Model from Ukrainian to English author: John Snow Labs name: opus_mt_uk_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, uk, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `uk` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_uk_en_xx_2.7.0_2.4_1609170699118.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_uk_en_xx_2.7.0_2.4_1609170699118.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_uk_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_uk_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.uk.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_uk_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Pipeline to Detect Chemicals in Medical Text author: John Snow Labs name: bert_token_classifier_ner_bc4chemd_chemicals_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_bc4chemd_chemicals](https://nlp.johnsnowlabs.com/2022/07/25/bert_token_classifier_ner_bc4chemd_chemicals_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc4chemd_chemicals_pipeline_en_4.3.0_3.2_1679301435930.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_bc4chemd_chemicals_pipeline_en_4.3.0_3.2_1679301435930.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_bc4chemd_chemicals_pipeline", "en", "clinical/models") text = '''The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_bc4chemd_chemicals_pipeline", "en", "clinical/models") val text = "The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin)." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:--------------------------------|--------:|------:|:------------|-------------:| | 0 | triterpenes | 33 | 43 | CHEM | 0.99999 | | 1 | alpha - amyrin | 46 | 59 | CHEM | 0.999939 | | 2 | beta - amyrin | 62 | 74 | CHEM | 0.999679 | | 3 | lupeol | 77 | 82 | CHEM | 0.999968 | | 4 | betulin | 85 | 91 | CHEM | 0.999975 | | 5 | betulinic acid | 94 | 107 | CHEM | 0.999984 | | 6 | uvaol | 110 | 114 | CHEM | 0.99998 | | 7 | erythrodiol | 117 | 127 | CHEM | 0.999987 | | 8 | oleanolic acid | 133 | 146 | CHEM | 0.999984 | | 9 | phenolic acid | 153 | 165 | CHEM | 0.999985 | | 10 | 4 - hydroxybenzoic acid | 184 | 206 | CHEM | 0.999973 | | 11 | gallic and protocatechuic acids | 209 | 239 | CHEM | 0.999984 | | 12 | isocorilagin | 245 | 256 | CHEM | 0.999985 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_bc4chemd_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Word2Vec Embeddings in Asturian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-14 tags: [cc, embeddings, fastText, word2vec, ast, open_source] task: Embeddings language: ast edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ast_3.4.1_3.0_1647284350724.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_ast_3.4.1_3.0_1647284350724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ast") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ast") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ast.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|ast| |Size:|391.2 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Legal Term Loan Credit Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_term_loan_credit_agreement_bert date: 2023-01-26 tags: [en, legal, classification, loan, credit, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_term_loan_credit_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `term-loan-credit-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `term-loan-credit-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_term_loan_credit_agreement_bert_en_1.0.0_3.0_1674735257018.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_term_loan_credit_agreement_bert_en_1.0.0_3.0_1674735257018.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_term_loan_credit_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[term-loan-credit-agreement]| |[other]| |[other]| |[term-loan-credit-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_term_loan_credit_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.98 0.98 0.98 116 term-loan-credit-agreement 0.96 0.96 0.96 51 accuracy - - 0.98 167 macro-avg 0.97 0.97 0.97 167 weighted-avg 0.98 0.98 0.98 167 ``` --- layout: model title: English BertForQuestionAnswering model (from madlag) author: John Snow Labs name: bert_qa_bert_base_uncased_squadv1_x1.84_f88.7_d36_hybrid_filled_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-squadv1-x1.84-f88.7-d36-hybrid-filled-v1` is a English model orginally trained by `madlag`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.84_f88.7_d36_hybrid_filled_v1_en_4.0.0_3.0_1654181557143.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_uncased_squadv1_x1.84_f88.7_d36_hybrid_filled_v1_en_4.0.0_3.0_1654181557143.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_uncased_squadv1_x1.84_f88.7_d36_hybrid_filled_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_uncased_squadv1_x1.84_f88.7_d36_hybrid_filled_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased_1_block_sparse_0.13_v1.by_madlag").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_uncased_squadv1_x1.84_f88.7_d36_hybrid_filled_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|206.9 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/madlag/bert-base-uncased-squadv1-x1.84-f88.7-d36-hybrid-filled-v1 - https://rajpurkar.github.io/SQuAD-explorer - https://www.aclweb.org/anthology/N19-1423.pdf --- layout: model title: French CamemBert Embeddings (from cylee) author: John Snow Labs name: camembert_embeddings_cylee_generic_model date: 2022-05-31 tags: [fr, open_source, camembert, embeddings] task: Embeddings language: fr edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: CamemBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained CamemBert Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `dummy-model` is a French model orginally trained by `cylee`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/camembert_embeddings_cylee_generic_model_fr_3.4.4_3.0_1653987722726.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/camembert_embeddings_cylee_generic_model_fr_3.4.4_3.0_1653987722726.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_cylee_generic_model","fr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["J'adore Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = CamemBertEmbeddings.pretrained("camembert_embeddings_cylee_generic_model","fr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("J'adore Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|camembert_embeddings_cylee_generic_model| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|fr| |Size:|266.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/cylee/dummy-model --- layout: model title: Relation extraction between body parts and procedures author: John Snow Labs name: redl_bodypart_procedure_test_biobert date: 2021-02-04 task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 2.7.3 spark_version: 2.4 tags: [licensed, clinical, en, relation_extraction] supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like ‘Internal_organ_or_component’, ’External_body_part_or_region’ etc. and procedure and test entities. `1` : body part and test/procedure are related to each other. `0` : body part and test/procedure are not related to each other. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_2.7.3_2.4_1612447034744.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_2.7.3_2.4_1612447034744.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["external_body_part_or_region-test"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_procedure_test_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound." p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text")) result = p_model.transform(data) ``` ```scala ... val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = sparknlp.annotators.Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("external_body_part_or_region-test")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_procedure_test_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.procedure").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash | | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |---:|-----------:|:-----------------------------|:---------|:----------|:--------------------|-------------:| | 0 | 1 | External_body_part_or_region | chest | Test | portable ultrasound | 0.99953 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_procedure_test_biobert| |Compatibility:|Healthcare NLP 2.7.3+| |License:|Licensed| |Edition:|Official| |Language:|en| ## Data Source Trained on a custom internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.338 0.472 0.394 325 1 0.904 0.843 0.872 1275 Avg. 0.621 0.657 0.633 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab0_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab0_by_hassnain` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain_en_4.2.0_3.0_1664039054193.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain_en_4.2.0_3.0_1664039054193.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab0_by_hassnain| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from lijingxin) author: John Snow Labs name: xlmroberta_ner_lijingxin_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `lijingxin`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_lijingxin_base_finetuned_panx_all_xx_4.1.0_3.0_1660428690762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_lijingxin_base_finetuned_panx_all_xx_4.1.0_3.0_1660428690762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_lijingxin_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_lijingxin_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_lijingxin_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/lijingxin/xlm-roberta-base-finetuned-panx-all --- layout: model title: English RobertaForQuestionAnswering (from Firat) author: John Snow Labs name: roberta_qa_Firat_roberta_base_finetuned_squad date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-finetuned-squad` is a English model originally trained by `Firat`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_Firat_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734262536.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_Firat_roberta_base_finetuned_squad_en_4.0.0_3.0_1655734262536.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_Firat_roberta_base_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_Firat_roberta_base_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base.by_Firat").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_Firat_roberta_base_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|464.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Firat/roberta-base-finetuned-squad --- layout: model title: Turkish ElectraForQuestionAnswering model (from enelpi) Discriminator Version-1 author: John Snow Labs name: electra_qa_base_discriminator_finetuned_squadv1 date: 2022-06-22 tags: [tr, open_source, electra, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-discriminator-finetuned_squadv1_tr` is a Turkish model originally trained by `enelpi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv1_tr_4.0.0_3.0_1655920559958.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_discriminator_finetuned_squadv1_tr_4.0.0_3.0_1655920559958.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv1","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_discriminator_finetuned_squadv1","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.squad.electra.base").predict("""Benim adım ne?|||"Benim adım Clara ve Berkeley'de yaşıyorum.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_discriminator_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|412.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/enelpi/electra-base-discriminator-finetuned_squadv1_tr --- layout: model title: Named Entity Recognition (NER) Model in Bengali (bengaliner_cc_300d) author: John Snow Labs name: bengaliner_cc_300d date: 2021-02-10 task: Named Entity Recognition language: bn edition: Spark NLP 2.7.3 spark_version: 2.4 tags: [open_source, bn, ner] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect 4 different types of entities in Indian text. ## Predicted Entities `PER`, `ORG`, `LOC`, `TIME` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_BN/){:.button.button-orange} [Open in Colab](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bengaliner_cc_300d_bn_2.7.3_2.4_1612957259511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bengaliner_cc_300d_bn_2.7.3_2.4_1612957259511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("bengali_cc_300d", "bn") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = NerDLModel.pretrained("bengaliner_cc_300d", "bn") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, ner, ner_converter]) example = spark.createDataFrame([['১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("bengali_cc_300d", "bn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("bengaliner_cc_300d", "bn") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter)) val data = Seq("১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.ner").predict("""১৯৪৮ সালে ইয়াজউদ্দিন আহম্মেদ মুন্সিগঞ্জ উচ্চ বিদ্যালয় থেকে মেট্রিক পাশ করেন এবং ১৯৫০ সালে মুন্সিগঞ্জ হরগঙ্গা কলেজ থেকে ইন্টারমেডিয়েট পাশ করেন""") ```
## Results ```bash +----------------------+-----------+ | ner_chunk | label | +----------------------+-----------+ | ১৯৪৮ সালে | TIME | | ইয়াজউদ্দিন আহম্মেদ | PER | | মুন্সিগঞ্জ উচ্চ বিদ্যালয় | ORG | | ১৯৫০ সালে | TIME | | মুন্সিগঞ্জ হরগঙ্গা কলেজ | ORG | +----------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bengaliner_cc_300d| |Type:|ner| |Compatibility:|Spark NLP 2.7.3+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token, word_embeddings]| |Output Labels:|[ner]| |Language:|bn| ## Data Source This models is trained on data obtained from https://ieeexplore.ieee.org/document/8944804 ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 167 37 25 0.8186275 0.8697917 0.8434344 B-LOC 678 114 195 0.8560606 0.7766323 0.81441444 I-ORG 287 104 143 0.73401535 0.66744184 0.6991474 B-TIME 414 54 123 0.88461536 0.7709497 0.8238806 I-LOC 98 50 76 0.6621622 0.5632184 0.6086956 I-PER 805 38 55 0.9549229 0.93604654 0.94539046 B-ORG 446 108 225 0.8050541 0.6646796 0.72816324 B-PER 764 48 183 0.9408867 0.80675817 0.86867535 tp: 3659 fp: 553 fn: 1025 labels: 8 Macro-average prec: 0.8320431, rec: 0.75693977, f1: 0.79271656 Micro-average prec: 0.86870843, rec: 0.78116995, f1: 0.8226169 ``` --- layout: model title: Legal Usa patriot act Clause Binary Classifier author: John Snow Labs name: legclf_usa_patriot_act_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `usa-patriot-act` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `usa-patriot-act` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_usa_patriot_act_clause_en_1.0.0_3.2_1660123155724.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_usa_patriot_act_clause_en_1.0.0_3.2_1660123155724.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_usa_patriot_act_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[usa-patriot-act]| |[other]| |[other]| |[usa-patriot-act]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_usa_patriot_act_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.99 1.00 0.99 97 usa-patriot-act 1.00 0.96 0.98 25 accuracy - - 0.99 122 macro-avg 0.99 0.98 0.99 122 weighted-avg 0.99 0.99 0.99 122 ``` --- layout: model title: English image_classifier_vit_rust_image_classification_11 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_rust_image_classification_11 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_rust_image_classification_11` is a English model originally trained by SummerChiam. ## Predicted Entities `nonrust0`, `rust0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_11_en_4.1.0_3.0_1660171448218.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_rust_image_classification_11_en_4.1.0_3.0_1660171448218.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_rust_image_classification_11", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_rust_image_classification_11", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_rust_image_classification_11| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Legal NER (Headers / Subheaders) author: John Snow Labs name: legner_headers date: 2022-08-12 tags: [en, legal, ner, agreements, splitting, licensed] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal NER Model, aimed to carry out Section Splitting by using the Headers and Subheaders entities, detected in the document. Other models can be found to detect other parts of the document, as Headers/Subheaders, Signers, "Will-do", etc. ## Predicted Entities `HEADER`, `SUBHEADER` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_HEADERS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_headers_en_1.0.0_3.2_1660298515978.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_headers_en_1.0.0_3.2_1660298515978.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = legal.NerModel.pretrained('legner_headers', 'en', 'legal/models')\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = [""" 2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller. 2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6 2.2 Customer Agreements."""] res = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +-----------+-----------+ | token| ner_label| +-----------+-----------+ | 2| B-HEADER| | .| I-HEADER| |Definitions| I-HEADER| | .| O| | For| O| | purposes| O| | of| O| | this| O| | Agreement| O| | ,| O| | the| O| | following| O| | terms| O| | have| O| | the| O| | meanings| O| | ascribed| O| | thereto| O| | in| O| | this| O| | Section| O| | 1|B-SUBHEADER| | .|I-SUBHEADER| | 2|I-SUBHEADER| | .|I-SUBHEADER| |Appointment| I-HEADER| | as| I-HEADER| | Reseller| I-HEADER| | .| O| | 2.1|B-SUBHEADER| |Appointment|I-SUBHEADER| | .| O| | The| O| | Company| O| | hereby| O| | [***]| O| | .| O| | Allscripts| O| | may| O| | also| O| | disclose| O| | Company's| O| | pricing| O| |information| O| | relating| O| | to| O| | its| O| | Merchant| O| | Processing| O| | Services| O| | and| O| | facilitate| O| |procurement| O| | of| O| | Merchant| O| | Processing| O| | Services| O| | on| O| | behalf| O| | of| O| |Sublicensed| O| | Customers| O| | ,| O| | including| O| | ,| O| | without| O| | limitation| O| | by| O| | references| O| | to| O| | such| O| | pricing| O| |information| O| | and| O| | Merchant| O| | Processing| O| | Services| O| | in| O| | Customer| O| | Agreements| O| | .| O| | 6| O| | 2.2|B-SUBHEADER| | Customer|I-SUBHEADER| | Agreements|I-SUBHEADER| | .| O| +-----------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_headers| |Type:|legal| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.3 MB| ## References Manual annotations on CUAD dataset ## Benchmarking ```bash label tp fp fn prec rec f1 I-HEADER 1486 40 25 0.97378767 0.98345464 0.9785973 B-SUBHEADER 744 16 14 0.97894734 0.98153037 0.9802372 I-SUBHEADER 2382 53 34 0.9782341 0.98592716 0.98206556 B-HEADER 415 4 12 0.9904535 0.97189695 0.9810875 Macro-average 5027 113 85 0.9803556 0.9807023 0.9805289 Micro-average 5027 113 85 0.97801554 0.98337245 0.98068666 ``` --- layout: model title: Stopwords Remover for Thai language (1006 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, th, open_source] task: Stop Words Removal language: th edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_th_3.4.1_3.0_1646673180836.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_th_3.4.1_3.0_1646673180836.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","th") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["คุณไม่ดีไปกว่าฉัน"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","th") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("คุณไม่ดีไปกว่าฉัน").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("th.stopwords").predict("""Put your text here.""") ```
## Results ```bash +-------------------+ |result | +-------------------+ |[คุณไม่ดีไปกว่าฉัน]| +-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|th| |Size:|6.0 KB| --- layout: model title: Pipeline to Detect Living Species(w2v_cc_300d) author: John Snow Labs name: ner_living_species_pipeline date: 2023-03-13 tags: [es, ner, clinical, licensed] task: Named Entity Recognition language: es edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species](https://nlp.johnsnowlabs.com/2022/06/22/ner_living_species_es_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_es_4.3.0_3.2_1678706115739.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_pipeline_es_4.3.0_3.2_1678706115739.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_pipeline", "es", "clinical/models") text = '''Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_pipeline", "es", "clinical/models") val text = "Lactante varón de dos años. Antecedentes familiares sin interés. Antecedentes personales: Embarazo, parto y periodo neonatal normal. En seguimiento por alergia a legumbres, diagnosticado con diez meses por reacción urticarial generalizada con lentejas y garbanzos, con dieta de exclusión a legumbres desde entonces. En ésta visita la madre describe episodios de eritema en zona maxilar derecha con afectación ocular ipsilateral que se resuelve en horas tras la administración de corticoides. Le ha ocurrido en 5-6 ocasiones, en relación con la ingesta de alimentos previamente tolerados. Exploración complementaria: Cacahuete, ac(ige)19.2 Ku.arb/l. Resultados: Ante la sospecha clínica de Síndrome de Frey, se tranquiliza a los padres, explicándoles la naturaleza del cuadro y se cita para revisión anual." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------|--------:|------:|:------------|-------------:| | 0 | Lactante varón | 0 | 13 | HUMAN | 0.9926 | | 1 | familiares | 41 | 50 | HUMAN | 0.9994 | | 2 | personales | 78 | 87 | HUMAN | 0.9987 | | 3 | neonatal | 116 | 123 | HUMAN | 0.9731 | | 4 | legumbres | 162 | 170 | SPECIES | 0.9978 | | 5 | lentejas | 243 | 250 | SPECIES | 0.9995 | | 6 | garbanzos | 254 | 262 | SPECIES | 0.9933 | | 7 | legumbres | 290 | 298 | SPECIES | 0.9991 | | 8 | madre | 334 | 338 | HUMAN | 0.9997 | | 9 | Cacahuete | 616 | 624 | SPECIES | 0.9998 | | 10 | padres | 728 | 733 | HUMAN | 0.9992 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|es| |Size:|1.3 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Fast Neural Machine Translation Model from English to Maltese author: John Snow Labs name: opus_mt_en_mt date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, mt, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `mt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_mt_xx_2.7.0_2.4_1609169930393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_mt_xx_2.7.0_2.4_1609169930393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_mt", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_mt", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.mt').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_mt| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Self Treatment Changes Classifier in Tweets (BioBERT) author: John Snow Labs name: bert_sequence_classifier_treatment_changes_sentiment_tweet date: 2022-08-04 tags: [en, clinical, licensed, public_health, classifier, sequence_classification, treatment_changes, sentiment, treatment] task: Text Classification language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: MedicalBertForSequenceClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a [BioBERT based](https://github.com/dmis-lab/biobert) classifier that can classify patients non-adherent to their treatments and their reasons on Twitter. ## Predicted Entities `negative`, `positive` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/PUBLIC_HEALTH_CHANGE_DRUG_TREATMENT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/PUBLIC_HEALTH_MB4SC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_treatment_changes_sentiment_tweet_en_4.0.2_3.0_1659637869883.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_sequence_classifier_treatment_changes_sentiment_tweet_en_4.0.2_3.0_1659637869883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') tokenizer = Tokenizer() \ .setInputCols(['document']) \ .setOutputCol('token') sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_treatment_changes_sentiment_tweet", "en", "clinical/models")\ .setInputCols(["document",'token'])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, tokenizer, sequenceClassifier ]) data = spark.createDataFrame(["I love when they say things like this. I took that ambien instead of my thyroid pill.", "I am a 30 year old man who is not overweight but is still on the verge of needing a Lipitor prescription."], StringType()).toDF("text") result = pipeline.fit(data).transform(data) result.select("text", "class.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_treatment_changes_sentiment_tweet", "en", "clinical/models") .setInputCols(Array("document","token")) .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier)) val data = Seq(Array("I love when they say things like this. I took that ambien instead of my thyroid pill.", "I am a 30 year old man who is not overweight but is still on the verge of needing a Lipitor prescription.")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.bert_sequence.treatment_sentiment_tweets").predict("""I am a 30 year old man who is not overweight but is still on the verge of needing a Lipitor prescription.""") ```
## Results ```bash +---------------------------------------------------------------------------------------------------------+----------+ |text |result | +---------------------------------------------------------------------------------------------------------+----------+ |I love when they say things like this. I took that ambien instead of my thyroid pill. |[positive]| |I am a 30 year old man who is not overweight but is still on the verge of needing a Lipitor prescription.|[negative]| +---------------------------------------------------------------------------------------------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_sequence_classifier_treatment_changes_sentiment_tweet| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|406.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## Benchmarking ```bash label precision recall f1-score support negative 0.9515 0.9751 0.9632 1368 positive 0.6304 0.4603 0.5321 126 accuracy - - 0.9317 1494 macro-avg 0.7910 0.7177 0.7476 1494 weighted-avg 0.9244 0.9317 0.9268 1494 ``` --- layout: model title: Detect Entities (BERT) author: John Snow Labs name: ner_dl_bert date: 2020-09-08 task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, en, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description `ner_dl_bert` is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. It was trained on the CoNLL 2003 text corpus. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. `ner_dl_bert` model is trained with `bert_based_cased` word embeddings, so be sure to use the same embeddings in the pipeline. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_bert_en_2.6.0_2.4_1599550979101.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ner_dl_bert_en_2.6.0_2.4_1599550979101.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## Predicted Entities `PER`, `LOC`, `ORG`, `MISC` ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \ .setInputCols(['document', 'token']) \ .setOutputCol('embeddings') ner_model = NerDLModel.pretrained("ner_dl_bert", "en") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained(name="bert_base_cased", lang="en") .setInputCols(Array('document', 'token')) .setOutputCol('embeddings') val ner_model = NerDLModel.pretrained("ner_dl_bert", "en") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque, New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January 2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000. He gradually transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support the newly appointed CEO Satya Nadella."""] ner_df = nlu.load('en.ner.dl.bert').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |American |MISC | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_dl_bert| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from[CoNLL 2003 Data Set](https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003) --- layout: model title: Legal Romanian NER (RONEC dataset) author: John Snow Labs name: legner_ronec date: 2022-11-30 tags: [ro, ner, legal, ronec, licensed] task: Named Entity Recognition language: ro edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legner_ronec` is a Named Entity Recognition model trained on RONEC (ROmanian Named Entity Corpus). Unlike the original dataset, it has been trained with the following classes: - PERSON - proper nouns or pronouns if they refer to a person - LOC - location or geo political entity - ORG - organization - LANGUAGE - language - NAT_REL_POL - national, religious or political organizations - DATETIME - a time and date in any format, including references to time (e.g. 'yesterday') - MONEY - a monetary value, numeric or otherwise - NUMERIC - a simple numeric value, represented as digits or words - ORDINAL - an ordinal value like 'first', 'third', etc. - WORK_OF_ART - a work of art like a named TV show, painting, etc. - EVENT - a named recognizable or periodic major event ## Predicted Entities `DATETIME`, `EVENT`, `LANGUAGE`, `LOC`, `MONEY`, `NAT_REL_POL`, `NUMERIC`, `ORDINAL`, `ORG`, `PERSON`, `WORK_OF_ART` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_ronec_ro_1.0.0_3.0_1669842840646.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_ronec_ro_1.0.0_3.0_1669842840646.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_ronec", "ro", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""Guvernul de stânga italian, condus de premierul Romano Prodi, a devenit după numirea a încă trei secretari de stat, cel mai numeros Executiv din istoria Republicii italiene, având 102 membri."""]]).toDF("text") result = model.transform(data) ```
## Results ```bash +----------------------+-----------+ |ner_chunk |label | +----------------------+-----------+ |Guvernul |ORG | |italian |NAT_REL_POL| |premierul Romano Prodi|PERSON | |trei |NUMERIC | |secretari |PERSON | |Republicii italiene |LOC | |102 |NUMERIC | |membri |PERSON | +----------------------+-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_ronec| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.2 MB| ## References Dataset is available [here](https://github.com/dumitrescustefan/ronec). ## Benchmarking ```bash label precision recall f1-score support DATETIME 0.90 0.90 0.90 1070 EVENT 0.53 0.68 0.59 116 LANGUAGE 0.98 0.95 0.97 44 LOC 0.91 0.90 0.91 1699 MONEY 0.97 0.97 0.97 130 NAT_REL_POL 0.92 0.94 0.93 510 NUMERIC 0.95 0.95 0.95 970 ORDINAL 0.88 0.93 0.90 183 ORG 0.81 0.83 0.82 779 PERSON 0.89 0.91 0.90 2635 WORK_OF_ART 0.73 0.57 0.64 140 micro-avg 0.89 0.90 0.89 8276 macro-avg 0.86 0.87 0.86 8276 weighted-avg 0.89 0.90 0.89 8276 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from sunitha) author: John Snow Labs name: roberta_qa_cv_merge_ds date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `CV_Merge_DS` is a English model originally trained by `sunitha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_cv_merge_ds_en_4.3.0_3.0_1674207962257.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_cv_merge_ds_en_4.3.0_3.0_1674207962257.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cv_merge_ds","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_cv_merge_ds","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_cv_merge_ds| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/sunitha/CV_Merge_DS --- layout: model title: Word2Vec Embeddings in Lithuanian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, lt, open_source] task: Embeddings language: lt edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lt_3.4.1_3.0_1647443135863.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_lt_3.4.1_3.0_1647443135863.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lt") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Aš myliu kibirkštį nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","lt") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Aš myliu kibirkštį nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("lt.embed.w2v_cc_300d").predict("""Aš myliu kibirkštį nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|lt| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: English BertForTokenClassification Cased model (from ghadeermobasher) author: John Snow Labs name: bert_ner_BC5CDR_Chem_Modified_SciBERT_512 date: 2022-07-06 tags: [bert, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BC5CDR-Chem-Modified-SciBERT-512` is a English model originally trained by `ghadeermobasher`. ## Predicted Entities `Chemical` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_SciBERT_512_en_4.0.0_3.0_1657109466000.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_BC5CDR_Chem_Modified_SciBERT_512_en_4.0.0_3.0_1657109466000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_SciBERT_512","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_BC5CDR_Chem_Modified_SciBERT_512","en") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_BC5CDR_Chem_Modified_SciBERT_512| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|410.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ghadeermobasher/BC5CDR-Chem-Modified-SciBERT-512 --- layout: model title: English BertForQuestionAnswering Cased model (from songhee) author: John Snow Labs name: bert_qa_i_manual_m date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `i-manual-mbert` is a English model originally trained by `songhee`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_i_manual_m_en_4.0.0_3.0_1657189488729.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_i_manual_m_en_4.0.0_3.0_1657189488729.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_i_manual_m","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_i_manual_m","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_i_manual_m| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|666.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/songhee/i-manual-mbert --- layout: model title: English BertForQuestionAnswering model (from ericRosello) author: John Snow Labs name: bert_qa_results date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `results` is a English model orginally trained by `ericRosello`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_results_en_4.0.0_3.0_1654189231590.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_results_en_4.0.0_3.0_1654189231590.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_results","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_results","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.bert.by_ericRosello").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_results| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ericRosello/results --- layout: model title: English DistilBertForQuestionAnswering Uncased model (from SauravMaheshkar) author: John Snow Labs name: distilbert_qa_base_uncased_distilled_chaii date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-distilled-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_distilled_chaii_en_4.0.0_3.0_1654723758525.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_distilled_chaii_en_4.0.0_3.0_1654723758525.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_distilled_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_distilled_chaii","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_distilled_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/distilbert-base-uncased-distilled-chaii --- layout: model title: Extract Demographic Entities from Voice of the Patient Documents (embeddings_clinical_medium) author: John Snow Labs name: ner_vop_demographic_emb_clinical_medium date: 2023-06-06 tags: [licensed, clinical, ner, en, vop, demographic] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model extracts demographic terms from the documents transferred from the patient’s own sentences. ## Predicted Entities `Gender`, `Employment`, `RaceEthnicity`, `Age`, `Substance`, `RelationshipStatus`, `SubstanceQuantity` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/VOICE_OF_THE_PATIENTS/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_emb_clinical_medium_en_4.4.3_3.0_1686075342272.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_vop_demographic_emb_clinical_medium_en_4.4.3_3.0_1686075342272.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_vop_demographic_emb_clinical_medium", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_vop_demographic_emb_clinical_medium", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("My grandma, who's 85 and Black, just had a pacemaker implanted in the cardiology department. The doctors say it'll help regulate her heartbeat and prevent future complications.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash | chunk | ner_label | |:--------|:--------------| | grandma | Gender | | 85 | Age | | Black | RaceEthnicity | | doctors | Employment | | her | Gender | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_vop_demographic_emb_clinical_medium| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|3.8 MB| |Dependencies:|embeddings_clinical_medium| ## References In-house annotated health-related text in colloquial language. ## Sample text from the training dataset Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem. Finally I got an appointment with a homeopathy doctor finally he find that i was suffering from hyperthyroid and my TSH was 0.15 T3 and T4 is normal . Also i have b12 deficiency and vitamin D deficiency so I'm taking weekly supplement of vitamin D and 1000 mcg b12 daily. I'm taking homeopathy medicine for 40 days and took 2nd test after 30 days. My TSH is 0.5 now. I feel a little bit relief from weakness and depression but I'm facing with 2 new problem from last week that is breathtaking problem and very rapid heartrate. I just want to know if i should start allopathy medicine or homeopathy is okay? Bcs i heard that thyroid take time to start recover. So please let me know if both of medicines take same time. Because some of my friends advising me to start allopathy and never take a chance as i can develop some serious problems.Sorry for my poor english😐Thank you. ## Benchmarking ```bash label tp fp fn total precision recall f1 Gender 1299 23 18 1317 0.98 0.99 0.98 Employment 1175 45 68 1243 0.96 0.95 0.95 RaceEthnicity 31 2 2 33 0.94 0.94 0.94 Age 546 54 36 582 0.91 0.94 0.92 Substance 389 48 32 421 0.89 0.92 0.91 RelationshipStatus 21 5 3 24 0.81 0.88 0.84 SubstanceQuantity 65 18 20 85 0.78 0.76 0.77 macro_avg 3526 195 179 3705 0.90 0.91 0.90 micro_avg 3526 195 179 3705 0.95 0.95 0.95 ``` --- layout: model title: Detect Social Determinants of Health Mentions author: John Snow Labs name: ner_sdoh_mentions date: 2022-12-18 tags: [en, licensed, ner, sdoh, mentions, clinical] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Named Entity Recognition model is intended for detecting Social Determinants of Health mentions in clinical notes and trained by using MedicalNerApproach annotator that allows to train generic NER models based on Neural Networks. | Entitiy Name | Descriptions | Sample Texts | chunks+labels | |------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| | sdoh_community | The patient's social
and community networks, including family members,
friends, and other social connections. | - He has a 27 yo son.
- The patient lives with mother.
- She is a widow.
- Married and has two children.
| - (son),(sdoh_community)
- (mother),(sdoh_community)
- (widow),(sdoh_community)
- (Married, children),(sdoh_community,sdoh_community)
| | sdoh_economics | The patient's economic
status and financial resources, including their
occupation, income, and employment status. | - The patient worked as an accountant.
- He is a retired history professor.
- She is a lawyer.
- Worked in insurance, currently unemployed.
| - (worked),(sdoh_economics)
- (retired),(sdoh_economics)
- (lawyer),(sdoh_economics)
- (worked, unemployed),(sdoh_economics, sdoh_economics) | | sdoh_education | The patient's education-related
passages such as schooling, college, or degrees attained. | - She graduated from high school.
- He is a fourth grade teacher in inner city.
- He completed some college.
| - (graduated from high school),(sdoh_education)
- (teacher),(sdoh_education)
- (college),(sdoh_education)
| | sdoh_environment | The patient's living
environment and access to housing. | - He lives at home.
- Her other daughter lives in the apartment below.
- The patient lives with her husband in a retirement community.
| - (home),(sdoh_environment)
- (apartment),(sdoh_environment)
- (retirement community),(sdoh_environment)
| | behavior_tobacco | This entity is labeled based on any indication of
the patient's current or past tobacco
use and smoking history | - She smoked one pack a day for forty years.
- The patient denies tobacco use.
- The patient smokes an occasional cigar.
| - (smoked one pack),(behavior_tobacco)
- (tobacco),(behavior_tobacco)
- (smokes an occasional cigar),(behavior_tobacco)
| | behavior_alcohol | This entity is used to label indications of
the patient's alcohol consumption. | - She drinks alcohol.
- The patient denies ethanol.
- He denies ETOH.
| - (alcohol),(behavior_alcohol)
- (ethanol),(behavior_alcohol)
- (ETOH),(behavior_alcohol)
| | behavior_drug | This entity is used to label any indications of
the patient's current or past drug use. | - She denies any intravenous drug abuse.
- No illicit drug use including IV per family.
- The patient any using recreational drugs.
| - (intravenous drug),(behavior_drug)
- (illicit drug, IV),(behavior_drug, behavior_drug)
- (recreational drugs),(behavior_drug)
| ## Predicted Entities `sdoh_community`, `sdoh_economics`, `sdoh_education`, `sdoh_environment`, `behavior_tobacco`, `behavior_alcohol`, `behavior_drug` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_mentions_en_4.2.2_3.0_1671369830893.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_sdoh_mentions_en_4.2.2_3.0_1671369830893.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols("document")\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_sdoh_mentions", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[ document_assembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter]) data = spark.createDataFrame([["Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years."]]).toDF("text") result = nlpPipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_sdoh_mentions", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val nlpPipeline = new PipelineModel().setStages(Array( document_assembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years.").toDS.toDF("text") val result = nlpPipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.sdoh_mentions").predict("""Mr. Known lastname 9880 is a pleasant, cooperative gentleman with a long standing history (20 years) diverticulitis. He is married and has 3 children. He works in a bank. He denies any alcohol or intravenous drug use. He has been smoking for many years.""") ```
## Results ```bash +----------------+----------------+ |chunk |ner_label | +----------------+----------------+ |married |sdoh_community | |children |sdoh_community | |works |sdoh_economics | |alcohol |behavior_alcohol| |intravenous drug|behavior_drug | |smoking |behavior_tobacco| +----------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_sdoh_mentions| |Compatibility:|Healthcare NLP 4.2.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|15.1 MB| ## Benchmarking ```bash label precision recall f1-score support behavior_alcohol 0.95 0.94 0.94 798 behavior_drug 0.93 0.92 0.92 366 behavior_tobacco 0.95 0.95 0.95 936 sdoh_community 0.97 0.97 0.97 969 sdoh_economics 0.95 0.91 0.93 363 sdoh_education 0.69 0.65 0.67 34 sdoh_environment 0.93 0.90 0.92 651 micro-avg 0.95 0.94 0.94 4117 macro-avg 0.91 0.89 0.90 4117 weighted-avg 0.95 0.94 0.94 4117 ``` --- layout: model title: Castilian, Spanish BertForQuestionAnswering model (from CenIA) author: John Snow Labs name: bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: es edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-spanish-wwm-uncased-finetuned-qa-tar` is a Castilian, Spanish model orginally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar_es_4.0.0_3.0_1654180605963.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar_es_4.0.0_3.0_1654180605963.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar","es") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("es.answer_question.bert.base_uncased.by_CenIA").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|410.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/bert-base-spanish-wwm-uncased-finetuned-qa-tar --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Spanish (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-02-03 task: Named Entity Recognition language: es edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, es, open_source] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_ES){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_ES.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_es_2.4.0_2.4_1581971942091.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_es_2.4.0_2.4_1581971942091.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "es") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "es") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nacido el 28 de octubre de 1955) es un magnate de los negocios, desarrollador de software, inversor y filántropo estadounidense. Es mejor conocido como el cofundador de Microsoft Corporation. Durante su carrera en Microsoft, Gates ocupó los cargos de presidente, director ejecutivo (CEO), presidente y arquitecto de software en jefe, y también fue el mayor accionista individual hasta mayo de 2014. Es uno de los empresarios y pioneros más conocidos de revolución de la microcomputadora de los años setenta y ochenta. Nacido y criado en Seattle, Washington, Gates cofundó Microsoft con su amigo de la infancia Paul Allen en 1975, en Albuquerque, Nuevo México; se convirtió en la compañía de software de computadora personal más grande del mundo. Gates dirigió la compañía como presidente y CEO hasta que dejó el cargo de CEO en enero de 2000, pero siguió siendo presidente y se convirtió en el arquitecto jefe de software. A fines de la década de 1990, Gates había sido criticado por sus tácticas comerciales, que se han considerado anticompetitivas. Esta opinión ha sido confirmada por numerosas sentencias judiciales. En junio de 2006, Gates anunció que haría la transición a un puesto de medio tiempo en Microsoft y trabajaría a tiempo completo en la Fundación Bill y Melinda Gates, la fundación caritativa privada que él y su esposa, Melinda Gates, establecieron en 2000. Poco a poco transfirió sus deberes a Ray Ozzie y Craig Mundie. Renunció como presidente de Microsoft en febrero de 2014 y asumió un nuevo cargo como asesor tecnológico para apoyar al recién nombrado CEO Satya Nadella."""] ner_df = nlu.load('es.ner.wikiner.glove.840B_300').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------------------------------------------+---------+ |chunk |ner_label| +-------------------------------------------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Durante su carrera |MISC | |Microsoft |ORG | |Gates |PER | |Nacido |MISC | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |Nuevo México |LOC | |Gates |PER | |CEO |ORG | |A fines de la década de 1990 |MISC | |Gates |PER | |Esta opinión ha sido confirmada por numerosas sentencias judiciales|MISC | |Gates |PER | |Microsoft |ORG | +-------------------------------------------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|es| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://es.wikipedia.org](https://es.wikipedia.org) --- layout: model title: GPT2 text-to-text model (Large) author: John Snow Labs name: gpt_large date: 2021-12-03 tags: [en, open_source] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: GPT2Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation. This is the large model (bigger than Base). Other models (medium, base) are also available in [Models Hub](https://nlp.johnsnowlabs.com/models?task=Text+Generation) ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/gpt_large_en_3.4.0_3.0_1638547401185.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/gpt_large_en_3.4.0_3.0_1638547401185.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \\ .setInputCol("text") \\ .setOutputCol("documents") gpt2 = GPT2Transformer.pretrained("gpt2_large") \\ .setInputCols(["documents"]) \\ .setMaxOutputLength(50) \\ .setOutputCol("generation") pipeline = Pipeline().setStages([documentAssembler, gpt2]) data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("summaries.generation").show(truncate=False) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val gpt2 = GPT2Transformer.pretrained("gpt2_large") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) .setDoSample(false) .setTopK(50) .setNoRepeatNgramSize(3) .setOutputCol("generation") val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2)) val data = Seq("My name is Leonardo.").toDF("text") val result = pipeline.fit(data).transform(data) results.select("generation.result").show(truncate = false) ``` {:.nlu-block} ```python import nlu nlu.load("en.gpt2.large").predict("""My name is Leonardo.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|gpt_large| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[generation]| |Language:|en| ## Data Source OpenAI WebText - a corpus created by scraping web pages with emphasis on document quality. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText. --- layout: model title: English AlbertForQuestionAnswering model (from SalmanMo) author: John Snow Labs name: albert_qa_QA_1e date: 2022-06-24 tags: [en, open_source, albert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: AlBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ALBERT_QA_1e` is a English model originally trained by `SalmanMo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_qa_QA_1e_en_4.0.0_3.0_1656062921751.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_qa_QA_1e_en_4.0.0_3.0_1656062921751.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = AlbertForQuestionAnswering.pretrained("albert_qa_QA_1e","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = AlbertForQuestionAnswering.pretrained("albert_qa_QA_1e","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.albert.by_SalmanMo").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_qa_QA_1e| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|42.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SalmanMo/ALBERT_QA_1e --- layout: model title: English image_classifier_vit_pond_image_classification_4 ViTForImageClassification from SummerChiam author: John Snow Labs name: image_classifier_vit_pond_image_classification_4 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_pond_image_classification_4` is a English model originally trained by SummerChiam. ## Predicted Entities `Normal`, `Boiling`, `Algae`, `NormalCement`, `NormalRain`, `BoilingNight`, `NormalNight` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_4_en_4.1.0_3.0_1660166981854.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_pond_image_classification_4_en_4.1.0_3.0_1660166981854.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_pond_image_classification_4", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_pond_image_classification_4", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_pond_image_classification_4| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English Bert Embeddings (Large, Cased) author: John Snow Labs name: bert_embeddings_bert_large_cased_whole_word_masking date: 2022-04-11 tags: [bert, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-large-cased-whole-word-masking` is a English model orginally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_cased_whole_word_masking_en_3.4.2_3.0_1649671766081.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_large_cased_whole_word_masking_en_3.4.2_3.0_1649671766081.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_cased_whole_word_masking","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_cased_whole_word_masking","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.bert_large_cased_whole_word_masking").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_large_cased_whole_word_masking| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/bert-large-cased-whole-word-masking - https://arxiv.org/abs/1810.04805 - https://github.com/google-research/bert - https://yknzhu.wixsite.com/mbweb - https://en.wikipedia.org/wiki/English_Wikipedia --- layout: model title: English T5ForConditionalGeneration Cased model (from KES) author: John Snow Labs name: t5_ttparser date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `T5-TTParser` is a English model originally trained by `KES`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_ttparser_en_4.3.0_3.0_1675099426092.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_ttparser_en_4.3.0_3.0_1675099426092.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_ttparser","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_ttparser","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_ttparser| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|916.4 MB| ## References - https://huggingface.co/KES/T5-TTParser - https://arxiv.org/abs/1702.04066 - https://pypi.org/project/Caribe/ --- layout: model title: English DistilBertForQuestionAnswering Base Cased model (from Moussab) author: John Snow Labs name: distilbert_qa_base_cased_led_squad_orkg_no_label_1e_4 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-cased-distilled-squad-orkg-no-label-1e-4` is a English model originally trained by `Moussab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_no_label_1e_4_en_4.3.0_3.0_1672766759116.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_cased_led_squad_orkg_no_label_1e_4_en_4.3.0_3.0_1672766759116.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_no_label_1e_4","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_cased_led_squad_orkg_no_label_1e_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_cased_led_squad_orkg_no_label_1e_4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|244.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Moussab/distilbert-base-cased-distilled-squad-orkg-no-label-1e-4 --- layout: model title: Detect Assertion Status from Oncology Tests author: John Snow Labs name: assertion_oncology_test_binary_wip date: 2022-10-01 tags: [licensed, clinical, oncology, en, assertion, test, biomarker] task: Assertion Status language: en nav_key: models edition: Healthcare NLP 4.1.0 spark_version: 3.0 supported: true annotator: AssertionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model detects the assertion status of oncology tests, such as Pathology_Test or Imaging_Test. The model identifies if the test has been performed (Present_Or_Past status), or if it is mentioned but in hypothetical or absent sentences (Hypothetical_Or_Absent status). ## Predicted Entities `Hypothetical_Or_Absent`, `Medical_History` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ASSERTION_ONCOLOGY/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_test_binary_wip_en_4.1.0_3.0_1664641843088.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/assertion_oncology_test_binary_wip_en_4.1.0_3.0_1664641843088.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols(["sentence"]) \ .setOutputCol("token") word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk")\ .setWhiteList(["Pathology_Test", "Imaging_Test"]) assertion = AssertionDLModel.pretrained("assertion_oncology_test_binary_wip", "en", "clinical/models") \ .setInputCols(["sentence", "ner_chunk", "embeddings"]) \ .setOutputCol("assertion") pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion]) data = spark.createDataFrame([["The result of the biopsy was positive. We recommend to perform a CT scan."]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Pathology_Test", "Imaging_Test")) val clinical_assertion = AssertionDLModel.pretrained("assertion_oncology_test_binary_wip","en","clinical/models") .setInputCols(Array("sentence","ner_chunk","embeddings")) .setOutputCol("assertion") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, assertion)) val data = Seq("""The result of the biopsy was positive. We recommend to perform a CT scan.""").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.assert.oncology_test_binary").predict("""The result of the biopsy was positive. We recommend to perform a CT scan.""") ```
## Results ```bash | chunk | ner_label | assertion | |:--------|:---------------|:-----------------------| | biopsy | Pathology_Test | Medical_History | | CT scan | Imaging_Test | Hypothetical_Or_Absent | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|assertion_oncology_test_binary_wip| |Compatibility:|Healthcare NLP 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion_pred]| |Language:|en| |Size:|1.4 MB| ## References In-house annotated oncology case reports. ## Benchmarking ```bash label precision recall f1-score support Hypothetical_Or_Absent 0.79 0.81 0.80 37.0 Medical_History 0.80 0.78 0.79 36.0 macro-avg 0.79 0.79 0.79 73.0 weighted-avg 0.79 0.79 0.79 73.0 ``` --- layout: model title: Spanish DistilBertForQuestionAnswering Base Uncased model (from CenIA) author: John Snow Labs name: distilbert_qa_distillbert_base_uncased_finetuned_tar date: 2023-01-03 tags: [es, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distillbert-base-spanish-uncased-finetuned-qa-tar` is a Spanish model originally trained by `CenIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_uncased_finetuned_tar_es_4.3.0_3.0_1672774812379.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distillbert_base_uncased_finetuned_tar_es_4.3.0_3.0_1672774812379.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_uncased_finetuned_tar","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distillbert_base_uncased_finetuned_tar","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distillbert_base_uncased_finetuned_tar| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|es| |Size:|250.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/CenIA/distillbert-base-spanish-uncased-finetuned-qa-tar --- layout: model title: English BertForQuestionAnswering model (from zhufy) author: John Snow Labs name: bert_qa_squad_en_bert_base date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squad-en-bert-base` is a English model orginally trained by `zhufy`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_squad_en_bert_base_en_4.0.0_3.0_1654191955457.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_squad_en_bert_base_en_4.0.0_3.0_1654191955457.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_squad_en_bert_base","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_squad_en_bert_base","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base.by_zhufy").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_squad_en_bert_base| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/zhufy/squad-en-bert-base - https://rajpurkar.github.io/SQuAD-explorer/ --- layout: model title: Chinese Bert Embeddings (from amine) author: John Snow Labs name: bert_embeddings_bert_base_5lang_cased date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-5lang-cased` is a Chinese model orginally trained by `amine`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_zh_3.4.2_3.0_1649669822302.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_5lang_cased_zh_3.4.2_3.0_1649669822302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_5lang_cased","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.bert_5lang_cased").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_5lang_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|464.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/amine/bert-base-5lang-cased - https://cloud.google.com/compute/docs/machine-types#n1_machine_type --- layout: model title: News Classifier for Urdu texts author: John Snow Labs name: classifierdl_bert_news date: 2021-12-10 tags: [urdu, news, classifier, ur, open_source] task: Text Classification language: ur edition: Spark NLP 3.3.0 spark_version: 2.4 supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify Urdu news into 7 categories. ## Predicted Entities `business`, `entertainment`, `health`, `inland`, `science`, `sports`, `weird_news` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/CLASSIFICATION_UR_NEWS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_UR_NEWS.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_ur_3.3.0_2.4_1639125233132.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_bert_news_ur_3.3.0_2.4_1639125233132.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("news") \ .setOutputCol("document") embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") classifierdl = ClassifierDLModel.pretrained("classifierdl_bert_news", "ur") \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") urdu_news_pipeline = Pipeline(stages=[document_assembler, embeddings, classifierdl]) light_pipeline = LightPipeline(urdu_news_pipeline.fit(spark.createDataFrame([['']]).toDF("news"))) result = light_pipeline.annotate("گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔") result["class"] ``` ```scala val document = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val embeddings = BertSentenceEmbeddings .pretrained("lanse", "xx") .setInputCols("document") .setOutputCol("sentence_embeddings") val document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "ur") .setInputCols(Array("document", "sentence_embeddings")) .setOutputCol("class") val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier)) val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) val result = light_pipeline.annotate("گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔") ``` {:.nlu-block} ```python import nlu nlu.load("ur.classify.news").predict("""گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔""") ```
## Results ```bash ['business'] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_bert_news| |Compatibility:|Spark NLP 3.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|ur| |Size:|23.6 MB| ## Data Source Combination of multiple open source data sets. ## Benchmarking ```bash label precision recall f1-score support business 0.83 0.86 0.85 2365 entertainment 0.87 0.85 0.86 3081 health 0.68 0.67 0.68 430 inland 0.80 0.82 0.81 3964 science 0.62 0.60 0.61 558 sports 0.88 0.89 0.89 4022 weird_news 0.60 0.54 0.57 826 accuracy - - 0.82 15246 macro-avg 0.76 0.75 0.75 15246 weighted-avg 0.82 0.82 0.82 15246 ``` --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from robkayinto) author: John Snow Labs name: xlmroberta_ner_robkayinto_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `robkayinto`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_robkayinto_base_finetuned_panx_all_xx_4.1.0_3.0_1660428813467.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_robkayinto_base_finetuned_panx_all_xx_4.1.0_3.0_1660428813467.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_robkayinto_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_robkayinto_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_robkayinto_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/robkayinto/xlm-roberta-base-finetuned-panx-all --- layout: model title: English RobertaForQuestionAnswering (from mrm8488) author: John Snow Labs name: roberta_qa_distilroberta_finetuned_squadv1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilroberta-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_finetuned_squadv1_en_4.0.0_3.0_1655728443758.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_distilroberta_finetuned_squadv1_en_4.0.0_3.0_1655728443758.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_distilroberta_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_distilroberta_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.distilled").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_distilroberta_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|307.0 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/distilroberta-finetuned-squadv1 --- layout: model title: English asr_wav2vec2_base_timit_demo_colab92 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab92 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab92` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab92_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab92_en_4.2.0_3.0_1664022445550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab92_en_4.2.0_3.0_1664022445550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab92', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab92", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab92| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Indonesian T5ForConditionalGeneration Base Cased model (from panggi) author: John Snow Labs name: t5_panggi_base_indonesian_summarization_cased date: 2023-01-30 tags: [id, open_source, t5, tensorflow] task: Text Generation language: id edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-base-indonesian-summarization-cased` is a Indonesian model originally trained by `panggi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_panggi_base_indonesian_summarization_cased_id_4.3.0_3.0_1675109758797.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_panggi_base_indonesian_summarization_cased_id_4.3.0_3.0_1675109758797.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_panggi_base_indonesian_summarization_cased","id") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_panggi_base_indonesian_summarization_cased","id") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_panggi_base_indonesian_summarization_cased| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|id| |Size:|925.6 MB| ## References - https://huggingface.co/panggi/t5-base-indonesian-summarization-cased - https://github.com/kata-ai/indosum --- layout: model title: Chinese BertForQuestionAnswering model (from hfl) author: John Snow Labs name: bert_qa_chinese_pert_large_mrc date: 2022-06-06 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese-pert-large-mrc` is a Chinese model orginally trained by `hfl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pert_large_mrc_zh_4.0.0_3.0_1654537830236.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_chinese_pert_large_mrc_zh_4.0.0_3.0_1654537830236.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_chinese_pert_large_mrc","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_chinese_pert_large_mrc","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert.large.by_hfl").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_chinese_pert_large_mrc| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/hfl/chinese-pert-large-mrc - https://github.com/ymcui/PERT - https://github.com/ymcui/Chinese-ELECTRA - https://github.com/ymcui/Chinese-Minority-PLM - https://github.com/ymcui/HFL-Anthology - https://github.com/ymcui/Chinese-BERT-wwm - https://github.com/ymcui/Chinese-XLNet - https://github.com/airaria/TextBrewer - https://github.com/ymcui/MacBERT --- layout: model title: Pipeline to Detect Chemicals and Proteins in text author: John Snow Labs name: ner_chemprot_biobert_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemprot_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_chemprot_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_biobert_pipeline_en_3.4.1_3.0_1647868851526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemprot_biobert_pipeline_en_3.4.1_3.0_1647868851526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_chemprot_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` ```scala val pipeline = new PretrainedPipeline("ner_chemprot_biobert_pipeline", "en", "clinical/models") pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemprot_biobert.pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +-------------------------------+--------+ |chunks |entities| +-------------------------------+--------+ |Keratinocyte growth factor |GENE-Y | |acidic fibroblast growth factor|GENE-Y | +-------------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemprot_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.0 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter --- layout: model title: Legal Optional renewal Clause Binary Classifier author: John Snow Labs name: legclf_optional_renewal_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `optional_renewal` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `optional_renewal` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_optional_renewal_clause_en_1.0.0_3.2_1660122788779.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_optional_renewal_clause_en_1.0.0_3.2_1660122788779.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_optional_renewal_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[optional_renewal]| |[other]| |[other]| |[optional_renewal]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_optional_renewal_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support optional_renewal 1.00 0.91 0.95 32 other 0.97 1.00 0.99 101 accuracy - - 0.98 133 macro-avg 0.99 0.95 0.97 133 weighted-avg 0.98 0.98 0.98 133 ``` --- layout: model title: English asr_wav2vec2_large_xls_r_300m_urdu_proj TFWav2Vec2ForCTC from MSaudTahir author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_urdu_proj date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_urdu_proj` is a English model originally trained by MSaudTahir. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_urdu_proj_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_urdu_proj_en_4.2.0_3.0_1664101958745.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_urdu_proj_en_4.2.0_3.0_1664101958745.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_urdu_proj", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_urdu_proj", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_urdu_proj| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Legal Acceleration Clause Binary Classifier author: John Snow Labs name: legclf_acceleration_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `acceleration` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `acceleration` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_acceleration_clause_en_1.0.0_3.2_1660123216364.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_acceleration_clause_en_1.0.0_3.2_1660123216364.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_acceleration_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[acceleration]| |[other]| |[other]| |[acceleration]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_acceleration_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support acceleration 1.00 0.78 0.88 27 other 0.92 1.00 0.96 65 accuracy - - 0.93 92 macro-avg 0.96 0.89 0.92 92 weighted-avg 0.94 0.93 0.93 92 ``` --- layout: model title: Financial Relation Extraction (Tickers) author: John Snow Labs name: finre_has_ticker date: 2022-10-15 tags: [en, finance, re, has_ticker, licensed] task: Relation Extraction language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to extract the Ticker of Companies or Product names. A Ticker (stock symbol) is a unique series of letters assigned to a security for trading purposes. For example: Company: Apple Inc. Ticker: `AAPL` ## Predicted Entities `has_ticker` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finre_has_ticker_en_1.0.0_3.0_1665842119957.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finre_has_ticker_en_1.0.0_3.0_1665842119957.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model_org = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner_org") ner_converter_org = nlp.NerConverter()\ .setInputCols(["sentence","token","ner_org"])\ .setOutputCol("ner_chunk_org")\ .setWhiteList(['ORG']) ner_model_ticker = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner_ticker") ner_converter_ticker = nlp.NerConverter() \ .setInputCols(["sentence", "token", "ner_ticker"]) \ .setOutputCol("ner_chunk_ticker") chunk_merger = finance.ChunkMergeApproach()\ .setInputCols("ner_chunk_ticker", "ner_chunk_org")\ .setOutputCol('ner_chunk')\ .setMergeOverlapping(True) pos = nlp.PerceptronModel.pretrained("pos_anc", 'en')\ .setInputCols("sentence", "token")\ .setOutputCol("pos") dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\ .setInputCols(["sentence", "pos", "token"])\ .setOutputCol("dependencies") re_ner_chunk_filter = finance.RENerChunksFilter()\ .setInputCols(["ner_chunk", "dependencies"])\ .setOutputCol("re_ner_chunk")\ .setRelationPairs(["ORG-TICKER"])\ .setMaxSyntacticDistance(4) re_Model = finance.RelationExtractionDLModel.pretrained("finre_has_ticker", "en", "finance/models")\ .setInputCols(["ner_chunk", "sentence"])\ .setOutputCol("relations")\ .setPredictionThreshold(0.2) pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model_org, ner_converter_org, ner_model_ticker, ner_converter_ticker, chunk_merger, pos, dependency_parser, re_ner_chunk_filter, re_Model]) empty_df = spark.createDataFrame([['']]).toDF("text") re_model = pipeline.fit(empty_df) text="""'MTH - Meritage Homes Corporation Reports Disappointing Revenue. RECN, Resources Connection Inc. Shareholder Raymond James Trust Has Decreased Holding'""" light_model = nlp.LightPipeline(re_model) light_model.fullAnnotate(text) ```
## Results ```bash | relation | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_begin | entity2_end | chunk2 | confidence | |-----------:|--------:|--------------:|------------:|-------:|--------:|--------------:|------------:|---------------------------:|-----------:| | has_ticker | TICKER | 0 | 2 | MTH | ORG | 6 | 31 | Meritage Homes Corporation | 0.99532026 | | has_ticker | TICKER | 64 | 67 | RECN | ORG | 70 | 93 | Resources Connection Inc | 0.97409964 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finre_has_ticker| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|409.9 MB| ## References Manual annotations on tweets ## Benchmarking ```bash label Recall Precision F1 Support has_ticker 0.717 0.827 0.768 60 Avg. 0.717 0.827 0.768 - Weighted-Avg. 0.717 0.827 0.768 - ``` --- layout: model title: Translate English to Albanian Pipeline author: John Snow Labs name: translate_en_sq date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, sq, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `sq` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_sq_xx_2.7.0_2.4_1609690542107.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_sq_xx_2.7.0_2.4_1609690542107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_sq", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_sq", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.sq').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_sq| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Kinyarwanda XLMRobertaForTokenClassification Base Cased model (from mbeukman) author: John Snow Labs name: xlmroberta_ner_base_finetuned_ner_kinyarwand date: 2022-08-01 tags: [rw, open_source, xlm_roberta, ner] task: Named Entity Recognition language: rw edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-ner-kinyarwanda` is a Kinyarwanda model originally trained by `mbeukman`. ## Predicted Entities `DATE`, `PER`, `ORG`, `LOC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659354904793.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_base_finetuned_ner_kinyarwand_rw_4.1.0_3.0_1659354904793.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner_kinyarwand","rw") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_base_finetuned_ner_kinyarwand","rw") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_base_finetuned_ner_kinyarwand| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|rw| |Size:|775.8 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/mbeukman/xlm-roberta-base-finetuned-ner-kinyarwanda - https://arxiv.org/abs/2103.11811 - https://github.com/Michael-Beukman/NERTransfer - https://github.com/masakhane-io/masakhane-ner --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-10` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10_en_4.0.0_3.0_1654191366585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10_en_4.0.0_3.0_1654191366585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_128d_seed_10").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|381.1 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-128-finetuned-squad-seed-10 --- layout: model title: English BertForQuestionAnswering Cased model (from LeoFelix) author: John Snow Labs name: bert_qa_leofelix_finetuned_squad date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model originally trained by `LeoFelix`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_leofelix_finetuned_squad_en_4.0.0_3.0_1657186109766.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_leofelix_finetuned_squad_en_4.0.0_3.0_1657186109766.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_leofelix_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_leofelix_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_leofelix_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|406.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LeoFelix/bert-finetuned-squad --- layout: model title: Smaller BERT Embeddings (L-8_H-512_A-8) author: John Snow Labs name: small_bert_L8_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L8_512_en_2.6.0_2.4_1598344705269.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L8_512_en_2.6.0_2.4_1598344705269.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L8_512", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L8_512", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L8_512').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L8_512_embeddings I [-0.0785166472196579, -0.22764667868614197, 0.... love [0.4189109206199646, 0.17259590327739716, 0.87... NLP [0.583548367023468, -0.21391963958740234, -0.2... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L8_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1 --- layout: model title: Yue Chinese asr_wav2vec2_large_xlsr_cantonese_by_ctl TFWav2Vec2ForCTC from ctl author: John Snow Labs name: pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl date: 2022-09-24 tags: [wav2vec2, yue, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: yue edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_cantonese_by_ctl` is a Yue Chinese model originally trained by ctl. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl_yue_4.2.0_3.0_1664039821455.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl_yue_4.2.0_3.0_1664039821455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl', lang = 'yue') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl", lang = "yue") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_large_xlsr_cantonese_by_ctl| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|yue| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: German BertForQuestionAnswering model (from Sahajtomar) author: John Snow Labs name: bert_qa_GBERTQnA date: 2022-06-02 tags: [de, open_source, question_answering, bert] task: Question Answering language: de edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `GBERTQnA` is a German model orginally trained by `Sahajtomar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_GBERTQnA_de_4.0.0_3.0_1654176679493.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_GBERTQnA_de_4.0.0_3.0_1654176679493.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_GBERTQnA","de") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_GBERTQnA","de") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("de.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_GBERTQnA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Sahajtomar/GBERTQnA - https://github.com/Sahajtomar/Question-Answering/blob/main/Sahajtomar_GBERTQnA.ipynb --- layout: model title: Translate Afrikaans to English Pipeline author: John Snow Labs name: translate_af_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, af, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `af` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_af_en_xx_2.7.0_2.4_1609687489458.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_af_en_xx_2.7.0_2.4_1609687489458.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_af_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_af_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.af.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_af_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Fast Neural Machine Translation Model from Pangasinan to English author: John Snow Labs name: opus_mt_pag_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pag, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `pag` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_pag_en_xx_2.7.0_2.4_1609167115456.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_pag_en_xx_2.7.0_2.4_1609167115456.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_pag_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_pag_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.pag.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_pag_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English DistilBertForQuestionAnswering model (from LucasS) author: John Snow Labs name: distilbert_qa_distilBertABSA date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilBertABSA` is a English model originally trained by `LucasS`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distilBertABSA_en_4.0.0_3.0_1654723518164.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distilBertABSA_en_4.0.0_3.0_1654723518164.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distilBertABSA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distilBertABSA","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.by_LucasS").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distilBertABSA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/LucasS/distilBertABSA --- layout: model title: Marathi DistilBERT Embeddings (from DarshanDeshpande) author: John Snow Labs name: distilbert_embeddings_marathi_distilbert date: 2022-04-12 tags: [distilbert, embeddings, mr, open_source] task: Embeddings language: mr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `marathi-distilbert` is a Marathi model orginally trained by `DarshanDeshpande`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_marathi_distilbert_mr_3.4.2_3.0_1649783605801.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_marathi_distilbert_mr_3.4.2_3.0_1649783605801.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_marathi_distilbert","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_marathi_distilbert","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mr.embed.distilbert").predict("""मला स्पार्क एनएलपी आवडते""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_marathi_distilbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|mr| |Size:|247.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/DarshanDeshpande/marathi-distilbert - https://github.com/DarshanDeshpande - https://www.linkedin.com/in/darshan-deshpande/ - https://github.com/Baras64 - http://​www.linkedin.com/in/harsh-abhi --- layout: model title: English Named Entity Recognition (from sven-nm) author: John Snow Labs name: roberta_ner_roberta_classics_ner date: 2022-05-03 tags: [roberta, ner, open_source, en] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta_classics_ner` is a English model orginally trained by `sven-nm`. ## Predicted Entities `REFSCOPE`, `AWORK`, `FRAGREF`, `AAUTHOR`, `REFAUWORK` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_classics_ner_en_3.4.2_3.0_1651594279156.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_ner_roberta_classics_ner_en_3.4.2_3.0_1651594279156.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_classics_ner","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_ner_roberta_classics_ner","en") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.ner.roberta_classics_ner").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_ner_roberta_classics_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|444.3 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/sven-nm/roberta_classics_ner - https://www.epische-bauformen.uni-rostock.de/ - http://infoscience.epfl.ch/record/291236?&ln=en - https://github.com/impresso/CLEF-HIPE-2020-scorer - https://github.com/AjaxMultiCommentary --- layout: model title: Detect Cancer Genetics author: John Snow Labs name: ner_bionlp date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for biology and genetics terms. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities ``Amino_acid``, ``Anatomical_system``, ``Cancer``, ``Cell``, ``Cellular_component``, ``Developing_anatomical_Structure``, ``Gene_or_gene_product``, ``Immaterial_anatomical_entity``, ``Multi-tissue_structure``, ``Organ``, ``Organism``, ``Organism_subdivision``, ``Simple_chemical``, ``Tissue`` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_TUMOR){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_en_3.0.0_3.0_1617208458431.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bionlp_en_3.0.0_3.0_1617208458431.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes."""]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.bionlp").predict("""The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.""") ```
## Results ```bash |id |sentence_id|chunk |begin|end|ner_label | +---+-----------+----------------------+-----+---+--------------------+ |0 |0 |human |4 |8 |Organism | |0 |0 |Kir 3.3 |17 |23 |Gene_or_gene_product| |0 |0 |GIRK3 |26 |30 |Gene_or_gene_product| |0 |0 |potassium |92 |100|Simple_chemical | |0 |0 |GIRK |103 |106|Gene_or_gene_product| |0 |1 |chromosome 1q21-23 |188 |205|Cellular_component | |0 |5 |pancreas |697 |704|Organ | |0 |5 |tissues |740 |746|Tissue | |0 |5 |fat andskeletal muscle|749 |770|Tissue | |0 |6 |KCNJ9 |801 |805|Gene_or_gene_product| |0 |6 |Type II |940 |946|Gene_or_gene_product| |1 |0 |breast cancer |84 |96 |Cancer | |1 |0 |patients |134 |141|Organism | |1 |0 |anthracyclines |167 |180|Simple_chemical | |1 |0 |taxanes |186 |192|Simple_chemical | |1 |1 |vinorelbine |246 |256|Simple_chemical | |1 |1 |patients |273 |280|Organism | |1 |1 |breast |309 |314|Cancer | |1 |1 |vinorelbine inpatients|386 |407|Simple_chemical | |1 |1 |anthracyclines |433 |446|Simple_chemical | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bionlp| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013 with ``embeddings_clinical``. https://aclanthology.org/W13-2008/ ## Benchmarking ```bash | | label | tp | fp | fn | prec | rec | f1 | |---:|:----------------------------------|------:|------:|-----:|---------:|---------:|---------:| | 1 | I-Amino_acid | 1 | 0 | 2 | 1 | 0.333333 | 0.5 | | 2 | I-Simple_chemical | 264 | 39 | 358 | 0.871287 | 0.424437 | 0.570811 | | 3 | B-Immaterial_anatomical_entity | 19 | 12 | 12 | 0.612903 | 0.612903 | 0.612903 | | 4 | B-Cellular_component | 144 | 24 | 36 | 0.857143 | 0.8 | 0.827586 | | 5 | B-Cancer | 808 | 103 | 115 | 0.886937 | 0.875406 | 0.881134 | | 6 | I-Cell | 888 | 91 | 198 | 0.907048 | 0.81768 | 0.860048 | | 7 | B-Tissue | 137 | 44 | 47 | 0.756906 | 0.744565 | 0.750685 | | 8 | B-Organism_substance | 67 | 4 | 34 | 0.943662 | 0.663366 | 0.77907 | | 9 | B-Simple_chemical | 598 | 165 | 128 | 0.783748 | 0.823692 | 0.803224 | | 10 | B-Cell | 910 | 125 | 98 | 0.879227 | 0.902778 | 0.890847 | | 11 | I-Organ | 7 | 2 | 10 | 0.777778 | 0.411765 | 0.538462 | | 12 | I-Tissue | 86 | 21 | 25 | 0.803738 | 0.774775 | 0.788991 | | 13 | I-Pathological_formation | 20 | 5 | 19 | 0.8 | 0.512821 | 0.625 | | 14 | I-Organism | 58 | 13 | 62 | 0.816901 | 0.483333 | 0.60733 | | 15 | B-Gene_or_gene_product | 2354 | 282 | 165 | 0.89302 | 0.934498 | 0.913288 | | 16 | I-Cancer | 488 | 73 | 116 | 0.869875 | 0.807947 | 0.837768 | | 17 | B-Organ | 109 | 36 | 47 | 0.751724 | 0.698718 | 0.724252 | | 18 | B-Pathological_formation | 58 | 20 | 30 | 0.74359 | 0.659091 | 0.698795 | | 19 | I-Cellular_component | 33 | 5 | 36 | 0.868421 | 0.478261 | 0.616822 | | 20 | I-Multi-tissue_structure | 132 | 34 | 29 | 0.795181 | 0.819876 | 0.807339 | | 21 | B-Organism | 437 | 53 | 77 | 0.891837 | 0.850195 | 0.870518 | | 22 | I-Gene_or_gene_product | 1268 | 161 | 1086 | 0.887334 | 0.538658 | 0.670367 | | 23 | B-Multi-tissue_structure | 228 | 62 | 73 | 0.786207 | 0.757475 | 0.771574 | | 24 | Macro-average | 9159 | 1398 | 2948 | 0.76803 | 0.548396 | 0.639891 | | 25 | Micro-average | 9159 | 1398 | 2948 | 0.867576 | 0.756505 | 0.808242 | ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AyushPJ) author: John Snow Labs name: roberta_qa_ai_club_inductions_21_nlp_base_squad_v2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `ai-club-inductions-21-nlp-roBERTa-base-squad-v2` is a English model originally trained by `AyushPJ`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_base_squad_v2_en_4.3.0_3.0_1674209021708.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_ai_club_inductions_21_nlp_base_squad_v2_en_4.3.0_3.0_1674209021708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp_base_squad_v2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_ai_club_inductions_21_nlp_base_squad_v2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_ai_club_inductions_21_nlp_base_squad_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|465.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AyushPJ/ai-club-inductions-21-nlp-roBERTa-base-squad-v2 --- layout: model title: DistilBERT base model (cased) author: John Snow Labs name: distilbert_base_cased date: 2021-05-20 tags: [distilbert, en, english, open_source, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a distilled version of the [BERT base model](https://huggingface.co/bert-base-cased). It was introduced in [this paper](https://arxiv.org/abs/1910.01108). The code for the distillation process can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation). This model is cased: it does make a difference between english and English. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_en_3.1.0_2.4_1621521790187.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_base_cased_en_3.1.0_2.4_1621521790187.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) ``` ```scala val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.distilbert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_base_cased| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token, sentence]| |Output Labels:|[embeddings]| |Language:|en| |Case sensitive:|true| ## Data Source [https://huggingface.co/distilbert-base-cased](https://huggingface.co/distilbert-base-cased) ## Benchmarking ```bash When fine-tuned on downstream tasks, this model achieves the following results: Glue test results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 81.5 | 87.8 | 88.2 | 90.4 | 47.2 | 85.5 | 85.6 | 60.6 | ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from MrAnderson) author: John Snow Labs name: bert_qa_base_1024_full_trivia date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-1024-full-trivia` is a English model originally trained by `MrAnderson`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_1024_full_trivia_en_4.0.0_3.0_1657182678581.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_1024_full_trivia_en_4.0.0_3.0_1657182678581.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_1024_full_trivia","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_1024_full_trivia","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_1024_full_trivia| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|409.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/MrAnderson/bert-base-1024-full-trivia --- layout: model title: English asr_english_filipino_wav2vec2_l_xls_r_test_06 TFWav2Vec2ForCTC from Khalsuu author: John Snow Labs name: pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_english_filipino_wav2vec2_l_xls_r_test_06` is a English model originally trained by Khalsuu. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06_en_4.2.0_3.0_1664115059427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06_en_4.2.0_3.0_1664115059427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_english_filipino_wav2vec2_l_xls_r_test_06| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Mapping National Drug Codes (NDC) with Corresponding HCPCS Codes and Descriptions author: John Snow Labs name: ndc_hcpcs_mapper date: 2023-04-13 tags: [en, licensed, clinical, chunk_mapping, ndc, hcpcs, brand_name] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps National Drug Codes (NDC) with their corresponding HCPCS codes and their descriptions. ## Predicted Entities `hcpcs_code`, `hcpcs_description` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ndc_hcpcs_mapper_en_4.4.0_3.0_1681405091593.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ndc_hcpcs_mapper_en_4.4.0_3.0_1681405091593.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ndc_chunk") chunkerMapper = DocMapperModel.pretrained("ndc_hcpcs_mapper", "en", "clinical/models")\ .setInputCols(["ndc_chunk"])\ .setOutputCol("hcpcs")\ .setRels(["hcpcs_code", "hcpcs_description"]) pipeline = Pipeline().setStages([document_assembler, chunkerMapper]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) res = lp.fullAnnotate(["16714-0892-01", "00990-6138-03", "43598-0650-11"]) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text")\ .setOutputCol("ndc_chunk") val chunkerMapper = DocMapperModel .pretrained("ndc_hcpcs_mapper", "en", "clinical/models") .setInputCols(Array("ndc_chunk")) .setOutputCol("mappings") .setRels(Array("hcpcs_code", "hcpcs_description")) val mapper_pipeline = new Pipeline().setStages(Array( document_assembler, chunkerMapper)) val data = Seq(Array("16714-0892-01", "00990-6138-03", "43598-0650-11")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +-------------+----------------------------+-----------------+ |ndc_chunk |mappings |relation | +-------------+----------------------------+-----------------+ |16714-0892-01|J0878 |hcpcs_code | |16714-0892-01|INJECTION, DAPTOMYCIN, 1 MG |hcpcs_description| |00990-6138-03|A4217 |hcpcs_code | |00990-6138-03|STERILE WATER/SALINE, 500 ML|hcpcs_description| |43598-0650-11|J9340 |hcpcs_code | |43598-0650-11|INJECTION, THIOTEPA, 15 MG |hcpcs_description| +-------------+----------------------------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ndc_hcpcs_mapper| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|203.1 KB| --- layout: model title: Sentence Detection in Nepali Text author: John Snow Labs name: sentence_detector_dl date: 2021-08-30 tags: [sentence_detection, ne, open_source] task: Sentence Detection language: ne edition: Spark NLP 3.2.0 spark_version: 3.0 supported: true annotator: SentenceDetectorDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ne_3.2.0_3.0_1630334779183.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sentence_detector_dl_ne_3.2.0_3.0_1630334779183.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel\ .pretrained("sentence_detector_dl", "ne") \ .setInputCols(["document"]) \ .setOutputCol("sentences") sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL])) sd_model.fullAnnotate("""अंग्रेजी पढ्ने अनुच्छेद को एक महान स्रोत को लागी हेर्दै हुनुहुन्छ? तपाइँ सही ठाउँमा आउनुभएको छ. हालै गरिएको एक अध्ययन अनुसार आजको युवाहरुमा पढ्ने बानी छिटोछिटो घट्दै गएको छ. उनीहरु केहि सेकेन्ड भन्दा बढी को लागी एक दिईएको अंग्रेजी पढ्ने अनुच्छेद मा ध्यान केन्द्रित गर्न सक्दैनन्! साथै, पठन थियो र सबै प्रतियोगी परीक्षा को एक अभिन्न हिस्सा हो। त्यसोभए, तपाइँ तपाइँको पठन कौशल कसरी सुधार गर्नुहुन्छ? यो प्रश्न को जवाफ वास्तव मा अर्को प्रश्न हो: पढ्ने कौशल को उपयोग के हो? पढ्न को मुख्य उद्देश्य 'भावना बनाउन' हो.""") ``` ```scala val documenter = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "ne") .setInputCols(Array("document")) .setOutputCol("sentence") val pipeline = new Pipeline().setStages(Array(documenter, model)) val data = Seq("अंग्रेजी पढ्ने अनुच्छेद को एक महान स्रोत को लागी हेर्दै हुनुहुन्छ? तपाइँ सही ठाउँमा आउनुभएको छ. हालै गरिएको एक अध्ययन अनुसार आजको युवाहरुमा पढ्ने बानी छिटोछिटो घट्दै गएको छ. उनीहरु केहि सेकेन्ड भन्दा बढी को लागी एक दिईएको अंग्रेजी पढ्ने अनुच्छेद मा ध्यान केन्द्रित गर्न सक्दैनन्! साथै, पठन थियो र सबै प्रतियोगी परीक्षा को एक अभिन्न हिस्सा हो। त्यसोभए, तपाइँ तपाइँको पठन कौशल कसरी सुधार गर्नुहुन्छ? यो प्रश्न को जवाफ वास्तव मा अर्को प्रश्न हो: पढ्ने कौशल को उपयोग के हो? पढ्न को मुख्य उद्देश्य 'भावना बनाउन' हो.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load('ne.sentence_detector').predict("अंग्रेजी पढ्ने अनुच्छेद को एक महान स्रोत को लागी हेर्दै हुनुहुन्छ? तपाइँ सही ठाउँमा आउनुभएको छ. हालै गरिएको एक अध्ययन अनुसार आजको युवाहरुमा पढ्ने बानी छिटोछिटो घट्दै गएको छ. उनीहरु केहि सेकेन्ड भन्दा बढी को लागी एक दिईएको अंग्रेजी पढ्ने अनुच्छेद मा ध्यान केन्द्रित गर्न सक्दैनन्! साथै, पठन थियो र सबै प्रतियोगी परीक्षा को एक अभिन्न हिस्सा हो। त्यसोभए, तपाइँ तपाइँको पठन कौशल कसरी सुधार गर्नुहुन्छ? यो प्रश्न को जवाफ वास्तव मा अर्को प्रश्न हो: पढ्ने कौशल को उपयोग के हो? पढ्न को मुख्य उद्देश्य 'भावना बनाउन' हो.", output_level ='sentence') ```
## Results ```bash +-----------------------------------------------------------------------------------------------------------------------+ |result | +-----------------------------------------------------------------------------------------------------------------------+ |[अंग्रेजी पढ्ने अनुच्छेद को एक महान स्रोत को लागी हेर्दै हुनुहुन्छ?] | |[तपाइँ सही ठाउँमा आउनुभएको छ.] | |[हालै गरिएको एक अध्ययन अनुसार आजको युवाहरुमा पढ्ने बानी छिटोछिटो घट्दै गएको छ.] | |[उनीहरु केहि सेकेन्ड भन्दा बढी को लागी एक दिईएको अंग्रेजी पढ्ने अनुच्छेद मा ध्यान केन्द्रित गर्न सक्दैनन्!] | |[साथै, पठन थियो र सबै प्रतियोगी परीक्षा को एक अभिन्न हिस्सा हो। त्यसोभए, तपाइँ तपाइँको पठन कौशल कसरी सुधार गर्नुहुन्छ?] | |[यो प्रश्न को जवाफ वास्तव मा अर्को प्रश्न हो:] | |[पढ्ने कौशल को उपयोग के हो?] | |[ पढ्न को मुख्य उद्देश्य 'भावना बनाउन' हो।] | +-----------------------------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sentence_detector_dl| |Compatibility:|Spark NLP 3.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[sentences]| |Language:|ne| ## Benchmarking ```bash label Accuracy Recall Prec F1 0 0.98 1.00 0.96 0.98 ``` --- layout: model title: Ganda asr_wav2vec2_luganda_by_indonesian_nlp TFWav2Vec2ForCTC from indonesian-nlp author: John Snow Labs name: asr_wav2vec2_luganda_by_indonesian_nlp date: 2022-09-24 tags: [wav2vec2, lg, audio, open_source, asr] task: Automatic Speech Recognition language: lg edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_luganda_by_indonesian_nlp` is a Ganda model originally trained by indonesian-nlp. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_luganda_by_indonesian_nlp_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_indonesian_nlp_lg_4.2.0_3.0_1664036258895.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_luganda_by_indonesian_nlp_lg_4.2.0_3.0_1664036258895.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_luganda_by_indonesian_nlp", "lg")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_luganda_by_indonesian_nlp", "lg") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_luganda_by_indonesian_nlp| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|lg| |Size:|1.2 GB| --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: roberta_qa_base_few_shot_k_64_finetuned_squad_seed_0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-0` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_0_en_4.3.0_3.0_1674215810378.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_few_shot_k_64_finetuned_squad_seed_0_en_4.3.0_3.0_1674215810378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_few_shot_k_64_finetuned_squad_seed_0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_few_shot_k_64_finetuned_squad_seed_0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|419.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-0 --- layout: model title: Part of Speech for Amharic author: John Snow Labs name: pos_ud_att date: 2021-03-09 tags: [part_of_speech, open_source, amharic, pos_ud_att, am] task: Part of Speech Tagging language: am edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - NOUN - DET - PART - VERB - PRON - PUNCT - AUX - PROPN - ADP - SCONJ - ADV - CCONJ - ADJ - INTJ - NUM {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_att_am_3.0.0_3.0_1615292425835.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_att_am_3.0.0_3.0_1615292425835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_att", "am") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Hello from John Snow Labs!']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_att", "am") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Hello from John Snow Labs!").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs!""] token_df = nlu.load('am.pos').predict(text) token_df ```
## Results ```bash token pos 0 Hello NOUN 1 from NOUN 2 John NOUN 3 Snow VERB 4 Labs VERB 5 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_att| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|am| --- layout: model title: Swedish asr_wav2vec2_large_voxrex_swedish_4gram TFWav2Vec2ForCTC from viktor-enzell author: John Snow Labs name: asr_wav2vec2_large_voxrex_swedish_4gram date: 2022-09-25 tags: [wav2vec2, sv, audio, open_source, asr] task: Automatic Speech Recognition language: sv edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_voxrex_swedish_4gram` is a Swedish model originally trained by viktor-enzell. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_voxrex_swedish_4gram_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_voxrex_swedish_4gram_sv_4.2.0_3.0_1664113748205.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_voxrex_swedish_4gram_sv_4.2.0_3.0_1664113748205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_voxrex_swedish_4gram", "sv")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_voxrex_swedish_4gram", "sv") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_voxrex_swedish_4gram| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|sv| |Size:|757.4 MB| --- layout: model title: Detect Clinical Entities (jsl_ner_wip_greedy_biobert) author: John Snow Labs name: jsl_ner_wip_greedy_biobert date: 2021-07-26 tags: [ner, licensed, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.1.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. ## Predicted Entities `Test_Result`, `Relationship_Status`, `RelativeDate`, `Blood_Pressure`, `Triglycerides`, `Smoking`, `Pregnancy`, `Medical_History_Header`, `LDL`, `Hypertension`, `Hyperlipidemia`, `Frequency`, `BMI`, `Internal_organ_or_component`, `Allergen`, `Fetus_NewBorn`, `Substance_Quantity`, `Time`, `Temperature`, `Procedure`, `Strength`, `Treatment`, `HDL`, `Alcohol`, `Birth_Entity`, `Diet`, `Weight`, `Oxygen_Therapy`, `Injury_or_Poisoning`, `Section_Header`, `Obesity`, `EKG_Findings`, `Gender`, `Height`, `Social_History_Header`, `Diabetes`, `Route`, `Race_Ethnicity`, `Substance`, `Drug`, `External_body_part_or_region`, `RelativeTime`, `Admission_Discharge`, `Psychological_Condition`, `Total_Cholesterol`, `Labour_Delivery`, `Imaging_Technique`, `Date`, `Form`, `Overweight`, `Cerebrovascular_Disease`, `Vital_Signs_Header`, `Oncological`, `ImagingFindings`, `Communicable_Disease`, `Duration`, `Vaccine`, `Kidney_Disease`, `O2_Saturation`, `Heart_Disease`, `Employment`, `Sexually_Active_or_Sexual_Orientation`, `Test`, `Disease_Syndrome_Disorder`, `Respiration`, `Direction`, `Medical_Device`, `Clinical_Dept`, `Modifier`, `Symptom`, `Pulse`, `Age`, `Death_Entity`, `Dosage`, `Family_History_Header`, `VS_Finding` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_en_3.1.3_3.0_1627304288213.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_ner_wip_greedy_biobert_en_3.1.3_3.0_1627304288213.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ```
## Results ```bash | | chunk | entity | |---:|:-----------------------------------------------|:-----------------------------| | 0 | 21-day-old | Age | | 1 | Caucasian | Race_Ethnicity | | 2 | male | Gender | | 3 | for 2 days | Duration | | 4 | congestion | Symptom | | 5 | mom | Gender | | 6 | suctioning yellow discharge | Symptom | | 7 | nares | External_body_part_or_region | | 8 | she | Gender | | 9 | mild problems with his breathing while feeding | Symptom | | 10 | perioral cyanosis | Symptom | | 11 | retractions | Symptom | | 12 | One day ago | RelativeDate | | 13 | mom | Gender | | 14 | tactile temperature | Symptom | | 15 | Tylenol | Drug | | 16 | Baby | Age | | 17 | decreased p.o. intake | Symptom | | 18 | His | Gender | | 19 | breast-feeding | External_body_part_or_region | | 20 | q.2h | Frequency | | 21 | to 5 to 10 minutes | Duration | | 22 | his | Gender | | 23 | respiratory congestion | Symptom | | 24 | He | Gender | | 25 | tired | Symptom | | 26 | fussy | Symptom | | 27 | over the past 2 days | RelativeDate | | 28 | albuterol | Drug | | 29 | ER | Clinical_Dept | | 30 | His | Gender | | 31 | urine output has also decreased | Symptom | | 32 | he | Gender | | 33 | per 24 hours | Frequency | | 34 | he | Gender | | 35 | per 24 hours | Frequency | | 36 | Mom | Gender | | 37 | diarrhea | Symptom | | 38 | His | Gender | | 39 | bowel | Internal_organ_or_component | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|jsl_ner_wip_greedy_biobert| |Compatibility:|Healthcare NLP 3.1.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-Oxygen_Therapy 47 11 10 0.8103448 0.8245614 0.81739134 B-Cerebrovascular_Disease 43 20 21 0.6825397 0.671875 0.6771653 B-Triglycerides 5 0 0 1.0 1.0 1.0 I-Cerebrovascular_Disease 25 12 27 0.6756757 0.48076922 0.56179774 B-Medical_Device 2704 531 364 0.8358578 0.88135594 0.85800415 B-Labour_Delivery 43 16 29 0.7288136 0.5972222 0.6564886 I-Vaccine 5 0 5 1.0 0.5 0.6666667 I-Obesity 6 4 1 0.6 0.85714287 0.70588243 I-Smoking 3 1 2 0.75 0.6 0.6666667 B-RelativeTime 67 36 51 0.65048546 0.5677966 0.60633487 B-Imaging_Technique 33 12 19 0.73333335 0.63461536 0.68041235 B-Heart_Disease 285 55 68 0.8382353 0.8073654 0.82251084 B-Procedure 1876 303 384 0.8609454 0.8300885 0.84523547 I-RelativeTime 105 43 53 0.7094595 0.664557 0.6862745 B-Drug 1803 299 265 0.8577545 0.87185687 0.8647482 B-Obesity 29 9 5 0.7631579 0.85294116 0.8055555 I-RelativeDate 617 167 107 0.7869898 0.8522099 0.8183024 B-O2_Saturation 27 8 6 0.7714286 0.8181818 0.7941177 B-Direction 2856 390 326 0.8798521 0.89754874 0.88861233 I-Alcohol 4 4 4 0.5 0.5 0.5 I-Oxygen_Therapy 25 7 6 0.78125 0.8064516 0.79365087 B-Diet 23 14 32 0.6216216 0.4181818 0.5 B-Dosage 35 26 29 0.57377046 0.546875 0.55999994 B-Injury_or_Poisoning 308 52 83 0.85555553 0.7877238 0.82023966 B-Hypertension 80 9 2 0.8988764 0.9756098 0.9356726 I-Test_Result 124 73 156 0.6294416 0.44285715 0.5199161 B-Alcohol 54 11 12 0.830 0.8181818 0.8244275 B-Height 14 5 5 0.7368421 0.7368421 0.7368421 I-Substance 18 8 8 0.6923077 0.6923077 0.6923077 B-RelativeDate 372 109 93 0.7733888 0.8 0.78646934 B-Admission_Discharge 218 22 14 0.90833336 0.9396552 0.9237288 B-Date 345 24 26 0.93495935 0.9299191 0.9324324 B-Kidney_Disease 63 10 20 0.8630137 0.7590361 0.8076923 I-Strength 22 17 13 0.5641026 0.62857145 0.59459466 I-Injury_or_Poisoning 301 93 98 0.7639594 0.75438595 0.75914246 I-Time 28 11 17 0.71794873 0.62222224 0.6666667 B-Substance 48 11 10 0.8135593 0.82758623 0.8205129 B-Total_Cholesterol 6 3 0 0.6666667 1.0 0.8 I-Vital_Signs_Header 276 28 8 0.90789473 0.97183096 0.93877554 I-Internal_organ_or_component 2907 518 490 0.8487591 0.8557551 0.8522427 B-Hyperlipidemia 28 3 0 0.9032258 1.0 0.9491525 B-Overweight 3 0 3 1.0 0.5 0.6666667 I-Sexually_Active_or_Sexual_Orientation 2 0 3 1.0 0.4 0.5714286 B-Sexually_Active_or_Sexual_Orientation 2 0 2 1.0 0.5 0.6666667 I-Fetus_NewBorn 50 38 58 0.5681818 0.46296296 0.5102041 B-BMI 6 0 1 1.0 0.85714287 0.9230769 B-ImagingFindings 52 41 61 0.5591398 0.460177 0.5048544 B-Test_Result 714 135 212 0.8409894 0.7710583 0.8045071 B-Section_Header 2140 79 65 0.9643984 0.97052157 0.96745026 I-Treatment 85 21 29 0.8018868 0.74561405 0.7727273 B-Clinical_Dept 638 82 77 0.88611114 0.8923077 0.88919866 I-Kidney_Disease 114 7 18 0.94214875 0.8636364 0.90118575 I-Pulse 189 27 42 0.875 0.8181818 0.84563756 B-Test 1589 320 315 0.83237296 0.83455884 0.83346444 B-Weight 54 12 13 0.8181818 0.80597013 0.81203 I-Respiration 114 4 17 0.9661017 0.870229 0.91566265 I-EKG_Findings 68 34 52 0.6666667 0.56666666 0.6126126 I-Section_Header 3828 168 77 0.957958 0.9802817 0.9689913 B-Strength 27 13 23 0.675 0.54 0.6 I-Social_History_Header 137 4 4 0.9716312 0.9716312 0.9716312 B-Vital_Signs_Header 183 18 7 0.9104478 0.9631579 0.9360614 B-Death_Entity 28 9 6 0.7567568 0.8235294 0.7887324 B-Modifier 302 90 282 0.77040815 0.5171233 0.6188525 B-Blood_Pressure 93 14 21 0.86915886 0.81578946 0.84162897 I-O2_Saturation 49 19 23 0.7205882 0.6805556 0.7 B-Frequency 437 77 68 0.8501946 0.86534655 0.8577036 I-Triglycerides 5 0 0 1.0 1.0 1.0 I-Duration 513 254 47 0.66883963 0.9160714 0.77317256 I-Diabetes 50 4 6 0.9259259 0.89285713 0.90909094 B-Race_Ethnicity 78 3 2 0.962963 0.975 0.9689441 I-Gender 114 2 17 0.98275864 0.870229 0.9230769 I-Height 43 13 10 0.76785713 0.8113208 0.78899086 B-Communicable_Disease 10 5 9 0.6666667 0.5263158 0.5882354 I-Family_History_Header 134 1 0 0.9925926 1.0 0.9962825 B-LDL 2 2 2 0.5 0.5 0.5 I-Race_Ethnicity 6 0 0 1.0 1.0 1.0 B-Psychological_Condition 103 21 17 0.83064514 0.85833335 0.84426236 I-Age 116 14 50 0.8923077 0.6987952 0.78378385 B-EKG_Findings 33 18 32 0.64705884 0.50769234 0.56896555 B-Employment 168 29 44 0.8527919 0.7924528 0.8215159 I-Oncological 358 38 17 0.9040404 0.9546667 0.9286641 B-Time 27 7 18 0.7941176 0.6 0.68354434 B-Treatment 93 31 41 0.75 0.69402987 0.7209303 B-Temperature 69 5 8 0.9324324 0.8961039 0.9139073 I-Procedure 2437 379 501 0.86541194 0.8294758 0.84706295 B-Relationship_Status 30 3 1 0.90909094 0.9677419 0.9375 B-Pregnancy 56 17 30 0.7671233 0.6511628 0.7044025 I-Route 8 4 7 0.6666667 0.53333336 0.59259266 I-Medical_History_Header 151 4 15 0.9741936 0.9096386 0.94080997 I-Imaging_Technique 25 5 20 0.8333333 0.5555556 0.66666675 B-Smoking 74 6 4 0.925 0.94871795 0.93670887 I-Labour_Delivery 36 8 18 0.8181818 0.6666667 0.7346939 I-Death_Entity 3 0 2 1.0 0.6 0.75 B-Diabetes 77 9 5 0.89534885 0.9390244 0.9166666 B-Gender 4479 82 111 0.9820215 0.97581697 0.9789094 B-Vaccine 6 1 9 0.85714287 0.4 0.54545456 I-Heart_Disease 393 61 89 0.8656388 0.8153527 0.8397436 I-Dosage 31 27 22 0.5344828 0.5849057 0.5585586 B-Social_History_Header 78 2 3 0.975 0.962963 0.9689441 B-External_body_part_or_region 1640 402 311 0.8031342 0.8405946 0.8214376 I-Clinical_Dept 546 59 47 0.90247935 0.920742 0.91151917 I-Test 1195 320 402 0.7887789 0.748278 0.7679949 I-Frequency 340 97 120 0.77803206 0.73913044 0.75808245 B-Age 454 35 57 0.9284254 0.888454 0.908 B-Pulse 90 11 17 0.8910891 0.8411215 0.8653846 I-Symptom 4265 2050 1232 0.6753761 0.7758778 0.72214705 I-Pregnancy 39 28 42 0.58208954 0.4814815 0.527027 I-LDL 5 0 4 1.0 0.5555556 0.71428573 I-Diet 33 14 25 0.70212764 0.5689655 0.6285714 I-Blood_Pressure 198 54 27 0.78571427 0.88 0.83018863 I-ImagingFindings 136 99 85 0.57872343 0.61538464 0.5964913 I-Date 203 13 10 0.9398148 0.9530516 0.946387 B-Route 84 23 47 0.78504676 0.64122134 0.7058824 B-Duration 204 110 26 0.6496815 0.8869565 0.74999994 B-Medical_History_Header 56 1 7 0.98245615 0.8888889 0.93333334 B-Respiration 55 4 6 0.9322034 0.90163934 0.9166667 I-External_body_part_or_region 314 105 167 0.74940336 0.65280664 0.6977778 I-BMI 15 0 1 1.0 0.9375 0.9677419 B-Internal_organ_or_component 4349 886 761 0.8307545 0.8510763 0.8407926 I-Weight 150 22 23 0.872093 0.867052 0.8695652 B-Disease_Syndrome_Disorder 1698 375 358 0.81910276 0.82587546 0.8224752 B-Symptom 4358 1002 932 0.8130597 0.8238185 0.8184037 B-VS_Finding 138 36 37 0.79310346 0.7885714 0.79083097 I-Disease_Syndrome_Disorder 1723 372 451 0.82243437 0.7925483 0.8072148 I-Drug 3282 838 493 0.79660195 0.86940396 0.8314123 I-Medical_Device 1864 418 242 0.81682736 0.88509023 0.84958977 B-Oncological 278 22 22 0.9266667 0.9266667 0.9266667 I-Temperature 111 8 6 0.9327731 0.94871795 0.94067794 I-Employment 92 27 19 0.77310926 0.8288288 0.8 I-Psychological_Condition 32 7 19 0.82051283 0.627451 0.7111111 B-Family_History_Header 68 0 0 1.0 1.0 1.0 I-Direction 311 91 144 0.7736318 0.6835165 0.72578764 Macro-average 65035 12855 11898 0.761429 0.70630085 0.7328297 Micro-average 65035 12855 11898 0.83495957 0.845346 0.8401207 ``` --- layout: model title: Translate Salishan languages to English Pipeline author: John Snow Labs name: translate_sal_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, sal, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `sal` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_sal_en_xx_2.7.0_2.4_1609686923332.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_sal_en_xx_2.7.0_2.4_1609686923332.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_sal_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_sal_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.sal.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_sal_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Proposal Classification author: John Snow Labs name: legclf_proposal_topic date: 2023-02-17 tags: [en, legal, classification, proposal, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Given a proposal on a socially important issue, this model classifies it according to its topic. ## Predicted Entities `Democracy`, `Digital`, `EU_In_The_World`, `Economy`, `Education`, `Green_Deal`, `Health`, `Migration`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_proposal_topic_en_1.0.0_3.0_1676594573703.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_proposal_topic_en_1.0.0_3.0_1676594573703.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = nlp.UniversalSentenceEncoder.pretrained()\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") classifier = legal.ClassifierDLModel.pretrained("legclf_proposal_topic", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_embeddings, classifier ]) empty_df = spark.createDataFrame([['']]).toDF("text") model = clf_pipeline.fit(empty_df) text = ["""In order to involve young people in the European Union, they need to understand the role, importance, and impact of the European Union on their lives and how they can contribute to the EU. I believe that many Europeans do not know the values of Europe, how they can contribute to the EU, etc. To do this, it was necessary to create an education program on the European Union that could cut across all countries, including a discipline on the EU, visits by young people to the European institutions, and a 'channel of communication' between young people and the EU. The same could be done for older people in senior universities."""] data = spark.createDataFrame([text]).toDF("text") result = model.transform(data) ```
## Results ```bash +-----------+ | result| +-----------+ |[Education]| +-----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_proposal_topic| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Training dataset available [here](https://touche.webis.de/clef23/touche23-web/multilingual-stance-classification.html#data) ## Benchmarking ```bash label precision recall f1-score support Democracy 0.86 0.90 0.88 62 Digital 0.85 0.80 0.82 35 EU_In_The_World 0.78 0.72 0.75 39 Economy 0.82 0.77 0.80 43 Education 0.89 0.87 0.88 46 Green_Deal 0.85 0.92 0.88 49 Health 0.87 0.95 0.91 21 Migration 0.86 0.89 0.87 27 Other 1.00 0.97 0.98 32 accuracy - - 0.86 354 macro-avg 0.86 0.87 0.86 354 weighted-avg 0.86 0.86 0.86 354 ``` --- layout: model title: Marathi Bert Embeddings author: John Snow Labs name: bert_embeddings_marathi_bert date: 2022-04-11 tags: [bert, embeddings, mr, open_source] task: Embeddings language: mr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `marathi-bert` is a Marathi model orginally trained by `l3cube-pune`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_marathi_bert_mr_3.4.2_3.0_1649675042393.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_marathi_bert_mr_3.4.2_3.0_1649675042393.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_marathi_bert","mr") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["मला स्पार्क एनएलपी आवडते"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_marathi_bert","mr") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("मला स्पार्क एनएलपी आवडते").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mr.embed.marathi_bert").predict("""मला स्पार्क एनएलपी आवडते""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_marathi_bert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|mr| |Size:|668.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/l3cube-pune/marathi-bert - https://github.com/l3cube-pune/MarathiNLP - https://arxiv.org/abs/2202.01159 --- layout: model title: German Electra Embeddings (from stefan-it) author: John Snow Labs name: electra_embeddings_electra_base_gc4_64k_700000_cased_generator date: 2022-05-17 tags: [de, open_source, electra, embeddings] task: Embeddings language: de edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-gc4-64k-700000-cased-generator` is a German model orginally trained by `stefan-it`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_700000_cased_generator_de_3.4.4_3.0_1652786477097.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_gc4_64k_700000_cased_generator_de_3.4.4_3.0_1652786477097.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_700000_cased_generator","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_gc4_64k_700000_cased_generator","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_gc4_64k_700000_cased_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|de| |Size:|222.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/stefan-it/electra-base-gc4-64k-700000-cased-generator - https://german-nlp-group.github.io/projects/gc4-corpus.html - https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf --- layout: model title: NER Pipeline for 6 Scandinavian Languages author: John Snow Labs name: bert_token_classifier_scandi_ner_pipeline date: 2022-02-15 tags: [danish, norwegian, swedish, icelandic, faroese, bert, xx, open_source] task: Named Entity Recognition language: xx edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on [bert_token_classifier_scandi_ner](https://nlp.johnsnowlabs.com/2021/12/09/bert_token_classifier_scandi_ner_xx.html) model which is imported from `HuggingFace`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_SCANDINAVIAN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_scandi_ner_pipeline_xx_3.4.0_3.0_1644927539839.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_scandi_ner_pipeline_xx_3.4.0_3.0_1644927539839.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python scandiner_pipeline = PretrainedPipeline("bert_token_classifier_scandi_ner_pipeline", lang = "xx") scandiner_pipeline.annotate("Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner.") ``` ```scala val scandiner_pipeline = new PretrainedPipeline("bert_token_classifier_scandi_ner_pipeline", lang = "xx") val scandiner_pipeline.annotate("Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner.") ```
## Results ```bash +-------------------+---------+ |chunk |ner_label| +-------------------+---------+ |Hans |PER | |Statens Universitet|ORG | |København |LOC | |københavner |MISC | +-------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_scandi_ner_pipeline| |Type:|pipeline| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Language:|xx| |Size:|666.9 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - BertForTokenClassification - NerConverter - Finisher --- layout: model title: Pipeline to Detect diseases in text (large) author: John Snow Labs name: ner_diseases_large_pipeline date: 2023-03-14 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_diseases_large](https://nlp.johnsnowlabs.com/2021/04/01/ner_diseases_large_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_pipeline_en_4.3.0_3.2_1678836753073.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_diseases_large_pipeline_en_4.3.0_3.2_1678836753073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_diseases_large_pipeline", "en", "clinical/models") text = '''Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_diseases_large_pipeline", "en", "clinical/models") val text = "Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.diseases_large.pipeline").predict("""Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive.""") ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:----------------|--------:|------:|:------------|-------------:| | 0 | T-cell leukemia | 136 | 150 | Disease | 0.93585 | | 1 | T-cell leukemia | 402 | 416 | Disease | 0.9567 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_diseases_large_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from ChutianTao) author: John Snow Labs name: distilbert_qa_base_uncased_finetuned_squad_1 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad-1` is a English model originally trained by `ChutianTao`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_1_en_4.3.0_3.0_1672773434592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_finetuned_squad_1_en_4.3.0_3.0_1672773434592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_finetuned_squad_1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_finetuned_squad_1| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/ChutianTao/distilbert-base-uncased-finetuned-squad-1 --- layout: model title: Sentence Entity Resolver for ICD10-CM (Augmented) author: John Snow Labs name: sbiobertresolve_icd10cm_augmented date: 2023-05-20 tags: [icd10cm, entity_resolution, clinical, en, licensed] task: Entity Resolution language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to ICD10-CM codes using `sbiobert_base_cased_mli` Sentence Bert Embeddings. Also, it has been augmented with synonyms for making it more accurate and returns the official resolution text within the brackets. ## Predicted Entities `ICD10CM Codes` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_4.4.2_3.0_1684591054927.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_icd10cm_augmented_en_4.4.2_3.0_1684591054927.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\ .setInputCols(["sentence","token","embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence","token","ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(['PROBLEM']) chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver]) data_ner = spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."]]).toDF("text") results = nlpPipeline.fit(data_ner).transform(data_ner) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") .setInputCols(Array("sentence","token","embeddings")) .setOutputCol("ner") val ner_converter = NerConverter() .setInputCols(Array("sentence","token","ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("PROBLEM")) val chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") .setInputCols(Array("sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, icd10_resolver)) val data = Seq("A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.icd10cm.augmented").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection.""") ```
## Results ```bash +-------------------------------------+-------+------------+-------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | ner_chunk| entity|icd10cm_code| resolutions | all_codes| +-------------------------------------+-------+------------+-------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | gestational diabetes mellitus|PROBLEM| O24.4|gestational diabetes mellitus [gestational diabetes mellitus], gestational diabetes mellitus [gestational... |O24.4, O24.41, O24.43, Z86.32, Z87.5, O24.31, O24.11, O24.1, O24.81...| |subsequent type two diabetes mellitus|PROBLEM| O24.11|pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mellitus], disorder associated with t... |O24.11, E11.8, E11, E13.9, E11.9, E11.3, E11.44, Z86.3, Z86.39, E11...| | obesity|PROBLEM| E66.9|obesity [obesity], abdominal obesity [abdominal obesity], obese [obese], central obesity [central obesity... |E66.9, E66.8, Z68.41, Q13.0, E66, E66.01, Z86.39, E34.9, H35.50, Z8...| | a body mass index|PROBLEM| Z68.41|finding of body mass index [finding of body mass index], observation of body mass index [observation of b... |Z68.41, E66.9, R22.9, Z68.1, R22.3, R22.1, Z68, R22.2, R22.0, R41.8...| | polyuria|PROBLEM| R35|polyuria [polyuria], nocturnal polyuria [nocturnal polyuria], polyuric state [polyuric state], polyuric ... |R35, R35.81, R35.8, E23.2, R35.89, R31, R35.0, R82.99, N40.1, E72.3...| | polydipsia|PROBLEM| R63.1|polydipsia [polydipsia], psychogenic polydipsia [psychogenic polydipsia], primary polydipsia [primary po... |R63.1, F63.89, E23.2, F63.9, O40, G47.5, M79.89, R63.2, R06.1, H53....| | poor appetite|PROBLEM| R63.0|poor appetite [poor appetite], poor feeding [poor feeding], bad taste in mouth [bad taste in mouth], unp... |R63.0, P92.9, R43.8, R43.2, E86, R19.6, F52.0, Z72.4, R06.89, Z76.8...| | vomiting|PROBLEM| R11.1|vomiting [vomiting], intermittent vomiting [intermittent vomiting], vomiting symptoms [vomiting symptom... |R11.1, R11, R11.10, G43.A1, P92.1, P92.09, G43.A, R11.13, R11.0 | | a respiratory tract infection|PROBLEM| J98.8|respiratory tract infection [respiratory tract infection], upper respiratory tract infection [upper respi... |J98.8, J06.9, A49.9, J22, J20.9, Z59.3, T17, J04.10, Z13.83, J18.9 ...| +-------------------------------------+-------+------------+----------------------------------------------------------------------+--------------------------------------+----------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_icd10cm_augmented| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[icd10cm_code]| |Language:|en| |Size:|1.5 GB| |Case sensitive:|false| ## References Trained on ICD10CM 2023 Codes dataset: https://www.cdc.gov/nchs/icd/icd10cm.htm --- layout: model title: Legal Organisation Of Work And Working Conditions Document Classifier (EURLEX) author: John Snow Labs name: legclf_organisation_of_work_and_working_conditions_bert date: 2023-03-06 tags: [en, legal, classification, clauses, organisation_of_work_and_working_conditions, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_organisation_of_work_and_working_conditions_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Organisation_of_Work_and_Working_Conditions or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Organisation_of_Work_and_Working_Conditions`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_organisation_of_work_and_working_conditions_bert_en_1.0.0_3.0_1678111659189.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_organisation_of_work_and_working_conditions_bert_en_1.0.0_3.0_1678111659189.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_organisation_of_work_and_working_conditions_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Organisation_of_Work_and_Working_Conditions]| |[Other]| |[Other]| |[Organisation_of_Work_and_Working_Conditions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_organisation_of_work_and_working_conditions_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Organisation_of_Work_and_Working_Conditions 1.00 0.96 0.98 23 Other 0.96 1.00 0.98 24 accuracy - - 0.98 47 macro-avg 0.98 0.98 0.98 47 weighted-avg 0.98 0.98 0.98 47 ``` --- layout: model title: Croatian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_croatian_legal date: 2023-02-16 tags: [hr, croatian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: hr edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-croatian-roberta-base` is a Croatian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_croatian_legal_hr_4.2.4_3.0_1676576201975.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_croatian_legal_hr_4.2.4_3.0_1676576201975.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_croatian_legal", "hr")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_croatian_legal", "hr") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_croatian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|hr| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-croatian-roberta-base --- layout: model title: English T5ForConditionalGeneration Base Cased model (from tennessejoyce) author: John Snow Labs name: t5_titlewave_base date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `titlewave-t5-base` is a English model originally trained by `tennessejoyce`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_titlewave_base_en_4.3.0_3.0_1675157398835.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_titlewave_base_en_4.3.0_3.0_1675157398835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_titlewave_base","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_titlewave_base","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_titlewave_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|917.2 MB| ## References - https://huggingface.co/tennessejoyce/titlewave-t5-base - https://github.com/tennessejoyce/TitleWave - https://github.com/tennessejoyce/TitleWave - https://archive.org/details/stackexchange - https://github.com/tennessejoyce/TitleWave/blob/master/model_training/test_summarizer.ipynb --- layout: model title: Hindi asr_CDAC_hindispeechrecognition TFWav2Vec2ForCTC from nalini2799 author: John Snow Labs name: asr_CDAC_hindispeechrecognition date: 2022-09-26 tags: [wav2vec2, hi, audio, open_source, asr] task: Automatic Speech Recognition language: hi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_CDAC_hindispeechrecognition` is a Hindi model originally trained by nalini2799. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_CDAC_hindispeechrecognition_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_CDAC_hindispeechrecognition_hi_4.2.0_3.0_1664188438122.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_CDAC_hindispeechrecognition_hi_4.2.0_3.0_1664188438122.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_CDAC_hindispeechrecognition", "hi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_CDAC_hindispeechrecognition", "hi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_CDAC_hindispeechrecognition| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|hi| |Size:|1.2 GB| --- layout: model title: Chinese Bert Embeddings (from shibing624) author: John Snow Labs name: bert_embeddings_macbert4csc_base_chinese date: 2022-04-11 tags: [bert, embeddings, zh, open_source] task: Embeddings language: zh edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `macbert4csc-base-chinese` is a Chinese model orginally trained by `shibing624`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_macbert4csc_base_chinese_zh_3.4.2_3.0_1649669240219.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_macbert4csc_base_chinese_zh_3.4.2_3.0_1649669240219.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_macbert4csc_base_chinese","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_macbert4csc_base_chinese","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("zh.embed.macbert4csc_base_chinese").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_macbert4csc_base_chinese| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|384.0 MB| |Case sensitive:|true| ## References - https://huggingface.co/shibing624/macbert4csc-base-chinese - https://github.com/shibing624/pycorrector - https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ - http://nlp.ee.ncu.edu.tw/resource/csc.html - https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml - https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert - https://arxiv.org/abs/2004.13922 --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_4_h_128 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-4_H-128` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_128_zh_4.2.4_3.0_1670325916540.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_4_h_128_zh_4.2.4_3.0_1670325916540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_128","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_4_h_128","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_4_h_128| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|13.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-4_H-128 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Romanian T5ForConditionalGeneration Small Cased model (from BlackKakapo) author: John Snow Labs name: t5_small_paraphrase_v2 date: 2023-01-31 tags: [ro, open_source, t5, tensorflow] task: Text Generation language: ro edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-small-paraphrase-ro-v2` is a Romanian model originally trained by `BlackKakapo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_small_paraphrase_v2_ro_4.3.0_3.0_1675155540424.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_small_paraphrase_v2_ro_4.3.0_3.0_1675155540424.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_small_paraphrase_v2","ro") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_small_paraphrase_v2","ro") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_small_paraphrase_v2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|ro| |Size:|288.8 MB| ## References - https://huggingface.co/BlackKakapo/t5-small-paraphrase-ro-v2 - https://img.shields.io/badge/V.2-17.08.2022-brightgreen --- layout: model title: Persian Named Entity Recognition (from HooshvareLab) author: John Snow Labs name: bert_ner_bert_fa_zwnj_base_ner date: 2022-05-09 tags: [bert, ner, token_classification, fa, open_source] task: Named Entity Recognition language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-fa-zwnj-base-ner` is a Persian model orginally trained by `HooshvareLab`. ## Predicted Entities `LOC`, `PRO`, `MON`, `TIM`, `PER`, `DAT`, `FAC`, `EVE`, `PCT`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_bert_fa_zwnj_base_ner_fa_3.4.2_3.0_1652099703343.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_bert_fa_zwnj_base_ner_fa_3.4.2_3.0_1652099703343.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_fa_zwnj_base_ner","fa") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["من عاشق جرقه nlp هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_bert_fa_zwnj_base_ner","fa") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("من عاشق جرقه nlp هستم").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_bert_fa_zwnj_base_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|fa| |Size:|442.2 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/HooshvareLab/bert-fa-zwnj-base-ner - https://github.com/HaniehP/PersianNER - http://nsurl.org/2019-2/tasks/task-7-named-entity-recognition-ner-for-farsi/ - https://elisa-ie.github.io/wikiann/ - https://github.com/hooshvare/parsner/issues --- layout: model title: Financial News Summarization (Medium) author: John Snow Labs name: finsum_news_md date: 2022-11-24 tags: [en, licensed] task: Summarization language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Financial news Summarizer, finetuned with a financial dataset (about 10K news). ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finsum_news_md_en_1.0.0_3.0_1669312993098.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finsum_news_md_en_1.0.0_3.0_1669312993098.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = nlp.T5Transformer() \ .pretrained("finsum_news_md" ,"en", "finance/models") \ .setTask("summarization") \ # or summarize: .setInputCols(["documents"]) \ .setMaxOutputLength(512) \ .setOutputCol("summaries") data_df = spark.createDataFrame([["Deere Grows Sales 37% as Shipments Rise. Farm equipment supplier forecasts higher sales in year ahead, lifted by price increases and infrastructure investments. Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more of its farm and construction equipment. The Moline, Ill.-based company, the largest supplier of farm equipment in the U.S., said demand held up as it raised prices on farm equipment, and forecast sales gains in the year ahead. Chief Executive John May cited strong demand and increased investment in infrastructure projects as the Biden administration ramps up spending. Elevated crop prices have kept farmers interested in new machinery even as their own production expenses increase."]]).toDF("text") pipeline = nlp.Pipeline().setStages([document_assembler, t5]) results = pipeline.fit(data_df).transform(data_df) results.select("summaries.result").show(truncate=False) ```
## Results ```bash Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more farm and construction equipment. Deere & Co. said its fiscal fourth-quarter sales surged 37% as supply constraints eased and the company shipped more farm and construction equipment. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finsum_news_md| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[summaries]| |Language:|en| |Size:|925.1 MB| ## References John Snow Labs in-house summarized articles. --- layout: model title: English image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3 ViTForImageClassification from AykeeSalazar author: John Snow Labs name: image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3` is a English model originally trained by AykeeSalazar. ## Predicted Entities `nonViolation`, `publicDrinking`, `publicSmoking` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3_en_4.1.0_3.0_1660168819647.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3_en_4.1.0_3.0_1660168819647.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_vc_bantai__withoutAMBI_adunest_v3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: Igbo Named Entity Recognition (from Davlan) author: John Snow Labs name: distilbert_ner_distilbert_base_multilingual_cased_masakhaner date: 2022-05-16 tags: [distilbert, ner, token_classification, ig, open_source] task: Named Entity Recognition language: ig edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-multilingual-cased-masakhaner` is a Igbo model orginally trained by `Davlan`. ## Predicted Entities `DATE`, `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_multilingual_cased_masakhaner_ig_3.4.2_3.0_1652721792853.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_ner_distilbert_base_multilingual_cased_masakhaner_ig_3.4.2_3.0_1652721792853.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_multilingual_cased_masakhaner","ig") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Ahụrụ m n'anya na-atọ m ụtọ"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_ner_distilbert_base_multilingual_cased_masakhaner","ig") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Ahụrụ m n'anya na-atọ m ụtọ").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_ner_distilbert_base_multilingual_cased_masakhaner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ig| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Davlan/distilbert-base-multilingual-cased-masakhaner - https://github.com/masakhane-io/masakhane-ner - https://github.com/masakhane-io/masakhane-ner - https://arxiv.org/abs/2103.11811 --- layout: model title: Named Entity Recognition (NER) Model in Finnish (GloVe 6B 300) author: John Snow Labs name: finnish_ner_6B_300 date: 2020-09-01 task: Named Entity Recognition language: fi edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [ner, fi, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Finnish NER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. The model is trained with GloVe 6B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Product-`PRO`, Date-`DATE`, Event-`EVENT`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/PP_EXPLAIN_DOCUMENT_FI/){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/PP_EXPLAIN_DOCUMENT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/finnish_ner_6B_300_fi_2.6.0_2.4_1598965807718.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/finnish_ner_6B_300_fi_2.6.0_2.4_1598965807718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_6B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("finnish_ner_6B_300", "fi") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([['']]).toDF('text')) result = pipeline_model.transform(spark.createDataFrame([["William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia \u200b\u200bulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970'erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990'erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella."]], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_6B_300', lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("finnish_ner_6B_300", "fi") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia ​​ulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970"erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990"erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [9] Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (28. lokakuuta 1955) on amerikkalaisia ​​ulkoministeriöitä, ohjelmistoja, sijoittajia ja filantroppeja. Microsoft on toiminut Microsoft Corporationin välittäjänä. I løbet af sin karriere hos Microsoft havde Gates stillinger som formand, administrerende direktør (administratorerendeøøør), præsident og chefsoftwarearkitekt, samtidig med at han var den største individualelle aktionær indtil maj 2014. mikrotietokonevoluutioille i 1970'erne 1980 1980erne. Født and opvokset i Seattle, Washington, var Gates grundlægger af Microsoft sammen med barndomsvennen Paul Allen i 1975 i Albuquerque, New Mexico; Det fortsatte med at blive verdens største virksomhed inden for personlig tietokoneohjelmistot. Gates førte virksomheden som formand and administratorer direktør, indtil han trådte tilbage som administrerende direktør tammikuu 2000, miehet han forblev formand blev chefsoftwarearkitekt. Olen slutningen 1990'erne var Gates blevet kritiseret for syn forretningstaktik, der er blevet betragtet som konkurrencebegrænsende. Denne udtalelse er blevet opretholdt ved adskillige retsafgørelser. Kesäkuun 2006 Meddelte Gates, at han ville overgå til en deltidsrolle i Microsoft og fuldtidsarbejde i Bill & Melinda Gates Foundation, det private velgørende fundament, som han og hans kone, Melinda Gates, oprettede i 2000. [9] Han overførte gradvist sine pligter Tilaaja Ray Ozzie ja Craig Mundie. Han trådte tilbage som formand for Microsoft helmikuussa 2014 ja tiltrådte en ny stilling som teknologiatietojen antaja at støtte den nyudnævnte adminerende direktør Satya Nadella."""] ner_df = nlu.load('fi.ner.6B_300d').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |William Henry Gates III|PER | |lokakuuta 1955 |DATE | |​​ulkoministeriöitä |ORG | |Microsoft |ORG | |Microsoft Corporationin|PRO | |Microsoft |ORG | |Gates |PER | |2014 |DATE | |1970'erne 1980 |DATE | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |tammikuu 2000 |DATE | |Gates |PER | |Denne |PER | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finnish_ner_6B_300| |Type:|ner| |Compatibility:| Spark NLP 2.6.0+| |Edition:|Official| |License:|Open Source| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fi| |Case sensitive:|false| {:.h2_title} ## Data Source The detailed information can be found from [https://www.aclweb.org/anthology/2020.lrec-1.567.pdf](https://www.aclweb.org/anthology/2020.lrec-1.567.pdf) --- layout: model title: Russian DistilBertForQuestionAnswering model (from AndrewChar) author: John Snow Labs name: distilbert_qa_model_QA_5_epoch_RU date: 2022-06-08 tags: [ru, open_source, distilbert, question_answering] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `model-QA-5-epoch-RU` is a Russian model originally trained by `AndrewChar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_QA_5_epoch_RU_ru_4.0.0_3.0_1654728395737.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_model_QA_5_epoch_RU_ru_4.0.0_3.0_1654728395737.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model_QA_5_epoch_RU","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_model_QA_5_epoch_RU","ru") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ru.answer_question.distil_bert").predict("""Как меня зовут?|||"Меня зовут Клара, и я живу в Беркли.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_model_QA_5_epoch_RU| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|505.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AndrewChar/model-QA-5-epoch-RU --- layout: model title: Chinese BertForQuestionAnswering model (from yechen) author: John Snow Labs name: bert_qa_question_answering_chinese date: 2022-06-02 tags: [zh, open_source, question_answering, bert] task: Question Answering language: zh edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `question-answering-chinese` is a Chinese model orginally trained by `yechen`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_chinese_zh_4.0.0_3.0_1654189170939.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_chinese_zh_4.0.0_3.0_1654189170939.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_question_answering_chinese","zh") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_question_answering_chinese","zh") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("zh.answer_question.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_question_answering_chinese| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|zh| |Size:|1.2 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/yechen/question-answering-chinese --- layout: model title: Persian BertForQuestionAnswering Cased model (from newsha) author: John Snow Labs name: bert_qa_pquad_2 date: 2022-07-07 tags: [fa, open_source, bert, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `PQuAD_2` is a Persian model originally trained by `newsha`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_pquad_2_fa_4.0.0_3.0_1657182314348.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_pquad_2_fa_4.0.0_3.0_1657182314348.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_pquad_2","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_pquad_2","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_pquad_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|607.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/newsha/PQuAD_2 --- layout: model title: Universal Sentence Encoder author: John Snow Labs name: tfhub_use date: 2020-04-17 task: Embeddings language: en nav_key: models edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: UniversalSentenceEncoder article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder. The details are described in the paper "[Universal Sentence Encoder](https://arxiv.org/abs/1803.11175)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/tfhub_use_en_2.4.0_2.4_1587136330099.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_en_2.4.0_2.4_1587136330099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Many thanks']], ["text"])) ``` ```scala ... val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I love NLP", "Many thanks").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed_sentence.tfhub_use').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_tfhub_use_embeddings 0 I love NLP [0.06498772650957108, 0.01892215944826603, -0.... 1 Many thanks [0.0255892276763916, -0.042829226702451706, -0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|tfhub_use| |Type:|embeddings| |Compatibility:|Spark NLP 2.4.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://tfhub.dev/google/universal-sentence-encoder/2](https://tfhub.dev/google/universal-sentence-encoder/2) --- layout: model title: Pipeline to Extract Entities Related to TNM Staging author: John Snow Labs name: ner_oncology_tnm_pipeline date: 2023-03-09 tags: [licensed, en, clinical, oncology, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_tnm](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_tnm_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_pipeline_en_4.3.0_3.2_1678352273944.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_tnm_pipeline_en_4.3.0_3.2_1678352273944.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_tnm_pipeline", "en", "clinical/models") text = '''The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_tnm_pipeline", "en", "clinical/models") val text = "The final diagnosis was metastatic breast carcinoma, and it was classified as T2N1M1 stage IV. The histological grade of this 4 cm tumor was grade 2." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-----------------|--------:|------:|:------------------|-------------:| | 0 | metastatic | 24 | 33 | Metastasis | 0.9999 | | 1 | breast carcinoma | 35 | 50 | Cancer_Dx | 0.9972 | | 2 | T2N1M1 stage IV | 78 | 92 | Staging | 0.905267 | | 3 | 4 cm | 126 | 129 | Tumor_Description | 0.85105 | | 4 | tumor | 131 | 135 | Tumor | 0.9926 | | 5 | grade 2 | 141 | 147 | Tumor_Description | 0.89705 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_tnm_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Legal NER for NDA (Termination Clause) author: John Snow Labs name: legner_nda_termination date: 2023-04-06 tags: [en, licensed, legal, ner, nda, termination] task: Named Entity Recognition language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a NER model, aimed to be run **only** after detecting the `TERMINATION ` clause with a proper classifier (use legmulticlf_mnda_sections_paragraph_other for that purpose). It will extract the following entities: `TERM_DATE `, and `REF_TERM_DATE`. ## Predicted Entities `TERM_DATE`, `REF_TERM_DATE` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_nda_termination_en_1.0.0_3.0_1680816697135.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_nda_termination_en_1.0.0_3.0_1680816697135.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_nda_termination", "en", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") model = nlpPipeline.fit(empty_data) text = ["""Except as otherwise specified herein, the obligations of the parties set forth in this Agreement shall terminate and be of no further force and effect eighteen months from the date hereof."""] result = model.transform(spark.createDataFrame([text]).toDF("text")) ```
## Results ```bash +---------------+-------------+ |chunk |ner_label | +---------------+-------------+ |eighteen months|TERM_DATE | |date hereof |REF_TERM_DATE| +---------------+-------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_nda_termination| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|16.2 MB| ## References In-house annotations on the Non-disclosure Agreements ## Benchmarking ```bash label precision recall f1-score support B-TERM_DATE 1.00 0.92 0.96 12 I-TERM_DATE 0.97 1.00 0.98 28 B-REF_TERM_DATE 0.91 0.91 0.91 11 I-REF_TERM_DATE 1.00 0.90 0.95 10 micro-avg 0.97 0.95 0.96 61 macro-avg 0.97 0.93 0.95 61 weighted-avg 0.97 0.95 0.96 61 ``` --- layout: model title: Danish asr_xls_r_300m_danish_nst_cv9 TFWav2Vec2ForCTC from chcaa author: John Snow Labs name: pipeline_asr_xls_r_300m_danish_nst_cv9 date: 2022-09-25 tags: [wav2vec2, da, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: da edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_300m_danish_nst_cv9` is a Danish model originally trained by chcaa. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_xls_r_300m_danish_nst_cv9_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_300m_danish_nst_cv9_da_4.2.0_3.0_1664102626776.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_xls_r_300m_danish_nst_cv9_da_4.2.0_3.0_1664102626776.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_xls_r_300m_danish_nst_cv9', lang = 'da') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_xls_r_300m_danish_nst_cv9", lang = "da") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_xls_r_300m_danish_nst_cv9| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|da| |Size:|757.6 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English financial Word Embeddings (Roberta) author: John Snow Labs name: roberta_embeddings_financial date: 2022-05-04 tags: [roberta, embeddings, en, open_source, financial] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial Pretrained Roberta Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `abhilash1910/financial_roberta` is a English model orginally trained by `abhilash1910`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_financial_en_3.4.2_3.0_1651678861262.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_financial_en_3.4.2_3.0_1651678861262.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_financial","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I Love Spark-NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_financial","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I Love Spark-NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_financial| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|324.5 MB| |Case sensitive:|true| --- layout: model title: Relation Extraction Between Dates and Clinical Entities (ReDL) author: John Snow Labs name: redl_date_clinical_biobert date: 2021-06-01 tags: [licensed, en, clinical, relation_extraction] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Identify if tests were conducted on a particular date or any diagnosis was made on a specific date by checking relations between clinical entities and dates. `1` : Shows date and the clinical entity are related, `0` : Shows date and the clinical entity are not related. ## Predicted Entities `0`, `1` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_CLINICAL_DATE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_3.0.3_3.0_1622583984107.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_date_clinical_biobert_en_3.0.3_3.0_1622583984107.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") events_ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_chunker = NerConverterInternal()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") events_re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setOutputCol("re_ner_chunks") events_re_Model = RelationExtractionDLModel() \ .pretrained('redl_date_clinical_biobert', "en", "clinical/models")\ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[ documenter, sentencer, tokenizer, words_embedder, pos_tagger, events_ner_tagger, ner_chunker, dependency_parser, events_re_ner_chunk_filter, events_re_Model]) data = spark.createDataFrame([['''This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.''']]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val events_ner_tagger = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_chunker = new NerConverterInternal() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val events_re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setOutputCol("re_ner_chunks") val events_re_Model = RelationExtractionDLModel() .pretrained('redl_date_clinical_biobert', "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter,sentencer,tokenizer,words_embedder,pos_tagger,events_ner_tagger,ner_chunker,dependency_parser,events_re_ner_chunk_filter,events_re_Model)) val data = Seq("This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.date").predict("""This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.""") ```
## Results ```bash | | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |---|-----------|---------|---------------|-------------|------------------------------------------|---------|-------------|-------------|---------|------------| | 0 | 1 | Test | 24 | 25 | CT | Date | 31 | 37 | 1/12/95 | 1.0 | | 1 | 1 | Symptom | 45 | 84 | progressive memory and cognitive decline | Date | 92 | 98 | 8/11/94 | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_date_clinical_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on an internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.738 0.729 0.734 84 1 0.945 0.947 0.946 416 Avg. 0.841 0.838 0.840 - ``` --- layout: model title: Malay DistilBERT Embeddings (from w11wo) author: John Snow Labs name: distilbert_embeddings_malaysian_distilbert_small date: 2022-04-12 tags: [distilbert, embeddings, ms, open_source] task: Embeddings language: ms edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: DistilBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `malaysian-distilbert-small` is a Malay model orginally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_malaysian_distilbert_small_ms_3.4.2_3.0_1649784041633.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_malaysian_distilbert_small_ms_3.4.2_3.0_1649784041633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_malaysian_distilbert_small","ms") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Saya suka Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = DistilBertEmbeddings.pretrained("distilbert_embeddings_malaysian_distilbert_small","ms") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Saya suka Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ms.embed.distilbert").predict("""Saya suka Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_malaysian_distilbert_small| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ms| |Size:|248.4 MB| |Case sensitive:|true| ## References - https://huggingface.co/w11wo/malaysian-distilbert-small - https://arxiv.org/abs/1910.01108 - https://github.com/sgugger - https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb - https://w11wo.github.io/ --- layout: model title: Fast Neural Machine Translation Model from English to Bantu Languages author: John Snow Labs name: opus_mt_en_bnt date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, en, bnt, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `en` - target languages: `bnt` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_en_bnt_xx_2.7.0_2.4_1609169302611.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_en_bnt_xx_2.7.0_2.4_1609169302611.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_bnt", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_en_bnt", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.en.marian.translate_to.bnt').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_en_bnt| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Twi asr_wav2vec2large_xlsr_akan TFWav2Vec2ForCTC from azunre author: John Snow Labs name: asr_wav2vec2large_xlsr_akan date: 2022-09-24 tags: [wav2vec2, tw, audio, open_source, asr] task: Automatic Speech Recognition language: tw edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2large_xlsr_akan` is a Twi model originally trained by azunre. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2large_xlsr_akan_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2large_xlsr_akan_tw_4.2.0_3.0_1664022240589.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2large_xlsr_akan_tw_4.2.0_3.0_1664022240589.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2large_xlsr_akan", "tw")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2large_xlsr_akan", "tw") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2large_xlsr_akan| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|tw| |Size:|1.2 GB| --- layout: model title: Armenian Lemmatizer author: John Snow Labs name: lemma date: 2020-07-29 23:34:00 +0800 task: Lemmatization language: hy edition: Spark NLP 2.5.5 spark_version: 2.4 tags: [lemmatizer, hy] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb#scrollTo=bbzEH9u7tdxR){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_hy_2.5.5_2.4_1596054393298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_hy_2.5.5_2.4_1596054393298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... lemmatizer = LemmatizerModel.pretrained("lemma", "hy") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:") ``` ```scala ... val lemmatizer = LemmatizerModel.pretrained("lemma", "hy") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Հյուսիսային թագավոր լինելուց բացի, Johnոն Սնոուն անգլիացի բժիշկ է և անզգայացման և բժշկական հիգիենայի զարգացման առաջատար:"""] lemma_df = nlu.load('hy.lemma').predict(text, output_level='document') lemma_df.lemma.values[0] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=10, result='Հյուսիսային', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=12, end=18, result='թագավոր', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=20, end=27, result='լինել', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=29, end=32, result='բացի', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=33, end=33, result=',', metadata={'sentence': '0'}, embeddings=[]), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Type:|lemmatizer| |Compatibility:|Spark NLP 2.5.5+| |Edition:|Official| |Input labels:|[token]| |Output labels:|[lemma]| |Language:|hy| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://universaldependencies.org](https://universaldependencies.org) --- layout: model title: Danish BertForQuestionAnswering model (from jacobshein) author: John Snow Labs name: bert_qa_danish_bert_botxo_qa_squad date: 2022-06-02 tags: [da, open_source, question_answering, bert] task: Question Answering language: da edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `danish-bert-botxo-qa-squad` is a Danish model orginally trained by `jacobshein`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_danish_bert_botxo_qa_squad_da_4.0.0_3.0_1654187371533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_danish_bert_botxo_qa_squad_da_4.0.0_3.0_1654187371533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_danish_bert_botxo_qa_squad","da") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_danish_bert_botxo_qa_squad","da") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("da.answer_question.squad.bert").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_danish_bert_botxo_qa_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|da| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jacobshein/danish-bert-botxo-qa-squad - https://jacobhein.com/#contact - https://github.com/botxo/nordic_bert - https://github.com/ccasimiro88/TranslateAlignRetrieve/tree/multilingual/squads-tar/da --- layout: model title: SDOH Housing Insecurity For Classification author: John Snow Labs name: genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli date: 2023-04-27 tags: [en, licensed, clinical, sdoh, housing, biobert, generic_classifier, housing_insecurity] task: Text Classification language: en edition: Healthcare NLP 4.3.2 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting whether the patient has housing insecurity. If the clinical note includes patient housing problems, the model identifies it. If there is no housing issue or it is not mentioned in the text, it is regarded as “no housing insecurity”. The model is trained by using GenericClassifierApproach annotator. `Housing_Insecurity`: The patient has housing problems. `No_Housing_Insecurity`: The patient has no housing problems or it is not mentioned in the clinical notes. ## Predicted Entities `Housing_Insecurity`, `No_Housing_Insecurity` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.3.2_3.0_1682607884617.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en_4.3.2_3.0_1682607884617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("class") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["""Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical conditions. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "class.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("class") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("""Patient: Mary H. Background: Mary is a 40-year-old woman who has been diagnosed with asthma and allergies. She has been managing her conditions with medication and regular follow-up appointments with her healthcare provider. She lives in a rented apartment with her husband and two children and has been stably housed for the past five years. Presenting problem: Mary presents to the clinic for a routine check-up and reports no significant changes in her health status or symptoms related to her asthma or allergies. However, she expresses concerns about the quality of the air in her apartment and potential environmental triggers that could impact her health. Medical history: Mary has a medical history of asthma and allergies. She takes an inhaler and antihistamines to manage her conditions. Social history: Mary is married with two children and lives in a rented apartment. She and her husband both work full-time jobs and have health insurance. They have savings and are able to cover basic expenses. Assessment: The clinician assesses Mary's medical conditions and determines that her asthma and allergies are stable and well-controlled. The clinician also assesses Mary's housing situation and determines that her apartment building is in good condition and does not present any immediate environmental hazards. Plan: The clinician advises Mary to continue to monitor her health conditions and to report any changes or concerns to her healthcare team. The clinician also prescribes a referral to an allergist who can provide additional evaluation and treatment for her allergies. The clinician recommends that Mary and her family take steps to minimize potential environmental triggers in their apartment, such as avoiding smoking and using air purifiers. The clinician advises Mary to continue to maintain her stable housing situation and to seek assistance if any financial or housing issues arise. """, """Patient: Sarah L. Background: Sarah is a 35-year-old woman who has been experiencing housing insecurity for the past year. She was evicted from her apartment due to an increase in rent, which she could not afford, and has been staying with friends and family members ever since. She works as a part-time sales associate at a retail store and has no medical conditions. Presenting problem: Sarah presents to the clinic with complaints of increased stress and anxiety related to her housing insecurity. She reports feeling constantly on edge and worried about where she will sleep each night. She is also having difficulty concentrating at work and has been missing shifts due to her anxiety. Medical history: Sarah has no significant medical history and takes no medications. Social history: Sarah is currently single and has no children. She has a high school diploma but has not attended college. She has been working at her current job for three years and earns minimum wage. She has no savings and relies on her income to cover basic expenses. Assessment: The clinician assesses Sarah's mental health and determines that she is experiencing symptoms of anxiety and depression related to her housing insecurity. The clinician also assesses Sarah's housing situation and determines that she is at risk for homelessness if she is unable to secure stable housing soon. Plan: The clinician refers Sarah to a social worker who can help her connect with local housing resources, including subsidized housing programs and emergency shelters. The clinician also prescribes an antidepressant medication to help manage her symptoms of anxiety and depression. The clinician advises Sarah to continue to seek employment opportunities that may offer higher pay and stability.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+-----------------------+ | text| result| +----------------------------------------------------------------------------------------------------+-----------------------+ |Patient: Mary H.\n\nBackground: Mary is a 40-year-old woman who has been diagnosed with asthma an...|[No_Housing_Insecurity]| |Patient: Sarah L.\n\nBackground: Sarah is a 35-year-old woman who has been experiencing housing i...| [Housing_Insecurity]| +----------------------------------------------------------------------------------------------------+-----------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.3.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH Project ## Benchmarking ```bash label precision recall f1-score support Housing_Insecurity 0.83 0.81 0.82 64 No_Housing_Insecurity 0.86 0.87 0.86 83 accuracy - - 0.84 147 macro-avg 0.84 0.84 0.84 147 weighted-avg 0.84 0.84 0.84 147 ``` --- layout: model title: Detect Living Species (bert_embeddings_bert_base_fr_cased) author: John Snow Labs name: ner_living_species_bert date: 2022-06-23 tags: [fr, ner, clinical, licensed, bert] task: Named Entity Recognition language: fr edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract living species from clinical texts in French which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition and agriculture. This model is trained using `bert_embeddings_bert_base_fr_cased` embeddings. It is trained on the [LivingNER](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. **NOTE :** 1. The text files were translated from Spanish with a neural machine translation system. 2. The annotations were translated with the same neural machine translation system. 3. The translated annotations were transferred to the translated text files using an annotation transfer technology. ## Predicted Entities `HUMAN`, `SPECIES` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_fr_3.5.3_3.0_1655973142130.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_fr_3.5.3_3.0_1655973142130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_fr_cased", "fr")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "fr", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) data = spark.createDataFrame([["""Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_fr_cased", "fr") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_living_species_bert", "fr", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("""Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fr.med_ner.living_species.bert").predict("""Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40.""") ```
## Results ```bash +--------------------------------+-------+ |ner_chunk |label | +--------------------------------+-------+ |Femme |HUMAN | |mari |HUMAN | |enfants |HUMAN | |patient |HUMAN | |Coxiella burnetii |SPECIES| |Bartonella henselae |SPECIES| |Borrelia burgdorferi |SPECIES| |Entamoeba histolytica |SPECIES| |Toxoplasma gondii |SPECIES| |cytomégalovirus |SPECIES| |virus d'Epstein Barr |SPECIES| |virus de la varicelle et du zona|SPECIES| |parvovirus B19 |SPECIES| |Brucella |SPECIES| +--------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|fr| |Size:|16.4 MB| ## References [https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/](https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/) ## Benchmarking ```bash label precision recall f1-score support B-HUMAN 0.81 0.95 0.87 2549 B-SPECIES 0.66 0.87 0.75 2824 I-HUMAN 0.98 0.43 0.60 114 I-SPECIES 0.73 0.77 0.75 1109 micro-avg 0.73 0.87 0.80 6596 macro-avg 0.80 0.75 0.74 6596 weighted-avg 0.74 0.87 0.80 6596 ``` --- layout: model title: English RobertaForQuestionAnswering Cased model (from choosistant) author: John Snow Labs name: roberta_qa_model_fine_tuned date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `qa-model-fine-tuned` is a English model originally trained by `choosistant`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_model_fine_tuned_en_4.3.0_3.0_1674211892619.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_model_fine_tuned_en_4.3.0_3.0_1674211892619.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model_fine_tuned","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_model_fine_tuned","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_model_fine_tuned| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/choosistant/qa-model-fine-tuned --- layout: model title: English DistilBertForQuestionAnswering model (from exafluence) author: John Snow Labs name: distilbert_qa_BERT_ClinicalQA date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `BERT-ClinicalQA` is a English model originally trained by `exafluence`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_BERT_ClinicalQA_en_4.0.0_3.0_1654722853170.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_BERT_ClinicalQA_en_4.0.0_3.0_1654722853170.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_BERT_ClinicalQA","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_BERT_ClinicalQA","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.clinical.distil_bert").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_BERT_ClinicalQA| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/exafluence/BERT-ClinicalQA --- layout: model title: Fast Neural Machine Translation Model from Japanese to English author: John Snow Labs name: opus_mt_jap_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, jap, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `jap` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_jap_en_xx_2.7.0_2.4_1609164435326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_jap_en_xx_2.7.0_2.4_1609164435326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_jap_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_jap_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.jap.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_jap_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Translate Congo Swahili to English Pipeline author: John Snow Labs name: translate_swc_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, swc, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `swc` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_swc_en_xx_2.7.0_2.4_1609687888047.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_swc_en_xx_2.7.0_2.4_1609687888047.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_swc_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_swc_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.swc.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_swc_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: German ALBERT Embeddings (from abhilash1910) author: John Snow Labs name: albert_embeddings_albert_german_ner date: 2022-04-14 tags: [albert, embeddings, de, open_source] task: Embeddings language: de edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: AlBertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained ALBERT Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `albert-german-ner` is a German model orginally trained by `abhilash1910`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_german_ner_de_3.4.2_3.0_1649954270550.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/albert_embeddings_albert_german_ner_de_3.4.2_3.0_1649954270550.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_german_ner","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Ich liebe Funken NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = AlbertEmbeddings.pretrained("albert_embeddings_albert_german_ner","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Ich liebe Funken NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.embed.albert_german_ner").predict("""Ich liebe Funken NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|albert_embeddings_albert_german_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|44.8 MB| |Case sensitive:|false| ## References - https://huggingface.co/abhilash1910/albert-german-ner --- layout: model title: Pipeline to Mapping SNOMED Codes with Their Corresponding UMLS Codes author: John Snow Labs name: snomed_umls_mapping date: 2023-06-13 tags: [en, licensed, clinical, pipeline, chunk_mapping, snomed, umls] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of `snomed_umls_mapper` model. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_4.4.4_3.2_1686663535191.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/snomed_umls_mapping_en_4.4.4_3.2_1686663535191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(733187009 449433008 51264003) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(733187009 449433008 51264003) ``` {:.nlu-block} ```python import nlu nlu.load("en.snomed.umls.mapping").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(733187009 449433008 51264003) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("snomed_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(733187009 449433008 51264003) ``` {:.nlu-block} ```python import nlu nlu.load("en.snomed.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash Results | | snomed_code | umls_code | |---:|:---------------------------------|:-------------------------------| | 0 | 733187009 | 449433008 | 51264003 | C4546029 | C3164619 | C0271267 | {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|snomed_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|5.1 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: Context Spell Checker for English author: John Snow Labs name: spellcheck_dl date: 2022-03-28 tags: [spellcheck, en, open_source] task: Spell Check language: en nav_key: models edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true annotator: ContextSpellCheckerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/CONTEXTUAL_SPELL_CHECKER/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CONTEXTUAL_SPELL_CHECKER.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.4.1_3.0_1648457196011.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/spellcheck_dl_en_3.4.1_3.0_1648457196011.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = RecursiveTokenizer()\ .setInputCols(["document"])\ .setOutputCol("token")\ .setPrefixes(["\"", "“", "(", "[", "\n", "."]) \ .setSuffixes(["\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s"]) spellModel = ContextSpellCheckerModel\ .pretrained("spellcheck_dl", "en")\ .setInputCols("token")\ .setOutputCol("checked")\ pipeline = Pipeline(stages = [documentAssembler, tokenizer, spellModel]) empty_df = spark.createDataFrame([[""]]).toDF("text") lp = LightPipeline(pipeline.fit(empty_df)) text = ["During the summer we have the best ueather.", "I have a black ueather jacket, so nice."] lp.annotate(text) ``` ```scala val assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new RecursiveTokenizer() .setInputCols(Array("document")) .setOutputCol("token") .setPrefixes(Array("\"", "“", "(", "[", "\n", ".")) .setSuffixes(Array("\"", "”", ".", ",", "?", ")", "]", "!", ";", ":", "'s", "’s")) val spellChecker = ContextSpellCheckerModel. pretrained("spellcheck_dl", "en"). setInputCols("token"). setOutputCol("checked") val pipeline = new Pipeline().setStages(Array(assembler, tokenizer, spellChecker)) val empty_df = spark.createDataFrame([[""]]).toDF("text") val lp = new LightPipeline(pipeline.fit(empty_df)) val text = Array("During the summer we have the best ueather.", "I have a black ueather jacket, so nice.") lp.annotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("spell").predict("""During the summer we have the best ueather.""") ```
## Results ```bash [{'checked': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'weather', '.'], 'document': ['During the summer we have the best ueather.'], 'token': ['During', 'the', 'summer', 'we', 'have', 'the', 'best', 'ueather', '.']}, {'checked': ['I', 'have', 'a', 'black', 'leather', 'jacket', ',', 'so', 'nice', '.'], 'document': ['I have a black ueather jacket, so nice.'], 'token': ['I', 'have', 'a', 'black', 'ueather', 'jacket', ',', 'so', 'nice', '.']}] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|spellcheck_dl| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[corrected]| |Language:|en| |Size:|99.7 MB| ## References Combination of custom data sets. --- layout: model title: RE Pipeline between Body Parts and Direction Entities author: John Snow Labs name: re_bodypart_directions_pipeline date: 2022-03-31 tags: [licensed, clinical, relation_extraction, body_part, directions, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [re_bodypart_directions](https://nlp.johnsnowlabs.com/2021/01/18/re_bodypart_directions_en.html) model. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_BODYPART_ENT/){:.button.button-orange.button-orange-trans.arr.button-icon} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.arr.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_pipeline_en_3.4.1_3.0_1648732927504.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_bodypart_directions_pipeline_en_3.4.1_3.0_1648732927504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models") pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("re_bodypart_directions_pipeline", "en", "clinical/models") pipeline.fullAnnotate("MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia") ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart_directions.pipeline").predict("""MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia""") ```
## Results ```bash | index | relations | entity1 | entity1_begin | entity1_end | chunk1 | entity2 | entity2_end | entity2_end | chunk2 | confidence | |-------|-----------|-----------------------------|---------------|-------------|------------|-----------------------------|-------------|-------------|---------------|------------| | 0 | 1 | Direction | 35 | 39 | upper | Internal_organ_or_component | 41 | 50 | brain stem | 0.9999989 | | 1 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 59 | 68 | cerebellum | 0.99992585 | | 2 | 0 | Direction | 35 | 39 | upper | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.9999999 | | 3 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 54 | 57 | left | 0.999811 | | 4 | 0 | Internal_organ_or_component | 41 | 50 | brain stem | Direction | 75 | 79 | right | 0.9998203 | | 5 | 1 | Direction | 54 | 57 | left | Internal_organ_or_component | 59 | 68 | cerebellum | 1.0 | | 6 | 0 | Direction | 54 | 57 | left | Internal_organ_or_component | 81 | 93 | basil ganglia | 0.97616416 | | 7 | 0 | Internal_organ_or_component | 59 | 68 | cerebellum | Direction | 75 | 79 | right | 0.953046 | | 8 | 1 | Direction | 75 | 79 | right | Internal_organ_or_component | 81 | 93 | basil ganglia | 1.0 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_bodypart_directions_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - PerceptronModel - WordEmbeddingsModel - MedicalNerModel - NerConverter - DependencyParserModel - RelationExtractionModel --- layout: model title: Detect Adverse Drug Events (biobert) author: John Snow Labs name: ner_ade_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect adverse drug events in tweets, reviews, and medical text using pretrained NER model. ## Predicted Entities `DRUG`, `ADE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_ade_biobert_en_3.0.0_3.0_1617260850526.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_ade_biobert_en_3.0.0_3.0_1617260850526.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_ade_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val ner = MedicalNerModel.pretrained("ner_ade_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.ade_biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_ade_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash label precision recall f1-score support B-ADE 0.48 0.82 0.60 3582 B-DRUG 0.87 0.65 0.75 11763 I-ADE 0.48 0.76 0.59 4309 I-DRUG 0.95 0.28 0.43 7654 O 0.97 0.98 0.97 303457 accuracy 0.95 330765 macro-avg 0.75 0.70 0.67 330765 weighted-avg 0.95 0.95 0.94 330765 ``` --- layout: model title: English BertForQuestionAnswering model (from SupriyaArun) author: John Snow Labs name: bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-uncased-finetuned-squad` is a English model orginally trained by `SupriyaArun`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181043121.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad_en_4.0.0_3.0_1654181043121.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.base_uncased.by_SupriyaArun").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/SupriyaArun/bert-base-uncased-finetuned-squad --- layout: model title: Legal Fees Clause Binary Classifier author: John Snow Labs name: legclf_fees_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `fees` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `fees` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_fees_clause_en_1.0.0_3.2_1660122434131.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_fees_clause_en_1.0.0_3.2_1660122434131.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_fees_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[fees]| |[other]| |[other]| |[fees]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_fees_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support fees 0.95 0.94 0.94 81 other 0.98 0.99 0.98 284 accuracy - - 0.98 365 macro-avg 0.97 0.96 0.96 365 weighted-avg 0.98 0.98 0.98 365 ``` --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_2_H_512_A_8_squad2 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-2_H-512_A-8_squad2` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_en_4.0.0_3.0_1654185232766.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_en_4.0.0_3.0_1654185232766.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_squad2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_2_H_512_A_8_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.bert.uncased_2l_512d_a8a_512d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_2_H_512_A_8_squad2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|83.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-2_H-512_A-8_squad2 --- layout: model title: Persian RoBERTa Embeddings (from HooshvareLab) author: John Snow Labs name: roberta_embeddings_roberta_fa_zwnj_base date: 2022-04-14 tags: [roberta, embeddings, fa, open_source] task: Embeddings language: fa edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `roberta-fa-zwnj-base` is a Persian model orginally trained by `HooshvareLab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_fa_zwnj_base_fa_3.4.2_3.0_1649948242326.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_roberta_fa_zwnj_base_fa_3.4.2_3.0_1649948242326.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_fa_zwnj_base","fa") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["من عاشق جرقه NLP هستم"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_roberta_fa_zwnj_base","fa") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("من عاشق جرقه NLP هستم").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("fa.embed.roberta_fa_zwnj_base").predict("""من عاشق جرقه NLP هستم""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_roberta_fa_zwnj_base| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|fa| |Size:|444.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base - https://github.com/hooshvare/roberta/issues --- layout: model title: English ElectraForQuestionAnswering model (from mrm8488) author: John Snow Labs name: electra_qa_base_finetuned_squadv1 date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-finetuned-squadv1` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_finetuned_squadv1_en_4.0.0_3.0_1655920647215.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_finetuned_squadv1_en_4.0.0_3.0_1655920647215.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_finetuned_squadv1","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.electra.base.by_mrm8488").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/electra-base-finetuned-squadv1 - https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/ --- layout: model title: English image_classifier_vit_base_patch16_224_cifar10 ViTForImageClassification from karthiksv author: John Snow Labs name: image_classifier_vit_base_patch16_224_cifar10 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_patch16_224_cifar10` is a English model originally trained by karthiksv. ## Predicted Entities `deer`, `bird`, `dog`, `horse`, `automobile`, `truck`, `frog`, `ship`, `airplane`, `cat` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_cifar10_en_4.1.0_3.0_1660170286683.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_patch16_224_cifar10_en_4.1.0_3.0_1660170286683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_patch16_224_cifar10", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_patch16_224_cifar10", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_patch16_224_cifar10| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English image_classifier_vit_base_beans_demo_v3 ViTForImageClassification from nateraw author: John Snow Labs name: image_classifier_vit_base_beans_demo_v3 date: 2022-08-10 tags: [vit, en, images, open_source] task: Image Classification language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: ViTForImageClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained VIT model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`image_classifier_vit_base_beans_demo_v3` is a English model originally trained by nateraw. ## Predicted Entities `angular_leaf_spot`, `bean_rust`, `healthy` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_v3_en_4.1.0_3.0_1660167976186.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/image_classifier_vit_base_beans_demo_v3_en_4.1.0_3.0_1660167976186.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python image_assembler = ImageAssembler() \ .setInputCol("image") \ .setOutputCol("image_assembler") imageClassifier = ViTForImageClassification \ .pretrained("image_classifier_vit_base_beans_demo_v3", "en")\ .setInputCols("image_assembler") \ .setOutputCol("class") pipeline = Pipeline(stages=[ image_assembler, imageClassifier, ]) pipelineModel = pipeline.fit(imageDF) pipelineDF = pipelineModel.transform(imageDF) ``` ```scala val imageAssembler = new ImageAssembler() .setInputCol("image") .setOutputCol("image_assembler") val imageClassifier = ViTForImageClassification .pretrained("image_classifier_vit_base_beans_demo_v3", "en") .setInputCols("image_assembler") .setOutputCol("class") val pipeline = new Pipeline().setStages(Array(imageAssembler, imageClassifier)) val pipelineModel = pipeline.fit(imageDF) val pipelineDF = pipelineModel.transform(imageDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|image_classifier_vit_base_beans_demo_v3| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[image_assembler]| |Output Labels:|[class]| |Language:|en| |Size:|321.9 MB| --- layout: model title: English XlmRoBertaForQuestionAnswering (from teacookies) author: John Snow Labs name: xlm_roberta_qa_autonlp_roberta_base_squad2_24465519 date: 2022-06-23 tags: [en, open_source, question_answering, xlmroberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autonlp-roberta-base-squad2-24465519` is a English model originally trained by `teacookies`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465519_en_4.0.0_3.0_1655986545762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_roberta_qa_autonlp_roberta_base_squad2_24465519_en_4.0.0_3.0_1655986545762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = XlmRoBertaForQuestionAnswering.pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465519","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifier = XlmRoBertaForQuestionAnswering .pretrained("xlm_roberta_qa_autonlp_roberta_base_squad2_24465519","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.xlm_roberta.base_24465519.by_teacookies").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlm_roberta_qa_autonlp_roberta_base_squad2_24465519| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|887.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/teacookies/autonlp-roberta-base-squad2-24465519 --- layout: model title: English T5ForConditionalGeneration Small Cased model (from google) author: John Snow Labs name: t5_efficient_small_nl4 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-small-nl4` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl4_en_4.3.0_3.0_1675122566099.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_small_nl4_en_4.3.0_3.0_1675122566099.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_small_nl4","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_small_nl4","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_small_nl4| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|119.7 MB| ## References - https://huggingface.co/google/t5-efficient-small-nl4 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Mapping RxNorm Codes with Corresponding UMLS Codes author: John Snow Labs name: rxnorm_umls_mapper date: 2022-06-24 tags: [rxnorm, umls, chunk_mapper, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.3 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm codes with corresponding UMLS Codes. ## Predicted Entities {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapper_en_3.5.3_3.0_1656088714126.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_umls_mapper_en_3.5.3_3.0_1656088714126.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("ner_chunk") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["ner_chunk"])\ .setOutputCol("sbert_embeddings") rxnorm_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\ .setInputCols(["ner_chunk", "sbert_embeddings"])\ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper = ChunkMapperModel.pretrained("rxnorm_umls_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("mappings")\ .setRels(["umls_code"]) pipeline = Pipeline( stages = [ documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp= LightPipeline(model) result = lp.fullAnnotate(["amlodipine 5 MG", "hydrochlorothiazide 25 MG"]) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sbert_embeddings") val rxnorm_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_umls_mapper", "en", "clinical/models") .setInputCols(Array("rxnorm_code")) .setOutputCol("mappings") .setRels(Array("umls_code")) val pipeline = new Pipeline(stages = Array(documentAssembler, sbert_embedder, rxnorm_resolver, chunkerMapper) val model = pipeline.fit(Seq("").toDF("text")) val lp= LightPipeline(model) val result = lp.fullAnnotate(Seq("amlodipine 5 MG", "hydrochlorothiazide 25 MG")) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm_to_umls").predict("""hydrochlorothiazide 25 MG""") ```
## Results ```bash | | chunk | rxnorm_code | umls_mappings | |---:|:--------------------------|--------------:|:----------------| | 0 | amlodipine 5 MG | 329528 | C1124796 | | 1 | hydrochlorothiazide 25 MG | 310798 | C0977518 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_umls_mapper| |Compatibility:|Healthcare NLP 3.5.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[rxnorm_code]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.9 MB| --- layout: model title: Detect bacterial species (embeddings_clinical_large) author: John Snow Labs name: ner_bacterial_species_emb_clinical_large date: 2023-05-23 tags: [ner, clinical, bacterial_species, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.2 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect different types of species of bacteria in text using pretrained NER model. ## Predicted Entities `SPECIES` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_BACTERIAL_SPECIES/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_emb_clinical_large_en_4.4.2_3.0_1684854147723.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_bacterial_species_emb_clinical_large_en_4.4.2_3.0_1684854147723.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") species_ner = MedicalNerModel.pretrained("ner_bacterial_species_emb_clinical_large", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("species_ner") species_ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "species_ner"]) \ .setOutputCol("species_ner_chunk") species_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, species_ner, species_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") species_ner_model = species_ner_pipeline.fit(empty_data) results = species_ner_model.transform(spark.createDataFrame([[''' Proportions of Veillonella parvula and Prevotella melaninogenica were higher in saliva and on the lateral and dorsal surfaces of the tongue, while Streptococcus mitis and S. oralis were in significantly lower proportions in saliva and on the tongue dorsum. Cluster analysis resulted in the formation of 2 clusters with >85% similarity. Cluster 1 comprised saliva, lateral and dorsal tongue surfaces, while Cluster 2 comprised the remaining soft tissue locations. V. parvula, P. melaninogenica, Eikenella corrodens, Neisseria mucosa, Actinomyces odontolyticus, Fusobacterium periodonticum, F. nucleatum ss vincentii and Porphyromonas gingivalis were in significantly higher proportions in Cluster 1 and S. mitis, S. oralis and S. noxia were significantly higher in Cluster 2. These findings were confirmed using data from the 44 subjects providing plaque samples.''']]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val species_ner_model = MedicalNerModel.pretrained("ner_bacterial_species_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("species_ner") val species_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "species_ner")) .setOutputCol("species_ner_chunk") val species_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, species_ner_model, species_ner_converter)) val data = Seq(""" This is an 11-year-old female who comes in for two different things. 1. She was seen by the allergist. No allergies present, so she stopped her Allegra, but she is still real congested and does a lot of snorting. They do not notice a lot of snoring at night though, but she seems to be always like that. 2. On her right great toe, she has got some redness and erythema. Her skin is kind of peeling a little bit, but it has been like that for about a week and a half now.\nGeneral: Well-developed female, in no acute distress, afebrile.\nHEENT: Sclerae and conjunctivae clear. Extraocular muscles intact. TMs clear. Nares patent. A little bit of swelling of the turbinates on the left. Oropharynx is essentially clear. Mucous membranes are moist.\nNeck: No lymphadenopathy.\nChest: Clear.\nAbdomen: Positive bowel sounds and soft.\nDermatologic: She has got redness along the lateral portion of her right great toe, but no bleeding or oozing. Some dryness of her skin. Her toenails themselves are very short and even on her left foot and her left great toe the toenails are very short.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:----------------------------|--------:|------:|:-----------| | 0 | Veillonella parvula | 16 | 34 | SPECIES | | 1 | Prevotella melaninogenica | 40 | 64 | SPECIES | | 2 | Streptococcus mitis | 148 | 166 | SPECIES | | 3 | S. oralis | 172 | 180 | SPECIES | | 4 | V. parvula | 464 | 473 | SPECIES | | 5 | P. melaninogenica | 476 | 492 | SPECIES | | 6 | Eikenella corrodens | 495 | 513 | SPECIES | | 7 | Neisseria mucosa | 516 | 531 | SPECIES | | 8 | Actinomyces odontolyticus | 534 | 558 | SPECIES | | 9 | Fusobacterium periodonticum | 561 | 587 | SPECIES | | 10 | F. nucleatum ss vincentii | 590 | 614 | SPECIES | | 11 | Porphyromonas gingivalis | 620 | 643 | SPECIES | | 12 | S. mitis | 703 | 710 | SPECIES | | 13 | S. oralis | 713 | 721 | SPECIES | | 14 | S. noxia | 727 | 734 | SPECIES | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_bacterial_species_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.2+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash label precision recall f1-score support SPECIES 0.80 0.82 0.81 1810 micro-avg 0.80 0.82 0.81 1810 macro-avg 0.80 0.82 0.81 1810 weighted-avg 0.80 0.82 0.81 1810 ``` --- layout: model title: English T5ForConditionalGeneration Base Cased model (from google) author: John Snow Labs name: t5_efficient_base_dm512 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-base-dm512` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dm512_en_4.3.0_3.0_1675110784924.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_base_dm512_en_4.3.0_3.0_1675110784924.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_base_dm512","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_base_dm512","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_base_dm512| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|316.6 MB| ## References - https://huggingface.co/google/t5-efficient-base-dm512 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: T5 for Passive to Active Style Transfer author: John Snow Labs name: t5_passive_to_active_styletransfer date: 2022-01-12 tags: [t5, open_source, en] task: Text Generation language: en nav_key: models edition: Spark NLP 3.4.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a text-to-text model based on T5 fine-tuned to generate passively written text from a actively written text input, for the task "transfer Passive to Active:". It is based on Prithiviraj Damodaran's Styleformer. ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/T5_LINGUISTIC/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5_LINGUISTIC.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_passive_to_active_styletransfer_en_3.4.0_3.0_1641987698487.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_passive_to_active_styletransfer_en_3.4.0_3.0_1641987698487.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import sparknlp from sparknlp.base import * from sparknlp.annotator import * spark = sparknlp.start() documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("documents") t5 = T5Transformer.pretrained("t5_passive_to_active_styletransfer") \ .setTask("transfer Passive to Active:") \ .setInputCols(["documents"]) \ .setMaxOutputLength(200) \ .setOutputCol("transfers") pipeline = Pipeline().setStages([documentAssembler, t5]) data = spark.createDataFrame([["A letter was sent to you."]]).toDF("text") result = pipeline.fit(data).transform(data) result.select("transfers.result").show(truncate=False) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer.pretrained("t5_passive_to_active_styletransfer") .setTask("transfer Passive to Active:") .setMaxOutputLength(200) .setInputCols("documents") .setOutputCol("transfer") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("A letter was sent to you.").toDF("text") val result = pipeline.fit(data).transform(data) result.select("transfer.result").show(false) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.passive_to_active_styletransfer").predict("""transfer Passive to Active:""") ```
## Results ```bash +-------------------+ |result | +-------------------+ |[you sent a letter]| +-------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_passive_to_active_styletransfer| |Compatibility:|Spark NLP 3.4.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[transfers]| |Language:|en| |Size:|265.0 MB| ## Data Source The original model is from the transformers library: https://huggingface.co/prithivida/passive_to_active_styletransfer --- layout: model title: Financial Question Answering (Bert) author: John Snow Labs name: finqa_bert date: 2023-01-03 tags: [en, licensed] task: Question Answering language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Financial Bert-based Question Answering model, trained on squad-v2, finetuned on proprietary Financial questions and answers. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finqa_bert_en_1.0.0_3.0_1672759463237.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finqa_bert_en_1.0.0_3.0_1672759463237.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) spanClassifier = nlp.BertForQuestionAnswering.pretrained("finqa_bert","en", "finance/models") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = nlp.Pipeline().setStages([ documentAssembler, spanClassifier ]) example = spark.createDataFrame([["On which market is their common stock traded?", "Our common stock is traded on the Nasdaq Global Select Market under the symbol CDNS."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) result.select('answer.result').show() ```
## Results ```bash `Nasdaq Global Select Market under the symbol CDNS` ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finqa_bert| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|407.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References Trained on squad-v2, finetuned on proprietary Financial questions and answers. --- layout: model title: Legal Disclosure Clause Binary Classifier author: John Snow Labs name: legclf_disclosure_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `disclosure` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `disclosure` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_disclosure_clause_en_1.0.0_3.2_1660122351983.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_disclosure_clause_en_1.0.0_3.2_1660122351983.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_disclosure_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[disclosure]| |[other]| |[other]| |[disclosure]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_disclosure_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.9 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support disclosure 0.93 0.82 0.87 33 other 0.93 0.98 0.95 86 accuracy - - 0.93 119 macro-avg 0.93 0.90 0.91 119 weighted-avg 0.93 0.93 0.93 119 ``` --- layout: model title: Adverse Drug Events Classifier (LogReg) author: John Snow Labs name: classifier_logreg_ade date: 2023-05-16 tags: [text_classification, ade, logreg, en, clinical, licensed] task: Text Classification language: en edition: Healthcare NLP 4.4.1 spark_version: 3.0 supported: true annotator: DocumentLogRegClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is trained with the Logistic Regression algorithm and classifies text/sentence into two categories: True : The sentence is talking about a possible ADE False : The sentence doesn’t have any information about an ADE. The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). ## Predicted Entities `True`, `False` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1684248428027.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/classifier_logreg_ade_en_4.4.1_3.0_1684248428027.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols("document")\ .setOutputCol("token") logreg = DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models")\ .setInputCols("token")\ .setOutputCol("prediction") clf_Pipeline = Pipeline(stages=[ document_assembler, tokenizer, logreg]) data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text") result = clf_Pipeline.fit(data).transform(data) ``` ```scala val document_assembler =new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val logreg = new DocumentLogRegClassifierModel.pretrained("classifier_logreg_ade", "en", "clinical/models") .setInputCols("token") .setOutputCol("prediction") val clf_Pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, logreg)) val data = Seq(Array("None of the patients required treatment for the overdose.", "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.")).toDS().toDF("text") val result = clf_Pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------+-------+ |text |result | +----------------------------------------------------------------------------------------+-------+ |Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[True] | |None of the patients required treatment for the overdose. |[False]| +----------------------------------------------------------------------------------------+-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifier_logreg_ade| |Compatibility:|Healthcare NLP 4.4.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[prediction]| |Language:|en| |Size:|596.0 KB| ## References The corpus used for model training is ADE-Corpus-V2 Dataset: Adverse Drug Reaction Data. This is a dataset for classification of a sentence if it is ADE-related (True) or not (False). Reference: Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012. http://www.sciencedirect.com/science/article/pii/S1532046412000615 ## Benchmarking ```bash label precision recall f1-score support False 0.91 0.92 0.92 3362 True 0.79 0.79 0.79 1361 accuracy - - 0.88 4723 macro_avg 0.85 0.85 0.85 4723 weighted_avg 0.88 0.88 0.88 4723 ``` --- layout: model title: Explain Document pipeline for Hebrew (explain_document_lg) author: John Snow Labs name: explain_document_lg date: 2021-04-30 tags: [hebrew, ner, he, open_source, explain_document_lg, pipeline] task: Named Entity Recognition language: he edition: Spark NLP 3.0.2 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The explain_document_lg is a pre-trained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_he_3.0.2_3.0_1619775273050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/explain_document_lg_he_3.0.2_3.0_1619775273050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('explain_document_lg', lang = 'he') annotations = pipeline.fullAnnotate(""היי, מעבדות ג'ון סנו!"")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("explain_document_lg", lang = "he") val result = pipeline.fullAnnotate("היי, מעבדות ג'ון סנו!")(0) ``` {:.nlu-block} ```python import nlu nlu.load("he.explain_document").predict("""היי, מעבדות ג'ון סנו!""") ```
## Results ```bash +----------------------+------------------------+----------------------+---------------------------+--------------------+---------+ | text| document| sentence| token| ner|ner_chunk| +----------------------+------------------------+----------------------+---------------------------+--------------------+---------+ | היי ג'ון מעבדות שלג! |[ היי ג'ון מעבדות שלג! ]|[היי ג'ון מעבדות שלג!]|[היי, ג'ון, מעבדות, שלג, !]|[O, B-PERS, O, O, O]| [ג'ון]| +----------------------+------------------------+----------------------+---------------------------+--------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|explain_document_lg| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.2+| |License:|Open Source| |Edition:|Official| |Language:|he| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - WordEmbeddingsModel - NerDLModel - NerConverter --- layout: model title: Multilingual XLMRobertaForTokenClassification Base Cased model (from dkasti) author: John Snow Labs name: xlmroberta_ner_dkasti_base_finetuned_panx_all date: 2022-08-13 tags: [xx, open_source, xlm_roberta, ner] task: Named Entity Recognition language: xx edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-all` is a Multilingual model originally trained by `dkasti`. ## Predicted Entities `ORG`, `LOC`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_dkasti_base_finetuned_panx_all_xx_4.1.0_3.0_1660428009273.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_dkasti_base_finetuned_panx_all_xx_4.1.0_3.0_1660428009273.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_dkasti_base_finetuned_panx_all","xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_dkasti_base_finetuned_panx_all","xx") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_dkasti_base_finetuned_panx_all| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|xx| |Size:|861.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/dkasti/xlm-roberta-base-finetuned-panx-all --- layout: model title: Notice Clause Binary Classifier author: John Snow Labs name: legclf_notice_clause date: 2022-12-16 tags: [en, legal, notice, classification, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `notice` clause type (where information about to people, addresses, notice methods, etc are mentioned). To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it’s better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `notice`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_notice_clause_en_1.0.0_3.0_1671208792323.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_notice_clause_en_1.0.0_3.0_1671208792323.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = legal.ClassifierDLModel.pretrained("legclf_notice_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier ]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[notice]| |[other]| |[other]| |[notice]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_notice_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support notice 1.00 0.98 0.99 64 other 0.99 1.00 0.99 92 accuracy - - 0.99 156 macro-avg 0.99 0.99 0.99 156 weighted-avg 0.99 0.99 0.99 156 ``` --- layout: model title: English T5ForConditionalGeneration Small Cased model (from HUPD) author: John Snow Labs name: t5_hupd_small date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `hupd-t5-small` is a English model originally trained by `HUPD`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_hupd_small_en_4.3.0_3.0_1675102875669.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_hupd_small_en_4.3.0_3.0_1675102875669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_hupd_small","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_hupd_small","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_hupd_small| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|289.1 MB| ## References - https://huggingface.co/HUPD/hupd-t5-small - https://patentdataset.org/ - https://github.com/suzgunmirac/hupd --- layout: model title: English DistilBertForQuestionAnswering model (from armageddon) author: John Snow Labs name: distilbert_qa_base_uncased_squad2_covid_qa_deepset date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-squad2-covid-qa-deepset` is a English model originally trained by `armageddon`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654727294594.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_squad2_covid_qa_deepset_en_4.0.0_3.0_1654727294594.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_covid_qa_deepset","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_squad2_covid_qa_deepset","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid.distil_bert.base_uncased").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_squad2_covid_qa_deepset| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/armageddon/distilbert-base-uncased-squad2-covid-qa-deepset --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_de_4.2.0_3.0_1664117928306.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458_de_4.2.0_3.0_1664117928306.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_2_austria_8_s458| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Exceptions Clause Binary Classifier author: John Snow Labs name: legclf_exceptions_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `exceptions` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `exceptions` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_exceptions_clause_en_1.0.0_3.2_1660123503661.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_exceptions_clause_en_1.0.0_3.2_1660123503661.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_exceptions_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[exceptions]| |[other]| |[other]| |[exceptions]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_exceptions_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.1 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support exceptions 0.88 0.94 0.91 32 other 0.97 0.95 0.96 79 accuracy - - 0.95 111 macro-avg 0.93 0.94 0.94 111 weighted-avg 0.95 0.95 0.95 111 ``` --- layout: model title: Icelandic RobertaForQuestionAnswering Cased model (from nozagleh) author: John Snow Labs name: roberta_qa_icebert_is_finetune date: 2023-01-20 tags: [is, open_source, roberta, question_answering, tensorflow] task: Question Answering language: is edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `IceBERT-QA-Is-finetune` is a Icelandic model originally trained by `nozagleh`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_is_finetune_is_4.3.0_3.0_1674208136981.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_icebert_is_finetune_is_4.3.0_3.0_1674208136981.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert_is_finetune","is")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_icebert_is_finetune","is") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_icebert_is_finetune| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|is| |Size:|451.0 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/nozagleh/IceBERT-QA-Is-finetune --- layout: model title: Fast and Accurate Language Identification - 99 Languages (CNN) author: John Snow Labs name: ld_tatoeba_cnn_99 date: 2020-12-05 task: Language Detection language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [language_detection, open_source, xx] supported: true annotator: LanguageDetectorDL article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. ``LanguageDetectorDL`` is an annotator that detects the language of documents or sentences depending on the ``inputCols``. In addition, ``LanguageDetetorDL`` can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate. We have designed and developed Deep Learning models using CNNs in TensorFlow/Keras. The model is trained on Tatoeba dataset with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style: [https://en.wikipedia.org/wiki/List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias). This model can detect the following languages: `Afrikaans`, `Arabic`, `Algerian Arabic`, `Assamese`, `Kotava`, `Azerbaijani`, `Belarusian`, `Bengali`, `Berber`, `Breton`, `Bulgarian`, `Catalan`, `Chavacano`, `Cebuano`, `Czech`, `Chuvash`, `Mandarin Chinese`, `Cornish`, `Danish`, `German`, `Central Dusun`, `Modern Greek (1453-)`, `English`, `Esperanto`, `Estonian`, `Basque`, `Finnish`, `French`, `Guadeloupean Creole French`, `Irish`, `Galician`, `Gronings`, `Guarani`, `Hebrew`, `Hindi`, `Croatian`, `Hungarian`, `Armenian`, `Ido`, `Interlingue`, `Ilocano`, `Interlingua`, `Indonesian`, `Icelandic`, `Italian`, `Lojban`, `Japanese`, `Kabyle`, `Georgian`, `Kazakh`, `Khasi`, `Khmer`, `Korean`, `Coastal Kadazan`, `Latin`, `Lingua Franca Nova`, `Lithuanian`, `Latvian`, `Literary Chinese`, `Marathi`, `Meadow Mari`, `Macedonian`, `Low German (Low Saxon)`, `Dutch`, `Norwegian Nynorsk`, `Norwegian Bokmål`, `Occitan`, `Ottoman Turkish`, `Kapampangan`, `Picard`, `Persian`, `Polish`, `Portuguese`, `Romanian`, `Kirundi`, `Russian`, `Slovak`, `Spanish`, `Albanian`, `Serbian`, `Swedish`, `Swabian`, `Tatar`, `Tagalog`, `Thai`, `Klingon`, `Toki Pona`, `Turkmen`, `Turkish`, `Uyghur`, `Ukrainian`, `Urdu`, `Vietnamese`, `Volapük`, `Waray`, `Shanghainese`, `Yiddish`, `Cantonese`, `Malay`. ## Predicted Entities `af`, `ar`, `arq`, `as`, `avk`, `az`, `be`, `bn`, `ber`, `br`, `bg`, `ca`, `cbk`, `ceb`, `cs`, `cv`, `cmn`, `kw`, `da`, `de`, `dtp`, `el`, `en`, `eo`, `et`, `eu`, `fi`, `fr`, `gcf`, `ga`, `gl`, `gos`, `gn`, `he`, `hi`, `hr`, `hu`, `hy`, `io`, `ie`, `ilo`, `ia`, `id`, `is`, `it`, `jbo`, `ja`, `kab`, `ka`, `kk`, `kha`, `km`, `ko`, `kzj`, `la`, `lfn`, `lt`, `lvs`, `lzh`, `mr`, `mhr`, `mk`, `nds`, `nl`, `nn`, `nb`, `oc`, `ota`, `pam`, `pcd`, `pes`, `pl`, `pt`, `ro`, `rn`, `ru`, `sk`, `es`, `sq`, `sr`, `sv`, `swg`, `tt`, `tl`, `th`, `tlh`, `toki`, `tk`, `tr`, `ug`, `uk`, `ur`, `vi`, `vo`, `war`, `wuu`, `yi`, `yue`, `zsm`. {:.btn-box} [Open in Colab](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ld_tatoeba_cnn_99_xx_2.7.0_2.4_1607183215533.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/ld_tatoeba_cnn_99_xx_2.7.0_2.4_1607183215533.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... language_detector = LanguageDetectorDL.pretrained("ld_tatoeba_cnn_99", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("language") languagePipeline = Pipeline(stages=[documentAssembler, sentenceDetector, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(spark.createDataFrame([['']]).toDF("text"))) result = light_pipeline.fullAnnotate("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.") ``` ```scala ... val languageDetector = LanguageDetectorDL.pretrained("ld_tatoeba_cnn_99", "xx") .setInputCols("sentence") .setOutputCol("language") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, languageDetector)) val data = Seq("Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."] lang_df = nlu.load('xx.classify.wiki_99').predict(text, output_level='sentence') lang_df ```
## Results ```bash 'fr' ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ld_tatoeba_cnn_99| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[language]| |Language:|xx| ## Data Source Tatoeba ## Benchmarking ```bash Evaluated on Europarl dataset which the model has never seen: +--------+-----+-------+------------------+ |src_lang|count|correct| precision| +--------+-----+-------+------------------+ | fi| 1000| 1000| 1.0| | pt| 1000| 1000| 1.0| | it| 1000| 998| 0.998| | es| 1000| 997| 0.997| | en| 1000| 995| 0.995| | da| 1000| 994| 0.994| | sv| 1000| 992| 0.992| | pl| 914| 899|0.9835886214442013| | hu| 880| 863|0.9806818181818182| | lt| 1000| 975| 0.975| | bg| 1000| 951| 0.951| | et| 928| 783| 0.84375| +--------+-----+-------+------------------+ +-------+-------------------+ |summary| precision| +-------+-------------------+ | count| 12| | mean| 0.9758350366355016| | stddev|0.04391442353856736| | min| 0.84375| | max| 1.0| +-------+-------------------+ ``` --- layout: model title: Detect Clinical Entities (ner_jsl_greedy_biobert) author: John Snow Labs name: ner_jsl_greedy_biobert date: 2021-08-13 tags: [ner, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.2.0 spark_version: 2.4 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition deep learning model for clinical terminology. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. This model is the official version of jsl_ner_wip_greedy_biobert model. Definitions of Predicted Entities: - `Injury_or_Poisoning`: Physical harm or injury caused to the body, including those caused by accidents, falls, or poisoning of a patient or someone else. - `Direction`: All the information relating to the laterality of the internal and external organs. - `Test`: Mentions of laboratory, pathology, and radiological tests. - `Admission_Discharge`: Terms that indicate the admission and/or the discharge of a patient. - `Death_Entity`: Mentions that indicate the death of a patient. - `Relationship_Status`: State of patients romantic or social relationships (e.g. single, married, divorced). - `Duration`: The duration of a medical treatment or medication use. - `Respiration`: Number of breaths per minute. - `Hyperlipidemia`: Terms that indicate hyperlipidemia with relevant subtypes and synonims. - `Birth_Entity`: Mentions that indicate giving birth. - `Age`: All mention of ages, past or present, related to the patient or with anybody else. - `Labour_Delivery`: Extractions include stages of labor and delivery. - `Family_History_Header`: identifies section headers that correspond to Family History of the patient. - `BMI`: Numeric values and other text information related to Body Mass Index. - `Temperature`: All mentions that refer to body temperature. - `Alcohol`: Terms that indicate alcohol use, abuse or drinking issues of a patient or someone else. - `Kidney_Disease`: Terms that refer to any kidney diseases (includes mentions of modifiers such as "Acute" or "Chronic"). - `Oncological`: All the cancer, tumor or metastasis related extractions mentioned in the document, of the patient or someone else. - `Medical_History_Header`: Identifies section headers that correspond to Past Medical History of a patient. - `Cerebrovascular_Disease`: All terms that refer to cerebrovascular diseases and events. - `Oxygen_Therapy`: Breathing support triggered by patient or entirely or partially by machine (e.g. ventilator, BPAP, CPAP). - `O2_Saturation`: Systemic arterial, venous or peripheral oxygen saturation measurements. - `Psychological_Condition`: All the Mental health diagnosis, disorders, conditions or syndromes of a patient or someone else. - `Heart_Disease`: All mentions of acquired, congenital or degenerative heart diseases. - `Employment`: All mentions of patient or provider occupational titles and employment status . - `Obesity`: Terms related to a patient being obese (overweight and BMI are extracted as different labels). - `Disease_Syndrome_Disorder`: All the diseases mentioned in the document, of the patient or someone else (excluding diseases that are extracted with their specific labels, such as "Heart_Disease" etc.). - `Pregnancy`: All terms related to Pregnancy (excluding terms that are extracted with their specific labels, such as "Labour_Delivery" etc.). - `ImagingFindings`: All mentions of radiographic and imagistic findings. - `Procedure`: All mentions of invasive medical or surgical procedures or treatments. - `Medical_Device`: All mentions related to medical devices and supplies. - `Race_Ethnicity`: All terms that refer to racial and national origin of sociocultural groups. - `Section_Header`: All the section headers present in the text (Medical History, Family History, Social History, Physical Examination and Vital signs Headers are extracted separately with their specific labels). - `Symptom`: All the symptoms mentioned in the document, of a patient or someone else. - `Treatment`: Includes therapeutic and minimally invasive treatment and procedures (invasive treatments or procedures are extracted as "Procedure"). - `Substance`: All mentions of substance use related to the patient or someone else (recreational drugs, illicit drugs). - `Route`: Drug and medication administration routes available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_Ingredient`: Active ingredient/s found in drug products. - `Blood_Pressure`: Systemic blood pressure, mean arterial pressure, systolic and/or diastolic are extracted. - `Diet`: All mentions and information regarding patients dietary habits. - `External_body_part_or_region`: All mentions related to external body parts or organs that can be examined by naked eye. - `LDL`: All mentions related to the lab test and results for LDL (Low Density Lipoprotein). - `VS_Finding`: Qualitative data (e.g. Fever, Cyanosis, Tachycardia) and any other symptoms that refers to vital signs. - `Allergen`: Allergen related extractions mentioned in the document. - `EKG_Findings`: All mentions of EKG readings. - `Imaging_Technique`: All mentions of special radiographic views or special imaging techniques used in radiology. - `Triglycerides`: All mentions terms related to specific lab test for Triglycerides. - `RelativeTime`: Time references that are relative to different times or events (e.g. words such as "approximately", "in the morning"). - `Gender`: Gender-specific nouns and pronouns. - `Pulse`: Peripheral heart rate, without advanced information like measurement location. - `Social_History_Header`: Identifies section headers that correspond to Social History of a patient. - `Substance_Quantity`: All mentions of substance quantity (quantitative information related to illicit/recreational drugs). - `Diabetes`: All terms related to diabetes mellitus. - `Modifier`: Terms that modify the symptoms, diseases or risk factors. If a modifier is included in ICD-10 name of a specific disease, the respective modifier is not extracted separately. - `Internal_organ_or_component`: All mentions related to internal body parts or organs that can not be examined by naked eye. - `Clinical_Dept`: Terms that indicate the medical and/or surgical departments. - `Form`: Drug and medication forms available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Drug_BrandName`: Commercial labeling name chosen by the labeler or the drug manufacturer for a drug containing a single or multiple drug active ingredients. - `Strength`: Potency of one unit of drug (or a combination of drugs) the measurement units available are described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Fetus_NewBorn`: All terms related to fetus, infant, new born (excluding terms that are extracted with their specific labels, such as "Labour_Delivery", "Pregnancy" etc.). - `RelativeDate`: Temporal references that are relative to the date of the text or to any other specific date (e.g. "approximately two years ago", "about two days ago"). - `Height`: All mentions related to a patients height. - `Test_Result`: Terms related to all the test results present in the document (clinical tests results are included). - `Sexually_Active_or_Sexual_Orientation`: All terms that are related to sexuality, sexual orientations and sexual activity. - `Frequency`: Frequency of administration for a dose prescribed. - `Time`: Specific time references (hour and/or minutes). - `Weight`: All mentions related to a patients weight. - `Vaccine`: Generic and brand name of vaccines or vaccination procedure. - `Vital_Signs_Header`: Identifies section headers that correspond to Vital Signs of a patient. - `Communicable_Disease`: Includes all mentions of communicable diseases. - `Dosage`: Quantity prescribed by the physician for an active ingredient; measurement units are available described by [FDA](http://wayback.archive-it.org/7993/20171115111313/https:/www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071667.htm). - `Overweight`: Terms related to the patient being overweight (BMI and Obesity is extracted separately). - `Hypertension`: All terms related to Hypertension (quantitative data such as 150/100 is extracted as Blood_Pressure). - `HDL`: Terms related to the lab test for HDL (High Density Lipoprotein). - `Total_Cholesterol`: Terms related to the lab test and results for cholesterol. - `Smoking`: All mentions of smoking status of a patient. - `Date`: Mentions of an exact date, in any format, including day number, month and/or year. ## Predicted Entities `Test_Result`, `Relationship_Status`, `RelativeDate`, `Blood_Pressure`, `Triglycerides`, `Smoking`, `Pregnancy`, `Medical_History_Header`, `LDL`, `Hypertension`, `Hyperlipidemia`, `Frequency`, `BMI`, `Internal_organ_or_component`, `Allergen`, `Fetus_NewBorn`, `Substance_Quantity`, `Time`, `Temperature`, `Procedure`, `Strength`, `Treatment`, `HDL`, `Alcohol`, `Birth_Entity`, `Diet`, `Weight`, `Oxygen_Therapy`, `Injury_or_Poisoning`, `Section_Header`, `Obesity`, `EKG_Findings`, `Gender`, `Height`, `Social_History_Header`, `Diabetes`, `Route`, `Race_Ethnicity`, `Substance`, `Drug`, `External_body_part_or_region`, `RelativeTime`, `Admission_Discharge`, `Psychological_Condition`, `Total_Cholesterol`, `Labour_Delivery`, `Imaging_Technique`, `Date`, `Form`, `Overweight`, `Cerebrovascular_Disease`, `Vital_Signs_Header`, `Oncological`, `ImagingFindings`, `Communicable_Disease`, `Duration`, `Vaccine`, `Kidney_Disease`, `O2_Saturation`, `Heart_Disease`, `Employment`, `Sexually_Active_or_Sexual_Orientation`, `Test`, `Disease_Syndrome_Disorder`, `Respiration`, `Direction`, `Medical_Device`, `Clinical_Dept`, `Modifier`, `Symptom`, `Pulse`, `Age`, `Death_Entity`, `Dosage`, `Family_History_Header`, `VS_Finding` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_JSL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_en_3.2.0_2.4_1628850116918.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_greedy_biobert_en_3.2.0_2.4_1628850116918.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased') \ .setInputCols(['sentence', 'token']) \ .setOutputCol('embeddings') jsl_ner = MedicalNerModel.pretrained("ner_jsl_greedy_biobert", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("jsl_ner") jsl_ner_converter = NerConverter() \ .setInputCols(["sentence", "token", "jsl_ner"]) \ .setOutputCol("ner_chunk") jsl_ner_pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter]) jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature."""]]).toDF("text") result = jsl_ner_model.transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased') .setInputCols(Array('sentence', 'token')) .setOutputCol('embeddings') val jsl_ner = MedicalNerModel.pretrained("ner_jsl_greedy_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("jsl_ner") val jsl_ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "jsl_ner")) .setOutputCol("ner_chunk") val jsl_ner_pipeline = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, jsl_ner, jsl_ner_converter)) val data = Seq("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""").toDS.toDF("text") val result = jsl_ner_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_greedy_biobert").predict("""The patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.""") ```
## Results ```bash | | chunk | entity | |---:|:-----------------------------------------------|:-----------------------------| | 0 | 21-day-old | Age | | 1 | Caucasian | Race_Ethnicity | | 2 | male | Gender | | 3 | for 2 days | Duration | | 4 | congestion | Symptom | | 5 | mom | Gender | | 6 | suctioning yellow discharge | Symptom | | 7 | nares | External_body_part_or_region | | 8 | she | Gender | | 9 | mild problems with his breathing while feeding | Symptom | | 10 | perioral cyanosis | Symptom | | 11 | retractions | Symptom | | 12 | One day ago | RelativeDate | | 13 | mom | Gender | | 14 | tactile temperature | Symptom | | 15 | Tylenol | Drug | | 16 | Baby | Age | | 17 | decreased p.o. intake | Symptom | | 18 | His | Gender | | 19 | breast-feeding | External_body_part_or_region | | 20 | q.2h | Frequency | | 21 | to 5 to 10 minutes | Duration | | 22 | his | Gender | | 23 | respiratory congestion | Symptom | | 24 | He | Gender | | 25 | tired | Symptom | | 26 | fussy | Symptom | | 27 | over the past 2 days | RelativeDate | | 28 | albuterol | Drug | | 29 | ER | Clinical_Dept | | 30 | His | Gender | | 31 | urine output has also decreased | Symptom | | 32 | he | Gender | | 33 | per 24 hours | Frequency | | 34 | he | Gender | | 35 | per 24 hours | Frequency | | 36 | Mom | Gender | | 37 | diarrhea | Symptom | | 38 | His | Gender | | 39 | bowel | Internal_organ_or_component | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_greedy_biobert| |Compatibility:|Healthcare NLP 3.2.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on data gathered and manually annotated by John Snow Labs. https://www.johnsnowlabs.com/data/ ## Benchmarking ```bash label tp fp fn prec rec f1 B-Oxygen_Therapy 47 11 10 0.8103448 0.8245614 0.81739134 B-Cerebrovascular_Disease 43 20 21 0.6825397 0.671875 0.6771653 B-Triglycerides 5 0 0 1 1 1 I-Cerebrovascular_Disease 25 12 27 0.6756757 0.48076922 0.56179774 B-Medical_Device 2704 531 364 0.8358578 0.88135594 0.85800415 B-Labour_Delivery 43 16 29 0.7288136 0.5972222 0.6564886 I-Vaccine 5 0 5 1 0.5 0.6666667 I-Obesity 6 4 1 0.6 0.85714287 0.70588243 I-Smoking 3 1 2 0.75 0.6 0.6666667 B-RelativeTime 67 36 51 0.65048546 0.5677966 0.60633487 B-Imaging_Technique 33 12 19 0.73333335 0.63461536 0.68041235 B-Heart_Disease 285 55 68 0.8382353 0.8073654 0.82251084 B-Procedure 1876 303 384 0.8609454 0.8300885 0.84523547 I-RelativeTime 105 43 53 0.7094595 0.664557 0.6862745 B-Drug 1803 299 265 0.8577545 0.87185687 0.8647482 B-Obesity 29 9 5 0.7631579 0.85294116 0.8055555 I-RelativeDate 617 167 107 0.7869898 0.8522099 0.8183024 B-O2_Saturation 27 8 6 0.7714286 0.8181818 0.7941177 B-Direction 2856 390 326 0.8798521 0.89754874 0.88861233 I-Alcohol 4 4 4 0.5 0.5 0.5 I-Oxygen_Therapy 25 7 6 0.78125 0.8064516 0.79365087 B-Diet 23 14 32 0.6216216 0.4181818 0.5 B-Dosage 35 26 29 0.57377046 0.546875 0.55999994 B-Injury_or_Poisoning 308 52 83 0.85555553 0.7877238 0.82023966 B-Hypertension 80 9 2 0.8988764 0.9756098 0.9356726 I-Test_Result 124 73 156 0.6294416 0.44285715 0.5199161 B-Alcohol 54 11 12 0.83076924 0.8181818 0.8244275 B-Height 14 5 5 0.7368421 0.7368421 0.7368421 I-Substance 18 8 8 0.6923077 0.6923077 0.6923077 B-RelativeDate 372 109 93 0.7733888 0.8 0.78646934 B-Admission_Discharge 218 22 14 0.90833336 0.9396552 0.9237288 B-Date 345 24 26 0.93495935 0.9299191 0.9324324 B-Kidney_Disease 63 10 20 0.8630137 0.7590361 0.8076923 I-Strength 22 17 13 0.5641026 0.62857145 0.59459466 I-Injury_or_Poisoning 301 93 98 0.7639594 0.75438595 0.75914246 I-Time 28 11 17 0.71794873 0.62222224 0.6666667 B-Substance 48 11 10 0.8135593 0.82758623 0.8205129 B-Total_Cholesterol 6 3 0 0.6666667 1 0.8 I-Vital_Signs_Header 276 28 8 0.90789473 0.97183096 0.93877554 I-Internal_organ_or_component 2907 518 490 0.8487591 0.8557551 0.8522427 B-Hyperlipidemia 28 3 0 0.9032258 1 0.9491525 B-Overweight 3 0 3 1 0.5 0.6666667 I-Sexually_Active_or_Sexual_Orientation 2 0 3 1 0.4 0.5714286 B-Sexually_Active_or_Sexual_Orientation 2 0 2 1 0.5 0.6666667 I-Fetus_NewBorn 50 38 58 0.5681818 0.46296296 0.5102041 B-BMI 6 0 1 1 0.85714287 0.9230769 B-ImagingFindings 52 41 61 0.5591398 0.460177 0.5048544 B-Test_Result 714 135 212 0.8409894 0.7710583 0.8045071 B-Section_Header 2140 79 65 0.9643984 0.97052157 0.96745026 I-Treatment 85 21 29 0.8018868 0.74561405 0.7727273 B-Clinical_Dept 638 82 77 0.88611114 0.8923077 0.88919866 I-Kidney_Disease 114 7 18 0.94214875 0.8636364 0.90118575 I-Pulse 189 27 42 0.875 0.8181818 0.84563756 B-Test 1589 320 315 0.83237296 0.83455884 0.83346444 B-Weight 54 12 13 0.8181818 0.80597013 0.81203 I-Respiration 114 4 17 0.9661017 0.870229 0.91566265 I-EKG_Findings 68 34 52 0.6666667 0.56666666 0.6126126 I-Section_Header 3828 168 77 0.957958 0.9802817 0.9689913 B-Strength 27 13 23 0.675 0.54 0.6 I-Social_History_Header 137 4 4 0.9716312 0.9716312 0.9716312 B-Vital_Signs_Header 183 18 7 0.9104478 0.9631579 0.9360614 B-Death_Entity 28 9 6 0.7567568 0.8235294 0.7887324 B-Modifier 302 90 282 0.77040815 0.5171233 0.6188525 B-Blood_Pressure 93 14 21 0.86915886 0.81578946 0.84162897 I-O2_Saturation 49 19 23 0.7205882 0.6805556 0.7 B-Frequency 437 77 68 0.8501946 0.86534655 0.8577036 I-Triglycerides 5 0 0 1 1 1 I-Duration 513 254 47 0.66883963 0.9160714 0.77317256 I-Diabetes 50 4 6 0.9259259 0.89285713 0.90909094 B-Race_Ethnicity 78 3 2 0.962963 0.975 0.9689441 I-Gender 114 2 17 0.98275864 0.870229 0.9230769 I-Height 43 13 10 0.76785713 0.8113208 0.78899086 B-Communicable_Disease 10 5 9 0.6666667 0.5263158 0.5882354 I-Family_History_Header 134 1 0 0.9925926 1 0.9962825 B-LDL 2 2 2 0.5 0.5 0.5 I-Race_Ethnicity 6 0 0 1 1 1 B-Psychological_Condition 103 21 17 0.83064514 0.85833335 0.84426236 I-Age 116 14 50 0.8923077 0.6987952 0.78378385 B-EKG_Findings 33 18 32 0.64705884 0.50769234 0.56896555 B-Employment 168 29 44 0.8527919 0.7924528 0.8215159 I-Oncological 358 38 17 0.9040404 0.9546667 0.9286641 B-Time 27 7 18 0.7941176 0.6 0.68354434 B-Treatment 93 31 41 0.75 0.69402987 0.7209303 B-Temperature 69 5 8 0.9324324 0.8961039 0.9139073 I-Procedure 2437 379 501 0.86541194 0.8294758 0.84706295 B-Relationship_Status 30 3 1 0.90909094 0.9677419 0.9375 B-Pregnancy 56 17 30 0.7671233 0.6511628 0.7044025 I-Route 8 4 7 0.6666667 0.53333336 0.59259266 I-Medical_History_Header 151 4 15 0.9741936 0.9096386 0.94080997 I-Imaging_Technique 25 5 20 0.8333333 0.5555556 0.66666675 B-Smoking 74 6 4 0.925 0.94871795 0.93670887 I-Labour_Delivery 36 8 18 0.8181818 0.6666667 0.7346939 I-Death_Entity 3 0 2 1 0.6 0.75 B-Diabetes 77 9 5 0.89534885 0.9390244 0.9166666 B-Gender 4479 82 111 0.9820215 0.97581697 0.9789094 B-Vaccine 6 1 9 0.85714287 0.4 0.54545456 I-Heart_Disease 393 61 89 0.8656388 0.8153527 0.8397436 I-Dosage 31 27 22 0.5344828 0.5849057 0.5585586 B-Social_History_Header 78 2 3 0.975 0.962963 0.9689441 B-External_body_part_or_region 1640 402 311 0.8031342 0.8405946 0.8214376 I-Clinical_Dept 546 59 47 0.90247935 0.920742 0.91151917 I-Test 1195 320 402 0.7887789 0.748278 0.7679949 I-Frequency 340 97 120 0.77803206 0.73913044 0.75808245 B-Age 454 35 57 0.9284254 0.888454 0.908 B-Pulse 90 11 17 0.8910891 0.8411215 0.8653846 I-Symptom 4265 2050 1232 0.6753761 0.7758778 0.72214705 I-Pregnancy 39 28 42 0.58208954 0.4814815 0.527027 I-LDL 5 0 4 1 0.5555556 0.71428573 I-Diet 33 14 25 0.70212764 0.5689655 0.6285714 I-Blood_Pressure 198 54 27 0.78571427 0.88 0.83018863 I-ImagingFindings 136 99 85 0.57872343 0.61538464 0.5964913 I-Date 203 13 10 0.9398148 0.9530516 0.946387 B-Route 84 23 47 0.78504676 0.64122134 0.7058824 B-Duration 204 110 26 0.6496815 0.8869565 0.74999994 B-Medical_History_Header 56 1 7 0.98245615 0.8888889 0.93333334 B-Respiration 55 4 6 0.9322034 0.90163934 0.9166667 I-External_body_part_or_region 314 105 167 0.74940336 0.65280664 0.6977778 I-BMI 15 0 1 1 0.9375 0.9677419 B-Internal_organ_or_component 4349 886 761 0.8307545 0.8510763 0.8407926 I-Weight 150 22 23 0.872093 0.867052 0.8695652 B-Disease_Syndrome_Disorder 1698 375 358 0.81910276 0.82587546 0.8224752 B-Symptom 4358 1002 932 0.8130597 0.8238185 0.8184037 B-VS_Finding 138 36 37 0.79310346 0.7885714 0.79083097 I-Disease_Syndrome_Disorder 1723 372 451 0.82243437 0.7925483 0.8072148 I-Drug 3282 838 493 0.79660195 0.86940396 0.8314123 I-Medical_Device 1864 418 242 0.81682736 0.88509023 0.84958977 B-Oncological 278 22 22 0.9266667 0.9266667 0.9266667 I-Temperature 111 8 6 0.9327731 0.94871795 0.94067794 I-Employment 92 27 19 0.77310926 0.8288288 0.8 I-Psychological_Condition 32 7 19 0.82051283 0.627451 0.7111111 B-Family_History_Header 68 0 0 1 1 1 I-Direction 311 91 144 0.7736318 0.6835165 0.72578764 Macro-average 65035 12855 11898 0.761429 0.706300 0.7328297 Micro-average 65035 12855 11898 0.834959 0.845346 0.8401207 ``` --- layout: model title: English asr_wav2vec2_base_timit_demo_colab30 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab30 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab30` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab30_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab30_en_4.2.0_3.0_1664020634835.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab30_en_4.2.0_3.0_1664020634835.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab30", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab30", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab30| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: Part of Speech for Basque author: John Snow Labs name: pos_ud_bdt date: 2021-03-09 tags: [part_of_speech, open_source, basque, pos_ud_bdt, eu] task: Part of Speech Tagging language: eu edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true annotator: PerceptronModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description A [Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) classifier predicts a grammatical label for every token in the input text. Implemented with an `averaged perceptron architecture`. ## Predicted Entities - ADV - PUNCT - VERB - NOUN - NUM - CCONJ - ADJ - AUX - PROPN - DET - PRON - PART - SYM - INTJ - X - ADP {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/GRAMMAR_EN/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/GRAMMAR_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_ud_bdt_eu_3.0.0_3.0_1615292144964.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pos_ud_bdt_eu_3.0.0_3.0_1615292144964.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pos = PerceptronModel.pretrained("pos_ud_bdt", "eu") \ .setInputCols(["document", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[ document_assembler, sentence_detector, posTagger ]) example = spark.createDataFrame([['Kaixo John Snow Labs-etik! ']], ["text"]) result = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val pos = PerceptronModel.pretrained("pos_ud_bdt", "eu") .setInputCols(Array("document", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos)) val data = Seq("Kaixo John Snow Labs-etik! ").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = [""Kaixo John Snow Labs-etik! ""] token_df = nlu.load('eu.pos').predict(text) token_df ```
## Results ```bash token pos 0 Kaixo INTJ 1 John PROPN 2 Snow PROPN 3 Labs-etik PROPN 4 ! PUNCT ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pos_ud_bdt| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[pos]| |Language:|eu| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab2_by_sameearif88 TFWav2Vec2ForCTC from sameearif88 author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab2_by_sameearif88` is a English model originally trained by sameearif88. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88_en_4.2.0_3.0_1664040818102.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88_en_4.2.0_3.0_1664040818102.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab2_by_sameearif88| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Chinese T5ForConditionalGeneration Base Cased model (from thu-coai) author: John Snow Labs name: t5_longlm_base date: 2023-01-30 tags: [zh, open_source, t5] task: Text Generation language: zh edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `LongLM-base` is a Chinese model originally trained by `thu-coai`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_longlm_base_zh_4.3.0_3.0_1675098192540.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_longlm_base_zh_4.3.0_3.0_1675098192540.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_longlm_base","zh") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_longlm_base","zh") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_longlm_base| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|zh| |Size:|926.8 MB| ## References - https://huggingface.co/thu-coai/LongLM-base - https://jianguanthu.github.io/ - http://coai.cs.tsinghua.edu.cn/ --- layout: model title: English DistilBertForTokenClassification Cased model (from ismail-lucifer011) author: John Snow Labs name: distilbert_token_classifier_autotrain_name_vsv_all_901529445 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-name_vsv_all-901529445` is a English model originally trained by `ismail-lucifer011`. ## Predicted Entities `OOV`, `Name` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1678134144791.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_name_vsv_all_901529445_en_4.3.1_3.0_1678134144791.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_name_vsv_all_901529445","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_name_vsv_all_901529445| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/ismail-lucifer011/autotrain-name_vsv_all-901529445 --- layout: model title: Detect Persons, Locations, Organizations and Misc Entities in Italian (WikiNER 840B 300) author: John Snow Labs name: wikiner_840B_300 date: 2020-02-03 task: Named Entity Recognition language: it edition: Spark NLP 2.4.0 spark_version: 2.4 tags: [ner, it, open_source] supported: true annotator: NerDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description WikiNER is a Named Entity Recognition (or NER) model, meaning it annotates text to find features like the names of people, places, and organizations. This NER model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together. WikiNER 840B 300 is trained with GloVe 840B 300 word embeddings, so be sure to use the same embeddings in the pipeline. {:.h2_title} ## Predicted Entities Persons-`PER`, Locations-`LOC`, Organizations-`ORG`, Miscellaneous-`MISC`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_IT){:.button.button-orange}{:target="_blank"} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_IT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_it_2.4.0_2.4_1579699913554.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/wikiner_840B_300_it_2.4.0_2.4_1579699913554.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") ner_model = NerDLModel.pretrained("wikiner_840B_300", "it") \ .setInputCols(["document", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['William Henry Gates III (nato il 28 ottobre 1955) è un magnate d"affari americano, sviluppatore di software, investitore e filantropo. È noto soprattutto come co-fondatore di Microsoft Corporation. Durante la sua carriera in Microsoft, Gates ha ricoperto le posizioni di presidente, amministratore delegato (CEO), presidente e capo architetto del software, pur essendo il principale azionista individuale fino a maggio 2014. È uno dei più noti imprenditori e pionieri del rivoluzione dei microcomputer degli anni "70 e "80. Nato e cresciuto a Seattle, Washington, Gates ha co-fondato Microsoft con l"amico d"infanzia Paul Allen nel 1975, ad Albuquerque, nel New Mexico; divenne la più grande azienda di software per personal computer al mondo. Gates ha guidato l"azienda come presidente e CEO fino a quando non si è dimesso da CEO nel gennaio 2000, ma è rimasto presidente e divenne capo architetto del software. Alla fine degli anni "90, Gates era stato criticato per le sue tattiche commerciali, che erano state considerate anticoncorrenziali. Questa opinione è stata confermata da numerose sentenze giudiziarie. Nel giugno 2006, Gates ha annunciato che sarebbe passato a un ruolo part-time presso Microsoft e un lavoro a tempo pieno presso la Bill & Melinda Gates Foundation, la fondazione di beneficenza privata che lui e sua moglie, Melinda Gates, hanno fondato nel 2000. A poco a poco trasferì i suoi doveri a Ray Ozzie e Craig Mundie. Si è dimesso da presidente di Microsoft nel febbraio 2014 e ha assunto un nuovo incarico come consulente tecnologico per supportare il neo nominato CEO Satya Nadella.']], ["text"])) ``` ```scala ... val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", lang="xx") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val ner_model = NerDLModel.pretrained("wikiner_840B_300", "it") .setInputCols(Array("document", "token", "embeddings")) .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter)) val data = Seq("William Henry Gates III (nato il 28 ottobre 1955) è un magnate d"affari americano, sviluppatore di software, investitore e filantropo. È noto soprattutto come co-fondatore di Microsoft Corporation. Durante la sua carriera in Microsoft, Gates ha ricoperto le posizioni di presidente, amministratore delegato (CEO), presidente e capo architetto del software, pur essendo il principale azionista individuale fino a maggio 2014. È uno dei più noti imprenditori e pionieri del rivoluzione dei microcomputer degli anni "70 e "80. Nato e cresciuto a Seattle, Washington, Gates ha co-fondato Microsoft con l"amico d"infanzia Paul Allen nel 1975, ad Albuquerque, nel New Mexico; divenne la più grande azienda di software per personal computer al mondo. Gates ha guidato l"azienda come presidente e CEO fino a quando non si è dimesso da CEO nel gennaio 2000, ma è rimasto presidente e divenne capo architetto del software. Alla fine degli anni "90, Gates era stato criticato per le sue tattiche commerciali, che erano state considerate anticoncorrenziali. Questa opinione è stata confermata da numerose sentenze giudiziarie. Nel giugno 2006, Gates ha annunciato che sarebbe passato a un ruolo part-time presso Microsoft e un lavoro a tempo pieno presso la Bill & Melinda Gates Foundation, la fondazione di beneficenza privata che lui e sua moglie, Melinda Gates, hanno fondato nel 2000. A poco a poco trasferì i suoi doveri a Ray Ozzie e Craig Mundie. Si è dimesso da presidente di Microsoft nel febbraio 2014 e ha assunto un nuovo incarico come consulente tecnologico per supportare il neo nominato CEO Satya Nadella.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""William Henry Gates III (nato il 28 ottobre 1955) è un magnate d'affari americano, sviluppatore di software, investitore e filantropo. È noto soprattutto come co-fondatore di Microsoft Corporation. Durante la sua carriera in Microsoft, Gates ha ricoperto le posizioni di presidente, amministratore delegato (CEO), presidente e capo architetto del software, pur essendo il principale azionista individuale fino a maggio 2014. È uno dei più noti imprenditori e pionieri del rivoluzione dei microcomputer degli anni '70 e '80. Nato e cresciuto a Seattle, Washington, Gates ha co-fondato Microsoft con l'amico d'infanzia Paul Allen nel 1975, ad Albuquerque, nel New Mexico; divenne la più grande azienda di software per personal computer al mondo. Gates ha guidato l'azienda come presidente e CEO fino a quando non si è dimesso da CEO nel gennaio 2000, ma è rimasto presidente e divenne capo architetto del software. Alla fine degli anni '90, Gates era stato criticato per le sue tattiche commerciali, che erano state considerate anticoncorrenziali. Questa opinione è stata confermata da numerose sentenze giudiziarie. Nel giugno 2006, Gates ha annunciato che sarebbe passato a un ruolo part-time presso Microsoft e un lavoro a tempo pieno presso la Bill & Melinda Gates Foundation, la fondazione di beneficenza privata che lui e sua moglie, Melinda Gates, hanno fondato nel 2000. A poco a poco trasferì i suoi doveri a Ray Ozzie e Craig Mundie. Si è dimesso da presidente di Microsoft nel febbraio 2014 e ha assunto un nuovo incarico come consulente tecnologico per supportare il neo nominato CEO Satya Nadella."""] ner_df = nlu.load('it.ner').predict(text, output_level = "chunk") ner_df[["entities", "entities_confidence"]] ```
{:.h2_title} ## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |William Henry Gates III |PER | |Microsoft Corporation |ORG | |Microsoft |ORG | |Gates |PER | |Seattle |LOC | |Washington |LOC | |Gates |PER | |Microsoft |ORG | |Paul Allen |PER | |Albuquerque |LOC | |New Mexico |LOC | |Gates |PER | |CEO |ORG | |CEO |ORG | |Gates |PER | |Gates |PER | |Microsoft |ORG | |Bill & Melinda Gates Foundation|ORG | |Melinda Gates |PER | |Ray Ozzie |PER | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|wikiner_840B_300| |Type:|ner| |Compatibility:| Spark NLP 2.4.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|it| |Case sensitive:|false| {:.h2_title} ## Data Source The model is trained based on data from [https://it.wikipedia.org](https://it.wikipedia.org) --- layout: model title: Word2Vec Embeddings in Odia (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, or, open_source] task: Embeddings language: or edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_or_3.4.1_3.0_1647451017345.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_or_3.4.1_3.0_1647451017345.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","or") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["ମୁଁ ସ୍ପାର୍କ NLP କୁ ଭଲ ପାଏ |"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","or") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("ମୁଁ ସ୍ପାର୍କ NLP କୁ ଭଲ ପାଏ |").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("or.embed.w2v_cc_300d").predict("""ମୁଁ ସ୍ପାର୍କ NLP କୁ ଭଲ ପାଏ |""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|or| |Size:|186.6 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Detect Clinical Entities (ner_eu_clinical_case) author: John Snow Labs name: ner_eu_clinical_case date: 2023-01-25 tags: [clinical, licensed, ner, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 4.2.7 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained named entity recognition (NER) deep learning model for clinical entities. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN. The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Predicted Entities `clinical_event`, `bodypart`, `clinical_condition`, `units_measurements`, `patient`, `date_time` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_en_4.2.7_3.0_1674657662344.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_eu_clinical_case_en_4.2.7_3.0_1674657662344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\ .setInputCols(["sentence","token"])\ .setOutputCol("embeddings") ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = pipeline(stages=[ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter]) data = spark.createDataFrame([["""A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer."""]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") .setInputCols(Array("sentence","token")) .setOutputCol("embeddings") val ner_model = MedicalNerModel.pretrained("ner_eu_clinical_case", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documenter, sentenceDetector, tokenizer, word_embeddings, ner_model, ner_converter)) val data = Seq(Array("""A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer.""")).toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.clinical_case_eu").predict("""A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder. The child was diagnosed with a severe communication disorder, with social interaction difficulties and sensory processing delay. Blood work was normal (thyroid-stimulating hormone (TSH), hemoglobin, mean corpuscular volume (MCV), and ferritin). Upper endoscopy also showed a submucosal tumor causing subtotal obstruction of the gastric outlet. Because a gastrointestinal stromal tumor was suspected, distal gastrectomy was performed. Histopathological examination revealed spindle cell proliferation in the submucosal layer.""") ```
## Results ```bash +------------------------------+------------------+ |chunk |ner_label | +------------------------------+------------------+ |A 3-year-old boy |patient | |autistic disorder |clinical_condition| |He |patient | |illness |clinical_event | |autistic spectrum disorder |clinical_condition| |The child |patient | |diagnosed |clinical_event | |disorder |clinical_event | |difficulties |clinical_event | |Blood |bodypart | |work |clinical_event | |normal |units_measurements| |hormone |clinical_event | |hemoglobin |clinical_event | |volume |clinical_event | |endoscopy |clinical_event | |showed |clinical_event | |tumor |clinical_condition| |causing |clinical_event | |obstruction |clinical_event | |the gastric outlet |bodypart | |gastrointestinal stromal tumor|clinical_condition| |suspected |clinical_event | |gastrectomy |clinical_event | |examination |clinical_event | |revealed |clinical_event | |spindle cell proliferation |clinical_condition| |the submucosal layer |bodypart | +------------------------------+------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_eu_clinical_case| |Compatibility:|Healthcare NLP 4.2.7+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|849.0 KB| ## References The corpus used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. ## Benchmarking ```bash label tp fp fn total precision recall f1 date_time 54.0 7.0 15.0 69.0 0.8852 0.7826 0.8308 units_measurements 111.0 48.0 12.0 123.0 0.6981 0.9024 0.7872 clinical_condition 93.0 47.0 81.0 174.0 0.6643 0.5345 0.5924 patient 119.0 16.0 5.0 124.0 0.8815 0.9597 0.9189 clinical_event 331.0 126.0 89.0 420.0 0.7243 0.7881 0.7548 bodypart 171.0 58.0 84.0 255.0 0.7467 0.6706 0.7066 macro - - - - - - 0.7651 micro - - - - - - 0.7454 ``` --- layout: model title: Legal Laws Clause Binary Classifier author: John Snow Labs name: legclf_laws_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `laws` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `laws` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_laws_clause_en_1.0.0_3.2_1660123661133.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_laws_clause_en_1.0.0_3.2_1660123661133.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_laws_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[laws]| |[other]| |[other]| |[laws]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_laws_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support compliance-with-laws 0.92 0.94 0.93 64 other 0.97 0.96 0.97 141 accuracy - - 0.96 205 macro-avg 0.95 0.95 0.95 205 weighted-avg 0.96 0.96 0.96 205 ``` --- layout: model title: English BertForQuestionAnswering model (from aodiniz) author: John Snow Labs name: bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert_uncased_L-4_H-768_A-12_cord19-200616_squad2_covid-qna` is a English model orginally trained by `aodiniz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185337634.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna_en_4.0.0_3.0_1654185337634.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2_covid_cord19.bert.uncased_4l_768d_a12a_768d").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|194.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/aodiniz/bert_uncased_L-4_H-768_A-12_cord19-200616_squad2_covid-qna --- layout: model title: English BERT Embeddings Cased model (from mrm8488) author: John Snow Labs name: bert_embeddings_scibert_scivocab_finetuned_cord19 date: 2022-07-15 tags: [en, open_source, bert, embeddings] task: Embeddings language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BERT Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `scibert_scivocab-finetuned-CORD19` is a English model originally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_scibert_scivocab_finetuned_cord19_en_4.0.0_3.0_1657880897640.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_scibert_scivocab_finetuned_cord19_en_4.0.0_3.0_1657880897640.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_scibert_scivocab_finetuned_cord19","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_scibert_scivocab_finetuned_cord19","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_scibert_scivocab_finetuned_cord19| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|412.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/mrm8488/scibert_scivocab-finetuned-CORD19 --- layout: model title: Named Entity Recognition for Japanese (BertForTokenClassification) author: John Snow Labs name: bert_token_classifier_ner_ud_gsd date: 2021-09-10 tags: [ner, ja, open_source, bert] task: Named Entity Recognition language: ja edition: Spark NLP 3.2.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. It has BERT embeddings integrated with an attached dense layer at the end, to classify tokens directly. Word segmentation is needed to extract the tokens. The model uses BERT embeddings from https://github.com/cl-tohoku/bert-japanese. ## Predicted Entities `ORDINAL`, `PERSON`, `LAW`, `MOVEMENT`, `LOC`, `WORK_OF_ART`, `DATE`, `NORP`, `TITLE_AFFIX`, `QUANTITY`, `FAC`, `TIME`, `MONEY`, `LANGUAGE`, `GPE`, `EVENT`, `ORG`, `PERCENT`, `PRODUCT` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_ud_gsd_ja_3.2.2_3.0_1631279615344.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_token_classifier_ner_ud_gsd_ja_3.2.2_3.0_1631279615344.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetector() \ .setInputCols("document") \ .setOutputCol("sentence") word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \ .setInputCols(["sentence"]) \ .setOutputCol("token") nerTagger = BertForTokenClassification \ .pretrained("bert_token_classifier_ner_ud_gsd", "ja") \ .setInputCols(["sentence",'token']) \ .setOutputCol("ner") pipeline = Pipeline().setStages([ documentAssembler, sentenceDetector, word_segmenter, nerTagger ]) data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text") model = pipeline.fit(data) result = model.transform(data) ``` ```scala import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel} import com.johnsnowlabs.nlp.annotator.BertForTokenClassification import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") .setInputCols("sentence") .setOutputCol("token") val nerTagger = BertForTokenClassification .pretrained("bert_token_classifier_ner_ud_gsd", "ja") .setInputCols("sentence", "token") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, word_segmenter, nerTagger )) val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text") val model = pipeline.fit(data) val result = model.transform(data) result.selectExpr("explode(arrays_zip(token.result, ner.result))") .selectExpr("col'0' as token", "col'1' as ner") .show() ``` {:.nlu-block} ```python import nlu nlu.load("ja.classify.token_bert.classifier_ner_ud_gsd").predict("""Put your text here.""") ```
## Results ```bash +--------------+--------+ | token| ner| +--------------+--------+ | 宮本|B-PERSON| | 茂|B-PERSON| | 氏| O| | は| O| | 、| O| | 日本| O| | の| O| | 任天| O| | 堂| O| | の| O| | ゲーム| O| |プロデューサー| O| | です| O| | 。| O| +--------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_ud_gsd| |Compatibility:|Spark NLP 3.2.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[ner]| |Language:|ja| |Case sensitive:|true| |Max sentense length:|128| ## Data Source The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs: https://github.com/megagonlabs/UD_Japanese-GSD Reference: Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018. ## Benchmarking ```bash label precision recall f1-score support DATE 0.87 0.94 0.91 191 EVENT 0.27 0.82 0.41 17 FAC 0.66 0.63 0.64 62 GPE 0.40 0.59 0.48 70 LANGUAGE 0.50 0.67 0.57 6 LAW 0.00 0.00 0.00 0 LOC 0.59 0.69 0.63 35 MONEY 0.90 0.90 0.90 20 MOVEMENT 0.27 1.00 0.43 3 NORP 0.46 0.57 0.50 46 O 0.98 0.97 0.98 11897 ORDINAL 0.94 0.70 0.80 43 ORG 0.49 0.43 0.46 204 PERCENT 1.00 0.80 0.89 20 PERSON 0.56 0.68 0.61 105 PRODUCT 0.32 0.53 0.40 30 QUANTITY 0.83 0.75 0.79 189 TIME 0.88 0.64 0.74 44 TITLE_AFFIX 0.33 0.80 0.47 10 WORK_OF_ART 0.67 0.76 0.71 42 accuracy - - 0.95 13034 macro-avg 0.60 0.69 0.62 13034 weighted-avg 0.96 0.95 0.95 13034 ``` --- layout: model title: Legal Investments Clause Binary Classifier author: John Snow Labs name: legclf_investments_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `investments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `investments` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investments_clause_en_1.0.0_3.2_1660123646127.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investments_clause_en_1.0.0_3.2_1660123646127.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_investments_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[investments]| |[other]| |[other]| |[investments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investments_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support investments 0.97 0.85 0.91 34 other 0.94 0.99 0.96 80 accuracy - - 0.95 114 macro-avg 0.95 0.92 0.93 114 weighted-avg 0.95 0.95 0.95 114 ``` --- layout: model title: Pipeline to Detect Living Species author: John Snow Labs name: bert_token_classifier_ner_living_species_pipeline date: 2023-03-20 tags: [en, ner, clinical, licensed, bertfortokenclassification] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_living_species](https://nlp.johnsnowlabs.com/2022/06/26/bert_token_classifier_ner_living_species_en_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_en_4.3.0_3.2_1679304760714.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_living_species_pipeline_en_4.3.0_3.2_1679304760714.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "en", "clinical/models") text = '''42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("bert_token_classifier_ner_living_species_pipeline", "en", "clinical/models") val text = "42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:------------------------|--------:|------:|:------------|-------------:| | 0 | woman | 12 | 16 | HUMAN | 0.986743 | | 1 | bacterial | 145 | 153 | SPECIES | 0.975256 | | 2 | Fusarium spp | 337 | 348 | SPECIES | 0.998142 | | 3 | patient | 355 | 361 | HUMAN | 0.994012 | | 4 | species | 507 | 513 | SPECIES | 0.962562 | | 5 | Fusarium solani complex | 522 | 544 | SPECIES | 0.999028 | | 6 | antifungals | 792 | 802 | SPECIES | 0.999852 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_living_species_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.8 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverterInternalModel --- layout: model title: Translate English to Macedonian Pipeline author: John Snow Labs name: translate_en_mk date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, mk, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `mk` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_mk_xx_2.7.0_2.4_1609687312775.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_mk_xx_2.7.0_2.4_1609687312775.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_mk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_mk", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.mk').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_mk| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Stopwords Remover for Danish language (219 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, da, open_source] task: Stop Words Removal language: da edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_da_3.4.1_3.0_1646672996410.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_da_3.4.1_3.0_1646672996410.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","da") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Du er ikke bedre end mig"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","da") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Du er ikke bedre end mig").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("da.stopwords").predict("""Du er ikke bedre end mig""") ```
## Results ```bash +-------+ |result | +-------+ |[bedre]| +-------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|da| |Size:|2.0 KB| --- layout: model title: Pipeline to Detect Clinical Entities (ner_jsl_slim) author: John Snow Labs name: ner_jsl_slim_pipeline date: 2023-03-07 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.1 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_jsl_slim](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_slim_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_jsl_slim_pipeline_en_4.3.1_3.2_1678195679312.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_jsl_slim_pipeline_en_4.3.1_3.2_1678195679312.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_jsl_slim_pipeline", "en", "clinical/models") text = "Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture." result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_jsl_slim_pipeline", "en", "clinical/models") val text = "Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.jsl_slim.pipeline").predict("""Hyperparathyroidism was considered upon the fourth occasion. The history of weakness and generalized joint pains were present. He also had history of epigastric pain diagnosed informally as gastritis. He had previously had open reduction and internal fixation for the initial two fractures under general anesthesia. He sustained mandibular fracture.""") ```
## Results ```bash | | chunks | begin | end | entities | confidence | |---:|:-------------------------------------|--------:|------:|:--------------------------|-------------:| | 0 | Hyperparathyroidism | 0 | 18 | Disease_Syndrome_Disorder | 0.9977 | | 1 | weakness | 76 | 83 | Symptom | 0.9744 | | 2 | generalized joint pains | 89 | 111 | Symptom | 0.584067 | | 3 | He | 127 | 128 | Demographics | 0.9996 | | 4 | epigastric pain | 150 | 164 | Symptom | 0.66655 | | 5 | gastritis | 190 | 198 | Disease_Syndrome_Disorder | 0.9874 | | 6 | He | 201 | 202 | Demographics | 0.9995 | | 7 | open reduction and internal fixation | 223 | 258 | Procedure | 0.61648 | | 8 | fractures under general anesthesia | 280 | 313 | Drug | 0.79585 | | 9 | He | 316 | 317 | Demographics | 0.9992 | | 10 | sustained mandibular fracture | 319 | 347 | Disease_Syndrome_Disorder | 0.662467 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_jsl_slim_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Pipeline to Mapping RxNorm Codes with Corresponding National Drug Codes (NDC) author: John Snow Labs name: rxnorm_ndc_mapping date: 2023-06-13 tags: [en, licensed, clinical, pipeline, chunk_mapping, rxnorm, ndc] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps RXNORM codes to NDC codes without using any text data. You’ll just feed white space-delimited RXNORM codes and it will return the corresponding two different types of ndc codes which are called `package ndc` and `product ndc`. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_4.4.4_3.2_1686665536839.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapping_en_4.4.4_3.2_1686665536839.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(1652674 259934) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(1652674 259934) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_to_ndc.pipe").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(1652674 259934) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("rxnorm_ndc_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(1652674 259934) ``` {:.nlu-block} ```python import nlu nlu.load("en.map_entity.rxnorm_to_ndc.pipe").predict("""Put your text here.""") ```
## Results ```bash Results {'document': ['1652674 259934'], 'package_ndc': ['62135-0625-60', '13349-0010-39'], 'product_ndc': ['46708-0499', '13349-0010'], 'rxnorm_code': ['1652674', '259934']} {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_ndc_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|4.0 MB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel - ChunkMapperModel --- layout: model title: Legal Interest Clause Binary Classifier author: John Snow Labs name: legclf_interest_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `interest` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `interest` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_interest_clause_en_1.0.0_3.2_1660122572067.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_interest_clause_en_1.0.0_3.2_1660122572067.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_interest_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[interest]| |[other]| |[other]| |[interest]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_interest_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support conflict-of-interest 0.90 0.81 0.85 32 other 0.95 0.97 0.96 108 accuracy - - 0.94 140 macro-avg 0.92 0.89 0.91 140 weighted-avg 0.93 0.94 0.93 140 ``` --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from JoanTirant) author: John Snow Labs name: roberta_qa_joantirant_base_bne_finetuned_s_c date: 2023-01-20 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-finetuned-sqac` is a Spanish model originally trained by `JoanTirant`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_joantirant_base_bne_finetuned_s_c_es_4.3.0_3.0_1674212952508.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_joantirant_base_bne_finetuned_s_c_es_4.3.0_3.0_1674212952508.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_joantirant_base_bne_finetuned_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_joantirant_base_bne_finetuned_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_joantirant_base_bne_finetuned_s_c| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.4 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/JoanTirant/roberta-base-bne-finetuned-sqac --- layout: model title: English asr_iloko TFWav2Vec2ForCTC from denden author: John Snow Labs name: pipeline_asr_iloko date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_iloko` is a English model originally trained by denden. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_iloko_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_iloko_en_4.2.0_3.0_1664095414916.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_iloko_en_4.2.0_3.0_1664095414916.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_iloko', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_iloko", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_iloko| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Jurisdiction Clause Binary Classifier author: John Snow Labs name: legclf_jurisdiction_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `jurisdiction` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `jurisdiction` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_jurisdiction_clause_en_1.0.0_3.2_1660122594462.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_jurisdiction_clause_en_1.0.0_3.2_1660122594462.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_jurisdiction_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[jurisdiction]| |[other]| |[other]| |[jurisdiction]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_jurisdiction_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support jurisdiction 0.94 0.96 0.95 70 other 0.99 0.98 0.98 208 accuracy - - 0.97 278 macro-avg 0.96 0.97 0.97 278 weighted-avg 0.97 0.97 0.97 278 ``` --- layout: model title: English BertForQuestionAnswering model (from jatinshah) author: John Snow Labs name: bert_qa_jatinshah_bert_finetuned_squad date: 2022-06-06 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad` is a English model orginally trained by `jatinshah`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_jatinshah_bert_finetuned_squad_en_4.0.0_3.0_1654535654266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_jatinshah_bert_finetuned_squad_en_4.0.0_3.0_1654535654266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_jatinshah_bert_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_jatinshah_bert_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_jatinshah").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_jatinshah_bert_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/jatinshah/bert-finetuned-squad --- layout: model title: Spanish Named Entity Recognition (from lirondos) author: John Snow Labs name: bert_ner_anglicisms_spanish_mbert date: 2022-05-09 tags: [bert, ner, token_classification, es, open_source] task: Named Entity Recognition language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, uploaded to Hugging Face, adapted and imported into Spark NLP. `anglicisms-spanish-mbert` is a Spanish model orginally trained by `lirondos`. ## Predicted Entities `OTHER`, `ENG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_ner_anglicisms_spanish_mbert_es_3.4.2_3.0_1652096511636.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_ner_anglicisms_spanish_mbert_es_3.4.2_3.0_1652096511636.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_ner_anglicisms_spanish_mbert","es") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Amo Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_ner_anglicisms_spanish_mbert","es") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Amo Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_ner_anglicisms_spanish_mbert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|es| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/lirondos/anglicisms-spanish-mbert - https://github.com/lirondos/coalas/ - https://github.com/lirondos/coalas/ - https://github.com/lirondos/coalas/ - https://arxiv.org/abs/2203.16169 --- layout: model title: Multilabel Classification of NDA Clauses (sentences, medium) author: John Snow Labs name: legmulticlf_mnda_sections_other date: 2023-02-09 tags: [mnda, nda, en, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: MultiClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models is a version of `legmulticlf_mnda_sections` (medium) but including more negative examples (OTHER) to reinforce difference between groups and returning `OTHER` also as synonym to `[]`. It should be run on sentences of the NDA clauses, and will retrieve a series of 1..N labels for each of them. The possible clause types detected my this model in NDA / MNDA aggrements are: 1. Parties to the Agreement - Names of the Parties Clause 2. Identification of What Information Is Confidential - Definition of Confidential Information Clause 3. Use of Confidential Information: Permitted Use Clause and Obligations of the Recipient 4. Time Frame of the Agreement - Termination Clause 5. Return of Confidential Information Clause 6. Remedies for Breaches of Agreement - Remedies Clause 7. Non-Solicitation Clause 8. Dispute Resolution Clause 9. Exceptions Clause 10. Non-competition clause 11. Other: Nothing of the above (synonym to `[]`)- ## Predicted Entities `APPLIC_LAW`, `ASSIGNMENT`, `DEF_OF_CONF_INFO`, `DISPUTE_RESOL`, `EXCEPTIONS`, `NAMES_OF_PARTIES`, `NON_COMP`, `NON_SOLIC`, `PREAMBLE`, `REMEDIES`, `REQ_DISCL`, `RETURN_OF_CONF_INFO`, `TERMINATION`, `USE_OF_CONF_INFO`, `OTHER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legmulticlf_mnda_sections_other_en_1.0.0_3.0_1675938628942.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legmulticlf_mnda_sections_other_en_1.0.0_3.0_1675938628942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = ( nlp.DocumentAssembler().setInputCol("text").setOutputCol("document") ) sentence_splitter = ( nlp.SentenceDetector() .setInputCols(["document"]) .setOutputCol("sentence") .setCustomBounds(["\n"]) ) embeddings = ( nlp.UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") ) classsifierdl_pred = nlp.MultiClassifierDLModel.pretrained('legmulticlf_mnda_sections_other', 'en', 'legal/models')\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("class") clf_pipeline = nlp.Pipeline(stages=[document_assembler,embeddings, classsifierdl_pred]) df = spark.createDataFrame([["Governing Law.\nThis Agreement shall be govern..."]]).toDF("text") res = clf_pipeline.fit(df).transform(df) res.select('text', 'class.result').show() res.select('class.result') ```
## Results ```bash [APPLIC_LAW] Governing Law.\nThis Agreement shall be govern... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legmulticlf_mnda_sections_other| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|13.0 MB| ## References In-house MNDA ## Benchmarking ```bash label precision recall f1-score support APPLIC_LAW 0.90 0.86 0.88 22 ASSIGNMENT 0.90 0.90 0.90 21 DEF_OF_CONF_INFO 0.83 0.80 0.82 25 DISPUTE_RESOL 0.84 0.75 0.79 36 EXCEPTIONS 0.94 0.85 0.89 20 NAMES_OF_PARTIES 0.85 0.95 0.90 37 NON_COMP 0.89 0.94 0.92 18 NON_SOLIC 0.90 1.00 0.95 9 OTHER 0.95 0.90 0.92 123 PREAMBLE 0.88 0.81 0.84 36 REMEDIES 0.74 0.74 0.74 27 REQ_DISCL 0.86 0.75 0.80 16 RETURN_OF_CONF_INFO 0.85 0.88 0.87 26 TERMINATION 0.79 0.79 0.79 19 USE_OF_CONF_INFO 0.79 0.71 0.75 31 micro-avg 0.88 0.85 0.86 466 macro-avg 0.86 0.84 0.85 466 weighted-avg 0.88 0.85 0.86 466 samples-avg 0.83 0.85 0.83 466 ``` --- layout: model title: Pipeline to Detect clinical events (biobert) author: John Snow Labs name: ner_events_biobert_pipeline date: 2023-03-20 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_events_biobert](https://nlp.johnsnowlabs.com/2021/04/01/ner_events_biobert_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_pipeline_en_4.3.0_3.2_1679315540096.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_biobert_pipeline_en_4.3.0_3.2_1679315540096.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_events_biobert_pipeline", "en", "clinical/models") text = '''The patient presented to the emergency room last evening.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_events_biobert_pipeline", "en", "clinical/models") val text = "The patient presented to the emergency room last evening." val result = pipeline.fullAnnotate(text) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.biobert_events.pipeline").predict("""The patient presented to the emergency room last evening.""") ```
## Results ```bash | | ner_chunk | begin | end | ner_label | confidence | |---:|:-------------------|--------:|------:|:--------------|-------------:| | 0 | presented | 12 | 20 | OCCURRENCE | 0.5019 | | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | 0.695333 | | 2 | last evening | 44 | 55 | DATE | 0.7621 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_biobert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|422.1 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Mapping RxNorm Codes with Corresponding National Drug Codes author: John Snow Labs name: rxnorm_ndc_mapper date: 2022-05-20 tags: [ndc, rxnorm, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC). ## Predicted Entities `Product NDC`, `Package NDC` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapper_en_3.5.1_3.0_1653061064192.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapper_en_3.5.1_3.0_1653061064192.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('ner_chunk') sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models') .setInputCols(["ner_chunk"]) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") .setInputCols(["sentence_embeddings"]) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") chunkerMapper_product = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")) .setInputCols(["rxnorm_code"]) .setOutputCol("Product NDC") .setRel("Product NDC") chunkerMapper_package = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")) .setInputCols(["rxnorm_code"]) .setOutputCol("Package NDC") .setRel("Package NDC") pipeline = Pipeline().setStages([document_assembler, sbert_embedder, rxnorm_resolver, chunkerMapper_product, chunkerMapper_package ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) result = lp.annotate(['doxycycline hyclate 50 MG Oral Tablet', 'macadamia nut 100 MG/ML']) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper_product = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")) .setInputCols(Array("rxnorm_code")) .setOutputCol("Product NDC") .setRel("Product NDC") val chunkerMapper_package = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")) .setInputCols(Array("rxnorm_code")) .setOutputCol("Package NDC") .setRel("Package NDC") val pipeline = Pipeline().setStages(Array(document_assembler, sbert_embedder, rxnorm_resolver, chunkerMapper_product, chunkerMapper_package )) val text_data = Seq("doxycycline hyclate 50 MG Oral Tablet'", "macadamia nut 100 MG/ML").toDF("text") val res = pipeline.fit(text_data).transform(text_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm_to_ndc").predict("""Product NDC""") ```
## Results ```bash | | ner_chunk | rxnorm_code | Package NDC | Product NDC | |---:|:------------------------------------------|:--------------|:------------------|:---------------| | 0 | ['doxycycline hyclate 50 MG Oral Tablet'] | ['1652674'] | ['62135-0625-60'] | ['62135-0625'] | | 1 | ['macadamia nut 100 MG/ML'] | ['212433'] | ['00187-1474-08'] | ['00187-1474'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_ndc_mapper| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|2.0 MB| --- layout: model title: Detect Chemicals in Text (embeddings_clinical_large) author: John Snow Labs name: ner_chemicals_emb_clinical_large date: 2023-06-02 tags: [ner, clinical, licensed, en, chemicals] task: Named Entity Recognition language: en edition: Healthcare NLP 4.4.3 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Extract different types of chemical compounds mentioned in text using pretrained NER model. ( Trained with embeddings_clinical_large ) ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_CHEMICALS/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_emb_clinical_large_en_4.4.3_3.0_1685713522857.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_emb_clinical_large_en_4.4.3_3.0_1685713522857.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") chemicals_ner = MedicalNerModel.pretrained("ner_chemicals_emb_clinical_large", "en", "clinical/models" ) \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("chemicals_ner") chemicals_ner_converter = NerConverterInternal() \ .setInputCols(["sentence", "token", "species_ner"]) \ .setOutputCol("chemicals_ner_chunk") chemicals_ner_pipeline = Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, word_embeddings, chemicals_ner, chemicals_ner_converter]) empty_data = spark.createDataFrame([[""]]).toDF("text") chemicals_ner_model = chemicals_ner_pipeline.fit(empty_data) results = chemicals_ner_model.transform(spark.createDataFrame([['''Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4 - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases. RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied. ''']]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val anatomy_ner_model = MedicalNerModel.pretrained("ner_chemicals_emb_clinical_large", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("chemicals_ner") val anatomy_ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "chemicals_ner")) .setOutputCol("chemicals_ner_chunk") val chemicals_pipeline = new PipelineModel().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, chemicals_ner_model, chemicals_ner_converter)) val data = Seq(""" Differential cell - protective function of two resveratrol (trans - 3, 5, 4 - trihydroxystilbene) glucosides against oxidative stress. Resveratrol (trans - 3, 5, 4 - trihydroxystilbene ; RSV) , a natural polyphenol, exerts a beneficial effect on health and diseases. RSV targets and activates the NAD(+) - dependent protein deacetylase SIRT1; in turn, SIRT1 induces an intracellular antioxidative mechanism by inducing mitochondrial superoxide dismutase (SOD2). Most RSV found in plants is glycosylated, and the effect of these glycosylated forms on SIRT1 has not been studied.""").toDS.toDF("text") val result = model.fit(data).transform(data) ```
## Results ```bash | | chunks | begin | end | entities | |---:|:-------------------------------------------------|--------:|------:|:-----------| | 0 | resveratrol | 48 | 58 | CHEM | | 1 | trans - 3, 5, 4 - trihydroxystilbene) glucosides | 61 | 108 | CHEM | | 2 | Resveratrol | 136 | 146 | CHEM | | 3 | trans - 3, 5, 4 - trihydroxystilbene | 149 | 185 | CHEM | | 4 | RSV | 189 | 191 | CHEM | | 5 | polyphenol | 206 | 215 | CHEM | | 6 | NAD(+) | 300 | 305 | CHEM | | 7 | superoxide | 436 | 445 | CHEM | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemicals_emb_clinical_large| |Compatibility:|Healthcare NLP 4.4.3+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token, embeddings]| |Output Labels:|[ner]| |Language:|en| |Size:|2.8 MB| ## Benchmarking ```bash label precision recall f1-score support CHEM 0.94 0.93 0.94 62001 micro-avg 0.94 0.93 0.94 62001 macro-avg 0.94 0.93 0.94 62001 weighted-avg 0.94 0.93 0.94 62001 ``` --- layout: model title: English RobertaForQuestionAnswering (from AnonymousSub) author: John Snow Labs name: roberta_qa_rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739809828.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0_en_4.0.0_3.0_1655739809828.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squadv2.roberta.base_rule_based_twostagetriplet_hier_epochs_1_shard_1.by_AnonymousSub").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|306.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/AnonymousSub/rule_based_roberta_twostagetriplet_hier_epochs_1_shard_1_squad2.0 --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from machine2049) author: John Snow Labs name: distilbert_qa_machine2049_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `machine2049`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_machine2049_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772141683.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_machine2049_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772141683.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_machine2049_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_machine2049_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_machine2049_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/machine2049/distilbert-base-uncased-finetuned-squad --- layout: model title: Pipeline to Extract textual entities in biomedical texts author: John Snow Labs name: ner_nature_nero_clinical_pipeline date: 2023-03-14 tags: [ner, en, clinical, licensed] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_nature_nero_clinical](https://nlp.johnsnowlabs.com/2022/02/08/ner_nature_nero_clinical_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_nature_nero_clinical_pipeline_en_4.3.0_3.2_1678776843378.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_nature_nero_clinical_pipeline_en_4.3.0_3.2_1678776843378.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_nature_nero_clinical_pipeline", "en", "clinical/models") text = '''he patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_nature_nero_clinical_pipeline", "en", "clinical/models") val text = "he patient is a 21-day-old Caucasian male here for 2 days of congestion - mom has been suctioning yellow discharge from the patient's nares, plus she has noticed some mild problems with his breathing while feeding (but negative for any perioral cyanosis or retractions). One day ago, mom also noticed a tactile temperature and gave the patient Tylenol. Baby also has had some decreased p.o. intake. His normal breast-feeding is down from 20 minutes q.2h. to 5 to 10 minutes secondary to his respiratory congestion. He sleeps well, but has been more tired and has been fussy over the past 2 days. The parents noticed no improvement with albuterol treatments given in the ER. His urine output has also decreased; normally he has 8 to 10 wet and 5 dirty diapers per 24 hours, now he has down to 4 wet diapers per 24 hours. Mom denies any diarrhea. His bowel movements are yellow colored and soft in nature." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------------------------------------|--------:|------:|:----------------------|-------------:| | 0 | perioral cyanosis | 236 | 252 | Medicalfinding | 0.198 | | 1 | One day | 271 | 277 | Duration | 0.35005 | | 2 | mom | 284 | 286 | Namedentity | 0.1301 | | 3 | tactile temperature | 303 | 321 | Quantityormeasurement | 0.1074 | | 4 | patient Tylenol | 336 | 350 | Chemical | 0.20805 | | 5 | decreased p.o. intake | 376 | 396 | Medicalprocedure | 0.105725 | | 6 | normal breast-feeding | 403 | 423 | Medicalfinding | 0.1769 | | 7 | 20 minutes q.2h | 438 | 452 | Timepoint | 0.275333 | | 8 | 5 to 10 minutes | 458 | 472 | Duration | 0.22645 | | 9 | respiratory congestion | 491 | 512 | Medicalfinding | 0.1423 | | 10 | past 2 days | 583 | 593 | Duration | 0.256867 | | 11 | parents | 600 | 606 | Persongroup | 0.9441 | | 12 | improvement | 619 | 629 | Process | 0.147 | | 13 | albuterol treatments | 636 | 655 | Medicalprocedure | 0.305 | | 14 | ER | 670 | 671 | Bodypart | 0.2024 | | 15 | urine output | 678 | 689 | Quantityormeasurement | 0.1283 | | 16 | 8 to 10 wet and 5 dirty diapers per 24 hours | 727 | 770 | Measurement | 0.121327 | | 17 | 4 wet diapers per 24 hours | 792 | 817 | Measurement | 0.1611 | | 18 | Mom | 820 | 822 | Person | 0.9515 | | 19 | diarrhea | 835 | 842 | Medicalfinding | 0.533 | | 20 | bowel movements | 849 | 863 | Biologicalprocess | 0.2036 | | 21 | soft in nature | 888 | 901 | Biologicalprocess | 0.170467 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_nature_nero_clinical_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: Pipeline to Detect Entities Related to Cancer Therapies author: John Snow Labs name: ner_oncology_therapy_pipeline date: 2023-03-09 tags: [clinical, en, licensed, oncology, treatment, ner] task: Named Entity Recognition language: en edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_oncology_therapy](https://nlp.johnsnowlabs.com/2022/11/24/ner_oncology_therapy_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_pipeline_en_4.3.0_3.2_1678351787302.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_oncology_therapy_pipeline_en_4.3.0_3.2_1678351787302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_oncology_therapy_pipeline", "en", "clinical/models") text = '''The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_oncology_therapy_pipeline", "en", "clinical/models") val text = "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast. The cancer recurred as a right lung metastasis 13 years later. The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:-------------------------------|--------:|------:|:----------------------|-------------:| | 0 | mastectomy | 36 | 45 | Cancer_Surgery | 0.9817 | | 1 | axillary lymph node dissection | 54 | 83 | Cancer_Surgery | 0.719725 | | 2 | radiotherapy | 183 | 194 | Radiotherapy | 0.9984 | | 3 | recurred | 239 | 246 | Response_To_Treatment | 0.9481 | | 4 | adriamycin | 337 | 346 | Chemotherapy | 0.9981 | | 5 | 60 mg/m2 | 349 | 356 | Dosage | 0.58815 | | 6 | cyclophosphamide | 363 | 378 | Chemotherapy | 0.9976 | | 7 | 600 mg/m2 | 381 | 389 | Dosage | 0.64205 | | 8 | six courses | 397 | 407 | Cycle_Count | 0.46815 | | 9 | first line | 413 | 422 | Line_Of_Therapy | 0.95015 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_oncology_therapy_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverterInternalModel --- layout: model title: English RoBERTa Embeddings (Large, Biology/Medical) author: John Snow Labs name: roberta_embeddings_pmc_med_bio_mlm_roberta_large date: 2022-04-14 tags: [roberta, embeddings, en, open_source] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `pmc-med-bio-mlm-roberta-large` is a English model orginally trained by `raynardj`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_pmc_med_bio_mlm_roberta_large_en_3.4.2_3.0_1649946971676.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_pmc_med_bio_mlm_roberta_large_en_3.4.2_3.0_1649946971676.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_pmc_med_bio_mlm_roberta_large","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_pmc_med_bio_mlm_roberta_large","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.embed.pmc_med_bio_mlm_roberta_large").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_pmc_med_bio_mlm_roberta_large| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| ## References - https://huggingface.co/raynardj/pmc-med-bio-mlm-roberta-large --- layout: model title: Pipeline to Detect Living Species (bert_embeddings_bert_base_fr_cased) author: John Snow Labs name: ner_living_species_bert_pipeline date: 2023-03-13 tags: [fr, ner, clinical, licensed, bert] task: Named Entity Recognition language: fr edition: Healthcare NLP 4.3.0 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_living_species_bert](https://nlp.johnsnowlabs.com/2022/06/23/ner_living_species_bert_fr_3_0.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_fr_4.3.0_3.2_1678730531080.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_living_species_bert_pipeline_fr_4.3.0_3.2_1678730531080.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("ner_living_species_bert_pipeline", "fr", "clinical/models") text = '''Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40.''' result = pipeline.fullAnnotate(text) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("ner_living_species_bert_pipeline", "fr", "clinical/models") val text = "Femme de 47 ans allergique à l'iode, fumeuse sociale, opérée pour des varices, deux césariennes et un abcès fessier. Vit avec son mari et ses trois enfants, travaille comme enseignante. Initialement, le patient a eu une bonne évolution, mais au 2ème jour postopératoire, il a commencé à montrer une instabilité hémodynamique. Les sérologies pour Coxiella burnetii, Bartonella henselae, Borrelia burgdorferi, Entamoeba histolytica, Toxoplasma gondii, herpès simplex virus 1 et 2, cytomégalovirus, virus d'Epstein Barr, virus de la varicelle et du zona et parvovirus B19 étaient négatives. Cependant, un test au rose Bengale positif pour Brucella, le test de Coombs et les agglutinations étaient également positifs avec un titre de 1/40." val result = pipeline.fullAnnotate(text) ```
## Results ```bash | | ner_chunks | begin | end | ner_label | confidence | |---:|:---------------------------------|--------:|------:|:------------|-------------:| | 0 | Femme | 0 | 4 | HUMAN | 1 | | 1 | mari | 130 | 133 | HUMAN | 1 | | 2 | enfants | 148 | 154 | HUMAN | 0.9999 | | 3 | patient | 203 | 209 | HUMAN | 0.9993 | | 4 | Coxiella burnetii | 346 | 362 | SPECIES | 0.9879 | | 5 | Bartonella henselae | 365 | 383 | SPECIES | 0.9926 | | 6 | Borrelia burgdorferi | 386 | 405 | SPECIES | 0.9959 | | 7 | Entamoeba histolytica | 408 | 428 | SPECIES | 0.9913 | | 8 | Toxoplasma gondii | 431 | 447 | SPECIES | 0.97845 | | 9 | cytomégalovirus | 479 | 493 | SPECIES | 0.9976 | | 10 | virus d'Epstein Barr | 496 | 515 | SPECIES | 0.967967 | | 11 | virus de la varicelle et du zona | 518 | 549 | SPECIES | 0.985429 | | 12 | parvovirus B19 | 554 | 567 | SPECIES | 0.98595 | | 13 | Brucella | 636 | 643 | SPECIES | 0.9995 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_living_species_bert_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Language:|fr| |Size:|410.6 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverterInternalModel --- layout: model title: Turkish Named Entity Recognition (from akdeniz27) author: John Snow Labs name: xlmroberta_ner_xlm_roberta_base_turkish_ner date: 2022-05-17 tags: [xlm_roberta, ner, token_classification, tr, open_source] task: Named Entity Recognition language: tr edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Named Entity Recognition model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-turkish-ner` is a Turkish model orginally trained by `akdeniz27`. ## Predicted Entities `LOC`, `ORG`, `PER` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_turkish_ner_tr_3.4.2_3.0_1652807661266.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_xlm_roberta_base_turkish_ner_tr_3.4.2_3.0_1652807661266.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_turkish_ner","tr") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["Spark NLP'yi seviyorum"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_xlm_roberta_base_turkish_ner","tr") .setInputCols(Array("sentence", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("Spark NLP'yi seviyorum").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_xlm_roberta_base_turkish_ner| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|tr| |Size:|851.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/akdeniz27/xlm-roberta-base-turkish-ner - https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt --- layout: model title: Extract relations between drugs and proteins (ReDL) author: John Snow Labs name: redl_drugprot_biobert date: 2022-01-05 tags: [relation_extraction, clinical, en, licensed] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.3.4 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Detect interactions between chemical compounds/drugs and genes/proteins using BERT by classifying whether a specified semantic relation holds between a chemical and gene entities within a sentence or document. The entity labels used during training were derived from the [custom NER model](https://nlp.johnsnowlabs.com/2021/12/20/ner_drugprot_clinical_en.html) created by our team for the [DrugProt corpus](https://zenodo.org/record/5119892). These include `CHEMICAL` for chemical compounds/drugs, `GENE` for genes/proteins and `GENE_AND_CHEMICAL` for entity mentions of type `GENE` and of type `CHEMICAL` that overlap (such as enzymes and small peptides). The relation categories from the [DrugProt corpus](https://zenodo.org/record/5119892) were condensed from 13 categories to 10 categories due to low numbers of examples for certain categories. This merging process involved grouping the `SUBSTRATE_PRODUCT-OF` and `SUBSTRATE` relation categories together and grouping the `AGONIST-ACTIVATOR`, `AGONIST-INHIBITOR` and `AGONIST` relation categories together. ## Predicted Entities `INHIBITOR`, `DIRECT-REGULATOR`, `SUBSTRATE`, `ACTIVATOR`, `INDIRECT-UPREGULATOR`, `INDIRECT-DOWNREGULATOR`, `ANTAGONIST`, `PRODUCT-OF`, `PART-OF`, `AGONIST` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_drugprot_biobert_en_3.3.4_3.0_1641393971428.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_drugprot_biobert_en_3.3.4_3.0_1641393971428.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use In the table below, `redl_drugprot_biobert` RE model, its labels, optimal NER model, and meaningful relation pairs are illustrated. | RE MODEL | RE MODEL LABES | NER MODEL | RE PAIRS | |:---------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------:|------------------------------------------------------------------------------------| | redl_drugprot_biobert | INHIBITOR,
DIRECT-REGULATOR,
SUBSTRATE,
ACTIVATOR,
INDIRECT-UPREGULATOR,
INDIRECT-DOWNREGULATOR,
ANTAGONIST,
PRODUCT-OF,
PART-OF,
AGONIST | ner_drugprot_clinical | [“checmical-gene”,
“chemical-gene_and_chemical”,
“gene_and_chemical-gene”] |
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel()\ .pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("embeddings") drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter()\ .setInputCols(["sentences", "tokens", "ner_tags"])\ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models")\ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates drugprot_re_ner_chunk_filter = RENerChunksFilter()\ .setInputCols(["ner_chunks", "dependencies"])\ .setOutputCol("re_ner_chunks")\ .setMaxSyntacticDistance(4) # .setRelationPairs(['CHEMICAL-GENE']) drugprot_re_Model = RelationExtractionDLModel()\ .pretrained('redl_drugprot_biobert', "en", "clinical/models")\ .setPredictionThreshold(0.9)\ .setInputCols(["re_ner_chunks", "sentences"])\ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_ner_chunk_filter, drugprot_re_Model]) text='''Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.''' data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val drugprot_re_ner_chunk_filter = new RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") // .setRelationPairs(Array("CHEMICAL-GENE")) // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val drugprot_re_Model = RelationExtractionDLModel() .pretrained("redl_drugprot_biobert", "en", "clinical/models") .setPredictionThreshold(0.9) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, words_embedder, drugprot_ner_tagger, ner_converter, pos_tagger, dependency_parser, drugprot_re_ner_chunk_filter, drugprot_re_Model)) val data = Seq("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.drugprot").predict("""Lipid specific activation of the murine P4-ATPase Atp8a1 (ATPase II). The asymmetric transbilayer distribution of phosphatidylserine (PS) in the mammalian plasma membrane and secretory vesicles is maintained, in part, by an ATP-dependent transporter. This aminophospholipid "flippase" selectively transports PS to the cytosolic leaflet of the bilayer and is sensitive to vanadate, Ca(2+), and modification by sulfhydryl reagents. Although the flippase has not been positively identified, a subfamily of P-type ATPases has been proposed to function as transporters of amphipaths, including PS and other phospholipids. A candidate PS flippase ATP8A1 (ATPase II), originally isolated from bovine secretory vesicles, is a member of this subfamily based on sequence homology to the founding member of the subfamily, the yeast protein Drs2, which has been linked to ribosomal assembly, the formation of Golgi-coated vesicles, and the maintenance of PS asymmetry. To determine if ATP8A1 has biochemical characteristics consistent with a PS flippase, a murine homologue of this enzyme was expressed in insect cells and purified. The purified Atp8a1 is inactive in detergent micelles or in micelles containing phosphatidylcholine, phosphatidic acid, or phosphatidylinositol, is minimally activated by phosphatidylglycerol or phosphatidylethanolamine (PE), and is maximally activated by PS. The selectivity for PS is dependent upon multiple elements of the lipid structure. Similar to the plasma membrane PS transporter, Atp8a1 is activated only by the naturally occurring sn-1,2-glycerol isomer of PS and not the sn-2,3-glycerol stereoisomer. Both flippase and Atp8a1 activities are insensitive to the stereochemistry of the serine headgroup. Most modifications of the PS headgroup structure decrease recognition by the plasma membrane PS flippase. Activation of Atp8a1 is also reduced by these modifications; phosphatidylserine-O-methyl ester, lysophosphatidylserine, glycerophosphoserine, and phosphoserine, which are not transported by the plasma membrane flippase, do not activate Atp8a1. Weakly translocated lipids (PE, phosphatidylhydroxypropionate, and phosphatidylhomoserine) are also weak Atp8a1 activators. However, N-methyl-phosphatidylserine, which is transported by the plasma membrane flippase at a rate equivalent to PS, is incapable of activating Atp8a1 activity. These results indicate that the ATPase activity of the secretory granule Atp8a1 is activated by phospholipids binding to a specific site whose properties (PS selectivity, dependence upon glycerol but not serine, stereochemistry, and vanadate sensitivity) are similar to, but distinct from, the properties of the substrate binding site of the plasma membrane flippase.""") ```
## Results ```bash +---------+--------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ | relation| entity1|entity1_begin|entity1_end| chunk1|entity2|entity2_begin|entity2_end| chunk2|confidence| +---------+--------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ |SUBSTRATE|CHEMICAL| 308| 310| PS| GENE| 275| 283| flippase| 0.998399| |ACTIVATOR|CHEMICAL| 1563| 1578| sn-1,2-glycerol| GENE| 1479| 1509|plasma membrane P...| 0.999304| |ACTIVATOR|CHEMICAL| 1563| 1578| sn-1,2-glycerol| GENE| 1511| 1517| Atp8a1| 0.979057| |ACTIVATOR|CHEMICAL| 2112| 2114| PE| GENE| 2189| 2195| Atp8a1| 0.998299| |ACTIVATOR|CHEMICAL| 2116| 2145|phosphatidylhydro...| GENE| 2189| 2195| Atp8a1| 0.981534| |ACTIVATOR|CHEMICAL| 2151| 2173|phosphatidylhomos...| GENE| 2189| 2195| Atp8a1| 0.988504| |SUBSTRATE|CHEMICAL| 2217| 2244|N-methyl-phosphat...| GENE| 2290| 2298| flippase| 0.994092| |ACTIVATOR|CHEMICAL| 1292| 1312|phosphatidylglycerol| GENE| 1134| 1140| Atp8a1| 0.994409| |ACTIVATOR|CHEMICAL| 1316| 1340|phosphatidylethan...| GENE| 1134| 1140| Atp8a1| 0.988359| |ACTIVATOR|CHEMICAL| 1342| 1344| PE| GENE| 1134| 1140| Atp8a1| 0.988399| |ACTIVATOR|CHEMICAL| 1377| 1379| PS| GENE| 1134| 1140| Atp8a1| 0.996349| |ACTIVATOR|CHEMICAL| 2526| 2528| PS| GENE| 2444| 2450| Atp8a1| 0.978597| |ACTIVATOR|CHEMICAL| 2526| 2528| PS| GENE| 2403| 2409| ATPase| 0.988679| +---------+--------+-------------+-----------+--------------------+-------+-------------+-----------+--------------------+----------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_drugprot_biobert| |Compatibility:|Healthcare NLP 3.3.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|405.4 MB| ## Data Source This model was trained on the [DrugProt corpus](https://zenodo.org/record/5119892). ## Benchmarking ```bash label recall precision f1 support ACTIVATOR 0.885 0.776 0.827 235 AGONIST 0.810 0.925 0.864 137 ANTAGONIST 0.970 0.919 0.944 199 DIRECT-REGULATOR 0.836 0.901 0.867 403 INDIRECT-DOWNREGULATOR 0.885 0.850 0.867 313 INDIRECT-UPREGULATOR 0.844 0.887 0.865 270 INHIBITOR 0.947 0.937 0.942 1083 PART-OF 0.939 0.889 0.913 247 PRODUCT-OF 0.697 0.953 0.805 145 SUBSTRATE 0.912 0.884 0.898 468 Avg 0.873 0.892 0.879 - Weighted-Avg 0.897 0.899 0.897 - ``` --- layout: model title: Legal Assertion Status (Negation) author: John Snow Labs name: legassertion_negation date: 2023-01-01 tags: [negation, en, licensed] task: Assertion Status language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Legal Negation model, aimed to identify if an NER entity is mentioned in the context to be negated or not. ## Predicted Entities `positive`, `negative` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legassertion_negation_en_1.0.0_3.0_1672578547085.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legassertion_negation_en_1.0.0_3.0_1672578547085.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python import pyspark.sql.functions as F document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = nlp.NerConverter() \ .setInputCols(["sentence", "token", "ner"]) \ .setOutputCol("ner_chunk") legassertion = legal.AssertionDLModel.pretrained("legassertion_negation", "en", "legal/models")\ .setInputCols(["sentence", "ner_chunk", "embeddings"])\ .setOutputCol("leglabel") pipe = nlp.Pipeline(stages = [ document_assembler, sentence_detector, tokenizer, embeddings, ner, ner_converter, legassertion]) text = "Gradio INC will not be entering into a joint agreement with Hugging Face, Inc." sdf = spark.createDataFrame([[text]]).toDF("text") res = pipe.fit(sdf).transform(sdf) res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.leglabel.result)).alias("cols"))\ .select(F.expr("cols['0']").alias("ner_chunk"), F.expr("cols['1']").alias("assertion")).show(200, truncate=100) ```
## Results ```bash +-----------------+---------+ | ner_chunk|assertion| +-----------------+---------+ | Gradio INC| negative| |Hugging Face, Inc| positive| +-----------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legassertion_negation| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, chunk, embeddings]| |Output Labels:|[assertion]| |Language:|en| |Size:|2.2 MB| ## References In-house annotated legal sentences ## Benchmarking ```bash label tp fp fn prec rec f1 negative 26 0 1 1.0 0.962963 0.9811321 positive 38 1 0 0.974359 1.0 0.987013 Macro-average 641 1 1 0.9871795 0.9814815 0.9843222 Micro-average 0.9846154 0.9846154 0.9846154 ``` --- layout: model title: German asr_exp_w2v2t_r_wav2vec2_s37 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2t_r_wav2vec2_s37 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2t_r_wav2vec2_s37` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2t_r_wav2vec2_s37_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_r_wav2vec2_s37_de_4.2.0_3.0_1664107978613.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2t_r_wav2vec2_s37_de_4.2.0_3.0_1664107978613.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2t_r_wav2vec2_s37", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2t_r_wav2vec2_s37", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2t_r_wav2vec2_s37| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: Vietnamese Lemmatizer author: John Snow Labs name: lemma date: 2021-04-02 tags: [vi, open_source, lemmatizer] task: Lemmatization language: vi edition: Spark NLP 3.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a dictionary-based lemmatizer that assigns all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_vi_3.0.0_3.0_1617388850136.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_vi_3.0.0_3.0_1617388850136.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "vi") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) example = spark.createDataFrame([['Tất cả đều hồi hộp .']], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "vi") .setInputCols("token") .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("Tất cả đều hồi hộp .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["Tất cả đều hồi hộp ."] lemma_df = nlu.load('vi.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash +-----+ |lemma| +-----+ | Tất| | cả| | đều| | hồi| | hộp| | .| +-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|vi| ## Data Source The model was trained on the [Universal Dependencies](https://www.universaldependencies.org) version 2.7. ## Benchmarking ```bash Precision=0.96, Recall=0.89, F1-score=0.93 ``` --- layout: model title: Japanese Electra Embeddings (from izumi-lab) author: John Snow Labs name: electra_embeddings_electra_base_japanese_generator date: 2022-05-17 tags: [ja, open_source, electra, embeddings] task: Embeddings language: ja edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Electra Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-japanese-generator` is a Japanese model orginally trained by `izumi-lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_japanese_generator_ja_3.4.4_3.0_1652786593799.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_embeddings_electra_base_japanese_generator_ja_3.4.4_3.0_1652786593799.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_japanese_generator","ja") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Spark NLPが大好きです"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("electra_embeddings_electra_base_japanese_generator","ja") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Spark NLPが大好きです").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_embeddings_electra_base_japanese_generator| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|ja| |Size:|132.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/izumi-lab/electra-base-japanese-generator - https://github.com/google-research/electra - https://github.com/retarfi/language-pretraining/tree/v1.0 - https://github.com/google-research/electra - https://arxiv.org/abs/2003.10555 - https://creativecommons.org/licenses/by-sa/4.0/ --- layout: model title: Turkish BertForQuestionAnswering model (from husnu) author: John Snow Labs name: bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3 date: 2022-06-02 tags: [open_source, question_answering, bert] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-turkish-cased-finetuned_lr-2e-05_epochs-3` is a Turkish model orginally trained by `husnu`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1654180690796.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3_tr_4.0.0_3.0_1654180690796.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3","tr") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.bert.base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|tr| |Size:|412.8 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/husnu/bert-base-turkish-cased-finetuned_lr-2e-05_epochs-3 --- layout: model title: Legal Securities Purchase Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_securities_purchase_agreement_bert date: 2022-11-25 tags: [en, legal, classification, agreement, securities_purchase, licensed, bert] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_securities_purchase_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `securities-purchase-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `securities-purchase-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_securities_purchase_agreement_bert_en_1.0.0_3.0_1669369230699.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_securities_purchase_agreement_bert_en_1.0.0_3.0_1669369230699.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_securities_purchase_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[securities-purchase-agreement]| |[other]| |[other]| |[securities-purchase-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_securities_purchase_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.6 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support other 0.93 0.95 0.94 65 securities-purchase-agreement 0.93 0.89 0.91 45 accuracy - - 0.93 110 macro-avg 0.93 0.92 0.92 110 weighted-avg 0.93 0.93 0.93 110 ``` --- layout: model title: English BertForQuestionAnswering model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-4` is a English model orginally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4_en_4.0.0_3.0_1654189557504.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4_en_4.0.0_3.0_1654189557504.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.span_bert.base_cased_1024d_seed_4").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|390.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-1024-finetuned-squad-seed-4 --- layout: model title: Intent Classification for Airline Traffic Information System queries (ATIS dataset) author: John Snow Labs name: classifierdl_use_atis date: 2021-01-25 task: Text Classification language: en nav_key: models edition: Spark NLP 2.7.1 spark_version: 2.4 tags: [en, classifier, open_source] supported: true annotator: ClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Classify user questions into 5 categories of an airline traffic information system. ## Predicted Entities `atis_abbreviation`, `atis_airfare`, `atis_airline`, `atis_flight`, `atis_ground_service` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/classifierdl_use_atis_en_2.7.1_2.4_1611572512585.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_atis_en_2.7.1_2.4_1611572512585.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") use = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") document_classifier = ClassifierDLModel.pretrained('classifierdl_use_atis', 'en') \ .setInputCols(["document", "sentence_embeddings"]) \ .setOutputCol("class") nlpPipeline = Pipeline(stages=[document_assembler, use, document_classifier]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate(['what is the price of flight from newyork to washington', 'how many flights does twa have in business class']) ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.intent.airline").predict("""what is the price of flight from newyork to washington""") ```
## Results ```bash +-------------------------------------------------------------------+----------------+ | document | class | +-------------------------------------------------------------------+----------------+ | what is the price of flight from newyork to washington | atis_airfare | | how many flights does twa have in business class | atis_quantity | +-------------------------------------------------------------------+----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|classifierdl_use_atis| |Compatibility:|Spark NLP 2.7.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Dependencies:|tfhub_use| ## Data Source This model is trained on data obtained from https://www.kaggle.com/hassanamin/atis-airlinetravelinformationsystem ## Benchmarking ```bash precision recall f1-score support atis_abbreviation 1.00 1.00 1.00 33 atis_airfare 0.60 0.96 0.74 48 atis_airline 0.69 0.89 0.78 38 atis_flight 0.99 0.93 0.96 632 atis_ground_service 1.00 1.00 1.00 36 accuracy 0.93 787 macro avg 0.86 0.96 0.90 787 weighted avg 0.95 0.93 0.94 787 ``` --- layout: model title: English RobertaForQuestionAnswering (from anas-awadalla) author: John Snow Labs name: roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_8 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-few-shot-k-64-finetuned-squad-seed-8` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_8_en_4.0.0_3.0_1655733637029.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_8_en_4.0.0_3.0_1655733637029.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_8","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_8","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.base_64d_seed_8").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_base_few_shot_k_64_finetuned_squad_seed_8| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|419.4 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/roberta-base-few-shot-k-64-finetuned-squad-seed-8 --- layout: model title: Arabic Part of Speech Tagger (DA-Dialectal Arabic dataset, Gulf Arabic POS) author: John Snow Labs name: bert_pos_bert_base_arabic_camelbert_da_pos_glf date: 2022-04-26 tags: [bert, pos, part_of_speech, ar, open_source] task: Part of Speech Tagging language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Part of Speech model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-arabic-camelbert-da-pos-glf` is a Arabic model orginally trained by `CAMeL-Lab`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_glf_ar_3.4.2_3.0_1650993664822.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_pos_bert_base_arabic_camelbert_da_pos_glf_ar_3.4.2_3.0_1650993664822.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_glf","ar") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("pos") pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier]) data = spark.createDataFrame([["أنا أحب الشرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val tokenClassifier = BertForTokenClassification.pretrained("bert_pos_bert_base_arabic_camelbert_da_pos_glf","ar") .setInputCols(Array("sentence", "token")) .setOutputCol("pos") val pipeline = new Pipeline().setStages(Array(documentAssembler,sentenceDetector, tokenizer, tokenClassifier)) val data = Seq("أنا أحب الشرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.pos.arabic_camelbert_da_pos_glf").predict("""أنا أحب الشرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_pos_bert_base_arabic_camelbert_da_pos_glf| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|ar| |Size:|407.6 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-pos-glf - https://camel.abudhabi.nyu.edu/annotated-gumar-corpus/ - https://arxiv.org/abs/2103.06678 - https://github.com/CAMeL-Lab/CAMeLBERT - https://github.com/CAMeL-Lab/camel_tools --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_squadv2_base_3_epochs date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `squadv2-roberta-base-3-epochs` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_base_3_epochs_en_4.3.0_3.0_1674224185355.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_squadv2_base_3_epochs_en_4.3.0_3.0_1674224185355.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_base_3_epochs","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_squadv2_base_3_epochs","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_squadv2_base_3_epochs| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|460.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/squadv2-roberta-base-3-epochs --- layout: model title: Pipeline to Detect Chemical Compounds and Genes (BertForTokenClassifier) author: John Snow Labs name: bert_token_classifier_ner_chemprot_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, berfortokenclassification, chemprot, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [bert_token_classifier_ner_chemprot](https://nlp.johnsnowlabs.com/2022/01/06/bert_token_classifier_ner_chemprot_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_3.4.1_3.0_1647889733985.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/bert_token_classifier_ner_chemprot_pipeline_en_3.4.1_3.0_1647889733985.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` ```scala val pipeline = new PretrainedPipeline("bert_token_classifier_ner_chemprot_pipeline", "en", "clinical/models") pipeline.annotate("Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.") ``` {:.nlu-block} ```python import nlu nlu.load("en.classify.token_bert.chemprot_pipeline").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""") ```
## Results ```bash +-------------------------------+---------+ |chunk |ner_label| +-------------------------------+---------+ |Keratinocyte growth factor |GENE-Y | |acidic fibroblast growth factor|GENE-Y | +-------------------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_token_classifier_ner_chemprot_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|404.7 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - MedicalBertForTokenClassifier - NerConverter --- layout: model title: English DistilBertForQuestionAnswering model (from Slavka) Winlogbeat author: John Snow Labs name: distilbert_qa_distil_bert_finetuned_log_parser_winlogbeat date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distil-bert-finetuned-log-parser-winlogbeat` is a English model originally trained by `Slavka`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_distil_bert_finetuned_log_parser_winlogbeat_en_4.0.0_3.0_1654723486624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_distil_bert_finetuned_log_parser_winlogbeat_en_4.0.0_3.0_1654723486624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distil_bert_finetuned_log_parser_winlogbeat","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_distil_bert_finetuned_log_parser_winlogbeat","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.distil_bert.log_parser_winlogbeat.by_Slavka").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_distil_bert_finetuned_log_parser_winlogbeat| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Slavka/distil-bert-finetuned-log-parser-winlogbeat --- layout: model title: English asr_model_sid_voxforge_cetuc_1 TFWav2Vec2ForCTC from joaoalvarenga author: John Snow Labs name: pipeline_asr_model_sid_voxforge_cetuc_1 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_model_sid_voxforge_cetuc_1` is a English model originally trained by joaoalvarenga. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_model_sid_voxforge_cetuc_1_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_sid_voxforge_cetuc_1_en_4.2.0_3.0_1664021895154.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_model_sid_voxforge_cetuc_1_en_4.2.0_3.0_1664021895154.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_model_sid_voxforge_cetuc_1', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_model_sid_voxforge_cetuc_1", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_model_sid_voxforge_cetuc_1| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Legal Amendments Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_amendments_bert date: 2023-03-05 tags: [en, legal, classification, clauses, amendments, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Amendments` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Amendments`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_amendments_bert_en_1.0.0_3.0_1678049943783.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_amendments_bert_en_1.0.0_3.0_1678049943783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_amendments_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Amendments]| |[Other]| |[Other]| |[Amendments]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_amendments_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.7 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Amendments 0.96 0.94 0.95 181 Other 0.95 0.97 0.96 206 accuracy - - 0.95 387 macro-avg 0.95 0.95 0.95 387 weighted-avg 0.95 0.95 0.95 387 ``` --- layout: model title: Italian Bert Embeddings (Cased) author: John Snow Labs name: bert_embeddings_bert_base_italian_xxl_cased date: 2022-04-11 tags: [bert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true recommended: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-italian-xxl-cased` is a Italian model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_italian_xxl_cased_it_3.4.2_3.0_1649676734473.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_italian_xxl_cased_it_3.4.2_3.0_1649676734473.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_cased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.bert_base_italian_xxl_cased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_italian_xxl_cased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|415.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-italian-xxl-cased - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: Named Entity Recognition Profiling (Biobert) author: John Snow Labs name: ner_profiling_biobert date: 2022-08-28 tags: [ner, ner_profiling, biobert, licensed, en] task: [Named Entity Recognition, Pipeline Healthcare] language: en nav_key: models edition: Healthcare NLP 4.0.2 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pipeline can be used to explore all the available pretrained NER models at once. When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with `biobert_pubmed_base_cased`. It has been updated by adding NER model outputs to the previous version. Here are the NER models that this pretrained pipeline includes: `jsl_ner_wip_greedy_biobert`, `jsl_rd_ner_wip_greedy_biobert`, `ner_ade_biobert`, `ner_anatomy_biobert`, `ner_anatomy_coarse_biobert`, `ner_bionlp_biobert`, `ner_cellular_biobert`, `ner_chemprot_biobert`, `ner_clinical_biobert`, `ner_deid_biobert`, `ner_deid_enriched_biobert`, `ner_diseases_biobert`, `ner_events_biobert`, `ner_human_phenotype_gene_biobert`, `ner_human_phenotype_go_biobert`, `ner_jsl_biobert`, `ner_jsl_enriched_biobert`, `ner_jsl_greedy_biobert`, `ner_living_species_biobert`, `ner_posology_biobert`, `ner_posology_large_biobert`, `ner_risk_factors_biobert` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.2.Pretrained_NER_Profiling_Pipelines.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.0.2_3.0_1661701458686.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_profiling_biobert_en_4.0.2_3.0_1661701458686.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline ner_profiling_pipeline = PretrainedPipeline('ner_profiling_biobert', 'en', 'clinical/models') result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val ner_profiling_pipeline = PretrainedPipeline("ner_profiling_biobert", "en", "clinical/models") val result = ner_profiling_pipeline.annotate("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.profiling_biobert").predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .""") ```
## Results ```bash ******************** ner_diseases_biobert Model Results ******************** [('gestational diabetes mellitus', 'Disease'), ('type two diabetes mellitus', 'Disease'), ('T2DM', 'Disease'), ('HTG-induced pancreatitis', 'Disease'), ('hepatitis', 'Disease'), ('obesity', 'Disease'), ('polyuria', 'Disease'), ('polydipsia', 'Disease'), ('poor appetite', 'Disease'), ('vomiting', 'Disease')] ******************** ner_events_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('eight years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('three years', 'DURATION'), ('presentation', 'OCCURRENCE'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index', 'TEST'), ('BMI', 'TEST'), ('presented', 'OCCURRENCE'), ('a one-week', 'DURATION'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_jsl_biobert Model Results ******************** [('28-year-old', 'Age'), ('female', 'Gender'), ('gestational diabetes mellitus', 'Diabetes'), ('eight years prior', 'RelativeDate'), ('type two diabetes mellitus', 'Diabetes'), ('T2DM', 'Disease_Syndrome_Disorder'), ('HTG-induced pancreatitis', 'Disease_Syndrome_Disorder'), ('three years prior', 'RelativeDate'), ('acute', 'Modifier'), ('hepatitis', 'Disease_Syndrome_Disorder'), ('obesity', 'Obesity'), ('body mass index', 'BMI'), ('BMI ) of 33.5 kg/m2', 'BMI'), ('one-week', 'Duration'), ('polyuria', 'Symptom'), ('polydipsia', 'Symptom'), ('poor appetite', 'Symptom'), ('vomiting', 'Symptom')] ******************** ner_clinical_biobert Model Results ******************** [('gestational diabetes mellitus', 'PROBLEM'), ('subsequent type two diabetes mellitus ( T2DM', 'PROBLEM'), ('HTG-induced pancreatitis', 'PROBLEM'), ('an acute hepatitis', 'PROBLEM'), ('obesity', 'PROBLEM'), ('a body mass index ( BMI )', 'TEST'), ('polyuria', 'PROBLEM'), ('polydipsia', 'PROBLEM'), ('poor appetite', 'PROBLEM'), ('vomiting', 'PROBLEM')] ******************** ner_risk_factors_biobert Model Results ******************** [('diabetes mellitus', 'DIABETES'), ('subsequent type two diabetes mellitus', 'DIABETES'), ('obesity', 'OBESE')] ... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_profiling_biobert| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.0.2+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|766.4 MB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - BertEmbeddings - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - MedicalNerModel - NerConverter - Finisher --- layout: model title: T5 Clinical Summarization / QA model author: John Snow Labs name: t5_base_pubmedqa date: 2022-10-25 tags: [t5, licensed, clinical, en] task: Summarization language: en nav_key: models edition: Spark NLP for Healthcare 4.1.0 spark_version: [3.0, 3.2] supported: true annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The T5 transformer model described in the seminal paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” can perform a variety of tasks, such as text summarization, question answering and translation. More details about using the model can be found in the paper (https://arxiv.org/pdf/1910.10683.pdf). This model is specifically trained on medical data for text summarization and question answering. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/t5_base_pubmedqa_en_4.1.0_3.0_1666670271455.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/t5_base_pubmedqa_en_4.1.0_3.0_1666670271455.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("documents") t5 = T5Transformer().pretrained("t5_base_pubmedqa", "en", "clinical/models")\ .setInputCols(["documents"])\ .setOutputCol("t5_output")\ .setTask("summarize medical questions:")\ .setMaxOutputLength(200) pipeline = Pipeline(stages=[document_assembler, t5]) data = spark.createDataFrame([ [1, "content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \"girly\" parts are normal. My organs never matured. Could you give me more information please. focus:all"] ]).toDF('id', 'text') results = pipeline.fit(data).transform(data) results.select("t5_output.result").show(truncate=False) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("documents") val t5 = T5Transformer() .pretrained("t5_base_pubmedqa", "en", "clinical/models") .setInputCols("documents") .setOutputCol("t5_output") .setTask("summarize medical questions:") .setMaxOutputLength(200) val pipeline = new Pipeline().setStages(Array(document_assembler, t5)) val data = Seq(Array( (1, "content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \"girly\" parts are normal. My organs never matured. Could you give me more information please. focus:all") )).toDF("id", "text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.t5.base_pubmedqa").predict("""content:SUBJECT: Normal physical traits but no period MESSAGE: I'm a 40 yr. old woman that has infantile reproductive organs and have never experienced a mensus. I have had Doctors look but they all say I just have infantile female reproductive organs. When I try to look for answers on the internet I cannot find anything. ALL my \""") ```
## Results ```bash I have a normal physical appearance and have no mensus. Can you give me more information? ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_base_pubmedqa| |Compatibility:|Spark NLP for Healthcare 4.1.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|916.7 MB| ## References Trained on Pubmed data & qnli --- layout: model title: Summarize clinical notes author: John Snow Labs name: summarizer_clinical_jsl date: 2023-03-25 tags: [en, licensed, clinical, summarization, tensorflow] task: Summarization language: en edition: Healthcare NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: MedicalSummarizer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a modified version of Flan-T5 (LLM) based summarization model that is finetuned with clinical notes, encounters, critical care notes, discharge notes, reports, curated by John Snow Labs. This model is further optimized by augmenting the training methodology, and dataset. It can generate summaries from clinical notes up to 512 tokens given the input text (max 1024 tokens). ## Predicted Entities {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/MEDICAL_TEXT_SUMMARIZATION/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/32.Medical_Text_Summarization.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_en_4.3.1_3.0_1679772340755.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/summarizer_clinical_jsl_en_4.3.1_3.0_1679772340755.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") summarizer = MedicalSummarizer()\ .pretrained("summarizer_clinical_jsl", "en", "clinical/models")\ .setInputCols("document")\ .setOutputCol("summary")\ .setMaxTextLength(512)\ .setMaxNewTokens(512) pipeline = Pipeline(stages=[document,summarizer]) text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash.""" data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val summarizer = MedicalSummarizer() .pretrained("summarizer_clinical_jsl", "en", "clinical/models") .setInputCols("document") .setOutputCol("summary") .setMaxTextLength(512) .setMaxNewTokens(512) val pipeline = new Pipeline().setStages(Array(document_assembler, summarizer)) val text = """Patient with hypertension, syncope, and spinal stenosis - for recheck. (Medical Transcription Sample Report) SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. PAST MEDICAL HISTORY / SURGERY / HOSPITALIZATIONS: Reviewed and unchanged from the dictation on 12/03/2003. MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily. She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash.""" val data = Seq(text).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash A 78-year-old female with hypertension, syncope, and spinal stenosis returns for recheck. She denies chest pain, palpations, orthopnea, nocturnal dyspnea, or edema. She is on multiple medications and has Elocon cream and Synalar cream for rash. ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|summarizer_clinical_jsl| |Compatibility:|Healthcare NLP 4.3.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|920.1 MB| ## Benchmarkings ### Benchmark on MtSamples Summarization Dataset : | model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 | |--|--|--|--|--|--|--| philschmid/flan-t5-base-samsum | 250M | 0.1919 | 0.1124 | 0.8409 | 0.8964 | 0.8678 | linydub/bart-large-samsum | 500M | 0.1586 | 0.0732 | 0.8747 | 0.8184 | 0.8456 | philschmid/bart-large-cnn-samsum | 500M | 0.2170 | 0.1299 | 0.8846 | 0.8436 | 0.8636 | transformersbook/pegasus-samsum | 500M | 0.1924 | 0.0965 | 0.8920 | 0.8149 | 0.8517 | summarizer_clinical_jsl | 250M | 0.4836 | 0.4188 | 0.9041 | 0.9374 | 0.9204 | summarizer_clinical_jsl_augmented | 250M | 0.5119 | 0.4545 | 0.9282 | 0.9526 | 0.9402 | ### Benchmark on MIMIC Summarization Dataset : | model_name | model_size | rouge | bleu | bertscore_precision | bertscore_recall: | bertscore_f1 | |--|--|--|--|--|--|--| philschmid/flan-t5-base-samsum | 250M | 0.1910 | 0.1037 | 0.8708 | 0.9056 | 0.8879 | linydub/bart-large-samsum | 500M | 0.1252 | 0.0382 | 0.8933 | 0.8440 | 0.8679 | philschmid/bart-large-cnn-samsum | 500M | 0.1795 | 0.0889 | 0.9172 | 0.8978 | 0.9074 | transformersbook/pegasus-samsum | 570M | 0.1425 | 0.0582 | 0.9171 | 0.8682 | 0.8920 | summarizer_clinical_jsl | 250M | 0.395 | 0.2962 | 0.895 | 0.9316 | 0.913 | summarizer_clinical_jsl_augmented | 250M | 0.3964 | 0.307 | 0.9109 | 0.9452 | 0.9227 | ![Benchmark Summary](https://github.com/JohnSnowLabs/jsl-private-projects/blob/llm_benchmarks/internal_projects/LLM_Experiments/jsl-summarization-benchmarks.png?raw=true) ## References Trained on in-house curated dataset --- layout: model title: English BertForQuestionAnswering model (from Kutay) author: John Snow Labs name: bert_qa_fine_tuned_squad_aip date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `fine_tuned_squad_aip` is a English model orginally trained by `Kutay`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_fine_tuned_squad_aip_en_4.0.0_3.0_1654187684708.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_fine_tuned_squad_aip_en_4.0.0_3.0_1654187684708.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_fine_tuned_squad_aip","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_fine_tuned_squad_aip","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.by_Kutay").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_fine_tuned_squad_aip| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|407.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/Kutay/fine_tuned_squad_aip --- layout: model title: Recognize Entities OntoNotes pipeline - ELECTRA Base author: John Snow Labs name: onto_recognize_entities_electra_base date: 2021-03-23 tags: [open_source, english, onto_recognize_entities_electra_base, pipeline, en] supported: true task: [Named Entity Recognition, Lemmatization] language: en nav_key: models edition: Spark NLP 3.0.0 spark_version: 3.0 annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The onto_recognize_entities_electra_base is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entities . It performs most of the common text processing tasks on your dataframe {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/NER_EN_18/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_base_en_3.0.0_3.0_1616477573783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/onto_recognize_entities_electra_base_en_3.0.0_3.0_1616477573783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipelinein pipeline = PretrainedPipeline('onto_recognize_entities_electra_base', lang = 'en') annotations = pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0] annotations.keys() ``` ```scala val pipeline = new PretrainedPipeline("onto_recognize_entities_electra_base", lang = "en") val result = pipeline.fullAnnotate("Hello from John Snow Labs ! ")(0) ``` {:.nlu-block} ```python import nlu text = [""Hello from John Snow Labs ! ""] result_df = nlu.load('en.ner.onto.electra.base').predict(text) result_df ```
## Results ```bash | | document | sentence | token | embeddings | ner | entities | |---:|:---------------------------------|:--------------------------------|:-----------------------------------------------|:-----------------------------|:-------------------------------------------|:-------------------| | 0 | ['Hello from John Snow Labs ! '] | ['Hello from John Snow Labs !'] | ['Hello', 'from', 'John', 'Snow', 'Labs', '!'] | [[0.2088415920734405,.,...]] | ['O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O'] | ['John Snow Labs'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|onto_recognize_entities_electra_base| |Type:|pipeline| |Compatibility:|Spark NLP 3.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| --- layout: model title: English BertForQuestionAnswering Cased model (from KFlash) author: John Snow Labs name: bert_qa_kflash_finetuned_squad_accelera date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-finetuned-squad-accelerate` is a English model originally trained by `KFlash`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_kflash_finetuned_squad_accelera_en_4.0.0_3.0_1657186938840.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_kflash_finetuned_squad_accelera_en_4.0.0_3.0_1657186938840.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_kflash_finetuned_squad_accelera","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_kflash_finetuned_squad_accelera","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_kflash_finetuned_squad_accelera| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/KFlash/bert-finetuned-squad-accelerate --- layout: model title: Chinese BertForMaskedLM Cased model (from uer) author: John Snow Labs name: bert_embeddings_chinese_roberta_l_6_h_768 date: 2022-12-06 tags: [zh, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: zh edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `chinese_roberta_L-6_H-768` is a Chinese model originally trained by `uer`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_768_zh_4.2.4_3.0_1670326011437.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_chinese_roberta_l_6_h_768_zh_4.2.4_3.0_1670326011437.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_768","zh") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_chinese_roberta_l_6_h_768","zh") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_chinese_roberta_l_6_h_768| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|zh| |Size:|224.1 MB| |Case sensitive:|true| ## References - https://huggingface.co/uer/chinese_roberta_L-6_H-768 - https://github.com/dbiir/UER-py/ - https://arxiv.org/abs/1909.05658 - https://arxiv.org/abs/1908.08962 - https://github.com/dbiir/UER-py/wiki/Modelzoo - https://github.com/CLUEbenchmark/CLUECorpus2020/ - https://github.com/dbiir/UER-py/ - https://cloud.tencent.com/ --- layout: model title: Detect Symptoms, Treatments and Other Entities in German author: John Snow Labs name: ner_healthcare date: 2020-09-28 task: Named Entity Recognition language: de edition: Healthcare NLP 2.6.0 spark_version: 2.4 tags: [ner, clinical, de, licensed] supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect symptoms, treatments and other entities in medical text in German language. ## Predicted Entities `DIAGLAB_PROCEDURE`, `MEDICAL_SPECIFICATION`, `MEDICAL_DEVICE`, `MEASUREMENT`, `BIOLOGICAL_CHEMISTRY`, `BODY_FLUID`, `TIME_INFORMATION`, `LOCAL_SPECIFICATION`, `BIOLOGICAL_PARAMETER`, `PROCESS`, `MEDICATION`, `DOSING`, `DEGREE`, `MEDICAL_CONDITION`, `PERSON`, `TISSUE`, `STATE_OF_HEALTH`, `BODY_PART`, `TREATMENT` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HEALTHCARE_DE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_HEALTHCARE_DE.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_de_2.5.5_2.4_1599433028253.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_healthcare_de_2.5.5_2.4_1599433028253.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\ .setInputCols(["document","token"])\ .setOutputCol("embeddings") clinical_ner = NerDLModel.pretrained("ner_healthcare", "de", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, clinical_ner_converter]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) annotations = light_pipeline.fullAnnotate("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom") ``` ```scala ... val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models") .setInputCols(Array("document","token")) .setOutputCol("embeddings") val ner = NerDLModel.pretrained("ner_healthcare", "de", "clinical/models") .setInputCols("sentence", "token", "embeddings") .setOutputCol("ner") ... val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, clinical_ner_converter)) val data = Seq("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("de.med_ner.healthcare").predict("""Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom""") ```
{:.h2_title} ## Results ```bash +----+-------------------+---------+---------+--------------------------+ | | chunk | begin | end | entity | +====+===================+=========+=========+==========================+ | 0 | Kleinzellige | 4 | 15 | MEDICAL_SPECIFICATION | +----+-------------------+---------+---------+--------------------------+ | 1 | Bronchialkarzinom | 17 | 33 | MEDICAL_CONDITION | +----+-------------------+---------+---------+--------------------------+ | 2 | Kleinzelliger | 36 | 48 | MEDICAL_SPECIFICATION | +----+-------------------+---------+---------+--------------------------+ | 3 | Lungenkrebs | 50 | 60 | MEDICAL_CONDITION | +----+-------------------+---------+---------+--------------------------+ | 4 | SCLC | 63 | 66 | MEDICAL_CONDITION | +----+-------------------+---------+---------+--------------------------+ | 5 | hochmalignes | 77 | 88 | MEASUREMENT | +----+-------------------+---------+---------+--------------------------+ | 6 | bronchogenes | 90 | 101 | BODY_PART | +----+-------------------+---------+---------+--------------------------+ | 7 | Karzinom | 103 | 110 | MEDICAL_CONDITION | +----+-------------------+---------+---------+--------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_healthcare| |Type:|ner| |Compatibility:|Healthcare NLP 2.6.0 +| |Edition:|Official| |License:|Licensed| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|[de]| |Case sensitive:|false| | Dependencies: | w2v_cc_300d | {:.h2_title} ## Data Source Trained on 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text with *w2v_cc_300d*. {:.h2_title} ## Benchmarking ```bash | | label | tp | fp | fn | precision| recall| f1 | |---:|--------------------:|-------:|------:|-----:|---------:|---------:|---------:| | 0 | BIOLOGICAL_PARAMETER| 103 | 52 | 57 | 0.6645 | 0.6438 | 0.654 | | 1 | BODY_FLUID | 166 | 16 | 24 | 0.9121 | 0.8737 | 0.8925 | | 2 | PERSON | 475 | 74 | 142 | 0.8652 | 0.7699 | 0.8148 | | 3 | DOSING | 38 | 14 | 31 | 0.7308 | 0.5507 | 0.6281 | | 4 | DIAGLAB_PROCEDURE | 236 | 58 | 68 | 0.8027 | 0.7763 | 0.7893 | | 5 | BODY_PART | 690 | 72 | 79 | 0.9055 | 0.8973 | 0.9014 | | 6 | MEDICATION | 391 | 117 | 167 | 0.7697 | 0.7007 | 0.7336 | | 7 | STATE_OF_HEALTH | 321 | 41 | 76 | 0.8867 | 0.8086 | 0.8458 | | 8 | LOCAL_SPECIFICATION | 57 | 19 | 24 | 0.75 | 0.7037 | 0.7261 | | 9 | MEASUREMENT | 574 | 260 | 222 | 0.6882 | 0.7211 | 0.7043 | | 10 | TREATMENT | 476 | 131 | 135 | 0.7842 | 0.7791 | 0.7816 | | 11 | MEDICAL_CONDITION | 1741 | 442 | 271 | 0.7975 | 0.8653 | 0.83 | | 12 | TIME_INFORMATION | 651 | 126 | 161 | 0.8378 | 0.8017 | 0.8194 | | 13 | BIOLOGICAL_CHEMISTRY| 192 | 55 | 60 | 0.7773 | 0.7619 | 0.7695 | ``` --- layout: model title: Zero-shot Legal NER (CUAD, large) author: John Snow Labs name: legner_roberta_zeroshot_cuad_large date: 2023-01-30 tags: [en, licensed, tensorflow] task: Named Entity Recognition language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true recommended: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a Zero-shot NER model, trained using Roberta on SQUAD and finetuned to perform Zero-shot NER using CUAD legal dataset. In order to use it, a specific prompt is required. This is an example of it for extracting PARTIES: ``` "Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract" ``` ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_large_en_1.0.0_3.0_1675089504749.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_roberta_zeroshot_cuad_large_en_1.0.0_3.0_1675089504749.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") zeroshot = nlp.ZeroShotNerModel.pretrained("legner_roberta_zeroshot_cuad_large","en","legal/models")\ .setInputCols(["document", "token"])\ .setOutputCol("zero_shot_ner")\ .setEntityDefinitions( { 'PARTIES': ['Highlight the parts (if any) of this contract related to "Parties" that should be reviewed by a lawyer. Details: The two or more parties who signed the contract'] }) nerconverter = NerConverter()\ .setInputCols(["document", "token", "zero_shot_ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline().setStages([ document_assembler, tokenizer, zeroshot, nerconverter ]) from pyspark.sql import types as T sample_text = ["""THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is made by and among P.H. GLATFELTER COMPANY, a Pennsylvania corporation ( the "COMPANY") and certain of its subsidiaries. Identified on the signature pages hereto (each a "BORROWER" and collectively, the "BORROWERS"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, in its capacity as agent for the Lenders under this Agreement (hereinafter referred to in such capacity as the "ADMINISTRATIVE AGENT"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA, as syndication agent.""".replace('\n',' ')] p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) res = p_model.transform(spark.createDataFrame(sample_text, T.StringType()).toDF("text")) res.show() ```
## Results ```bash +-----------------------+---------+ |chunk |ner_label| +-----------------------+---------+ |P.H. GLATFELTER COMPANY|PARTIES | +-----------------------+---------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_roberta_zeroshot_cuad_large| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|454.4 MB| |Case sensitive:|true| |Max sentence length:|128| ## References SQUAD and CUAD --- layout: model title: English DistilBertForTokenClassification Cased model (from Lucifermorningstar011) author: John Snow Labs name: distilbert_token_classifier_autotrain_ner_778023879 date: 2023-03-06 tags: [en, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: en edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `autotrain-ner-778023879` is a English model originally trained by `Lucifermorningstar011`. ## Predicted Entities `9`, `0` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1678134257673.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_autotrain_ner_778023879_en_4.3.1_3.0_1678134257673.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_autotrain_ner_778023879","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_autotrain_ner_778023879| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|244.1 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/Lucifermorningstar011/autotrain-ner-778023879 --- layout: model title: Spanish RoBERTa Embeddings (Base, Random Sampling, Using Sequence Length 512) author: John Snow Labs name: roberta_embeddings_bertin_base_random_exp_512seqlen date: 2022-04-14 tags: [roberta, embeddings, es, open_source] task: Embeddings language: es edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bertin-base-random-exp-512seqlen` is a Spanish model orginally trained by `bertin-project`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_random_exp_512seqlen_es_3.4.2_3.0_1649946088404.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_embeddings_bertin_base_random_exp_512seqlen_es_3.4.2_3.0_1649946088404.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_random_exp_512seqlen","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Me encanta chispa nlp"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_bertin_base_random_exp_512seqlen","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Me encanta chispa nlp").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.embed.bertin_base_random_exp_512seqlen").predict("""Me encanta chispa nlp""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_embeddings_bertin_base_random_exp_512seqlen| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|es| |Size:|230.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/bertin-project/bertin-base-random-exp-512seqlen --- layout: model title: English asr_wav2vec2_base_timit_demo_colab7_by_hassnain TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: asr_wav2vec2_base_timit_demo_colab7_by_hassnain date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab7_by_hassnain` is a English model originally trained by hassnain. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_base_timit_demo_colab7_by_hassnain_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab7_by_hassnain_en_4.2.0_3.0_1664018865478.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_base_timit_demo_colab7_by_hassnain_en_4.2.0_3.0_1664018865478.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_base_timit_demo_colab7_by_hassnain", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_base_timit_demo_colab7_by_hassnain", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_base_timit_demo_colab7_by_hassnain| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|355.0 MB| --- layout: model title: German XLMRobertaForTokenClassification Base Cased model (from lijingxin) author: John Snow Labs name: xlmroberta_ner_lijingxin_base_finetuned_panx date: 2022-08-13 tags: [de, open_source, xlm_roberta, ner] task: Named Entity Recognition language: de edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `xlm-roberta-base-finetuned-panx-de` is a German model originally trained by `lijingxin`. ## Predicted Entities `PER`, `LOC`, `ORG` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_lijingxin_base_finetuned_panx_de_4.1.0_3.0_1660435174962.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_lijingxin_base_finetuned_panx_de_4.1.0_3.0_1660435174962.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_lijingxin_base_finetuned_panx","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_lijingxin_base_finetuned_panx","de") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_lijingxin_base_finetuned_panx| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|de| |Size:|854.5 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/lijingxin/xlm-roberta-base-finetuned-panx-de - https://paperswithcode.com/sota?task=Token+Classification&dataset=xtreme --- layout: model title: English BertForQuestionAnswering model (from batterydata) author: John Snow Labs name: bert_qa_batteryonlybert_cased_squad_v1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `batteryonlybert-cased-squad-v1` is a English model orginally trained by `batterydata`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_batteryonlybert_cased_squad_v1_en_4.0.0_3.0_1654179335684.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_batteryonlybert_cased_squad_v1_en_4.0.0_3.0_1654179335684.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_batteryonlybert_cased_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_batteryonlybert_cased_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad_battery.bert.cased_only_bert.by_batterydata").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_batteryonlybert_cased_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|404.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/batterydata/batteryonlybert-cased-squad-v1 - https://github.com/ShuHuang/batterybert --- layout: model title: Recognize Entities Bert Noncontrib author: John Snow Labs name: recognize_entities_bert_noncontrib date: 2022-06-20 tags: [en, open_source] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The recognize_entities_bert_noncontrib is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps and recognizes entites. It performs most of the common text processing tasks on your dataframe {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/recognize_entities_bert_noncontrib_en_4.0.0_3.0_1655697462434.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/recognize_entities_bert_noncontrib_en_4.0.0_3.0_1655697462434.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("recognize_entities_bert_noncontrib", "en", "clinical/models") result = pipeline.annotate("""I love johnsnowlabs! """) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|recognize_entities_bert_noncontrib| |Type:|pipeline| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|425.6 MB| ## Included Models - DocumentAssembler - SentenceDetector - TokenizerModel - BertEmbeddings - NerDLModel --- layout: model title: Legal Marketing Document Classifier (EURLEX) author: John Snow Labs name: legclf_marketing_bert date: 2023-03-06 tags: [en, legal, classification, clauses, marketing, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description European Union (EU) legislation is published in the EUR-Lex portal. All EU laws are annotated by the EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. Given a document, the legclf_marketing_bert model, it is a Bert Sentence Embeddings Document Classifier, classifies if the document belongs to the class Marketing or not (Binary Classification) according to EuroVoc labels. ## Predicted Entities `Marketing`, `Other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_marketing_bert_en_1.0.0_3.0_1678111761414.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_marketing_bert_en_1.0.0_3.0_1678111761414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_marketing_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Marketing]| |[Other]| |[Other]| |[Marketing]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_marketing_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.1 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Marketing 0.85 0.84 0.84 716 Other 0.82 0.83 0.83 648 accuracy - - 0.84 1364 macro-avg 0.84 0.84 0.84 1364 weighted-avg 0.84 0.84 0.84 1364 ``` --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from liuhaor4) author: John Snow Labs name: distilbert_qa_liuhaor4_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `liuhaor4`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_liuhaor4_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772009401.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_liuhaor4_base_uncased_finetuned_squad_en_4.3.0_3.0_1672772009401.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_liuhaor4_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_liuhaor4_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_liuhaor4_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/liuhaor4/distilbert-base-uncased-finetuned-squad --- layout: model title: Swedish BertForMaskedLM Cased model (from Addedk) author: John Snow Labs name: bert_embeddings_kb_distilled_cased date: 2022-12-02 tags: [sv, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: sv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `kbbert-distilled-cased` is a Swedish model originally trained by `Addedk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_kb_distilled_cased_sv_4.2.4_3.0_1670022547171.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_kb_distilled_cased_sv_4.2.4_3.0_1670022547171.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kb_distilled_cased","sv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_kb_distilled_cased","sv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_kb_distilled_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|sv| |Size:|308.3 MB| |Case sensitive:|true| ## References - https://huggingface.co/Addedk/kbbert-distilled-cased - https://spraakbanken.gu.se/en/resources/gigaword - https://github.com/AddedK/swedish-mbert-distillation/blob/main/azureML/pretrain_distillation.py - https://kth.diva-portal.org/smash/record.jsf?aq2=%5B%5B%5D%5D&c=2&af=%5B%5D&searchType=UNDERGRADUATE&sortOrder2=title_sort_asc&language=en&pid=diva2%3A1698451&aq=%5B%5B%7B%22freeText%22%3A%22added+kina%22%7D%5D%5D&sf=all&aqe=%5B%5D&sortOrder=author_sort_asc&onlyFullText=false&noOfRows=50&dswid=-6142 - https://arxiv.org/abs/2103.06418 - https://spraakbanken.gu.se/en/resources/gigaword --- layout: model title: English BertForQuestionAnswering model (from mrm8488) author: John Snow Labs name: bert_qa_bert_medium_wrslb_finetuned_squadv1 date: 2022-06-02 tags: [en, open_source, question_answering, bert] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-medium-wrslb-finetuned-squadv1` is a English model orginally trained by `mrm8488`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_wrslb_finetuned_squadv1_en_4.0.0_3.0_1654183753800.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_medium_wrslb_finetuned_squadv1_en_4.0.0_3.0_1654183753800.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_medium_wrslb_finetuned_squadv1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_medium_wrslb_finetuned_squadv1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.bert.medium").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_medium_wrslb_finetuned_squadv1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|154.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/mrm8488/bert-medium-wrslb-finetuned-squadv1 --- layout: model title: Turkish BertForQuestionAnswering Uncased model (from enelpi) author: John Snow Labs name: bert_qa_question_answering_uncased_squadv2 date: 2022-07-07 tags: [tr, open_source, bert, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-question-answering-uncased-squadv2_tr` is a Turkish model originally trained by `enelpi`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_uncased_squadv2_tr_4.0.0_3.0_1657187939073.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_question_answering_uncased_squadv2_tr_4.0.0_3.0_1657187939073.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_question_answering_uncased_squadv2","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_question_answering_uncased_squadv2","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_question_answering_uncased_squadv2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|413.2 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/enelpi/bert-question-answering-uncased-squadv2_tr --- layout: model title: Finnish asr_wav2vec2_large_xlsr_53_finnish_by_aapot TFWav2Vec2ForCTC from aapot author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_finnish_by_aapot date: 2022-09-24 tags: [wav2vec2, fi, audio, open_source, asr] task: Automatic Speech Recognition language: fi edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_finnish_by_aapot` is a Finnish model originally trained by aapot. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_finnish_by_aapot_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_aapot_fi_4.2.0_3.0_1664019226592.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_finnish_by_aapot_fi_4.2.0_3.0_1664019226592.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_aapot", "fi")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_finnish_by_aapot", "fi") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_finnish_by_aapot| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|fi| |Size:|1.2 GB| --- layout: model title: Stop Words Cleaner for Hindi author: John Snow Labs name: stopwords_hi date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: hi edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, hi] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_hi_hi_2.5.4_2.4_1594742439035.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_hi_hi_2.5.4_2.4_1594742439035.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_hi", "hi") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_hi", "hi") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""उत्तर के राजा होने के अलावा, जॉन स्नो एक अंग्रेजी चिकित्सक और संज्ञाहरण और चिकित्सा स्वच्छता के विकास में अग्रणी है।"""] stopword_df = nlu.load('hi.stopwords').predict(text) stopword_df[['cleanTokens']] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='उत्तर', metadata={'sentence': '0'}), Row(annotatorType='token', begin=9, end=12, result='राजा', metadata={'sentence': '0'}), Row(annotatorType='token', begin=22, end=26, result='अलावा', metadata={'sentence': '0'}), Row(annotatorType='token', begin=27, end=27, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=31, result='जॉन', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_hi| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|hi| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Legal Notices Clause Binary Classifier author: John Snow Labs name: legclf_notices_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `notices` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `notices` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_notices_clause_en_1.0.0_3.2_1660122758974.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_notices_clause_en_1.0.0_3.2_1660122758974.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_notices_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[notices]| |[other]| |[other]| |[notices]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_notices_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.3 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support notices 0.96 0.97 0.97 191 other 0.99 0.99 0.99 529 accuracy - - 0.98 720 macro-avg 0.98 0.98 0.98 720 weighted-avg 0.98 0.98 0.98 720 ``` --- layout: model title: Detect Genes and Human Phenotypes (biobert) author: John Snow Labs name: ner_human_phenotype_gene_biobert date: 2021-04-01 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect normalized mentions of genes (gene) and human phenotypes (hp) in medical text. ## Predicted Entities `HP`, `GENE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_HUMAN_PHENOTYPE_GENE_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_en_3.0.0_3.0_1617260636569.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_human_phenotype_gene_biobert_en_3.0.0_3.0_1617260636569.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_biobert", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["EXAMPLE_TEXT"]]).toDF("text")) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings_clinical = BertEmbeddings.pretrained("biobert_pubmed_base_cased") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_human_phenotype_gene_biobert", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter)) val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.human_phenotype.gene_biobert").predict("""Put your text here.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_human_phenotype_gene_biobert| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Benchmarking ```bash entity tp fp fn total precision recall f1 HP 1761.0 198.0 342.0 2103.0 0.8989 0.8374 0.8671 GENE 1600.0 290.0 361.0 1961.0 0.8466 0.8159 0.831 macro - - - - - - 0.8490 micro - - - - - - 0.8496 ``` --- layout: model title: Translate Esperanto to English Pipeline author: John Snow Labs name: translate_eo_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, eo, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `eo` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_eo_en_xx_2.7.0_2.4_1609687626260.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_eo_en_xx_2.7.0_2.4_1609687626260.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_eo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_eo_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.eo.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_eo_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Bangla DistilBertForQuestionAnswering Cased model (from khasrul-alam) author: John Snow Labs name: distilbert_qa_banglabert_finetuned_squad date: 2023-01-03 tags: [bn, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: bn edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `banglabert-finetuned-squad` is a Bangla model originally trained by `khasrul-alam`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_banglabert_finetuned_squad_bn_4.3.0_3.0_1672765708608.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_banglabert_finetuned_squad_bn_4.3.0_3.0_1672765708608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_banglabert_finetuned_squad","bn")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_banglabert_finetuned_squad","bn") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_banglabert_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|bn| |Size:|247.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/khasrul-alam/banglabert-finetuned-squad --- layout: model title: English ElectraForQuestionAnswering model (from SauravMaheshkar) author: John Snow Labs name: electra_qa_base_chaii date: 2022-06-22 tags: [en, open_source, electra, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-base-chaii` is a English model originally trained by `SauravMaheshkar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_base_chaii_en_4.0.0_3.0_1655920382993.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_base_chaii_en_4.0.0_3.0_1655920382993.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_base_chaii","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_base_chaii","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.chaii.electra.base").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_base_chaii| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|408.5 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/SauravMaheshkar/electra-base-chaii --- layout: model title: English asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97 TFWav2Vec2ForCTC from chaitanya97 author: John Snow Labs name: asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97` is a English model originally trained by chaitanya97. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97_en_4.2.0_3.0_1664036036447.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97_en_4.2.0_3.0_1664036036447.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xls_r_300m_turkish_colab_by_chaitanya97| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Lemmatizer (Tagalog, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, tl] task: Lemmatization language: tl edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Tagalog Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_tl_3.4.1_3.0_1646316594942.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_tl_3.4.1_3.0_1646316594942.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","tl") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Hindi ka mas mahusay kaysa sa akin"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","tl") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Hindi ka mas mahusay kaysa sa akin").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tl.lemma").predict("""Hindi ka mas mahusay kaysa sa akin""") ```
## Results ```bash +------------------------------------------+ |result | +------------------------------------------+ |[Hindi, ka, mas, mahusay, kaysa, sa, akin]| +------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|tl| |Size:|3.2 KB| --- layout: model title: German asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756 date: 2022-09-26 tags: [wav2vec2, de, audio, open_source, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756` is a German model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756_de_4.2.0_3.0_1664190015002.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756_de_4.2.0_3.0_1664190015002.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756", "de")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756", "de") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_exp_w2v2r_vp_100k_accent_germany_0_austria_10_s756| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|de| |Size:|1.2 GB| --- layout: model title: ICD10 to UMLS Code Mapping author: John Snow Labs name: icd10cm_umls_mapping date: 2023-06-13 tags: [en, licensed, icd10cm, umls, pipeline, chunk_mapping] task: Chunk Mapping language: en edition: Healthcare NLP 4.4.4 spark_version: 3.2 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_4.4.4_3.2_1686663524227.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/icd10cm_umls_mapping_en_4.4.4_3.2_1686663524227.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(['M8950', 'R822', 'R0901']) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(['M8950', 'R822', 'R0901']) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm.umls.mapping").predict("""Put your text here.""") ```
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") result = pipeline.fullAnnotate(['M8950', 'R822', 'R0901']) ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("icd10cm_umls_mapping", "en", "clinical/models") val result = pipeline.fullAnnotate(['M8950', 'R822', 'R0901']) ``` {:.nlu-block} ```python import nlu nlu.load("en.icd10cm.umls.mapping").predict("""Put your text here.""") ```
## Results ```bash Results {'icd10cm': ['M89.50', 'R82.2', 'R09.01'], 'umls': ['C4721411', 'C0159076', 'C0004044']} {:.model-param} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|icd10cm_umls_mapping| |Type:|pipeline| |Compatibility:|Healthcare NLP 4.4.4+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|956.6 KB| ## Included Models - DocumentAssembler - TokenizerModel - ChunkMapperModel --- layout: model title: English asr_maialong_model TFWav2Vec2ForCTC from buidung2004 author: John Snow Labs name: pipeline_asr_maialong_model date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_maialong_model` is a English model originally trained by buidung2004. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_maialong_model_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_maialong_model_en_4.2.0_3.0_1664023438768.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_maialong_model_en_4.2.0_3.0_1664023438768.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_maialong_model', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_maialong_model", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_maialong_model| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|227.7 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Bhojpuri Lemmatizer author: John Snow Labs name: lemma date: 2021-01-18 task: Lemmatization language: bh edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [bh, bho, open_source, lemmatizer] supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/public/TEXT_PREPROCESSING/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_bh_2.7.0_2.4_1610989221391.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_bh_2.7.0_2.4_1610989221391.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer()\ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma", "bh") \ .setInputCols(["token"]) \ .setOutputCol("lemma") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) results = light_pipeline.fullAnnotate(["एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।"]) ``` ```scala val document_assembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = Tokenizer() .setInputCols("document") .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma", "bh") .setInputCols("token") .setOutputCol("lemma") val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer)) val data = Seq("एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।"] lemma_df = nlu.load('bh.lemma').predict(text, output_level = "document") lemma_df.lemma.values[0] ```
## Results ```bash {'lemma': [Annotation(token, 0, 1, एह, {'sentence': '0'}), Annotation(token, 3, 7, आयोजन, {'sentence': '0'}), Annotation(token, 9, 11, में, {'sentence': '0'}), Annotation(token, 13, 17, विश्व, {'sentence': '0'}), Annotation(token, 19, 25, भोजपुरी, {'sentence': '0'}), Annotation(token, 27, 33, सम्मेलन, {'sentence': '0'}), Annotation(token, 35, 35, COMMA, {'sentence': '0'}), Annotation(token, 37, 45, पूर्वांचल, {'sentence': '0'}), Annotation(token, 47, 50, एकता, {'sentence': '0'}), Annotation(token, 52, 54, मंच, {'sentence': '0'}), Annotation(token, 56, 56, COMMA, {'sentence': '0'}), Annotation(token, 58, 60, वीर, {'sentence': '0'}), Annotation(token, 62, 66, कुँवर, {'sentence': '0'}), Annotation(token, 68, 71, सिंह, {'sentence': '0'}), Annotation(token, 73, 81, फाउन्डेशन, {'sentence': '0'}), Annotation(token, 83, 83, COMMA, {'sentence': '0'}), Annotation(token, 85, 93, पूर्वांचल, {'sentence': '0'}), Annotation(token, 95, 101, भोजपुरी, {'sentence': '0'}), Annotation(token, 103, 108, महासभा, {'sentence': '0'}), Annotation(token, 110, 110, COMMA, {'sentence': '0'}), Annotation(token, 112, 114, अउर, {'sentence': '0'}), Annotation(token, 116, 119, हर्फ, {'sentence': '0'}), Annotation(token, 121, 121, -, {'sentence': '0'}), Annotation(token, 123, 128, मीडिया, {'sentence': '0'}), Annotation(token, 130, 131, को, {'sentence': '0'}), Annotation(token, 133, 140, सहभागिता, {'sentence': '0'}), Annotation(token, 142, 143, बा, {'sentence': '0'}), Annotation(token, 145, 145, ।, {'sentence': '0'})]} ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[document]| |Output Labels:|[token]| |Language:|bh| ## Data Source The model was trained on the [Universal Dependencies](http://universaldependencies.org) data set version 2.7. Reference: - Ojha, A. K., & Zeman, D. (2020). Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri. Proceedings of the WILDRE5{--} 5th Workshop on Indian Language Data: Resources and Evaluation. --- layout: model title: Whereas Pipeline author: John Snow Labs name: legpipe_whereas date: 2022-08-24 tags: [en, legal, whereas, licensed] task: [Named Entity Recognition, Part of Speech Tagging, Dependency Parser, Relation Extraction] language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: Pipeline article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description IMPORTANT: Don't run this model on the whole legal agreement. Instead: - Split by paragraphs. You can use [notebook 1](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings) in Finance or Legal as inspiration; - Use the `legclf_cuad_whereas_clause` Text Classifier to select only these paragraphs; This is a Pretrained Pipeline to show extraction of whereas clauses (Subject, Action and Object), and also the relationships between them, using two approaches: - A Semantic Relation Extraction Model; - A Dependency Parser Tree; The difficulty of these entities is that they are totally free-text, with OBJECT being sometimes very long with very diverse vocabulary. Although the NER and the REDL can help you identify them, the Dependency Parser has been added so you can navigate the tree looking for specific direct objects or other phrases. ## Predicted Entities `WHEREAS_SUBJECT`, `WHEREAS_OBJECT`, `WHEREAS_ACTION` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/LEGALNER_WHEREAS){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legpipe_whereas_en_1.0.0_3.2_1661340138139.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legpipe_whereas_en_1.0.0_3.2_1661340138139.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from johnsnowlabs import * deid_pipeline = PretrainedPipeline("legpipe_whereas", "en", "legal/models") deid_pipeline.annotate('WHEREAS VerticalNet owns and operates a series of online communities.') # Return NER chunks pipeline_result['ner_chunk'] # Return RE pipeline_result['relations'] # Visualize the Dependencies dependency_vis = viz.DependencyParserVisualizer() dependency_vis.display(pipeline_result[0], #should be the results of a single example, not the complete dataframe. pos_col = 'pos', #specify the pos column dependency_col = 'dependencies', #specify the dependency column dependency_type_col = 'dependency_type' #specify the dependency type column ) ```
## Results ```bash # NER ['VerticalNet', 'operates', 'a series of online communities'] # Relations ['has_subject', 'has_subject', 'has_object'] # DEP # Use Spark NLP Display to see the dependency tree ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legpipe_whereas| |Type:|pipeline| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|918.6 MB| ## References In-house annotations on CUAD dataset ## Included Models - nlp.DocumentAssembler - nlp.Tokenizer - nlp.PerceptronModel - nlp.DependencyParserModel - nlp.TypedDependencyParserModel - nlp.RoBertaEmbeddings - legal.NerModel - nlp.NerConverter - legal.RelationExtractionDLModel --- layout: model title: English BertForQuestionAnswering Base Cased model (from roshnir) author: John Snow Labs name: bert_qa_base_multi_mlqa_dev date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multi-mlqa-dev-en` is a English model originally trained by `roshnir`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_base_multi_mlqa_dev_en_4.0.0_3.0_1657183101705.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_base_multi_mlqa_dev_en_4.0.0_3.0_1657183101705.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_base_multi_mlqa_dev","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_base_multi_mlqa_dev","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_base_multi_mlqa_dev| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|626.2 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/roshnir/bert-base-multi-mlqa-dev-en --- layout: model title: Detect Clinical Events author: John Snow Labs name: ner_events_clinical date: 2021-03-31 tags: [ner, clinical, licensed, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 supported: true annotator: MedicalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model can be used to detect clinical events in medical text. ## Predicted Entities `DATE`, `TIME`, `PROBLEM`, `TEST`, `TREATMENT`, `OCCURENCE`, `CLINICAL_DEPT`, `EVIDENTIAL`, `DURATION`, `FREQUENCY`, `ADMISSION`, `DISCHARGE`. {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_EVENTS_CLINICAL/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_en_3.0.0_3.0_1617209685283.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_events_clinical_en_3.0.0_3.0_1617209685283.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_detector = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("embeddings") clinical_ner = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter]) model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame([["The patient presented to the emergency room last evening"]], ["text"])) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_detector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("embeddings") val ner = MedicalNerModel.pretrained("ner_events_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token", "embeddings")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter)) val data = Seq("""The patient presented to the emergency room last evening""").toDS().toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.events_clinical").predict("""The patient presented to the emergency room last evening""") ```
## Results ```bash +----+-----------------------------+---------+---------+-----------------+ | | chunk | begin | end | entity | +====+=============================+=========+=========+=================+ | 0 | presented | 12 | 20 | EVIDENTIAL | +----+-----------------------------+---------+---------+-----------------+ | 1 | the emergency room | 25 | 42 | CLINICAL_DEPT | +----+-----------------------------+---------+---------+-----------------+ | 2 | last evening | 44 | 55 | DATE | +----+-----------------------------+---------+---------+-----------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_events_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|en| ## Data Source Trained on i2b2 events data with *clinical_embeddings*. ## Benchmarking ```bash label tp fp fn prec rec f1 I-TIME 82 12 45 0.87234 0.645669 0.742081 I-TREATMENT 2580 439 535 0.854588 0.82825 0.841213 B-OCCURRENCE 1548 680 945 0.694793 0.620939 0.655793 I-DURATION 366 183 99 0.666667 0.787097 0.721893 B-DATE 847 151 138 0.848697 0.859898 0.854261 I-DATE 921 191 196 0.828237 0.82453 0.82638 B-ADMISSION 105 102 15 0.507246 0.875 0.642202 I-PROBLEM 5238 902 823 0.853094 0.864214 0.858618 B-CLINICAL_DEPT 613 130 119 0.825034 0.837432 0.831187 B-TIME 36 8 24 0.818182 0.6 0.692308 I-CLINICAL_DEPT 1273 210 137 0.858395 0.902837 0.880055 B-PROBLEM 3717 608 591 0.859422 0.862813 0.861114 I-TEST 2304 384 361 0.857143 0.86454 0.860826 B-TEST 1870 372 300 0.834077 0.861751 0.847688 B-TREATMENT 2767 437 513 0.863608 0.843598 0.853485 B-EVIDENTIAL 394 109 201 0.7833 0.662185 0.717669 B-DURATION 236 119 105 0.664789 0.692082 0.678161 B-FREQUENCY 117 20 79 0.854015 0.596939 0.702703 Macro-average 25806 5821 6342 0.735285 0.677034 0.704959 Micro-average 25806 5821 6342 0.815948 0.802725 0.809283 ``` --- layout: model title: English XLMRoBerta Embeddings (from EMBEDDIA) author: John Snow Labs name: xlmroberta_embeddings_litlat_bert date: 2022-05-13 tags: [en, open_source, xlm_roberta, embeddings, litlat] task: Embeddings language: en nav_key: models edition: Spark NLP 3.4.4 spark_version: 3.0 supported: true annotator: XlmRoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRoBERTa Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `litlat-bert` is a English model orginally trained by `EMBEDDIA`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_litlat_bert_en_3.4.4_3.0_1652440079941.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_litlat_bert_en_3.4.4_3.0_1652440079941.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_litlat_bert","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_litlat_bert","en") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_embeddings_litlat_bert| |Compatibility:|Spark NLP 3.4.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|en| |Size:|361.9 MB| |Case sensitive:|true| ## References - https://huggingface.co/EMBEDDIA/litlat-bert --- layout: model title: Spanish RobertaForQuestionAnswering Base Cased model (from BSC-TeMU) author: John Snow Labs name: roberta_qa_bsc_temu_base_bne_s_c date: 2022-12-02 tags: [es, open_source, roberta, question_answering, tensorflow] task: Question Answering language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-bne-sqac` is a Spanish model originally trained by `BSC-TeMU`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_bsc_temu_base_bne_s_c_es_4.2.4_3.0_1669985795826.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_bsc_temu_base_bne_s_c_es_4.2.4_3.0_1669985795826.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bsc_temu_base_bne_s_c","es")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_bsc_temu_base_bne_s_c","es") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_bsc_temu_base_bne_s_c| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|es| |Size:|460.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/BSC-TeMU/roberta-base-bne-sqac - https://arxiv.org/abs/1907.11692 - http://www.bne.es/en/Inicio/index.html - https://github.com/PlanTL-SANIDAD/lm-spanish - https://arxiv.org/abs/2107.07253 --- layout: model title: Sentence Entity Resolver for Snomed Concepts, CT version (``sbiobert_base_cased_mli`` embeddings) author: John Snow Labs name: sbiobertresolve_snomed_findings date: 2021-05-16 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.4 spark_version: 3.0 supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps extracted medical entities to Snomed codes (CT version) using `sbiobert_base_cased_mli` Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements. ## Predicted Entities Predicts Snomed Codes and their normalized definition for each chunk. {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/demos){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_3.0.4_3.0_1621191323188.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_snomed_findings_en_3.0.4_3.0_1621191323188.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use ```sbiobertresolve_snomed_findings``` resolver model must be used with ```sbiobert_base_cased_mli``` as embeddings ```ner_clinical``` as NER model. No need to set ```.setWhiteList()```.
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") snomed_resolver = SentenceEntityResolverModel\ .pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") \ .setInputCols(["ner_chunk", "sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val snomed_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_snomed_findings","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, snomed_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.snomed.findings").predict("""This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .""") ```
## Results ```bash +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| resolutions| codes| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM| 38341003| 0.3234|hypertension:::hy...|38341003:::155295...| |chronic renal ins...| 83|109| PROBLEM|723190009| 0.7522|chronic renal ins...|723190009:::70904...| | COPD| 113|116| PROBLEM| 13645005| 0.1226|copd - chronic ob...|13645005:::155565...| | gastritis| 120|128| PROBLEM|235653009| 0.2444|gastritis:::gastr...|235653009:::45560...| | TIA| 136|138| PROBLEM|275382005| 0.0766|cerebral trauma (...|275382005:::44739...| |a non-ST elevatio...| 182|202| PROBLEM|233843008| 0.2224|silent myocardial...|233843008:::19479...| |Guaiac positive s...| 208|229| PROBLEM| 59614000| 0.9678|guaiac-positive s...|59614000:::703960...| |cardiac catheteri...| 295|317| TEST|301095005| 0.2584|cardiac finding::...|301095005:::25090...| | PTCA| 324|327|TREATMENT|373108000| 0.0809|post percutaneous...|373108000:::25103...| | mid LAD lesion| 332|345| PROBLEM|449567000| 0.0900|overriding left v...|449567000:::46140...| +--------------------+-----+---+---------+---------+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_snomed_findings| |Compatibility:|Healthcare NLP 3.0.4+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk, sbert_embeddings]| |Output Labels:|[snomed_ct_code]| |Language:|en| |Case sensitive:|false| ## Data Source Trained on SNOMED (CT version) Findings with ``sbiobert_base_cased_mli`` sentence embeddings. http://www.snomed.org/ --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from akdeniz27) author: John Snow Labs name: roberta_qa_akdeniz27_base_cuad date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-cuad` is a English model originally trained by `akdeniz27`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_akdeniz27_base_cuad_en_4.2.4_3.0_1669986180302.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_akdeniz27_base_cuad_en_4.2.4_3.0_1669986180302.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_akdeniz27_base_cuad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_akdeniz27_base_cuad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_akdeniz27_base_cuad| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|447.9 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/akdeniz27/roberta-base-cuad - https://github.com/TheAtticusProject/cuad - https://github.com/marshmellow77/cuad-demo --- layout: model title: Relation Extraction Between Body Parts and Procedures author: John Snow Labs name: redl_bodypart_procedure_test_biobert date: 2021-06-01 tags: [licensed, en, relation_extraction, clinical] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.0.3 spark_version: 3.0 supported: true annotator: RelationExtractionDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Relation extraction between body parts entities like ‘Internal_organ_or_component’, ’External_body_part_or_region’ etc. and procedure and test entities. `1` : body part and test/procedure are related to each other. `0` : body part and test/procedure are not related to each other. ## Predicted Entities `0`, `1` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.1.Clinical_Relation_Extraction_BodyParts_Models.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_3.0.3_3.0_1622581871045.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/redl_bodypart_procedure_test_biobert_en_3.0.3_3.0_1622581871045.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencer = SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentences") tokenizer = sparknlp.annotators.Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models")\ .setInputCols("sentences", "tokens", "embeddings")\ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") dependency_parser = DependencyParserModel() \ .pretrained("dependency_conllu", "en") \ .setInputCols(["sentences", "pos_tags", "tokens"]) \ .setOutputCol("dependencies") # Set a filter on pairs of named entities which will be treated as relation candidates re_ner_chunk_filter = RENerChunksFilter() \ .setInputCols(["ner_chunks", "dependencies"])\ .setMaxSyntacticDistance(10)\ .setOutputCol("re_ner_chunks")\ .setRelationPairs(["external_body_part_or_region-test"]) # The dataset this model is trained to is sentence-wise. # This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. re_model = RelationExtractionDLModel()\ .pretrained('redl_bodypart_procedure_test_biobert', 'en', "clinical/models") \ .setPredictionThreshold(0.5)\ .setInputCols(["re_ner_chunks", "sentences"]) \ .setOutputCol("relations") pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model]) text ="TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound." data = spark.createDataFrame([[text]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala ... val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentencer = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols(Array("sentences")) .setOutputCol("tokens") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = MedicalNerModel.pretrained("ner_jsl_greedy", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") // Set a filter on pairs of named entities which will be treated as relation candidates val re_ner_chunk_filter = RENerChunksFilter() .setInputCols(Array("ner_chunks", "dependencies")) .setMaxSyntacticDistance(10) .setOutputCol("re_ner_chunks") .setRelationPairs(Array("external_body_part_or_region-test")) // The dataset this model is trained to is sentence-wise. // This model can also be trained on document-level relations - in which case, while predicting, use "document" instead of "sentence" as input. val re_model = RelationExtractionDLModel() .pretrained("redl_bodypart_procedure_test_biobert", "en", "clinical/models") .setPredictionThreshold(0.5) .setInputCols(Array("re_ner_chunks", "sentences")) .setOutputCol("relations") val pipeline = new Pipeline().setStages(Array(documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, dependency_parser, re_ner_chunk_filter, re_model)) val data = Seq("TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.bodypart.procedure").predict("""TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.""") ```
## Results ```bash | | relation | entity1 | chunk1 | entity2 | chunk2 | confidence | |---:|-----------:|:-----------------------------|:---------|:----------|:--------------------|-------------:| | 0 | 1 | External_body_part_or_region | chest | Test | portable ultrasound | 0.99953 | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|redl_bodypart_procedure_test_biobert| |Compatibility:|Healthcare NLP 3.0.3+| |License:|Licensed| |Edition:|Official| |Language:|en| |Case sensitive:|true| ## Data Source Trained on a custom internal dataset. ## Benchmarking ```bash Relation Recall Precision F1 Support 0 0.338 0.472 0.394 325 1 0.904 0.843 0.872 1275 Avg. 0.621 0.657 0.633 - ``` --- layout: model title: Legal Enforceability Clause Binary Classifier (LEDGAR) author: John Snow Labs name: legclf_enforceability_bert date: 2023-03-05 tags: [en, legal, classification, clauses, enforceability, licensed, tensorflow] task: Text Classification language: en edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description LEDGAR dataset aims to contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision. This model is a Binary Classifier (True, False) for the `Enforceability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/01.Page_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `Enforceability`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_enforceability_bert_en_1.0.0_3.0_1678047282639.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_enforceability_bert_en_1.0.0_3.0_1678047282639.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_enforceability_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[Enforceability]| |[Other]| |[Other]| |[Enforceability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_enforceability_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.5 MB| ## References Train dataset available [here](https://huggingface.co/datasets/lex_glue) ## Benchmarking ```bash label precision recall f1-score support Enforceability 0.78 0.84 0.81 25 Other 0.90 0.86 0.88 43 accuracy - - 0.85 68 macro-avg 0.84 0.85 0.84 68 weighted-avg 0.86 0.85 0.85 68 ``` --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from csarron) author: John Snow Labs name: roberta_qa_base_squad_v1 date: 2022-12-02 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad-v1` is a English model originally trained by `csarron`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_v1_en_4.2.4_3.0_1669986599451.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_base_squad_v1_en_4.2.4_3.0_1669986599451.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_v1","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_base_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_base_squad_v1| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|457.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/csarron/roberta-base-squad-v1 - https://arxiv.org/abs/1907.11692 - https://rajpurkar.github.io/SQuAD-explorer - https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json - https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json - https://awk.ai/ - https://github.com/csarron - https://twitter.com/sysnlp --- layout: model title: Legal No assignment Clause Binary Classifier author: John Snow Labs name: legclf_no_assignment_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `no-assignment` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `no-assignment` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_no_assignment_clause_en_1.0.0_3.2_1660123750879.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_no_assignment_clause_en_1.0.0_3.2_1660123750879.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_no_assignment_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[no-assignment]| |[other]| |[other]| |[no-assignment]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_no_assignment_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support no-assignment 0.87 0.91 0.89 43 other 0.95 0.93 0.94 89 accuracy - - 0.92 132 macro-avg 0.91 0.92 0.91 132 weighted-avg 0.93 0.92 0.92 132 ``` --- layout: model title: Legal Investment Advisory And Management Agreement Document Classifier (Bert Sentence Embeddings) author: John Snow Labs name: legclf_investment_advisory_and_management_agreement_bert date: 2023-01-29 tags: [en, legal, classification, investment, advisory, management, agreement, licensed, bert, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_investment_advisory_and_management_agreement_bert` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class `investment-advisory-and-management-agreement` or not (Binary Classification). Unlike the Longformer model, this model is lighter in terms of inference time. ## Predicted Entities `investment-advisory-and-management-agreement`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_and_management_agreement_bert_en_1.0.0_3.0_1674990433150.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_investment_advisory_and_management_agreement_bert_en_1.0.0_3.0_1674990433150.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_investment_advisory_and_management_agreement_bert", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[investment-advisory-and-management-agreement]| |[other]| |[other]| |[investment-advisory-and-management-agreement]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_investment_advisory_and_management_agreement_bert| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.2 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support investment-advisory-and-management-agreement 1.00 0.97 0.99 34 other 0.98 1.00 0.99 53 accuracy - - 0.99 87 macro-avg 0.99 0.99 0.99 87 weighted-avg 0.99 0.99 0.99 87 ``` --- layout: model title: Legal Right Of First Refusal Clause Binary Classifier author: John Snow Labs name: legclf_right_of_first_refusal_clause date: 2023-01-29 tags: [en, legal, classification, right, first, refusal, clauses, right_of_first_refusal, licensed, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow annotator: LegalClassifierDLModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `right-of-first-refusal` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `right-of-first-refusal`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_right_of_first_refusal_clause_en_1.0.0_3.0_1675005454414.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_right_of_first_refusal_clause_en_1.0.0_3.0_1675005454414.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\ .setInputCols("document")\ .setOutputCol("sentence_embeddings") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_right_of_first_refusal_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[right-of-first-refusal]| |[other]| |[other]| |[right-of-first-refusal]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_right_of_first_refusal_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|22.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 1.00 0.97 0.99 38 right-of-first-refusal 0.96 1.00 0.98 22 accuracy - - 0.98 60 macro-avg 0.98 0.99 0.98 60 weighted-avg 0.98 0.98 0.98 60 ``` --- layout: model title: Stopwords Remover for Albanian language (223 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, sq, open_source] task: Stop Words Removal language: sq edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_sq_3.4.1_3.0_1646673139987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_sq_3.4.1_3.0_1646673139987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","sq") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["Ju nuk jeni më të mirë se unë"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","sq") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("Ju nuk jeni më të mirë se unë").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("sq.stopwords").predict("""Ju nuk jeni më të mirë se unë""") ```
## Results ```bash +-------------------------+ |result | +-------------------------+ |[jeni, më, të, mirë, unë]| +-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|sq| |Size:|1.9 KB| --- layout: model title: English RobertaForQuestionAnswering Cased model (from AnonymousSub) author: John Snow Labs name: roberta_qa_emanuals_squad2.0 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `EManuals_RoBERTa_squad2.0` is a English model originally trained by `AnonymousSub`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_emanuals_squad2.0_en_4.3.0_3.0_1674208019986.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_emanuals_squad2.0_en_4.3.0_3.0_1674208019986.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_emanuals_squad2.0","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_emanuals_squad2.0","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_emanuals_squad2.0| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|466.7 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/AnonymousSub/EManuals_RoBERTa_squad2.0 --- layout: model title: Abkhazian asr_speech_sprint_test TFWav2Vec2ForCTC from Mofe author: John Snow Labs name: asr_speech_sprint_test date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_speech_sprint_test` is a Abkhazian model originally trained by Mofe. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_speech_sprint_test_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_speech_sprint_test_ab_4.2.0_3.0_1664021394571.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_speech_sprint_test_ab_4.2.0_3.0_1664021394571.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_speech_sprint_test", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_speech_sprint_test", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_speech_sprint_test| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|446.9 KB| --- layout: model title: Relation extraction between Drugs and ADE - Conversational Text author: John Snow Labs name: re_ade_conversational date: 2022-07-27 tags: [relation_extraction, licensed, clinical, en] task: Relation Extraction language: en nav_key: models edition: Healthcare NLP 3.5.0 spark_version: 3.0 supported: true annotator: RelationExtractionModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is capable of Relating Drugs and adverse reactions caused by them in conversational text. ## Predicted Entities `is_related`, `not_related` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/RE_ADE/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/re_ade_conversational_en_3.5.0_3.0_1658956087191.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/re_ade_conversational_en_3.5.0_3.0_1658956087191.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documenter = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(["sentences"])\ .setOutputCol("tokens") words_embedder = WordEmbeddingsModel() \ .pretrained("embeddings_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"]) \ .setOutputCol("embeddings") ner_tagger = MedicalNerModel() \ .pretrained("ner_ade_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens", "embeddings"]) \ .setOutputCol("ner_tags") ner_converter = NerConverter() \ .setInputCols(["sentences", "tokens", "ner_tags"]) \ .setOutputCol("ner_chunks") pos_tagger = PerceptronModel()\ .pretrained("pos_clinical", "en", "clinical/models") \ .setInputCols(["sentences", "tokens"])\ .setOutputCol("pos_tags") dependency_parser = sparknlp.annotators.DependencyParserModel()\ .pretrained("dependency_conllu", "en")\ .setInputCols(["sentences", "pos_tags", "tokens"])\ .setOutputCol("dependencies") re_model = RelationExtractionModel()\ .pretrained("re_ade_conversational", "en", "clinical/models")\ .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\ .setOutputCol("relations")\ .setRelationPairs(["ade-drug", "drug-ade"]) # Possible relation pairs. Default: All Relations. nlp_pipeline = Pipeline(stages=[documenter, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_converter, dependency_parser, re_model]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) text ="""19.32 day 20 rivaroxaban diary. still residual aches and pains; only had 4 paracetamol today.""" annotations = light_pipeline.fullAnnotate(text) ``` ```scala val documenter = new DocumentAssembler() .setInputCol("text") .setOutputCol("sentences") val tokenizer = new Tokenizer() .setInputCols("sentences") .setOutputCol("tokens") val words_embedder = WordEmbeddingsModel() .pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("embeddings") val ner_tagger = NerDLModel() .pretrained("ner_ade_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens", "embeddings")) .setOutputCol("ner_tags") val ner_converter = new NerConverter() .setInputCols(Array("sentences", "tokens", "ner_tags")) .setOutputCol("ner_chunks") val pos_tagger = PerceptronModel() .pretrained("pos_clinical", "en", "clinical/models") .setInputCols(Array("sentences", "tokens")) .setOutputCol("pos_tags") val dependency_parser = DependencyParserModel() .pretrained("dependency_conllu", "en") .setInputCols(Array("sentences", "pos_tags", "tokens")) .setOutputCol("dependencies") val re_model = RelationExtractionModel() .pretrained("re_ade_conversational", "en", "clinical/models") .setInputCols(Array("embeddings", "pos_tags", "ner_chunks", "dependencies")) .setOutputCol("relations") .setMaxSyntacticDistance(3) #default: 0 .setPredictionThreshold(0.5) #default: 0.5 .setRelationPairs(Array("drug-ade", "ade-drug")) # Possible relation pairs. Default: All Relations. val nlpPipeline = new Pipeline().setStages(Array(documenter, tokenizer, words_embedder, pos_tagger, ner_tagger, ner_chunker, dependency_parser, re_model)) val data = Seq("""19.32 day 20 rivaroxaban diary. still residual aches and pains; only had 4 paracetamol today.""").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.relation.adverse_drug_events.conversational").predict("""19.32 day 20 rivaroxaban diary. still residual aches and pains; only had 4 paracetamol today.""") ```
## Results ```bash | | chunk1 | entitiy1 | chunk2 | entity2 | relation | |----|-------------------------------|------------|-------------|---------|-------------| | 0 | residual aches and pains | ADE | rivaroxaban | DRUG | is_related | | 1 | residual aches and pains | ADE | paracetamol | DRUG | not_related | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|re_ade_conversational| |Type:|re| |Compatibility:|Healthcare NLP 3.5.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[embeddings, pos_tags, train_ner_chunks, dependencies]| |Output Labels:|[relations]| |Language:|en| |Size:|11.3 MB| ## References Trained on SMM4H dataset - annotated manually. https://healthlanguageprocessing.org/smm4h-2022/ ## Benchmarking ```bash label precision recall f1-score support not_related 0.81 0.88 0.85 528 is_related 0.94 0.89 0.91 1019 accuracy - - 0.89 1547 macro-avg 0.87 0.89 0.88 1547 weighted-avg 0.89 0.89 0.89 1547 ``` --- layout: model title: Fast Neural Machine Translation Model from Tuvaluan to English author: John Snow Labs name: opus_mt_tvl_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, tvl, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `tvl` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_tvl_en_xx_2.7.0_2.4_1609163187137.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_tvl_en_xx_2.7.0_2.4_1609163187137.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_tvl_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_tvl_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.tvl.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_tvl_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Persian BertForQuestionAnswering Cased model (from marzinouri101) author: John Snow Labs name: bert_qa_parsbert_finetuned_persianqa date: 2022-07-07 tags: [fa, open_source, bert, question_answering] task: Question Answering language: fa edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `parsbert-finetuned-persianQA` is a Persian model originally trained by `marzinouri101`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_parsbert_finetuned_persianqa_fa_4.0.0_3.0_1657190740642.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_parsbert_finetuned_persianqa_fa_4.0.0_3.0_1657190740642.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_parsbert_finetuned_persianqa","fa") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_parsbert_finetuned_persianqa","fa") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("اسم من چیست؟", "نام من کلارا است و من در برکلی زندگی می کنم.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_parsbert_finetuned_persianqa| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|fa| |Size:|442.3 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/marzinouri101/parsbert-finetuned-persianQA --- layout: model title: Spanish Legal Roberta Embeddings author: John Snow Labs name: roberta_base_spanish_legal date: 2023-02-16 tags: [es, spanish, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: es edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-spanish-roberta-base` is a Spanish model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_spanish_legal_es_4.2.4_3.0_1676579126608.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_spanish_legal_es_4.2.4_3.0_1676579126608.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_spanish_legal", "es")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_spanish_legal", "es") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_spanish_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|416.2 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-spanish-roberta-base --- layout: model title: Sentence Entity Resolver for CPT (sbiobert_base_cased_mli embeddings) author: John Snow Labs name: sbiobertresolve_cpt language: en nav_key: models repository: clinical/models date: 2020-11-27 task: Entity Resolution edition: Healthcare NLP 2.6.4 spark_version: 2.4 tags: [clinical,entity_resolution,en] supported: true annotator: SentenceEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model maps extracted medical entities to CPT codes using chunk embeddings. {:.h2_title} ## Predicted Entities CPT Codes and their normalized definition with ``sbiobert_base_cased_mli`` sentence embeddings. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}{:target="_blank"} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_en_2.6.4_2.4_1606235767322.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_en_2.6.4_2.4_1606235767322.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings\ .pretrained("sbiobert_base_cased_mli","en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sbert_embeddings") cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt","en", "clinical/models") \ .setInputCols(["sbert_embeddings"]) \ .setOutputCol("resolution")\ .setDistanceFunction("EUCLIDEAN") nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, cpt_resolver]) data = spark.createDataFrame([["This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU ."]]).toDF("text") results = nlpPipeline.fit(data).transform(data) ``` ```scala ... chunk2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings .pretrained("sbiobert_base_cased_mli","en","clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sbert_embeddings") val cpt_resolver = SentenceEntityResolverModel .pretrained("sbiobertresolve_cpt","en", "clinical/models") .setInputCols(Array("ner_chunk", "sbert_embeddings")) .setOutputCol("resolution") .setDistanceFunction("EUCLIDEAN") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk2doc, sbert_embedder, cpt_resolver)) val data = Seq("This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.h2_title} ## Results ```bash +--------------------+-----+---+---------+-----+----------+--------------------+--------------------+ | chunk|begin|end| entity| code|confidence| all_k_resolutions| all_k_codes| +--------------------+-----+---+---------+-----+----------+--------------------+--------------------+ | hypertension| 68| 79| PROBLEM|49425| 0.0967|Insertion of peri...|49425:::36818:::3...| |chronic renal ins...| 83|109| PROBLEM|50070| 0.2569|Nephrolithotomy; ...|50070:::49425:::5...| | COPD| 113|116| PROBLEM|49425| 0.0779|Insertion of peri...|49425:::31592:::4...| | gastritis| 120|128| PROBLEM|43810| 0.5289|Gastroduodenostom...|43810:::43880:::4...| | TIA| 136|138| PROBLEM|25927| 0.2060|Transmetacarpal a...|25927:::25931:::6...| |a non-ST elevatio...| 182|202| PROBLEM|33300| 0.3046|Repair of cardiac...|33300:::33813:::3...| |Guaiac positive s...| 208|229| PROBLEM|47765| 0.0974|Anastomosis, of i...|47765:::49425:::1...| |cardiac catheteri...| 295|317| TEST|62225| 0.1996|Replacement or ir...|62225:::33722:::4...| | PTCA| 324|327|TREATMENT|60500| 0.1481|Parathyroidectomy...|60500:::43800:::2...| | mid LAD lesion| 332|345| PROBLEM|33722| 0.3097|Closure of aortic...|33722:::33732:::3...| +--------------------+-----+---+---------+-----+----------+--------------------+--------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---------------|---------------------| | Name: | sbiobertresolve_cpt | | Type: | SentenceEntityResolverModel | | Compatibility: | Spark NLP 2.6.4 + | | License: | Licensed | | Edition: | Official | |Input labels: | [ner_chunk, chunk_embeddings] | |Output labels: | [resolution] | | Language: | en | | Dependencies: | sbiobert_base_cased_mli | {:.h2_title} ## Data Source Trained on Current Procedural Terminology dataset with ``sbiobert_base_cased_mli`` sentence embeddings. --- layout: model title: Translate Tok Pisin to English Pipeline author: John Snow Labs name: translate_tpi_en date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, tpi, en, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `tpi` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_tpi_en_xx_2.7.0_2.4_1609689664783.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_tpi_en_xx_2.7.0_2.4_1609689664783.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_tpi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_tpi_en", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.tpi.translate_to.en').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_tpi_en| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English RobertaForQuestionAnswering Cased model (from eAsyle) author: John Snow Labs name: roberta_qa_testabsa3 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `testABSA3` is a English model originally trained by `eAsyle`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_testabsa3_en_4.3.0_3.0_1674224346874.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_testabsa3_en_4.3.0_3.0_1674224346874.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_testabsa3","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_testabsa3","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_testabsa3| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|426.2 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/eAsyle/testABSA3 --- layout: model title: Polish BertForQuestionAnswering model (from henryk) author: John Snow Labs name: bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1 date: 2022-06-02 tags: [pl, open_source, question_answering, bert] task: Question Answering language: pl edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-multilingual-cased-finetuned-polish-squad1` is a Polish model orginally trained by `henryk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1_pl_4.0.0_3.0_1654180093762.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1_pl_4.0.0_3.0_1654180093762.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1","pl") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = BertForQuestionAnswering .pretrained("bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1","pl") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("pl.answer_question.squad.bert.multilingual_base_cased").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|pl| |Size:|665.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad1 - https://www.linkedin.com/in/henryk-borzymowski-0755a2167/ - https://rajpurkar.github.io/SQuAD-explorer/ - https://github.com/google-research/bert/blob/master/multilingual.md --- layout: model title: Financial Security ownership Item Binary Classifier author: John Snow Labs name: finclf_security_ownership_item date: 2022-08-10 tags: [en, finance, classification, 10k, annual, reports, sec, filings, licensed] task: Text Classification language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `security_ownership` item type of 10K Annual Reports. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big financial documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Finance NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). ## Predicted Entities `other`, `security_ownership` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finclf_security_ownership_item_en_1.0.0_3.2_1660154472575.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finclf_security_ownership_item_en_1.0.0_3.2_1660154472575.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("finclf_security_ownership_item", "en", "finance/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, useEmbeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[security_ownership]| |[other]| |[other]| |[security_ownership]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finclf_security_ownership_item| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|22.6 MB| ## References Weak labelling on documents from Edgar database ## Benchmarking ```bash label precision recall f1-score support other 0.90 0.84 0.87 31 security_ownership 0.78 0.86 0.82 21 accuracy - - 0.85 52 macro-avg 0.84 0.85 0.84 52 weighted-avg 0.85 0.85 0.85 52 ``` --- layout: model title: Word2Vec Embeddings in Mongolian (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-16 tags: [cc, embeddings, fastText, word2vec, mn, open_source] task: Embeddings language: mn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mn_3.4.1_3.0_1647447849598.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_mn_3.4.1_3.0_1647447849598.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mn") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Би SPARK NLP-т дуртай"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","mn") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Би SPARK NLP-т дуртай").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("mn.embed.w2v_cc_300d").predict("""Би SPARK NLP-т дуртай""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|mn| |Size:|352.4 MB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: Spanish Deberta Embeddings model (from plncmm) author: John Snow Labs name: deberta_embeddings_wl_base date: 2023-03-13 tags: [deberta, open_source, deberta_embeddings, debertav2formaskedlm, es, tensorflow] task: Embeddings language: es edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DeBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DebertaEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `mdeberta-wl-base-es` is a Spanish model originally trained by `plncmm`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/deberta_embeddings_wl_base_es_4.3.1_3.0_1678702937786.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/deberta_embeddings_wl_base_es_4.3.1_3.0_1678702937786.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_wl_base","es") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val embeddings = DeBertaEmbeddings.pretrained("deberta_embeddings_wl_base","es") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|deberta_embeddings_wl_base| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|es| |Size:|1.0 GB| |Case sensitive:|false| ## References https://huggingface.co/plncmm/mdeberta-wl-base-es --- layout: model title: Fast Neural Machine Translation Model from Azerbaijani to Spanish author: John Snow Labs name: opus_mt_az_es date: 2021-06-01 tags: [open_source, seq2seq, translation, az, es, xx, multilingual] task: Translation language: xx edition: Spark NLP 3.1.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). source languages: az target languages: es {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_az_es_xx_3.1.0_2.4_1622560735224.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_az_es_xx_3.1.0_2.4_1622560735224.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_az_es", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_az_es", "xx") .setInputCols("sentence") .setOutputCol("translation") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.Azerbaijani.translate_to.Spanish').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_az_es| |Compatibility:|Spark NLP 3.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Lemmatizer (Spanish, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, es] task: Lemmatization language: es edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Spanish Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_es_3.4.1_3.0_1646316590883.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_es_3.4.1_3.0_1646316590883.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","es") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["No eres mejor que yo"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","es") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("No eres mejor que yo").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("es.lemma").predict("""No eres mejor que yo""") ```
## Results ```bash +-------------------------+ |result | +-------------------------+ |[No, ser, mejor, que, yo]| +-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|es| |Size:|5.3 MB| --- layout: model title: BioBERT Sentence Embeddings (Pubmed PMC) author: John Snow Labs name: sent_biobert_pubmed_pmc_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertSentenceEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper "[BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_pmc_base_cased_en_2.6.0_2.4_1598349155555.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/sent_biobert_pubmed_pmc_base_cased_en_2.6.0_2.4_1598349155555.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_pmc_base_cased", "en") \ .setInputCols("sentence") \ .setOutputCol("sentence_embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer', "Antibiotics aren't painkiller"]], ["text"])) ``` ```scala ... val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_pmc_base_cased", "en") .setInputCols("sentence") .setOutputCol("sentence_embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings)) val data = Seq("I hate cancer, "Antibiotics aren't painkiller").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer", "Antibiotics aren't painkiller"] embeddings_df = nlu.load('en.embed_sentence.biobert.pubmed_pmc_base_cased').predict(text, output_level='sentence') embeddings_df ```
{:.h2_title} ## Results ```bash sentence en_embed_sentence_biobert_pubmed_pmc_base_cased_embeddings I hate cancer [0.2354733943939209, 0.30127033591270447, -0.1... Antibiotics aren't painkiller [0.2837969958782196, 0.03842488303780556, 0.04... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sent_biobert_pubmed_pmc_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[sentence_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/dmis-lab/biobert](https://github.com/dmis-lab/biobert) --- layout: model title: Moldavian, Moldovan, Romanian asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila TFWav2Vec2ForCTC from gmihaila author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila date: 2022-09-25 tags: [wav2vec2, ro, audio, open_source, asr] task: Automatic Speech Recognition language: ro edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila` is a Moldavian, Moldovan, Romanian model originally trained by gmihaila. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila_ro_4.2.0_3.0_1664099376954.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila_ro_4.2.0_3.0_1664099376954.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila", "ro")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila", "ro") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_romanian_by_gmihaila| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ro| |Size:|1.2 GB| --- layout: model title: ICD10CM Injuries Entity Resolver author: John Snow Labs name: chunkresolve_icd10cm_injuries_clinical date: 2021-04-02 tags: [entity_resolution, clinical, licensed, en] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.0.0 spark_version: 3.0 deprecated: true annotator: ChunkEntityResolverModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Entity Resolution model Based on KNN using Word Embeddings + Word Movers Distance ## Predicted Entities ICD10-CM Codes and their normalized definition with `clinical_embeddings` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_ICD10_CM/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_injuries_clinical_en_3.0.0_3.0_1617355437876.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/chunkresolve_icd10cm_injuries_clinical_en_3.0.0_3.0_1617355437876.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... injury_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_injuries_clinical","en","clinical/models")\ .setInputCols("token","chunk_embeddings")\ .setOutputCol("entity") pipeline_puerile = Pipeline(stages = [documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, injury_resolver]) model = pipeline_puerile.fit(spark.createDataFrame([["""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion."""]]).toDF("text")) results = model.transform(data) ``` ```scala ... val injury_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_injuries_clinical","en","clinical/models") .setInputCols(Array("token","chunk_embeddings")) .setOutputCol("resolution") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, word_embeddings, clinical_ner, ner_converter, chunk_embeddings, injury_resolver)) val data = Seq("The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.").toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash chunk entity icd10_inj_description icd10_inj_code 0 a cold, cough PROBLEM Insect bite (nonvenomous) of throat, sequela S1016XS 1 runny nose PROBLEM Abrasion of nose, initial encounter S0031XA 2 fever PROBLEM Blister (nonthermal) of abdominal wall, initia... S30821A 3 difficulty breathing PROBLEM Concussion without loss of consciousness, init... S060X0A 4 her cough PROBLEM Contusion of throat, subsequent encounter S100XXD 5 physical exam TEST Concussion without loss of consciousness, init... S060X0A 6 fairly congested PROBLEM Contusion and laceration of right cerebrum wit... S06310A 7 Amoxil TREATMENT Insect bite (nonvenomous), unspecified ankle, ... S90569S 8 Aldex TREATMENT Insect bite (nonvenomous) of penis, initial en... S30862A 9 difficulty breathing PROBLEM Concussion without loss of consciousness, init... S060X0A 10 more congested PROBLEM Complete traumatic amputation of two or more r... S98211S 11 trouble sleeping PROBLEM Concussion without loss of consciousness, sequela S060X0S 12 congestion PROBLEM Unspecified injury of portal vein, initial enc... S35319A ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|chunkresolve_icd10cm_injuries_clinical| |Compatibility:|Healthcare NLP 3.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[token, chunk_embeddings]| |Output Labels:|[icd10cm]| |Language:|en| --- layout: model title: Stopwords Remover for Urdu language (508 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, ur, open_source] task: Stop Words Removal language: ur edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_ur_3.4.1_3.0_1646672328433.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_ur_3.4.1_3.0_1646672328433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","ur") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["تم مجھ سے بہتر نہیں ہو"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","ur") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("تم مجھ سے بہتر نہیں ہو").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ur.stopwords").predict("""تم مجھ سے بہتر نہیں ہو""") ```
## Results ```bash +-------------------------+ |result | +-------------------------+ |[تم, مجھ, سے, بہتر, نہیں]| +-------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|ur| |Size:|3.3 KB| --- layout: model title: Fast Neural Machine Translation Model from Venda to English author: John Snow Labs name: opus_mt_ve_en date: 2020-12-28 task: Translation language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, ve, en, xx] supported: true annotator: MarianTransformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). - source languages: `ve` - target languages: `en` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/opus_mt_ve_en_xx_2.7.0_2.4_1609169172425.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/opus_mt_ve_en_xx_2.7.0_2.4_1609169172425.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentencerDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_ve_en", "xx")\ .setInputCols(["sentence"])\ .setOutputCol("translation") marian_pipeline = Pipeline(stages=[documentAssembler, sentencerDL, marian]) light_pipeline = LightPipeline(marian_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))) result = light_pipeline.fullAnnotate(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") .setInputCols("document") .setOutputCol("sentence") val marian = MarianTransformer.pretrained("opus_mt_ve_en", "xx") .setInputCols("sentence") .setOutputCol("translation") val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, marian)) val result = pipeline.fit(Seq.empty[String].toDS.toDF("text")).transform(data) ``` {:.nlu-block} ```python import nlu text = ["text to translate"] opus_df = nlu.load('xx.ve.marian.translate_to.en').predict(text, output_level='sentence') opus_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|opus_mt_ve_en| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Input Labels:|[sentence]| |Output Labels:|[translation]| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: English asr_wav2vec2_base_timit_demo_colab52 TFWav2Vec2ForCTC from hassnain author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab52 date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab52` is a English model originally trained by hassnain. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab52_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab52_en_4.2.0_3.0_1664022006865.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab52_en_4.2.0_3.0_1664022006865.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab52', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab52", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab52| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|355.0 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from mdineshk) author: John Snow Labs name: distilbert_qa_base_uncased_meded date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-meded` is a English model originally trained by `mdineshk`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_meded_en_4.3.0_3.0_1672773966077.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_base_uncased_meded_en_4.3.0_3.0_1672773966077.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_meded","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_base_uncased_meded","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_base_uncased_meded| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mdineshk/distilbert-base-uncased-meded --- layout: model title: English asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman` is a English model originally trained by jonatasgrosman. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman_en_4.2.0_3.0_1664036112317.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman_en_4.2.0_3.0_1664036112317.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|1.2 GB| --- layout: model title: Named Entity Recognition in Romanian Official Documents (Small) author: John Snow Labs name: legner_romanian_official_sm date: 2022-11-10 tags: [ro, ner, legal, licensed] task: Named Entity Recognition language: ro edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true annotator: LegalNerModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a small version of NER model that extracts only PER(Person), LOC(Location), ORG(Organization) and DATE entities from Romanian Official Documents. ## Predicted Entities `PER`, `LOC`, `ORG`, `DATE` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/legal/LEGNER_ROMANIAN_OFFICIAL/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_sm_ro_1.0.0_3.0_1668082337617.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legner_romanian_official_sm_ro_1.0.0_3.0_1668082337617.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document")\ sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ .setInputCols(["document"])\ .setOutputCol("sentence")\ tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_base_cased", "ro")\ .setInputCols("sentence", "token")\ .setOutputCol("embeddings")\ .setMaxSentenceLength(512)\ .setCaseSensitive(True) ner_model = legal.NerModel.pretrained("legner_romanian_official_sm", "ro", "legal/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, tokenizer, embeddings, ner_model, ner_converter ]) model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) data = spark.createDataFrame([["""Prezentul ordin se publică în Monitorul Oficial al României, Partea I. Ministrul sănătății, Sorina Pintea București, 28 februarie 2019.""]]).toDF("text") result = model.transform(data) ```
## Results ```bash +-----------------------------+-----+ |chunk |label| +-----------------------------+-----+ |Monitorul Oficial al României|ORG | |Sorina Pintea |PER | |București |LOC | |28 februarie 2019 |DATE | +-----------------------------+-----+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legner_romanian_official_sm| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence, token, embeddings]| |Output Labels:|[ner]| |Language:|ro| |Size:|16.4 MB| ## References Dataset is available [here](https://zenodo.org/record/7025333#.Y2zsquxBx83). ## Benchmarking ```bash label precision recall f1-score support DATE 0.87 0.96 0.91 397 LOC 0.87 0.78 0.83 190 ORG 0.90 0.93 0.91 559 PER 0.98 0.93 0.95 108 micro-avg 0.89 0.92 0.90 1254 macro-avg 0.91 0.90 0.90 1254 weighted-avg 0.89 0.92 0.90 1254 ``` --- layout: model title: Word2Vec Embeddings in Esperanto (300d) author: John Snow Labs name: w2v_cc_300d date: 2022-03-15 tags: [cc, embeddings, fastText, word2vec, eo, open_source] task: Embeddings language: eo edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Word Embeddings lookup annotator that maps tokens to vectors. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_eo_3.4.1_3.0_1647368870513.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/w2v_cc_300d_eo_3.4.1_3.0_1647368870513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eo") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","eo") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("I love Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("eo.embed.w2v_cc_300d").predict("""I love Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|w2v_cc_300d| |Type:|embeddings| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[embeddings]| |Language:|eo| |Size:|1.2 GB| |Case sensitive:|false| |Dimension:|300| --- layout: model title: SDOH Insurance Coverage For Classification author: John Snow Labs name: genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli date: 2023-04-28 tags: [en, licensed, sdoh, social_determinants, insurance, generic_classifier, biobert] task: Text Classification language: en edition: Healthcare NLP 4.4.0 spark_version: 3.0 supported: true annotator: GenericClassifierModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Generic Classifier model is intended for detecting insurance coverage. In this classifier, we know/assume that the patient **has insurance**. `Good`: The insurance covers all or most of the medications. `Poor`: The insurance doesn't cover all medications, specialist visits, or prescription medications. That may affect the patient's treatment. `Unknown`: Insurance coverage is not mentioned in the clinical notes or is not known. ## Predicted Entities `Good`, `Poor`, `Unknown` {:.btn-box} [Live Demo](https://nlp.johnsnowlabs.com/social_determinant){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli_en_4.4.0_3.0_1682710286670.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli_en_4.4.0_3.0_1682710286670.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\ .setInputCols(["document"])\ .setOutputCol("sentence_embeddings") features_asm = FeaturesAssembler()\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("features") generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli", 'en', 'clinical/models')\ .setInputCols(["features"])\ .setOutputCol("prediction") pipeline = Pipeline(stages=[ document_assembler, sentence_embeddings, features_asm, generic_classifier ]) text_list = ["The patient's Medicaid insurance is limited with some medicaitons.", "She is under good coverage Medicare insurance", "The patient has poor coverage of Private insurance", """Medical File for John Smith, Male, Age 42 Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath. History of Present Illness: The patient has a history of hypertension and diabetes, which are both poorly controlled. The patient has been feeling unwell for the past week, with symptoms including nausea, vomiting, and shortness of breath. Upon examination, the patient was found to have a high serum creatinine level of 5.8 mg/dL, indicating renal failure. Past Medical History: The patient has a history of hypertension and diabetes, which have been poorly controlled due to poor medication adherence. The patient also has a history of smoking, which has been a contributing factor to the development of renal failure. Medications: The patient is currently taking Metformin and Lisinopril for the management of diabetes and hypertension, respectively. However, due to poor Medicaid coverage, the patient is unable to afford some of the medications prescribed by his physician. Insurance Status: The patient has Medicaid insurance, which provides poor coverage for some of the medications needed to manage his medical conditions, including those related to his renal failure. Physical Examination: Upon physical examination, the patient appears pale and lethargic. Blood pressure is 160/100 mmHg, heart rate is 90 beats per minute, and respiratory rate is 20 breaths per minute. There is diffuse abdominal tenderness on palpation, and lung auscultation reveals bilateral rales. Diagnosis: The patient is diagnosed with acute renal failure, likely due to uncontrolled hypertension and poorly managed diabetes. Treatment: The patient is started on intravenous fluids and insulin to manage his blood sugar levels. Due to the patient's poor Medicaid coverage, the physician works with the patient to identify alternative medications that are more affordable and will still provide effective management of his medical conditions. Follow-Up: The patient is advised to follow up with his primary care physician for ongoing management of his renal failure and other medical conditions. The patient is also referred to a nephrologist for further evaluation and management of his renal failure. """, """Certainly, here is an example case study for a patient with private insurance: Case Study for Emily Chen, Female, Age 38 Chief Complaint: Patient reports chronic joint pain and stiffness. History of Present Illness: The patient has been experiencing chronic joint pain and stiffness, particularly in the hands, knees, and ankles. The pain is worse in the morning and improves throughout the day. The patient has also noticed some swelling and redness in the affected joints. Past Medical History: The patient has a history of osteoarthritis, which has been gradually worsening over the past several years. The patient has tried over-the-counter pain relievers and joint supplements, but has not found significant relief. Medications: The patient is currently taking over-the-counter pain relievers and joint supplements for the management of her osteoarthritis. Insurance Status: The patient has private insurance, which provides comprehensive coverage for her medical care, including specialist visits and prescription medications. Physical Examination: Upon physical examination, the patient has tenderness and swelling in multiple joints, particularly the hands, knees, and ankles. Range of motion is limited due to pain and stiffness. Diagnosis: The patient is diagnosed with osteoarthritis, a chronic degenerative joint disease that causes pain, swelling, and stiffness in the affected joints. Treatment: The patient is prescribed a nonsteroidal anti-inflammatory drug (NSAID) to manage pain and inflammation. The physician also recommends physical therapy to improve range of motion and strengthen the affected joints. The patient is advised to continue taking joint supplements for ongoing joint health. Follow-Up: The patient is advised to follow up with the physician in 4-6 weeks to monitor response to treatment and make any necessary adjustments. The patient is also referred to a rheumatologist for further evaluation and management of her osteoarthritis."""] df = spark.createDataFrame(text_list, StringType()).toDF("text") result = pipeline.fit(df).transform(df) result.select("text", "prediction.result").show(truncate=100) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols("document") .setOutputCol("sentence_embeddings") val features_asm = new FeaturesAssembler() .setInputCols("sentence_embeddings") .setOutputCol("features") val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli", "en", "clinical/models") .setInputCols("features") .setOutputCol("prediction") val pipeline = new PipelineModel().setStages(Array( document_assembler, sentence_embeddings, features_asm, generic_classifier)) val data = Seq(Array("The patient's Medicaid insurance is limited with some medicaitons.", "She is under good coverage Medicare insurance", "The patient has poor coverage of Private insurance", """Medical File for John Smith, Male, Age 42 Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath. History of Present Illness: The patient has a history of hypertension and diabetes, which are both poorly controlled. The patient has been feeling unwell for the past week, with symptoms including nausea, vomiting, and shortness of breath. Upon examination, the patient was found to have a high serum creatinine level of 5.8 mg/dL, indicating renal failure. Past Medical History: The patient has a history of hypertension and diabetes, which have been poorly controlled due to poor medication adherence. The patient also has a history of smoking, which has been a contributing factor to the development of renal failure. Medications: The patient is currently taking Metformin and Lisinopril for the management of diabetes and hypertension, respectively. However, due to poor Medicaid coverage, the patient is unable to afford some of the medications prescribed by his physician. Insurance Status: The patient has Medicaid insurance, which provides poor coverage for some of the medications needed to manage his medical conditions, including those related to his renal failure. Physical Examination: Upon physical examination, the patient appears pale and lethargic. Blood pressure is 160/100 mmHg, heart rate is 90 beats per minute, and respiratory rate is 20 breaths per minute. There is diffuse abdominal tenderness on palpation, and lung auscultation reveals bilateral rales. Diagnosis: The patient is diagnosed with acute renal failure, likely due to uncontrolled hypertension and poorly managed diabetes. Treatment: The patient is started on intravenous fluids and insulin to manage his blood sugar levels. Due to the patient's poor Medicaid coverage, the physician works with the patient to identify alternative medications that are more affordable and will still provide effective management of his medical conditions. Follow-Up: The patient is advised to follow up with his primary care physician for ongoing management of his renal failure and other medical conditions. The patient is also referred to a nephrologist for further evaluation and management of his renal failure. """, """Certainly, here is an example case study for a patient with private insurance: Case Study for Emily Chen, Female, Age 38 Chief Complaint: Patient reports chronic joint pain and stiffness. History of Present Illness: The patient has been experiencing chronic joint pain and stiffness, particularly in the hands, knees, and ankles. The pain is worse in the morning and improves throughout the day. The patient has also noticed some swelling and redness in the affected joints. Past Medical History: The patient has a history of osteoarthritis, which has been gradually worsening over the past several years. The patient has tried over-the-counter pain relievers and joint supplements, but has not found significant relief. Medications: The patient is currently taking over-the-counter pain relievers and joint supplements for the management of her osteoarthritis. Insurance Status: The patient has private insurance, which provides comprehensive coverage for her medical care, including specialist visits and prescription medications. Physical Examination: Upon physical examination, the patient has tenderness and swelling in multiple joints, particularly the hands, knees, and ankles. Range of motion is limited due to pain and stiffness. Diagnosis: The patient is diagnosed with osteoarthritis, a chronic degenerative joint disease that causes pain, swelling, and stiffness in the affected joints. Treatment: The patient is prescribed a nonsteroidal anti-inflammatory drug (NSAID) to manage pain and inflammation. The physician also recommends physical therapy to improve range of motion and strengthen the affected joints. The patient is advised to continue taking joint supplements for ongoing joint health. Follow-Up: The patient is advised to follow up with the physician in 4-6 weeks to monitor response to treatment and make any necessary adjustments. The patient is also referred to a rheumatologist for further evaluation and management of her osteoarthritis.""")).toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
## Results ```bash +----------------------------------------------------------------------------------------------------+------+ | text|result| +----------------------------------------------------------------------------------------------------+------+ | The patient's Medicaid insurance is limited with some medicaitons.|[Poor]| | She is under good coverage Medicare insurance|[Good]| | The patient has poor coverage of Private insurance|[Poor]| |Medical File for John Smith, Male, Age 42\n\nChief Complaint: Patient complains of nausea, vomiti...|[Poor]| |Certainly, here is an example case study for a patient with private insurance:\n\nCase Study for ...|[Good]| +----------------------------------------------------------------------------------------------------+------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli| |Compatibility:|Healthcare NLP 4.4.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[features]| |Output Labels:|[prediction]| |Language:|en| |Size:|3.4 MB| |Dependencies:|sbiobert_base_cased_mli| ## References Internal SDOH project ## Benchmarking ```bash label precision recall f1-score support Good 0.81 0.86 0.84 74 Poor 0.94 0.84 0.89 70 Unknown 0.67 0.71 0.69 31 accuracy - - 0.83 175 macro-avg 0.80 0.81 0.80 175 weighted-avg 0.84 0.83 0.83 175 ``` --- layout: model title: English BertForQuestionAnswering Base Cased model (from anas-awadalla) author: John Snow Labs name: bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_2 date: 2022-07-07 tags: [en, open_source, bert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-2` is a English model originally trained by `anas-awadalla`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1657192046030.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_2_en_4.0.0_3.0_1657192046030.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_2","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_2","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_2| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|375.9 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/anas-awadalla/spanbert-base-cased-few-shot-k-16-finetuned-squad-seed-2 --- layout: model title: Smaller BERT Embeddings (L-4_H-512_A-8) author: John Snow Labs name: small_bert_L4_512 date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [open_source, embeddings, en] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is one of the smaller BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/small_bert_L4_512_en_2.6.0_2.4_1598344591466.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/small_bert_L4_512_en_2.6.0_2.4_1598344591466.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("small_bert_L4_512", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I love NLP']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("small_bert_L4_512", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I love NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I love NLP"] embeddings_df = nlu.load('en.embed.bert.small_L4_512').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_bert_small_L4_512_embeddings I [0.5992571711540222, 0.4491763710975647, 0.515... love [0.9561082124710083, 0.11446993052959442, 0.19... NLP [-1.0831373929977417, 0.9501134753227234, -0.4... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|small_bert_L4_512| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|512| |Case sensitive:|false| {:.h2_title} ## Data Source The model is imported from https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1 --- layout: model title: Abkhazian asr_xls_r_ab_test_by_FitoDS TFWav2Vec2ForCTC from FitoDS author: John Snow Labs name: asr_xls_r_ab_test_by_FitoDS date: 2022-09-24 tags: [wav2vec2, ab, audio, open_source, asr] task: Automatic Speech Recognition language: ab edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_xls_r_ab_test_by_FitoDS` is a Abkhazian model originally trained by FitoDS. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_xls_r_ab_test_by_FitoDS_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_FitoDS_ab_4.2.0_3.0_1664021158373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_xls_r_ab_test_by_FitoDS_ab_4.2.0_3.0_1664021158373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_xls_r_ab_test_by_FitoDS", "ab")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_xls_r_ab_test_by_FitoDS", "ab") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_xls_r_ab_test_by_FitoDS| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|ab| |Size:|442.4 KB| --- layout: model title: Translate English to Basque Pipeline author: John Snow Labs name: translate_en_eu date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, eu, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `eu` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_eu_xx_2.7.0_2.4_1609691166839.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_eu_xx_2.7.0_2.4_1609691166839.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_eu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_eu", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.eu').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_eu| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Mapping National Drug Codes (NDC) Codes with Corresponding Drug Brand Names author: John Snow Labs name: ndc_drug_brandname_mapper date: 2023-02-22 tags: [chunk_mapping, ndc, drug_brand_name, clinical, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 4.3.0 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps National Drug Codes (NDC) codes with their corresponding drug brand names. ## Predicted Entities `drug_brand_name` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ndc_drug_brandname_mapper_en_4.3.0_3.0_1677102197072.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ndc_drug_brandname_mapper_en_4.3.0_3.0_1677102197072.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") mapper = DocMapperModel.pretrained("ndc_drug_brandname_mapper", "en", "clinical/models")\ .setInputCols("document")\ .setOutputCol("mappings")\ .setRels(["drug_brand_name"])\ pipeline = Pipeline( stages = [ documentAssembler, mapper ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) result = lp.fullAnnotate(["0009-4992", "57894-150"]) ``` ```scala val documentAssembler = new DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") val mapper = DocMapperModel.pretrained("ndc_drug_brandname_mapper", "en", "clinical/models")\ .setInputCols("document")\ .setOutputCol("mappings")\ .setRels(Array("drug_brand_name")\ val pipeline = new Pipeline(stages = Array( documentAssembler, mapper )) val data = Seq(Array("0009-4992", "57894-150")).toDS.toDF("text") val result= pipeline.fit(data).transform(data) ```
## Results ```bash | | ndc_code | drug_brand_name | |---:|:-----------|:------------------| | 0 | 0009-4992 | ZYVOX | | 1 | 57894-150 | ZYTIGA | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ndc_drug_brandname_mapper| |Compatibility:|Healthcare NLP 4.3.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[brandname]| |Language:|en| |Size:|917.7 KB| --- layout: model title: Stopwords Remover for Bengali language (458 entries) author: John Snow Labs name: stopwords_iso date: 2022-03-07 tags: [stopwords, bn, open_source] task: Stop Words Removal language: bn edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This is a scalable, production-ready Stopwords Remover model trained using the corpus available at [stopwords-iso](https://github.com/stopwords-iso/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_iso_bn_3.4.1_3.0_1646673064513.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_iso_bn_3.4.1_3.0_1646673064513.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") stop_words = StopWordsCleaner.pretrained("stopwords_iso","bn") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") pipeline = Pipeline(stages=[documentAssembler, tokenizer, stop_words]) example = spark.createDataFrame([["আপনি আমার চেয়ে ভাল না"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val stop_words = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = StopWordsCleaner.pretrained("stopwords_iso","bn") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, stop_words)) val data = Seq("আপনি আমার চেয়ে ভাল না").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("bn.stopwords").predict("""আপনি আমার চেয়ে ভাল না""") ```
## Results ```bash +------+ |result| +------+ |[ভাল] | +------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_iso| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|bn| |Size:|3.1 KB| --- layout: model title: English T5ForConditionalGeneration Tiny Cased model (from google) author: John Snow Labs name: t5_efficient_tiny_nl16 date: 2023-01-31 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-tiny-nl16` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl16_en_4.3.0_3.0_1675123795068.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_tiny_nl16_en_4.3.0_3.0_1675123795068.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_tiny_nl16","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_tiny_nl16","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_tiny_nl16| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|88.6 MB| ## References - https://huggingface.co/google/t5-efficient-tiny-nl16 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Lemmatizer (Romania, SpacyLookup) author: John Snow Labs name: lemma_spacylookup date: 2022-03-03 tags: [open_source, lemmatizer, ro] task: Lemmatization language: ro edition: Spark NLP 3.4.1 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This Romania Lemmatizer is an scalable, production-ready version of the Rule-based Lemmatizer available in [Spacy Lookups Data repository](https://github.com/explosion/spacy-lookups-data/). {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ro_3.4.1_3.0_1646316527511.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/lemma_spacylookup_ro_3.4.1_3.0_1646316527511.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols(["document"]) \ .setOutputCol("token") lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ro") \ .setInputCols(["token"]) \ .setOutputCol("lemma") pipeline = Pipeline(stages=[documentAssembler, tokenizer, lemmatizer]) example = spark.createDataFrame([["Nu ești mai bun decât mine"]], ["text"]) results = pipeline.fit(example).transform(example) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val lemmatizer = LemmatizerModel.pretrained("lemma_spacylookup","ro") .setInputCols(Array("token")) .setOutputCol("lemma") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, lemmatizer)) val data = Seq("Nu ești mai bun decât mine").toDF("text") val results = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ro.lemma").predict("""Nu ești mai bun decât mine""") ```
## Results ```bash +-------------------------------+ |result | +-------------------------------+ |[Nu, fi, mai, bun, decât, mină]| +-------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|lemma_spacylookup| |Compatibility:|Spark NLP 3.4.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[lemma]| |Language:|ro| |Size:|3.4 MB| --- layout: model title: English asr_wav2vec2_base_timit_demo_colab_by_gullenasatish TFWav2Vec2ForCTC from gullenasatish author: John Snow Labs name: pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish date: 2022-09-24 tags: [wav2vec2, en, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_wav2vec2_base_timit_demo_colab_by_gullenasatish` is a English model originally trained by gullenasatish. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish_en_4.2.0_3.0_1664038844889.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish_en_4.2.0_3.0_1664038844889.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish', lang = 'en') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish", lang = "en") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_wav2vec2_base_timit_demo_colab_by_gullenasatish| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|en| |Size:|354.9 MB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: Pipeline to Detect chemicals in text author: John Snow Labs name: ner_chemicals_pipeline date: 2022-03-21 tags: [licensed, ner, clinical, en] task: Named Entity Recognition language: en nav_key: models edition: Healthcare NLP 3.4.1 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained pipeline is built on the top of [ner_chemicals](https://nlp.johnsnowlabs.com/2021/04/01/ner_chemicals_en.html) model. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_pipeline_en_3.4.1_3.0_1647869797628.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/ner_chemicals_pipeline_en_3.4.1_3.0_1647869797628.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline("ner_chemicals_pipeline", "en", "clinical/models") pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` ```scala val pipeline = new PretrainedPipeline("ner_chemicals_pipeline", "en", "clinical/models") pipeline.annotate("The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.") ``` {:.nlu-block} ```python import nlu nlu.load("en.med_ner.chemicals.pipeline").predict("""The results have shown that the product p - choloroaniline is not a significant factor in chlorhexidine - digluconate associated erosive cystitis. A high percentage of kanamycin - colistin and povidone - iodine irrigations were associated with erosive cystitis.""") ```
## Results ```bash +---------------------------+--------+ |chunks |entities| +---------------------------+--------+ |p - choloroaniline |CHEM | |chlorhexidine - digluconate|CHEM | |kanamycin |CHEM | |colistin |CHEM | |povidone - iodine |CHEM | +---------------------------+--------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|ner_chemicals_pipeline| |Type:|pipeline| |Compatibility:|Healthcare NLP 3.4.1+| |License:|Licensed| |Edition:|Official| |Language:|en| |Size:|1.7 GB| ## Included Models - DocumentAssembler - SentenceDetectorDLModel - TokenizerModel - WordEmbeddingsModel - MedicalNerModel - NerConverter --- layout: model title: Italian Bert Embeddings (Uncased) author: John Snow Labs name: bert_embeddings_bert_base_italian_xxl_uncased date: 2022-04-11 tags: [bert, embeddings, it, open_source] task: Embeddings language: it edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `bert-base-italian-xxl-uncased` is a Italian model orginally trained by `dbmdz`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_italian_xxl_uncased_it_3.4.2_3.0_1649676784263.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_bert_base_italian_xxl_uncased_it_3.4.2_3.0_1649676784263.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_uncased","it") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["Adoro Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_italian_xxl_uncased","it") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("Adoro Spark NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("it.embed.bert_base_italian_xxl_uncased").predict("""Adoro Spark NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_bert_base_italian_xxl_uncased| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|it| |Size:|415.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased - http://opus.nlpl.eu/ - https://traces1.inria.fr/oscar/ - https://github.com/dbmdz/berts/issues/7 - https://github.com/stefan-it/turkish-bert/tree/master/electra - https://github.com/stefan-it/italian-bertelectra - https://github.com/dbmdz/berts/issues/new --- layout: model title: German BertForMaskedLM Base Cased model author: John Snow Labs name: bert_embeddings_base_german_cased date: 2022-12-02 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-german-cased` is a German model originally trained by HuggingFace. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_german_cased_de_4.2.4_3.0_1670017649516.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_german_cased_de_4.2.4_3.0_1670017649516.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_german_cased","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_german_cased","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_german_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|409.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/bert-base-german-cased - https://static.tildacdn.com/tild6438-3730-4164-b266-613634323466/german_bert.png - https://github.com/deepset-ai/FARM/issues/60 - https://deepset.ai/german-bert - https://thumb.tildacdn.com/tild3162-6462-4566-b663-376630376138/-/format/webp/Screenshot_from_2020.png - https://thumb.tildacdn.com/tild6335-3531-4137-b533-313365663435/-/format/webp/deepset_checkpoints.png - https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/deepset_logo.png - https://deepset.ai/german-bert - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://deepset.ai --- layout: model title: Arabic Bert Embeddings (from Ebtihal) author: John Snow Labs name: bert_embeddings_AraBertMo_base_V1 date: 2022-04-11 tags: [bert, embeddings, ar, open_source] task: Embeddings language: ar edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `AraBertMo_base_V1` is a Arabic model orginally trained by `Ebtihal`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_AraBertMo_base_V1_ar_3.4.2_3.0_1649678935180.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_AraBertMo_base_V1_ar_3.4.2_3.0_1649678935180.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_AraBertMo_base_V1","ar") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["أنا أحب شرارة NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_AraBertMo_base_V1","ar") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("أنا أحب شرارة NLP").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ar.embed.AraBertMo_base_V1").predict("""أنا أحب شرارة NLP""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_AraBertMo_base_V1| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ar| |Size:|410.7 MB| |Case sensitive:|true| ## References - https://huggingface.co/Ebtihal/AraBertMo_base_V1 - https://github.com/google-research/bert - https://traces1.inria.fr/oscar/ - https://uokufa.edu.iq/ - https://mathcomp.uokufa.edu.iq/ --- layout: model title: Icelandic DistilBertForTokenClassification Cased model (from m3hrdadfi) author: John Snow Labs name: distilbert_token_classifier_icelandic_ner date: 2023-03-06 tags: [is, open_source, distilbert, token_classification, ner, tensorflow] task: Named Entity Recognition language: is edition: Spark NLP 4.3.1 spark_version: 3.0 supported: true engine: tensorflow annotator: DistilBertForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `icelandic-ner-distilbert` is a Icelandic model originally trained by `m3hrdadfi`. ## Predicted Entities `Money`, `Date`, `Time`, `Percent`, `Miscellaneous`, `Location`, `Person`, `Organization` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_icelandic_ner_distilbert_is_4.3.1_3.0_1678134372960.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_token_classifier_icelandic_ner_distilbert_is_4.3.1_3.0_1678134372960.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_icelandic_ner_distilbert","is") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_icelandic_ner_distilbert","is") .setInputCols(Array("document", "token")) .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_token_classifier_icelandic_ner_distilbert| |Compatibility:|Spark NLP 4.3.1+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|is| |Size:|505.8 MB| |Case sensitive:|true| |Max sentence length:|128| ## References - https://huggingface.co/m3hrdadfi/icelandic-ner-distilbert - http://hdl.handle.net/20.500.12537/42 - https://en.ru.is/ - https://github.com/m3hrdadfi/icelandic-ner/issues --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from mvonwyl) author: John Snow Labs name: distilbert_qa_mvonwyl_base_uncased_finetuned_squad2 date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad2` is a English model originally trained by `mvonwyl`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_mvonwyl_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773565472.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_mvonwyl_base_uncased_finetuned_squad2_en_4.3.0_3.0_1672773565472.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mvonwyl_base_uncased_finetuned_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_mvonwyl_base_uncased_finetuned_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_mvonwyl_base_uncased_finetuned_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2 --- layout: model title: English XLMRobertaForTokenClassification Large Cased model (from asahi417) author: John Snow Labs name: xlmroberta_ner_tner_large_bc5cdr date: 2022-08-13 tags: [en, open_source, xlm_roberta, ner] task: Named Entity Recognition language: en nav_key: models edition: Spark NLP 4.1.0 spark_version: 3.0 supported: true annotator: XlmRoBertaForTokenClassification article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained XLMRobertaForTokenClassification model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `tner-xlm-roberta-large-bc5cdr` is a English model originally trained by `asahi417`. ## Predicted Entities `chemical`, `disease` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_bc5cdr_en_4.1.0_3.0_1660424448383.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_ner_tner_large_bc5cdr_en_4.1.0_3.0_1660424448383.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_bc5cdr","en") \ .setInputCols(["document", "token"]) \ .setOutputCol("ner") ner_converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[documentAssembler, tokenizer, token_classifier, ner_converter]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val token_classifier = XlmRoBertaForTokenClassification.pretrained("xlmroberta_ner_tner_large_bc5cdr","en") .setInputCols(Array("document", "token")) .setOutputCol("ner") val ner_converter = new NerConverter() .setInputCols(Array("document", "token', "ner")) .setOutputCol("ner_chunk") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, token_classifier, ner_converter)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|xlmroberta_ner_tner_large_bc5cdr| |Compatibility:|Spark NLP 4.1.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[ner]| |Language:|en| |Size:|1.7 GB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/asahi417/tner-xlm-roberta-large-bc5cdr - https://github.com/asahi417/tner --- layout: model title: Sentence Entity Resolver for CPT codes (procedures and measurements) - Augmented author: John Snow Labs name: sbiobertresolve_cpt_procedures_measurements_augmented date: 2022-05-10 tags: [licensed, en, clinical, entity_resolution, cpt] task: Entity Resolution language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true recommended: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model maps medical entities to CPT codes using Sentence Bert Embeddings. The corpus of this model has been extented to measurements, and this model is capable of mapping both procedures and measurement concepts/entities to CPT codes. Measurement codes are helpful in codifying medical entities related to tests and their results. ## Predicted Entities `CPT Codes` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/healthcare/ER_CPT/){:.button.button-orange} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_CPT.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_measurements_augmented_en_3.5.1_3.0_1652168576968.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/sbiobertresolve_cpt_procedures_measurements_augmented_en_3.5.1_3.0_1652168576968.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ .setInputCols(["sentence", "token"])\ .setOutputCol("word_embeddings") ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \ .setInputCols(["sentence", "token", "word_embeddings"]) \ .setOutputCol("ner")\ ner_converter = NerConverterInternal()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk")\ .setWhiteList(["Procedure", "Test"]) c2doc = Chunk2Doc()\ .setInputCols("ner_chunk")\ .setOutputCol("ner_chunk_doc") sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\ .setInputCols(["ner_chunk_doc"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_measurements_augmented", "en", "clinical/models")\ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("cpt_code")\ .setDistanceFunction("EUCLIDEAN") resolver_pipeline = Pipeline(stages = [ document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, cpt_resolver]) model = resolver_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) text='''She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy, which were performed, which revealed epithelioid malignant mesothelioma.''' data = spark.createDataFrame([[text]]).toDF("text") result = model.transform(data) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetectorDL = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols(Array("sentence", "token")) .setOutputCol("word_embeddings") val ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") .setInputCols(Array("sentence", "token", "word_embeddings")) .setOutputCol("ner") val ner_converter = new NerConverterInternal() .setInputCols(Array("sentence", "token", "ner")) .setOutputCol("ner_chunk") .setWhiteList(Array("Procedure", "Test")) val c2doc = new Chunk2Doc() .setInputCols(Array("ner_chunk")) .setOutputCol("ner_chunk_doc") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models") .setInputCols(Array("ner_chunk_doc")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_measurements_augmented", "en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("cpt_code") .setDistanceFunction("EUCLIDEAN") val resolver_pipeline = new PipelineModel().setStages(Array( document_assembler, sentenceDetectorDL, tokenizer, word_embeddings, ner, ner_converter, c2doc, sbert_embedder, cpt_resolver)) val data = Seq("She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy, which were performed, which revealed epithelioid malignant mesothelioma.").toDS.toDF("text") val results = resolver_pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.resolve.cpt.procedures_measurements").predict("""She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma. At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy, which were performed, which revealed epithelioid malignant mesothelioma.""") ```
## Results ```bash +---------------------+---------+--------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | chunk| entity|cpt_code| all_k_resolutions| all_k_codes| +---------------------+---------+--------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | CT scan of the chest| Test| 71250|Diagnostic CT scan of chest [Computed tomography, thorax, diagnostic; without contrast material]:...|71250:::70490:::76497:::71260:::74150:::70486:::73200:::70480:::77014:::73700:::71270:::70491:::7...| | pericardectomy|Procedure| 33030|Pericardectomy [Pericardiectomy, subtotal or complete; without cardiopulmonary bypass]:::Pericard...|33030:::33020:::64746:::49250:::27350:::68520:::32310:::27340:::33025:::32215:::41821:::1005708::...| | chest tube placement|Procedure| 39503|Insertion of chest tube [Repair, neonatal diaphragmatic hernia, with or without chest tube insert...|39503:::96440:::32553:::35820:::32100:::36226:::21899:::29200:::0174T:::31502:::31605:::69424:::1...| |drainage of the fluid|Procedure| 10140|Drainage of blood or fluid accumulation [Incision and drainage of hematoma, seroma or fluid colle...|10140:::40800:::61108:::41006:::62180:::83986:::49082:::27030:::21502:::49323:::32554:::51040:::6...| | thoracoscopy|Procedure| 1020900|Thoracoscopy [Thoracoscopy]:::Thoracoscopy, surgical; with control of traumatic hemorrhage | [Hea...| 1020900:::32654:::32668:::1006014:::35820:::32606:::32555:::31781:::31515:::29200| +---------------------+---------+--------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|sbiobertresolve_cpt_procedures_measurements_augmented| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[cpt_code]| |Language:|en| |Size:|360.4 MB| |Case sensitive:|false| ## References Trained on Current Procedural Terminology dataset with `sbiobert_base_cased_mli` sentence embeddings. --- layout: model title: English DistilBertForQuestionAnswering Base Uncased model (from Plimpton) author: John Snow Labs name: distilbert_qa_plimpton_base_uncased_finetuned_squad date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Plimpton`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_plimpton_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768921810.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_plimpton_base_uncased_finetuned_squad_en_4.3.0_3.0_1672768921810.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_plimpton_base_uncased_finetuned_squad","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_plimpton_base_uncased_finetuned_squad","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_plimpton_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Plimpton/distilbert-base-uncased-finetuned-squad --- layout: model title: Legal Indenture Document Binary Classifier (Longformer) author: John Snow Labs name: legclf_indenture_agreement date: 2022-12-18 tags: [en, legal, classification, licensed, document, longformer, indenture, tensorflow] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description The `legclf_indenture_agreement` model is a Longformer Document Classifier used to classify if the document belongs to the class `indenture` or not (Binary Classification). Longformers have a restriction on 4096 tokens, so only the first 4096 tokens will be taken into account. We have realised that for the big majority of the documents in legal corpora, if they are clean and only contain the legal document without any extra information before, 4096 is enough to perform Document Classification. If your document needs to process more than 4096 tokens, you can try the following: getting chunks of 4096 tokens and average the embeddings, training with the averaged version, what means all document will be taken into account. ## Predicted Entities `indenture`, `other` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_indenture_agreement_en_1.0.0_3.0_1671393672864.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_indenture_agreement_en_1.0.0_3.0_1671393672864.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") tokenizer = nlp.Tokenizer()\ .setInputCols(["document"])\ .setOutputCol("token") embeddings = nlp.LongformerEmbeddings.pretrained("legal_longformer_base", "en")\ .setInputCols("document", "token")\ .setOutputCol("embeddings") sentence_embeddings = nlp.SentenceEmbeddings()\ .setInputCols(["document", "embeddings"])\ .setOutputCol("sentence_embeddings")\ .setPoolingStrategy("AVERAGE") doc_classifier = legal.ClassifierDLModel.pretrained("legclf_indenture_agreement", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ document_assembler, tokenizer, embeddings, sentence_embeddings, doc_classifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ |result| +-------+ |[indenture]| |[other]| |[other]| |[indenture]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_indenture_agreement| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[class]| |Language:|en| |Size:|21.8 MB| ## References Legal documents, scrapped from the Internet, and classified in-house + SEC documents ## Benchmarking ```bash label precision recall f1-score support indenture 0.90 0.94 0.92 113 other 0.96 0.94 0.95 188 accuracy - - 0.94 301 macro-avg 0.93 0.94 0.93 301 weighted-avg 0.94 0.94 0.94 301 ``` --- layout: model title: Stop Words Cleaner for French author: John Snow Labs name: stopwords_fr date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: fr edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, fr] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_fr_fr_2.5.4_2.4_1594742439495.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_fr_fr_2.5.4_2.4_1594742439495.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_fr", "fr") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("En plus d'être le roi du nord, John Snow est un médecin anglais et un leader dans le développement de l'anesthésie et de l'hygiène médicale.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_fr", "fr") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("En plus d"être le roi du nord, John Snow est un médecin anglais et un leader dans le développement de l"anesthésie et de l'hygiène médicale.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""En plus d'être le roi du nord, John Snow est un médecin anglais et un leader dans le développement de l'anesthésie et de l'hygiène médicale."""] stopword_df = nlu.load('fr.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=8, end=13, result="d'être", metadata={'sentence': '0'}), Row(annotatorType='token', begin=18, end=20, result='roi', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=28, result='nord', metadata={'sentence': '0'}), Row(annotatorType='token', begin=29, end=29, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=31, end=34, result='John', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_fr| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|fr| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords) --- layout: model title: Translate English to Tahitian Pipeline author: John Snow Labs name: translate_en_ty date: 2021-01-03 task: [Translation, Pipeline Public] language: xx edition: Spark NLP 2.7.0 spark_version: 2.4 tags: [open_source, seq2seq, translation, pipeline, en, ty, xx] supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list). Note that this is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended. - source languages: `en` - target languages: `ty` {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/translate_en_ty_xx_2.7.0_2.4_1609687934373.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/translate_en_ty_xx_2.7.0_2.4_1609687934373.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline("translate_en_ty", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` ```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline val pipeline = new PretrainedPipeline("translate_en_ty", lang = "xx") pipeline.annotate("Your sentence to translate!") ``` {:.nlu-block} ```python import nlu text = ["text to translate"] translate_df = nlu.load('xx.en.translate_to.ty').predict(text, output_level='sentence') translate_df ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|translate_en_ty| |Type:|pipeline| |Compatibility:|Spark NLP 2.7.0+| |Edition:|Official| |Language:|xx| ## Data Source [https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models](https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models) --- layout: model title: Legal Severability Clause Binary Classifier author: John Snow Labs name: legclf_severability_clause date: 2022-08-10 tags: [en, legal, classification, clauses, licensed] task: Text Classification language: en nav_key: models edition: Legal NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model is a Binary Classifier (True, False) for the `severability` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level. If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Legal NLP Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/1.Tokenization_Splitting.ipynb)), namely: - Paragraph splitting (by multiline); - Splitting by headers / subheaders; - etc. Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above). This model can be combined with any of the other "hundreds" of Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added. ## Predicted Entities `other`, `severability` {:.btn-box} [Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_severability_clause_en_1.0.0_3.2_1660123999498.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/legal/models/legclf_severability_clause_en_1.0.0_3.2_1660123999498.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler() \ .setInputCol("clause_text") \ .setOutputCol("document") embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \ .setInputCols("document") \ .setOutputCol("sentence_embeddings") docClassifier = nlp.ClassifierDLModel.pretrained("legclf_severability_clause", "en", "legal/models")\ .setInputCols(["sentence_embeddings"])\ .setOutputCol("category") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, embeddings, docClassifier]) df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text") model = nlpPipeline.fit(df) result = model.transform(df) ```
## Results ```bash +-------+ | result| +-------+ |[severability]| |[other]| |[other]| |[severability]| ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|legclf_severability_clause| |Compatibility:|Legal NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[sentence_embeddings]| |Output Labels:|[category]| |Language:|en| |Size:|23.0 MB| ## References Legal documents, scrapped from the Internet, and classified in-house ## Benchmarking ```bash label precision recall f1-score support other 0.97 0.98 0.98 131 severability 0.97 0.95 0.96 81 accuracy - - 0.97 212 macro-avg 0.97 0.97 0.97 212 weighted-avg 0.97 0.97 0.97 212 ``` --- layout: model title: Map Company Tickers to Subsidiary Companies (wikipedia, en) author: John Snow Labs name: finmapper_wikipedia_parentcompanies_ticker date: 2023-01-13 tags: [subsidiaries, companies, acquisitions, en, licensed] task: Chunk Mapping language: en nav_key: models edition: Finance NLP 1.0.0 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This models allows you to, given an extracter TICKER, retrieve all the parent / subsidiary /companies acquired and/or in the same group than it. ## Predicted Entities {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finmapper_wikipedia_parentcompanies_ticker_en_1.0.0_3.0_1673610848433.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finmapper_wikipedia_parentcompanies_ticker_en_1.0.0_3.0_1673610848433.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = nlp.DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") sentenceDetector = nlp.SentenceDetector()\ .setInputCols(["document"])\ .setOutputCol("sentence") tokenizer = nlp.Tokenizer()\ .setInputCols(["sentence"])\ .setOutputCol("token") embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \ .setInputCols(["sentence", "token"]) \ .setOutputCol("embeddings") ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\ .setInputCols(["sentence", "token", "embeddings"])\ .setOutputCol("ner")\ ner_converter = nlp.NerConverter()\ .setInputCols(["sentence", "token", "ner"])\ .setOutputCol("ner_chunk") CM = finance.ChunkMapperModel()\ .pretrained('finmapper_wikipedia_parentcompanies_ticker','en','finance/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("mappings") nlpPipeline = nlp.Pipeline(stages=[ documentAssembler, sentenceDetector, tokenizer, embeddings, ner_model, ner_converter, CM ]) text = ["""ABG is a multinational corporation that is engaged in ..."""] test_data = spark.createDataFrame([text]).toDF("text") model = nlpPipeline.fit(test_data) lp = nlp.LightPipeline(model) res= model.transform(test_data) ```
## Results ```bash {'mappings': ['ABSA Group Limited', 'ABSA Group Limited@http://www.wikidata.org/entity/Q58641733', 'ABSA Group Limited@ABSA Group Limited@en', 'ABSA Group Limited@http://www.wikidata.org/prop/direct/P749', 'ABSA Group Limited@is_parent_of', 'ABSA Group Limited@JOHANNESBURG STOCK EXCHANGE@en', 'ABSA Group Limited@باركليز@ar', 'ABSA Group Limited@http://www.wikidata.org/entity/Q245343'], ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|finmapper_wikipedia_parentcompanies_ticker| |Compatibility:|Finance NLP 1.0.0+| |License:|Licensed| |Edition:|Official| |Input Labels:|[ner_chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|1.3 MB| ## References Wikidata --- layout: model title: Mapping RxNorm Codes with Corresponding National Drug Codes author: John Snow Labs name: rxnorm_ndc_mapper date: 2022-05-09 tags: [chunk_mapper, ndc, rxnorm, licensed, en, clinical] task: Chunk Mapping language: en nav_key: models edition: Healthcare NLP 3.5.1 spark_version: 3.0 supported: true annotator: ChunkMapperModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This pretrained model maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC). ## Predicted Entities `Product NDC`, `Package NDC` {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapper_en_3.5.1_3.0_1652076748381.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/rxnorm_ndc_mapper_en_3.5.1_3.0_1652076748381.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('ner_chunk') sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\ .setInputCols(["ner_chunk"])\ .setOutputCol("sentence_embeddings")\ .setCaseSensitive(False) rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \ .setInputCols(["sentence_embeddings"]) \ .setOutputCol("rxnorm_code")\ .setDistanceFunction("EUCLIDEAN") chunkerMapper_product = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("Product NDC")\ .setRel("Product NDC") chunkerMapper_package = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")\ .setInputCols(["rxnorm_code"])\ .setOutputCol("Package NDC")\ .setRel("Package NDC") pipeline = Pipeline().setStages([document_assembler, sbert_embedder, rxnorm_resolver, chunkerMapper_product, chunkerMapper_package ]) model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) lp = LightPipeline(model) result = lp.annotate(['doxepin hydrochloride 50 MG/ML', 'macadamia nut 100 MG/ML']) ``` ```scala val document_assembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("ner_chunk") val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models") .setInputCols(Array("ner_chunk")) .setOutputCol("sentence_embeddings") .setCaseSensitive(False) val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") .setInputCols(Array("sentence_embeddings")) .setOutputCol("rxnorm_code") .setDistanceFunction("EUCLIDEAN") val chunkerMapper_product = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models") .setInputCols(Array("rxnorm_code")) .setOutputCol("Product NDC") .setRel("Product NDC") val chunkerMapper_package = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models") .setInputCols(Array("rxnorm_code")) .setOutputCol("Package NDC") .setRel("Package NDC") val pipeline = new Pipeline().setStages(Array( document_assembler, sbert_embedder, rxnorm_resolver, chunkerMapper_product, chunkerMapper_package )) val text_data = Seq("doxepin hydrochloride 50 MG/ML", "macadamia nut 100 MG/ML").toDS.toDF("text") val res = pipeline.fit(text_data).transform(text_data) ``` {:.nlu-block} ```python import nlu nlu.load("en.rxnorm_to_ndc").predict("""Product NDC""") ```
## Results ```bash | | ner_chunk | rxnorm_code | Package NDC | Product NDC | |---:|:-----------------------------------|:--------------|:------------------|:---------------| | 0 | ['doxepin hydrochloride 50 MG/ML'] | ['1000091'] | ['00378-8117-45'] | ['00378-8117'] | | 1 | ['macadamia nut 100 MG/ML'] | ['212433'] | ['00064-2120-08'] | ['00064-2120'] | ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|rxnorm_ndc_mapper| |Compatibility:|Healthcare NLP 3.5.1+| |License:|Licensed| |Edition:|Official| |Input Labels:|[chunk]| |Output Labels:|[mappings]| |Language:|en| |Size:|4.2 MB| --- layout: model title: Javanese DistilBertForMaskedLM Small Cased model (from w11wo) author: John Snow Labs name: distilbert_embeddings_javanese_small_imdb date: 2022-12-12 tags: [jv, open_source, distilbert_embeddings, distilbertformaskedlm] task: Embeddings language: jv edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `javanese-distilbert-small-imdb` is a Javanese model originally trained by `w11wo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670864962624.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_embeddings_javanese_small_imdb_jv_4.2.4_3.0_1670864962624.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_small_imdb","jv") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(False) pipeline = Pipeline(stages=[documentAssembler, tokenizer, distilbert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val distilbert_loaded = DistilBertEmbeddings.pretrained("distilbert_embeddings_javanese_small_imdb","jv") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(false) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, distilbert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_embeddings_javanese_small_imdb| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|jv| |Size:|247.9 MB| |Case sensitive:|false| ## References - https://huggingface.co/w11wo/javanese-distilbert-small-imdb - https://arxiv.org/abs/1910.01108 - https://github.com/sgugger - https://w11wo.github.io/ --- layout: model title: English asr_vakyansh_wav2vec2_maithili_maim_50 TFWav2Vec2ForCTC from Harveenchadha author: John Snow Labs name: asr_vakyansh_wav2vec2_maithili_maim_50 date: 2022-09-25 tags: [wav2vec2, en, audio, open_source, asr] task: Automatic Speech Recognition language: en nav_key: models edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: Wav2Vec2ForCTC article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_vakyansh_wav2vec2_maithili_maim_50` is a English model originally trained by Harveenchadha. NOTE: This model only works on a CPU, if you need to use this model on a GPU device please use asr_vakyansh_wav2vec2_maithili_maim_50_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/asr_vakyansh_wav2vec2_maithili_maim_50_en_4.2.0_3.0_1664117277718.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/asr_vakyansh_wav2vec2_maithili_maim_50_en_4.2.0_3.0_1664117277718.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python audio_assembler = AudioAssembler() \ .setInputCol("audio_content") \ .setOutputCol("audio_assembler") speech_to_text = Wav2Vec2ForCTC \ .pretrained("asr_vakyansh_wav2vec2_maithili_maim_50", "en")\ .setInputCols("audio_assembler") \ .setOutputCol("text") pipeline = Pipeline(stages=[ audio_assembler, speech_to_text, ]) pipelineModel = pipeline.fit(audioDf) pipelineDF = pipelineModel.transform(audioDf) ``` ```scala val audioAssembler = new AudioAssembler() .setInputCol("audio_content") .setOutputCol("audio_assembler") val speechToText = Wav2Vec2ForCTC .pretrained("asr_vakyansh_wav2vec2_maithili_maim_50", "en") .setInputCols("audio_assembler") .setOutputCol("text") val pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText)) val pipelineModel = pipeline.fit(audioDf) val pipelineDF = pipelineModel.transform(audioDf) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|asr_vakyansh_wav2vec2_maithili_maim_50| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[audio_assembler]| |Output Labels:|[text]| |Language:|en| |Size:|227.8 MB| --- layout: model title: German BertForMaskedLM Base Cased model (from deepset) author: John Snow Labs name: bert_embeddings_g_base date: 2022-12-06 tags: [de, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: de edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `gbert-base` is a German model originally trained by `deepset`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_base_de_4.2.4_3.0_1670326376987.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_g_base_de_4.2.4_3.0_1670326376987.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_base","de") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_g_base","de") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_g_base| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|de| |Size:|412.5 MB| |Case sensitive:|true| ## References - https://huggingface.co/deepset/gbert-base - https://arxiv.org/pdf/2010.10906.pdf - https://arxiv.org/pdf/2010.10906.pdf - https://workablehr.s3.amazonaws.com/uploads/account/logo/476306/logo - https://deepset.ai/german-bert - https://deepset.ai/germanquad - https://github.com/deepset-ai/FARM - https://github.com/deepset-ai/haystack/ - https://twitter.com/deepset_ai - https://www.linkedin.com/company/deepset-ai/ - https://haystack.deepset.ai/community/join - https://github.com/deepset-ai/haystack/discussions - https://deepset.ai - http://www.deepset.ai/jobs --- layout: model title: Turkish ElectraForQuestionAnswering model (from kuzgunlar) author: John Snow Labs name: electra_qa_turkish date: 2022-06-22 tags: [tr, open_source, electra, question_answering] task: Question Answering language: tr edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `electra-turkish-qa` is a Turkish model originally trained by `kuzgunlar`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/electra_qa_turkish_tr_4.0.0_3.0_1655921428307.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/electra_qa_turkish_tr_4.0.0_3.0_1655921428307.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("electra_qa_turkish","tr") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("electra_qa_turkish","tr") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Benim adım ne?", "Benim adım Clara ve Berkeley'de yaşıyorum.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("tr.answer_question.electra").predict("""Benim adım ne?|||"Benim adım Clara ve Berkeley'de yaşıyorum.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|electra_qa_turkish| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|tr| |Size:|412.7 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/kuzgunlar/electra-turkish-qa --- layout: model title: English RobertaForQuestionAnswering (from csarron) author: John Snow Labs name: roberta_qa_roberta_large_squad_v1 date: 2022-06-20 tags: [en, open_source, question_answering, roberta] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-large-squad-v1` is a English model originally trained by `csarron`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad_v1_en_4.0.0_3.0_1655737362298.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_roberta_large_squad_v1_en_4.0.0_3.0_1655737362298.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python document_assembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = RoBertaForQuestionAnswering.pretrained("roberta_qa_roberta_large_squad_v1","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer") \ .setCaseSensitive(True) pipeline = Pipeline().setStages([ document_assembler, spanClassifier ]) example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(example).transform(example) ``` ```scala val document = new MultiDocumentAssembler() .setInputCols("question", "context") .setOutputCols("document_question", "document_context") val spanClassifier = RoBertaForQuestionAnswering .pretrained("roberta_qa_roberta_large_squad_v1","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) .setMaxSentenceLength(512) val pipeline = new Pipeline().setStages(Array(document, spanClassifier)) val example = Seq( ("Where was John Lenon born?", "John Lenon was born in London and lived in Paris. My name is Sarah and I live in London."), ("What's my name?", "My name is Clara and I live in Berkeley.")) .toDF("question", "context") val result = pipeline.fit(example).transform(example) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.roberta.large").predict("""What's my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_roberta_large_squad_v1| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[question, context]| |Output Labels:|[answer]| |Language:|en| |Size:|1.3 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/csarron/roberta-large-squad-v1 --- layout: model title: Russian BertForQuestionAnswering Cased model (from ruselkomp) author: John Snow Labs name: bert_qa_sber_full_tes date: 2022-07-07 tags: [ru, open_source, bert, question_answering] task: Question Answering language: ru edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: BertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `sber-full-test` is a Russian model originally trained by `ruselkomp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_qa_sber_full_tes_ru_4.0.0_3.0_1657191384050.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_qa_sber_full_tes_ru_4.0.0_3.0_1657191384050.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_sber_full_tes","ru") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["Как меня зовут?", "Меня зовут Клара, и я живу в Беркли."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = BertForQuestionAnswering.pretrained("bert_qa_sber_full_tes","ru") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("Как меня зовут?", "Меня зовут Клара, и я живу в Беркли.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_qa_sber_full_tes| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|ru| |Size:|1.6 GB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/ruselkomp/sber-full-test --- layout: model title: English RobertaForQuestionAnswering Base Cased model (from ModelTC) author: John Snow Labs name: roberta_qa_modeltc_base_squad2 date: 2023-01-20 tags: [en, open_source, roberta, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained RobertaForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `roberta-base-squad2` is a English model originally trained by `ModelTC`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_qa_modeltc_base_squad2_en_4.3.0_3.0_1674218961578.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_qa_modeltc_base_squad2_en_4.3.0_3.0_1674218961578.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_modeltc_base_squad2","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = RoBertaForQuestionAnswering.pretrained("roberta_qa_modeltc_base_squad2","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_qa_modeltc_base_squad2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document, token]| |Output Labels:|[class]| |Language:|en| |Size:|464.3 MB| |Case sensitive:|true| |Max sentence length:|256| ## References - https://huggingface.co/ModelTC/roberta-base-squad2 --- layout: model title: Polish BertForMaskedLM Base Cased model (from Geotrend) author: John Snow Labs name: bert_embeddings_base_pl_cased date: 2022-12-02 tags: [pl, open_source, bert_embeddings, bertformaskedlm] task: Embeddings language: pl edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained BertForMaskedLM model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `bert-base-pl-cased` is a Polish model originally trained by `Geotrend`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_pl_cased_pl_4.2.4_3.0_1670018669418.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_base_pl_cased_pl_4.2.4_3.0_1670018669418.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols(["text"]) \ .setOutputCols("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_pl_cased","pl") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") \ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, tokenizer, bert_loaded]) data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols(Array("text")) .setOutputCols(Array("document")) val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val bert_loaded = BertEmbeddings.pretrained("bert_embeddings_base_pl_cased","pl") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") .setCaseSensitive(True) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, bert_loaded)) val data = Seq("I love Spark NLP").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_base_pl_cased| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|pl| |Size:|387.6 MB| |Case sensitive:|true| ## References - https://huggingface.co/Geotrend/bert-base-pl-cased - https://www.aclweb.org/anthology/2020.sustainlp-1.16.pdf - https://github.com/Geotrend-research/smaller-transformers --- layout: model title: Lithuanian Legal Roberta Embeddings author: John Snow Labs name: roberta_base_lithuanian_legal date: 2023-02-16 tags: [lt, lithuanian, embeddings, transformer, open_source, legal, tensorflow] task: Embeddings language: lt edition: Spark NLP 4.2.4 spark_version: 3.0 supported: true engine: tensorflow annotator: RoBertaEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Legal Roberta Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-lithuanian-roberta-base` is a Lithuanian model originally trained by `joelito`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/roberta_base_lithuanian_legal_lt_4.2.4_3.0_1676555711899.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/roberta_base_lithuanian_legal_lt_4.2.4_3.0_1676555711899.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_lithuanian_legal", "lt")\ .setInputCols(["sentence"])\ .setOutputCol("embeddings") ``` ```scala val sentence_embeddings = RoBertaEmbeddings.pretrained("roberta_base_lithuanian_legal", "lt") .setInputCols("sentence") .setOutputCol("embeddings")) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|roberta_base_lithuanian_legal| |Compatibility:|Spark NLP 4.2.4+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[embeddings]| |Language:|lt| |Size:|416.0 MB| |Case sensitive:|true| ## References https://huggingface.co/joelito/legal-lithuanian-roberta-base --- layout: model title: English DistilBertForQuestionAnswering model (from Adrian) Squad1 author: John Snow Labs name: distilbert_qa_Adrian_base_uncased_finetuned_squad date: 2022-06-08 tags: [en, open_source, distilbert, question_answering] task: Question Answering language: en nav_key: models edition: Spark NLP 4.0.0 spark_version: 3.0 supported: true annotator: DistilBertForQuestionAnswering article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Question Answering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert-base-uncased-finetuned-squad` is a English model originally trained by `Adrian`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_Adrian_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724056294.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_Adrian_base_uncased_finetuned_squad_en_4.0.0_3.0_1654724056294.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = MultiDocumentAssembler() \ .setInputCols(["question", "context"]) \ .setOutputCols(["document_question", "document_context"]) spanClassifier = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Adrian_base_uncased_finetuned_squad","en") \ .setInputCols(["document_question", "document_context"]) \ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[documentAssembler, spanClassifier]) data = spark.createDataFrame([["What is my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val spanClassifer = DistilBertForQuestionAnswering.pretrained("distilbert_qa_Adrian_base_uncased_finetuned_squad","en") .setInputCols(Array("document", "token")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, spanClassifier)) val data = Seq("What is my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("en.answer_question.squad.distil_bert.base_uncased.by_Adrian").predict("""What is my name?|||"My name is Clara and I live in Berkeley.""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_Adrian_base_uncased_finetuned_squad| |Compatibility:|Spark NLP 4.0.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.5 MB| |Case sensitive:|false| |Max sentence length:|512| ## References - https://huggingface.co/Adrian/distilbert-base-uncased-finetuned-squad --- layout: model title: English T5ForConditionalGeneration Large Cased model (from google) author: John Snow Labs name: t5_efficient_large_dl2 date: 2023-01-30 tags: [en, open_source, t5, tensorflow] task: Text Generation language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow annotator: T5Transformer article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained T5ForConditionalGeneration model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `t5-efficient-large-dl2` is a English model originally trained by `google`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dl2_en_4.3.0_3.0_1675114295600.zip){:.button.button-orange} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/t5_efficient_large_dl2_en_4.3.0_3.0_1675114295600.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCols("text") \ .setOutputCols("document") t5 = T5Transformer.pretrained("t5_efficient_large_dl2","en") \ .setInputCols("document") \ .setOutputCol("answers") pipeline = Pipeline(stages=[documentAssembler, t5]) data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCols("text") .setOutputCols("document") val t5 = T5Transformer.pretrained("t5_efficient_large_dl2","en") .setInputCols("document") .setOutputCol("answers") val pipeline = new Pipeline().setStages(Array(documentAssembler, t5)) val data = Seq("PUT YOUR STRING HERE").toDS.toDF("text") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|t5_efficient_large_dl2| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[documents]| |Output Labels:|[t5]| |Language:|en| |Size:|770.5 MB| ## References - https://huggingface.co/google/t5-efficient-large-dl2 - https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html - https://arxiv.org/abs/2109.10686 - https://arxiv.org/abs/2109.10686 - https://github.com/google-research/google-research/issues/986#issuecomment-1035051145 --- layout: model title: Korean Bert Embeddings (from snunlp) author: John Snow Labs name: bert_embeddings_KR_FinBert date: 2022-04-11 tags: [bert, embeddings, ko, open_source] task: Embeddings language: ko edition: Spark NLP 3.4.2 spark_version: 3.0 supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Bert Embeddings model, uploaded to Hugging Face, adapted and imported into Spark NLP. `KR-FinBert` is a Korean model orginally trained by `snunlp`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bert_embeddings_KR_FinBert_ko_3.4.2_3.0_1649675550368.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bert_embeddings_KR_FinBert_ko_3.4.2_3.0_1649675550368.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python documentAssembler = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") tokenizer = Tokenizer() \ .setInputCols("document") \ .setOutputCol("token") embeddings = BertEmbeddings.pretrained("bert_embeddings_KR_FinBert","ko") \ .setInputCols(["document", "token"]) \ .setOutputCol("embeddings") pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings]) data = spark.createDataFrame([["나는 Spark NLP를 좋아합니다"]]).toDF("text") result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols(Array("document")) .setOutputCol("token") val embeddings = BertEmbeddings.pretrained("bert_embeddings_KR_FinBert","ko") .setInputCols(Array("document", "token")) .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings)) val data = Seq("나는 Spark NLP를 좋아합니다").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu nlu.load("ko.embed.KR_FinBert").predict("""나는 Spark NLP를 좋아합니다""") ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|bert_embeddings_KR_FinBert| |Compatibility:|Spark NLP 3.4.2+| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[bert]| |Language:|ko| |Size:|380.8 MB| |Case sensitive:|true| ## References - https://huggingface.co/snunlp/KR-FinBert - https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments --- layout: model title: German asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412 TFWav2Vec2ForCTC from jonatasgrosman author: John Snow Labs name: pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412 date: 2022-09-25 tags: [wav2vec2, de, audio, open_source, pipeline, asr] task: Automatic Speech Recognition language: de edition: Spark NLP 4.2.0 spark_version: 3.0 supported: true annotator: PipelineModel article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained Wav2vec2 pipeline, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412` is a German model originally trained by jonatasgrosman. NOTE: This pipeline only works on a CPU, if you need to use this pipeline on a GPU device please use pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_gpu {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_de_4.2.0_3.0_1664112298280.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412_de_4.2.0_3.0_1664112298280.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python pipeline = PretrainedPipeline('pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412', lang = 'de') annotations = pipeline.transform(audioDF) ``` ```scala val pipeline = new PretrainedPipeline("pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412", lang = "de") val annotations = pipeline.transform(audioDF) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|pipeline_asr_exp_w2v2r_xls_r_accent_germany_5_austria_5_s412| |Type:|pipeline| |Compatibility:|Spark NLP 4.2.0+| |License:|Open Source| |Edition:|Official| |Language:|de| |Size:|1.2 GB| ## Included Models - AudioAssembler - Wav2Vec2ForCTC --- layout: model title: English DistilBertForQuestionAnswering Cased model (from nlpunibo) author: John Snow Labs name: distilbert_qa_convolutional_classifier date: 2023-01-03 tags: [en, open_source, distilbert, question_answering, tensorflow] task: Question Answering language: en nav_key: models edition: Spark NLP 4.3.0 spark_version: 3.0 supported: true engine: tensorflow article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description Pretrained DistilBertForQuestionAnswering model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `distilbert_convolutional_classifier` is a English model originally trained by `nlpunibo`. {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbert_qa_convolutional_classifier_en_4.3.0_3.0_1672774548669.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbert_qa_convolutional_classifier_en_4.3.0_3.0_1672774548669.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python Document_Assembler = MultiDocumentAssembler()\ .setInputCols(["question", "context"])\ .setOutputCols(["document_question", "document_context"]) Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_convolutional_classifier","en")\ .setInputCols(["document_question", "document_context"])\ .setOutputCol("answer")\ .setCaseSensitive(True) pipeline = Pipeline(stages=[Document_Assembler, Question_Answering]) data = spark.createDataFrame([["What's my name?","My name is Clara and I live in Berkeley."]]).toDF("question", "context") result = pipeline.fit(data).transform(data) ``` ```scala val Document_Assembler = new MultiDocumentAssembler() .setInputCols(Array("question", "context")) .setOutputCols(Array("document_question", "document_context")) val Question_Answering = DistilBertForQuestionAnswering.pretrained("distilbert_qa_convolutional_classifier","en") .setInputCols(Array("document_question", "document_context")) .setOutputCol("answer") .setCaseSensitive(true) val pipeline = new Pipeline().setStages(Array(Document_Assembler, Question_Answering)) val data = Seq("What's my name?","My name is Clara and I live in Berkeley.").toDS.toDF("question", "context") val result = pipeline.fit(data).transform(data) ```
{:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|distilbert_qa_convolutional_classifier| |Compatibility:|Spark NLP 4.3.0+| |License:|Open Source| |Edition:|Official| |Input Labels:|[document_question, document_context]| |Output Labels:|[answer]| |Language:|en| |Size:|247.6 MB| |Case sensitive:|true| |Max sentence length:|512| ## References - https://huggingface.co/nlpunibo/distilbert_convolutional_classifier --- layout: model title: BioBERT Embeddings (Discharge) author: John Snow Labs name: biobert_discharge_base_cased date: 2020-08-25 task: Embeddings language: en nav_key: models edition: Spark NLP 2.6.0 spark_version: 2.4 tags: [embeddings, en, open_source] supported: true annotator: BertEmbeddings article_header: type: cover use_language_switcher: "Python-Scala-Java" --- ## Description This model contains a pre-trained weights of ClinicalBERT for discharge summaries. This domain-specific model has performance improvements on 3/5 clinical NLP tasks andd establishing a new state-of-the-art on the MedNLI dataset. The details are described in the paper "[Publicly Available Clinical BERT Embeddings](https://www.aclweb.org/anthology/W19-1909/)". {:.btn-box} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/biobert_discharge_base_cased_en_2.6.0_2.4_1598343571130.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/biobert_discharge_base_cased_en_2.6.0_2.4_1598343571130.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... embeddings = BertEmbeddings.pretrained("biobert_discharge_base_cased", "en") \ .setInputCols("sentence", "token") \ .setOutputCol("embeddings") nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings]) pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) result = pipeline_model.transform(spark.createDataFrame([['I hate cancer']], ["text"])) ``` ```scala ... val embeddings = BertEmbeddings.pretrained("biobert_discharge_base_cased", "en") .setInputCols("sentence", "token") .setOutputCol("embeddings") val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings)) val data = Seq("I hate cancer").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["I hate cancer"] embeddings_df = nlu.load('en.embed.biobert.discharge_base_cased').predict(text, output_level='token') embeddings_df ```
{:.h2_title} ## Results ```bash token en_embed_biobert_discharge_base_cased_embeddings I [0.0036486536264419556, 0.3796533942222595, -0... hate [0.1914958357810974, 0.6709488034248352, -0.49... cancer [0.04618441313505173, -0.04562612622976303, -0... ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|biobert_discharge_base_cased| |Type:|embeddings| |Compatibility:|Spark NLP 2.6.0| |License:|Open Source| |Edition:|Official| |Input Labels:|[sentence, token]| |Output Labels:|[word_embeddings]| |Language:|[en]| |Dimension:|768| |Case sensitive:|true| {:.h2_title} ## Data Source The model is imported from [https://github.com/EmilyAlsentzer/clinicalBERT](https://github.com/EmilyAlsentzer/clinicalBERT) --- layout: model title: Stop Words Cleaner for Czech author: John Snow Labs name: stopwords_cs date: 2020-07-14 19:03:00 +0800 task: Stop Words Removal language: cs edition: Spark NLP 2.5.4 spark_version: 2.4 tags: [stopwords, cs] supported: true annotator: StopWordsCleaner article_header: type: cover use_language_switcher: "Python-Scala-Java" --- {:.h2_title} ## Description This model removes 'stop words' from text. Stop words are words so common that they can be removed without significantly altering the meaning of a text. Removing stop words is useful when one wants to deal with only the most semantically important words in a text, and ignore words that are rarely semantically relevant, such as articles and prepositions. {:.btn-box} [Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/b2eb08610dd49d5b15077cc499a94b4ec1e8b861/jupyter/annotation/english/stop-words/StopWordsCleaner.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/stopwords_cs_cs_2.5.4_2.4_1594742440427.zip){:.button.button-orange.button-orange-trans.arr.button-icon} [Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/stopwords_cs_cs_2.5.4_2.4_1594742440427.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} {:.h2_title} ## How to use
{% include programmingLanguageSelectScalaPythonNLU.html %} ```python ... stop_words = StopWordsCleaner.pretrained("stopwords_cs", "cs") \ .setInputCols(["token"]) \ .setOutputCol("cleanTokens") nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, stop_words]) light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))) results = light_pipeline.fullAnnotate("Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny.") ``` ```scala ... val stopWords = StopWordsCleaner.pretrained("stopwords_cs", "cs") .setInputCols(Array("token")) .setOutputCol("cleanTokens") val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, stopWords)) val data = Seq("Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny.").toDF("text") val result = pipeline.fit(data).transform(data) ``` {:.nlu-block} ```python import nlu text = ["""Kromě toho, že je králem severu, je John Snow anglickým lékařem a lídrem ve vývoji anestezie a lékařské hygieny."""] stopword_df = nlu.load('cs.stopwords').predict(text) stopword_df[["cleanTokens"]] ```
{:.h2_title} ## Results ```bash [Row(annotatorType='token', begin=0, end=4, result='Kromě', metadata={'sentence': '0'}), Row(annotatorType='token', begin=10, end=10, result=',', metadata={'sentence': '0'}), Row(annotatorType='token', begin=12, end=13, result='že', metadata={'sentence': '0'}), Row(annotatorType='token', begin=18, end=23, result='králem', metadata={'sentence': '0'}), Row(annotatorType='token', begin=25, end=30, result='severu', metadata={'sentence': '0'}), ...] ``` {:.model-param} ## Model Information {:.table-model} |---|---| |Model Name:|stopwords_cs| |Type:|stopwords| |Compatibility:|Spark NLP 2.5.4+| |Edition:|Official| |Input Labels:|[token]| |Output Labels:|[cleanTokens]| |Language:|cs| |Case sensitive:|false| |License:|Open Source| {:.h2_title} ## Data Source The model is imported from [https://github.com/WorldBrain/remove-stopwords](https://github.com/WorldBrain/remove-stopwords)